logo of nvidia

英伟达Senior Infrastructure Engineer

社招全职地点:上海状态:招聘

任职要求


• Master's Degree in Electrical Engineering, Computer Science, or Computer Engineering or equivalent work experience.
• Deep understanding of ASIC design or verification with 3+ years engineering experience
• Proficient in Python, familiar with makefile, C++ and Perl are a plus
• Strong communication skills and interpersonal ability

Ways to stand out from the crowd:
• Experienced in designing a complicated methodology, flow or product to solve common issues in IP development.
• ideas about how to improve engineering efficiency.

工作职责


• Develop and maintain the methodology for IP development
• Develop and maintain the flow automation to improve the engineering efficiency
• Find and fix flow issues and help IP team to adopt
包括英文材料
Python+
Makefile+
C+
Perl+
相关职位

logo of nvidia
社招

• Architect, develop, and maintain Python-based tools and services to efficiently run a performance-focused multi-tenant Linux cluster including embedded, desktop, and server systems • Work with industry standard tools (Kubernetes, Slurm, Ansible, Gitlab, Artifactory, Jira) • Actively support users doing development, functional testing, and performance testing on current and pre-production GPU cluster systems • Work with various teams at NVIDIA across different timezones to incorporate and influence the latest tools for operating GPU clusters • Collaborate with users and system administrators to seek out ways to improve UX and operational efficiency • Become an expert on the entire AI infrastructure stack

更新于 2025-10-17
logo of nvidia
社招

• Design, develop, and improve scalable infrastructure to support the next generation of AI applications, including copilots and agentic tools.  • Drive improvements in architecture, performance, and reliability, enabling teams to bring to bear LLMs and advanced agent frameworks at scale.  • Collaborate across hardware, software, and research teams, mentoring and supporting peers while encouraging best engineering practices and a culture of technical excellence.  • Stay informed of the latest advancements in AI infrastructure and contribute to continuous innovation across the organization.

更新于 2025-09-16
logo of nvidia
社招

Joining NVIDIA's DGX Cloud Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the necessary resources and scale to foster innovation. We are seeking an AI infrastructure software engineer to join our team. You'll be instrumental in designing, building, and maintaining AI infrastructure that enable large-scale AI training and inferencing. The responsibilities include implementing software and systems engineering practices to ensure high efficiency and availability of AI systems.As a senior DGX Cloud AI Infrastructure software engineer at NVIDIA, you will have the opportunity to work on innovative technologies that power the future of AI and data science, and be part of a dynamic and supportive team that values learning and growth. The role provides the autonomy to work on meaningful projects with the support and mentorship needed to succeed, and contributes to a culture of blameless postmortems, iterative improvement, and risk-taking. If you are seeking an exciting and rewarding career that makes a difference, we invite you to apply now! What you’ll be doing: • Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure. • Develop and optimize tools to improve infrastructure efficiency and resiliency. • Root cause and analyze and triage failures from the application level to the hardware level • Enhance infrastructure and products underpinning NVIDIA's AI platforms. • Co-design and implement APIs for integration with NVIDIA's resiliency stacks. • Define meaningful and actionable reliability metrics to track and improve system and service reliability. • Skilled in problem-solving, root cause analysis, and optimization.

更新于 2025-10-07