logo of nvidia

英伟达Senior Infrastructure Software Engineer

社招全职地点:上海 | 北京状态:招聘

任职要求


• BS or higher degree in computer science with 4+ years of relevant experience
• Adept programming skills in multiple languages including Python
• In-depth experience with distributed systems and cluster management stacks (logging, monitoring, scheduling, etc.)
• Hands-on experience with continuous integration and deployment tools (e.g. GitlabCI)
• Outstanding ability to understand users, prioritize among many contending requests, and build consensus
• Passion for “it just works” automation, eliminating repetitive tasks, and enabling team members
• Deep understanding of Linux system administration and container technologies
• Proficient English communication skills




Ways to stand out from the crowd:

• Experience automating operations for bare-metal clusters
• Experience with GPU computing systems
• Track record of identifying useful new technologies or methods and incorporating them into SW development flows
• Experience as an active contributor to a SW project involving many developers or as a maintainer of open-source software

工作职责


• Architect, develop, and maintain Python-based tools and services to efficiently run a performance-focused multi-tenant Linux cluster including embedded, desktop, and server systems
• Work with industry standard tools (Kubernetes, Slurm, Ansible, Gitlab, Artifactory, Jira)
• Actively support users doing development, functional testing, and performance testing on current and pre-production GPU cluster systems
• Work with various teams at NVIDIA across different timezones to incorporate and influence the latest tools for operating GPU clusters
• Collaborate with users and system administrators to seek out ways to improve UX and operational efficiency
• Become an expert on the entire AI infrastructure stack
包括英文材料
Python+
Linux+
Metal+
相关职位

logo of nvidia
社招

• Design, develop, and improve scalable infrastructure to support the next generation of AI applications, including copilots and agentic tools.  • Drive improvements in architecture, performance, and reliability, enabling teams to bring to bear LLMs and advanced agent frameworks at scale.  • Collaborate across hardware, software, and research teams, mentoring and supporting peers while encouraging best engineering practices and a culture of technical excellence.  • Stay informed of the latest advancements in AI infrastructure and contribute to continuous innovation across the organization.

更新于 2025-09-16
logo of nvidia
社招

Joining NVIDIA's DGX Cloud Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the necessary resources and scale to foster innovation. We are seeking an AI infrastructure software engineer to join our team. You'll be instrumental in designing, building, and maintaining AI infrastructure that enable large-scale AI training and inferencing. The responsibilities include implementing software and systems engineering practices to ensure high efficiency and availability of AI systems.As a senior DGX Cloud AI Infrastructure software engineer at NVIDIA, you will have the opportunity to work on innovative technologies that power the future of AI and data science, and be part of a dynamic and supportive team that values learning and growth. The role provides the autonomy to work on meaningful projects with the support and mentorship needed to succeed, and contributes to a culture of blameless postmortems, iterative improvement, and risk-taking. If you are seeking an exciting and rewarding career that makes a difference, we invite you to apply now! What you’ll be doing: • Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure. • Develop and optimize tools to improve infrastructure efficiency and resiliency. • Root cause and analyze and triage failures from the application level to the hardware level • Enhance infrastructure and products underpinning NVIDIA's AI platforms. • Co-design and implement APIs for integration with NVIDIA's resiliency stacks. • Define meaningful and actionable reliability metrics to track and improve system and service reliability. • Skilled in problem-solving, root cause analysis, and optimization.

更新于 2025-10-07
logo of nvidia
社招

Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing! An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work. Come join the team and see how we can make a lasting impact on the world.NVIDIA is hiring senior software engineers in its Infrastructure, Planning and Process Team (IPP), to accelerate AI adoption across various engineering workflows within the company. IPP is a global organization within NVIDIA. The group works with various other teams within NVIDIA such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars to cater to their infrastructure and software development workflow needs. As a senior engineer on AI Workflow, you will create and establish tools and software solutions that leverage Large Language Models and agentic AI to automate end to end software engineering workflows and enhance the productivity of engineers across NVIDIA. What you’ll be doing: • Develop and implement solutions throughout software development lifecycles to improve developer efficiency, accelerate feedback loops, and boost release reliability • Experience designing, developing, and deploying AI agents to automate software development workflows and processes. • Continuously measure and report on the impact of AI interventions, showing progress in metrics such as cycle time, change failure rate, and mean time to recovery (MTTR). • Build and deploy predictive models to identify high-risk commits, forecast potential build failures, and flag changes that have a high probability of failures. • Research emerging AI technologies and engineering best practices to continuously evolve our development ecosystem and maintain a competitive edge.

更新于 2025-09-26
logo of nvidia
社招

For more than 25 years, NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing—and now, we’re defining the next era of AI-powered computing. From powering breakthroughs in autonomous vehicles to building the next wave of infrastructure, we grow with innovation motivated by the world’s best talent. Monitor Code Coverage and perform Static Analysis (Coverity) for NVIDIA’s AV software stack. The position entails hands-on engineering tasks, such as composing strategies, building automation, working with developers, and ensuring safety-critical systems meet quality and compliance standards. What you’ll be doing: Code Coverage Strategy & Tooling• Define, implement, and own the AV software code coverage strategy (statement, branch, MC/DC) for unit, integration, and safety reporting. • Automate coverage collection in Bazel-based builds and integrate into CI/CD pipelines (GitLab/Jenkins/GitHub Actions). • Build dashboards and reporting pipelines for developers, safety engineers, and auditors. Coverity Static Analysis• Operate incremental and full scans, automate pipelines, and implement quality gates. • Triage, classify, and handle findings—including waiver workflows and procedures that adhere to MISRA C/C++ and CERT standards. Developer & Collaborator Engagement• Partner with AV developers to resolve findings, avoid false positives, and improve adoption of coverage and static analysis practices. • Coordinate with safety, security, and compliance participants to uphold reporting consistency and audit readiness. Innovation & AI Integration• Explore ways to apply AI/LLMs for accelerating triage, generating reports, and improving developer workflows (e.g., editor plugins, code assistants).

更新于 2025-10-17