logo of nvidia

英伟达Senior System Software Engineer - AI Performance and Efficiency Tools

社招全职地点:上海状态:招聘

任职要求


• BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
• Strong software skills in design, coding (C++ and Python), analytical, and debugging
• Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference.
• Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking
• Experience with NVIDIA GPUs, CUDA Programming and NCCL
• Motivated self-starter with strong problem-solving skills and customer-facing communication skills
• Passion for continuous learning. Ability to work concurrently with multiple global groups

Ways to stand out from the crowd:
• Proven experience in GPU cluster scale continuous profiling & analysis tools/platforms
• Solid experience in large AI job performance analysis for training/inference workload
• Knowledge of Linux device drivers and/or compiler implementation
• Knowledge of GPU and/or CPU architecture and general computer architecture principles

工作职责


A key part of NVIDIA's strength is our sophisticated analysis / debugging tools that empower NVIDIA engineers to improve perf and power efficiency of our products and the running applications. We are looking for forward-thinking, hard-working, and creative people to join a multifaceted software team with high standards! This software engineering role involves developing tools for AI researchers and SW/HW teams running AI workload in GPU cluster.As a member of the software development team, we will work with users from different departments like Architecture teams, Software teams. Our work brings the users intuitive, rich and accurate insight in the workload and the system, and empower them to find opportunities in software and hardware, build high level models to propose and deliver the best hardware and software to our customers, or debugging tricky failures and issues to help improve the performance and efficiency of the system.
What you’ll be doing:
• Build internal profiling and analysis tools for AI workloads at large scale
• Build debugging tools for common encountered problems like memory or networking
• Create benchmarking and simulation technologies for AI system or GPU cluster
• Partner with HW architects to propose new features or improve existing features with real world use cases
包括英文材料
C+
Python+
PyTorch+
TensorFlow+
Kubernetes+
CUDA+
Linux+
相关职位

logo of apple
社招Machine

This role requires a blend of skills in software engineering, machine learning, and operations to ensure the smooth functioning of ML systems in production environments. In this role you will: - Lead the team to design and implement automation for model training, testing, validation, and deployment - Collaborate with machine learning engineers to ensure efficient deployment and scaling of ML models - Implement monitoring and alerting systems to track model performance, system health, and data drift - Optimize compute resources for cost and performance efficiency - Manage model versions to ensure traceability and reproducibility

更新于 2025-07-22
logo of nvidia
社招

• Architect, develop, and maintain Python-based tools and services to efficiently run a performance-focused multi-tenant Linux cluster including embedded, desktop, and server systems • Work with industry standard tools (Kubernetes, Slurm, Ansible, Gitlab, Artifactory, Jira) • Actively support users doing development, functional testing, and performance testing on current and pre-production GPU cluster systems • Work with various teams at NVIDIA across different timezones to incorporate and influence the latest tools for operating GPU clusters • Collaborate with users and system administrators to seek out ways to improve UX and operational efficiency • Become an expert on the entire AI infrastructure stack

更新于 2025-10-17
logo of nvidia
社招

NVIDIA is now looking for LLM Train Framework Engineers for the Megatron Core team. Megatron Core is open-source, scalable, and cloud-native frameworks built for researchers and developers working on Large Language Models (LLM) and Multimodal (MM) foundation model pretraining and post-training. Our GenAI Frameworks provide end-to-end model training, including pretraining, alignment, customization, evaluation, deployment, and tooling to optimize performance and user experience. Build on Megatron Core Framework's capabilities by inventing advanced distributed training algorithms and model optimizations. Collaborate with partners to implement optimized solutions. What you’ll be doing: • Build and develop open source Megatron Core. • Address extensive AI training and inference obstacles, covering the entire model lifecycle including orchestration, data pre-processing, conducting model training and tuning, and deploying models. • Work at the intersection of AI applications, libraries, frameworks, and the entire software stack. • Spearhead advancements in model architectures, distributed training strategies, and model parallel approaches. • Enhance the pace of foundation model training and optimization through mixed precision formulas and advanced NVIDIA GPU structures. • Performance tuning and optimizations of deep learning framework and software components. • Research, prototype, and develop robust and scalable AI tools and pipelines.

更新于 2025-10-13
logo of nvidia
社招

Joining NVIDIA's DGX Cloud Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the necessary resources and scale to foster innovation. We are seeking an AI infrastructure software engineer to join our team. You'll be instrumental in designing, building, and maintaining AI infrastructure that enable large-scale AI training and inferencing. The responsibilities include implementing software and systems engineering practices to ensure high efficiency and availability of AI systems.As a senior DGX Cloud AI Infrastructure software engineer at NVIDIA, you will have the opportunity to work on innovative technologies that power the future of AI and data science, and be part of a dynamic and supportive team that values learning and growth. The role provides the autonomy to work on meaningful projects with the support and mentorship needed to succeed, and contributes to a culture of blameless postmortems, iterative improvement, and risk-taking. If you are seeking an exciting and rewarding career that makes a difference, we invite you to apply now! What you’ll be doing: • Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure. • Develop and optimize tools to improve infrastructure efficiency and resiliency. • Root cause and analyze and triage failures from the application level to the hardware level • Enhance infrastructure and products underpinning NVIDIA's AI platforms. • Co-design and implement APIs for integration with NVIDIA's resiliency stacks. • Define meaningful and actionable reliability metrics to track and improve system and service reliability. • Skilled in problem-solving, root cause analysis, and optimization.

更新于 2025-10-07