logo of nvidia

英伟达Senior System Software Engineer - AI Performance and Efficiency Tools

社招全职地点:上海状态:招聘

任职要求


• BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
• Strong software skills in design, coding (C++ and Python), analytical, and debugging
• Good understanding of Deep Learning frameworks like PyTorch and TensorFlow, distributed training and inference.
• Knowledge of GPU cluster job scheduling (Slurm or Kubernetes), storage and networking
• Experience with NVIDIA GPUs, CUDA Programming and NCCL
• Motivated self-starter with strong problem-sol…
登录查看完整任职要求
微信扫码,1秒登录

工作职责


A key part of NVIDIA's strength is our sophisticated analysis / debugging tools that empower NVIDIA engineers to improve perf and power efficiency of our products and the running applications. We are looking for forward-thinking, hard-working, and creative people to join a multifaceted software team with high standards! This software engineering role involves developing tools for AI researchers and SW/HW teams running AI workload in GPU cluster.As a member of the software development team, we will work with users from different departments like Architecture teams, Software teams. Our work brings the users intuitive, rich and accurate insight in the workload and the system, and empower them to find opportunities in software and hardware, build high level models to propose and deliver the best hardware and software to our customers, or debugging tricky failures and issues to help improve the performance and efficiency of the system.
What you’ll be doing:
• Build internal profiling and analysis tools for AI workloads at large scale
• Build debugging tools for common encountered problems like memory or networking
• Create benchmarking and simulation technologies for AI system or GPU cluster
• Partner with HW architects to propose new features or improve existing features with real world use cases
包括英文材料
C+
Python+
PyTorch+
还有更多 •••
相关职位

logo of apple
社招Machine

This role requires a blend of skills in software engineering, machine learning, and operations to ensure the smooth functioning of ML systems in production environments. In this role you will: - Lead the team to design and implement automation for model training, testing, validation, and deployment - Collaborate with machine learning engineers to ensure efficient deployment and scaling of ML models - Implement monitoring and alerting systems to track model performance, system health, and data drift - Optimize compute resources for cost and performance efficiency - Manage model versions to ensure traceability and reproducibility

更新于 2025-07-22上海
logo of nvidia
社招

• Architect, develop, and maintain Python-based tools and services to efficiently run a performance-focused multi-tenant Linux cluster including embedded, desktop, and server systems • Work with industry standard tools (Kubernetes, Slurm, Ansible, Gitlab, Artifactory, Jira) • Actively support users doing development, functional testing, and performance testing on current and pre-production GPU cluster systems • Work with various teams at NVIDIA across different timezones to incorporate and influence the latest tools for operating GPU clusters • Collaborate with users and system administrators to seek out ways to improve UX and operational efficiency • Become an expert on the entire AI infrastructure stack

更新于 2025-10-17上海|北京
logo of microsoft
社招Software

• Design, document, implement, and maintain scalable, secure, and high-performance backend services for Calendar & Places Copilot scenarios.  • Take ownership of service design by driving reliable, scalable, and high-performance solutions.  • Ensure availability, reliability, efficiency, observability, and performance of supported services.  • Develop automation and leverage telemetry to identify patterns and drive continuous improvement.  • Resolve service issues, minimize customer impact, and document solutions to prevent recurrence.  • Collaborate with PMs, applied scientists, and UX designers to deliver intelligent, user-facing features.

更新于 2025-10-28上海
logo of microsoft
社招Software

• Design, document, implement, and maintain scalable, secure, and high-performance backend services for Calendar & Places Copilot scenarios.  • Take ownership of service design by driving reliable, scalable, and high-performance solutions.  • Ensure availability, reliability, efficiency, observability, and performance of supported services.  • Develop automation and leverage telemetry to identify patterns and drive continuous improvement.  • Resolve service issues, minimize customer impact, and document solutions to prevent recurrence.  • Collaborate with PMs, applied scientists, and UX designers to deliver intelligent, user-facing features.

更新于 2025-10-28上海