英伟达Senior System Software Engineer - GPU Performance Profiling Tools
任职要求
• BS+ in Computer Science or a related field (or equivalent experience) with 3+ years of software development experience.
• Strong system software development skills in C++.
• Proficiency in using coding agents like Codex, Claude Code, etc.
• A motivated self-starter with strong problem-solving abilities and excellent customer-facing communication skills.
• Passion for continuous learning and the abil…工作职责
At NVIDIA, we are proud of our advanced analysis and debugging tools that help engineers reach outstanding performance and power efficiency in products and applications. We invite creative, diligent, and innovative people to join our committed software team with rigorous standards. This software engineering position focuses on building tools for NVIDIA’s internal teams to improve hardware development and software execution.As part of the software development team, we work with users from different departments, including Architecture and Software teams. Our mission is to give users intuitive, rich, and detailed insights into workloads and systems. This helps them see opportunities in both software and hardware. We then build high-level models that suggest and deliver world-class hardware and software solutions to our customers. We also debug complex issues to improve system performance and efficiency. What you’ll be doing: • Build and maintain internal profiling tools aimed at performance and power optimization by using real-world GPU applications, such as games and AI workload. • Collaborate with our users to model and improve the design for next-generation GPU for better performance and power efficiency • Partner with hardware architects to propose new features or improve existing ones based on real-world use cases.
• Providing Ethernet and routing expertise to customers during project delivery to design, architect and test Ethernet networking solutions. • Work on multi-functional teams to provide Ethernet network expertise to server infrastructure builds, accelerated computing workloads and GPU enabled AI applications. • Crafting and evaluating DevOps automation scripts for network operations, crafting network architectures, and developing switch fabric configurations. • Implementing tasks related to network configuration and validation for data centers. • Create Methods of Procedure and deployment documents. • Use software tools to validate and monitor network performance.
• Primary responsibilities will include deploying, managing and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers. • Be the domain expert with customers during planning calls through implementation. • Handover-related documentation and perform knowledge transfers required to support customers as they begin rolling out some of the most sophisticated systems in the world! • Provide feedback into internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
A key part of NVIDIA's strength is our sophisticated analysis / debugging tools that empower NVIDIA engineers to improve perf and power efficiency of our products and the running applications. We are looking for forward-thinking, hard-working, and creative people to join a multifaceted software team with high standards! This software engineering role involves developing tools for AI researchers and SW/HW teams running AI workload in GPU cluster.As a member of the software development team, we will work with users from different departments like Architecture teams, Software teams. Our work brings the users intuitive, rich and accurate insight in the workload and the system, and empower them to find opportunities in software and hardware, build high level models to propose and deliver the best hardware and software to our customers, or debugging tricky failures and issues to help improve the performance and efficiency of the system. What you’ll be doing: • Build internal profiling and analysis tools for AI workloads at large scale • Build debugging tools for common encountered problems like memory or networking • Create benchmarking and simulation technologies for AI system or GPU cluster • Partner with HW architects to propose new features or improve existing features with real world use cases
Joining NVIDIA's DGX Cloud Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of AI workloads, as well as developing scalable AI and Data infrastructure tools and services. Our objective is to deliver a stable, scalable environment for AI researchers, providing them with the necessary resources and scale to foster innovation. We are seeking an AI infrastructure software engineer to join our team. You'll be instrumental in designing, building, and maintaining AI infrastructure that enable large-scale AI training and inferencing. The responsibilities include implementing software and systems engineering practices to ensure high efficiency and availability of AI systems.As a senior DGX Cloud AI Infrastructure software engineer at NVIDIA, you will have the opportunity to work on innovative technologies that power the future of AI and data science, and be part of a dynamic and supportive team that values learning and growth. The role provides the autonomy to work on meaningful projects with the support and mentorship needed to succeed, and contributes to a culture of blameless postmortems, iterative improvement, and risk-taking. If you are seeking an exciting and rewarding career that makes a difference, we invite you to apply now! What you’ll be doing: • Develop infrastructure software and tools for large-scale AI, LLM, and GenAI infrastructure. • Develop and optimize tools to improve infrastructure efficiency and resiliency. • Root cause and analyze and triage failures from the application level to the hardware level • Enhance infrastructure and products underpinning NVIDIA's AI platforms. • Co-design and implement APIs for integration with NVIDIA's resiliency stacks. • Define meaningful and actionable reliability metrics to track and improve system and service reliability. • Skilled in problem-solving, root cause analysis, and optimization.