英伟达Senior System Software Architect, HPC and AI Networking
任职要求
• Ph.D, Masters, or Bachelors in computer science, computer engineering, electrical engineering or a closely related field. • 5+ years of experience in DNNs, Scaling of DNNs, Parallelism of DNN frameworks, or deep learning training workloads. • Deep understanding of Inference and Training workloads and optimizations, like Prefill/Decode, data parallelism, Tensor parallelism, FDSP, etc... • Experience with AI network parallelism using collective libraries and RDMA/RoCE. • Background in algorithm design, system programming, and computer architecture. • Strong programming and software developme…
工作职责
• Design and prototype scalable software systems that optimize distributed AI training and inference—focusing on throughput, latency, and memory efficiency. • Develop and evaluate enhancements to communication libraries such as NCCL , UCX , and UCC , tailored to the unique demands of deep learning workloads. • Collaborate with AI framework teams (e.g., TensorFlow, PyTorch, JAX) to improve integration, performance, and reliability of communication backends. • Co-design hardware features (e.g., in GPUs, DPUs, or interconnects) that accelerate data movement and enable new capabilities for inference and model serving. • Contribute to the evolution of runtime systems, communication libraries, and AI-specific protocol layers. • Collaborate with customers to understand their needs and provide innovative solutions for them.
NVIDIA data center systems, such as DGX and HGX, have become core to NVIDIA's rapidly growing enterprise and cloud provider businesses. These platforms bring together the full power of NVIDIA GPUs, NVIDIA NVLink, NVIDIA InfiniBand networking, NVIDIA Grace CPUs, and a fully optimized NVIDIA AI and HPC software stack. We are hiring Sr. Software Engineer who will help build simulators for our DGX Server platforms. Simulations play a significant role in building scalable systems at Speed of Light! You will work with world class engineering teams across HW and SW. What you’ll be doing: • Contribute to architect and develop simulation platform for next-gen NVIDIA DGX platforms. • Build, integrate and enhance simulator components with new HW features and write supporting technical documents. • Bring full SW stack up on DGX Simulator; work closely with hardware modeling, kernel & platform driver teams distributed globally. • Improve performance, fix bugs across user and kernel stack, and automate execution flow.
NVIDIA is the world leader in computer graphics, PC gaming, and accelerated computing. Today, we are tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of edge computers and robotics that can understand the world. Doing what is never been done before takes vision, innovation, and the world’s best talent. At NVIDIA, our employees are passionate about accelerated computing. We're united in our quest to transform the way accelerated computing are used for work and play. Our technology impacts the large language model in daily copilot, visual experience in video game development, film production, space exploration, medicine, computational finance and automotive design. And we've only scratched the surface of what we can accomplish when we apply our technology to it. We need passionate, hard-‐working and creative people to help us seek some of these outstanding opportunities.We are now looking for a System & Network Solution Architect to join the NVIDIA China Solution Architect team. In this role, you will engage and support design-in projects with major China OEM customers, focusing on integrating NVIDIA’s world-class networking portfolio (ConnectX, BlueField, and Spectrum Switches). As a Solution Architect, you will act as the technical bridge between NVIDIA engineering and our OEM partners. You will guide customers through the integration of next-generation networking into their server and storage platforms, ensuring seamless compatibility, performance optimization, and successful mass production. What you’ll be doing: • Lead OEM Design-in & Integration: Work closely with OEM customers to integrate NVIDIA networking products (e.g., CX8/CX9, CX6 Dx, BF3 DPUs) and Switch platforms (Spectrum-4/6 Blackbox & Whitebox) into their server lineups. • Architecture & Customization: Understand customer requirements to provide system-level architectural guidance. Lead technical discussions on system topology, thermal/mechanical constraints, firmware customization, and sideband management (NC-SI, PLDM). • System Bring-up & Support: Support customers during the bring-up phase of new server designs. Diagnose complex system-level issues involving PCIe, BIOS/BMC, firmware, and OS/Driver interactions. • Performance Optimization: Guide customers in optimizing network performance for AI, HPC, and Cloud workloads, ensuring the best integration of NVIDIA NICs and DPUs within their specific hardware environments. • Crisis Management: Handle in-depth hands-on engagement with customers to resolve critical technical blockers during the NPI (New Product Introduction) and production phases. • Cross-Functional Leadership: Collaborate with NVIDIA worldwide hardware, firmware, software, and product teams to drive customer requirements and resolve issues. Act as the technical advocate for the customer within NVIDIA.
• Primary responsibilities will include building AI/HPC infrastructure for new and existing customers. • Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting. • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement. • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. • Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
*Hiring location: Beijing, Shanghai, Guangzhou, Shenzhen, Hong Kong(visa sponsorship provided) Would you like to join one of the fastest-growing teams within Amazon Web Services (AWS) and help shape the future of GPU optimization and high-performance computing? Join us in helping customers across all industries to maximize the performance and efficiency of their GPU workloads on AWS while pioneering innovative optimization solutions. As a Senior Technical Account Manager (Sr. TAM) specializing in GPU Optimization in AWS Enterprise Support, you will play a crucial role in two key missions: guiding customers' GPU acceleration initiatives across AWS's comprehensive compute portfolio, and spearheading the development of optimization strategies that revolutionize customer workload performance. Key Job Responsibilities - Build and maintain long-term technical relationships with enterprise customers, focusing on GPU performance optimization and resource allocation efficiency on AWS cloud or similar cloud services. - Analyze customers’ current architecture, models, data pipelines, and deployment patterns; create a GPU bottleneck map and measurable KPIs (e.g., GPU utilization, throughput, P95/P99 latency, cost per unit). - Design and optimize GPU resource usage on EC2/EKS/SageMaker or equivalent cloud compute, container, and ML services; implement node pool tiering, Karpenter/Cluster Autoscaler tuning, auto scaling, and cost governance (Savings Plans/RI/Spot/ODCR or equivalent). - Drive GPU partitioning and multi-tenant resource sharing strategies to reduce idle resources and increase overall cluster utilization. - Guide customers in PyTorch/TensorFlow performance tuning (DataLoader optimization, mixed precision, gradient accumulation, operator fusion, torch.compile) and inference acceleration (ONNX, TensorRT, CUDA Graphs, model compression). - Build GPU observability and monitoring systems (nvidia-smi, CloudWatch or equivalent monitoring tools, profilers, distributed communication metrics) to align capacity planning with SLOs. - Ensure compatibility across GPU drivers, CUDA, container runtimes, and frameworks; standardize change management and rollback processes. - Collaborate with cloud provider internal teams and external partners (NVIDIA, ISVs) to resolve cross-domain complex issues and deliver repeatable optimization solutions. ------------------------------------------------------