英伟达Senior Solutions Architect, GPU System
任职要求
• BS/BA in Computer Science, Electrical/Computer Engineering, or equivalent experience, with 6+ years of experience with data center servers, GPU platforms, or large‑scale AI/HPC infrastructure. • Strong understanding of GPU server architecture: CPU/GPU balance, memory and PCIe/NVLink topology, storage and NIC placement, and power/cooling considerations. • Proven experience designing or operating AI or HPC clusters using GPU‑accelerated servers in cloud or on‑prem data centers. • Solid background in data center and cloud networking for AI workloads, including leaf‑spine fabrics, RDMA and high‑bandwidth/low‑latency designs. • Strong Linux system and Linux networking skills, including driver, firmware, and OS‑level tuning for GPU and NIC performance. • Knowledge and experience with K8S, RDMA/RoCE and, ideally, RoCE and Infiniband AI clusters. • Excellent communicati…
工作职责
• Lead presales and architecture engagements with AI industry customers, focusing on GPU servers, AI clusters, and large‑scale training/inference platforms built on NVIDIA HGX, GPU systems, and reference architectures. • Design and validate end‑to‑end AI data center solutions, including server platforms, storage connectivity, and high‑performance networking based on Spectrum, Quantum, ConnectX, and BlueField. • Define system architectures for AI supercomputing, LLM training, and inference workloads, including node configuration, GPU topology, PCIe/NVLink considerations, and network design. • Support business teams in exploring, developing, and deploying NVIDIA server and GPU solution opportunities, from early technical discovery through POC and production rollout. • Own and execute POCs and hands‑on labs that validate GPU server performance, scalability, reliability, and interoperability across compute, storage, and network domains. • Troubleshoot complex end‑to‑end issues involving GPU servers, firmware, drivers, operating systems, and networking stacks, and drive fixes with internal R&D and partners. • Provide structured feedback on platform features, system requirements, and customer needs to server OEMs, engineering, and product teams to improve NVIDIA AI platforms and ecosystems.
*Hiring location: Beijing, Shanghai, Guangzhou, Shenzhen, Hong Kong(visa sponsorship provided) Would you like to join one of the fastest-growing teams within Amazon Web Services (AWS) and help shape the future of GPU optimization and high-performance computing? Join us in helping customers across all industries to maximize the performance and efficiency of their GPU workloads on AWS while pioneering innovative optimization solutions. As a Senior Technical Account Manager (Sr. TAM) specializing in GPU Optimization in AWS Enterprise Support, you will play a crucial role in two key missions: guiding customers' GPU acceleration initiatives across AWS's comprehensive compute portfolio, and spearheading the development of optimization strategies that revolutionize customer workload performance. Key Job Responsibilities - Build and maintain long-term technical relationships with enterprise customers, focusing on GPU performance optimization and resource allocation efficiency on AWS cloud or similar cloud services. - Analyze customers’ current architecture, models, data pipelines, and deployment patterns; create a GPU bottleneck map and measurable KPIs (e.g., GPU utilization, throughput, P95/P99 latency, cost per unit). - Design and optimize GPU resource usage on EC2/EKS/SageMaker or equivalent cloud compute, container, and ML services; implement node pool tiering, Karpenter/Cluster Autoscaler tuning, auto scaling, and cost governance (Savings Plans/RI/Spot/ODCR or equivalent). - Drive GPU partitioning and multi-tenant resource sharing strategies to reduce idle resources and increase overall cluster utilization. - Guide customers in PyTorch/TensorFlow performance tuning (DataLoader optimization, mixed precision, gradient accumulation, operator fusion, torch.compile) and inference acceleration (ONNX, TensorRT, CUDA Graphs, model compression). - Build GPU observability and monitoring systems (nvidia-smi, CloudWatch or equivalent monitoring tools, profilers, distributed communication metrics) to align capacity planning with SLOs. - Ensure compatibility across GPU drivers, CUDA, container runtimes, and frameworks; standardize change management and rollback processes. - Collaborate with cloud provider internal teams and external partners (NVIDIA, ISVs) to resolve cross-domain complex issues and deliver repeatable optimization solutions. ------------------------------------------------------
- As an AIML Specialist Solutions Architect (SA) in AI Infrastructure, you will serve as the Subject Matter Expert (SME) for providing optimal solutions in model training and inference workloads that leverage Amazon Web Services accelerator computing services. As part of the Specialist Solutions Architecture team, you will work closely with other Specialist SAs to enable large-scale customer model workloads and drive the adoption of AWS EC2, EKS, ECS, SageMaker and other computing platform for GenAI practice. - You will interact with other SAs in the field, providing guidance on their customer engagements, and you will develop white papers, blogs, reference implementations, and presentations to enable customers and partners to fully leverage AI Infrastructure on Amazon Web Services. You will also create field enablement materials for the broader SA population, to help them understand how to integrate Amazon Web Services GenAI solutions into customer architectures. - You must have deep technical experience working with technologies related to Large Language Model (LLM), Stable Diffusion and many other SOTA model architectures, from model designing, fine-tuning, distributed training to inference acceleration. A strong developing machine learning background is preferred, in addition to experience building application and architecture design. You will be familiar with the ecosystem of Nvidia and related technical options, and will leverage this knowledge to help Amazon Web Services customers in their selection process. - Candidates must have great communication skills and be very technical and hands-on, with the ability to impress Amazon Web Services customers at any level, from ML engineers to executives. Previous experience with Amazon Web Services is desired but not required, provided you have experience building large scale solutions. You will get the opportunity to work directly with senior engineers at customers, partners and Amazon Web Services service teams, influencing their roadmaps and driving innovations.
• Primary responsibilities will include building AI/HPC infrastructure for new and existing customers. • Support operational and reliability aspects of large-scale AI clusters, focusing on performance at scale, real-time monitoring, logging, and alerting. • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation, and refinement. • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. • Provide feedback to internal teams such as opening bugs, documenting workarounds, and suggesting improvements.
• Define the end-to-end technical architecture for the NIM Factory, from container build systems and CI/CD to Kubernetes deployment patterns and runtime optimization. • Drive technical strategy and roadmap, making high-impact decisions on frameworks, technologies, and standards that empower dozens of engineering teams. • Architect and influence the design of workflow orchestration systems that underpin the NIM factory. • Coach and mentor senior engineers across the organization, fostering a culture of technical excellence, innovation, and knowledge sharing. • Champion best practices in software development, including API design, automation, observability, and secure supply chain management. • Collaborate with leadership across research, backend, SRE, and product to align technical vision with product goals and influence technical roadmaps.