logo of nvidia

英伟达Networking Solution Test Engineer - AI Cluster Debugging

社招全职地点:上海 | 北京状态:招聘

任职要求


• B.A./B.Sc. in Computer Science, Electrical Engineering, or equivalent IT/Network/Systems experience.
• 2+ years of hands‑on networking or system‑level testing and debugging on Linux.
• Strong Linux networking and debugging skills (for example perf, tcpdump, ethtool, iproute2).
• Proven production‑grade debugging experience: forming hypotheses, running experiments, and driving issues to root cause under pressure.
• Expertise in host‑side NIC validation and tuning (offloads, queues, interrupts, firmware/driver interactions).
• Strong knowledge of AI networking libraries (such as NCCL) and protocols (such as RoCE and RDMA), including performance and correctness debugging.
• Ability to read and reason about source code (C/C++/Python or similar) and collaborate closely with developers on fixes.
• Solid scripting and automation skills with Bash / Python / Ansible for setup, log collection, and experiment o…
登录查看完整任职要求
微信扫码,1秒登录

工作职责


Networking Solution Test Engineer – AI Cluster Debugging
We are looking for a networking test engineer with strong system‑level debugging skills to join our End‑to‑End Verification team. You will work on cutting‑edge Ethernet‑based AI clusters, owning complex issues across hardware, system software and AI workloads. 
What you’ll be doing
• Design and review test and product requirements across the Ethernet / NIC / DPU / Switch portfolio, focusing on large‑scale AI cluster behavior.
• Build and maintain realistic customer‑like testbeds, including heterogeneous hardware, OS / driver combinations and complex network fabrics.
• Own end‑to‑end cluster troubleshooting: reproduce customer scenarios, triage across the stack and drive issues to root cause and fix.
• Read and understand relevant source code to identify defects, validate fixes and improve logging and instrumentation.
• Collaborate closely with development teams to debug NCCL, RoCE/RDMA and related networking components using logs, code inspection and targeted experiments.
• Define tests and guide the automation team to implement robust suites that produce actionable logs, metrics and traces.
• Run Regression, Performance, Functional and Scale testing, analyze results and provide clear, data‑driven reports to stakeholders.
• Profile and benchmark deep learning training and inference workloads, correlating model‑level metrics with system and network telemetry to uncover bottlenecks.
包括英文材料
Linux+
Perf+
NCCL+
C+
还有更多 •••
相关职位

logo of nvidia
社招

• Contribute to design review and product features requirements under the whole Ethernet/ NIC/DPU/Switch portfolio. Design and build setup topologies with an emphasis on an emulation of customer large scale / complex environments. • Collaborating closely with multi-functional teams, including hardware engineers, software developers, and domain experts, to deliver optimized solutions that meet the demanding requirements of AI workloads. • Design, mentorship for testing automation team to implement tests. Generate comprehensive test reports during release execution procedure, assist with reproduction and debugs complex customer use cases, with determination of the issue root cause, be an engineering PIC for the full verification cycles of the customer use cases. • Complete end-to-end test scenarios in different scopes: Regression, Performance, Functional and Scale; Report the progress of testing and provide summary reports of testing activity. • Profiling, Benchmarking, and Analyzing Deep Learning models to identify areas for optimization and improvement in terms of performance, efficiency, and accuracy, with a strong emphasis on networking aspects. • Providing insights and recommendations based on the analysis of large-scale training results, specifically focusing on networking bottlenecks and optimizations, to improve model outcomes and achieve business objectives.

更新于 2025-12-01上海|北京
logo of nvidia
社招

NVIDIA is the world leader in computer graphics, PC gaming, and accelerated computing. Today, we are tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of edge computers and robotics that can understand the world. Doing what is never been done before takes vision, innovation, and the world’s best talent. At NVIDIA, our employees are passionate about accelerated computing. We're united in our quest to transform the way accelerated computing are used for work and play. Our technology impacts the large language model in daily copilot, visual experience in video game development, film production, space exploration, medicine, computational finance and automotive design. And we've only scratched the surface of what we can accomplish when we apply our technology to it. We need passionate, hard-‐working and creative people to help us seek some of these outstanding opportunities.We are now looking for a hardware expert to join NVIDIA China FAE(Field Application Engineer) team, to engage and support NVIDIA networking product hardware solution. As a FAE, you'll collaborate with the sales team to support our customers, including Networking chips, components and hardware systems. You'll establish relationships with top customers, tackle engineering problems, and help customers to build a successful NVIDIA practice. What you’ll be doing: • Assist field business development in guiding the customer through the design-win process for NVIDIA data center solution. • Working with customers, understanding requirements, and leading the support from architecture, schematics, simulation and layout to production. • Review customers’ hardware solutions and design, support bring-up of customer designs, diagnose problems and seek to resolve technical issues. • Take an active role in assessing the technical details of customer projects. Build close technical relationship with customers & partners. • Collaborate across the company, work with NVIDIA worldwide hardware, software, application engineering, and product teams to lead technical activities and customer support. Guide the directions of NVIDIA product implementations.

更新于 2025-12-23深圳
logo of nvidia
社招

• Providing Ethernet and routing expertise to customers during project delivery to design, architect and test Ethernet networking solutions. • Work on multi-functional teams to provide Ethernet network expertise to server infrastructure builds, accelerated computing workloads and GPU enabled AI applications. • Crafting and evaluating DevOps automation scripts for network operations, crafting network architectures, and developing switch fabric configurations. • Implementing tasks related to network configuration and validation for data centers. • Create Methods of Procedure and deployment documents. • Use software tools to validate and monitor network performance.

更新于 2025-09-18北京|上海|深圳
logo of nvidia
社招

• Providing Ethernet and routing expertise to customers during project delivery to design, architect and test Ethernet networking solutions. • Work on multi-functional teams to provide Ethernet network expertise to server infrastructure builds, accelerated computing workloads and GPU enabled AI applications. • Crafting and evaluating DevOps automation scripts for network operations, crafting network architectures, and developing switch fabric configurations. • Implementing tasks related to network configuration and validation for data centers. • Create Methods of Procedure and deployment documents. • Use software tools to validate and monitor network performance.

更新于 2025-10-22北京|上海|深圳