英伟达Senior AI Performance and Efficiency Engineer
任职要求
• BS or similar background in Computer Science or related area (or equivalent experience) • Minimum 8+ years of experience designing and operating large scale compute infrastructure • Strong understanding of modern ML techniques and tools • Experience investigating, and resolving, training & inference performance end to end • Debugging and optimization experience with NSight Systems and NSight Compute • Experience with debugging large-scale distributed training using NCCL • Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms. • Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML in…
工作职责
• Collaborate closely with our AI/ML researchers to make their ML models more efficient leading to significant productivity improvements and cost savings • Build tools, frameworks, and apply ML techniques to detect & analyze efficiency bottlenecks and deliver productivity improvements for our researchers • Work with researchers working on a variety of innovative ML workloads across Robotics, Autonomous vehicles, LLM’s, Videos and more • Collaborate across the engineering organizations to deliver efficiency in our usage of hardware, software, and infrastructure • Proactively monitor fleet wide utilization patterns, analyze existing inefficiency patterns, or discover new patterns, and deliver scalable solutions to solve them • Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization.
A key part of NVIDIA's strength is our sophisticated analysis / debugging tools that empower NVIDIA engineers to improve perf and power efficiency of our products and the running applications. We are looking for forward-thinking, hard-working, and creative people to join a multifaceted software team with high standards! This software engineering role involves developing tools for AI researchers and SW/HW teams running AI workload in GPU cluster.As a member of the software development team, we will work with users from different departments like Architecture teams, Software teams. Our work brings the users intuitive, rich and accurate insight in the workload and the system, and empower them to find opportunities in software and hardware, build high level models to propose and deliver the best hardware and software to our customers, or debugging tricky failures and issues to help improve the performance and efficiency of the system. What you’ll be doing: • Build internal profiling and analysis tools for AI workloads at large scale • Build debugging tools for common encountered problems like memory or networking • Create benchmarking and simulation technologies for AI system or GPU cluster • Partner with HW architects to propose new features or improve existing features with real world use cases

• Understand business scenarios and design targeted data acquisition solutions, ensuring data is relevant, high-quality, and aligned with project goals. • Architect, design, and maintain enterprise-grade databases, data warehouses, and lakehouse systems to support analytical, operational, and AI workloads. • Model and optimize schema design, storage layouts, data partitioning, clustering, and indexing strategies for large-scale datasets. • Implement and maintain ETL/ELT pipelines feeding data warehouses (e.g., Snowflake, BigQuery, Redshift, Databricks, or open-lakehouse environments). • Design, collect, and maintain high-quality datasets for AI inferencing and LLM model optimization, fine-tuning, and testing, ensuring data is formatted and preprocessed to meet model requirements. • Collaborate with AI application engineers to understand model performance requirements and translate them into targeted data collection and preparation strategies. • Develop and implement automated data pipelines for efficient data processing, including data cleaning, labeling, augmentation, and transformation. • Proactively identify data gaps based on model performance metrics, design solutions to acquire, clean, and optimize data for enhanced model accuracy and efficiency. • Build, clean, and manage diverse data sources, ensuring compliance with data security and privacy standards. • Conduct exploratory data analysis to discover data patterns, anomalies, and optimization opportunities, directly impacting model performance. • Continuously learn and adapt to the latest advancements in data engineering, AI, and large language model (LLM) technologies.
THE ROLE: AMD drives innovation at the intersection of performance and efficiency to shape the future of AI, cloud computing, and high-performance servers. We seek an experienced IC Connector Engineer to lead the design and development of high-speed connectors, cables, and sockets for advanced servers and AI platforms. This role requires deep technical expertise and program leadership to deliver reliable, cost-effective solutions at scale.
As chip sizes continue to grow, power efficiency has become paramount across all applications - from data centers to automotive and personal computing. Our PMU IP, developed over the past 13 years, is crucial in optimizing chip performance and efficiency in both idle and active scenarios. The PMU IP consists of a RISC-V core and custom-designed control logic. It collects and processes data from the entire chip, working in tandem with software running on the RISC-V core to determine optimal operating points. We are seeking a Senior ASIC Engineer who can help architect the next generation PMU for AI datacenter. What you’ll be doing: • Collaborate with the production SW team and power arch team to define the architecture/micro-architecture for various power features. • Learn how PMU's function impacts the system and support the silicon debug. • Implement the micro-architecture to RTL design.