英伟达AI Computing Performance Architect
任职要求
• An MS or PhD in a relevant field like Computer Science, Electrical Engineering, or Mathematics. • At least 3 years of professional experience with performance modeling, analysis, and code optimization for deep learning operators on GPU, CPU, or LPU—including hands-on assembly or SIMD programming. • Solid foundation in computer architecture. • Proficiency in programming languages such as C, C++, Perl, or Python. Ways to stand out from the crowd: • You’re knowledgeable about LLM frameworks and their fundamentals. • Experience with parallel programming and CUDA or Open…
工作职责
• Analyze the performance of a wide range of machine learning and deep learning algorithms across existing and emerging architectures. • Identify bottlenecks and devise creative software solutions or recommend improvements in GPU architectures. • Explore and evaluate how hardware and software architectures interact with future algorithms and applications.
Responsibilities Collaborate with GPU sales team and SCE AIML TPM team to provide technical support for customers both at pre-sales and after-sales stage. Take ownership of problems and work to identify solutions. Design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows. Collaborate with customers’ scientists and software/infrastructure engineers to understand infrastructure requirements for training, testing, and deploying machine learning models. Implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity. Optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques. Troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime. Stay updated on emerging technologies and best practices in AI/ML infrastructure and evaluate their potential impact on our systems and workflows. Document infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability. Qualifications: Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes. Experience with containerization technologies (e.g., Docker, Kubernetes) and orchestration tools for managing distributed systems. Solid understanding of networking concepts, security principles, and best practices. Excellent problem-solving skills, with the ability to troubleshoot complex issues and drive resolution in a fast-paced environment. Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders. Strong documentation skills with experience documenting infrastructure designs, configurations, procedures, and troubleshooting steps to facilitate knowledge sharing, ensure maintainability, and enhance team collaboration. Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS, Ubuntu, and Debian distributions, including system administration, package management, shell scripting, and performance optimization.
团队介绍:字节跳动豆包大模型团队(Seed)成立于 2023 年,致力于寻找通用智能的新方法,追求智能上限,并探索新的交互。团队研究方向涵盖 LLM、语音、视觉、世界模型、基础架构、AI Infra、下一代 AI 交互等,在中国、新加坡、美国等地设有实验室和岗位。 豆包大模型团队在 AI 领域拥有长期愿景与决心,坚持深耕基础,期望成为世界一流的 AI 研究团队,为科技和社会发展作出贡献。目前团队已推出业界领先的通用大模型以及前沿的多模态能力,支持豆包、扣子、即梦等超过 50 个应用场景。 1、负责机器学习系统资源调度的设计和开发,服务于各方向场景(NLP/CV/Speech等)的模型训练、模型评估和模型推理; 2、负责多种异构资源(GPU、CPU、其他异构硬件)的最优化编排,实现稳定资源、潮汐资源、混布资源、多云资源的合理化使用; 3、负责通过技术手段实现计算资源、RDMA高速网络资源、存储资源的最优调度,充分发挥大规模分布式集群的计算能力; 4、负责多机房、多地域、多云场景的在离线任务/服务调度,实现全球负载的合理化分布。
1、负责机器学习系统推理架构和产品的设计开发,支持火山方舟大模型平台和机器学习平台的产品业务; 2、负责深度模型推理任务为核心的在线架构设计与优化,充分利用各种异构计算(GPU、CPU、其他异构硬件)、存储(各种云存储)、网络(VPC、RDMA)等资源,构建多租环境下的稳定性、观测体系,实现高并发、高吞吐的大规模在线系统; 3、负责推理系统的产品化落地,打造稳定、可观测、体验一流的公有云推理平台。