腾讯大模型训练性能优化工程师(训练算子)(深圳/北京/上海/杭州)
社招全职2年以上公共技术地点:深圳状态:招聘
任职要求
1.计算机、软件工程、数学、电子信息、自动化等相关专业,本科及以上学历; 2.扎实的编程基础,熟练使用 C/C++,对代码质量与工程实践有较高要求; 3.熟练掌握 GPU 编程,有实际 CUDA 开发经验;熟悉 CUTLASS、Triton 等任一或多种算子开发/优化框架; 4.熟悉并行计算原理,对 GPU 体系结构(SM、Warp、Memory Hierarchy、Occupancy 等)有较深入理解; 5.对 3D 并行训练(如数据并行、模型并行、流水并行、混合并行等)有实践经验,能够理解并分析其对算子与通信模式的影响; 6.具备良好的问题定位与性能分析能力,能熟练使用 Nsight、nvprof、p…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
1.负责深度学习训练相关算子的设计、实现与优化( CUDA/CUTLASS/Triton ); 2.面向大模型训练场景,对算子进行端到端性能分析与调优,持续挖掘吞吐、延迟、显存利用率等指标的优化空间; 3.参与或主导 3D 并行(Data / Tensor / Pipeline Parallel 等)训练体系下的算子与通信方案设计与优化; 4.与分布式训练、系统、模型算法团队密切协作,共同提升大规模训练任务的整体效率与稳定性; 5.跟踪业界前沿的硬件架构与系统软件(GPU 架构、网络、编译器、库等),将最新技术转化为实际性能收益。
包括英文材料
FSDP+
https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html
In DistributedDataParallel (DDP) training, each rank owns a model replica and processes a batch of data, finally it uses all-reduce to sync gradients across ranks.
https://www.youtube.com/watch?v=PjEwLgyzuzQ
FSDP provides a comprehensive framework for large model training in PyTorch.
CUDA+
https://developer.nvidia.com/blog/even-easier-introduction-cuda/
This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA.
https://www.youtube.com/watch?v=86FAWCzIe_4
Lean how to program with Nvidia CUDA and leverage GPUs for high-performance computing and deep learning.
内核+
https://www.youtube.com/watch?v=C43VxGZ_ugU
I rummage around the Linux kernel source and try to understand what makes computers do what they do.
https://www.youtube.com/watch?v=HNIg3TXfdX8&list=PLrGN1Qi7t67V-9uXzj4VSQCffntfvn42v
Learn how to develop your very own kernel from scratch in this programming series!
https://www.youtube.com/watch?v=JDfo2Lc7iLU
Denshi goes over a simple explanation of what computer kernels are and how they work, alonside what makes the Linux kernel any special.
Triton Inference Server+
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
Triton Inference Server is an open source inference serving software that streamlines AI inferencing.
性能调优+
https://goperf.dev/
The Go App Optimization Guide is a series of in-depth, technical articles for developers who want to get more performance out of their Go code without relying on guesswork or cargo cult patterns.
https://web.dev/learn/performance
This course is designed for those new to web performance, a vital aspect of the user experience.
https://www.ibm.com/think/insights/application-performance-optimization
Application performance is not just a simple concern for most organizations; it’s a critical factor in their business’s success.
https://www.oreilly.com/library/view/optimizing-java/9781492039259/
Performance tuning is an experimental science, but that doesn’t mean engineers should resort to guesswork and folklore to get the job done.
Nsight+
https://developer.nvidia.com/tools-tutorials
NVIDIA Nsight™ Developer tools are a suite of tools for building, profiling, and debugging accelerated applications.
https://www.youtube.com/watch?v=aQ1NYoRvp7o
Profile Python for AI and deep learning applications with NVIDIA's suite of Nsight Developer Tools.
https://www.youtube.com/watch?v=Iuy_RAvguBM
Join NVIDIA’s Jackson Marusarz for an introduction to NVIDIA Nsight Compute, a tool for in-depth analysis of CUDA kernel performance on GPUs.
NVIDIA Visual Profiler+
https://developer.nvidia.com/nvidia-visual-profiler
The NVIDIA Visual Profiler is a cross-platform performance profiling tool that delivers developers vital feedback for optimizing CUDA C/C++ applications.
https://docs.nvidia.com/cuda/profiler-users-guide/
The user manual for NVIDIA profiling tools for optimizing performance of CUDA applications.
https://www.youtube.com/watch?v=F_BazucyCMw
https://www.youtube.com/watch?v=SI4UMz430ZU
This video tutorial has been taken from Learning CUDA 10 Programming.
C+++
https://www.learncpp.com/
LearnCpp.com is a free website devoted to teaching you how to program in modern C++.
https://www.youtube.com/watch?v=ZzaPdXTrSb8
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
数据结构+
https://www.youtube.com/watch?v=8hly31xKli0
In this course you will learn about algorithms and data structures, two of the fundamental topics in computer science.
https://www.youtube.com/watch?v=B31LgI4Y4DQ
Learn about data structures in this comprehensive course. We will be implementing these data structures in C or C++.
https://www.youtube.com/watch?v=CBYHwZcbD-s
Data Structures and Algorithms full course tutorial java
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
还有更多 •••
相关职位
校招J1020
1、参与大模型推理/训练优化。通过研发业界领先的AI Compiler 技术,支撑搜推场景在GPU上的训练计算性能优化;支持大模型推理优化技术在异构硬件上的落地; 2、参与各种大模型推理所需的功能性开发任务;相关编译优化功能开发,以图优化、算子融合、GPU高性能算子开发及自动Codegen等技术手段不断推高在不同卡型上的计算性能极限; 3、参与支持日常的大模型推理服务部署,参与内部日常提效工具的研发。
更新于 2025-08-11杭州|深圳|北京
校招J1020
1、参与大模型推理/训练优化。通过研发业界领先的AI Compiler 技术,支撑搜推场景在GPU上的训练计算性能优化;支持大模型推理优化技术在异构硬件上的落地; 2、参与各种大模型推理所需的功能性开发任务;相关编译优化功能开发,以图优化、算子融合、GPU高性能算子开发及自动Codegen等技术手段不断推高在不同卡型上的计算性能极限; 3、参与支持日常的大模型推理服务部署,参与内部日常提效工具的研发。
更新于 2025-07-22北京|深圳|杭州
社招TEG技术
1.参与开发优化大模型训练框架,支持单任务万卡以上规模高效稳定训练; 2.参与NLP、多模态大模型结构设计,并联合业务进行模型训练效率和效果验证; 3.参与文生图、文生视频、文生3D等业务的训练性能加速; 4.参与低精度训练性能优化和业务推广、参与大窗口训练性能优化。
更新于 2025-05-26北京