蚂蚁金服蚂蚁数字科技-数字科技线-大模型训推引擎工程师
社招全职5年以上技术类-算法地点:北京 | 上海 | 杭州 | 深圳 | 成都状态:招聘
任职要求
1.编程功底:熟练掌握 Python / C++,有 GPU 编程(CUDA / Triton)或分布式系统实战经验优先; 2.框架深度:对 Megatron-LM / DeepSpeed / vLLM / SGLang 中至少一项有源码级理解,而非…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
按方向划分,你将承担以下职责中一项或多项: 1.分布式训练方向: (1)设计并调优大规模混合并行策略(TP/PP/DP/EP/CP 组合方案),持续提升集群 MFU; (2)做训练性能 profiling 与全链路瓶颈攻坚,扛起训练稳定性终极排查(NCCL 超时、梯度异常、显存泄漏); (3)对 Megatron-LM / DeepSpeed 等训练框架做深度适配、魔改与 bug 修复,主导 mid-train 阶段的分布式方案设计与执行,与基座模型团队做核心技术对接; 2.推理服务方向: (1)负责推理服务搭建与维护(vLLM / SGLang 部署、升级、深度调优); (2)推进批量推理优化与量化方案落地,在精度与吞吐之间找到商业最优解; (3)建设 Checkpoint 管理与模型格式转换工具链,作为基模与算法团队之间的桥接层,提供日常 infra 支持; 3.Agentic RL 沙箱方向: (1)从 0 到 1 搭建 Agentic RL 训练的沙箱环境基建,做多团队协作的技术枢纽; (2)设计标准化的环境接口(observation / action / reward),保障沙箱的隔离性、容错与资源管控; (3)优化沙箱吞吐与延迟,持续接入新工具与新环境; 4.大模型训练参与方向: (1)负责专项数据合成,为模型训练提供高质量数据燃料; (2)参与专项能力优化与模型训练,把引擎能力转化为真实的模型能力提升。
包括英文材料
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
C+++
https://www.learncpp.com/
LearnCpp.com is a free website devoted to teaching you how to program in modern C++.
https://www.youtube.com/watch?v=ZzaPdXTrSb8
CUDA+
https://developer.nvidia.com/blog/even-easier-introduction-cuda/
This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA.
https://www.youtube.com/watch?v=86FAWCzIe_4
Lean how to program with Nvidia CUDA and leverage GPUs for high-performance computing and deep learning.
Triton Inference Server+
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
Triton Inference Server is an open source inference serving software that streamlines AI inferencing.
分布式系统+
https://www.distributedsystemscourse.com/
The home page of a free online class in distributed systems.
https://www.youtube.com/watch?v=7VbL89mKK3M&list=PLOE1GTZ5ouRPbpTnrZ3Wqjamfwn_Q5Y9A
还有更多 •••