百度大模型分布式训练研发工程师(J85174)
社招全职TPG地点:北京状态:招聘
任职要求
-热爱大模型训练技术或者深度学习框架技术 -计算机软件或相关专业硕士及以上学历 -有Linux/Unix下开发经验,熟悉多线程编程、网络编程 -熟悉大模型训练技术(高性能,算法策略,集群容错)或优化技术熟悉CUDA编程,高性能优化者优先 -了解飞桨或其他深度学习分布式训练框架技术如DeepSpeed,Megatron等经验者优先 -优秀的分析问题和解决问题的能力,对解决具有挑战性问题充满激情 -思路清晰,具备良好的沟通能力和理解能力 -工作积极主动,具有强烈的责任心 -良好的团队合作精神
工作职责
-参与负责百度文心大模型的训练优化和支持 -负责百度核心产品飞桨的分布式训练功能和架构开发 -参与前沿大模型训练技术和超大规模分布式训练架构技术的探索和研究 -参与飞桨深度学习框架的优化工作,使开发者能够以更简单的方式实现各类任务,降低学习成本和开发成本 -负责异构高性能计算平台的设计、研发,高性能计算库、通信库开发与优化 -探索深度学习大语言模型、跨模态模型等领域的算法-工程协同优化方案 -根据整体技术方案完成高质量的开发、自测及项目文档编写
包括英文材料
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
深度学习+
https://d2l.ai/
Interactive deep learning book with code, math, and discussions.
学历+
Linux+
https://ryanstutorials.net/linuxtutorial/
Ok, so you want to learn how to use the Bash command line interface (terminal) on Unix/Linux.
https://ubuntu.com/tutorials/command-line-for-beginners
The Linux command line is a text interface to your computer.
https://www.youtube.com/watch?v=6WatcfENsOU
In this Linux crash course, you will learn the fundamental skills and tools you need to become a proficient Linux system administrator.
https://www.youtube.com/watch?v=v392lEyM29A
Never fear the command line again, make it fear you.
https://www.youtube.com/watch?v=ZtqBQ68cfJc
Unix+
[英文] The UNIX® Standard
https://www.opengroup.org/membership/forums/platform/unix
https://www.youtube.com/watch?v=IrDUcdpPmdI
UNIX is an operating system which was first developed in the 1970s, and has been under constant development ever since.
多线程+
https://liaoxuefeng.com/books/java/threading/basic/index.html
和单线程相比,多线程编程的特点在于:多线程经常需要读写共享数据,并且需要同步。
https://www.youtube.com/watch?v=_uQgGS_VIXM&list=PLsc-VaxfZl4do3Etp_xQ0aQBoC-x5BIgJ
https://www.youtube.com/watch?v=IEEhzQoKtQU
https://www.youtube.com/watch?v=mTGdtC9f4EU&list=PLL8woMHwr36EDxjUoCzboZjedsnhLP1j4
https://www.youtube.com/watch?v=TPVH_coGAQs&list=PLk6CEY9XxSIAeK-EAh3hB4fgNvYkYmghp
https://www.youtube.com/watch?v=xPqnoB2hjjA
This video is an introduction to multithreading in modern C++.
https://www.youtube.com/watch?v=YKBwKy5PrpQ
Rust threading is easy to implement and improves the efficiency of your applications on multi-core systems!
网络编程+
https://www.youtube.com/watch?v=2HrYIl6GpYg
I will make a simple HTTP web server with the C Programming Language.
https://www.youtube.com/watch?v=8z6okCgdREo
This tutorial is for Gophers who have written a command line or an API application, but have little to no experience in lower-level concepts like reading and writing to sockets, working with channels, and managing multiple goroutines.
https://www.youtube.com/watch?v=bdIiTxtMaKA&list=PL9IEJIKnBJjH_zM5LnovnoaKlXML5qh17
https://www.youtube.com/watch?v=bzja9fQWzdA
Implement the ubiquitous TCP protocol that underlies much of the traffic on the internet!
[英文] 📺Network Programming with Python Course (build a port scanner, mailing client, chat room, DDOS)
https://www.youtube.com/watch?v=FGdiSJakIS4
Learn network programming in Python by building four projects. You will learn to build a mailing client, a DDOS script, a port scanner, and a TCP Chat Room.
https://www.youtube.com/watch?v=gntyAFoZp-E
https://www.youtube.com/watch?v=JiuouCJQzSQ
Explore the fundamentals of networking in Rust by building a simple TCP server.
https://www.youtube.com/watch?v=JRTLSxGf_6w
https://www.youtube.com/watch?v=sFizpxHkIlI
In this video we'll cover SOCKET PROGRAMMING in JAVA.
https://www.youtube.com/watch?v=sXW_sNGvqcU
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
CUDA+
https://developer.nvidia.com/blog/even-easier-introduction-cuda/
This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA.
https://www.youtube.com/watch?v=86FAWCzIe_4
Lean how to program with Nvidia CUDA and leverage GPUs for high-performance computing and deep learning.
DeepSpeed+
https://www.youtube.com/watch?v=pDGI668pNg0
Megatron+
https://www.youtube.com/watch?v=hc0u4avAkuM
相关职位
实习引擎
工作职责: 1、参与千亿级大模型的分布式强化学习 RL 训练框架研发,提升百卡~千卡级训练吞吐与资源利用率 2、参与 100B以上多模态强化学习算法流程适配(如DAPO等),各领域任务的 RL 正确性验证 3、实验并调优不同并行策略(Tensor/ZeRO/FSDP/Pipeline Parallelism)在超大规模模型上的最佳配置组合 4、协助定位分析分布式训练中的关键性能瓶颈(如GPU利用率低、显存瓶颈、网络通信阻塞、I/O延迟等),设计并实施优化方案进行验证。 5、参与研发/优化训练引擎的关键特性,如大规模集群下的稳定断点续训、高性能异步Rollout机制、以及高性能算子(Kernel)的集成与优化。
社招3年以上CSIG技术
1.框架开发与优化:负责强化学习、模型精调、知识蒸馏等核心模块的设计与开发,提升框架的训练效率与易用性; 2.分布式训练支持:基于Megatron-LM、DeepSpeed等工具,优化大模型分布式训练策略(数据并行/张量并行/流水并行/专家并行等),解决显存、通信与计算瓶颈; 3.工具链构建:参与开发轻量化训练框架(如LLama-Factory、swift),支持快速模型微调、部署及多硬件平台适配; 4.前沿技术探索:跟踪学术动态(如RLHF、MoE架构、FlashMLA、EPLB、DualPipe等),将最新研究成果转化为框架功能,提升产品竞争力; 5.协作与文档:与产品团队紧密配合,提供框架级解决方案;编写技术文档与案例,赋能公有云客户。
更新于 2025-06-17