通义Token Foundry-大模型Post-Training 算法专家-杭州/北京
社招全职1年以上技术类-算法地点:北京 | 杭州状态:招聘
任职要求
1. 计算机、人工智能、数学等相关专业硕士及以上学历。具有 1 年以上 NLP / LLM 领域一线研发经验。 2. 对大模型的 Post-Training 阶段有全局视角的理解,曾作为核心骨干完整参与过百亿/千亿级别模型的 SFT 与 RLHF 迭代过程,熟悉各阶段的坑与解法。 3. 熟练掌握主流的对齐算法(如 SFT、DPO、GRPO、PPO 等)及其底层数学逻辑;熟悉 Megatron-LM、Deep…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
1. 负责大模型从 Pre-SFT、SFT 到 RLHF (DPO/PPO/GRPO等) 全阶段的算法调优与策略设计,提升模型在风格控制、安全合规、领域知识注入、多轮交互逻辑、文创、指令遵循、情绪控制、Agent Planning 等核心维度的综合能力 2. 设计高质量 SFT、RL 等数据采集、清洗和标注方案,建立数据质量评估标准和流程,持续提升数据质量和多样性 3. 参与模型合版工作,包括数据配比策略、多任务训练优化、灾难性遗忘缓解等核心技术,确保各业务线智能体能力与基座模型的高效融合 4. 探索多模态(文本、视觉、音频等)场景下的 Post-Training 方案,解决跨模态对齐与幻觉问题,探索高质量数据的合成、Self-Play、Agentic RL 等方法,与 Pretrain、RL、评测团队紧密配合,推动基座模型面向应用的全流程优化 5. 参与 APP 对话助手产品的效果优化,通过精细化的 SFT/RL 策略提升对话质量、安全性与用户体验
包括英文材料
学历+
NLP+
https://www.youtube.com/watch?v=fNxaJsNG3-s&list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S
Welcome to Zero to Hero for Natural Language Processing using TensorFlow!
https://www.youtube.com/watch?v=R-AG4-qZs1A&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX
Natural Language Processing tutorial for beginners series in Python.
https://www.youtube.com/watch?v=rmVRLeJRkl4&list=PLoROMvodv4rMFqRtEuo6SGjY4XbRIVRd4
The foundations of the effective modern methods for deep learning applied to NLP.
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
SFT+
https://cameronrwolfe.substack.com/p/understanding-and-using-supervised
Understanding how SFT works from the idea to a working implementation...
RLHF+
[英文] What is RLHF?
https://aws.amazon.com/what-is/reinforcement-learning-from-human-feedback/
Reinforcement learning from human feedback (RLHF) is a machine learning (ML) technique that uses human feedback to optimize ML models to self-learn more efficiently.
https://www.ibm.com/think/topics/rlhf
Reinforcement learning from human feedback (RLHF) is a machine learning technique in which a “reward model” is trained with direct human feedback, then used to optimize the performance of an artificial intelligence agent through reinforcement learning.
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
还有更多 •••