米哈游LLM Post-train 算法研究员 - 星布谷地
社招全职3年以上程序&技术类地点:上海 | 北京状态:招聘
任职要求
1)硕士及以上学历,计算机科学、人工智能、机器学习、NLP 或相关专业 2)3年以上大模型训练或 NLP 算法相关经验,有 SFT、RLHF/DPO、Reward Model 训练的实际项目经验 3)熟悉 Transformer / MoE 架构原理,熟练使用 PyTorch 及主流大模型训练/推理框架(如 DeepSpeed、Megatron-LM、VeRL、Slime、vLLM、SGLang 等) 4)具备优秀的工程实现能力,能够独立设计和搭建训练流水线,快速复现和改进前沿算法 5)对数据质量敏感,具备高质量 SFT/偏好数据构建经验,了解数据对模型效果的影响机制 6)具备扎实的强化学习基础,理解 PPO/DPO/GR…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
1)后训练算法研发:参与游戏内容、角色扮演等场景下大模型的后训练(Post-training)算法研发工作,涵盖 SFT、RLHF、DPO 等对齐方法的实现与优化,提升模型在剧情生成、角色一致性、对话连贯性、情感表达等维度的能力 2)奖励模型与对齐信号:设计和训练 Reward Model,探索多维度奖励信号的构建(如指令遵循、对话连贯性、创意性、安全性等),减少 Reward Hacking 和偏差问题,为强化学习提供高质量训练信号 3)强化学习训练与优化:基于 PPO/GRPO 等强化学习算法完成模型对齐训练,探索可规模化(Scalable)的 Verifier 信号与 RL 策略,提升训练稳定性与效率,推动模型在复杂多轮对话和开放域场景中的推理与生成能力 4)高质量数据工程:负责后训练阶段的数据治理,包括 SFT 数据构建、偏好数据采集与清洗、合成数据生成、数据混合策略设计,结合业务场景解决数据稀缺性问题 5)多类型模型训练:除对话模型外,参与其他辅助模型(如分类器、决策模型等)的训练与调优,支撑整体模型产品体系建设 前沿技术探索:跟踪 Post-training 领域最新研究进展(如 RLAIF、On-Policy Distillation、推理链压缩等),结合游戏对话业务需求进行技术预研与创新落地
包括英文材料
学历+
机器学习+
https://www.youtube.com/watch?v=0oyDqO8PjIg
Learn about machine learning and AI with this comprehensive 11-hour course from @LunarTech_ai.
https://www.youtube.com/watch?v=i_LwzRVP7bg
Learn Machine Learning in a way that is accessible to absolute beginners.
https://www.youtube.com/watch?v=NWONeJKn6kc
Learn the theory and practical application of machine learning concepts in this comprehensive course for beginners.
https://www.youtube.com/watch?v=PcbuKRNtCUc
Learn about all the most important concepts and terms related to machine learning and AI.
NLP+
https://www.youtube.com/watch?v=fNxaJsNG3-s&list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S
Welcome to Zero to Hero for Natural Language Processing using TensorFlow!
https://www.youtube.com/watch?v=R-AG4-qZs1A&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX
Natural Language Processing tutorial for beginners series in Python.
https://www.youtube.com/watch?v=rmVRLeJRkl4&list=PLoROMvodv4rMFqRtEuo6SGjY4XbRIVRd4
The foundations of the effective modern methods for deep learning applied to NLP.
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
SFT+
https://cameronrwolfe.substack.com/p/understanding-and-using-supervised
Understanding how SFT works from the idea to a working implementation...
RLHF+
[英文] What is RLHF?
https://aws.amazon.com/what-is/reinforcement-learning-from-human-feedback/
Reinforcement learning from human feedback (RLHF) is a machine learning (ML) technique that uses human feedback to optimize ML models to self-learn more efficiently.
https://www.ibm.com/think/topics/rlhf
Reinforcement learning from human feedback (RLHF) is a machine learning technique in which a “reward model” is trained with direct human feedback, then used to optimize the performance of an artificial intelligence agent through reinforcement learning.
Transformer+
https://huggingface.co/learn/llm-course/en/chapter1/4
Breaking down how Large Language Models work, visualizing how data flows through.
https://poloclub.github.io/transformer-explainer/
An interactive visualization tool showing you how transformer models work in large language models (LLM) like GPT.
https://www.youtube.com/watch?v=wjZofJX0v4M
Breaking down how Large Language Models work, visualizing how data flows through.
还有更多 •••