
商汤大模型强化学习研究员
任职要求
1. 拥有计算机科学、人工智能、统计学、应用数学或相关领域的硕士及以上学历。 2. 熟悉主流的强化学习算法(如 PPO, GRPO…
工作职责
1. 针对多模态大模型设计并实现强化学习算法,提升模型在推理、工具调用、Agentic 能力上的表现。 2. 参与大模型的强化学习训练流水线设计与实现,包括奖励建模、策略优化、对齐训练等环节。 3. 关注行业内顶尖机构在 Agentic RL、多智能体学习、多模态推理 等方向的最新成果,基于前沿研究提出创新性方法,推动模型在复杂推理、长程规划、多轮对话等方向的突破。
1.带领团队进行前沿算法研究,专注于大模型中强化学习算法的设计与优化,涵盖强化学习算法、奖励建模、世界模型等多个方向; 2.在大模型的复杂推理等自主探索与学习等场景中进行大规模实验验证,推动研究成果在行业内的实际应用,并发表具有影响力的学术论文; 3.探索大模型的前沿技术,结合未来实际应用场景,提供创新的技术解决方案; 4.与跨职能团队合作,确保项目进展顺利,并在技术突破方面发挥领导作用。
1.带领团队开展前沿算法研究,重点攻克大模型中强化学习算法设计与优化,研究方向包括但不限于:强化学习算法、奖励建模、世界模型等; 2.强化学习算法要在大模型的复杂推理等方向自主探索与学习等场景进行大规模实验验证,推动研究成果在行业内落地,并发表有影响力论文; 3.负责探索大模型的前沿技术,结合未来实际应用场景,提供技术解决方案; 4.与业界同行进行交流与合作,跟踪并分析大模型强化学习领域的最新研究动态。

1. 针对多模态大模型设计并实现强化学习算法,提升模型在推理、工具调用、Agentic 能力上的表现。 2. 参与大模型的强化学习训练流水线设计与实现,包括奖励建模、策略优化、对齐训练等环节。 3. 关注行业内顶尖机构在 Agentic RL、多智能体学习、多模态推理 等方向的最新成果,基于前沿研究提出创新性方法,推动模型在复杂推理、长程规划、多轮对话等方向的突破。
We empower our people to stay resilient and relevant in a constantly changing world. We're looking for people who are always searching for creative ways to grow and learn. People who want to make a real impact, now and in the future. Does that sound like you? Then it seems like you'd make a great addition to our vibrant international team. DAI AIX – AI Acceleration and Exploration, is working on the cutting-edge research of Data Analytics and AI with Siemens global technology network, and consulting, co-creation, data driven applications for the end customers. Research Scientist is to do applied research for Industrial AI applications in the team. We are seeking a Reinforcement Learning (RL) Specialist to lead the design, implementation, and optimization of RL-driven systems for post-training of foundation models. The primary focus of this role is advancing our RL capabilities for real-world applications such as industrial control systems and LLM agents. You will develop cutting-edge algorithms, improve post-training efficiency, and deploy scalable RL solutions in industry. You'll make an impact by • 1. Reinforcement learning development for post-training: • Design and implement state-of-the-art RL algorithms (e.g., PPO, SAC, DQN) for post-training of foundation models like LLMs and time series foundation models. • Implement distributed RL training pipelines using frameworks like Ray RLlib, Deepspeed, or custom solutions. • Design and implement benchmark pipelines for model evaluation. • 2. Align foundation models like LLMs and time series foundation models with specific areas/tasks through techniques like SFT, RL. • 3. Coding & Infrastructure: • Write production-grade Python code using PyTorch, numpy, and pandas. • Manage Linux-based clusters for distributed training and deployment. • 4. All other support required by the line manager if necessary.