字节跳动AI Infra平台研发工程师(大模型开发机方向)-Seed
社招全职A149874A地点:上海状态:招聘
任职要求
1、本科及以上学历,计算机、软件工程或相关专业优先; 2、具备扎实的软件工程能力,熟悉至少一种主流后端开发语言,如Go、Java、Python、C++等;有较强的系统设计和代码实现能力; 3、熟悉应用开发和平台开发,有复杂业务系统、基础平台、研发平台、云平台或机器学习平台建设经验; 4、具备扎实的架构基础,理解分布式系统、高可用设计、服务治理、异步任务、缓存、消息队列、数据库设计等常见后端架构模式; 5、熟悉容器化和Kubernetes生态,理解Pod、Deployment、StatefulSet、CRD、Operator、Scheduler、Volume、NetworkPolicy等核心机制;理解资源编排与调度相关技术,有CPU/GPU调度、队列、配额、多租户隔离、弹性伸缩、资源回收等经验者优先; 6、具备良好的问题分析和故障排查能…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
团队介绍:字节跳动Seed团队成立于2023年,致力于寻找通用智能的新方法,追求智能上限,为科技和社会发展作出贡献。 Seed团队在AI领域拥有长期愿景与决心,团队研究方向涵盖MLLM、GenMedia、AI for Science、机器人等,在中国、新加坡、美国等地设有实验室和岗位;目前,团队已推出业界领先的通用大模型以及前沿的多模态能力,支持豆包、即梦、TRAE等超过50个应用场景,并通过火山引擎开放给企业客户;第三方数据显示,豆包App用户量在中国市场排名第一,豆包大模型日均Token调用量行业领先。 1、负责大模型平台开发机的后端系统设计与研发,包括开发机生命周期管理、用户权限与资源隔离等能力; 2、设计和优化基于Kubernetes的资源编排与调度体系,支持CPU/GPU、共享存储、网络、镜像等复杂资源管理场景,跟进云原生、AI Infra、GPU调度、分布式训练、AI Agent等方向的技术演进,并推动在平台中的落地; 3、建设面向算法研发的云端开发体验,包括VS Code Server、SSH、Web IDE、任务环境复用、镜像管理、数据挂载等能力; 4、负责平台核心架构设计与工程化建设,提升系统的稳定性、可扩展性、可观测性和运维效率,包括服务治理、监控告警、日志链路、故障诊断、灰度发布、容量规划和成本优化; 5、与算法、训练平台、基础架构、运维等团队协作,持续提升大模型研发效率和资源利用率。
包括英文材料
学历+
后端开发+
https://www.youtube.com/watch?v=tN6oJu2DqCM&list=PLWKjhJtqVAbn21gs5UnLhCQ82f923WCgM
Learn what technologies you should learn first to become a back end web developer.
Go+
https://www.youtube.com/watch?v=8uiZC0l4Ajw
学习Golang的完整教程!从开始到结束不到一个小时,包括如何在Go中构建API的完整演示。没有多余的内容,只有你需要知道的知识。
Java+
https://www.youtube.com/watch?v=eIrMbAQSU34
Master Java – a must-have language for software development, Android apps, and more! ☕️ This beginner-friendly course takes you from basics to real coding skills.
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
C+++
https://www.learncpp.com/
LearnCpp.com is a free website devoted to teaching you how to program in modern C++.
https://www.youtube.com/watch?v=ZzaPdXTrSb8
系统设计+
https://roadmap.sh/system-design
Everything you need to know about designing large scale systems.
https://www.youtube.com/watch?v=F2FmTdLtb_4
This complete system design tutorial covers scalability, reliability, data handling, and high-level architecture with clear explanations, real-world examples, and practical strategies.
机器学习+
https://www.youtube.com/watch?v=0oyDqO8PjIg
Learn about machine learning and AI with this comprehensive 11-hour course from @LunarTech_ai.
https://www.youtube.com/watch?v=i_LwzRVP7bg
Learn Machine Learning in a way that is accessible to absolute beginners.
https://www.youtube.com/watch?v=NWONeJKn6kc
Learn the theory and practical application of machine learning concepts in this comprehensive course for beginners.
https://www.youtube.com/watch?v=PcbuKRNtCUc
Learn about all the most important concepts and terms related to machine learning and AI.
分布式系统+
https://www.distributedsystemscourse.com/
The home page of a free online class in distributed systems.
https://www.youtube.com/watch?v=7VbL89mKK3M&list=PLOE1GTZ5ouRPbpTnrZ3Wqjamfwn_Q5Y9A
高可用+
https://redis.io/blog/high-availability-architecture/
A high available architecture is when there are a number of different components, modules, or services that work together to maintain optimal performance, irrespective of peak-time loads.
https://www.ibm.com/think/topics/high-availability
High availability (HA) is a term that refers to a system’s ability to be accessible and reliable close to 100% of the time.
服务治理+
https://cloudnativecn.com/blog/istio-traffic-management-series-service-management-concept-theory/
通过阅读本文读者可以初步理解 Istio 流量治理的概念和相关知识框架。
https://juejin.cn/post/6844904006033080334
服务治理主要包括服务发现、负载均衡、限流、熔断、超时、重试、服务追踪等。我们今天要讲的,就是服务发现的内容。
还有更多 •••