字节跳动公有云机器学习系统工程师-调度方向
社招全职A11907地点:北京状态:招聘
任职要求
1、熟练掌握Linux环境下的Go/Java/Python等1-2种语言; 2、具备扎实的计算机科学功底和编程能力,熟悉常见算法和数据结构,具有良好的编程习惯; 3、熟悉至少一种主流的机器学习框架(TensorFlow / PyTorch 或其他自研框架); 4、熟悉 Kubernetes 架构和生态,熟悉 Docker/Containerd/Kata 等容器技术,有丰富的云原生机器学习系统实践和开发经验; 5、掌握分布式系统原理,参与过大规模分布式系统的设计、开发和维护; 6、有优秀的逻辑分析能力,能够对业务逻辑进行合理的抽象和拆分; 7、有强烈的工作责任心,较好的学习能力、沟通能力和自驱力,能够快速的响应和行动; 8、有良好的工作文档习惯,及时按要求撰写更新工作流程及技术文档。 加分项: 1、有大规模集群在离线资源调度相关工作的实践经验,对K8S/Volcano/Yarn/Mesos等一到多个开源项目的调度实现有源码级的理解,熟悉容器化、轻量级虚拟机等相关技术; 2、熟悉常见调度算法,对多租户Quota治理、抢占、弹性、碎片、潮汐、混部、QoS等一到多个调度问题有深入理解和实践经验,具备较强的解决复杂问题的分析和建模能力,有GPU相关调度经验; 3、有以下某一方向领域的经验:CUDA,RDMA,AI Infrastructure,HW/SW Co-Design,High Performance Computing,ML Hardware Architecture (GPU, Accelerators, Networking),ML for System,Distributed Storage。
工作职责
1、负责机器学习系统资源调度的设计和开发,支持火山方舟大模型平台和机器学习平台的产品业务; 2、负责多机房、多集群环境下的,各种异构计算(GPU、CPU、其他异构硬件)、存储(各种云存储)、网络(VPC、RDMA)等资源的最优化编排调度,在严格的多租隔离环境下,支持各种离线训练、在线推理等负载场景的调度需求,并实现整体资源的合理化、最大化利用。
包括英文材料
Linux+
https://ryanstutorials.net/linuxtutorial/
Ok, so you want to learn how to use the Bash command line interface (terminal) on Unix/Linux.
https://ubuntu.com/tutorials/command-line-for-beginners
The Linux command line is a text interface to your computer.
https://www.youtube.com/watch?v=6WatcfENsOU
In this Linux crash course, you will learn the fundamental skills and tools you need to become a proficient Linux system administrator.
https://www.youtube.com/watch?v=v392lEyM29A
Never fear the command line again, make it fear you.
https://www.youtube.com/watch?v=ZtqBQ68cfJc
Go+
https://www.youtube.com/watch?v=8uiZC0l4Ajw
学习Golang的完整教程!从开始到结束不到一个小时,包括如何在Go中构建API的完整演示。没有多余的内容,只有你需要知道的知识。
Java+
https://www.youtube.com/watch?v=eIrMbAQSU34
Master Java – a must-have language for software development, Android apps, and more! ☕️ This beginner-friendly course takes you from basics to real coding skills.
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
数据结构+
https://www.youtube.com/watch?v=8hly31xKli0
In this course you will learn about algorithms and data structures, two of the fundamental topics in computer science.
https://www.youtube.com/watch?v=B31LgI4Y4DQ
Learn about data structures in this comprehensive course. We will be implementing these data structures in C or C++.
https://www.youtube.com/watch?v=CBYHwZcbD-s
Data Structures and Algorithms full course tutorial java
编程规范+
[英文] Google Style Guides
https://google.github.io/styleguide/
Every major open-source project has its own style guide: a set of conventions (sometimes arbitrary) about how to write code for that project. It is much easier to understand a large codebase when all the code in it is in a consistent style.
机器学习+
https://www.youtube.com/watch?v=0oyDqO8PjIg
Learn about machine learning and AI with this comprehensive 11-hour course from @LunarTech_ai.
https://www.youtube.com/watch?v=i_LwzRVP7bg
Learn Machine Learning in a way that is accessible to absolute beginners.
https://www.youtube.com/watch?v=NWONeJKn6kc
Learn the theory and practical application of machine learning concepts in this comprehensive course for beginners.
https://www.youtube.com/watch?v=PcbuKRNtCUc
Learn about all the most important concepts and terms related to machine learning and AI.
TensorFlow+
https://www.youtube.com/watch?v=tpCFfeUEGs8
Ready to learn the fundamentals of TensorFlow and deep learning with Python? Well, you’ve come to the right place.
https://www.youtube.com/watch?v=ZUKz4125WNI
This part continues right where part one left off so get that Google Colab window open and get ready to write plenty more TensorFlow code.
PyTorch+
https://datawhalechina.github.io/thorough-pytorch/
PyTorch是利用深度学习进行数据科学研究的重要工具,在灵活性、可读性和性能上都具备相当的优势,近年来已成为学术界实现深度学习算法最常用的框架。
https://www.youtube.com/watch?v=V_xro1bcAuA
Learn PyTorch for deep learning in this comprehensive course for beginners. PyTorch is a machine learning framework written in Python.
Kubernetes+
https://kubernetes.io/docs/tutorials/kubernetes-basics/
This tutorial provides a walkthrough of the basics of the Kubernetes cluster orchestration system.
https://kubernetes.io/zh-cn/docs/tutorials/kubernetes-basics/
本教程介绍 Kubernetes 集群编排系统的基础知识。每个模块包含关于 Kubernetes 主要特性和概念的一些背景信息,还包括一个在线教程供你学习。
https://www.youtube.com/watch?v=s_o8dwzRlu4
Hands-On Kubernetes Tutorial | Learn Kubernetes in 1 Hour - Kubernetes Course for Beginners
https://www.youtube.com/watch?v=X48VuDVv0do
Full Kubernetes Tutorial | Kubernetes Course | Hands-on course with a lot of demos
Docker+
https://www.youtube.com/watch?v=GFgJkfScVNU
Master Docker in one course; learn about images and containers on Docker Hub, running multiple containers with Docker Compose, automating workflows with Docker Compose Watch, and much more. 🐳
https://www.youtube.com/watch?v=kTp5xUtcalw
Learn how to use Docker and Kubernetes in this complete hand-on course for beginners.
分布式系统+
https://www.distributedsystemscourse.com/
The home page of a free online class in distributed systems.
https://www.youtube.com/watch?v=7VbL89mKK3M&list=PLOE1GTZ5ouRPbpTnrZ3Wqjamfwn_Q5Y9A
Volcano+
[英文] Tutorials
https://volcano.sh/en/docs/tutorials/
This section provides guidance to help you quickly get started with Volcano, from deploying a basic Volcano Job/Deployment, to integrating with Volcano Queues
Yarn+
[英文] Introduction
https://yarnpkg.com/getting-started
Yarn is an established open-source package manager used to manage dependencies in JavaScript projects.
Mesos+
https://www.baeldung.com/apache-mesos
Apache Mesos is a platform that allows effective resource sharing between such applications.
https://www.oreilly.com/library/view/learn-apache-mesos/9781789137385/
Learn Apache Mesos is the go-to book for anyone eager to master the power of efficient resource management and cluster deployment with Apache Mesos.
CUDA+
https://developer.nvidia.com/blog/even-easier-introduction-cuda/
This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA.
https://www.youtube.com/watch?v=86FAWCzIe_4
Lean how to program with Nvidia CUDA and leverage GPUs for high-performance computing and deep learning.
相关职位
社招2年以上A252507
1、负责AML-机器学习平台的开发与优化,打造国内领先的聚焦AI开发者体验的机器学习平台; 2、从机器学习系统架构、云原生架构、公有云架构,等多个层面,进行技术探索和攻坚,帮助客户实现高性能、高资源利用率的高性能计算平台。
更新于 2023-09-06
社招2年以上A247110
1、负责AML-机器学习平台的开发与优化,打造国内领先的聚焦AI开发者体验的机器学习平台; 2、从机器学习系统架构、云原生架构、公有云架构,等多个层面,进行技术探索和攻坚,帮助客户实现高性能、高资源利用率的高性能计算平台。
更新于 2023-11-15
社招算法
1. 构建多云异构资源调度体系,整合多家云厂商的AI算力资源,设计优先级策略,实现跨平台资源池化管理和高效动态分配; 2. 设计数据智能路由方案,确保训练数据在混合云环境下高效流动,优化跨云数据同步效率; 3. 对接MLOps系统,实现训练任务编排、版本控制、模型监控等功能的深度集成; 4. 开发资源效能监控系统,实时追踪GPU利用率、任务排队时长等核心指标。
更新于 2025-04-08