网易资深 AI 工程师(机器学习平台方向)
社招全职3-5年网易游戏(互娱)地点:广州状态:招聘
任职要求
1、熟练掌握 K8s 大规模集群运维管理,精通容器、镜像、存储、网络,熟悉云原生 CI/CD、服务网格主流工具链; 2、具备完整 MLOps 平台落地经验,熟悉 Kubeflow、MLflow、Ray 至少一种生态工具; 3、熟悉 TensorFlow / PyTorch 主流框架,了解分布…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
1、负责企业级云原生机器学习平台建设,支撑模型开发、训练、部署、上线全生命周期管理; 2、负责机器学习平台核心组件落地与优化,包含分布式训练调度、模型版本管理、模型服务化推理部署; 3、负责 GPU 算力集群精细化运营,通过资源调度、弹性伸缩、异构算力管理,优化大模型训练 / 推理成本; 4、搭建平台监控、告警、可观测体系,保障机器学习集群及业务系统高可用、高稳定; 5、对接算法、业务团队,拆解需求并提供 MLOps 标准化平台解决方案。
包括英文材料
Kubernetes+
https://kubernetes.io/docs/tutorials/kubernetes-basics/
This tutorial provides a walkthrough of the basics of the Kubernetes cluster orchestration system.
https://kubernetes.io/zh-cn/docs/tutorials/kubernetes-basics/
本教程介绍 Kubernetes 集群编排系统的基础知识。每个模块包含关于 Kubernetes 主要特性和概念的一些背景信息,还包括一个在线教程供你学习。
https://www.youtube.com/watch?v=s_o8dwzRlu4
Hands-On Kubernetes Tutorial | Learn Kubernetes in 1 Hour - Kubernetes Course for Beginners
https://www.youtube.com/watch?v=X48VuDVv0do
Full Kubernetes Tutorial | Kubernetes Course | Hands-on course with a lot of demos
CI+
https://www.ibm.com/cn-zh/think/topics/continuous-integration
持续集成 (CI) 是一种软件开发实践,开发人员在整个开发周期中会定期将新的代码和代码变更集成到中央代码存储库中。它是 DevOps 和敏捷方法的关键组成部分。
https://www.youtube.com/watch?v=42UP1fxi2SY
CD+
https://www.redhat.com/zh-cn/topics/devops/what-is-ci-cd
CI/CD 是持续集成和持续交付/部署的缩写,旨在简化并加快软件开发生命周期。
https://www.youtube.com/watch?v=R8_veQiYBjI&list=PLy7NrYWoggjzSIlwxeBbcgfAdYoxCIrM2
Service Mesh+
https://aws.amazon.com/cn/what-is/service-mesh/
服务网格是一个软件层,用于处理应用程序中服务之间的所有通信。该层由容器化微服务组成。随着应用程序的扩展和微服务数量的增加,监控服务的性能变得越来越困难。
https://aws.amazon.com/what-is/service-mesh/
A service mesh is a software layer that handles all communication between services in applications. This layer is composed of containerized microservices.
https://www.redhat.com/zh-cn/topics/microservices/what-is-a-service-mesh
服务网格是软件应用内的一个专用基础架构层,用于处理服务之间的通信。服务网格可以处理流量路由、安全防护、可观测性和弹性功能,同时对各个服务进行抽象化处理来降低复杂性。
Kubeflow+
https://huggingface.co/blog/turhancan97/building-your-first-kubeflow-pipeline
Kubeflow is an open-source platform designed to be end-to-end, facilitating each step of the Machine Learning (ML) workflow.
https://www.kubeflow.org/docs/started/introduction/
Kubeflow is the foundation of tools for AI Platforms on Kubernetes.
https://www.youtube.com/watch?v=6wWdNg0GMV4
In this walk-through I will show you how I've created a machine learning pipeline with Kubeflow 1.5 using Juypter Notebooks, Kubeflow pipelines, MinIO and Kserve.
MLflow+
https://mlflow.org/docs/latest/ml/getting-started/
If you're new to MLflow or seeking a refresher on its core functionalities, the quickstart tutorials here are the perfect starting point.
https://mlflow.org/docs/latest/ml/tutorials-and-examples/
Here you'll find a curated set of resources to help you get started and deepen your knowledge of MLflow.
https://www.youtube.com/watch?v=cjeCAoW83_U
This is a video version of the MLFlow Quickstart guide.
https://www.youtube.com/watch?v=DnpEA1XaYlI
MLflow is designed to simplify the challenges of managing the machine learning lifecycle.
还有更多 •••