平头哥平头哥-AI DevOps 专家-上海
社招全职5年以上技术-芯片地点:上海状态:招聘
任职要求
我们希望你具备: ● 计算机相关专业本科及以上学历,5年以上DevOps、SRE、平台工程或自动化系统开发经验 ● 精通Linux系统、容器化技术(Docker)与编排系统(Kubernetes),具备大规模集群管理与调优经验 ● 熟练掌握至少一门主流编程语言(Python / Go / Java),具备良好的工程规范与系统设计能力 ● 深入理解CI/CD、监控告警、日志分析、自动化运维等核心系统,有实际平台建设或开源项目贡献经验 ● 对人工智能在系统工程中的应用有深刻理解或实践经验,熟悉常见机器学习算法(如分类、聚类、时序预测)在日志分析、异常检测、资源优化等场景的落地 ● 具备优秀的系统思维、问题拆解能力与跨团队协作能力,能够独立主导复杂系统的设计与推进 加分项: ● 有MLOps实践经历,熟悉MLflow、Kubeflow、Seldon、KServe等工具链; ● 在AIOps、智能调度、故障预测等领域有实际项目成果或论文发表; ● 参与过大型企业级DevOps平台或内部PaaS/IaaS系统建设; ● 熟悉云原生生态(AWS/GCP/AliCloud),具备多云或混合云架构经验; ● 开源社区活跃者,有知名项目贡献或维护经验。 我们提供: ● 深度参与AI与系统工程融合创新的机会,接触全球前沿技术趋势 ● 与资深架构师和技术专家共事的成长环境,持续提升技术深度与影响力 ● 高度开放、结果导向、鼓励技术创新的团队文化,支持你在关键技术方向上自主探索 ● 具有竞争力的薪酬、弹性工作制与长期发展通道,为你的职业进阶保驾护航。 加入我们,用AI重塑研发效能,让系统更聪明,让工程师更专注创造价值!
工作职责
我们正在寻找具备深厚技术功底、前瞻性视野和丰富实战经验的DevOps平台工程专家,加入我们致力于构建智能化研发基础设施的核心团队。在这里,你将主导下一代AI驱动的CI/CD平台与智能运维系统的设计与落地,推动软件研发流程向自动化、可观测性、自愈能力和数据驱动决策全面进化。 作为团队的技术骨干,你将: 1. 设计并构建高可用、智能化的CI/CD平台 主导持续集成与持续交付系统的架构演进,支持大规模分布式研发协作;探索机器学习在构建失败预测、测试用例智能推荐、资源调度优化等场景的应用,显著提升交付效率与稳定性。 2. 打造企业级智能化运维(AIOps)体系 基于Python、Go等语言,构建自动化运维工具链与平台化能力,实现基础设施即代码(IaC);引入异常检测、根因分析、故障传播图谱等AI算法,提升系统可观测性与故障响应速度,推动运维从“被动响应”向“主动预防”转变。 3. 构建全链路智能监控与自愈系统 设计并落地覆盖应用、服务、资源的端到端监控体系,集成Prometheus、Grafana、ELK、OpenTelemetry等主流技术栈;结合时序预测(LSTM、Prophet)、无监督异常检测(Isolation Forest、One-Class SVM)等模型,实现性能瓶颈预警、自动诊断与部分场景的闭环自愈。 4. 推动MLOps与DevOps深度融合 主导机器学习模型训练流水线(ML Pipeline)与模型服务化(Model Serving)平台建设,设计模型版本管理、A/B测试、流量灰度、监控告警与快速回滚机制,支撑AI能力高效、稳定地规模化落地。
包括英文材料
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
Java+
https://www.youtube.com/watch?v=eIrMbAQSU34
Master Java – a must-have language for software development, Android apps, and more! ☕️ This beginner-friendly course takes you from basics to real coding skills.
Go+
https://www.youtube.com/watch?v=8uiZC0l4Ajw
学习Golang的完整教程!从开始到结束不到一个小时,包括如何在Go中构建API的完整演示。没有多余的内容,只有你需要知道的知识。
Linux+
https://ryanstutorials.net/linuxtutorial/
Ok, so you want to learn how to use the Bash command line interface (terminal) on Unix/Linux.
https://ubuntu.com/tutorials/command-line-for-beginners
The Linux command line is a text interface to your computer.
https://www.youtube.com/watch?v=6WatcfENsOU
In this Linux crash course, you will learn the fundamental skills and tools you need to become a proficient Linux system administrator.
https://www.youtube.com/watch?v=v392lEyM29A
Never fear the command line again, make it fear you.
https://www.youtube.com/watch?v=ZtqBQ68cfJc
Unix+
[英文] The UNIX® Standard
https://www.opengroup.org/membership/forums/platform/unix
https://www.youtube.com/watch?v=IrDUcdpPmdI
UNIX is an operating system which was first developed in the 1970s, and has been under constant development ever since.
TCP/IP+
[英文] What is TCP/IP?
https://www.techtarget.com/searchnetworking/definition/TCP-IP
TCP/IP stands for Transmission Control Protocol/Internet Protocol and is a suite of communication protocols used to interconnect network devices on the internet.
Spring Boot+
https://spring.io/guides/gs/spring-boot
his guide provides a sampling of how Spring Boot helps you accelerate application development.
https://www.youtube.com/watch?v=Nv2DERaMx-4&list=PLzUMQwCOrQTksiYqoumAQxuhPNa3HqasL
The author teaches you how to use Spring Boot from a complete beginner, to building a REST API with a real database, Dockerising it and deploying it to the cloud.
Django+
https://www.youtube.com/watch?v=nGIg40xs9e4
Learn how to build a simple Django application in as fast as 20 minutes!
https://www.youtube.com/watch?v=rHux0gMZ3Eg
Learn Django and start building amazing back-ends!
Kubernetes+
https://kubernetes.io/docs/tutorials/kubernetes-basics/
This tutorial provides a walkthrough of the basics of the Kubernetes cluster orchestration system.
https://kubernetes.io/zh-cn/docs/tutorials/kubernetes-basics/
本教程介绍 Kubernetes 集群编排系统的基础知识。每个模块包含关于 Kubernetes 主要特性和概念的一些背景信息,还包括一个在线教程供你学习。
https://www.youtube.com/watch?v=s_o8dwzRlu4
Hands-On Kubernetes Tutorial | Learn Kubernetes in 1 Hour - Kubernetes Course for Beginners
https://www.youtube.com/watch?v=X48VuDVv0do
Full Kubernetes Tutorial | Kubernetes Course | Hands-on course with a lot of demos
学历+
DevOps+
https://roadmap.sh/devops
Step by step guide for DevOps, SRE or any other Operations Role in 2025
https://zhuanlan.zhihu.com/p/562036793
DevOps中的Dev指的是Development(开发),Ops指的是Operations(运维),用一句话来说,DevOps就是打通开发运维的壁垒,实现开发运维一体化。
Docker+
https://www.youtube.com/watch?v=GFgJkfScVNU
Master Docker in one course; learn about images and containers on Docker Hub, running multiple containers with Docker Compose, automating workflows with Docker Compose Watch, and much more. 🐳
https://www.youtube.com/watch?v=kTp5xUtcalw
Learn how to use Docker and Kubernetes in this complete hand-on course for beginners.
系统设计+
https://roadmap.sh/system-design
Everything you need to know about designing large scale systems.
https://www.youtube.com/watch?v=F2FmTdLtb_4
This complete system design tutorial covers scalability, reliability, data handling, and high-level architecture with clear explanations, real-world examples, and practical strategies.
CI+
https://www.ibm.com/cn-zh/think/topics/continuous-integration
持续集成 (CI) 是一种软件开发实践,开发人员在整个开发周期中会定期将新的代码和代码变更集成到中央代码存储库中。它是 DevOps 和敏捷方法的关键组成部分。
https://www.youtube.com/watch?v=42UP1fxi2SY
CD+
https://www.redhat.com/zh-cn/topics/devops/what-is-ci-cd
CI/CD 是持续集成和持续交付/部署的缩写,旨在简化并加快软件开发生命周期。
https://www.youtube.com/watch?v=R8_veQiYBjI&list=PLy7NrYWoggjzSIlwxeBbcgfAdYoxCIrM2
机器学习+
https://www.youtube.com/watch?v=0oyDqO8PjIg
Learn about machine learning and AI with this comprehensive 11-hour course from @LunarTech_ai.
https://www.youtube.com/watch?v=i_LwzRVP7bg
Learn Machine Learning in a way that is accessible to absolute beginners.
https://www.youtube.com/watch?v=NWONeJKn6kc
Learn the theory and practical application of machine learning concepts in this comprehensive course for beginners.
https://www.youtube.com/watch?v=PcbuKRNtCUc
Learn about all the most important concepts and terms related to machine learning and AI.
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
MLflow+
https://mlflow.org/docs/latest/ml/getting-started/
If you're new to MLflow or seeking a refresher on its core functionalities, the quickstart tutorials here are the perfect starting point.
https://mlflow.org/docs/latest/ml/tutorials-and-examples/
Here you'll find a curated set of resources to help you get started and deepen your knowledge of MLflow.
https://www.youtube.com/watch?v=cjeCAoW83_U
This is a video version of the MLFlow Quickstart guide.
https://www.youtube.com/watch?v=DnpEA1XaYlI
MLflow is designed to simplify the challenges of managing the machine learning lifecycle.
Kubeflow+
https://huggingface.co/blog/turhancan97/building-your-first-kubeflow-pipeline
Kubeflow is an open-source platform designed to be end-to-end, facilitating each step of the Machine Learning (ML) workflow.
https://www.kubeflow.org/docs/started/introduction/
Kubeflow is the foundation of tools for AI Platforms on Kubernetes.
https://www.youtube.com/watch?v=6wWdNg0GMV4
In this walk-through I will show you how I've created a machine learning pipeline with Kubeflow 1.5 using Juypter Notebooks, Kubeflow pipelines, MinIO and Kserve.
PaaS+
https://www.ibm.com/cn-zh/think/topics/paas
平台即服务 (PaaS) 是一种云计算模型,提供完整的按需云平台(硬件、软件和基础设施),用于开发、运行和管理应用程序。
https://www.ibm.com/think/topics/paas
https://www.youtube.com/watch?v=QAbqJzd0PEE
IaaS+
https://www.ibm.com/think/topics/iaas
https://www.youtube.com/watch?v=XRdmfo4M_YA
AWS+
https://aws.amazon.com/
Amazon Web Services offers reliable, scalable, and inexpensive cloud computing services. Free to join, pay only for what you use.
相关职位
社招8年以上技术-芯片
职位描述: 作为 AI 软件测试开发高级技术专家,您将参与平头哥 AI 芯片从硅前到硅后的研发过程并推动实现产品化. 您将负责构建高覆盖率的测试体系,确保框架的功能正确性、性能优化和稳定性。 主要职责: 参与 AI 芯片解决方案的系统测试工作,保证产品的交付质量; 参与 AI 领域推理框架,模型训练的测试策略,测试方法,测试工具以及测试用例设计。 参与 AI 领域软件基础框架,算子库,编译的测试策略,测试方法,测试工具以及测试用例设计。 参与设计、建立以及推动 AI 芯片软件质量持续提升流程。 与开发团队、项目管理团队一起制定软件需求开发计划,并且制定对应的测试开发计划, 参与平头哥整体软件质量流程建设,对软件开发的质量进行监控和追踪.
更新于 2025-09-22
社招5年以上技术类-开发
1、研发面向云计算底座海量数据的大模型,包括但不限于代码大模型、全模态、大规模图学习等领域相关的大模型的应用算法研发; 2、参与大模型应用研发全流程的工作,包括但不限于模型算法设计、代码开发、训练、部署优化、调试、评测;技术创新如专利、论文的撰写;外部技术影响力交流等; 3、推动大模型在DevOps提效、内外部智能体业务应用、爆款AI原生应用、安全和技术风险防控等场景的业务落地;
更新于 2025-09-01
社招5年以上研发类
1、设计、开发和维护机器学习平台及相关工具,支持nlp,cv等模型的训练和推理; 2、负责训练&推理优化,包括但不限于GPU计算加速、网络通信优化,存储性能提升等; 3、联合算法团队,搭建及优化分布式机器学习训练及推理系统,能够根据数据特点从算法和工程角度进行优化和调整; 4、构建高可用的模型服务,保证服务的稳定性、高效性。让平台不断适应业务发展的需求和趋势; 5、参与开源社区的贡献,推动公司技术在业界的影响。
更新于 2025-04-28