阿里云阿里云智能-云原生k8s SRE平台研发工程师/专家-杭州
社招全职3年以上云智能集团地点:杭州状态:招聘
任职要求
1. 计算机相关专业,3年及以上后端开发或云原生平台开发经验,具备良好的问题排查能力、系统设计能力、和稳定性风险意识; 2. 熟练掌握Java和Go语言,深入理解 Kubernetes 架构与核心组件(如 API Server、etcd、kubelet、kube-proxy 等),熟悉 Helm、Istio、Prometheus、Fluentd 等生态工具; 3. 熟悉常用中间件(如 Kafka、Redis、MySQL等),具备微服务架构设计与落地经验,熟悉服务治理、熔断限流、配置中心(如 Nacos、Consul)、API 网关等核心组件; 4. 有编写AI agent经验,尤其是k8s AI诊断及运维经验者优先,有开源项目贡献经验或技术博客输出者优先; 5. 具备强烈的技术好奇心和探索精神,具有良好的团队协作意识和问题解决能力。
工作职责
1. 负责设计、开发和维护基于 Kubernetes 的自动化运维管理平台,提升对资源成本的控制、保障业务稳定性、提高运维效率; 2. 熟练使用Go/Java语言开发平台服务及底层Kubernetes组件能力; 3. 参与平台的高可用、性能优化、安全加固及自动化运维体系建设; 4. 基于AI技术,智能化解决容器层面的问题诊断、成本治理、告警降噪等问题; 5. 编写高质量、可维护的技术文档,推动团队技术沉淀与标准化。
包括英文材料
后端开发+
https://www.youtube.com/watch?v=tN6oJu2DqCM&list=PLWKjhJtqVAbn21gs5UnLhCQ82f923WCgM
Learn what technologies you should learn first to become a back end web developer.
系统设计+
https://roadmap.sh/system-design
Everything you need to know about designing large scale systems.
https://www.youtube.com/watch?v=F2FmTdLtb_4
This complete system design tutorial covers scalability, reliability, data handling, and high-level architecture with clear explanations, real-world examples, and practical strategies.
Java+
https://www.youtube.com/watch?v=eIrMbAQSU34
Master Java – a must-have language for software development, Android apps, and more! ☕️ This beginner-friendly course takes you from basics to real coding skills.
Go+
https://www.youtube.com/watch?v=8uiZC0l4Ajw
学习Golang的完整教程!从开始到结束不到一个小时,包括如何在Go中构建API的完整演示。没有多余的内容,只有你需要知道的知识。
Kubernetes+
https://kubernetes.io/docs/tutorials/kubernetes-basics/
This tutorial provides a walkthrough of the basics of the Kubernetes cluster orchestration system.
https://kubernetes.io/zh-cn/docs/tutorials/kubernetes-basics/
本教程介绍 Kubernetes 集群编排系统的基础知识。每个模块包含关于 Kubernetes 主要特性和概念的一些背景信息,还包括一个在线教程供你学习。
https://www.youtube.com/watch?v=s_o8dwzRlu4
Hands-On Kubernetes Tutorial | Learn Kubernetes in 1 Hour - Kubernetes Course for Beginners
https://www.youtube.com/watch?v=X48VuDVv0do
Full Kubernetes Tutorial | Kubernetes Course | Hands-on course with a lot of demos
Helm+
[英文] Introduction to Helm
https://helm.sh/docs/intro/
Are you new to Helm? This is the place to start!
https://www.baeldung.com/ops/kubernetes-helm
In this tutorial, we’ll understand the basics of Helm and how they form a powerful tool for working with Kubernetes resources.
Istio+
https://istio.io/latest/docs/examples/microservices-istio/
This modular tutorial provides new users with hands-on experience using Istio for common microservices scenarios, one step at a time.
https://www.freecodecamp.org/news/learn-istio-manage-microservices/
In a world without Istio, one service makes direct requests to another and in case of failures, the service is responsible for handling those.
Prometheus+
https://grafana.com/docs/grafana/latest/getting-started/get-started-grafana-prometheus/
Prometheus is an open source monitoring system for which Grafana provides out-of-the-box support.
https://prometheus.io/docs/tutorials/getting_started/
Prometheus is a system monitoring and alerting system.
Fluentd+
https://docs.fluentd.org/
Fluentd is an open-source data collector for a unified logging layer.
[英文] Guides and Recipes
https://www.fluentd.org/guides
Here is a growing collection of Fluentd resources, solution guides and recipes.
https://www.youtube.com/watch?v=Gp0-7oVOtPw
In todays episode, we take a look at the basics of Fluentd.
中间件+
https://www.youtube.com/watch?v=1oWPUpMheGk
Kafka+
https://developer.confluent.io/what-is-apache-kafka/
https://www.youtube.com/watch?v=CU44hKLMg7k
https://www.youtube.com/watch?v=j4bqyAMMb7o&list=PLa7VYi0yPIH0KbnJQcMv5N9iW8HkZHztH
In this Apache Kafka fundamentals course, we introduce you to the basic Apache Kafka elements and APIs, as well as the broader Kafka ecosystem.
Redis+
[英文] Developer Hub
https://redis.io/dev/
Get all the tutorials, learning paths, and more you need to start building—fast.
https://www.runoob.com/redis/redis-tutorial.html
REmote DIctionary Server(Redis) 是一个由 Salvatore Sanfilippo 写的 key-value 存储系统,是跨平台的非关系型数据库。
https://www.youtube.com/watch?v=jgpVdJB2sKQ
In this video I will be covering Redis in depth from how to install it, what commands you can use, all the way to how to use it in a real world project.
MySQL+
https://juejin.cn/post/7190306988939542585
这是一篇 MySQL 通关一篇过硬核经验学习路线,包括数据库相关知识,SQL语句的使用,数据库约束,设计等。
[英文] MySQL Tutorial
https://www.mysqltutorial.org/
your go-to resource for mastering MySQL in a fast, easy, and enjoyable way.
https://www.youtube.com/watch?v=5OdVJbNCSso
MySQL SQL tutorial for beginners
https://www.youtube.com/watch?v=7S_tz1z_5bA
This beginner-friendly course teaches you SQL from scratch.
微服务+
https://learn.microsoft.com/en-us/training/modules/dotnet-microservices/
Microservice applications are composed of small, independently versioned, and scalable customer-focused services that communicate with each other by using standard protocols and well-defined interfaces.
https://microservices.io/
Microservices - also known as the microservice architecture - is an architectural style that structures an application as a collection of two or more services.
https://spring.io/microservices
Building small, self-contained, ready to run applications can bring great flexibility and added resilience to your code.
https://www.ibm.com/think/topics/microservices
Microservices, or microservices architecture, is a cloud-native architectural approach in which a single application is composed of many loosely coupled and independently deployable smaller components or services.
https://www.youtube.com/watch?v=CqCDOosvZIk
https://www.youtube.com/watch?v=hmkF77F9TLw
Learn about software system design and microservices.
服务治理+
https://cloudnativecn.com/blog/istio-traffic-management-series-service-management-concept-theory/
通过阅读本文读者可以初步理解 Istio 流量治理的概念和相关知识框架。
https://juejin.cn/post/6844904006033080334
服务治理主要包括服务发现、负载均衡、限流、熔断、超时、重试、服务追踪等。我们今天要讲的,就是服务发现的内容。
Nacos+
https://nacos.io/docs/latest/overview/
Nacos 是 Dynamic Naming and Configuration Service 的首字母简称,一个易于构建 AI Agent 应用的动态服务发现、配置管理和AI智能体管理平台。
Consul+
[英文] Tutorials | Consul
https://developer.hashicorp.com/consul/tutorials
Start here to learn the basics of Consul on your favorite platform.
[英文] Consul Tutorial
https://www.tutorialspoint.com/consul/index.htm
Consul is an important service discovery tool in the world of Devops.
https://www.youtube.com/watch?v=s3I1kKKfjtQ
Complete Service Mesh and HashiCorp Consul tutorial - Real life demo of setting up Consul in Kubernetes multi cluster, multi cloud with failover
AI agent+
https://www.ibm.com/think/ai-agents
Your one-stop resource for gaining in-depth knowledge and hands-on applications of AI agents.
相关职位
社招A81609
1、负责火山引擎云原生容器平台产品的稳定性保障,通过平台建设/架构优化/组织提升等手段,不断提升云产品系统稳定性; 2、负责容器平台和大规模容器集群的稳定性保障,完成可靠性分析与优化;深入分析业务架构和系统运行时,持续识别稳定性薄弱环节,负责技术难点的攻坚,提升系统核心链路的整体稳定性; 3、参与火山引擎云原生容器平台产品的运维管控平台规划建设,设计实现相关自动化运维、分析诊断和保障体系,打造面向多地域超大规模集群的自动化运维和管控体系。
更新于 2025-06-10
社招A98480A
1、负责火山引擎云原生容器平台产品的稳定性保障,通过平台建设/架构优化/组织提升等手段,不断提升云产品系统稳定性; 2、负责容器平台和大规模容器集群的稳定性保障,完成可靠性分析与优化;深入分析业务架构和系统运行时,持续识别稳定性薄弱环节,负责技术难点的攻坚,提升系统核心链路的整体稳定性; 3、参与火山引擎云原生容器平台产品的运维管控平台规划建设,设计实现相关自动化运维、分析诊断和保障体系,打造面向多地域超大规模集群的自动化运维和管控体系。
更新于 2025-06-10
社招A48924
1、负责火山引擎云原生容器平台产品的稳定性保障,通过平台建设/架构优化/组织提升等手段,不断提升云产品系统稳定性; 2、负责容器平台和大规模容器集群的稳定性保障,完成可靠性分析与优化;深入分析业务架构和系统运行时,持续识别稳定性薄弱环节,负责技术难点的攻坚,提升系统核心链路的整体稳定性; 3、参与火山引擎云原生容器平台产品的运维管控平台规划建设,设计实现相关自动化运维、分析诊断和保障体系,打造面向多地域超大规模集群的自动化运维和管控体系。
更新于 2025-06-10