阿里巴巴达摩院-AI 软件栈工具链测试工程师-计算技术
社招全职3年以上技术-芯片地点:成都 | 北京 | 杭州 | 上海状态:招聘
任职要求
必备能力 • 本科及以上,计算机/软件/电子相关;3 年+ 测试或系统验证经验。 • 熟悉 Linux(进程/权限/内核日志/网络/性能工具),具备排障能力(dmesg/journalctl/perf/strace 等)。 • 熟悉容器与 K8s:至少理解并能操作 DaemonSet/CRD/Admission/Webhook/RBAC/Node label/taint & toleration/device plugin 等机制。 • 有 GPU/异构/高性能系统测试经验(任一满足即可):GPU 驱动/工具链、CUDA/ROCm 类生态、RDMA/NCCL 通信、算子性能/显存/带宽测试。 • 能编写自动化测试代码与工具脚本(Pytho…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
岗位描述 负责 AI/GPU 软件栈工具链的系统测试与质量保障,覆盖从 驱动/固件能力暴露 → 容器运行时接入 → K8s 编排部署 → 可观测/诊断/Profiling → 调试与运维 的端到端链路。通过搭建自动化验证体系、E2E 测试环境与稳定性/兼容性测试方案,保障工具链在 Post Silicon 与生产集群场景下可交付、可运维、可观测、可调试。 岗位职责: 1. 云原生 GPU 工具链的端到端测试与交付质量,负责 GPU Operator / ClusterPolicy 的部署、升级、回滚、配置变更与异常恢复测试,负责 K8s 场景 E2E 测试,构建并维护集群级测试基线:单机/多机、不同 OS(Ubuntu/Anolis/CentOS/RHEL 等)、不同 containerd/docker、不同 K8s 版本矩阵。 2. 容器运行时与设备接入链路测试,测试 Container Toolkit / CDI / runtime hook:驱动/库/设备节点映射正确性、容器内可用性、权限与隔离、与不同 runtime 的兼容性。测试 Device Plugin / GPU Feature Discovery:设备发现、健康检查、资源分配、配置热加载、节点标签变化触发行为、异常设备/坏卡/降级策略验证。覆盖典型 workload 验证:训练/推理/HPC demo(可用 PyTorch、CUDA sample、NCCL/RCCL 类通信样例等)作为回归基准。 3. 运维与诊断工具测试(SMI / DCGM 类 / diag / exporter),测试 SMI/诊断工具,构建测试监控链路,对齐硬件能力暴露路径,对关键字段在 FW/KMD/用户态库/工具层的贯通进行验证与回归。 4. Profiling 与调试工具测试(Profiling Tool / GDB Debugger),负责 Profiling 工具的功能/性能/稳定性测试,负责 Thrive GDB / 异构调试链路测试,与 OpenOCD/仿真器/EMU/硬件板卡协同验证;覆盖 debug 信息(DWARF)、fatbinary、runtime 传递等场景。 5. 自动化与工程体系建设,设计并落地自动化测试框架(Python/Go/Shell 均可),沉淀可复用的 E2E 测试套件,了解多版本矩阵、夜间回归、长稳 soak test、性能基线与趋势分析。
包括英文材料
Linux+
https://ryanstutorials.net/linuxtutorial/
Ok, so you want to learn how to use the Bash command line interface (terminal) on Unix/Linux.
https://ubuntu.com/tutorials/command-line-for-beginners
The Linux command line is a text interface to your computer.
https://www.youtube.com/watch?v=6WatcfENsOU
In this Linux crash course, you will learn the fundamental skills and tools you need to become a proficient Linux system administrator.
https://www.youtube.com/watch?v=v392lEyM29A
Never fear the command line again, make it fear you.
https://www.youtube.com/watch?v=ZtqBQ68cfJc
内核+
https://www.youtube.com/watch?v=C43VxGZ_ugU
I rummage around the Linux kernel source and try to understand what makes computers do what they do.
https://www.youtube.com/watch?v=HNIg3TXfdX8&list=PLrGN1Qi7t67V-9uXzj4VSQCffntfvn42v
Learn how to develop your very own kernel from scratch in this programming series!
https://www.youtube.com/watch?v=JDfo2Lc7iLU
Denshi goes over a simple explanation of what computer kernels are and how they work, alonside what makes the Linux kernel any special.
Perf+
https://perfwiki.github.io/main/
perf is powerful: it can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing).
https://www.brendangregg.com/bpf-performance-tools-book.html
This book can help you get the most out of your systems and applications, helping you improve performance, reduce costs, and solve software issues.
[英文] perf Examples
https://www.brendangregg.com/perf.html
These are some examples of using the perf Linux profiler, which has also been called Performance Counters for Linux (PCL), Linux perf events (LPE), or perf_events.
https://www.youtube.com/watch?v=M6ldFtwWup0
STrace+
https://opensource.com/article/19/10/strace
Trace the thin layer between user processes and the Linux kernel with strace.
https://www.youtube.com/watch?v=mBfurelWwPQ
In this video, we use the Linux strace command to trace Linux system calls.
Kubernetes+
https://kubernetes.io/docs/tutorials/kubernetes-basics/
This tutorial provides a walkthrough of the basics of the Kubernetes cluster orchestration system.
https://kubernetes.io/zh-cn/docs/tutorials/kubernetes-basics/
本教程介绍 Kubernetes 集群编排系统的基础知识。每个模块包含关于 Kubernetes 主要特性和概念的一些背景信息,还包括一个在线教程供你学习。
https://www.youtube.com/watch?v=s_o8dwzRlu4
Hands-On Kubernetes Tutorial | Learn Kubernetes in 1 Hour - Kubernetes Course for Beginners
https://www.youtube.com/watch?v=X48VuDVv0do
Full Kubernetes Tutorial | Kubernetes Course | Hands-on course with a lot of demos
Node.js+
https://liaoxuefeng.com/books/javascript/nodejs/index.html
从本章开始,我们就正式开启JavaScript的后端开发之旅。
https://www.youtube.com/watch?v=32M1al-Y6Ag
This is an intro to Node.js. No frameworks or libraries.
https://www.youtube.com/watch?v=zb3Qk8SG5Ms&list=PL4cUxeGkcC9jsz4LDYc6kv3ymONOKxwBU
In this Node JS tutorial I'll introduce to what exactly Node is all about, why we'd use it and the technologies you'll need to be familiar with to get started.
CUDA+
https://developer.nvidia.com/blog/even-easier-introduction-cuda/
This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA.
https://www.youtube.com/watch?v=86FAWCzIe_4
Lean how to program with Nvidia CUDA and leverage GPUs for high-performance computing and deep learning.
NCCL+
https://developer.nvidia.com/nccl
The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and networking.
还有更多 •••