阿里巴巴达摩院-AI软件栈测试leader-计算技术
社招全职8年以上技术-芯片地点:上海状态:招聘
任职要求
必备条件: ● 计算机科学、电子工程或相关专业硕士及以上学历; ● 8年以上软件测试经验,其中至少5年在AI/高性能计算软件领域; ● 3年以上技术团队管理经验,具备跨地域或混合用工(正式+外包)团队管理实践; ● 深入理解AI软件栈架构,包括但不限于: ○ 驱动层(KMD/UMD/Runtime/CCL/Video/Security) ○ 编译器(LLVM、AI Compiler、Triton) ○ 算子与模型推理(PyTorch、vLLM、Model Zoo) ○ …
登录查看完整任职要求
微信扫码,1秒登录
工作职责
我们正在寻找一位具备深厚AI系统与软件栈测试经验的技术专家,担任AI软件栈测试负责人。你将全面负责从底层驱动、编译器、算子、框架到工具链及全栈集成的端到端质量保障体系,领导覆盖模块测试、集成测试与专项测试(性能/稳定性/精度)的多维度测试团队,确保AI软件栈在CPU/GPU/NPU等异构平台上的功能正确性、性能卓越性与长期稳定性。 核心职责: 测试战略与体系建设:制定并落地AI软件栈整体测试策略,覆盖驱动(KMD/UMD/Runtime/CCL/Video/Security)、编译器(DFCA/LLVM/Triton/AI Compiler)、算子、深度学习框架(PyTorch/vLLM等)、工具链(调试/Profiling/覆盖率)及云原生环境的全生命周期质量保障。 团队管理与能力建设:技术上领导约50人的测试团队(含正式与外包资源),合理分配模块测试、集成测试与专项测试人力,持续提升自动化覆盖率、测试效率与缺陷拦截能力。 跨模块协同与质量左移:与AI架构、驱动开发、编译器、算法及产品团队紧密协作,在需求与设计阶段介入,推动可测性设计(Design for Testability)和质量内建(Quality Built-in)。 关键质量维度保障: 功能正确性:主导边界、异常、多进程、虚拟化、硬件适配等场景的测试覆盖; 性能与功耗:建立标准化性能基线,监控推理/训练吞吐、延迟、能效比等指标; 精度一致性:确保CPU/GPU/NPU间数值精度对齐,支持FP16/INT8/BF16等混合精度验证; 稳定性与鲁棒性:设计并执行长稳、OOM、Harvesting、压力及故障注入等专项测试。 自动化与工具链建设:推动EMU仿真环境、自动化回归流水线、设备级代码覆盖率工具、Sanitizer等基础设施的落地与优化,提升测试左移与右移能力。 质量度量与持续改进:建立质量看板,监控缺陷逃逸率、回归通过率、自动化覆盖率等核心指标,驱动流程与技术持续优化。
包括英文材料
学历+
安全防护+
https://roadmap.sh/cyber-security
Step by step guide to becoming a Cyber Security Expert
https://www.w3schools.com/cybersecurity/
This course serves as an excellent primer to the many different domains of Cyber security.
LLVM+
https://llvm.org/docs/GettingStarted.html
Welcome to the LLVM project!
https://llvm.org/docs/tutorial/
This is the “Kaleidoscope” Language tutorial, showing how to implement a simple language using LLVM components in C++.
https://mcyoung.xyz/2023/08/01/llvm-ir/
“LLVM” is an umbrella name for a number of software components that can be used to build compilers.
https://www.youtube.com/watch?v=Lvc8qx8ukOI
This is the first lecture from the "Programming Language with LLVM" course where we build a full programming language similar to JavaScript from scratch, using LLVM compiler infrastructure.
Triton Inference Server+
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
Triton Inference Server is an open source inference serving software that streamlines AI inferencing.
PyTorch+
https://datawhalechina.github.io/thorough-pytorch/
PyTorch是利用深度学习进行数据科学研究的重要工具,在灵活性、可读性和性能上都具备相当的优势,近年来已成为学术界实现深度学习算法最常用的框架。
https://www.youtube.com/watch?v=V_xro1bcAuA
Learn PyTorch for deep learning in this comprehensive course for beginners. PyTorch is a machine learning framework written in Python.
vLLM+
https://www.newline.co/@zaoyang/ultimate-guide-to-vllm--aad8b65d
vLLM is a framework designed to make large language models faster, more efficient, and better suited for production environments.
https://www.youtube.com/watch?v=Ju2FrqIrdx0
vLLM is a cutting-edge serving engine designed for large language models (LLMs), offering unparalleled performance and efficiency for AI-driven applications.
GDB+
[英文] Debugging with GDB
https://betterexplained.com/articles/debugging-with-gdb/
A debugger lets you pause a program, examine and change variables, and step through code.
https://code.visualstudio.com/docs/cpp/cpp-debug
After you have set up the basics of your debugging environment as specified in the configuration tutorials for each target compiler/platform, you can learn more details about debugging C/C++ in this section.
https://opensource.com/article/21/3/debug-code-gdb
Troubleshoot your code with the GNU Debugger.
https://www.brendangregg.com/blog/2016-08-09/gdb-example-ncurses.html
gdb is the GNU Debugger, the standard debugger on Linux.
还有更多 •••