米哈游AI 推理系统工程师
社招全职2年以上程序&技术类地点:上海状态:招聘
任职要求
1.有2年以上推理部署或 AI 性能优化经验。 2.熟悉至少 2 种主流推理引擎(TensorRT / vLLM / Triton / SGlang 等)的原理与调优手段。 3.熟悉 NVIDIA GPU 生态(CUDA、cuDNN、TensorRT、NCCL),了解其架构演进(A100 → H100 → B200 等)。 4.了解 AMD ROCm 或国产 NPU 至少其一的演进路径、算子支持与生态现状。 5.有 开源大模型(LLM / 扩散模型 / 多模态) 部署优化实战经验。 6.扎实的 性能建模能力:能基于 FLOPs、带宽、显存、Batch Size、Sequence Length …
登录查看完整任职要求
微信扫码,1秒登录
工作职责
负责 AI 模型在多硬件平台的生产级推理部署、性能调优与稳定性保障,与算法团队紧密协作,输出最优部署方案。 核心职责: 1. 模型部署:负责 LLM、CV、语音等多类模型的推理服务化部署,覆盖 NVIDIA(CUDA / TensorRT)、AMD(ROCm) 及国产卡(昇腾 CANN、寒武纪、燧原、摩尔线程等) 硬件平台。 2. 推理引擎选型与调优:能基于业务场景(吞吐 / 时延 / 成本)对比 TensorRT、vLLM、Triton、SGlang 等引擎,输出选型与调优方案。 3. 性能建模与分析:基于 Roofline 模型、计算 / 访存比、并行策略、KV Cache、Continuous Batching 等进行量化分析,定位瓶颈并给出优化建议。 4. Benchmark 体系:搭建离线 / 在线压测与回归测试框架,输出量化评估报告。 5. 线上稳定性:推理服务监控、告警、异常排查与性能回归治理。 6. 跨团队协作:对接算法团队,理解模型结构、算子特性与精度约束,将工程约束前置反馈到模型设计与训练环节。
包括英文材料
推理引擎+
https://www.youtube.com/watch?v=_dvk75LEJ34
https://www.youtube.com/watch?v=XtT5i0ZeHHE
TensorRT+
https://docs.nvidia.com/deeplearning/tensorrt/latest/getting-started/quick-start-guide.html
This TensorRT Quick Start Guide is a starting point for developers who want to try out the TensorRT SDK; specifically, it demonstrates how to quickly construct an application to run inference on a TensorRT engine.
vLLM+
https://www.newline.co/@zaoyang/ultimate-guide-to-vllm--aad8b65d
vLLM is a framework designed to make large language models faster, more efficient, and better suited for production environments.
https://www.youtube.com/watch?v=Ju2FrqIrdx0
vLLM is a cutting-edge serving engine designed for large language models (LLMs), offering unparalleled performance and efficiency for AI-driven applications.
Triton Inference Server+
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
Triton Inference Server is an open source inference serving software that streamlines AI inferencing.
SGLang+
[英文] Install SGLang
https://docs.sglang.ai/get_started/install.html
SGLang is a fast serving framework for large language models and vision language models.
https://github.com/sgl-project/sgl-learning-materials
CUDA+
https://developer.nvidia.com/blog/even-easier-introduction-cuda/
This post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA.
https://www.youtube.com/watch?v=86FAWCzIe_4
Lean how to program with Nvidia CUDA and leverage GPUs for high-performance computing and deep learning.
NCCL+
https://developer.nvidia.com/nccl
The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and networking.
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
还有更多 •••