小米端侧大模型推理工程师
社招全职A178819地点:北京状态:招聘
任职要求
1. 了解业界主流大模型推理框架,深入理解MNN-LLM、vLLM、SGLang、TensorRT-LLM 等开源框架的设计与实现,具备框架开发经验者优先。 2. 掌握大模型低比特量化技术,具备大模型(如 AWQ, GPTQ, SpinQuant, Seq-MSE 等)的低比特(INT4)量化实践经验,熟悉相关算法原理与优化技巧。 3. 熟悉大模型推理核心优化技术,深入理解并掌握关键推理优化技术,包括但不限于:投机推理、Chunk Prefill、Prompt Cache、FlashAttention系列优化、高效KVCache管理等。 4. 了解主流开源大模型的架构及演进,熟悉 Llama、Qwen、DeepSeek 等大模型结构特点,持续关注学术界与工业界大模型架构的最新动态与发展趋势。 5. 具备端侧硬件优化能力: - 熟悉端侧CPU/GPU通用硬件编程(如 SIMD 指令集优化)。 - 了解Arm最新计算硬件CME者优先。 - 熟悉至少一款主流NPU(如高通、联发科、海思)的硬件特性及其推理部署工具链。 6. 扎实的工程能力,精通 C++ 和 Python 编程语言,拥有丰富的大型项目开发经验者优先。 7. 具备出色的学习能力、独立分析与解决问题的能力,以及良好的团队协作精神与沟通表达能力。
工作职责
1. 负责端侧高性能大模型推理框架开发,充分挖掘后端硬件的算力,构建业界性能领先的AI框架 2. 负责小爱同学各业务大语言模型和多模态大模型落地到各种端侧设备上,包含车、手机、IoT设备等 3. 负责大模型低比特量化算法研究和开发,落地于小爱同学大模型业务 4. 负责业界大模型推理技术的跟踪调研,以及学术界技术的落地可行性分析
包括英文材料
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
MNN+
https://github.com/alibaba/MNN?tab=readme-ov-file#intro
MNN is a highly efficient and lightweight deep learning framework.
vLLM+
https://www.newline.co/@zaoyang/ultimate-guide-to-vllm--aad8b65d
vLLM is a framework designed to make large language models faster, more efficient, and better suited for production environments.
https://www.youtube.com/watch?v=Ju2FrqIrdx0
vLLM is a cutting-edge serving engine designed for large language models (LLMs), offering unparalleled performance and efficiency for AI-driven applications.
SGLang+
[英文] Install SGLang
https://docs.sglang.ai/get_started/install.html
SGLang is a fast serving framework for large language models and vision language models.
https://github.com/sgl-project/sgl-learning-materials
TensorRT+
https://docs.nvidia.com/deeplearning/tensorrt/latest/getting-started/quick-start-guide.html
This TensorRT Quick Start Guide is a starting point for developers who want to try out the TensorRT SDK; specifically, it demonstrates how to quickly construct an application to run inference on a TensorRT engine.
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
Prompt+
https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/introduction-prompt-design
A prompt is a natural language request submitted to a language model to receive a response back.
https://learn.microsoft.com/en-us/azure/ai-foundry/openai/concepts/prompt-engineering
These techniques aren't recommended for reasoning models like gpt-5 and o-series models.
https://www.youtube.com/watch?v=LWiMwhDZ9as
Learn and master the fundamentals of Prompt Engineering and LLMs with this 5-HOUR Prompt Engineering Crash Course!
缓存+
https://hackernoon.com/the-system-design-cheat-sheet-cache
The cache is a layer that stores a subset of data, typically the most frequently accessed or essential information, in a location quicker to access than its primary storage location.
https://www.youtube.com/watch?v=bP4BeUjNkXc
Caching strategies, Distributed Caching, Eviction Policies, Write-Through Cache and Least Recently Used (LRU) cache are all important terms when it comes to designing an efficient system with a caching layer.
https://www.youtube.com/watch?v=dGAgxozNWFE
Llama+
https://github.com/LlamaFamily/Llama-Chinese
Llama中文社区,实时汇总最新Llama学习资料,构建最好的中文Llama大模型开源生态,完全开源可商用。
https://www.llama.com/docs/overview/
This guide provides information and resources to help you set up Llama including how to access the model, hosting, how-to and integration guides.
C+++
https://www.learncpp.com/
LearnCpp.com is a free website devoted to teaching you how to program in modern C++.
https://www.youtube.com/watch?v=ZzaPdXTrSb8
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
相关职位
校招
1.负责大模型在内的各类算法的移动端部署与优化; 2.负责移动端深度学习框架开发及算子优化; 3.不断挖掘移动芯片算力潜能,改良模型结构,实现业界领先的算法执行效能; 4.撰写相关论文,专利。 【课题名称】 端侧大模型效能优化 【课题内容】 解决大模型落地涉及的性能功耗内存限制,实现最高效的大模型推理方案。
更新于 2025-06-25
校招
1.【低内存、低带宽】大模型低比特(1-3bit)量化创新算法; 2.【低内存、低带宽】端侧推理MoE或大参数量模型,解决占用大内存问题; 3.【高性能】端侧大模型高性能推理研究(比如:创新投机推理、硬件融合高性能计算、创新算法解决端侧prefill阶段的compute bound)。 【课题名称】 端侧大模型高性能推理计算 【课题内容】 研究大模型如何在端侧设备上(高通和自研F3芯片的算力、内存、带宽资源都严格约束,即使自研外挂的BW芯片也有内存强约束)进行高性能推理计算,同时保证模型算法效果满足业务需求、资源占用满足系统要求,有效解决云端大模型突出的隐私、成本问题。
更新于 2025-06-25
实习
1.负责端侧大模型量化算法的研究与开发,包括但不限于低比特量化、混合精度量化等,提升模型推理效率,降低计算资源消耗 2.与大模型研发团队紧密合作,针对不同架构的大模型进行量化适配,确保量化后的模型性能损失最小化 3.搭建和优化大模型量化工具链,实现量化流程的自动化和高效化,提高整体研发效率 4.跟踪大模型量化领域的最新技术动态,将前沿技术引入实际项目,保持公司技术的先进性
更新于 2025-03-17