
Keep大模型推理与部署优化工程师(J12297)
社招全职2年以上地点:北京状态:招聘
任职要求
1、统招本科及以上学历,计算机科学、人工智能或相关专业,2 年以上工程开发经验; 2、熟练掌握 Python 和 C++,熟悉 Linux 环境开发,有良好的数据结构与算法基础; 3、熟悉主流大模型推理框架(vLLM / SGLang / TensorRT 等),有实际的推理服务部署和优化经验; 4、了解大模型推理优化的核心原理,包括 KV Cache、PagedAttention、Prefix Caching、量化等; 5、…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
1、负责运动健康垂类大模型的推理服务建设与性能优化,涵盖 SLM / LLM / VLM 等多种模型形态; 2、基于 vLLM / SGLang 等推理框架,持续优化首 token 延迟、吞吐量及 GPU 资源利用率; 3、设计并实现模型推理加速方案,包括 Prefix Caching、Speculative Decoding、量化部署等; 4、负责多模型统一接入与请求调度网关的开发,根据不同业务场景实现智能路由与负载均衡; 5、负责与推理引擎相关的 Context / KV Cache 管理优化(如长上下文、Prefix 复用等),支撑上层 Agent 场景的高效落地; 6、构建模型服务的可观测性体系,包括延迟监控、请求全链路 trace、成本归因与降级熔断策略; 7、与算法研究员紧密协作,完成模型从训练产出到线上服务的全链路打通; 8、跟进大模型推理优化领域的前沿技术进展,持续迭代推理架构和服务性能。
包括英文材料
学历+
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
C+++
https://www.learncpp.com/
LearnCpp.com is a free website devoted to teaching you how to program in modern C++.
https://www.youtube.com/watch?v=ZzaPdXTrSb8
Linux+
https://ryanstutorials.net/linuxtutorial/
Ok, so you want to learn how to use the Bash command line interface (terminal) on Unix/Linux.
https://ubuntu.com/tutorials/command-line-for-beginners
The Linux command line is a text interface to your computer.
https://www.youtube.com/watch?v=6WatcfENsOU
In this Linux crash course, you will learn the fundamental skills and tools you need to become a proficient Linux system administrator.
https://www.youtube.com/watch?v=v392lEyM29A
Never fear the command line again, make it fear you.
https://www.youtube.com/watch?v=ZtqBQ68cfJc
数据结构+
https://www.youtube.com/watch?v=8hly31xKli0
In this course you will learn about algorithms and data structures, two of the fundamental topics in computer science.
https://www.youtube.com/watch?v=B31LgI4Y4DQ
Learn about data structures in this comprehensive course. We will be implementing these data structures in C or C++.
https://www.youtube.com/watch?v=CBYHwZcbD-s
Data Structures and Algorithms full course tutorial java
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
vLLM+
https://www.newline.co/@zaoyang/ultimate-guide-to-vllm--aad8b65d
vLLM is a framework designed to make large language models faster, more efficient, and better suited for production environments.
https://www.youtube.com/watch?v=Ju2FrqIrdx0
vLLM is a cutting-edge serving engine designed for large language models (LLMs), offering unparalleled performance and efficiency for AI-driven applications.
SGLang+
[英文] Install SGLang
https://docs.sglang.ai/get_started/install.html
SGLang is a fast serving framework for large language models and vision language models.
https://github.com/sgl-project/sgl-learning-materials
TensorRT+
https://docs.nvidia.com/deeplearning/tensorrt/latest/getting-started/quick-start-guide.html
This TensorRT Quick Start Guide is a starting point for developers who want to try out the TensorRT SDK; specifically, it demonstrates how to quickly construct an application to run inference on a TensorRT engine.
缓存+
https://hackernoon.com/the-system-design-cheat-sheet-cache
The cache is a layer that stores a subset of data, typically the most frequently accessed or essential information, in a location quicker to access than its primary storage location.
https://www.youtube.com/watch?v=bP4BeUjNkXc
Caching strategies, Distributed Caching, Eviction Policies, Write-Through Cache and Least Recently Used (LRU) cache are all important terms when it comes to designing an efficient system with a caching layer.
https://www.youtube.com/watch?v=dGAgxozNWFE
还有更多 •••