美团【LongCat大模型人才校招】基座大模型推理引擎工程师
校招全职核心本地商业-基础研发平台地点:北京 | 上海状态:招聘
任职要求
1.理论基础,深入理解Transformer架构核心机制(Attention/MoE/Memory等),熟悉大模型训练流程及推理流程。 2.工程能力,熟悉主流推理框架(SGLang/vLLM)源码,对PD分离、模型量化 、投机推理、调度重叠、前缀缓存 等关键技术有实战落地经验。精通C++/CUDA/AscendC,具备复杂算子(如FlashAttention、量化GEM…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
随着Agent技术的规模化落地,大模型推理的Token消耗呈现指数级增长,推理系统的性能与成本已成为制约业务发展的核心瓶颈。LongCat推理团队致力于打造世界级的高效、稳定、可扩展的大模型推理引擎,支撑超大规模集群的复杂线上流量场景,为业务提供极致的推理性能与成本优势。岗位职责我们诚邀在以下一个或多个方向具备深厚积累的工程师加入: 1.模型-系统协同设计深度参与模型架构设计,将推理效率优化的思想前置到模型设计环节,与算法及训练工程团队紧密协作,从硬件亲和性角度出发,设计低延迟、高吞吐的模型结构,实现算法与系统的端到端优化。 2.高性能算子开发 面向异构计算硬件,研发极致优化的融合算子 ,探索Tiling策略、内存访问模式、流水线并行等底层性能优化手段。 3.推理框架优化 深入优化自研推理框架,降低调度开销,实现计算与通信的高效重叠,提升硬件利用率。 4.分布式系统架构 设计高可用的分布式推理系统,通过智能请求调度、动态负载均衡、反压控制等机制,保障系统在突发流量下的稳定性与SLA。 5.长上下文场景极致优化 针对T级别参数模型在M级别序列长度下的推理场景,系统性优化显存占用、IO带宽、算力分配及跨节点通信效率,充分释放硬件潜力。 【为什么是我们】 1.直面大模型时代最核心的工程挑战——用极致的系统优化,打破推理成本与性能的边界。 2.从大规模集群的分布式调度,到底层算子的硬件性能榨取;从长上下文场景的显存革命,到模型-系统协同设计的未来架构。每一行代码,都将直接影响千亿级Token的推理效率,改善数亿用户的线上服务体验。
包括英文材料
Transformer+
https://huggingface.co/learn/llm-course/en/chapter1/4
Breaking down how Large Language Models work, visualizing how data flows through.
https://poloclub.github.io/transformer-explainer/
An interactive visualization tool showing you how transformer models work in large language models (LLM) like GPT.
https://www.youtube.com/watch?v=wjZofJX0v4M
Breaking down how Large Language Models work, visualizing how data flows through.
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
SGLang+
[英文] Install SGLang
https://docs.sglang.ai/get_started/install.html
SGLang is a fast serving framework for large language models and vision language models.
https://github.com/sgl-project/sgl-learning-materials
vLLM+
https://www.newline.co/@zaoyang/ultimate-guide-to-vllm--aad8b65d
vLLM is a framework designed to make large language models faster, more efficient, and better suited for production environments.
https://www.youtube.com/watch?v=Ju2FrqIrdx0
vLLM is a cutting-edge serving engine designed for large language models (LLMs), offering unparalleled performance and efficiency for AI-driven applications.
缓存+
https://hackernoon.com/the-system-design-cheat-sheet-cache
The cache is a layer that stores a subset of data, typically the most frequently accessed or essential information, in a location quicker to access than its primary storage location.
https://www.youtube.com/watch?v=bP4BeUjNkXc
Caching strategies, Distributed Caching, Eviction Policies, Write-Through Cache and Least Recently Used (LRU) cache are all important terms when it comes to designing an efficient system with a caching layer.
https://www.youtube.com/watch?v=dGAgxozNWFE
还有更多 •••