腾讯高性能计算工程师
社招全职3年以上腾讯云技术地点:上海状态:招聘
任职要求
1.计算机、人工智能、软件工程等相关专业硕士及以上学历; 2.具备5年以上 AI 系统、高性能计算或底层系统开发经验; 3.具备大规模、生产级大语言模型(LLM)在线推理系统从零到一的架构设计与优化实战经验; 4.精通 C++/Python,具备深厚的系统编程功底,对并行计算、内存管理和性能调优有深入的系统性理解; 5.深入理解 Transformer 架构,具备 vLLM, TensorRT-LLM, LightLLM 等主流推理框架的内核级/源码级深度优化经验;对 KV Cache、低比特量化、连续批处理等核心技术有架构决策能力; 6.具备设计和主导实现高并发、超低延迟分布式服务系统的能力;熟悉 Docker/Kubernetes 等云原生部署运维技术。 加分项 1.有昇腾、海光、天数等国产AI芯片平台上的LLM模型移植、底层算子开发或推理引擎适配的战略级成功经验; 2.具备多卡/多机通信(N…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
1.超大规模LLM性能工程: 主导并规划千亿参数级大模型的极致性能优化技术路线。负责 PagedAttention、连续批处理等核心调度策略的深度定制与生产级架构设计,负责 vLLM/TensorRT-LLM 等主流推理框架的内核级优化与落地; 2.低比特与稀疏模型优化: 牵头 INT4/FP8/AWQ 等前沿低比特量化技术的工业级系统化落地,平衡精度与计算效率。并设计面向 MoE 模型的分布式调度、路由、显存管理及跨卡通信的优化方案; 3.统一与多模态架构: 定义并设计一套具备长期扩展性的统一 AI 推理引擎架构,以支撑自回归生成任务,并前瞻性地解决多模态大模型(如视觉-语言模型)的协同推理部署挑战; 4.异构算力与国产化适配: 主导推理引擎在国产AI芯片(如昇腾、海光、天数等)平台上的战略级移植、生态适配与性能优化。对 HCCL/NCCL 等通信原语进行深度优化和定制,实现跨异构架构的算力自主可控; 5.核心算子优化与指令架构创新 (Enhanced Focus):深度介入 GPU/NPU 硬件底层,主导设计和实现LLM特有高性能算子。 重点包括:高性能Attention Kernel、矩阵乘法(GEMM)的深度定制与融合、KV Cache读写优化等关键算子; 6.具备深入理解和利用硬件指令集架构(ISA)和微架构(Microarchitecture)的能力, 通过 CUDA/Triton 或国产芯片底层编程语言,进行SIMD/SIMT指令优化、指令级并行(ILP)及寄存器重用等,将LLM推理性能推向硬件理论极限。
包括英文材料
学历+
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
系统设计+
https://roadmap.sh/system-design
Everything you need to know about designing large scale systems.
https://www.youtube.com/watch?v=F2FmTdLtb_4
This complete system design tutorial covers scalability, reliability, data handling, and high-level architecture with clear explanations, real-world examples, and practical strategies.
C+++
https://www.learncpp.com/
LearnCpp.com is a free website devoted to teaching you how to program in modern C++.
https://www.youtube.com/watch?v=ZzaPdXTrSb8
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
性能调优+
https://goperf.dev/
The Go App Optimization Guide is a series of in-depth, technical articles for developers who want to get more performance out of their Go code without relying on guesswork or cargo cult patterns.
https://web.dev/learn/performance
This course is designed for those new to web performance, a vital aspect of the user experience.
https://www.ibm.com/think/insights/application-performance-optimization
Application performance is not just a simple concern for most organizations; it’s a critical factor in their business’s success.
https://www.oreilly.com/library/view/optimizing-java/9781492039259/
Performance tuning is an experimental science, but that doesn’t mean engineers should resort to guesswork and folklore to get the job done.
Transformer+
https://huggingface.co/learn/llm-course/en/chapter1/4
Breaking down how Large Language Models work, visualizing how data flows through.
https://poloclub.github.io/transformer-explainer/
An interactive visualization tool showing you how transformer models work in large language models (LLM) like GPT.
https://www.youtube.com/watch?v=wjZofJX0v4M
Breaking down how Large Language Models work, visualizing how data flows through.
vLLM+
https://www.newline.co/@zaoyang/ultimate-guide-to-vllm--aad8b65d
vLLM is a framework designed to make large language models faster, more efficient, and better suited for production environments.
https://www.youtube.com/watch?v=Ju2FrqIrdx0
vLLM is a cutting-edge serving engine designed for large language models (LLMs), offering unparalleled performance and efficiency for AI-driven applications.
TensorRT+
https://docs.nvidia.com/deeplearning/tensorrt/latest/getting-started/quick-start-guide.html
This TensorRT Quick Start Guide is a starting point for developers who want to try out the TensorRT SDK; specifically, it demonstrates how to quickly construct an application to run inference on a TensorRT engine.
内核+
https://www.youtube.com/watch?v=C43VxGZ_ugU
I rummage around the Linux kernel source and try to understand what makes computers do what they do.
https://www.youtube.com/watch?v=HNIg3TXfdX8&list=PLrGN1Qi7t67V-9uXzj4VSQCffntfvn42v
Learn how to develop your very own kernel from scratch in this programming series!
https://www.youtube.com/watch?v=JDfo2Lc7iLU
Denshi goes over a simple explanation of what computer kernels are and how they work, alonside what makes the Linux kernel any special.
缓存+
https://hackernoon.com/the-system-design-cheat-sheet-cache
The cache is a layer that stores a subset of data, typically the most frequently accessed or essential information, in a location quicker to access than its primary storage location.
https://www.youtube.com/watch?v=bP4BeUjNkXc
Caching strategies, Distributed Caching, Eviction Policies, Write-Through Cache and Least Recently Used (LRU) cache are all important terms when it comes to designing an efficient system with a caching layer.
https://www.youtube.com/watch?v=dGAgxozNWFE
高并发+
https://www.baeldung.com/concurrency-principles-patterns
In this tutorial, we’ll discuss some of the design principles and patterns that have been established over time to build highly concurrent applications.
https://www.baeldung.com/java-concurrency
Handling concurrency in an application can be a tricky process with many potential pitfalls. A solid grasp of the fundamentals will go a long way to help minimize these issues.
https://www.oreilly.com/library/view/concurrency-in-go/9781491941294/
You’ll understand how Go chooses to model concurrency, what issues arise from this model, and how you can compose primitives within this model to solve problems.
https://www.oreilly.com/library/view/modern-concurrency-in/9781098165406/
With this book, you'll explore the transformative world of Java 21's key feature: virtual threads.
https://www.youtube.com/watch?v=qyM8Pi1KiiM
https://www.youtube.com/watch?v=wEsPL50Uiyo
还有更多 •••
相关职位
社招5年以上技术类
1、加入系统联合优化团队,负责服务框架开发、性能瓶颈分析以及调优,为电商推荐场景下提供全栈式 Compiler+Serving+Benchmark 不同层级优化方案。2、负责通过新硬件引入,软硬结合优化, 在异构计算硬件上,挤压算力的极限降低服务资源成本。3、负责持续跟踪业界最优实现深度学习模型,从训练到部署的并结合实际场景超越并创新。
更新于 2025-06-20上海

社招系统开发
1. 端侧部署与性能优化:主导世界模型及辅助驾驶的软件架构设计及芯片部署方案落地,综合运用指令集优化、线程调度策略、内存池管理等技术,实现极致性能与资源利用率。 2. 新平台可行性评估:系统性评估模块 SDK 在车载新平台的适配潜力,输出技术可行性报告及性能优化路径规划。 3. 芯片生态深度协作:聚焦英伟达等主流车载芯片平台,完成模型部署与推理性能调优;结合系统及硬件架构特性,协同芯片厂商实现软硬件本地化定制开发,驱动业务需求落地。
更新于 2025-06-12北京|合肥|上海
社招3年以上
1.负责AI及传统图像、视频算法工程化 2.负责各个算法在特定处理器上的极致优化,包括但不限于指令集、cache、带宽、GPU优化 3.负责分析各个算法的性能瓶颈点,并与算法同事进行反馈,进一步保效提速降负载 4.负责算法SDK的封装,输出到下游部门进行集成,并进行性能和效果对齐
更新于 2024-06-01深圳