英伟达Deep Learning Performance Software Engineer
任职要求
• Master's or Ph.D degree (or equivalent experience) in relevant discipline (CE, CS&E, CS, AI) • Excellent C/C++ programming and…
工作职责
We are now looking for a Deep Learning Performance Software Engineer!We are expanding our research and development for deep learning. We seek excellent Software Engineers and Senior Software Engineers to join our team. We specialize in developing GPU-accelerated Deep learning software. Researchers around the world are using NVIDIA GPUs to power a revolution in deep learning, enabling breakthroughs in numerous areas. Join the team that builds software to enable new solutions. Your ability to work in a fast-paced customer-oriented team is required and excellent communication skills are necessary. What you’ll be doing: • Develop compilers and DSLs for deep learning workloads • Design and implement highly optimized deep learning kernels • Continuously improve the compiler architecture for current and next generation chips • Perform performance analysis on emerging AI workloads and integrate with AI frameworks
1、基于最新的大模型、深度学习、机器学习、统计学和优化技术,开发创新算法并为业务问题构建原型; 2、通过无监督学习、聚类算法等技术,从海量数据中发现潜在的模式和趋势,提出数据驱动的业务解决方案; 3、与产品经理和跨职能团队合作,定义用户故事和成功指标,管理数据项目从0到1的全过程; 4、使用AB测试等方法验证项目的商业价值和预期收益,并持续优化模型性能; 5、与工程团队合作部署数据模型,并将解决方案规模化。 1.Develop innovative algorithms and build prototypes for business problems using the latest deep learning, machine learning, statistical, and optimization techniques; 2.Use unsupervised learning and clustering algorithms to discover potential patterns and trends from large datasets and propose data-driven business solutions; 3.Collaborate with product managers and cross-functional teams to define user stories and success metrics, managing data projects from 0 to 1; 4.Use methods like AB testing to validate the business value and expected revenue of projects and continuously optimize model performance; 5.Work with engineering teams to deploy data models and scale solutions.
Team Introduction: Data AML is ByteDance's machine learning middle platform, providing training and inference systems for recommendation, advertising, CV (computer vision), speech, and NLP (natural language processing) across businesses such as Douyin, Toutiao, and Xigua Video. AML provides powerful machine learning computing capabilities to internal business units and conducts research on general and innovative algorithms to solve key business challenges. Additionally, through Volcano Engine, it delivers core machine learning and recommendation system capabilities to external enterprise clients. Beyond business applications, AML is also engaged in cutting-edge research in areas such as AI for Science and scientific computing. Research Project Introduction: Large-scale recommendation systems are being increasingly applied to short video, text community, image and other products, and the role of modal information in recommendation systems has become more prominent. ByteDance's practice has found that modal information can serve as a generalization feature to support business scenarios such as recommendation, and the research on end-to-end ultra-large-scale multimodal recommendation systems has enormous potential. It is expected to further explore directions such as multimodal cotraining, 7B/13B large-scale parameter models, and longer sequence end-to-end based on algorithm-engineering CoDesign. Engineering research directions include: Representation of multimodal samples Construction of high-performance multimodal inference engines based on the PyTorch framework Development of high-performance multimodal training frameworks Application of heterogeneous hardware in multimodal recommendation systems 1. Algorithmic research directions include: 2. Design of reasonable recommendation-advertising and multimodal cotraining architectures 3. Sparse Mixture of Experts (Sparse MOE) 4. Memory Network 5. Hybrid precision techniques 团队介绍: Data AML是字节跳动公司的机器学习中台,为抖音/今日头条/西瓜视频等业务提供推荐/广告/CV/语音/NLP的训练和推理系统。为公司内业务部门提供强大的机器学习算力,并在这些业务的问题上研究一些具有通用性和创新性的算法。同时,也通过火山引擎将一些机器学习/推荐系统的核心能力提供给外部企业客户。此外,AML还在AI for Science,科学计算等领域做一些前沿研究。 课题介绍: 大规模推荐系统正在越来越多的应用到短视频、文本社区、图像等产品上,模态信息在推荐系统中的作用也越来越大。 字节实践中发现模态信息能够很好的作为泛化特征支持推荐等业务场景,端到端的超大规模多模态推荐系统的研究具有非常大的想象空间。 期望在算法和工程CoDesign基础上,对多模态Cotrain、7B/13B大规模参数模型、更长序列端到端等方向进一步进行探索。 工程上研究方向包括多模态样本的表征、基于 pytorch 框架的高性能多模态推理引擎、高性能多模态训练框架的构建、异构硬件在多模态推荐系统上的应用;算法上的研究方向包括设计合理的推荐广告和多模态Cotrain结构、Sparse MOE、Memory Network、混合精度等。 1、负责机器学习系统架构的设计开发,以及系统性能调优; 2、负责解决系统高并发、高可靠性、高可扩展性等技术难关; 3、覆盖机器学习系统多个子方向领域的工作,包括:资源调度、任务编排、模型训练、模型推理、模型管理、数据集管理、工作流编排、ML for System等; 4、负责机器学习系统前瞻技术的调研和引入,比如:最新硬件架构、异构计算系统、GPU优化技术的引入落地; 5、研究基于机器学习方法,实现对集群/服务资源使用情况的分析和优化。

• Core Development & Optimization: Participate in the development of cutting-edge applications powered by Large Language Models (LLMs), contributing to code implementation, performance optimization, and debugging. • Requirement Translation & Feature Implementation: Collaborate closely with senior developers and product teams to deeply understand user requirements and translate them into high-quality functional modules. • LLM Models & Framework: Responsible for the design, development, and maintenance of LLM models within the team's proprietary LLM framework. • Advanced LLM Interaction: Skillfully apply prompt engineering techniques, context management, and advanced model interaction as part of LLM application development. • Continuous Learning & Growth: Actively learn and stay updated with the latest developments in LLM technologies, algorithms, and programming best practices. • Collaboration & Skill Enhancement: Actively participate in code reviews, pair programming sessions, and technical discussions to continuously grow your development skills. • Technical Problem Solving: Under the guidance of senior engineers, participate in technical problem-solving, performance optimization, and system debugging. • AI Agent System Productionization: Work closely with product and research teams to translate AI agent logic (e.g., tool-use, planning, reasoning) into robust, production-grade systems.

• Participate in the development of cutting-edge applications powered by Large Language Models (LLMs), contributing to code implementation, optimization, and debugging. • Collaborate with senior developers and product teams to understand user requirements and transform them into functional code modules. • Design, develop, and maintain LLM models within the team’s proprietary LLM framework. • Implement prompt engineering techniques, context management, and advanced model interaction as part of LLM application development. • Continuously learn and stay updated with the latest developments in LLM technologies, algorithms, and programming best practices. • Participate in code reviews, peer programming sessions, and technical discussions, growing your development skills. • Take part in technical problem-solving, performance optimization, and system debugging under the guidance of senior engineers.