logo of nvidia

英伟达Senior LLM Train Framework Engineer

社招全职地点:上海状态:招聘

任职要求


• MS, PhD or equivalent experience in Computer Science, AI, Applied Math, or related fields and 5+ years of industry experience.
• Experience with AI train frameworks (e.g., PyTorch, JAX), and/or inference and deployment environments (e.g., TRTLLM, vLLM, SGLang).
• Proficiency in decentralized instruction.
• Proficient in Python programming, software development, debugging, performance analysis, test composition, and documentation.
• CUDA or collective programming skills are a big plus.
• Consistent record of working effectively across multiple engineering initiatives and improving AI libraries with new innovations.
• S…
登录查看完整任职要求
微信扫码,1秒登录

工作职责


NVIDIA is now looking for LLM Train Framework Engineers for the Megatron Core team. Megatron Core is open-source, scalable, and cloud-native frameworks built for researchers and developers working on Large Language Models (LLM) and Multimodal (MM) foundation model pretraining and post-training. Our GenAI Frameworks provide end-to-end model training, including pretraining, alignment, customization, evaluation, deployment, and tooling to optimize performance and user experience. Build on Megatron Core Framework's capabilities by inventing advanced distributed training algorithms and model optimizations. Collaborate with partners to implement optimized solutions.
What you’ll be doing:
• Build and develop open source Megatron Core.
• Address extensive AI training and inference obstacles, covering the entire model lifecycle including orchestration, data pre-processing, conducting model training and tuning, and deploying models.
• Work at the intersection of AI applications, libraries, frameworks, and the entire software stack.
• Spearhead advancements in model architectures, distributed training strategies, and model parallel approaches.
• Enhance the pace of foundation model training and optimization through mixed precision formulas and advanced NVIDIA GPU structures.
• Performance tuning and optimizations of deep learning framework and software components.
• Research, prototype, and develop robust and scalable AI tools and pipelines.
包括英文材料
PyTorch+
JAX+
vLLM+
SGLang+
Python+
还有更多 •••
相关职位

logo of nvidia
社招

N/A

更新于 2025-09-05北京|上海
logo of microsoft
社招Research

Bringing the State of the Art to Products  Collaborates with and bridges the gap between researchers (in community, Microsoft Research [MSR], or in their own organizations) and development teams. Brings new technology and approaches into production by applying long-term research efforts to solve immediate product needs.  With limited guidance from others, works to create product impact. Identifies approach, and applies, improves, or creates a research-backed solution (e.g., novel, data driven, scalable, extendable) to positively impact a Microsoft product or service. Solves components or aspects of a problem as assigned by a senior team member. May publish research to promote receiving new intellectual property for product impact.  Participates in collaborative relationships with relevant product and business groups inside or outside of Microsoft and provides expertise or technology to create business impact. Participates in technology transfer attempts, filing patents, authoring white papers, developing or maintaining tools/services for internal Microsoft use, or consulting for product or business groups. May publish research to promote receiving new intellectual property for business impact.  Capability Management and Networking  Maintains ties with external network of peers and identifies prospective talent, when asked. May contribute to publications on research findings. May participate in candidate interviews. Collaborates with the academic community to develop the recruiting pipeline and establish awareness of their work.  Reinforces a positive environment by applying best practices. May support mentorship by assisting with onboarding of research interns or other entry-level team members, if applicable.  Documentation  Performs documentation of work in progress, experimentation results, plans, etc. Documents scientific work to ensure process is captured. Participates in the creation of informal documentation and may share findings to promote innovation within group.  Ethics and Privacy  Understands and follows ethics and privacy policies when executing research processes and/or collecting data/information.  Leveraging Applied Research  Applies strategy by understanding the role in the team and applying the strategy provided by senior team members and incorporates state-of-the-art research. Asks probing questions to better understand strategy.  Researches and develops an understanding of tools, technologies, and methods being used in the community that can be utilized to improve product quality, performance, or efficiency. Contributes knowledge around several specialized tools/methods to support the application of business impact or serves as an expert in a deeply specialized area.  Gains deep knowledge in a service, platform, or domain and acquires knowledge of changes in industry trends and advances in applied technologies. Consults with engineers and product teams to apply advanced concepts to product needs. Learns product domain by reviewing products.  Machine Learning Functionality, Insights, and Technical Tools  Prepares data to be used for analysis by reviewing criteria that reflect quality and technical constraints. Reviews data and suggests data to be included and excluded. Describes actions taken to address data quality problems. Assists with the development of useable datasets for modeling purposes. Supports the scaling of feature ideation and data preparation. Helps take cleaned data and adapts for machine learning purposes, under the direction of a senior team member. Seeks guidance from senior team members when confronted with problems/challenges.  Uses machine learning algorithms that structures, analyzes, and uses data in product and platforms to train algorithms for scalable artificial intelligence solutions before deploying. Begins to develop new machine learning improvements independently while under the direction of a senior team member.  Collaborates to leverage data to identify pockets of opportunity to apply state-of-the-art algorithms to improve a solution to a business problem. Uses statistical analysis tools for evaluating Machine Learning models and validating assumptions about the data while also reviewing consistency against other sources. Begins to independently run basic descriptive, diagnostic, predictive, and prescriptive statistics. Assists with the communication of insights under the direction of senior team members.  Supports the application and use of intelligence created during the training of algorithms for deployment. Seeks information about large-scale computing frameworks, data analysis systems, and modeling environments to improve models. Helps create a model, apply the model to real products, and then verify effects through iterations. Helps with experiments by putting multiple models in production and evaluating their performance. Sets up monitoring and implementation to track production models, under the direction of a senior team member. Addresses models when that break, under the direction of others.  Leverages or designs and uses machine learning/data extraction, transformation, and loading (ETL) of pipelines (e.g., data collection, cleaning) based on data prepared. 

更新于 2025-09-17北京
logo of microsoft
社招Research

• Design and implement advanced LLM-based architectures and agentic systems for real-world product scenarios.• Lead model training and evaluation efforts, including data preprocessing, fine-tuning, and inference optimization.• Collaborate across teams to deliver robust, scalable models aligned with product objectives and user value.• Apply and adapt research ideas to solve practical challenges in reasoning, planning, memory, and alignment.• Monitor and improve model performance post-deployment through data-driven iteration and error analysis.• Contribute to technical discussions, model reviews, and best practices within the applied science community.

更新于 2025-08-07北京|苏州|上海