logo of nvidia

英伟达Systems Infrastructure Software Engineer

社招全职地点:上海状态:招聘

任职要求


The NVIDIA Infrastructure Group is seeking world-class programmers to design, implement, and debug the next generation of large-scale, general-purpose graphics and computing chips. In this role, you will help build the core verification infrastructure that drives the development of our GPU and Tegra chips.This strongly object-oriented C++ and Python infrastructure encompasses several extensive applications that allow us to efficiently verify the world's largest chips with a sophisticated distributed computing execution and triage environment. Come and join our diverse, international, fast-paced team with high production-quality standards.
What You’ll Be Doing:
• Developing environments to program and test next-generation GPU and SoC features well before they are integrated into products or supported by driver software. Every day brings new and meaningful challenges.
• Collaborating with colleagues across architecture, hardware,…
登录查看完整任职要求
微信扫码,1秒登录

工作职责


N/A
包括英文材料
C+
还有更多 •••
相关职位

logo of nvidia
社招

N/A

更新于 2025-09-17上海
logo of oracle
社招PRODEV-S

Responsibilities Collaborate with GPU sales team and SCE AIML TPM team to provide technical support for customers both at pre-sales and after-sales stage. Take ownership of problems and work to identify solutions. Design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows. Collaborate with customers’ scientists and software/infrastructure engineers to understand infrastructure requirements for training, testing, and deploying machine learning models. Implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity. Optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques. Troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime. Stay updated on emerging technologies and best practices in AI/ML infrastructure and evaluate their potential impact on our systems and workflows. Document infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability. Qualifications: Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes. Experience with containerization technologies (e.g., Docker, Kubernetes) and orchestration tools for managing distributed systems. Solid understanding of networking concepts, security principles, and best practices. Excellent problem-solving skills, with the ability to troubleshoot complex issues and drive resolution in a fast-paced environment. Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders. Strong documentation skills with experience documenting infrastructure designs, configurations, procedures, and troubleshooting steps to facilitate knowledge sharing, ensure maintainability, and enhance team collaboration. Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS, Ubuntu, and Debian distributions, including system administration, package management, shell scripting, and performance optimization.

更新于 2025-12-09深圳
logo of mi
社招A132469

1. R&D Operations Strategy and Planning - Develop and implement comprehensive R&D operations strategies, plans, and processes that align with the company's overall business objectives in the automotive sector. - Forecast resource requirements, including personnel, equipment, and budget, for R&D projects based on product development roadmaps and market demands. - Create and manage the annual R&D operations budget, ensuring efficient allocation of funds and cost control throughout the department. 2. Operation Project Management and Coordination - Oversee the planning, execution, and delivery of multiple operation projects simultaneously. Define project scopes, set milestones, and establish timelines to ensure on-time completion. - Coordinate with cross-functional teams, including engineering, design, purchase, admin, HR, finance to ensure strong support to R&D activities with other business functions. - Facilitate communication and collaboration among project teams, stakeholders, and external partners, ensuring clear and timely information flow. 3. HR Management - Collaborate with HR to develop and implement training and career development programs tailored to the needs of the R&D team. - Foster a culture of innovation, collaboration, and continuous learning within the R&D department, encouraging team members to explore new ideas and technologies. - Resolve conflicts and manage team dynamics effectively to maintain a positive and productive work environment. 4. Financial Management - Work closely with the finance department to develop and manage the R&D budget. Monitor and control expenses, ensuring that all spending aligns with budgetary constraints and financial targets. - Analyze financial data related to R&D projects, such as cost overruns, return on investment, and cost-benefit ratios. Provide financial insights and recommendations to senior management to support decision-making. - Participate in the preparation of financial reports and forecasts for the R&D department, ensuring accuracy and compliance with financial regulations. 5. Supplier Management - Establish and maintain strong relationships with suppliers of R&D materials, equipment, and services. Collaborate with the procurement department to develop supplier strategies, negotiate contracts, and manage supplier performance. - Evaluate suppliers based on quality, cost, delivery, and innovation capabilities. Conduct regular supplier audits and performance reviews to ensure compliance with company standards and expectations. - Identify potential suppliers and new sourcing opportunities to support R&D projects. Drive continuous improvement in supplier performance through supplier development initiatives. 6. Resource Management - Allocate and manage resources, including personnel, equipment, and facilities, to support R&D projects effectively. Ensure optimal utilization of resources and minimize waste. - Coordinate with procurement and logistics teams to acquire necessary materials, equipment, and services in a timely manner. - Manage the maintenance and upgrade of R&D facilities and equipment to ensure they meet the requirements of current and future projects. 7. Facilities Management - Oversee the planning, design, and construction of R&D facilities, ensuring they are equipped with the necessary infrastructure and technology to support research and development activities. - Manage the day-to-day operations of R&D facilities, including maintenance, security, and environmental management. Ensure compliance with relevant regulations and standards. - Plan for future facility expansion and upgrades based on the growth and needs of the R&D department.

更新于 2025-04-01
logo of bytedance
校招A158012A

Team Introduction: Data AML is ByteDance's machine learning middle platform, providing training and inference systems for recommendation, advertising, CV (computer vision), speech, and NLP (natural language processing) across businesses such as Douyin, Toutiao, and Xigua Video. AML provides powerful machine learning computing capabilities to internal business units and conducts research on general and innovative algorithms to solve key business challenges. Additionally, through Volcano Engine, it delivers core machine learning and recommendation system capabilities to external enterprise clients. Beyond business applications, AML is also engaged in cutting-edge research in areas such as AI for Science and scientific computing. Research Project Introduction: Large-scale recommendation systems are being increasingly applied to short video, text community, image and other products, and the role of modal information in recommendation systems has become more prominent. ByteDance's practice has found that modal information can serve as a generalization feature to support business scenarios such as recommendation, and the research on end-to-end ultra-large-scale multimodal recommendation systems has enormous potential. It is expected to further explore directions such as multimodal cotraining, 7B/13B large-scale parameter models, and longer sequence end-to-end based on algorithm-engineering CoDesign. Engineering research directions include: Representation of multimodal samples Construction of high-performance multimodal inference engines based on the PyTorch framework Development of high-performance multimodal training frameworks Application of heterogeneous hardware in multimodal recommendation systems 1. Algorithmic research directions include: 2. Design of reasonable recommendation-advertising and multimodal cotraining architectures 3. Sparse Mixture of Experts (Sparse MOE) 4. Memory Network 5. Hybrid precision techniques 团队介绍: Data AML是字节跳动公司的机器学习中台,为抖音/今日头条/西瓜视频等业务提供推荐/广告/CV/语音/NLP的训练和推理系统。为公司内业务部门提供强大的机器学习算力,并在这些业务的问题上研究一些具有通用性和创新性的算法。同时,也通过火山引擎将一些机器学习/推荐系统的核心能力提供给外部企业客户。此外,AML还在AI for Science,科学计算等领域做一些前沿研究。 课题介绍: 大规模推荐系统正在越来越多的应用到短视频、文本社区、图像等产品上,模态信息在推荐系统中的作用也越来越大。 字节实践中发现模态信息能够很好的作为泛化特征支持推荐等业务场景,端到端的超大规模多模态推荐系统的研究具有非常大的想象空间。 期望在算法和工程CoDesign基础上,对多模态Cotrain、7B/13B大规模参数模型、更长序列端到端等方向进一步进行探索。 工程上研究方向包括多模态样本的表征、基于 pytorch 框架的高性能多模态推理引擎、高性能多模态训练框架的构建、异构硬件在多模态推荐系统上的应用;算法上的研究方向包括设计合理的推荐广告和多模态Cotrain结构、Sparse MOE、Memory Network、混合精度等。 1、负责机器学习系统架构的设计开发,以及系统性能调优; 2、负责解决系统高并发、高可靠性、高可扩展性等技术难关; 3、覆盖机器学习系统多个子方向领域的工作,包括:资源调度、任务编排、模型训练、模型推理、模型管理、数据集管理、工作流编排、ML for System等; 4、负责机器学习系统前瞻技术的调研和引入,比如:最新硬件架构、异构计算系统、GPU优化技术的引入落地; 5、研究基于机器学习方法,实现对集群/服务资源使用情况的分析和优化。

更新于 2025-05-26新加坡