英伟达Infrastructure Software Engineer, Deep Learning Libraries
任职要求
• A Masters Degree in Computer Science or Computer Engineering or equivalent experience. • 3+ years of relevant experience • Strong programming skills in Python (or similar) and familiarity with C/C++ development • Experience setting up, maintaining, and automating continuous integration systems (e.g. Jenkins, GitHub Actions, GitLab pipelines, Azure DevOps) • Experience in HTML5, CSS, NodeJS, or React • Fluency in SCM (e.g. Git, Perforce) and build systems (e.g. Make, CMake, Bazel) Ways to stand out from the crowd: • Experience designing and developing automation in Jenkins with Groovy (or similar) • Background with distributed systems and cluster/cloud computing, especially with Kubernetes • Experience designing and developing unit and integration test frameworks • Experience with mobile/embedded platforms and multiple operating systems (Ubuntu, RedHat, Windows, QNX, or similar) • Track record of identifying useful new technologies and incorporating them into SW development flows This is an opportunity to have a wide impact at NVIDIA by improving development velocity across our many AI/DL/Compute Software projects. Are you creative, driven, and autonomous? Do you love a challenge? If so, we want to hear from you!
工作职责
• Designing and developing software for testing and analysis of our codebases • Building scalable automation for build, test, integration, and release processes for publicly distributed deep learning libraries • Developing throughout the software stack, from the user experience and user interfaces down to the cluster and database layers • Configuring, maintaining, and building upon deployments of industry-standard tools (e.g. Kubernetes, Jenkins, Docker, CMake, Gitlab, Jira, etc.) • Develop front-end solutions using HTML, CSS, JavaScript, and related web technologies • Advancing the state of the art in those industry-standard tools
• Designing and developing software for testing and analysis of our codebases • Building scalable automation for build, test, integration, and release processes for publicly distributed deep learning libraries • Developing throughout the software stack, from the user experience down to the cluster and database layers • Configuring, maintaining, and building upon deployments of industry-standard tools (e.g. Kubernetes, Jenkins, Docker, CMake, Github, Gitlab, Jira, etc) • Advancing state of the art in those industry-standard tools
Develop, test and maintain rich web experiences with UIs that address deep domains with high volumes of data. Building reusable components and front-end libraries for future use. Work with backend team to define and integrate APIs. Implement software designs using JavaScript and related technologies. Prepare and execute unit and integration tests. Envision the functional and non-functional requirements to build solutions from scratch. To be able to define technologies, patterns and prototype solutions for new requirements to materialist it as a well functioning project. Build the front-end of applications through appealing visual design. Use test driven development to ensure responsiveness, consistency and efficiency and crafting maintainable testing infrastructure. Build features and applications with a mobile responsive design. Learn and adapt new technologies to quickly develop required POCs and influence.
- Keep up to date with and utilize the latest developments in LLM system optimization.- Take the lead in designing innovative system optimization solutions for internal LLM workloads.- Optimize LLM inference workloads through innovative kernel, algorithm, scheduling, and parallelization technologies.- Continuously develop and maintain internal LLM inference infrastructure.- Discover new LLM system optimization needs and innovations.
- Keep up to date with and utilize the latest developments in LLM system optimization.- Discover/solve impactful technical problems, advance state-of-the-art LLM technologies, and translate ideas into production.- Optimize LLM inference workloads through innovative kernel, algorithm, scheduling, and parallelization technologies.- Continuously maintain internal LLM inference infrastructure.