AMDCuda Kernel Software Engineer

社招全职 Engineering2025-09-17地点：上海状态：招聘

扫码手机上打开

任职要求

Direct experience with AMD ROCm development (HIP, MIOpen, Composable Kernel). Knowledge of LLM-specific optimizations (e.g., FlashAttention, PagedAttention in vLLM). Experience with distributed training/inference…

登录查看完整任职要求

微信扫码，1秒登录

工作职责

THE ROLE: We are seeking a talented Machine Learning Kernel Developer to design, develop, and optimize low-level machine learning kernels for AMD GPUs using the ROCm software stack. In this role, you will work on high-impact projects to accelerate AI frameworks and libraries, with a focus on emerging technologies like Large Language Models (LLMs) and other generative AI workloads. THE PERSON: The ideal candidate will have hands-on experience with GPU programming (ROCm or CUDA) and a passion for pushing the boundaries of AI performance. KEY RESPONSIBILITIES: Design and implement highly optimized ML kernels (e.g., matrix operations, attention mechanisms) for AMD GPUs using ROCm. Profile, debug, and tune kernel performance to maximize hardware utilization for AI workloads. Collaborate with ML researchers and framework developers to integrate kernels into AI frameworks (e.g., PyTorch, TensorFlow) and inference engines (e.g., vLLM, SGLang). Contribute to the ROCm software stack by identifying and resolving bottlenecks in libraries like MIOpen, BLAS, or Composable Kernel. Stay updated on the latest AI/ML trends (LLMs, quantization, distributed inference) and apply them to kernel development. Document and communicate technical designs, benchmarks, and best practices. Troubleshoot and resolve issues related to GPU compatibility, performance, and scalability. REQUIRED EXPERIENCE: 2+ years of experience in GPU kernel development for machine learning (ROCm or CUDA). Proficiency in C/C++ and Python, with experience in performance-critical programming. Strong understanding of ML frameworks (PyTorch, TensorFlow) and GPU-accelerated libraries. Basic knowledge of modern AI technologies (LLMs, transformers, inference optimization). Familiarity with parallel computing, memory optimization, and hardware architectures. Problem-solving skills and ability to work in a fast-paced environment.

📮 投递简历 ✨AI模拟面试

难度：

包括英文材料

内核+

还有更多 •••

登录查看完整学习资料

相关职位

GPU Kernel Software Engineer

社招 Enginee

THE ROLE: you will be responsible for developing and optimizing deep learning operators (kernels) for high-performance training, as well as contributing to the design and implementation of large-scale training frameworks such as Megatron-LM.

更新于 2025-08-29北京

Principal Software Engineer

社招Software

- Keep up to date with and utilize the latest developments in LLM system optimization.- Take the lead in designing innovative system optimization solutions for internal LLM workloads.- Optimize LLM inference workloads through innovative kernel, algorithm, scheduling, and parallelization technologies.- Continuously develop and maintain internal LLM inference infrastructure.- Discover new LLM system optimization needs and innovations.

更新于 2025-10-17北京

Senior Software Engineer

社招Software

- Keep up to date with and utilize the latest developments in LLM system optimization.- Discover/solve impactful technical problems, advance state-of-the-art LLM technologies, and translate ideas into production.- Optimize LLM inference workloads through innovative kernel, algorithm, scheduling, and parallelization technologies.- Continuously maintain internal LLM inference infrastructure.

更新于 2025-10-17北京

Senior Performance Software Engineer, Deep Learning Libraries

社招

• Writing highly tuned compute kernels to perform core deep learning operations (e.g. matrix multiplies, convolutions, normalizations) • Following general software engineering best practices including support for regression testing and CI/CD flows • Collaborating with teams across NVIDIA:• CUDA compiler team on generating optimal assembly code • Deep learning training and inference performance teams on which layers require optimization • Hardware and architecture teams on the programming model for new deep learning hardware features

更新于 2025-09-24上海|北京