logo of alibaba

阿里巴巴AI Business-AI Infra研发专家-杭州

社招全职3年以上技术类-开发地点:杭州状态:招聘

任职要求


1. 精通代码编写,包括但不限于C/C++/Python/Rust/Java等,熟悉Linux/Unix开发环境,具备良好的工程实现能力,有较强的代码调试和解决技术问题的能力。
2. 熟悉计算机系统,有计算机系统相关的知识,包括计算机网络、操作系统、计算机体系结构、数据库等,对AI Infra相关的技术感兴趣,熟悉Nvidia GPU和相关生态者优先。
3. 模型推理和训练部分,熟悉业界主流的推理和训练框架,包括但不限于vllm/sglang/pytorch/megatron/deeps…
登录查看完整任职要求
微信扫码,1秒登录

工作职责


团队介绍:
阿里国际以AI技术驱动,助力全球数字贸易及电商生 态的发展。AlBusiness是阿里国际内部集大模型研究 及智能化前沿产品研发于一体的AI部门,自研面向跨境商贸增强的多语言大模型-Marco和多模态大模型 -Ovis,依托全球化的AI基础设施和算力资源,帮助 AliExpress、Lazada、Alibaba国际站、Trendyol、 Daraz等平台全面革新跨境电商全链路的经营体验和 商业效率。基于先进的大模型与工程技术,我们正在打造新一代的智能体(Agent)和智能引擎(Deep Search)产品,持续致力于让全球商业没有语言障碍,用智能帮助跨境贸易更加简单。

职位描述:
1. 进行AI Infra相关的研发工作,包括但不限于模型推理引擎,分布式训练框架,模型部署和服务,任务分发和调度,弹性扩缩容,高性能计算集群管理等。
2. 通过AI Infra的研发工作,支持LLM和多模态等大模型的研发、部署和上线服务,支持Agentic AI等新型AI产品的研发和应用,保障客户体验,实现业务落地和成本降低等目标。
3. 与算法、产品、运营以及工程等团队通力合作,推进AI产品和技术的发展和应用。
包括英文材料
C+
C+++
Python+
Rust+
Java+
Linux+
Unix+
vLLM+
SGLang+
还有更多 •••
相关职位

logo of alibaba
实习阿里国际2026

AI Business成立于2023年4月,是阿里国际数字商业集团设立的一层业务组织,专注于AI技术能力建设和AI产品能力输出,旨在用最先进的AI技术重塑平台竞争力,为商家和用户带来极致的电商体验。 作为跨境电商领域的AI先锋,我们坚定地相信人工智能对塑造未来电商的关键作用,并坚持对AI领域人才的培养和发展。我们已经汇聚了业内顶尖的AI算法专家、AI工程师和AI产品团队,并诚挚邀请有共同使命感、追求创新与卓越的AI人才加入我们的团队,共同用AI技术书写数字商业领域的新篇章。 1、AI 计算框架的设计与实现,包括并行计算、访存优化、量化、任务切分调度,pipeline等,支持LLM大语言模型、生成式CV模型、多模态模型等的高效计算 2、实现大规模高性能计算集群的合池管理,包括任务的统一分发调度,资源动态调度使用,离在线一体等,达成计算资源的高效利用 3、通过AI infra的研发,保障客户体验,实现业务落地和低成本

更新于 2025-04-15杭州
logo of oracle
社招PRODEV-S

Responsibilities Collaborate with GPU sales team and SCE AIML TPM team to provide technical support for customers both at pre-sales and after-sales stage. Take ownership of problems and work to identify solutions. Design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows. Collaborate with customers’ scientists and software/infrastructure engineers to understand infrastructure requirements for training, testing, and deploying machine learning models. Implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity. Optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques. Troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime. Stay updated on emerging technologies and best practices in AI/ML infrastructure and evaluate their potential impact on our systems and workflows. Document infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability. Qualifications: Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes. Experience with containerization technologies (e.g., Docker, Kubernetes) and orchestration tools for managing distributed systems. Solid understanding of networking concepts, security principles, and best practices. Excellent problem-solving skills, with the ability to troubleshoot complex issues and drive resolution in a fast-paced environment. Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders. Strong documentation skills with experience documenting infrastructure designs, configurations, procedures, and troubleshooting steps to facilitate knowledge sharing, ensure maintainability, and enhance team collaboration. Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS, Ubuntu, and Debian distributions, including system administration, package management, shell scripting, and performance optimization.

更新于 2025-12-09深圳
logo of oracle
社招PRODEV-S

Responsibilities Collaborate with GPU sales team and SCE AIML TPM team to provide technical support for customers both at pre-sales and after-sales stage. Take ownership of problems and work to identify solutions. Design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows. Collaborate with customers’ scientists and software/infrastructure engineers to understand infrastructure requirements for training, testing, and deploying machine learning models. Implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity. Optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques. Troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime. Stay updated on emerging technologies and best practices in AI/ML infrastructure and evaluate their potential impact on our systems and workflows. Document infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability. Qualifications: Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes. Experience with containerization technologies (e.g., Docker, Kubernetes) and orchestration tools for managing distributed systems. Solid understanding of networking concepts, security principles, and best practices. Excellent problem-solving skills, with the ability to troubleshoot complex issues and drive resolution in a fast-paced environment. Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders. Strong documentation skills with experience documenting infrastructure designs, configurations, procedures, and troubleshooting steps to facilitate knowledge sharing, ensure maintainability, and enhance team collaboration. Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS, Ubuntu, and Debian distributions, including system administration, package management, shell scripting, and performance optimization.

更新于 2025-12-02
logo of oracle
社招PRODEV-S

Responsibilities Collaborate with GPU sales team and SCE AIML TPM team to provide technical support for customers both at pre-sales and after-sales stage. Take ownership of problems and work to identify solutions. Design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows. Collaborate with customers’ scientists and software/infrastructure engineers to understand infrastructure requirements for training, testing, and deploying machine learning models. Implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity. Optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques. Troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime. Stay updated on emerging technologies and best practices in AI/ML infrastructure and evaluate their potential impact on our systems and workflows. Document infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability. Qualifications: Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes. Experience with containerization technologies (e.g., Docker, Kubernetes) and orchestration tools for managing distributed systems. Solid understanding of networking concepts, security principles, and best practices. Excellent problem-solving skills, with the ability to troubleshoot complex issues and drive resolution in a fast-paced environment. Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders. Strong documentation skills with experience documenting infrastructure designs, configurations, procedures, and troubleshooting steps to facilitate knowledge sharing, ensure maintainability, and enhance team collaboration. Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS, Ubuntu, and Debian distributions, including system administration, package management, shell scripting, and performance optimization.

更新于 2025-12-05