阿里巴巴AI数据工程师
实习兼职阿里巴巴2027届实习生地点:北京 | 杭州状态:招聘
任职要求
1.基础条件 ● 计算机、软件工程、数学、统计、人工智能、大数据、机器人等相关专业硕士/博士优先(非此类专业,有相关经验亦可)。 ● 有顶会论文/高影响项目/开源贡献者加分。 2.专业能力 ● 大数据处理技术:深入理解大规模分布式数据处理系统原理,熟悉Spark/Flink/Ray等开源技术栈;深入理解流批处理原理(计算模型、调度和资源管理、容错与一致性等);可独立完成面向全模态数据(结构化/文本/图像/音频/视频)的批流一体数据处理开发与优化。 ● 大模型技术的理解与掌握:深入理解大模型核心原理,包括Transformer架构、上下文学习(ICL)、指令微调(Instruction Tuning)、检索增强生成(RAG)及推理机制(如思维链CoT)等关键技术;熟悉大模型在预训练、监督微调(SFT)和强化学习对齐(RLHF/RLAIF)等阶段的数据需求与优化逻辑。能够基于领域场景设计高质量数据处理与合成算法,通过系统化的数据迭代、评估反馈与模型微调闭环,持续驱动大模型在特定领域的能力…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
以数据驱动、评测驱动的方式,构建数据高效迭代闭环,建立从数据寻源、标注、处理、合成到评测的全链路数据体系,持续建设高质量数据集和评测集,不断推动基础模型能力提升,推动AI模型和应用发展。 具体职责包括以下相关方向的一项或多项: 1.全模态数据处理: ● 参与研发万亿级数据规模的全模态数据处理引擎。 ● 通过设计高性能、可复用的数据处理算子,构建覆盖全生命周期的自动化数据生产pipeline。 ● 解决海量数据在清洗、脱敏及增强过程中的计算瓶颈,利用智能筛选与精准对齐算法交付极具竞争力的高质量训练集。 ● 持续优化全链路交付效能,确保数据质量与处理规模世界领先。 2.大模型数据理解与资产体系建设: ● 参与全模态AI数据基础设施建设。 负责设计支撑AGI 演进的多模态语义标签标准与特征映射体系,通过构建先进的质量度量模型与内容理解框架,实现对海量 3D、视频、音频等复杂数据的自动化精炼,精细化的数据理解体系加速AGI发展的科学性与高效性。 ● 构建核心AI数据战略资产体系。 结合业务垂直场景与最前沿算法,深度参与海量数据的解析、挖掘与性能优化,驱动EB级全模态数据的深度解析与价值发现;通过全链路的智能处理与挖掘优化,将海量数据转化为高稀缺性和行业竞争壁垒的AI数据资产。 3.领域全链路数据策略建设: ● 设计实现面向大模型细分领域的模型性能优化的全链路数据体系,涵盖评测体系设计、数据加工与数据合成链路、数据标注策略设计。 ● 深度理解大模型细分领域的技术点,实践“评测驱动”(Evaluation-Driven Development,EDD)的大模型迭代方法,确保千问、万相等基础模型能力持续处于世界领先水平。
包括英文材料
大数据+
https://www.youtube.com/watch?v=bAyrObl7TYE
https://www.youtube.com/watch?v=H4bf_uuMC-g
With all this talk of Big Data, we got Rebecca Tickle to explain just what makes data into Big Data.
Spark+
[英文] Learning Spark Book
https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
Flink+
https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/learn-flink/overview/
This training presents an introduction to Apache Flink that includes just enough to get you started writing scalable streaming ETL, analytics, and event-driven applications, while leaving out a lot of (ultimately important) details.
https://www.youtube.com/watch?v=WajYe9iA2Uk&list=PLa7VYi0yPIH2GTo3vRtX8w9tgNTTyYSux
Today’s businesses are increasingly software-defined, and their business processes are being automated. Whether it’s orders and shipments, or downloads and clicks, business events can always be streamed. Flink can be used to manipulate, process, and react to these streaming events as they occur.
Ray+
https://github.com/ray-project/ray
Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://www.youtube.com/watch?v=FhXfEXUUQp0
In this video, I'll teach you everything you need to know about Apache Ray!
https://www.youtube.com/watch?v=fMiAyj2kgac
Using powerful machine learning algorithms is easy using Ray.io and Python.
https://www.youtube.com/watch?v=q_aTbb7XeL4
Parallel and Distributed computing sounds scary until you try this fantastic Python library.
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
Transformer+
https://huggingface.co/learn/llm-course/en/chapter1/4
Breaking down how Large Language Models work, visualizing how data flows through.
https://poloclub.github.io/transformer-explainer/
An interactive visualization tool showing you how transformer models work in large language models (LLM) like GPT.
https://www.youtube.com/watch?v=wjZofJX0v4M
Breaking down how Large Language Models work, visualizing how data flows through.
RAG+
https://www.youtube.com/watch?v=sVcwVQRHIc8
Learn how to implement RAG (Retrieval Augmented Generation) from scratch, straight from a LangChain software engineer.
还有更多 •••