阿里巴巴数据技术及产品部-AI数据工程-杭州
社招全职2年以上技术类-数据地点:杭州状态:招聘
任职要求
1、具备AI+数据双栈能力:精通Python,熟悉SQL/Shell;理解LLM、音频/视频模型、多模态模型等基础原理;具有大模型数据构造、清洗、合成或质量评估相关实践经验。 2、多模态数据能力:熟悉图像、视频或音频中的任一模态的特征工程、理解/分类/识别算法或质量建模方法;具备深度学习模型训练实践(PyTorch/TensorFlow)。 3、数据工程基础扎实:熟悉主流大数据平台(如 Spark/Flink/MaxCompute/Hadoop);具备 ETL、数据建模、数据 Pipeline 或数据仓库建设经验。 4、数据治理意识:理解元数据、数据质量、数据血缘、数据标准等治理理念;…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
1、参与集团级AI数据引擎:负责多模态数据(文本、音频、图像、视频)的采集、清洗、处理、治理与资产化管理,打造可复用、可观测、可解释的 EB 级数据体系,支撑大模型训练与推理的高质量数据供给; 2、多模态数据智能化处理:主导音频/视频/图像等模态的自动理解、标签体系构建、语义特征抽取、质量建模与自动化治理;设计并训练分类、识别、预测等多模态模型; 3、AI Native数据Pipeline建设:使用LLM+Agent框架构建智能数据Pipeline,实现数据分渠道过滤、去重、质量诊断、调度编排和异常告警等环节的自动化,显著降低人力成本; 4、数据&模型闭环迭代:基于评测反馈的短板,设计对应的专项数据集,并在训练过程中构建可观测指标,量化数据对模型能力提升的贡献,动态更新数据集,实现数据 → 模型 → 评测 → 数据的循环优化; 5、数据资产治理:负责元数据、数据血缘、分类分级、质量评分、数据标准、价值评估等治理框架的设计与落地,推动数据资产的可视化与可运营化,让数据可管理、可复用、可增长; 6、算法与工程一体化协作:与模型团队协作,参与训练数据构造、数据反哺、短板挖掘和评测闭环建设,通过数据驱动模型能力提升,成为AI模型训练的数据核心驱动力;
包括英文材料
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
SQL+
https://liaoxuefeng.com/books/sql/introduction/index.html
什么是SQL?简单地说,SQL就是访问和处理关系数据库的计算机标准语言。
https://sqlbolt.com/
Learn SQL with simple, interactive exercises.
https://www.youtube.com/watch?v=p3qvj9hO_Bo
In this video we will cover everything you need to know about SQL in only 60 minutes.
Bash+
[英文] The Bash Guide
https://guide.bash.academy/
A quality-driven guide through the shell's many features.
https://www.youtube.com/watch?v=tK9Oc6AEnR4
Understanding how to use bash scripting will enhance your productivity by automating tasks, streamlining processes, and making your workflow more efficient.
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
特征工程+
https://www.ibm.com/think/topics/feature-engineering
Feature engineering preprocesses raw data into a machine-readable format. It optimizes ML model performance by transforming and selecting relevant features.
https://www.kaggle.com/learn/feature-engineering
Better features make better models. Discover how to get the most out of your data.
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
深度学习+
https://d2l.ai/
Interactive deep learning book with code, math, and discussions.
PyTorch+
https://datawhalechina.github.io/thorough-pytorch/
PyTorch是利用深度学习进行数据科学研究的重要工具,在灵活性、可读性和性能上都具备相当的优势,近年来已成为学术界实现深度学习算法最常用的框架。
https://www.youtube.com/watch?v=V_xro1bcAuA
Learn PyTorch for deep learning in this comprehensive course for beginners. PyTorch is a machine learning framework written in Python.
TensorFlow+
https://www.youtube.com/watch?v=tpCFfeUEGs8
Ready to learn the fundamentals of TensorFlow and deep learning with Python? Well, you’ve come to the right place.
https://www.youtube.com/watch?v=ZUKz4125WNI
This part continues right where part one left off so get that Google Colab window open and get ready to write plenty more TensorFlow code.
大数据+
https://www.youtube.com/watch?v=bAyrObl7TYE
https://www.youtube.com/watch?v=H4bf_uuMC-g
With all this talk of Big Data, we got Rebecca Tickle to explain just what makes data into Big Data.
Spark+
[英文] Learning Spark Book
https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
还有更多 •••