阿里巴巴数据技术及产品部-大模型数据处理工程师-音频方向
社招全职3年以上技术类-数据地点:杭州状态:招聘
任职要求
1.语音、音频或自然语言处理领域 3 年以上研发经验; 2.精通 Python 与 SQL,具备数据处理框架(如 Spark / Ray / Dask)实战经验;理解语音模型的训练流程,能区分预训练 / SFT / RLHF 各阶段的数据需求差异; 3.具备实验设计与统计分析能力(A/B 测试、效果归因、置信区间),对数据分布、质量问题、偏见风险有敏锐直觉; 4.对领域前沿保持持续好奇,善于运用 …
登录查看完整任职要求
微信扫码,1秒登录
工作职责
1.面向阿里集团语音大模型及相关业务场景,参与"评测→数据→训练→再评测"闭环中的数据侧工作,建立从评测信号到数据规格的转化机制,覆盖组件级、系统级、产品能力评测发现的数据需求; 2.执行数据配方实验,通过系统实验研究数据规模、配比、质量对模型行为的影响——这是行业尚无系统答案的研究命题; 3.建设数据质量度量体系,将评价标准从"交付量与标注准确率"升级为"数据对模型效果的可量化贡献",并参与数据质量标准的建设与验证; 4.建设端到端数据基础设施——采集、清洗、标注、质控、版本管理的自动化(不可变原则与血缘追踪),以及 AI 预标注 + 人工校正流程的最优配比研究; 5.对接算法与模型训练团队,基于实验证据回答"下一轮应使用什么数据。
包括英文材料
NLP+
https://www.youtube.com/watch?v=fNxaJsNG3-s&list=PLQY2H8rRoyvzDbLUZkbudP-MFQZwNmU4S
Welcome to Zero to Hero for Natural Language Processing using TensorFlow!
https://www.youtube.com/watch?v=R-AG4-qZs1A&list=PLeo1K3hjS3uuvuAXhYjV2lMEShq2UYSwX
Natural Language Processing tutorial for beginners series in Python.
https://www.youtube.com/watch?v=rmVRLeJRkl4&list=PLoROMvodv4rMFqRtEuo6SGjY4XbRIVRd4
The foundations of the effective modern methods for deep learning applied to NLP.
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
SQL+
https://liaoxuefeng.com/books/sql/introduction/index.html
什么是SQL?简单地说,SQL就是访问和处理关系数据库的计算机标准语言。
https://sqlbolt.com/
Learn SQL with simple, interactive exercises.
https://www.youtube.com/watch?v=p3qvj9hO_Bo
In this video we will cover everything you need to know about SQL in only 60 minutes.
Spark+
[英文] Learning Spark Book
https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
Ray+
https://github.com/ray-project/ray
Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://www.youtube.com/watch?v=FhXfEXUUQp0
In this video, I'll teach you everything you need to know about Apache Ray!
https://www.youtube.com/watch?v=fMiAyj2kgac
Using powerful machine learning algorithms is easy using Ray.io and Python.
https://www.youtube.com/watch?v=q_aTbb7XeL4
Parallel and Distributed computing sounds scary until you try this fantastic Python library.
Dask+
https://tutorial.dask.org/00_overview.html
Dask is a parallel and distributed computing library that scales the existing Python and PyData ecosystem.
https://www.youtube.com/watch?v=jstCmSD_LAs
In this video, you will learn how to use Dask, a Python module that enables pandas code to run in parallel on your local machine or scaled out to multiple machines.
还有更多 •••