通义通义实验室-技术专家-大模型数据
社招全职3年以上技术类-开发地点:北京 | 杭州状态:招聘
任职要求
1、计算机/人工智能及相关专业硕士及以上学历,优异者条件可适当放宽; 2、具备3年以上数据处理或模型训练工作经验,熟练掌握文本、多模态等非结构化数据处理方法,精通数据清洗、特征提取和数据增强等技术,能够解决数据工作中各种问题; 3、精通Python、Java等至少一种编程语言,熟悉常用的数据处理、文本处理和图像处理库,能够高效地实现数据清洗和处理的算法和流程; 4、具备丰富的数据湖开发经验(Hudi、Iceberg、Hive等),并且对数据计算相关技术框架有深入的实践和理解(Spark、Flink、Hadoop、Ray); 5、具备优秀的分析问题和解决问题能力,勇于挑战和解决复杂问题; 6、具备良好的团队协作和沟通能力,能够有效协调组内外资源以推动项目进展; 7、主导过大模型或离在线场景下的数据平台建设,有海量图片、视频数据平台建设或大数据开源框架开发经验者优先。
工作职责
1、负责AI平台大数据架构演进以及推进落地:根据不同领域场景大模型落地需求,与算法团队和IT基础设施团队紧密合作,提出大模型训练和优化数据规模、数据类型、数据结构等建议,确保架构有效实施; 2、负责搭建大模型数据平台:支撑大模型数据的存储、预处理(去重、相似度计算、脱敏等)诉求,针对大模型场景、数据类型、数据规模具有高扩展性,以支持大模型数据集持续迭代,实现高质量数据集沉淀,确保数据安全和隐私保护; 3、与算法团队紧密协作,抽象研发诉求,落地为便捷实用的的平台能力,提升整个团队的工作效率和数据处理能力。
包括英文材料
学历+
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
Java+
https://www.youtube.com/watch?v=eIrMbAQSU34
Master Java – a must-have language for software development, Android apps, and more! ☕️ This beginner-friendly course takes you from basics to real coding skills.
图像处理+
https://opencv.org/blog/computer-vision-and-image-processing/
This fascinating journey involves two key fields: Computer Vision and Image Processing.
https://www.geeksforgeeks.org/python/image-processing-in-python/
Image processing involves analyzing and modifying digital images using computer algorithms.
https://www.youtube.com/watch?v=kSqxn6zGE0c
In this Introduction to Image Processing with Python, kaggle grandmaster Rob Mulla shows how to work with image data in python!
算法+
https://roadmap.sh/datastructures-and-algorithms
Step by step guide to learn Data Structures and Algorithms in 2025
https://www.hellointerview.com/learn/code
A visual guide to the most important patterns and approaches for the coding interview.
https://www.w3schools.com/dsa/
Hudi+
[英文] Spark Quick Start
https://hudi.apache.org/docs/quick-start-guide
we will walk through code snippets that allows you to insert, update, delete and query a Hudi table.
https://www.oreilly.com/library/view/apache-hudi-the/9781098173821/
Overcome challenges in building transactional guarantees on rapidly changing data by using Apache Hudi.
https://www.youtube.com/watch?v=pyK18sDYnS0
In this video, I'll introduce you to one of the most popular Data Lake solutions out there, Apache Hudi!
Iceberg+
https://iceberg.apache.org/spark-quickstart/
This guide will get you up and running with Apache Iceberg™ using Apache Spark™, including sample code to highlight some powerful features.
https://www.baeldung.com/apache-iceberg-intro
This tutorial will discuss Apache Iceberg, a popular open table format in today’s big data landscape.
https://www.youtube.com/watch?v=TsmhRZElPvM
You’ve probably heard about Apache Iceberg™—after all, it’s been getting a lot of buzz.
Hive+
[英文] Hive Tutorial
https://www.tutorialspoint.com/hive/index.htm
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
https://www.youtube.com/watch?v=D4HqQ8-Ja9Y
Spark+
[英文] Learning Spark Book
https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
Flink+
https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/learn-flink/overview/
This training presents an introduction to Apache Flink that includes just enough to get you started writing scalable streaming ETL, analytics, and event-driven applications, while leaving out a lot of (ultimately important) details.
https://www.youtube.com/watch?v=WajYe9iA2Uk&list=PLa7VYi0yPIH2GTo3vRtX8w9tgNTTyYSux
Today’s businesses are increasingly software-defined, and their business processes are being automated. Whether it’s orders and shipments, or downloads and clicks, business events can always be streamed. Flink can be used to manipulate, process, and react to these streaming events as they occur.
Hadoop+
https://www.runoob.com/w3cnote/hadoop-tutorial.html
Hadoop 为庞大的计算机集群提供可靠的、可伸缩的应用层计算和存储支持,它允许使用简单的编程模型跨计算机群集分布式处理大型数据集,并且支持在单台计算机到几千台计算机之间进行扩展。
[英文] Hadoop Tutorial
https://www.tutorialspoint.com/hadoop/index.htm
Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models.
Ray+
https://github.com/ray-project/ray
Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://www.youtube.com/watch?v=FhXfEXUUQp0
In this video, I'll teach you everything you need to know about Apache Ray!
https://www.youtube.com/watch?v=fMiAyj2kgac
Using powerful machine learning algorithms is easy using Ray.io and Python.
https://www.youtube.com/watch?v=q_aTbb7XeL4
Parallel and Distributed computing sounds scary until you try this fantastic Python library.
大模型+
https://www.youtube.com/watch?v=xZDB1naRUlk
You will build projects with LLMs that will enable you to create dynamic interfaces, interact with vast amounts of text data, and even empower LLMs with the capability to browse the internet for research papers.
https://www.youtube.com/watch?v=zjkBMFhNj_g
大数据+
https://www.youtube.com/watch?v=bAyrObl7TYE
https://www.youtube.com/watch?v=H4bf_uuMC-g
With all this talk of Big Data, we got Rebecca Tickle to explain just what makes data into Big Data.
相关职位
社招5年以上技术类-开发
● 参与/负责大数据业务的工程研发工作,包括算法工程化、数据加工、服务开发、SaaS 平台建设、解决方案交付等; ● 深度参与技术方案设计和迭代,包括架构升级、性能优化、代码重构、监控体系建设等;
更新于 2025-08-04
社招2年以上技术类-开发
1、参与大模型数据处理工程平台建设,涉及数据相关的网页解析、加工、打标、过滤、去重、质量提升等服务的系统架构设计和开发工作,推动业务和技术的融合落地,建设非结构化数据处理工程平台能力; 2、参与网页抽取、文本、图文、视频等非结构化数据的结构化、标准化、分析挖掘、提升数据质量等能力建设,沉淀数据资产,提效工具等产品,支撑蚂蚁智能技术和生态业务发展; 3、保障技术系统稳定可靠,熟练运用合适技术对复杂场景做出合理技术设计,深入分析解决系统疑难问题; 4、对数据处理相关领域产品有一定了解,能够对负责领域做未来判断和规划。
更新于 2025-06-03
社招算法
1. 负责业界/学界SOTA方案调研和原型验证; 2. 结合公司业务规划,探索大模型在数据生成、数据挖掘、数据质量、真值构建、隐私保护等方面的技术方案,解决业务中的痛点、难点问题; 3. 评估不同技术方案的研发成本、收益,给出方案选型建议; 4. 方案落地过程中进行风险把控和方向纠偏; 5. 指导一线工程师解决新方案落地过程中的卡点。
更新于 2025-04-02