千问千问事业部-数据架构专家 - 预训练/RAG方向-杭州
社招全职1年以上技术类-开发地点:杭州状态:招聘
任职要求
1. 具备大规模内容数据处理经验,熟悉网页、文档、视频等各模态数据的实时&批次处理技术,对数据清洗、去重、结构化、特征建设、质量&权威性&时效性评估等关键流程有深入理解; 2. 熟悉分布式数据计算与存储技术,如 Ray、Spark、Flink、Paimon 等,具备大规模数据处理系统设计与性能优化经验,能够与AI Infra及基础数据平台团队协同推进能力落地; 3.…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
业务场景 我们正在构建面向医疗健康领域的大模型能力和应用体系。医疗数据具有高度专业性、知识密度大、准确性要求极高的特点——一条错误的药物相互作用知识可能直接影响用户健康决策。 团队需要从海量医学文献、临床指南、药品说明书、疾病知识图谱等异构数据源中,建设高质量的预训练语料和结构化知识库,支撑大模型在医疗问答、健康咨询、临床辅助决策等场景下的权威性、正确性与实用性。 你将参与到以Data为中心、驱动内容理解、知识构建、数据合成和应用的全链路工作中,为千问app toC的医疗健康业务提供高质量的RAG内容供给,同时沉淀高质数据、保障模型能力的迭代提升。 你的工作将直接影响模型在医疗垂直领域的效果天花板——数据质量决定模型能力上限。 工作职责 1. 负责大规模数据采集与内容发现的架构设计、核心技术研发与持续优化,覆盖网页、文档、图片、视频等多种数据形态,实现高质量数据资源的自动化发现、采集与更新。 2. 负责大模型数据基建与演进,支撑海量数据的存储、治理、预处理、质量评估及版本管理,包括数据清洗、去重、相似度计算、脱敏、结构化转换等核心能力建设,保障医学数据的高度准确性、安全与合规。 3. 结合自然语言处理(NLP)、多模态理解、大模型等技术,对海量非结构化数据进行信息抽取、网页分析、内容聚类、标签体系建设等核心技术开发,构建高质量训练数据集和知识库体系,提升RAG效果 4. 与算法团队紧密协作,围绕大模型训练、微调、评测及应用落地需求,设计并优化数据规模、数据结构、数据质量和数据生产流程,持续提升模型训练效果。
包括英文材料
Ray+
https://github.com/ray-project/ray
Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://www.youtube.com/watch?v=FhXfEXUUQp0
In this video, I'll teach you everything you need to know about Apache Ray!
https://www.youtube.com/watch?v=fMiAyj2kgac
Using powerful machine learning algorithms is easy using Ray.io and Python.
https://www.youtube.com/watch?v=q_aTbb7XeL4
Parallel and Distributed computing sounds scary until you try this fantastic Python library.
Spark+
[英文] Learning Spark Book
https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
Flink+
https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/learn-flink/overview/
This training presents an introduction to Apache Flink that includes just enough to get you started writing scalable streaming ETL, analytics, and event-driven applications, while leaving out a lot of (ultimately important) details.
https://www.youtube.com/watch?v=WajYe9iA2Uk&list=PLa7VYi0yPIH2GTo3vRtX8w9tgNTTyYSux
Today’s businesses are increasingly software-defined, and their business processes are being automated. Whether it’s orders and shipments, or downloads and clicks, business events can always be streamed. Flink can be used to manipulate, process, and react to these streaming events as they occur.
AI agent+
https://www.ibm.com/think/ai-agents
Your one-stop resource for gaining in-depth knowledge and hands-on applications of AI agents.
还有更多 •••