米哈游LLM Pretrain Data研究员
校招全职程序&技术类地点:上海状态:招聘
任职要求
1、精通大规模数据处理框架,如Apache Spark或Ray。 2、扎实的Python编程能力,熟悉分布式计算概念。 3、高度重视数据质量,能够分析并处理不同代码和…
登录查看完整任职要求
微信扫码,1秒登录
工作职责
1、针对多种数据源(包括GitHub代码库、网页爬取的code以及通用文本数据)设计并实现代码及通用数据清洗pipeline。 2、开发并迭代基于LLM的数据过滤策略,以提高预训练语料库的数据质量。 3、开发、维护并优化数据pipeline,确保其在大规模场景下的性能和可靠性。
包括英文材料
Apache+
https://www.apache.org/
The Apache® Software Foundation (ASF) provides software for the public good, guided by community over code.
Spark+
[英文] Learning Spark Book
https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
Ray+
https://github.com/ray-project/ray
Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://www.youtube.com/watch?v=FhXfEXUUQp0
In this video, I'll teach you everything you need to know about Apache Ray!
https://www.youtube.com/watch?v=fMiAyj2kgac
Using powerful machine learning algorithms is easy using Ray.io and Python.
https://www.youtube.com/watch?v=q_aTbb7XeL4
Parallel and Distributed computing sounds scary until you try this fantastic Python library.
还有更多 •••