百度大数据研发工程师(J85698)
实习兼职MEG地点:北京状态:招聘
任职要求
-本科及以上学历,计算机、数学或相关专业,3年以上大数据开发经验 -扎实的编程能力,熟悉Java/Scala/Python至少一门语言 -精通Hadoop/Spark/Flink/Kafka等大数据生态技术,有实时计算项目经验 -熟悉数据仓库建模、ETL开发及性能调优,熟悉OLAP引擎(如Hive/Presto) -具备业务敏感度,能快速理解产品需求并输出技术方案
工作职责
-负责百度APP、信息流、及百家号业务的大数据平台建设,支撑亿级用户数据PB级分析与实时计算需求 -参与数据采集、存储、计算、治理全链路研发,优化数据架构与处理性能 -搭建实时/离线数仓,支持精准推荐、用户画像、业务分析等场景 -擅长大数据技术(如Flink/Spark/Hadoop/数据湖等),推动技术落地与效率提升 -解决高并发、海量数据场景下的技术难题,保障系统稳定性和可扩展性
包括英文材料
学历+
大数据+
https://www.youtube.com/watch?v=bAyrObl7TYE
https://www.youtube.com/watch?v=H4bf_uuMC-g
With all this talk of Big Data, we got Rebecca Tickle to explain just what makes data into Big Data.
Java+
https://www.youtube.com/watch?v=eIrMbAQSU34
Master Java – a must-have language for software development, Android apps, and more! ☕️ This beginner-friendly course takes you from basics to real coding skills.
Scala+
Python+
https://liaoxuefeng.com/books/python/introduction/index.html
中文,免费,零起点,完整示例,基于最新的Python 3版本。
https://www.learnpython.org/
a free interactive Python tutorial for people who want to learn Python, fast.
https://www.youtube.com/watch?v=K5KVEU3aaeQ
Master Python from scratch 🚀 No fluff—just clear, practical coding skills to kickstart your journey!
https://www.youtube.com/watch?v=rfscVS0vtbw
This course will give you a full introduction into all of the core concepts in python.
Hadoop+
https://www.runoob.com/w3cnote/hadoop-tutorial.html
Hadoop 为庞大的计算机集群提供可靠的、可伸缩的应用层计算和存储支持,它允许使用简单的编程模型跨计算机群集分布式处理大型数据集,并且支持在单台计算机到几千台计算机之间进行扩展。
[英文] Hadoop Tutorial
https://www.tutorialspoint.com/hadoop/index.htm
Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models.
Spark+
[英文] Learning Spark Book
https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
Flink+
https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/learn-flink/overview/
This training presents an introduction to Apache Flink that includes just enough to get you started writing scalable streaming ETL, analytics, and event-driven applications, while leaving out a lot of (ultimately important) details.
https://www.youtube.com/watch?v=WajYe9iA2Uk&list=PLa7VYi0yPIH2GTo3vRtX8w9tgNTTyYSux
Today’s businesses are increasingly software-defined, and their business processes are being automated. Whether it’s orders and shipments, or downloads and clicks, business events can always be streamed. Flink can be used to manipulate, process, and react to these streaming events as they occur.
Kafka+
https://developer.confluent.io/what-is-apache-kafka/
https://www.youtube.com/watch?v=CU44hKLMg7k
https://www.youtube.com/watch?v=j4bqyAMMb7o&list=PLa7VYi0yPIH0KbnJQcMv5N9iW8HkZHztH
In this Apache Kafka fundamentals course, we introduce you to the basic Apache Kafka elements and APIs, as well as the broader Kafka ecosystem.
数据仓库+
https://www.youtube.com/watch?v=9GVqKuTVANE
From Zero to Data Warehouse Hero: A Full SQL Project Walkthrough and Real Industry Experience!
https://www.youtube.com/watch?v=k4tK2ttdSDg
ETL+
https://www.ibm.com/think/topics/etl
ETL—meaning extract, transform, load—is a data integration process that combines, cleans and organizes data from multiple sources into a single, consistent data set for storage in a data warehouse, data lake or other target system.
https://www.youtube.com/watch?v=OW5OgsLpDCQ
It explains what ETL is and what it can do for you to improve your data analysis and productivity.
性能调优+
https://goperf.dev/
The Go App Optimization Guide is a series of in-depth, technical articles for developers who want to get more performance out of their Go code without relying on guesswork or cargo cult patterns.
https://web.dev/learn/performance
This course is designed for those new to web performance, a vital aspect of the user experience.
https://www.ibm.com/think/insights/application-performance-optimization
Application performance is not just a simple concern for most organizations; it’s a critical factor in their business’s success.
https://www.oreilly.com/library/view/optimizing-java/9781492039259/
Performance tuning is an experimental science, but that doesn’t mean engineers should resort to guesswork and folklore to get the job done.
OLAP+
https://www.youtube.com/watch?v=iw-5kFzIdgY
OLAP (for online analytical processing) is software for performing multidimensional analysis at high speeds on large volumes of data from a data warehouse, data mart, or some other unified, centralized data store.
Hive+
[英文] Hive Tutorial
https://www.tutorialspoint.com/hive/index.htm
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
https://www.youtube.com/watch?v=D4HqQ8-Ja9Y
Presto+
[英文] What is Presto?
https://prestodb.io/what-is-presto/
https://www.tutorialspoint.com/apache_presto/index.htm
相关职位
社招MEG
-负责用户增长相关的数据仓库建设 -参与用户增长离线特征数据(使用Hadoop/Spark)和实时数据流(使用Flink)的建设 -负责数据指标、常用数据报表的建设和分析 -参与用户增长计算治理、数据治理平台化建设和数据通路性能分析和优化
更新于 2025-06-05
社招A140437
1、广告各类在线业务的离线数据加工与在线数据服务开发与维护; 2、数据服务接口及产品需求研发迭代,代码review、bug修复及日常服务运维; 3、针对海量数据处理和查询需求,设计适应业务变化的合理的多维数据分析系统架构,满足多样性的需求; 4、海量日志清洗加工,并抽象出可以多业务复用的数据模型。
更新于 2023-10-20
社招3年以上J6NQP
1、负责抖音/抖音火山版等多个业务线的策略算法建设与优化工作; 2、通过海量数据,分析与挖掘各种潜在关联,不断优化策略效果,保障用户体验; 3、负责实时及离线特征抽取、融合,为数据挖掘及策略平台提供特征服务; 4、负责大数据能力在产品功能上的落地,推动产品数据化和智能化。
更新于 2021-01-19