腾讯智能湖仓研发工程师(上海/深圳)
社招全职3年以上公共技术地点:上海状态:招聘
任职要求
1.具备良好的 Java / Scala 编程基础和良好的计算机技术基础,同时具备良好的沟通能力和团队协作能力; 2.熟悉开源数据湖存储方案 Hudi,Iceberg,Delta Lake 的原理及源码,有内核开发经验或社区贡献者优先,开源社区 committer / PMC 优先; 3.熟悉 Parquet,ORC,Arrow 文件格式,或者 Avro,Protobuf 行存格式者优先; 4.熟悉 Spark、Flink、Presto、Hive 等主流大数据计算引擎者优先。
工作职责
1.负责湖仓存储系统内核的极致优化,设计并实现异步湖仓智能优化模块,提升数据写入/查询性能与资源利用率; 2.生态对接与计算融合:深度整合 Spark、Flink、SR 等计算引擎,实现湖仓与流批一体场景的平滑对接,支持实时数仓与离线分析协同; 3.开源协同与技术影响力:参与 Iceberg, lance 等开源项目贡献,主导定制化功能开发,推动技术文档完善与社区生态共建。
包括英文材料
Java+
https://www.youtube.com/watch?v=eIrMbAQSU34
Master Java – a must-have language for software development, Android apps, and more! ☕️ This beginner-friendly course takes you from basics to real coding skills.
Scala+
Hudi+
[英文] Spark Quick Start
https://hudi.apache.org/docs/quick-start-guide
we will walk through code snippets that allows you to insert, update, delete and query a Hudi table.
https://www.oreilly.com/library/view/apache-hudi-the/9781098173821/
Overcome challenges in building transactional guarantees on rapidly changing data by using Apache Hudi.
https://www.youtube.com/watch?v=pyK18sDYnS0
In this video, I'll introduce you to one of the most popular Data Lake solutions out there, Apache Hudi!
Iceberg+
https://iceberg.apache.org/spark-quickstart/
This guide will get you up and running with Apache Iceberg™ using Apache Spark™, including sample code to highlight some powerful features.
https://www.baeldung.com/apache-iceberg-intro
This tutorial will discuss Apache Iceberg, a popular open table format in today’s big data landscape.
https://www.youtube.com/watch?v=TsmhRZElPvM
You’ve probably heard about Apache Iceberg™—after all, it’s been getting a lot of buzz.
Delta Lake+
https://delta.io/learn/getting-started/
This guide helps you quickly explore the main features of Delta Lake.
[英文] Delta Lake Tutorials
https://delta.io/learn/tutorials/
Try out the latest tutorials for the open-source Delta Lake project.
[英文] Tutorial: Delta Lake
https://docs.databricks.com/aws/en/delta/tutorial
This tutorial introduces common Delta Lake operations on Databricks.
https://www.youtube.com/watch?v=fkWxiesfrgk
In this Delta Lake course, we will go though all the important concepts of Delta Lake.
内核+
https://www.youtube.com/watch?v=C43VxGZ_ugU
I rummage around the Linux kernel source and try to understand what makes computers do what they do.
https://www.youtube.com/watch?v=HNIg3TXfdX8&list=PLrGN1Qi7t67V-9uXzj4VSQCffntfvn42v
Learn how to develop your very own kernel from scratch in this programming series!
https://www.youtube.com/watch?v=JDfo2Lc7iLU
Denshi goes over a simple explanation of what computer kernels are and how they work, alonside what makes the Linux kernel any special.
Parquet+
https://www.youtube.com/watch?v=KLFadWdomyI
Learn all about Apache Parquet, a column-based file format that's popular in the Hadoop/Spark ecosystem.
ProtoBuf+
https://learnxinyminutes.com/protocol-buffer-3/
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.
https://protobuf.dev/getting-started/
Each tutorial in this section shows you how to implement a simple application using protocol buffers in your favorite language.
https://www.baeldung.com/google-protocol-buffer
In this article, we’ll be looking at the Google Protocol Buffer (protobuf) – a well-known language-agnostic binary data format.
Spark+
[英文] Learning Spark Book
https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
Flink+
https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/learn-flink/overview/
This training presents an introduction to Apache Flink that includes just enough to get you started writing scalable streaming ETL, analytics, and event-driven applications, while leaving out a lot of (ultimately important) details.
https://www.youtube.com/watch?v=WajYe9iA2Uk&list=PLa7VYi0yPIH2GTo3vRtX8w9tgNTTyYSux
Today’s businesses are increasingly software-defined, and their business processes are being automated. Whether it’s orders and shipments, or downloads and clicks, business events can always be streamed. Flink can be used to manipulate, process, and react to these streaming events as they occur.
Presto+
[英文] What is Presto?
https://prestodb.io/what-is-presto/
https://www.tutorialspoint.com/apache_presto/index.htm
Hive+
[英文] Hive Tutorial
https://www.tutorialspoint.com/hive/index.htm
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
https://www.youtube.com/watch?v=D4HqQ8-Ja9Y
大数据+
https://www.youtube.com/watch?v=bAyrObl7TYE
https://www.youtube.com/watch?v=H4bf_uuMC-g
With all this talk of Big Data, we got Rebecca Tickle to explain just what makes data into Big Data.
相关职位
社招TEG技术
1.负责湖仓存储系统内核的极致优化,设计并实现异步湖仓智能优化模块,提升数据写入/查询性能与资源利用率; 2.生态对接与计算融合:深度整合 Spark、Flink、SR 等计算引擎,实现湖仓与流批一体场景的平滑对接,支持实时数仓与离线分析协同; 3.开源协同与技术影响力:参与 Iceberg 等开源项目贡献,主导定制化功能开发,推动技术文档完善与社区生态共建。
更新于 2025-05-26
社招公共技术
1.负责大数据相关计算引擎核心研发,为腾讯的智能湖仓打造领先业界的计算内核, 推进大数据业务的高效发展; 2.负责腾讯计算内核领域前沿技术调研,与开源社区保持交流,根据业务特性和需求,引入前沿技术。
更新于 2025-07-22
社招8UY51
1、负责分布式数据库(云原生架构)设计实现,打造业界领先的数据库系统; 2、理解业务/云原生架构,从场景出发,设计并实现高并发,低延时,高容错系统; 3、分析系统性能瓶颈,从软硬一体设计出发,打造极致系统; 4、跟踪数据库前沿技术,挖掘/落地新技术的机会,包括新硬件,智能优化器,湖仓一体等。
更新于 2020-05-18