字节跳动数据湖 高级工程师/技术专家
社招全职K4338地点:上海状态:招聘
任职要求
1、具备良好的 Java / Scala 编程基础和良好的计算机技术基础; 2、具备良好的沟通能力和团队协作能力; 3、熟悉开源数据湖存储方案 Hudi,Iceberg,Delta Lake 的原理及源码,有内核开发经验或社区贡献者优先,开源社区 committer / PMC 优先; 4、熟悉 KUDU,HBase,Cassandra 等分布式存储系统,或者 Spark、Flink、Presto、Doris、Hive、Impala 等主流大数据系统原理者优先。
工作职责
数据引擎-数据湖 团队,旨在打造业界领先的 EB 级超大规模数据湖,支持字节跳动众多核心业务线,如抖音、今日头条、电商。同时基于内部最佳实践,在火山引擎上打造一款云原生实时湖仓一体的 toB 产品——湖仓一体分析服务LAS(LakeHouse Analytics Service)。 1、打造业界领先的基于 HUDI的EB级数据湖,支撑字节跳动众多业务线(如抖音,今日头条,电商); 2、负责流批一体的实时数据湖存储系统的设计与研发,以及内核的极致优化; 3、与开源社区紧密合作,持续构建开源影响力,有机会成长为 HUDI Committer / PMC。
包括英文材料
Java+
https://www.youtube.com/watch?v=eIrMbAQSU34
Master Java – a must-have language for software development, Android apps, and more! ☕️ This beginner-friendly course takes you from basics to real coding skills.
Scala+
内核+
https://www.youtube.com/watch?v=C43VxGZ_ugU
I rummage around the Linux kernel source and try to understand what makes computers do what they do.
https://www.youtube.com/watch?v=HNIg3TXfdX8&list=PLrGN1Qi7t67V-9uXzj4VSQCffntfvn42v
Learn how to develop your very own kernel from scratch in this programming series!
https://www.youtube.com/watch?v=JDfo2Lc7iLU
Denshi goes over a simple explanation of what computer kernels are and how they work, alonside what makes the Linux kernel any special.
HBase+
[英文] HBase Tutorial
https://www.tutorialspoint.com/hbase/index.htm
HBase is a data model that is similar to Google's big table designed to provide quick random access to huge amounts of structured data. This tutorial provides an introduction to HBase, the procedures to set up HBase on Hadoop File Systems, and ways to interact with HBase shell.
Cassandra+
[英文] Learn Cassandra
https://teddyma.gitbooks.io/learncassandra/content/index.html
This book step-by-step guides developers to understand what Cassandra is, how Cassandra works and how to use the features and capabilities of Apache Cassandra 2.0.
https://www.freecodecamp.org/news/the-apache-cassandra-beginner-tutorial/
In this tutorial I will introduce you to Apache Cassandra, a distributed, horizontally scalable, open-source database.
https://www.youtube.com/watch?v=J-cSy5MeMOA
Apache Cassandra is an open source NoSQL distributed database.
Spark+
[英文] Learning Spark Book
https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf
This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated.
Flink+
https://nightlies.apache.org/flink/flink-docs-release-2.0/docs/learn-flink/overview/
This training presents an introduction to Apache Flink that includes just enough to get you started writing scalable streaming ETL, analytics, and event-driven applications, while leaving out a lot of (ultimately important) details.
https://www.youtube.com/watch?v=WajYe9iA2Uk&list=PLa7VYi0yPIH2GTo3vRtX8w9tgNTTyYSux
Today’s businesses are increasingly software-defined, and their business processes are being automated. Whether it’s orders and shipments, or downloads and clicks, business events can always be streamed. Flink can be used to manipulate, process, and react to these streaming events as they occur.
Presto+
[英文] What is Presto?
https://prestodb.io/what-is-presto/
https://www.tutorialspoint.com/apache_presto/index.htm
Doris+
https://doris.apache.org/docs/gettingStarted/what-is-apache-doris
Hive+
[英文] Hive Tutorial
https://www.tutorialspoint.com/hive/index.htm
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
https://www.youtube.com/watch?v=D4HqQ8-Ja9Y
Impala+
[英文] Impala Tutorials
https://impala.apache.org/docs/build/html/topics/impala_tutorial.html
This section includes tutorial scenarios that demonstrate how to begin using Impala.
相关职位
社招5年以上技术团队AI &
职位概述: 作为数据开发专家,您将负责设计、开发和维护数据仓库、数据湖和数据管道,确保数据的准确性、完整性和可访问性。您将与数据科学家、分析师和业务团队紧密合作,提供数据支持,推动数据驱动的决策和创新。 设计和实现高效的数据模型,支持复杂的数据查询和分析需求。 开发和维护数据集成和ETL(提取、转换、加载)流程。 优化数据存储解决方案,确保数据的安全性和合规性。 与跨职能团队合作,理解业务需求,提供定制化的数据解决方案。 监控数据质量,确保数据准确性和一致性。 跟踪和评估新兴的数据技术和工具,推动技术创新。 编写技术文档,为团队成员提供指导和培训。 管理数据项目的时间表和预算,确保按时交付高质量的成果。
更新于 2024-10-28
社招5年以上云智能集团
1. 作为企业客户技术服务工作的第一责任人,深入了解客户业务场景,与客户的架构、开发、运维团队深入合作,梳理分析客户现有云产品及应用架构,围绕企业客户上云、云上业务设计稳定性优化方案,从云上监控,主动发现,灾难演练,业务快恢/降级方案,高可用架构改造等方面协助客户持续优化云上稳定性。 2. 与阿里云各团队充分合作,从客户架构视角出发进行问题处置、护航保障、风险治理,并沉淀输出最佳实践及工具产品,面向客户痛点主动进行专项高阶服务。 3. 追踪客户关键稳定性问题,持续协助客户治理并不断推动阿里云产品及服务的优化改进。
更新于 2025-09-28
社招X9WV
1、为大规模推荐系统设计和实现合理的流式计算系统; 2、设计和实现灵活可扩展、稳定、高性能存储系统和计算模型; 3、生产系统的Trouble-shooting,设计和实现必要的机制和工具保障生产系统稳定运行; 4、打造业界领先的流式计算框架等分布式系统,为海量数据和大规模业务系统提供可靠的基础设施。
更新于 2021-12-31