字节跳动Structured Data Fusion Large Model Researcher | 结构化数据融合大模型研究员-风控-筋斗云人才计划
任职要求
1. Got doctor degree, currently pursuing a doctoral degree in computer science, cybersecurity, artificial intelligence, or related fields. 2. Excellent coding skills and a solid foundation in data structures and algorithms; proficiency in Python is required, and familiarity with PyTorch or TensorFlow (TF) is preferred. 3. Outstanding ability to define, analyze, and solve problems; candidates with publications in CCF-A category journals or top c…
工作职责
Team Introduction: The Risk Control R&D Team is dedicated to addressing various challenges posed by malicious activities across ByteDance's products including Douyin and Toutiao. Their work spans multiple domains of risk governance such as content, transactions, traffic, and accounts. By leveraging technologies such as machine learning, multimodal models, and large models, the team strives to understand user behaviors and content, thereby identifying potential risks and issues. By continuously deepening their understanding of business and user behaviors, the team drives innovation in models and algorithms with an aim to build an industry-leading risk control algorithm system. Project Objectives: Optimize and enhance large models' ability to understand and reason about structured data (sequential data, graph data) based on risk control data. Project Necessity: Data in risk control scenarios is primarily structured, while large models have significantly improved their understanding of text and images. Integrating non-text/image structured data from risk control scenarios with large models to enable better comprehension of structured data remains an industry-wide challenge. This involves three key difficulties: 1. How to effectively align structured information with the NLP semantic space, allowing models to simultaneously understand both data structure and semantic information. 2. How to use appropriate instructions to enable large models to interpret structural information in structured data. 3. How to endow large language models with step-by-step reasoning capabilities for graph learning downstream tasks, thereby inferring more complex relationships and attributes. Project Content: Current industry explorations of structured data include: 1. Graph data understanding (e.g., GraphGPT: Enabling large models to read graph data, SIGIR'2024). 2. Graph data RAG (e.g., Microsoft GraphRAG: Unlocking LLM discovery on narrative private data). 3. Sequential data understanding (e.g., StructGPT: A large model reasoning framework for structured data, EMNLP-2023). However, current efforts mainly focus on understanding single-type structured data, and several challenges remain in risk control scenarios: 1. How to effectively fuse and understand various types of structured data, especially the integration of graph and sequential data. 2. Addressing the challenges mentioned in the ""Project Necessity"" section, particularly the step-by-step reasoning capabilities for downstream tasks, which are currently underexplored—especially reasoning over sequential data. Research Directions: 1. Large model structured data understanding 2. Large model structured data RAG 3. Large model thought chains 团队介绍: 风控研发团队致力于解决各个产品(包括抖音、头条等)面临的各种黑灰产对抗问题,涵盖内容、交易、流量、账号等多个方面的风险治理领域。利用机器学习、多模态、大模型等技术对用户行为、内容进行理解从而识别潜在的风险和问题。不断深入理解业务和用户行为,进行模型和算法创新,打造业界领先的风控算法体系。 课题介绍: 1、课题目标:以风控数据为基础,优化提高大模型对于结构化数据(序列数据、图数据)的理解推理能力; 2、课题背景:风控场景下的数据主要为结构化数据,而目前大模型对于文本和图像的理解能力有了很大的提升,如何跟风控场景的非文本、图像数据(结构化数据)结合起来,让大模型能够更好的理解结构化的数据,是一个业界难题。 面临着三大挑战 : 1)如何有效地将结构化的信息与nlp语义空间进行对齐,使得模型能够同时理解数据结构和语义信息; 2)如何用适当的指令使得大模型理解结构化数据中的结构信息; 3)如何赋予大语言模型图学习下游任务的逐步推理能力,从而逐步推断出更复杂的关系和属性。 3、课题内容:目前业界对结构化数据探索有: 1)图数据理解相关GraphGPT:让大模型读懂图数据(SIGIR'2024); 2)图数据RAG相关GraphRAG:Unlocking LLM discovery on narrative private data; 3)序列数据理解相关StructGPT:面向结构化数据的大模型推理框架(EMNLP-2023)。 目前的主要工作都是单一结构数据的理解,在风控场景下还面临几个问题: 1)对各种不同种类的的结构化数据融合理解怎么做,特别是融合图和序列数据的数据理解; 2)针对课题必要性中的问题; 3)对于下游任务的推理能力,目前的研究比较少,针对序列数据的推理能力研究非常少。 4、研究方向:大模型结构化数据理解、大模型结构化数据RAG、大模型思维链。
团队介绍:国际电商是以TikTok为载体的电商业务(也称为TikTok Shop),致力于成为用户发现并获取优价好物的首选平台,在直播电商、视频内容电商、货架电商等多场景下,国际电商希望能为用户提供更个性化、更主动、更高效的消费体验,为商家提供稳定可靠的平台服务,致力于新奇好物畅销全球,美好生活触手可得的使命。 Data-电商团队是国际电商的核心算法技术力量,专注于电商领域的算法创新,帮助用户高效发现感兴趣的商品,保障用户的购物安全,提升交易各环节的智能化水平。在这里,你将与一流的产品和技术团队合作、钻研,一起应对技术和业务上的挑战,推动技术在电商场景的深度落地。 课题介绍: 国际电商生态中沉淀了用户行为、商品图文、多媒体内容、商品销量与物流时序等海量异构数据,但传统模型在长周期预测、跨模态理解及复杂决策推理上仍存在明显瓶颈。 本课题拟以大模型为基础,联合构建面向国际电商场景的基础大模型,将用户、商品、内容、物流与库存等关键信息统一建模,并在其之上设计可插拔的Agent框架,系统整合任务规划、工具调用、多轮交互与环境感知等能力,从而在需求预测、流量分发与个性化推荐等链路中实现端到端的智能决策。 课题挑战: 1、异构融合与对齐:统一建模用户行为序列、商品销量时序信号与多模态商品内容,完成高维时序与图文表征的深度语义对齐; 2、推荐大模型与世界模型协同:把推荐问题定义还原为用户推荐列表的生成问题,基于大模型的技术完成端到端推荐建模; 3、推荐物品的Tokenizor:如何把亿级别的物品进行多模态和特征语义编码,支撑后续训练和生成任务,处理几十TB级别的用户行为Tokens的预训练,通过模型结构和训练方式拉高Scaling Law曲线,把各类推荐任务重构为后训练任务,以RLVR的思路进行推荐任务建模,最大化GMV和体验价值,训练推理优化,基于SGLang 等大模型推理套件定制构建高性能的推荐服务; 4、电商多模态大模型:构建面向电商领域的多语言多模态大模型,在核心电商场景达到SOTA性能,并以此为基础打造电商智能体基座,广泛支撑各类电商场景下的Agent应用落地; 5、Agent评测与安全合规:构建贴合实际业务的Agent评测指标与基准,保障在强约束、强对抗环境下的稳定性、安全性与合规性。 课题价值: 1、技术价值:打造通用多模态基座,以模型、数据、算力迭代实现幂律增长,夯实规模化技术底座; 2、业务价值:搭建国际电商大模型底座,以生成式推荐、时序大模型、Agent等驱动GMV与留存,打造高杠杆营收引擎。 Topic Content: In today’s global e-commerce landscape, intelligent systems must operate across increasingly complex and dynamic business environments. Yet existing approaches still face limitations in long-horizon forecasting, cross-modal understanding, and holistic decision-making.This initiative is focused on building a next-generation foundational large model purpose-built for global e-commerce applications. The model will integrate key business dimensions—such as users, products, content, logistics, and inventory—into a unified representation to support deep, context-aware intelligence at scale.Building on this foundation, we are developing a modular, agent-driven architecture that enables advanced capabilities including task planning, tool use, multi-turn reasoning, and real-world environment interaction.Together, these innovations aim to power end-to-end intelligent decision-making across critical e-commerce scenarios, including demand forecasting, traffic optimization, and personalized recommendation systems. Topic Challenges: 1.Heterogeneous fusion and alignment; 2.Synergy between recommendation LLMs and world models; 3.Tokenizer of recommendation items; 4.Multimodal large models for e-commerce; 5.Agent evaluation, safety, and compliance. Topic Value: 1.Technical value: Building a general-purpose multimodal foundation to enable power-law scaling through iterative advancements in models, data, and compute, thereby strengthening the infrastructure for scalable AI foundations; 2.Business value: Establishing a global e-commerce foundation model to drive GMV growth and user retention through generative recommendation, time-series large models, and agent-based systems, ultimately creating a high-leverage revenue engine.
• Evaluate perception‑fusion algorithms and KPIs across multiple OEM carlines for both driving and parking functions, including near‑field scene understanding for parking and full‑field scene understanding for active safety and driving. • Triage and diagnose perception‑fusion issues, identifying root causes behind KPI variations across carlines, regions, ODDs, and operating conditions. • Propose and prototype innovative perception‑fusion solutions to meet new sensor configurations and platform requirements. • Collaborate cross‑functionally with perception, planning, controls, systems, and platform teams to drive the development, optimization, and evolution of perception‑fusion algorithms.
• Design and implement end-to-end data pipelines (ETL) to ensure efficient data collection, cleansing, transformation, and storage, supporting both real-time and offline analytics needs. • Develop automated data monitoring tools and interactive dashboards to enhance business teams’ insights into core metrics (e.g., user behavior, AI model performance). • Collaborate with cross-functional teams (e.g., Product, Operations, Tech) to align data logic, integrate multi-source data (e.g., user behavior, transaction logs, AI outputs), and build a unified data layer. • Establish data standardization and governance policies to ensure consistency, accuracy, and compliance. • Provide structured data inputs for AI model training and inference (e.g., LLM applications, recommendation systems), optimizing feature engineering workflows. • Explore innovative AI-data integration use cases (e.g., embedding AI-generated insights into BI tools). • Provide technical guidance and best practice on data architecture that meets both traditional reporting purpose and modern AI Agent requirements.
