logo of bytedance

字节跳动Structured Data Fusion Large Model Researcher | 结构化数据融合大模型研究员-风控-筋斗云人才计划

校招全职A40464A地点:新加坡状态:招聘

任职要求


1. Got doctor degree, currently pursuing a doctoral degree in computer science, cybersecurity, artificial intelligence, or related fields.
2. Excellent coding skills and a solid foundation in data structures and algorithms; proficiency in Python is required, and familiarity with PyTorch or TensorFlow (TF) is preferred.
3. Outstanding ability to define, analyze, and solve problems; candidates with publications in CCF-A category journals or top conferences such as AAAI, NeurIPS, SIGKDD, SIGIR, etc., are preferred.
4. Strong resilience under pressure, excellent communication and teamwork skills; passionate about technology, willing to embrace challenges with the team, and a drive for innovation.

1、获得博士学位,计算机、网络安全、人工智能相关专业优先;
2、优秀的代码能力、扎实的数据结构算法基础,熟练使用Python,熟悉Pytorch和TF者优先;
3、出色的问题定义、分析和解决能力,发表过CCF-A类论文,在AAAI、NeurIPS、SIGKDD、SIGIR等顶级期刊会议上发表论文者优先;
4、较强的抗压和沟通协作能力,对技术有追求,愿意和团队一起迎接挑战,追求创新。

工作职责


Team Introduction:
The Risk Control R&D Team is dedicated to addressing various challenges posed by malicious activities across ByteDance's products including Douyin and Toutiao. Their work spans multiple domains of risk governance such as content, transactions, traffic, and accounts. By leveraging technologies such as machine learning, multimodal models, and large models, the team strives to understand user behaviors and content, thereby identifying potential risks and issues. By continuously deepening their understanding of business and user behaviors, the team drives innovation in models and algorithms with an aim to build an industry-leading risk control algorithm system.

Project Objectives:
Optimize and enhance large models' ability to understand and reason about structured data (sequential data, graph data) based on risk control data.
Project Necessity:
Data in risk control scenarios is primarily structured, while large models have significantly improved their understanding of text and images. Integrating non-text/image structured data from risk control scenarios with large models to enable better comprehension of structured data remains an industry-wide challenge. This involves three key difficulties:

1. How to effectively align structured information with the NLP semantic space, allowing models to simultaneously understand both data structure and semantic information.
2. How to use appropriate instructions to enable large models to interpret structural information in structured data.
3. How to endow large language models with step-by-step reasoning capabilities for graph learning downstream tasks, thereby inferring more complex relationships and attributes.
Project Content:
Current industry explorations of structured data include:

1. Graph data understanding (e.g., GraphGPT: Enabling large models to read graph data, SIGIR'2024).
2. Graph data RAG (e.g., Microsoft GraphRAG: Unlocking LLM discovery on narrative private data).
3. Sequential data understanding (e.g., StructGPT: A large model reasoning framework for structured data, EMNLP-2023).

However, current efforts mainly focus on understanding single-type structured data, and several challenges remain in risk control scenarios:

1. How to effectively fuse and understand various types of structured data, especially the integration of graph and sequential data.
2. Addressing the challenges mentioned in the ""Project Necessity"" section, particularly the step-by-step reasoning capabilities for downstream tasks, which are currently underexplored—especially reasoning over sequential data.

Research Directions:
1. Large model structured data understanding
2. Large model structured data RAG
3. Large model thought chains

团队介绍:
风控研发团队致力于解决各个产品(包括抖音、头条等)面临的各种黑灰产对抗问题,涵盖内容、交易、流量、账号等多个方面的风险治理领域。利用机器学习、多模态、大模型等技术对用户行为、内容进行理解从而识别潜在的风险和问题。不断深入理解业务和用户行为,进行模型和算法创新,打造业界领先的风控算法体系。

课题介绍:
1、课题目标:以风控数据为基础,优化提高大模型对于结构化数据(序列数据、图数据)的理解推理能力;
2、课题背景:风控场景下的数据主要为结构化数据,而目前大模型对于文本和图像的理解能力有了很大的提升,如何跟风控场景的非文本、图像数据(结构化数据)结合起来,让大模型能够更好的理解结构化的数据,是一个业界难题。
面临着三大挑战 :
1)如何有效地将结构化的信息与nlp语义空间进行对齐,使得模型能够同时理解数据结构和语义信息;
2)如何用适当的指令使得大模型理解结构化数据中的结构信息;
3)如何赋予大语言模型图学习下游任务的逐步推理能力,从而逐步推断出更复杂的关系和属性。
3、课题内容:目前业界对结构化数据探索有:
1)图数据理解相关GraphGPT:让大模型读懂图数据(SIGIR'2024);
2)图数据RAG相关GraphRAG:Unlocking LLM discovery on narrative private data;
3)序列数据理解相关StructGPT:面向结构化数据的大模型推理框架(EMNLP-2023)。
目前的主要工作都是单一结构数据的理解,在风控场景下还面临几个问题:
1)对各种不同种类的的结构化数据融合理解怎么做,特别是融合图和序列数据的数据理解;
2)针对课题必要性中的问题;
3)对于下游任务的推理能力,目前的研究比较少,针对序列数据的推理能力研究非常少。
4、研究方向:大模型结构化数据理解、大模型结构化数据RAG、大模型思维链。
包括英文材料
Python+
PyTorch+
TensorFlow+
NeurIPS+
学历+
数据结构+
算法+
相关职位

logo of amazon
社招Data Eng

• Design and implement end-to-end data pipelines (ETL) to ensure efficient data collection, cleansing, transformation, and storage, supporting both real-time and offline analytics needs. • Develop automated data monitoring tools and interactive dashboards to enhance business teams’ insights into core metrics (e.g., user behavior, AI model performance). • Collaborate with cross-functional teams (e.g., Product, Operations, Tech) to align data logic, integrate multi-source data (e.g., user behavior, transaction logs, AI outputs), and build a unified data layer. • Establish data standardization and governance policies to ensure consistency, accuracy, and compliance. • Provide structured data inputs for AI model training and inference (e.g., LLM applications, recommendation systems), optimizing feature engineering workflows. • Explore innovative AI-data integration use cases (e.g., embedding AI-generated insights into BI tools). • Provide technical guidance and best practice on data architecture that meets both traditional reporting purpose and modern AI Agent requirements.

更新于 2025-09-24
logo of amazon
社招Data Eng

• Design and implement end-to-end data pipelines (ETL) to ensure efficient data collection, cleansing, transformation, and storage, supporting both real-time and offline analytics needs. • Develop automated data monitoring tools and interactive dashboards to enhance business teams’ insights into core metrics (e.g., user behavior, AI model performance). • Collaborate with cross-functional teams (e.g., Product, Operations, Tech) to align data logic, integrate multi-source data (e.g., user behavior, transaction logs, AI outputs), and build a unified data layer. • Establish data standardization and governance policies to ensure consistency, accuracy, and compliance. • Provide structured data inputs for AI model training and inference (e.g., LLM applications, recommendation systems), optimizing feature engineering workflows. • Explore innovative AI-data integration use cases (e.g., embedding AI-generated insights into BI tools). • Provide technical guidance and best practice on data architecture and BI solution

更新于 2025-06-12
logo of didi
校招商分-商业分析

更新于 2025-08-18