๐ฐ 1. News
-
[2026-02-02] ๐ฅ๏ธ DataFlow WebUI is now available! Launch the visual pipeline builder with a single command:
dataflow webui. Build and run DataFlow pipelines through an intuitive web interface. ๐ WebUI Docs -
[2026-01-20] ๐ Awesome Works Using DataFlow is now live! A new section showcasing open-source projects and research built on DataFlow. Contributions are welcome! ๐ Awesome Works
-
[2025-12-19] ๐ Our DataFlow technical report is now available! Read and cite our work on arXiv: https://arxiv.org/abs/2512.16676
-
[2025-11-20] ๐ค Introducing New Data Agents for DataFlow! Try them out and follow the tutorial on Bilibili: https://space.bilibili.com/3546929239689711/lists/6761342?type=season
-
[2025-06-28] ๐ DataFlow is officially released! Our data-centric AI system is now public. Stay tuned for future updates.
๐ 2. Overview
DataFlow is a data preparation and training system designed to parse, generate, process, and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuning, RL training) or RAG using knowledge base cleaning. DataFlow has been empirically validated to improve domain-oriented LLMs' performance in fields such as healthcare, finance, and law.
Specifically, we are constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive DataFlow system. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.
๐ ๏ธ 3. Operators Functionality
๐ง 3.1 How Operators Work
DataFlow adopts a modular operator design philosophy, building flexible data processing pipelines by combining different types of operators. As the basic unit of data processing, an operator can receive structured data input (such as in json/jsonl/csv format) and, after intelligent processing, output high-quality data results. For a detailed guide on using operators, please refer to the Operator Documentation.
The design of DataFlow operators follows a PyTorch-like style, making them easy to understand and use. The code block below shows a minimal invocation example of PromptedGenerator:
Example input data (json/jsonl-style):
// input.json [ {"problem": "What is 17 + 25?"}, {"problem": "If x = 3, compute 2x^2 + 1."} ]
Operator invocation code:
from dataflow.operators.core_text import PromptedGenerator from dataflow.utils.storage import FileStorage from dataflow.serving import APILLMServing_request # set input file to global storage class storage = FileStorage(first_entry_file_name="./input.json",) # configure LLM serving (e.g., OpenAI API) # api key needs to be set via `export DF_API_KEY=sk-xxx` llm_serving = APILLMServing_request( api_url="https://api.openai.com/v1/chat/completions", ) prompted_generator = PromptedGenerator( llm_serving=llm_serving, # pre-configured LLM backend system_prompt="Please solve this math problem." ) prompted_generator.run( storage=self.storage.step(), # data management (details omitted) input_key="problem", # read from this column output_key="solution" # write to this column )
After running, the operator will append the generated results into output_key. For example, the output data (json/jsonl-style) becomes:
// dataflow_step1.json [ {"problem":"What is 17 + 25?","solution":"42"}, {"problem":"If x = 3, compute 2x^2 + 1.","solution":"19"} ]
๐ 3.2 Operator Classification System
In the DataFlow framework, operators are divided into three core categories based on their functional characteristics:
| Operator Type | Quantity | Main Function |
|---|---|---|
| Generic Operators | 80+ | Covers general functions for text evaluation, processing, and synthesis |
| Domain-Specific Operators | 40+ | Specialized processing for specific domains (e.g., medical, financial, legal) |
| Evaluation Operators | 20+ | Comprehensively evaluates data quality from 6 dimensions |
๐ ๏ธ 4. Pipelines Functionality
๐ง 4.1 Ready-to-Use PipeLines
Current Pipelines in Dataflow are as follows:
โ๏ธ 4.2 Flexible Operator PipeLines
In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the documentation for details.
๐ค 4.3 Agent Guided Pipelines
-
DataFlow Agent: An intelligent assistant that performs data analysis, writes custom
operators, and automatically orchestrates them intopipelinesbased on specific task objectives.
โก 5. Quick Start
๐ ๏ธ 5.1 Environment Setup and Installation
DataFlow supports Python>=3.10 environments, tested passed on Windows, Linux, and MacOS with Python 3.10, 3.11, and 3.12.
Please use the following commands for environment setup and installation๐
We recommend use uv to install DataFlow for speed up.
pip install uv uv pip install open-dataflow
If you want to use your own GPU for local inference, please use:
pip install uv uv pip install open-dataflow[vllm]
After installation, you can use the following command to check if dataflow has been installed correctly:
If installed correctly, you should see:
open-dataflow codebase version: 1.0.0
Checking for updates...
Local version: 1.0.0
PyPI newest version: 1.0.0
You are using the latest version: 1.0.0.
๐ณ 5.1.1 Docker Installation (Alternative)
We also provide a Dockerfile for easy deployment and a pre-built Docker image for immediate use.
Option 1: Use Pre-built Docker Image
You can directly pull and use our pre-built Docker image:
# Pull the pre-built image docker pull molyheci/dataflow:cu124 # Run the container with GPU support docker run --gpus all -it molyheci/dataflow:cu124 # Inside the container, verify installation dataflow -v
Option 2: Build from Dockerfile
Alternatively, you can build the Docker image from the provided Dockerfile:
# Clone the repository (HTTPS) git clone https://github.com/OpenDCAI/DataFlow.git # Or use SSH # git clone git@github.com:OpenDCAI/DataFlow.git cd DataFlow # Build the Docker image docker build -t dataflow:custom . # Run the container docker run --gpus all -it dataflow:custom # Inside the container, verify installation dataflow -v
Note: The Docker image includes CUDA 12.4.1 support and comes with vLLM pre-installed for GPU acceleration. Make sure you have NVIDIA Container Toolkit installed to use GPU features.
๐ 5.2 Quick Start with Google Colab
You can start your first DataFlow translation project directly on Google Colab. By following the provided guidelines, you can seamlessly scale from a simple translation example to more complex DataFlow pipelines.
๐ Start DataFlow with Google Colab
๐ 5.3 Reference Project Documentation
For detailed usage instructions and getting started guide, please visit our Documentation.
๐ฅ๏ธ 5.4 WebUI
DataFlow provides a Web-based UI (WebUI) for visual pipeline construction and execution.
After installing the DataFlow main repository, simply run:
This will automatically download and launch the latest DataFlow-WebUI and open it in your browser
(http://localhost:<port>/ if it does not open automatically).
๐ Documentation
- Chinese: https://wcny4qa9krto.feishu.cn/wiki/F4PDw76uDiOG42k76gGc6FaBnod
- English: https://wcny4qa9krto.feishu.cn/wiki/SYELwZhh9ixcNwkNRnhcLGmWnEg
๐ ๏ธ Development Repository
๐งช 6. Experimental Results
For Detailed Experiments setting, please visit our DataFlow Technical Report.
6.1 Text Pipeline
6.1.1 Pre-training data filter pipeline
From the SlimPajama-627B corpus, we extract a 100B-token subset and apply multiple DataFlow text-pretraining filters. We train a Qwen2.5-0.5B model from scratch for 30B tokens using the Megatron-DeepSpeed framework, the results are as follows:
| Methods | ARC-C | ARC-E | MMLU | HellaSwag | WinoGrande | Gaokao-MathQA | Avg |
|---|---|---|---|---|---|---|---|
| Random-30B | 25.26 | 43.94 | 27.03 | 37.02 | 50.99 | 27.35 | 35.26 |
| Qurating-30B | 25.00 | 43.14 | 27.50 | 37.03 | 50.67 | 26.78 | 35.02 |
| FineWeb-Edu-30B | 26.45 | 45.41 | 27.41 | 38.06 | 50.43 | 25.64 | 35.57 |
| DataFlow-30B | 25.51 | 45.58 | 27.42 | 37.58 | 50.67 | 27.35 | 35.69 |
6.1.2 SFT data filter and synthesis pipeline
To study small-scale SFT data quality, we fine-tune the Qwen2.5-7B base model using LLaMA-Factory on WizardLM and Alpaca datasets. For each dataset, we compared a randomly sampled set of 5K instances against a set of 5K instances filtered by DataFlow's SFT pipeline. Additionally, we synthesize a 15k-size dataset, DataFlow-SFT-15K, using DataFlowโs Condor Generator and Condor Refiner pipeline, followed by DataFlowโs SFT filtering pipeline (excluding the Instagram filter). Benchmarks include comprehensive Math, Code, and Knowledge evaluation suites.
Math Benchmarks
| Methods | math | gsm8k | aime24 | minerva | olympiad | Avg |
|---|---|---|---|---|---|---|
| Alpaca (random) | 54.9 | 77.2 | 13.3 | 14.0 | 27.0 | 37.3 |
| Alpaca (filtered) | 60.3 | 80.0 | 13.3 | 14.7 | 30.7 | 39.8 |
| WizardLM (random) | 61.1 | 84.2 | 6.7 | 18.0 | 29.3 | 39.9 |
| WizardLM (filtered) | 69.7 | 88.8 | 10.0 | 19.9 | 35.4 | 44.8 |
| DataFlow-SFT-15K (random) | 72.6 | 89.6 | 13.3 | 37.9 | 32.9 | 49.3 |
| DataFlow-SFT-15K (filtered) | 73.3 | 90.2 | 13.3 | 36.0 | 35.9 | 49.7 |
Code Benchmarks
| Methods | HumanEval | MBPP | Avg |
|---|---|---|---|
| Alpaca (random) | 71.3 | 75.9 | 73.6 |
| Alpaca (filtered) | 73.8 | 75.7 | 74.8 |
| WizardLM (random) | 75.6 | 82.0 | 78.8 |
| WizardLM (filtered) | 77.4 | 80.4 | 78.9 |
| DataFlow-SFT-15K (random) | 79.9 | 75.9 | 77.9 |
| DataFlow-SFT-15K (filtered) | 82.9 | 74.9 | 78.9 |
Knowledge Benchmarks
| Methods | MMLU | C-EVAL | Avg |
|---|---|---|---|
| Alpaca (random) | 71.8 | 80.0 | 75.9 |
| Alpaca (filtered) | 71.8 | 80.0 | 75.9 |
| WizardLM (random) | 71.8 | 79.2 | 75.5 |
| WizardLM (filtered) | 71.9 | 79.6 | 75.8 |
| DataFlow-SFT-15K (random) | 72.1 | 80.0 | 76.1 |
| DataFlow-SFT-15K (filtered) | 72.2 | 80.4 | 76.3 |
6.1.3 Conversation Synthesis Pipeline
We synthesize DataFlow-Chat-15K using DataFlow's conversation-generation pipeline and fine-tune Qwen2.5-7B-Base on it. Baselines include ShareGPT-15K, UltraChat-15K, and their full (non-truncated) versions. We evaluate on domain-specific tasks (TopDial, Light) and general benchmarks (MMLU, AlpacaEval, Arena-Hard).
Conversation Benchmarks
| Model | TopDial | Light | Avg |
|---|---|---|---|
| Qwen2.5-7B | 7.71 | 7.79 | 7.75 |
| + ShareGPT-15K | 7.75 | 6.72 | 7.24 |
| + UltraChat-15K | 7.72 | 6.83 | 7.28 |
| + DataFlow-Chat-15K | 7.98 | 8.10 | 8.04 |
General Benchmarks
| Model | MMLU | AlpacaEval | Arena-Hard | Avg |
|---|---|---|---|---|
| Qwen2.5-7B | 71.45 | 7.05 | 0.60 | 26.36 |
| + ShareGPT-15K | 73.09 | 3.70 | 1.30 | 26.03 |
| + UltraChat-15K | 72.97 | 3.97 | 0.80 | 25.91 |
| + DataFlow-Chat-15K | 73.41 | 10.11 | 1.10 | 28.21 |
6.2 Reasoning Pipeline
We adopt the NuminaMath dataset as a high-quality seed dataset. We compare three training sources: (1) a random 10K subset from Open-R1, (2) a random 10K subset from Synthetic-1, and (3) our 10K synthesized DataFlow-Reasoning-10K dataset constructed using DataFlow.
| Setting | Model | gsm8k | math | amc23 | olympiad | gaokao24_mix | minerva | AIME24@32 | AIME25@32 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| Baseline | Qwen2.5-32B-Instruct | 95.8 | 73.5 | 70.0 | 38.5 | 42.9 | 26.5 | 16.8 | 11.6 | 46.95 |
| 1 Epoch | + SYNTHETIC-1-10k | 92.9 | 71.8 | 52.5 | 38.4 | 23.1 | 24.3 | 35.6 | 34.0 | 46.6 |
| 1 Epoch | + Open-R1-10k | 91.5 | 72.3 | 65.0 | 38.4 | 20.9 | 24.6 | 43.0 | 33.5 | 48.7 |
| 1 Epoch | + DataFlow-Reasoning-10K | 93.9 | 72.3 | 72.5 | 38.7 | 38.5 | 26.5 | 35.9 | 34.5 | 51.6 |
| 2 Epochs | + SYNTHETIC-1-10k | 94.5 | 78.4 | 75.0 | 45.0 | 24.2 | 28.3 | 48.4 | 37.9 | 54.0 |
| 2 Epochs | + Open-R1-10k | 93.9 | 77.2 | 80.0 | 44.1 | 20.9 | 25.4 | 51.0 | 40.7 | 54.2 |
| 2 Epochs | + DataFlow-Reasoning-10K | 94.4 | 76.6 | 75.0 | 45.2 | 42.9 | 25.7 | 45.4 | 40.0 | 55.7 |
6.3 Code PipeLine
We randomly sample 20k instances from the Ling-Coder-SFT corpus and process them through the DataFlow Code Pipeline. This yields three curated code instruction datasets of different scales, DataFlow-Code-1K, DataFlow-Code-5K, and DataFlow-Code-10K, each designed to provide high-quality, pipeline-refined supervision signals for code generation tasks.
We compare our synthesized datasets against Code-Alpaca-1k and Self-OSS-Instruct-SC2-Exec-Filter-1k.
Trained on Qwen2.5-7B-Instruct
| Training Data | BigCodeBench | LiveCodeBench (v6) | CruxEval (Input) | CruxEval (Output) | HumanEval+ | Avg |
|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 35.3 | 23.4 | 44.8 | 43.9 | 72.6 | 44.0 |
| + Code Alpaca-1K | 33.3 | 18.7 | 45.6 | 46.4 | 66.5 | 42.1 |
| + Self-OSS | 31.9 | 21.4 | 46.9 | 45.9 | 70.1 | 43.2 |
| + DataFlow-Code-1K | 35.5 | 25.7 | 48.0 | 45.1 | 72.6 | 45.4 |
| + DataFlow-Code-5K | 36.2 | 26.4 | 48.6 | 45.0 | 73.2 | 45.9 |
| + DataFlow-Code-10K | 36.8 | 26.0 | 48.8 | 45.4 | 73.8 | 46.2 |
Trained on Qwen2.5-14B-Instruct
| Training Data | BigCodeBench | LiveCodeBench (v6) | CruxEval (Input) | CruxEval (Output) | HumanEval+ | Avg |
|---|---|---|---|---|---|---|
| Qwen2.5-14B-Instruct | 37.5 | 33.4 | 48.0 | 48.5 | 74.4 | 48.4 |
| + Code Alpaca-1K | 37.0 | 28.2 | 50.2 | 49.6 | 71.3 | 47.3 |
| + Self-OSS | 36.9 | 22.3 | 52.6 | 50.1 | 68.3 | 46.0 |
| + DataFlow-Code-1K | 41.4 | 33.7 | 51.0 | 50.9 | 77.3 | 50.9 |
| + DataFlow-Code-5K | 41.1 | 33.2 | 52.5 | 50.6 | 76.2 | 50.7 |
| + DataFlow-Code-10K | 41.9 | 33.2 | 52.9 | 51.0 | 76.2 | 51.0 |
๐ 7. Publications
Our team has published the following papers that form core components of the DataFlow system:
| Paper Title | DataFlow Component | Venue | Year |
|---|---|---|---|
| Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL | Text2SQL Data Augmentation | ICDE | 2026 |
| Let's Verify Math Questions Step by Step | Math question quality evaluation | KDD | 2026 |
| MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification | Multimodal reasoning verification framework for data processing and evaluation | ACL | 2025 |
| Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration | Multi-actor collaborative data selection mechanism for enhanced data filtering and processing | ACL | 2025 |
๐ 8. Awards & Achievements
We are honored to have received first-place awards in two major international AI competitions, recognizing the excellence and robustness of DataFlow and its reasoning capabilities:
| Competition | Track | Award | Organizer | Date |
|---|---|---|---|---|
| ICML 2025 Challenges on Automated Math Reasoning and Extensions | Track 2:Physics Reasoning with Diagrams and Expressions | ๐ฅFirst Place Winner | ICML AI for Math Workshop & AWS Codabench | July 18, 2025 |
| 2025 Language and Intelligence Challenge (LIC) | Track 2:Beijing Academy of Artificial Intelligence | ๐ฅFirst Prize | Beijing Academy of Artificial Intelligence (BAAI) & Baidu | August 10, 2025 |
๐ 9. Acknowledgements
We sincerely thank MinerU for their outstanding work, whose powerful PDF/document text extraction capabilities provided essential support for our data loading process. We also thank LLaMA-Factory for offering an efficient and user-friendly framework for large model fine-tuning, which greatly facilitated rapid iteration in our training and experimentation workflows. Our gratitude extends to all contributors in the open-source communityโtheir efforts collectively drive the development of DataFlow. We thank Zhongguancun Academy for their API and GPU support.
๐ 10. Awesome Work Using DataFlow & DataFlow Ecosystem
This section highlights projects, research works, and applications built on top of DataFlow or deeply integrated with the DataFlow ecosystem.
๐ Curated list of featured projects: [Awesome Work Using DataFlow]
We warmly welcome the community to contribute new entries via Pull Requests. ๐ Detailed Guidance can help you creating a Dataflow extension repository from DataFlow-CLI.
๐ค 11. Community & Support
Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!
โข ๐ฎ GitHub Issues: Report bugs or suggest features
โข ๐ง GitHub Pull Requests: Contribute code improvements
โข ๐ฌ Join our community groups to connect with us and other contributors!
๐ 12. Citation
If you use DataFlow in your research, feel free to give us a cite.
@article{liang2025dataflow, title={DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI}, author={Liang, Hao and Ma, Xiaochen and Liu, Zhou and Wong, Zhen Hao and Zhao, Zhengyang and Meng, Zimo and He, Runming and Shen, Chengyu and Cai, Qifeng and Han, Zhaoyang and others}, journal={arXiv preprint arXiv:2512.16676}, year={2025} }




