Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens

Back to Articles

In our previous work, we ran 50+ experiments to find the optimal mixing ratio for pre-training data. We discovered that a static 50-30-20 mix of textbook-quality PDFs, filtered web content, and educational web resources consistently outperformed complex curriculum strategies. We used that recipe to train codelion/gpt-2-70m, achieving over 90% of GPT-2's performance with 10x less data.

That work left us with a natural question: what happens when you take the insights from optimal mixing and scale up the data itself?

This post is the story of Sutra-10B, a 10 billion token pedagogical pre-training dataset, and the framework we built to create it. We describe how the Sutra generation pipeline works, from knowledge graph to quality filtering, what happened when we trained SmolLM2-70M on it for 3 full epochs (30.6 billion tokens total), and what the results tell us about the limits of small models and the value of curated data. Sutra-10B is the largest in a family of pedagogical datasets we have released at multiple scales, all collected in our Sutra Pedagogical Datasets collection.

From Mixing Ratios to Data Generation

Our mixing experiments showed that textbook-quality content (finePDFs) was the most valuable ingredient in the mix, consistently anchoring strong validation performance. But we were limited by available high-quality educational content. FinePDFs, Cosmopedia, and similar sources only go so far when you need billions of tokens.

This is a challenge the field has been grappling with broadly. The HuggingFace team addressed it with FineWeb-Edu [1], using classifier-based quality filtering to extract educational content from web crawls, and later with Cosmopedia [2], which generates synthetic textbooks and articles seeded by 34,000 BISAC subject categories. The SmolLM2 family [3] demonstrated that combining these filtered and synthetic sources with careful multi-stage training can push sub-2B models to state-of-the-art. Microsoft's Phi-4 [4] showed that strategically placed synthetic data throughout pre-training, generated via multi-agent prompting and instruction reversal, can make a 14B model punch well above its weight class on reasoning tasks.

But these approaches share a common limitation: they either filter existing content (losing volume) or generate synthetic content without a structured curriculum. We wanted to combine both. So we built Sutra, a framework for generating pedagogical content at scale, guided by a knowledge graph that defines what to teach, in what order, and across what domains.

The Sutra Framework

Sutra is not a single prompt that asks an LLM to write textbook pages. It is a multi-stage pipeline with six components: a knowledge graph that defines the curriculum, a content generator that produces educational text, and a quality evaluator that scores output on six pedagogical dimensions. It also includes a diversity manager for broad topic coverage, a rephraser that transforms content into multiple formats, and a cleaner that removes duplicates and low-quality entries. Here is how each piece works.

Knowledge Graph

At the heart of Sutra is a knowledge graph containing 1,942 concepts organized across 9 domains: mathematics, science, technology, language arts, social studies, arts and creativity, life skills, philosophy and ethics, and interdisciplinary topics. Each concept carries a complexity level (1 through 10), a list of prerequisites, a set of downstream concepts it builds toward, and cross-domain connections to related concepts in other fields.

Concepts fall into four tiers based on complexity: fundamental (levels 1-3), intermediate (4-6), advanced (7+), and synthesis. The graph validates itself for circular dependencies and missing prerequisites, ensuring that the curriculum structure is internally consistent before any content gets generated.

The most important property of the knowledge graph is that it can produce a learning sequence for any set of target concepts: a generation order that respects prerequisite chains, so foundational content gets created before advanced material. This mirrors how a well-designed textbook builds knowledge incrementally. Cross-domain bridges connect related concepts across fields, so a concept like "statistical mechanics" in science links bidirectionally to "probability distributions" in mathematics.

Recent curriculum learning studies [5, 6] have shown that ordering training data by difficulty can reduce training steps by 18-45%, though the interaction with learning rate decay schedules complicates things in practice [7]. Our knowledge graph provides the scaffolding for potential curriculum strategies, even though for the Sutra-10B training run itself we used standard shuffled pre-training.

Content Generation

The generator walks the knowledge graph and plans generation tasks across all concept and content type combinations. There are 14 content types, ranging from concept introductions down to advanced applications and synthesis pieces. Not every content type applies to every concept: synthesis content only appears at complexity 3+, meta-learning at 4+, and code-related content types (implementation, explanation, debugging, optimization) are restricted to domains where they make sense, like programming, engineering, mathematics, and science.

A breadth-first priority system ensures complete coverage before depth. Concepts with no generated entries yet receive the strongest priority boost. New content types on existing concepts get a moderate boost. Once a concept has multiple entries of a given type, it drops to natural priority. This prevents the generator from producing hundreds of entries about popular topics while leaving obscure but important concepts empty.

The generation is driven by over 30 structured prompt templates organized into four categories: core curriculum prompts (concept introduction, reasoning demonstration, synthesis, meta-learning, and others), code-specific prompts, rephrase prompts, and general format prompts covering document types from scientific papers to case studies to troubleshooting guides. Each template defines the expected structure, target length, and quality criteria. Every template also includes an explicit instruction to avoid common LLM output wrapper phrases ("Here's...", "Let me...", "I'll create...") and instead start directly with substantive content. This simple prompting technique dramatically reduces the need for post-hoc cleaning.

For code content, programming language selection is deterministic per concept, ensuring consistent assignment across regeneration runs. The language pools vary by domain: Python, JavaScript, Java, C++, Go, and Rust for programming and systems topics; Python, R, MATLAB, and Julia for mathematics and science.

We served the generation model (GLM-4.7-Flash, a 30B total / 3B active parameter MoE model) via vLLM on a single L40S GPU, processing requests in parallel batches for throughput. The BeyondWeb project [8] showed that diversity in generation strategies is critical at trillion-token scale, outperforming Cosmopedia and Nemotron-Synth by up to 5.1 percentage points. The MAGA approach [9] demonstrated that reformulating documents across genre-audience combinations yields a 3.9x token expansion that consistently improves models from 134M to 13B parameters. Our approach shares this bet on diversity, with 20 content styles organized around a pedagogical knowledge graph rather than genre-audience matrices.

Quality Evaluation

Every generated piece runs through a six-dimension quality evaluator before being accepted into the dataset. The dimensions and their weights are:

Dimension	Weight	What It Measures
Connection richness	0.30	Cross-concept and cross-domain linking
Clarity	0.20	Structural coherence and readability
Practical utility	0.15	Actionable knowledge and examples
Reasoning completeness	0.15	Logical chains, evidence, step-by-step thinking
Information density	0.10	Substantive content per word
Pedagogical structure	0.10	Definitions, mechanisms, examples, progressions

Connection richness carries the highest weight because cross-concept linking is the most distinctive property of pedagogical content versus generic web text. Information density gets the lowest weight to avoid penalizing conversational or narrative styles, which are pedagogically valuable even if they use more words per idea.

Each dimension is scored by looking for specific textual indicators. For reasoning, we look for phrases like "because", "therefore", "for example", and step-by-step markers. For connections, we look for phrases like "connects to", "analogous to", "bridges". For pedagogical structure, we look for definitions, mechanisms, and demonstrations. Filler phrases ("obviously", "as you know") are penalized. For code content, there are separate scoring criteria that evaluate the ratio of code to explanation, the presence of comments and documentation, error handling, and complexity analysis.

Entries must pass hard acceptance gates on top of the weighted score: a minimum overall quality score, a minimum information density, and a minimum reasoning completeness threshold. These gates have content-type-specific adjustments. Worked examples get a more lenient density threshold to compensate for their naturally verbose format. Code implementations get a stricter one.

Recent work [10] found that classifier-based quality filtering improves downstream task performance but does not necessarily improve perplexity on high-quality held-out data. What quality filters really do, it turns out, is domain selection rather than quality measurement. Our multi-dimensional scoring tries to be more targeted: rather than a single quality/not-quality binary, we measure different aspects of pedagogical value and weight them according to what matters for educational content specifically.

Diversity Management

The diversity manager tracks which combinations of concept, content type, and presentation style have already been generated. There are 20 distinct content styles: formal mathematical, intuitive visual, practical applied, historical context, problem solving, comparative analysis, step-by-step, conceptual overview, real-world case study, Socratic method, narrative storytelling, visual/diagrammatic, code demonstration, mathematical proof, interactive dialogue, error correction, pattern matching, analogical reasoning, experimental inquiry, and multimodal synthesis.

When the generator needs a style for a new entry, the diversity manager suggests whichever style has not yet been used for that particular concept and content type. Among unused styles, it picks the one that has seen the least use globally. This simple greedy approach ensures surprisingly even coverage across the style space without requiring complex scheduling.

Novelty checking works at two levels. First, exact text matching catches identical duplicates. Second, a text similarity measure based on overlapping word sequences detects near-duplicates within the same content type and domain. Common pedagogical phrases ("for example", "step by step", "in conclusion") that appear naturally in educational content are excluded from the similarity computation to prevent false positives.

Content Rephrasing

Inspired by the BeyondWeb methodology [8], the rephraser transforms existing content into alternative formats to increase dataset diversity. It implements five strategies: conversational (teacher-student dialogue), tutorial (step-by-step walkthrough), Q&A format (Socratic question-answer pairs), dense reference (compact technical reference), and pedagogical (general educational restructuring).

The strategy selection encodes a key insight from BeyondWeb: conversational format represents less than 3% of web text but is the primary format users encounter during inference (chat, tutoring, Q&A). So the rephraser has a strong bias toward converting non-conversational content into dialogue form. Technical content gets routed preferentially to either dense reference or tutorial format. Content that already has a sequential structure goes to tutorial format. Everything else gets a weighted random selection across the five strategies.

Cleaning Pipeline

The final stage is a five-step cleaning pipeline:

Wrapper detection: Scans the opening of each entry for common LLM output artifacts ("here's", "let me", "i'll", "sure,", and similar). Dialogue entries with speaker labels are protected from this filter, since they naturally start with conversational phrases.
Exact deduplication: Removes identical text entries across the entire dataset.
Length filtering: Removes entries below a minimum token count threshold.
Semantic deduplication: Uses text embeddings with cosine similarity to catch paraphrased duplicates that exact matching would miss.
Quality validation: Removes entries with malformed content, such as those dominated by special characters or containing too few unique characters to be meaningful.

Kang et al. [11] found that training on 1/3 rephrased synthetic data mixed with 2/3 natural web text can achieve 5-10x speedup at larger data budgets, while pure textbook-style synthetic data risks model collapse. The related work on strong model collapse [12] showed that even 1-in-1000 synthetic samples can trigger collapse in some settings. Our cleaning pipeline, combined with the 24% natural data mix from external sources, is designed to mitigate these risks.

Building Sutra-10B

With the framework in place, building the 10 billion token dataset involved four phases:

Core Generation: We used GLM-4.7-Flash served via vLLM to generate structured educational content guided by the knowledge graph. Each piece carries metadata for domain, content type, difficulty level, and the full six-dimension quality assessment. This core Sutra content accounts for about 7.8 billion tokens.

Diversity Mixing: Drawing directly from our 50-30-20 insight, we mixed in roughly 2.4 billion tokens from five external sources: Nemotron-CC-Math, OpenWebMath, English Wikipedia, Cosmopedia, and FineWeb-Edu. Each contributes roughly 0.5 billion tokens.

Metadata Enrichment: For entries from external sources that lack Sutra-style metadata, we used GLM-4.7-Flash to classify domain, content type, complexity, and quality dimensions. Entries from the Sutra core already carry this metadata from generation. Every entry also gets a token count computed with the SmolLM2 tokenizer and a unique ID.

Quality Filtering and Deduplication: We deduplicated across all sources and applied quality scoring, keeping only entries above our threshold. The final dataset contains 10.19 million entries totaling 10.2 billion tokens, with an average quality score of 0.70.

The domain breakdown reflects our goal of broad educational coverage:

Interdisciplinary content makes up the largest share (34.9%), followed by technology (21.1%) and science (14.3%). Mathematics, social studies, life skills, arts, language arts, and philosophy round out the rest. The content types are similarly diverse: historical context (30.2%), concept introductions (9.1%), data analysis (7.6%), worked examples (6.8%), problem sets (6.6%), tutorials (6.1%), and many others.

Each entry in the dataset carries 13 metadata fields: id, concept_name, domain, content_type, text, quality_score, information_density, complexity_level, token_count, prerequisites, builds_to, cross_domain_connections, and quality_assessment (a nested object with six sub-scores). This rich metadata makes the dataset useful beyond pre-training, for research on curriculum learning, domain-specific filtering, and data selection strategies.

Sutra-10B is the largest in a series of datasets we built at increasing scales: 10M, 100M, 1B, and 10B tokens. We also release the 30K seed concepts that bootstrap the knowledge graph and an SFT variant for instruction tuning. All of these are available in the Sutra Pedagogical Datasets collection. The smaller datasets are useful for quick experiments and ablations; the 1B version was used as one of our baseline comparisons in this work.

Training Setup

We used SmolLM2-70M as our base model. It is a 69.2M parameter LlamaForCausalLM with 32 layers, 384 hidden dimensions, 6 attention heads (2 KV heads), and an 8192 token context window. We chose this model specifically because it lets us run full training experiments on a single GPU in reasonable time while still being large enough to show meaningful benchmark differences.

All training was done on a single NVIDIA A10 (48GB) with the following configuration:

Batch size 4, gradient accumulation 8 (effective batch ~262K tokens per step)
Sequence length 8,192
AdamW optimizer (fused) with weight decay 0.1
Cosine learning rate schedule with warmup
Flash Attention 2, TF32 matmul, torch.compile
Throughput: roughly 110,000 tokens per second

We trained for 3 epochs, reducing the learning rate each time:

Epoch	Tokens	Hours	Peak LR	Min LR	Warmup Steps	Best Perplexity
1	10.2B	25.82	3e-4	3e-5	2,000	39.50
2	10.2B	25.78	1e-4	1e-5	500	37.81
3	10.2B	26.16	3e-5	3e-6	250	37.72
Total	30.6B	77.76				37.72

The total training cost was about 78 hours of single-GPU time.

What We Learned

Perplexity Keeps Improving, Benchmarks Don't

This is the most striking result. Perplexity dropped consistently across all three epochs: 39.50 after epoch 1, 37.81 after epoch 2, 37.72 after epoch 3. The model kept getting better at predicting the next token in our dataset.

But downstream benchmark performance told a different story. The 10-benchmark average went from 34.02 (epoch 1 final) to 34.13 (epoch 2 final) to 34.26 (epoch 3 final). That is a total improvement of 0.24 points over 20 additional billion tokens of training.

The gap between the perplexity curve and the benchmark curve tells us the model has hit a representational ceiling. With only 69 million parameters, there is a hard limit on how much knowledge and reasoning ability the model can encode, regardless of how much data it sees. The perplexity improvements in later epochs are real. They reflect better surface-level pattern matching, not deeper understanding that would show up on benchmarks.

Huang et al. [13] introduced a dimensionless data-quality parameter extending the Chinchilla framework. High-quality datasets allow strong results with smaller models and less compute, but only up to the model's capacity limit. The "Densing Law" [14] published in Nature Machine Intelligence found that capability density (capability per parameter) doubles approximately every 3.5 months, validating the economic case for investing in small models. But there are floors to what any given model size can achieve, and 69 million parameters hits that floor quickly.

The Detailed Benchmark Picture

We evaluated every checkpoint (both the final model and the best-perplexity model from each epoch) on 11 benchmarks using lm-evaluation-harness v0.4.11:

A few patterns stand out:

SciQ improved the most. It went from 42.30 (E1-best) to 45.20 (E3-best), a gain of nearly 3 points. This makes sense given the heavy science and educational content in Sutra. The model genuinely learned more science knowledge across epochs.

PIQA also improved steadily. Physical intuition scores went from 53.92 to 54.84 over the three epochs. Physical reasoning benefits from the diverse worked examples and practical applications in the dataset.

TruthfulQA declined slightly. It went from 49.09 (E1-best) to 48.02 (E3-best). Extended training on synthetic pedagogical content may slightly erode the model's calibration on factual accuracy benchmarks. This is worth watching in future work.

MMLU and ARC-Challenge were essentially flat. These benchmarks require the kind of deep knowledge that a 70M parameter model simply cannot store, regardless of training data.

GSM8K stayed near zero. Math reasoning at this model scale is not really feasible. The scores bounced between 0.15 and 0.83 across checkpoints, which is noise.

How Sutra Compares to Other Datasets

We trained the same SmolLM2-70M model on 1B tokens from seven different datasets, each for a single epoch. This gives us a controlled comparison of data quality independent of scale.

At the 1B scale, all seven datasets produce remarkably similar average benchmark scores, ranging from 31.42 (Synth-1B) to 32.38 (FinePDFs-1B). The differences are within noise. This tells us that at 1B tokens, the model has not yet saturated its capacity, and the data source matters less than having enough data.

But Sutra-10B at 3 epochs (30.6B tokens total) reaches 34.27, a clear separation from the 1B baselines. The improvement comes not from magical data quality but from scale: more tokens, more epochs, more exposure to the patterns in the data.

The radar chart shows where Sutra-10B gains over the 1B baselines. The improvements are broad rather than concentrated in one area: slightly better on PIQA, HellaSwag, WinoGrande, and significantly better on SciQ. The tradeoff is slightly worse TruthfulQA compared to some of the 1B baselines.

One interesting observation: Sutra-1B achieves the lowest training perplexity (10.44) of any dataset by a wide margin. The next closest is Synth-1B at 18.27, while web-crawled datasets range from 28 to 58. Pedagogical content is dramatically easier for the model to learn. But easier-to-learn does not directly translate to better downstream performance, which is a useful lesson about the relationship between training loss and actual capability.

The Capacity Ceiling

The most important takeaway: model size, not data quality or quantity, is the binding constraint once you have a reasonably good dataset.

After 10.2B tokens (1 epoch), the model achieves 34.02 average. After 30.6B tokens (3 epochs), it achieves 34.27. That is 3x the training compute for a 0.7% improvement. The diminishing returns are extreme.

This connects to work on data-constrained pre-training. Goyal et al. [15] found that in data-constrained regimes, text simplification combined with curriculum ordering outperforms both naive repetition and random ordering. Our multi-epoch training is essentially data-constrained repetition, and the results confirm that repeating even high-quality data has sharply diminishing returns for small models.

The practical implication: if you are training small models, focus on data quality over quantity. A well-curated 10B token dataset is enough. Spending compute on additional epochs yields minimal returns. Save that compute for a larger model instead.

Connections to the Broader Field

Several active research threads are relevant here, and the pace of progress in 2025 and early 2026 has been fast.

Synthetic data at scale. The question of whether synthetic data can replace or augment web-crawled data for pre-training is now well-studied. Kang et al. [11] at EMNLP 2025 showed that rephrased synthetic data mixed at 1/3 ratio with natural text achieves 5-10x speedup, while pure synthetic risks collapse. The EntiGraph method [16] at ICLR 2025 showed that synthesizing text by connecting extracted entities from domain corpora enables effective continued pre-training. BeyondWeb [8] demonstrated that diversity in generation strategies is the key ingredient at trillion-token scale. Our Sutra framework draws on these insights, using 20 content styles, 33 prompt templates, and knowledge-graph-guided generation to maximize diversity while maintaining pedagogical structure.

Data quality and filtering. The field has moved beyond simple quality classifiers. Nait Saada et al. [10] questioned the "data-quality illusion" in classifier-based filtering, finding that these filters act more like domain selectors than quality measurers. The ACL 2025 analysis of 400+ models [17] found that high data density significantly shifts the compute-optimal frontier. DatologyAI's UberWeb project [18], published in February 2026, showed that targeted per-language curation of a 20T-token corpus lets 3B/8B models match baselines at 4-10x lower compute. Our six-dimension quality evaluator with pedagogical-specific indicators is an attempt to measure quality along axes that actually matter for educational content.

Data mixing. The problem of optimal mixing ratios has received rigorous treatment. Data Mixing Laws [19] at ICLR 2025 discovered predictable functional relationships between mixture proportions and model performance. The UtiliMax framework [20] frames mixing as portfolio optimization. Our previous work on the 50-30-20 ratio was empirical; these newer frameworks suggest that similar ratios could be derived analytically from small-scale ablations.

Small model efficiency. The economics of small models keep improving. SmolLM2 [3] pushed the boundaries of what sub-2B models can do with careful data curation and multi-stage training. MiniCPM [21] at ICLR 2025 showed that 1.2B models can match 7B-13B performance using warmup-stable-decay scheduling. The Densing Law [14] quantified this trend: equivalent capability per parameter doubles every 3.5 months. Our work adds to this picture by showing where the floor is for 70M parameter models, even with the best data we could build.

Curriculum and pedagogical approaches. Using structured curricula for pre-training remains a mixed bag. Recent work [5, 6] shows 18-45% training step reductions from curriculum ordering, but the interaction with learning rate schedules can neutralize the benefit [7]. PedagoSense [22] in February 2026 introduced pedagogy-grounded strategy detection for learning dialogues. TeachLM [23] showed that training on authentic student-tutor interactions produces high-fidelity pedagogical dialogues. Our knowledge graph provides the infrastructure for curriculum strategies, even though our current training uses shuffled data. The metadata in Sutra-10B (difficulty levels, prerequisites, builds-to relationships) makes it possible for others to experiment with curriculum approaches.

Try It Yourself

Everything from this work is publicly available:

The Dataset: codelion/sutra-10B -- 10.2 billion tokens of pedagogical pre-training data with rich metadata across 13 fields.

The Model: codelion/SmolLM2-70M -- SmolLM2-70M trained for 3 epochs on Sutra-10B (best perplexity checkpoint from epoch 3).

The Sutra Family: Sutra Pedagogical Datasets -- the full collection of Sutra datasets at multiple scales (10M, 100M, 1B, 10B), plus the 30K seed concepts used to bootstrap the knowledge graph and an SFT variant for instruction tuning.

Baseline Comparisons: Pre-training Dataset Samples -- 1B token samples from seven datasets used for our controlled comparisons.

from datasets import load_dataset

# Load the full dataset
dataset = load_dataset("codelion/sutra-10B", split="train", streaming=True)
for example in dataset:
    print(example["text"][:200])
    print(f"Domain: {example['domain']}, Type: {example['content_type']}")
    print(f"Quality: {example['quality_score']}, Complexity: {example['complexity_level']}")
    break

# Load the trained model
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("codelion/SmolLM2-70M")
tokenizer = AutoTokenizer.from_pretrained("codelion/SmolLM2-70M")

inputs = tokenizer("The theory of relativity states that", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What's Next

This work confirmed two things for us. First, the mixing insights from our original experiments do scale: pedagogical content mixed with diverse web data produces good pre-training datasets at the 10B scale. Second, the 70M model size is too small to fully exploit a dataset of this quality. The obvious next step is training a larger model on Sutra-10B and seeing how much further the benchmarks move.

We are also interested in several directions that the framework makes possible. The knowledge graph and prerequisite chains in Sutra could support true curriculum learning during pre-training, where the model sees foundational content before advanced material. The 13 metadata fields per entry enable experiments in data selection: what happens if you train only on high-connection-richness entries, or only on entries above complexity level 5? The rephrasing pipeline could be expanded to generate multilingual variants, building on recent work in multilingual data curation [18, 24].

The broader lesson from this project is that generating good pre-training data is becoming a systems engineering problem. It is not enough to prompt an LLM and collect the output. You need a curriculum structure to guide what gets generated, quality evaluation to filter what gets kept, diversity management to prevent mode collapse in generation, and careful mixing with natural data to maintain grounding. Each of these components benefits from the rapid progress across the field, and we expect the next generation of pedagogical datasets to be substantially better as these techniques mature.

If you use Sutra-10B in your work or have ideas for what to try next, we would love to hear about it.

References

[1] Lozhkov, A., Ben Allal, L., von Werra, L., & Wolf, T. "FineWeb: decanting the web for the finest text data at scale." NeurIPS 2024 Datasets and Benchmarks Track. HuggingFace

[2] Ben Allal, L., et al. "Cosmopedia v2." HuggingFace, 2024. Blog

[3] HuggingFace Smol Models Research Team. "SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model." arXiv:2502.02737, February 2025. Paper

[4] Abdin, M., et al. "Phi-4 Technical Report." Microsoft Research, December 2024. Paper

[5] "Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning." arXiv:2506.11300, June 2025. Paper

[6] "Curriculum Learning for LLM Pretraining: An Analysis of Learning Dynamics." arXiv:2601.21698, January 2026. Paper

[7] "How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining." arXiv:2511.18903, November 2025. Paper

[8] Maini, P., et al. "BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining." arXiv:2508.10975, August 2025. Paper

[9] "MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion." arXiv:2502.04235, February 2025. Paper

[10] Nait Saada, T., et al. "The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining." arXiv:2510.00866, October 2025. Paper

[11] Kang, et al. "Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls." EMNLP 2025. Paper

[12] "Strong Model Collapse." ICLR 2025. Paper

[13] Huang, et al. "Scaling Laws Revisited: Modeling the Role of Data Quality in Language Model Pretraining." arXiv:2510.03313, October 2025. Paper

[14] "Densing Law of LLMs." Nature Machine Intelligence, 2025. Paper

[15] Goyal, et al. "Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining." arXiv:2509.24356, September 2025. Paper

[16] Yang, Z., Band, N., et al. "Synthetic Continued Pretraining (EntiGraph)." ICLR 2025 (Oral). Paper

[17] "Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies." ACL 2025. Paper

[18] DatologyAI. "UberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset." arXiv:2602.15210, February 2026. Paper

[19] "Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance." ICLR 2025. Paper

[20] "Optimizing Pretraining Data Mixtures with LLM-Estimated Utility (UtiliMax)." arXiv:2501.11747, January 2025. Paper

[21] "MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies." ICLR 2025. Paper

[22] "PedagoSense: A Pedagogy Grounded LLM System for Pedagogical Strategy Detection." arXiv:2602.01169, February 2026. Paper

[23] "TeachLM: Post-Training LLMs for Education Using Authentic Learning Data." arXiv:2510.05087, October 2025. Paper

[24] "Enhancing Multilingual LLM Pretraining with Model-Based Data Selection." arXiv:2502.10361, February 2025 (updated February 2026). Paper