Adaptive Data Optimization

Based on Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws, presented at ICLR 2025.

Pretraining foundation models involves vast amounts of computation and trillions of tokens of data. A central challenge is determining how best to allocate compute across different sources of data. Since the behavior of the final model is strongly influenced by its training data¹, it is crucial to intelligently select what data the model trains on.

A common approach to training data selection is to start with a base dataset containing data that is grouped into distinct domains: code from GitHub, articles from Wikipedia, posts from Reddit, etc. We must then decide the proportion of training data to draw from each domain. With this setup, the data selection problem is recast into an optimization problem over data mixtures.

Intuitively, to estimate the benefit of a particular data mixture, it helps to train a model on it. As a result, many existing data selection methods require multiple stages of training, where earlier stages of training produce “proxy” models that help optimize the data mixture, and the last stage uses the optimized mixture to train the final model². However, this can increase the computational cost and complexity of training pipelines.

ado-fig

In contrast, Adaptive Data Optimization (ADO) is an algorithm that dynamically adjusts the training data mixture throughout training, designed with two key properties in mind:

On-the-fly: ADO does not require expensive computation to optimize the data distribution before training starts. It runs alongside training, adjusting the data mixture online.
Efficient: ADO is fast and has minimal impact on training speed.

Dynamic Sample Selection with Scaling Laws

There are many algorithms for data selection³, but they typically operate based on one of two core principles⁴:

Selecting data that “looks like” the tasks of interest.
Selecting data that improves the efficiency of learning.

The first objective is well-defined, but requires having a set of downstream tasks in mind while training a foundation model that is supposed to “do everything.” ADO is an algorithm based on the second objective, which leads to very different considerations, such as how to avoid learning redundant information, and how to select the data that can be most efficiently learned in a given computation budget.

At every training step $t$, ADO outputs a probability distribution $\pi(t)$ over $K$ domains, which defines the current training mixture. The training minibatch at step $t$ is sampled according to that distribution.

In order to perform data selection on-the-fly, ADO uses per-domain scaling laws to forecast a model’s performance as it trains. The scaling laws model $L_k(n)$, the loss on each domain $k$ after training on $n$ tokens, using a power law form $\hat{L}_k(n) = \epsilon_k + \beta_k n^{-\alpha_k}$. ADO fits the parameters $\{\epsilon_k, \beta_k, \alpha_k\}$ to the actual loss curve observed so far, re-fitting every few hundred steps because the process is relatively cheap. Despite their simplicity, power laws can fit loss curves quite well:

Power law fits for loss curves on various domains.

Inspecting the derivative of the scaling laws tells us two interesting things about how the model is learning on each domain:

$$\frac{d \hat{L}(n)}{dn} = -\frac{1}{n} \alpha_k \left(\hat{L}(n) - \epsilon_k\right).$$

Here $\alpha_k$ tells us the speed of learning on domain $k$, while the factor $L_k - \epsilon_k$ tells us how much is “left to learn.”

The ADO algorithm starts from the simple idea of sampling more data from domains where the model is learning quickly. It uses the scaling laws to predict which domains the model can make the most progress on, and computes $\pi(t)$ that puts more weight on those domains. For a more detailed description, see Section 3 of the original paper.

Computational efficiency

A major advantage of ADO is that it runs alongside model training without requiring substantial computational resources. In practice, ADO increases training wall-clock time by $0.4\%$ when training a 1.3B model, with the primary compute cost coming from periodically re-fitting the parameters of the scaling laws. The cost of fitting scaling laws is independent of model size so the relative overhead actually shrinks when training larger models.

Experimental results

ADO prefers CommonCrawl

The experiments in the original paper use ADO to train language models on The Pile at two different model sizes: 124M and 1.3B. The Pile is a diverse text dataset sourced from 23 domains, such as arXiv, GitHub, PubMed Abstracts, and Wikipedia. By plotting how the data mixture $\pi(t)$ evolves throughout training, we see that ADO prefers to sample from the CommonCrawl (Pile-CC), which is consistent with findings from prior works (e.g., Xie et al, 2023).

mixture_over_time

Benchmark performance and perplexity

When evaluating average performance across seven standard evaluation tasks, models trained with ADO outperform those trained by other data selection strategies:

Data selection strategy	Avg score (124M) ↑	Avg score (1.3B) ↑
Pile default	45.4%	-
DoReMi	45.5%	57.5%
ODM	46.1%	-
Balanced	45.7%	55.7%
Natural	46.3%	58.5%
ADO	47.0%	59.0%

Condensed version of Table 1 from the original paper. Expand for more details.

Average zero-shot performance across seven evaluation tasks. Summary of baselines:

Pile default: the mixture weights provided in the Pile paper (Gao et al, 2020)
DoReMi: from Xie et al, 2023
ODM: Online Data Mixing (Albalak et al, 2023)
Balanced: uniform weighting across domains
Natural: weighting each domain in proportion to token count

Though not quite as strong as ADO, the “Natural” strategy stands out for its simplicity: just sample from each domain in proportion to token count. Surprisingly, this simple strategy is not frequently discussed or benchmarked in previous data selection literature.

Despite achieving the best (highest) average benchmark scores, ADO does not achieve the best (lowest) test perplexity on The Pile. One explanation is that The Pile contains Internet-scraped data of varying quality, and the lower quality data may be harder to learn (e.g., noisy) or may not contain interesting information relevant to the skills tested by typical evaluation tasks. If we instead look at datasets that are curated and more heavily filtered for “high quality” data (such as FineWeb and SlimPajama), we see that ADO does achieve significantly better text perplexity.

It is interesting that ADO seems to target “high quality” data even though it is not explicitly designed for that objective–its only aim is to learn as efficiently as possible.

Independent re-evaluation using Mixtera

Recently Bother et al, 2024 re-implemented ADO (with a few minor differences⁵) using Mixtera, an infrastructure for declaratively filtering and mixing large datasets. Their experiments use the same base dataset (The Pile), but a completely independent training infrastructure, architecture, and tokenizer.

They conducted a more extensive evaluation by training 162M, 1B, and 3B models and testing them on eight standard evaluation tasks. Under their setup, they found that ADO provides no improvement at the 162M scale, modest improvements at the 1B scale, and larger improvements at the 3B scale.

Mixture	Avg score (162M) ↑	Avg score (1B) ↑	Avg score (3B) ↑
Pile default	34.9%	42.9%	44.6%
Natural	35.0%	42.9%	-
ADO	34.4%	43.4%	46.5%

Condensed version of Table 3 in Bother et al, 2024. Expand for more details.

Average zero-shot performance across the eight evaluation tasks achieved by the final checkpoint (30k steps).

The average scores are not directly comparable to those from the original paper (above) because (1) the tokenizer, model architecture, training hyperparameters, and overall training pipeline are different, and (2) the set of evaluation tasks is overlapping but not identical.

Conclusion

Optimizing the data mixture is a critical but often computationally expensive step in pretraining powerful foundation models. Adaptive Data Optimization (ADO) presents a compelling alternative: it dynamically adjusts the training data mixture during the training process itself, with minimal impact on the training speed.

There are many interesting technical directions that remain to be explored:

The current division of base datasets into domains is somewhat arbitrary. How do we create better and more fine-grained domains?
The ADO scaling laws naively predict the performance of the model on each domain independently. How can we capture inter-domain interactions efficiently⁶?
The present results are all for pretraining. Can insights from ADO be applied to other training settings such as continued pretraining or finetuning?

Acknowledgements

Thanks to Samuel Sokota for reviewing an earlier draft of this post.

See “the it in ai models is the dataset.” ↩︎
With notable exceptions, such as Albalak et al, 2023. ↩︎
See, e.g., this survey for a more comprehensive overview. ↩︎
This observation is not new, see also https://x.com/giffmana/status/1898664177452953701 ↩︎
For example, the Mixtera implementation only queries for a new data mixture $\pi(t)$ every few training steps, instead of every step. ↩︎
The ADO per-domain scaling laws require fitting $3K$ parameters for $K$ domains. To model the effect of pairwise interactions between domains, you would need $\sim K^2$ parameters. ↩︎