GitHub - DebarghaG/estimate-train-time: Time cost estimator of LLM's distributed training [HiPC 2025]

estimate-train-time

Predict distributed LLM training time before you run. This tool estimates the wall-clock time for training large language models across multiple GPUs using 3D parallelism (pipeline, tensor, and data parallelism), helping you plan capacity and compare parallelization strategies without expensive trial runs.

Installation

pip install estimate-train-time  # Coming soon to PyPI

Note: PyPI package is coming soon. For now, install directly from the repository:

git clone https://github.com/DebarghaG/estimate-train-time.git
cd estimate-train-time
pip install -e .

Quick Start

# List available example configurations
estimate-train-time list-examples

# Run prediction with a bundled example (Llama 7B on A100s)
estimate-train-time predict --example llemma_7b_4_2_2_P

Output:

Estimated time cost of current training config: 9480819.17 us
                                               = 9480.82 ms
                                               = 9.4808 s

Features

3D Parallelism Support: Pipeline, tensor (model), and data parallelism
Pre-trained Regressors: Bundled models for NVIDIA A100 and GH200 GPUs
No GPU Required: Predictions run on CPU using trained regressors
Extensible: Add your own GPU profiles and cluster configurations

Documentation

Getting Started - Installation and first prediction
Core Concepts - Understanding distributed training estimation
Configuration Reference - Config file parameters
CLI Reference - Command-line options
Python API - Programmatic usage
Examples - Usage examples and custom configurations
Advanced - Kernel sampling and extending the tool

Python API

from estimate_train_time import one_batch_predict

# Predict training time from a config file
time_us = one_batch_predict("path/to/config.yml")
print(f"One batch takes {time_us / 1e6:.2f} seconds")

Requirements

Python 3.8+
pandas, numpy, scikit-learn, xgboost, pyyaml, ijson, joblib

For GPU sampling (optional): torch, flash-attn, deepspeed

Acknowledgements

National Science Foundation (NSF) funded AI institute for Intelligent Cyberinfrastructure with Computational Learning in the Environment (ICICLE) (OAC 2112606)

Citation

If you use this tool in your research, please cite our paper, accepted to HiPC 2025 (proceedings forthcoming):

@article{zhang2025efficient,
  title={Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM},
  author={Zhang, Biyao and Zheng, Mingkai and Ganguly, Debargha and Zhang, Xuecen and Singh, Vikash and Chaudhary, Vipin and Zhang, Zhao},
  journal={arXiv preprint arXiv:2509.22832},
  year={2025}
}

License

MIT License - see LICENSE for details.