๐ What is AutoRound?
AutoRound is an advanced quantization toolkit designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It achieves high accuracy at ultra-low bit widths (2โ4 bits) with minimal tuning by leveraging sign-gradient descent and providing broad hardware compatibility. See our paper for more details. For usage instructions, please refer to the User Guide.
๐ What's New
-
[2025/12] The SignRoundV2 paper is available. Turn on
enable_alg_extand use the AutoScheme API for mixed-precision quantization to reproduce the results: Paper, Notes for evaluating LLaMA models. -
[2025/11] AutoRound has landed in LLM-Compressor: Usage, vLLM blog, RedHat blog, X post, Intel blog, Linkedin, ๅพฎไฟก, ็ฅไน.
-
[2025/11] An enhanced GGUF quantization algorithm is available via
--enable_alg_ext: Accuracy. -
[2025/10] AutoRound has been integrated into SGLang: Usage, LMSYS Blog, X post, Intel blog, Linkedin.
-
[2025/10] A mix precision algorithm is available to generate schemes in minutes: Usage, Accuracy.
-
[2025/09] MXFP4 and NVFP4 dtypes is available: Accuracy.
-
[2025/08] An improved INT2 algorithm is available via
--enable_alg_ext: Accuracy -
[2025/07] GGUF format is supported: Usage.
-
[2025/05] AutoRound has been integrated into vLLM: Usage, Medium blog, ๅฐ็บขไนฆ.
-
[2025/05] AutoRound has been integrated into Transformers: Blog.
-
[2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy: Model.
โจ Key Features
โ Superior Accuracy Delivers strong performance even at 2โ3 bits example models, with leading results at 4 bits benchmark.
โ Ecosystem Integration Seamlessly works with Transformers, vLLM, SGLang and more.
โ Multiple Formats Export Support AutoRound, AutoAWQ, AutoGPTQ, and GGUF for maximum compatibility. Details are shown in export formats
โ Fast Mixed Bits/Dtypes Scheme Generation Automatically configure in minutes, with about 1.1X-1.5X the modelโs BF16 RAM size as overhead. Accuracy results and user guide.
โ
Optimized Round-to-Nearest Mode
Use --iters 0 for fast quantization with some accuracy drop for 4 bits. Details are shown in opt_rtn mode
โ Affordable Quantization Cost Quantize 7B models in about 10 minutes on a single GPU. Details are shown in quantization costs
โ 10+ VLMs Support Out-of-the-box quantization for 10+ vision-language models example models, support matrix
โ
Multiple Recipes
Choose from auto-round-best, auto-round, and auto-round-light to suit your needs. Details are shown in quantization recipes
โ Advanced Utilities Includes multiple gpus quantization, multiple calibration datasets and support for 10+ runtime backends.
โ Beyond weight only quantization. We are actively expanding support for additional datatypes such as MXFP, NVFP, W8A8, and more.
Installation
Install from pypi
# CPU/Intel GPU/CUDA pip install auto-round # HPU pip install auto-round-lib
Build from Source
# CPU/Intel GPU/CUDA pip install . # HPU python setup.py install lib
Model Quantization (CPU/Intel GPU/Gaudi/CUDA)
CLI Usage
The full list of supported arguments is provided by calling auto-round -h on the terminal.
ModelScope is supported for model downloads, simply set
AR_USE_MODELSCOPE=1.
auto-round \
--model Qwen/Qwen3-0.6B \
--scheme "W4A16" \
--format "auto_round" \
--output_dir ./tmp_autoroundWe offer another two recipes, auto-round-best and auto-round-light, designed for optimal accuracy and improved speed, respectively. Details are as follows.
Other Recipes
# Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower auto-round-best \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16" \ --low_gpu_mem_usage
# 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2 auto-round-light \ --model Qwen/Qwen3-0.6B \ --scheme "W4A16"
In conclusion, we recommend using auto-round for W4A16 and auto-round-best with enable_alg_ext for W2A16. However, you may adjust the
configuration to suit your specific requirements and available resources.
API Usage
from auto_round import AutoRound # Load a model (supports FP8/BF16/FP16/FP32) model_name_or_path = "Qwen/Qwen3-0.6B" # Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc. ar = AutoRound(model_name_or_path, scheme="W4A16") # Highest accuracy (4โ5ร slower). # `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower. # ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True) # Faster quantization (2โ3ร speedup) with slight accuracy drop at W4G128. # ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3) # Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc. ar.quantize_and_save(output_dir="./qmodel", format="auto_round")
Important Hyperparameters
Quantization Scheme & Configuration
scheme(str|dict|AutoScheme): The predefined quantization keys, e.g.W4A16,MXFP4,NVFP4,GGUF:Q4_K_M. For MXFP4/NVFP4, we recommend exporting to LLM-Compressor format.bits(int): Number of bits for quantization (default isNone). If not None, it will override the scheme setting.group_size(int): Size of the quantization group (default isNone). If not None, it will override the scheme setting.sym(bool): Whether to use symmetric quantization (default isNone). If not None, it will override the scheme setting.layer_config(dict): Configuration for layer_wise scheme (default isNone), mainly for customized mixed schemes.
Algorithm Settings
enable_alg_ext(bool): [Experimental Feature] Only foriters>0. Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default isFalse.disable_opt_rtn(bool): Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default isFalse(improved RTN enabled).
Tuning Process Parameters
iters(int): Number of tuning iterations (default is200). Common values: 0 (RTN mode), 50 (with lr=5e-3 recommended), 1000. Higher values increase accuracy but slow down tuning.lr(float): The learning rate for rounding value (default isNone). When None, it will be set to1.0/itersautomatically.batch_size(int): Batch size for training (default is8). 4 is also commonly used.- **
enable_deterministic_algorithms(bool)**: Whether to enable deterministic algorithms for reproducibility (default isFalse).
Calibration Dataset
dataset(str|list|tuple|torch.utils.data.DataLoader): The dataset for tuning (default is"NeelNanda/pile-10k"). Supports local JSON files and dataset combinations, e.g."./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test".nsamples(int): Number of samples for tuning (default is128).seqlen(int): Data length of the sequence for tuning (default is2048).
Device/Speed Configuration
enable_torch_compile(bool): If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource.low_gpu_mem_usage(bool): Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default isFalse).low_cpu_mem_usage(bool): [Experimental Feature]Whether to enable saving immediately to reduce ram usage (default isFalse).device_map(str|dict|int): The device to be used for tuning, e.g.,auto,cpu,cuda,0,1,2(default is0). When usingauto, it will try to use all available GPUs.
Adaptive Schemes (Experimental Feature)
AutoScheme provides an automatic algorithm to generate adaptive mixed bits/data-type quantization recipes. Please refer to the user guide for more details on AutoScheme.
from auto_round import AutoRound, AutoScheme model_name = "Qwen/Qwen3-8B" avg_bits = 3.0 scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True) layer_config = {"lm_head": "GGUF:Q6_K"} # Change iters to 200 for non-GGUF schemes ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0) ar.quantize_and_save()
Important Hyperparameters of AutoScheme
AutoScheme Hyperparameters
avg_bits(float): Target average bit-width for the entire model. Only quantized layers are included in the average bit calculation.options(str | list[str] | list[QuantizationScheme]): Candidate quantization schemes to choose from. It can be a single comma-separated string (e.g.,"W4A16,W2A16"), a list of strings (e.g.,["W4A16", "W2A16"]), or a list ofQuantizationSchemeobjects.ignore_scale_zp_bits(bool): Only supported in API usage. Determines whether to exclude the bits of scale and zero-point from the average bit-width calculation (default:False).shared_layers(Iterable[Iterable[str]], optional): Only supported in API usage. Defines groups of layers that share quantization settings.batch_size(int, optional): Only supported in API usage. Can be set to1to reduce VRAM usage at the expense of longer tuning time.
API Usage for VLMs
If you encounter issues during quantization, try setting iters=0 (to enable RTN) and group_size=32 for better results.
Click to expand
This feature is experimental and may be subject to changes.
By default, AutoRound only quantize the text module of VLMs and uses NeelNanda/pile-10k for calibration. To
quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature
is limited. For more information, please refer to the AutoRound readme.
from auto_round import AutoRound # Load the model model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct" # Quantize the model ar = AutoRound(model_name_or_path, scheme="W4A16") output_dir = "./qmodel" ar.quantize_and_save(output_dir)
Model Inference
vLLM (CPU/Intel GPU/CUDA)
from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", ] sampling_params = SamplingParams(temperature=0.6, top_p=0.95) model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound" llm = LLM(model=model_name) outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
SGLang (Intel GPU/CUDA)
Please note that support for the MoE models and visual language models is currently limited.
import sglang as sgl llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound") prompts = [ "Hello, my name is", ] sampling_params = {"temperature": 0.6, "top_p": 0.95} outputs = llm.generate(prompts, sampling_params) for prompt, output in zip(prompts, outputs): print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
Transformers (CPU/Intel GPU/Gaudi/CUDA)
AutoRound supports 10+ backends and automatically selects the best available backend based on the installed libraries and prompts the user to install additional libraries when a better backend is found.
Please avoid manually moving the quantized model to a different device (e.g., model.to('cpu')) during inference, as this may cause unexpected exceptions.
The support for Gaudi device is limited.
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound" model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) text = "There is a girl who likes adventure," inputs = tokenizer(text, return_tensors="pt").to(model.device) print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
Publications & Events
Acknowledgement
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.
๐ Support Us
If you find AutoRound helpful, please โญ star the repo and share it with your community!
