GitHub - NVlabs/Jet-Nemotron

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

🔥🔥 News

(🔥 New) [2025/9/29] We released the Jet-Nemotron models and inference code.
(🔥 New) [2025/9/18] Jet-Nemotron is accepted by NeurIPS 2025! 🎉🎉🎉 See you at San Diego!
[2025/8/22] We released the Jet-Nemotron technical report on arXiv.

💡 Introduction

Jet-Nemotron is a new family of hybrid-architecture language models that surpass state-of-the-art open-source full-attention language models such as Qwen3, Qwen2.5, Gemma3, and Llama3.2, while achieving significant efficiency gains—up to 53.6× speedup in generation throughput on H100 GPUs (256K context length, maximum batch size). It is built upon two core innovations:

Post Neural Architecture Search, an efficient post-training architecture exploration and adaptation pipeline applicable to arbitrary pre-trained transformer models;
JetBlock, a novel linear attention block that significantly outperforms previous designs such as Mamba2.

Highlight 1: PostNAS – Post-Training Architecture Exploration and Adaptation

Unlike prior methods that train from scratch to explore new model architectures, PostNAS builds on a pre-trained transformer model while enabling flexible exploration of attention block designs, greatly reducing the cost and risk of developing new language model architectures.

PostNAS first identifies the optimal placement of full-attention layers, then searches for improved attention block designs.

In the pre-trained transformer model, not all attention layers contribute equally. PostNAS reveals important attention layers within pre-trained transformer models.

KV cache size is the most critical factor influencing long-context and long-generation throughput. PostNAS hardware-aware search discovers architectures that deliver similar generation throughput, while having more parameters and achieving better accuracy.

Highlight 2: JetBlock - A New Linear Attention Module with SOTA Accuracy

With PostNAS, we introduce the JetBlock — a novel linear attention module that integrates dynamic convolution with hardware-aware architecture search to enhance linear attention, delivering substantial accuracy gains over previous designs while maintaining similar training and inference throughput. Below, we present an apples-to-apples comparison between the Mamba2 Block and the JetBlock, using identical training data and training recipes.

Performance

Jet-Nemotron-2B and Jet-Nemotron-4B match or surpass the accuracy of leading efficient language models (e.g., Qwen3) across a comprehensive benchmark suite while running significantly faster — 21× and 47× faster than Qwen3-1.7B-Base, respectively.

1 Setup Environments

git clone https://github.com/NVlabs/Jet-Nemotron
cd Jet-Nemotron
pip3 install -e .

NOTE: To install flash-attn properly, you may need to install specific release version or build from source.

(Optional) To support throughput measurement or chunk-prefilling when eval_batch_size > 1, please install a modified version of transformers==4.52.0:

pip3 install -U transformers@git+https://github.com/jet-ai-projects/transformers.git@jetai

2 Models

Jet-Nemotron-2B: jet-ai/Jet-Nemotron-2B
Jet-Nemotron-4B: jet-ai/Jet-Nemotron-4B

Load the model with

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("jet-ai/Jet-Nemotron-2B", 
                                             trust_remote_code=True, 
                                             attn_implementation="flash_attention_2",
                                             torch_dtype=torch.bfloat16,
                                             device_map="cuda")

NOTE: The kernels in Jet-Nemotron currently do not support running on CPUs. You may get unexpected results on CPUs.

To use or contribute to the model definition files in this repo (jetai/modeling/hf), you can first download or soft-link the model weights and model config to jetai/modeling/hf/:

hf download jet-ai/Jet-Nemotron-2B --local-dir jetai/modeling/hf --include "*safetensors*" --include "config.json"

Then you can load the model with

model = AutoModelForCausalLM.from_pretrained("jetai/modeling/hf", 
                                             trust_remote_code=True, 
                                             attn_implementation="flash_attention_2",
                                             torch_dtype=torch.bfloat16,
                                             device_map="cuda")

3 Generate with Jet-Nemotron

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name_or_path = "jet-ai/Jet-Nemotron-2B"

# For local testing, you can use the following path.
# NOTE: Be sure to download or soft-link the model weights to `jetai/modeling/hf`
# model_name_or_path = "jetai/modeling/hf/"

model = AutoModelForCausalLM.from_pretrained(model_name_or_path, 
                                             trust_remote_code=True, 
                                             attn_implementation="flash_attention_2",
                                             torch_dtype=torch.bfloat16,
                                             device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = model.eval().cuda()

input_str = "Hello, I'm Jet-Nemotron from NVIDIA."

input_ids = tokenizer(input_str, return_tensors="pt").input_ids.cuda()
output = model.generate(input_ids, max_new_tokens=50, do_sample=False)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
print(output_str)

python3 jetai/inference/generate.py --model_name_or_path ${PATH_TO_YOUR_MODEL}

4 Evaluation on Benchmarks

Run evaluation for MMLU, MMLU-pro, BBH, Commonsense, Math, Code, Retrieval, and LongBench Tasks.

bash scripts/eval/2B/mmlu.sh
bash scripts/eval/2B/mmlu_pro.sh
bash scripts/eval/2B/bbh.sh
bash scripts/eval/2B/commonsense.sh
bash scripts/eval/2B/math.sh
bash scripts/eval/2B/code.sh
bash scripts/eval/2B/retrieval.sh
bash scripts/eval/2B/longbench.sh

You can use the first command line argument to specify model_name_or_path:

bash scripts/eval/2B/mmlu.sh ${PATH_TO_YOUR_MODEL}

NOTE: The evaluation code will use the .parquet version of social_i_qa, mathqa, and longbench data from our repo because their official repos does not supports loading with datasets >= 4.0.0.

5 Measure Throughput

python3 jetai/inference/measure_throuput.py --model_name_or_path jetai/Jet-Nemotron-2B
python3 jetai/inference/measure_throuput.py --model_name_or_path jetai/Jet-Nemotron-4B --batch_size 64 --prefill_chunk_size 1024

Measure Throughput for All Context Lengths

python3 jetai/inference/measure_throuput.py --model_name_or_path jetai/Jet-Nemotron-2B --prompt_len 4096 --batch_size 1024 --prefill_chunk_size 256
python3 jetai/inference/measure_throuput.py --model_name_or_path jetai/Jet-Nemotron-2B --prompt_len 8192 --batch_size 512 --prefill_chunk_size 512
python3 jetai/inference/measure_throuput.py --model_name_or_path jetai/Jet-Nemotron-2B --prompt_len 16384 --batch_size 512 --prefill_chunk_size 512
python3 jetai/inference/measure_throuput.py --model_name_or_path jetai/Jet-Nemotron-2B --prompt_len 32768 --batch_size 256 --prefill_chunk_size 1024
python3 jetai/inference/measure_throuput.py --model_name_or_path jetai/Jet-Nemotron-2B --prompt_len 65536 --batch_size 128 --prefill_chunk_size 2048
python3 jetai/inference/measure_throuput.py --model_name_or_path jetai/Jet-Nemotron-2B --prompt_len 131072 --batch_size 128 --prefill_chunk_size 2048
python3 jetai/inference/measure_throuput.py --model_name_or_path jetai/Jet-Nemotron-2B --prompt_len 262144 --batch_size 64 --prefill_chunk_size 2048

6 Build Your Own JetBlock

The following code is a minimal example to build your own JetBlock.

import torch
from jetai.modeling.hf.jet_block import (
    JetBlock, 
    JetBlockConfig
)

jet_block_config = JetBlockConfig(
    expand_v=2.0,
    num_heads=6,
    head_dim=256,
    conv_size=4,
)

jet_block = JetBlock(
    hidden_size=1536,
    initializer_range=0.02,
    jet_block_config=jet_block_config,
).cuda().to(torch.bfloat16)

hidden_states = torch.randn(16, 4096, 1536).cuda().to(torch.bfloat16)

hidden_states, _ = jet_block(
    hidden_states=hidden_states,
)

print(hidden_states)

License

Contact

📖 BibTeX

@article{gu2025jet,
  title={Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search},
  author={Gu, Yuxian and Hu, Qinghao and Yang, Shang and Xi, Haocheng and Chen, Junyu and Han, Song and Cai, Han},
  journal={arXiv preprint arXiv:2508.15884},
  year={2025}
}