LLM Token Production Energy Cost

1 min read Original article ↗

Calculation Methodology

Energy Calculation: The calculator uses the active parameter count when estimating compute. Each token triggers roughly 2×Nactive2 \times N_{active} floating-point operations, so total energy is E=2×Nactive×TηE = \frac{2 \times N_{active} \times T}{\eta}, where η\eta is hardware efficiency in FLOPs/Joule.

Overall parameters represent the full model size (all experts for MoE).
Active parameters are the subset actually multiplied for a single token. Dense models haveactive = overall; MoE models often have active ≪ overall.

This distinction means MoE models show lower compute-energy and cost here than equally sized dense models. We do not currently account for memory bandwidth or expert-routing overhead, so real-world MoE energy can be somewhat higher.

Carbon Emissions: EkWh×IgridE_{\text{kWh}} \times I_{\text{grid}}, where IgridI_{\text{grid}} is the region-specific carbon intensity (kg CO₂/kWh). Selecting cleaner grids (lower IgridI_{grid}) therefore reduces emissions even when energy use is unchanged.

Hardware Assumptions: Based on NVIDIA H100 specifications (~6.59×10116.59 \times 10^{11} FLOPs/Joule, conservative estimate). Precision improvements (FP16/FP8) increase efficiency by 2×/4× respectively.

Variable definitions: NactiveN_{active} – active parameters (billions); TT – token count; EkWhE_{\text{kWh}} – total energy in kilowatt-hours; IgridI_{\text{grid}} – regional carbon intensity (kg CO₂/kWh); η\eta – hardware efficiency (FLOPs/J).

Sources: Özcan et al. (2023), "Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations"