Methodology applied to generative AI
Summary
This methodological note proposes a calculation framework to assess the environmental footprint of generative AI models by integrating training, fine-tuning, and inference. The approach is based on estimating the compute load (FLOPs) required by each usage, converting it into GPU usage time, then into energy consumption and greenhouse gas (GHG) emissions. It also includes the share of impact linked to equipment manufacturing and life cycle. This approach aims to provide a reproducible, transparent method adapted to different models and usage contexts, consistent with Green AI research recommendations.
Principle
The methodology is based on a simple philosophy: directly link real uses of an AI model (training, fine-tuning, inference) to the hardware footprint necessary to perform them.
Rather than starting from global electricity consumption measurements at the data center level, which are often inaccessible or proprietary (Google, 2025), it first evaluates the amount of computation required by the model according to:
- its own characteristics (size, number of parameters, proportion of activated parameters, architecture),
- the volume of tokens consumed or generated (text, images, etc.).
This compute load is expressed in FLOPs, then converted to effective hardware usage time (GPUh) while accounting for real efficiency (Model FLOP Utilization, MFU).
The next step translates this usage time into energy consumption and GHG emissions based on the physical characteristics of GPUs/servers and operating conditions (PUE, electricity emission factor).
Finally, a share of the impact related to manufacturing and the equipment life cycle is added proportionally to usage time, following a life-cycle assessment (LCA) logic (ISO 14040 and 14044).
According to the Green AI study, FLOPs are a relevant metric to measure the impact of generative AI because they express the compute load actually performed, directly correlated with energy consumption, and provide a hardware-agnostic basis to compare different models fairly.
Impact assessment
What is a token?
A token is the discrete unit manipulated by the model to represent an input or an output. Depending on the modality, it can be a word fragment, a spatial position, or a coded temporal unit.
The table below provides a quick reference for each modality and a simple way to estimate the variables used in the formulas.
| Modality | What a token is | Formula (tokens / activations) | Example / estimation |
|---|---|---|---|
| Text | Word fragment (often ~3–4 characters on average) | T_\\text{text} = \\text{number of words} \\times \\text{tokens per word} | 100 words → T_\\text{text} \\approx 130–160 tokens depending on the tokenizer. |
| Image | Spatial / latent token (patch) | T_\\text{image} = (\\text{width}/\\text{patch}) \\times (\\text{height}/\\text{patch}) | 512×512 image, 16×16 patches → 512/16 = 32 tokens per axis → 32×32 = 1,024 tokens. |
| Audio | Temporal token produced by a codec (e.g., EnCodec) | T_\\text{audio} = \\text{duration (s)} \\times \\text{sample rate} \\div \\text{downscale} \\times \\text{latent channels} | 10 s clip, 24 kHz sample rate, downscale 320, 8 channels → T_\\text{audio} \\approx 6{,}000 tokens. |
| Video | Spatial token per frame + number of frames | T_\\text{frame} = (\\text{width}/\\text{patch}) \\times (\\text{height}/\\text{patch}) | 4 s at 24 fps → frames. 512×512 frame, 16×16 patches → T_\\text{frame} \\approx 32\\times32=1{,}024 tokens per frame and T_\\text{video} = 96 \\times 1{,}024 = 98{,}304 tokens. |
Explanation of technical terms
- Patch: dividing an image or frame into square blocks processed as tokens by the model.
- Downscale / downsampling: reducing spatial (images/video) or temporal (audio) resolution to move into a smaller latent space used for activations. Example: downscale 8 → width and height divided by 8.
- Latent channels: number of dimensions in the latent space (feature depth) for image, video, or audio.
Estimating compute load
| Use case | Calculation formula | Variables | Explanation |
|---|---|---|---|
| Training | : total number of model parameters : number of tokens processed during training (tokens × batch × steps) | For each token and parameter, 6 FLOPs are needed: 2 FLOPs for the forward pass and 4 for gradient computation and propagation (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic) | |
| Fine-tuning | : total number of model parameters : number of trainable parameters (depends on optimization: LoRA, …) : number of tokens processed during training (tokens × batch × steps) | Same as full training, however the number of updated parameters is lower (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic) | |
| Prompt processing (text) | : number of active parameters : number of prompt tokens | With KV cache enabled, the prompt is encoded once: cost is reduced to ≈ 1 FLOP per parameter/token (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic) | |
| Prompt processing (image) | : number of active parameters : number of image activations = width × height × channels | Each prompt image is encoded once by the model. corresponds to the number of latent tokens or encoded pixels. | |
| Prompt processing (audio) | : number of active parameters : number of audio tokens = duration × sample rate ÷ downscale × latent channels | Each prompt audio clip is encoded once by the model. corresponds to the latent tokens used to represent the audio signal. | |
| Text generation | : number of active parameters : number of generated tokens | For each token and parameter, 2 FLOPs are needed for the forward pass. The number of active parameters during inference depends on the model architecture (especially for MoE). (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic) | |
| Image generation | : number of activations = width x height x number of channels | For each activation and parameter, 2 FLOPs are needed for the forward pass (Source: Clockwork Diffusion, Transformers Inference Arithmetic) | |
| Video generation (frame by frame) | : number of activations = width x height x number of channels : number of frames to generate : number of denoising steps | Generation processes each frame independently (Source: Clockwork Diffusion, Transformers Inference Arithmetic) | |
| Video generation (spatio-temporal) | : number of activations = width x height x number of channels : number of frames to generate : number of denoising steps : number of spatial tokens = width x height : latent dimension = number of channels : hidden dimension | The first term corresponds to the linear cost of frame generation. The second models the dominant quadratic cost of spatio-temporal self-attention across all video tokens. (Source: Video Killed the Energy Budget) | |
| Audio generation (temporal) | : number of latent audio activations per step : number of temporal audio tokens : hidden dimension : number of denoising steps | Audio is generated by diffusion over a 1D sequence. Cost is linear for latent processing and quadratic for temporal self-attention, at each denoising step. (Sources: AudioLM, MusicLM, Stable Audio Open) |
Conversion to GPU usage
If the FLOP processing capacity of a GPU is known, it is then trivial to calculate the theoretical usage duration to satisfy one of the above use cases:
With the GPU usage duration in hours, and the theoretical computing capacity in FLOP/h of the GPU.
The actually usable computing capacity of a GPU, taking into account model typology, GPU/TPU type, heavy parallelism, network exchanges, etc., would represent only 25 to 50% of the theoretical capacity (see NVIDIA Benchmarks).
This utilization rate is called (Model FLOP Utilization).
Conversion to energy consumption
If we assume that during GPU usage its energy consumption is at maximum, the calculation of its energy consumption is simple:
With the GPU power in Watts.
In a data center context, it is relevant to multiply this figure by its (Power Usage Efficiency) to account for energy efficiency.
Environmental impact of energy consumption
To obtain the environmental impact (e.g., GHG emissions) of energy, simply apply electricity emission factors such as those available in the D4B Open Data reference:
Environmental impact of GPU manufacturing
The impact linked to GPU manufacturing is calculated proportionally to usage duration relative to the estimated GPU lifetime:
Accounting for server impacts
The impact of other components (CPU, RAM, storage, chassis) is also taken into account. Because durations are expressed in GPUh, the impact of these components is allocated in proportion to the number of GPUs per server. For example, in an 8-GPU server, one eighth of the operational and embodied impacts of non-GPU components is attributed to each calculated GPUh.
Assumptions & limits
Assumptions
- During inference, a cache (KV) is always present (Transformer Inference Arithmetic).
- Electricity emission factors come from the D4B Open Data reference.
Limitations
- Uncertainties in input data: actual training data, model characteristics often confidential, MFU, etc.
- No accounting for whether models fit in memory on selected hardware.
- No handling of TPU, FPGA, ASIC specificities.
- No reliable LCA on equipment.
Perspectives
- Include public metrics such as tokens/s in addition to FLOPs.
- Account for precision (FP32, FP16, ...).
- Integrate overhead to account for parallelism impacts (network, replication, queuing, ...).
- Integrate GPU memory as a bottleneck.
- Integrate amortization of training across inference.
- Adapt MFU according to server characteristics (number of GPUs per server, ...).
- Adapt the methodology to multimodal models (text, image, video).
- Integrate multi-criteria impact factors (primary energy, water, rare metals).
- Integrate training of development versions attributable to the current model version.
Application
This section aims to evaluate the model using public data from the open-source LLM Llama 3.1 (405B parameters).
Hardware assumptions
The NVIDIA DGX H100 is a “classic” configuration on which the workloads are executed.
| Characteristics | Component | Power | Life-cycle impact (approximate) |
|---|---|---|---|
| CPU | 2 x Intel Xeon Platinum 8480C processors (112 cores total) | 2 x 350 = 700 W | 2 x 25 = 50 kgCO2e |
| RAM | 2 TB | 2 x 1024 x 0.392 = 803 W | 2 x 1024 x 533 / 384 = 2843 kgCO2e |
| Storage | 30 TB SSD | 30 x 1024 x 0.0012 = 37 W | 30 x 1024 x 0.16 = 4915 kgCO2e |
| GPU | 8 x H100 80 GB (989 TFLOP/s per GPU) | 8 x 700 W | 8 x 250 kgCO2e |
| Chassis | - | 250 kgCO2e | |
| Total (excluding GPU) | 1540 W | 10058 kgCO2e | |
| Total (excluding GPU)/h | 1540 W | 10058 / (5 x 24 x 365.25) = 0.230 kgCO2e/h |
Training impact
Llama 3.1 (405B parameters) was trained with approximately 15 trillion (15e12) tokens. According to Huggingface, it was trained with 24576 H100 GPUs: Training Time (GPU hours) Power Consumption (W) Emissions (tons CO2eq) Llama 3.1 8B 1.46M 700 420 Llama 3.1 70B 7.0M 700 2,040 Llama 3.1 405B 30.84M 700 8,930
According to the model formulas and assuming an MFU of 40% (to be refined based on NVIDIA benchmarks, it could be closer to 35%) for training, a PUE of 1.2 and a GHG emission factor of 0.420 kgCO2e / kWh:
The gap between Huggingface data and the calculation is < 2%, which remains very reasonable.
For embodied impact, we assume a 5-year equipment lifetime:
We observe that embodied impact is considerably lower than operational impact.
To GPU impact we add server operational and embodied impact. There are 8 GPUs per server, so we add 1/8 of non-GPU components.
Impact of generating 1 million tokens
In the cloud, when using an LLM in “completion” mode, thanks to KV caching, input tokens only incur a linear cost per output token because attention is recalculated only on newly generated tokens.
If we consider an average prompt size of about 400 tokens, then the impact of a request is about 0.1 gCO2e.
Simulator
Comparison
This section provides a comparison of available methodologies for evaluating the environmental impacts of generative AI models. It highlights their perimeters, strengths, and limitations, to position the D4B methodology relative to existing approaches.
| Characteristic | Full LCA (Google, 2025) | Ecologits | D4B methodology |
|---|---|---|---|
| Approach type | Full-stack measurement: CPU/DRAM, idle machines, datacenter overhead, water, partial hardware LCA | Bottom-up assessment applied to inference only (usage + manufacturing) | FLOPs → GPUh → impacts modeling |
| Perimeter | Manufacturing (partial), usage (all server components), datacenter infrastructure, water, Scope 2/3 emissions | Infra usage + manufacturing, inference only | Training, fine-tuning, inference usage + GPU and server manufacturing |
| Granularity & measurement | Very fine: real measurements on Gemini production, energy, water, emissions | Medium-high, open data multi-criteria (GWP, PE, ADPe) aggregated per API call | Moderate: depends on available data (FLOPs, TDP, ...) |
| Accessibility | Low: internal Google data not detailed | High: open-source code, open API | High: publicly documented methods and assumptions |
| Reproducibility | Low: proprietary instrumentation and internal data | High: public tool, transparent and reproducible calculations | Medium to high: if input data can be estimated |
| Transparency | Medium: method published but data access limited | High: open-source code, assumptions, and model | High: all formulas and sources are explained |
| Accuracy (inference) | Very high: real measured deployment, includes full energy spectrum | Medium: relies on simplified models and generalized assumptions | Medium to high depending on parameter accuracy |
| Applicability | Limited: specific to Google infrastructure and inference | Medium: inference across various providers, but no training | Very broad: training, fine-tuning, inference based on public data |
| Targeted uses | Internal analysis, detailed reporting, communication | Public assessment, awareness, multi-provider comparison | Research, internal assessment, FinOps, Green AI |
| Quantified results (Average prompt, around 400 tokens) | ~0.03 gCO2e ~0.24 Wh Gemini | ~40 gCO2e ~95 Wh Llama 3.1 405b | ~0.12 gCO2e ~0.27 Wh Llama 3.1 405b (see Application) |
| Key limitations | Proprietary data, does not cover training, focuses on inference, bias on “median prompt” | Limited perimeter (inference only), possible overestimation due to extrapolation | Highly dependent on assumptions (MFU, lifetime) |
These results show that each approach has a specific positioning: Google prioritizes accuracy but remains closed and non-reproducible, Ecologits focuses on transparency and simplicity but at the cost of possible overestimation, while the D4B methodology offers a reproducible and adaptable compromise for different usage contexts but depends on the precision of input data.