Skip to main content

Methodology applied to generative AI

Summary

This methodological note proposes a calculation framework to assess the environmental footprint of generative AI models by integrating training, fine-tuning, and inference. The approach is based on estimating the compute load (FLOPs) required by each usage, converting it into GPU usage time, then into energy consumption and greenhouse gas (GHG) emissions. It also includes the share of impact linked to equipment manufacturing and life cycle. This approach aims to provide a reproducible, transparent method adapted to different models and usage contexts, consistent with Green AI research recommendations.

Principle

The methodology is based on a simple philosophy: directly link real uses of an AI model (training, fine-tuning, inference) to the hardware footprint necessary to perform them.

Rather than starting from global electricity consumption measurements at the data center level, which are often inaccessible or proprietary (Google, 2025), it first evaluates the amount of computation required by the model according to:

  • its own characteristics (size, number of parameters, proportion of activated parameters, architecture),
  • the volume of tokens consumed or generated (text, images, etc.).

This compute load is expressed in FLOPs, then converted to effective hardware usage time (GPUh) while accounting for real efficiency (Model FLOP Utilization, MFU).

The next step translates this usage time into energy consumption and GHG emissions based on the physical characteristics of GPUs/servers and operating conditions (PUE, electricity emission factor).

Finally, a share of the impact related to manufacturing and the equipment life cycle is added proportionally to usage time, following a life-cycle assessment (LCA) logic (ISO 14040 and 14044).

Why use FLOPs as a metric?

According to the Green AI study, FLOPs are a relevant metric to measure the impact of generative AI because they express the compute load actually performed, directly correlated with energy consumption, and provide a hardware-agnostic basis to compare different models fairly.

Impact assessment

What is a token?

A token is the discrete unit manipulated by the model to represent an input or an output. Depending on the modality, it can be a word fragment, a spatial position, or a coded temporal unit.

The table below provides a quick reference for each modality and a simple way to estimate the variables used in the formulas.

ModalityWhat a token isFormula (tokens / activations)Example / estimation
TextWord fragment (often ~3–4 characters on average)T_\\text{text} = \\text{number of words} \\times \\text{tokens per word}100 words → T_\\text{text} \\approx 130–160 tokens depending on the tokenizer.
ImageSpatial / latent token (patch)T_\\text{image} = (\\text{width}/\\text{patch}) \\times (\\text{height}/\\text{patch})512×512 image, 16×16 patches → 512/16 = 32 tokens per axis → 32×32 = 1,024 tokens.
AudioTemporal token produced by a codec (e.g., EnCodec)T_\\text{audio} = \\text{duration (s)} \\times \\text{sample rate} \\div \\text{downscale} \\times \\text{latent channels}10 s clip, 24 kHz sample rate, downscale 320, 8 channels → T_\\text{audio} \\approx 6{,}000 tokens.
VideoSpatial token per frame + number of framesT_\\text{frame} = (\\text{width}/\\text{patch}) \\times (\\text{height}/\\text{patch})4 s at 24 fps → F=96F=96 frames. 512×512 frame, 16×16 patches → T_\\text{frame} \\approx 32\\times32=1{,}024 tokens per frame and T_\\text{video} = 96 \\times 1{,}024 = 98{,}304 tokens.

Explanation of technical terms

  • Patch: dividing an image or frame into square blocks processed as tokens by the model.
  • Downscale / downsampling: reducing spatial (images/video) or temporal (audio) resolution to move into a smaller latent space used for activations. Example: downscale 8 → width and height divided by 8.
  • Latent channels: number of dimensions in the latent space (feature depth) for image, video, or audio.

Estimating compute load

Use caseCalculation formulaVariablesExplanation
TrainingFLOP6×Ptotal×TtrainingFLOP \approx 6 \times P_\text{total} \times T_\text{training}PtotalP_{total}: total number of model parameters
TtrainingT_{training}: number of tokens processed during training (tokens × batch × steps)
For each token and parameter, 6 FLOPs are needed: 2 FLOPs for the forward pass and 4 for gradient computation and propagation (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic)
Fine-tuningFLOP(2×Ptotal+4×Ptunable)×TtrainingFLOP \approx (2 \times P_\text{total} + 4 \times P_\text{tunable}) \times T_\text{training}PtotalP_{total}: total number of model parameters
PtunableP_{tunable}: number of trainable parameters (depends on optimization: LoRA, …)
TtrainingT_{training}: number of tokens processed during training (tokens × batch × steps)
Same as full training, however the number of updated parameters is lower (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic)
Prompt processing
(text)
FLOP1×Pactive×TinputFLOP \approx 1 \times P_{active} \times T_{input}PactiveP_{active}: number of active parameters
TinputT_{input}: number of prompt tokens
With KV cache enabled, the prompt is encoded once: cost is reduced to ≈ 1 FLOP per parameter/token (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic)
Prompt processing
(image)
FLOP1×Pactive×NactivationFLOP \approx 1 \times P_{active} \times N_\text{activation}PactiveP_{active}: number of active parameters
NactivationN_\text{activation}: number of image activations = width × height × channels
Each prompt image is encoded once by the model. NactivationN_\text{activation} corresponds to the number of latent tokens or encoded pixels.
Prompt processing
(audio)
FLOP1×Pactive×NaudioFLOP \approx 1 \times P_{active} \times N_\text{audio}PactiveP_{active}: number of active parameters
NaudioN_\text{audio}: number of audio tokens = duration × sample rate ÷ downscale × latent channels
Each prompt audio clip is encoded once by the model. NaudioN_\text{audio} corresponds to the latent tokens used to represent the audio signal.
Text generationFLOP2×Pactive×ToutputFLOP \approx 2 \times P_\text{active} \times T_\text{output}PactiveP_{active}: number of active parameters
ToutputT_{output}: number of generated tokens
For each token and parameter, 2 FLOPs are needed for the forward pass. The number of active parameters during inference depends on the model architecture (especially for MoE). (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic)
Image generationFLOP2×Pactive×NactivationFLOP \approx 2 \times P_\text{active} \times N_\text{activation}NactivationN_{activation}: number of activations = width x height x number of channelsFor each activation and parameter, 2 FLOPs are needed for the forward pass (Source: Clockwork Diffusion, Transformers Inference Arithmetic)
Video generation
(frame by frame)
FLOPS×(2×Pactive×Nactivation×F)FLOP \approx S \times \big( 2 \times P_\text{active} \times N_\text{activation} \times F \big)NactivationN_{activation}: number of activations = width x height x number of channels
FF: number of frames to generate
SS: number of denoising steps
Generation processes each frame independently (Source: Clockwork Diffusion, Transformers Inference Arithmetic)
Video generation
(spatio-temporal)
FLOPS×(2×Pactive×Nactivation×F+2×(F×T)2×d)FLOP \approx S \times \big( 2 \times P_\text{active} \times N_\text{activation} \times F + 2 \times (F \times T)^2 \times d \big)NactivationN_{activation}: number of activations = width x height x number of channels
FF: number of frames to generate
SS: number of denoising steps
TT: number of spatial tokens = width x height
DD: latent dimension = number of channels
dd: hidden dimension
The first term corresponds to the linear cost of frame generation. The second models the dominant quadratic cost of spatio-temporal self-attention across all video tokens. (Source: Video Killed the Energy Budget)
Audio generation
(temporal)
FLOPS×(2×Pactive×Naudio+2×T2×d)FLOP \approx S \times \big( 2 \times P_\text{active} \times N_\text{audio} + 2 \times T^2 \times d \big)NaudioN_{audio}: number of latent audio activations per step
TT: number of temporal audio tokens
dd: hidden dimension
SS: number of denoising steps
Audio is generated by diffusion over a 1D sequence. Cost is linear for latent processing and quadratic for temporal self-attention, at each denoising step. (Sources: AudioLM, MusicLM, Stable Audio Open)

Conversion to GPU usage

If the FLOP processing capacity of a GPU is known, it is then trivial to calculate the theoretical usage duration to satisfy one of the above use cases:

Dgpu=FLOP(Cgpu×MFU)D_{gpu} = \frac{FLOP}{(C_{gpu}\times MFU)}

With DgpuD_{gpu} the GPU usage duration in hours, and CgpuC_{gpu} the theoretical computing capacity in FLOP/h of the GPU.

The actually usable computing capacity of a GPU, taking into account model typology, GPU/TPU type, heavy parallelism, network exchanges, etc., would represent only 25 to 50% of the theoretical capacity (see NVIDIA Benchmarks).

This utilization rate is called MFUMFU (Model FLOP Utilization).

Conversion to energy consumption

If we assume that during GPU usage its energy consumption is at maximum, the calculation of its energy consumption is simple:

Egpu=Dgpu×PgpuE_{gpu} = D_{gpu} \times P_{gpu}

With PgpuP_{gpu} the GPU power in Watts.

In a data center context, it is relevant to multiply this figure by its PUEPUE (Power Usage Efficiency) to account for energy efficiency.

Environmental impact of energy consumption

To obtain the environmental impact (e.g., GHG emissions) of energy, simply apply electricity emission factors such as those available in the D4B Open Data reference:

Ioperational=Egpu×FenergyI_{operational} = E_{gpu} \times F_{energy}

Environmental impact of GPU manufacturing

The impact linked to GPU manufacturing is calculated proportionally to usage duration relative to the estimated GPU lifetime:

Iembodied=Imanufacturing×DusageDlifespanI_{embodied} = I_{manufacturing} \times \frac{D_{usage}}{D_{lifespan}}

Accounting for server impacts

The impact of other components (CPU, RAM, storage, chassis) is also taken into account. Because durations are expressed in GPUh, the impact of these components is allocated in proportion to the number of GPUs per server. For example, in an 8-GPU server, one eighth of the operational and embodied impacts of non-GPU components is attributed to each calculated GPUh.

Itotal=Igpu+IserverNgpu/serverI_{total} = I_{gpu} + \frac{I_{server}}{N_{gpu/server}}

Assumptions & limits

Assumptions

  • During inference, a cache (KV) is always present (Transformer Inference Arithmetic).
  • Electricity emission factors come from the D4B Open Data reference.

Limitations

  • Uncertainties in input data: actual training data, model characteristics often confidential, MFU, etc.
  • No accounting for whether models fit in memory on selected hardware.
  • No handling of TPU, FPGA, ASIC specificities.
  • No reliable LCA on equipment.

Perspectives

  • Include public metrics such as tokens/s in addition to FLOPs.
  • Account for precision (FP32, FP16, ...).
  • Integrate overhead to account for parallelism impacts (network, replication, queuing, ...).
  • Integrate GPU memory as a bottleneck.
  • Integrate amortization of training across inference.
  • Adapt MFU according to server characteristics (number of GPUs per server, ...).
  • Adapt the methodology to multimodal models (text, image, video).
  • Integrate multi-criteria impact factors (primary energy, water, rare metals).
  • Integrate training of development versions attributable to the current model version.

Application

This section aims to evaluate the model using public data from the open-source LLM Llama 3.1 (405B parameters).

Hardware assumptions

The NVIDIA DGX H100 is a “classic” configuration on which the workloads are executed.

CharacteristicsComponentPowerLife-cycle impact (approximate)
CPU2 x Intel Xeon Platinum 8480C processors (112 cores total)2 x 350 = 700 W2 x 25 = 50 kgCO2e
RAM2 TB2 x 1024 x 0.392 = 803 W2 x 1024 x 533 / 384 = 2843 kgCO2e
Storage30 TB SSD30 x 1024 x 0.0012 = 37 W30 x 1024 x 0.16 = 4915 kgCO2e
GPU8 x H100 80 GB (989 TFLOP/s per GPU)8 x 700 W8 x 250 kgCO2e
Chassis-250 kgCO2e
Total (excluding GPU)1540 W10058 kgCO2e
Total (excluding GPU)/h1540 W10058 / (5 x 24 x 365.25) = 0.230 kgCO2e/h

Training impact

Llama 3.1 (405B parameters) was trained with approximately 15 trillion (15e12) tokens. According to Huggingface, it was trained with 24576 H100 GPUs: Training Time (GPU hours) Power Consumption (W) Emissions (tons CO2eq) Llama 3.1 8B 1.46M 700 420 Llama 3.1 70B 7.0M 700 2,040 Llama 3.1 405B 30.84M 700 8,930

According to the model formulas and assuming an MFU of 40% (to be refined based on NVIDIA benchmarks, it could be closer to 35%) for training, a PUE of 1.2 and a GHG emission factor of 0.420 kgCO2e / kWh:

FLOPtraining=6×Ptotal×Ttraining=6×405e9×15e12=3.65e25FLOPDtraining=FLOPtrainingFLOPgpu×MFU=FLOPtraining989e123600×0.40=25.6e6GPU.hEtraining=0.700×Dtraining×PUE=21.5e6kWhItrainingopegpu=0.420×Etraining=9,030tCO2e\begin{aligned} &FLOP_{training} = 6 \times P_{total} \times T_{training} = 6 \times 405e9 \times 15e12 = 3.65e25 FLOP \\ &D_{training} = \frac{FLOP_{training}}{FLOP_{gpu} \times MFU} = \frac{FLOP_{training}}{989e12 * 3600 \times 0.40} = 25.6e6 GPU.h \\ &E_{training} = 0.700 \times D_{training} \times PUE = 21.5e6 kWh \\ &I^{gpu}_{training_{ope}} = 0.420 \times E_{training} = 9,030 tCO2e \\ \end{aligned}
note

The gap between Huggingface data and the calculation is < 2%, which remains very reasonable.

For embodied impact, we assume a 5-year equipment lifetime:

Itrainingembgpu=DtrainingDlifespan×Itotalemb=25.6e6(5×24×365.25)×250=146tCO2eI^{gpu}_{training_{emb}} = \frac{D_{training}}{D_{lifespan}} \times I_{total_{emb}} = \frac{25.6e6}{(5 \times 24 \times 365.25)} \times 250 = 146 tCO2e

note

We observe that embodied impact is considerably lower than operational impact.

To GPU impact we add server operational and embodied impact. There are 8 GPUs per server, so we add 1/8 of non-GPU components.

Itrainingope=Itrainingopegpu+Itrainingopeserver8=9030+25.6e6×1.540×0.420×1.28=11,513tCO2eItrainingemb=Itrainingembgpu+Itrainingembserver8=146+25.6e6×0.0002308=899tCO2e\begin{aligned} &I_{training_{ope}} = I^{gpu}_{training_{ope}} + \frac{I^{server}_{training_{ope}}}{8} = 9030 + \frac{25.6e6 \times 1.540 \times 0.420 \times 1.2 }{8} = 11,513 tCO2e \\ &I_{training_{emb}} = I^{gpu}_{training_{emb}} + \frac{I^{server}_{training_{emb}}}{8} = 146 + \frac{25.6e6 \times 0.000230}{8} = 899 tCO2e \end{aligned}

Impact of generating 1 million tokens

In the cloud, when using an LLM in “completion” mode, thanks to KV caching, input tokens only incur a linear cost per output token because attention is recalculated only on newly generated tokens.

Ioutputopegpu=2×405e9×1e6989e12×3600×0.40×(0.700+1.5408)×1.2×0.420=256gCO2eIoutputembgpu=2×405e9×1e6989e12×3600×0.40×250+1005885×24×365.25=20gCO2e\begin{aligned} &I^{gpu}_{output_{ope}} = \frac{2 \times 405e9 \times 1e6}{989e12 \times 3600 \times 0.40} \times (0.700 + \frac{1.540}{8}) \times 1.2 \times 0.420 = 256gCO2e \\ &I^{gpu}_{output_{emb}} = \frac{2 \times 405e9 \times 1e6}{989e12 \times 3600 \times 0.40} \times \frac{250 + \frac{10058}{8}}{5\times24\times365.25} = 20gCO2e \end{aligned}

If we consider an average prompt size of about 400 tokens, then the impact of a request is about 0.1 gCO2e.

Simulator

Paramètres
Résultats
Charge de calcul : --
Latence : --
Durée de traitement : --
Débit : --
Énergie : --
GPU
CPU
RAM
Stockage
Chassis
Émissions GES : --
GPU
CPU
RAM
Stockage
Chassis
Operational
Embodied

Comparison

This section provides a comparison of available methodologies for evaluating the environmental impacts of generative AI models. It highlights their perimeters, strengths, and limitations, to position the D4B methodology relative to existing approaches.

CharacteristicFull LCA (Google, 2025)EcologitsD4B methodology
Approach typeFull-stack measurement: CPU/DRAM, idle machines, datacenter overhead, water, partial hardware LCABottom-up assessment applied to inference only (usage + manufacturing)FLOPs → GPUh → impacts modeling
PerimeterManufacturing (partial), usage (all server components), datacenter infrastructure, water, Scope 2/3 emissionsInfra usage + manufacturing, inference onlyTraining, fine-tuning, inference usage + GPU and server manufacturing
Granularity & measurementVery fine: real measurements on Gemini production, energy, water, emissionsMedium-high, open data multi-criteria (GWP, PE, ADPe) aggregated per API callModerate: depends on available data (FLOPs, TDP, ...)
AccessibilityLow: internal Google data not detailedHigh: open-source code, open APIHigh: publicly documented methods and assumptions
ReproducibilityLow: proprietary instrumentation and internal dataHigh: public tool, transparent and reproducible calculationsMedium to high: if input data can be estimated
TransparencyMedium: method published but data access limitedHigh: open-source code, assumptions, and modelHigh: all formulas and sources are explained
Accuracy (inference)Very high: real measured deployment, includes full energy spectrumMedium: relies on simplified models and generalized assumptionsMedium to high depending on parameter accuracy
ApplicabilityLimited: specific to Google infrastructure and inferenceMedium: inference across various providers, but no trainingVery broad: training, fine-tuning, inference based on public data
Targeted usesInternal analysis, detailed reporting, communicationPublic assessment, awareness, multi-provider comparisonResearch, internal assessment, FinOps, Green AI
Quantified results
(Average prompt, around 400 tokens)
~0.03 gCO2e
~0.24 Wh
Gemini
~40 gCO2e
~95 Wh
Llama 3.1 405b
~0.12 gCO2e
~0.27 Wh
Llama 3.1 405b
(see Application)
Key limitationsProprietary data, does not cover training, focuses on inference, bias on “median prompt”Limited perimeter (inference only), possible overestimation due to extrapolationHighly dependent on assumptions (MFU, lifetime)

These results show that each approach has a specific positioning: Google prioritizes accuracy but remains closed and non-reproducible, Ecologits focuses on transparency and simplicity but at the cost of possible overestimation, while the D4B methodology offers a reproducible and adaptable compromise for different usage contexts but depends on the precision of input data.