Skip to main content

Methodology applied to generative AI

Summary

This methodological note proposes a calculation framework to assess the environmental footprint of generative AI models by integrating training, fine-tuning, and inference. The approach is based on estimating the compute load (FLOPs) required by each usage, converting it into GPU usage time, then into energy consumption and greenhouse gas (GHG) emissions. It also includes the share of impact linked to equipment manufacturing and life cycle. This approach aims to provide a reproducible, transparent method adapted to different models and usage contexts, consistent with Green AI research recommendations.

Principle

The methodology is based on a simple philosophy: directly link real uses of an AI model (training, fine-tuning, inference) to the hardware footprint necessary to perform them.

Rather than starting from global electricity consumption measurements at the data center level, which are often inaccessible or proprietary (Google, 2025), it first evaluates the amount of computation required by the model according to:

  • its own characteristics (size, number of parameters, proportion of activated parameters, architecture),
  • the volume of tokens consumed or generated (text, images, etc.).

This compute load is expressed in FLOPs, then converted to effective hardware usage time (GPUh) while accounting for real efficiency (Model FLOP Utilization, MFU).

The next step translates this usage time into energy consumption and GHG emissions based on the physical characteristics of GPUs/servers and operating conditions (PUE, electricity emission factor).

Finally, a share of the impact related to manufacturing and the equipment life cycle is added proportionally to usage time, following a life-cycle assessment (LCA) logic (ISO 14040 and 14044).

Why use FLOPs as a metric?

According to the Green AI study, FLOPs are a relevant metric to measure the impact of generative AI because they express the compute load actually performed, directly correlated with energy consumption, and provide a hardware-agnostic basis to compare different models fairly.

Impact assessment

What is a token?

A token is the discrete unit manipulated by the model to represent an input or an output. Depending on the modality, it can be a word fragment, a spatial position, or a coded temporal unit.

The table below provides a quick reference for each modality and a simple way to estimate the variables used in the formulas.

ModalityWhat a token isFormula (tokens / activations)Example / estimation
TextWord fragment (often ~3–4 characters on average)Ttext=number of words×tokens per wordT_\text{text} = \text{number of words} \times \text{tokens per word}100 words → Ttext130T_\text{text} \approx 130–160 tokens depending on the tokenizer.
ImageSpatial / latent token (patch)Timage=(width/patch)×(height/patch)T_\text{image} = (\text{width}/\text{patch}) \times (\text{height}/\text{patch})512×512 image, 16×16 patches → 512/16 = 32 tokens per axis → 32×32 = 1,024 tokens.
AudioTemporal token produced by a codec (e.g., EnCodec)Taudio=duration (s)×sample rate÷downscale×latent channelsT_\text{audio} = \text{duration (s)} \times \text{sample rate} \div \text{downscale} \times \text{latent channels}10 s clip, 24 kHz sample rate, downscale 320, 8 channels → Taudio6,000T_\text{audio} \approx 6{,}000 tokens.
VideoSpatial token per frame + number of framesTframe=(width/patch)×(height/patch)T_\text{frame} = (\text{width}/\text{patch}) \times (\text{height}/\text{patch})4 s at 24 fps → F=96F=96 frames. 512×512 frame, 16×16 patches → Tframe32×32=1,024T_\text{frame} \approx 32\times32=1{,}024 tokens per frame and Tvideo=96×1,024=98,304T_\text{video} = 96 \times 1{,}024 = 98{,}304 tokens.

Explanation of technical terms

  • Patch: dividing an image or frame into square blocks processed as tokens by the model.
  • Downscale / downsampling: reducing spatial (images/video) or temporal (audio) resolution to move into a smaller latent space used for activations. Example: downscale 8 → width and height divided by 8.
  • Latent channels: number of dimensions in the latent space (feature depth) for image, video, or audio.

Estimating compute load

Use caseCalculation formulaVariablesExplanation
TrainingFLOP6×Ptotal×TtrainingFLOP \approx 6 \times P_\text{total} \times T_\text{training}PtotalP_{total}: total number of model parameters
TtrainingT_{training}: number of tokens processed during training (tokens × batch × steps)
For each token and parameter, 6 FLOPs are needed: 2 FLOPs for the forward pass and 4 for gradient computation and propagation (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic)
Fine-tuningFLOP(2×Ptotal+4×Ptunable)×TtrainingFLOP \approx (2 \times P_\text{total} + 4 \times P_\text{tunable}) \times T_\text{training}PtotalP_{total}: total number of model parameters
PtunableP_{tunable}: number of trainable parameters (depends on optimization: LoRA, …)
TtrainingT_{training}: number of tokens processed during training (tokens × batch × steps)
Same as full training, however the number of updated parameters is lower (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic)
Prompt processing
(text)
FLOP1×Pactive×TinputFLOP \approx 1 \times P_{active} \times T_{input}PactiveP_{active}: number of active parameters
TinputT_{input}: number of prompt tokens
The prompt is encoded once by the model. During auto-regressive generation, the intermediate states that have already been computed are then reused to avoid recalculating the full context for each newly generated token. (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic)
Prompt processing
(image)
FLOP1×Pactive×NactivationFLOP \approx 1 \times P_{active} \times N_\text{activation}PactiveP_{active}: number of active parameters
NactivationN_\text{activation}: number of image activations = width × height × channels
Each prompt image is encoded once by the model. NactivationN_\text{activation} corresponds to the number of latent tokens or encoded pixels.
Prompt processing
(audio)
FLOP1×Pactive×NaudioFLOP \approx 1 \times P_{active} \times N_\text{audio}PactiveP_{active}: number of active parameters
NaudioN_\text{audio}: number of audio tokens = duration × sample rate ÷ downscale × latent channels
Each prompt audio clip is encoded once by the model. NaudioN_\text{audio} corresponds to the latent tokens used to represent the audio signal.
Text generationFLOP2×Pactive×ToutputFLOP \approx 2 \times P_\text{active} \times T_\text{output}PactiveP_{active}: number of active parameters
ToutputT_{output}: number of generated tokens
For each token and parameter, 2 FLOPs are needed for the forward pass. The number of active parameters during inference depends on the model architecture (especially for MoE). (Source: Scaling Law, Transformers FLOPs, Transformers Inference Arithmetic)
Image generationFLOP2×Pactive×NactivationFLOP \approx 2 \times P_\text{active} \times N_\text{activation}NactivationN_{activation}: number of activations = width x height x number of channelsFor each activation and parameter, 2 FLOPs are needed for the forward pass (Source: Clockwork Diffusion, Transformers Inference Arithmetic)
Video generation
(frame by frame)
FLOPS×(2×Pactive×Nactivation×F)FLOP \approx S \times \big( 2 \times P_\text{active} \times N_\text{activation} \times F \big)NactivationN_{activation}: number of activations = width x height x number of channels
FF: number of frames to generate
SS: number of denoising steps
Generation processes each frame independently (Source: Clockwork Diffusion, Transformers Inference Arithmetic)
Video generation
(spatio-temporal)
FLOPS×(2×Pactive×Nactivation×F+2×(F×T)2×d)FLOP \approx S \times \big( 2 \times P_\text{active} \times N_\text{activation} \times F + 2 \times (F \times T)^2 \times d \big)NactivationN_{activation}: number of activations = width x height x number of channels
FF: number of frames to generate
SS: number of denoising steps
TT: number of spatial tokens = width x height
DD: latent dimension = number of channels
dd: hidden dimension
The first term corresponds to the linear cost of frame generation. The second models the dominant quadratic cost of spatio-temporal self-attention across all video tokens. (Source: Video Killed the Energy Budget)
Audio generation
(temporal)
FLOPS×(2×Pactive×Naudio+2×T2×d)FLOP \approx S \times \big( 2 \times P_\text{active} \times N_\text{audio} + 2 \times T^2 \times d \big)NaudioN_{audio}: number of latent audio activations per step
TT: number of temporal audio tokens
dd: hidden dimension
SS: number of denoising steps
Audio is generated by diffusion over a 1D sequence. Cost is linear for latent processing and quadratic for temporal self-attention, at each denoising step. (Sources: AudioLM, MusicLM, Stable Audio Open)

Conversion to GPU usage

If the FLOP processing capacity of a GPU is known, it is then trivial to calculate the theoretical usage duration to satisfy one of the above use cases:

Dgpu=FLOP(Cgpu×MFU)D_{gpu} = \frac{FLOP}{(C_{gpu}\times MFU)}

With DgpuD_{gpu} the GPU usage duration in hours, and CgpuC_{gpu} the theoretical computing capacity in FLOP/h of the GPU.

The actually usable computing capacity of a GPU, taking into account model typology, GPU/TPU type, heavy parallelism, network exchanges, etc., would represent only 25 to 50% of the theoretical capacity (see NVIDIA Benchmarks).

This utilization rate is called MFUMFU (Model FLOP Utilization).

Conversion to energy consumption

If we assume that during GPU usage its energy consumption is at maximum, the calculation of its energy consumption is simple:

Egpu=Dgpu×PgpuE_{gpu} = D_{gpu} \times P_{gpu}

With PgpuP_{gpu} the GPU power in Watts.

In a data center context, it is relevant to multiply this figure by its PUEPUE (Power Usage Efficiency) to account for energy efficiency.

Environmental impact of energy consumption

To obtain the environmental impact (e.g., GHG emissions) of energy, simply apply electricity emission factors such as those available in the D4B Open Data reference:

Ioperational=Egpu×FenergyI_{operational} = E_{gpu} \times F_{energy}

Environmental impact of GPU manufacturing

The impact linked to GPU manufacturing is calculated proportionally to usage duration relative to the estimated GPU lifetime:

Iembodied=Imanufacturing×DusageDlifespanI_{embodied} = I_{manufacturing} \times \frac{D_{usage}}{D_{lifespan}}

Accounting for server impacts

The impact of other components (CPU, RAM, storage, chassis) is also taken into account. Because durations are expressed in GPUh, the impact of these components is allocated in proportion to the number of GPUs per server. For example, in an 8-GPU server, one eighth of the operational and embodied impacts of non-GPU components is attributed to each calculated GPUh.

Itotal=Igpu+IserverNgpu/serverI_{total} = I_{gpu} + \frac{I_{server}}{N_{gpu/server}}

Taking caches into account

Two caching mechanisms can reduce the effective cost of inference.

The first is an inference cache, used within the same request. Once a prompt has been processed once, the intermediate states associated with previously seen tokens can be reused during generation of subsequent tokens. This mechanism explains why generation does not require recalculating the full context at each step.

The second is a prefix cache, used across multiple distinct requests that share the exact same beginning. In that case, part of the prompt can sometimes be reused from one request to another, which reduces the cost of processing input tokens.

The base methodology first computes a raw impact for prompt processing, without cross-request reuse.

When several requests reuse the same prefix, the reuse rate can be noted rcacher_{cache}:

rcache=TcachedTinputr_{cache} = \frac{T_{cached}}{T_{input}}

with:

  • TinputT_{input} the total number of input tokens
  • TcachedT_{cached} the number of input tokens reused from the cache
  • rcache[0;1]r_{cache} \in [0;1]
  • rcache=0r_{cache} = 0: no prompt reuse
  • rcache=1r_{cache} = 1: prompt fully reused

Tokens reused from a cache are not, however, assumed to be impact-free. A residual coefficient α\alpha is introduced to represent the impact of a token served from a cache relative to a recalculated token:

  • α[0;1]\alpha \in [0;1]
  • α=0\alpha = 0: tokens served from a cache are assumed negligible compared with the avoided impact
  • α=1\alpha = 1: a token served from a cache is assumed to have the same impact as a recalculated token

The effective impact of the prompt can then be approximated by:

Iprompt,effectif=Iprompt,brut×((1rcache)+α×rcache)I_{prompt,effectif} = I_{prompt,brut} \times \big((1-r_{cache}) + \alpha \times r_{cache}\big)

This correction can be applied identically to the operational and embodied impact of the prompt. The cost of generating output tokens remains unchanged.

note

When operating data distinguishes input tokens that were effectively recalculated from input tokens reused from a cache, it can be used to estimate rcacher_{cache} empirically. The coefficient α\alpha remains a modeling assumption: it aims to represent the residual impact associated with the memory, fast storage, and services required to retain and serve cached states. This estimate therefore reflects a real usage and deployment context, rather than a general property of the model.

Assumptions & limits

Assumptions

  • During auto-regressive generation, an inference cache is generally used to reuse already computed intermediate states.
  • Cross-request reuse of a prompt prefix is not systematic. It depends on the deployment context and the effective stability of prompts.
  • In the absence of direct measurement, the reuse rate rcacher_{cache} is a usage assumption.
  • As a first approximation, three usages can be retained: a simple mode with α=0\alpha = 0, a prudent mode with α=0.1\alpha = 0.1, and an exploratory mode as a range between 00 and 0.250.25.

Limitations

  • Uncertainties in input data: actual training data, model characteristics often confidential, MFU, etc.
  • No accounting for whether models fit in memory on selected hardware.
  • No handling of TPU, FPGA, ASIC specificities.
  • No reliable LCA on equipment.
  • The method does not model in detail the activation, retention, and eviction conditions of caches.
  • Tokens served from a cache are not impact-free: the method simply assumes that the avoided computation dominates the memory and service overhead associated with the cache.
  • The actual reuse rate of a prompt depends strongly on usage structure, prefix repetition, and the technical deployment context.
  • The value of α\alpha remains uncertain in the absence of direct measurement of the memory and service overhead associated with the cache.
  • Billing or operating data that distinguishes recalculated tokens from reused tokens can serve as an operational proxy, but does not constitute a direct physical measurement of environmental impact.

Perspectives

  • Include public metrics such as tokens/s in addition to FLOPs.
  • Account for precision (FP32, FP16, ...).
  • Integrate overhead to account for parallelism impacts (network, replication, queuing, ...).
  • Integrate GPU memory as a bottleneck.
  • Integrate amortization of training across inference.
  • Adapt MFU according to server characteristics (number of GPUs per server, ...).
  • Adapt the methodology to multimodal models (text, image, video).
  • Integrate multi-criteria impact factors (primary energy, water, rare metals).
  • Integrate training of development versions attributable to the current model version.

Application

This section aims to evaluate the model using public data from the open-source LLM Llama 3.1 (405B parameters).

Hardware assumptions

The NVIDIA DGX H100 is a “classic” configuration on which the workloads are executed.

CharacteristicsComponentPowerLife-cycle impact (approximate)
CPU2 x Intel Xeon Platinum 8480C processors (112 cores total)2 x 350 = 700 W2 x 25 = 50 kgCO2e
RAM2 TB2 x 1024 x 0.392 = 803 W2 x 1024 x 533 / 384 = 2843 kgCO2e
Storage30 TB SSD30 x 1024 x 0.0012 = 37 W30 x 1024 x 0.16 = 4915 kgCO2e
GPU8 x H100 80 GB (989 TFLOP/s per GPU)8 x 700 W8 x 250 kgCO2e
Chassis-250 kgCO2e
Total (excluding GPU)1540 W10058 kgCO2e
Total (excluding GPU)/h1540 W10058 / (5 x 24 x 365.25) = 0.230 kgCO2e/h

Training impact

Llama 3.1 (405B parameters) was trained with approximately 15 trillion (15e12) tokens. According to Huggingface, it was trained with 24576 H100 GPUs: Training Time (GPU hours) Power Consumption (W) Emissions (tons CO2eq) Llama 3.1 8B 1.46M 700 420 Llama 3.1 70B 7.0M 700 2,040 Llama 3.1 405B 30.84M 700 8,930

According to the model formulas and assuming an MFU of 40% (to be refined based on NVIDIA benchmarks, it could be closer to 35%) for training, a PUE of 1.2 and a GHG emission factor of 0.420 kgCO2e / kWh:

FLOPtraining=6×Ptotal×Ttraining=6×405e9×15e12=3.65e25FLOPDtraining=FLOPtrainingFLOPgpu×MFU=FLOPtraining989e123600×0.40=25.6e6GPU.hEtraining=0.700×Dtraining×PUE=21.5e6kWhItrainingopegpu=0.420×Etraining=9,030tCO2e\begin{aligned} &FLOP_{training} = 6 \times P_{total} \times T_{training} = 6 \times 405e9 \times 15e12 = 3.65e25 FLOP \\ &D_{training} = \frac{FLOP_{training}}{FLOP_{gpu} \times MFU} = \frac{FLOP_{training}}{989e12 * 3600 \times 0.40} = 25.6e6 GPU.h \\ &E_{training} = 0.700 \times D_{training} \times PUE = 21.5e6 kWh \\ &I^{gpu}_{training_{ope}} = 0.420 \times E_{training} = 9,030 tCO2e \\ \end{aligned}
note

The gap between Huggingface data and the calculation is < 2%, which remains very reasonable.

For embodied impact, we assume a 5-year equipment lifetime:

Itrainingembgpu=DtrainingDlifespan×Itotalemb=25.6e6(5×24×365.25)×250=146tCO2eI^{gpu}_{training_{emb}} = \frac{D_{training}}{D_{lifespan}} \times I_{total_{emb}} = \frac{25.6e6}{(5 \times 24 \times 365.25)} \times 250 = 146 tCO2e

note

We observe that embodied impact is considerably lower than operational impact.

To GPU impact we add server operational and embodied impact. There are 8 GPUs per server, so we add 1/8 of non-GPU components.

Itrainingope=Itrainingopegpu+Itrainingopeserver8=9030+25.6e6×1.540×0.420×1.28=11,513tCO2eItrainingemb=Itrainingembgpu+Itrainingembserver8=146+25.6e6×0.0002308=899tCO2e\begin{aligned} &I_{training_{ope}} = I^{gpu}_{training_{ope}} + \frac{I^{server}_{training_{ope}}}{8} = 9030 + \frac{25.6e6 \times 1.540 \times 0.420 \times 1.2 }{8} = 11,513 tCO2e \\ &I_{training_{emb}} = I^{gpu}_{training_{emb}} + \frac{I^{server}_{training_{emb}}}{8} = 146 + \frac{25.6e6 \times 0.000230}{8} = 899 tCO2e \end{aligned}

Impact of generating 1 million tokens

In a completion-type use case, inference cost is split into two parts: initial prompt processing, then output token generation. During generation, intermediate states already computed for the context are reused, which avoids recalculating the full prompt for each new token. When multiple requests also share the same prefix, the cost of processing input tokens can be reduced further if that reuse is effectively exploited. The calculations below nevertheless correspond to a base case without an explicit correction by rcacher_{cache}.

Ioutputopegpu=2×405e9×1e6989e12×3600×0.40×(0.700+1.5408)×1.2×0.420=256gCO2eIoutputembgpu=2×405e9×1e6989e12×3600×0.40×250+1005885×24×365.25=20gCO2e\begin{aligned} &I^{gpu}_{output_{ope}} = \frac{2 \times 405e9 \times 1e6}{989e12 \times 3600 \times 0.40} \times (0.700 + \frac{1.540}{8}) \times 1.2 \times 0.420 = 256gCO2e \\ &I^{gpu}_{output_{emb}} = \frac{2 \times 405e9 \times 1e6}{989e12 \times 3600 \times 0.40} \times \frac{250 + \frac{10058}{8}}{5\times24\times365.25} = 20gCO2e \end{aligned}

If we consider an average prompt size of about 400 tokens, then the impact of a request is about 0.1 gCO2e.

Simulator

Parameters
Results
Compute load : --
Latency : --
Processing time : --
Throughput : --
Energy : --
GPU
CPU
RAM
Storage
Chassis
GHG emissions : --
GPU
CPU
RAM
Storage
Chassis
Operational
Embodied

Comparison

This section provides a comparison of available methodologies for evaluating the environmental impacts of generative AI models. It highlights their perimeters, strengths, and limitations, to position the D4B methodology relative to existing approaches.

CharacteristicFull LCA (Google, 2025)EcologitsD4B methodology
Approach typeFull-stack measurement: CPU/DRAM, idle machines, datacenter overhead, water, partial hardware LCABottom-up assessment applied to inference only (usage + manufacturing)FLOPs → GPUh → impacts modeling
PerimeterManufacturing (partial), usage (all server components), datacenter infrastructure, water, Scope 2/3 emissionsInfra usage + manufacturing, inference onlyTraining, fine-tuning, inference usage + GPU and server manufacturing
Granularity & measurementVery fine: real measurements on Gemini production, energy, water, emissionsMedium-high, open data multi-criteria (GWP, PE, ADPe) aggregated per API callModerate: depends on available data (FLOPs, TDP, ...)
AccessibilityLow: internal Google data not detailedHigh: open-source code, open APIHigh: publicly documented methods and assumptions
ReproducibilityLow: proprietary instrumentation and internal dataHigh: public tool, transparent and reproducible calculationsMedium to high: if input data can be estimated
TransparencyMedium: method published but data access limitedHigh: open-source code, assumptions, and modelHigh: all formulas and sources are explained
Accuracy (inference)Very high: real measured deployment, includes full energy spectrumMedium: relies on simplified models and generalized assumptionsMedium to high depending on parameter accuracy
ApplicabilityLimited: specific to Google infrastructure and inferenceMedium: inference across various providers, but no trainingVery broad: training, fine-tuning, inference based on public data
Targeted usesInternal analysis, detailed reporting, communicationPublic assessment, awareness, multi-provider comparisonResearch, internal assessment, FinOps, Green AI
Quantified results
(Average prompt, around 400 tokens)
~0.03 gCO2e
~0.24 Wh
Gemini
~40 gCO2e
~95 Wh
Llama 3.1 405b
~0.12 gCO2e
~0.27 Wh
Llama 3.1 405b
(see Application)
Key limitationsProprietary data, does not cover training, focuses on inference, bias on “median prompt”Limited perimeter (inference only), possible overestimation due to extrapolationHighly dependent on assumptions (MFU, lifetime)

These results show that each approach has a specific positioning: Google prioritizes accuracy but remains closed and non-reproducible, Ecologits focuses on transparency and simplicity but at the cost of possible overestimation, while the D4B methodology offers a reproducible and adaptable compromise for different usage contexts but depends on the precision of input data.