How Diffusion Models Work 2026 Featured

 

Type a sentence into Midjourney or Stable Diffusion and a picture comes back in seconds. The technology running underneath is the diffusion model. This article explains how diffusion models build images out of noise in plain terms, how they compare with GANs and autoregressive models, and what changed in 2026 as the same approach spread into text generation. I have spent close to twenty years running IT security and compliance at a multinational financial institution, and these days I see this technology less as a clever art tool and more as the starting point of security risks, forged ID images in eKYC flows being the obvious one. That angle is covered here too.

 

 

1. What a Diffusion Model Is and Where the Idea Came From

A generative model that learns to remove noise

A diffusion model is a type of generative model. A generative model learns the distribution of its training data and produces plausible new samples from it. Train one on millions of face photos and it can draw a face that has never existed.

The training recipe, stripped down, goes like this: take a clean image, mix in noise (the static-like specks you see on a broken screen) a little at a time until the image is destroyed, then learn to reverse that process. The corruption procedure is fixed math, so what the model actually learns is the reversal. More precisely, it learns to predict which part of a noisy image is the noise.

It may not be obvious why that amounts to drawing. Give a well-trained model a block of pure noise and it treats it as “an image that got corrupted” and restores it one step at a time. Since there never was an original, the result is a brand-new image that resembles the training data. The model is not creating from nothing; it is repurposing a denoising skill for generation.

Why the name “diffusion”: the physics of an ink drop

The name comes from physics. Drop ink into a glass of water and the molecules spread until the whole glass is uniformly murky. Adding noise to an image step by step looks much the same: a crisp image (the ink drop) spreads out into uniform noise (the murky water).

In physics, diffusion runs in one direction. The model exploits a different fact: if you slice the process finely enough, each small step can be statistically reversed. Split a video of spreading ink into 1,000 frames and the change between any two adjacent frames is small enough to guess what the previous frame looked like. This idea was first proposed in 2015 by Sohl-Dickstein and colleagues, framed in terms of non-equilibrium thermodynamics.

The limits of GANs and the arrival of DDPM

The concept appeared in 2015 but went largely unnoticed for years. Image generation at the time belonged to GANs (Generative Adversarial Networks), which pit a counterfeiter (the generator) against an inspector (the discriminator). GANs produced sharp results but were notorious for unstable training. When the balance between the two networks broke, you got mode collapse, the model emitting the same image over and over, and a small hyperparameter mistake could wreck a training run entirely.

The turning point was the 2020 DDPM (Denoising Diffusion Probabilistic Models) paper from Google Brain. It showed that a simple noise-prediction objective could match GAN quality while training stably. DALL·E 2, Imagen, and Stable Diffusion followed in 2021 and 2022, and the standard for image generation shifted from GANs to diffusion.

 

 

2. How Diffusion Models Operate: Adding Noise, Then Undoing It

A diffusion model runs on two processes, forward and reverse. Training uses both. Generating an image uses only the reverse.

How Diffusion Models Work 2026 1

 

The forward process: breaking an image down into noise

The forward process adds Gaussian noise to a training image over hundreds to thousands of small steps. In the original DDPM setup, 1,000 steps are enough to make any image statistically indistinguishable from pure noise.

Two details matter here. First, nothing is learned in this process. How much noise gets added at each step is fixed in advance by a formula called the schedule. Second, you can jump to any step in a single shot. There is no need to apply 700 small corruptions in sequence; one equation produces “the image as it would look after step 700” directly. That shortcut is a large part of why training is efficient.

The reverse process: rebuilding an image from noise

The reverse process is the part the model actually learns. Training works like this: pick an image, corrupt it to a random step (say, step 412), then ask a neural network (usually a U-Net, or more recently a Transformer-based DiT) to predict the noise that was mixed in. We know the right answer because we added the noise ourselves, so the weights are updated to shrink the gap between prediction and answer. Training is this one simple problem repeated hundreds of millions of times.

At generation time, the model starts from pure noise and repeatedly subtracts its own noise prediction. The original DDPM needed 1,000 iterations and was slow. Sampling methods like DDIM cut that to 20 to 50 steps, and distillation techniques from 2024 onward brought it down to around 4 steps, in some cases 1 or 2. That is why today’s services return a picture in a few seconds.

Conditioning: how a text prompt steers the image

Everything so far produces “some image that resembles the training data”. To request something specific, like “a cat on a beach at sunset”, the model needs a condition. The text prompt passes through a text encoder such as CLIP or T5 and becomes a numeric vector, and that vector is injected into the denoising network at every step. Since the model consults this text information each time it predicts noise, the restoration keeps getting pulled toward the prompt.

In practice, a value called CFG (Classifier-Free Guidance) controls the strength of that pull. Turn it up and the model follows the prompt closely but the image oversaturates; turn it down and the image looks natural but drifts from the prompt. The guidance scale slider in most generation tools is exactly this knob.

Latent diffusion: why Stable Diffusion runs on a single GPU

Running diffusion directly on pixels is expensive. A 1024×1024 image has over a million pixels, and repeating dozens of denoising passes over that space takes several high-end GPUs. The Latent Diffusion Model paper went around the problem: compress the image with an autoencoder (VAE) into a latent space roughly 1/48 the size, run the entire diffusion process in that compressed space, and decode back to pixels only at the end.

Stable Diffusion is exactly this design, and it is the main reason image generation became possible on a single consumer GPU. A reasonable analogy: instead of working on the full-size blueprint, you do all the work on a reduced sketch and enlarge it once at the end.

How Diffusion Models Work 2026 2

 

 

3. Diffusion vs GAN vs Autoregressive Models: Three Ways to Generate

Diffusion is not the only generative approach. Knowing how the three families differ makes the 2026 trends in section 4 and the decision criteria in section 6 fall into place.

Diffusion GAN Autoregressive
Generation method Remove noise step by step Generator vs discriminator Predict tokens left to right
Main domain Images, video, audio Images (former leader), super-resolution Text (GPT, Claude, Gemini)
Quality High, with strong diversity Sharp but limited diversity Best in class for text
Training stability Stable Unstable (mode collapse) Stable
Generation speed Slow per iteration count (improving) Fast, single pass Scales with output length
Editing Easy to intervene mid-process (inpainting) Difficult Requires regeneration

Look again at the last two rows. Diffusion’s weakness was speed. The autoregressive weakness is that a token, once emitted, cannot be fixed. The 2026 story is these two weaknesses crossing into each other’s territory.

 

 

4. Diffusion Models in 2026: Past Images, Into Text (dLLMs)

Text diffusion models: refining many tokens at once

GPT-style models are autoregressive: they produce tokens left to right, and each token has to wait for the previous one. No matter how fast the hardware gets, that sequential dependency puts a structural ceiling on speed. Text diffusion models (dLLMs, Diffusion Language Models) borrow the image-diffusion idea: start from a fully masked sequence and fill in and refine tokens at many positions in parallel, over several passes.

Model Released Notes
Mercury (Inception Labs) Feb 2025 First commercial dLLM, 1,000+ tokens/sec on H100
Gemini Diffusion (Google DeepMind) May 2025 Experimental, about 1,479 tokens/sec, on par with autoregressive peers on coding benchmarks
Mercury 2 (Inception Labs) Early 2026 Positioned as the first diffusion-based reasoning model
DiffusionGemma (Google) June 2026 26B MoE (3.8B active), open weights, 256 tokens generated in parallel, up to 4x faster

The newly released DiffusionGemma matters for one reason in particular: it ships under an Apache 2.0 license as open weights and fits within 18GB of VRAM when quantized. That makes it the first realistic dLLM candidate for environments where calling an external API is not an option.

An honest assessment of where things stand: dLLMs are competitive on code generation, translation, and classification, in other words short and structured work. They still trail autoregressive models on long-form coherence and complex reasoning. Five to ten times the speed at 5 to 15 percent lower quality means the use cases split rather than one side winning.

Diffusion spreading into video and audio

Video generation (OpenAI’s Sora, Google’s Veo line) uses diffusion extended along the time axis, and speech synthesis and music generation have also settled on diffusion-based methods. One principle, restoring data from noise step by step, keeps crossing into new modalities.

 

 

5. A Financial IT Security View: The Risks Arrive Before the Benefits

eKYC and AML: forged identity images made with diffusion

My first serious look at diffusion models in a multinational financial environment was not as an art tool. It was as a way around remote identity verification (eKYC). In remote account-opening flows that collect an ID photo and a selfie, a forgery built with diffusion-based inpainting, where only the face or text regions of an ID are swapped in naturally, behaves differently from old photo-editing fakes. The classic detection cues, edge artifacts and lighting mismatches, barely exist.

I came away with two operational conclusions. First, an eKYC process that relies on image inspection alone is no longer an adequate control. You have to combine non-image signals: device fingerprints, behavioral data, and checks against authoritative databases. Second, even if you deploy an AI-generated-image detector, do not lock a detection rate into a KPI. Generative models refresh on a quarterly cadence and there will be stretches where detectors fall behind. Treat the detector as a supporting signal and put the durable control in process design.

Deploying on a restricted network: what to check first

The opposite request also shows up: internal teams wanting to use diffusion models for design drafts or synthetic training data. For an air-gapped or otherwise restricted environment where external API calls are blocked, these are the items that actually caused friction during review:

  • Weight transfer procedure: moving multi-gigabyte weight files across a network boundary requires integrity verification (hash comparison) and provenance evidence. “We downloaded it from Hugging Face” does not pass a security review on its own.
  • License differences: “open model” is not one thing. Apache 2.0 (DiffusionGemma and others) and community licenses with commercial restrictions (some Stable Diffusion variants) come back from legal review with different answers. Check the license before fetching the weights, not after.
  • GPU budget: thanks to latent diffusion, inference fits on a single 24GB-class GPU. Fine-tuning on internal data (LoRA included) jumps a full tier. Decide whether the use case is inference or training before sizing anything.
  • Output controls: settle internal policy up front on the chance that generated images reproduce copyrighted training data, and on watermarking and provenance standards such as C2PA. Untangling this after the fact is much harder.

 

 

6. Which Generative Model to Use: Decision Criteria by Situation

Pulling the article together as a set of decisions:

  • If the goal is high-quality image or video generation: diffusion is the default. The only real fork is whether an external service (Midjourney and the like) is enough, or whether data-control requirements push you to self-host open weights (the Stable Diffusion or FLUX families).
  • If the image task is extremely latency-sensitive (in-game super-resolution, for example): single-pass GANs are still in active service. Diffusion becoming the standard did not retire them.
  • If the task is general text and reasoning: autoregressive LLMs remain the default. The quality gap with dLLMs has not closed.
  • If the text task is short and latency-sensitive (code completion, translation, classification): it is time to shortlist dLLMs, meaning the Mercury family and DiffusionGemma. At equal quality, response speed decides the user experience.
  • If you operate in a regulated or air-gapped environment: settle licensing, the weight-transfer procedure, and output policy before debating model performance. Run the proof of concept after that, not before.

 

 


Diffusion models are the story of one idea, that learning to remove noise gives you generation for free, reshaping images, video, and audio in turn, and now reaching for text. A follow-up post will take the dLLM topic from section 4 on its own: how the architecture differs from autoregressive LLMs and how to judge adoption. In your own work, which one hurts more right now, generation speed or output quality?

 

 

Leave a Reply