Foveated Diffusion

Foveated Diffusion employs non-uniform tokenization to concentrate high-resolution tokens on the region of interest (white circle) while using low-resolution tokens in the periphery. We achieve a 2× speedup for image generation and 4× for video generation at 1/2× peripheral resolution, with even higher speedups at 1/4× peripheral resolution. Please hover over the images for the interactive demo.

Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens.

Our work seeks to optimize the efficiency of the generation process in settings where the user's gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field.

Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions.

We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.

In Foveated Generation , we iteratively denoise a foveated token sequence of reduced length instead of the full high-resolution sequence. The resulting tokens are split into high- and low-resolution grids, decoded by the VAE, and blended using a user-specified foveation mask.

We employ Foveated Training to adapt pretrained diffusion transformers (DiTs) to foveated token sequences using low-rank adaptation (LoRA). The image and its downsampled version are independently encoded by the VAE encoder and merged into a clean foveated token sequence for flow-matching training.

We provide extended baseline comparisons of image and video generation against full high-resolution generation and naïve mixed-resolution generation using our randomized mask model. Our method consistently generates coherent content, whereas the naïve baseline exhibits significant artifacts such as scale mismatch and distorted structures.

We present image generation results where each image is generated with a randomly-placed circular foveation mask. Given the same prompt and noise seed, Foveated Diffusion generates coherent and consistent content independent of the foveation mask; the mask only determines which regions are synthesized at high resolution.

Please try the interactive demo below by hovering the cursor over the image. The foveation mask will change dynamically based on the cursor position.

We also present video generation results where the high-resolution region varies in shape, size, and position across frames , resulting in various mask trajectories . Given the same prompt and noise seed, Foveated Diffusion generates coherent and consistent content independent of the foveation mask trajectory.

Foveated Diffusion can be trained on arbitrary foveation masks, enabling various applications. We demonstrate one such application by training a model for saliency-guided image generation by using DeepGaze-predicted saliency maps as foveation masks during training instead of randomly-placed masks. The saliency-guided model centers the most salient object in the scene around the foveation region, which is synthesized at high resolution, while peripheral regions are generated at lower resolution.

Please hover over the images below to change the foveation mask position.

We illustrate applications of saliency-guided video generation across VR gaming, autonomous driving simulation, and robotics. In each scenario, only the most salient objects are rendered at high resolution while peripheral regions are rendered at lower resolution.

By using YOLO-predicted bounding boxes as foveation masks during training, we can also train a model for bounding-box-guided image generation . The bounding-box-guided model successfully generates objects within the foveation boundary. Compared with saliency guidance, bounding boxes explicitly outline object extents, encouraging the model to place entire objects within or aligned to the foveal region, whereas saliency guidance tends to align only the most visually salient parts of objects with the fovea.

Please drag the slider below to interactively adjust the foveation mask radius.

Foveated Diffusion:
Efficient Spatially Aware Image and Video Generation

Abstract

Overview

Results

Baseline Comparisons

Image Generation

Video Generation

Visual Generation with Different Foveation Masks

Image Generation

Video Generation

Applications

Saliency-guided Image Generation

Saliency-guided Video Generation

Bounding-box-guided Image Generation

Foveated Diffusion: Efficient Spatially Aware Image and Video Generation

Abstract

Overview

Results

Baseline Comparisons

Image Generation

Video Generation

Visual Generation with Different Foveation Masks

Image Generation

Video Generation

Applications

Saliency-guided Image Generation

Saliency-guided Video Generation

Bounding-box-guided Image Generation

Foveated Diffusion:
Efficient Spatially Aware Image and Video Generation