Numerous models have been developed for scanpath and saliency prediction, which are typically trained on scanpaths, while the rich information contained in the raw trajectories is often discarded. Moreover, most existing approaches fail to capture the variability observed among human subjects viewing the same image. They generally predict a single scanpath of fixed, pre-defined length, which conflicts with the inherent diversity and stochastic nature of real-world visual attention. To address these challenges, we propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images. Our method builds on a diffusion model conditioned on visual stimuli and introduces a novel component, namely Corresponding Positional Embedding (CPE), which aligns spatial gaze information with the patch-based semantic features of the visual input. By leveraging raw eye-tracking trajectories rather than relying on scanpaths, DiffEye captures the inherent variability in human gaze behavior and generates high-quality, realistic eye movement patterns, despite being trained on a comparatively small dataset. The generated trajectories can also be converted into scanpaths and saliency maps, resulting in outputs that more accurately reflect the distribution of human visual attention. DiffEye is the first method to tackle this task on natural images using a diffusion model while fully leveraging the richness of raw eye-tracking data. Our extensive evaluation shows that DiffEye not only achieves state-of-the-art performance in scanpath generation but also enables, for the first time, the generation of continuous eye movement trajectories.
Figure 1: Comparison of different eye-tracking data types.
(a) Original visual stimulus. (b) Saliency maps highlight regions of interest but do not capture the temporal dynamics of human attention. (c) Scanpaths offer a compressed representation of eye movement trajectories. (d) Full eye movement trajectories, recorded via eye trackers, provide detailed insights into attention dynamics. Example from MIT1003; each color represents a different subject, emphasizing inter-subject variability.
Motivation: Existing models for predicting eye movements often use simplified data like scanpaths (c), which are sequences of discrete points. This discards the rich, continuous information found in raw eye-tracking trajectories (d). Furthermore, these models typically produce a single, deterministic output, failing to capture the natural variation in how different people look at the same image (as shown by the different colored paths). DiffEye addresses these limitations by training directly on raw trajectories to generate diverse and realistic eye movements that better reflect human visual attention.
Figure 2: An illustration of DiffEye.
(a) End-to-end training: noise is added to trajectories; FeatUp extracts patch features; both pass through the CPE module and a U-Net with cross-attention, optimized via diffusion loss. (b) CPE aligns trajectory positions with image patch positions. (c) Inference: starting from noise, the model denoises to produce an eye-movement trajectory, which can be converted into scanpaths or saliency maps.
Figure 3: Qualitative comparison of scanpath generation.
Scanpaths generated by DiffEye and baseline models are shown alongside ground truth across multiple scenes. Each row is a stimulus; columns show each method’s scanpaths.
Figure 4: Qualitative analysis and ablation study of continuous eye-movement trajectory generation.
(a) Multiple trajectories generated by DiffEye with ground-truth overlays across several scenes. (b) Effect of removing FeatUp, CPE, U-Net cross-attention, and patch-level features on trajectory quality.
Figure 5: Qualitative comparison of saliency map predictions.
Saliency maps produced by DiffEye and baseline models alongside ground truth for multiple scenes. Rows correspond to different stimuli; columns show the stimulus, ground truth, and each model’s prediction.
@article{kara2025diffeye,
title={DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images},
author={Ozgur Kara and Harris Nisar and James M. Rehg},
journal={arXiv preprint arXiv:2509.16767},
year={2025},
url={https://arxiv.org/abs/2509.16767}
}