We trained a multi-layer feed-forward convolutional neural network (ConvNet). The model takes as input an RGB image from a smartphone’s front-facing camera cropped to the eye regions, and applies three layers of convolution to extract gaze features. The features are combined in additional layers with automatically-extracted eye corner landmarks indicating the eye position within the image for a final on-screen gaze estimate. This base model was first trained using the publicly available GazeCapture dataset37, then fine-tuned using calibration data and personalized by fitting an additional regression model (details in the “Methods” section) to the gaze feature output from the ConvNet, described below.
During calibration, participants were asked to fixate on a green circular stimulus that appeared on a black screen. The stimulus appeared at random locations on the screen. Images from the front-facing camera were recorded at 30 Hz and timestamps synchronized with the marker location. In ML terminology, images and marker locations served as inputs and targets, respectively. During inference, the camera images were fed in sequence to the fine-tuned base model whose penultimate layer served as input to the regression model to get the final, personalized gaze estimate. Model accuracy was evaluated across all participants by computing the error in cm between stimulus locations from the calibration tasks (ground truth) and the estimated gaze locations.
To test the effect of personalization on model accuracy, we collected data from 26 participants as they viewed stimuli on the phone, mounted on a device stand. Similar to typical eye tracking studies on the desktop, we focused on a near frontal headpose (no tilt/pan/roll; see “Methods”, study 1). Figure 1 shows how accuracy varies with the number of calibration frames. While the base model has a high error of 1.92 ± 0.20 cm, personalization with ~100 calibration frames led to a nearly fourfold reduction in error resulting in 0.46 ± 0.03 cm (t(25) = 7.32, p = 1.13 × 10−7). Note that 100 calibration frames across different screen locations corresponds to <30 s of data, which is quite reasonable for eye tracking studies where calibration is typically performed at the beginning of each study (or during the study to account for breaks or large changes in pose). The best participant had 0.23 cm error, while the worst participant had 0.75 cm error ([5,95]th percentiles were [0.31,0.72] cm). At a viewing distance of 25–40 cm, this corresponds to 0.6–1∘ accuracy, which is better than 2.44–3∘ for previous work37,38.
The improvements over previous work are due to a combination of better model architecture, calibration/personalization, and optimal UX settings. In particular, fine-tuning and personalizing the model using ~30 s of calibration data under optimal UX settings (near frontal headpose, short viewing distance of 25–40 cm) led to big accuracy improvements (1.92–0.46 cm). While changes in model architecture led to modest improvements in accuracy (0.73 cm37 to 0.46 cm for ours, with fine-tuning and personalization applied to both models), they significantly reduced model complexity by 50× (8 M vs. 170 K model parameters), making it suitable for on-device implementation. Thus, our model is both lightweight and accurate.
As shown in Fig. 1b, the errors were comparable across different locations on the phone screen, with slightly larger error toward the bottom screen locations since the eyes tend to appear partially closed when participants look down (see Supplementary Fig. 1). While these numbers are reported for Pixel 2 XL phones, personalization was found to help across other devices as well (see Supplementary Fig. 3a). Figures 1a, b focused on the frontal headpose such that the face covered about one-third of the camera frame. To test the effect of headpose and distance on accuracy, we analyzed the GazeCapture37 dataset on iPhones, which offered more diversity in headpose/distance. As seen in Supplementary Figs. 3b–e, the best performance was achieved for near frontal headpose and shorter distance to the phone (where the eye region appeared bigger), and accuracy decayed with increasing pan/tilt/roll, or as participants moved further away from the phone. Thus, all studies in this paper focused on the optimal UX settings, namely near frontal headpose with short viewing distances of 25–40 cm to the phone. While this may seem restrictive, it is worth noting that the most common eye tracking setup for prior eye movement research8,12,14,16,18,29 often requires expensive hardware and more controlled settings such as chin rest with dim indoor lighting and fixed viewing distance.
Comparison with specialized mobile eye trackers
To understand the gap in performance between our smartphone eye tracker and state-of-the-art, expensive mobile eye trackers, we compared our method against Tobii Pro 2 glasses which is a head mounted eye tracker with four infrared cameras near the eye. We selected the frontal headpose since Tobii glasses work best in this setting. Thirteen users performed a calibration task under four conditions—with and without Tobii glasses, with a fixed device stand and freely holding the phone in the hand (see Fig. 2). With the fixed device stand, we found that the smartphone eye tracker’s accuracy (0.42 ± 0.03 cm) was comparable to Tobii glasses (0.55 ± 0.06 cm, two-tailed paired t-test, t(12) = −2.12, p = 0.06). Similar results were obtained in the hand-held setting (0.59 ± 0.03 cm on Tobii vs. 0.50 ± 0.03 cm on ours; t(12) = −1.53, p = 0.15). The error distribution per user for both the device stand and hand-held settings can be found in Supplementary Fig. 4.
It is worth noting that specialized eye trackers like Tobii Pro glasses represent a high bar. These are head mounted glasses with four infrared cameras (two near each eye) and one world centered camera. Thus the input is high-resolution infrared images of close-up of the eyes (within 5–10 cm distance from the eye). In contrast, our method uses the smartphone’s single front-facing RGB camera, at larger viewing distance (25–40 cm from the eye), hence the eye region appears small. Despite these challenges, it is promising that our smartphone eye tracker achieves comparable accuracy as state-of-the-art mobile eye trackers.
Validation on standard oculomotor tasks
As a research validation, we tested whether the key findings from previous eye movement research on oculomotor tasks using large displays and expensive desktop eye trackers, could be replicated on small smartphone displays using our method. Twenty-two participants performed prosaccade, smooth pursuit and visual search tasks as described below (details in “Methods”, study 2). Figure 3a shows the setup for the prosaccade task. We computed saccade latency, a commonly studied measure, as the time from when the stimulus appeared to when the participant moved their eyes. As seen in Fig. 3b, mean saccade latency was 210 ms (median 167 ms), consistent with 200–250 ms observed in previous studies41.
To investigate smooth pursuit eye movements, participants were asked to perform two types of tasks—one where the object moved smoothly along a circle, and another along a box. Similar tasks have been recently demonstrated to be useful for detecting concussion42,43. Figures 3c–e show sample gaze scanpath from a randomly selected participant, and the population-level heatmap from all users and trials for the smooth pursuit circle task. Consistent with previous literature on desktops, participants performed well in this task, with a low tracking error of 0.39 ± 0.02 cm. Similar results were obtained for the smooth pursuit box task (see Supplementary Fig. 5).
Beyond simple oculomotor tasks, we investigated visual search which has been a key focus area of attention research since 1980s12,44,45. Two well-known phenomena here are: (1) the effect of target saliency (dissimilarity or contrast between the target and surrounding distracting items in the display, known as distractors)46,47; (2) and the effect of set size (number of items in the display)44,45 on visual search behavior.
To test the presence of these effects on phones, we measured gaze patterns as 22 participants performed a series of visual search tasks. We systematically varied the target’s color intensity or orientation relative to the distractors. When the target’s color (or orientation) appeared similar to the distractors (low target saliency), more fixations were required to find the target (see Fig. 4a, c). In contrast, when the target’s color (or orientation) appeared different from the distractors (high target saliency), fewer fixations were required (Fig. 4b, d). We found that across all users and trials, the number of fixations to find the target decreased significantly as target saliency increased (see Fig. 4e, f for color intensity contrast: F(3, 63) = 37.36, p < 10−5; for orientation contrast: F(3, 60) = 22.60, p < 10−5). These results confirm the effect of target saliency on visual search, previously seen in desktop studies12,44,46,47.
To test the effect of set size on visual search, we varied the number of items in the display from 5, 10 to 15. Figure 4g shows that the effect of set size depends on target saliency. When the target saliency is low (difference in orientation between target and distractors, Δθ = 7∘), the number of fixations to find the target increased linearly with set size (slope = 0.17; one-way repeated measures ANOVA F(2, 40) = 3.52, p = 0.04). In contrast, when the target saliency is medium-high (Δθ = 15∘), the number of fixations to find the target did not vary significantly with set size (F(2, 40) = 0.85, p = 0.44). For very highly salient targets (Δθ = 75∘), we found a negative effect of set size on the number of fixations (slope = −0.06; F(2, 40) = 4.39, p = 0.02). These findings are consistent with previous work on desktops47,48,49,50. To summarize, in this section, we replicated the key findings on oculomotor tasks such as prosaccade, smooth pursuit and visual search tasks using our smartphone eye tracker.
Validation on natural images
We further validated our method by testing whether previous findings on eye movements for rich stimuli such as natural images, obtained from expensive desktop eye trackers with large displays could be replicated on small displays such as smartphones, using our method. Some well-known phenomena about gaze on natural images are that gaze is affected by (a) the task being performed (known since the classic eye tracking experiments by Yarbus in 196730); (b) the saliency of objects in the scene19,51,52; and (c) tendency to fixate near the center of the scene51,53. To test whether our smartphone eye tracker can reproduce these findings, we collected data from 32 participants as they viewed natural images under two different task conditions: (1) free viewing and (2) visual search for a target (see “Methods”, study 3).
As expected, gaze patterns were more dispersed during free viewing, and more focused toward the target object and its likely locations during visual search (see Fig. 5). For example, Fig. 5 third row shows that during free viewing, participants spent time looking at the person, and the sign he points to in the scene, while during visual search for a “car”, participants avoided the sign and instead fixated on the person and the car. Across all images, gaze entropy was found to be significantly higher for free viewing than for visual search (16.94 ± 0.03 vs. 16.39 ± 0.04, t(119) = 11.14, p = 10−23). Additional analysis of visual search performance showed that consistent with previous findings54, the total fixation duration to find the target decreased with the size of the target (r = −0.56, p = 10−11; n = 120 images), confirming that bigger targets are easier to find than smaller ones. Beyond size, we found that target saliency density has a significant effect on time to find the target (r = −0.30, p = 0.0011; n = 120 images), i.e., more salient targets are easier to find than less salient ones, consistent with previous literature19.
Second, we tested the existence of the central tendency during free viewing of natural images on smartphones. Figure 6a shows the gaze entropy across all images in this study. Examples of low gaze entropy are images containing one or two salient objects in the scene (e.g., a single person or animal in the scene), while the high entropy images contain multiple objects of interest (e.g., multiple people, indoor room with furniture). Similar findings were reported with specialized desktop eye trackers51,52. Averaging the fixations across all users and images from our smartphone eye tracker revealed a center bias (see Fig. 6b), consistent with previous literature on desktops51,53.
Finally, since saliency has been extensively studied using desktop eye trackers19,51,52, we directly compared the gaze patterns obtained from our smartphone eye tracker against those obtained from specialized desktop eye trackers such as Eyelink 1000 (using the OSIE dataset52). Note that this comparison places a high bar. Not only did the desktop setup with EyeLink 1000 involve specialized hardware with infrared light source and infrared cameras near the eye with high spatio-temporal resolution (up to 2000 Hz), but it also used highly controlled settings with chin rest (and dim lighting conditions), and displayed the image on a large screen (22″, 33 × 25∘ viewing angle). In contrast, our study setup used the smartphone’s existing selfie camera (RGB) in more natural settings (natural indoor lighting, no chin rest, just a stand for the phone) with images viewed on a small mobile screen (6″, median viewing angle of 12 × 9∘). Thus, the two setups differ in a number of ways (large-screen desktop vs. small-screen mobile, controlled settings, eye tracker cost, sampling rate).
Despite these differences, we found that the gaze heatmaps from the two settings are qualitatively similar. Figure 7 shows the most similar and dissimilar heatmaps from desktop vs. mobile (similarity measured using Pearson’s correlation). Our smartphone eye tracker was able to detect similar gaze hotspots as the expensive desktop counterparts, with a key difference being that the mobile gaze heatmaps appear more blurred (see Supplementary Discussion for further analysis). The blur is due to a combination of the small size display on the mobile screen, and the lower accuracy/noise from the smartphone eye tracker (no chin rest, no infrared cameras near the eye). Apart from the blur, the gaze heatmaps from desktop and mobile are highly correlated both at the pixel level (r = 0.74) and object level (r = 0.90, see Table 1). This suggests that our smartphone eye tracker could be used to scale saliency analyses on mobile content, both for static images and dynamic content (as participants scroll and interact with the content, or watch videos).
Testing on reading comprehension task
Beyond research validation on oculomotor tasks and natural images, we tested whether our smartphone eye tracker could help detect reading comprehension difficulty, as participants naturally scrolled and read passages on the phone. Seventeen participants read SAT-like passages on the phone (with scroll interactions), and answered two multiple choice questions (see “Methods”, study 4). One of the questions was factual and could be answered by finding the relevant excerpt within the passage. The other question required interpreting the passage in more detail—we call this the “interpretive” task. As expected, we found that the gaze patterns are different for factual vs. interpretive tasks. Gaze patterns were more focused on specific parts of the passage for factual tasks, and more dispersed across the passage for interpretive tasks (see Fig. 8). Across all users and tasks, gaze entropy was found to be higher for the interpretive tasks than the factual tasks (8.14 ± 0.16 vs. 7.71 ± 0.15; t(114) = 1.97, p = 0.05).
Within factual tasks, we examined if there are differences in gaze patterns when participants answered the question correctly vs. not. We hypothesized that gaze should be focused on the relevant excerpt in the passage for participants that answered correctly, and gaze should be more dispersed or focused on other parts of the passage for incorrect answers. Figure 9a shows that participants spent significantly more time fixating within the relevant passage regions than irrelevant ones when they answered correctly (62.29 ± 3.63% time on relevant vs. 37.7 ± 3.63% on irrelevant; t(52) = 3.38, p = 0.001). This trend was inverted for wrong answers, though not significant (41.97 ± 6.99% on relevant vs. 58.03 ± 6.99% on irrelevant; t(12) = −1.15, p = 0.27).
Next, we examined the effect of task-level difficulty on gaze and time-to-answer. We quantified task difficulty as the %incorrect answers per task (see Supplementary Figs. 6–7 for additional measures of task difficulty that take time and accuracy into account). Figure 9b–f shows example gaze heatmaps for easy vs. difficult tasks, and the corresponding scatterplots of various metrics as a function of task difficulty. As expected, time to answer increased with task difficulty, though not significantly (Spearman’s rank correlation r = 0.176, p = 0.63). The number of eye fixations on the passage increased with task difficulty (r = 0.67, p = 0.04). A closer look showed that the best predictor was fraction of gaze time spent on the relevant excerpt (normalized by height), which was strongly negatively correlated with task difficulty (r = −0.72, p = 0.02). In other words, as task difficulty increased, participants spent more time looking at the irrelevant excerpts in the passage before finding the relevant excerpt that contained the answer. These results show that smartphone-based gaze can help detect reading comprehension difficulty.