Standardizing Multilingual Eye-tracking for Psycholinguistic and NLP Research

Authors

  • Andrej Čuber University of Ljubljana
  • Gaja Prinčič University of Ljubljana

Abstract

Reading behavior provides a valuable window into cognitive processing, and eye-tracking has become an established method for studying how we read. However, most existing eye-tracking datasets are monolingual and incompatible across languages, limiting their broader applicability. The MultiplEYE COST Action is a European research initiative that aims to address this gap by building a large-scale, multilingual eye-tracking corpus to support psycholinguistic research and the development of computational language models [1].

The project collects data across more than 25 languages using a standardized protocol, enabling controlled cross-linguistic comparison. Within this initiative, we are responsible for collecting data for Slovenian, involving approximately 100 native speakers. Although full-scale data collection has not yet started, we have conducted a pilot study with seven participants to evaluate the reliability of our equipment and the quality of the data. The main study will involve participants reading parallel texts in their native language while their eye movements are recorded. After each passage, comprehension questions will follow, and additional metadata, such as demographic profiles, reading habits, and optional psychometric assessments, will be gathered to enrich the dataset.

One of the major challenges of the project is coordinating cross-linguistic research at this scale. Differences in hardware, script direction, translation consistency, and reading behavior all present methodological complications. To address this, MultiplEYE has developed detailed data collection guidelines, an open-source preprocessing pipeline, and a data-sharing plan that ensures comparability across sites while respecting FAIR principles (Findable, Accessible, Interoperable, and Reusable) [2]. Pilot studies have already helped refine these procedures and identify key areas for standardization.

We anticipate that the resulting dataset will support a wide range of uses, such as comparing fixation durations and saccade patterns across writing systems or evaluating natural language processing (NLP) models with gaze-informed benchmarks. While the idea of using gaze data as additional input for training NLP models remains exploratory, we aim to test whether such biologically grounded signals can enhance model performance or interpretability.

A key hypothesis is that language-specific reading behaviors—such as skipping function words or re-reading syntactically complex clauses—can be systematically captured and compared under harmonized protocols. By contributing the Slovenian component to this broader effort, our work will also shed light on cognitive and linguistic variation within the Slovene-speaking population. MultiplEYE represents a significant step toward more inclusive, transparent, and interdisciplinary language research.

References

[1] N. Hollenstein, R. Kasperė, and L. A. Jäger, “Opportunities and Challenges in the MultiplEYE Data Collection,” in MultiplEYE Mid-Term Conference 2024, Tirana, Albania, 2024. [Online]. doi: 10.23668/psycharchives.15479.

[2] M. L. Müller, “Data Management in Eye-Tracking Research: Crucial Steps and Challenges,” in MultiplEYE Mid-Term Conference 2024, Albania, 2024. [Online]. doi: 10.23668/psycharchives.15484.

Published

2025-06-10