The Winograd Schema Challenge: A Study of Language Models and Common-sense Reasoning

Authors

  • Tamara Podolski University of Ljubljana
  • Aleš Žagar University of Ljubljana

Abstract

Introduction

The Winograd Schema Challenge (WSC) is a widely accepted benchmark for evaluating common-sense reasoning capabilities in artificial intelligence systems. It presents models with a pronoun resolution task that requires not only the ability to parse syntax but also the application of implicit world knowledge and pragmatic inference  [1]. To illustrate, determining the referent in the sentence “Joan made sure to thank Susan for all the help she had given,” [2] necessitates an understanding of social behaviour, rather than reliance on structural or lexical cues alone. While large language models (LLMs) are currently performing well on the WSC task, their ability to justify answers through generated explanations remains under-researched. 

Studying explanations in this particular context offers an insight into the workings of artificial reasoning, showcasing its decision pathways, exposing potential errors, and building the foundation for transparency and trust.

Research Questions and Aims

This thesis aims to examine how LLMs generate and use explanations in the context of the WSC task. It showcases qualitative differences between human and machine-generated explanations, and explores how different prompting strategies affect explanation quality. The study includes an introduction of methods for automatic evaluation of these explanations using metrics such as coherence, ontological alignment, which measures how well explanations correctly map entities and their relationships according to real-world categories, and the inclusion of relevant knowledge types.

Methodology and Expected Results

The basis of the analysis is formed by a Slovene dataset of the WSC task, which includes human and machine-generated explanations. Automated evaluation pipelines will be developed in order to assess explanation quality across the examples.  Although results are currently pending, this research aims to offer insight into the effectiveness of prompting strategies and the general viability of automated evaluation methods for improving the quality, accuracy as well as interpretability of machine-generated explanations.

Limitations

Challenges include the language-specific nature of the WSC task, as it relies heavily on the language and its inclination towards the usage of pronouns. Additionally, the complexity of designing automated evaluation methods may hinder their ability to generalise across different models and languages as they may struggle to account for context, nuances, and subtle variations in meaning that are inherently human-like.

References

[1] N. Isaak, “The Winograd Schema Challenge: Are You Sure That We Are on the Right Track?,” in Proceedings of the 14th International Conference on Agents and Artificial Intelligence, pp. 881-888, 2022. doi: 10.5220/0010923200003116.

[2] H. J. Levesque, E. Davis, and L. Morgenstern, “The Winograd Schema Challenge,” in Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, pp. 552–561, 2012, .

Published

2025-06-10