„Understanding“ in the Context of XAI


  • Tim Knapp University of Vienna



This research project aims to investigate the phenomenon and notion of understanding in the context of Explainable Artificial Intelligence (XAI), considering understanding in both humans and AI systems. We will pursue two primary research questions:

RQ1: What is the most adequate notion of human understanding in the context of XAI?

RQ2: To what extent, if at all, can XAI methods be used to provide information about assumed understanding in AI systems?

This is a theoretical, exploratory research project. The main research method will be literature review.


The goal of XAI is to make the internal workings of deep learning models transparent and understandable for humans. Within XAI, various methods are utilized, broadly divided into local and global methods [1]. While local methods focus on explaining specific instances of input and associated model output, global methods aim to elucidate the model and its parameters as a whole. Although improving human understanding is unequivocally a central aim of XAI, there is still discussion in the field about what the term “understanding” actually means, and consequently also ambiguity about how to best improve it [1].

For example, there is a tension between interpretability and completeness, both of which arguably contribute to understandability. We could provide an incredibly complete, or transparent, account of a model by revealing all its parameters and the involved mathematical operations, but such an account would not offer much value in terms of making it more interpretable for humans. On the other hand, there also is a risk of leaving out or even distorting key pieces of information, to create a coherent explanation of a model’s behavior, that humans can interpret well.


Proponents of the notion that large language models (LLMs) can understand language, cite their exceptional performance on various benchmarks [2] and believe that LLM’s internal representations capture important aspects of meaning similarly to human cognition. Contrarily, critics argue that the statistical pattern-matching of LLMs cannot in any meaningful way be compared to the way humans understand language. To resolve the debate, some researchers have called for new methods, “that can reveal the detailed mechanisms of understanding” [2] in such systems. We will investigate if and to what extent XAI methods could bridge this gap.

For example, researchers from Anthropic have recently argued that a certain feature, or component, of an LLM, they were able to isolate and extract, has "a deep connection" to "the model’s understanding of bugs in code" [3]. What is the underlying notion of understanding at play here and how does it compare to understanding in humans?


[1] Longo, L. et al. (2024) ‘Explainable artificial intelligence (XAI) 2.0: A manifesto of open challenges and Interdisciplinary Research Directions’, Information Fusion, 106. doi:10.1016/j.inffus.2024.102301.

[2] Mitchell, M. and Krakauer, D.C. (2023) ‘The debate over understanding in AI’s large language models’, Proceedings of the National Academy of Sciences, 120(13). doi:10.1073/pnas.2215907120.

[3] Templeton, et al. (2024) 'Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet', Transformer Circuits Thread.