Human Label Variation of Sexism in German Language News Forums


  • Louisa Venhoff University of Vienna


Sexism remains a problem in online forums across platforms, hindering equal opportunities for participation across gender identities [1]. Several platforms attempt to address this issue using human moderators and automated moderation tools with varying outcomes. However, these tools and their underlying guidelines are rarely available to the public, and differences in the perception of sexism among individuals are seldom considered. Recent research indicates that annotation differences may result from subjectivity, lack of context, poorly described guidelines, annotators' own beliefs and backgrounds, or lack of attention and ambiguity in items. This questions the notion that inter-annotator disagreements are mere deviations from a single correct interpretation of text [2].

These challenges highlight the need for publicly funded tools that detect hateful comments while accounting for differences in annotator perceptions and the resulting annotation variations. This project investigates the effects of inter-annotator disagreements in identifying and rating sexist comments, aiming to leverage these disagreements to develop more rational and transparent tools for detecting sexism in online forums. The hypothesis is that inter-annotator disagreement can provide additional useful information for the development of sexism classifiers by offering further measures of certainty, thereby improving the evaluation of classifier performance.

This work is part of the shared task “GERMS-DETECT” by the Austrian Research Institute for Artificial Intelligence (OFAI). The objective is to develop a classifier to rate and detect sexist comments in the forum of a widely read Austrian newspaper. OFAI established annotation guidelines and collected 7,995 comments, rated by three to nine annotators on an ordinal scale from 0 (no sexism) to 4 (severe sexism).

First, the effect of annotation differences on item variables will be investigated by analyzing which items differ most and why, using both deductive and inductive categorical systems. This will quantify the certainty of annotations on a per-item level, establishing a baseline of annotation variability by comment. Subsequently, the effects of annotator variables will be examined to understand why some annotators are more similar in their behavior than others. This will be quantified using Krippendorff’s Alpha to identify the drivers of inter-annotator disagreement and their potential causes. This analysis will inform how inter-annotator disagreement can improve the quality of training data for the classifier.

Ultimately, this analysis may integrate inter-annotator disagreements into the labeling of training data, potentially leading to a more equitable and transparent online moderation tool and elucidating the perception of sexism in the context of online moderation.


[1] J. Petrak and B. Krenn, "Misogyny classification of German newspaper forum comments," Nov. 2022. [Online]. Available: http://​​/​pdf/​2211.17163v1

[2] M. Sandri, E. Leonardelli, S. Tonelli, and E. Jezek, "Why Don’t You Do It Right? Analysing Annotators’ Disagreement in Subjective Tasks," Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics [Preprint], pp. 2428–2441, 2023, doi: 10.18653/v1/2023.eacl-main.178.