Disentangling Linguistic and Cultural Biases in Large Language Models: A Comparative Study
Abstract
Introduction
Large Language Models (LLMs) are rapidly becoming ubiquitous tools for communication, research, and decision-making across disciplines and industries. However, they inherit systematic biases from their training data, raising critical concerns about their fairness and representation. LLMs are often perceived as neutral tools, but training on large, predominantly Western datasets embeds cultural norms and ideological assumptions. This can reinforce dominant worldviews while marginalizing non-Western perspectives, shaping how models portray social roles and global narratives. This study investigates two interrelated yet analytically distinct types of bias in LLMs: linguistic bias, embedded in the structure of language and its representation within the model's architecture, and cultural bias, which emerges through normative assumptions elicited during model prompting. By distinguishing between these forms, we aim to contribute to the development of targeted strategies for bias mitigation, model auditing, and model transparency.
Theoretical Framework
Linguistic bias refers to systematic patterns that arise from word co-occurrence, syntactic structures, and embedding-level associations in training corpora. This can result in underrepresentation or mischaracterization of dialects and non-standard linguistic forms, as well as reflect societal stereotypes [1]. Cultural bias, by contrast, is reflected in prompt-induced behaviour that draws on specific cultural scripts, stereotypes, or political values. It is more dynamic and sensitive to framing, often reinforcing hegemonic worldviews [2]. We hypothesise that LLMs encode these two forms of bias at different levels of their architecture. In our exploratory design, a language condition is intended to reflect linguistic bias, while a roleplay condition aims to surface cultural bias through interpretive and value-laden tests. By comparing the outputs of these two branches across the same tasks, we aim to better understand and disentangle how each bias manifests in model behaviour.
Methodology
We will conduct a comparative analysis of two LLMs—GPT-4 and Gemini—across two experimental branches. We either prompt the models to roleplay various cultural identities (e.g., "You are German, but please respond in English") or to respond in different languages (e.g., "Bitte antworte auf Deutsch"). Our target languages are English, Hindi, German, Italian, and Hungarian. Following the initial prompt, we apply a Thematic Apperception Test (TAT), and two quantitative measures: the Triandis’ Individualism–Collectivism Scale and the Cultural Tightness–Looseness Scale. TAT responses will be scored using the Social Cognition and Object Relations Scale–Global Rating Method (SCORS-G). To explore how model creativity impacts bias expression, each condition will be tested across three temperature levels. Output text will be analysed using NLP classifiers, sentiment analysis, and stance detection, with statistical tests (e.g., t-tests, ANOVA, effect size measures) applied to compare results. The testing framework will be implemented in Python using API calls. As of the time of submission, data collection and testing has not yet commenced.
References
[1] E. Fleisig, G. Smith, M. Bossi, I. Rustagi, X. Yin, and D. Klein, “Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination,” arXiv, 2024. doi: 10.48550/arXiv.2406.08818.
[2] A. Birhane, S. Prabhu, and E. Kahembwe, “The Values Encoded in Machine Learning Research,” in Proc. ACM Conf. on Fairness, Accountability, and Transparency (FAccT), 2021.
Published
Issue
Section
License
Copyright (c) 2025 Philip Nordt

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.