Slovakbabylm: Slovak Language Model Based on Cognitive Plausibility
Abstract
Currently, we can observe a trend within the Slavic language family towards the creation of various specific language models (LMs) that use the Bert architecture to create LMs [1]. However, these models do not make use of knowledge consisting of psycholinguistics and do not focus on the cognitive plausibility of LMs. The application of cognitive principles may lead to higher efficiency in training models, which may lead to resource savings in subsequent research and the creation of new LMs.
We use the principle of curriculum learning (CL), which resembles the gradual development of human speech. CL increases the order and amount of information provided with increasing time. The use consists in adapting LM pre-training to the human way of language production. Which can be considered as a step towards higher cognitive plausibility of the language model. From a technical point of view, compared to ordinary pre-training of BERT model (3B words), the application of CL represents a lower amount of data needed to pre-train the language model. Successful models involved in the BabyLM challenge have a better ability to behave in accordance with the structure of English (Evaluation task: BLiMP) [2]
To evaluate CL as a technique that increases training efficiency, we compare the developed model with the existing Slovakbert model and other multilingual models. We will also aim to create a dataset of 100 million words consisting of child-directed language and fairytales, transcribed speech or scientific articles. The created dataset will be used to pre-train Robert's model. All other attributes such as the number of parameters or the tuning procedure will be the same as in the case of the SlovakBert model in order to make the models as comparable as possible. [1].
The main reason for this work is the linguistic differences between English and Slovak. Therefore, the demonstration of CL in another language may bring new insights in given scientific fields. With this effort, we aim to advance the field of psycholinguistically oriented language models and offer insights that can be applied to other Slavic languages and not only to this language group.
References
[1] Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., & Cotterell, R. (2023). Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning.
[2] Pikuliak, M., Grivalský, Š., Konôpka, M., Blšták, M., Tamajka, M., Bachratý, V., Šimko, M., Balážik, P., Trnka, M., Uhlárik, F., & Uhlárik, F. (2021). SlovakBERT: Slovak masked language model. arXiv preprint arXiv:2109.15254.