The Russian language corpus and a neural network to analyse Internet tweet reports about Covid-19

Moloshnikov, I.; Naumov, A.; Levochkina, A.; Rybka, R.; Sboev, A.; Сбоев, Александр Георгиевич

Publication:
The Russian language corpus and a neural network to analyse Internet tweet reports about Covid-19

dc.contributor.author	Moloshnikov, I.
dc.contributor.author	Naumov, A.
dc.contributor.author	Levochkina, A.
dc.contributor.author	Rybka, R.
dc.contributor.author	Sboev, A.
dc.contributor.author	Сбоев, Александр Георгиевич
dc.date.accessioned	2024-12-26T08:18:43Z
dc.date.available	2024-12-26T08:18:43Z
dc.date.issued	2022
dc.description.abstract	© Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).This work is aimed at creating a tool for filtering messages from Twitter users by the presence of mentions of coronavirus disease in them. For this purpose, a corpus of Russian-language tweets was created, which contains the part of 10 thousand tweets that are manually divided into several classes with different levels of confidence: potentially have covid, have covid now, other cases, and an unmarked part – 2 million tweets on the topic of the pandemic. The paper presents the process of creating a corpus of tweets from the stage of data collection, their preliminary filtering and subsequent annotation according to the presence of disease description. Machine learning methods were compared according to classification task on tweets. It is shown that a model based on the XLM-RoBERTa topology with additional training on corpus of unmarked tweets gives the F1 score of 0.85 on binary classification task ("potentially have covid have covid now" vs "other"). This is 12% higher relative to the simplest model using TF-IDF encoding and SVM classifier and 5% higher relative to the RuDR-BERT-based model. The created toolkit will expand the feature space of models for predicting the spread of coronavirus infection and other pandemics by adding the dynamics of discussion on social networks, which characterizes people’s attitudes towards it.
dc.identifier.citation	The Russian language corpus and a neural network to analyse Internet tweet reports about Covid-19 / Moloshnikov, I. [et al.] // Proceedings of Science. - 2022. - 410. - 10.22323/1.410.0017
dc.identifier.doi	10.22323/1.410.0017
dc.identifier.uri	https://www.doi.org/10.22323/1.410.0017
dc.identifier.uri	https://www.scopus.com/record/display.uri?eid=2-s2.0-85124089684&origin=resultslist
dc.identifier.uri	https://openrepository.mephi.ru/handle/123456789/28722
dc.relation.ispartof	Proceedings of Science
dc.title	The Russian language corpus and a neural network to analyse Internet tweet reports about Covid-19
dc.type	Conference Paper
dspace.entity.type	Publication
oaire.citation.volume	410
relation.isAuthorOfPublication	fc2d63d7-5260-41ba-a952-0420c8848b13
relation.isAuthorOfPublication.latestForDiscovery	fc2d63d7-5260-41ba-a952-0420c8848b13
relation.isOrgUnitOfPublication	ba0b4738-e6bd-4285-bda5-16ab2240dbd1
relation.isOrgUnitOfPublication.latestForDiscovery	ba0b4738-e6bd-4285-bda5-16ab2240dbd1

Коллекции

Публикации

Publication: The Russian language corpus and a neural network to analyse Internet tweet reports about Covid-19

Файлы

Коллекции

Publication:
The Russian language corpus and a neural network to analyse Internet tweet reports about Covid-19