Búsqueda | Portal Regional de la BVS

A Sesotho news headlines dataset for sentiment analysis.

Mokhosi, Refuoe; Shivachi, Casper-Shikali; Sethobane, Matello.

Data Brief ; 54: 110371, 2024 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-38590621

RESUMEN

Sentiment Analysis (SA) is a subset of Natural Language Processing (NLP) which has become a promising research area enabling the provision of language specific services. Although research in high resource languages such as English and Chinese has achieved promising results, research in low resource African languages such as Sesotho is still in its infancy due to limited text and speech datasets. This study contributes in this regard by availing the Sesotho News (SN) dataset, as an annotated dataset for the SA and Aspect Based Sentiment Analysis (ABSA) tasks. This dataset may be used for NLP research to benefit 1.85 million Sesotho speakers in Lesotho and 11.5 million speakers in South Africa. The dataset includes 4651 headlines for the ABSA task and 2401 headlines for the SA task using Lesotho's orthography of Sesotho. The news headlines were collected from Sesotho online newspapers and then annotated for the ABSA and SA tasks. The Spearman's correlation and Cohen's Kappa Index metrics show that there is good correlation between the annotators, implying that the SN dataset is of gold standard.

Enhancing African low-resource languages: Swahili data for language modelling.

Shikali, Casper S; Mokhosi, Refuoe.

Data Brief ; 31: 105951, 2020 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-32671155

RESUMEN

Language modelling using neural networks requires adequate data to guarantee quality word representation which is important for natural language processing (NLP) tasks. However, African languages, Swahili in particular, have been disadvantaged and most of them are classified as low resource languages because of inadequate data for NLP. In this article, we derive and contribute unannotated Swahili dataset, Swahili syllabic alphabet and Swahili word analogy dataset to address the need for language processing resources especially for low resource languages. Therefore, we derive the unannotated Swahili dataset by pre-processing raw Swahili data using a Python script, formulate the syllabic alphabet and develop the Swahili word analogy dataset based on an existing English dataset. We envisage that the datasets will not only support language models but also other NLP downstream tasks such as part-of-speech tagging, machine translation and sentiment analysis.

RESUMEN

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA