Hate speech detection in the Arabic language: corpus design, construction, and evaluation.

Ahmad, Ashraf; Azzeh, Mohammad; Alnagi, Eman; Abu Al-Haija, Qasem; Halabi, Dana; Aref, Abdullah; AbuHour, Yousef

Ahmad, Ashraf; Azzeh, Mohammad; Alnagi, Eman; Abu Al-Haija, Qasem; Halabi, Dana; Aref, Abdullah; AbuHour, Yousef.

Afiliación

Ahmad A; Department of Computer Science, Princess Sumaya University for Technology (PSUT), Amman, Jordan.
Azzeh M; Department of Data Science, Princess Sumaya University for Technology (PSUT), Amman, Jordan.
Alnagi E; Department of Computer Science, Princess Sumaya University for Technology (PSUT), Amman, Jordan.
Abu Al-Haija Q; Department of Cybersecurity, Faculty of Computer and Information Technology, Jordan University of Science and Technology, Irbid, Jordan.
Halabi D; SAE Institute, Luminus Technical University College (LTUC), Amman, Jordan.
Aref A; Department of Computer Science, Princess Sumaya University for Technology (PSUT), Amman, Jordan.
AbuHour Y; Department of Basic Sciences, Princess Sumaya University for Technology (PSUT), Amman, Jordan.

Front Artif Intell ; 7: 1345445, 2024.

Article en En | MEDLINE | ID: mdl-38444962

ABSTRACT

ABSTRACT

Hate Speech Detection in Arabic presents a multifaceted challenge due to the broad and diverse linguistic terrain. With its multiple dialects and rich cultural subtleties, Arabic requires particular measures to address hate speech online successfully. To address this issue, academics and developers have used natural language processing (NLP) methods and machine learning algorithms adapted to the complexities of Arabic text. However, many proposed methods were hampered by a lack of a comprehensive dataset/corpus of Arabic hate speech. In this research, we propose a novel multi-class public Arabic dataset comprised of 403,688 annotated tweets categorized as extremely positive, positive, neutral, or negative based on the presence of hate speech. Using our developed dataset, we additionally characterize the performance of multiple machine learning models for Hate speech identification in Arabic Jordanian dialect tweets. Specifically, the Word2Vec, TF-IDF, and AraBert text representation models have been applied to produce word vectors. With the help of these models, we can provide classification models with vectors representing text. After that, seven machine learning classifiers have been evaluated Support Vector Machine (SVM), Logistic Regression (LR), Naive Bays (NB), Random Forest (RF), AdaBoost (Ada), XGBoost (XGB), and CatBoost (CatB). In light of this, the experimental evaluation revealed that, in this challenging and unstructured setting, our gathered and annotated datasets were rather efficient and generated encouraging assessment outcomes. This will enable academics to delve further into this crucial field of study.

Palabras clave

Arabic hate speech; Arabic hate speech corpus; Arabic hate speech detection; machine learning; natural language processing (NLP)

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Idioma: En Revista: Front Artif Intell Año: 2024 Tipo del documento: Article País de afiliación: Jordania Pais de publicación: Suiza

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google