A bi-annotated Malay-English code-switching (Manglish) dataset of X posts for biological gender identification and authorship attribution.

Maskat, Ruhaila; Azman, Norazmiera Ayunie; Nulizairos, Nur Shaheera Shastera; Zahidin, Nurul Athirah; Mahadi, Adibah Humairah; Norshamsul, Siti Rubaya; Sharif, Mohd Mukhlis Mohd; Mahdin, Hairulnizam

Maskat, Ruhaila; Azman, Norazmiera Ayunie; Nulizairos, Nur Shaheera Shastera; Zahidin, Nurul Athirah; Mahadi, Adibah Humairah; Norshamsul, Siti Rubaya; Sharif, Mohd Mukhlis Mohd; Mahdin, Hairulnizam.

Afiliación

Maskat R; College of Computing, Informatics and Mathematics of Universiti Teknologi MARA Shah Alam, 40450, Selangor, Malaysia.
Azman NA; College of Computing, Informatics and Mathematics of Universiti Teknologi MARA Shah Alam, 40450, Selangor, Malaysia.
Nulizairos NSS; College of Computing, Informatics and Mathematics of Universiti Teknologi MARA Shah Alam, 40450, Selangor, Malaysia.
Zahidin NA; College of Computing, Informatics and Mathematics of Universiti Teknologi MARA Shah Alam, 40450, Selangor, Malaysia.
Mahadi AH; College of Computing, Informatics and Mathematics of Universiti Teknologi MARA Shah Alam, 40450, Selangor, Malaysia.
Norshamsul SR; College of Computing, Informatics and Mathematics of Universiti Teknologi MARA Shah Alam, 40450, Selangor, Malaysia.
Sharif MMM; College of Computing, Informatics and Mathematics of Universiti Teknologi MARA Shah Alam, 40450, Selangor, Malaysia.
Mahdin H; Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, 86400 Parit Raja, Batu Pahat, Johor, Malaysia.

Data Brief ; 52: 110034, 2024 Feb.

Article en En | MEDLINE | ID: mdl-38282916

ABSTRACT

ABSTRACT

Low-resource languages, like Malay, face the threat of extinction when linguistic resources become scarce. This paper addresses the scarcity issue by contributing to the inventory of low-resource languages, specifically focusing on Malay-English, known as Manglish. Manglish speakers are primarily located in Malaysia, Indonesia, Brunei, and Singapore. As global adoption of second languages and social media usage increases, language code-switching, such as Spanglish and Chinglish, becomes more prevalent. In the case of Malay-English, this phenomenon is termed Manglish. To enhance the status of the Malay language and its transition out of the low-resource category, this unique text corpus, with binary annotations for biological gender and anonymized author identities is presented. This bi-annotated dataset offers valuable applications for various fields, including the investigation of cyberbullying, combating gender bias, and providing targeted recommendations for gender-specific products. This corpus can be used with either of the annotations or their composite. The dataset comprises of posts from 50 Malaysian public figures, equally split between biological males and females. The dataset contains a total of 709,012 raw X posts (formerly Twitter), with a relatively balanced distribution of 53.72% from biological female authors and 46.28% from biological male authors. Twitter API was used to scrape the posts. After pre-processing, the total posts reduced to 650,409 posts, widening the gap between the genders with the 56.88% for biological female and 43.12% for biological male. This dataset is a valuable resource for researchers in the field of Malay-English code-switching Natural Language Processing (NLP) and can be used to train or enhance existing and future Manglish language transformers.

Palabras clave

Authorship attribution; Biological gender identification; Code-switching; Malay-English; Manglish; NLP; Text analytics

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Tipo de estudio: Diagnostic_studies Aspecto: Determinantes_sociais_saude Idioma: En Revista: Data Brief Año: 2024 Tipo del documento: Article País de afiliación: Malasia Pais de publicación: Países Bajos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google