Improving Text-Independent Forced Alignment to Support Speech-Language Pathologists with Phonetic Transcription.

Li, Ying; Wohlan, Bryce Johannas; Pham, Duc-Son; Chan, Kit Yan; Ward, Roslyn; Hennessey, Neville; Tan, Tele

Li, Ying; Wohlan, Bryce Johannas; Pham, Duc-Son; Chan, Kit Yan; Ward, Roslyn; Hennessey, Neville; Tan, Tele.

Afiliación

Li Y; School of EECMS, Curtin University, Bentley, WA 6102, Australia.
Wohlan BJ; School of EECMS, Curtin University, Bentley, WA 6102, Australia.
Pham DS; School of EECMS, Curtin University, Bentley, WA 6102, Australia.
Chan KY; School of EECMS, Curtin University, Bentley, WA 6102, Australia.
Ward R; School of Allied Health, Curtin University, Bentley, WA 6102, Australia.
Hennessey N; School of Allied Health, Curtin University, Bentley, WA 6102, Australia.
Tan T; School of EECMS, Curtin University, Bentley, WA 6102, Australia.

Sensors (Basel) ; 23(24)2023 Dec 06.

Article en En | MEDLINE | ID: mdl-38139496

ABSTRACT

ABSTRACT

Problem:

Phonetic transcription is crucial in diagnosing speech sound disorders (SSDs) but is susceptible to transcriber experience and perceptual bias. Current forced alignment (FA) tools, which annotate audio files to determine spoken content and its placement, often require manual transcription, limiting their effectiveness.

Method:

We introduce a novel, text-independent forced alignment model that autonomously recognises individual phonemes and their boundaries, addressing these limitations. Our approach leverages an advanced, pre-trained wav2vec 2.0 model to segment speech into tokens and recognise them automatically. To accurately identify phoneme boundaries, we utilise an unsupervised segmentation tool, UnsupSeg. Labelling of segments employs nearest-neighbour classification with wav2vec 2.0 labels, before connectionist temporal classification (CTC) collapse, determining class labels based on maximum overlap. Additional post-processing, including overfitting cleaning and voice activity detection, is implemented to enhance segmentation.

Results:

We benchmarked our model against existing methods using the TIMIT dataset for normal speakers and, for the first time, evaluated its performance on the TORGO dataset containing SSD speakers. Our model demonstrated competitive performance, achieving a harmonic mean score of 76.88% on TIMIT and 70.31% on TORGO. Implications This research presents a significant advancement in the assessment and diagnosis of SSDs, offering a more objective and less biased approach than traditional methods. Our model's effectiveness, particularly with SSD speakers, opens new avenues for research and clinical application in speech pathology.

Asunto(s)

Percepción del Habla; Voz; Humanos; Fonética; Habla; Patólogos

Palabras clave

forced alignment; phoneme segmentation; phonological disorders; speech sound disorders; speech therapy; wav2vec 2.0

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Percepción del Habla / Voz Límite: Humans Idioma: En Revista: Sensors (Basel) Año: 2023 Tipo del documento: Article País de afiliación: Australia Pais de publicación: Suiza

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google