ViralVectors: compact and scalable alignment-free virome feature generation.

Ali, Sarwan; Chourasia, Prakash; Tayebi, Zahra; Bello, Babatunde; Patterson, Murray

Ali, Sarwan; Chourasia, Prakash; Tayebi, Zahra; Bello, Babatunde; Patterson, Murray.

Afiliación

Ali S; Georgia State University, Atlanta, GA, USA. sali85@student.gsu.edu.
Chourasia P; Georgia State University, Atlanta, GA, USA.
Tayebi Z; Georgia State University, Atlanta, GA, USA.
Bello B; Georgia State University, Atlanta, GA, USA.
Patterson M; Georgia State University, Atlanta, GA, USA.

Med Biol Eng Comput ; 61(10): 2607-2626, 2023 Oct.

Article en En | MEDLINE | ID: mdl-37395885

RESUMEN

The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose ViralVectors, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on minimizers, a type of lightweight "signature" of a sequence, used traditionally in assembly and read mapping - to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks. Graphical Abstract showing the all steps of proposed approach. We start by collecting the sequence-based data. Then Data cleaning and preprocessing is applied. After that, we generate the feature embeddings using minimizer based approach. Then Classification and clustering algorithms are applied on the resultant data and predictions are made on the test set.

Asunto(s)

COVID-19; Viroma; Humanos; SARS-CoV-2; Algoritmos; Análisis de Secuencia de ADN/métodos

Palabras clave

Biological Sequences; Minimizer; Sequence classification; Spike Sequence; k-mers

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Viroma / COVID-19 Límite: Humans Idioma: En Revista: Med Biol Eng Comput Año: 2023 Tipo del documento: Article País de afiliación: Estados Unidos Pais de publicación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google