Búsqueda | Portal Regional de la BVS

GABAC: an arithmetic coding solution for genomic data.

Voges, Jan; Paridaens, Tom; Müntefering, Fabian; Mainzer, Liudmila S; Bliss, Brian; Yang, Mingyu; Ochoa, Idoia; Fostier, Jan; Ostermann, Jörn; Hernaez, Mikel.

Bioinformatics ; 36(7): 2275-2277, 2020 04 01.

Artículo en Inglés | MEDLINE | ID: mdl-31830243

RESUMEN

MOTIVATION: In an effort to provide a response to the ever-expanding generation of genomic data, the International Organization for Standardization (ISO) is designing a new solution for the representation, compression and management of genomic sequencing data: the Moving Picture Experts Group (MPEG)-G standard. This paper discusses the first implementation of an MPEG-G compliant entropy codec: GABAC. GABAC combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes and transformations, into a straightforward solution for the compression of sequencing data. RESULTS: We demonstrate that GABAC outperforms well-established (entropy) codecs in a significant set of cases and thus can serve as an extension for existing genomic compression solutions, such as CRAM. AVAILABILITY AND IMPLEMENTATION: The GABAC library is written in C++. We also provide a command line application which exercises all features provided by the library. GABAC can be downloaded from https://github.com/mitogen/gabac. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Compresión de Datos , Secuenciación de Nucleótidos de Alto Rendimiento , Genoma , Genómica , Programas Informáticos

AQUa: an adaptive framework for compression of sequencing quality scores with random access functionality.

Paridaens, Tom; Van Wallendael, Glenn; De Neve, Wesley; Lambert, Peter.

Bioinformatics ; 34(3): 425-433, 2018 02 01.

Artículo en Inglés | MEDLINE | ID: mdl-29028894

RESUMEN

Motivation: The past decade has seen the introduction of new technologies that significantly lowered the cost of genome sequencing. As a result, the amount of genomic data that must be stored and transmitted is increasing exponentially. To mitigate storage and transmission issues, we introduce a framework for lossless compression of quality scores. Results: This article proposes AQUa, an adaptive framework for lossless compression of quality scores. To compress these quality scores, AQUa makes use of a configurable set of coding tools, extended with a Context-Adaptive Binary Arithmetic Coding scheme. When benchmarking AQUa against generic single-pass compressors, file sizes are reduced by up to 38.49% when comparing with GNU Gzip and by up to 6.48% when comparing with 7-Zip at the Ultra Setting, while still providing support for random access. When comparing AQUa with the purpose-built, single-pass, and state-of-the-art compressor SCALCE, which does not support random access, file sizes are reduced by up to 21.14%. When comparing AQUa with the purpose-built, dual-pass, and state-of-the-art compressor QVZ, which does not support random access, file sizes are larger by 6.42-33.47%. However, for one test file, the file size is 0.38% smaller, illustrating the strength of our single-pass compression framework. This work has been spurred by the current activity on genomic information representation (MPEG-G) within the ISO/IEC SC29/WG11 technical committee. Availability and implementation: The software is available on Github: https://github.com/tparidae/AQUa. Contact: tom.paridaens@ugent.be.

Asunto(s)

Compresión de Datos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Metadatos , Programas Informáticos , Algoritmos , Escherichia coli/genética , Genómica/métodos , Humanos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ARN/métodos

AFRESh: an adaptive framework for compression of reads and assembled sequences with random access functionality.

Paridaens, Tom; Van Wallendael, Glenn; De Neve, Wesley; Lambert, Peter.

Bioinformatics ; 33(10): 1464-1472, 2017 May 15.

Artículo en Inglés | MEDLINE | ID: mdl-28057687

RESUMEN

MOTIVATION: The past decade has seen the introduction of new technologies that lowered the cost of genomic sequencing increasingly. We can even observe that the cost of sequencing is dropping significantly faster than the cost of storage and transmission. The latter motivates a need for continuous improvements in the area of genomic data compression, not only at the level of effectiveness (compression rate), but also at the level of functionality (e.g. random access), configurability (effectiveness versus complexity, coding tool set ) and versatility (support for both sequenced reads and assembled sequences). In that regard, we can point out that current approaches mostly do not support random access, requiring full files to be transmitted, and that current approaches are restricted to either read or sequence compression. RESULTS: We propose AFRESh, an adaptive framework for no-reference compression of genomic data with random access functionality, targeting the effective representation of the raw genomic symbol streams of both reads and assembled sequences. AFRESh makes use of a configurable set of prediction and encoding tools, extended by a Context-Adaptive Binary Arithmetic Coding scheme (CABAC), to compress raw genetic codes. To the best of our knowledge, our paper is the first to describe an effective implementation CABAC outside of its' original application. By applying CABAC, the compression effectiveness improves by up to 19% for assembled sequences and up to 62% for reads. By applying AFRESh to the genomic symbols of the MPEG genomic compression test set for reads, a compression gain is achieved of up to 51% compared to SCALCE, 42% compared to LFQC and 44% compared to ORCOM. When comparing to generic compression approaches, a compression gain is achieved of up to 41% compared to GNU Gzip and 22% compared to 7-Zip at the Ultra setting. Additionaly, when compressing assembled sequences of the Human Genome, a compression gain is achieved up to 34% compared to GNU Gzip and 16% compared to 7-Zip at the Ultra setting. AVAILABILITY AND IMPLEMENTATION: A Windows executable version can be downloaded at https://github.com/tparidae/AFresh . CONTACT: tom.paridaens@ugent.be.

Asunto(s)

Compresión de Datos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Bacterias/genética , Genoma , Genómica/métodos , Humanos , Plantas/genética

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA