WO2016178643A1 - Procédé destiné à l'analyse des données d'une séquence nucléotidique par utilisation conjointe de multiples unités de calcul à différents emplacements - Google Patents

Procédé destiné à l'analyse des données d'une séquence nucléotidique par utilisation conjointe de multiples unités de calcul à différents emplacements Download PDF

Info

Publication number
WO2016178643A1
WO2016178643A1 PCT/TR2016/050134 TR2016050134W WO2016178643A1 WO 2016178643 A1 WO2016178643 A1 WO 2016178643A1 TR 2016050134 W TR2016050134 W TR 2016050134W WO 2016178643 A1 WO2016178643 A1 WO 2016178643A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
calculation unit
block
written
fifo
Prior art date
Application number
PCT/TR2016/050134
Other languages
English (en)
Inventor
Muhammed Oguzhan KÜLEKCI
Mahmut Samil SAGIROGLU
Original Assignee
Erlab Teknoloji Anonim Sirketi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Erlab Teknoloji Anonim Sirketi filed Critical Erlab Teknoloji Anonim Sirketi
Publication of WO2016178643A1 publication Critical patent/WO2016178643A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data

Definitions

  • the invention relates to analysis of DNA/RNA (Deoxyribonucleic Acid / Ribonucleic Acid) sequencing data, and particularly relates to integrated use of a calculation unit found at the same location with the data together with a remote second calculation unit.
  • DNA/RNA Deoxyribonucleic Acid / Ribonucleic Acid
  • the analysis desired to be made on the data is defined as consecutively operated operation blocks.
  • the FIFO output (First in first out) of each operation block is recorded to a list.
  • the input of the next operation block is fed by being read from the FIFO list of the previous block.
  • DNA/RNA sequence centers are usually located in institutions, corporations, and organizations working in the field of life sciences; and the cloud computing centers where the data are processed require expertise in computation sciences.
  • the transfer of data from the point of production to the point of calculation forms the weakest link in data analysis.
  • the main reason for this weakness is the great size of the data amount.
  • the first method comprises copying of this huge data to a media found at the location of the data, and sending the media to a remote calculation center via normal mail.
  • the remote calculation unit copies the data received by mail to its own system.
  • the second method comprises making transfer via connection through a communication channel (such as internet, rented line etc.). Due to the huge size of data, this kind of transfer cannot be an effective solution, and instead, conventionally copying the data to a media (such as disk, DVD, CD) and sending to the other calculation unit via normal mail can be preferred.
  • a communication channel such as internet, rented line etc.
  • CPU/GPU CPU/GPU, memory, and hard disk storage at the location where the DNA/RNA sequence data are found as raw files.
  • DNA/RNA sequence data (2) is a sequence of values formed of 3 main components about each reading made on the target genome, which are:
  • This data can be stored as a FASTQ, FASTA, text file, or any other digital format.
  • the data compression block (9) is an operation block in which the DNA/RNA sequence data (2) are collectively compressed by a compression algorithm.
  • Communication line (12) is the line through which the data formed as a result of the compression made by the data compression block (9) is transferred. While this communication line (12) can be a digital channel such as internet, it can also be a normal mail channel.
  • the remote calculation unit (13) is a remote calculation unit containing high amount of CPU/GPU, high capacity of memory and storage space to be used in processing of DNA/RNA data.
  • the data opening block (15) is the operation block, in which the compressed data reaching the remote calculation unit (13) is opened and thus the original data is acquired.
  • - Operation block 1 (3) is the operation block forming the first step of the operation flow related to the analysis desired to be made on the calculation unit 1 (1 ) where the DNA/RNA sequence data (2) are found.
  • - Operation block 2 (5) is the second operation block of the analysis operation flow, using the output of operation block 1 .
  • Operation block N is the final operation block of the analysis workflow.
  • genomic data Due to the large size of the data and the insufficient bandwidth capacity, genomic data is required to be compressed during data transfer through the internet. About compression of genomic data, various methods are developed especially in recent years (References 1 , 2, 3, 4). In relevant studies, FASTQ files produced by sequencing devices are dealt with and compression of these FASTQ files is focused on.
  • the main components of FASTQ files are the base sequences formed of the letters A, T, C, G, and N; and the quality scores generated by the device about the precision of reading of each base.
  • Two main methods are used for compressing base sequences. In the first method, if the organism to be sequenced has a previously formed reference sequence, then the compression is made by making use of this sequence.
  • coding is made by revealing the residues found on the base sequence, without using any additional information (Reference 5).
  • the quality score of each base is provided in the form of a one-byte length value.
  • the best results are obtained by PPM method via statistical compression.
  • the quality scores array is predicted with the value to be encoded and the previous values, and this prediction is compressed via arithmetic encoding.
  • the calculations made at the side of the sender can be used for the subsequent operations to be made at the cloud computing side; but this cannot be achieved and the calculations made at the compression block are not used following opening/decoding operation.
  • the invention is formed with the inspiration from the prior art situations and aims to solve the above said problems.
  • the main purpose of the invention is to be able to adjust the load distribution between the calculation units so as to finish the whole operation in the shortest possible time period.
  • By adjustment of the workflow in this way it is aimed to make maximum use of the calculation capacity at the side where the DNA/RNA sequence data are found, and perform the optimum transfer depending on the bandwidth.
  • K units of operation blocks is performed in the calculation unit found at the side of the raw DNA/RNA sequence data.
  • the data obtained as a result of these operation blocks are compressed in the compression block and then sent to the remote calculation unit through the communication line.
  • the compressed data reaching the remote calculation unit is opened, and the analysis operation continues from the operation block No. K+1 .
  • N units of operation blocks the results of the data analysis are formed.
  • the number N is determined according to the type of the analysis desired to be made on the DNA/RNA data. K units of these N units of operation blocks are found at the side of the raw data, while (N-K) units are operated at the remote calculation unit.
  • the number K is variable for the purpose of keeping the fullness ratio of FIFO lists at a predetermined interval during the whole operation flow, and it is dynamically adjusted by the remote calculation unit according to the capacities and fullness ratios of the calculation units, and the speed of the communication line.
  • the K value can start as 5; but at any moment, K value can be reduced to 3 or increased to 8, or any other possible value.
  • the raw data is processed as it proceeds on the operation blocks, and becomes much more refined. Therefore, at the end of each operation (except standard reading, writing, sending blocks), the size of the data is reduced and its transfer to the remote system becomes faster. For example, if all of the operation blocks forming the alignment operation can be completed by the sender, the raw DNA/RNA data can become extremely compact.
  • Figure 1 is a diagram showing the components and the relations thereof involved in operation of the method according to the present invention, ensuring the transfer of the DNA/RNA data between two calculation units and the analysis thereof.
  • Figure 2 is a diagram showing the components and the relations thereof involved in operation of the method according to the prior art, ensuring the transfer of the DNA/RNA data between two calculation systems and the analysis thereof.
  • Genomic data analysis results K the number of operation blocks and FIFO lists found in calculation unit 1
  • N The number of total operation blocks and total FIFO lists found in calculation unit 1 and calculation unit 2
  • DNA/RNA sequence data (2) is a sequence of values formed of 3 main components about each reading made on the target genome, which are:
  • This data can be stored as a FASTQ, FASTA, text file, or any other digital format.
  • Operation block 1 (3) is the first operation block to be processed in the workflow to be performed on the DNA/RNA sequence data (2) to be analyzed.
  • Each possible operation that can be defined within the genome analysis workflow is divided into pieces and defined as operation blocks.
  • the inputs and outputs of each operation block, from which other operation blocks they can accept input connections, to which other operation blocks they can give output connections, what kind of compressions they can accept if a compression block is to be applied after themselves are all determined beforehand. All these operation blocks are loaded both in calculation unit 1 (1 ) where the DNA/RNA sequence data (2) are found and in calculation unit 2 (13) which is the remote calculation unit, in the form of a software library.
  • FIFO list 1 (4) is the FIFO list on which the data coming out of operation block 1 (3) are to be written.
  • the values written on FIFO list 1 (4) are placed into an operation block 2 (5) by coming out according to the order of writing.
  • Operation block 2 (5) is the second operation block to be processed in the workflow to be performed on the DNA/RNA sequence data (2) to be analyzed.
  • FIFO list 2 (6) is the FIFO list on which the output of operation block 2 (5) is written.
  • - Operation block K (7) is the K th operation block to be processed in the calculation unit where the DNA/RNA sequence data (2) are found, within the workflow to be performed on the DNA/RNA sequence data (2).
  • the data compression block (9) is the operation block on which the output of operation block K (7) is written, and where the data found on FIFO list K (8) are compressed.
  • the FIFO list compression block (10) is the FIFO list on which the compressed data are written.
  • the communication line transfer block (1 1 ) is the operation block by which the compressed data are taken from the FIFO list compression block (10) and sent to the communication line (12).
  • the communication line (12) is the communication line between two remote calculation units, including calculation unit 1 (1 ) where DNA/RNA sequence data (2) are found and calculation unit 2 (13), which is the remote calculation unit.
  • This communication line for instance can be internet or a rented line.
  • the remote calculation unit (13) is a remote calculation unit containing high amount of CPU/GPU, high capacity of memory and storage space used in processing of DNA/RNA data.
  • the FIFO list receiving block (14) is the FIFO list on which the data received from the communication line (12) are written.
  • the data opening block (15) is the block where the compressed data written on the FIFO list of the FIFO list receiving block (14) are opened.
  • the FIFO list opening block (16) is the FIFO list on which the data opened by the data opening block (15) are written.
  • the operation block K+1 (17) is the (K+1 ) th operation block within the list formed of N units of operation blocks for the purpose of analysis of DNA/RNA sequence data (2), found after the K units of operation blocks processed in calculation unit 1 (1 ) where the DNA/RNA sequence data (2) are found, and the processing of which is to be started in calculation unit 2 (13).
  • - FIFO list K+1 (18) is the FIFO list on which the output of operation block K+1 (17) is to be written.
  • - FIFO list K+2 (20) is the FIFO list on which the output of operation block K+2 (19) is to be written.
  • Operation block N is the Nth operation block to be processed in the end within the list formed of N units of operation blocks for the purpose of analysis of DNA/RNA sequence data (2).
  • - FIFO list N (22) is the FIFO list on which the output of operation block N (21 ) is written.
  • Genomic data analysis results (23) are the analysis results formed by collection of the results formed in FIFO list N (22).
  • DNA/RNA sequence data (2) formed by DNA/RNA sequencing devices are defined in FASTQ, which is the digital file format of the data generated by DNA/RNA sequencing devices and which gives the outputs.
  • Calculation unit 1 (1 ) containing the DNA/RNA sequence data (2) informs the remote calculation unit 2 (13) about the analysis desired to be made and the size, format etc. information about the DNA/RNA sequence data (the network bandwidth used by the user while connecting to the service, the size of the FASTQ file of the user, the number of operations to be analyzed (they also have a size), the operation capacity of the computer, or in other words, calculation unit 1 (1 ) to be used), before the operation blocks start to operate, through the communication line (12).
  • the remote calculation unit 2 (13), decides on which operation blocks to be used for the desired analysis, and informs calculation unit 1 (1 ) about which operation blocks are to be used in the analysis and among these which ones are to be made in calculation unit 1 (1 ), or in other words, the K value, through the communication line (12).
  • Calculation unit 1 (1 ) starts operation according to the K value and workflow it is given.
  • Calculation unit 1 (1 ) starts to send the DNA/RNA sequence data (2) to the operation blocks one by one and compress the data coming out of operation block K (7) via the data compression block (9) and send to the remote calculation unit 2 (13) through the communication line (12).
  • Calculation unit 1 (1 ) containing the DNA/RNA sequence data (2) also reports the fullness ratios of the FIFO lists found at the output of the operation blocks found at its own side (operation block 1 , operation block 2, operation block K ...) to the remote calculation unit 2 (13).
  • the remote calculation unit (13) opens the compressed data it receives, and continues the operation starting from operation block K+1 (17) until operation block N (21 ).
  • calculation unit 2 (13) can change the K value, considering the FIFO fullness ratios reported by calculation unit 1 (1 ) and the FIFO fullness ratios of the operation blocks performed by itself.
  • calculation unit 1 (1 ) containing the DNA/RNA sequence data (2) either increases or decreases the number of operation blocks according to the new K value.
  • calculation unit 1 (1 ) found at the location of the DNA/RNA sequence data (2), informing the remote calculation unit 2 (13) about the analysis to be made and the network band width used during connection, the size of the FASTQ file, the number of operations to be analyzed, and the operation capacity of calculation unit 1 (1 ), through a communication line (12),
  • calculation unit 1 (1 ) the remote calculation unit 2 (13), deciding on which operation blocks to be used for the desired analysis, and informing calculation unit 1 (1 ) about which operation blocks are to be used in the analysis and among these which ones are to be made in calculation unit 1 (1 ), or in other words, the K value, through the communication line (12), - analysis of the DNA/RNA sequence data (2) to be analyzed, via operation block 1 (3),
  • the compressed data written on the FIFO list compression block (10) being sent by the communication line transfer block (1 1 ) from calculation unit 1 (1 ) to the FIFO list receiving block (14) found in calculation unit 2 (13), through the communication line (12), and being written on the FIFO list receiving block (14),

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioethics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

La présente invention concerne l'analyse des données de séquençage d'ADN/ARN (2) et particulièrement un procédé destiné à une utilisation intégrée d'une unité de calcul se trouvant au même emplacement que les données conjointement avec une seconde unité de calcul distante.
PCT/TR2016/050134 2015-05-06 2016-05-03 Procédé destiné à l'analyse des données d'une séquence nucléotidique par utilisation conjointe de multiples unités de calcul à différents emplacements WO2016178643A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
TR201505488 2015-05-06
TR2015/05488 2015-05-06

Publications (1)

Publication Number Publication Date
WO2016178643A1 true WO2016178643A1 (fr) 2016-11-10

Family

ID=56203893

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/TR2016/050134 WO2016178643A1 (fr) 2015-05-06 2016-05-03 Procédé destiné à l'analyse des données d'une séquence nucléotidique par utilisation conjointe de multiples unités de calcul à différents emplacements

Country Status (1)

Country Link
WO (1) WO2016178643A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116033033A (zh) * 2022-12-31 2023-04-28 西安电子科技大学 一种联合显微图像和rna的空间组学数据压缩和传输方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010027348A1 (fr) * 2008-09-08 2010-03-11 Ahdoot Ned M Filtre vidéo numérique et traitement d'image
US20130031092A1 (en) 2010-04-26 2013-01-31 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
US20130166518A1 (en) 2011-12-24 2013-06-27 Tata Consultancy Services Limited Compression Of Genomic Data File
US20130204851A1 (en) 2011-12-05 2013-08-08 Samsung Electronics Co., Ltd. Method and apparatus for compressing and decompressing genetic information obtained by using next generation sequencing (ngs)
US20130275486A1 (en) 2012-04-11 2013-10-17 Illumina, Inc. Cloud computing environment for biological data
WO2014116851A2 (fr) * 2013-01-25 2014-07-31 Illumina, Inc. Procédés et systèmes pour utiliser un environnement informatique en nuage pour partager des données biologiques
US8812243B2 (en) 2012-05-09 2014-08-19 International Business Machines Corporation Transmission and compression of genetic data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010027348A1 (fr) * 2008-09-08 2010-03-11 Ahdoot Ned M Filtre vidéo numérique et traitement d'image
US20130031092A1 (en) 2010-04-26 2013-01-31 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
US20130204851A1 (en) 2011-12-05 2013-08-08 Samsung Electronics Co., Ltd. Method and apparatus for compressing and decompressing genetic information obtained by using next generation sequencing (ngs)
US20130166518A1 (en) 2011-12-24 2013-06-27 Tata Consultancy Services Limited Compression Of Genomic Data File
US20130275486A1 (en) 2012-04-11 2013-10-17 Illumina, Inc. Cloud computing environment for biological data
US8812243B2 (en) 2012-05-09 2014-08-19 International Business Machines Corporation Transmission and compression of genetic data
WO2014116851A2 (fr) * 2013-01-25 2014-07-31 Illumina, Inc. Procédés et systèmes pour utiliser un environnement informatique en nuage pour partager des données biologiques

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BONFIELD, J. K.; MAHONEY, M. V.: "Compression of FASTQ and SAM format sequencing data", PLOS ONE, vol. 8, no. 3, 2013, pages E59190, XP055330942, DOI: doi:10.1371/journal.pone.0059190
COX, A. J.; BAUER, M. J.; JAKOBI, T.; ROSONE, G.: "Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform", BIOINFORMATICS, vol. 28, no. 11, 2012, pages 1415 - 1419
DEOROWICZ, S.; GRABOWSKI, S.: "Compression of DNA sequence reads in FASTQ format", BIOINFORMATICS, vol. 27, no. 6, 2011, pages 860 - 862, XP055077100, DOI: doi:10.1093/bioinformatics/btr014
FRITZ, M. H. Y.; LEINONEN, R.; COCHRANE, G.; BIRRIEY, E.: "Efficient storage of high throughput DNA sequencing data using reference-based compression", GENOME RESEARCH, vol. 21, no. 5, 2011, pages 734 - 740
HACH, F.; NUMANAGIC, I.; ALKAN, C.; SAHINALP, S, C.: "SCALCE: boosting sequence compression algorithms using locally consistent encoding", BIOINFORMATICS, vol. 28, no. 23, 2012, pages 3051 - 3057

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116033033A (zh) * 2022-12-31 2023-04-28 西安电子科技大学 一种联合显微图像和rna的空间组学数据压缩和传输方法
CN116033033B (zh) * 2022-12-31 2024-05-17 西安电子科技大学 一种联合显微图像和rna的空间组学数据压缩和传输方法

Similar Documents

Publication Publication Date Title
EP2608096B1 (fr) Compression de fichiers de données génomiques
CN110603595B (zh) 用于从压缩的基因组序列读段重建基因组参考序列的方法和系统
US8812243B2 (en) Transmission and compression of genetic data
Zhu et al. High-throughput DNA sequence data compression
Patro et al. Data-dependent bucketing improves reference-free compression of sequencing reads
Malysa et al. QVZ: lossy compression of quality values
US11165849B2 (en) Accelerated cloud data transfers using optimized file handling and a choice of speeds across heterogeneous network paths
US10090857B2 (en) Method and apparatus for compressing genetic data
US11762813B2 (en) Quality score compression apparatus and method for improving downstream accuracy
KR101074010B1 (ko) 블록 단위 데이터 압축 및 복원 방법 및 그 장치
Wan et al. Transformations for the compression of FASTQ quality scores of next-generation sequencing data
WO2016020682A1 (fr) Procédés et systèmes pour l'analyse et la compression de données
Bose et al. BIND–An algorithm for loss-less compression of nucleotide sequence data
JP2004240975A (ja) Dna配列符号化装置及び方法
WO2015180203A1 (fr) Système de compression sans perte de score de qualité de séquençage d'adn à débit élevé et procédé de compression
US10560552B2 (en) Compression and transmission of genomic information
CN108134609A (zh) 一种通用数据gz格式的多线程压缩与解压方法及装置
WO2016178643A1 (fr) Procédé destiné à l'analyse des données d'une séquence nucléotidique par utilisation conjointe de multiples unités de calcul à différents emplacements
KR20040070438A (ko) Dna 서열 부호화 장치 및 방법
Long et al. GeneComp, a new reference-based compressor for SAM files
JP2020509474A (ja) 圧縮されたゲノムシーケンスリードからゲノムリファレンスシーケンスを再構築するための方法とシステム
CN113574603A (zh) 基因融合的快速检测
Wang et al. smallWig: parallel compression of RNA-seq WIG files
Ochoa et al. AliCo: A new efficient representation for SAM files
CN1656688B (zh) 在压缩之前处理数字数据

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16732011

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16732011

Country of ref document: EP

Kind code of ref document: A1