CN106603591B

CN106603591B - Processing method and system for genome detection data transmission and preprocessing

Info

Publication number: CN106603591B
Application number: CN201510663214.1A
Authority: CN
Inventors: 王振飞
Original assignee: Beijing Genedock Technology Co Ltd
Current assignee: Beijing Genedock Technology Co Ltd
Priority date: 2015-10-14
Filing date: 2015-10-14
Publication date: 2020-02-07
Anticipated expiration: 2035-10-14
Also published as: CN106603591A

Abstract

The invention relates to the field of genome sequencing data transmission, analysis and detection, in particular to a processing method and a system for genome detection data transmission and pretreatment, wherein the method comprises the following steps: obtaining the genome detection data, and partitioning the genome detection data, wherein if the genome detection data is single-stranded data, the genome detection data of M Read short sequences is partitioned into P ═ INT (M/N) according to one partition of every N Read short sequences, INT () is an upward rounding function, P is the number of partitions, if the genome detection data is double-stranded data, the strand data R1 and the strand data R2 are partitioned according to a single-stranded data partitioning method respectively, so as to generate R1 partition data and R2 partition data, and each R1 partition data is matched with one of the R2 partition data, or vice versa; and transmitting the block data to a server for genome analysis and detection. The present invention significantly reduces the genome data pre-processing time expenditure and increases the fault tolerance of the processing process.

Description

Processing method and system for genome detection data transmission and preprocessing

Technical Field

The invention relates to the field of genome sequencing data transmission, analysis and detection, in particular to a processing method and a processing system for genome detection data transmission and pretreatment.

Background

The data preprocessing of genomic variation detection in the prior art mainly comprises two main steps: firstly, data are transmitted to a server storage or cloud storage service, and then genome sequencing data of a sample stored by the server or the cloud service are compared with a standard reference genome of a species to which the sample belongs in a single-machine computing mode.

In the prior art, the data transmission and the species standard reference genome are compared into two steps, and the two steps are executed serially, that is, the comparison process with the species standard reference genome must be started until all sample genome sequencing data are transmitted to a server or cloud storage service.

There are many ways to transmit data to the server or the cloud storage service, for example, using a data synchronization tool based on TCP or UDP protocol, such as FTP, SCP, RSYNC, etc., directly mounting a hard disk with genome data or other storage media to the server, or using a client provided by the cloud storage service to transmit.

The process of alignment with a standard reference genome of the species to which the sample belongs is typically a computationally intensive task. The computing tasks for this process in the prior art were handled using a high performance server (e.g., a minicomputer or mainframe).

The prior art has the following problems:

the processes of the original data transmission of the genome sequencing of the sample and the comparison with the standard reference genome of the species are executed in series, and the time consumption is long;

for genome data of double-ended sequencing, it cannot be guaranteed that paired double-ended short sequences are successfully transmitted at the same time, and even the transmission of two double-ended data files is serial, so that the whole data transmission process is long;

the comparison process with the species standard reference genome to which the sample belongs is carried out on a single server, and the computing power of the server becomes the bottleneck of task processing, so that the processing process of the step is time-consuming;

the process is usually run in a single task mode, if the task fails, the whole sample genome sequencing data processing process needs to be run again, the retry cost is high, the fault tolerance capability is weak, and the processing time is further prolonged.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a processing method and a system for genome detection data transmission and pretreatment.

The invention provides a processing method for genome detection data transmission and pretreatment, which comprises the following steps:

step 1, acquiring the genome detection data, and partitioning the genome detection data, wherein if the genome detection data is single-stranded data, the genome detection data of M Read short sequences is partitioned into P ═ INT (M/N) according to one partition of every N Read short sequences, INT () is an upward rounding function, P is the number of partitions, and if the genome detection data is double-stranded data, the chain data R1 and the chain data R2 are respectively classified according to a single-stranded data partitioning method, so as to generate R1 partition data and R2 partition data, and each R1 partition data is matched with one of the R2 partition data, or vice versa;

and 2, transmitting the block data to a server for genome analysis and detection.

The processing method for transmitting and preprocessing the genome-oriented detection data, wherein the step 1 further comprises the following steps of: and selectively compressing the blocks to achieve the purpose of reducing the size of the transmitted data.

The processing method for genome-oriented detection data transmission and preprocessing puts the matched block data of the R1 block data and the R2 block data into the same data packet or ensures that the mutually matched block data of the R1 block data and the R2 block data are uploaded successfully at the same time and are used as the input of a genome data preprocessing task at the same time.

In the processing method for genome detection data transmission and preprocessing, in the step 1, the number of short sequences contained in the block data is any integer between 1 and M, which is greater than or equal to the number of Read short sequences contained in the sample.

In the processing method for genome detection data transmission and preprocessing, the block data in the step 2 are transmitted to a server in parallel, and are analyzed and detected in parallel, wherein any block data is wrong, and the analysis and detection of the rest block data are not influenced.

The invention also provides a genome detection data transmission and preprocessing oriented processing system, which comprises:

a partitioning module, configured to obtain the genome detection data and partition the genome detection data, where if the genome detection data is single-stranded data, the genome detection data of M Read short sequences is partitioned into P ═ INT (M/N) according to one partition of every N Read short sequences, INT () is an rounding-up function, P is the number of partitions, and if the genome detection data is double-stranded data, the chain data R1 and the chain data R2 are respectively classified according to a single-stranded data partitioning method, so as to generate R1 partition data and R2 partition data, and each R1 partition data matches with one of the R2 partition data, or vice versa;

and the transmission module is used for transmitting the block data to a server for genome analysis and detection.

The processing system for genome-oriented detection data transmission and preprocessing comprises: and selectively compressing the blocks to achieve the purpose of reducing the size of the transmitted data.

The genome detection data transmission and preprocessing-oriented processing system places the R1 block data and the matched block data in the R2 block data into the same data packet or ensures that the R1 block data and the matched block data in the R2 block data are uploaded successfully at the same time and are used as the input of a genome data preprocessing task at the same time. 9. The system for processing genome-oriented detection data transmission and preprocessing as claimed in claim 6, wherein in the step 6, the number of short sequences contained in the block data is any integer between 1 and M.

According to the genome detection data transmission and preprocessing oriented processing system, the block data in the transmission module are transmitted to the server in parallel, and parallel analysis and detection are carried out, wherein any block data is wrong, and the analysis and detection of the rest block data are not influenced.

As can be seen from the above invention, the advantages of the invention are:

according to the invention, through data blocking and combination of two steps of data transmission and species reference genome comparison into a complete workflow, complete sample genome sequencing data blocking processing is realized, parallelization of transmission and preprocessing between different data blocks is realized, the problem that complete data transmission needs to be waited for complete genome data comparison is avoided, the time of the process from data preparation transmission to generation of a preprocessed result data file is greatly reduced, the method has important significance in production application, and the method can refer to the graph 3 compared with the prior art.

Original identical sample genome sequencing data with the size of M Read short sequences takes UT seconds, while the transmission of the data with the size of M Read short sequences in the same network environment takes UTB seconds in INT (M/N) (1 ═ N < ═ M, N is the number of Read short sequences contained in a single block), and the transmission of the identical data in blocks takes UTB seconds, generally UT > ═ UTB, because the block transmission can better utilize the multi-core processing capability of a client computer and the utilization of network bandwidth is more effective. In the original scheme, the genome sequencing data of the sample can be compared and sequenced to output a comparison result file after UT seconds, and the PT seconds are taken in the step. In the present invention, the alignment and sorting are performed immediately after a block transfer is completed. While one block is compared and sequenced, and other blocks are also in the transmission process, so that the parallelization of the block comparison and sequencing and the block transmission is realized, and if the time spent in the processing process from the block comparison and sequencing to the final output result file segment is about PBT seconds on average, under the same computing power, for the genome sequencing data of N blocks, PBT < (PT/N) is the same,

in the prior art, the process from data transmission to final output of the comparison result data file takes UT + PT seconds.

In the invention, most of the block comparison processing, sequencing and block transmission are parallel, so the time spent by the invention is about UTB + PBT, and the UTB + PBT is far less than UT + PT.

Drawings

FIG. 1 is a data flow diagram of the overall data preprocessing for genomic variation detection of paired-end Reads;

FIG. 2 is a flow chart of data preprocessing for genome variation detection for single-ended detection;

fig. 3 is a diagram comparing the present invention with the prior art.

Detailed Description

The invention solves the problems of data transmission and pretreatment in the detection of the variation of massive genome data. The method mainly comprises the steps of segmenting sample genome sequencing original data according to short sequence (Read) numbers, dividing the sample genome sequencing original data into different blocks, and then selectively compressing, transmitting and checking the blocks in parallel and comparing the blocks with standard reference genomes of species to which samples belong. Compared with other schemes, the processing flow between different blocks can run on a computer cluster or a server simultaneously and parallelly, because the process of comparing with the species standard reference genome does not need to wait for the completion of the transmission of the genome sequencing original data of the sample, and the parallel running of the processing workflow between the blocks fully utilizes the computing power of the computing cluster and the high-performance server, the method can greatly shorten the time from the transmission of the genome sequencing original data of the sample to the final preprocessing process of comparing with the species standard reference genome, and in addition, because the transmission of the genome sequencing original data and the mapping process with the standard reference genome of the species to which the sample belongs are both carried out by taking the blocks as units, therefore, the failure of transmission of a single block or mapping does not affect the whole data processing process, and only the transmission of the failed block or the mapping process needs to be operated again, so that the fault tolerance is stronger, the corresponding processing time is also favorably shortened, and the invention needs to be supplemented to show that the invention not only supports the genome sequencing data of single-ended sequencing, but also supports the short-sequence paired transmission of the genome sequencing data of double-ended sequencing.

The processing steps of the invention are as follows:

and step 1, data is partitioned. The original genome sequencing data blocking strategy of the invention uses the original sequencing data file which is blocked by lines, such as one M Read short sequences, if one block is formed by every N (1 ═ N < ═ M) short sequences, the original genome sequencing data file can be divided into P ═ INT (M/N), INT () is an upward rounding function, and P is the number of blocks. For example, genome sequencing sample data containing 100 ten thousand rows of Read short sequences can be divided into 10(100w/10w) blocks by one block per 10 ten thousand rows of Read. Each block is named after the block is incremented by 1 from 0 block to block, and the name of the block is the file name plus the block number, for example, the name of the first block of the test. Genome variation detection can be divided into single-ended sequencing and double-ended sequencing according to the type of sample data, if the sample data is single-ended sequencing, the method for blocking is only required according to the method for blocking, double-ended detection is simultaneously supported, in a double-ended sequencing file, two short sequence files are blocked according to the same rule, wherein M, N values are the same, and therefore the obtained number of blocks is also the same.

And 2, data is compressed in a blocking mode. The data can be selectively compressed according to actual conditions. The invention can firstly wait until the same numbered blocks of the double-end Reads are all divided and then compressed together, or firstly compress the blocks of each read and then put the blocks of two compressed Reads together, and the method of putting the compressed blocks of the two Reads in the same sequence can be put into a tar packet or a new compressed packet, can also be a catalog, and can be any method which can ensure that the two compressed blocks are transmitted successfully at the same time and can be used as the input of a gene preprocessing task.

And 3, aligning the double-end detection blocks. Single ended detection skips this step. For double-end detection, the blocks of the same block number suffix of the R1 and R2 files (R1 and R2 represent two chains of a gene sequence) are packed into the same file packet, called a block (block), and the synchronous transmission processing of double-end short sequences is realized by this method. This step may also employ other mechanisms that ensure that the matching aligned partitions of R1 and R2 are both successfully uploaded and used as input for a genome data preprocessing task. For convenience of description, the block for single-ended detection of Reads is also referred to as a block.

For step 2 and step 3, the sequence of the two steps is irrelevant, and the method can be used for waiting until the blocks with the same number of the double-end Reads are divided and then compressed together, or compressing the blocks of each read before putting the blocks of the two compressed Reads together, and putting the compressed blocks of the two Reads in the same sequence together, wherein the method can be put into a tar packet or a new compressed packet, can be a catalog, or can be any method which can ensure that the two compressed blocks are transmitted successfully at the same time and can be used as the input of a gene preprocessing task. See fig. 1 for compression (Compressing), Merging (Merging).

And 4, transmitting the data in blocks. And 3, data blocking transmission, namely calling a corresponding stored client or API to transmit the block in the step 3 to a remote server.

And 5, decompressing the blocks. If no compression is performed in step 2, this step is skipped. At the server, for the block successfully transmitted in step 4, if double-end detection is performed, a file packet needs to be opened to obtain compressed blocks of two paired short sequences, and if single-end detection is performed, the block itself is the compressed block of the short sequence, processing is not needed, the obtained compressed blocks are decompressed according to different compression modes to obtain original short sequence block data, and the integrity of the data is checked through a hash value of the file, where an algorithm for calculating the hash value may be any hash algorithm, such as MD 5.

And 6, comparing the block with the species standard reference genome. Comparing the block data of the original Reads obtained in the step 5 with a standard reference genome corresponding to the sample species by using comparison software (such as BWA and the like), and outputting a result data file.

And 7, storing the originally transmitted short sequence data. The block data of the original Reads obtained in step 5 is saved to the corresponding storage location for subsequent direct reading from the storage for genome variation detection, as shown in the box "… …" in fig. 1 and 2.

And (3) processing in a streaming mode according to the steps 1 to 7, and when all the blocked processing flows are processed, obtaining the result data of mapping the whole sample genome sequencing data and the standard reference genome of the species to which the sample belongs, wherein the result data can be used by the subsequent various processing flows as required.

And (3) preprocessing the data stream by making up a blocked data stream for detecting genomic variations in a block depending on the blocked data produced in the previous step in steps 1 to 7. The data streams between different blocks are independent of each other and can run in parallel in a computing cluster and a multi-core server.

The whole flow from step 1 to step 7 is in block (block) units, and any failure in any step only causes the failure of the corresponding block processing flow, and does not affect the processing of other blocks. And the fault tolerance of the whole data processing flow can be realized by re-running the flow corresponding to the failed block.

The overall data flow diagram for the data pre-processing of genomic variation detection for paired-end Reads can be referenced to fig. 1, and the data pre-processing flow diagram for the genomic variation detection for singled-end detection can be referenced to fig. 2.

and a blocking module, configured to obtain the genome detection data and block the genome detection data, where if the genome detection data is single-stranded data, the genome detection data of M Read short sequences is divided into P INT (M/N) by one block per N (1 ═ N < ═ M) Read short sequences, INT () is an upward rounding function, and P is a block number, and for example, for genome sequencing including 100 ten thousand Read short sequences, 10(100w/10w) sample data can be divided into one block per 10 ten thousand Read. If the genome detection data is double-stranded data, classifying the strand data R1 and the strand data R2 according to a single-stranded data blocking method respectively to generate R1 block data and R2 block data, wherein each R1 block data is matched with one of the R2 block data, and vice versa;

The blocking module further comprises: and selectively compressing the blocks to achieve the purpose of reducing the size of the transmitted data.

Compressing the R1 blocks and the matching blocks in the R2 blocks into the same packet, or using other mechanisms that ensure that the matching aligned blocks of R1 and R2 are uploaded successfully at the same time and used as input for a genome data preprocessing task.

And the block data in the transmission module is transmitted to a server in parallel, and is subjected to parallel analysis and detection, wherein any block data has an error and has no influence on the analysis and detection of the rest of block data.

The present invention also includes the following preferred embodiments, as follows:

for the blocking method of the sample genome sequencing data in the step 1, the blocking strategy is to block according to the number of Read short sequences, so that the size range of one block can be from one Read short sequence to multiple Read short sequences, and the naming of the block can be according to the above description or any naming mode which can represent the file of the block and can represent the continuity and the orderliness between the blocks.

For step 2 and step 3, the sequence of the two steps is irrelevant, and the method can be used for waiting until the blocks with the same number of the double-end Reads are divided and then compressed together, or compressing the blocks of each read before putting the blocks of the two compressed Reads together, and putting the compressed blocks of the two Reads in the same sequence together, wherein the method can be put into a tar packet or a new compressed packet, can be a catalog, or can be any method which can ensure that the two compressed blocks are transmitted successfully at the same time and can be used as the input of a gene preprocessing task.

The transmission in step 4 may be any method that can copy or move a data file from one storage location to another, such as SCP, FTP, local copy, etc. The storage location may be a local direct server storage, a SAN or a NAS, and a distributed file system or a cloud storage service.

Step 6, comparing the decompressed Reads blocks with the standard reference genome of the species to which the sample belongs by using any software which can be used for comparing with the standard reference genome of the species to which the sample belongs, and outputting a comparison result

The process of step 6, which is compared to the standard reference genome of the species to which the sample belongs, can be run either in a computer cluster or on a server with multi-core processing capabilities.

Claims

1. A processing method for genome detection data transmission and preprocessing is characterized by comprising the following steps:

the number of short sequences contained in the block data is any integer between 1 and M which is greater than or equal to the number of Read short sequences contained in the sample;

step 2, transmitting the block data to a server for genome analysis and detection; processing in a streaming mode, and when all the processing flows of the blocks are processed, obtaining the result data of mapping the whole sample genome sequencing data and the standard reference genome of the species to which the sample belongs, wherein the result data can be used by the subsequent various processing flows as required;

and placing the matched blocks of the R1 blocks and the R2 blocks into the same data packet or ensuring that the matched blocks of the R1 blocks and the R2 blocks are uploaded successfully at the same time and are used as input of a genome data preprocessing task at the same time.

2. The method for processing genome-oriented detection data transmission and preprocessing as claimed in claim 1, wherein the step 1 further comprises: and selectively compressing the blocks to achieve the purpose of reducing the size of the transmitted data.

3. The method for processing genome-oriented inspection data transmission and preprocessing as claimed in claim 1, wherein the block data in step 2 are transmitted to the server in parallel and analyzed and inspected in parallel, wherein any block data has errors and has no influence on the analysis and inspection of the rest of the block data.

4. A processing system for genome-specific test data transmission and preprocessing, comprising:

the transmission module is used for transmitting the block data to a server for genome analysis and detection; processing in a streaming mode, and when all the processing flows of the blocks are processed, obtaining the result data of mapping the whole sample genome sequencing data and the standard reference genome of the species to which the sample belongs, wherein the result data can be used by the subsequent various processing flows as required;

5. The genome-oriented detection data transmission and preprocessing processing system of claim 4, wherein the partitioning module further comprises: and selectively compressing the blocks to achieve the purpose of reducing the size of the transmitted data.

6. The genome-specific test data transmission and preprocessing system as recited in claim 4, wherein the block data in the transmission module are transmitted to the server in parallel and analyzed and tested in parallel, and any one of the block data is in error and has no influence on the analysis and testing of the rest of the block data.