CN110767265A

CN110767265A - Parallel acceleration method for big data genome comparison file sequencing

Info

Publication number: CN110767265A
Application number: CN201911008972.4A
Authority: CN
Inventors: 张中海; 谭光明; 张春明; 姚二林
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2019-10-23
Filing date: 2019-10-23
Publication date: 2020-02-07

Abstract

The invention discloses a parallel acceleration method for big data genome comparison file sequencing, which comprises the following steps: reading and decompressing the target BAM file, and storing the target BAM file into a continuous first buffer B; when the first buffer B is full, performing multithread sorting and merging through heap sorting to form an intermediate file; reading the intermediate files in sequence, putting the intermediate files into the associated second buffer areas MB, and merging the data of each second buffer area MB through heap sequencing; and compressing the merged data through a plurality of threads, and writing the compressed data into a result file. The invention separately allocates threads for reading and decompressing, and respectively constructs thread pools for decompressing and compressing, thereby reducing the number of threads to be opened, fully utilizing multithreading resources, improving the efficiency of reading and writing files, reducing the number of intermediate files, reducing the number of memory copy operations, and realizing the shortening of processing time.

Description

Parallel acceleration method for big data genome comparison file sequencing

Technical Field

The invention relates to the field of high-performance computing, in particular to a parallel acceleration method for ordering files by comparing big data genomes.

Background

In recent years, with the progress of gene sequencing technology, the field of biological gene health has been rapidly developed. The rapid growth of genetic data poses even greater challenges to genetic analysis techniques. How to rapidly process the big data in the biological gene field becomes a great hot research direction for generating information and calculating high performance.

In clinical and scientific research, mainstream analysis procedures for human genome data include: genome comparison, sequencing, redundancy removal, insertion and deletion weight comparison, quality fraction re-check, mutation detection and the like. The intermediate generated to-be-sorted files have dozens of Gb less than the files and hundreds of Gb more than the files. The existing processing software, such as SAMtools, mainly includes two stages in the sequencing process:

the first stage, the file to be ordered is read into the memory, only one part is read each time, the data of the part is related to the size of the system memory and can be manually set; then, the part of data is evenly distributed to a plurality of threads, and each thread only sequences the distributed data; writing the ordered data into a temporary file, and continuously reading the next part for the same processing; at the end of this phase, many ordered temporary files will be generated.

And in the second stage, the temporary files are sorted by a heap sorting algorithm and finally input into a result file.

For the comparison file in the BAM format, the above conventional sorting method needs decompression for each reading and compression for each writing. Decompression and sorting operations consume relatively little processor resources, while compression operations consume very much processor resources. This is also the biggest difference between big data gene alignment files and other big data file sorting processes. Because too many temporary files are generated in the sorting process and too many threads are opened in the file reading and writing process, the efficiency of reading a disk is low when the temporary files are combined, and the traditional sorting algorithm is very inefficient and time-consuming.

It can be seen that, for the sequencing process of the SAM/BAM file, the efficiency will depend on the coordination degree of the computing resources such as the hard disk read-write speed, the memory size, the processor computing power, etc. However, the existing sorting methods and tools, such as SAMtools, cannot achieve reasonable utilization of computing and hardware resources, so as to achieve faster sorting processing.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a parallel acceleration method for sequencing big data genome comparison files. The method can fully utilize the read-write performance of the hard disk, and simultaneously coordinate the use of the calculation resources and the memory resources by the decompression, the sequencing and the compression operations in the file processing process, thereby improving the processing efficiency and reducing the processing time.

In order to achieve the above object, in one aspect, the present invention provides a parallel acceleration method for sorting big data genome alignment files, which is characterized by comprising the following steps:

step 101, reading and decompressing a target BAM file, and storing the target BAM file into a continuous first buffer B;

102, distributing data in the first buffer area B to a plurality of threads according to blocks for sorting respectively after the first buffer area B is full, merging the processing results of each sorting thread through heap sorting, and compressing to form an intermediate file;

103, reading the intermediate files in sequence, associating each intermediate file to be read with a second buffer area, reading and decompressing the intermediate files, storing the intermediate files into the associated second buffer areas MB, and merging the data of each second buffer area MB through heap sorting;

and 104, compressing the merged data through a plurality of threads, and writing the compressed data into a result file.

In a preferred implementation, the step 101 includes allocating a read thread and a decompression thread for the read operation and the decompression operation respectively, where the number of the read threads is less than the number of the decompression threads.

In another preferred implementation, the step 101 includes: when data reading is carried out, one thread is allocated for reading, and a plurality of threads are allocated for decompression.

In another preferred implementation, the step 102 includes dividing the first buffer B into a plurality of buffer blocks, and each sorting thread is associated with a buffer block for sorting data therein.

In another preferred implementation, the step 102 includes creating a first thread pool and a second thread pool each containing a plurality of threads, the first thread pool being used for decompression operations and the second thread pool being used for compression operations.

In another preferred implementation, the step 104 includes: associating each intermediate file F to be read with a read-in thread, a read queue and a second buffer region MB, reading and decompressing the intermediate file F in sequence through the read-in thread and the decompression thread pool, and storing the intermediate file F into the associated second buffer region MB.

In another preferred implementation, the method further comprises: the plurality of intermediate files are respectively read into the associated second buffer areas MB, the files in the second buffer areas MB are read for heap sorting, and the result of the heap sorting is written into the result file.

In another preferred implementation, the step 103 includes: the data of each second buffer MB is merged by means of heap sorting.

In another aspect, the invention provides a computer readable storage medium having a computer program stored thereon, wherein the program when executed by a processor implements the method.

In another aspect, the invention provides a computer device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, wherein the processor implements the method when executing the program. During data reading, the data volume of single reading and decompression processing can be determined according to the size of the memory.

It should be noted that the "to-be-read" intermediate file mentioned in the present invention refers to an intermediate file that has a free second buffer MB, i.e., an intermediate file that is to be associated with and read from the corresponding second buffer MB.

The invention has the following advantages:

the parallel acceleration method for sorting files by comparing large data genomes disclosed by the invention has the advantages that threads are respectively and independently allocated for reading and decompressing (the number of the reading threads is less than that of the decompressing threads), thread pools are respectively established for decompressing and compressing, so that the number of the opened threads is greatly reduced, multithreading resources are fully utilized, and the file reading and writing efficiency is further improved.

The parallel acceleration method for sorting files by comparing large data genomes can fully utilize the read-write performance of a hard disk by constructing two independent buffer areas, reasonably allocates a reading thread, a decompressing thread and a compressing thread, and coordinates the use of decompressing, sorting and compressing operations in the file processing process on computing resources and memory resources simultaneously, so that the aims of improving the processing efficiency and reducing the processing time are fulfilled.

Drawings

FIG. 1 is a schematic flow chart of a parallel acceleration method for big data genome alignment file sorting according to the present invention.

FIG. 2 is a first part of a processing flow diagram of a parallel acceleration method for big data genome alignment file sorting according to an embodiment of the present invention.

FIG. 3 is a second portion of the process flow diagram of the embodiment shown in FIG. 2.

FIG. 4 is a third portion of the process flow diagram of the embodiment shown in FIG. 2.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

As shown in fig. 1, the parallel acceleration method for big data genome alignment file sorting according to the present invention generally includes the following steps:

step 101, obtaining a target BAM file to be sorted, reading and decompressing the target BAM file, and storing the target BAM file into a continuous first buffer area B;

102, distributing data to a plurality of threads for sorting respectively after the first buffer B is full, merging the processing results of each sorting thread through heap sorting, and compressing to form an intermediate file;

103, reading the intermediate files in sequence, associating a second buffer area with each intermediate file to be read, reading and decompressing the intermediate files to be read, putting the intermediate files into the associated second buffer areas MB, and merging the data of each second buffer area MB through stacking and sorting;

step 104, compressing the merged data through a plurality of threads, and writing the compressed data into a result file;

and 105, judging whether all the intermediate files are executed completely, if so, ending the process, otherwise, returning to the step 103.

The whole process of the parallel acceleration method for big data genome alignment file ordering of the invention will be described in detail as follows:

for the input BAM file, because the data size is too large, the BAM file cannot be put into a memory at one time, and therefore, only a part of the content can be read at one time. Fig. 2 shows the process of reading and decompressing the BAM file and putting it into buffer B. In order to avoid excessive memory opening release and copy operation between memories, a continuous large memory space (larger than a preset value, wherein the preset value is determined according to the total size of the memory space) is applied to serve as a buffer B, and then reading, decompressing and caching are carried out in a blocking mode.

Specifically, in this embodiment, for the decompression and compression operations involved, two thread pools including N threads, a decompression thread pool and a compression thread pool are created, the threads in the decompression thread pool are exclusively used for the decompression operation, the threads in the compression thread pool are exclusively used for the compression operation, and the decompression and compression involved in the sorting process are both completed by the decompression and compression threads in the thread pools.

The input BAM file is read through an IO thread, and a read-in thread RT and a read queue RQ are associated with the input BAM file. The read-in thread RT reads contents from an input file according to blocks, each block of data is placed into a read queue RQ which is used as a queue to be decompressed, a decompression thread pool acquires each block of data in the read queue RQ and delivers the data to a decompression thread for processing, and the decompression thread decompresses the block of data and then sequentially places the decompressed data into a buffer B according to the number of the block. When the buffer B is full, the read-in thread RT blocks and waits for the next read. The task queue to be decompressed in fig. 2 is a read queue RQ, and the memory in fig. 2 is a buffer B in the memory.

And then sequencing is carried out, wherein PN sequencing threads are developed by the main thread, and PN is a positive integer. As shown in fig. 3, the buffer B is divided into PN blocks, in the example of the figure, the value of PN is 4, that is, the data (for example, SAM records) read into the memory is divided into PN blocks, each block is associated with a sorting thread, and each thread sorts its associated data. And after all the sequencing threads are executed, merging the data associated with the PN threads by using a heap sequencing algorithm.

After each time the buffer B is full, the files obtained by multithread sorting and merging after the sorting are stacked are created into one intermediate file F. By repeating the above steps, a plurality of intermediate files F can be obtained.

And delivering the merged intermediate file F to a compression thread pool for compression, sequentially putting the compressed data into a write-in queue WQ according to the thread number, and writing by using an IO thread. Specifically, the intermediate file F may be associated with a write thread WT and a write queue WQ, and the write thread WT sequentially writes data in the queue into the hard disk. The ordered intermediate file F (BAM format) in fig. 3 is any one of the intermediate files F, and the SAM data read into the memory refers to the file stored in the buffer B.

And repeating the operations until the input file is processed.

After the above processing, some intermediate files F are generated, for example, the number of the intermediate files is M, and since the number of the intermediate files may be large and the number of the intermediate files processed at the same time is limited, the intermediate files need to be processed for multiple times to be read and decompressed respectively. Fig. 4 shows the process of decompressing, buffering, heap sorting and compression writing two intermediate files F simultaneously.

The following description will be given by taking the example of reading and heap sorting for synchronization of two intermediate files. For each intermediate file F to be read, when it is read, a read thread, a read queue and a buffer MB are associated with it, the intermediate file F is read and decompressed (since read decompression is similar to that in fig. 2, the processing is simplified in fig. 4) by decompression threads (e.g., a single read thread and multiple decompression threads) in the read thread and decompression thread pool, the associated buffer MB is filled, and the data in each buffer MB at this time is merged by a heap sorting algorithm. After reading of a certain buffer area MB is finished (reading from the buffer area and finishing stacking and sorting), the reading thread is started again, another corresponding intermediate file F is read again through the reading thread and the decompressing thread in the decompressing thread pool, and the buffer area MB is filled with decompressed data. After the buffer MB is filled, the merge operation continues.

And creating the merged result into a result file FN, associating a write-in thread and a write-in queue with the merged result similar to the case of writing the intermediate file F, compressing the merged data by the compression threads in the compression thread pool CP, and finally writing the compressed data into the result file FN by the write-in thread.

The read queue RQ of fig. 2 only shows 4 blocks of data, it being understood that the number of data blocks in the read queue RQ may be 4 or more. The number M of intermediate files F shown in fig. 4 is 2, it being understood that the number M of intermediate files F may be 2 or more. The number of data blocks in RQ and the number M of intermediate files F need to be specifically determined by the size of the data size of the input BAM file and the size of the buffer B in the memory, which is not described herein again.

Compared with the existing processing software such as SAMtools, under the condition of the same hardware condition, for example, the same computer or server is adopted, and the target files with the same size, for example, 100GB, are sorted simultaneously or sequentially. The sorting method can improve the data sorting speed by 40-50%.

The parallel acceleration method for the big data genome comparison file sequencing can fully utilize multi-thread resources by greatly reducing the number of the opened threads, thereby improving the file reading and writing efficiency, reducing the number of intermediate files, reducing the copy operation times of the memory and shortening the processing time.

The parallel acceleration method for sorting the big data genome comparison files can fully utilize the read-write performance of the hard disk, and realize the purposes of improving the processing efficiency and reducing the processing time by coordinating the use of the decompression, sorting and compression operations in the file processing process to the computing resources and the memory resources.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A parallel acceleration method for sequencing big data genome alignment files is characterized by comprising the following steps:

2. The parallel acceleration method for big data genome alignment file sorting according to claim 1, wherein the step 101 comprises allocating a reading thread and a decompression thread for the reading operation and the decompression operation respectively, and the number of the reading threads is less than that of the decompression threads.

3. The parallel acceleration method for big data genome alignment file sorting according to claim 2, wherein the step 101 comprises: when data reading is carried out, one thread is allocated for reading, and a plurality of threads are allocated for decompression.

4. The method of claim 1, wherein the step 102 comprises dividing the first buffer B into a plurality of buffer blocks, and each sorting thread is associated with one buffer block for sorting data therein.

5. The method of claim 1, wherein the step 102 comprises creating a first thread pool and a second thread pool respectively comprising a plurality of threads, the first thread pool being used for decompression operations and the second thread pool being used for compression operations.

6. The parallel acceleration method for big data genome alignment file sorting according to claim 1, wherein the step 104 comprises: associating each intermediate file F to be read with a read-in thread, a read queue and a second buffer region MB, reading and decompressing the intermediate file F in sequence through the read-in thread and the decompression thread pool, and storing the intermediate file F into the associated second buffer region MB.

7. The parallel acceleration method for big data genome alignment file sorting according to claim 6, characterized in that the method further comprises: the plurality of intermediate files are respectively read into the associated second buffer areas MB, the files in the second buffer areas MB are read for heap sorting, and the result of the heap sorting is written into the result file.

8. The parallel acceleration method for big data genome alignment file sorting according to claim 1, wherein the step 103 comprises: the data of each second buffer MB is merged by means of heap sorting.

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 8.

10. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor implements the method of any one of claims 1 to 8 when executing the program.