CN111326216B

CN111326216B - Rapid partitioning method for big data gene sequencing file

Info

Publication number: CN111326216B
Application number: CN202010122470.0A
Authority: CN
Inventors: 张中海; 谭光明; 张春明; 姚二林
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2023-07-21
Anticipated expiration: 2040-02-27
Also published as: CN111326216A

Abstract

The invention relates to the field of high-performance calculation, in particular to a rapid segmentation method of a big data gene sequencing file, which ensures that the actual segmentation of the sequencing file is not needed in the multi-node gene analysis process, subfiles are not generated, and a flexible partitioning scheme is provided according to a subsequent analysis program, so that the loads of all nodes are more balanced, hard disk reading and writing are reduced, and the partitioning efficiency is improved.

Description

Rapid partitioning method for big data gene sequencing file

Technical Field

The invention relates to the field of high-performance calculation, in particular to a rapid segmentation method of a big data gene sequencing file.

Background

With the rapid development of the field of great health, genetic analysis technology plays an increasingly important role. The gene sequencer produces a large number of sequencing files, the most commonly used sequencing file format being fastq format. Each sequencing file is few G, and tens G to hundreds G more. How to rapidly process these large data becomes increasingly the bottleneck of genetic analysis.

Since the sequencing file is large, it takes a lot of time to perform analysis processing with a single node, and thus multiple nodes are required to perform parallel calculation to reduce the time for gene analysis. This requires the division of the sequencing file, each node processes only a portion of the sequencing file, and the processing results are finally combined, thereby obtaining complete results of the genetic analysis in a short time.

When a plurality of nodes are used for processing the sequencing files, the common segmentation method is to divide the sequencing files equally according to the number of the nodes, then generate a plurality of subfiles, write the subfiles into a hard disk, and each node respectively reads the corresponding subfiles for processing. The method is simple and convenient, but can increase the read-write burden of the hard disk.

Moreover, the common segmentation method may affect the results of the subsequent procedure. In sequencing analysis, the sequence alignment software such as bwa and bowtie is generally used for analysis and alignment. For example, the bwa program reads a file in blocks, each time a block of fastq file is processed during its operation. Since the conventional slicing method does not take this into consideration, the result of bwa is affected, and inconsistency of the comparison result is easily caused.

Disclosure of Invention

The invention provides a rapid partitioning method for big data gene sequencing files, which comprises the following steps:

step 101, setting the size of a file block;

102, analyzing and counting fastq files according to the file block sizes set in the step 101, and dividing the fastq files into a plurality of file blocks; storing the position information of each file block and the total number of the file blocks into an information file;

step 103, calculating the number of file blocks to be processed by each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed by each node in the fastq file;

step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed by each node in the fastq file, which are determined in step 103, and providing the reading instruction for a subsequent program in a pipeline mode.

Preferably, step 102 in the above method further comprises: if the ending position of a file block is in the middle of a sequence, the file block is extended to the end of the sequence.

Preferably, the position information of the file block in the above method includes a start position and an end position of the file block.

Preferably, the file block size in the above method may have a value ranging from 1M to 100M.

Preferably, the number of file blocks that each node needs to process in step 103 in the above method is calculated according to the following formula:

wherein B is _i The number of file blocks processed for the ith node;

c _i the number of cores for the ith node;

B _t the total file block number;

n is the total node number;

j is an integer ranging from 1 to n;

c _j the number of cores for the j-th node.

According to another aspect of the present invention, there is provided a rapid partitioning method for big data gene sequencing files, comprising:

step 201, analyzing and counting fastq files according to sequences to obtain position information of each sequence and total number of the sequences;

step 202, calculating the number of sequences to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of a file part to be processed of each node in a fastq file;

step 203, generating a read instruction according to the start position and the end position of the file part to be processed by each node in the fastq file, which are obtained in step 202, and providing the read instruction to a subsequent program in a pipeline mode.

Preferably, the position information of the sequence of the above method includes a start position and an end position of the sequence.

Preferably, the number of sequences that each node needs to process in step 202 of the above method is calculated according to the following formula:

wherein S is _i The number of sequences processed for the ith node;

c _i the number of cores for the ith node;

S _t is the total number of sequences;

n is the total node number;

j is an integer ranging from 1 to n;

c _j the number of cores for the j-th node.

A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements any of the methods described above.

A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements any of the methods described above when executing the program.

Aiming at the defects of the prior art, the invention adopts a lazy division strategy for fastq files, and generates no subfiles, thereby avoiding the reading, writing and storage of subfiles. And various dividing modes are added for the subsequent analysis software. The method reduces the read-write times of the hard disk, improves the file dividing speed and eliminates the comparison error.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a flow diagram of a partitioning method by block according to one embodiment of the invention.

FIG. 2 is a flow diagram of a method of partitioning by sequence according to one embodiment of the invention.

Detailed Description

Before describing the method in detail, the format of the fastq file will be briefly described. The fastq file is a text file, one sequence every four lines, the first line is name information of the sequence, the second line is a base sequence, the third line is explanatory information, and the fourth line is quality score information of the sequence. The length of each sequence is not exactly the same. The sequencing files are divided into single-ended sequencing files and double-ended sequencing files, wherein the single-ended sequencing file only comprises one file, the double-ended sequencing file comprises a pair of files, and each sequence in the pair of files corresponds to each other.

According to one embodiment of the present invention, a block-wise partitioning method is described in connection with fig. 1, which method comprises the following steps.

In step 101, the file block size is set, and preferably, the value range can be between 1M and 100M.

The inventors have found that the gene sequencing analysis tool bwa reads the sequencing files in blocks during the alignment process. The inventors have found that partitioning fastq files in blocks facilitates load balancing when fastq files are aligned in parallel using a multi-node run bwa tool. The file block size can adopt different values according to different analysis software and processing capacities of the nodes, and preferably, the value range can be between 1M and 100M. In one embodiment of the present invention, when the size of the file block is 10M, a better processing speed and load balancing effect can be obtained.

102, analyzing and counting fastq files according to the block sizes set in the previous step, and dividing the fastq files into a plurality of file blocks; and saving the analysis result in the information file.

In analyzing fastq files by block size, taking the example of a block size of 10M, a 10 Mbyte offset backward from the file start position, if this byte data is just at the end of a sequence, the start position of the first file block is set to 0 and the end position is set to 10M. If the byte data is in the middle of a sequence, the end position of the first file block is set to the end position of the sequence. After the first file block position of the fastq file is found, the starting position and the ending position of the fastq file are stored in the information file. It can be seen that this first file block is greater than or equal to 10M in size and contains complete sequence data. Then, the end position of the first file block is shifted by one byte as the start position of the second file block, the shift back by 10 mbytes is continued, if the current byte data is just at the end of one sequence, the position of the current byte is set to the end position of the second block, and if the current byte data is at the beginning or middle of one sequence, the end position of the second block is set to the end position of the sequence. And so on, analyzing until the end of the fastq file, finding the starting positions and the ending positions of all file blocks and storing the information file. According to one embodiment of the invention, it is obvious that only the starting position can be stored in the information file. It can be seen that the size of the last file block may be less than 10M, and therefore, the size of each file block, except the last file block, may be slightly different, floating around 10M. And each file block contains a plurality of complete sequences, one sequence being present in only one file block. According to one embodiment of the invention, the number of file blocks is also accumulated and stored in the information file during the analysis.

For double-ended sequencing files, if the sizes of the two files are consistent, performing block-by-block statistical analysis according to one file, and if the sizes of the two files are inconsistent, dividing the two files according to another method for dividing the two files according to the sequence in the invention.

Step 103, calculating the number of file blocks to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed of each node in the fastq file according to the statistical information obtained in step 102.

Specifically, in the multi-node genetic analysis flow, the number of cores and the computing power of each computing node are not the same, so that when the sequencing file is divided, the processing range of each node is determined by taking the situations into consideration, and the load is more balanced.

According to one embodiment of the invention, the number of file blocks processed by each node is calculated according to equation 1.

Wherein B is _i The number of blocks processed for the ith node;

c _i the number of cores for the ith node;

B _t the total file block number;

n is the total node number;

j is an integer ranging from 1 to n;

c _j the number of cores for the j-th node.

After the number of file blocks to be processed of each node is calculated, the starting position and the ending position of the file part to be processed of each node in the original sequencing file can be determined through the information file.

Step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed of each node in the original sequencing file, and providing the reading instruction for a subsequent program in a pipeline mode.

Pipeline commands are prior art and will be briefly described herein by way of example using the Linux system. Linux pipes use vertical lines "|" to connect multiple commands, which are called pipe symbols. The concrete syntax format of the Linux pipeline is as follows:

command1|command2

in the present invention, command1 is an instruction to read the fastq file range, and command2 is an instruction of a per-block analysis tool such as bwa.

In accordance with another aspect of the present invention, the inventors have also found that bowtie is a tool for processing fastq files in sequence, and therefore, when the subsequent processing procedure is a tool for sequential analysis of bowtie or the like, the fastq files need to be divided in sequence.

In the following, a method of dividing in sequence is described in connection with fig. 2, according to an embodiment of the invention, which method comprises the following steps.

Step 201, performing analysis statistics on fastq files according to the sequence. And storing the analysis result into the information file while analyzing the fastq file.

Specifically, in the analysis process, the starting and ending positions of each sequence are analyzed and recorded, the number of the sequences is counted, and the sequences are saved in an information file.

For double-ended sequencing files, if the sizes of the two files are consistent, carrying out statistical analysis according to the sequence according to one file, and if the sizes of the two files are inconsistent, carrying out statistical analysis according to the sequence respectively.

Step 202, calculating the number of sequences to be processed of each node according to the number of nodes and the number of cores of each node, and determining the starting position and the ending position of the file part to be processed of each node in the fastq file according to the statistical information obtained in step 201.

According to one embodiment of the invention, in the multi-node gene analysis flow, the number of cores and the computing power of each computing node are different, so that when sequencing file division is performed, the processing range of each node is determined according to the situations, and the load is more balanced.

The number of sequences processed by each node is calculated according to equation 2.

Wherein S is _i The number of sequences processed for the ith node;

c _i the number of cores for the ith node;

S _t is the total number of sequences;

n is the total node number;

j is an integer ranging from 1 to n;

c _j the number of cores for the j-th node.

After the number of sequences to be processed of each node is calculated, the starting position and the ending position of the file part to be processed of each node in the original sequencing file can be determined through the information file.

In step 20304, a read command is generated according to the start position and the end position of the file portion to be processed by each node in the original sequencing file, and the read command is provided to the subsequent program in a pipeline manner.

The pipe command format is as follows:

command1|command2

command1 is an instruction for reading the fastq file range, and command2 is an instruction for a tool for analyzing sequences, such as bowtie.

The invention provides a rapid dividing method for big data gene sequencing files, which ensures that in the multi-node gene analysis process, the sequencing files are not required to be actually divided, subfiles are not generated, and a flexible dividing scheme is provided according to a subsequent analysis program, so that the loads of all nodes are more balanced, hard disk reading and writing are reduced, and the dividing efficiency is improved.

It should be noted that, the steps in the foregoing embodiments are not necessary, and those skilled in the art may perform appropriate operations, substitutions, modifications and the like according to actual needs.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and are not limiting. Although the invention has been described in detail with reference to the embodiments, those skilled in the art will understand that modifications and equivalents may be made thereto without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A rapid partitioning method for big data gene sequencing files, comprising:

step 101, setting the size of a file block;

102, analyzing and counting fastq files according to the file block sizes set in the step 101, and dividing the fastq files into a plurality of file blocks; storing the position information of each file block and the total number of the file blocks into an information file, wherein the position information of the file blocks comprises a starting position and an ending position of the file blocks;

step 104, generating a reading instruction according to the starting position and the ending position of the file part to be processed by each node in the fastq file, which are determined in step 103, providing the reading instruction to a subsequent program in a pipeline mode,

the number of file blocks to be processed by each node in step 103 is calculated according to the following formula:

wherein B is _i The number of file blocks processed for the ith node;

c _i the number of cores for the ith node;

B _t the total file block number;

n is the total node number;

j is an integer ranging from 1 to n;

c _j the number of cores for the j-th node.

2. The rapid partitioning method for big data gene sequencing file of claim 1, said step 102 further comprising: if the ending position of a file block is in the middle of a sequence, the file block is extended to the end of the sequence.

3. The rapid partitioning method for big data gene sequencing file as claimed in claim 1, wherein the file block size has a value ranging from 1M to 100M.

4. A rapid partitioning method for big data gene sequencing files, comprising:

step 201, analyzing and counting fastq files according to sequences, obtaining position information of each sequence and total number of the sequences, and storing the position information of each sequence into an information file, wherein the position information of each sequence comprises a starting position and an ending position of each sequence;

step 203, generating a read instruction according to the start position and the end position of the file part to be processed by each node in the fastq file obtained in step 202, providing the read instruction to a subsequent program in a pipeline manner,

the number of sequences that each node needs to process in step 202 is calculated according to the following formula:

wherein S is _i The number of sequences processed for the ith node;

c _i the number of cores for the ith node;

S _t is the total number of sequences;

n is the total node number;

j is an integer ranging from 1 to n;

c _j the number of cores for the j-th node.

5. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method according to any of claims 1-4.

6. A computer device comprising a memory and a processor, on which memory a computer program is stored which can be run on the processor, characterized in that the processor implements the method according to any of claims 1-4 when executing the program.