CN107169313A

CN107169313A - The read method and computer-readable recording medium of DNA data files

Info

Publication number: CN107169313A
Application number: CN201710195158.2A
Authority: CN
Inventors: 葛健秋; 孟金涛; 郭宁; 滕彦宁; 魏彦杰
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2017-03-29
Filing date: 2017-03-29
Publication date: 2017-09-15

Abstract

The invention discloses a kind of read method and computer-readable recording medium of DNA data files, this method includes：Based on default number of processes, the size of each blocks of files of DNA data files is determined；Based on the size of each blocks of files, to DNA data file segmentations, to obtain the blocks of files corresponding to each process；Each task parallelism reads corresponding blocks of files.The present invention can accelerate the reading speed of DNA data files, greatly shorten read access time, particularly TB, PB rank DNA file and read, it is possible to reduce the memory consumption that unit stores DNA data files.

Description

The read method and computer-readable recording medium of DNA data files

Technical field

The present invention relates to technical field of biological information, more particularly to a kind of read method and computer of DNA data files Readable storage medium storing program for executing.

Background technology

Being presently used for the software of gene sequencing mainly has composite software EULER, SSAKE, VCAKE, velvet, IDBA.Wherein, it is to realize the readings of DNA data files by the way of serially read file in velvet, IDBA scheduling algorithm Take, the request memory that this is stored to unit is very high.In view of the data volume of DNA data files is huge, it is desirable to provide a kind of DNA data The read method of file, to reduce the memory consumption that unit stores DNA data files.

The content of the invention

To overcome the problem of memory consumption is excessive in DNA data files reading process in the prior art, the embodiment of the present invention On the one hand a kind of read method of DNA data files is provided, including：

Based on default number of processes, the size of each blocks of files of DNA data files is determined；

Based on the size of each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process；

Each task parallelism reads corresponding blocks of files.

Wherein, the size based on each blocks of files, to the DNA data file segmentations, to obtain corresponding to each process Blocks of files, including：

Based on the size of each blocks of files, the original position of each blocks of files is determined；

Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, it is determined that The initial address of each blocks of files and end address；

Initial address and end address based on each blocks of files, split to the DNA data files, respectively to be entered Blocks of files corresponding to journey；

Each task parallelism reads corresponding blocks of files, including：

Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.

Wherein, DNA sequencing fragment is opened in the original position based on each blocks of files, and the DNA data files Initial point, determines initial address and the end address of each blocks of files, including：

In the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched；

The start address of any blocks of files is defined as the starting point；

The end address of a upper blocks of files for any blocks of files is defined as the starting point.

Wherein, the original position in any blocks of files, searches the starting point of the first DNA sequencing fragment, including

In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, the first DNA sequencing fragment is searched Starting point.

Wherein, each task parallelism reads corresponding blocks of files, including：

Based on data block form, each task parallelism reads corresponding blocks of files, the size of the data block To preset adjustable dimension.

On the other hand, the embodiments of the invention provide a kind of computer-readable recording medium, including：

For storing the computer program that can be run on a processor；The computer program is used for：

Each task parallelism reads corresponding blocks of files.

Each task parallelism reads corresponding blocks of files, including：

The start address of any blocks of files is defined as the starting point；

The embodiment of the present invention by based on default number of processes, determining the size of each blocks of files of DNA data files, with And the size based on each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process, so that real Now each task parallelism reads corresponding blocks of files, can accelerate the reading speed of DNA data files, when greatly shortening reading Between, particularly TB, PB rank DNA file are read, it is possible to reduce the memory consumption that unit stores DNA data files.

Brief description of the drawings

Technical scheme in order to illustrate the embodiments of the present invention more clearly, makes required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for For those of ordinary skill in the art, on the premise of not paying creative work, other can also be obtained according to these accompanying drawings Accompanying drawing.

Fig. 1 is the schematic flow sheet of the first embodiment of the read method of the DNA data files of the present invention；

Fig. 2 is the schematic flow sheet of the second embodiment of the read method of the DNA data files of the present invention.

Embodiment

In order that technical problem solved by the invention, technical scheme and beneficial effect are more clearly understood, below in conjunction with Drawings and Examples, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only used To explain the present invention, it is not intended to limit the present invention.

Fig. 1 is refer to, is the schematic flow sheet of the first embodiment of the read method of the DNA data files of the present invention.Should Read method comprises the following steps：

101：Based on default number of processes, the size of each blocks of files of DNA data files is determined.Preferably, in order to true The uniform in size of each blocks of files is protected, the size of each blocks of files is size/p, wherein, size is the size chi of DNA data files Very little, p is default number of processes.

102：Based on the size of each blocks of files, to DNA data file segmentations, to obtain the blocks of files corresponding to each process. Since it is determined the size of each blocks of files, can successively be split with DNA data files according to the size of each blocks of files, so as to obtain each Blocks of files corresponding to process.

103：Each task parallelism reads corresponding blocks of files.

The embodiment of the present invention by based on default number of processes, determining the size of each blocks of files of DNA data files, with And the size based on each blocks of files, to the DNA data file segmentations, to obtain the blocks of files corresponding to each process, so that real Now each task parallelism reads corresponding blocks of files, it is possible to reduce unit stores the memory consumption of DNA data files.

Fig. 2 is refer to, is the schematic flow sheet of the second embodiment of the read method of the DNA data files of the present invention.Should Read method comprises the following steps：

201：Based on default number of processes, the size of each blocks of files of DNA data files is determined.Preferably, in order to true The uniform in size of each blocks of files is protected, the size of DNA data files, and being dimensioned so as to each blocks of files are obtained in advance Size/p, wherein, size is the size dimension of DNA data files, and p is default number of processes.

Step 201 can be realized based on MPI environment, and the initialization of MPI environment is carried out first：Carrying out, parallelization is defeated Before entering, to carry out a series of initial work and call initialization function, the entitled MPI_Init () of call function；Then, Statement participates in all of all computer node information, the process identification number of participation group communication and the communication of participation group calculated Number of processes.Then input file parameters carry out Initialize installation；According to different assembling demands, input is corresponding initial Change parameter, length, the input of gene sequencing sequential file including k-mer, the outgoing position of gene sequencing sequential file. DNA data files are finally read, its size size is calculated.Each participates in the process all concurrent invocation MPI_File_ calculated Open () function is opened pending DNA data files and handled.And read the total size of the DNA data files.

202：Based on the size of each blocks of files, the original position of each blocks of files is determined.Since it is determined of each blocks of files Beginning position, the continuity based on blocks of files just can determine that the end position of each blocks of files.Set the text corresponding to i-th of process The original position of part block is i*size/p, then the end position of the blocks of files corresponding to i-th of process is then (i+1) * size/ The end position of blocks of files corresponding to the i-th -1 process of p is then i*size/p.

203：Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and DNA data files, it is determined that The initial address of each blocks of files and end address.Specifically, step 103 can include：In the original position of any blocks of files, look into Look for the starting point of the first DNA sequencing fragment；The start address of any blocks of files is defined as to the starting point of DNA sequencing fragment；Will The end address of a upper blocks of files for any blocks of files is defined as the starting point of DNA sequencing fragment.Can be based on blocks of files dynamic Algorithm FAA searches the starting point of the first DNA sequencing fragment in original position.

Wherein, FAA algorithms are as follows：It is the blocks of files for inputting fasta or fastq forms first, size is size；Then The size that each blocks of files is got is determined according to the Thread Count proc of distribution size, size is size/p；Corresponding i pairs of thread The input starting position start answered is i*size/p and end position end is (i+1) * size/p；Be not as readBuf ">”i It is incremental plus until find first ">" untill；And the start=start+sendAdujstDelta of current file block, together When current file block a upper data block end=end+sendAdujstDelta；So go down successively, each thread correspondence The original position of blocks of files all accurately position.

204：Initial address and end address based on each blocks of files, split to DNA data files, respectively to be entered Blocks of files corresponding to journey.Specifically, in MPI environment, MPI_File_set_view () function can be called, and combine FAA algorithms carry out piecemeal to DNA data files.

205：Initial address and end address based on each blocks of files, each task parallelism read corresponding blocks of files.Specifically Ground, can the form based on data block, each task parallelism reads corresponding blocks of files, wherein, data block size To preset adjustable dimension.Because data block size is default adjustable dimension, so, by adjusting data Block size, can improve the I/O performances of parallel file system.

MPI parallel files, which are read, above all will accurately know store data file to be processed hereof Particular location, could so cause all processes while input while the various pieces of file, so as to realize file Parallelization is inputted.The embodiment of the present invention uses the parallel file read-write mode of many viewports in MPI, and combines gene data text The feature of part itself, calls MPI_File_set_view () function pair view to carry out piecemeal, and has used FAA algorithms, it is determined that The positional information of the starting and ending of each process blocks of files to be processed.Finally realize the blocks of files of the high concurrent of multi-process Read work.

In addition, the embodiment of the present invention by the initial address of each blocks of files and end address by being all positioned at DNA sequence dna piece The starting point of section, just will not so carry out rough segmentation to DNA sequence dna, will not cause the fracture point of identical DNA sequence fragment Cut, so as to ensure that each DNA sequencing fragment is all allocated and has been assigned exclusively to a certain blocks of files.

The present invention also provides a kind of computer-readable recording medium, and the computer-readable recording medium is used to be stored in processing The computer program run on device, the wherein computer program can be performed the reading side in Fig. 1~Fig. 2 any embodiments Method.The computer-readable recording medium includes at least one of USB flash disk, CD and terminal, server, does not limit herein.

In a particular embodiment, the computer program is used for：

Based on the size of each blocks of files, to DNA data file segmentations, to obtain the blocks of files corresponding to each process；

Each task parallelism reads corresponding blocks of files.

Wherein, the size based on each blocks of files, to the DNA data file segmentations, to obtain the text corresponding to each process Part block, including：

Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and DNA data files, each text is determined The initial address of part block and end address；

Each task parallelism reads corresponding blocks of files, including：

Wherein, the starting point of DNA sequencing fragment in the original position based on each blocks of files, and the DNA data files, Initial address and the end address of each blocks of files are determined, including：

The start address of any blocks of files is defined as the starting point；

Wherein, in the original position of any blocks of files, the starting point of the first DNA sequencing fragment is searched, including

Based on data block form, each task parallelism reads corresponding blocks of files, and data block size is pre- If adjustable dimension.

It is to combine one or more embodiments that particular content is provided as described above, does not assert the specific reality of the present invention Apply and be confined to these explanations.It is all approximate with method of the invention, structure etc., identical, or for present inventive concept under the premise of Some technology deduction or replace are made, should all be considered as protection scope of the present invention.

Claims

1. a kind of read method of DNA data files, it is characterised in that including：

Each task parallelism reads corresponding blocks of files.

2. read method as claimed in claim 1, it is characterised in that the size based on each blocks of files, to the DNA numbers According to file division, to obtain the blocks of files corresponding to each process, including：

Based on the starting point of DNA sequencing fragment in the original position of each blocks of files, and the DNA data files, each text is determined The initial address of part block and end address；

Initial address and end address based on each blocks of files, split to the DNA data files, to obtain each process institute Corresponding blocks of files；

Each task parallelism reads corresponding blocks of files, including：

3. read method as claimed in claim 2, it is characterised in that the original position based on each blocks of files, Yi Jisuo The starting point of DNA sequencing fragment in DNA data files is stated, initial address and the end address of each blocks of files is determined, including：

The start address of any blocks of files is defined as the starting point；

4. read method as claimed in claim 3, it is characterised in that the original position in any blocks of files, the is searched The starting point of one DNA sequencing fragment, including

In the original position of any blocks of files, based on blocks of files dynamic adjustment algorithm FAA, opening for the first DNA sequencing fragment is searched Initial point.

5. read method as claimed in claim 1, it is characterised in that each task parallelism reads corresponding blocks of files, bag Include：

Based on data block form, each task parallelism reads corresponding blocks of files, and the size of the data block is pre- If adjustable dimension.

6. a kind of computer-readable recording medium, it is characterised in that for storing the computer program that can be run on a processor； The computer program is used for：

Each task parallelism reads corresponding blocks of files.

7. computer-readable recording medium as claimed in claim 6, it is characterised in that the size based on each blocks of files, To the DNA data file segmentations, to obtain the blocks of files corresponding to each process, including：

Each task parallelism reads corresponding blocks of files, including：

8. computer-readable recording medium as claimed in claim 7, it is characterised in that the start bit based on each blocks of files Put, and in the DNA data files DNA sequencing fragment starting point, determine initial address and the end address of each blocks of files, Including：

The start address of any blocks of files is defined as the starting point；

9. computer-readable recording medium as claimed in claim 8, it is characterised in that the start bit in any blocks of files Put, search the starting point of the first DNA sequencing fragment, including

10. computer-readable recording medium as claimed in claim 6, it is characterised in that each task parallelism reads correspondence Blocks of files, including：

Based on data block form, each task parallelism reads corresponding blocks of files, and the size of the datablock is default Adjustable dimension.