CN103049680A

CN103049680A - gene sequencing data reading method and system

Info

Publication number: CN103049680A
Application number: CN2012105920612A
Authority: CN
Inventors: 孟金涛; 魏延杰; 成杰峰; 冯圣中
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Senris Biotechnology Shenzhen Co ltd
Priority date: 2012-12-29
Filing date: 2012-12-29
Publication date: 2013-04-17
Anticipated expiration: 2032-12-29
Also published as: CN103049680B

Abstract

The invention relates to the technical field of bioinformatics, and provides a gene sequencing data reading method, which comprises the following steps: analyzing the user parameters to determine the number of tasks; dividing sequencing data into file blocks with the same size according to the number of tasks; adjusting the starting address and the ending address of each file block; and each task reads the adjusted file block result. The invention also provides a gene sequencing data reading system and a gene sequencing data analysis device with the system. The invention realizes the parallel reading of gene sequencing data, and each file block has uniform size, and avoids dividing a sequence into two different file blocks.

Description

Gene sequencing method for reading data and system

Technical field

The present invention relates to the bioinformatics technique field, be specifically related to a kind of gene sequencing method for reading data and system.

Background technology

The order-checking of biomacromolecule is running through the development of bioinformatics from start to finish, especially to the order-checking of nucleic acid and protein.Comprise all eucaryotic cell structures and the hereditary information of vital movement in the biological genome, fundamentally instructing the Rapid development of biosome.The research that hereditary information accurate and the Real-time Obtaining biosome can effectively be guided life science.The hereditary information of sequencing technologies on can quick obtaining DNA is explained genomic diversity and complicacy comprehensively, is playing the part of more and more important role in biological information research.

In nearest several years, the sequencing technologies of a new generation has brought dramatic change to bioinformatics, has obtained remarkable development at aspects such as order-checking principle, details of operation, technological expansion.With respect to traditional Sanger sequencing, the new-generation sequencing technology platform has been avoided clone's process, directly uses joint to carry out Parallel PC R, sequencing reaction, so its data volume is largely increased, and can check order to more DNA within the shorter time.As use the Sanger sequencing to draw first human genome collection of illustrative plates front and back and expend altogether 13 years and hundreds of platform sequenator, and new-generation sequencing can be finished this work in the time in some months now.In addition, the cost of new-generation sequencing also reduces greatly.

Because (such as yellow, cucumber, panda genome) differs in size the length of genome source sequence from 100,000 bases (such as pig pox virus, Escherichia coli) to 1,000,000,000 bases, and complex environment (such as seawater, human body large intestine etc.) grand genomic data even can reach the over ten billion base, and to reach 30-100 doubly to these samples its coverage that checks order, this so that the gene order fragment that produces increase severely.The magnanimity sequence data is processed meeting consume huge internal memory, therefore the normal mode of parallel processing of using is cut apart the magnanimity sequence data, carry out in the prior art will selecting suitable sequences segmentation strategy before the gene sequencing Data Segmentation, avoid in a sequences segmentation to the two different blocks of files.

Summary of the invention

The present invention is intended to solve above-mentioned problems of the prior art, proposes a kind of gene sequencing method for reading data, comprises the steps:

Step a: customer parameter is resolved, and number sets the tasks;

Step b: the blocks of files that sequencing data is divided into formed objects according to the task number;

Step c: start address and termination address to each blocks of files are adjusted;

Steps d: each task reads the blocks of files result after adjusting.

Preferably, before described step a, also comprise the steps: task is carried out initialization, between all nodes, connect, and nodal information, mission bit stream are added up.

Preferably, described step b is specially: sequencing data is divided into the blocks of files of formed objects according to the task number, obtains reference position and the final position of each blocks of files; Described step c is specially: the starting point that the reference position of each blocks of files of step b gained is adjusted into first sequence after the described reference position; The final position of each blocks of files of step b gained is adjusted into the starting point of first sequence behind the described final position, or is adjusted into the file full stop behind the described final position.

Preferably, described steps d is carried out many viewports parallel file for each task to the blocks of files result after adjusting and is read.

Preferably, described task is process, or the thread in the program.

Preferably, described process is the MPI process.

Preferably, described customer parameter comprises hardware performance, the total size of gene sequencing data, homologous gene reference sequences length.

Preferably, the form of described gene sequencing data is FASTA form or FASTQ form.

The present invention also provides a kind of gene sequencing data reading system, comprising:

The Parameter analysis of electrochemical unit, in order to customer parameter is resolved, number sets the tasks;

Cutting unit is in order to be divided into sequencing data according to the task number blocks of files of formed objects;

Adjustment unit is adjusted in order to start address and termination address to each blocks of files;

Reading unit reads the blocks of files result after adjusting in order to each task as a result.

Preferably, described system also comprises: initialization unit in order to task is carried out initialization, connects between all nodes, and nodal information, mission bit stream is added up.

The present invention provides a kind of gene sequencing data analysis set-up in addition, and described gene sequencing data analysis set-up is provided with said gene sequencing data reading system.

Beneficial effect of the present invention is, realized that the parallel of gene sequencing data read, and each blocks of files size evenly, also avoided in a sequences segmentation to the two different blocks of files.

Description of drawings

Fig. 1 is the realization flow figure of the gene sequencing method for reading data that provides of the embodiment of the invention 1.

Fig. 2 is FASTA data layout exemplary plot.

Fig. 3 is FASTQ data layout exemplary plot.

Fig. 4 is the realization flow figure of the gene sequencing method for reading data that provides of the embodiment of the invention 2.

Fig. 5 is the parallel schematic diagram that read of the many viewports in the embodiment of the invention 2.

Fig. 6 is read distributed number figure in the blocks of files of application examples 1 of the present invention.

Fig. 7 is that the time of reading in the application examples 2 of the present invention is with task number change figure.

Fig. 8 is the structured flowchart of the gene sequencing data reading system that provides of the embodiment of the invention 4.

Fig. 9 is the structured flowchart of the gene sequencing data reading system that provides of the embodiment of the invention 5.

Embodiment

In order to make those skilled in the art better understand the application's technical scheme, below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete description.Should be appreciated that specific embodiment described herein only in order to explain the present invention, is not intended to limit the present invention.

The embodiment of the invention at first is evenly divided into each identical blocks of files of size according to the quantity of task with the gene sequencing data, again start address and the termination address of each blocks of files are adjusted, read respectively the different blocks of files result of gene sequencing data by parallel task at last.Realized that not only the parallel of gene sequencing data read, and each blocks of files size evenly, also avoided in a sequences segmentation to the two different blocks of files.

Embodiment 1

Embodiments of the invention 1 provide a kind of gene sequencing method for reading data.As shown in Figure 1, the method comprises the steps:

Step S101: customer parameter is resolved, and number sets the tasks.Customer parameter described in the present embodiment comprises rigid performance, the total size of gene sequencing data, homologous gene reference sequences length etc., according to the number of the required task of customer parameter choose reasonable.Task in the present embodiment is the MPI process.

Step S102: the blocks of files that sequencing data is divided into formed objects according to the task number.Be specially the blocks of files that sequencing data is divided into formed objects according to the task number in the present embodiment, obtain reference position and the final position of each blocks of files.Be n such as the task number, the total size of gene sequencing data is S, i(i=0 then, and 1,2 ..., n-1) reference position of individual blocks of files is i*S/n, final position is (i+1) * S/n.

Step S103: start address and termination address to each blocks of files are adjusted.Be specially the starting point that reference position with each blocks of files of step S102 gained is adjusted into first sequence after the described reference position in the present embodiment; The final position of each blocks of files of step S102 gained is adjusted into the starting point of first sequence behind the described final position, or is adjusted into the file full stop behind the described final position.Be that the starting point of first sequence is start(i behind the reference position i*S/n), the starting point of first sequence is end(i behind final position (i+1) the * S/n), or the file full stop behind final position (i+1) the * S/n is end(i).

Step S104: each task reads the blocks of files result after adjusting.Quantity is that task and the quantity of n are that the blocks of files of n is corresponding one by one in the present embodiment, and each task is clearly known the accurate location of corresponding blocks of files, by reference position start(i) to final position end(i) order reads.

Gene sequencing data layout described in the present embodiment is specially FASTA form (search sequence file) or FASTQ form (quality information file).Described FASTA formatted file as shown in Figure 2, descriptive information and the arrangement set information of every sequence have been stored, for each bar sequence information, the first row all is take "〉" be message identification, sequence mark as this sequence, and recorded this sequence information and come from chromosome position and other biological information in the species, second has subsequently recorded the sequence self-information.Described FASTQ formatted file as shown in Figure 3, the storage take the order-checking section of reading as unit, every the section of reading accounts for four lines, the first row and the third line are comprised of file identification sign and the section of reading name (ID), the first row starts with "@", the third line starts with "+", the second behavior base sequence, and fourth line is corresponding sequencing quality mark.

Those having ordinary skill in the art will appreciate that, realize that all or part of step in the present embodiment method is to come the relevant hardware of instruction to finish by program, described program can be stored in the computer read/write memory medium, and described storage medium can adopt ROM/RAM, disk, CD etc.

Method by the present embodiment has realized that not only the parallel of gene sequencing data read, and each blocks of files size evenly, also avoided in a sequences segmentation to the two different blocks of files.

Embodiment 2

Embodiments of the invention 2 provide a kind of gene sequencing method for reading data.As shown in Figure 4, the method comprises the steps:

Step S201: task is carried out initialization, between all nodes, connect, and nodal information, progress information are added up.Carry out task initialization in the present embodiment, obtain with all computer node information of calculating, add up with the task identification of group communication number and all potential task numbers that can participate in group communication.Task in the present embodiment is the MPI process.

Step S202: customer parameter is resolved, and number sets the tasks.

Step S203: the blocks of files that sequencing data is divided into formed objects according to the task number.

Step S204: start address and termination address to each blocks of files are adjusted.

Above step is described in detail in embodiment 1, does not give unnecessary details one by one at this.

Step S205: each task is carried out many viewports parallel file to the blocks of files result after adjusting and is read.Each task is carried out many viewports parallel file to the blocks of files result after adjusting and is read in the present embodiment, according to the actual requirements the file type of data is classified, whether the data that limit in the described file type according to different file types again can be accessed by viewport, and the data that can not be accessed by viewport are sightless to viewport.The other types that for example file type can be divided into fundamental type and from fundamental type, derive, limiting fundamental type is to be accessed by viewport again, and other types are that viewport is sightless, and as shown in Figure 5, the data of seeing from viewport are fundamental type.Certainly, also can carry out to data the classification of other modes, can be accessed by viewport as the file type of paying close attention in the gene sequencing data analysis being defined as, will to be defined as viewport sightless with the not high file type of the gene sequencing data analysis degree of correlation.

Application examples 1

Utilize the gene sequencing method for reading data of embodiment 2 to read saccharomycete Solexa sequenator sequencing data.At first consider according to customer parameters such as the total size of hardware performance saccharomycete gene order-checking data, homologous gene reference sequences length, selecting number of tasks is 16.Then according to the task number sequencing data is divided into the blocks of files of formed objects; And start address and the termination address of each blocks of files adjusted, blocks of files information is as shown in table 1, and wherein, each all represents a character in the data file, and unit is bit.

Table 1 saccharomycete sequencing data blocks of files information table

Read quantity in each blocks of files is added up, the result as shown in Figure 6, contained read quantity differs less in each blocks of files, read quantity is evenly distributed in each blocks of files.

Application examples 2

Utilize the gene sequencing method for reading data of embodiment 2 to read saccharomycete Solexa sequenator sequencing data.It is 1,2,3,4,5 that number of tasks is set respectively ..., 16, calculate respectively gene sequencing and read the time.The result as shown in Figure 7, when number of processes was 1-10, the time of reading reduced along with the increase of number of processes, after number of processes reached 10, the variation of reading the time tended towards stability because task quantity is when reaching 10, the IO value of storage system reaches capacity.

Embodiment 3

Embodiments of the invention 3 provide a kind of gene sequencing method for reading data.Utilize the different threads in the program to finish reading of gene sequencing data by a high performance large scale computer in the present embodiment, the method comprises the steps:

Step S301: customer parameter is resolved the number of threads of determine procedures.

Step S302: the blocks of files that sequencing data is divided into formed objects according to the number of threads of program.

Step S303: start address and termination address to each blocks of files are adjusted.

Step S304: each thread carries out file to the blocks of files result after adjusting and reads.

Step S302 is described in detail in embodiment 1 to step S304, does not give unnecessary details one by one at this.

Embodiment 4

The embodiment of the invention 4 provides a kind of gene sequencing data reading system.As shown in Figure 8, for convenience of description, the part relevant with the embodiment of the invention only is shown.

See also Fig. 8, described gene sequencing data reading system 1 comprises Parameter analysis of electrochemical unit 11, cutting unit 12, adjustment unit 13 and reading unit 14 as a result.

In the gene sequencing data read process, the 11 pairs of customer parameters in Parameter analysis of electrochemical unit are resolved, and number sets the tasks.The task number that cutting unit 12 is determined according to Parameter analysis of electrochemical unit 11 is divided into sequencing data the blocks of files of formed objects.Start address and the termination address of each blocks of files that 13 pairs of cutting units 12 of adjustment unit are cut apart are adjusted.Blocks of files result after reading unit 14 is adjusted adjustment unit 13 as a result reads.

Be the identical blocks of files of size by cutting unit 12 with the gene sequencing Data Segmentation, can guarantee that sequence contained in each blocks of files is read the hop count amount suitable, guarantee the sequence section of reading being evenly distributed in each blocks of files.Adjustment unit 13 is adjusted start address and the termination address of each blocks of files that cutting unit 12 is cut apart, and guarantees that a sequence can not be split in two different blocks of files.Reading unit 14 is that task and the quantity of n are that the blocks of files of n is corresponding one by one with quantity as a result, and each task is clearly known the accurate location of corresponding blocks of files, is sequentially read by the final position of the reference position after adjusting after adjust.For example, can carry out many viewports parallel file to the blocks of files result after adjusting reads.

Embodiment 5

The embodiment of the invention 5 provides a kind of gene sequencing data reading system.As shown in Figure 9, for convenience of description, the part relevant with the embodiment of the invention only is shown.

See also Fig. 8, described gene sequencing data reading system 1 comprises initialization unit 10, Parameter analysis of electrochemical unit 11, cutting unit 12, adjustment unit 13 and reading unit 14 as a result.

10 pairs of MPI programs of initialization unit are carried out initialization, connect between all nodes, and nodal information, progress information are added up.Initialization unit 10 is carried out the MPI program initialization in the present embodiment, obtains with all computer node information of calculating, adds up with the process identification number of group communication and all potential number of processes that can participate in group communication.

Parameter analysis of electrochemical unit 11, cutting unit 12, adjustment unit 13 and as a result reading unit 14 in embodiment 4, be described in detail, do not give unnecessary details one by one at this.

Embodiment 6

The embodiment of the invention 6 provides a kind of gene sequencing data analysis set-up, and this gene sequencing data analysis set-up is provided with the gene sequencing data reading system that embodiment 4 or embodiment 5 provide.The specific works principle is as indicated above, does not give unnecessary details one by one at this.

The gene sequencing data analysis set-up that the present embodiment provides has been realized the parallel parsing of gene sequencing data, and each blocks of files size evenly, also avoided in a sequences segmentation to the two different blocks of files.

Above-described embodiment of the present invention does not consist of the restriction to protection domain of the present invention.Any modification of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., all should be included within the claim protection domain of the present invention.

Claims

1. a gene sequencing method for reading data is characterized in that, comprises the steps:

Step a: customer parameter is resolved, and number sets the tasks;

Steps d: each task reads the blocks of files result after adjusting.

2. gene sequencing method for reading data according to claim 1 is characterized in that, also comprises the steps: task is carried out initialization before described step a, connects between all nodes, and nodal information, mission bit stream are added up.

3. gene sequencing method for reading data according to claim 1 is characterized in that, described step b is specially: sequencing data is divided into the blocks of files of formed objects according to the task number, obtains reference position and the final position of each blocks of files; Described step c is specially: the starting point that the reference position of each blocks of files of step b gained is adjusted into first sequence after the described reference position; The final position of each blocks of files of step b gained is adjusted into the starting point of first sequence behind the described final position, or is adjusted into the file full stop behind the described final position.

4. gene sequencing method for reading data according to claim 1 is characterized in that, described steps d is carried out many viewports parallel file for each task to the blocks of files result after adjusting and read.

5. each described gene sequencing method for reading data is characterized in that according to claim 1-4, and described task is process, or the thread in the program.

6. gene sequencing method for reading data according to claim 5 is characterized in that, described process is the MPI process.

7. gene sequencing method for reading data according to claim 1 is characterized in that, described customer parameter comprises hardware performance, the total size of gene sequencing data, homologous gene reference sequences length.

8. gene sequencing method for reading data according to claim 1 is characterized in that, the form of described gene sequencing data is FASTA form or FASTQ form.

9. a gene sequencing data reading system is characterized in that, comprising:

10. gene sequencing data reading system according to claim 9 is characterized in that, described system also comprises: initialization unit in order to task is carried out initialization, connects between all nodes, and nodal information, mission bit stream is added up.

11. a gene sequencing data analysis set-up is characterized in that, described gene sequencing data analysis set-up is provided with each described gene sequencing data reading system such as claim 9-10.