CN106295250B

CN106295250B - Short sequence quick comparison analysis method and device was sequenced in two generations

Info

Publication number: CN106295250B
Application number: CN201610609337.1A
Authority: CN
Inventors: 郑洪坤; 郭强; 许德德; 马威锋; 孙乔慧
Original assignee: Beijing Hundred Medical Laboratory Ltd
Current assignee: Beijing Pukang Ruiren Medical Laboratory Co., Ltd.
Priority date: 2016-07-28
Filing date: 2016-07-28
Publication date: 2019-03-29
Anticipated expiration: 2036-07-28
Also published as: CN106295250A

Abstract

The present invention discloses a kind of two generations and short sequence quick comparison analysis method and device is sequenced, be able to solve sequencing data comparison efficiency is low and problem that EMS memory occupation is high.The described method includes: obtaining the short sequence of DNA that sequencing obtains, and using the short sequence of DNA described in the first hash algorithm and the second hash algorithm difference mapping code, respectively obtain the first index and the second index；It is compared based on preset index inquiry library, first index and the second index by the short sequence of the DNA and with reference to genome, index inquiry library is made of cellular construction body array, each cellular construction body includes value value and index2 value, the array indexing offset for storing each cellular construction body is corresponding index1, the as corresponding index value of Array for structural body, K are fragment sequence length；According to comparison as a result, if on comparing, obtain with the value value of the K-mer segment on the short sequence alignment of corresponding DNA, determine the site on the corresponding short sequence designation of chromosome number of DNA and designation of chromosome.

Description

Short sequence quick comparison analysis method and device was sequenced in two generations

Technical field

The invention belongs to biological information engineering fields, are related to biology information technology and Computer Applied Technology, specifically, It is related to the short sequence quick comparison analysis method of two generation of DNA sequence dna sequencing.

Background technique

DNA sequencing plays most basic and widest effect in the gene order coding for interpreting species life.Early in hair At the beginning of existing DNA double spiral, someone was reported to the sequencing technologies of DNA, only process is excessively complicated.Then soon, in 1977 Sanger has invented end and has terminated PCR sequencing PCR, the meaning with milestone.So far with biology information technology the reach of science, Sanger PCR sequencing PCR can no longer meet the needs of research, and then cost is lower, and flux is higher, the faster second generation sequencing of speed Technology is come into being.Its core concept is sequenced in synthesis.DNA sequence dna comparison is that the short sequence of obtained DNA will be sequenced (reads) it is compared with reference to genome, commonly uses to analyze species similarity and homology, computer can also be passed through The excavation and analysis of technology progress gene information.

Currently, the technology compared for two generation sequencing sequences is main, there are two the development in direction: first is that based on design hash Algorithm mapping constructs hash Table storehouse thinking, and this technological merit compares that speed is fast, and accuracy rate is high, but committed memory and CPU compared with It is high.The software representative that the technology is realized has SOAP, MAQ etc..Second is that constructing suffix tree query data structure based on BWT transfer algorithm Technology, which first to genome sequence cyclic shift, is then sorted, is compressed using BWT and establish index.When comparing, Using lookup and recall come the location information of positioning sequence.The technology needs to construct index file in advance to be operated step by step, is compared Speed is also not as good as the hash table structure close to constant average time.It usually can realize that the conventional of all sequencing datas compares, But opposite hash data structure, tree construction will occupy huge memory source, and inquire the speed of comparison far away from averagely often The hash Table storehouse of number time.

Summary of the invention

In view of this, the present invention, which provides two generations of one kind, is sequenced short sequence quick comparison analysis method and device, to solve The problem that the comparison efficiency of sequencing data is low and EMS memory occupation is high.

On the one hand, the embodiment of the present invention proposes that short sequence quick comparison analysis method is sequenced in a kind of two generations, comprising:

S1, the short sequence of DNA that sequencing obtains is obtained, and volume is mapped using the first hash algorithm and the second hash algorithm respectively The code short sequence of DNA respectively obtains the first index and the second index；

S2, it is indexed based on preset index inquiry library, first index and second by the short sequence of the DNA and reference Genome is compared, wherein index inquiry library is made of cellular construction body array, and each cellular construction body includes Value value and index2 value, the value value are the location information of the corresponding composition K-mer segment with reference to genome The compression of abstract, the index2 are the mapping result of the second hash algorithm of the K-mer segment, store each unit Array for structural body index offset amount is that the mapping result index1, K of the first hash algorithm of corresponding K-mer segment are segment sequence Column length；

S3, according to comparison as a result, if compare on, obtain and the K-mer segment on the short sequence alignment of corresponding DNA Value value, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described corresponding Site on the short sequence designation of chromosome number of DNA and designation of chromosome.

Preferably, first hash algorithm is XDDHash algorithm, and second hash algorithm is MWFHash algorithm.

Preferably, the S2, comprising:

Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if searching To the memory block for the first index that first address is the short sequence of the DNA, then judge whether the data storage location of the memory block is 1 It is a；

If 1, it is determined that it is corresponding with the structural body stored at the data storage location of the memory block to go out the short sequence of the DNA K-mer segment compare on.

Preferably, the method also includes:

If not 1, then since the first address, being obtained using the method for displacement linear probing and being stored at current address Structural body index2 value, judge whether the index2 value identical as the second index of the short sequence of the DNA；

If the index2 value is identical as the second index of the short sequence of the DNA, it is determined that go out the short sequence of the DNA and the current position The corresponding K-mer segment of the structural body stored at location compares.

Preferably, the load factor in index inquiry library is 0.607.

Preferably, before the S2, the method also includes:

S40, foundation priori data model, the DNA that 24 chromosome with reference to genome is compared according to history The quantity of sequence from being more to ranked up less；

S41, according to sequence as a result, according to vertical sequence, the interception of sliding window formula is carried out to current chromosome, is obtained Continuous K-mer segment, and mark the location information where each segment；

S42, removal include the K-mer segment and duplicate K-mer segment of degeneracy base；

S43, each K-mer segment for obtaining, to the K-mer segment carry out respectively the first hash algorithm mapping and The mapping of second hash algorithm, respectively obtains index1 value and index2 value, and calculate the location information abstract of the K-mer segment Value value is compressed, the structural body comprising the value value and index2 value is generated；

S44, library is inquired using index1 value as index described in index construct.

Preferably, the short sequence of the DNA and with reference to genome be compared when, using multithreading realize normal chain and The parallel comparison of the short sequence of anti-chain.

On the other hand, the embodiment of the present invention proposes that short sequence quick comparison analytical equipment is sequenced in a kind of two generations, comprising:

Map unit, the short sequence of DNA obtained for obtaining sequencing, and use the first hash algorithm and the second hash algorithm The short sequence of DNA described in mapping code respectively respectively obtains the first index and the second index；

Comparing unit, for based on preset index inquiry library, first index and the second index that the DNA is short Sequence and reference genome are compared, wherein index inquiry library is made of cellular construction body array, each unit knot Structure body includes value value and index2 value, and the value value is the corresponding composition K-mer segment with reference to genome Location information abstract compression, the index2 is the mapping result of the second hash algorithm of the K-mer segment, and storage is every The array indexing offset of a cellular construction body is the mapping result of the first hash algorithm of corresponding K-mer segment Index1, K are fragment sequence length；

Computing unit, for according to comparison as a result, if compare, obtain on the short sequence alignment of corresponding DNA The value value of K-mer segment, and determined according to described with the value value of the K-mer segment on the short sequence alignment of corresponding DNA Site on the short sequence designation of chromosome number of the corresponding DNA and designation of chromosome out.

The invention has the following beneficial effects:

1, hash algorithm is used, comparison speed is fast, close to constant time O (1), improves Project Product time efficiency；

2, a large amount of memory sources are saved, item is improved without directly saving reads sequence using secondary hash mapping techniques Mesh product resource benefit；

3, hash index is reset using priori data model, compares the high reads sequence of the frequency of occurrences, speed-up ratio in advance It is right；

4, parallel execution comparison task is compared using multiwire technology, shortens comparison time.

Detailed description of the invention

Fig. 1 is the flow diagram for two generations of the invention short one embodiment of sequence quick comparison analysis method being sequenced；

Fig. 2 is the time performance figure for the XDDHash algorithm that the present invention is used to do a hash Function Mapping；

Fig. 3 is the amount of collisions performance map for the XDDHash algorithm that the present invention is used to do a hash Function Mapping；

Fig. 4 is the time performance figure for the MWFHash algorithm that the present invention is used to do secondary hash Function Mapping；

Fig. 5 is the amount of collisions performance map for the MWFHash algorithm that the present invention is used to do secondary hash Function Mapping；

Fig. 6 is the load factor for the hash table that the present invention constructs and the relational graph of average detection times；

Fig. 7 is the method for the present invention after being designed to software program product, compares the data volume of sample and spends time taking pass System's figure；

Fig. 8 is the structural schematic diagram for two generations of the invention short one embodiment of sequence quick comparison analytical equipment being sequenced.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical solution in the embodiment of the present invention is explicitly described, it is clear that described embodiment is the present invention A part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.

Referring to Fig. 1, the present embodiment discloses a kind of short sequence quick comparison analysis method of two generations sequencing, comprising:

S2, it is indexed based on preset index inquiry library, first index and second by the short sequence of the DNA and reference Genome is compared, wherein index inquiry library is made of cellular construction body array, and each cellular construction body includes Value value and index2 value, the value value are the location information of the corresponding composition K-mer segment with reference to genome The compression of abstract, the index2 are the mapping result of the second hash algorithm of the K-mer segment, store each unit The array indexing offset of structural body is that the mapping result index1, K of the first hash algorithm of corresponding K-mer segment are segment Sequence length；

The acquisition for inquiring timing column position information is as follows:

Memory index index is obtained: by the first hash algorithm and the second hash algorithm, sequence information in sample being converted For the first index and the second index, value value in memory is obtained.

Sequence location obtains:

The short sequence designation of chromosome K-mer_number=value%26 of DNA (modulus operation, i.e. complementation)；

Site K-mer_location=value/26 (ask and divide exactly) on the short sequence designation of chromosome of DNA.

In the embodiment of the present invention, using secondary hash algorithm, the mapping result of a hash is used to Query Location, secondary The mapping result of hash, which is used to conflict, screens positioning, can accelerate to compare speed, and need not directly save huge reads sequence Information can count comparison result, reduce the waste to memory source.

Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, described first Hash algorithm is XDDHash algorithm, and second hash algorithm is MWFHash algorithm.

The XDDhash compression algorithm once mapped is used to index key informative abstract.Its algorithm pseudo-function is as follows:

Hash [i+1]=(hash [i] * seed+str [i]) ∧ 0x00000000FFFFFFFF (i=0,1,2 ..., N-2) Wherein, hash [0] is set as 63, and it is the i+1 base character of reads sequence that multiplication factor seed seed, which takes 33, str [i], II value of ASC, N be it is to be checked compare reads sequence length.P ∧ 0x00000000FFFFFFFF expression takes its low 32 to P Binary value.It is the first index that circular recursion operation, which is done, until acquiring hash [N-1].

Through theoretical and Programmable detection practical proof, which has preferable performance: amount of collisions is few, compression mapping speed Fastly.Its performance statistics is shown in Figure of description Fig. 2 and 3, and wherein Fig. 2 is the XDDHash algorithm for doing a hash Function Mapping Time performance figure, reflection XDDHash algorithm changes time-consuming relational graph with reads quantity, and Fig. 3 is for making a hash letter The amount of collisions performance map of the XDDHash algorithm of number mapping, the conflict that reflection XDDHash algorithm is generated with the variation of reads quantity The relational graph of amount.

The MWFhash compression algorithm of Quadratic Map, for separating index conflict value, to parity bit using different displacements because Subalgorithm is as follows,

Odd bits Index Algorithm (when i is odd number value):

Hash [i+1]=(hash [i] * seedodd+i*str [i]) ∧ 0x00000000FFFFFFFF

Even bit Index Algorithm (when i is even number value):

Hash [i+1]=(hash [i] * seedeven+i*str [i]) ∧ 0x00000000FFFFFFFF

Wherein, hash [0] is set as 0, odd number translocation factor seed seedodd value 5, even number translocation factor seed Seedeven value 7.It does circular recursion operation and obtains hash [N-1] value, the as second index.

Through theoretical and Programmable detection practical proof, which has preferable performance: amount of collisions is few, compression mapping speed Fastly.Its performance statistics is shown in Figure of description Figure 4 and 5, Fig. 4 be for do the MWFHash algorithm of secondary hash Function Mapping when Between performance map, reflection MWFHash algorithm changes time-consuming relational graph with reads quantity, and Fig. 5 is for making secondary hash letter The amount of collisions performance map of the MWFHash algorithm of number mapping, the conflict that reflection MWFHash algorithm is generated with the variation of reads quantity The relational graph of amount.

In the embodiment of the present invention, in hash algorithm coding mapping reads sequence, direct coding takes its low 32 binary system Value simplifies Algorithms T-cbmplexity.

Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, the S2, packet It includes:

Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, further includes:

In the embodiment of the present invention, the characteristics of according to sample short sequences h ash mapping value, the linear probing improved is utilized The collision problem of method solution hash value.It specifically says, when indexing displacement detection downwards, it is only necessary to which index is certainly plus operation is Can, reduce time complexity when comparing search.

Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, the index The load factor for inquiring library is 0.607.

Index inquiry library is substantially hash table, the selection of Hash table load factor K:

Load factor: the ratio of the record number and hash table size actually stored in hash table is to measure hash table performance An important indicator.Its value is bigger, and hash value number of collisions is more, and the inquiry velocity of hash table is caused to reduce.Its value is too small, wave Take memory source.We are based on XDDHash algorithm and MWFHash algorithm design program test statistics load factor and average detection As shown in Figure 6, it reflects the efficiency that the inquiry of constructed table compares to relational graph between number (number is more, time-consuming more) Can, the load factor K that the present invention designs is 0.607, it can be observed that its average lookup number is only at 3 times or so, it is close normal The number time searches.

Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, in the S2 Before, further includes:

In a particular embodiment, the K-mer segment comprising degeneracy base represents unknown nucleotide sequence, structure in reference genome hg19 It builds unavailable when the library index, EMS memory occupation can also be saved after removing.

Value value is the compression of the location information abstract of K-mer fragment sequence, and compression function is as follows:

Value=K-mer_number*26+K-mer_location,

Wherein, K-mer_number is K-mer fragment sequence designation of chromosome number, and K-mer_location is K-mer piece Site on Duan Xulie designation of chromosome.

Index2 value and value value are stored in hash structural body:

Typedef struct Node{

unsigned int value；

unsigned int index2；

}HashType

The pseudo- expression formula of the hash table constructed in this way can be simply written as:

Hash [index1]=HashType element；

The value of HashType type is stored at memory index1 with regard to realizing.

In the embodiment of the present invention, according to priori data model, the high short sequence of the priority ordering frequency of occurrences, to improve inquiry Speed, this specific method are to say, when processing has the hash index of repeated collisions, preferential search occurs in practical sequencing sample The high reads sequence of rate reduces index displacement searching times, avoids unnecessary time overhead.

Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, in the DNA When short sequence and reference genome are compared, the parallel comparison of normal chain and the short sequence of anti-chain is realized using multithreading.

In a particular application, the lower machine catalogue for the DNA sequence dna that storage sequencing obtains can be monitored in real time, as long as there is newly lower machine Result data is just compared out in sample data.

Embodiment 1 is sequenced short sequence quick comparison analysis method based on two generations and analyzes applied to the data of noninvasive antenatal detection

The present invention realizes that the block process of noninvasive antenatal detection is to utilize perl language and shell-command pretreatment people first Class refers to genome hg19.fa, format needed for being processed into index inquiry library, including three column:

First row: 36-mer DNA sequencing fragment；

Secondary series: segment designation of chromosome number；

Third column: the site on segment designation of chromosome.

Then its core methed is realized based on C language algorithm and data structure.

Design process of the present invention and operation need hardware and software environment: linux system；It is more than 3 cores；35 memories with On；The library C under Linux platform；Gcc compiler；Gdb debugging software.

Input sample file format:

Illumina microarray dataset NIPT raw sequencing data, be fastq format, one read information of every 4 behavior:

The first row: being started with@, indicates a sequence, including sequenator name, mark position, the sequencing of mono-/bis-end and mistake Filter the description of the relevant informations such as situation and primer connector；

Second row: reads sequence；

The third line: with '+' beginning, it is followed by sequence identifier, description information, or be whatever not added；

Fourth line: be reads quality information and the second row sequence it is corresponding.

Operation and output result

(1) nohup and & turns are issued orders from the background using linux, lower machine catalogue can be scanned always, as long as there is new lot sample Notebook data appearance can go out result.Operation order:

nohup./hashtb_pth.c&；

(2) input sample file data:

We are by 7.4M size sample file Human_A1600169-R169_good_1.fq data and Human_ A1600171-R171_good_1.fq is stored in lower machine catalogue.Input file:

Human_A1600169-R169_good_11.fq、Human_A1600169-R169_good_12.fq、Human_ A1600169-R169_good_13.fq、Human_A1600169-R169_good_14.fq、Human_A1600169-R169_ Good_15.fq, Human_A1600169-R169_good_1.fq and Human_A1600171-R171_good_1.fq, wherein First 6 be same sample data copy, to detect the stability of program, the last one be used to detect program processing difference sample This ability.

(3) comparison result

The file generated in result list:

Human_A1600169-R169_good_1、Human_A1600169-R169_good_11、Human_ A1600169-R169_good_12、Human_A1600169-R169_good_13、Human_A1600169-R169_good_ 14, Human_A1600169-R169_good_15 and Human_A1600171-R171_good_1,

The result data of 6 copy of sample Human_A1600169-R169_good_1.fq is completely the same, and reflection is based on Even if the software that the method for the present invention is realized is reliable and stable in the multiple sample datas of continuous processing.

It can be gone out by result data with simple computation:

The unique comparison rate of sample are as follows: 72.772%,

Sample comparison rate are as follows: 74.443%.

The result of 4 rows after difference sample sample Human_A1600171-R171_good_1.fq result data is done simply It calculates:

The unique comparison rate of sample: 71.531%,

Sample comparison rate: 73.487%.

As a result different with Human_A1600169-R169_good_1.fq, illustrate that software program being capable of instant continuous processing Different sample datas under same batch catalogue.

(4) used time is compared

Handle 7 sample used times: 77.850 seconds.

Relational graph between statistical sample reads amount and comparison time, is shown in Fig. 7, Fig. 7 be the method for the present invention be designed to it is soft After part program product, comparing the data volume of sample and spend time taking relational graph, this is obtained by the test of authentic specimen example, it Concentrated expression the method for the present invention realized by C programming language after effect.It is the original using noninvasive prenatal gene detection sample The test that beginning data are done.

(5) result verification and compare

Results of comparison we realized using the reliable BWA software of mainstream:

We compare sample Human_A1600169-R169_good_1.fq using BWA.Process is as follows:

1. constructing the library index, operation using with reference to genome

Bwa index-a bwtsw hg19.fa,

Wherein, hg19.fa is with reference to genome, and since reference genome hg19.fa is greater than 2GB ,-a parameter is used bwtsw。

2. sample sequence and the library reference sequences index compare, operation

bwa aln-f Human_A1600169-R169_good_1.sai hg19.fa Human_A1600169-R169_ good_1.fq

Sai file is generated, includes SA coordinates informative abstract.

3. sai file is converted to sam output, operation

bwa samse-f Human_A1600169-R169_good_1.sam hg19.fa Human_A1600169- R169_good_1.sai Human_A1600169-R169_good_1.fastq,

Wherein, sam file includes the various information of comparison result.

4. extracting comparison result data using perl language and counting

The unique comparison rate of sample: 72.772%,

Sample comparison rate: 74.443%.

This software BWA is consistent with result, but bwa whole flow process needs half an hour and occupies vast resources, while bwa cannot In addition the multiple samples of instant continuous processing, statistical result needed for single sample process can not directly give analysis NIPT, write Script statistical analysis.

Compare (single sample) for both noninvasive antenatal detection projects:

Referring to Fig. 8, the present embodiment discloses a kind of short sequence quick comparison analytical equipment of two generations sequencing, comprising:

Map unit 1, the short sequence of DNA obtained for obtaining sequencing, and calculated using the first hash algorithm and the 2nd hash Method distinguishes the short sequence of DNA described in mapping code, respectively obtains the first index and the second index；

Comparing unit 2, for based on preset index inquiry library, first index and the second index that the DNA is short Sequence and reference genome are compared, wherein index inquiry library is made of cellular construction body array, each unit knot Structure body includes value value and index2 value, and the value value is the corresponding composition K-mer segment with reference to genome Location information abstract compression, the index2 is the mapping result of the second hash algorithm of the K-mer segment, and storage is every The array indexing offset of a cellular construction body is the mapping result of the first hash algorithm of corresponding K-mer segment Index1, K are fragment sequence length；

Computing unit 3, for according to comparison as a result, if compare, obtain on the short sequence alignment of corresponding DNA The value value of K-mer segment, and determined according to described with the value value of the K-mer segment on the short sequence alignment of corresponding DNA Site on the short sequence designation of chromosome number of the corresponding DNA and designation of chromosome out.

Although the embodiments of the invention are described in conjunction with the attached drawings, but those skilled in the art can not depart from this hair Various modifications and variations are made in the case where bright spirit and scope, such modifications and variations are each fallen within by appended claims Within limited range.

Claims

1. short sequence quick comparison analysis method is sequenced in a kind of two generations characterized by comprising

S1, the short sequence of DNA that sequencing obtains is obtained, and mapping code institute is distinguished using the first hash algorithm and the second hash algorithm The short sequence of DNA is stated, the first index and the second index are respectively obtained；

S2, gene by the short sequence of the DNA and is referred to based on preset index inquiry library, first index and the second index Group is compared, wherein index inquiry library is made of cellular construction body array, and each cellular construction body includes value Value and index2 value, the value value are the location information abstract of the corresponding composition K-mer segment with reference to genome Compression, the index2 are the mapping result of the second hash algorithm of the K-mer segment, store each cellular construction body Array indexing offset be corresponding K-mer segment the first hash algorithm mapping result index1, K is that fragment sequence is long Degree；

S3, according to comparison as a result, if compare on, obtain and the K-mer segment on the short sequence alignment of corresponding DNA Value value, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described corresponding Site on the short sequence designation of chromosome number of DNA and designation of chromosome；

The S2, comprising:

Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if finding head Address is the memory block of the first index of the short sequence of the DNA, then judges whether the data storage location of the memory block is 1；

If 1, it is determined that go out the short sequence of DNA K- corresponding with the structural body of storage at the data storage location of the memory block On mer segment compares.

2. the method according to claim 1, wherein first hash algorithm is XDDHash algorithm, described the Two hash algorithms are MWFHash algorithm.

3. the method according to claim 1, wherein further include:

If not 1, then since the first address, obtaining the knot stored at current address using the method for displacement linear probing The index2 value of structure body judges whether the index2 value is identical as the second index of the short sequence of the DNA；

If the index2 value is identical as the second index of the short sequence of the DNA, it is determined that go out at the short sequence of the DNA and the current address The corresponding K-mer segment of the structural body of storage compares.

4. the method according to claim 1, wherein the load factor in index inquiry library is 0.607.

5. the method according to claim 1, wherein before the S2, further includes:

S40, foundation priori data model, the DNA sequence dna that 24 chromosome with reference to genome is compared according to history Quantity from being more to ranked up less；

S43, each K-mer segment for obtaining, carry out the first hash algorithm mapping and second to the K-mer segment respectively Hash algorithm mapping, respectively obtains index1 value and index2 value, and calculates the compression of the location information abstract of the K-mer segment Value value generates the structural body comprising the value value and index2 value；

6. the method according to claim 1, wherein being compared in the short sequence of the DNA and with reference to genome When, the parallel comparison of normal chain and the short sequence of anti-chain is realized using multithreading.

7. short sequence quick comparison analytical equipment is sequenced in a kind of two generations characterized by comprising

Map unit, the short sequence of DNA obtained for obtaining sequencing, and distinguished using the first hash algorithm and the second hash algorithm The short sequence of DNA described in mapping code respectively obtains the first index and the second index；

Comparing unit, for being indexed based on preset index inquiry library, first index and second by the short sequence of the DNA It is compared with reference genome, wherein index inquiry library is made of cellular construction body array, each cellular construction body It include value value and index2 value, the value value is the position of the corresponding composition K-mer segment with reference to genome The compression of informative abstract is set, the index2 is the mapping result of the second hash algorithm of the K-mer segment, stores each institute The array indexing offset of cellular construction body is stated as the mapping result index1, K of the first hash algorithm of corresponding K-mer segment For fragment sequence length；

Computing unit, for, as a result, if comparing, being obtained and the K-mer on the short sequence alignment of corresponding DNA according to comparison The value value of segment, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described Site on the corresponding short sequence designation of chromosome number of DNA and designation of chromosome；

The comparing unit, is used for: