CN105243297A

CN105243297A - Quick comparing and positioning method for gene sequence segments on reference genome

Info

Publication number: CN105243297A
Application number: CN201510648108.6A
Authority: CN
Inventors: 宋卓; 李�根
Original assignee: Human And Future Biotechnology (changsha) Co Ltd
Current assignee: Human And Future Biotechnology (changsha) Co Ltd
Priority date: 2015-10-09
Filing date: 2015-10-09
Publication date: 2016-01-13

Abstract

The present invention discloses a quick comparing and positioning method for a gene sequence segment on a reference genome. The method comprises: firstly, extracting the gene sequence segments from the reference genome; aiming at each gene sequence segment, establishing a key value pair by using the gene sequence segment as a key part and using target information of the gene sequence segment as a Value part, mapping the gene sequence segment by adopting a Hash function to determine a target storage location in a database, and writing the key value pair into the target storage location to complete database establishment of the reference genome; and when quick comparison and positioning need to be carried out on a gene sequence segment to be matched, firstly, mapping the gene sequence segment to be matched by adopting the Hash function to determine a target storage location of the gene sequence segment to be matched in the database, and then reading the target information of the gene sequence segment of the key value pair corresponding to the matched gene sequence segment from the target storage location. The quick comparing and positioning method disclosed by the present invention has the advantages of low time complexity, high comparing and positioning speed, high positioning efficiency, wide application range, and applicability to cross-species hybrid quick analysis.

Description

A kind of quick comparison localization method with reference to gene order fragment on genome

Technical field

The present invention relates to the bioinformatic analysis technology of gene sequencing data, be specifically related to a kind of quick comparison localization method with reference to gene order fragment on genome.

Background technology

Along with the development of gene sequencing technology, order-checking price exponentially declines, and speed even exceedes Moore's Law.Incident a large amount of sequencing data proposes huge challenge for computational analysis fast and accurately.The first step analyzing sequencing data is sequence alignment, namely sequence fragment comparison is navigated to reference on genome, often needs to consume a large amount of computational resources and time.Comparison positioning sequence fragment is just becoming the bottleneck of an expedited data analysis process.

In order to solve sequence fragment with reference to the comparison orientation problem on genome, people have been developed a lot of algorithm and have been developed widely used specific implementation, and wherein more well-known have BWA, the softwares such as Bowtie, SOAP.Existing alignment algorithm and to software all for reference to genome to be found and the best comparison position of positioning sequence fragment is being designed.Because consider the fault-tolerant situation of site mutation when comparison, the sudden change often on sequence fragment is more, and more with reference to the position in the possible comparison on genome, the calculated amount of needs is also larger.By forming unique sequence fragment in a large number on human genome.For the fragment of 36 bases (36bp), the genome of 3,000,000,000 base compositions have at least the fragment of the 36bp on the region of 2/3 be unique.Current alignment algorithm and software are not design for these unique fragment regions, and what their solved is a more general sequence alignment and orientation problem, and therefore its algorithm and implementation method all exist optimization space for certain class particular problem.And mutation analysis is not paid close attention to for those, only pay close attention to the order-checking flow process of unique aligned fragment, comparison method be designed with very large optimization space, by redesign, comparison time can be saved, raise the efficiency further.

Summary of the invention

The technical problem to be solved in the present invention: for the problems referred to above of prior art, provide that a kind of time complexity is low, comparison locating speed is fast, location efficiency is high, applied range, can be applicable to across the mixing express-analysis of species reference genome on the quick comparison localization method of gene order fragment.

In order to solve the problems of the technologies described above, the technical solution used in the present invention is:

With reference to a quick comparison localization method for gene order fragment on genome, step comprises:

1) gene order fragment is extracted from reference to genome;

2) for each the gene order fragment on reference genome, using gene order fragment as key, the target information of part, gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, and key-value pair is write target storage position, finally complete and build storehouse with reference to genomic;

3) when needs carry out quick comparison location to gene order fragment to be matched, first described gene order fragment to be matched is adopted the target storage position in hash function Map Searching database, if search successfully, then read the target information of the gene order fragment of the corresponding key-value pair of coupling gene order fragment from target storage position; Otherwise return and search failure information.

Preferably, described step 1) detailed step comprise:

1.1) the length L of gene order fragment is set;

1.2) position of unique on computing reference genome gene order fragment and target information;

1.3) gene order fragment and target information thereof is extracted respectively according to the position of gene order fragment.

Preferably, described step 1.2) in unique gene order fragment specifically refer to that the editing distance between any two gene order fragments is more than or equal to the threshold value n of setting.

Preferably, described target information comprises at least one in chromosome, chromosome position, GC content, species taxonomy.

Preferably, described step 2) detailed step comprise:

2.1) from all gene order fragments that extraction obtains, a gene order fragment is taken out as current gene order fragment;

2.2) using current gene order fragment as key, the target information of part, current gene order fragment sets up key-value pair (Key, Value) to describe the mapping relations between current gene order fragment and target information thereof as Value part;

2.3) the Key part in the key-value pair (Key, Value) of current gene order fragment and Value are partly encoded, select the hash function of specifying by current gene order fragment map in the database i in d database;

2.4) its target storage position in database i is adopted hash function to map current gene order fragment to determine, and by the target storage position of key-value pair (Key, Value) the write into Databasce i after coding;

2.5) judge whether extract all gene order fragments obtained is disposed, if be disposed, then redirect performs step 2.1); Otherwise, judge to have completed to build storehouse with reference to genomic.

Preferably, described step 2.3) in current gene order fragment map is comprised to the detailed step in the database i in d database:

2.3.1) setting data storehouse number d;

2.3.2) the prefix substring that current gene order fragment Key partial-length is m is got, select the hash function of specifying also to adopt and calculate current gene order fragment database accession number i corresponding in d database such as formula functional expression (1) Suo Shi, thus by current gene order fragment map in the database i in d database;

i＝f(Key _[1:m])％d(1)

In formula (1), i is the database accession number of current gene order fragment correspondence in d database, Key _[1:m]for the prefix substring that current gene order fragment Key partial-length is m, d is the database number preset, and f is the hash function of specifying.

Preferably, described step 3) detailed step comprise:

3.1) when needs carry out quick comparison location to gene order fragment to be matched, first the reverse complementary sequence of gene order fragment to be matched is calculated, gene order fragment to be matched and reverse complementary sequence thereof being encoded respectively, is then that the prefix substring of m is selected the hash function of specifying and adopts functional expression shown in formula (1) to calculate the database accession number i of gene order fragment to be matched and reverse complementary sequence correspondence in d database thereof respectively by length after coding;

3.2) for gene order fragment to be matched and reverse complementary sequence thereof, parallel search gene order fragment to be matched and reverse complementary sequence thereof in the corresponding target database of the database accession number i calculated respectively, if any one in both gene order fragment to be matched and reverse complementary sequence thereof finds matched record in target database, then the result that the target information of the Value part in coupling key-value pair is located as comparison is returned; Otherwise both gene order fragment to be matched and reverse complementary sequence thereof all do not find matched record in target database, then return and search failure information.

Preferably, described step 3.1) in calculate the reverse complementary sequence of gene order fragment to be matched and specifically refer to first by gene order fragment reversion to be matched, then respectively base A and base T exchange, base C and bases G in the gene order fragment after reversion are exchanged, obtain the reverse complementary sequence of gene order fragment to be matched.

Preferably, described step 2.3) by the key-value pair (Key of current gene order fragment, Value) the Key part in and Value part are carried out encoding and are specifically referred to: for Key part successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode, partly adopt Variable Length Code for Value; Described step 3.1) described gene order fragment to be matched being carried out encoding specifically refers to: successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode.

Preferably, described hash function of specifying is MurmurHash function.

The present invention has following advantage with reference to the quick comparison localization method of gene order fragment on genome:

1, the present invention is directed to reference to each the gene order fragment on genome, using gene order fragment as key part, the target information of gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, alignment algorithm is replaced with the search of key-value pair, whether the comparison result of quick acquisition sequence fragment is (namely in comparison, if in comparison, its target information can be obtained), key-value pair search has minimum time complexity, compared with existing sequence alignment method, there is the fastest comparison locating speed, there is time complexity low, comparison locating speed is fast, the advantage that location efficiency is high.

2, the present invention is directed to reference to each the gene order fragment on genome, using gene order fragment as key part, the target information of gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, therefore can by the mode that multiple genomic unique unique sequence segment is put together, the multiple genome of search comparison simultaneously, unique comparison area on different plant species genome can be put together, comparison positioning sequence fragment on several species genome simultaneously, be applicable to the mixing express-analysis across species.

3, for the reference genome of most of species, the fragment in the most of region on it is all unique, and Just because of this, the present invention significantly can accelerate the analysis process that those do not pay close attention to not exclusive aligned sequences fragment.

4, the target information of the present invention's part, gene order fragment using gene order fragment as key sets up key-value pair as Value part, the particular content of based target information is different, can be applied in the genetic analysis application of the quick comparison location of all kinds of gene order fragment, such as CNV analysis, bacterial classification calibrating etc., have the advantage of applied range.

Accompanying drawing explanation

Fig. 1 is the basic procedure schematic diagram of quick comparison localization method in the embodiment of the present invention one.

Fig. 2 is that the embodiment of the present invention one CNV analyzes the CNV analysis result schematic diagram obtained.

Embodiment

Embodiment one:

Hereafter to be carried out CNV analysis (analysis of variance of CopyNumberVariation analysis/copy number) by rapid serial comparison, the present invention is further detailed with reference to the quick comparison localization method of gene order fragment on genome.

As shown in Figure 1, the present embodiment comprises with reference to the step of the quick comparison localization method of gene order fragment on genome:

1) gene order fragment is extracted from reference to genome;

3) when needs carry out quick comparison location to gene order fragment to be matched, described gene order fragment to be matched is adopted the target storage position in hash function Map Searching database, if search successfully, then read the target information of the gene order fragment of the corresponding key-value pair of coupling gene order fragment from target storage position; Otherwise return and search failure information.

In the present embodiment, carry out the positional information of human genome (version hg19) the 36bp unique gene sequence fragment of comfortable UCSC website with reference to genome from network address:

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign36mer.bigWig

This file behaviour genoid group (version hg19) 36mermappability file, be BED formatted file by using instrument bigWigToBedGraph after above-mentioned file download by bigWig file transform, BED formatted file comprises four row, 1st is classified as chromosome numbers, 2nd is classified as chromosome reference position (starting to calculate with 0), 3rd is classified as chromosome final position (starting to calculate with 1), 4th is classified as the mappability value that each position in chromosome reference position to chromosome final position (not comprising final position) interval starts the 36bp sequence calculated, mappabilit is that 1 expression gene order fragment is unique on genome, in BED formatted file, the record that mappability is 1 is chosen in the present embodiment, and extract hg19 with reference to the unique gene sequence fragment on genome according to the positional information that BED file provides, the target gene sequence set of segments extracted using as build database use the set of Key, unique coverage on each chromosome is as shown in table 1.

Table 1: the unique coverage tables of data on each chromosome.

Chromosome	Base number	Unique comparison base number	Unique coverage
				1	249,250,621	171,109,348	0.69
2	243,199,373	187,599,789	0.77
				3	198,022,430	155,568,147	0.79
4	191,154,276	148,235,325	0.78
				5	180,915,260	139,351,070	0.77
6	171,115,067	132,615,985	0.78
				7	159,138,663	115,076,932	0.72
8	146,364,022	113,561,421	0.78
				9	141,213,431	85,976,834	0.61
10	135,534,747	100,580,972	0.74
				11	135,006,516	101,426,392	0.75
12	133,851,895	101,134,431	0.76
				13	115,169,878	77,517,018	0.67
14	107,349,540	68,732,800	0.64
				15	102,531,392	59,987,912	0.59
16	90,354,753	56,576,358	0.63
				17	81,195,210	55,582,856	0.68
18	78,077,248	60,876,802	0.78
				19	59,128,983	35,479,931	0.6
20	63,025,520	47,555,303	0.75
				21	48,129,895	26,757,998	0.56
22	51,304,566	24,675,505	0.48
				X	155,270,560	103,355,975	0.67
Y	59,373,566	7,016,301	0.12

See table 1, first is classified as chromosome numbers, and second is classified as the base number that each chromosome comprises, and the 3rd is classified as the base number that unique gene sequence fragment comprises, 4th is classified as the ratio that base number that unique gene sequence fragment comprises accounts for each chromosome base number, in the present embodiment, the unique gene sequence fragments total 2,176 of 36bp, 351,405, base sum accounts for reference to genomic 70.3%, and chromosomal mean coverage is 67.1%.

In the present embodiment, step 1) detailed step comprise:

1.1) the length L of gene order fragment is set; In the present embodiment, the length L of gene order fragment is 36;

In the present embodiment, step 1.2) in unique gene order fragment specifically refer to that the editing distance between any two gene order fragments is more than or equal to the threshold value n of setting, the concrete value of threshold value n is 2, also can be set to other values as required in addition.

It should be noted that, target information is concrete relevant with the quick comparison position application of gene order fragment, and target information can comprise at least one in chromosome, chromosome position, GC content, species taxonomy.In the present embodiment, target information specifically comprises: the chromosome numbers of (1) gene order fragment 1,2 ..., 22, X, Y}; (2) gene order fragment reference position on chromosome (calculating from 0); (3) the GC content total number of bases G, base C (namely in gene order fragment) of gene order fragment.

In the present embodiment, step 2) detailed step comprise:

2.3) by the key-value pair (Key of current gene order fragment, Value) the Key part in and Value partly encode, and select the hash function of specifying by current gene order fragment map in the database i (1≤i≤d) in d database;

In the present embodiment, step 2.3) in current gene order fragment map is comprised to the detailed step in the database i in d database:

2.3.1) setting data storehouse number d; In the present embodiment, the concrete value of database number d is 12;

i＝f(Key _[1:m])％d(1)

The hash function f specified is for realizing by all gene order fragment map in d database, and in the present embodiment, Hash function f specifically adopts MurmurHash function, all gene order fragments can be mapped in d database equably.

In the present embodiment, step 3) detailed step comprise:

3.1) when needs carry out quick comparison location to gene order fragment to be matched, first the reverse complementary sequence of gene order fragment to be matched is calculated, gene order fragment to be matched and reverse complementary sequence thereof are encoded respectively, then by length after coding be the prefix substring of m select the hash function of specifying and adopt functional expression shown in formula (1) calculate respectively gene order fragment to be matched and reverse complementary sequence correspondence in d database thereof database accession number i (with reference to genomic build storehouse time the numbering determined); In the present embodiment, the concrete value of prefix substring m is 3;

In the present embodiment, step 3.1) in calculate the reverse complementary sequence of gene order fragment to be matched and specifically refer to first by gene order fragment reversion to be matched, then respectively base A and base T exchange, base C and bases G in the gene order fragment after reversion are exchanged, obtain the reverse complementary sequence of gene order fragment to be matched.Such as, for the gene order Segment A C to be matched of 36 ... ACGT, obtains TGCA after reversion ... CA, then after carrying out base A and base T exchange, base C and bases G exchange, obtain ACGT ... GT.

In the present embodiment, step 2.3) when the Key part in the key-value pair of current gene order fragment and Value are partly encoded, for in the key-value pair of current gene order fragment Key part successively with 00 replace base A, with 01 replace base C, with 10 replace bases G, with 11 replace base T encode, in the key-value pair of current gene order fragment Value part adopt Variable Length Code; Step 3.1) when gene order fragment to be matched is encoded, specifically refer to successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode; In the present embodiment, step 2.3) when the Key part in the key-value pair of current gene order fragment and Value are partly encoded, coded character adopts ASCII character, and after coding, the length of Key part is that the length of 9, Value part is between 3 ~ 6 characters; Step 3.1) when gene order fragment to be matched being encoded, after coding, the length of Key part is 9.Step 2.3) for Value part adopt Variable Length Code time, the order of coding is followed successively by chromosome numbers, the GC content of gene order fragment, gene order fragment reference position on chromosome, the GC content of chromosome numbers, gene order fragment respectively accounts for a character, gene order fragment reference position part on chromosome adopts Variable Length Code, namely when position is positioned at interval [0,2 ⁸-1) with a character code time, when position is positioned at interval [2 ⁸, 2 ¹⁶-1) with two character codes time, when position is positioned at interval [2 ¹⁶, 2 ²⁴-1) with three character codes time, when position is positioned at interval [2 ²⁴, 2 ³²-1) with four character codes time, thus taking of storage space can be saved.

In the present embodiment, gene order fragment data to be matched integrates the sequencing result as real human's blood sample, and sequenced fragments length is 36bp, the number of sequence fragment is 4,929,709, carry out rapid serial comparison location by the inventive method and to add up comparison result as shown in table 2 below.

Table 2: the sequence data table on each chromosome.

Chromosome	Aligned sequences number	Unique aligned sequences number	Unique aligned sequences GC content
				1	279880	277196	0.40525
2	299817	296786	0.390581
				3	248322	245823	0.385119
4	234336	231968	0.369009
				5	220284	218086	0.383703
6	213813	211737	0.384034
				7	185199	183429	0.392871

8	180296	178557	0.389799
				9	137702	136404	0.403912
10	164105	162531	0.404724
				11	161240	159706	0.407018
12	163420	161837	0.394989
				13	122829	121550	0.372917
14	110550	109441	0.397038
				15	99286	98357	0.409776
16	91390	90550	0.437317
				17	91902	91116	0.444336
18	95782	94852	0.387118
				19	57913	57502	0.477258
20	76343	75659	0.432834
				21	44167	43733	0.402295
22	40754	40440	0.47358
				X	164658	163060	0.382272
Y	481	476	0.394771

See table 2, first is classified as chromosome numbers, second number being classified as matching database gene order fragment, 3rd number being classified as unique gene sequence fragment in gene order fragment in matching database, the 4th GC content being classified as unique gene sequence fragment in matching database gene order fragment, specifically refers to the ratio shared by base C and G in matching database gene order fragment.In the present embodiment, the gene order fragment of matching database has 3,484, article 469, the gene order fragment of unique match has 3,450, article 796, the ratio that unique match gene order fragment accounts for gene order fragment total number to be matched is 70%, accounts for reference to genome total bases object ratio consistent with the unique gene sequence fragment base number described in table 1.

On the basis obtaining table 2, standardization comparison to each chromosomal sequence and carry out GC correct and mappability correction, choosing normal sample does with reference to analyzing chromosome copies number variation, finally accurately find known CNV sudden change at No. 2 chromosome q (Chr2) ends, CNV analysis result as shown in Figure 2.Be with the position of 60K window of 5K distance interval sliding on No. 2 chromosomes see Fig. 2, x coordinate; Y coordinate is that log2ratio, log2ratio significantly depart from 0, represents that the present embodiment sample exists copy number variation in homologue position.The circular of y coordinate log2ratio is such as formula shown in (2);

l o g 2 r a t i o = \log_{2} \frac{x_{60 K}}{{\overset{&OverBar;}{x}}_{60 K}} - - - (2)

In formula (2), x _60Kfor making the value after standardization to unique comparison to the gene order segment number in each 60K window of No. 2 chromosomes in the present embodiment sample data, for making the average after standardization to unique comparison to the gene order segment number in each 60K window of No. 2 chromosomes in normal sample notebook data.

For the present embodiment 36bp sequence, the comparison efficiency of quick alignment algorithm is 1Million sequence/s, and compared with traditional universal comparison software based on BWT, eliminate SmithWaterman comparison process consuming time, comparison efficiency significantly improves.

Embodiment two:

Except carrying out CNV analysis, the present invention can also be used for carrying out strain idenfication with reference to the quick comparison localization method of gene order fragment on genome.In the present embodiment, reference genome is with reference to genome across species, step 2) carry out with reference to genomic build storehouse time, be (http://www.ncbi.nlm.nih.gov/refseq/) downloads in NCBIRefSeq database people reference genome and all bacteriums, virus genome sequence with reference to genome.Calculate the mappability collection of illustrative plates of (people, bacterium, virus) reference sequences 31mer with gem-mappability software, therefrom choose the unique sequence code fragment that mappability is 1; The exhaustive division information of species is downloaded at NCBI species taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy).For each the gene order fragment on reference genome, using gene order fragment as key, the target information (species taxonomy) of part, gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, and key-value pair is write target storage position, finally complete and build storehouse with reference to genomic.The sequencing data of normal person's fecal sample is downloaded to analyze the composition of human intestinal microorganisms at HMP website (http://hmpdacc.org/), step 3 is used for gene order fragment to be matched wherein) carry out the microorganism composition of rapid serial compare of analysis in genus level, the result finally obtained is as shown in table 3.

Table 3: microorganism composition analysis result data table.

Belong to	Sorting sequence number	Standardization sorting sequence number	Number percent
				Bacteroides	3345216	153556.7	15.36
Eubacterium	1597681	73338.94	7.33
				Roseburia	224359	10298.83	1.03
Alistipes	197067	9046.04	0.9
				Odoribacter	174594	8014.45	0.8
Others	505842	23219.85	2.32

See table 3, first microbial name being classified as genus level, second is classified as comparison belongs to the gene order fragment of level number to corresponding microorganism, and the 3rd is classified as every 1,000, in 000 sequence fragment, comparison belongs to the number of the sequence fragment of level to corresponding microorganism, 4th be classified as standardization after sorting sequence number account for 1,000, the number percent of 000 sequence fragment, be used for weighing the abundance of microorganism in sample, this value is higher shows that the proportion of composing of corresponding microorganism in sample is higher.Finally, judge the bacterial classification abundance in institute's analyzing samples according to the unique sequence code fragment navigating to different strain, this analysis result is consistent with the analysis result that HMP website provides.The present embodiment can carry out High Level customization for various application specifically to reference to genome, target information with reference to the quick comparison localization method of gene order fragment on genome, flexible and changeable; To the efficiency greatly improving sequence alignment with reference to genomic customization and simplification; On the present embodiment reference genome, the quick comparison localization method of gene order fragment is while acceleration compare of analysis, do not sacrifice precision of analysis, extremely be applicable to extensive mass data analysis and the application scenarios higher to requirement of real-time, have a extensive future.

The above is only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, and all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1., with reference to a quick comparison localization method for gene order fragment on genome, it is characterized in that step comprises:

1) gene order fragment is extracted from reference to genome;

2. the quick comparison localization method with reference to gene order fragment on genome according to claim 1, is characterized in that, described step 1) detailed step comprise:

1.1) the length L of gene order fragment is set;

3. the quick comparison localization method with reference to gene order fragment on genome according to claim 2, it is characterized in that, described step 1.2) in unique gene order fragment specifically refer to that the editing distance between any two gene order fragments is more than or equal to the threshold value n of setting.

4. the quick comparison localization method with reference to gene order fragment on genome according to claim 3, it is characterized in that, described target information comprises at least one in chromosome, chromosome position, GC content, species taxonomy.

5., according to the quick comparison localization method of gene order fragment on the reference genome in Claims 1 to 4 described in any one, it is characterized in that, described step 2) detailed step comprise:

6. the quick comparison localization method with reference to gene order fragment on genome according to claim 5, is characterized in that, described step 2.3) in current gene order fragment map is comprised to the detailed step in the database i in d database:

2.3.1) setting data storehouse number d;

i＝f(Key _[1:m])％d(1)

7. the quick comparison localization method with reference to gene order fragment on genome according to claim 6, is characterized in that, described step 3) detailed step comprise:

8. the quick comparison localization method with reference to gene order fragment on genome according to claim 7, it is characterized in that, described step 3.1) in calculate the reverse complementary sequence of gene order fragment to be matched and specifically refer to first by gene order fragment reversion to be matched, then respectively base A and base T exchange, base C and bases G in the gene order fragment after reversion are exchanged, obtain the reverse complementary sequence of gene order fragment to be matched.

9. the quick comparison localization method with reference to gene order fragment on genome according to claim 8, it is characterized in that, described step 2.3) by the key-value pair (Key of current gene order fragment, Value) the Key part in and Value part are carried out encoding and are specifically referred to: for Key part successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode, partly adopt Variable Length Code for Value; Described step 3.1) described gene order fragment to be matched being carried out encoding specifically refers to: successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode.

10. the quick comparison localization method with reference to gene order fragment on genome according to claim 9, it is characterized in that, described hash function of specifying is MurmurHash function.