CN105243297A - Quick comparing and positioning method for gene sequence segments on reference genome - Google Patents

Quick comparing and positioning method for gene sequence segments on reference genome Download PDF

Info

Publication number
CN105243297A
CN105243297A CN201510648108.6A CN201510648108A CN105243297A CN 105243297 A CN105243297 A CN 105243297A CN 201510648108 A CN201510648108 A CN 201510648108A CN 105243297 A CN105243297 A CN 105243297A
Authority
CN
China
Prior art keywords
gene order
order fragment
fragment
key
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510648108.6A
Other languages
Chinese (zh)
Inventor
宋卓
李�根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Human And Future Biotechnology (changsha) Co Ltd
Original Assignee
Human And Future Biotechnology (changsha) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Human And Future Biotechnology (changsha) Co Ltd filed Critical Human And Future Biotechnology (changsha) Co Ltd
Priority to CN201510648108.6A priority Critical patent/CN105243297A/en
Publication of CN105243297A publication Critical patent/CN105243297A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention discloses a quick comparing and positioning method for a gene sequence segment on a reference genome. The method comprises: firstly, extracting the gene sequence segments from the reference genome; aiming at each gene sequence segment, establishing a key value pair by using the gene sequence segment as a key part and using target information of the gene sequence segment as a Value part, mapping the gene sequence segment by adopting a Hash function to determine a target storage location in a database, and writing the key value pair into the target storage location to complete database establishment of the reference genome; and when quick comparison and positioning need to be carried out on a gene sequence segment to be matched, firstly, mapping the gene sequence segment to be matched by adopting the Hash function to determine a target storage location of the gene sequence segment to be matched in the database, and then reading the target information of the gene sequence segment of the key value pair corresponding to the matched gene sequence segment from the target storage location. The quick comparing and positioning method disclosed by the present invention has the advantages of low time complexity, high comparing and positioning speed, high positioning efficiency, wide application range, and applicability to cross-species hybrid quick analysis.

Description

A kind of quick comparison localization method with reference to gene order fragment on genome
Technical field
The present invention relates to the bioinformatic analysis technology of gene sequencing data, be specifically related to a kind of quick comparison localization method with reference to gene order fragment on genome.
Background technology
Along with the development of gene sequencing technology, order-checking price exponentially declines, and speed even exceedes Moore's Law.Incident a large amount of sequencing data proposes huge challenge for computational analysis fast and accurately.The first step analyzing sequencing data is sequence alignment, namely sequence fragment comparison is navigated to reference on genome, often needs to consume a large amount of computational resources and time.Comparison positioning sequence fragment is just becoming the bottleneck of an expedited data analysis process.
In order to solve sequence fragment with reference to the comparison orientation problem on genome, people have been developed a lot of algorithm and have been developed widely used specific implementation, and wherein more well-known have BWA, the softwares such as Bowtie, SOAP.Existing alignment algorithm and to software all for reference to genome to be found and the best comparison position of positioning sequence fragment is being designed.Because consider the fault-tolerant situation of site mutation when comparison, the sudden change often on sequence fragment is more, and more with reference to the position in the possible comparison on genome, the calculated amount of needs is also larger.By forming unique sequence fragment in a large number on human genome.For the fragment of 36 bases (36bp), the genome of 3,000,000,000 base compositions have at least the fragment of the 36bp on the region of 2/3 be unique.Current alignment algorithm and software are not design for these unique fragment regions, and what their solved is a more general sequence alignment and orientation problem, and therefore its algorithm and implementation method all exist optimization space for certain class particular problem.And mutation analysis is not paid close attention to for those, only pay close attention to the order-checking flow process of unique aligned fragment, comparison method be designed with very large optimization space, by redesign, comparison time can be saved, raise the efficiency further.
Summary of the invention
The technical problem to be solved in the present invention: for the problems referred to above of prior art, provide that a kind of time complexity is low, comparison locating speed is fast, location efficiency is high, applied range, can be applicable to across the mixing express-analysis of species reference genome on the quick comparison localization method of gene order fragment.
In order to solve the problems of the technologies described above, the technical solution used in the present invention is:
With reference to a quick comparison localization method for gene order fragment on genome, step comprises:
1) gene order fragment is extracted from reference to genome;
2) for each the gene order fragment on reference genome, using gene order fragment as key, the target information of part, gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, and key-value pair is write target storage position, finally complete and build storehouse with reference to genomic;
3) when needs carry out quick comparison location to gene order fragment to be matched, first described gene order fragment to be matched is adopted the target storage position in hash function Map Searching database, if search successfully, then read the target information of the gene order fragment of the corresponding key-value pair of coupling gene order fragment from target storage position; Otherwise return and search failure information.
Preferably, described step 1) detailed step comprise:
1.1) the length L of gene order fragment is set;
1.2) position of unique on computing reference genome gene order fragment and target information;
1.3) gene order fragment and target information thereof is extracted respectively according to the position of gene order fragment.
Preferably, described step 1.2) in unique gene order fragment specifically refer to that the editing distance between any two gene order fragments is more than or equal to the threshold value n of setting.
Preferably, described target information comprises at least one in chromosome, chromosome position, GC content, species taxonomy.
Preferably, described step 2) detailed step comprise:
2.1) from all gene order fragments that extraction obtains, a gene order fragment is taken out as current gene order fragment;
2.2) using current gene order fragment as key, the target information of part, current gene order fragment sets up key-value pair (Key, Value) to describe the mapping relations between current gene order fragment and target information thereof as Value part;
2.3) the Key part in the key-value pair (Key, Value) of current gene order fragment and Value are partly encoded, select the hash function of specifying by current gene order fragment map in the database i in d database;
2.4) its target storage position in database i is adopted hash function to map current gene order fragment to determine, and by the target storage position of key-value pair (Key, Value) the write into Databasce i after coding;
2.5) judge whether extract all gene order fragments obtained is disposed, if be disposed, then redirect performs step 2.1); Otherwise, judge to have completed to build storehouse with reference to genomic.
Preferably, described step 2.3) in current gene order fragment map is comprised to the detailed step in the database i in d database:
2.3.1) setting data storehouse number d;
2.3.2) the prefix substring that current gene order fragment Key partial-length is m is got, select the hash function of specifying also to adopt and calculate current gene order fragment database accession number i corresponding in d database such as formula functional expression (1) Suo Shi, thus by current gene order fragment map in the database i in d database;
i=f(Key [1:m])%d(1)
In formula (1), i is the database accession number of current gene order fragment correspondence in d database, Key [1:m]for the prefix substring that current gene order fragment Key partial-length is m, d is the database number preset, and f is the hash function of specifying.
Preferably, described step 3) detailed step comprise:
3.1) when needs carry out quick comparison location to gene order fragment to be matched, first the reverse complementary sequence of gene order fragment to be matched is calculated, gene order fragment to be matched and reverse complementary sequence thereof being encoded respectively, is then that the prefix substring of m is selected the hash function of specifying and adopts functional expression shown in formula (1) to calculate the database accession number i of gene order fragment to be matched and reverse complementary sequence correspondence in d database thereof respectively by length after coding;
3.2) for gene order fragment to be matched and reverse complementary sequence thereof, parallel search gene order fragment to be matched and reverse complementary sequence thereof in the corresponding target database of the database accession number i calculated respectively, if any one in both gene order fragment to be matched and reverse complementary sequence thereof finds matched record in target database, then the result that the target information of the Value part in coupling key-value pair is located as comparison is returned; Otherwise both gene order fragment to be matched and reverse complementary sequence thereof all do not find matched record in target database, then return and search failure information.
Preferably, described step 3.1) in calculate the reverse complementary sequence of gene order fragment to be matched and specifically refer to first by gene order fragment reversion to be matched, then respectively base A and base T exchange, base C and bases G in the gene order fragment after reversion are exchanged, obtain the reverse complementary sequence of gene order fragment to be matched.
Preferably, described step 2.3) by the key-value pair (Key of current gene order fragment, Value) the Key part in and Value part are carried out encoding and are specifically referred to: for Key part successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode, partly adopt Variable Length Code for Value; Described step 3.1) described gene order fragment to be matched being carried out encoding specifically refers to: successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode.
Preferably, described hash function of specifying is MurmurHash function.
The present invention has following advantage with reference to the quick comparison localization method of gene order fragment on genome:
1, the present invention is directed to reference to each the gene order fragment on genome, using gene order fragment as key part, the target information of gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, alignment algorithm is replaced with the search of key-value pair, whether the comparison result of quick acquisition sequence fragment is (namely in comparison, if in comparison, its target information can be obtained), key-value pair search has minimum time complexity, compared with existing sequence alignment method, there is the fastest comparison locating speed, there is time complexity low, comparison locating speed is fast, the advantage that location efficiency is high.
2, the present invention is directed to reference to each the gene order fragment on genome, using gene order fragment as key part, the target information of gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, therefore can by the mode that multiple genomic unique unique sequence segment is put together, the multiple genome of search comparison simultaneously, unique comparison area on different plant species genome can be put together, comparison positioning sequence fragment on several species genome simultaneously, be applicable to the mixing express-analysis across species.
3, for the reference genome of most of species, the fragment in the most of region on it is all unique, and Just because of this, the present invention significantly can accelerate the analysis process that those do not pay close attention to not exclusive aligned sequences fragment.
4, the target information of the present invention's part, gene order fragment using gene order fragment as key sets up key-value pair as Value part, the particular content of based target information is different, can be applied in the genetic analysis application of the quick comparison location of all kinds of gene order fragment, such as CNV analysis, bacterial classification calibrating etc., have the advantage of applied range.
Accompanying drawing explanation
Fig. 1 is the basic procedure schematic diagram of quick comparison localization method in the embodiment of the present invention one.
Fig. 2 is that the embodiment of the present invention one CNV analyzes the CNV analysis result schematic diagram obtained.
Embodiment
Embodiment one:
Hereafter to be carried out CNV analysis (analysis of variance of CopyNumberVariation analysis/copy number) by rapid serial comparison, the present invention is further detailed with reference to the quick comparison localization method of gene order fragment on genome.
As shown in Figure 1, the present embodiment comprises with reference to the step of the quick comparison localization method of gene order fragment on genome:
1) gene order fragment is extracted from reference to genome;
2) for each the gene order fragment on reference genome, using gene order fragment as key, the target information of part, gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, and key-value pair is write target storage position, finally complete and build storehouse with reference to genomic;
3) when needs carry out quick comparison location to gene order fragment to be matched, described gene order fragment to be matched is adopted the target storage position in hash function Map Searching database, if search successfully, then read the target information of the gene order fragment of the corresponding key-value pair of coupling gene order fragment from target storage position; Otherwise return and search failure information.
In the present embodiment, carry out the positional information of human genome (version hg19) the 36bp unique gene sequence fragment of comfortable UCSC website with reference to genome from network address:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign36mer.bigWig
This file behaviour genoid group (version hg19) 36mermappability file, be BED formatted file by using instrument bigWigToBedGraph after above-mentioned file download by bigWig file transform, BED formatted file comprises four row, 1st is classified as chromosome numbers, 2nd is classified as chromosome reference position (starting to calculate with 0), 3rd is classified as chromosome final position (starting to calculate with 1), 4th is classified as the mappability value that each position in chromosome reference position to chromosome final position (not comprising final position) interval starts the 36bp sequence calculated, mappabilit is that 1 expression gene order fragment is unique on genome, in BED formatted file, the record that mappability is 1 is chosen in the present embodiment, and extract hg19 with reference to the unique gene sequence fragment on genome according to the positional information that BED file provides, the target gene sequence set of segments extracted using as build database use the set of Key, unique coverage on each chromosome is as shown in table 1.
Table 1: the unique coverage tables of data on each chromosome.
Chromosome Base number Unique comparison base number Unique coverage
1 249,250,621 171,109,348 0.69
2 243,199,373 187,599,789 0.77
3 198,022,430 155,568,147 0.79
4 191,154,276 148,235,325 0.78
5 180,915,260 139,351,070 0.77
6 171,115,067 132,615,985 0.78
7 159,138,663 115,076,932 0.72
8 146,364,022 113,561,421 0.78
9 141,213,431 85,976,834 0.61
10 135,534,747 100,580,972 0.74
11 135,006,516 101,426,392 0.75
12 133,851,895 101,134,431 0.76
13 115,169,878 77,517,018 0.67
14 107,349,540 68,732,800 0.64
15 102,531,392 59,987,912 0.59
16 90,354,753 56,576,358 0.63
17 81,195,210 55,582,856 0.68
18 78,077,248 60,876,802 0.78
19 59,128,983 35,479,931 0.6
20 63,025,520 47,555,303 0.75
21 48,129,895 26,757,998 0.56
22 51,304,566 24,675,505 0.48
X 155,270,560 103,355,975 0.67
Y 59,373,566 7,016,301 0.12
See table 1, first is classified as chromosome numbers, and second is classified as the base number that each chromosome comprises, and the 3rd is classified as the base number that unique gene sequence fragment comprises, 4th is classified as the ratio that base number that unique gene sequence fragment comprises accounts for each chromosome base number, in the present embodiment, the unique gene sequence fragments total 2,176 of 36bp, 351,405, base sum accounts for reference to genomic 70.3%, and chromosomal mean coverage is 67.1%.
In the present embodiment, step 1) detailed step comprise:
1.1) the length L of gene order fragment is set; In the present embodiment, the length L of gene order fragment is 36;
1.2) position of unique on computing reference genome gene order fragment and target information;
1.3) gene order fragment and target information thereof is extracted respectively according to the position of gene order fragment.
In the present embodiment, step 1.2) in unique gene order fragment specifically refer to that the editing distance between any two gene order fragments is more than or equal to the threshold value n of setting, the concrete value of threshold value n is 2, also can be set to other values as required in addition.
It should be noted that, target information is concrete relevant with the quick comparison position application of gene order fragment, and target information can comprise at least one in chromosome, chromosome position, GC content, species taxonomy.In the present embodiment, target information specifically comprises: the chromosome numbers of (1) gene order fragment 1,2 ..., 22, X, Y}; (2) gene order fragment reference position on chromosome (calculating from 0); (3) the GC content total number of bases G, base C (namely in gene order fragment) of gene order fragment.
In the present embodiment, step 2) detailed step comprise:
2.1) from all gene order fragments that extraction obtains, a gene order fragment is taken out as current gene order fragment;
2.2) using current gene order fragment as key, the target information of part, current gene order fragment sets up key-value pair (Key, Value) to describe the mapping relations between current gene order fragment and target information thereof as Value part;
2.3) by the key-value pair (Key of current gene order fragment, Value) the Key part in and Value partly encode, and select the hash function of specifying by current gene order fragment map in the database i (1≤i≤d) in d database;
2.4) its target storage position in database i is adopted hash function to map current gene order fragment to determine, and by the target storage position of key-value pair (Key, Value) the write into Databasce i after coding;
2.5) judge whether extract all gene order fragments obtained is disposed, if be disposed, then redirect performs step 2.1); Otherwise, judge to have completed to build storehouse with reference to genomic.
In the present embodiment, step 2.3) in current gene order fragment map is comprised to the detailed step in the database i in d database:
2.3.1) setting data storehouse number d; In the present embodiment, the concrete value of database number d is 12;
2.3.2) the prefix substring that current gene order fragment Key partial-length is m is got, select the hash function of specifying also to adopt and calculate current gene order fragment database accession number i corresponding in d database such as formula functional expression (1) Suo Shi, thus by current gene order fragment map in the database i in d database;
i=f(Key [1:m])%d(1)
In formula (1), i is the database accession number of current gene order fragment correspondence in d database, Key [1:m]for the prefix substring that current gene order fragment Key partial-length is m, d is the database number preset, and f is the hash function of specifying.
The hash function f specified is for realizing by all gene order fragment map in d database, and in the present embodiment, Hash function f specifically adopts MurmurHash function, all gene order fragments can be mapped in d database equably.
In the present embodiment, step 3) detailed step comprise:
3.1) when needs carry out quick comparison location to gene order fragment to be matched, first the reverse complementary sequence of gene order fragment to be matched is calculated, gene order fragment to be matched and reverse complementary sequence thereof are encoded respectively, then by length after coding be the prefix substring of m select the hash function of specifying and adopt functional expression shown in formula (1) calculate respectively gene order fragment to be matched and reverse complementary sequence correspondence in d database thereof database accession number i (with reference to genomic build storehouse time the numbering determined); In the present embodiment, the concrete value of prefix substring m is 3;
3.2) for gene order fragment to be matched and reverse complementary sequence thereof, parallel search gene order fragment to be matched and reverse complementary sequence thereof in the corresponding target database of the database accession number i calculated respectively, if any one in both gene order fragment to be matched and reverse complementary sequence thereof finds matched record in target database, then the result that the target information of the Value part in coupling key-value pair is located as comparison is returned; Otherwise both gene order fragment to be matched and reverse complementary sequence thereof all do not find matched record in target database, then return and search failure information.
In the present embodiment, step 3.1) in calculate the reverse complementary sequence of gene order fragment to be matched and specifically refer to first by gene order fragment reversion to be matched, then respectively base A and base T exchange, base C and bases G in the gene order fragment after reversion are exchanged, obtain the reverse complementary sequence of gene order fragment to be matched.Such as, for the gene order Segment A C to be matched of 36 ... ACGT, obtains TGCA after reversion ... CA, then after carrying out base A and base T exchange, base C and bases G exchange, obtain ACGT ... GT.
In the present embodiment, step 2.3) when the Key part in the key-value pair of current gene order fragment and Value are partly encoded, for in the key-value pair of current gene order fragment Key part successively with 00 replace base A, with 01 replace base C, with 10 replace bases G, with 11 replace base T encode, in the key-value pair of current gene order fragment Value part adopt Variable Length Code; Step 3.1) when gene order fragment to be matched is encoded, specifically refer to successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode; In the present embodiment, step 2.3) when the Key part in the key-value pair of current gene order fragment and Value are partly encoded, coded character adopts ASCII character, and after coding, the length of Key part is that the length of 9, Value part is between 3 ~ 6 characters; Step 3.1) when gene order fragment to be matched being encoded, after coding, the length of Key part is 9.Step 2.3) for Value part adopt Variable Length Code time, the order of coding is followed successively by chromosome numbers, the GC content of gene order fragment, gene order fragment reference position on chromosome, the GC content of chromosome numbers, gene order fragment respectively accounts for a character, gene order fragment reference position part on chromosome adopts Variable Length Code, namely when position is positioned at interval [0,2 8-1) with a character code time, when position is positioned at interval [2 8, 2 16-1) with two character codes time, when position is positioned at interval [2 16, 2 24-1) with three character codes time, when position is positioned at interval [2 24, 2 32-1) with four character codes time, thus taking of storage space can be saved.
In the present embodiment, gene order fragment data to be matched integrates the sequencing result as real human's blood sample, and sequenced fragments length is 36bp, the number of sequence fragment is 4,929,709, carry out rapid serial comparison location by the inventive method and to add up comparison result as shown in table 2 below.
Table 2: the sequence data table on each chromosome.
Chromosome Aligned sequences number Unique aligned sequences number Unique aligned sequences GC content
1 279880 277196 0.40525
2 299817 296786 0.390581
3 248322 245823 0.385119
4 234336 231968 0.369009
5 220284 218086 0.383703
6 213813 211737 0.384034
7 185199 183429 0.392871
8 180296 178557 0.389799
9 137702 136404 0.403912
10 164105 162531 0.404724
11 161240 159706 0.407018
12 163420 161837 0.394989
13 122829 121550 0.372917
14 110550 109441 0.397038
15 99286 98357 0.409776
16 91390 90550 0.437317
17 91902 91116 0.444336
18 95782 94852 0.387118
19 57913 57502 0.477258
20 76343 75659 0.432834
21 44167 43733 0.402295
22 40754 40440 0.47358
X 164658 163060 0.382272
Y 481 476 0.394771
See table 2, first is classified as chromosome numbers, second number being classified as matching database gene order fragment, 3rd number being classified as unique gene sequence fragment in gene order fragment in matching database, the 4th GC content being classified as unique gene sequence fragment in matching database gene order fragment, specifically refers to the ratio shared by base C and G in matching database gene order fragment.In the present embodiment, the gene order fragment of matching database has 3,484, article 469, the gene order fragment of unique match has 3,450, article 796, the ratio that unique match gene order fragment accounts for gene order fragment total number to be matched is 70%, accounts for reference to genome total bases object ratio consistent with the unique gene sequence fragment base number described in table 1.
On the basis obtaining table 2, standardization comparison to each chromosomal sequence and carry out GC correct and mappability correction, choosing normal sample does with reference to analyzing chromosome copies number variation, finally accurately find known CNV sudden change at No. 2 chromosome q (Chr2) ends, CNV analysis result as shown in Figure 2.Be with the position of 60K window of 5K distance interval sliding on No. 2 chromosomes see Fig. 2, x coordinate; Y coordinate is that log2ratio, log2ratio significantly depart from 0, represents that the present embodiment sample exists copy number variation in homologue position.The circular of y coordinate log2ratio is such as formula shown in (2);
l o g 2 r a t i o = log 2 x 60 K x ‾ 60 K - - - ( 2 )
In formula (2), x 60Kfor making the value after standardization to unique comparison to the gene order segment number in each 60K window of No. 2 chromosomes in the present embodiment sample data, for making the average after standardization to unique comparison to the gene order segment number in each 60K window of No. 2 chromosomes in normal sample notebook data.
For the present embodiment 36bp sequence, the comparison efficiency of quick alignment algorithm is 1Million sequence/s, and compared with traditional universal comparison software based on BWT, eliminate SmithWaterman comparison process consuming time, comparison efficiency significantly improves.
Embodiment two:
Except carrying out CNV analysis, the present invention can also be used for carrying out strain idenfication with reference to the quick comparison localization method of gene order fragment on genome.In the present embodiment, reference genome is with reference to genome across species, step 2) carry out with reference to genomic build storehouse time, be (http://www.ncbi.nlm.nih.gov/refseq/) downloads in NCBIRefSeq database people reference genome and all bacteriums, virus genome sequence with reference to genome.Calculate the mappability collection of illustrative plates of (people, bacterium, virus) reference sequences 31mer with gem-mappability software, therefrom choose the unique sequence code fragment that mappability is 1; The exhaustive division information of species is downloaded at NCBI species taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy).For each the gene order fragment on reference genome, using gene order fragment as key, the target information (species taxonomy) of part, gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, and key-value pair is write target storage position, finally complete and build storehouse with reference to genomic.The sequencing data of normal person's fecal sample is downloaded to analyze the composition of human intestinal microorganisms at HMP website (http://hmpdacc.org/), step 3 is used for gene order fragment to be matched wherein) carry out the microorganism composition of rapid serial compare of analysis in genus level, the result finally obtained is as shown in table 3.
Table 3: microorganism composition analysis result data table.
Belong to Sorting sequence number Standardization sorting sequence number Number percent
Bacteroides 3345216 153556.7 15.36
Eubacterium 1597681 73338.94 7.33
Roseburia 224359 10298.83 1.03
Alistipes 197067 9046.04 0.9
Odoribacter 174594 8014.45 0.8
Others 505842 23219.85 2.32
See table 3, first microbial name being classified as genus level, second is classified as comparison belongs to the gene order fragment of level number to corresponding microorganism, and the 3rd is classified as every 1,000, in 000 sequence fragment, comparison belongs to the number of the sequence fragment of level to corresponding microorganism, 4th be classified as standardization after sorting sequence number account for 1,000, the number percent of 000 sequence fragment, be used for weighing the abundance of microorganism in sample, this value is higher shows that the proportion of composing of corresponding microorganism in sample is higher.Finally, judge the bacterial classification abundance in institute's analyzing samples according to the unique sequence code fragment navigating to different strain, this analysis result is consistent with the analysis result that HMP website provides.The present embodiment can carry out High Level customization for various application specifically to reference to genome, target information with reference to the quick comparison localization method of gene order fragment on genome, flexible and changeable; To the efficiency greatly improving sequence alignment with reference to genomic customization and simplification; On the present embodiment reference genome, the quick comparison localization method of gene order fragment is while acceleration compare of analysis, do not sacrifice precision of analysis, extremely be applicable to extensive mass data analysis and the application scenarios higher to requirement of real-time, have a extensive future.
The above is only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, and all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1., with reference to a quick comparison localization method for gene order fragment on genome, it is characterized in that step comprises:
1) gene order fragment is extracted from reference to genome;
2) for each the gene order fragment on reference genome, using gene order fragment as key, the target information of part, gene order fragment sets up key-value pair as Value part, hash function is adopted to map the target storage position determined in database gene order fragment, and key-value pair is write target storage position, finally complete and build storehouse with reference to genomic;
3) when needs carry out quick comparison location to gene order fragment to be matched, first described gene order fragment to be matched is adopted the target storage position in hash function Map Searching database, if search successfully, then read the target information of the gene order fragment of the corresponding key-value pair of coupling gene order fragment from target storage position; Otherwise return and search failure information.
2. the quick comparison localization method with reference to gene order fragment on genome according to claim 1, is characterized in that, described step 1) detailed step comprise:
1.1) the length L of gene order fragment is set;
1.2) position of unique on computing reference genome gene order fragment and target information;
1.3) gene order fragment and target information thereof is extracted respectively according to the position of gene order fragment.
3. the quick comparison localization method with reference to gene order fragment on genome according to claim 2, it is characterized in that, described step 1.2) in unique gene order fragment specifically refer to that the editing distance between any two gene order fragments is more than or equal to the threshold value n of setting.
4. the quick comparison localization method with reference to gene order fragment on genome according to claim 3, it is characterized in that, described target information comprises at least one in chromosome, chromosome position, GC content, species taxonomy.
5., according to the quick comparison localization method of gene order fragment on the reference genome in Claims 1 to 4 described in any one, it is characterized in that, described step 2) detailed step comprise:
2.1) from all gene order fragments that extraction obtains, a gene order fragment is taken out as current gene order fragment;
2.2) using current gene order fragment as key, the target information of part, current gene order fragment sets up key-value pair (Key, Value) to describe the mapping relations between current gene order fragment and target information thereof as Value part;
2.3) the Key part in the key-value pair (Key, Value) of current gene order fragment and Value are partly encoded, select the hash function of specifying by current gene order fragment map in the database i in d database;
2.4) its target storage position in database i is adopted hash function to map current gene order fragment to determine, and by the target storage position of key-value pair (Key, Value) the write into Databasce i after coding;
2.5) judge whether extract all gene order fragments obtained is disposed, if be disposed, then redirect performs step 2.1); Otherwise, judge to have completed to build storehouse with reference to genomic.
6. the quick comparison localization method with reference to gene order fragment on genome according to claim 5, is characterized in that, described step 2.3) in current gene order fragment map is comprised to the detailed step in the database i in d database:
2.3.1) setting data storehouse number d;
2.3.2) the prefix substring that current gene order fragment Key partial-length is m is got, select the hash function of specifying also to adopt and calculate current gene order fragment database accession number i corresponding in d database such as formula functional expression (1) Suo Shi, thus by current gene order fragment map in the database i in d database;
i=f(Key [1:m])%d(1)
In formula (1), i is the database accession number of current gene order fragment correspondence in d database, Key [1:m]for the prefix substring that current gene order fragment Key partial-length is m, d is the database number preset, and f is the hash function of specifying.
7. the quick comparison localization method with reference to gene order fragment on genome according to claim 6, is characterized in that, described step 3) detailed step comprise:
3.1) when needs carry out quick comparison location to gene order fragment to be matched, first the reverse complementary sequence of gene order fragment to be matched is calculated, gene order fragment to be matched and reverse complementary sequence thereof being encoded respectively, is then that the prefix substring of m is selected the hash function of specifying and adopts functional expression shown in formula (1) to calculate the database accession number i of gene order fragment to be matched and reverse complementary sequence correspondence in d database thereof respectively by length after coding;
3.2) for gene order fragment to be matched and reverse complementary sequence thereof, parallel search gene order fragment to be matched and reverse complementary sequence thereof in the corresponding target database of the database accession number i calculated respectively, if any one in both gene order fragment to be matched and reverse complementary sequence thereof finds matched record in target database, then the result that the target information of the Value part in coupling key-value pair is located as comparison is returned; Otherwise both gene order fragment to be matched and reverse complementary sequence thereof all do not find matched record in target database, then return and search failure information.
8. the quick comparison localization method with reference to gene order fragment on genome according to claim 7, it is characterized in that, described step 3.1) in calculate the reverse complementary sequence of gene order fragment to be matched and specifically refer to first by gene order fragment reversion to be matched, then respectively base A and base T exchange, base C and bases G in the gene order fragment after reversion are exchanged, obtain the reverse complementary sequence of gene order fragment to be matched.
9. the quick comparison localization method with reference to gene order fragment on genome according to claim 8, it is characterized in that, described step 2.3) by the key-value pair (Key of current gene order fragment, Value) the Key part in and Value part are carried out encoding and are specifically referred to: for Key part successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode, partly adopt Variable Length Code for Value; Described step 3.1) described gene order fragment to be matched being carried out encoding specifically refers to: successively 00 replacing base A, replace base C with 01, replace bases G with 10, replace base T with 11 and encode.
10. the quick comparison localization method with reference to gene order fragment on genome according to claim 9, it is characterized in that, described hash function of specifying is MurmurHash function.
CN201510648108.6A 2015-10-09 2015-10-09 Quick comparing and positioning method for gene sequence segments on reference genome Pending CN105243297A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510648108.6A CN105243297A (en) 2015-10-09 2015-10-09 Quick comparing and positioning method for gene sequence segments on reference genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510648108.6A CN105243297A (en) 2015-10-09 2015-10-09 Quick comparing and positioning method for gene sequence segments on reference genome

Publications (1)

Publication Number Publication Date
CN105243297A true CN105243297A (en) 2016-01-13

Family

ID=55040943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510648108.6A Pending CN105243297A (en) 2015-10-09 2015-10-09 Quick comparing and positioning method for gene sequence segments on reference genome

Country Status (1)

Country Link
CN (1) CN105243297A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106295250A (en) * 2016-07-28 2017-01-04 北京百迈客医学检验所有限公司 Method and device is analyzed in the quick comparison of the short sequence of secondary order-checking
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations
WO2019023978A1 (en) * 2017-08-02 2019-02-07 深圳市瀚海基因生物科技有限公司 Alignment method, device and system
CN109584965A (en) * 2018-12-21 2019-04-05 龙口味美思环保科技有限公司 A kind of staggered form genome recombination sort method of GM food
CN109698011A (en) * 2018-12-25 2019-04-30 人和未来生物科技(长沙)有限公司 Indel regional correction method and system based on short sequence alignment
CN110021368A (en) * 2017-10-20 2019-07-16 人和未来生物科技(长沙)有限公司 Comparison type gene sequencing data compression method, system and computer-readable medium
CN110060731A (en) * 2019-04-12 2019-07-26 福建师范大学 Determine that overlapping genes are to the method for quantity between genome based on distributed computing
CN110517728A (en) * 2019-08-29 2019-11-29 苏州浪潮智能科技有限公司 A kind of gene order comparison method and device
CN110782946A (en) * 2019-10-17 2020-02-11 南京医基云医疗数据研究院有限公司 Method and device for identifying repeated sequence, storage medium and electronic equipment
CN110875084A (en) * 2018-08-13 2020-03-10 深圳华大基因科技服务有限公司 Nucleic acid sequence comparison method
CN111063394A (en) * 2019-12-13 2020-04-24 人和未来生物科技(长沙)有限公司 Species rapid searching and database building method, system and medium based on gene sequence
CN112700819A (en) * 2020-12-31 2021-04-23 云舟生物科技(广州)有限公司 Gene sequence processing method, computer storage medium and electronic device
CN113012755A (en) * 2021-04-12 2021-06-22 聊城大学 Genome ATCG search method
WO2022048284A1 (en) * 2020-09-04 2022-03-10 苏州浪潮智能科技有限公司 Hash table lookup method, apparatus, and device for gene comparison, and storage medium
CN115862735A (en) * 2022-12-28 2023-03-28 郑州思昆生物工程有限公司 Nucleic acid sequence detection method, nucleic acid sequence detection device, computer equipment and storage medium
CN115862740A (en) * 2022-12-06 2023-03-28 中国人民解放军军事科学院军事医学研究院 Rapid distributed multi-sequence comparison method for large-scale virus genome data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971031A (en) * 2014-05-04 2014-08-06 南京师范大学 Read positioning method oriented to large-scale gene data
US20150234842A1 (en) * 2014-02-14 2015-08-20 Srinivasan Kumar Mapping of Extensible Datasets to Relational Database Schemas

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150234842A1 (en) * 2014-02-14 2015-08-20 Srinivasan Kumar Mapping of Extensible Datasets to Relational Database Schemas
CN103971031A (en) * 2014-05-04 2014-08-06 南京师范大学 Read positioning method oriented to large-scale gene data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
危扬: "基于光模型的高产作物株型定量化设计研究", 《中国优秀硕士学位论文全文数据库 农业科技辑》 *
李航: "基于MPI的并行DNA序列比对系统的设计与实现", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *
王立坤: "RNA-seq数据的处理与应用", 《中国博士学位论文全文数据库 基础科学辑》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106295250A (en) * 2016-07-28 2017-01-04 北京百迈客医学检验所有限公司 Method and device is analyzed in the quick comparison of the short sequence of secondary order-checking
CN106295250B (en) * 2016-07-28 2019-03-29 北京百迈客医学检验所有限公司 Short sequence quick comparison analysis method and device was sequenced in two generations
CN107180166A (en) * 2017-04-21 2017-09-19 北京希望组生物科技有限公司 A kind of full-length genome structure variation analysis method and system being sequenced based on three generations
WO2019023978A1 (en) * 2017-08-02 2019-02-07 深圳市瀚海基因生物科技有限公司 Alignment method, device and system
US11482304B2 (en) 2017-08-02 2022-10-25 Genemind Biosciences Company Limited Alignment methods, devices and systems
CN110021368B (en) * 2017-10-20 2020-07-17 人和未来生物科技(长沙)有限公司 Comparison type gene sequencing data compression method, system and computer readable medium
CN110021368A (en) * 2017-10-20 2019-07-16 人和未来生物科技(长沙)有限公司 Comparison type gene sequencing data compression method, system and computer-readable medium
CN110875084A (en) * 2018-08-13 2020-03-10 深圳华大基因科技服务有限公司 Nucleic acid sequence comparison method
CN110875084B (en) * 2018-08-13 2022-06-21 深圳华大基因科技服务有限公司 Nucleic acid sequence comparison method
CN109584965A (en) * 2018-12-21 2019-04-05 龙口味美思环保科技有限公司 A kind of staggered form genome recombination sort method of GM food
CN109698011A (en) * 2018-12-25 2019-04-30 人和未来生物科技(长沙)有限公司 Indel regional correction method and system based on short sequence alignment
CN109698011B (en) * 2018-12-25 2020-10-23 人和未来生物科技(长沙)有限公司 Indel region correction method and system based on short sequence comparison
CN110060731A (en) * 2019-04-12 2019-07-26 福建师范大学 Determine that overlapping genes are to the method for quantity between genome based on distributed computing
CN110517728A (en) * 2019-08-29 2019-11-29 苏州浪潮智能科技有限公司 A kind of gene order comparison method and device
CN110517728B (en) * 2019-08-29 2022-04-29 苏州浪潮智能科技有限公司 Gene sequence comparison method and device
CN110782946A (en) * 2019-10-17 2020-02-11 南京医基云医疗数据研究院有限公司 Method and device for identifying repeated sequence, storage medium and electronic equipment
CN111063394A (en) * 2019-12-13 2020-04-24 人和未来生物科技(长沙)有限公司 Species rapid searching and database building method, system and medium based on gene sequence
CN111063394B (en) * 2019-12-13 2023-07-11 人和未来生物科技(长沙)有限公司 Method, system and medium for quickly searching and constructing library of species based on gene sequence
WO2022048284A1 (en) * 2020-09-04 2022-03-10 苏州浪潮智能科技有限公司 Hash table lookup method, apparatus, and device for gene comparison, and storage medium
CN112700819B (en) * 2020-12-31 2021-11-30 云舟生物科技(广州)有限公司 Gene sequence processing method, computer storage medium and electronic device
CN112700819A (en) * 2020-12-31 2021-04-23 云舟生物科技(广州)有限公司 Gene sequence processing method, computer storage medium and electronic device
CN113012755A (en) * 2021-04-12 2021-06-22 聊城大学 Genome ATCG search method
CN113012755B (en) * 2021-04-12 2023-10-27 聊城大学 Genome ATCG searching method
CN115862740A (en) * 2022-12-06 2023-03-28 中国人民解放军军事科学院军事医学研究院 Rapid distributed multi-sequence comparison method for large-scale virus genome data
CN115862740B (en) * 2022-12-06 2023-09-12 中国人民解放军军事科学院军事医学研究院 Rapid distributed multi-sequence comparison method for large-scale virus genome data
CN115862735A (en) * 2022-12-28 2023-03-28 郑州思昆生物工程有限公司 Nucleic acid sequence detection method, nucleic acid sequence detection device, computer equipment and storage medium
CN115862735B (en) * 2022-12-28 2024-02-27 郑州思昆生物工程有限公司 Nucleic acid sequence detection method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105243297A (en) Quick comparing and positioning method for gene sequence segments on reference genome
Alkhnbashi et al. Characterizing leader sequences of CRISPR loci
Auch et al. Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison
CN104504304B (en) A kind of short palindrome repetitive sequence recognition methods of regular intervals of cluster and device
Ju et al. 16S rRNA gene high-throughput sequencing data mining of microbial diversity and interactions
Williams et al. A robust species tree for the alphaproteobacteria
Liao et al. A new unsupervised binning approach for metagenomic sequences based on n-grams and automatic feature weighting
Saheb Kashaf et al. Recovering prokaryotic genomes from host-associated, short-read shotgun metagenomic sequencing data
US20200201905A1 (en) Methods of automatically and self-consistently correcting genome databases
Rasheed et al. Metagenomic taxonomic classification using extreme learning machines
CN102521528A (en) Method for screening gene sequence data
Dlamini et al. Classification of COVID-19 and other pathogenic sequences: a dinucleotide frequency and machine learning approach
CN106228035A (en) Based on local sensitivity Hash and the efficient clustering method of imparametrization bayes method
Goussarov et al. Introduction to the principles and methods underlying the recovery of metagenome‐assembled genomes from metagenomic data
Wu et al. DeepRetention: a deep learning approach for intron retention detection
Beiko et al. Detecting lateral genetic transfer: a phylogenetic approach
Yuan et al. RNA-CODE: a noncoding RNA classification tool for short reads in NGS data lacking reference genomes
Ji et al. HOTSPOT: hierarchical host prediction for assembled plasmid contigs with transformer
Machado et al. Evidence of absence treated as absence of evidence: The effects of variation in the number and distribution of gaps treated as missing data on the results of standard maximum likelihood analysis
Solovyev et al. Automatic annotation of bacterial community sequences and application to infections diagnostic
Huang et al. The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer
CN102663287B (en) Attack characteristic extraction method for realizing sequence-based alignment through code conversion
Liao et al. A binary coding method of RNA secondary structure and its application
Gudodagi et al. Investigations and Compression of Genomic Data
Sengupta et al. Classification and identification of fungal sequences using characteristic restriction endonuclease cut order

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160113