CN106295250B - Short sequence quick comparison analysis method and device was sequenced in two generations - Google Patents

Short sequence quick comparison analysis method and device was sequenced in two generations Download PDF

Info

Publication number
CN106295250B
CN106295250B CN201610609337.1A CN201610609337A CN106295250B CN 106295250 B CN106295250 B CN 106295250B CN 201610609337 A CN201610609337 A CN 201610609337A CN 106295250 B CN106295250 B CN 106295250B
Authority
CN
China
Prior art keywords
value
dna
index
short sequence
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610609337.1A
Other languages
Chinese (zh)
Other versions
CN106295250A (en
Inventor
郑洪坤
郭强
许德德
马威锋
孙乔慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Pukang Ruiren Medical Laboratory Co., Ltd.
Original Assignee
Beijing Hundred Medical Laboratory Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Hundred Medical Laboratory Ltd filed Critical Beijing Hundred Medical Laboratory Ltd
Priority to CN201610609337.1A priority Critical patent/CN106295250B/en
Publication of CN106295250A publication Critical patent/CN106295250A/en
Application granted granted Critical
Publication of CN106295250B publication Critical patent/CN106295250B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Abstract

The present invention discloses a kind of two generations and short sequence quick comparison analysis method and device is sequenced, be able to solve sequencing data comparison efficiency is low and problem that EMS memory occupation is high.The described method includes: obtaining the short sequence of DNA that sequencing obtains, and using the short sequence of DNA described in the first hash algorithm and the second hash algorithm difference mapping code, respectively obtain the first index and the second index;It is compared based on preset index inquiry library, first index and the second index by the short sequence of the DNA and with reference to genome, index inquiry library is made of cellular construction body array, each cellular construction body includes value value and index2 value, the array indexing offset for storing each cellular construction body is corresponding index1, the as corresponding index value of Array for structural body, K are fragment sequence length;According to comparison as a result, if on comparing, obtain with the value value of the K-mer segment on the short sequence alignment of corresponding DNA, determine the site on the corresponding short sequence designation of chromosome number of DNA and designation of chromosome.

Description

Short sequence quick comparison analysis method and device was sequenced in two generations
Technical field
The invention belongs to biological information engineering fields, are related to biology information technology and Computer Applied Technology, specifically, It is related to the short sequence quick comparison analysis method of two generation of DNA sequence dna sequencing.
Background technique
DNA sequencing plays most basic and widest effect in the gene order coding for interpreting species life.Early in hair At the beginning of existing DNA double spiral, someone was reported to the sequencing technologies of DNA, only process is excessively complicated.Then soon, in 1977 Sanger has invented end and has terminated PCR sequencing PCR, the meaning with milestone.So far with biology information technology the reach of science, Sanger PCR sequencing PCR can no longer meet the needs of research, and then cost is lower, and flux is higher, the faster second generation sequencing of speed Technology is come into being.Its core concept is sequenced in synthesis.DNA sequence dna comparison is that the short sequence of obtained DNA will be sequenced (reads) it is compared with reference to genome, commonly uses to analyze species similarity and homology, computer can also be passed through The excavation and analysis of technology progress gene information.
Currently, the technology compared for two generation sequencing sequences is main, there are two the development in direction: first is that based on design hash Algorithm mapping constructs hash Table storehouse thinking, and this technological merit compares that speed is fast, and accuracy rate is high, but committed memory and CPU compared with It is high.The software representative that the technology is realized has SOAP, MAQ etc..Second is that constructing suffix tree query data structure based on BWT transfer algorithm Technology, which first to genome sequence cyclic shift, is then sorted, is compressed using BWT and establish index.When comparing, Using lookup and recall come the location information of positioning sequence.The technology needs to construct index file in advance to be operated step by step, is compared Speed is also not as good as the hash table structure close to constant average time.It usually can realize that the conventional of all sequencing datas compares, But opposite hash data structure, tree construction will occupy huge memory source, and inquire the speed of comparison far away from averagely often The hash Table storehouse of number time.
Summary of the invention
In view of this, the present invention, which provides two generations of one kind, is sequenced short sequence quick comparison analysis method and device, to solve The problem that the comparison efficiency of sequencing data is low and EMS memory occupation is high.
On the one hand, the embodiment of the present invention proposes that short sequence quick comparison analysis method is sequenced in a kind of two generations, comprising:
S1, the short sequence of DNA that sequencing obtains is obtained, and volume is mapped using the first hash algorithm and the second hash algorithm respectively The code short sequence of DNA respectively obtains the first index and the second index;
S2, it is indexed based on preset index inquiry library, first index and second by the short sequence of the DNA and reference Genome is compared, wherein index inquiry library is made of cellular construction body array, and each cellular construction body includes Value value and index2 value, the value value are the location information of the corresponding composition K-mer segment with reference to genome The compression of abstract, the index2 are the mapping result of the second hash algorithm of the K-mer segment, store each unit Array for structural body index offset amount is that the mapping result index1, K of the first hash algorithm of corresponding K-mer segment are segment sequence Column length;
S3, according to comparison as a result, if compare on, obtain and the K-mer segment on the short sequence alignment of corresponding DNA Value value, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described corresponding Site on the short sequence designation of chromosome number of DNA and designation of chromosome.
Preferably, first hash algorithm is XDDHash algorithm, and second hash algorithm is MWFHash algorithm.
Preferably, the S2, comprising:
Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if searching To the memory block for the first index that first address is the short sequence of the DNA, then judge whether the data storage location of the memory block is 1 It is a;
If 1, it is determined that it is corresponding with the structural body stored at the data storage location of the memory block to go out the short sequence of the DNA K-mer segment compare on.
Preferably, the method also includes:
If not 1, then since the first address, being obtained using the method for displacement linear probing and being stored at current address Structural body index2 value, judge whether the index2 value identical as the second index of the short sequence of the DNA;
If the index2 value is identical as the second index of the short sequence of the DNA, it is determined that go out the short sequence of the DNA and the current position The corresponding K-mer segment of the structural body stored at location compares.
Preferably, the load factor in index inquiry library is 0.607.
Preferably, before the S2, the method also includes:
S40, foundation priori data model, the DNA that 24 chromosome with reference to genome is compared according to history The quantity of sequence from being more to ranked up less;
S41, according to sequence as a result, according to vertical sequence, the interception of sliding window formula is carried out to current chromosome, is obtained Continuous K-mer segment, and mark the location information where each segment;
S42, removal include the K-mer segment and duplicate K-mer segment of degeneracy base;
S43, each K-mer segment for obtaining, to the K-mer segment carry out respectively the first hash algorithm mapping and The mapping of second hash algorithm, respectively obtains index1 value and index2 value, and calculate the location information abstract of the K-mer segment Value value is compressed, the structural body comprising the value value and index2 value is generated;
S44, library is inquired using index1 value as index described in index construct.
Preferably, the short sequence of the DNA and with reference to genome be compared when, using multithreading realize normal chain and The parallel comparison of the short sequence of anti-chain.
On the other hand, the embodiment of the present invention proposes that short sequence quick comparison analytical equipment is sequenced in a kind of two generations, comprising:
Map unit, the short sequence of DNA obtained for obtaining sequencing, and use the first hash algorithm and the second hash algorithm The short sequence of DNA described in mapping code respectively respectively obtains the first index and the second index;
Comparing unit, for based on preset index inquiry library, first index and the second index that the DNA is short Sequence and reference genome are compared, wherein index inquiry library is made of cellular construction body array, each unit knot Structure body includes value value and index2 value, and the value value is the corresponding composition K-mer segment with reference to genome Location information abstract compression, the index2 is the mapping result of the second hash algorithm of the K-mer segment, and storage is every The array indexing offset of a cellular construction body is the mapping result of the first hash algorithm of corresponding K-mer segment Index1, K are fragment sequence length;
Computing unit, for according to comparison as a result, if compare, obtain on the short sequence alignment of corresponding DNA The value value of K-mer segment, and determined according to described with the value value of the K-mer segment on the short sequence alignment of corresponding DNA Site on the short sequence designation of chromosome number of the corresponding DNA and designation of chromosome out.
The invention has the following beneficial effects:
1, hash algorithm is used, comparison speed is fast, close to constant time O (1), improves Project Product time efficiency;
2, a large amount of memory sources are saved, item is improved without directly saving reads sequence using secondary hash mapping techniques Mesh product resource benefit;
3, hash index is reset using priori data model, compares the high reads sequence of the frequency of occurrences, speed-up ratio in advance It is right;
4, parallel execution comparison task is compared using multiwire technology, shortens comparison time.
Detailed description of the invention
Fig. 1 is the flow diagram for two generations of the invention short one embodiment of sequence quick comparison analysis method being sequenced;
Fig. 2 is the time performance figure for the XDDHash algorithm that the present invention is used to do a hash Function Mapping;
Fig. 3 is the amount of collisions performance map for the XDDHash algorithm that the present invention is used to do a hash Function Mapping;
Fig. 4 is the time performance figure for the MWFHash algorithm that the present invention is used to do secondary hash Function Mapping;
Fig. 5 is the amount of collisions performance map for the MWFHash algorithm that the present invention is used to do secondary hash Function Mapping;
Fig. 6 is the load factor for the hash table that the present invention constructs and the relational graph of average detection times;
Fig. 7 is the method for the present invention after being designed to software program product, compares the data volume of sample and spends time taking pass System's figure;
Fig. 8 is the structural schematic diagram for two generations of the invention short one embodiment of sequence quick comparison analytical equipment being sequenced.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical solution in the embodiment of the present invention is explicitly described, it is clear that described embodiment is the present invention A part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.
Referring to Fig. 1, the present embodiment discloses a kind of short sequence quick comparison analysis method of two generations sequencing, comprising:
S1, the short sequence of DNA that sequencing obtains is obtained, and volume is mapped using the first hash algorithm and the second hash algorithm respectively The code short sequence of DNA respectively obtains the first index and the second index;
S2, it is indexed based on preset index inquiry library, first index and second by the short sequence of the DNA and reference Genome is compared, wherein index inquiry library is made of cellular construction body array, and each cellular construction body includes Value value and index2 value, the value value are the location information of the corresponding composition K-mer segment with reference to genome The compression of abstract, the index2 are the mapping result of the second hash algorithm of the K-mer segment, store each unit The array indexing offset of structural body is that the mapping result index1, K of the first hash algorithm of corresponding K-mer segment are segment Sequence length;
S3, according to comparison as a result, if compare on, obtain and the K-mer segment on the short sequence alignment of corresponding DNA Value value, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described corresponding Site on the short sequence designation of chromosome number of DNA and designation of chromosome.
The acquisition for inquiring timing column position information is as follows:
Memory index index is obtained: by the first hash algorithm and the second hash algorithm, sequence information in sample being converted For the first index and the second index, value value in memory is obtained.
Sequence location obtains:
The short sequence designation of chromosome K-mer_number=value%26 of DNA (modulus operation, i.e. complementation);
Site K-mer_location=value/26 (ask and divide exactly) on the short sequence designation of chromosome of DNA.
In the embodiment of the present invention, using secondary hash algorithm, the mapping result of a hash is used to Query Location, secondary The mapping result of hash, which is used to conflict, screens positioning, can accelerate to compare speed, and need not directly save huge reads sequence Information can count comparison result, reduce the waste to memory source.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, described first Hash algorithm is XDDHash algorithm, and second hash algorithm is MWFHash algorithm.
The XDDhash compression algorithm once mapped is used to index key informative abstract.Its algorithm pseudo-function is as follows:
Hash [i+1]=(hash [i] * seed+str [i]) ∧ 0x00000000FFFFFFFF (i=0,1,2 ..., N-2) Wherein, hash [0] is set as 63, and it is the i+1 base character of reads sequence that multiplication factor seed seed, which takes 33, str [i], II value of ASC, N be it is to be checked compare reads sequence length.P ∧ 0x00000000FFFFFFFF expression takes its low 32 to P Binary value.It is the first index that circular recursion operation, which is done, until acquiring hash [N-1].
Through theoretical and Programmable detection practical proof, which has preferable performance: amount of collisions is few, compression mapping speed Fastly.Its performance statistics is shown in Figure of description Fig. 2 and 3, and wherein Fig. 2 is the XDDHash algorithm for doing a hash Function Mapping Time performance figure, reflection XDDHash algorithm changes time-consuming relational graph with reads quantity, and Fig. 3 is for making a hash letter The amount of collisions performance map of the XDDHash algorithm of number mapping, the conflict that reflection XDDHash algorithm is generated with the variation of reads quantity The relational graph of amount.
The MWFhash compression algorithm of Quadratic Map, for separating index conflict value, to parity bit using different displacements because Subalgorithm is as follows,
Odd bits Index Algorithm (when i is odd number value):
Hash [i+1]=(hash [i] * seedodd+i*str [i]) ∧ 0x00000000FFFFFFFF
Even bit Index Algorithm (when i is even number value):
Hash [i+1]=(hash [i] * seedeven+i*str [i]) ∧ 0x00000000FFFFFFFF
Wherein, hash [0] is set as 0, odd number translocation factor seed seedodd value 5, even number translocation factor seed Seedeven value 7.It does circular recursion operation and obtains hash [N-1] value, the as second index.
Through theoretical and Programmable detection practical proof, which has preferable performance: amount of collisions is few, compression mapping speed Fastly.Its performance statistics is shown in Figure of description Figure 4 and 5, Fig. 4 be for do the MWFHash algorithm of secondary hash Function Mapping when Between performance map, reflection MWFHash algorithm changes time-consuming relational graph with reads quantity, and Fig. 5 is for making secondary hash letter The amount of collisions performance map of the MWFHash algorithm of number mapping, the conflict that reflection MWFHash algorithm is generated with the variation of reads quantity The relational graph of amount.
In the embodiment of the present invention, in hash algorithm coding mapping reads sequence, direct coding takes its low 32 binary system Value simplifies Algorithms T-cbmplexity.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, the S2, packet It includes:
Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if searching To the memory block for the first index that first address is the short sequence of the DNA, then judge whether the data storage location of the memory block is 1 It is a;
If 1, it is determined that it is corresponding with the structural body stored at the data storage location of the memory block to go out the short sequence of the DNA K-mer segment compare on.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, further includes:
If not 1, then since the first address, being obtained using the method for displacement linear probing and being stored at current address Structural body index2 value, judge whether the index2 value identical as the second index of the short sequence of the DNA;
If the index2 value is identical as the second index of the short sequence of the DNA, it is determined that go out the short sequence of the DNA and the current position The corresponding K-mer segment of the structural body stored at location compares.
In the embodiment of the present invention, the characteristics of according to sample short sequences h ash mapping value, the linear probing improved is utilized The collision problem of method solution hash value.It specifically says, when indexing displacement detection downwards, it is only necessary to which index is certainly plus operation is Can, reduce time complexity when comparing search.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, the index The load factor for inquiring library is 0.607.
Index inquiry library is substantially hash table, the selection of Hash table load factor K:
Load factor: the ratio of the record number and hash table size actually stored in hash table is to measure hash table performance An important indicator.Its value is bigger, and hash value number of collisions is more, and the inquiry velocity of hash table is caused to reduce.Its value is too small, wave Take memory source.We are based on XDDHash algorithm and MWFHash algorithm design program test statistics load factor and average detection As shown in Figure 6, it reflects the efficiency that the inquiry of constructed table compares to relational graph between number (number is more, time-consuming more) Can, the load factor K that the present invention designs is 0.607, it can be observed that its average lookup number is only at 3 times or so, it is close normal The number time searches.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, in the S2 Before, further includes:
S40, foundation priori data model, the DNA that 24 chromosome with reference to genome is compared according to history The quantity of sequence from being more to ranked up less;
S41, according to sequence as a result, according to vertical sequence, the interception of sliding window formula is carried out to current chromosome, is obtained Continuous K-mer segment, and mark the location information where each segment;
S42, removal include the K-mer segment and duplicate K-mer segment of degeneracy base;
In a particular embodiment, the K-mer segment comprising degeneracy base represents unknown nucleotide sequence, structure in reference genome hg19 It builds unavailable when the library index, EMS memory occupation can also be saved after removing.
S43, each K-mer segment for obtaining, to the K-mer segment carry out respectively the first hash algorithm mapping and The mapping of second hash algorithm, respectively obtains index1 value and index2 value, and calculate the location information abstract of the K-mer segment Value value is compressed, the structural body comprising the value value and index2 value is generated;
S44, library is inquired using index1 value as index described in index construct.
Value value is the compression of the location information abstract of K-mer fragment sequence, and compression function is as follows:
Value=K-mer_number*26+K-mer_location,
Wherein, K-mer_number is K-mer fragment sequence designation of chromosome number, and K-mer_location is K-mer piece Site on Duan Xulie designation of chromosome.
Index2 value and value value are stored in hash structural body:
Typedef struct Node{
unsigned int value;
unsigned int index2;
}HashType
The pseudo- expression formula of the hash table constructed in this way can be simply written as:
Hash [index1]=HashType element;
The value of HashType type is stored at memory index1 with regard to realizing.
In the embodiment of the present invention, according to priori data model, the high short sequence of the priority ordering frequency of occurrences, to improve inquiry Speed, this specific method are to say, when processing has the hash index of repeated collisions, preferential search occurs in practical sequencing sample The high reads sequence of rate reduces index displacement searching times, avoids unnecessary time overhead.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, in the DNA When short sequence and reference genome are compared, the parallel comparison of normal chain and the short sequence of anti-chain is realized using multithreading.
In a particular application, the lower machine catalogue for the DNA sequence dna that storage sequencing obtains can be monitored in real time, as long as there is newly lower machine Result data is just compared out in sample data.
Embodiment 1 is sequenced short sequence quick comparison analysis method based on two generations and analyzes applied to the data of noninvasive antenatal detection
The present invention realizes that the block process of noninvasive antenatal detection is to utilize perl language and shell-command pretreatment people first Class refers to genome hg19.fa, format needed for being processed into index inquiry library, including three column:
First row: 36-mer DNA sequencing fragment;
Secondary series: segment designation of chromosome number;
Third column: the site on segment designation of chromosome.
Then its core methed is realized based on C language algorithm and data structure.
Design process of the present invention and operation need hardware and software environment: linux system;It is more than 3 cores;35 memories with On;The library C under Linux platform;Gcc compiler;Gdb debugging software.
Input sample file format:
Illumina microarray dataset NIPT raw sequencing data, be fastq format, one read information of every 4 behavior:
The first row: being started with@, indicates a sequence, including sequenator name, mark position, the sequencing of mono-/bis-end and mistake Filter the description of the relevant informations such as situation and primer connector;
Second row: reads sequence;
The third line: with '+' beginning, it is followed by sequence identifier, description information, or be whatever not added;
Fourth line: be reads quality information and the second row sequence it is corresponding.
Operation and output result
(1) nohup and & turns are issued orders from the background using linux, lower machine catalogue can be scanned always, as long as there is new lot sample Notebook data appearance can go out result.Operation order:
nohup./hashtb_pth.c&;
(2) input sample file data:
We are by 7.4M size sample file Human_A1600169-R169_good_1.fq data and Human_ A1600171-R171_good_1.fq is stored in lower machine catalogue.Input file:
Human_A1600169-R169_good_11.fq、Human_A1600169-R169_good_12.fq、Human_ A1600169-R169_good_13.fq、Human_A1600169-R169_good_14.fq、Human_A1600169-R169_ Good_15.fq, Human_A1600169-R169_good_1.fq and Human_A1600171-R171_good_1.fq, wherein First 6 be same sample data copy, to detect the stability of program, the last one be used to detect program processing difference sample This ability.
(3) comparison result
The file generated in result list:
Human_A1600169-R169_good_1、Human_A1600169-R169_good_11、Human_ A1600169-R169_good_12、Human_A1600169-R169_good_13、Human_A1600169-R169_good_ 14, Human_A1600169-R169_good_15 and Human_A1600171-R171_good_1,
The result data of 6 copy of sample Human_A1600169-R169_good_1.fq is completely the same, and reflection is based on Even if the software that the method for the present invention is realized is reliable and stable in the multiple sample datas of continuous processing.
It can be gone out by result data with simple computation:
The unique comparison rate of sample are as follows: 72.772%,
Sample comparison rate are as follows: 74.443%.
The result of 4 rows after difference sample sample Human_A1600171-R171_good_1.fq result data is done simply It calculates:
The unique comparison rate of sample: 71.531%,
Sample comparison rate: 73.487%.
As a result different with Human_A1600169-R169_good_1.fq, illustrate that software program being capable of instant continuous processing Different sample datas under same batch catalogue.
(4) used time is compared
Handle 7 sample used times: 77.850 seconds.
Relational graph between statistical sample reads amount and comparison time, is shown in Fig. 7, Fig. 7 be the method for the present invention be designed to it is soft After part program product, comparing the data volume of sample and spend time taking relational graph, this is obtained by the test of authentic specimen example, it Concentrated expression the method for the present invention realized by C programming language after effect.It is the original using noninvasive prenatal gene detection sample The test that beginning data are done.
(5) result verification and compare
Results of comparison we realized using the reliable BWA software of mainstream:
We compare sample Human_A1600169-R169_good_1.fq using BWA.Process is as follows:
1. constructing the library index, operation using with reference to genome
Bwa index-a bwtsw hg19.fa,
Wherein, hg19.fa is with reference to genome, and since reference genome hg19.fa is greater than 2GB ,-a parameter is used bwtsw。
2. sample sequence and the library reference sequences index compare, operation
bwa aln-f Human_A1600169-R169_good_1.sai hg19.fa Human_A1600169-R169_ good_1.fq
Sai file is generated, includes SA coordinates informative abstract.
3. sai file is converted to sam output, operation
bwa samse-f Human_A1600169-R169_good_1.sam hg19.fa Human_A1600169- R169_good_1.sai Human_A1600169-R169_good_1.fastq,
Wherein, sam file includes the various information of comparison result.
4. extracting comparison result data using perl language and counting
The unique comparison rate of sample: 72.772%,
Sample comparison rate: 74.443%.
This software BWA is consistent with result, but bwa whole flow process needs half an hour and occupies vast resources, while bwa cannot In addition the multiple samples of instant continuous processing, statistical result needed for single sample process can not directly give analysis NIPT, write Script statistical analysis.
Compare (single sample) for both noninvasive antenatal detection projects:
Referring to Fig. 8, the present embodiment discloses a kind of short sequence quick comparison analytical equipment of two generations sequencing, comprising:
Map unit 1, the short sequence of DNA obtained for obtaining sequencing, and calculated using the first hash algorithm and the 2nd hash Method distinguishes the short sequence of DNA described in mapping code, respectively obtains the first index and the second index;
Comparing unit 2, for based on preset index inquiry library, first index and the second index that the DNA is short Sequence and reference genome are compared, wherein index inquiry library is made of cellular construction body array, each unit knot Structure body includes value value and index2 value, and the value value is the corresponding composition K-mer segment with reference to genome Location information abstract compression, the index2 is the mapping result of the second hash algorithm of the K-mer segment, and storage is every The array indexing offset of a cellular construction body is the mapping result of the first hash algorithm of corresponding K-mer segment Index1, K are fragment sequence length;
Computing unit 3, for according to comparison as a result, if compare, obtain on the short sequence alignment of corresponding DNA The value value of K-mer segment, and determined according to described with the value value of the K-mer segment on the short sequence alignment of corresponding DNA Site on the short sequence designation of chromosome number of the corresponding DNA and designation of chromosome out.
Although the embodiments of the invention are described in conjunction with the attached drawings, but those skilled in the art can not depart from this hair Various modifications and variations are made in the case where bright spirit and scope, such modifications and variations are each fallen within by appended claims Within limited range.

Claims (7)

1. short sequence quick comparison analysis method is sequenced in a kind of two generations characterized by comprising
S1, the short sequence of DNA that sequencing obtains is obtained, and mapping code institute is distinguished using the first hash algorithm and the second hash algorithm The short sequence of DNA is stated, the first index and the second index are respectively obtained;
S2, gene by the short sequence of the DNA and is referred to based on preset index inquiry library, first index and the second index Group is compared, wherein index inquiry library is made of cellular construction body array, and each cellular construction body includes value Value and index2 value, the value value are the location information abstract of the corresponding composition K-mer segment with reference to genome Compression, the index2 are the mapping result of the second hash algorithm of the K-mer segment, store each cellular construction body Array indexing offset be corresponding K-mer segment the first hash algorithm mapping result index1, K is that fragment sequence is long Degree;
S3, according to comparison as a result, if compare on, obtain and the K-mer segment on the short sequence alignment of corresponding DNA Value value, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described corresponding Site on the short sequence designation of chromosome number of DNA and designation of chromosome;
The S2, comprising:
Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if finding head Address is the memory block of the first index of the short sequence of the DNA, then judges whether the data storage location of the memory block is 1;
If 1, it is determined that go out the short sequence of DNA K- corresponding with the structural body of storage at the data storage location of the memory block On mer segment compares.
2. the method according to claim 1, wherein first hash algorithm is XDDHash algorithm, described the Two hash algorithms are MWFHash algorithm.
3. the method according to claim 1, wherein further include:
If not 1, then since the first address, obtaining the knot stored at current address using the method for displacement linear probing The index2 value of structure body judges whether the index2 value is identical as the second index of the short sequence of the DNA;
If the index2 value is identical as the second index of the short sequence of the DNA, it is determined that go out at the short sequence of the DNA and the current address The corresponding K-mer segment of the structural body of storage compares.
4. the method according to claim 1, wherein the load factor in index inquiry library is 0.607.
5. the method according to claim 1, wherein before the S2, further includes:
S40, foundation priori data model, the DNA sequence dna that 24 chromosome with reference to genome is compared according to history Quantity from being more to ranked up less;
S41, according to sequence as a result, according to vertical sequence, the interception of sliding window formula is carried out to current chromosome, is obtained continuous K-mer segment, and mark the location information where each segment;
S42, removal include the K-mer segment and duplicate K-mer segment of degeneracy base;
S43, each K-mer segment for obtaining, carry out the first hash algorithm mapping and second to the K-mer segment respectively Hash algorithm mapping, respectively obtains index1 value and index2 value, and calculates the compression of the location information abstract of the K-mer segment Value value generates the structural body comprising the value value and index2 value;
S44, library is inquired using index1 value as index described in index construct.
6. the method according to claim 1, wherein being compared in the short sequence of the DNA and with reference to genome When, the parallel comparison of normal chain and the short sequence of anti-chain is realized using multithreading.
7. short sequence quick comparison analytical equipment is sequenced in a kind of two generations characterized by comprising
Map unit, the short sequence of DNA obtained for obtaining sequencing, and distinguished using the first hash algorithm and the second hash algorithm The short sequence of DNA described in mapping code respectively obtains the first index and the second index;
Comparing unit, for being indexed based on preset index inquiry library, first index and second by the short sequence of the DNA It is compared with reference genome, wherein index inquiry library is made of cellular construction body array, each cellular construction body It include value value and index2 value, the value value is the position of the corresponding composition K-mer segment with reference to genome The compression of informative abstract is set, the index2 is the mapping result of the second hash algorithm of the K-mer segment, stores each institute The array indexing offset of cellular construction body is stated as the mapping result index1, K of the first hash algorithm of corresponding K-mer segment For fragment sequence length;
Computing unit, for, as a result, if comparing, being obtained and the K-mer on the short sequence alignment of corresponding DNA according to comparison The value value of segment, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described Site on the corresponding short sequence designation of chromosome number of DNA and designation of chromosome;
The comparing unit, is used for:
Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if finding head Address is the memory block of the first index of the short sequence of the DNA, then judges whether the data storage location of the memory block is 1;
If 1, it is determined that go out the short sequence of DNA K- corresponding with the structural body of storage at the data storage location of the memory block On mer segment compares.
CN201610609337.1A 2016-07-28 2016-07-28 Short sequence quick comparison analysis method and device was sequenced in two generations Active CN106295250B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610609337.1A CN106295250B (en) 2016-07-28 2016-07-28 Short sequence quick comparison analysis method and device was sequenced in two generations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610609337.1A CN106295250B (en) 2016-07-28 2016-07-28 Short sequence quick comparison analysis method and device was sequenced in two generations

Publications (2)

Publication Number Publication Date
CN106295250A CN106295250A (en) 2017-01-04
CN106295250B true CN106295250B (en) 2019-03-29

Family

ID=57662886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610609337.1A Active CN106295250B (en) 2016-07-28 2016-07-28 Short sequence quick comparison analysis method and device was sequenced in two generations

Country Status (1)

Country Link
CN (1) CN106295250B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020243009A1 (en) * 2019-05-24 2020-12-03 Illumina, Inc. Flexible seed extension for hash table genomic mapping

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971088A (en) * 2017-03-28 2017-07-21 泽塔生物科技(上海)有限公司 The method for identifying molecules and system of a kind of eukaryot-ic origin composition
CN107273663B (en) * 2017-05-22 2018-12-11 人和未来生物科技(长沙)有限公司 A kind of DNA methylation sequencing data calculating deciphering method
CN110021359B (en) * 2017-07-24 2021-05-04 深圳华大基因科技服务有限公司 Method and device for removing redundancy of combined assembly result of second-generation sequence and third-generation sequence
CN108470113B (en) * 2018-03-14 2019-05-17 四川大学 Several species do not occur the calculating of k-mer subsequence and characteristic analysis method and system
CN108660200B (en) * 2018-05-23 2022-10-18 北京希望组生物科技有限公司 Method for detecting expansion of short tandem repeat sequence
EP3874511A1 (en) 2018-10-31 2021-09-08 Illumina, Inc. Systems and methods for grouping and collapsing sequencing reads
CN109841264B (en) * 2019-01-31 2022-02-18 郑州云海信息技术有限公司 Sequence comparison filtering processing method, system and device and readable storage medium
CN109979537B (en) * 2019-03-15 2020-12-18 南京邮电大学 Multi-sequence-oriented gene sequence data compression method
CN110085284B (en) * 2019-04-29 2021-02-26 深圳大学 SSD (solid State disk) -oriented gene comparison method and system
CN110134694B (en) * 2019-05-20 2020-04-17 上海英方软件股份有限公司 Rapid comparison device and method for table data in double-activity database
CN111798923B (en) * 2019-05-24 2023-01-31 中国科学院计算技术研究所 Fine-grained load characteristic analysis method and device for gene comparison and storage medium
US11515011B2 (en) 2019-08-09 2022-11-29 International Business Machines Corporation K-mer based genomic reference data compression
CN111028897B (en) * 2019-12-13 2023-06-20 内蒙古农业大学 Hadoop-based distributed parallel computing method for genome index construction
CN111863139B (en) * 2020-04-10 2022-10-18 中国科学院计算技术研究所 Gene comparison acceleration method and system based on near-memory computing structure
CN112102883B (en) * 2020-08-20 2023-12-08 深圳华大生命科学研究院 Base sequence coding method and system in FASTQ file compression
AU2020450960A1 (en) 2020-10-22 2022-05-12 Bgi Genomics Co., Ltd Method for processing gene sequencing data and apparatus for processing gene sequencing data
CN112259168B (en) * 2020-10-22 2023-03-28 深圳华大基因科技服务有限公司 Gene sequencing data processing method and gene sequencing data processing device
CN114373508B (en) * 2022-01-24 2024-02-02 浙江天科高新技术发展有限公司 Strain identification method based on 16S rDNA sequence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103946396A (en) * 2011-10-31 2014-07-23 三星Sds株式会社 Method for sequence recombination and apparatus for ngs
CN104965999A (en) * 2015-06-05 2015-10-07 西安交通大学 Analysis and integration method and device for sequencing of medium-short gene segment
CN105243297A (en) * 2015-10-09 2016-01-13 人和未来生物科技(长沙)有限公司 Quick comparing and positioning method for gene sequence segments on reference genome

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10127346B2 (en) * 2011-04-13 2018-11-13 The Board Of Trustees Of The Leland Stanford Junior University Systems and methods for interpreting a human genome using a synthetic reference sequence

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103946396A (en) * 2011-10-31 2014-07-23 三星Sds株式会社 Method for sequence recombination and apparatus for ngs
CN104965999A (en) * 2015-06-05 2015-10-07 西安交通大学 Analysis and integration method and device for sequencing of medium-short gene segment
CN105243297A (en) * 2015-10-09 2016-01-13 人和未来生物科技(长沙)有限公司 Quick comparing and positioning method for gene sequence segments on reference genome

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
K-Mer Index Of DNA Sequence Based On Hash Algorithm;Jinlin Liu et al.;《International Journal on Computational Science & Applications (IJCSA)》;20150831;第5卷(第4期);第19-28页
基于Hash算法的DNA序列k-mer index问题的数学建模;郭方舟 等;《长春理工大学学报(自然科学版)》;20151031;第38卷(第5期);第116-119页

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020243009A1 (en) * 2019-05-24 2020-12-03 Illumina, Inc. Flexible seed extension for hash table genomic mapping

Also Published As

Publication number Publication date
CN106295250A (en) 2017-01-04

Similar Documents

Publication Publication Date Title
CN106295250B (en) Short sequence quick comparison analysis method and device was sequenced in two generations
US7640256B2 (en) Data collection cataloguing and searching method and system
Shao et al. Efficient cohesive subgraphs detection in parallel
Alser et al. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures
EP2759952B1 (en) Efficient genomic read alignment in an in-memory database
CN111445952B (en) Method and system for quickly comparing similarity of super-long gene sequences
Guidi et al. BELLA: Berkeley efficient long-read to long-read aligner and overlapper
CN114237911A (en) CUDA-based gene data processing method and device and CUDA framework
Mbadiwe et al. ParaMODA: Improving motif-centric subgraph pattern search in PPI networks
Wei et al. Comparison of methods for biological sequence clustering
KR100538451B1 (en) High performance sequence searching system and method for dna and protein in distributed computing environment
Cascitti et al. RNACache: A scalable approach to rapid transcriptomic read mapping using locality sensitive hashing
Kaznadzey et al. PSimScan: algorithm and utility for fast protein similarity search
Esmat et al. A parallel hash‐based method for local sequence alignment
Chen et al. CGAP-align: a high performance DNA short read alignment tool
Liu et al. Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining
KR101479735B1 (en) sequence likelihood ratio measurement system using Fast Global Alignmer algorith and sequence likelihood ratio measurement system using the same
CN116665772B (en) Genome map analysis method, device and medium based on memory calculation
Caldonazzo Garbelini et al. biomapp:: chip: Large-Scale Motif Analysis
JP7352985B2 (en) Handling of biological sequence information
CN111324638B (en) AR _ TSM-based time sequence motif association rule mining method
CN110781062B (en) Sampling method for accelerating extraction of trace information of software
EP4354444A1 (en) Method and system for identifying candidate genome sequecnces by estimating coverage
Qiu Parallelizing de vovo Assembly with Heterogeneous Processors
Marini et al. OCTOPUS: Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 9 No. 101300 Beijing Shunyi District city Nanfaxin Zhen Shunping Lu Nan Freeson 1 Building 8 Room 801

Patentee after: Beijing Pukang Ruiren Medical Laboratory Co., Ltd.

Address before: 9 No. 101300 Beijing Shunyi District city Nanfaxin Zhen Shunping Lu Nan Freeson 1 Building 8 Room 801

Patentee before: Beijing hundred medical laboratory Limited

CP01 Change in the name or title of a patent holder