CN106295250B - Short sequence quick comparison analysis method and device was sequenced in two generations - Google Patents
Short sequence quick comparison analysis method and device was sequenced in two generations Download PDFInfo
- Publication number
- CN106295250B CN106295250B CN201610609337.1A CN201610609337A CN106295250B CN 106295250 B CN106295250 B CN 106295250B CN 201610609337 A CN201610609337 A CN 201610609337A CN 106295250 B CN106295250 B CN 106295250B
- Authority
- CN
- China
- Prior art keywords
- value
- dna
- index
- short sequence
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Abstract
The present invention discloses a kind of two generations and short sequence quick comparison analysis method and device is sequenced, be able to solve sequencing data comparison efficiency is low and problem that EMS memory occupation is high.The described method includes: obtaining the short sequence of DNA that sequencing obtains, and using the short sequence of DNA described in the first hash algorithm and the second hash algorithm difference mapping code, respectively obtain the first index and the second index;It is compared based on preset index inquiry library, first index and the second index by the short sequence of the DNA and with reference to genome, index inquiry library is made of cellular construction body array, each cellular construction body includes value value and index2 value, the array indexing offset for storing each cellular construction body is corresponding index1, the as corresponding index value of Array for structural body, K are fragment sequence length;According to comparison as a result, if on comparing, obtain with the value value of the K-mer segment on the short sequence alignment of corresponding DNA, determine the site on the corresponding short sequence designation of chromosome number of DNA and designation of chromosome.
Description
Technical field
The invention belongs to biological information engineering fields, are related to biology information technology and Computer Applied Technology, specifically,
It is related to the short sequence quick comparison analysis method of two generation of DNA sequence dna sequencing.
Background technique
DNA sequencing plays most basic and widest effect in the gene order coding for interpreting species life.Early in hair
At the beginning of existing DNA double spiral, someone was reported to the sequencing technologies of DNA, only process is excessively complicated.Then soon, in 1977
Sanger has invented end and has terminated PCR sequencing PCR, the meaning with milestone.So far with biology information technology the reach of science,
Sanger PCR sequencing PCR can no longer meet the needs of research, and then cost is lower, and flux is higher, the faster second generation sequencing of speed
Technology is come into being.Its core concept is sequenced in synthesis.DNA sequence dna comparison is that the short sequence of obtained DNA will be sequenced
(reads) it is compared with reference to genome, commonly uses to analyze species similarity and homology, computer can also be passed through
The excavation and analysis of technology progress gene information.
Currently, the technology compared for two generation sequencing sequences is main, there are two the development in direction: first is that based on design hash
Algorithm mapping constructs hash Table storehouse thinking, and this technological merit compares that speed is fast, and accuracy rate is high, but committed memory and CPU compared with
It is high.The software representative that the technology is realized has SOAP, MAQ etc..Second is that constructing suffix tree query data structure based on BWT transfer algorithm
Technology, which first to genome sequence cyclic shift, is then sorted, is compressed using BWT and establish index.When comparing,
Using lookup and recall come the location information of positioning sequence.The technology needs to construct index file in advance to be operated step by step, is compared
Speed is also not as good as the hash table structure close to constant average time.It usually can realize that the conventional of all sequencing datas compares,
But opposite hash data structure, tree construction will occupy huge memory source, and inquire the speed of comparison far away from averagely often
The hash Table storehouse of number time.
Summary of the invention
In view of this, the present invention, which provides two generations of one kind, is sequenced short sequence quick comparison analysis method and device, to solve
The problem that the comparison efficiency of sequencing data is low and EMS memory occupation is high.
On the one hand, the embodiment of the present invention proposes that short sequence quick comparison analysis method is sequenced in a kind of two generations, comprising:
S1, the short sequence of DNA that sequencing obtains is obtained, and volume is mapped using the first hash algorithm and the second hash algorithm respectively
The code short sequence of DNA respectively obtains the first index and the second index;
S2, it is indexed based on preset index inquiry library, first index and second by the short sequence of the DNA and reference
Genome is compared, wherein index inquiry library is made of cellular construction body array, and each cellular construction body includes
Value value and index2 value, the value value are the location information of the corresponding composition K-mer segment with reference to genome
The compression of abstract, the index2 are the mapping result of the second hash algorithm of the K-mer segment, store each unit
Array for structural body index offset amount is that the mapping result index1, K of the first hash algorithm of corresponding K-mer segment are segment sequence
Column length;
S3, according to comparison as a result, if compare on, obtain and the K-mer segment on the short sequence alignment of corresponding DNA
Value value, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described corresponding
Site on the short sequence designation of chromosome number of DNA and designation of chromosome.
Preferably, first hash algorithm is XDDHash algorithm, and second hash algorithm is MWFHash algorithm.
Preferably, the S2, comprising:
Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if searching
To the memory block for the first index that first address is the short sequence of the DNA, then judge whether the data storage location of the memory block is 1
It is a;
If 1, it is determined that it is corresponding with the structural body stored at the data storage location of the memory block to go out the short sequence of the DNA
K-mer segment compare on.
Preferably, the method also includes:
If not 1, then since the first address, being obtained using the method for displacement linear probing and being stored at current address
Structural body index2 value, judge whether the index2 value identical as the second index of the short sequence of the DNA;
If the index2 value is identical as the second index of the short sequence of the DNA, it is determined that go out the short sequence of the DNA and the current position
The corresponding K-mer segment of the structural body stored at location compares.
Preferably, the load factor in index inquiry library is 0.607.
Preferably, before the S2, the method also includes:
S40, foundation priori data model, the DNA that 24 chromosome with reference to genome is compared according to history
The quantity of sequence from being more to ranked up less;
S41, according to sequence as a result, according to vertical sequence, the interception of sliding window formula is carried out to current chromosome, is obtained
Continuous K-mer segment, and mark the location information where each segment;
S42, removal include the K-mer segment and duplicate K-mer segment of degeneracy base;
S43, each K-mer segment for obtaining, to the K-mer segment carry out respectively the first hash algorithm mapping and
The mapping of second hash algorithm, respectively obtains index1 value and index2 value, and calculate the location information abstract of the K-mer segment
Value value is compressed, the structural body comprising the value value and index2 value is generated;
S44, library is inquired using index1 value as index described in index construct.
Preferably, the short sequence of the DNA and with reference to genome be compared when, using multithreading realize normal chain and
The parallel comparison of the short sequence of anti-chain.
On the other hand, the embodiment of the present invention proposes that short sequence quick comparison analytical equipment is sequenced in a kind of two generations, comprising:
Map unit, the short sequence of DNA obtained for obtaining sequencing, and use the first hash algorithm and the second hash algorithm
The short sequence of DNA described in mapping code respectively respectively obtains the first index and the second index;
Comparing unit, for based on preset index inquiry library, first index and the second index that the DNA is short
Sequence and reference genome are compared, wherein index inquiry library is made of cellular construction body array, each unit knot
Structure body includes value value and index2 value, and the value value is the corresponding composition K-mer segment with reference to genome
Location information abstract compression, the index2 is the mapping result of the second hash algorithm of the K-mer segment, and storage is every
The array indexing offset of a cellular construction body is the mapping result of the first hash algorithm of corresponding K-mer segment
Index1, K are fragment sequence length;
Computing unit, for according to comparison as a result, if compare, obtain on the short sequence alignment of corresponding DNA
The value value of K-mer segment, and determined according to described with the value value of the K-mer segment on the short sequence alignment of corresponding DNA
Site on the short sequence designation of chromosome number of the corresponding DNA and designation of chromosome out.
The invention has the following beneficial effects:
1, hash algorithm is used, comparison speed is fast, close to constant time O (1), improves Project Product time efficiency;
2, a large amount of memory sources are saved, item is improved without directly saving reads sequence using secondary hash mapping techniques
Mesh product resource benefit;
3, hash index is reset using priori data model, compares the high reads sequence of the frequency of occurrences, speed-up ratio in advance
It is right;
4, parallel execution comparison task is compared using multiwire technology, shortens comparison time.
Detailed description of the invention
Fig. 1 is the flow diagram for two generations of the invention short one embodiment of sequence quick comparison analysis method being sequenced;
Fig. 2 is the time performance figure for the XDDHash algorithm that the present invention is used to do a hash Function Mapping;
Fig. 3 is the amount of collisions performance map for the XDDHash algorithm that the present invention is used to do a hash Function Mapping;
Fig. 4 is the time performance figure for the MWFHash algorithm that the present invention is used to do secondary hash Function Mapping;
Fig. 5 is the amount of collisions performance map for the MWFHash algorithm that the present invention is used to do secondary hash Function Mapping;
Fig. 6 is the load factor for the hash table that the present invention constructs and the relational graph of average detection times;
Fig. 7 is the method for the present invention after being designed to software program product, compares the data volume of sample and spends time taking pass
System's figure;
Fig. 8 is the structural schematic diagram for two generations of the invention short one embodiment of sequence quick comparison analytical equipment being sequenced.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical solution in the embodiment of the present invention is explicitly described, it is clear that described embodiment is the present invention
A part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not having
Every other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.
Referring to Fig. 1, the present embodiment discloses a kind of short sequence quick comparison analysis method of two generations sequencing, comprising:
S1, the short sequence of DNA that sequencing obtains is obtained, and volume is mapped using the first hash algorithm and the second hash algorithm respectively
The code short sequence of DNA respectively obtains the first index and the second index;
S2, it is indexed based on preset index inquiry library, first index and second by the short sequence of the DNA and reference
Genome is compared, wherein index inquiry library is made of cellular construction body array, and each cellular construction body includes
Value value and index2 value, the value value are the location information of the corresponding composition K-mer segment with reference to genome
The compression of abstract, the index2 are the mapping result of the second hash algorithm of the K-mer segment, store each unit
The array indexing offset of structural body is that the mapping result index1, K of the first hash algorithm of corresponding K-mer segment are segment
Sequence length;
S3, according to comparison as a result, if compare on, obtain and the K-mer segment on the short sequence alignment of corresponding DNA
Value value, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described corresponding
Site on the short sequence designation of chromosome number of DNA and designation of chromosome.
The acquisition for inquiring timing column position information is as follows:
Memory index index is obtained: by the first hash algorithm and the second hash algorithm, sequence information in sample being converted
For the first index and the second index, value value in memory is obtained.
Sequence location obtains:
The short sequence designation of chromosome K-mer_number=value%26 of DNA (modulus operation, i.e. complementation);
Site K-mer_location=value/26 (ask and divide exactly) on the short sequence designation of chromosome of DNA.
In the embodiment of the present invention, using secondary hash algorithm, the mapping result of a hash is used to Query Location, secondary
The mapping result of hash, which is used to conflict, screens positioning, can accelerate to compare speed, and need not directly save huge reads sequence
Information can count comparison result, reduce the waste to memory source.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, described first
Hash algorithm is XDDHash algorithm, and second hash algorithm is MWFHash algorithm.
The XDDhash compression algorithm once mapped is used to index key informative abstract.Its algorithm pseudo-function is as follows:
Hash [i+1]=(hash [i] * seed+str [i]) ∧ 0x00000000FFFFFFFF (i=0,1,2 ..., N-2)
Wherein, hash [0] is set as 63, and it is the i+1 base character of reads sequence that multiplication factor seed seed, which takes 33, str [i],
II value of ASC, N be it is to be checked compare reads sequence length.P ∧ 0x00000000FFFFFFFF expression takes its low 32 to P
Binary value.It is the first index that circular recursion operation, which is done, until acquiring hash [N-1].
Through theoretical and Programmable detection practical proof, which has preferable performance: amount of collisions is few, compression mapping speed
Fastly.Its performance statistics is shown in Figure of description Fig. 2 and 3, and wherein Fig. 2 is the XDDHash algorithm for doing a hash Function Mapping
Time performance figure, reflection XDDHash algorithm changes time-consuming relational graph with reads quantity, and Fig. 3 is for making a hash letter
The amount of collisions performance map of the XDDHash algorithm of number mapping, the conflict that reflection XDDHash algorithm is generated with the variation of reads quantity
The relational graph of amount.
The MWFhash compression algorithm of Quadratic Map, for separating index conflict value, to parity bit using different displacements because
Subalgorithm is as follows,
Odd bits Index Algorithm (when i is odd number value):
Hash [i+1]=(hash [i] * seedodd+i*str [i]) ∧ 0x00000000FFFFFFFF
Even bit Index Algorithm (when i is even number value):
Hash [i+1]=(hash [i] * seedeven+i*str [i]) ∧ 0x00000000FFFFFFFF
Wherein, hash [0] is set as 0, odd number translocation factor seed seedodd value 5, even number translocation factor seed
Seedeven value 7.It does circular recursion operation and obtains hash [N-1] value, the as second index.
Through theoretical and Programmable detection practical proof, which has preferable performance: amount of collisions is few, compression mapping speed
Fastly.Its performance statistics is shown in Figure of description Figure 4 and 5, Fig. 4 be for do the MWFHash algorithm of secondary hash Function Mapping when
Between performance map, reflection MWFHash algorithm changes time-consuming relational graph with reads quantity, and Fig. 5 is for making secondary hash letter
The amount of collisions performance map of the MWFHash algorithm of number mapping, the conflict that reflection MWFHash algorithm is generated with the variation of reads quantity
The relational graph of amount.
In the embodiment of the present invention, in hash algorithm coding mapping reads sequence, direct coding takes its low 32 binary system
Value simplifies Algorithms T-cbmplexity.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, the S2, packet
It includes:
Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if searching
To the memory block for the first index that first address is the short sequence of the DNA, then judge whether the data storage location of the memory block is 1
It is a;
If 1, it is determined that it is corresponding with the structural body stored at the data storage location of the memory block to go out the short sequence of the DNA
K-mer segment compare on.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, further includes:
If not 1, then since the first address, being obtained using the method for displacement linear probing and being stored at current address
Structural body index2 value, judge whether the index2 value identical as the second index of the short sequence of the DNA;
If the index2 value is identical as the second index of the short sequence of the DNA, it is determined that go out the short sequence of the DNA and the current position
The corresponding K-mer segment of the structural body stored at location compares.
In the embodiment of the present invention, the characteristics of according to sample short sequences h ash mapping value, the linear probing improved is utilized
The collision problem of method solution hash value.It specifically says, when indexing displacement detection downwards, it is only necessary to which index is certainly plus operation is
Can, reduce time complexity when comparing search.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, the index
The load factor for inquiring library is 0.607.
Index inquiry library is substantially hash table, the selection of Hash table load factor K:
Load factor: the ratio of the record number and hash table size actually stored in hash table is to measure hash table performance
An important indicator.Its value is bigger, and hash value number of collisions is more, and the inquiry velocity of hash table is caused to reduce.Its value is too small, wave
Take memory source.We are based on XDDHash algorithm and MWFHash algorithm design program test statistics load factor and average detection
As shown in Figure 6, it reflects the efficiency that the inquiry of constructed table compares to relational graph between number (number is more, time-consuming more)
Can, the load factor K that the present invention designs is 0.607, it can be observed that its average lookup number is only at 3 times or so, it is close normal
The number time searches.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, in the S2
Before, further includes:
S40, foundation priori data model, the DNA that 24 chromosome with reference to genome is compared according to history
The quantity of sequence from being more to ranked up less;
S41, according to sequence as a result, according to vertical sequence, the interception of sliding window formula is carried out to current chromosome, is obtained
Continuous K-mer segment, and mark the location information where each segment;
S42, removal include the K-mer segment and duplicate K-mer segment of degeneracy base;
In a particular embodiment, the K-mer segment comprising degeneracy base represents unknown nucleotide sequence, structure in reference genome hg19
It builds unavailable when the library index, EMS memory occupation can also be saved after removing.
S43, each K-mer segment for obtaining, to the K-mer segment carry out respectively the first hash algorithm mapping and
The mapping of second hash algorithm, respectively obtains index1 value and index2 value, and calculate the location information abstract of the K-mer segment
Value value is compressed, the structural body comprising the value value and index2 value is generated;
S44, library is inquired using index1 value as index described in index construct.
Value value is the compression of the location information abstract of K-mer fragment sequence, and compression function is as follows:
Value=K-mer_number*26+K-mer_location,
Wherein, K-mer_number is K-mer fragment sequence designation of chromosome number, and K-mer_location is K-mer piece
Site on Duan Xulie designation of chromosome.
Index2 value and value value are stored in hash structural body:
Typedef struct Node{
unsigned int value;
unsigned int index2;
}HashType
The pseudo- expression formula of the hash table constructed in this way can be simply written as:
Hash [index1]=HashType element;
The value of HashType type is stored at memory index1 with regard to realizing.
In the embodiment of the present invention, according to priori data model, the high short sequence of the priority ordering frequency of occurrences, to improve inquiry
Speed, this specific method are to say, when processing has the hash index of repeated collisions, preferential search occurs in practical sequencing sample
The high reads sequence of rate reduces index displacement searching times, avoids unnecessary time overhead.
Optionally, it is sequenced in another embodiment of short sequence quick comparison analysis method in two generations of the present invention, in the DNA
When short sequence and reference genome are compared, the parallel comparison of normal chain and the short sequence of anti-chain is realized using multithreading.
In a particular application, the lower machine catalogue for the DNA sequence dna that storage sequencing obtains can be monitored in real time, as long as there is newly lower machine
Result data is just compared out in sample data.
Embodiment 1 is sequenced short sequence quick comparison analysis method based on two generations and analyzes applied to the data of noninvasive antenatal detection
The present invention realizes that the block process of noninvasive antenatal detection is to utilize perl language and shell-command pretreatment people first
Class refers to genome hg19.fa, format needed for being processed into index inquiry library, including three column:
First row: 36-mer DNA sequencing fragment;
Secondary series: segment designation of chromosome number;
Third column: the site on segment designation of chromosome.
Then its core methed is realized based on C language algorithm and data structure.
Design process of the present invention and operation need hardware and software environment: linux system;It is more than 3 cores;35 memories with
On;The library C under Linux platform;Gcc compiler;Gdb debugging software.
Input sample file format:
Illumina microarray dataset NIPT raw sequencing data, be fastq format, one read information of every 4 behavior:
The first row: being started with@, indicates a sequence, including sequenator name, mark position, the sequencing of mono-/bis-end and mistake
Filter the description of the relevant informations such as situation and primer connector;
Second row: reads sequence;
The third line: with '+' beginning, it is followed by sequence identifier, description information, or be whatever not added;
Fourth line: be reads quality information and the second row sequence it is corresponding.
Operation and output result
(1) nohup and & turns are issued orders from the background using linux, lower machine catalogue can be scanned always, as long as there is new lot sample
Notebook data appearance can go out result.Operation order:
nohup./hashtb_pth.c&;
(2) input sample file data:
We are by 7.4M size sample file Human_A1600169-R169_good_1.fq data and Human_
A1600171-R171_good_1.fq is stored in lower machine catalogue.Input file:
Human_A1600169-R169_good_11.fq、Human_A1600169-R169_good_12.fq、Human_
A1600169-R169_good_13.fq、Human_A1600169-R169_good_14.fq、Human_A1600169-R169_
Good_15.fq, Human_A1600169-R169_good_1.fq and Human_A1600171-R171_good_1.fq, wherein
First 6 be same sample data copy, to detect the stability of program, the last one be used to detect program processing difference sample
This ability.
(3) comparison result
The file generated in result list:
Human_A1600169-R169_good_1、Human_A1600169-R169_good_11、Human_
A1600169-R169_good_12、Human_A1600169-R169_good_13、Human_A1600169-R169_good_
14, Human_A1600169-R169_good_15 and Human_A1600171-R171_good_1,
The result data of 6 copy of sample Human_A1600169-R169_good_1.fq is completely the same, and reflection is based on
Even if the software that the method for the present invention is realized is reliable and stable in the multiple sample datas of continuous processing.
It can be gone out by result data with simple computation:
The unique comparison rate of sample are as follows: 72.772%,
Sample comparison rate are as follows: 74.443%.
The result of 4 rows after difference sample sample Human_A1600171-R171_good_1.fq result data is done simply
It calculates:
The unique comparison rate of sample: 71.531%,
Sample comparison rate: 73.487%.
As a result different with Human_A1600169-R169_good_1.fq, illustrate that software program being capable of instant continuous processing
Different sample datas under same batch catalogue.
(4) used time is compared
Handle 7 sample used times: 77.850 seconds.
Relational graph between statistical sample reads amount and comparison time, is shown in Fig. 7, Fig. 7 be the method for the present invention be designed to it is soft
After part program product, comparing the data volume of sample and spend time taking relational graph, this is obtained by the test of authentic specimen example, it
Concentrated expression the method for the present invention realized by C programming language after effect.It is the original using noninvasive prenatal gene detection sample
The test that beginning data are done.
(5) result verification and compare
Results of comparison we realized using the reliable BWA software of mainstream:
We compare sample Human_A1600169-R169_good_1.fq using BWA.Process is as follows:
1. constructing the library index, operation using with reference to genome
Bwa index-a bwtsw hg19.fa,
Wherein, hg19.fa is with reference to genome, and since reference genome hg19.fa is greater than 2GB ,-a parameter is used
bwtsw。
2. sample sequence and the library reference sequences index compare, operation
bwa aln-f Human_A1600169-R169_good_1.sai hg19.fa Human_A1600169-R169_
good_1.fq
Sai file is generated, includes SA coordinates informative abstract.
3. sai file is converted to sam output, operation
bwa samse-f Human_A1600169-R169_good_1.sam hg19.fa Human_A1600169-
R169_good_1.sai Human_A1600169-R169_good_1.fastq,
Wherein, sam file includes the various information of comparison result.
4. extracting comparison result data using perl language and counting
The unique comparison rate of sample: 72.772%,
Sample comparison rate: 74.443%.
This software BWA is consistent with result, but bwa whole flow process needs half an hour and occupies vast resources, while bwa cannot
In addition the multiple samples of instant continuous processing, statistical result needed for single sample process can not directly give analysis NIPT, write
Script statistical analysis.
Compare (single sample) for both noninvasive antenatal detection projects:
Referring to Fig. 8, the present embodiment discloses a kind of short sequence quick comparison analytical equipment of two generations sequencing, comprising:
Map unit 1, the short sequence of DNA obtained for obtaining sequencing, and calculated using the first hash algorithm and the 2nd hash
Method distinguishes the short sequence of DNA described in mapping code, respectively obtains the first index and the second index;
Comparing unit 2, for based on preset index inquiry library, first index and the second index that the DNA is short
Sequence and reference genome are compared, wherein index inquiry library is made of cellular construction body array, each unit knot
Structure body includes value value and index2 value, and the value value is the corresponding composition K-mer segment with reference to genome
Location information abstract compression, the index2 is the mapping result of the second hash algorithm of the K-mer segment, and storage is every
The array indexing offset of a cellular construction body is the mapping result of the first hash algorithm of corresponding K-mer segment
Index1, K are fragment sequence length;
Computing unit 3, for according to comparison as a result, if compare, obtain on the short sequence alignment of corresponding DNA
The value value of K-mer segment, and determined according to described with the value value of the K-mer segment on the short sequence alignment of corresponding DNA
Site on the short sequence designation of chromosome number of the corresponding DNA and designation of chromosome out.
Although the embodiments of the invention are described in conjunction with the attached drawings, but those skilled in the art can not depart from this hair
Various modifications and variations are made in the case where bright spirit and scope, such modifications and variations are each fallen within by appended claims
Within limited range.
Claims (7)
1. short sequence quick comparison analysis method is sequenced in a kind of two generations characterized by comprising
S1, the short sequence of DNA that sequencing obtains is obtained, and mapping code institute is distinguished using the first hash algorithm and the second hash algorithm
The short sequence of DNA is stated, the first index and the second index are respectively obtained;
S2, gene by the short sequence of the DNA and is referred to based on preset index inquiry library, first index and the second index
Group is compared, wherein index inquiry library is made of cellular construction body array, and each cellular construction body includes value
Value and index2 value, the value value are the location information abstract of the corresponding composition K-mer segment with reference to genome
Compression, the index2 are the mapping result of the second hash algorithm of the K-mer segment, store each cellular construction body
Array indexing offset be corresponding K-mer segment the first hash algorithm mapping result index1, K is that fragment sequence is long
Degree;
S3, according to comparison as a result, if compare on, obtain and the K-mer segment on the short sequence alignment of corresponding DNA
Value value, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described corresponding
Site on the short sequence designation of chromosome number of DNA and designation of chromosome;
The S2, comprising:
Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if finding head
Address is the memory block of the first index of the short sequence of the DNA, then judges whether the data storage location of the memory block is 1;
If 1, it is determined that go out the short sequence of DNA K- corresponding with the structural body of storage at the data storage location of the memory block
On mer segment compares.
2. the method according to claim 1, wherein first hash algorithm is XDDHash algorithm, described the
Two hash algorithms are MWFHash algorithm.
3. the method according to claim 1, wherein further include:
If not 1, then since the first address, obtaining the knot stored at current address using the method for displacement linear probing
The index2 value of structure body judges whether the index2 value is identical as the second index of the short sequence of the DNA;
If the index2 value is identical as the second index of the short sequence of the DNA, it is determined that go out at the short sequence of the DNA and the current address
The corresponding K-mer segment of the structural body of storage compares.
4. the method according to claim 1, wherein the load factor in index inquiry library is 0.607.
5. the method according to claim 1, wherein before the S2, further includes:
S40, foundation priori data model, the DNA sequence dna that 24 chromosome with reference to genome is compared according to history
Quantity from being more to ranked up less;
S41, according to sequence as a result, according to vertical sequence, the interception of sliding window formula is carried out to current chromosome, is obtained continuous
K-mer segment, and mark the location information where each segment;
S42, removal include the K-mer segment and duplicate K-mer segment of degeneracy base;
S43, each K-mer segment for obtaining, carry out the first hash algorithm mapping and second to the K-mer segment respectively
Hash algorithm mapping, respectively obtains index1 value and index2 value, and calculates the compression of the location information abstract of the K-mer segment
Value value generates the structural body comprising the value value and index2 value;
S44, library is inquired using index1 value as index described in index construct.
6. the method according to claim 1, wherein being compared in the short sequence of the DNA and with reference to genome
When, the parallel comparison of normal chain and the short sequence of anti-chain is realized using multithreading.
7. short sequence quick comparison analytical equipment is sequenced in a kind of two generations characterized by comprising
Map unit, the short sequence of DNA obtained for obtaining sequencing, and distinguished using the first hash algorithm and the second hash algorithm
The short sequence of DNA described in mapping code respectively obtains the first index and the second index;
Comparing unit, for being indexed based on preset index inquiry library, first index and second by the short sequence of the DNA
It is compared with reference genome, wherein index inquiry library is made of cellular construction body array, each cellular construction body
It include value value and index2 value, the value value is the position of the corresponding composition K-mer segment with reference to genome
The compression of informative abstract is set, the index2 is the mapping result of the second hash algorithm of the K-mer segment, stores each institute
The array indexing offset of cellular construction body is stated as the mapping result index1, K of the first hash algorithm of corresponding K-mer segment
For fragment sequence length;
Computing unit, for, as a result, if comparing, being obtained and the K-mer on the short sequence alignment of corresponding DNA according to comparison
The value value of segment, and determined according to the value value with the K-mer segment on the short sequence alignment of corresponding DNA described
Site on the corresponding short sequence designation of chromosome number of DNA and designation of chromosome;
The comparing unit, is used for:
Sequence short for each DNA searches the memory block for the first index that first address is the short sequence of the DNA, if finding head
Address is the memory block of the first index of the short sequence of the DNA, then judges whether the data storage location of the memory block is 1;
If 1, it is determined that go out the short sequence of DNA K- corresponding with the structural body of storage at the data storage location of the memory block
On mer segment compares.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610609337.1A CN106295250B (en) | 2016-07-28 | 2016-07-28 | Short sequence quick comparison analysis method and device was sequenced in two generations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610609337.1A CN106295250B (en) | 2016-07-28 | 2016-07-28 | Short sequence quick comparison analysis method and device was sequenced in two generations |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106295250A CN106295250A (en) | 2017-01-04 |
CN106295250B true CN106295250B (en) | 2019-03-29 |
Family
ID=57662886
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610609337.1A Active CN106295250B (en) | 2016-07-28 | 2016-07-28 | Short sequence quick comparison analysis method and device was sequenced in two generations |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106295250B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020243009A1 (en) * | 2019-05-24 | 2020-12-03 | Illumina, Inc. | Flexible seed extension for hash table genomic mapping |
Families Citing this family (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106971088A (en) * | 2017-03-28 | 2017-07-21 | 泽塔生物科技(上海)有限公司 | The method for identifying molecules and system of a kind of eukaryot-ic origin composition |
CN107273663B (en) * | 2017-05-22 | 2018-12-11 | 人和未来生物科技(长沙)有限公司 | A kind of DNA methylation sequencing data calculating deciphering method |
CN110021359B (en) * | 2017-07-24 | 2021-05-04 | 深圳华大基因科技服务有限公司 | Method and device for removing redundancy of combined assembly result of second-generation sequence and third-generation sequence |
CN108470113B (en) * | 2018-03-14 | 2019-05-17 | 四川大学 | Several species do not occur the calculating of k-mer subsequence and characteristic analysis method and system |
CN108660200B (en) * | 2018-05-23 | 2022-10-18 | 北京希望组生物科技有限公司 | Method for detecting expansion of short tandem repeat sequence |
EP3874511A1 (en) | 2018-10-31 | 2021-09-08 | Illumina, Inc. | Systems and methods for grouping and collapsing sequencing reads |
CN109841264B (en) * | 2019-01-31 | 2022-02-18 | 郑州云海信息技术有限公司 | Sequence comparison filtering processing method, system and device and readable storage medium |
CN109979537B (en) * | 2019-03-15 | 2020-12-18 | 南京邮电大学 | Multi-sequence-oriented gene sequence data compression method |
CN110085284B (en) * | 2019-04-29 | 2021-02-26 | 深圳大学 | SSD (solid State disk) -oriented gene comparison method and system |
CN110134694B (en) * | 2019-05-20 | 2020-04-17 | 上海英方软件股份有限公司 | Rapid comparison device and method for table data in double-activity database |
CN111798923B (en) * | 2019-05-24 | 2023-01-31 | 中国科学院计算技术研究所 | Fine-grained load characteristic analysis method and device for gene comparison and storage medium |
US11515011B2 (en) | 2019-08-09 | 2022-11-29 | International Business Machines Corporation | K-mer based genomic reference data compression |
CN111028897B (en) * | 2019-12-13 | 2023-06-20 | 内蒙古农业大学 | Hadoop-based distributed parallel computing method for genome index construction |
CN111863139B (en) * | 2020-04-10 | 2022-10-18 | 中国科学院计算技术研究所 | Gene comparison acceleration method and system based on near-memory computing structure |
CN112102883B (en) * | 2020-08-20 | 2023-12-08 | 深圳华大生命科学研究院 | Base sequence coding method and system in FASTQ file compression |
AU2020450960A1 (en) | 2020-10-22 | 2022-05-12 | Bgi Genomics Co., Ltd | Method for processing gene sequencing data and apparatus for processing gene sequencing data |
CN112259168B (en) * | 2020-10-22 | 2023-03-28 | 深圳华大基因科技服务有限公司 | Gene sequencing data processing method and gene sequencing data processing device |
CN114373508B (en) * | 2022-01-24 | 2024-02-02 | 浙江天科高新技术发展有限公司 | Strain identification method based on 16S rDNA sequence |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103946396A (en) * | 2011-10-31 | 2014-07-23 | 三星Sds株式会社 | Method for sequence recombination and apparatus for ngs |
CN104965999A (en) * | 2015-06-05 | 2015-10-07 | 西安交通大学 | Analysis and integration method and device for sequencing of medium-short gene segment |
CN105243297A (en) * | 2015-10-09 | 2016-01-13 | 人和未来生物科技(长沙)有限公司 | Quick comparing and positioning method for gene sequence segments on reference genome |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10127346B2 (en) * | 2011-04-13 | 2018-11-13 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and methods for interpreting a human genome using a synthetic reference sequence |
-
2016
- 2016-07-28 CN CN201610609337.1A patent/CN106295250B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103946396A (en) * | 2011-10-31 | 2014-07-23 | 三星Sds株式会社 | Method for sequence recombination and apparatus for ngs |
CN104965999A (en) * | 2015-06-05 | 2015-10-07 | 西安交通大学 | Analysis and integration method and device for sequencing of medium-short gene segment |
CN105243297A (en) * | 2015-10-09 | 2016-01-13 | 人和未来生物科技(长沙)有限公司 | Quick comparing and positioning method for gene sequence segments on reference genome |
Non-Patent Citations (2)
Title |
---|
K-Mer Index Of DNA Sequence Based On Hash Algorithm;Jinlin Liu et al.;《International Journal on Computational Science & Applications (IJCSA)》;20150831;第5卷(第4期);第19-28页 |
基于Hash算法的DNA序列k-mer index问题的数学建模;郭方舟 等;《长春理工大学学报(自然科学版)》;20151031;第38卷(第5期);第116-119页 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020243009A1 (en) * | 2019-05-24 | 2020-12-03 | Illumina, Inc. | Flexible seed extension for hash table genomic mapping |
Also Published As
Publication number | Publication date |
---|---|
CN106295250A (en) | 2017-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106295250B (en) | Short sequence quick comparison analysis method and device was sequenced in two generations | |
US7640256B2 (en) | Data collection cataloguing and searching method and system | |
Shao et al. | Efficient cohesive subgraphs detection in parallel | |
Alser et al. | From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures | |
EP2759952B1 (en) | Efficient genomic read alignment in an in-memory database | |
CN111445952B (en) | Method and system for quickly comparing similarity of super-long gene sequences | |
Guidi et al. | BELLA: Berkeley efficient long-read to long-read aligner and overlapper | |
CN114237911A (en) | CUDA-based gene data processing method and device and CUDA framework | |
Mbadiwe et al. | ParaMODA: Improving motif-centric subgraph pattern search in PPI networks | |
Wei et al. | Comparison of methods for biological sequence clustering | |
KR100538451B1 (en) | High performance sequence searching system and method for dna and protein in distributed computing environment | |
Cascitti et al. | RNACache: A scalable approach to rapid transcriptomic read mapping using locality sensitive hashing | |
Kaznadzey et al. | PSimScan: algorithm and utility for fast protein similarity search | |
Esmat et al. | A parallel hash‐based method for local sequence alignment | |
Chen et al. | CGAP-align: a high performance DNA short read alignment tool | |
Liu et al. | Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining | |
KR101479735B1 (en) | sequence likelihood ratio measurement system using Fast Global Alignmer algorith and sequence likelihood ratio measurement system using the same | |
CN116665772B (en) | Genome map analysis method, device and medium based on memory calculation | |
Caldonazzo Garbelini et al. | biomapp:: chip: Large-Scale Motif Analysis | |
JP7352985B2 (en) | Handling of biological sequence information | |
CN111324638B (en) | AR _ TSM-based time sequence motif association rule mining method | |
CN110781062B (en) | Sampling method for accelerating extraction of trace information of software | |
EP4354444A1 (en) | Method and system for identifying candidate genome sequecnces by estimating coverage | |
Qiu | Parallelizing de vovo Assembly with Heterogeneous Processors | |
Marini et al. | OCTOPUS: Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder |
Address after: 9 No. 101300 Beijing Shunyi District city Nanfaxin Zhen Shunping Lu Nan Freeson 1 Building 8 Room 801 Patentee after: Beijing Pukang Ruiren Medical Laboratory Co., Ltd. Address before: 9 No. 101300 Beijing Shunyi District city Nanfaxin Zhen Shunping Lu Nan Freeson 1 Building 8 Room 801 Patentee before: Beijing hundred medical laboratory Limited |
|
CP01 | Change in the name or title of a patent holder |