CN102453751A - Method for DNA sequencer to reattach short sequence to genome - Google Patents

Method for DNA sequencer to reattach short sequence to genome Download PDF

Info

Publication number
CN102453751A
CN102453751A CN2010105197821A CN201010519782A CN102453751A CN 102453751 A CN102453751 A CN 102453751A CN 2010105197821 A CN2010105197821 A CN 2010105197821A CN 201010519782 A CN201010519782 A CN 201010519782A CN 102453751 A CN102453751 A CN 102453751A
Authority
CN
China
Prior art keywords
short sequence
sequence
genome
seed
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010105197821A
Other languages
Chinese (zh)
Inventor
马斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PEOPLESPOT (BEIJING) CO Ltd
Original Assignee
PEOPLESPOT (BEIJING) CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PEOPLESPOT (BEIJING) CO Ltd filed Critical PEOPLESPOT (BEIJING) CO Ltd
Priority to CN2010105197821A priority Critical patent/CN102453751A/en
Publication of CN102453751A publication Critical patent/CN102453751A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Lubricants (AREA)
  • Peptides Or Proteins (AREA)

Abstract

The invention belongs to a method for processing DNA sequencing data. In order to improve the efficiency of reattaching a short sequence to genome, the invention provides a method for a DNA sequencer to reattach a short sequence to genome. 100% recall ratio can be ensured by giving an optimized overall length discrete seed combination, and meanwhile, the efficiency of data processing is improved.

Description

The short sequence of the dna sequencing appearance genome method of replying to the topic
Technical field
The present invention relates to the dna sequencing processing method of data, particularly the result of order-checking---the genomic treatment process of short sequence money order receipt to be signed and returned to the sender.
Background technology
Dna sequencing is technological, promptly measures the technology of dna sequence dna.In molecular biology research, the sequential analysis of DNA is the basis of further research and transformation goal gene.The technology that is used to check order mainly contains the double deoxidating chain end cessation method of (1977) inventions such as Sanger and the chemical degradation method of Maxam and Gilbert (1977) invention.These two kinds of methods are widely different on principle, but all are to begin at a certain fixed point according to Nucleotide, stop at some specific base places at random; Produce A; T, C, a series of Nucleotide of four groups of different lengthss of G; Electrophoresis detects on urea-denatured PAGE glue then, thereby obtains dna sequence dna.
The s-generation nucleotide sequence sequencing technologies of rising in recent years (next generation sequencing technology); Compare with traditional mulberry lattice (Sanger) sequencing technologies; Have outstanding advantages such as high-throughput, high accuracy and low operating cost; Be the change of the revolution property of sequencing technologies, expedited the emergence of the research in field, numerous biology forward position, application prospect is very wide.Wherein the SOLiD sequenator of the GA sequenator of Illumina company and Applied Biology company is two kinds of sequenators that on market, account for main flow at present.Because the nucleotide sequence that these two kinds of sequenators produce has the advantages that sequence is relatively lacked (15bp-100bp); Therefore the data that produce from sequenator short exactly sequence of requisite link to the data analysis flow process that applies to numerous biologic applications genome of replying to the topic; The high-throughout short sequence and the gene group leader sequence that are about to the sequenator generation compare; List at genome sequence and to find a fragment the most similar to be complementary with it, and the position of output coupling.The short sequence genomic essence of replying to the topic is exactly a short sequence and long sequence alignment problem.This is an algorithm basic, the most the most frequently used in the information biology, and nearly all bioinformation Processing tasks all possibly use it.Along with the biological sequence data amount that can supply comparative analysis presents explosive increase, the sequence that continues to bring out various new demands relatively propose new challenge to the method for handling sequence alignment.
The seventies in last century is to the eighties, with Neeleman-Wunsch algorithm and Smith-Waterman algorithm be the dynamic programming alignment algorithm of representative only be applicable to small number biological sequence relatively.From the eighties to the nineties; With FASTA and BLAST is the algorithm of representative; Through biological sequence is indexed, filter out very dissimilar biological sequence fast, and then left possible matched candidate position few in number checked more accurately (adopt accurate algorithm; Like above-mentioned dynamic programming algorithm), just can under the situation that guarantees certain tolerance range, improve speed greatly.Therefore biological sequence being indexed to become utilizes limited resources to accomplish an extensive biological sequence requisite step relatively in the limited time.The mode that indexes is to influence the sequence efficient relatively and the key point of precision, and the indexed mode of FASTA and BLAST is to be the heuristic method that cost exchanges speed for the sacrifice precision.Since the end of the nineties in last century, people begin to improve to this indexed mode, and hope can improve the precision of comparison with the speed of BLAST as far as possible, even approaches the comparison accuracy of dynamic programming algorithm.Be that to have started with " discrete seed " (spaced seed) be the research of the indexing means of core for the algorithm of representative with PatternHunter.No matter be BLAST or PatternHunter, its indexed mode all is that index is built in each position of sequence, is stored in internal memory or the external file.And these two kinds of sequenators of Illumina Genome Analyzer and AB SOLiD produce unprecedented googol according to amount (once experiment in the unit time; In two days, just produce the data of 1.5G); If make the method utilize existing sequence comparison lack the sequence genome of replying to the topic; Accomplishing the replying to the topic of data that once experiment produces just needs the time of some months, and the index number that is several times as much as short sequence quantity makes existing calculator memory resource to support.Therefore short sequence this link of genome of replying to the topic becomes a huge bottleneck in these two kinds of nucleotide sequence sequenator data analysis flow processs.
Zillions of Oligos Mapped.Bioinformatics 24 (21): provided a kind of genomic method of short sequence money order receipt to be signed and returned to the sender of utilizing some discrete seeds to reach 100% precise ratio 2431-2437.2008); But only provided the proof that needs discrete seed amount, do not provided concrete discrete seed combination.
Summary of the invention
In order to improve the genomic efficient of short sequence money order receipt to be signed and returned to the sender, the invention provides the genomic method of the short sequence money order receipt to be signed and returned to the sender of a kind of dna sequencing appearance, through the discrete seed combination of the total length that provides optimization, realize the object of the invention.
Technical scheme of the present invention is following:
The short sequence of the dna sequencing appearance genome method of replying to the topic comprises the steps:
With the discrete seed combination of total length short sequence and the genome that the dna sequencing appearance produces indexed, to filter out the location sets that to reply to the topic;
The discrete seed of total length is length and the identical code string of said short sequence length, and code string is made up of some matching codes and asterisk wildcard; The position that the matching code representative need be compared said short sequence and genome, asterisk wildcard representative not need with said short sequence and genome compare; The discrete seed group of said total length is combined into said short sequence money order receipt to be signed and returned to the sender genome and reaches the combination that 100% precise ratio needs the discrete seed of minimum quantity total length;
The discrete seed combination of the said total length of corresponding multiple situation is as follows; One of them s#.w#.r# or s#.w#.z# represent the discrete seed combination of one group of total length; # represents numeral, the numeral seed length of s back, the weight of the numeral seed of w back; The numeral of r back can reach the mispairing number that 100% precise ratio is allowed, and the numeral of z back can reach the mispairing number in the color space permission that 100% precise ratio is allowed; The short sequence of the corresponding respectively two kinds of dna sequencing appearance output of r and z; The said matching code of 1 representative, * represents asterisk wildcard; The discrete seed combination of said total length (being designated hereinafter simply as the discrete seed combination of the present invention) comprises that also (said column permutation is meant the discrete seed combination of one group of total length of a s#.w#.r# or s#.w#.z# representative for result behind the column permutation of the discrete seed combination of any total length as follows; A discrete seed with horizontal is capable; The corresponding position of each discrete seed constitutes the matrix that row form longitudinally, and any two row in this matrix can be replaced and constituted new matrix.Promptly on following listed basis, the discrete seed combination of total length of carrying out above-mentioned column permutation formation also can realize the object of the invention):
s15.w11.r1
*11111111111***
****11111111111
1111****1111111
11111111****111
s17.w11.r1
*11111111111*****
******11111111111
111111******11111
s18.w13.r1
1111111111111*****
*****1111111111111
111111111****1111*
11111****11111111*
s20.w13.r1
1111111111111*******
*******1111111111111
*111111******1111111
s26.w13.r1
1111111111111*************
*************1111111111111
s14.w8.r2
11**111111****
**11**111111**
****11**111111
111111****11**
**111111****11
1111****11**11
11****11**1111
s16.w8.r2
11111111********
****11111111****
********11111111
1111****1111****
****1111****1111
1111********1111
s17.w9.r2
111**111111******
***111**111111***
******111**111111
11111******1**111
11111*11****11***
111******1**11111
***111*1*11***111
s18.w9.r2
111111***111******
***111111***111***
******111111***111
111******111111***
***111******111111
111***111******111
s19.w9.r2
111111111**********
****111111111******
**********111111111
1111*****11111*****
*****1111*****11111
1111**********11111
s20.w10.r2
11111*****11111*****
*****1111111111*****
11111**********11111
*****11111*****11111
1111111111**********
**********1111111111
s22.w10-12.r2
11111111111***********
*****11111111111******
***********11111111111
11111******11111******
*****111111*****111111
11111***********111111
s23.w11.r2
11111111111************
*****11111111111*******
************11111111111
11111******111111******
******11111******111111
11111************111111
s24.w12.r2
111111111111************
************111111111111
111111******111111******
111111************111111
******111111111111******
******111111******111111
s26.w12-13.r2
1111111111111*************
******1111111111111*******
*************1111111111111
111111*******111111*******
******1111111******111111*
111111*************1111111
s27.w13.r2
1111111111111**************
******1111111111111********
**************1111111111111
111111*******1111111*******
*******111111*******1111111
111111**************1111111
s28.w13.r2
111*1**111*1**111*1**1******
*111*1**111*1**111*1**1*****
**111*1**111*1**111*1**1****
***111*1**111*1**111*1**1***
****111*1**111*1**111*1**1**
*****111*1**111*1**111*1**1*
******111*1**111*1**111*1**1
s30.w13.r2
1111111111111*****************
*************1111111111111****
*****************1111111111111
********111111111*********1111
11111********1111*********1111
s32.w13.r2
1111111111111*******************
111111*******1111111************
111111**************1111111*****
******1111111111111*************
*******************1111111111111
s33.w13.r2
*******1111111111111*************
********************1111111111111
1111111111111********************
1111111******111111**************
s39.w13.r2
1111111111111**************************
*************1111111111111*************
**************************1111111111111
s25.w10.r3
1111111111***************
*****1111111111**********
**********1111111111*****
***************1111111111
11111*****11111**********
*****11111*****11111*****
**********11111*****11111
11111**********11111*****
*****11111**********11111
11111***************11111
s30.w12.r3
111111111111******************
******111111111111************
************111111111111******
******************111111111111
111111******111111************
******111111******111111******
************111111******111111
111111************111111******
******111111************111111
111111******************111111
s34.w13.r3
1111111111111*********************
*******1111111111111**************
**************1111111111111*******
*********************1111111111111
1111111*******111111**************
*******1111111*******111111*******
**************1111111*******111111
1111111**************111111*******
*******1111111**************111111
1111111*********************111111
s36.w13.r3
111*1**1***111*1**1***111***********
*111*1**1***111*1**1***111**********
**111*1**1***111*1**1***111*********
***111*1**1***111*1**1***111********
****111*1**1***111*1**1***111*******
*****111*1**1***111*1**1***111******
******111*1**1***111*1**1***111*****
*******111*1**1***111*1**1***111****
********111*1**1***111*1**1***111***
*********111*1**1***111*1**1***111**
**********111*1**1***111*1**1***111*
s42.w12.r3
111111111111******************************
******111111111111************************
************111111111111******************
******************111111111111************
************************111111111111******
******************************111111111111
111111******************************111111
s46.w13.r3
1111111111111*********************************
*******1111111111111**************************
********************1111111111111*************
*********************************1111111111111
1111111******111111***************************
s52.w13.r3
1111111111111***************************************
*************1111111111111**************************
**************************1111111111111*************
***************************************1111111111111
s30.w10.r4
1111111111********************
*****1111111111***************
**********1111111111**********
***************1111111111*****
********************1111111111
11111*****11111***************
*****11111*****11111**********
**********11111*****11111*****
***************11111*****11111
11111**********11111**********
*****11111**********11111*****
**********11111**********11111
11111***************11111*****
*****11111***************11111
11111********************11111
s36.w12.r4
111111111111************************
******111111111111******************
************111111111111************
******************111111111111******
************************111111111111
111111******111111******************
******111111******111111************
************111111******111111******
******************111111******111111
111111************111111************
******111111************111111******
************111111************111111
111111******************111111******
******111111******************111111
111111************************111111
s41.w13.r4
1111111*111111***************************
*******1111111*111111********************
**************1111111*111111*************
*********************1111111*111111******
1111111********111111********************
*******1111111********111111*************
**************1111111********111111******
1111111***************111111*************
*******1111111***************111111******
**************1111111**************111111
1111111**********************111111******
1111111****************************111111
*******1111111*********************111111
*********************1111111*******111111
****************************1111111111111
s42.w12.r4
111111111111******************************
************111111111111******************
************************111111111111******
******************************111111111111
************************111111******111111
111111******111111************************
111111************111111******************
******111111111111************************
******111111******111111******************
s45.w12-13.r4
1111111111111********************************
*************1111111111111*******************
**************************1111111111111******
********************************1111111111111
**************************111111*******111111
111111*******111111**************************
111111*************1111111*******************
******1111111111111**************************
******1111111******111111********************
s46.w12-13.r4
1111111111111*********************************
*************1111111111111********************
**************************1111111111111*******
*********************************1111111111111
**************************1111111******111111*
1111111******111111***************************
1111111*************111111********************
*******1111111111111**************************
*******111111*******111111********************
s49.w13.r4
1111111111111************************************
**************1111111111111**********************
****************************1111111111111********
***********************************1111111111111*
****************************1111111*******111111*
1111111*******111111*****************************
1111111**************111111**********************
*******1111111111111*****************************
*******1111111*******111111**********************
s65.w13.r4
1111111111111****************************************************
*************1111111111111***************************************
**************************1111111111111**************************
***************************************1111111111111*************
****************************************************1111111111111
s78.w13.r5
1111111111111*****************************************************************
*************1111111111111****************************************************
**************************1111111111111***************************************
***************************************1111111111111**************************
****************************************************1111111111111*************
*****************************************************************1111111111111
s91.w13.r6
1111111111111******************************************************************************
*************1111111111111*****************************************************************
**************************1111111111111****************************************************
***************************************1111111111111***************************************
****************************************************1111111111111**************************
*****************************************************************1111111111111*************
******************************************************************************1111111111111
s23.w11.z2
11111111111************
************11111111111
s24.w12.z?4
1111*1111*1111**********
1111*1111******1111*****
1111*1111***********1111
1111******1111*1111*****
1111******1111******1111
1111***********1111*1111
*****1111*1111*1111*****
*****1111*1111******1111
*****1111******1111*1111
**********1111*1111*1111
s25.w12-13.z4
11111*1111*1111**********
11111*1111******1111*****
11111*1111***********1111
11111******1111*1111*****
11111******1111******1111
11111***********1111*1111
******1111*1111*1111*****
******1111*1111******1111
******1111******1111*1111
***********1111*1111*1111
s34.w13.z4
11111*11111111********************
***************1111*111111111*****
********************111111111*1111
*********11111*1111***********1111
11111**********1111***********1111
s38.w12.z4
111111111111**************************
*************111111111111*************
**************************111111111111
s41.w13.z4
1111111111111****************************
**************1111111111111**************
****************************1111111111111
s29.w12.z6
1111*1111*1111***************
1111*1111******1111**********
1111*1111***********1111*****
1111*1111****************1111
1111******1111*1111**********
1111******1111******1111*****
1111******1111***********1111
1111***********1111*1111*****
1111***********1111******1111
1111****************1111*1111
*****1111*1111*1111**********
*****1111*1111******1111*****
*****1111*1111***********1111
*****1111******1111*1111*****
*****1111******1111******1111
*****1111***********1111*1111
**********1111*1111*1111*****
**********1111*1111******1111
**********1111******1111*1111
***************1111*1111*1111
s34.w12.z6
111111*111111*********************
111111********111111**************
111111***************111111*******
111111**********************111111
*******111111*111111**************
*******111111********111111*******
*******111111***************111111
**************111111*111111*******
**************111111********111111
*********************111111*111111
s35.w12-13.z6
1111111*111111*********************
1111111********111111**************
1111111***************111111*******
1111111**********************111111
********111111*111111**************
********111111********111111*******
********111111***************111111
***************111111*111111*******
***************111111********111111
**********************111111*111111
s39.w13.z6
1111111*111111*************************
1111111*********111111*****************
1111111*****************111111*********
1111111*************************111111*
********1111111*111111*****************
********1111111*********111111*********
********1111111*****************111111*
****************1111111*111111*********
****************1111111*********111111*
************************1111111*111111*
s47.w11.z6
11111111111************************************
************11111111111************************
************************11111111111************
************************************11111111111
s48.w12.z6
111111************************************111111
111111*111111***********************************
*******111111*111111****************************
**************111111*111111*********************
*********************111111*111111**************
****************************111111*111111*******
***********************************111111*111111
s41.w12.z8
111111*111111****************************
111111********111111*********************
111111***************111111**************
111111**********************111111*******
111111*****************************111111
*******111111*111111*********************
*******111111********111111**************
*******111111***************111111*******
*******111111**********************111111
**************111111*111111**************
**************111111********111111*******
**************111111***************111111
*********************111111*111111*******
*********************111111********111111
****************************111111*111111
s48.w12.z8
111111*111111***********************************
**************111111*111111*********************
****************************111111*111111*******
***********************************111111*111111
****************************111111********111111
111111********111111****************************
111111***************111111*********************
*******111111*111111****************************
*******111111********111111*********************。
The short sequence money order receipt to be signed and returned to the sender genome process of the dna sequencing appearance output of s#.w#.z# representative utilizes the SSE instruction set in the Intel chip to quicken, and pseudo-code is following:
Input: short sequence R=R 0...L-1(L is a sequence length, R i 1{ 0,1,2,3}), its Adaptor is Ada (R) 1{ A, C, G, T};
Reference sequences G (G i 1A, and C, G, T});
Mispairing number under the best comparison of output: R and G
int?min_mismatch(R,G)
/*
Make F (A)=0, F (C)=1, F (G)=2, F (T)=3 is mapped as numeral with letter;
Make that X is 128 integers, by four 32 integer X 0X 1X 2X 3Form X iAnd then it is whole by four 8
Number X i 0X i 1X i 2X i 3Form;
Make that T is 128 integers consistent with X length;
Make Y [0...3] be 128 the integer arrays consistent, define as follows with X length with Z [0...3]:
Y [0]=0,001 0,001 0,001 0000; (totally 128)
Y[1]=0001?0001?0000?0001;
Y[2]=0001?0000?0001?0001;
Y[3]=0000?0001?0001?0001;
Z[0]=0111?1011?1101?1110;
Z[1]=1011?0111?1110?1101;
Z[2]=1101?1110?0111?1011;
Z[3]=1110?1101?1011?0111;
*/
X=Y[F(Ada(R))];
for?i?from?0?to?L-1
T 0 0=X 0 0;T 0 1=X 1 0;T 0 2=X 2 0;T 0 3=X 3 0;//PACKSSDW,PACKSSWB
T 1=T 0;T 2=T 0;T 3=T 0;//PUNPCKLDQ
X+=Z[R i];//PADDB
X i 0=min{X i 0,X i 1,X i 2,X i 3};//PMINUB
X+=Y[G i];//PADDB
done
return?min{X 0 0,X 1 0,X 2 0,X 3 0};//PMINUB
end。
Technique effect of the present invention:
The present invention uses " total length disperse seed ", the discrete seed length of a total length with treat that index sequence length is identical, so only can in index, produce a record to each sequence.The index number equals the sequence number.The present invention can guarantee to reach 100% precise ratio through using the combination of the discrete seed of a plurality of total lengths (the matching code position of each seed is different).Seed number in combination is few more, and its internal memory that needs is also just low more with working time.
Through checking, the discrete seed combination of the present invention all can be satisfied the requirement of 100% precise ratio, and concrete verification method is following:
Supposing has such section in the genome of short sequence money order receipt to be signed and returned to the sender: the mispairing position number that this section comprises is no more than maximum mispairing and counts m; And for each the discrete seed s in the discrete seed combination S of the present invention; At least contain a said mispairing position in the position of the matching code correspondence of s, that is all seeds all can not detect such mispairing bit pattern among the S.If through after exhaustive, such zone does not exist, can verify that the discrete seed combination of the present invention counts m for mispairing and have 100% precise ratio.
The program pseudo-code is following:
Input: the total length seed combination S that disperses, m is counted in maximum mispairing;
Output: have whole sections that m is counted in maximum mispairing if S can detect, be output as sky so; Otherwise export a kind of mispairing bit pattern that can not be detected.
position_set?verify_seed_set(seed_set?S,int?m)
P={}; / * P be mispairing bit pattern */
T=S;
/ * be provided with first mispairing position */
For p in oneset (T [1])/* oneset (T [1]) is the corresponding position set of the matching code of first seed among the T
*/
P+=p;
T={x|x in S and x [P]==* }; / * T is the residue subset that P does not hit the corresponding position of matching code as yet
Close */
if(T=={})return?P;
/ * be provided with second mispairing position */
for?p?in?oneset(T[1])
P+=p;
T={x|x?in?S?and?x[P]==*};
if(T=={})return?P;
......
/ * be provided with m mispairing position */
for?p?in?onset(T[1])
p+=p;
T={x|x?in?S?and?x[P]==*};
if(T=={})return?P;
Return NULL; / * tests through all m mispairing numbers, S have 100% precise ratio */
end
Use top method can verify the precise ratio that the discrete seed of total length of the short sequence of corresponding A B SOLiD sequenator output makes up, and the precise ratio of the discrete seed combination of the total length of the short sequence of corresponding Illumina sequenator generation.Through after the experimental verification, the present invention is discrete, and the seed combination all has 100% precise ratio.
Description of drawings
Fig. 1 is the schema of the inventive method.
The short example series that Fig. 2 produces for the Illumina sequenator.
Fig. 3 is the short example series that AB SOLiD sequenator produces.
Fig. 4 is 33 for length, and the needed weight of 100% precise ratio that allows two mispairing is 13 the discrete seed combination of optimum, and " 1 " be the position of matching code correspondence, and " * " is the position of asterisk wildcard correspondence.
Fig. 5 is the coding synoptic diagram of two equal length of short sequence A CGTAT.
Embodiment
Below in conjunction with accompanying drawing technical scheme of the present invention is elaborated.
Fig. 1 has shown a reply to the topic embodiment of genome method of the short sequence of dna sequencing appearance of the present invention.
At first, need to import the short sequential file that the dna sequencing appearance produces.Carry out to the short sequence that two kinds of dna sequencing appearance produce respectively at present embodiment, promptly import the short sequential file of Illumina sequenator and the generation of AB SOLiD sequenator, identify short sequence and reply to the topic for short sequence in this step.The document format data that Illumina GA sequenator produces has * _ seq.txt, * _ qual.txt, FASTA and FASTQ.The document format data that AB SOLiD sequenator produces has * .csfasta, * .qual.Comprise name, sequence or the sequencing quality marking of every short sequence in the file.Since the difference of sequencing technologies, the alphabet of the short sequence that the Illumina sequenator produces be four kinds of nucleotide sequences A, C, T, G}, and the alphabet of the short sequence of AB company sequenator generation be color space (color space) four kinds of colors 0,1,2,3}.
The short sequence example that the Illumina sequenator produces is as shown in Figure 2, is by { A, C, G, T, the sequence that N} forms.The short sequence example that AB SOLiD sequenator produces is as shown in Figure 3, be by four kinds of colors 0,1,2, the sequence of 3} composition.Adjacent two bases can form a kind of color.
The second, utilize the discrete seed combination of total length that short sequence or genome are indexed, filter out the candidate matches set.This step will be selected the discrete seed combination of one group of total length of equal in length according to the length of short sequence from the discrete seed combination of aforementioned the present invention, short sequence or genome are indexed, to reach minimum index number.In the discrete seed combination of total length (the discrete seed that needs under this length); A discrete seed comprises some matching codes and asterisk wildcard; When the short sequence of dna sequencing appearance output and genome are compared; Be somebody's turn to do discrete seed and short sequence (claiming sequence 1 during this time) equal in length, genome is since first sequence (claiming sequence 2 during this time) with this weak point sequence equal length.With sequence 1 and sequence 2 comparisons; Be the basis with discrete seed; Each is corresponding with sequence 1 and sequence 2 respectively for promptly discrete seed; If the value of the sequence 1 that the matching code of discrete seed is corresponding and the position of sequence 2 all identical (position that asterisk wildcard is corresponding is not considered), then sequence 1 and sequence 2 can tentatively be thought to mate.Then, begin to get the sequence (this abbreviate sequence 3) isometric, carry out above-mentioned contrast once more, to judge whether recorded information in the candidate matches set with sequence 1 at genomic second.The rest may be inferred, up to the comparison of accomplishing whole genome.Concrete implementation method: the character of sequence 1 and sequence 2 corresponding discrete seed match bit is calculated cryptographic hash; And put into concordance list; After accomplishing the contrast of whole genome, in concordance list, find out with genome in sequence 1 identical sequence, form the candidate matches set.Follow-up more accurate sequence contrast can be carried out in the candidate matches set, has reduced operand.
Short sequence or genome indexed only need build index to the matching code position of discrete seed, owing to there is wildcard bit, so it is few to set up the quantity of index, has promoted efficient.
Fig. 4 is that length is 33, and the needed weight of 100% precise ratio that allows two mispairing is 13 the discrete seed combination of optimum.
The 3rd, adopt parallel bit arithmetic algorithm, and utilize the SSE instruction set in the Intel chip that the short sequence of candidate matches is done the rapid serial comparison, the genomic fragment that is meant the position candidate of replying to the topic that a short sequence and a last step are filtered out is gathered and is carried out sequence alignment.Different to this step of different sequenators.
The short sequence of nucleotide sequence spatial that produces for the sequenator of Illumina company: short sequence and genomic fragment are encoded, both alphabet ∑s=A, C, G, T} all be encoded to 00,01,10,11}.Adopt parallel bit arithmetic method to accomplish quick comparison process.
The short sequence of the color space that produces for the sequenator of AB company, its step is following: encode to lacking sequence and genomic fragment, with both the letter color space 0,1,2, and 3} be encoded to 00,01,10,11}.Adopt parallel bit arithmetic method, screening is fallen between short sequence and the genomic fragment apart from the position candidate that surpasses the twice threshold value.Utilize the SSE instruction set in the Intel chip to make comparison process accomplish fast in this process.
Parallel bit arithmetic algorithmic descriptions is following: show the process that short sequence r=ACGTAT is encoded among Fig. 5, r is encoded to two longly is binary string rU and the rD of m, deposit with two 64 machine works respectively.The high coding of i the letter of r represented in i the letter of rU, and the low level coding of i the letter of r represented in i the letter of rD.To genomic fragment g jAlso use two long binary string gU and gD codings as m.We adopt document (Henry, S.W., Hacker ' s Delight.2002:Addison-Wesley Longman Publishing Co., Inc.368.) in the algorithm of number of number " 1 " in binary string of design realize parallel bit manipulation.Pseudo-code is following:
Input: the candidate verifies (g, r), threshold value k is counted in mispairing
Output: whether candidate's checking is to passing through to verify
bool?verify(g,r,k)
/ * make gU/gD and rU/rD be respectively the binary coding * of g and r/
vector=(gU^rU)|(gD^rD);
vector=(vector?&?0x5555555555555555)+(vector>>1?&
0x5555555555555555);
vector=(vector?&?0x3333333333333333)+(vector>>2?&
0x3333333333333333);
vector=(vector+(vector>>4))&?0x0F0F0F0F0F0F0F0F;
vector=(vector+(vector>>8))&?0x00FF00FF00FF00FF;
vector=(vector+(vector>>16))&?0x0000FFFF0000FFFF;
mismatch=(vector+(vector>>32))&?0x00000000FFFFFFFF;
return(mismatch<=k);
end
The sequence alignment process of the short sequence of AB SOLiD has been used document (Li H.and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics; 25; 1754-60.) in verification method; This dynamic programming algorithm is consuming time too much, and the present invention utilizes the SSE instruction set in the Intel chip that this process is quickened.
Pseudo-code is (used SSE instruction is also listed in the lump) as follows:
Input: short sequence R=R 0...L-1(L is a sequence length, R i 1{ 0,1,2,3}), its Adaptor is Ada (R) 1{ A, C, G, T};
Reference sequences G (G i 1A, and C, G, T});
Mispairing number under the best comparison of output: R and G
int?min_mismatch(R,G)
/*
Make F (A)=0, F (C)=1, F (G)=2, F (T)=3 is mapped as numeral with letter;
Make that X is 128 integers, by four 32 integer X 0X 1X 2X 3Form Xi and then whole by four 8
Number X i 0X i 1X i 2X i 3Form;
Make that T is 128 integers of similar X;
Make that Y [0...3] and Z [0...3] are 128 integer arrays of two similar X, definition as follows:
Y [0]=0,001 0,001 0,001 0000; (totally 128)
Y[1]=0001?0001?0000?0001;
Y[2]=0001?0000?0001?0001;
Y[3]=0000?0001?0001?0001;
Z[0]=0111?1011?1101?1110;
Z[1]=1011?0111?1110?1101;
Z[2]=1101?1110?0111?1011;
Z[3]=1110?1101?1011?0111;
*/
X=Y[F(Ada(R))];
for?i?from?0?to?L-1
T 0 0=X 0 0;T 0 1=X 1 0;T 0 2=X 2 0;T 0 3=X 3 0;//PACKSSDW,PACKSSWB
T 1=T 0;T 2=T 0;T 3=T 0;//PUNPCKLDQ
X+=Z[R i];//PADDB
X i 0=min{X i 0,X i 1,X i 2,X i 3};//PMINUB
X+=Y[G i];//PADDB
done
return?min{X 0 0,X 1 0,X 2 0,X 3 0};//PMINUB
end。
The 4th; Accomplishing short sequence replies to the topic and exports the short sequence positional information file of replying to the topic; Be meant the position replied to the topic as this weak point sequence in the minimum genome position of every short sequence selection mispairing number and this positional information is outputed in the file, use for the downstream data analysis process.The title, sequence, order-checking marking, the position of replying to the topic, the number of mispairing, the position of mispairing that comprise short sequence.Also comprise according to matched position for the short sequence of AB SOLiD the sequence of color space is converted into nucleotide sequence.

Claims (2)

  1. The genome method 1.DNA the short sequence of sequenator is replied to the topic is characterized in that comprising the steps:
    With the discrete seed combination of total length short sequence and the genome that the dna sequencing appearance produces indexed, to filter out the location sets that to reply to the topic;
    The discrete seed of total length is length and the identical code string of said short sequence length, and code string is made up of some matching codes and asterisk wildcard; The position that the matching code representative need be compared said short sequence and genome, asterisk wildcard representative not need with said short sequence and genome compare; The discrete seed group of said total length is combined into said short sequence money order receipt to be signed and returned to the sender genome and reaches the combination that 100% precise ratio needs the discrete seed of minimum quantity total length;
    The discrete seed combination of the said total length of corresponding multiple situation is as follows; The one group of total length of representing one of them s#.w#.r# or the s#.w#.z# seed that disperses; # represents numeral, the numeral seed length of s back, the weight of the numeral seed of w back; The numeral of r back can reach the mispairing number that 100% precise ratio is allowed, and the numeral of z back can reach the mispairing number in the color space permission that 100% precise ratio is allowed; The short sequence of the corresponding respectively two kinds of dna sequencing appearance output of r and z; The said matching code of 1 representative, * represents asterisk wildcard; The discrete seed of said total length makes up the result behind the column permutation that also comprises the discrete seed combination of any total length as follows:
    s15.w11.r1
    *11111111111***
    ****11111111111
    1111****1111111
    11111111****111
    s17.w11.r1
    *11111111111*****
    ******11111111111
    111111******11111
    s18.w13.r1
    1111111111111*****
    *****1111111111111
    111111111****1111*
    11111****11111111*
    s20.w13.r1
    1111111111111*******
    *******1111111111111
    *111111******1111111
    s26.w13.r1
    1111111111111*************
    *************1111111111111
    s14.w8.r2
    11**111111****
    **11**111111**
    ****11**111111
    111111****11**
    **111111****11
    1111****11**11
    11****11**1111
    s16.w8.r2
    11111111********
    ****11111111****
    ********11111111
    1111****1111****
    ****1111****1111
    1111********1111
    s17.w9.r2
    111**111111******
    ***111**111111***
    ******111**111111
    11111******1**111
    11111*11****11***
    111******1**11111
    ***111*1*11***111
    s18.w9.r2
    111111***111******
    ***111111***111***
    ******111111***111
    111******111111***
    ***111******111111
    111***111******111
    s19.w9.r2
    111111111**********
    ****111111111******
    **********111111111
    1111*****11111*****
    *****1111*****11111
    1111**********11111
    s20.w10.r2
    11111*****11111*****
    *****1111111111*****
    11111**********11111
    *****11111*****11111
    1111111111**********
    **********1111111111
    s22.w10-12.r2
    11111111111***********
    *****11111111111******
    ***********11111111111
    11111******11111******
    *****111111*****111111
    11111***********111111
    s23.w11.r2
    11111111111************
    *****11111111111*******
    ************11111111111
    11111******111111******
    ******11111******111111
    11111************111111
    s24.w12.r2
    111111111111************
    ************111111111111
    111111******111111******
    111111************111111
    ******111111111111******
    ******111111******111111
    s26.w12-13.r2
    1111111111111*************
    ******1111111111111*******
    *************1111111111111
    111111*******111111*******
    ******1111111******111111*
    111111*************1111111
    s27.w13.r2
    1111111111111**************
    ******1111111111111********
    **************1111111111111
    111111*******1111111*******
    *******111111*******1111111
    111111**************1111111
    s28.w13.r2
    111*1**111*1**111*1**1******
    *111*1**111*1**111*1**1*****
    **111*1**111*1**111*1**1****
    ***111*1**111*1**111*1**1***
    ****111*1**111*1**111*1**1**
    *****111*1**111*1**111*1**1*
    ******111*1**111*1**111*1**1
    s30.w13.r2
    1111111111111*****************
    *************1111111111111****
    *****************1111111111111
    ********111111111*********1111
    11111********1111*********1111
    s32.w13.r2
    1111111111111*******************
    111111*******1111111************
    111111**************1111111*****
    ******1111111111111*************
    *******************1111111111111
    s33.w13.r2
    *******1111111111111*************
    ********************1111111111111
    1111111111111********************
    1111111******111111**************
    s39.w13.r2
    1111111111111**************************
    *************1111111111111*************
    **************************1111111111111
    s25.w10.r3
    1111111111***************
    *****1111111111**********
    **********1111111111*****
    ***************1111111111
    11111*****11111**********
    *****11111*****11111*****
    **********11111*****11111
    11111**********11111*****
    *****11111**********11111
    11111***************11111
    s30.w12.r3
    111111111111******************
    ******111111111111************
    ************111111111111******
    ******************111111111111
    111111******111111************
    ******111111******111111******
    ************111111******111111
    111111************111111******
    ******111111************111111
    111111******************111111
    s34.w13.r3
    1111111111111*********************
    *******1111111111111**************
    **************1111111111111*******
    *********************1111111111111
    1111111*******111111**************
    *******1111111*******111111*******
    **************1111111*******111111
    1111111**************111111*******
    *******1111111**************111111
    1111111*********************111111
    s36.w13.r3
    111*1**1***111*1**1***111***********
    *111*1**1***111*1**1***111**********
    **111*1**1***111*1**1***111*********
    ***111*1**1***111*1**1***111********
    ****111*1**1***111*1**1***111*******
    *****111*1**1***111*1**1***111******
    ******111*1**1***111*1**1***111*****
    *******111*1**1***111*1**1***111****
    ********111*1**1***111*1**1***111***
    *********111*1**1***111*1**1***111**
    **********111*1**1***111*1**1***111*
    s42.w12.r3
    111111111111******************************
    ******111111111111************************
    ************111111111111******************
    ******************111111111111************
    ************************111111111111******
    ******************************111111111111
    111111******************************111111
    s46.w13.r3
    1111111111111*********************************
    *******1111111111111**************************
    ********************1111111111111*************
    *********************************1111111111111
    1111111******111111***************************
    s52.w13.r3
    1111111111111***************************************
    *************1111111111111**************************
    **************************1111111111111*************
    ***************************************1111111111111
    s30.w10.r4
    1111111111********************
    *****1111111111***************
    **********1111111111**********
    ***************1111111111*****
    ********************1111111111
    11111*****11111***************
    *****11111*****11111**********
    **********11111*****11111*****
    ***************11111*****11111
    11111**********11111**********
    *****11111**********11111*****
    **********11111**********11111
    11111***************11111*****
    *****11111***************11111
    11111********************11111
    s36.w12.r4
    111111111111************************
    ******111111111111******************
    ************111111111111************
    ******************111111111111******
    ************************111111111111
    111111******111111******************
    ******111111******111111************
    ************111111******111111******
    ******************111111******111111
    111111************111111************
    ******111111************111111******
    ************111111************111111
    111111******************111111******
    ******111111******************111111
    111111************************111111
    s41.w13.r4
    1111111*111111***************************
    *******1111111*111111********************
    **************1111111*111111*************
    *********************1111111*111111******
    1111111********111111********************
    *******1111111********111111*************
    **************1111111********111111******
    1111111***************111111*************
    *******1111111***************111111******
    **************1111111**************111111
    1111111**********************111111******
    1111111****************************111111
    *******1111111*********************111111
    *********************1111111*******111111
    ****************************1111111111111
    s42.w12.r4
    111111111111******************************
    ************111111111111******************
    ************************111111111111******
    ******************************111111111111
    ************************111111******111111
    111111******111111************************
    111111************111111******************
    ******111111111111************************
    ******111111******111111******************
    s45.w12-13.r4
    1111111111111********************************
    *************1111111111111*******************
    **************************1111111111111******
    ********************************1111111111111
    **************************111111*******111111
    111111*******111111**************************
    111111*************1111111*******************
    ******1111111111111**************************
    ******1111111******111111********************
    s46.w12-13.r4
    1111111111111*********************************
    *************1111111111111********************
    **************************1111111111111*******
    *********************************1111111111111
    **************************1111111******111111*
    1111111******111111***************************
    1111111*************111111********************
    *******1111111111111**************************
    *******111111*******111111********************
    s49.w13.r4
    1111111111111************************************
    **************1111111111111**********************
    ****************************1111111111111********
    ***********************************1111111111111*
    ****************************1111111*******111111*
    1111111*******111111*****************************
    1111111**************111111**********************
    *******1111111111111*****************************
    *******1111111*******111111**********************
    s65.w13.r4
    1111111111111****************************************************
    *************1111111111111***************************************
    **************************1111111111111**************************
    ***************************************1111111111111*************
    ****************************************************1111111111111
    s78.w13.r5
    1111111111111*****************************************************************
    *************1111111111111****************************************************
    **************************1111111111111***************************************
    ***************************************1111111111111**************************
    ****************************************************1111111111111*************
    *****************************************************************1111111111111
    s91.w13.r6
    1111111111111******************************************************************************
    *************1111111111111*****************************************************************
    **************************1111111111111****************************************************
    ***************************************1111111111111***************************************
    ****************************************************1111111111111**************************
    *****************************************************************1111111111111*************
    ******************************************************************************1111111111111
    s23.w11.z2
    11111111111************
    ************11111111111
    s24.w12.z4
    1111*1111*1111**********
    1111*1111******1111*****
    1111*1111***********1111
    1111******1111*1111*****
    1111******1111******1111
    1111***********1111*1111
    *****1111*1111*1111*****
    *****1111*1111******1111
    *****1111******1111*1111
    **********1111*1111*1111
    s25.w12-13.z4
    11111*1111*1111**********
    11111*1111******1111*****
    11111*1111***********1111
    11111******1111*1111*****
    11111******1111******1111
    11111***********1111*1111
    ******1111*1111*1111*****
    ******1111*1111******1111
    ******1111******1111*1111
    ***********1111*1111*1111
    s34.w13.z4
    11111*11111111********************
    ***************1111*111111111*****
    ********************111111111*1111
    *********11111*1111***********1111
    11111**********1111***********1111
    s38.w12.z4
    111111111111**************************
    *************111111111111*************
    **************************111111111111
    s41.w13.z4
    1111111111111****************************
    **************1111111111111**************
    ****************************1111111111111
    s29.w12.z6
    1111*1111*1111***************
    1111*1111******1111**********
    1111*1111***********1111*****
    1111*1111****************1111
    1111******1111*1111**********
    1111******1111******1111*****
    1111******1111***********1111
    1111***********1111*1111*****
    1111***********1111******1111
    1111****************1111*1111
    *****1111*1111*1111**********
    *****1111*1111******1111*****
    *****1111*1111***********1111
    *****1111******1111*1111*****
    *****1111******1111******1111
    *****1111***********1111*1111
    **********1111*1111*1111*****
    **********1111*1111******1111
    **********1111******1111*1111
    ***************1111*1111*1111
    s34.w12.z6
    111111*111111*********************
    111111********111111**************
    111111***************111111*******
    111111**********************111111
    *******111111*111111**************
    *******111111********111111*******
    *******111111***************111111
    **************111111*111111*******
    **************111111********111111
    *********************111111*111111
    s35.w12-13.z6
    1111111*111111*********************
    1111111********111111**************
    1111111***************111111*******
    1111111**********************111111
    ********111111*111111**************
    ********111111********111111*******
    ********111111***************111111
    ***************111111*111111*******
    ***************111111********111111
    **********************111111*111111
    s39.w13.z6
    1111111*111111*************************
    1111111*********111111*****************
    1111111*****************111111*********
    1111111*************************111111*
    ********1111111*111111*****************
    ********1111111*********111111*********
    ********1111111*****************111111*
    ****************1111111*111111*********
    ****************1111111*********111111*
    ************************1111111*111111*
    s47.w11.z6
    11111111111************************************
    ************11111111111************************
    ************************11111111111************
    ************************************11111111111
    s48.w12.z6
    111111************************************111111
    111111*111111***********************************
    *******111111*111111****************************
    **************111111*111111*********************
    *********************111111*111111**************
    ****************************111111*111111*******
    ***********************************111111*111111
    s41.w12.z8
    111111*111111****************************
    111111********111111*********************
    111111***************111111**************
    111111**********************111111*******
    111111*****************************111111
    *******111111*111111*********************
    *******111111********111111**************
    *******111111***************111111*******
    *******111111**********************111111
    **************111111*111111**************
    **************111111********111111*******
    **************111111***************111111
    *********************111111*111111*******
    *********************111111********111111
    ****************************111111*111111
    s48.w12.z8
    111111*111111***********************************
    **************111111*111111*********************
    ****************************111111*111111*******
    ***********************************111111*111111
    ****************************111111********111111
    111111********111111****************************
    111111***************111111*********************
    *******111111*111111****************************
    *******111111********111111*********************。
  2. 2. according to the short sequence of the said dna sequencing appearance of claim 1 genome method of replying to the topic, it is characterized in that the short sequence money order receipt to be signed and returned to the sender genome process of the dna sequencing appearance output of s#.w#.z# representative utilizes the SSE instruction set in the Intel chip to quicken, pseudo-code is following:
    Input: short sequence R=R 0...L-1(L is a sequence length, R i 1{ 0,1,2,3}), its Adaptor is Ada (R) 1{ A, C, G, T};
    Reference sequences G (G i 1A, and C, G, T});
    Mispairing number under the best comparison of output: R and G
    int?min_mismatch(R,G)
    /*
    Make F (A)=0, F (C)=1, F (G)=2, F (T)=3 is mapped as numeral with letter;
    Make that X is 128 integers, by four 32 integer X 0X 1X 2X 3Form X iAnd then it is whole by four 8
    Number X i 0X i 1X i 2X i 3Form;
    Make that T is 128 integers consistent with X length;
    Make Y [0...3] be 128 the integer arrays consistent, define as follows with X length with Z [0...3]:
    Y [0]=0,001 0,001 0,001 0000; (totally 128)
    Y[1]=0001?0001?0000?0001;
    Y[2]=0001?0000?0001?0001;
    Y[3]=0000?0001?0001?0001;
    Z[0]=0111?1011?1101?1110;
    Z[1]=1011?0111?1110?1101;
    Z[2]=1101?1110?0111?1011;
    Z[3]=1110?1101?1011?0111;
    */
    X=Y[F(Ada(R))];
    for?i?from?0?to?L-1
    T 0 0=X 0 0;T 0 1=X 1 0;T 0 2=X 2 0;T 0 3=X 3 0;//PACKSSDW,PACKSSWB
    T 1=T 0;T 2=T 0;T 3=T 0;//PUNPCKLDQ
    X+=Z[R i];//PADDB
    X i 0=min{X i 0,X i 1,X i 2,X i 3};//PMINUB
    X+=Y[G i];//PADDB
    done
    return?min{X 0 0,X 1 0,X 2 0,X 3 0};//PMINUB
    end。
CN2010105197821A 2010-10-19 2010-10-19 Method for DNA sequencer to reattach short sequence to genome Pending CN102453751A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105197821A CN102453751A (en) 2010-10-19 2010-10-19 Method for DNA sequencer to reattach short sequence to genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105197821A CN102453751A (en) 2010-10-19 2010-10-19 Method for DNA sequencer to reattach short sequence to genome

Publications (1)

Publication Number Publication Date
CN102453751A true CN102453751A (en) 2012-05-16

Family

ID=46037447

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105197821A Pending CN102453751A (en) 2010-10-19 2010-10-19 Method for DNA sequencer to reattach short sequence to genome

Country Status (1)

Country Link
CN (1) CN102453751A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336916A (en) * 2013-07-05 2013-10-02 中国科学院数学与系统科学研究院 Sequencing sequence mapping method and sequencing sequence mapping system
CN105069325A (en) * 2012-07-28 2015-11-18 盛司潼 Method for matching nucleic acid sequence information
CN106096333A (en) * 2016-06-02 2016-11-09 广州麦仑信息科技有限公司 A kind of method that gene information is carried out visual image expression
WO2018053761A1 (en) * 2016-09-22 2018-03-29 华为技术有限公司 Data processing method and device, and computing node
CN106096333B (en) * 2016-06-02 2019-07-16 广州麦仑信息科技有限公司 A method of gene information is subjected to visual image expression

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030120429A1 (en) * 2001-09-07 2003-06-26 Ming Li Method and system for faster and more sensitive homology searching

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030120429A1 (en) * 2001-09-07 2003-06-26 Ming Li Method and system for faster and more sensitive homology searching

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HAO LIN ET AL.: "ZOOM! Zillions of oligos mapped", 《BIOINFORMATICS》 *
LUCIAN ILIE ET AL.: "Multiple spaced seeds for homology search", 《BIOINFORMATICS》 *
MING LI ET AL: "PatternHunter II: Highly Sensitive and Fast Homology Search", 《GENOME INFORMATICS》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069325A (en) * 2012-07-28 2015-11-18 盛司潼 Method for matching nucleic acid sequence information
CN105069325B (en) * 2012-07-28 2018-10-09 盛司潼 It is a kind of that matched method is carried out to nucleic acid sequence information
CN103336916A (en) * 2013-07-05 2013-10-02 中国科学院数学与系统科学研究院 Sequencing sequence mapping method and sequencing sequence mapping system
CN103336916B (en) * 2013-07-05 2016-04-06 中国科学院数学与系统科学研究院 A kind of sequencing sequence mapping method and system
CN106096333A (en) * 2016-06-02 2016-11-09 广州麦仑信息科技有限公司 A kind of method that gene information is carried out visual image expression
CN106096333B (en) * 2016-06-02 2019-07-16 广州麦仑信息科技有限公司 A method of gene information is subjected to visual image expression
WO2018053761A1 (en) * 2016-09-22 2018-03-29 华为技术有限公司 Data processing method and device, and computing node
CN109477140A (en) * 2016-09-22 2019-03-15 华为技术有限公司 A kind of data processing method, device and calculate node
CN109477140B (en) * 2016-09-22 2022-05-31 华为技术有限公司 Data processing method and device and computing node

Similar Documents

Publication Publication Date Title
Gremme et al. Engineering a software tool for gene structure prediction in higher organisms
KR20220017409A (en) Data structures and behaviors for searching, computing, and indexing in DNA-based data stores.
CN109086890A (en) Information coding and the decoded method of information
JP4912646B2 (en) Gene transcript mapping method and system
JP2023526017A (en) Programs and functions in DNA-based data storage
CN101908102B (en) Precasting method and device for stalk-based RNA (Ribonucleic Acid) secondary structure
CN102453751A (en) Method for DNA sequencer to reattach short sequence to genome
CN108182348A (en) DNA methylation data detection method and its device based on Seed Sequences information
CN102841988B (en) A kind of system and method that nucleic acid sequence information is mated
Chen et al. A numerical representation of DNA sequences and its applications
Gonzalez et al. Strong short-range correlations and dichotomic codon classes in coding DNA sequences
Xu et al. Using MoBIoS'scalable genome join to find conserved primer pair candidates between two genomes
Vezzi Next generation sequencing revolution challenges: Search, assemble, and validate genomes
Procházka et al. On-line Searching in IUPAC Nucleotide Sequences.
KR101953663B1 (en) Method for generating pool containing oligonucleotides from a oligonucleotide
CN109979536A (en) It is a kind of based on DNA bar code to the identification method of species
Iliopoulos et al. The Max-Shift algorithm for approximate string matching
Meiser Advancing Information Technology Using Synthetic DNA as an Alternative to Electronic-Based Media
Muggli et al. Succinct de Bruijn graph construction for massive populations through space-efficient merging
CN110875084B (en) Nucleic acid sequence comparison method
Mian et al. MISSH: Fast Hashing of Multiple Spaced Seeds
Lima et al. The chain alignment problem
Nikooienejad et al. Fast DNA barcode generating algorithm using Radix Coding method
Zhang et al. Lower bounds of DNA codes with reverse constraint
Nsira et al. Practical fast on-line exact pattern matching algorithms for highly similar sequences

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120516

WD01 Invention patent application deemed withdrawn after publication