CN104573407A - Searching method for species-specific endogenous barcodes and application thereof in multi-sample mixed sequencing - Google Patents
Searching method for species-specific endogenous barcodes and application thereof in multi-sample mixed sequencing Download PDFInfo
- Publication number
- CN104573407A CN104573407A CN201510070781.6A CN201510070781A CN104573407A CN 104573407 A CN104573407 A CN 104573407A CN 201510070781 A CN201510070781 A CN 201510070781A CN 104573407 A CN104573407 A CN 104573407A
- Authority
- CN
- China
- Prior art keywords
- sequence
- bar code
- endogenous
- sample
- endogenous bar
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000012163 sequencing technique Methods 0.000 title claims abstract description 26
- 238000007857 nested PCR Methods 0.000 claims abstract description 14
- 230000003321 amplification Effects 0.000 claims abstract description 13
- 238000003199 nucleic acid amplification method Methods 0.000 claims abstract description 13
- 238000004364 calculation method Methods 0.000 claims abstract description 11
- 108020004414 DNA Proteins 0.000 claims description 38
- 108090000623 proteins and genes Proteins 0.000 claims description 29
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 21
- 239000012634 fragment Substances 0.000 claims description 21
- 238000013461 design Methods 0.000 claims description 12
- 230000000295 complement effect Effects 0.000 claims description 8
- 239000000203 mixture Substances 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 108091036078 conserved sequence Proteins 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 8
- 230000015572 biosynthetic process Effects 0.000 abstract description 4
- 238000003786 synthesis reaction Methods 0.000 abstract description 4
- 238000002474 experimental method Methods 0.000 abstract description 3
- 238000000338 in vitro Methods 0.000 abstract 2
- 238000005580 one pot reaction Methods 0.000 abstract 1
- 241000894007 species Species 0.000 description 27
- 238000002156 mixing Methods 0.000 description 19
- 108020004465 16S ribosomal RNA Proteins 0.000 description 11
- 241000196324 Embryophyta Species 0.000 description 7
- 101150101585 ccmD gene Proteins 0.000 description 7
- 108091007491 NSP3 Papain-like protease domains Proteins 0.000 description 6
- 238000002360 preparation method Methods 0.000 description 4
- 241000894006 Bacteria Species 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 239000003755 preservative agent Substances 0.000 description 3
- 230000002335 preservative effect Effects 0.000 description 3
- 101100301006 Allochromatium vinosum (strain ATCC 17899 / DSM 180 / NBRC 103801 / NCIMB 10441 / D) cbbL2 gene Proteins 0.000 description 2
- 101100166957 Anabaena sp. (strain L31) groEL2 gene Proteins 0.000 description 2
- 101150087323 COI gene Proteins 0.000 description 2
- 102000000634 Cytochrome c oxidase subunit IV Human genes 0.000 description 2
- 108050008072 Cytochrome c oxidase subunit IV Proteins 0.000 description 2
- 241000588921 Enterobacteriaceae Species 0.000 description 2
- 241000588724 Escherichia coli Species 0.000 description 2
- 101100439396 Synechococcus sp. (strain ATCC 27144 / PCC 6301 / SAUG 1402/1) groEL1 gene Proteins 0.000 description 2
- 230000001580 bacterial effect Effects 0.000 description 2
- 101150004101 cbbL gene Proteins 0.000 description 2
- 210000003763 chloroplast Anatomy 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000004907 flux Effects 0.000 description 2
- 101150077981 groEL gene Proteins 0.000 description 2
- 238000012165 high-throughput sequencing Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 101150088250 matK gene Proteins 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 210000003470 mitochondria Anatomy 0.000 description 2
- 239000002773 nucleotide Substances 0.000 description 2
- 125000003729 nucleotide group Chemical group 0.000 description 2
- 101150074945 rbcL gene Proteins 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 108010041986 DNA Vaccines Proteins 0.000 description 1
- 229940021995 DNA vaccine Drugs 0.000 description 1
- 241000588722 Escherichia Species 0.000 description 1
- 238000012408 PCR amplification Methods 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000003570 biosynthesizing effect Effects 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 210000003205 muscle Anatomy 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a searching method for species-specific endogenous barcodes and an application thereof in multi-sample mixed sequencing. The searching method comprises the following steps: determining, collecting and comparing candidate genomic sequences, calculating the variation degree of a sequence in a current sliding window and the conservation degrees of sequences on both sides of the window, and according to the scanning calculation results of the sliding window, determining an endogenous barcode; and after the endogenous barcode is determined, carrying out amplification by using an overlap extension PCR technology, connecting the endogenous barcode and a to-be-sequenced target sequence, carrying out sequencing on a computer, and determining the sample source of the sequencing segment through the characteristics of the endogenous barcode. Compared with exogenous barcode sample marked samples synthesized in vitro, endogenous barcodes do not need to be subjected to artificial synthesis of DNA, and multiple samples are simultaneously amplified in one-step reaction and connected with respective barcode and to-be-sequenced target sequence can be realized, so that an experiment process of extracting the target sequence firstly and then connecting barcodes synthesized in vitro one by one is simplified, thereby reducing the cost of sequencing.
Description
Technical field
The invention belongs to gene sequencing field, especially a kind of searching method of species specificity endogenous bar code and the application in multisample mixing order-checking thereof.
Background technology
High throughput sequencing technologies fast development in recent years, range of application is constantly expanded, and often needs the examining order carrying out specific region of DNA territory in great amount of samples.In order to improve the parallel order-checking ability of multisample, high-flux sequence platform both provides the order-checking passage of physical segmentation substantially, but parallel processing capability is limited to number of active lanes, still cannot meet the parallel sequencing demands of great amount of samples, the experimental program of therefore multisample mixing order-checking arises at the historic moment.The main method of current mixing order-checking comprises DNA bar code mark and overlapping mixing order-checking.
DNA bar code (also known as DNA Tag), the i.e. artificial DNA bar code sequence of the about several base of a segment length, by the sample preparation procedure before order-checking, utilizes PCR or coupled reaction, be inserted in corresponding sample sequence, and the unique corresponding order-checking sample of each DNA bar code sequence.Afterwards by labeled sample mixing order-checking, determined the sample ownership of surveyed DNA fragmentation by the DNA bar code information obtained, thus realize carrying out parallel parsing to multiple sample in once sequencing process simultaneously.
In current DNA bar code application process, two operations are had to be than more laborious: one is that DNA bar code needs external synthesis, and the bar code that each sample is corresponding will have uniqueness, how many sample is therefore had to mix, will external synthesis how many DNA bar code; Two is after bar code synthesis, needs to connect corresponding sample, and this process is also need sample one by one to add.When number of samples is many, above-mentioned two operations need the manpower and materials of at substantial undoubtedly.
Biological endogenous property bar code refers in biosome and can be used in conservative in these species of identification, species and keep enough abnormal dna short-movie sections between species.Be similar to " Universal Product Code ", the design concept of namely common in life " bar code ", bio-barcode attempts the authentication information providing species with a bit of DNA sequence dna, the bio-barcode of this species specificity has become a kind of important research instrument of biological classification scholar, is widely used in species identification, finds to hide species or biodiversity research.
Bio-barcode derives from biosome inside, need not externally synthesize, and there is good species specificity, therefore we consider when checking order to the sample mixing from different plant species, DNA bar code whether is likely it can be used as to mark specific sample, carry out mixing order-checking again? the Barcode Length of biological classification is generally hundreds of to thousands of bases, and be limited to for the sample labeling bar code of high flux mixing order-checking and effectively read long and order-checking cost, can effectively distinguish should be short as far as possible under the prerequisite of sample.There is the short sample also still can effectively distinguished from different plant species in various bio-barcodes known at present? even if there is such one section of region, does is whether its two terminal sequence is enough guarded easy to intercept common endogenous bar code DNA? this just requires a kind of general effective searching method, only gets the subregion of bio-barcode for marking mixing sample.
When finding the specific short bar code sequence region of sample, how simple it is carried out being connected with order-checking target sequence? Overlap extension PCR (Overlap Extension PCR) technology, or claim fusion DNA vaccine (FusionPCR), utilize the primer with base complementrity end, PCR primer is made to form overlapping chain, thus by the extension of overlapping chain in amplified reaction subsequently, the amplified fragments of separate sources is stitched together.Utilize this technology can realize increasing in multiple sample single step reaction simultaneously and connecting respective bar code and target sequence to be measured, simplify and first extract order-checking target sequence, connector synthesizes the experimentation of bar code outward one by one again.
Summary of the invention
Goal of the invention: apply exogenous bar code in current sequencing procedure, need to synthesize one by one and the actual conditions be connected with order-checking target sequence one by one, the invention provides a kind of searching method of species specificity endogenous bar code and the application in multisample mixing order-checking thereof, by the short sequence fragment of the inner species specificity of search biosome, and connect when extracting order-checking target sequence simultaneously, synchronously process while realizing multiple sample, optimize library preparation process, improve order-checking efficiency.
Technical scheme: for realizing above-mentioned technical purpose, the present invention proposes a kind of searching method of species specificity endogenous bar code, comprises the steps:
(1) the candidate gene group sequence at endogenous bar code place is determined: according to the feature treating order-checking sample, the bio-barcode of the corresponding species of the upper widespread use of selection sort is as hunting zone, thus determine the extensive area of endogenous bar code sequence on full-length genome, by searching within the scope of this, obtain genome sequence short as far as possible as the region corresponding to endogenous bar code.Usually, the COI gene of one of 3 kinds of cytochrome oxidase subunits that the animal kingdom can select chondriogen to encode, plantage can select the genes such as matK, rbcL of encoding chloroplast, and bacterium can select the gene of coding 16s rRNA or line of codes mitochondria function albumen cpn60;
(2) candidate gene group sequence is collected: collects and the candidate gene of the order-checking group sequence of the nearlyer species of sample evolutionary relationship of downloading and wait to check order;
(3) the candidate gene group sequence of comparison collection: make all sequences length unanimously and align, so that moving window base scanning one by one within the scope of this, finds the region that sample specific sequence short is as far as possible corresponding;
(4) Barcode Length parameter is set: the length of endogenous bar code is relevant with the evolutionary relationship between sample size to be checked order, sample, and namely sample size is more, then can ensure that the length of each sample specific endogenous bar code is longer; Between sample, evolutionary relationship is nearer, then the sequence variations degree separately between endogenous the barcode size or text field is less, and therefore the length of endogenous bar code sequence is longer.So when larger amt or the nearer sample mixing order-checking of evolutionary relationship use this method, the endogenous bar code of the different sample of the differentiation likely found is oversize, and oversize bar code takies order-checking space, increases order-checking cost.Therefore consider actual order-checking length and target dna length, this method need set Barcode Length parameter, and calculate the ratio shared by reality order-checking length according to endogenous bar code, default value is 20%, i.e. the Barcode Length upper limit.If target dna length is shorter, endogenous Barcode Length can suitably increase; Otherwise then should suitably reduce;
(5) degree of variation of sequence and the conservative degree of window both sides sequence in moving window is calculated: initialization window width, and with Barcode Length parameter for maximized window width, carry out following cycle calculations, namely slide by turn with the moving window of fixed width sequence area after alignment, calculate the degree of variation of sequence in each window and the conservative degree of window both sides designated length sequence, then window width is expanded gradually, until find the height variation and territory, high conserved region that meet the demands, or reach the upper limit of moving window width.Wherein, degree of variation characterizes the degree of variation of sequence in current sliding window mouth, and degree of variation is large as far as possible, to distinguish different plant species, becomes real bar code.Degree of variation neutralizes by current sliding window mouth the ratio that all different sequence of all the other sequences accounts for sequence sum to represent, is defined as follows:
Be listed in isometric arrangement set under corresponding moving window if sequence sets A is the genome sequence of all samples, specific sequence set B is that in set A, certain sequence and other sequence have the arrangement set that a more than base is different at least, then
Wherein card (X) is the number of element in set X;
Conservative degree characterizes the conservative of current sliding window mouth both sides designated length sequence, and conservative degree is high as far as possible, so that design primer amplification endogenous bar code sequence.The both sides sequence preservative degree of moving window will calculate respectively, and all will guard same pair of primers can be combined with the both sides sequence of all samples and increase.Carried out the conservative degree of calculation window both sides by the Hamming distances (hamming distance) between the sequence of calculation, conservative degree is defined as follows:
If sequence sets C is the genome sequence set of moving window side designated length, if the identical sequence that in C, number is maximum is sequences y composition set D={y|hamming (y, a)≤3) in a, C },
Wherein (y, a) represents the Hamming distances between sequences y and sequence a to hamming, and the conservative degree of both sides is all greater than set-point and just thinks that sequence can by same primer amplification in current sliding window port area;
(6) according to moving window scanning result of calculation, endogenous bar code is determined: in the moving window of selection, the degree of variation of sequence is 100%, the conservative degree of window both sides sequence is also that the sequence of 100% is as the endogenous bar code of species specificity.In order to can the different sample of accurate discrimination, in the moving window of selection, the degree of variation of sequence be 100%, and namely in window, each sequence all has at least a base different from other sequences; Simultaneously in order to the bar code of all samples that can increase, require that the conservative degree of window both sides sequence is also 100%, namely all bar code sequences can both use universal primer to increase.If slip window width reaches capping still cannot find the endogenous bar code met the demands, then stop search, think setting current parameter conditions under cannot find suitable endogenous bar code, need suitably to improve Barcode Length parameter, or sample packet is mixed respectively order-checking, by minimizing often group sample size find the endogenous bar code met the demands.
Present invention further proposes the application of above-mentioned searching method in multi-sample mixed sequencing method.
Particularly, above-mentioned application comprises following process:
Process one: utilize above-mentioned searching method to obtain the genome area at the endogenous bar code place of species specificity, thus determine endogenous bar code and the amplimer thereof of species specificity;
Process two: increase and connect endogenous bar code and target dna sequence, upper machine order-checking: for each sample, based on overlap extension pcr, corresponding primer is designed respectively with the conserved sequence at target dna sequence two ends for endogenous bar code, amplify endogenous bar code and target dna two sections of sequences simultaneously, endogenous bar code and target dna is connected again by complementary base sequences thereof during design primer, each sample is made to form the junction fragment of target dna sequence and corresponding endogenous bar code, then mix the junction fragment of each sample and add sequence measuring joints, form sequencing library, deliver to the actual order-checking of DNA sequencer,
Process three: the samples sources judging sequenced fragments: checked order and to have analyzed sequencing result afterwards, according to the feature of endogenous bar code in each sample, which sample the sequence fragment of tracing to the source in sequencing result comes from.
Particularly, process two realizes as follows:
(1) amplimer of endogenous bar code sequence and target sequence to be measured is designed respectively: according to the genome area at endogenous bar code place, design primer amplification; For target sequence to be measured, conventionally or primer-design software find conservative region in target sequence both sides to be measured and design primer and increase, then according to the principle of overlap extension pcr, 3 ' end primer of endogenous bar code sequence and 5 ' end primer of target dna sequence are except with except the complementation of respective target area, will also need the complementary region extending 15 ~ 25bp separately;
(2) all samples increase simultaneously and connect respective endogenous bar code and target dna sequence: for each sample, first round PCR reaction amplifies bar code and target dna two sections of sequences simultaneously, second takes turns PCR by complementary series connection strap shape code during design primer and target dna, and namely each sample forms the junction fragment of target dna sequence and corresponding endogenous bar code.
Thinking of the present invention is summarized as follows: when checking order to certain section of DNA region of the multiple samples from different plant species, first find other section of DNA region as endogenous bar code, sequence in this region possesses sample specificity and is easy to amplification, then when increasing target dna sequence to be measured, be connected with target dna sequence by the endogenous bar code sequence of Overlap extension PCR method by each sample, add sequence measuring joints after mixing again, go up machine order-checking afterwards.By analyzing sequencing result, according to the sequence signature of endogenous bar code in each sample, distinguishing target dna sequence and coming from which sample.
Beneficial effect: compared with prior art, tool of the present invention has the following advantages:
(1) according to the mixing sequence measurement utilizing endogenous bar code label sample that the present invention proposes, the natural kind sequence characteristic of biosome can be made full use of as sample labeling, avoid one by one external synthetic DNA bar code, and utilize overlap extension pcr, achieve endogenous bar code increase with while target sequence and be connected, optimize library preparation process, provide a kind of brand-new high flux mixing order-checking thinking.
(2) the present invention proposes the method for the endogenous dna bar code met the demands based on the search of genome sequence species specificity, the actual mixing order-checking application for endogenous dna bar code provides the analysis foundation of bioinformatics.Bio-barcode itself due to genome sequence species specificity possesses characteristic conservative in species, and when using it for multisample order-checking, each sample belongs to different plant species or genus as far as possible.Current hunting zone is only limitted to the bio-barcode region in taxonomy, but in full-length genome scope, completely likely there is the hypervariable region of individual specificity, thus forms other endogenous bar code.Along with popularizing of sequencing technologies, the species be sequenced get more and more, and the search principle of endogenous bar code can be applied to wider genome range.Along with the progress of high throughput sequencing technologies, long can constantly increasing is read in order-checking, and the development of third generation single-molecule sequencing technology is even expected to break reads long restriction, and genome endogenous bar code necessarily has larger application space in future in high-flux sequence.
Accompanying drawing explanation
Fig. 1 is the principle schematic of this method;
Fig. 2 is the program flow diagram of search endogenous bar code;
Fig. 3 searches for biological endogenous property bar code schematic diagram for using moving window;
Fig. 4 is Overlap extension PCR schematic diagram;
Fig. 5 is that the moving window of 45bp does not equally belong to the result figure finding hypervariable region in the 16s rRNA gene order of sample together at 39, and wherein dotted line frame is the V3 hypervariable region of 16s rRNAgene;
Fig. 6 is 39 the equal result figures of maximum degree of variation with moving window wide variety not belonging to the 16s rRNA gene order of sample together.
Embodiment
The present invention proposes a kind of mixing sequence measurement (as shown in Figure 1) utilizing endogenous bar code label sample, comprise following process:
Process one, determines species specificity bar code.Search can be distinguished all samples from different plant species and be easy to endogenous the barcode size or text field (method flow as shown in Figure 2) of amplification.
The 1.1 candidate gene group sequences determining endogenous bar code place: check order for the mixing from different plant species sample, determine the extensive area of its endogenous bar code sequence on full-length genome, by searching within the scope of this, obtain genome sequence short as far as possible as the region corresponding to endogenous bar code.According to the feature treating order-checking sample, the bio-barcode of the corresponding species of widespread use on taxonomy can be selected as hunting zone.The COI gene of one of 3 kinds of cytochrome oxidase subunits that the animal kingdom can select chondriogen to encode, plantage can select the genes such as matK, rbcL of encoding chloroplast, and bacterium can select the gene of coding 16s rRNA or line of codes mitochondria function albumen cpn60.
1.2 collect candidate gene group sequences: collects and the candidate gene of the order-checking group sequence of the nearlyer species of sample evolutionary relationship of downloading and wait to check order.
The candidate gene group sequence that 1.3 comparisons are collected also arranges: by contrasting and candidate gene group sequence of aliging, so that moving window base scanning one by one within the scope of this, finds the region (as shown in Figure 3) that sample specific sequence short is as far as possible corresponding.
1.4 setting Barcode Length parameters: consider actual order-checking length and target dna length, this method need set Barcode Length parameter, calculate the ratio shared by reality order-checking length according to endogenous bar code, default value is 20%, i.e. the Barcode Length upper limit.If target dna length is shorter, endogenous Barcode Length can suitably increase; Otherwise then should suitably reduce.
1.5 calculate the degree of variation of sequence and the conservative degree of window both sides sequence in moving window: initialization window width, and with Barcode Length parameter for maximized window width, carry out following cycle calculations, namely slide by turn with the moving window of fixed width sequence area after alignment, calculate the degree of variation of sequence in each window and the conservative degree of window both sides designated length sequence, then window width is expanded gradually, until find the height variation and territory, high conserved region that meet the demands, or reach the upper limit of moving window width.Wherein, degree of variation is defined as follows:
Be listed in isometric arrangement set under corresponding moving window if sequence sets A is the genome sequence of all samples, specific sequence set B is that in set A, certain sequence and other sequence have the arrangement set that a more than base is different at least, then
Wherein card (X) is the number of element in set X.
Conservative degree characterizes the conservative of current sliding window mouth both sides designated length sequence, and conservative degree is high as far as possible, so that design primer amplification endogenous bar code sequence.The both sides sequence preservative degree of moving window will calculate respectively, and all will guard same pair of primers can be combined with the both sides sequence of all samples and increase.Carried out the conservative degree of calculation window both sides by the Hamming distances (hamming distance) between the sequence of calculation, be specifically expressed as follows:
If sequence sets C is the genome sequence set of moving window side designated length, if the identical sequence that in C, number is maximum is sequences y composition set D={y|hamming (y, a)≤3) in a, C },
Wherein (y a) represents the Hamming distances between sequences y and sequence a to hamming.The conservative degree of both sides is all greater than set-point and just thinks that sequence can by same primer amplification in current sliding window port area.
1.6, according to moving window scanning result of calculation, determine endogenous bar code: in the moving window of selection, the degree of variation of sequence is 100%, the conservative degree of window both sides sequence is also that the sequence of 100% is as the endogenous bar code of species specificity.If slip window width reaches capping still cannot find the endogenous bar code met the demands, then stop search, think setting current parameter conditions under cannot find suitable endogenous bar code, need suitably to improve Barcode Length parameter, or sample packet is mixed respectively order-checking, by minimizing often group sample size find the endogenous bar code met the demands.
Process two, increases and connects endogenous bar code and target dna sequence, upper machine order-checking.For each sample, based on overlap extension pcr, corresponding primer is designed respectively with the conserved sequence at target dna sequence two ends for endogenous bar code, amplify bar code and target dna two sections of sequences under suitable experiment condition simultaneously, again by complementary base sequences thereof connection strap shape code during design primer and target dna (as shown in Figure 4), namely each sample correspondence forms the junction fragment of target dna sequence and corresponding endogenous bar code.Be connected on the basis of specialized endogenous bar code at each sample object DNA of guarantee, mix and add sequence measuring joints, form sequencing library, deliver to the actual order-checking of DNA sequencer.
Process three, judges the samples sources of sequenced fragments: checked order and to have analyzed sequencing result afterwards, according to the feature of endogenous bar code in each sample, which sample the sequence fragment of tracing to the source in sequencing result comes from.
Describe the present invention in detail below by specific embodiment, protection scope of the present invention is not limited only to the description of present embodiment.
The present embodiment, for do not belong to together from enterobacteriaceae 39 samples, describes the search procedure of the endogenous bar code for mixing order-checking.If order-checking object fragment is ccmD gene, relevant with electron transmission to cromoci biosynthesizing protein, be about 210bp, the sequence information of 671 related genes can be searched out in the gene database of NCBI (http://www.ncbi.nlm.nih.gov), wherein, the ccmD gene GeneID of e. coli k-12 bacterial strain is 12931490, corresponding nucleotide sequence is: ATGACCCCTGCATTTGCTTCCTGGAATGAATTTTTCGCAATGGGCGGTTACGCCTT TTTTGTCTGGCTGGCGGTGGTGATGACCGTTATTCCGCTGGTGGTTTTGGTCGTGC ACTCGGTGATGCAACATCGCGCAATTCTGCGTGGCGTGGCGCAACAGCGGGCGCGT GAGGCGCGTTTACGTGCTGCGCAACAGCAGGAGGCTGCATGA, utilize primer-design software to obtain forward primer to be: 5 '-GAGGCCGTAAATGACCCC, reverse primer is: 5 '-GGCAATCCACAAGCGGT.
Step 1: the endogenous bar code of search species specificity.
1.16s rRNA gene is the DNA sequence dna corresponding to bacterial identification 16s rRNA, is a kind of ideal biological endogenous property bar code for bacterium.16s rRNA gene has conservative property and the specificity of height, can comparatively be easy to equally carry out pcr amplification.Therefore we select to carry out moving window scanning in 16s rRNA gene region, and search is suitable as the hypervariable region territory of endogenous bar code.
2. collect candidate's bio-barcode sequence.Search in " Nucleotide " database with the generic name that keyword " 16s rRNA gene " is different with enterobacteriaceae in NCBI website, collect 39 equal gene orders do not belonged to together.
3. in Mega software, use ClustalW and Muscle alignment algorithm to align respectively 39 sequences, comprehensive two kinds of algorithms, obtain rational comparison result.
4. setup parameter.It is 500bp that length is read in the main flow two generation order-checking platform order-checking that uses both-end to check order at present of illumina company, and the length limit default value of endogenous bar code is 20%, i.e. 100bp, and therefore the moving window width upper limit is 100bp.
5. slide by turn with the moving window of fixed width sequence area after alignment, and according to the degree of variation of sequence in degree of variation formulae discovery current window, simultaneously according to the conservative degree (as shown in Figure 3) of conservative degree formulae discovery current window two terminal sequence.The moving window of record current width can obtain the window's position of maximum degree of variation, then expands window width gradually, until find the high variant area met the demands or the upper limit reaching moving window width.
6. scan through moving window, obtain 39 do not belong to together 16s rRNA gene window width be 45bp slide time degree of variation result as shown in Figure 5, wherein in 16s rRNA gene hypervariable region V3, initiation site is 269, the long sequence for 45bp reaches individual specificity's (45bp is less than length limit 100bp) of 100%, and the sequence preservative degree in 20bp region, both sides reaches 100%, mean that the sequence in this 45bp region of 39 samples can use same primer amplification, (the endogenous bar code sequence that the Escherichia at Escherichia coli place is corresponding is: TACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCA) to show the potentiality that sequence in this section of region possesses as endogenous bar code.Fig. 6 shows the situation that maximum degree of variation that moving window can reach changes with window width, and reach more than 45bp at moving window width, corresponding maximum degree of variation is just 100%, namely can distinguish the endogenous bar code of 39 samples completely to the youthful and the elderly 45bp.
Step 2, completes Overlap extension PCR by two-wheeled PCR, connects endogenous bar code and ccmD gene (see Fig. 4).First round PCR uses two pairs of primer amplifications, wherein forward primer (5 '-CCCAACATTTCGTGAAAGTC) and reverse primer (5 '-GCTGGCACGGAGTTAGC) amplify the hypervariable region (i.e. bar code) of 63bp, wherein, reverse primer 5 ' holds the repeated fragment of connection 10 pairs of ct bases (amounting to 20 bases); Forward primer (5 '-GAGGCCGTAAATGACCCC) and reverse primer (5 '-GGCAATCCACAAGCGGT) increase ccmD gene, and wherein forward primer 5 ' holds the repeated fragment of connection 10 pairs of ag bases (amounting to 20 bases); According to overlap extension pcr principle, second takes turns PCR uses the forward primer (5 '-CCCAACATTTCGTGAAAGTC) of amplification bar code and the reverse primer (5 '-GGCAATCCACAAGCGGT) of amplification ccmD gene, under the experiment condition of PCR, the 20bp base repeated fragment of 3 ' end primer of endogenous bar code sequence and 5 ' end primer 2 0bp base repeated fragment complementation of target dna sequence, bar code is connected by complementary series with ccmD gene, thus the ccmD gene of each sample have the specific sequence bar code identification of sample.The Overlap extension PCR process of 39 samples synchronously can be carried out in the reaction environment be separated, and substantially increases library preparation efficiency compared with exogenous bar code.
Step 3, is connected on the basis of specialized endogenous bar code at each sample of guarantee, mixes and add sequence measuring joints, form sequencing library, then goes up machine order-checking.Checked order and to have analyzed sequencing result afterwards, according to the feature of endogenous bar code in each sample, which sample the sequence fragment of tracing to the source in sequencing result comes from.
More than describe the preferred embodiment of the present invention in detail; but the present invention is not limited to the detail in above-mentioned embodiment, within the scope of technical conceive of the present invention; can carry out multiple equivalents to technical scheme of the present invention, these equivalents all belong to protection scope of the present invention.
It should be noted that in addition, each the concrete technical characteristic described in above-mentioned embodiment, in reconcilable situation, can be combined by any suitable mode.In order to avoid unnecessary repetition, the present invention illustrates no longer separately to various possible array mode.
In addition, also can carry out combination in any between various different embodiment of the present invention, as long as it is without prejudice to thought of the present invention, it should be considered as content disclosed in this invention equally.
Claims (4)
1. a searching method for species specificity endogenous bar code, is characterized in that, comprises the steps:
(1) the candidate gene group sequence at endogenous bar code place is determined: according to the feature treating order-checking sample, the bio-barcode of the corresponding species of the upper widespread use of selection sort is as hunting zone, thus determine the extensive area of endogenous bar code sequence on full-length genome, by searching within the scope of this, obtain genome sequence short as far as possible as the region corresponding to endogenous bar code;
(2) candidate gene group sequence is collected: collects and the candidate gene of the order-checking group sequence of the nearlyer species of sample evolutionary relationship of downloading and wait to check order;
(3) the candidate gene group sequence of comparison collection: make all sequences length unanimously and align, so that moving window base scanning one by one within the scope of this, finds the region that sample specific sequence short is as far as possible corresponding;
(4) set Barcode Length parameter: consider actual order-checking length and target dna length, setting Barcode Length parameter, described Barcode Length parameter is calculated the ratio shared by reality order-checking length according to endogenous bar code;
(5) degree of variation of sequence and the conservative degree of window both sides sequence in moving window is calculated: initialization window width, and with Barcode Length parameter for maximized window width, carry out following cycle calculations, namely slide by turn with the moving window of fixed width sequence area after alignment, calculate the degree of variation of sequence in each window and the conservative degree of window both sides designated length sequence, then window width is expanded gradually, until find the height variation and territory, high conserved region that meet the demands, or reach the upper limit of moving window width, wherein, degree of variation is defined as follows:
Be listed in isometric arrangement set under corresponding moving window if sequence sets A is the genome sequence of all samples, specific sequence set B is that in set A, certain sequence and other sequence have the arrangement set that a more than base is different at least, then
Wherein card (X) is the number of element in set X;
Conservative degree is defined as follows:
If sequence sets C is the genome sequence set of moving window side designated length, if the identical sequence that in C, number is maximum is sequences y composition set D={y|hamming (y, a)≤3) in a, C },
Wherein (y, a) represents the Hamming distances between sequences y and sequence a to hamming;
(6) according to moving window scanning result of calculation, determine endogenous bar code: in the moving window of selection, the degree of variation of sequence is 100%, the conservative degree of window both sides sequence is also that the sequence of 100% is as the endogenous bar code of species specificity, if slip window width reaches capping still cannot find the endogenous bar code met the demands, then stop search, think setting current parameter conditions under cannot find suitable endogenous bar code, need suitably to improve Barcode Length parameter, or sample packet is mixed order-checking respectively, the endogenous bar code met the demands is found by reducing often group sample size.
2. the application of searching method according to claim 1 in multi-sample mixed sequencing method.
3. application according to claim 2, is characterized in that, comprises following process:
Process one: utilize the searching method described in claim 1, obtains the genome area at the endogenous bar code place of species specificity, thus determines endogenous bar code and the amplimer thereof of species specificity;
Process two: increase and connect endogenous bar code and target dna sequence, upper machine order-checking: for each sample, based on overlap extension pcr, corresponding primer is designed respectively with the conserved sequence at target dna sequence two ends for endogenous bar code, amplify endogenous bar code and target dna two sections of sequences simultaneously, endogenous bar code and target dna is connected again by complementary base sequences thereof during design primer, each sample is made to form the junction fragment of target dna sequence and corresponding endogenous bar code, then mix the junction fragment of each sample and add sequence measuring joints, form sequencing library, deliver to the actual order-checking of DNA sequencer,
Process three: the samples sources judging sequenced fragments: checked order and to have analyzed sequencing result afterwards, according to the feature of endogenous bar code in each sample, which sample the sequence fragment of tracing to the source in sequencing result comes from.
4. application according to claim 3, is characterized in that, process two realizes as follows:
(1) amplimer of endogenous bar code sequence and target sequence to be measured is designed respectively: according to the genome area at endogenous bar code place, design primer amplification; For target sequence to be measured, conventionally or primer-design software find conservative region in target sequence both sides to be measured and design primer and increase, then according to the principle of overlap extension pcr, 3 ' end primer of endogenous bar code sequence and 5 ' end primer of target dna sequence are except with except the complementation of respective target area, will also need the complementary region extending 15 ~ 25bp separately;
(2) all samples increase simultaneously and connect respective endogenous bar code and target dna sequence: for each sample, first round PCR reaction amplifies bar code and target dna two sections of sequences simultaneously, second takes turns PCR by complementary series connection strap shape code during design primer and target dna, and namely each sample forms the junction fragment of target dna sequence and corresponding endogenous bar code.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510070781.6A CN104573407B (en) | 2015-02-10 | 2015-02-10 | Searching method for species-specific endogenous barcodes and application thereof in multi-sample mixed sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510070781.6A CN104573407B (en) | 2015-02-10 | 2015-02-10 | Searching method for species-specific endogenous barcodes and application thereof in multi-sample mixed sequencing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104573407A true CN104573407A (en) | 2015-04-29 |
CN104573407B CN104573407B (en) | 2017-05-24 |
Family
ID=53089453
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510070781.6A Expired - Fee Related CN104573407B (en) | 2015-02-10 | 2015-02-10 | Searching method for species-specific endogenous barcodes and application thereof in multi-sample mixed sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104573407B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107815489A (en) * | 2017-12-07 | 2018-03-20 | 江汉大学 | A kind of method for screening the high polymorphic molecular marker site of plant |
CN107937502A (en) * | 2017-12-07 | 2018-04-20 | 江汉大学 | A kind of method for screening the high polymorphic molecular marker site of microorganism |
CN108004335A (en) * | 2017-12-08 | 2018-05-08 | 江汉大学 | A kind of bacterial leaf spot bacterium microspecies isolation and identification method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060194214A1 (en) * | 2005-02-28 | 2006-08-31 | George Church | Methods for assembly of high fidelity synthetic polynucleotides |
CN101429559A (en) * | 2008-12-12 | 2009-05-13 | 深圳华大基因研究院 | Environmental microorganism detection method and system |
CN103805689A (en) * | 2012-11-15 | 2014-05-21 | 深圳华大基因科技服务有限公司 | Characteristic kmer based metatypic chromosomal sequence assembly method and application thereof |
CN103984879A (en) * | 2014-03-14 | 2014-08-13 | 中国科学院上海生命科学研究院 | Method and system for measuring regional RPKM of to-be-measured genome |
CN104164479A (en) * | 2014-04-04 | 2014-11-26 | 深圳华大基因科技服务有限公司 | Heterozygous genome processing method |
-
2015
- 2015-02-10 CN CN201510070781.6A patent/CN104573407B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060194214A1 (en) * | 2005-02-28 | 2006-08-31 | George Church | Methods for assembly of high fidelity synthetic polynucleotides |
CN101429559A (en) * | 2008-12-12 | 2009-05-13 | 深圳华大基因研究院 | Environmental microorganism detection method and system |
CN103805689A (en) * | 2012-11-15 | 2014-05-21 | 深圳华大基因科技服务有限公司 | Characteristic kmer based metatypic chromosomal sequence assembly method and application thereof |
CN103984879A (en) * | 2014-03-14 | 2014-08-13 | 中国科学院上海生命科学研究院 | Method and system for measuring regional RPKM of to-be-measured genome |
CN104164479A (en) * | 2014-04-04 | 2014-11-26 | 深圳华大基因科技服务有限公司 | Heterozygous genome processing method |
Non-Patent Citations (2)
Title |
---|
叶剑波: ""上海近郊某地区犬Torque Teno virus感染率调查及全基因组序列分析"", 《中国优秀硕士学位论文全文数据库 农业科技辑》 * |
陈纪云: ""中国榕属植物六个候选DNA条形码的评价"", 《中国优秀硕士学位论文全文数据库 基础科学辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107815489A (en) * | 2017-12-07 | 2018-03-20 | 江汉大学 | A kind of method for screening the high polymorphic molecular marker site of plant |
CN107937502A (en) * | 2017-12-07 | 2018-04-20 | 江汉大学 | A kind of method for screening the high polymorphic molecular marker site of microorganism |
CN107815489B (en) * | 2017-12-07 | 2021-06-29 | 江汉大学 | Method for screening plant high polymorphism molecular marker locus |
CN107937502B (en) * | 2017-12-07 | 2021-06-29 | 江汉大学 | Method for screening high-polymorphism molecular marker loci of microorganisms |
CN108004335A (en) * | 2017-12-08 | 2018-05-08 | 江汉大学 | A kind of bacterial leaf spot bacterium microspecies isolation and identification method |
CN108004335B (en) * | 2017-12-08 | 2021-07-02 | 江汉大学 | Method for separating and identifying small species of rhizoctonia solani |
Also Published As
Publication number | Publication date |
---|---|
CN104573407B (en) | 2017-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mercier et al. | SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences | |
Palmer et al. | All ANIs are not created equal: implications for prokaryotic species boundaries and integration of ANIs into polyphasic taxonomy | |
Xu et al. | Evaluation of the DNA barcodes in Dendrobium (Orchidaceae) from mainland Asia | |
Teasdale et al. | Paging through history: parchment as a reservoir of ancient DNA for next generation sequencing | |
Lenz et al. | Bacterial profiling of soil using genus‐specific markers and multidimensional scaling | |
Lu et al. | Computational methods for predicting genomic islands in microbial genomes | |
Mikkelsen et al. | Local scale DNA barcoding of bivalves (Mollusca): a case study | |
CN104046683A (en) | Method for discriminating two closely-related species of shellfish or identifying their hybrid generation | |
EP3438276B1 (en) | Microorganism identification method | |
Taylor et al. | Increasing ecological inference from high throughput sequencing of fungi in the environment through a tagging approach | |
KR20170012390A (en) | Sequencing process | |
CN104573407A (en) | Searching method for species-specific endogenous barcodes and application thereof in multi-sample mixed sequencing | |
CN103571833B (en) | A kind of SSR label primer method of design, wheat SSR marker primer | |
Krawczyk et al. | Molecular identification and genotyping of staphylococci: Genus, species, strains, clones, lineages, and interspecies exchanges | |
Liu et al. | Identification of medical plants of 24 Ardisia species from China using the matK genetic marker | |
Sun et al. | Database and primer selections affect nematode community composition under different vegetations of Changbai Mountain | |
Hu et al. | Inferring species compositions of Complex Fungal communities from Long-and short-read sequence data | |
Al-Juhani | Evaluation of the capacity of the DNA barcode ITS2 for identifying and discriminating dryland plants | |
CN113151535B (en) | Chloroplast SSR (simple sequence repeat) marker primers for molecular identification of compositae plants and acquisition method thereof | |
Alpen et al. | The development of a DNA barcode system for species identification of Conyza spp.(fleabane) | |
Liu et al. | A new method to analyze the similarity based on dual nucleotides of the DNA sequence | |
Bandyopadhyaya et al. | DNA barcoding and its applications–A critical review | |
Esmailzadeh Hosseini et al. | First report of a'Candidatus Phytoplasma phoenicium'-related strain (16SrIX-I) associated with yellowing of Onobrychis viciifolia in Iran | |
Kawulok et al. | Environmental Metagenome Classification for Soil-based Forensic Analysis. | |
Wang et al. | Chloroplast genome of serrated tussock (Nassella trichotoma): structure and evolution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170524 |