CN106650311A - Detection and recognition method and system for microorganisms - Google Patents
Detection and recognition method and system for microorganisms Download PDFInfo
- Publication number
- CN106650311A CN106650311A CN201611213197.2A CN201611213197A CN106650311A CN 106650311 A CN106650311 A CN 106650311A CN 201611213197 A CN201611213197 A CN 201611213197A CN 106650311 A CN106650311 A CN 106650311A
- Authority
- CN
- China
- Prior art keywords
- sequence
- dna
- labels
- characteristic sequences
- dna sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Engineering & Computer Science (AREA)
- Organic Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Evolutionary Biology (AREA)
- Wood Science & Technology (AREA)
- Zoology (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biochemistry (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- General Engineering & Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention is applicable to the field of bioengineering and provides a detection and recognition method and system for microorganisms. The method comprises the steps that a high-flux sequencing technology is adopted to perform sequencing on DNA extracted from an environment sample, and a DNA tag sequence is obtained; carrier contamination existing in the DNA tag sequence is removed; the DNA tag sequence is compared with a known sequence in a known database, and the classification of the DNA tag sequence is determined according to the comparison result. According to the embodiment, which microorganism species or which kind of microorganism species possibly exist in the environment sample can be detected.
Description
Technical field
The invention belongs to bioengineering field, more particularly to a kind of detection recognition method and system of microorganism.
Background technology
The protein and RNA molecule for determining biological character is all, in the coded sequence form of tetra- kinds of bases of DNA, information to be stored up
In being stored in biological cell.This DNA molecular contains a complete set of hereditary information of organism.In order to go to understand heredity from overall angle
The function of information and effect, a most important step is to determine a complete set of hereditary information of the biology to come, that is, know the biological institute
Some DNA bases put in order.Traditional gene order-checking mainly adopts " sanger " method sequencing technologies, also referred to as " end end
Only method " sequencing technologies.The disadvantage of this sequence measurement is:High cost, yield poorly.In recent years, with solexa as representative
" high throughput sequencing technologies of new generation " quietly rise.Solexa sequencing technologies with " being sequenced in synthesis " as principle, effectively
The deficiency of traditional sanger PCR sequencing PCRs is improved, with low cost, flux is high, the time is short, it is high, easy to operate etc. that accuracy rate is sequenced
Plurality of advantages.
Microorganism is ubiquitous in nature, is existed everywhere, huge amount.Microorganism is for tellurian life
It is critical that, important element can be converted to energy by them, keep the chemical balance in air, be plant and animal
Nutrient is provided.Microorganism can be also used for realizing many commercial objects that such as manufacture antibiotic, raising farm efficiency and production is given birth to
Thing fuel.It is pernicious to people in addition with sub-fraction microorganism, causes the generation of various diseases.From the point of view of historical viewpoint, micro- life
Thing research focuses primarily upon the individual species of research.But most of microbe is (raw in various environment in the form of group
Environment, external environment, extreme environment etc. in thing), and cannot individually cultivate under lab.For micropopulation complicated in environment
Fall, traditional research method is to be directed to after specific conservative gene (such as 16S rRNA) is expanded using round pcr to be sequenced.
By the evolution classification analysis to these conservative genes, so as to environmental microorganism be classified.This be from species, it is even higher
Method of the category level to be detected to environmental microorganism.This method can detect microorganism unknown in environment,
And have the advantages that simple to operate, technology is complete, with low cost.But deepening continuously, having announced with microbe research
Microbial genome number it is increasing, it has been found that based on conservative gene be sequenced detection method there is following limitation:
1st, the species of None- identified trace.By PCR expand sequencing obtain be all abundance higher building kind gene order.
The species relatively low for abundance, need substantial amounts of Sanger sequencings to find.
2nd, simply species can not be detected by several genes.By to existing 703 kinds of bacterial genomes sequences
Comparative analysis and to after the 16S rRNA sequencing analysis of true environment sample find:The 16S rRNA genes of many nearly edge species
Guard very much, there's almost no difference, but in phenotype, functionally but significant difference.
3rd, detection can only be in the category level of species or higher, and the classification information of resulting higher level is to later work(
Can study without too big effect.Even and same bacterium, can also there is very big difference between different strains.
The content of the invention
It is an object of the invention to provide the detection recognition method and system of a kind of microorganism, it is intended to solve existing environment
Microorganism detection method is difficult to the problem of the species of trace.
The present invention is achieved in that a kind of environmental microorganism detection method, and methods described comprises the steps:
The DNA extracted from environmental samples is sequenced using high-throughout sequencing technologies, obtains DNA sequence labels;
Remove carrier contamination present in the DNA sequence labels;
The DNA sequence labels obtained after removal carrier contamination are compared with the known array in given data storehouse, and
Classification according to belonging to comparison result determines the DNA sequence labels.
Used as one embodiment, the method also comprises the steps:
Known array in given data storehouse is pre-processed, obtains uniquely representing the DNA sequence dna piece of a species
Section;
The degree of covering of DNA sequence labels on each bit base in characteristic sequences is calculated, is obtained by Poisson distribution fitting
The average sequencing depth of characteristic sequences;
Calculate that how many bit base in characteristic sequences is covered by DNA sequence labels, so as to obtain the coverage of characteristic sequences;
Calculate that how many bit base in whole piece sequence is covered by DNA sequence labels, so as to obtain the coverage of whole piece sequence;
Sentenced according to the coverage of the average sequencing depth, the coverage of characteristic sequences and whole piece sequence in the peculiar region
Break and the confidence level that the species of the characteristic sequences representative are found.
Another object of the present invention is to provide a kind of environmental microorganism detecting system, the system includes:
DNA sequencing unit, for being sequenced to the DNA extracted from environmental samples using high-throughout sequencing technologies,
Obtain DNA sequence labels;
Carrier contamination removal unit, for removing carrier contamination present in the DNA sequence labels;
Affiliated classification determining unit, for removing after carrier contamination in the DNA sequence labels that obtain and given data storehouse
Known array compare, and the classification according to belonging to comparison result determines the DNA sequence labels.
Used as one embodiment, the system also includes:
Known array pretreatment unit, for pre-processing to the known array in given data storehouse, obtaining can be unique
Represent the DNA sequencing fragment of a species;
Sequencing depth calculation unit, for calculating characteristic sequences on each bit base DNA sequence labels degree of covering,
The average sequencing depth of characteristic sequences is obtained by Poisson distribution fitting;
Coverage computing unit, for calculating characteristic sequences in how many bit base covered by DNA sequence labels, so as to
To the coverage of characteristic sequences, and calculate that how many bit base in whole piece sequence is covered by DNA sequence labels, so as to obtain whole piece
The coverage of sequence;
Credibility judgement unit, for according to the average sequencing depth in the peculiar region, the coverage of characteristic sequences with
And the coverage of whole piece sequence judges the height of the confidence level that the species that the characteristic sequences are represented are found.
Environmental microorganism detection method and system that the present invention is provided, the DNA extracted in environmental samples is sequenced
During introduce high-throughout sequencing technologies, and in sequence alignment, remove carrier contamination first, then by the DNA label sequences
Row are compared comprehensively with the known array in given data storehouse, can be to more DNA sequencings in Environment features, or even can
Realization is sequenced to all DNA, and DNA sequence dna compared more fully hereinafter such that it is able to efficiently identify trace
Species.Can detect in environmental samples and which microbial species or which kind of microbial species there may be.Further pass through
To more in given data storehouse, or even all characteristic sequences carry out process and obtain averagely being sequenced depth, coverage and whole
The coverage of bar sequence determining the height of confidence level that the species that characteristic sequences are represented are found, so as to accuracy of detection is careful
To nearly edge species, even different strains can be distinguished.
Description of the drawings
Fig. 1 is the flowchart of environmental microorganism detection method provided in an embodiment of the present invention;
Fig. 2 be it is provided in an embodiment of the present invention DNA sequence labels and known array are compared, determine DNA label sequences
The schematic diagram of the classification of row;
Fig. 3 is that the simulation sequence label for being continuously mapped to unique positions provided in an embodiment of the present invention determines characteristic sequences
Schematic diagram;
Fig. 4 is the structured flowchart of environmental microorganism detecting system provided in an embodiment of the present invention.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and
It is not used in the restriction present invention.
In embodiments of the present invention, the DNA extracted from environmental samples is sequenced using high-throughout sequencing technologies,
Obtain DNA sequence labels, remove in the DNA sequence labels after carrier contamination that may be present, by the DNA sequence labels with it is known
Known array in database is compared, so as to obtain the affiliated classification of the DNA sequence labels.
What Fig. 1 showed environmental microorganism detection method provided in an embodiment of the present invention realizes flow process, and details are as follows:
In step S101, the DNA extracted from environmental samples is sequenced using high-throughout sequencing technologies, is obtained
DNA sequence labels.
Wherein high-throughout sequencing technologies are the second generation sequencing technologies with Solexa, Solid etc. as representative.Due to mining height
The detailed process that the sequencing technologies of flux are sequenced to DNA is prior art, therefore, in embodiments of the present invention, only sketch
The DNA extracted from environmental samples is sequenced process using high-throughout sequencing technologies:
A. DNA sample is extracted from environmental samples.Extract DNA sample when, need ensure sample in DNA high-quality and
The diversity of microorganism.
B, library preparation is carried out to above-mentioned DNA sample.In embodiments of the present invention, two-way sequencing text is built if desired
Storehouse, then in order to effectively solve the sequencing difficult problems of high-load species, in the preparation process of library, the length of Insert Fragment is typically little
It is more suitable in 200 Shame.
C, high-throughout DNA sequencing reaction is carried out, obtain substantial amounts of DNA sequence labels.
It is the accuracy for improving detection in this step, preferably all DNA extracted from environmental samples can be surveyed
Sequence.
In step s 102, carrier contamination that may be present in the DNA sequence labels that removal step S101 is obtained.
Because the carrier sequence used in sequencing reaction is specific, therefore, the DNA labels obtained by sequencing reaction
A part for these specific carrier sequences or specific carrier sequence may be included in sequence.By in DNA sequence labels
It is middle to search for specific carrier sequence word string, you can to judge that whether the DNA sequence labels are polluted, and then go by specific carrier sequence
Except carrier contamination present in the DNA sequence labels.
In step s 103, the DNA sequence labels Jing after removing depollution are entered with the known array in given data storehouse
Row is compared, and obtains the affiliated classification of the DNA sequence labels according to comparison result.
Wherein given data storehouse includes but is not limited to bacterial genomes database, fungal gene group database, virus
Genbank databases, ribosomes database (RDP databases), the nonredundancy GenBank of environmental microorganism, nonredundancy
GenBank.In the embodiment of the present invention, can be according to the detection demand of environmental microorganism, from above-mentioned multiple given datas
The known array in one or more given data storehouse is selected to compare with the DNA sequence labels in storehouse.And work as environment sample
When this is more complicated, then can select that the known array in all of given data storehouse is compared with DNA sequence labels.
In embodiments of the present invention, using the mapping method of short string sequence by DNA sequence labels and given data storehouse
Known array is compared, and the classification belonging to the best match sequence between DNA sequence labels and known array is defined as into this
The affiliated classification of DNA sequence labels.
It is known that best match sequence wherein between DNA sequence labels and known array refers to that DNA sequence labels are compared
Sequence with minimum base mispairing in sequence.When the mapping method using short string sequence is by DNA sequence labels and given data
When known array in storehouse is compared, the multiple best match sequences being likely to be obtained, i.e. DNA sequence labels can be simultaneously with most
Good matched form compares upper a plurality of known array, now, by the nearest of a plurality of known array in DNA sequence labels comparison
It is common belonging to classification as the DNA affiliated classification.
Due to the mutation rate of microbial genome it is higher, so by known in DNA sequence labels and given data storehouse
When sequence is compared, it is allowed to the mispairing of predetermined number and little insertion and deletion sequence.The wherein mispairing of predetermined number can be with
Rule of thumb arrange.
By above-mentioned steps, diversity information of the environmental sample in different classifications level can be obtained.
Can be detected in environmental samples by mentioned microorganism detection method there may be which microbial species or which
One quasi-microorganism species, but be difficult to detect the confidence level of species presence, and species exist it is with a high credibility when, the species
Shared ratio in the environment.
Therefore in order to reasonably solve above-mentioned two problems, in an alternative embodiment of the invention, may further include as
Lower step S104-S107.Wherein, step S104-S107 in step S103 by known in DNA sequence labels and given data storehouse
Sequence is performed before comparing, it is also possible to synchronous with step S103 or carry out after step s 103.
In step S104, the known array in given data storehouse is pre-processed, obtain uniquely representing a thing
The characteristic sequences planted.It is comprised the following steps that:
A, the known array in given data storehouse produce simulation sequence label.Its detailed process is as follows:
From the beginning of the first bit base of known array, the DNA sequence dna of preset length is taken as first simulation sequence label,
Then from the beginning of the second bit base of known array, the DNA sequence dna for taking same length simulates sequence label as second, according to this
Analogize, from the beginning of each bit base of known array, take the DNA sequence dna of same length as simulation sequence label.
B, each simulation sequence label for obtaining is mapped on known array, and record is mapped to the simulation mark of unique positions
Sign sequence.
In embodiments of the present invention, any one sequence mapping method, such as SOAP comparison methods can be adopted to simulate
Sequence label is mapped on known array, therefore, will not be described here.When simulation sequence label is mapped on known array,
Exist because the sequencing fragment that obtains of Jing sequencings always has certain error rate, in order to avoid in practical operation because of the sequencing
Mistake and true DNA sequence labels are mapped into another location, in embodiments of the present invention, allow sequencing mistake premise
Under, simulation sequence label is mapped on known array.
C, lookup are continuously mapped to the simulation sequence label of unique positions, obtain uniquely representing the peculiar of species
Sequence.Wherein characteristic sequences refer to the DNA sequencing fragment that can uniquely represent a species.Typically, the number of characteristic sequences has
It is multiple, to put forward the accuracy for hearing detection, all of characteristic sequences are preferably found out in the present embodiment.The sequencing depth of the characteristic sequences
Degree represents species content in the sample.Its detailed process is as follows:
Lookup is continuously mapped to the simulation sequence label of unique positions, obtains the company of the simulation sequence label of unique mapping
Continuous region.Two parts end to end of the continuum are respectively removed in the continuum behind (simulation sequence label length -1) individual site
Sequence as characteristic sequences.Because only uniquely being reflected by the simulation sequence label of part in two parts end to end of the continuum
Penetrate, and ideally each site is modeled the continuum ability that the length sequence of sequence label uniquely maps
Uniquely represent a species.Accordingly, it would be desirable to two parts end to end of above-mentioned continuum are respectively removed, and (simulation sequence label is long
Degree -1) continuum behind individual site is used as characteristic sequences.Finally, whole characteristic sequences on known array are coupled together,
As " characteristic sequences " that can uniquely represent this species DNA sequencing fragment.In embodiments of the present invention, when it should be understood that all
When the confidence level of the presence of the microbial species detected from environmental samples and shared in the environment ratio, then need to
All known arrays in primary data storehouse carry out above-mentioned pretreatment, obtain uniquely representing the peculiar region of a species, due to
Multiple species are potentially included in given data storehouse, therefore after pretreatment, obtains uniquely representing the peculiar region of a species
Have multiple, different species are uniquely represented respectively.
Refer to Fig. 3, when the simulation sequence label for being continuously mapped to unique positions for finding be short sequence 1 to short sequence
The two parts end to end in the region in the continuous unique comparison for finding respectively are removed (simulation sequence label length position by row II
Continuum after point is used as characteristic sequences).
In step S105, the degree of covering of DNA sequence labels on each bit base in characteristic sequences is calculated, by Poisson
Fitting of distribution obtains the average sequencing depth (being designated as d) of characteristic sequences.Wherein, DNA sequence labels described in this step are corresponding to step
The rapid DNA sequence labels of S102 Jing after removing depollution.According to result of the test, the species representated by characteristic sequences are in the sample
Content is increased with the increase of the average sequencing depth of characteristic sequences, therefore, when it should be understood that examining from environmental samples
During the relative amount ratio of the species for measuring, when the average sequencing depth of characteristic sequences is calculated, calculating uniquely represents every kind of species
Characteristic sequences average sequencing depth, now, the method also comprises the steps:
According to the average sequencing depth ratio of the calculated characteristic sequences for uniquely representing every kind of species, obtain every kind of peculiar
The relative amount ratio of the species that sequence is represented.Because the content in the sample of the species representated by characteristic sequences is with peculiar sequence
The increase of the average sequencing depth of row and increase, therefore, the calculated characteristic sequences for uniquely representing every kind of species it is flat
The relative amount ratio that depth ratio is the species that every kind of characteristic sequences are represented is sequenced.
The average sequencing depth of the calculated characteristic sequences for uniquely representing species C is such as assumed as 20, uniquely represent thing
The mean depth of kinds 8 characteristic sequences is 100, uniquely represent species C characteristic sequences mean depth as 30 when, then according to upper
Result of calculation is stated, it is 20 that can obtain the relative amount ratio between species A, species B and species C:100:30.
In step s 106, calculate that how many bit base in characteristic sequences is covered by DNA sequence labels, by capped alkali
Base digit divided by base digit total in characteristic sequences, so as to obtain the coverage (being designated as C) of characteristic sequences.And calculate whole piece sequence
< includes that sequence > how many bit base on characteristic sequences and the not exclusive comparison of DNA sequence labels is covered by DNA sequence labels in row
Lid, by capped base digit divided by the base digit in whole piece sequence, so as to obtain the coverage of whole piece sequence, is designated as
(c).Such as:There is 100 bit base < i.e. length to be 100bp > in a certain sequence, wherein 80 bit bases are capped, be then calculated this
The coverage of sequence is 0.8.
In step s 107, according to the coverage of the average sequencing depth characteristic sequences of DNA sequence labels.And whole piece
Coverage C of sequence ', the confidence level that the species sequence of characteristic sequences representative is found is calculated, can for example adopt following algorithm meter
Calculate confidence level:Confidence level
(when P it is close 1 when, confidence level highest;When P it is close 0 when, confidence level is minimum), wherein θ represent the correction of sequencing because
Son, different sequence measurements, the value of θ may be different.Under normal circumstances, formula c<C ' sets up;If c in real data>C ', then table
The bright species sequence has abnormal conditions.
Fig. 4 shows the structure of environmental microorganism detecting system provided in an embodiment of the present invention, for convenience of description, only not
The part related to the embodiment of the present invention is gone out.Wherein:
DNA sequencing unit 41 is sequenced using high-throughout sequencing technologies to the DNA extracted from environmental samples, is obtained
DNA sequence labels.Wherein high-throughout sequencing technologies are the second generation sequencing technologies with Solexan, Solid etc. as representative.Should
DNA sequencing unit 41 includes that DNA sample extraction module 411, library prepares module 412 and sequencer module 413.Wherein DNA sample
Extraction module 411 extracts DNA sample from environmental samples.When DNA sample is extracted, the high-quality for ensureing DNA in sample is needed
With the diversity of microorganism.Library prepares module 412 and carries out library preparation to above-mentioned DNA sample.Sequencer module 413 carries out high pass
The DNA sequencing reaction of amount, obtains substantial amounts of DNA sequence labels.Because the concrete sequencing procedure of sequencer module 413 belongs to existing skill
Art, therefore, here is omitted.
Carrier contamination removal unit 42 removes carrier that may be present in the DNA sequence labels that DNA sequencing unit 41 is obtained
Pollution.In embodiments of the present invention, because the carrier sequence used in sequencing reaction is specific, therefore, it is anti-by sequencing
A part for these specific carrier sequences or specific carrier sequence may be included in the DNA sequence labels that should be obtained.It is logical
Cross and specific carrier sequence word string is searched in DNA sequence labels, you can judge the DNA sequence labels whether by specific carrier
Sequence pollutes, and then removes carrier contamination present in the DNA sequence labels.
It is affiliated classification determining unit 43 carrier contamination removal unit 42 is processed after DNA sequence labels and given data storehouse
In known array compare, and the classification according to belonging to comparison result obtains the DNA sequence labels.Wherein given data storehouse
For bacterial genomes database, fungal gene group database, virus database Genbank database, RDP societies database, nt numbers
According to one or more combinations in storehouse.
In embodiments of the present invention, using the mapping method of short string sequence by DNA sequence labels and given data storehouse
Known array is compared, and obtains the best match form between DNA sequence labels and known array.Wherein DNA sequence labels
Best match form between known array refers to that DNA sequence labels is compared and have on known array minimum base mispairing
Position.The DNA sequence labels can be obtained according to the best match form between in the DNA sequence labels and known array for obtaining
Affiliated classification.When the mapping method using short string sequence carries out DNA sequence labels with the known array in given data storehouse
During comparison, the multiple best match forms being likely to be obtained, i.e. DNA sequence labels can be compared with optimal matched form simultaneously
A plurality of known array, now, a plurality of known array that the DNA sequence labels are compared it is nearest common belonging to classification as
The affiliated classification of the DNA.
Can be detected in environmental samples by mentioned microorganism detection method there may be which microbial species or which
One quasi-microorganism species, but be difficult to detect the confidence level of species presence, and species exist it is with a high credibility when, the species
Shared ratio in the environment.Therefore in order to reasonably solve above-mentioned two problems, in an alternative embodiment of the invention, the system
Also include known array pretreatment unit 44, sequencing depth calculation unit 45, coverage computing unit 46 and Credibility judgement list
Unit 47.
Wherein known array pretreatment unit 44 is pre-processed to the known array in given data storehouse, and obtaining can be unique
Represent the DNA sequencing fragment of a species.It includes simulating sequence label generation module 441, simulation sequence label mapping block
442nd, characteristic sequences acquisition module 443.
Sequence label generation module 441 is wherein simulated from the beginning of each bit base of known array, the DNA of same length is taken
Sequence is used as simulation sequence label.
Simulation sequence label mapping block 442 is mapped to each simulation sequence label for obtaining on known array, and records
It is mapped to the simulation sequence label of unique positions.
Characteristic sequences acquisition module 443 searches the simulation sequence label region for being continuously mapped to unique positions, and should
The two parts end to end in region respectively remove (simulates the sequence in the continuum behind the > site of sequence label length -1 as peculiar
Sequence.Finally, whole characteristic sequences on known array are coupled together, as can uniquely represent this species DNA sequencing fragment
" characteristic sequences ".Due to only uniquely being mapped by the simulation sequence label of part in the two parts end to end in the region, and it is preferable
Situation is that the continuum of the unique mapping of length sequence that each site is modeled sequence label could uniquely represent
One species.Accordingly, it would be desirable to two parts end to end of above-mentioned continuum are respectively removed into (simulation sequence label length -1) individual site
Continuum afterwards is used as characteristic sequences, so that the DNA sequencing fragment of the characteristic sequences can uniquely represent a species.
Sequencing depth calculation unit 45 calculates the degree of covering of DNA sequence labels on each bit base in characteristic sequences, leads to
Cross the average sequencing depth (being designated as d) that Poisson distribution fitting obtains characteristic sequences.The average sequencing depth of the characteristic sequences is
Compare the species content in the sample that the DNA sequence labels of the characteristic sequences are represented.
Coverage computing unit 46 calculates the coverage of characteristic sequences and whole piece sequence.It includes characteristic sequences coverage meter
Calculate module 461 and whole piece sequence coverage computing module 462.Characteristic sequences coverage computing module 461 is calculated in characteristic sequences
How many bit base is covered by DNA sequence labels, so as to obtain the coverage (being designated as C) of characteristic sequences.Whole piece sequence coverage meter
Calculate module 462 calculates (including the sequence on characteristic sequences and the not exclusive comparison of DNA sequence labels) how many position in whole piece sequence
Base is covered by DNA sequence labels, so as to obtain the coverage (being designated as C ') of whole piece sequence.
Coverage c and whole piece of the Credibility judgement unit 47 according to the average sequencing depth characteristic sequences of characteristic sequences
Coverage C of sequence ' judge the confidence level that the species sequence representated by the characteristic sequences is found.In present invention enforcement
In example, when c is approximately equal toAnd c≤c' and when, then it is assumed that it is with a high credibility that the species sequence is found, its
Middle θ represents the correction factor of sequencing, and different sequence measurements, the value of θ is possible different.Otherwise it is assumed that what the species sequence was found
It is with a low credibility.
When it should be understood that the species detected from environmental samples relative amount ratio when, in another embodiment of the present invention
In, the system also includes content than computing unit 48.The content than computing unit 48 according to it is calculated uniquely represent it is every kind of
The average sequencing depth ratio of the characteristic sequences of species, obtains the relative amount ratio of the species that every kind of characteristic sequences are represented.Due to spy
Have the content in the sample of the species representated by sequence to be increased with the increase of the average sequencing depth of characteristic sequences, because
This average sequencing depth ratio for uniquely representing the characteristic sequences of every kind of species is the relative of the species that every kind of characteristic sequences are represented
Content ratio.
In embodiments of the present invention, by being sequenced to the DNA sample extracted using high-throughout sequencing technologies, obtain
DNA sequence labels, then sequencing sequence is compared with the known array in given data storehouse, DNA is obtained according to comparison result
The affiliated classification of sequence label, such that it is able to detect environmental samples in which microbial species or which kind of micro- life there may be
Thing species.By pre-processing the known array in given data storehouse, obtain uniquely representing the peculiar sequence of a species
Row, then by the degree of covering of DNA sequence labels on each bit base in calculating characteristic sequences, obtained using Poisson distribution fitting
The average sequencing depth of characteristic sequences, so as to detect species that the characteristic sequences represent content in the sample.Pass through simultaneously
The coverage in peculiar region and the coverage of whole piece sequence are calculated, so as to according to average sequencing depth, the peculiar area of characteristic sequences
The coverage in domain and the coverage of whole piece sequence may determine that the confidence level that the species representated by characteristic sequences are found.
Presently preferred embodiments of the present invention is the foregoing is only, not to limit the present invention, all essences in the present invention
Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.
Claims (10)
1. a kind of environmental microorganism detection method, it is characterised in that methods described comprises the steps:
The DNA data that input is extracted from environmental samples, using DNA of the high-throughout sequence measurement to the extraction from environmental samples
It is sequenced, is obtained DNA sequence labels;
Remove carrier contamination present in the DNA sequence labels;
The DNA sequence labels obtained after removal carrier contamination are compared with the known array in given data storehouse, and according to
Comparison result determines the classification belonging to the DNA sequence labels.
2. the method for claim 1, it is characterised in that will remove the DNA sequence labels that obtain after carrier contamination with
Known array in primary data storehouse is compared, and the step of the classification according to belonging to comparison result determines the DNA sequence labels
Suddenly also include:
The DNA sequence labels and the known array in given data storehouse are compared using the mapping method of short string sequence,
Classification belonging to best match sequence between the DNA sequence labels and known array is defined as into the DNA sequence labels
Affiliated classification, the best match sequence between the DNA sequence labels and known array compares for the DNA sequence labels
There is the sequence at least crying out base mispairing on known array.
3. method as claimed in claim 2, it is characterised in that when optimal between the DNA sequence labels and known array
When matching sequence has multiple, the nearest common affiliated classification of the plurality of best match sequence is defined as into the DNA labels sequence
The classification of row.
4. the method for claim 1, it is characterised in that methods described also comprises the steps:
Known array in given data storehouse is pre-processed, obtains uniquely representing the DNA sequencing fragment of a species;
The degree of covering of DNA sequence labels on each bit base in characteristic sequences is calculated, obtains peculiar by Poisson distribution fitting
The average sequencing depth of sequence;
Calculate that how many bit base in characteristic sequences is covered by DNA sequence labels, so as to obtain the coverage of characteristic sequences;
Calculate that how many bit base in whole piece sequence is covered by DNA sequence labels, so as to obtain the coverage of whole piece sequence;
Judged according to the coverage of the average sequencing depth, the coverage of characteristic sequences and whole piece sequence in the peculiar region
The confidence level that the species that the characteristic sequences are represented are found.
5. method as claimed in claim 4, it is characterised in that the known array in given data storehouse carries out pre- place
The step of reason, DNA sequencing fragment for obtaining uniquely representing species, includes:
According to from the beginning of each bit base of known array, taking the DNA sequence dna of preset length as simulation sequence label;
The simulation sequence label is mapped on known array, and record is mapped to the simulation sequence label of unique positions;
Lookup is continuously mapped to the simulation sequence label region of unique positions, and the two parts end to end in the region are respectively removed
The sequence in the continuum behind 1 site of sequence label length is simulated as characteristic sequences, by the peculiar sequence in known array
Row are coupled together, used as the characteristic sequences that can uniquely represent a species DNA sequencing fragment.
6. method as claimed in claim 4, it is characterised in that according to average sequencing depth, the peculiar sequence in the peculiar region
The coverage of row and the coverage of whole piece sequence judge the step of the confidence level that the species that the characteristic sequences are represented are found
It is rapid to be specially:
Confidence level, when p it is close 1 when, confidence level highest;When p it is close 0 when, confidence level is minimum, wherein c for characteristic sequences covering
Degree, d is the average sequencing depth of characteristic sequences, and c ' is the coverage of whole piece sequence.θ is the correction factor of sequencing.
7. method as claimed in claim 4, it is characterised in that DNA labels on each bit base in the calculating characteristic sequences
The degree of covering of sequence, the average sequencing depth for obtaining characteristic sequences by Poisson distribution fitting also comprises the steps:
According to the average sequencing depth ratio of the calculated characteristic sequences for uniquely representing every kind of species, every kind of characteristic sequences are obtained
The relative amount ratio of the species of representative.
8. the method as described in claim 1-7 any one, it is characterised in that it is described using high-throughout sequencing technologies to from
It is that all DNA to extracting in environmental samples is sequenced that the DNA extracted in environmental samples carries out sequencing procedure.
9. a kind of environmental microorganism detecting system, it is characterised in that the system includes:DNA sequencing unit, for adopting high pass
The sequencing technologies of amount are sequenced to the DNA extracted from environmental samples of input, obtain DNA sequence labels;
Carrier contamination removal unit, for removing carrier contamination present in the DNA sequence labels;
Affiliated classification determining unit, for removing after carrier contamination in the DNA sequence labels that obtain and given data storehouse
Know that sequence is compared, and the classification according to belonging to comparison result determines the DNA sequence labels.
10. system as claimed in claim 9, it is characterised in that the system also includes:
Known array pretreatment unit, for pre-processing the known array in given data storehouse, obtains uniquely representing
The DNA sequencing fragment of one species;
Sequencing depth calculation unit, for calculating characteristic sequences on each bit base DNA sequence labels degree of covering, pass through
Poisson distribution fitting obtains the average sequencing depth of characteristic sequences;
Coverage computing unit, for calculating characteristic sequences in how many bit base covered by DNA sequence labels, so as to obtain spy
There is the coverage of sequence, and calculate that how many bit base in whole piece sequence is covered by DNA sequence labels, so as to obtain whole piece sequence
Coverage;
Credibility judgement unit, for according to the average sequencing depth in the peculiar region, the coverage of characteristic sequences and whole
The coverage of bar sequence judges the height of the confidence level that the species that the characteristic sequences are represented are found.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611213197.2A CN106650311A (en) | 2016-12-23 | 2016-12-23 | Detection and recognition method and system for microorganisms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611213197.2A CN106650311A (en) | 2016-12-23 | 2016-12-23 | Detection and recognition method and system for microorganisms |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106650311A true CN106650311A (en) | 2017-05-10 |
Family
ID=58826927
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611213197.2A Pending CN106650311A (en) | 2016-12-23 | 2016-12-23 | Detection and recognition method and system for microorganisms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106650311A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109637585A (en) * | 2018-12-27 | 2019-04-16 | 北京优迅医学检验实验室有限公司 | The antidote and device of depth is sequenced |
CN114596917A (en) * | 2022-05-10 | 2022-06-07 | 天津诺禾致源生物信息科技有限公司 | Method and device for eliminating bacterial contamination sequence by sequencing data |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101748213A (en) * | 2008-12-12 | 2010-06-23 | 深圳华大基因研究院 | Environmental microorganism detection method and system |
-
2016
- 2016-12-23 CN CN201611213197.2A patent/CN106650311A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101748213A (en) * | 2008-12-12 | 2010-06-23 | 深圳华大基因研究院 | Environmental microorganism detection method and system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109637585A (en) * | 2018-12-27 | 2019-04-16 | 北京优迅医学检验实验室有限公司 | The antidote and device of depth is sequenced |
CN114596917A (en) * | 2022-05-10 | 2022-06-07 | 天津诺禾致源生物信息科技有限公司 | Method and device for eliminating bacterial contamination sequence by sequencing data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101748213B (en) | Environmental microorganism detection method and system | |
US20140162274A1 (en) | Compositions and methods for identifying and comparing members of microbial communities using amplicon sequences | |
CN108048595A (en) | Indel molecular labelings and its application with pumpkin photoperiod insensitive character close linkage | |
Hong et al. | Molecular markers reveal epidemiological patterns and evolutionary histories of the human pathogenic Cryptococcus | |
CN108154010A (en) | A kind of ctDNA low frequencies mutation sequencing data analysis method and device | |
CN101429559A (en) | Environmental microorganism detection method and system | |
CN111088382A (en) | Corn whole genome SNP chip and application thereof | |
CN106650311A (en) | Detection and recognition method and system for microorganisms | |
Ceballos-Escalera et al. | Metabarcoding of insect-associated fungal communities: a comparison of internal transcribed spacer (ITS) and large-subunit (LSU) rRNA markers | |
CN106555008A (en) | Detection and identification method and system for microorganisms | |
CN101979540B (en) | Method for designing microRNA probe sequence | |
CN113122651A (en) | SNP molecular marker linked with major QTL locus of lotus rhizome expansion character and application thereof | |
Carrieri et al. | A fast machine learning workflow for rapid phenotype prediction from whole shotgun metagenomes | |
Milne et al. | Molecular evidence indicates that subarctic willow communities in Scotland support a diversity of host-associated Melampsora rust taxa | |
Owen | Bacterial taxonomics: finding the wood through the phylogenetic trees | |
CN109706231A (en) | A kind of high-throughput SNP classifying method for litopenaeus vannamei molecular breeding | |
CN108416189A (en) | A kind of variety of crops Heterosis identification method based on molecular marking technique | |
Jiang et al. | SNP molecular markers development and genetic diversity analysis of Forsythia suspensa based on SLAF-seq technology | |
CN108875300A (en) | A kind of method and application adapting to potentiality using landscape genomics assessment species | |
Regalado et al. | Combining whole genome shotgun sequencing and rDNA amplicon analyses to improve detection of microbe-microbe interaction networks in plant leaves | |
US20220230704A1 (en) | Dna methylation based high resolution characterization of microbiome using nanopore sequencing | |
CN101565744B (en) | Polynary high-throughput genetic marking system and genetic analysis method for blue crabs | |
CN107630104A (en) | A kind of phylogenetic tree and authentication method for being used to identify Dendrobidium huoshanness or dendrobium candidum | |
CN113981081A (en) | Breast cancer molecular marker based on RNA editing level and diagnosis model | |
CN102831331A (en) | Primer design developing method of length polymorphism sign based on restriction enzyme digestion database-establishing pair-end sequencing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170510 |
|
WD01 | Invention patent application deemed withdrawn after publication |