CN106650311A

CN106650311A - Detection and recognition method and system for microorganisms

Info

Publication number: CN106650311A
Application number: CN201611213197.2A
Authority: CN
Inventors: 刘恩浩
Original assignee: JINULI (TIANJIN) BIOTECHNOLOGY CO Ltd
Current assignee: JINULI (TIANJIN) BIOTECHNOLOGY CO Ltd
Priority date: 2016-12-23
Filing date: 2016-12-23
Publication date: 2017-05-10

Abstract

The invention is applicable to the field of bioengineering and provides a detection and recognition method and system for microorganisms. The method comprises the steps that a high-flux sequencing technology is adopted to perform sequencing on DNA extracted from an environment sample, and a DNA tag sequence is obtained; carrier contamination existing in the DNA tag sequence is removed; the DNA tag sequence is compared with a known sequence in a known database, and the classification of the DNA tag sequence is determined according to the comparison result. According to the embodiment, which microorganism species or which kind of microorganism species possibly exist in the environment sample can be detected.

Description

A kind of detection recognition method and system of microorganism

Technical field

The invention belongs to bioengineering field, more particularly to a kind of detection recognition method and system of microorganism.

Background technology

The protein and RNA molecule for determining biological character is all, in the coded sequence form of tetra- kinds of bases of DNA, information to be stored up In being stored in biological cell.This DNA molecular contains a complete set of hereditary information of organism.In order to go to understand heredity from overall angle The function of information and effect, a most important step is to determine a complete set of hereditary information of the biology to come, that is, know the biological institute Some DNA bases put in order.Traditional gene order-checking mainly adopts " sanger " method sequencing technologies, also referred to as " end end Only method " sequencing technologies.The disadvantage of this sequence measurement is：High cost, yield poorly.In recent years, with solexa as representative " high throughput sequencing technologies of new generation " quietly rise.Solexa sequencing technologies with " being sequenced in synthesis " as principle, effectively The deficiency of traditional sanger PCR sequencing PCRs is improved, with low cost, flux is high, the time is short, it is high, easy to operate etc. that accuracy rate is sequenced Plurality of advantages.

Microorganism is ubiquitous in nature, is existed everywhere, huge amount.Microorganism is for tellurian life It is critical that, important element can be converted to energy by them, keep the chemical balance in air, be plant and animal Nutrient is provided.Microorganism can be also used for realizing many commercial objects that such as manufacture antibiotic, raising farm efficiency and production is given birth to Thing fuel.It is pernicious to people in addition with sub-fraction microorganism, causes the generation of various diseases.From the point of view of historical viewpoint, micro- life Thing research focuses primarily upon the individual species of research.But most of microbe is (raw in various environment in the form of group Environment, external environment, extreme environment etc. in thing), and cannot individually cultivate under lab.For micropopulation complicated in environment Fall, traditional research method is to be directed to after specific conservative gene (such as 16S rRNA) is expanded using round pcr to be sequenced. By the evolution classification analysis to these conservative genes, so as to environmental microorganism be classified.This be from species, it is even higher Method of the category level to be detected to environmental microorganism.This method can detect microorganism unknown in environment, And have the advantages that simple to operate, technology is complete, with low cost.But deepening continuously, having announced with microbe research Microbial genome number it is increasing, it has been found that based on conservative gene be sequenced detection method there is following limitation：

1st, the species of None- identified trace.By PCR expand sequencing obtain be all abundance higher building kind gene order. The species relatively low for abundance, need substantial amounts of Sanger sequencings to find.

2nd, simply species can not be detected by several genes.By to existing 703 kinds of bacterial genomes sequences Comparative analysis and to after the 16S rRNA sequencing analysis of true environment sample find：The 16S rRNA genes of many nearly edge species Guard very much, there's almost no difference, but in phenotype, functionally but significant difference.

3rd, detection can only be in the category level of species or higher, and the classification information of resulting higher level is to later work( Can study without too big effect.Even and same bacterium, can also there is very big difference between different strains.

The content of the invention

It is an object of the invention to provide the detection recognition method and system of a kind of microorganism, it is intended to solve existing environment Microorganism detection method is difficult to the problem of the species of trace.

The present invention is achieved in that a kind of environmental microorganism detection method, and methods described comprises the steps：

The DNA extracted from environmental samples is sequenced using high-throughout sequencing technologies, obtains DNA sequence labels；

Remove carrier contamination present in the DNA sequence labels；

The DNA sequence labels obtained after removal carrier contamination are compared with the known array in given data storehouse, and Classification according to belonging to comparison result determines the DNA sequence labels.

Used as one embodiment, the method also comprises the steps：

Known array in given data storehouse is pre-processed, obtains uniquely representing the DNA sequence dna piece of a species Section；

The degree of covering of DNA sequence labels on each bit base in characteristic sequences is calculated, is obtained by Poisson distribution fitting The average sequencing depth of characteristic sequences；

Calculate that how many bit base in characteristic sequences is covered by DNA sequence labels, so as to obtain the coverage of characteristic sequences；

Calculate that how many bit base in whole piece sequence is covered by DNA sequence labels, so as to obtain the coverage of whole piece sequence；

Sentenced according to the coverage of the average sequencing depth, the coverage of characteristic sequences and whole piece sequence in the peculiar region Break and the confidence level that the species of the characteristic sequences representative are found.

Another object of the present invention is to provide a kind of environmental microorganism detecting system, the system includes：

DNA sequencing unit, for being sequenced to the DNA extracted from environmental samples using high-throughout sequencing technologies, Obtain DNA sequence labels；

Carrier contamination removal unit, for removing carrier contamination present in the DNA sequence labels；

Affiliated classification determining unit, for removing after carrier contamination in the DNA sequence labels that obtain and given data storehouse Known array compare, and the classification according to belonging to comparison result determines the DNA sequence labels.

Used as one embodiment, the system also includes：

Known array pretreatment unit, for pre-processing to the known array in given data storehouse, obtaining can be unique Represent the DNA sequencing fragment of a species；

Sequencing depth calculation unit, for calculating characteristic sequences on each bit base DNA sequence labels degree of covering, The average sequencing depth of characteristic sequences is obtained by Poisson distribution fitting；

Coverage computing unit, for calculating characteristic sequences in how many bit base covered by DNA sequence labels, so as to To the coverage of characteristic sequences, and calculate that how many bit base in whole piece sequence is covered by DNA sequence labels, so as to obtain whole piece The coverage of sequence；

Credibility judgement unit, for according to the average sequencing depth in the peculiar region, the coverage of characteristic sequences with And the coverage of whole piece sequence judges the height of the confidence level that the species that the characteristic sequences are represented are found.

Environmental microorganism detection method and system that the present invention is provided, the DNA extracted in environmental samples is sequenced During introduce high-throughout sequencing technologies, and in sequence alignment, remove carrier contamination first, then by the DNA label sequences Row are compared comprehensively with the known array in given data storehouse, can be to more DNA sequencings in Environment features, or even can Realization is sequenced to all DNA, and DNA sequence dna compared more fully hereinafter such that it is able to efficiently identify trace Species.Can detect in environmental samples and which microbial species or which kind of microbial species there may be.Further pass through To more in given data storehouse, or even all characteristic sequences carry out process and obtain averagely being sequenced depth, coverage and whole The coverage of bar sequence determining the height of confidence level that the species that characteristic sequences are represented are found, so as to accuracy of detection is careful To nearly edge species, even different strains can be distinguished.

Description of the drawings

Fig. 1 is the flowchart of environmental microorganism detection method provided in an embodiment of the present invention；

Fig. 2 be it is provided in an embodiment of the present invention DNA sequence labels and known array are compared, determine DNA label sequences The schematic diagram of the classification of row；

Fig. 3 is that the simulation sequence label for being continuously mapped to unique positions provided in an embodiment of the present invention determines characteristic sequences Schematic diagram；

Fig. 4 is the structured flowchart of environmental microorganism detecting system provided in an embodiment of the present invention.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.

In embodiments of the present invention, the DNA extracted from environmental samples is sequenced using high-throughout sequencing technologies, Obtain DNA sequence labels, remove in the DNA sequence labels after carrier contamination that may be present, by the DNA sequence labels with it is known Known array in database is compared, so as to obtain the affiliated classification of the DNA sequence labels.

What Fig. 1 showed environmental microorganism detection method provided in an embodiment of the present invention realizes flow process, and details are as follows：

In step S101, the DNA extracted from environmental samples is sequenced using high-throughout sequencing technologies, is obtained DNA sequence labels.

Wherein high-throughout sequencing technologies are the second generation sequencing technologies with Solexa, Solid etc. as representative.Due to mining height The detailed process that the sequencing technologies of flux are sequenced to DNA is prior art, therefore, in embodiments of the present invention, only sketch The DNA extracted from environmental samples is sequenced process using high-throughout sequencing technologies：

A. DNA sample is extracted from environmental samples.Extract DNA sample when, need ensure sample in DNA high-quality and The diversity of microorganism.

B, library preparation is carried out to above-mentioned DNA sample.In embodiments of the present invention, two-way sequencing text is built if desired Storehouse, then in order to effectively solve the sequencing difficult problems of high-load species, in the preparation process of library, the length of Insert Fragment is typically little It is more suitable in 200 Shame.

C, high-throughout DNA sequencing reaction is carried out, obtain substantial amounts of DNA sequence labels.

It is the accuracy for improving detection in this step, preferably all DNA extracted from environmental samples can be surveyed Sequence.

In step s 102, carrier contamination that may be present in the DNA sequence labels that removal step S101 is obtained.

Because the carrier sequence used in sequencing reaction is specific, therefore, the DNA labels obtained by sequencing reaction A part for these specific carrier sequences or specific carrier sequence may be included in sequence.By in DNA sequence labels It is middle to search for specific carrier sequence word string, you can to judge that whether the DNA sequence labels are polluted, and then go by specific carrier sequence Except carrier contamination present in the DNA sequence labels.

In step s 103, the DNA sequence labels Jing after removing depollution are entered with the known array in given data storehouse Row is compared, and obtains the affiliated classification of the DNA sequence labels according to comparison result.

Wherein given data storehouse includes but is not limited to bacterial genomes database, fungal gene group database, virus Genbank databases, ribosomes database (RDP databases), the nonredundancy GenBank of environmental microorganism, nonredundancy GenBank.In the embodiment of the present invention, can be according to the detection demand of environmental microorganism, from above-mentioned multiple given datas The known array in one or more given data storehouse is selected to compare with the DNA sequence labels in storehouse.And work as environment sample When this is more complicated, then can select that the known array in all of given data storehouse is compared with DNA sequence labels.

In embodiments of the present invention, using the mapping method of short string sequence by DNA sequence labels and given data storehouse Known array is compared, and the classification belonging to the best match sequence between DNA sequence labels and known array is defined as into this The affiliated classification of DNA sequence labels.

It is known that best match sequence wherein between DNA sequence labels and known array refers to that DNA sequence labels are compared Sequence with minimum base mispairing in sequence.When the mapping method using short string sequence is by DNA sequence labels and given data When known array in storehouse is compared, the multiple best match sequences being likely to be obtained, i.e. DNA sequence labels can be simultaneously with most Good matched form compares upper a plurality of known array, now, by the nearest of a plurality of known array in DNA sequence labels comparison It is common belonging to classification as the DNA affiliated classification.

Due to the mutation rate of microbial genome it is higher, so by known in DNA sequence labels and given data storehouse When sequence is compared, it is allowed to the mispairing of predetermined number and little insertion and deletion sequence.The wherein mispairing of predetermined number can be with Rule of thumb arrange.

By above-mentioned steps, diversity information of the environmental sample in different classifications level can be obtained.

Can be detected in environmental samples by mentioned microorganism detection method there may be which microbial species or which One quasi-microorganism species, but be difficult to detect the confidence level of species presence, and species exist it is with a high credibility when, the species Shared ratio in the environment.

Therefore in order to reasonably solve above-mentioned two problems, in an alternative embodiment of the invention, may further include as Lower step S104-S107.Wherein, step S104-S107 in step S103 by known in DNA sequence labels and given data storehouse Sequence is performed before comparing, it is also possible to synchronous with step S103 or carry out after step s 103.

In step S104, the known array in given data storehouse is pre-processed, obtain uniquely representing a thing The characteristic sequences planted.It is comprised the following steps that：

A, the known array in given data storehouse produce simulation sequence label.Its detailed process is as follows：

From the beginning of the first bit base of known array, the DNA sequence dna of preset length is taken as first simulation sequence label, Then from the beginning of the second bit base of known array, the DNA sequence dna for taking same length simulates sequence label as second, according to this Analogize, from the beginning of each bit base of known array, take the DNA sequence dna of same length as simulation sequence label.

B, each simulation sequence label for obtaining is mapped on known array, and record is mapped to the simulation mark of unique positions Sign sequence.

In embodiments of the present invention, any one sequence mapping method, such as SOAP comparison methods can be adopted to simulate Sequence label is mapped on known array, therefore, will not be described here.When simulation sequence label is mapped on known array, Exist because the sequencing fragment that obtains of Jing sequencings always has certain error rate, in order to avoid in practical operation because of the sequencing Mistake and true DNA sequence labels are mapped into another location, in embodiments of the present invention, allow sequencing mistake premise Under, simulation sequence label is mapped on known array.

C, lookup are continuously mapped to the simulation sequence label of unique positions, obtain uniquely representing the peculiar of species Sequence.Wherein characteristic sequences refer to the DNA sequencing fragment that can uniquely represent a species.Typically, the number of characteristic sequences has It is multiple, to put forward the accuracy for hearing detection, all of characteristic sequences are preferably found out in the present embodiment.The sequencing depth of the characteristic sequences Degree represents species content in the sample.Its detailed process is as follows：

Lookup is continuously mapped to the simulation sequence label of unique positions, obtains the company of the simulation sequence label of unique mapping Continuous region.Two parts end to end of the continuum are respectively removed in the continuum behind (simulation sequence label length -1) individual site Sequence as characteristic sequences.Because only uniquely being reflected by the simulation sequence label of part in two parts end to end of the continuum Penetrate, and ideally each site is modeled the continuum ability that the length sequence of sequence label uniquely maps Uniquely represent a species.Accordingly, it would be desirable to two parts end to end of above-mentioned continuum are respectively removed, and (simulation sequence label is long Degree -1) continuum behind individual site is used as characteristic sequences.Finally, whole characteristic sequences on known array are coupled together, As " characteristic sequences " that can uniquely represent this species DNA sequencing fragment.In embodiments of the present invention, when it should be understood that all When the confidence level of the presence of the microbial species detected from environmental samples and shared in the environment ratio, then need to All known arrays in primary data storehouse carry out above-mentioned pretreatment, obtain uniquely representing the peculiar region of a species, due to Multiple species are potentially included in given data storehouse, therefore after pretreatment, obtains uniquely representing the peculiar region of a species Have multiple, different species are uniquely represented respectively.

Refer to Fig. 3, when the simulation sequence label for being continuously mapped to unique positions for finding be short sequence 1 to short sequence The two parts end to end in the region in the continuous unique comparison for finding respectively are removed (simulation sequence label length position by row II Continuum after point is used as characteristic sequences).

In step S105, the degree of covering of DNA sequence labels on each bit base in characteristic sequences is calculated, by Poisson Fitting of distribution obtains the average sequencing depth (being designated as d) of characteristic sequences.Wherein, DNA sequence labels described in this step are corresponding to step The rapid DNA sequence labels of S102 Jing after removing depollution.According to result of the test, the species representated by characteristic sequences are in the sample Content is increased with the increase of the average sequencing depth of characteristic sequences, therefore, when it should be understood that examining from environmental samples During the relative amount ratio of the species for measuring, when the average sequencing depth of characteristic sequences is calculated, calculating uniquely represents every kind of species Characteristic sequences average sequencing depth, now, the method also comprises the steps：

According to the average sequencing depth ratio of the calculated characteristic sequences for uniquely representing every kind of species, obtain every kind of peculiar The relative amount ratio of the species that sequence is represented.Because the content in the sample of the species representated by characteristic sequences is with peculiar sequence The increase of the average sequencing depth of row and increase, therefore, the calculated characteristic sequences for uniquely representing every kind of species it is flat The relative amount ratio that depth ratio is the species that every kind of characteristic sequences are represented is sequenced.

The average sequencing depth of the calculated characteristic sequences for uniquely representing species C is such as assumed as 20, uniquely represent thing The mean depth of kinds 8 characteristic sequences is 100, uniquely represent species C characteristic sequences mean depth as 30 when, then according to upper Result of calculation is stated, it is 20 that can obtain the relative amount ratio between species A, species B and species C：100：30.

In step s 106, calculate that how many bit base in characteristic sequences is covered by DNA sequence labels, by capped alkali Base digit divided by base digit total in characteristic sequences, so as to obtain the coverage (being designated as C) of characteristic sequences.And calculate whole piece sequence ＜ includes that sequence ＞ how many bit base on characteristic sequences and the not exclusive comparison of DNA sequence labels is covered by DNA sequence labels in row Lid, by capped base digit divided by the base digit in whole piece sequence, so as to obtain the coverage of whole piece sequence, is designated as (c).Such as：There is 100 bit base ＜ i.e. length to be 100bp ＞ in a certain sequence, wherein 80 bit bases are capped, be then calculated this The coverage of sequence is 0.8.

In step s 107, according to the coverage of the average sequencing depth characteristic sequences of DNA sequence labels.And whole piece Coverage C of sequence ', the confidence level that the species sequence of characteristic sequences representative is found is calculated, can for example adopt following algorithm meter Calculate confidence level：Confidence level

(when P it is close 1 when, confidence level highest；When P it is close 0 when, confidence level is minimum), wherein θ represent the correction of sequencing because Son, different sequence measurements, the value of θ may be different.Under normal circumstances, formula c<C ' sets up；If c in real data>C ', then table The bright species sequence has abnormal conditions.

Fig. 4 shows the structure of environmental microorganism detecting system provided in an embodiment of the present invention, for convenience of description, only not The part related to the embodiment of the present invention is gone out.Wherein：

DNA sequencing unit 41 is sequenced using high-throughout sequencing technologies to the DNA extracted from environmental samples, is obtained DNA sequence labels.Wherein high-throughout sequencing technologies are the second generation sequencing technologies with Solexan, Solid etc. as representative.Should DNA sequencing unit 41 includes that DNA sample extraction module 411, library prepares module 412 and sequencer module 413.Wherein DNA sample Extraction module 411 extracts DNA sample from environmental samples.When DNA sample is extracted, the high-quality for ensureing DNA in sample is needed With the diversity of microorganism.Library prepares module 412 and carries out library preparation to above-mentioned DNA sample.Sequencer module 413 carries out high pass The DNA sequencing reaction of amount, obtains substantial amounts of DNA sequence labels.Because the concrete sequencing procedure of sequencer module 413 belongs to existing skill Art, therefore, here is omitted.

Carrier contamination removal unit 42 removes carrier that may be present in the DNA sequence labels that DNA sequencing unit 41 is obtained Pollution.In embodiments of the present invention, because the carrier sequence used in sequencing reaction is specific, therefore, it is anti-by sequencing A part for these specific carrier sequences or specific carrier sequence may be included in the DNA sequence labels that should be obtained.It is logical Cross and specific carrier sequence word string is searched in DNA sequence labels, you can judge the DNA sequence labels whether by specific carrier Sequence pollutes, and then removes carrier contamination present in the DNA sequence labels.

It is affiliated classification determining unit 43 carrier contamination removal unit 42 is processed after DNA sequence labels and given data storehouse In known array compare, and the classification according to belonging to comparison result obtains the DNA sequence labels.Wherein given data storehouse For bacterial genomes database, fungal gene group database, virus database Genbank database, RDP societies database, nt numbers According to one or more combinations in storehouse.

In embodiments of the present invention, using the mapping method of short string sequence by DNA sequence labels and given data storehouse Known array is compared, and obtains the best match form between DNA sequence labels and known array.Wherein DNA sequence labels Best match form between known array refers to that DNA sequence labels is compared and have on known array minimum base mispairing Position.The DNA sequence labels can be obtained according to the best match form between in the DNA sequence labels and known array for obtaining Affiliated classification.When the mapping method using short string sequence carries out DNA sequence labels with the known array in given data storehouse During comparison, the multiple best match forms being likely to be obtained, i.e. DNA sequence labels can be compared with optimal matched form simultaneously A plurality of known array, now, a plurality of known array that the DNA sequence labels are compared it is nearest common belonging to classification as The affiliated classification of the DNA.

Can be detected in environmental samples by mentioned microorganism detection method there may be which microbial species or which One quasi-microorganism species, but be difficult to detect the confidence level of species presence, and species exist it is with a high credibility when, the species Shared ratio in the environment.Therefore in order to reasonably solve above-mentioned two problems, in an alternative embodiment of the invention, the system Also include known array pretreatment unit 44, sequencing depth calculation unit 45, coverage computing unit 46 and Credibility judgement list Unit 47.

Wherein known array pretreatment unit 44 is pre-processed to the known array in given data storehouse, and obtaining can be unique Represent the DNA sequencing fragment of a species.It includes simulating sequence label generation module 441, simulation sequence label mapping block 442nd, characteristic sequences acquisition module 443.

Sequence label generation module 441 is wherein simulated from the beginning of each bit base of known array, the DNA of same length is taken Sequence is used as simulation sequence label.

Simulation sequence label mapping block 442 is mapped to each simulation sequence label for obtaining on known array, and records It is mapped to the simulation sequence label of unique positions.

Characteristic sequences acquisition module 443 searches the simulation sequence label region for being continuously mapped to unique positions, and should The two parts end to end in region respectively remove (simulates the sequence in the continuum behind the ＞ site of sequence label length -1 as peculiar Sequence.Finally, whole characteristic sequences on known array are coupled together, as can uniquely represent this species DNA sequencing fragment " characteristic sequences ".Due to only uniquely being mapped by the simulation sequence label of part in the two parts end to end in the region, and it is preferable Situation is that the continuum of the unique mapping of length sequence that each site is modeled sequence label could uniquely represent One species.Accordingly, it would be desirable to two parts end to end of above-mentioned continuum are respectively removed into (simulation sequence label length -1) individual site Continuum afterwards is used as characteristic sequences, so that the DNA sequencing fragment of the characteristic sequences can uniquely represent a species.

Sequencing depth calculation unit 45 calculates the degree of covering of DNA sequence labels on each bit base in characteristic sequences, leads to Cross the average sequencing depth (being designated as d) that Poisson distribution fitting obtains characteristic sequences.The average sequencing depth of the characteristic sequences is Compare the species content in the sample that the DNA sequence labels of the characteristic sequences are represented.

Coverage computing unit 46 calculates the coverage of characteristic sequences and whole piece sequence.It includes characteristic sequences coverage meter Calculate module 461 and whole piece sequence coverage computing module 462.Characteristic sequences coverage computing module 461 is calculated in characteristic sequences How many bit base is covered by DNA sequence labels, so as to obtain the coverage (being designated as C) of characteristic sequences.Whole piece sequence coverage meter Calculate module 462 calculates (including the sequence on characteristic sequences and the not exclusive comparison of DNA sequence labels) how many position in whole piece sequence Base is covered by DNA sequence labels, so as to obtain the coverage (being designated as C ') of whole piece sequence.

Coverage c and whole piece of the Credibility judgement unit 47 according to the average sequencing depth characteristic sequences of characteristic sequences Coverage C of sequence ' judge the confidence level that the species sequence representated by the characteristic sequences is found.In present invention enforcement

In example, when c is approximately equal toAnd c≤c' and when, then it is assumed that it is with a high credibility that the species sequence is found, its Middle θ represents the correction factor of sequencing, and different sequence measurements, the value of θ is possible different.Otherwise it is assumed that what the species sequence was found It is with a low credibility.

When it should be understood that the species detected from environmental samples relative amount ratio when, in another embodiment of the present invention In, the system also includes content than computing unit 48.The content than computing unit 48 according to it is calculated uniquely represent it is every kind of The average sequencing depth ratio of the characteristic sequences of species, obtains the relative amount ratio of the species that every kind of characteristic sequences are represented.Due to spy Have the content in the sample of the species representated by sequence to be increased with the increase of the average sequencing depth of characteristic sequences, because This average sequencing depth ratio for uniquely representing the characteristic sequences of every kind of species is the relative of the species that every kind of characteristic sequences are represented Content ratio.

In embodiments of the present invention, by being sequenced to the DNA sample extracted using high-throughout sequencing technologies, obtain DNA sequence labels, then sequencing sequence is compared with the known array in given data storehouse, DNA is obtained according to comparison result The affiliated classification of sequence label, such that it is able to detect environmental samples in which microbial species or which kind of micro- life there may be Thing species.By pre-processing the known array in given data storehouse, obtain uniquely representing the peculiar sequence of a species Row, then by the degree of covering of DNA sequence labels on each bit base in calculating characteristic sequences, obtained using Poisson distribution fitting The average sequencing depth of characteristic sequences, so as to detect species that the characteristic sequences represent content in the sample.Pass through simultaneously The coverage in peculiar region and the coverage of whole piece sequence are calculated, so as to according to average sequencing depth, the peculiar area of characteristic sequences The coverage in domain and the coverage of whole piece sequence may determine that the confidence level that the species representated by characteristic sequences are found.

Presently preferred embodiments of the present invention is the foregoing is only, not to limit the present invention, all essences in the present invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims

1. a kind of environmental microorganism detection method, it is characterised in that methods described comprises the steps：

The DNA data that input is extracted from environmental samples, using DNA of the high-throughout sequence measurement to the extraction from environmental samples It is sequenced, is obtained DNA sequence labels；

Remove carrier contamination present in the DNA sequence labels；

The DNA sequence labels obtained after removal carrier contamination are compared with the known array in given data storehouse, and according to Comparison result determines the classification belonging to the DNA sequence labels.

2. the method for claim 1, it is characterised in that will remove the DNA sequence labels that obtain after carrier contamination with Known array in primary data storehouse is compared, and the step of the classification according to belonging to comparison result determines the DNA sequence labels Suddenly also include：

The DNA sequence labels and the known array in given data storehouse are compared using the mapping method of short string sequence, Classification belonging to best match sequence between the DNA sequence labels and known array is defined as into the DNA sequence labels Affiliated classification, the best match sequence between the DNA sequence labels and known array compares for the DNA sequence labels There is the sequence at least crying out base mispairing on known array.

3. method as claimed in claim 2, it is characterised in that when optimal between the DNA sequence labels and known array When matching sequence has multiple, the nearest common affiliated classification of the plurality of best match sequence is defined as into the DNA labels sequence The classification of row.

4. the method for claim 1, it is characterised in that methods described also comprises the steps：

Known array in given data storehouse is pre-processed, obtains uniquely representing the DNA sequencing fragment of a species；

The degree of covering of DNA sequence labels on each bit base in characteristic sequences is calculated, obtains peculiar by Poisson distribution fitting The average sequencing depth of sequence；

Judged according to the coverage of the average sequencing depth, the coverage of characteristic sequences and whole piece sequence in the peculiar region The confidence level that the species that the characteristic sequences are represented are found.

5. method as claimed in claim 4, it is characterised in that the known array in given data storehouse carries out pre- place The step of reason, DNA sequencing fragment for obtaining uniquely representing species, includes：

According to from the beginning of each bit base of known array, taking the DNA sequence dna of preset length as simulation sequence label；

The simulation sequence label is mapped on known array, and record is mapped to the simulation sequence label of unique positions；

Lookup is continuously mapped to the simulation sequence label region of unique positions, and the two parts end to end in the region are respectively removed The sequence in the continuum behind 1 site of sequence label length is simulated as characteristic sequences, by the peculiar sequence in known array Row are coupled together, used as the characteristic sequences that can uniquely represent a species DNA sequencing fragment.

6. method as claimed in claim 4, it is characterised in that according to average sequencing depth, the peculiar sequence in the peculiar region The coverage of row and the coverage of whole piece sequence judge the step of the confidence level that the species that the characteristic sequences are represented are found It is rapid to be specially：

Confidence level, when p it is close 1 when, confidence level highest；When p it is close 0 when, confidence level is minimum, wherein c for characteristic sequences covering Degree, d is the average sequencing depth of characteristic sequences, and c ' is the coverage of whole piece sequence.θ is the correction factor of sequencing.

7. method as claimed in claim 4, it is characterised in that DNA labels on each bit base in the calculating characteristic sequences The degree of covering of sequence, the average sequencing depth for obtaining characteristic sequences by Poisson distribution fitting also comprises the steps：

According to the average sequencing depth ratio of the calculated characteristic sequences for uniquely representing every kind of species, every kind of characteristic sequences are obtained The relative amount ratio of the species of representative.

8. the method as described in claim 1-7 any one, it is characterised in that it is described using high-throughout sequencing technologies to from It is that all DNA to extracting in environmental samples is sequenced that the DNA extracted in environmental samples carries out sequencing procedure.

9. a kind of environmental microorganism detecting system, it is characterised in that the system includes：DNA sequencing unit, for adopting high pass The sequencing technologies of amount are sequenced to the DNA extracted from environmental samples of input, obtain DNA sequence labels；

Affiliated classification determining unit, for removing after carrier contamination in the DNA sequence labels that obtain and given data storehouse Know that sequence is compared, and the classification according to belonging to comparison result determines the DNA sequence labels.

10. system as claimed in claim 9, it is characterised in that the system also includes：

Known array pretreatment unit, for pre-processing the known array in given data storehouse, obtains uniquely representing The DNA sequencing fragment of one species；

Sequencing depth calculation unit, for calculating characteristic sequences on each bit base DNA sequence labels degree of covering, pass through Poisson distribution fitting obtains the average sequencing depth of characteristic sequences；

Coverage computing unit, for calculating characteristic sequences in how many bit base covered by DNA sequence labels, so as to obtain spy There is the coverage of sequence, and calculate that how many bit base in whole piece sequence is covered by DNA sequence labels, so as to obtain whole piece sequence Coverage；

Credibility judgement unit, for according to the average sequencing depth in the peculiar region, the coverage of characteristic sequences and whole The coverage of bar sequence judges the height of the confidence level that the species that the characteristic sequences are represented are found.