CN108268753A

CN108268753A - A kind of microorganism group recognition methods and device, equipment

Info

Publication number: CN108268753A
Application number: CN201810073198.4A
Authority: CN
Inventors: 王子承; 江瑞; 陈挺
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-01-25
Filing date: 2018-01-25
Publication date: 2018-07-10
Anticipated expiration: 2038-01-25
Also published as: CN108268753B

Abstract

A kind of microorganism group recognition methods and device, equipment, the microorganism group recognition methods include：The microorganism group characteristic information for obtaining multiple bions generates multiple samples, it calculates first sample in multiple sample and, respectively with the similarity of other samples to obtain multiple similarities, the similarity probability Distribution Model of the first sample is established according to multiple similarity；Obtain sample to be tested, calculate the similarity of the sample to be tested and the first sample, the first probability value of the sample to be tested is determined according to the similarity probability Distribution Model of the similarity and the first sample of the sample to be tested and the first sample, judges whether the sample to be tested belongs to same bion with the first sample according to first probability value.Scheme provided in this embodiment can effectively be identified microorganism group.

Description

A kind of microorganism group recognition methods and device, equipment

Technical field

The present invention relates to biological identification technology, espespecially a kind of microorganism group recognition methods and device, equipment.

Background technology

Microorganism is dispersed throughout each place of natural environment, no exception with human body, from internal enteron aisle to external skin, The quantity of Major Members bacterium in microorganism is even suitable with human body cell quantity.Microorganism does not occur individually, often with micro- life The form of object group exists, and since existing experiment condition is still difficult to be separately cultured all microorganisms, just has logical The method for crossing DNA sequencing obtains the group of microorganism basis composition, i.e. microorganism group indirectly.Microorganism group is microbiologic population The summation of all inhereditary materials, since what is obtained by high-flux sequence is mixutre genome fragment data, with macro gene Group represents the sequencing data of microorganism group.

The microorganism group of individual has very high specificity, and confirmation is obtained in many macro gene order-checking data.One A little methods uniquely characterize the microorganism group of a people by carrying out feature extraction to sequence, within a certain period of time can be by conduct The specific molecular label of the people, and applied in the experiment of small sample amount.But due to the microorganism group moment of individual Variation, macro gene order-checking data are stablized unlike genome, can not continuous and effective as molecular label.

Invention content

An at least embodiment of the invention provides a kind of microorganism group recognition methods and device, equipment, can be effectively to micro- life Object group is identified.

In order to reach the object of the invention, an at least embodiment of the invention provides a kind of microorganism group recognition methods, including：

The microorganism group characteristic information for obtaining multiple bions generates multiple samples, calculates the first sample in multiple sample This establishes the phase of the first sample according to multiple similarity respectively with the similarity of other samples to obtain multiple similarities Like degree probability Distribution Model；

Sample to be tested is obtained, the similarity of the sample to be tested and the first sample is calculated, according to the sample to be tested The sample to be tested is determined with the similarity of the first sample and the similarity probability Distribution Model of the first sample First probability value judges whether the sample to be tested belongs to same biology with the first sample according to first probability value Body.

An at least embodiment of the invention provides a kind of microorganism group identification device, including：

Information acquisition module, for obtain the microorganism group characteristic information of multiple bions generate multiple samples and, Obtain sample to be tested；

Similarity calculation module, for calculating in multiple sample first sample respectively with the similarity of other samples to obtain Obtain multiple similarities；And calculate the similarity of the sample to be tested and the first sample；

Module is established in similarity distribution, for establishing the similarity probability distribution mould of first sample according to multiple similarity Type；

Identification module, for the similarity according to the sample to be tested and the first sample the first sample phase Like the position in degree probability Distribution Model, judge whether the sample to be tested belongs to same bion with the first sample.

One embodiment of the invention provides a kind of microorganism group identification equipment, including memory and processor, the memory It has program stored therein, described program realizes the microorganism group described in any of the above-described embodiment when reading execution by the processor Recognition methods.

Compared with the relevant technologies, in one embodiment of the invention, by establishing the similarity probability Distribution Model of sample, according to Probability value of the similarity of sample to be tested and the sample in the similarity probability Distribution Model, and then whether judge sample to be tested Belong to same bion with the sample.The scheme of the application can realize the identification to microorganism group.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and is obtained in claim and attached drawing.

Description of the drawings

Attached drawing is used for providing further understanding technical solution of the present invention, and a part for constitution instruction, with this The embodiment of application technical solution for explaining the present invention together, does not form the limitation to technical solution of the present invention.

Fig. 1 is the microorganism group recognition methods flow chart that one embodiment of the invention provides；

Fig. 2 is the microorganism group identification device block diagram that one embodiment of the invention provides；

Fig. 3 is the identification module block diagram that one embodiment of the invention provides；

Fig. 4 is the identification module block diagram that another embodiment of the present invention provides；

Fig. 5 is the microorganism group recognition methods schematic diagram that one embodiment of the invention provides；

Fig. 6 is the microorganism group recognition methods flow chart that one embodiment of the invention provides；

Fig. 7 is the microorganism group recognition methods that provides of one embodiment of the invention figure compared with the success rate of other methods.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention Embodiment be described in detail.It should be noted that in the absence of conflict, in the embodiment and embodiment in the application Feature mutually can arbitrarily combine.

Step shown in the flowchart of the accompanying drawings can be in the computer system of such as a group of computer-executable instructions It performs.Also, although logical order is shown in flow charts, it in some cases, can be to be different from herein suitable Sequence performs shown or described step.

Unless otherwise defined, the technical term or scientific terminology that the disclosure uses, which are should be in fields of the present invention, to be had The ordinary meaning that the personage for having general technical ability is understood." first ", " second " and the similar word used in the disclosure is simultaneously It does not indicate that any sequence, quantity or importance, and is used only to distinguish different component parts." comprising " or "comprising" etc. Either object covers the element or object for appearing in the word presented hereinafter to the element that similar word means to occur before the word And its it is equivalent, and it is not excluded for other elements or object.

Individual is known by building the similarity distributed model of the microorganism group characteristic information between individual in the application Not, different from finding fixed label, but other people this point to be noticeably greater than from the microorganism group similarity of individual. If the similarity of two macro genome samples is significantly high, it is considered as from a bion.For a distribution, If value, which is fallen, can regard to a certain degree notable as in the position seldom occurred, thus the application also seek to it is similar general It reads, the identification to carry out individual specimen is distributed by the similarity generated between individual.For distribution of similarity between individual, if Say microbiologic population's specificity be described as a bion different samples be segmented into a kind of classification problem if, it is right In remaining sample all will be other one kind.Therefore the sample of the sample at some time point of bion and other biological individual The distribution of this generation, the sampling at another time point of oneself should be not belonging to this distribution, that is, p-value (probability value) Understand enough small.

As shown in Figure 1, one embodiment of the invention provides a kind of microorganism group recognition methods, including：

Step 101, the microorganism group characteristic information for obtaining multiple bions generates multiple samples, calculates multiple sample Middle first sample establishes described first respectively with the similarity of other samples to obtain multiple similarities, according to multiple similarity The similarity probability Distribution Model of sample；

Wherein, first sample is any sample in multiple sample.Each sample corresponds to micro- life of a bion Object group characteristic information.

Wherein, multiple bion is known, for example, when bion is behaved, it is known that the corresponding people of each sample.Separately Outside, the microorganism group characteristic information of the different time acquisition of a bion can be included in multiple samples.In addition, sample number More, similarity probability Distribution Model is more accurate, therefore, obtains great amount of samples as far as possible.Multiple bion includes difference Bion.Bion can be people or animal or other biological.Microorganism group such as enteric microorganism (can be extracted from excrement), oral microorganism etc..

It should be noted that in other embodiments, other in multiple sample in addition to first sample can also be calculated The similarity probability Distribution Model of sample.

Step 102, sample to be tested is obtained, calculates the similarity of the sample to be tested and the first sample；

Wherein, sample to be tested is the microorganism group characteristic information of a unknown bion.

Step 103, according to the similar of the similarity of the sample to be tested and the first sample and the first sample Degree probability Distribution Model determines the first probability value of the sample to be tested, judges the sample to be tested according to first probability value Whether belong to same bion with the first sample.

Microorganism group recognition methods provided in this embodiment, by establishing the similarity probability Distribution Model of first sample, By probability value of the similarity of sample to be tested in similarity probability Distribution Model, judge whether are sample to be tested and first sample From same bion.

In one embodiment, in the step 101, the microorganism group characteristic information includes：The macro gene of microorganism group Group sequencing data, alternatively, the microarray data of microorganism group, alternatively, the dyeing information of microorganism group.

In one embodiment, when the characteristic information of the microorganism group is the macro gene order-checking data of microorganism array, K-mer segmentations are carried out to the macro gene order-checking data, the k is more than 1.In an examples of implementation, the k is more than 15, than If k values are 18.It should be noted that can divide without k-mer, it is similar to be directly based upon macro gene order-checking data calculating Degree.It carries out k-mer segmentations and calculates similarity again, operand can be greatly reduced.In other embodiments, macro base can also be obtained Because group sequencing data carries out species label, alternatively, carrying out gene function group echo, phase is calculated according to species information or gene information Like degree.

K-mer is that sequence is intercepted according to regular length k, as soon as a sequence, sequentially slides a base energy As soon as obtaining a k-mer, therefore n-k+1 k-mer can be obtained for the sequence that a length is n.The extraction of k-mer is not related to Any reference gene group, therefore all sequences can be utilized.The length selection of k-mer needs to be made according to different demands Adjustment, k are just want when being equal to 1 when in the distribution situation of statistics base, k can be regarded as short sequence in the range of less than 10 Row at this time count the comparison that the frequency of occurrences of k-mer can be used between sample, and general also use should in some classification problems The k-mer of length scale is as feature.K can be regarded as medium sized selection between 10 to 15, can make in splicing For the length of basic k-mer, there is certain discrimination to bacterium, since the type of k-mer is no more than 1,000,000,000 (k etc. In 15, more than 10 hundred million) type, can not have to consider the problems of dimensionality reduction sometimes.Long k- is can be regarded as when k is more than 15 Mer can distinguish many bacteriums, particularly sequence more than 30 as feature, and many k-mer can be as strain unique identification Molecular label.

Wherein, should be shaped like normal distribution on similarity distribution theory, but interval is (0,1), therefore Gamma is distributed Variant Beta on (0,1) section is selected as the model of similarity distribution between individual.

Wherein, when calculating similarity, the microorganism group characteristic information used can be obtained according to macro gene order-checking data The species information that arrives, gene information or directly using k-mer, the metric form of similarity can be space lengths, Jaccard away from From Bray-Crutis distances etc..When using k-mer as feature calculation similarity, as far as possible between guarantee sample The unification of k-mer quantity, while suitably accepted or rejected according to the demand of sample size.

In one embodiment, the similarity is based on MinHash (min-hash, the LSH optimization methods of Jaccard distances) Algorithm obtains.

The core concept of LSH is that the spatial relationship in higher-dimension is mapped in lower dimensional space, is recovered as far as possible original Correspondence, be a kind of abbreviation method rather than strengthen method.That is if sample has class in luv space As correspondence, this relationship relatively can be restored by the algorithm of LSH faster.The degree of many similarity/distances Amount has corresponding LSH algorithms, and Euclidean distance, COS distance, Jaccard similarities have corresponding LSH algorithms, wherein The corresponding LSH algorithms, that is, min-hash algorithm (abbreviation MinHash) of Jaccard similarities.

One of k-mer types set can be obtained for each sample standard deviation, it is assumed that existing set A and set B respectively from The sampling of intestinal microflora twice is sequenced and calculates k-mer.Therefore just there are two the Jaccard similarities of sample to be：

J (A, B)=(A ∩ B)/(A ∪ B)

It is a random alignment of all k-mer in A, set B it is now assumed that there are a hash function h (), and h_min(S) being defined as set S, (h () is an orderly arrangement to first k-mer occurred, from the row in the hash function Check whether to exist successively in row with set S, the k-mer of first appearance, serial number of the correspondence in the arrangement is h_min (S)), if there is

Pr(h_min(A)=h_min(B))=J (A, B)

In order to calculate h_min(A)=h_min(B) probability naturally enough needs the random generation of multiple h (), it is assumed that carries out N times random experiment, wherein h_min(A)=h_min(B) number is mJ (A, B), it is possible to by the use of m/n as approximate calculation, here it is Definition initial MinHash obtains Jaccard similarities by multiple h () approximations, and anticipation error here is also certified as

The calculating of MinHash is simultaneously uncomplicated, but rate-limiting step is often to generate n Hash function (random alignment), special Be not when two set in k-mer it is more when, it is relatively time-consuming, thus just have with a hash function as calculating Deformation.H is defined now_(n)(S) k-mer that n occurs in set S before being obtained from the arrangement of h () are represented, then J (A, B) can approximate representation be：

J (A, B) ≈ | h_(n)(A∪B)∩h_(n)(A)∩h_(n)(B)|/n

It should be noted that above-mentioned only give a kind of computational methods of similarity, but the application is without being limited thereto, other meters The method for calculating similarity is also applicable in the application.

In one embodiment, it is described that the sample to be tested and institute are judged according to first probability value in the step 103 It states first sample and whether belongs to same bion and include：

When first probability value is less than the first predetermined threshold value, the sample to be tested belongs to same with the first sample Bion, when first probability value is more than or equal to the first predetermined threshold value, the sample to be tested and the first sample are not Belong to same bion.

Wherein, the first predetermined threshold value is according to statistically thinking that significantly similar requirement is set, for example could be provided as 0.01, alternatively, being set as needed.

Sample to be tested in multiple distributions is tested, in order to avoid false positive, can also carry out false discovery rate (false Discovery rate, FDR) correction, can use Benjamin＆Yekutieli (BY) method carry out false discovery rate Correction, it is of course also possible to carry out the correction of false discovery rate using other methods.In one embodiment, the method further includes：

The similarity probability Distribution Model of other samples in the multiple sample in addition to the first sample is established, is obtained The similarity of the sample to be tested and other samples, according to the similarity of the sample to be tested and other samples and The similarity probability Distribution Model of other samples determines other probability values of the sample to be tested；For example, there are n samples When, can be each Sample Establishing similarity probability Distribution Model, vertical n similarity probability Distribution Model of building together calculates to be measured The similarity of sample and the n sample, obtains n similarity, according to the n similarity probability Distribution Model, and then is treated N probability value of test sample sheet.

It is described to judge whether the sample to be tested belongs to same biology with the first sample according to first probability value Individual includes：First probability value and other described probability values are carried out with false discovery rate correction, first after being corrected Probability value, when the first probability value after the correction is less than the second predetermined threshold value, the sample to be tested and the first sample Belong to same bion, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested Same bion is not belonging to the first sample.

False discovery rate correction is carried out to the n probability value, obtains the probability value after n correction, then according to the n Probability value after correction judges whether sample to be tested belongs to same bion with the n sample respectively.Second predetermined threshold value generation Table false discovery rate, could be provided as 0.01, it is of course also possible to be set as other values, general value is smaller, represents false discovery rate It is smaller.

Scheme provided in this embodiment compared with directly by the use of similarity as the scheme judged, can be sent out by mistake Now rate correction reduces error probability.Such as false discovery rate threshold value be 0.01 when, the probability for representing false judgment is 1%, then If sample to be tested only in one of all samples distribution significantly, then it is basic it is considered that the result is that reliable, because wrong Probability accidentally is 0.01.

One embodiment of the invention provides a kind of microorganism group identification device, as shown in Fig. 2, including：

Information acquisition module 201 generates multiple samples for obtaining the microorganism group characteristic information of multiple bions, with And obtain sample to be tested；

Similarity calculation module 202, for calculate in multiple sample first sample respectively with the similarity of other samples To obtain multiple similarities；And calculate the similarity of the sample to be tested and the first sample；

Module 203 is established in similarity distribution, for determining the similarity probability point of first sample according to multiple similarity Cloth model；

Identification module 204, for according to the similarity of the sample to be tested and the first sample in the first sample Similarity probability Distribution Model in position, judge whether the sample to be tested and the first sample belong to same biology Body.

In one embodiment, the microorganism group characteristic information includes：The macro gene order-checking data of microorganism group or Person, the chip data of microorganism group, alternatively, the dyeing information of microorganism group.

In one embodiment, the similarity calculation module 203 calculates similarity and includes：The feature letter of the microorganism group When ceasing the macro gene order-checking data for microorganism array, k-mer segmentations, the k are carried out to the macro gene order-checking data More than 1, similarity is calculated based on the macro gene order-checking data after the progress k-mer segmentations.

In one embodiment, the similarity calculation module 203 can be based on many algorithms and calculate similarity, for example, base Similarity is calculated in MinHash algorithms.It is of course also possible to be other algorithms, the application is not construed as limiting this.

In one embodiment, as shown in figure 3, the identification module 204 includes：First probability value determination unit 301 and One judging unit 302, wherein：

The first probability value determination unit 301 is used for, true according to the similarity probability Distribution Model of the first sample The fixed sample to be tested the first probability value corresponding with the similarity of the first sample；

First judging unit 302 is used for, and first probability value is compared with predetermined threshold value, when described first When probability value is less than the first predetermined threshold value, the sample to be tested belongs to same bion with the first sample；When described When one probability value is more than or equal to the first predetermined threshold value, the sample to be tested is not belonging to same bion with the first sample.

In one embodiment, as shown in figure 4, the identification module 204 includes the second probability value determination unit 401, correction Unit 402 and second judgment unit 403, wherein：

The similarity calculation module is additionally operable to, and calculates in multiple sample in addition to first sample other samples between any two Similarity；And calculate the similarity of the sample to be tested and other samples；

The similarity distribution is established module and is additionally operable to, and establishes other in the multiple sample in addition to the first sample The similarity probability Distribution Model of sample obtains the similarity of the sample to be tested and other samples；

The second probability value determination unit is used for, and institute is determined according to the similarity probability Distribution Model of the first sample Sample to be tested the first probability value corresponding with the similarity of the first sample is stated, according to the sample to be tested and other described samples This similarity and the similarity probability Distribution Model of other samples determine other probability values of the sample to be tested；

The correction unit is used for, and false discovery rate correction is carried out to first probability value and other described probability values, The first probability value after being corrected；

The second judgment unit is used for, and the first probability value after the correction with predetermined threshold value is compared, works as institute When stating the first probability value after correction less than the second predetermined threshold value, the sample to be tested belongs to same biology with the first sample Individual, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested and first sample Originally it is not belonging to same bion.

The application is further illustrated below by a specific embodiment.

As shown in Figure 5 and Figure 6, microorganism group recognition methods provided in this embodiment includes：

Step 601, the macro gene order-checking data of n sample are obtained, carry out k-mer segmentations, the macro base after being divided Because of a group sequencing data；

Step 602, the similarity of n sample between any two is calculated；

Using the method for MinHash calculate and obtain similarity.Specifically, after sample carries out k-mer segmentations, Mei Gehong Genome sample obtains corresponding k-mer set.Hash function is the orderly arrangement of one group of k-mer, macro gene each in this way Group sample can be transferred through the Function Mapping and obtain one group of serial number.This group of serial number is exactly cryptographic Hash, selects m a right according to hash function Minimum hash is answered, similarity is then calculated as follows：

J (A, B) ≈ | h_(m)(A∪B)∩h_(m)(A)∩h_(m)(B)|/m

Step 603, for each sample, generating the similar of the sample to the n-1 similarity of other n-1 sample according to it Distributed model is spent, Beta distributions can be utilized to be fitted to obtain the similarity distributed model, and then obtains n similarity distribution Model.

Step 604, the similarity of sample to be tested and sample each in the n sample is calculated, n similarity is obtained, it is right Any sample according in the similarity probability Distribution Model of the similarity and the sample of sample to be tested and the sample, obtains one Whether probability value judges sample to be tested with the sample from same person according to the probability value.Judgment method is according to probability value Judge whether significantly it is similar, i.e., compared with preset first threshold value, when less than the first predetermined threshold value, represent significantly it is similar, when big When equal to the first predetermined threshold value, represent non-significant similar.For example, in Fig. 5, P2<α, sample to be tested is with target sample from same One people, P1>α, sample to be tested is with target sample from different biology individual, and α is preset first threshold value, for example, 0.01 can be taken.

In another embodiment, after n probability value p1, p2 ... pn is obtained in step 604, can to p1, p2 ... pn into Row false discovery rate corrects, and obtains q1, q2 ... qn, and respectively according to q1, q2 ... qn judge whether are sample to be tested and target sample From same people, specifically, q1, q2 ... qn and threshold value q can be compared, when qi (i=1 ..., n) is less than q, represent Sample to be tested and the corresponding target samples of qi, when qi is more than or equal to q, represent sample to be tested and the corresponding mesh of qi from same people Standard specimen sheet comes from different biology individual.Threshold value q is false discovery rate, be can be set as needed, for example is 0.01.

Fig. 7 be one embodiment of the invention provide take different characteristic information carry out similarity calculation when schematic diagram.Fig. 7 In for the test result in 612 samples.Wherein, left hand view is Receiver operating curve (receiver in Fig. 7 Operating characteristic curve, ROC) in, macro gene order-checking data using ker segmentation (in figure Gemini is corresponded to) with using species (Species) label, the comparison result marked using gene (KEGG).Right part of flg is in Fig. 7 Accuracy rate and recall rate curve (Precision-Recall curve, PRC), macro gene order-checking data using ker segmentations (with Gemini is corresponded in figure) with using species (Species) label, the comparison result marked using gene (KEGG).Wherein, solid line Represent Gemini's as a result, rounded-corner broken line is species as a result, result of the right angle dotted line as gene.It can be seen that Gemini side The effect that method judges individual is fine, refers to auROC, the value of auPRC, value is higher to illustrate that prediction is more accurate.In addition, and species, Gene is compared as feature, and k-mer is more preferable as the result of feature, auROC, and auPRC is higher than species, and gene is as special AuROC during sign, auPRC value.

One embodiment of the invention provides a kind of computer readable storage medium, and the computer-readable recording medium storage has One or more program, one or more of programs can be performed by one or more processor, to realize above-mentioned Microorganism group recognition methods described in one embodiment.

The computer readable storage medium includes：It is USB flash disk, read-only memory (ROM, Read-Only Memory), random Access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program The medium of code.

Although disclosed herein embodiment as above, the content only for ease of understanding the present invention and use Embodiment is not limited to the present invention.Technical staff in any fields of the present invention is taken off not departing from the present invention Under the premise of the spirit and scope of dew, any modification and variation, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims

1. a kind of microorganism group recognition methods, including：

The microorganism group characteristic information for obtaining multiple bions generates multiple samples, calculates first sample point in multiple sample Not with the similarity of other samples to obtain multiple similarities, the similarity of the first sample is established according to multiple similarity Probability Distribution Model；

Sample to be tested is obtained, the similarity of the sample to be tested and the first sample is calculated, according to the sample to be tested and institute The similarity probability Distribution Model of similarity and the first sample for stating first sample determines the first of the sample to be tested Probability value judges whether the sample to be tested belongs to same bion with the first sample according to first probability value.

2. microorganism group recognition methods as described in claim 1, which is characterized in that the microorganism group characteristic information includes： The macro gene order-checking data of microorganism group, alternatively, the microarray data of microorganism group, alternatively, the dyeing letter of microorganism group Breath.

3. microorganism group recognition methods as claimed in claim 2, which is characterized in that when calculating similarity, if micro- life The characteristic information of object group is the macro gene order-checking data of microorganism array, and k-mer is carried out to the macro gene order-checking data Similarity calculation is carried out based on the macro gene order-checking data after progress k-mer segmentations after segmentation, k is more than 1.

4. microorganism group recognition methods as described in claim 1, which is characterized in that the similarity is based on MinHash algorithms It obtains.

5. the microorganism group recognition methods as described in Claims 1-4 is any, which is characterized in that described general according to described first Rate value judges that the sample to be tested includes with whether the first sample belongs to same bion：

When first probability value is less than the first predetermined threshold value, the sample to be tested belongs to same biology with the first sample Individual, when first probability value is more than or equal to the first predetermined threshold value, the sample to be tested is not belonging to the first sample Same bion.

6. the microorganism group recognition methods as described in Claims 1-4 is any, which is characterized in that the method further includes, and establishes The similarity probability Distribution Model of other samples in the multiple sample in addition to the first sample, obtains the sample to be tested With the similarity of other samples, according to the similarity of the sample to be tested and other samples and other described samples Similarity probability Distribution Model determine other probability values of the sample to be tested；

It is described to judge whether the sample to be tested belongs to same bion with the first sample according to first probability value Including：False discovery rate correction, the first probability after being corrected are carried out to first probability value and other described probability values Value, when the first probability value after the correction is less than the second predetermined threshold value, the sample to be tested belongs to the first sample Same bion, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested and institute It states first sample and is not belonging to same bion.

7. a kind of microorganism group identification device, which is characterized in that including：

Similarity calculation module is more to obtain with the similarity of other samples respectively for calculating first sample in multiple sample A similarity；And calculate the similarity of the sample to be tested and the first sample；

Module is established in similarity distribution, for establishing the similarity probability Distribution Model of first sample according to multiple similarity；

Identification module, for the similarity according to the sample to be tested and the first sample the first sample similarity Position in probability Distribution Model, judges whether the sample to be tested belongs to same bion with the first sample.

8. microorganism group identification device as claimed in claim 7, which is characterized in that the microorganism group characteristic information includes： The macro gene order-checking data of microorganism group, alternatively, the microarray data of microorganism group, alternatively, the dyeing letter of microorganism group Breath.

9. microorganism group identification device as claimed in claim 8, which is characterized in that the similarity calculation module calculates similar Degree includes：When the characteristic information of the microorganism group is the macro gene order-checking data of microorganism array, to the macro genome Sequencing data carries out k-mer segmentations, and k is more than 1, and phase is calculated based on the macro gene order-checking data after the progress k-mer segmentations Like degree.

10. microorganism group identification device as claimed in claim 7, which is characterized in that the similarity calculation module is based on MinHash algorithms calculate similarity.

11. the microorganism group identification device as described in claim 7 to 10 is any, which is characterized in that the identification module includes First probability value determination unit and the first judging unit, wherein：

The first probability value determination unit is used for, and is treated according to determining the similarity probability Distribution Model of the first sample Test sample sheet the first probability value corresponding with the similarity of the first sample；

First judging unit is used for, and first probability value is compared with predetermined threshold value, when first probability value During less than the first predetermined threshold value, the sample to be tested belongs to same bion with the first sample；When first probability When value is more than or equal to the first predetermined threshold value, the sample to be tested is not belonging to same bion with the first sample.

12. the microorganism group identification device as described in claim 7 to 10 is any, which is characterized in that

The similarity calculation module is additionally operable to, and calculates in multiple sample the phase of other samples between any two in addition to first sample Like degree；And calculate the similarity of the sample to be tested and other samples；

The similarity distribution is established module and is additionally operable to, and establishes other samples in addition to the first sample in the multiple sample Similarity probability Distribution Model, obtain the similarity of the sample to be tested and other samples；

The identification module includes the second probability value determination unit, correction unit and second judgment unit, wherein：

The second probability value determination unit is used for, and is treated according to determining the similarity probability Distribution Model of the first sample Test sample sheet the first probability value corresponding with the similarity of the first sample, according to the sample to be tested and other samples Similarity and the similarity probability Distribution Model of other samples determine other probability values of the sample to be tested；

The correction unit is used for, and is carried out false discovery rate correction to first probability value and other described probability values, is obtained The first probability value after correction；

The second judgment unit is used for, and the first probability value after the correction is compared with predetermined threshold value, when the school When the first probability value after just is less than the second predetermined threshold value, the sample to be tested belongs to same biology with the first sample Body, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested and the first sample It is not belonging to same bion.

13. a kind of microorganism group identification equipment, which is characterized in that including memory and processor, the memory is stored with journey Sequence, described program realize the microorganism group identification as described in claim 1 to 6 is any when reading execution by the processor Method.