CN108268753A - A kind of microorganism group recognition methods and device, equipment - Google Patents

A kind of microorganism group recognition methods and device, equipment Download PDF

Info

Publication number
CN108268753A
CN108268753A CN201810073198.4A CN201810073198A CN108268753A CN 108268753 A CN108268753 A CN 108268753A CN 201810073198 A CN201810073198 A CN 201810073198A CN 108268753 A CN108268753 A CN 108268753A
Authority
CN
China
Prior art keywords
sample
similarity
tested
microorganism group
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810073198.4A
Other languages
Chinese (zh)
Other versions
CN108268753B (en
Inventor
王子承
江瑞
陈挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201810073198.4A priority Critical patent/CN108268753B/en
Publication of CN108268753A publication Critical patent/CN108268753A/en
Application granted granted Critical
Publication of CN108268753B publication Critical patent/CN108268753B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

A kind of microorganism group recognition methods and device, equipment, the microorganism group recognition methods include:The microorganism group characteristic information for obtaining multiple bions generates multiple samples, it calculates first sample in multiple sample and, respectively with the similarity of other samples to obtain multiple similarities, the similarity probability Distribution Model of the first sample is established according to multiple similarity;Obtain sample to be tested, calculate the similarity of the sample to be tested and the first sample, the first probability value of the sample to be tested is determined according to the similarity probability Distribution Model of the similarity and the first sample of the sample to be tested and the first sample, judges whether the sample to be tested belongs to same bion with the first sample according to first probability value.Scheme provided in this embodiment can effectively be identified microorganism group.

Description

A kind of microorganism group recognition methods and device, equipment
Technical field
The present invention relates to biological identification technology, espespecially a kind of microorganism group recognition methods and device, equipment.
Background technology
Microorganism is dispersed throughout each place of natural environment, no exception with human body, from internal enteron aisle to external skin, The quantity of Major Members bacterium in microorganism is even suitable with human body cell quantity.Microorganism does not occur individually, often with micro- life The form of object group exists, and since existing experiment condition is still difficult to be separately cultured all microorganisms, just has logical The method for crossing DNA sequencing obtains the group of microorganism basis composition, i.e. microorganism group indirectly.Microorganism group is microbiologic population The summation of all inhereditary materials, since what is obtained by high-flux sequence is mixutre genome fragment data, with macro gene Group represents the sequencing data of microorganism group.
The microorganism group of individual has very high specificity, and confirmation is obtained in many macro gene order-checking data.One A little methods uniquely characterize the microorganism group of a people by carrying out feature extraction to sequence, within a certain period of time can be by conduct The specific molecular label of the people, and applied in the experiment of small sample amount.But due to the microorganism group moment of individual Variation, macro gene order-checking data are stablized unlike genome, can not continuous and effective as molecular label.
Invention content
An at least embodiment of the invention provides a kind of microorganism group recognition methods and device, equipment, can be effectively to micro- life Object group is identified.
In order to reach the object of the invention, an at least embodiment of the invention provides a kind of microorganism group recognition methods, including:
The microorganism group characteristic information for obtaining multiple bions generates multiple samples, calculates the first sample in multiple sample This establishes the phase of the first sample according to multiple similarity respectively with the similarity of other samples to obtain multiple similarities Like degree probability Distribution Model;
Sample to be tested is obtained, the similarity of the sample to be tested and the first sample is calculated, according to the sample to be tested The sample to be tested is determined with the similarity of the first sample and the similarity probability Distribution Model of the first sample First probability value judges whether the sample to be tested belongs to same biology with the first sample according to first probability value Body.
An at least embodiment of the invention provides a kind of microorganism group identification device, including:
Information acquisition module, for obtain the microorganism group characteristic information of multiple bions generate multiple samples and, Obtain sample to be tested;
Similarity calculation module, for calculating in multiple sample first sample respectively with the similarity of other samples to obtain Obtain multiple similarities;And calculate the similarity of the sample to be tested and the first sample;
Module is established in similarity distribution, for establishing the similarity probability distribution mould of first sample according to multiple similarity Type;
Identification module, for the similarity according to the sample to be tested and the first sample the first sample phase Like the position in degree probability Distribution Model, judge whether the sample to be tested belongs to same bion with the first sample.
One embodiment of the invention provides a kind of microorganism group identification equipment, including memory and processor, the memory It has program stored therein, described program realizes the microorganism group described in any of the above-described embodiment when reading execution by the processor Recognition methods.
Compared with the relevant technologies, in one embodiment of the invention, by establishing the similarity probability Distribution Model of sample, according to Probability value of the similarity of sample to be tested and the sample in the similarity probability Distribution Model, and then whether judge sample to be tested Belong to same bion with the sample.The scheme of the application can realize the identification to microorganism group.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights Specifically noted structure is realized and is obtained in claim and attached drawing.
Description of the drawings
Attached drawing is used for providing further understanding technical solution of the present invention, and a part for constitution instruction, with this The embodiment of application technical solution for explaining the present invention together, does not form the limitation to technical solution of the present invention.
Fig. 1 is the microorganism group recognition methods flow chart that one embodiment of the invention provides;
Fig. 2 is the microorganism group identification device block diagram that one embodiment of the invention provides;
Fig. 3 is the identification module block diagram that one embodiment of the invention provides;
Fig. 4 is the identification module block diagram that another embodiment of the present invention provides;
Fig. 5 is the microorganism group recognition methods schematic diagram that one embodiment of the invention provides;
Fig. 6 is the microorganism group recognition methods flow chart that one embodiment of the invention provides;
Fig. 7 is the microorganism group recognition methods that provides of one embodiment of the invention figure compared with the success rate of other methods.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention Embodiment be described in detail.It should be noted that in the absence of conflict, in the embodiment and embodiment in the application Feature mutually can arbitrarily combine.
Step shown in the flowchart of the accompanying drawings can be in the computer system of such as a group of computer-executable instructions It performs.Also, although logical order is shown in flow charts, it in some cases, can be to be different from herein suitable Sequence performs shown or described step.
Unless otherwise defined, the technical term or scientific terminology that the disclosure uses, which are should be in fields of the present invention, to be had The ordinary meaning that the personage for having general technical ability is understood." first ", " second " and the similar word used in the disclosure is simultaneously It does not indicate that any sequence, quantity or importance, and is used only to distinguish different component parts." comprising " or "comprising" etc. Either object covers the element or object for appearing in the word presented hereinafter to the element that similar word means to occur before the word And its it is equivalent, and it is not excluded for other elements or object.
Individual is known by building the similarity distributed model of the microorganism group characteristic information between individual in the application Not, different from finding fixed label, but other people this point to be noticeably greater than from the microorganism group similarity of individual. If the similarity of two macro genome samples is significantly high, it is considered as from a bion.For a distribution, If value, which is fallen, can regard to a certain degree notable as in the position seldom occurred, thus the application also seek to it is similar general It reads, the identification to carry out individual specimen is distributed by the similarity generated between individual.For distribution of similarity between individual, if Say microbiologic population's specificity be described as a bion different samples be segmented into a kind of classification problem if, it is right In remaining sample all will be other one kind.Therefore the sample of the sample at some time point of bion and other biological individual The distribution of this generation, the sampling at another time point of oneself should be not belonging to this distribution, that is, p-value (probability value) Understand enough small.
As shown in Figure 1, one embodiment of the invention provides a kind of microorganism group recognition methods, including:
Step 101, the microorganism group characteristic information for obtaining multiple bions generates multiple samples, calculates multiple sample Middle first sample establishes described first respectively with the similarity of other samples to obtain multiple similarities, according to multiple similarity The similarity probability Distribution Model of sample;
Wherein, first sample is any sample in multiple sample.Each sample corresponds to micro- life of a bion Object group characteristic information.
Wherein, multiple bion is known, for example, when bion is behaved, it is known that the corresponding people of each sample.Separately Outside, the microorganism group characteristic information of the different time acquisition of a bion can be included in multiple samples.In addition, sample number More, similarity probability Distribution Model is more accurate, therefore, obtains great amount of samples as far as possible.Multiple bion includes difference Bion.Bion can be people or animal or other biological.Microorganism group such as enteric microorganism (can be extracted from excrement), oral microorganism etc..
It should be noted that in other embodiments, other in multiple sample in addition to first sample can also be calculated The similarity probability Distribution Model of sample.
Step 102, sample to be tested is obtained, calculates the similarity of the sample to be tested and the first sample;
Wherein, sample to be tested is the microorganism group characteristic information of a unknown bion.
Step 103, according to the similar of the similarity of the sample to be tested and the first sample and the first sample Degree probability Distribution Model determines the first probability value of the sample to be tested, judges the sample to be tested according to first probability value Whether belong to same bion with the first sample.
Microorganism group recognition methods provided in this embodiment, by establishing the similarity probability Distribution Model of first sample, By probability value of the similarity of sample to be tested in similarity probability Distribution Model, judge whether are sample to be tested and first sample From same bion.
In one embodiment, in the step 101, the microorganism group characteristic information includes:The macro gene of microorganism group Group sequencing data, alternatively, the microarray data of microorganism group, alternatively, the dyeing information of microorganism group.
In one embodiment, when the characteristic information of the microorganism group is the macro gene order-checking data of microorganism array, K-mer segmentations are carried out to the macro gene order-checking data, the k is more than 1.In an examples of implementation, the k is more than 15, than If k values are 18.It should be noted that can divide without k-mer, it is similar to be directly based upon macro gene order-checking data calculating Degree.It carries out k-mer segmentations and calculates similarity again, operand can be greatly reduced.In other embodiments, macro base can also be obtained Because group sequencing data carries out species label, alternatively, carrying out gene function group echo, phase is calculated according to species information or gene information Like degree.
K-mer is that sequence is intercepted according to regular length k, as soon as a sequence, sequentially slides a base energy As soon as obtaining a k-mer, therefore n-k+1 k-mer can be obtained for the sequence that a length is n.The extraction of k-mer is not related to Any reference gene group, therefore all sequences can be utilized.The length selection of k-mer needs to be made according to different demands Adjustment, k are just want when being equal to 1 when in the distribution situation of statistics base, k can be regarded as short sequence in the range of less than 10 Row at this time count the comparison that the frequency of occurrences of k-mer can be used between sample, and general also use should in some classification problems The k-mer of length scale is as feature.K can be regarded as medium sized selection between 10 to 15, can make in splicing For the length of basic k-mer, there is certain discrimination to bacterium, since the type of k-mer is no more than 1,000,000,000 (k etc. In 15, more than 10 hundred million) type, can not have to consider the problems of dimensionality reduction sometimes.Long k- is can be regarded as when k is more than 15 Mer can distinguish many bacteriums, particularly sequence more than 30 as feature, and many k-mer can be as strain unique identification Molecular label.
Wherein, should be shaped like normal distribution on similarity distribution theory, but interval is (0,1), therefore Gamma is distributed Variant Beta on (0,1) section is selected as the model of similarity distribution between individual.
Wherein, when calculating similarity, the microorganism group characteristic information used can be obtained according to macro gene order-checking data The species information that arrives, gene information or directly using k-mer, the metric form of similarity can be space lengths, Jaccard away from From Bray-Crutis distances etc..When using k-mer as feature calculation similarity, as far as possible between guarantee sample The unification of k-mer quantity, while suitably accepted or rejected according to the demand of sample size.
In one embodiment, the similarity is based on MinHash (min-hash, the LSH optimization methods of Jaccard distances) Algorithm obtains.
The core concept of LSH is that the spatial relationship in higher-dimension is mapped in lower dimensional space, is recovered as far as possible original Correspondence, be a kind of abbreviation method rather than strengthen method.That is if sample has class in luv space As correspondence, this relationship relatively can be restored by the algorithm of LSH faster.The degree of many similarity/distances Amount has corresponding LSH algorithms, and Euclidean distance, COS distance, Jaccard similarities have corresponding LSH algorithms, wherein The corresponding LSH algorithms, that is, min-hash algorithm (abbreviation MinHash) of Jaccard similarities.
One of k-mer types set can be obtained for each sample standard deviation, it is assumed that existing set A and set B respectively from The sampling of intestinal microflora twice is sequenced and calculates k-mer.Therefore just there are two the Jaccard similarities of sample to be:
J (A, B)=(A ∩ B)/(A ∪ B)
It is a random alignment of all k-mer in A, set B it is now assumed that there are a hash function h (), and hmin(S) being defined as set S, (h () is an orderly arrangement to first k-mer occurred, from the row in the hash function Check whether to exist successively in row with set S, the k-mer of first appearance, serial number of the correspondence in the arrangement is hmin (S)), if there is
Pr(hmin(A)=hmin(B))=J (A, B)
In order to calculate hmin(A)=hmin(B) probability naturally enough needs the random generation of multiple h (), it is assumed that carries out N times random experiment, wherein hmin(A)=hmin(B) number is mJ (A, B), it is possible to by the use of m/n as approximate calculation, here it is Definition initial MinHash obtains Jaccard similarities by multiple h () approximations, and anticipation error here is also certified as
The calculating of MinHash is simultaneously uncomplicated, but rate-limiting step is often to generate n Hash function (random alignment), special Be not when two set in k-mer it is more when, it is relatively time-consuming, thus just have with a hash function as calculating Deformation.H is defined now(n)(S) k-mer that n occurs in set S before being obtained from the arrangement of h () are represented, then J (A, B) can approximate representation be:
J (A, B) ≈ | h(n)(A∪B)∩h(n)(A)∩h(n)(B)|/n
It should be noted that above-mentioned only give a kind of computational methods of similarity, but the application is without being limited thereto, other meters The method for calculating similarity is also applicable in the application.
In one embodiment, it is described that the sample to be tested and institute are judged according to first probability value in the step 103 It states first sample and whether belongs to same bion and include:
When first probability value is less than the first predetermined threshold value, the sample to be tested belongs to same with the first sample Bion, when first probability value is more than or equal to the first predetermined threshold value, the sample to be tested and the first sample are not Belong to same bion.
Wherein, the first predetermined threshold value is according to statistically thinking that significantly similar requirement is set, for example could be provided as 0.01, alternatively, being set as needed.
Sample to be tested in multiple distributions is tested, in order to avoid false positive, can also carry out false discovery rate (false Discovery rate, FDR) correction, can use Benjamin&Yekutieli (BY) method carry out false discovery rate Correction, it is of course also possible to carry out the correction of false discovery rate using other methods.In one embodiment, the method further includes:
The similarity probability Distribution Model of other samples in the multiple sample in addition to the first sample is established, is obtained The similarity of the sample to be tested and other samples, according to the similarity of the sample to be tested and other samples and The similarity probability Distribution Model of other samples determines other probability values of the sample to be tested;For example, there are n samples When, can be each Sample Establishing similarity probability Distribution Model, vertical n similarity probability Distribution Model of building together calculates to be measured The similarity of sample and the n sample, obtains n similarity, according to the n similarity probability Distribution Model, and then is treated N probability value of test sample sheet.
It is described to judge whether the sample to be tested belongs to same biology with the first sample according to first probability value Individual includes:First probability value and other described probability values are carried out with false discovery rate correction, first after being corrected Probability value, when the first probability value after the correction is less than the second predetermined threshold value, the sample to be tested and the first sample Belong to same bion, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested Same bion is not belonging to the first sample.
False discovery rate correction is carried out to the n probability value, obtains the probability value after n correction, then according to the n Probability value after correction judges whether sample to be tested belongs to same bion with the n sample respectively.Second predetermined threshold value generation Table false discovery rate, could be provided as 0.01, it is of course also possible to be set as other values, general value is smaller, represents false discovery rate It is smaller.
Scheme provided in this embodiment compared with directly by the use of similarity as the scheme judged, can be sent out by mistake Now rate correction reduces error probability.Such as false discovery rate threshold value be 0.01 when, the probability for representing false judgment is 1%, then If sample to be tested only in one of all samples distribution significantly, then it is basic it is considered that the result is that reliable, because wrong Probability accidentally is 0.01.
One embodiment of the invention provides a kind of microorganism group identification device, as shown in Fig. 2, including:
Information acquisition module 201 generates multiple samples for obtaining the microorganism group characteristic information of multiple bions, with And obtain sample to be tested;
Similarity calculation module 202, for calculate in multiple sample first sample respectively with the similarity of other samples To obtain multiple similarities;And calculate the similarity of the sample to be tested and the first sample;
Module 203 is established in similarity distribution, for determining the similarity probability point of first sample according to multiple similarity Cloth model;
Identification module 204, for according to the similarity of the sample to be tested and the first sample in the first sample Similarity probability Distribution Model in position, judge whether the sample to be tested and the first sample belong to same biology Body.
In one embodiment, the microorganism group characteristic information includes:The macro gene order-checking data of microorganism group or Person, the chip data of microorganism group, alternatively, the dyeing information of microorganism group.
In one embodiment, the similarity calculation module 203 calculates similarity and includes:The feature letter of the microorganism group When ceasing the macro gene order-checking data for microorganism array, k-mer segmentations, the k are carried out to the macro gene order-checking data More than 1, similarity is calculated based on the macro gene order-checking data after the progress k-mer segmentations.
In one embodiment, the similarity calculation module 203 can be based on many algorithms and calculate similarity, for example, base Similarity is calculated in MinHash algorithms.It is of course also possible to be other algorithms, the application is not construed as limiting this.
In one embodiment, as shown in figure 3, the identification module 204 includes:First probability value determination unit 301 and One judging unit 302, wherein:
The first probability value determination unit 301 is used for, true according to the similarity probability Distribution Model of the first sample The fixed sample to be tested the first probability value corresponding with the similarity of the first sample;
First judging unit 302 is used for, and first probability value is compared with predetermined threshold value, when described first When probability value is less than the first predetermined threshold value, the sample to be tested belongs to same bion with the first sample;When described When one probability value is more than or equal to the first predetermined threshold value, the sample to be tested is not belonging to same bion with the first sample.
In one embodiment, as shown in figure 4, the identification module 204 includes the second probability value determination unit 401, correction Unit 402 and second judgment unit 403, wherein:
The similarity calculation module is additionally operable to, and calculates in multiple sample in addition to first sample other samples between any two Similarity;And calculate the similarity of the sample to be tested and other samples;
The similarity distribution is established module and is additionally operable to, and establishes other in the multiple sample in addition to the first sample The similarity probability Distribution Model of sample obtains the similarity of the sample to be tested and other samples;
The second probability value determination unit is used for, and institute is determined according to the similarity probability Distribution Model of the first sample Sample to be tested the first probability value corresponding with the similarity of the first sample is stated, according to the sample to be tested and other described samples This similarity and the similarity probability Distribution Model of other samples determine other probability values of the sample to be tested;
The correction unit is used for, and false discovery rate correction is carried out to first probability value and other described probability values, The first probability value after being corrected;
The second judgment unit is used for, and the first probability value after the correction with predetermined threshold value is compared, works as institute When stating the first probability value after correction less than the second predetermined threshold value, the sample to be tested belongs to same biology with the first sample Individual, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested and first sample Originally it is not belonging to same bion.
The application is further illustrated below by a specific embodiment.
As shown in Figure 5 and Figure 6, microorganism group recognition methods provided in this embodiment includes:
Step 601, the macro gene order-checking data of n sample are obtained, carry out k-mer segmentations, the macro base after being divided Because of a group sequencing data;
Step 602, the similarity of n sample between any two is calculated;
Using the method for MinHash calculate and obtain similarity.Specifically, after sample carries out k-mer segmentations, Mei Gehong Genome sample obtains corresponding k-mer set.Hash function is the orderly arrangement of one group of k-mer, macro gene each in this way Group sample can be transferred through the Function Mapping and obtain one group of serial number.This group of serial number is exactly cryptographic Hash, selects m a right according to hash function Minimum hash is answered, similarity is then calculated as follows:
J (A, B) ≈ | h(m)(A∪B)∩h(m)(A)∩h(m)(B)|/m
Step 603, for each sample, generating the similar of the sample to the n-1 similarity of other n-1 sample according to it Distributed model is spent, Beta distributions can be utilized to be fitted to obtain the similarity distributed model, and then obtains n similarity distribution Model.
Step 604, the similarity of sample to be tested and sample each in the n sample is calculated, n similarity is obtained, it is right Any sample according in the similarity probability Distribution Model of the similarity and the sample of sample to be tested and the sample, obtains one Whether probability value judges sample to be tested with the sample from same person according to the probability value.Judgment method is according to probability value Judge whether significantly it is similar, i.e., compared with preset first threshold value, when less than the first predetermined threshold value, represent significantly it is similar, when big When equal to the first predetermined threshold value, represent non-significant similar.For example, in Fig. 5, P2<α, sample to be tested is with target sample from same One people, P1>α, sample to be tested is with target sample from different biology individual, and α is preset first threshold value, for example, 0.01 can be taken.
In another embodiment, after n probability value p1, p2 ... pn is obtained in step 604, can to p1, p2 ... pn into Row false discovery rate corrects, and obtains q1, q2 ... qn, and respectively according to q1, q2 ... qn judge whether are sample to be tested and target sample From same people, specifically, q1, q2 ... qn and threshold value q can be compared, when qi (i=1 ..., n) is less than q, represent Sample to be tested and the corresponding target samples of qi, when qi is more than or equal to q, represent sample to be tested and the corresponding mesh of qi from same people Standard specimen sheet comes from different biology individual.Threshold value q is false discovery rate, be can be set as needed, for example is 0.01.
Fig. 7 be one embodiment of the invention provide take different characteristic information carry out similarity calculation when schematic diagram.Fig. 7 In for the test result in 612 samples.Wherein, left hand view is Receiver operating curve (receiver in Fig. 7 Operating characteristic curve, ROC) in, macro gene order-checking data using ker segmentation (in figure Gemini is corresponded to) with using species (Species) label, the comparison result marked using gene (KEGG).Right part of flg is in Fig. 7 Accuracy rate and recall rate curve (Precision-Recall curve, PRC), macro gene order-checking data using ker segmentations (with Gemini is corresponded in figure) with using species (Species) label, the comparison result marked using gene (KEGG).Wherein, solid line Represent Gemini's as a result, rounded-corner broken line is species as a result, result of the right angle dotted line as gene.It can be seen that Gemini side The effect that method judges individual is fine, refers to auROC, the value of auPRC, value is higher to illustrate that prediction is more accurate.In addition, and species, Gene is compared as feature, and k-mer is more preferable as the result of feature, auROC, and auPRC is higher than species, and gene is as special AuROC during sign, auPRC value.
One embodiment of the invention provides a kind of microorganism group identification equipment, including memory and processor, the memory It has program stored therein, described program realizes the microorganism group described in any of the above-described embodiment when reading execution by the processor Recognition methods.
One embodiment of the invention provides a kind of computer readable storage medium, and the computer-readable recording medium storage has One or more program, one or more of programs can be performed by one or more processor, to realize above-mentioned Microorganism group recognition methods described in one embodiment.
The computer readable storage medium includes:It is USB flash disk, read-only memory (ROM, Read-Only Memory), random Access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program The medium of code.
Although disclosed herein embodiment as above, the content only for ease of understanding the present invention and use Embodiment is not limited to the present invention.Technical staff in any fields of the present invention is taken off not departing from the present invention Under the premise of the spirit and scope of dew, any modification and variation, but the present invention can be carried out in the form and details of implementation Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.

Claims (13)

1. a kind of microorganism group recognition methods, including:
The microorganism group characteristic information for obtaining multiple bions generates multiple samples, calculates first sample point in multiple sample Not with the similarity of other samples to obtain multiple similarities, the similarity of the first sample is established according to multiple similarity Probability Distribution Model;
Sample to be tested is obtained, the similarity of the sample to be tested and the first sample is calculated, according to the sample to be tested and institute The similarity probability Distribution Model of similarity and the first sample for stating first sample determines the first of the sample to be tested Probability value judges whether the sample to be tested belongs to same bion with the first sample according to first probability value.
2. microorganism group recognition methods as described in claim 1, which is characterized in that the microorganism group characteristic information includes: The macro gene order-checking data of microorganism group, alternatively, the microarray data of microorganism group, alternatively, the dyeing letter of microorganism group Breath.
3. microorganism group recognition methods as claimed in claim 2, which is characterized in that when calculating similarity, if micro- life The characteristic information of object group is the macro gene order-checking data of microorganism array, and k-mer is carried out to the macro gene order-checking data Similarity calculation is carried out based on the macro gene order-checking data after progress k-mer segmentations after segmentation, k is more than 1.
4. microorganism group recognition methods as described in claim 1, which is characterized in that the similarity is based on MinHash algorithms It obtains.
5. the microorganism group recognition methods as described in Claims 1-4 is any, which is characterized in that described general according to described first Rate value judges that the sample to be tested includes with whether the first sample belongs to same bion:
When first probability value is less than the first predetermined threshold value, the sample to be tested belongs to same biology with the first sample Individual, when first probability value is more than or equal to the first predetermined threshold value, the sample to be tested is not belonging to the first sample Same bion.
6. the microorganism group recognition methods as described in Claims 1-4 is any, which is characterized in that the method further includes, and establishes The similarity probability Distribution Model of other samples in the multiple sample in addition to the first sample, obtains the sample to be tested With the similarity of other samples, according to the similarity of the sample to be tested and other samples and other described samples Similarity probability Distribution Model determine other probability values of the sample to be tested;
It is described to judge whether the sample to be tested belongs to same bion with the first sample according to first probability value Including:False discovery rate correction, the first probability after being corrected are carried out to first probability value and other described probability values Value, when the first probability value after the correction is less than the second predetermined threshold value, the sample to be tested belongs to the first sample Same bion, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested and institute It states first sample and is not belonging to same bion.
7. a kind of microorganism group identification device, which is characterized in that including:
Information acquisition module, for obtain the microorganism group characteristic information of multiple bions generate multiple samples and, obtain Sample to be tested;
Similarity calculation module is more to obtain with the similarity of other samples respectively for calculating first sample in multiple sample A similarity;And calculate the similarity of the sample to be tested and the first sample;
Module is established in similarity distribution, for establishing the similarity probability Distribution Model of first sample according to multiple similarity;
Identification module, for the similarity according to the sample to be tested and the first sample the first sample similarity Position in probability Distribution Model, judges whether the sample to be tested belongs to same bion with the first sample.
8. microorganism group identification device as claimed in claim 7, which is characterized in that the microorganism group characteristic information includes: The macro gene order-checking data of microorganism group, alternatively, the microarray data of microorganism group, alternatively, the dyeing letter of microorganism group Breath.
9. microorganism group identification device as claimed in claim 8, which is characterized in that the similarity calculation module calculates similar Degree includes:When the characteristic information of the microorganism group is the macro gene order-checking data of microorganism array, to the macro genome Sequencing data carries out k-mer segmentations, and k is more than 1, and phase is calculated based on the macro gene order-checking data after the progress k-mer segmentations Like degree.
10. microorganism group identification device as claimed in claim 7, which is characterized in that the similarity calculation module is based on MinHash algorithms calculate similarity.
11. the microorganism group identification device as described in claim 7 to 10 is any, which is characterized in that the identification module includes First probability value determination unit and the first judging unit, wherein:
The first probability value determination unit is used for, and is treated according to determining the similarity probability Distribution Model of the first sample Test sample sheet the first probability value corresponding with the similarity of the first sample;
First judging unit is used for, and first probability value is compared with predetermined threshold value, when first probability value During less than the first predetermined threshold value, the sample to be tested belongs to same bion with the first sample;When first probability When value is more than or equal to the first predetermined threshold value, the sample to be tested is not belonging to same bion with the first sample.
12. the microorganism group identification device as described in claim 7 to 10 is any, which is characterized in that
The similarity calculation module is additionally operable to, and calculates in multiple sample the phase of other samples between any two in addition to first sample Like degree;And calculate the similarity of the sample to be tested and other samples;
The similarity distribution is established module and is additionally operable to, and establishes other samples in addition to the first sample in the multiple sample Similarity probability Distribution Model, obtain the similarity of the sample to be tested and other samples;
The identification module includes the second probability value determination unit, correction unit and second judgment unit, wherein:
The second probability value determination unit is used for, and is treated according to determining the similarity probability Distribution Model of the first sample Test sample sheet the first probability value corresponding with the similarity of the first sample, according to the sample to be tested and other samples Similarity and the similarity probability Distribution Model of other samples determine other probability values of the sample to be tested;
The correction unit is used for, and is carried out false discovery rate correction to first probability value and other described probability values, is obtained The first probability value after correction;
The second judgment unit is used for, and the first probability value after the correction is compared with predetermined threshold value, when the school When the first probability value after just is less than the second predetermined threshold value, the sample to be tested belongs to same biology with the first sample Body, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested and the first sample It is not belonging to same bion.
13. a kind of microorganism group identification equipment, which is characterized in that including memory and processor, the memory is stored with journey Sequence, described program realize the microorganism group identification as described in claim 1 to 6 is any when reading execution by the processor Method.
CN201810073198.4A 2018-01-25 2018-01-25 Method, device and equipment for identifying microbiome Active CN108268753B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810073198.4A CN108268753B (en) 2018-01-25 2018-01-25 Method, device and equipment for identifying microbiome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810073198.4A CN108268753B (en) 2018-01-25 2018-01-25 Method, device and equipment for identifying microbiome

Publications (2)

Publication Number Publication Date
CN108268753A true CN108268753A (en) 2018-07-10
CN108268753B CN108268753B (en) 2021-12-03

Family

ID=62776724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810073198.4A Active CN108268753B (en) 2018-01-25 2018-01-25 Method, device and equipment for identifying microbiome

Country Status (1)

Country Link
CN (1) CN108268753B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522378A (en) * 2018-10-10 2019-03-26 深圳韦格纳医学检验实验室 The display methods and display equipment of hereditary birthplace probability distribution
CN110245685A (en) * 2019-05-15 2019-09-17 清华大学 Genome unit point makes a variation pathogenic prediction technique, system and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105986013A (en) * 2015-02-02 2016-10-05 广州华大基因医学检验所有限公司 Method and device for determining microbial species
CN106202989A (en) * 2015-04-30 2016-12-07 中国科学院青岛生物能源与过程研究所 A kind of method obtaining child's individuality biological age based on oral microbial community
CN106202999A (en) * 2016-07-21 2016-12-07 厦门大学 Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105986013A (en) * 2015-02-02 2016-10-05 广州华大基因医学检验所有限公司 Method and device for determining microbial species
CN106202989A (en) * 2015-04-30 2016-12-07 中国科学院青岛生物能源与过程研究所 A kind of method obtaining child's individuality biological age based on oral microbial community
CN106202999A (en) * 2016-07-21 2016-12-07 厦门大学 Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YANG,YUQING 等: "Inference of Environmental Factor-Microbe and Microbe-Microbe Associations from Metagenomic Data Using a Hierarchical Bayesian Statistical Model", 《CELL SYSTEMS》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522378A (en) * 2018-10-10 2019-03-26 深圳韦格纳医学检验实验室 The display methods and display equipment of hereditary birthplace probability distribution
CN110245685A (en) * 2019-05-15 2019-09-17 清华大学 Genome unit point makes a variation pathogenic prediction technique, system and storage medium
CN110245685B (en) * 2019-05-15 2022-03-25 清华大学 Method, system and storage medium for predicting pathogenicity of genome single-site variation

Also Published As

Publication number Publication date
CN108268753B (en) 2021-12-03

Similar Documents

Publication Publication Date Title
Zhang et al. An efficient feature selection strategy based on multiple support vector machine technology with gene expression data
CN111292802B (en) Method, electronic device, and computer storage medium for detecting sudden change
WO2007142044A1 (en) Image processing device and image processing program
CN107463795A (en) A kind of prediction algorithm for identifying tyrosine posttranslational modification site
CN111785328A (en) Coronavirus sequence identification method based on gated cyclic unit neural network
CN117153268A (en) Cell category determining method and system
CN108268753A (en) A kind of microorganism group recognition methods and device, equipment
CN113764034B (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
Kim et al. MarkerCount: A stable, count-based cell type identifier for single-cell RNA-seq experiments
Rasheed et al. LSH-Div: Species diversity estimation using locality sensitive hashing
Popic et al. Fast metagenomic binning via hashing and bayesian clustering
CN107103206A (en) The DNA sequence dna cluster of local sensitivity Hash based on standard entropy
CN116798515A (en) Gene mutation prediction method and system based on hierarchical depth multi-example learning
CN111414930A (en) Deep learning model training method and device, electronic equipment and storage medium
US20230274790A1 (en) Systems, methods, and media for classifying genetic sequencing results based on pathogen-specific adaptive thresholds
CN109243529B (en) Horizontal transfer gene identification method based on locality sensitive hashing
CN113971984A (en) Classification model construction method and device, electronic equipment and storage medium
Tsai et al. Significance analysis of ROC indices for comparing diagnostic markers: applications to gene microarray data
CN110265151A (en) A kind of learning method based on isomery temporal data in EHR
CN115359040B (en) Method, device and medium for predicting tissue sample properties of object to be measured
CN115579058B (en) Lossless compression method of genome data, prediction method and device of genetic variation
WO2024016389A1 (en) Ubiquitination site identification method, apparatus and system, and storage medium
Kukreja et al. A heuristic machine learning-based optimization technique to predict lung cancer patient survival
Oh et al. Deepbiogen: Generalizing predictions to unseen sequencing profiles via visual data augmentation
Lai Enhancements to the Microbial Source Tracking Process Through the Utilization of Clustering and k-Nearest Clusters Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant