CN108268753A - A kind of microorganism group recognition methods and device, equipment - Google Patents
A kind of microorganism group recognition methods and device, equipment Download PDFInfo
- Publication number
- CN108268753A CN108268753A CN201810073198.4A CN201810073198A CN108268753A CN 108268753 A CN108268753 A CN 108268753A CN 201810073198 A CN201810073198 A CN 201810073198A CN 108268753 A CN108268753 A CN 108268753A
- Authority
- CN
- China
- Prior art keywords
- sample
- similarity
- tested
- microorganism group
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Data Mining & Analysis (AREA)
- Public Health (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioethics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
A kind of microorganism group recognition methods and device, equipment, the microorganism group recognition methods include:The microorganism group characteristic information for obtaining multiple bions generates multiple samples, it calculates first sample in multiple sample and, respectively with the similarity of other samples to obtain multiple similarities, the similarity probability Distribution Model of the first sample is established according to multiple similarity;Obtain sample to be tested, calculate the similarity of the sample to be tested and the first sample, the first probability value of the sample to be tested is determined according to the similarity probability Distribution Model of the similarity and the first sample of the sample to be tested and the first sample, judges whether the sample to be tested belongs to same bion with the first sample according to first probability value.Scheme provided in this embodiment can effectively be identified microorganism group.
Description
Technical field
The present invention relates to biological identification technology, espespecially a kind of microorganism group recognition methods and device, equipment.
Background technology
Microorganism is dispersed throughout each place of natural environment, no exception with human body, from internal enteron aisle to external skin,
The quantity of Major Members bacterium in microorganism is even suitable with human body cell quantity.Microorganism does not occur individually, often with micro- life
The form of object group exists, and since existing experiment condition is still difficult to be separately cultured all microorganisms, just has logical
The method for crossing DNA sequencing obtains the group of microorganism basis composition, i.e. microorganism group indirectly.Microorganism group is microbiologic population
The summation of all inhereditary materials, since what is obtained by high-flux sequence is mixutre genome fragment data, with macro gene
Group represents the sequencing data of microorganism group.
The microorganism group of individual has very high specificity, and confirmation is obtained in many macro gene order-checking data.One
A little methods uniquely characterize the microorganism group of a people by carrying out feature extraction to sequence, within a certain period of time can be by conduct
The specific molecular label of the people, and applied in the experiment of small sample amount.But due to the microorganism group moment of individual
Variation, macro gene order-checking data are stablized unlike genome, can not continuous and effective as molecular label.
Invention content
An at least embodiment of the invention provides a kind of microorganism group recognition methods and device, equipment, can be effectively to micro- life
Object group is identified.
In order to reach the object of the invention, an at least embodiment of the invention provides a kind of microorganism group recognition methods, including:
The microorganism group characteristic information for obtaining multiple bions generates multiple samples, calculates the first sample in multiple sample
This establishes the phase of the first sample according to multiple similarity respectively with the similarity of other samples to obtain multiple similarities
Like degree probability Distribution Model;
Sample to be tested is obtained, the similarity of the sample to be tested and the first sample is calculated, according to the sample to be tested
The sample to be tested is determined with the similarity of the first sample and the similarity probability Distribution Model of the first sample
First probability value judges whether the sample to be tested belongs to same biology with the first sample according to first probability value
Body.
An at least embodiment of the invention provides a kind of microorganism group identification device, including:
Information acquisition module, for obtain the microorganism group characteristic information of multiple bions generate multiple samples and,
Obtain sample to be tested;
Similarity calculation module, for calculating in multiple sample first sample respectively with the similarity of other samples to obtain
Obtain multiple similarities;And calculate the similarity of the sample to be tested and the first sample;
Module is established in similarity distribution, for establishing the similarity probability distribution mould of first sample according to multiple similarity
Type;
Identification module, for the similarity according to the sample to be tested and the first sample the first sample phase
Like the position in degree probability Distribution Model, judge whether the sample to be tested belongs to same bion with the first sample.
One embodiment of the invention provides a kind of microorganism group identification equipment, including memory and processor, the memory
It has program stored therein, described program realizes the microorganism group described in any of the above-described embodiment when reading execution by the processor
Recognition methods.
Compared with the relevant technologies, in one embodiment of the invention, by establishing the similarity probability Distribution Model of sample, according to
Probability value of the similarity of sample to be tested and the sample in the similarity probability Distribution Model, and then whether judge sample to be tested
Belong to same bion with the sample.The scheme of the application can realize the identification to microorganism group.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
It obtains it is clear that being understood by implementing the present invention.The purpose of the present invention and other advantages can be by specification, rights
Specifically noted structure is realized and is obtained in claim and attached drawing.
Description of the drawings
Attached drawing is used for providing further understanding technical solution of the present invention, and a part for constitution instruction, with this
The embodiment of application technical solution for explaining the present invention together, does not form the limitation to technical solution of the present invention.
Fig. 1 is the microorganism group recognition methods flow chart that one embodiment of the invention provides;
Fig. 2 is the microorganism group identification device block diagram that one embodiment of the invention provides;
Fig. 3 is the identification module block diagram that one embodiment of the invention provides;
Fig. 4 is the identification module block diagram that another embodiment of the present invention provides;
Fig. 5 is the microorganism group recognition methods schematic diagram that one embodiment of the invention provides;
Fig. 6 is the microorganism group recognition methods flow chart that one embodiment of the invention provides;
Fig. 7 is the microorganism group recognition methods that provides of one embodiment of the invention figure compared with the success rate of other methods.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention
Embodiment be described in detail.It should be noted that in the absence of conflict, in the embodiment and embodiment in the application
Feature mutually can arbitrarily combine.
Step shown in the flowchart of the accompanying drawings can be in the computer system of such as a group of computer-executable instructions
It performs.Also, although logical order is shown in flow charts, it in some cases, can be to be different from herein suitable
Sequence performs shown or described step.
Unless otherwise defined, the technical term or scientific terminology that the disclosure uses, which are should be in fields of the present invention, to be had
The ordinary meaning that the personage for having general technical ability is understood." first ", " second " and the similar word used in the disclosure is simultaneously
It does not indicate that any sequence, quantity or importance, and is used only to distinguish different component parts." comprising " or "comprising" etc.
Either object covers the element or object for appearing in the word presented hereinafter to the element that similar word means to occur before the word
And its it is equivalent, and it is not excluded for other elements or object.
Individual is known by building the similarity distributed model of the microorganism group characteristic information between individual in the application
Not, different from finding fixed label, but other people this point to be noticeably greater than from the microorganism group similarity of individual.
If the similarity of two macro genome samples is significantly high, it is considered as from a bion.For a distribution,
If value, which is fallen, can regard to a certain degree notable as in the position seldom occurred, thus the application also seek to it is similar general
It reads, the identification to carry out individual specimen is distributed by the similarity generated between individual.For distribution of similarity between individual, if
Say microbiologic population's specificity be described as a bion different samples be segmented into a kind of classification problem if, it is right
In remaining sample all will be other one kind.Therefore the sample of the sample at some time point of bion and other biological individual
The distribution of this generation, the sampling at another time point of oneself should be not belonging to this distribution, that is, p-value (probability value)
Understand enough small.
As shown in Figure 1, one embodiment of the invention provides a kind of microorganism group recognition methods, including:
Step 101, the microorganism group characteristic information for obtaining multiple bions generates multiple samples, calculates multiple sample
Middle first sample establishes described first respectively with the similarity of other samples to obtain multiple similarities, according to multiple similarity
The similarity probability Distribution Model of sample;
Wherein, first sample is any sample in multiple sample.Each sample corresponds to micro- life of a bion
Object group characteristic information.
Wherein, multiple bion is known, for example, when bion is behaved, it is known that the corresponding people of each sample.Separately
Outside, the microorganism group characteristic information of the different time acquisition of a bion can be included in multiple samples.In addition, sample number
More, similarity probability Distribution Model is more accurate, therefore, obtains great amount of samples as far as possible.Multiple bion includes difference
Bion.Bion can be people or animal or other biological.Microorganism group such as enteric microorganism
(can be extracted from excrement), oral microorganism etc..
It should be noted that in other embodiments, other in multiple sample in addition to first sample can also be calculated
The similarity probability Distribution Model of sample.
Step 102, sample to be tested is obtained, calculates the similarity of the sample to be tested and the first sample;
Wherein, sample to be tested is the microorganism group characteristic information of a unknown bion.
Step 103, according to the similar of the similarity of the sample to be tested and the first sample and the first sample
Degree probability Distribution Model determines the first probability value of the sample to be tested, judges the sample to be tested according to first probability value
Whether belong to same bion with the first sample.
Microorganism group recognition methods provided in this embodiment, by establishing the similarity probability Distribution Model of first sample,
By probability value of the similarity of sample to be tested in similarity probability Distribution Model, judge whether are sample to be tested and first sample
From same bion.
In one embodiment, in the step 101, the microorganism group characteristic information includes:The macro gene of microorganism group
Group sequencing data, alternatively, the microarray data of microorganism group, alternatively, the dyeing information of microorganism group.
In one embodiment, when the characteristic information of the microorganism group is the macro gene order-checking data of microorganism array,
K-mer segmentations are carried out to the macro gene order-checking data, the k is more than 1.In an examples of implementation, the k is more than 15, than
If k values are 18.It should be noted that can divide without k-mer, it is similar to be directly based upon macro gene order-checking data calculating
Degree.It carries out k-mer segmentations and calculates similarity again, operand can be greatly reduced.In other embodiments, macro base can also be obtained
Because group sequencing data carries out species label, alternatively, carrying out gene function group echo, phase is calculated according to species information or gene information
Like degree.
K-mer is that sequence is intercepted according to regular length k, as soon as a sequence, sequentially slides a base energy
As soon as obtaining a k-mer, therefore n-k+1 k-mer can be obtained for the sequence that a length is n.The extraction of k-mer is not related to
Any reference gene group, therefore all sequences can be utilized.The length selection of k-mer needs to be made according to different demands
Adjustment, k are just want when being equal to 1 when in the distribution situation of statistics base, k can be regarded as short sequence in the range of less than 10
Row at this time count the comparison that the frequency of occurrences of k-mer can be used between sample, and general also use should in some classification problems
The k-mer of length scale is as feature.K can be regarded as medium sized selection between 10 to 15, can make in splicing
For the length of basic k-mer, there is certain discrimination to bacterium, since the type of k-mer is no more than 1,000,000,000 (k etc.
In 15, more than 10 hundred million) type, can not have to consider the problems of dimensionality reduction sometimes.Long k- is can be regarded as when k is more than 15
Mer can distinguish many bacteriums, particularly sequence more than 30 as feature, and many k-mer can be as strain unique identification
Molecular label.
Wherein, should be shaped like normal distribution on similarity distribution theory, but interval is (0,1), therefore Gamma is distributed
Variant Beta on (0,1) section is selected as the model of similarity distribution between individual.
Wherein, when calculating similarity, the microorganism group characteristic information used can be obtained according to macro gene order-checking data
The species information that arrives, gene information or directly using k-mer, the metric form of similarity can be space lengths, Jaccard away from
From Bray-Crutis distances etc..When using k-mer as feature calculation similarity, as far as possible between guarantee sample
The unification of k-mer quantity, while suitably accepted or rejected according to the demand of sample size.
In one embodiment, the similarity is based on MinHash (min-hash, the LSH optimization methods of Jaccard distances)
Algorithm obtains.
The core concept of LSH is that the spatial relationship in higher-dimension is mapped in lower dimensional space, is recovered as far as possible original
Correspondence, be a kind of abbreviation method rather than strengthen method.That is if sample has class in luv space
As correspondence, this relationship relatively can be restored by the algorithm of LSH faster.The degree of many similarity/distances
Amount has corresponding LSH algorithms, and Euclidean distance, COS distance, Jaccard similarities have corresponding LSH algorithms, wherein
The corresponding LSH algorithms, that is, min-hash algorithm (abbreviation MinHash) of Jaccard similarities.
One of k-mer types set can be obtained for each sample standard deviation, it is assumed that existing set A and set B respectively from
The sampling of intestinal microflora twice is sequenced and calculates k-mer.Therefore just there are two the Jaccard similarities of sample to be:
J (A, B)=(A ∩ B)/(A ∪ B)
It is a random alignment of all k-mer in A, set B it is now assumed that there are a hash function h (), and
hmin(S) being defined as set S, (h () is an orderly arrangement to first k-mer occurred, from the row in the hash function
Check whether to exist successively in row with set S, the k-mer of first appearance, serial number of the correspondence in the arrangement is hmin
(S)), if there is
Pr(hmin(A)=hmin(B))=J (A, B)
In order to calculate hmin(A)=hmin(B) probability naturally enough needs the random generation of multiple h (), it is assumed that carries out
N times random experiment, wherein hmin(A)=hmin(B) number is mJ (A, B), it is possible to by the use of m/n as approximate calculation, here it is
Definition initial MinHash obtains Jaccard similarities by multiple h () approximations, and anticipation error here is also certified as
The calculating of MinHash is simultaneously uncomplicated, but rate-limiting step is often to generate n Hash function (random alignment), special
Be not when two set in k-mer it is more when, it is relatively time-consuming, thus just have with a hash function as calculating
Deformation.H is defined now(n)(S) k-mer that n occurs in set S before being obtained from the arrangement of h () are represented, then J (A,
B) can approximate representation be:
J (A, B) ≈ | h(n)(A∪B)∩h(n)(A)∩h(n)(B)|/n
It should be noted that above-mentioned only give a kind of computational methods of similarity, but the application is without being limited thereto, other meters
The method for calculating similarity is also applicable in the application.
In one embodiment, it is described that the sample to be tested and institute are judged according to first probability value in the step 103
It states first sample and whether belongs to same bion and include:
When first probability value is less than the first predetermined threshold value, the sample to be tested belongs to same with the first sample
Bion, when first probability value is more than or equal to the first predetermined threshold value, the sample to be tested and the first sample are not
Belong to same bion.
Wherein, the first predetermined threshold value is according to statistically thinking that significantly similar requirement is set, for example could be provided as
0.01, alternatively, being set as needed.
Sample to be tested in multiple distributions is tested, in order to avoid false positive, can also carry out false discovery rate (false
Discovery rate, FDR) correction, can use Benjamin&Yekutieli (BY) method carry out false discovery rate
Correction, it is of course also possible to carry out the correction of false discovery rate using other methods.In one embodiment, the method further includes:
The similarity probability Distribution Model of other samples in the multiple sample in addition to the first sample is established, is obtained
The similarity of the sample to be tested and other samples, according to the similarity of the sample to be tested and other samples and
The similarity probability Distribution Model of other samples determines other probability values of the sample to be tested;For example, there are n samples
When, can be each Sample Establishing similarity probability Distribution Model, vertical n similarity probability Distribution Model of building together calculates to be measured
The similarity of sample and the n sample, obtains n similarity, according to the n similarity probability Distribution Model, and then is treated
N probability value of test sample sheet.
It is described to judge whether the sample to be tested belongs to same biology with the first sample according to first probability value
Individual includes:First probability value and other described probability values are carried out with false discovery rate correction, first after being corrected
Probability value, when the first probability value after the correction is less than the second predetermined threshold value, the sample to be tested and the first sample
Belong to same bion, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested
Same bion is not belonging to the first sample.
False discovery rate correction is carried out to the n probability value, obtains the probability value after n correction, then according to the n
Probability value after correction judges whether sample to be tested belongs to same bion with the n sample respectively.Second predetermined threshold value generation
Table false discovery rate, could be provided as 0.01, it is of course also possible to be set as other values, general value is smaller, represents false discovery rate
It is smaller.
Scheme provided in this embodiment compared with directly by the use of similarity as the scheme judged, can be sent out by mistake
Now rate correction reduces error probability.Such as false discovery rate threshold value be 0.01 when, the probability for representing false judgment is 1%, then
If sample to be tested only in one of all samples distribution significantly, then it is basic it is considered that the result is that reliable, because wrong
Probability accidentally is 0.01.
One embodiment of the invention provides a kind of microorganism group identification device, as shown in Fig. 2, including:
Information acquisition module 201 generates multiple samples for obtaining the microorganism group characteristic information of multiple bions, with
And obtain sample to be tested;
Similarity calculation module 202, for calculate in multiple sample first sample respectively with the similarity of other samples
To obtain multiple similarities;And calculate the similarity of the sample to be tested and the first sample;
Module 203 is established in similarity distribution, for determining the similarity probability point of first sample according to multiple similarity
Cloth model;
Identification module 204, for according to the similarity of the sample to be tested and the first sample in the first sample
Similarity probability Distribution Model in position, judge whether the sample to be tested and the first sample belong to same biology
Body.
In one embodiment, the microorganism group characteristic information includes:The macro gene order-checking data of microorganism group or
Person, the chip data of microorganism group, alternatively, the dyeing information of microorganism group.
In one embodiment, the similarity calculation module 203 calculates similarity and includes:The feature letter of the microorganism group
When ceasing the macro gene order-checking data for microorganism array, k-mer segmentations, the k are carried out to the macro gene order-checking data
More than 1, similarity is calculated based on the macro gene order-checking data after the progress k-mer segmentations.
In one embodiment, the similarity calculation module 203 can be based on many algorithms and calculate similarity, for example, base
Similarity is calculated in MinHash algorithms.It is of course also possible to be other algorithms, the application is not construed as limiting this.
In one embodiment, as shown in figure 3, the identification module 204 includes:First probability value determination unit 301 and
One judging unit 302, wherein:
The first probability value determination unit 301 is used for, true according to the similarity probability Distribution Model of the first sample
The fixed sample to be tested the first probability value corresponding with the similarity of the first sample;
First judging unit 302 is used for, and first probability value is compared with predetermined threshold value, when described first
When probability value is less than the first predetermined threshold value, the sample to be tested belongs to same bion with the first sample;When described
When one probability value is more than or equal to the first predetermined threshold value, the sample to be tested is not belonging to same bion with the first sample.
In one embodiment, as shown in figure 4, the identification module 204 includes the second probability value determination unit 401, correction
Unit 402 and second judgment unit 403, wherein:
The similarity calculation module is additionally operable to, and calculates in multiple sample in addition to first sample other samples between any two
Similarity;And calculate the similarity of the sample to be tested and other samples;
The similarity distribution is established module and is additionally operable to, and establishes other in the multiple sample in addition to the first sample
The similarity probability Distribution Model of sample obtains the similarity of the sample to be tested and other samples;
The second probability value determination unit is used for, and institute is determined according to the similarity probability Distribution Model of the first sample
Sample to be tested the first probability value corresponding with the similarity of the first sample is stated, according to the sample to be tested and other described samples
This similarity and the similarity probability Distribution Model of other samples determine other probability values of the sample to be tested;
The correction unit is used for, and false discovery rate correction is carried out to first probability value and other described probability values,
The first probability value after being corrected;
The second judgment unit is used for, and the first probability value after the correction with predetermined threshold value is compared, works as institute
When stating the first probability value after correction less than the second predetermined threshold value, the sample to be tested belongs to same biology with the first sample
Individual, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested and first sample
Originally it is not belonging to same bion.
The application is further illustrated below by a specific embodiment.
As shown in Figure 5 and Figure 6, microorganism group recognition methods provided in this embodiment includes:
Step 601, the macro gene order-checking data of n sample are obtained, carry out k-mer segmentations, the macro base after being divided
Because of a group sequencing data;
Step 602, the similarity of n sample between any two is calculated;
Using the method for MinHash calculate and obtain similarity.Specifically, after sample carries out k-mer segmentations, Mei Gehong
Genome sample obtains corresponding k-mer set.Hash function is the orderly arrangement of one group of k-mer, macro gene each in this way
Group sample can be transferred through the Function Mapping and obtain one group of serial number.This group of serial number is exactly cryptographic Hash, selects m a right according to hash function
Minimum hash is answered, similarity is then calculated as follows:
J (A, B) ≈ | h(m)(A∪B)∩h(m)(A)∩h(m)(B)|/m
Step 603, for each sample, generating the similar of the sample to the n-1 similarity of other n-1 sample according to it
Distributed model is spent, Beta distributions can be utilized to be fitted to obtain the similarity distributed model, and then obtains n similarity distribution
Model.
Step 604, the similarity of sample to be tested and sample each in the n sample is calculated, n similarity is obtained, it is right
Any sample according in the similarity probability Distribution Model of the similarity and the sample of sample to be tested and the sample, obtains one
Whether probability value judges sample to be tested with the sample from same person according to the probability value.Judgment method is according to probability value
Judge whether significantly it is similar, i.e., compared with preset first threshold value, when less than the first predetermined threshold value, represent significantly it is similar, when big
When equal to the first predetermined threshold value, represent non-significant similar.For example, in Fig. 5, P2<α, sample to be tested is with target sample from same
One people, P1>α, sample to be tested is with target sample from different biology individual, and α is preset first threshold value, for example, 0.01 can be taken.
In another embodiment, after n probability value p1, p2 ... pn is obtained in step 604, can to p1, p2 ... pn into
Row false discovery rate corrects, and obtains q1, q2 ... qn, and respectively according to q1, q2 ... qn judge whether are sample to be tested and target sample
From same people, specifically, q1, q2 ... qn and threshold value q can be compared, when qi (i=1 ..., n) is less than q, represent
Sample to be tested and the corresponding target samples of qi, when qi is more than or equal to q, represent sample to be tested and the corresponding mesh of qi from same people
Standard specimen sheet comes from different biology individual.Threshold value q is false discovery rate, be can be set as needed, for example is 0.01.
Fig. 7 be one embodiment of the invention provide take different characteristic information carry out similarity calculation when schematic diagram.Fig. 7
In for the test result in 612 samples.Wherein, left hand view is Receiver operating curve (receiver in Fig. 7
Operating characteristic curve, ROC) in, macro gene order-checking data using ker segmentation (in figure
Gemini is corresponded to) with using species (Species) label, the comparison result marked using gene (KEGG).Right part of flg is in Fig. 7
Accuracy rate and recall rate curve (Precision-Recall curve, PRC), macro gene order-checking data using ker segmentations (with
Gemini is corresponded in figure) with using species (Species) label, the comparison result marked using gene (KEGG).Wherein, solid line
Represent Gemini's as a result, rounded-corner broken line is species as a result, result of the right angle dotted line as gene.It can be seen that Gemini side
The effect that method judges individual is fine, refers to auROC, the value of auPRC, value is higher to illustrate that prediction is more accurate.In addition, and species,
Gene is compared as feature, and k-mer is more preferable as the result of feature, auROC, and auPRC is higher than species, and gene is as special
AuROC during sign, auPRC value.
One embodiment of the invention provides a kind of microorganism group identification equipment, including memory and processor, the memory
It has program stored therein, described program realizes the microorganism group described in any of the above-described embodiment when reading execution by the processor
Recognition methods.
One embodiment of the invention provides a kind of computer readable storage medium, and the computer-readable recording medium storage has
One or more program, one or more of programs can be performed by one or more processor, to realize above-mentioned
Microorganism group recognition methods described in one embodiment.
The computer readable storage medium includes:It is USB flash disk, read-only memory (ROM, Read-Only Memory), random
Access memory (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program
The medium of code.
Although disclosed herein embodiment as above, the content only for ease of understanding the present invention and use
Embodiment is not limited to the present invention.Technical staff in any fields of the present invention is taken off not departing from the present invention
Under the premise of the spirit and scope of dew, any modification and variation, but the present invention can be carried out in the form and details of implementation
Scope of patent protection, still should be subject to the scope of the claims as defined in the appended claims.
Claims (13)
1. a kind of microorganism group recognition methods, including:
The microorganism group characteristic information for obtaining multiple bions generates multiple samples, calculates first sample point in multiple sample
Not with the similarity of other samples to obtain multiple similarities, the similarity of the first sample is established according to multiple similarity
Probability Distribution Model;
Sample to be tested is obtained, the similarity of the sample to be tested and the first sample is calculated, according to the sample to be tested and institute
The similarity probability Distribution Model of similarity and the first sample for stating first sample determines the first of the sample to be tested
Probability value judges whether the sample to be tested belongs to same bion with the first sample according to first probability value.
2. microorganism group recognition methods as described in claim 1, which is characterized in that the microorganism group characteristic information includes:
The macro gene order-checking data of microorganism group, alternatively, the microarray data of microorganism group, alternatively, the dyeing letter of microorganism group
Breath.
3. microorganism group recognition methods as claimed in claim 2, which is characterized in that when calculating similarity, if micro- life
The characteristic information of object group is the macro gene order-checking data of microorganism array, and k-mer is carried out to the macro gene order-checking data
Similarity calculation is carried out based on the macro gene order-checking data after progress k-mer segmentations after segmentation, k is more than 1.
4. microorganism group recognition methods as described in claim 1, which is characterized in that the similarity is based on MinHash algorithms
It obtains.
5. the microorganism group recognition methods as described in Claims 1-4 is any, which is characterized in that described general according to described first
Rate value judges that the sample to be tested includes with whether the first sample belongs to same bion:
When first probability value is less than the first predetermined threshold value, the sample to be tested belongs to same biology with the first sample
Individual, when first probability value is more than or equal to the first predetermined threshold value, the sample to be tested is not belonging to the first sample
Same bion.
6. the microorganism group recognition methods as described in Claims 1-4 is any, which is characterized in that the method further includes, and establishes
The similarity probability Distribution Model of other samples in the multiple sample in addition to the first sample, obtains the sample to be tested
With the similarity of other samples, according to the similarity of the sample to be tested and other samples and other described samples
Similarity probability Distribution Model determine other probability values of the sample to be tested;
It is described to judge whether the sample to be tested belongs to same bion with the first sample according to first probability value
Including:False discovery rate correction, the first probability after being corrected are carried out to first probability value and other described probability values
Value, when the first probability value after the correction is less than the second predetermined threshold value, the sample to be tested belongs to the first sample
Same bion, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested and institute
It states first sample and is not belonging to same bion.
7. a kind of microorganism group identification device, which is characterized in that including:
Information acquisition module, for obtain the microorganism group characteristic information of multiple bions generate multiple samples and, obtain
Sample to be tested;
Similarity calculation module is more to obtain with the similarity of other samples respectively for calculating first sample in multiple sample
A similarity;And calculate the similarity of the sample to be tested and the first sample;
Module is established in similarity distribution, for establishing the similarity probability Distribution Model of first sample according to multiple similarity;
Identification module, for the similarity according to the sample to be tested and the first sample the first sample similarity
Position in probability Distribution Model, judges whether the sample to be tested belongs to same bion with the first sample.
8. microorganism group identification device as claimed in claim 7, which is characterized in that the microorganism group characteristic information includes:
The macro gene order-checking data of microorganism group, alternatively, the microarray data of microorganism group, alternatively, the dyeing letter of microorganism group
Breath.
9. microorganism group identification device as claimed in claim 8, which is characterized in that the similarity calculation module calculates similar
Degree includes:When the characteristic information of the microorganism group is the macro gene order-checking data of microorganism array, to the macro genome
Sequencing data carries out k-mer segmentations, and k is more than 1, and phase is calculated based on the macro gene order-checking data after the progress k-mer segmentations
Like degree.
10. microorganism group identification device as claimed in claim 7, which is characterized in that the similarity calculation module is based on
MinHash algorithms calculate similarity.
11. the microorganism group identification device as described in claim 7 to 10 is any, which is characterized in that the identification module includes
First probability value determination unit and the first judging unit, wherein:
The first probability value determination unit is used for, and is treated according to determining the similarity probability Distribution Model of the first sample
Test sample sheet the first probability value corresponding with the similarity of the first sample;
First judging unit is used for, and first probability value is compared with predetermined threshold value, when first probability value
During less than the first predetermined threshold value, the sample to be tested belongs to same bion with the first sample;When first probability
When value is more than or equal to the first predetermined threshold value, the sample to be tested is not belonging to same bion with the first sample.
12. the microorganism group identification device as described in claim 7 to 10 is any, which is characterized in that
The similarity calculation module is additionally operable to, and calculates in multiple sample the phase of other samples between any two in addition to first sample
Like degree;And calculate the similarity of the sample to be tested and other samples;
The similarity distribution is established module and is additionally operable to, and establishes other samples in addition to the first sample in the multiple sample
Similarity probability Distribution Model, obtain the similarity of the sample to be tested and other samples;
The identification module includes the second probability value determination unit, correction unit and second judgment unit, wherein:
The second probability value determination unit is used for, and is treated according to determining the similarity probability Distribution Model of the first sample
Test sample sheet the first probability value corresponding with the similarity of the first sample, according to the sample to be tested and other samples
Similarity and the similarity probability Distribution Model of other samples determine other probability values of the sample to be tested;
The correction unit is used for, and is carried out false discovery rate correction to first probability value and other described probability values, is obtained
The first probability value after correction;
The second judgment unit is used for, and the first probability value after the correction is compared with predetermined threshold value, when the school
When the first probability value after just is less than the second predetermined threshold value, the sample to be tested belongs to same biology with the first sample
Body, when the first probability value after the correction is more than or equal to the second predetermined threshold value, the sample to be tested and the first sample
It is not belonging to same bion.
13. a kind of microorganism group identification equipment, which is characterized in that including memory and processor, the memory is stored with journey
Sequence, described program realize the microorganism group identification as described in claim 1 to 6 is any when reading execution by the processor
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810073198.4A CN108268753B (en) | 2018-01-25 | 2018-01-25 | Method, device and equipment for identifying microbiome |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810073198.4A CN108268753B (en) | 2018-01-25 | 2018-01-25 | Method, device and equipment for identifying microbiome |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108268753A true CN108268753A (en) | 2018-07-10 |
CN108268753B CN108268753B (en) | 2021-12-03 |
Family
ID=62776724
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810073198.4A Active CN108268753B (en) | 2018-01-25 | 2018-01-25 | Method, device and equipment for identifying microbiome |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108268753B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109522378A (en) * | 2018-10-10 | 2019-03-26 | 深圳韦格纳医学检验实验室 | The display methods and display equipment of hereditary birthplace probability distribution |
CN110245685A (en) * | 2019-05-15 | 2019-09-17 | 清华大学 | Genome unit point makes a variation pathogenic prediction technique, system and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105986013A (en) * | 2015-02-02 | 2016-10-05 | 广州华大基因医学检验所有限公司 | Method and device for determining microbial species |
CN106202989A (en) * | 2015-04-30 | 2016-12-07 | 中国科学院青岛生物能源与过程研究所 | A kind of method obtaining child's individuality biological age based on oral microbial community |
CN106202999A (en) * | 2016-07-21 | 2016-12-07 | 厦门大学 | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement |
-
2018
- 2018-01-25 CN CN201810073198.4A patent/CN108268753B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105986013A (en) * | 2015-02-02 | 2016-10-05 | 广州华大基因医学检验所有限公司 | Method and device for determining microbial species |
CN106202989A (en) * | 2015-04-30 | 2016-12-07 | 中国科学院青岛生物能源与过程研究所 | A kind of method obtaining child's individuality biological age based on oral microbial community |
CN106202999A (en) * | 2016-07-21 | 2016-12-07 | 厦门大学 | Microorganism high-pass sequencing data based on different scale tuple word frequency analyzes agreement |
Non-Patent Citations (1)
Title |
---|
YANG,YUQING 等: "Inference of Environmental Factor-Microbe and Microbe-Microbe Associations from Metagenomic Data Using a Hierarchical Bayesian Statistical Model", 《CELL SYSTEMS》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109522378A (en) * | 2018-10-10 | 2019-03-26 | 深圳韦格纳医学检验实验室 | The display methods and display equipment of hereditary birthplace probability distribution |
CN110245685A (en) * | 2019-05-15 | 2019-09-17 | 清华大学 | Genome unit point makes a variation pathogenic prediction technique, system and storage medium |
CN110245685B (en) * | 2019-05-15 | 2022-03-25 | 清华大学 | Method, system and storage medium for predicting pathogenicity of genome single-site variation |
Also Published As
Publication number | Publication date |
---|---|
CN108268753B (en) | 2021-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | An efficient feature selection strategy based on multiple support vector machine technology with gene expression data | |
CN111292802B (en) | Method, electronic device, and computer storage medium for detecting sudden change | |
WO2007142044A1 (en) | Image processing device and image processing program | |
CN107463795A (en) | A kind of prediction algorithm for identifying tyrosine posttranslational modification site | |
CN111785328A (en) | Coronavirus sequence identification method based on gated cyclic unit neural network | |
CN117153268A (en) | Cell category determining method and system | |
CN108268753A (en) | A kind of microorganism group recognition methods and device, equipment | |
CN113764034B (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
Kim et al. | MarkerCount: A stable, count-based cell type identifier for single-cell RNA-seq experiments | |
Rasheed et al. | LSH-Div: Species diversity estimation using locality sensitive hashing | |
Popic et al. | Fast metagenomic binning via hashing and bayesian clustering | |
CN107103206A (en) | The DNA sequence dna cluster of local sensitivity Hash based on standard entropy | |
CN116798515A (en) | Gene mutation prediction method and system based on hierarchical depth multi-example learning | |
CN111414930A (en) | Deep learning model training method and device, electronic equipment and storage medium | |
US20230274790A1 (en) | Systems, methods, and media for classifying genetic sequencing results based on pathogen-specific adaptive thresholds | |
CN109243529B (en) | Horizontal transfer gene identification method based on locality sensitive hashing | |
CN113971984A (en) | Classification model construction method and device, electronic equipment and storage medium | |
Tsai et al. | Significance analysis of ROC indices for comparing diagnostic markers: applications to gene microarray data | |
CN110265151A (en) | A kind of learning method based on isomery temporal data in EHR | |
CN115359040B (en) | Method, device and medium for predicting tissue sample properties of object to be measured | |
CN115579058B (en) | Lossless compression method of genome data, prediction method and device of genetic variation | |
WO2024016389A1 (en) | Ubiquitination site identification method, apparatus and system, and storage medium | |
Kukreja et al. | A heuristic machine learning-based optimization technique to predict lung cancer patient survival | |
Oh et al. | Deepbiogen: Generalizing predictions to unseen sequencing profiles via visual data augmentation | |
Lai | Enhancements to the Microbial Source Tracking Process Through the Utilization of Clustering and k-Nearest Clusters Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |