CN115831224B

CN115831224B - Method and device for predicting probiotics potential of microorganism

Info

Publication number: CN115831224B
Application number: CN202211397170.9A
Authority: CN
Inventors: 左永春; 李海成; 孙宇; 郭树春; 赵小庆
Original assignee: Inner Mongolia University
Current assignee: Inner Mongolia University
Priority date: 2022-11-09
Filing date: 2022-11-09
Publication date: 2024-05-03
Anticipated expiration: 2042-11-09
Also published as: CN115831224A

Abstract

The invention provides a method and a device for predicting the probiotics potential of microorganisms, wherein the method comprises the following steps: determining a sample genome sequence corresponding to the microorganism, wherein the sample genome sequence is obtained by high-throughput sequencing based on the genome DNA of the microorganism sample; based on a sample genome sequence, determining a sub-fragment set corresponding to the sub-genome sequence and the abundance of the sub-fragment set by using a k-mer algorithm, wherein the sample genome sequence comprises a plurality of sub-genome sequences, and the abundance of the sub-fragment set is used for representing the abundance distribution condition of the sub-genome sequences in the sample genome sequence; predicting the sub-fragment set and the abundance of the sub-fragment set by using a support vector machine model to obtain a result of the probiotics potential of the microorganism, wherein the support vector machine model is obtained by training an initial support vector machine model through a historical genome training sequence. The embodiment of the invention can improve the efficiency of acquiring the probiotics potential of the microorganism and simultaneously can ensure the precision of the probiotics potential of the microorganism.

Description

Method and device for predicting probiotics potential of microorganism

Technical Field

The invention relates to the technical field of gene sequencing, in particular to a method and a device for predicting the probiotics potential of microorganisms.

Background

Genomic research based on high-throughput sequencing is one of the recent research hotspots. The biological information analysis means of genome are gradually perfecting and maturing, greatly promoting the development of genome research, and particularly achieving remarkable results in the aspects of inheritance and evolution, gene discovery, related research of human diseases and the like.

In the process of developing samples of microorganisms, the types and the number of organisms are numerous, and in the process of acquiring samples of microorganisms and excavating the probiotics potential of the microorganisms, a complex experimental verification process is required, and in the experimental process, a longer experimental period and labor cost are required. For this reason, it is highly desirable to provide a method for predicting the probiotic potential of microorganisms to increase the efficiency of obtaining the probiotic potential of microorganisms.

Disclosure of Invention

The invention aims to provide a method and a device for predicting the probiotics potential of microorganisms, which can improve the efficiency of acquiring the probiotics potential of the microorganisms and ensure the precision of the probiotics potential of the microorganisms.

To achieve the above object, in a first aspect, the present invention provides a method for predicting the probiotic potential of a microorganism, comprising:

determining a sample genome sequence corresponding to the microorganism, wherein the sample genome sequence is obtained by high-throughput sequencing based on the genome DNA of the microorganism sample;

Determining a sub-fragment set corresponding to the sub-genome sequence and the abundance of the sub-fragment set by using a k-mer algorithm based on the sample genome sequence, wherein the sample genome sequence comprises a plurality of sub-genome sequences, and the abundance of the sub-fragment set is used for representing the abundance distribution condition of the sub-genome sequences in the sample genome sequence;

and predicting the sub-segment set and the abundance of the sub-segment set by using a support vector machine model to obtain a result of the probiotics potential of the microorganism, wherein the support vector machine model is obtained by training an initial support vector machine model through a historical genome training sequence.

Optionally, before the predicting the sub-segment set and the abundance of the sub-segment set by using the support vector machine model, the method further includes:

Training an initial support vector machine model to obtain a support vector machine model, wherein the initial support vector machine model is a model which is not trained yet.

Optionally, the training the initial support vector machine model includes:

Determining a historical genome training sequence;

an initial support vector machine model is trained based on the historical genomic sequence.

Optionally, the determining the historical genome training sequence includes:

acquiring a historical genome sequence;

carrying out k-mer calculation on the historical genome sequence to obtain a historical sub-fragment set and the abundance of the historical sub-fragment set;

And carrying out normalization processing on the historical sub-fragment set and the abundance of the historical sub-fragment set, and carrying out feature screening to obtain a historical genome training sequence.

Optionally, the determining the sample genome sequence corresponding to the microorganism comprises

Performing high-throughput sequencing on the genome DNA of the microbial sample based on the microorganism to obtain a sequencing result;

And comparing the sequencing result with a reference genome to obtain a sample genome sequence.

In a second aspect, the present invention provides an apparatus for predicting the probiotic potential of a microorganism, comprising:

The sample genome sequence determining module is used for determining a sample genome sequence corresponding to the microorganism, and the sample genome sequence is obtained by high-throughput sequencing based on the sample genome DNA of the microorganism;

The sub-segment extraction module is used for determining a sub-segment set corresponding to the sub-genome sequence and the abundance of the sub-segment set by using a k-mer algorithm based on the sample genome sequence, wherein the sample genome sequence comprises a plurality of sub-genome sequences, and the abundance of the sub-segment set is used for representing the abundance distribution condition of the sub-genome sequence in the sample genome sequence;

The prediction module is used for predicting the sub-segment set and the abundance of the sub-segment set by using a support vector machine model to obtain a result of the probiotics potential of the microorganism, and the support vector machine model is obtained by training an initial support vector machine model through a historical genome training sequence.

Optionally, the method further comprises:

the training module is used for training an initial support vector machine model to obtain a support vector machine model, wherein the initial support vector machine model is a model which is not trained yet.

Optionally, the training module is configured to train an initial support vector machine model, and includes:

Determining a historical genome training sequence;

Optionally, the training module is configured to determine a historical genomic training sequence, including:

acquiring a historical genome sequence;

Optionally, the sample genome sequence determining module is configured to determine a sample genome sequence corresponding to a microorganism, and includes:

Performing high-throughput sequencing on the genome DNA of the microbial sample based on the microorganism to obtain a sequencing result; and comparing the sequencing result with a reference genome to obtain a sample genome sequence.

Based on the above, the invention provides a method for predicting the probiotic potential of microorganisms and a device thereof, comprising the following steps: determining a sample genome sequence corresponding to the microorganism, wherein the sample genome sequence is obtained by high-throughput sequencing based on the genome DNA of the microorganism sample; determining a sub-fragment set corresponding to the sub-genome sequence and the abundance of the sub-fragment set by using a k-mer algorithm based on the sample genome sequence, wherein the sample genome sequence comprises a plurality of sub-genome sequences, and the abundance of the sub-fragment set is used for representing the abundance distribution condition of the sub-genome sequences in the sample genome sequence; and predicting the sub-segment set and the abundance of the sub-segment set by using a support vector machine model to obtain a result of the probiotics potential of the microorganism, wherein the support vector machine model is obtained by training an initial support vector machine model through a historical genome training sequence. According to the method, the sample genome sequence is screened out by utilizing the gene sequence, the step of manually screening microorganisms is replaced, the microorganism selection process is shortened, the prediction is realized by utilizing the vector machine model, the efficiency of acquiring the probiotic potential of the microorganisms is improved, and the prediction precision of the probiotic potential of the microorganisms can be improved due to the fact that the vector machine model is finished through pre-training. Therefore, the embodiment of the invention can improve the efficiency of acquiring the probiotics potential of the microorganism and simultaneously can ensure the precision of the probiotics potential of the microorganism.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of the steps of a method for predicting the probiotic potential of a microorganism provided in an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating steps for training an initial SVM model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating steps for determining a historical genome training sequence according to an embodiment of the present invention;

FIG. 4 is a graph of AUROC values corresponding to an initial support vector machine model in an embodiment of the present invention;

fig. 5 is an alternative block diagram of an apparatus for predicting the probiotic potential of a microorganism provided in an embodiment of the invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

As described in the background art, at present, the existing microorganism selection process has the technical problems of high cost and long period of screening technology, and the process of obtaining the probiotic potential of the microorganism is complex and has low efficiency. The inventor researches and discovers that different microorganisms can be effectively screened based on different gene sequences in the process of selecting the microorganisms, and therefore, the embodiment of the invention provides a method for predicting the probiotics potential of the microorganisms, firstly, a sample genome sequence corresponding to the microorganisms is determined, and the sample genome sequence is obtained by high-throughput sequencing based on the genome DNA of a sample of the microorganisms; further, based on the sample genome sequence, determining a sub-fragment set corresponding to the sub-genome sequence and the abundance of the sub-fragment set by using a k-mer algorithm; and finally, predicting the sub-segment set and the abundance of the sub-segment set by using a support vector machine model to obtain a result of the probiotics potential of the microorganism, wherein the support vector machine model is obtained by training an initial support vector machine model through a historical genome training sequence. In the embodiment of the invention, the screening of specific microorganisms in large-scale microorganisms can be realized through the gene sequences corresponding to the microorganisms to obtain sample genome sequences, and then the sub-fragment sets and the abundance of the sub-fragment sets obtained by processing the sample genome sequences are predicted by using a support vector machine model, so that the corresponding result of the probiotics potential can be obtained. According to the method, the sample genome sequence is screened out by utilizing the gene sequence, the step of manually screening microorganisms is replaced, the microorganism selection process is shortened, the prediction is realized by utilizing the vector machine model, the efficiency of acquiring the probiotic potential of the microorganisms is improved, and the prediction precision of the probiotic potential of the microorganisms can be improved due to the fact that the vector machine model is finished through pre-training. Therefore, the embodiment of the invention can improve the efficiency of acquiring the probiotics potential of the microorganism and simultaneously can ensure the precision of the probiotics potential of the microorganism.

FIG. 1 is a schematic diagram of the steps of a method for predicting the probiotic potential of a microorganism provided in an embodiment of the present invention.

Referring to fig. 1, the method for predicting the probiotic potential of microorganisms specifically includes:

and step S12, determining a sample genome sequence corresponding to the microorganism, wherein the sample genome sequence is obtained by high-throughput sequencing based on the sample genome DNA of the microorganism.

Alternatively, determining a sample genomic sequence corresponding to the microorganism comprises

In an alternative embodiment, the microorganism is a probiotic (probiotic), which refers to an edible microorganism that is generally considered to be positively beneficial to the host (e.g., animal or human) after ingestion. Under the embodiment of the invention, the sample genome sequence corresponding to the probiotics can be in the fasta format.

In other alternative implementations of the invention, the sequencing results may also be data evaluated and summarized to provide statistics and arrangement of microorganisms.

Step S14, based on the sample genome sequence, determining a sub-fragment set corresponding to the sub-genome sequence and the abundance of the sub-fragment set by using a k-mer algorithm, wherein the sample genome sequence comprises a plurality of sub-genome sequences, and the abundance of the sub-fragment set is used for representing the abundance distribution condition of the sub-genome sequences in the sample genome sequence.

Illustratively, a k-mer algorithm is used to extract a sub-fragment set comprising k bases and its sub-fragment set abundance from the sample genomic sequence based on the sample genomic sequence. If the genome length is L and the k-mer length is set to k, then the number of subfragments in the resulting set of subfragments is: l-k+1.

The set of subfragments may be a set of k-mer fragments to predict the probiotic potential of the microorganism subsequently using a model from the set of k-mer fragments.

And S16, predicting the sub-segment set and the abundance of the sub-segment set by using a support vector machine model (support vector machines, SVM) to obtain a result of the probiotics potential of the microorganism, wherein the support vector machine model is obtained by training an initial support vector machine model through a historical genome training sequence.

In a further alternative implementation of the present invention, with continued reference to fig. 1, before predicting the sub-segment set and the abundance of the sub-segment set by using a support vector machine model in step S16, the method further includes:

and step S15, training an initial support vector machine model to obtain a support vector machine model, wherein the initial support vector machine model is a model which is not trained yet.

Optionally, with continued reference to fig. 2, training the initial support vector machine model may specifically include:

Step S151, determining a historical genome training sequence.

The historical genome training sequence is a microbial genome sequence in the history, and the longer the sequence length in the historical genome training sequence is, the better the effect of training the initial support vector machine model is.

Optionally, the determining the historical genomic training sequence, as shown in fig. 3, may include:

step S1511, acquiring a historical genome sequence;

s1512, performing k-mer calculation on the historical genome sequence to obtain a historical sub-fragment set and the abundance of the historical sub-fragment set;

And step S1513, carrying out normalization processing on the historical sub-segment set and the abundance of the historical sub-segment set, and carrying out feature screening to obtain a historical genome training sequence.

And step S152, training an initial support vector machine model based on the historical genome sequence.

In an alternative implementation of the present invention, with continued reference to fig. 4, when the average AUROC value of the initial support vector machine model reaches 98.00%, the training process may be terminated, and at this time, the initial support vector machine model may be used as a support vector machine model, and then the initial support vector machine model after the training is completed is used to predict the probiotic potential of the microorganism.

Further, in an alternative embodiment of the present invention, the set of sub-segments is:

GTCAT,ATGAC,CATGA,TCATG,GTTCA,TGAAC,CAATTG,TGATCA,GA TCAT,ATGATC,CGTCAA,AATCGT,TCATGA,ACGATT,CATGAT,ATCATG,TA ATCG,CGATTA,GGAATC,CGAATT,AATTCG,TTCATG,GAATTC,CATGAA,GGTTCA,TGAACC,ATGAAC,GTTCAT,TCAATTG,TGATCAT,ATGATCA,AATC GTT,GGCAATT,TCATGAT,ATCATGA,TCGTCAA,TTGACGA,AATTGCC,TTC ATGA,TCATGAA,TGTCAGC,CGATTGA,TCGTCAT,GCTGACA,AACGGTT,GT CATTG,TTAATCG,TGATGAC,TTCGTCA,ATGACGA,CAATCGT,TGACGAA,AACCGTT,ACGATTG,CGTTCAA,CAATGAC,CGTCAAT,TTGACGG,TTGAACG,TAATCGT,ATTGACG,AACGGCT,ACGATTA,AATCGTC,CGGAATT,TGAAGA C,GACGATT,CGTTCTT,AAGAACG,AATTCCG,ATTGTCG,TTAACGG,AATGA CG,CGACAAT,CGTCATT,AAGACGA,ACGTCAA,ACGAATT,GGAATCA,AAT TCGT,GTCATGA,TCGGAAT,GAATTCA,ATGATCC,GGTCATT,CGAATTG,CA ATTCG,GGATCAT,CTTCATG,AATCGTG,TGAATTC,TGATCAAG,CCGATTA,GACCGTT,CCGAATT,AACGGTC,AATTCGA,TCAATTGA,CGTAATC,CGTCC AA,AATTCGG,CTTGATCA,GGAATTC,TAATCGG,ATGACGT,TCGAATT,TTG ATCAT,TGTTCAT,ACGTCTT,ATGATCAA,CATGTCA,ATTCACG,TGACATG,ATGAACC,CGGCAATT,ATCATGAT,TTGTCAGC,GGCAATTG,TGTCAGCA,AATTGCCG,GGTTCAT,TCATGATC,CAATTGCC,GGGAACA,GATCATGA,ATG GTCC,CGATTAAT,GGGTTCA,AATTGACG,ATTAATCG,GACATGT,CAGTCA TT,CCGCAATT,AATTGCGG,TGTCAGCT,CGTTAATT,AATTAACG,GGATCA AT,CCGATTAA,TTAATCGG,CAAGTCCG,CCGTTAAT,TTAAGACG,GCCATA TG,CGTTAGTC,GGTCTAT,GTGTAGA,CCCTATC,ACCCTAT,TCTACAC,CTCT ACC,GCTATAC,ACTCCTG,CTCTACA,TGTAGAG,TCTATCC,GTAGAGA,TCT CTAC,CCATAGC,GCTATGG,GCTCTAT,GAGATAG,TATCTGC,CTATCTC,CT ATCTG,GATAGAG,GGTATAG,ATAGAAC,GTTCTAT,CTATACC,CTCTATC,ATAGGG,CCCTAT,GTAGAG,AGATAGA,TCTATCT,CTCTAC,CTATGG,AGATA G,TCTATC,CTATCT,CTCTAT,GTAGA,TCTAC.

Based on the above, in the embodiment of the invention, the screening of specific microorganisms in large-scale microorganisms can be realized through the gene sequences corresponding to the microorganisms, the sample genome sequences are obtained, and then the sub-fragment sets and the abundance of the sub-fragment sets obtained by processing the sample genome sequences are predicted by using the support vector machine model, so that the corresponding result of the probiotics potential can be obtained. Therefore, the method utilizes the gene sequence to screen the sample genome sequence, replaces the step of manually screening microorganisms, further utilizes the vector machine model to realize prediction, improves the efficiency of acquiring the probiotics potential of the microorganisms, and further improves the accuracy of acquiring the probiotics potential of the microorganisms as the vector machine model is finished through pre-training.

The foregoing provides a method for predicting the probiotic potential of microorganisms according to an embodiment of the present invention, and accordingly, an embodiment of the present invention further provides an apparatus for predicting the probiotic potential of microorganisms, which is described in a relatively simple manner, since the apparatus embodiments are substantially similar to the method embodiments, and details of the relevant technical features should be found in the corresponding descriptions of the method embodiments provided above, and the following descriptions of the apparatus embodiments are merely illustrative. As shown in fig. 5, an alternative block diagram of the apparatus for predicting the probiotic potential of a microorganism provided in this embodiment comprises:

the sample genome sequence determining module 520 is configured to determine a sample genome sequence corresponding to a microorganism, where the sample genome sequence is obtained by high-throughput sequencing based on a sample genome DNA of the microorganism;

The sub-segment extraction module 540 is configured to determine a sub-segment set and a sub-segment set abundance corresponding to a sub-genome sequence by using a k-mer algorithm based on the sample genome sequence, where the sample genome sequence includes a plurality of sub-genome sequences, and the sub-segment set abundance is used to represent an abundance distribution situation of the sub-genome sequence in the sample genome sequence;

And the prediction module 560 is configured to predict the sub-segment set and the abundance of the sub-segment set by using a support vector machine model to obtain a result of the probiotic potential of the microorganism, where the support vector machine model is obtained by training an initial support vector machine model through a historical genome training sequence.

Optionally, with continued reference to fig. 5, further includes:

The training module 550 is configured to train an initial support vector machine model to obtain a support vector machine model, where the initial support vector machine model is a model that has not been trained.

Optionally, the training module 550 is configured to train an initial support vector machine model, including:

Determining a historical genome training sequence;

Optionally, the training module 550 is configured to determine a historical genomic training sequence, including:

acquiring a historical genome sequence;

Optionally, the sample genome sequence determining module 520 is configured to determine a sample genome sequence corresponding to a microorganism, and includes:

The foregoing describes several embodiments of the present invention, and the various alternatives presented by the various embodiments may be combined, cross-referenced, with each other without conflict, extending beyond what is possible embodiments, all of which are considered to be embodiments of the present invention disclosed and disclosed.

Although the embodiments of the present invention are disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims

1. A method of predicting the probiotic potential of a microorganism comprising:

Extracting a sub-fragment set containing k bases and the abundance of the sub-fragment set from a sample genome sequence by adopting a k-mer algorithm, wherein the genome length is L, the k-mer length is set as k, and the number of sub-fragments in the generated sub-fragment set is as follows: l-k+1;

2. The method of claim 1, wherein prior to predicting the set of subfragments and the abundance of the set of subfragments using a support vector machine model, further comprising:

3. A method of predicting the probiotic potential of a microorganism according to claim 2, wherein said training an initial support vector machine model comprises:

Determining a historical genome training sequence;

An initial support vector machine model is trained based on the historical genomic training sequence.

4. A method of predicting the probiotic potential of a microorganism according to claim 3, wherein said determining a historical genomic training sequence comprises:

acquiring a historical genome sequence;

5. The method of claim 1, wherein determining the sample genomic sequence corresponding to the microorganism comprises

6. An apparatus for predicting the probiotic potential of a microorganism, comprising:

extracting a sub-fragment set containing k bases and the abundance of the sub-fragment set from a sample genome sequence by adopting a k-mer algorithm, wherein the genome length is L, the k-mer length is set as k, and the number of sub-fragments in the generated sub-fragment set is as follows: l-k+1; the prediction module is used for predicting the sub-segment set and the abundance of the sub-segment set by using a support vector machine model to obtain a result of the probiotics potential of the microorganism, and the support vector machine model is obtained by training an initial support vector machine model through a historical genome training sequence.

7. The apparatus for predicting the probiotic potential of a microorganism of claim 6, further comprising:

8. The apparatus for predicting the probiotic potential of a microorganism of claim 7, wherein the training module is configured to train an initial support vector machine model, comprising:

Determining a historical genome training sequence;

9. The apparatus of claim 8, wherein the training module is configured to determine a historical genomic training sequence comprising:

acquiring a historical genome sequence;

10. The apparatus of claim 6, wherein the sample genomic sequence determination module is configured to determine a sample genomic sequence corresponding to the microorganism, comprising: