CN117198397A - Disease prediction method and system based on variational neural network - Google Patents
Disease prediction method and system based on variational neural network Download PDFInfo
- Publication number
- CN117198397A CN117198397A CN202311028109.1A CN202311028109A CN117198397A CN 117198397 A CN117198397 A CN 117198397A CN 202311028109 A CN202311028109 A CN 202311028109A CN 117198397 A CN117198397 A CN 117198397A
- Authority
- CN
- China
- Prior art keywords
- data
- neural network
- histology
- joint
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 89
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 63
- 201000010099 disease Diseases 0.000 title claims abstract description 49
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 49
- 239000002207 metabolite Substances 0.000 claims abstract description 19
- 238000010606 normalization Methods 0.000 claims abstract description 13
- 238000006243 chemical reaction Methods 0.000 claims abstract description 12
- 230000037361 pathway Effects 0.000 claims abstract description 10
- 238000012163 sequencing technique Methods 0.000 claims abstract description 9
- 239000000126 substance Substances 0.000 claims abstract description 9
- 238000005516 engineering process Methods 0.000 claims abstract description 8
- 238000012165 high-throughput sequencing Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 8
- 238000004949 mass spectrometry Methods 0.000 claims abstract description 4
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 238000012549 training Methods 0.000 claims description 43
- 230000008569 process Effects 0.000 claims description 42
- 238000009826 distribution Methods 0.000 claims description 32
- 238000004364 calculation method Methods 0.000 claims description 28
- 230000009466 transformation Effects 0.000 claims description 19
- 230000006870 function Effects 0.000 claims description 18
- 230000004913 activation Effects 0.000 claims description 15
- 230000002550 fecal effect Effects 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 7
- 238000012217 deletion Methods 0.000 claims description 6
- 230000037430 deletion Effects 0.000 claims description 6
- 230000000968 intestinal effect Effects 0.000 claims description 4
- 238000013459 approach Methods 0.000 claims description 3
- 238000003745 diagnosis Methods 0.000 claims description 3
- 238000009827 uniform distribution Methods 0.000 claims description 3
- 244000005700 microbiome Species 0.000 abstract description 5
- 239000000090 biomarker Substances 0.000 abstract description 3
- 241000894007 species Species 0.000 abstract description 3
- 210000001035 gastrointestinal tract Anatomy 0.000 description 3
- 244000005709 gut microbiome Species 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 230000036541 health Effects 0.000 description 2
- 230000002962 histologic effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 208000027244 Dysbiosis Diseases 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 238000012351 Integrated analysis Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000007140 dysbiosis Effects 0.000 description 1
- 210000003608 fece Anatomy 0.000 description 1
- 210000005095 gastrointestinal system Anatomy 0.000 description 1
- 150000004676 glycans Chemical class 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000004060 metabolic process Effects 0.000 description 1
- 238000002705 metabolomic analysis Methods 0.000 description 1
- 230000001431 metabolomic effect Effects 0.000 description 1
- 230000000813 microbial effect Effects 0.000 description 1
- 230000009456 molecular mechanism Effects 0.000 description 1
- 235000015816 nutrient absorption Nutrition 0.000 description 1
- 244000052769 pathogen Species 0.000 description 1
- 150000008442 polyphenolic compounds Chemical class 0.000 description 1
- 235000013824 polyphenols Nutrition 0.000 description 1
- 229920001282 polysaccharide Polymers 0.000 description 1
- 239000005017 polysaccharide Substances 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Landscapes
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a disease prediction method and a disease prediction system based on a variational neural network, wherein the method comprises the following steps: step 1: extracting DNA, RNA and metabolites of a sample, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, respectively carrying out species annotation and functional annotation on the original data after processing to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry; step 2: preprocessing a plurality of groups of chemical data, including data conversion and normalization; step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain a probability value of illness, wherein an output result of the algorithm frame is illness or not. The invention can be used to integrate incomplete microbiome multi-set data, predict disease and find disease-related biomarkers.
Description
Technical Field
The invention relates to a disease prediction method and system based on a variational neural network, and belongs to the technical field of disease prediction.
Background
The human intestinal microbiota is a complex microbial ecosystem in which the genome is about 300 tens of thousands, 150 times larger than that of a human host. In fact, intestinal microbiota plays an important role in human metabolism by synthesizing enzymes that cannot be encoded by the human genome, which can promote the breakdown of polysaccharides and polyphenols, promote nutrient absorption, and provide protection against pathogens. More and more studies have shown that dysbiosis of the intestinal microbiota may be closely related to various diseases, especially those affecting the gastrointestinal system. In order to reveal the relationship between microorganisms and human health, various histologic techniques such as macrogenomics, macrotranscriptomics, metabolomics and the like have emerged. Each provides an information of molecular mechanisms or biological processes at a specific histology level. Over the past few years, more and more studies have shown that the combination of histologic data generally provides more complete information and better understanding of microecology, which can increase the accuracy of human disease prediction, increase the robustness of analysis, and also allow the discovery of important biomarkers. Notably, microbiome multi-set data includes various types of disparate data and is known for its heterogeneity, sparsity, and high dimensional properties. In view of these features, data processing requires specialized analysis methods to facilitate deeper understanding and knowledge discovery. High performance machine learning methods are currently receiving considerable attention in the biology field, and a large number of models have been developed to fully exploit the potential of multiple sets of chemical information.
Incomplete histology data is common in the disclosed databases, which can be attributed to various factors, such as limited funds, ethical considerations, and privacy concerns, that can affect the usability of the sample. This presents a significant challenge for integrated analysis. In this case, sample dropping or mean interpolation may be considered. However, the former will greatly reduce the number of available samples, while the latter may severely distort the true distribution of data. The existing machine learning algorithm for disease prediction by utilizing incomplete multi-study data mainly has two problems, namely that firstly, related features cannot be effectively extracted from Gao Weizu study data and uncorrelated features are filtered, and secondly, the information in the incomplete multi-study data cannot be fully utilized to realize efficient prediction while flexible integration of the incomplete multi-study data is difficult to realize.
Disclosure of Invention
In order to solve the problems, the invention provides a disease prediction method and a disease prediction system based on a variational neural network, wherein the method is based on the variational neural network, utilizes incomplete intestinal tract multigroup data to perform disease prediction, collects human intestinal tract fecal samples, obtains flora abundance data, path data and metabolite abundance data of the samples through sequencing and analysis technologies, preprocesses the multigroup data, substitutes the multigroup data owned by the samples into a trained algorithm frame to obtain a disease probability value, and outputs two types of algorithm frames, namely disease and non-disease.
In one aspect, the invention provides a disease prediction method based on a variational neural network, comprising the following steps:
step 1: extracting DNA, RNA and metabolites of a sample, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, respectively carrying out species annotation and functional annotation on the original data after processing to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry;
step 2: preprocessing a plurality of groups of chemical data, including data conversion and normalization;
step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain a probability value of illness, wherein an output result of the algorithm frame is illness or not.
In one embodiment of the present invention, the data conversion and normalization process in the step 2 includes the steps of:
step 2.1: performing down conversion on the data to perform reasonable analysis by using a neural network;
x=log 2 (2x+0.00001)
wherein x represents flora abundance data;
step 2.2: if the flora abundance data has been transformed, the following normalization is performed for each of the histology data:
wherein x is mean Is the average value of the group of data, x max Is the maximum value in the set of data, x min Is the minimum in the set of data.
In one embodiment of the present invention, the algorithm framework in step 3 uses multiple sets of learning data to obtain the probability value of illness, including the following steps:
step 3.1: the processed flora abundance data, pathway data and metabolite abundance data are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v Firstly, each matrix performs feature selection through a trained feature selection layer, and the calculation process is as follows:
u v =x v ·s v
wherein the method comprises the steps ofIs a linear transformation matrix which approaches one-hot form after training, and performs characteristic selection on each group of data to obtain u v ∈R n×F ;
Step 3.2: the characteristics selected histology data is passed through a trained encoder consisting of a full connection layer and an activation function to obtain potential representation of each histology, and the calculation process is as follows:
μ v +∈·σ v =z v
wherein the method comprises the steps ofRepresenting a nonlinear transformation process of a multi-layer neural network, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is finished; first by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v ;
Step 3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1)
wherein V represents the histology number, μ, possessed by the sample 0 Sum sigma 0 Representing the mean and variance of prior distribution, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is completed; from the mean mu of the existing histology data v Sum of variances sigma v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;
and 3, step 3.4: potential representation of joint omics data z trained joint predictor consisting of fully connected layers and activation functions to derive probability values for illnessThe calculation process is as follows:
wherein f ψ A nonlinear transformation process representing a multi-layer neural network derives a predictive label of whether the sample is ill or not based on the potential representation z of the federated omics data.
In one embodiment of the present invention, the training process of the algorithm framework includes the steps of:
step S1: collecting intestinal fecal samples of healthy people and target disease people predicted by a diagnosis confirming model, manually marking the people, marking the fecal sample of a patient with birth as 1 and the fecal sample of a patient without birth as 0, obtaining a plurality of groups of corresponding data of the sample through a sequencing and analysis technology, or collecting public data to construct a plurality of groups of data base, and obtaining a plurality of groups of data of the labeled fecal sample;
step S2: carrying out data conversion and normalization processing on multiple groups of chemical data;
step S3: dividing the marked data set into a training set and a testing set, performing supervised training on the algorithm framework by using the training set data, and testing on the testing set.
In one embodiment of the present invention, the one training process of the algorithm framework in the step S3 for performing supervised training using training set data includes the following steps:
step S3.1: the flora abundance data, pathway data and metabolite abundance data of the training set are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v Each matrix is first selected by featuresLayer selection is carried out to carry out linear transformation:
T e =T 0 ·(T E /T 0 ) e/E
u v =x v ·ε v
wherein E represents the total number of iterations of training, E represents the number of iterations, T 0 And T E Is the super parameter of the algorithm model, which is respectively set to 10 and 0.1, gamma v Epsilon is randomly sampled from the uniform distribution (0, 1) for parameters of the fully connected layer, softmax () represents an activation function, which is changed by x v Obtaining u v ∈R n×F ;
Step S3.2: the histology data changed by the feature selection layer is passed through an encoder consisting of a fully connected layer and an activation function to obtain a potential representation of each histology, which is calculated as follows:
μ v +∈·σ v =z v
wherein the method comprises the steps ofRepresenting the nonlinear transformation process of a multi-layer neural network, θ v The E is a parameter of the neural network and randomly sampled from standard normal distribution; first by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v ;
Step S3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1)
wherein V represents the histology number, μ, possessed by the sample 0 Sum sigma 0 Representing the mean and variance of the prior distribution, and randomly sampling the E from the standard normal distribution; from the mean mu of the existing histology data v Sum of variances sigma v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;
step S3.4: potential representation z of each omic data v And the joint potential representation z obtains a prediction probability value y of a specific group through a predictor and a joint predictor which are respectively composed of a full connection layer and an activation function v And final predicted probability valuesThe calculation process is as follows:
wherein the method comprises the steps ofAnd f ψ Representing a nonlinear transformation process of a multi-layer neural network, < >>And ψ are parameters of the neural network;
step S3.5: and calculating loss according to a loss function of the model, and carrying out gradient feedback to update parameters of the model neural network, wherein the calculation process is as follows:
L T =L J +α∑ v∈V L v
where α and β are the equilibrium coefficients combining the different losses, N (μ) 0 ,σ 0 ) Representing a priori distribution, N (μ) v ,σ v ) Representing the distribution of the v-th histology potential representation, N (μ, σ) representing the distribution of the joint potential representation, KL () representing the calculation of the KL divergence, i.e., the relative entropy, between the two distributions, y being the true label in one-hot form, y v The probability values are predicted for a particular group,for the final predicted probability value, n is the number of samples, λ is the learning rate during training of the algorithm framework, +.>Parameters of a neural network in an algorithm frame; thus, the training of the model is completed once to update the parameters of the neural network.
On the other hand, the invention also provides a disease prediction system based on the variational neural network, and the disease prediction method based on the variational neural network is applied, and the system comprises the following steps:
the feature selection layer module is used for selecting important features to linearly transform the histology data;
an encoder module for randomly encoding the omics data into a potential representation;
a joint encoder module for integrating groups of potential representations; and
a predictor and joint predictor module for providing label inferences, i.e., disease prediction results, from the potential representations encoded by the encoder module.
In one embodiment of the present invention, the feature selection layer module is a boolean matrix.
In one embodiment of the invention, the encoder module is formed by a fully connected neural network.
In one embodiment of the present invention, the joint encoder module is a calculation module.
In one embodiment of the invention, the predictor and joint predictor module is comprised of a fully connected neural network.
The disease prediction method and system based on the variational neural network provided by the invention have the following advantages: the invention provides a new framework based on a variational neural network, which can be used for integrating incomplete microbiome multi-set chemical data, predicting diseases and searching biomarkers related to the diseases. Secondly, the invention introduces specific distribution, and can select the most relevant characteristics of the target diseases in each microbiome, thereby improving the interpretation of the model. Thirdly, the algorithm utilizes the information bottleneck principle to construct a model training loss function, so that learning single-group potential representation and joint-group potential representation are promoted, and the algorithm has high prediction accuracy and robustness. Fourth, the invention has low requirement on the integrity of the multi-group data, and can flexibly utilize the group data owned by the sample to predict the diseases.
Drawings
FIG. 1 is a flow chart of a disease prediction method based on a variational neural network of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the present invention provides a disease prediction method based on a variational neural network, and in some embodiments, the method includes the following steps:
step 1: collecting intestinal faeces samples of people to be tested, firstly extracting DNA, RNA and metabolites of the samples, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, processing the original data, and respectively carrying out species annotation and functional annotation to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry.
Considering the problem of sequencing cost, if the number of samples is excessive, a large amount of cost and time are required to obtain complete multi-group data, so that the situation that the sample multi-group data is incomplete easily occurs, the existing group data of the samples are substituted into a trained model, the missing group data does not participate in integrated calculation, and the follow-up prediction accuracy rate can be higher.
Step 2: and preprocessing the multiple groups of chemical data, including data conversion and normalization.
Optionally, in some embodiments, the data conversion and normalization process in step 2 includes the steps of:
step 2.1: in order to use neural networks for reasonable analysis, the data first needs to be transformed. The relative abundance of the flora is considered in the present invention. These values can range from 0 to relatively large actual values, and most of the amplitudes range from 10 -1 To 10 -4 . Considering that low abundance features may play an important role in health, the following logarithmic transformation was designed and applied:
x=log 2 (2x+0.00001) (1)
to avoid numerical problems of origin, 0.00001 is added. Wherein x represents flora abundance data.
Step 2.2: if the flora abundance data has been transformed, the following normalization is performed for each of the histology data:
wherein x is mean Is the average value of the group of data, x max Is the maximum value in the set of data, x min Is the minimum in the set of data.
Step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain probability values of illness of the crowd to be detected, wherein the output results of the algorithm frame are two types, namely illness and non-illness.
Optionally, in some embodiments, the algorithm framework in step 3 uses multiple sets of learning data to derive the probability value of illness includes the steps of:
step 3.1: the processed flora abundance data, pathway data and metabolite abundance data are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v Firstly, each matrix performs feature selection through a trained feature selection layer, and the calculation process is as follows:
u v =x v ·s v (3)
wherein the method comprises the steps ofIs a linear transformation matrix which approaches one-hot (single thermal coding) form after training, and performs characteristic selection on each group of data to obtain u v ∈R n×F 。
Step 3.2: the characteristics selected histology data is passed through a trained encoder consisting of a full connection layer and an activation function to obtain potential representation of each histology, and the calculation process is as follows:
μ v +∈·σ v =z v (5)
wherein the method comprises the steps ofAnd (3) representing a nonlinear transformation process of the multi-layer neural network, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to 0 after model training is completed. Mu is firstly formed v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v 。
Step 3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1) (8)
where V represents the histology number possessed by the sample, set to 3, μ in the examples of the present invention 0 Sum sigma 0 The mean and variance of the prior distribution are shown, and are set to 0 and 1 respectively in the embodiment of the invention, the E is randomly sampled from the standard normal distribution when the network is trained, and the E is fixed to 0 after model training is completed. From the mean mu of the existing histology data v Sum of variances sigma v Integration yields the joint histology mean μ and variance σ, and the potential representation z of the joint histology data is derived using a re-parameterization technique.
Step 3.4: potential representation of federated omics data z trained federated predictors consisting of fully connected layers and activation functionsDeriving a probability value of illnessThe calculation process is as follows:
wherein f ψ The nonlinear transformation process of the multi-layer neural network is represented, and the prediction label of whether the crowd to be detected is ill or not is obtained based on the potential representation z of the combined histology data.
Optionally, in some embodiments, the training process of the algorithm framework of the disease prediction method based on the variational neural network includes the following steps:
step S1: and (3) collecting intestinal fecal samples of healthy people and target disease people predicted by a diagnosis confirming model, manually marking the people, marking the fecal sample of a patient with birth as 1 and the fecal sample of a patient with no birth as 0, obtaining a plurality of groups of corresponding data of the sample through a sequencing and analysis technology, or collecting public data on the internet to construct a plurality of groups of data database, and obtaining the plurality of groups of data of the labeled fecal sample.
Step S2: and carrying out data conversion and normalization processing on the multiple groups of chemical data.
Step S3: dividing the marked data set into a training set and a testing set, performing supervised training on the algorithm framework by using the training set data, and testing on the testing set.
For further explanation, the steps of the one-time training process of the algorithm framework in the step S3 for performing the supervised training by using the training set data are described as follows:
step S3.1: the flora abundance data, pathway data and metabolite abundance data of the training set are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v First, each matrix is subjected to linear transformation through a feature selection layer:
T e =T 0 ·(T E /T 0 ) e/E (10)
u v =x v ·s v (12)
wherein E represents the total number of iterations of training, which is set to 2000 in the embodiment of the invention, E represents the number of iterations, T 0 And T E Is the super parameter of the algorithm model, which is respectively set to 10 and 0.1, gamma v Epsilon is randomly sampled from the uniform distribution (0, 1) for parameters of the fully connected layer, softmax () represents an activation function, which is changed by x v Obtaining u v ∈R n×F 。
Step S3.2: the histology data changed by the feature selection layer is passed through an encoder consisting of a fully connected layer and an activation function to obtain a potential representation of each histology, which is calculated as follows:
μ v +∈·σ v =z v (14)
wherein the method comprises the steps ofRepresenting the nonlinear transformation process of a multi-layer neural network, θ v For the parameters of the neural network, the e is randomly sampled from a standard normal distribution. First by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v 。
Step S3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1) (17)
where V represents the histology number possessed by the sample, set to 3, μ in the examples of the present invention 0 Sum sigma 0 The mean and variance of the prior distribution are represented, in the present embodiment set to 0 and 1, respectively, e being randomly sampled from a standard normal distribution. From the mean mu of the existing histology data v Sum of variances sigma v Integration yields the joint histology mean μ and variance σ, and the potential representation z of the joint histology data is derived using a re-parameterization technique.
Step S3.4: potential representation z of each omic data v And the joint potential representation z obtains a prediction probability value y of a specific group through a predictor and a joint predictor which are respectively composed of a full connection layer and an activation function v And final predicted probability valuesThe calculation process is as follows:
wherein the method comprises the steps ofAnd f ψ Representing a nonlinear transformation process of a multi-layer neural network, < >>And ψ are parameters of the neural network.
Step S3.5: and calculating loss according to a loss function of the model, and carrying out gradient feedback to update parameters of the model neural network, wherein the calculation process is as follows:
L T =L J +α∑ v∈V L v (22)
where α and β are balance coefficients combining different losses, which in the present embodiment are set to 1 and 0.001, respectively.
N(μ 0 ,σ 0 ) Representing a priori distribution, N (μ) v ,σ v ) Representing the distribution of the v-th histology potential representation, N (μ, σ) representing the distribution of the joint potential representation, KL () representing the calculation of the KL divergence, i.e., the relative entropy, between the two distributions, y being the true label in one-hot form, y v The probability values are predicted for a particular group,for final prediction probability value, n is the number of samples, λ is the learning rate during training of the algorithm framework, and in the embodiment of the invention, 0.01 +.>Is a parameter of the neural network in the algorithm framework. Thus, the training of the model is completed once to update the parameters of the neural network.
The invention also provides a disease prediction system based on the variational neural network, which is applied to the disease prediction method based on the variational neural network, and in some embodiments, the system comprises:
the feature selection layer module is used for selecting important features to linearly transform the histology data;
an encoder module for randomly encoding the omics data into a potential representation;
a joint encoder module for integrating groups of potential representations; and
a predictor and joint predictor module for providing label inferences, i.e., disease prediction results, from the potential representations encoded by the encoder module.
Optionally, in some implementations, the feature selection layer module is a boolean matrix.
Optionally, in some embodiments, the encoder module is comprised of a fully connected neural network.
Optionally, in some embodiments, the joint encoder module is a calculation module.
Optionally, in some embodiments, the predictor and joint predictor module is comprised of a fully connected neural network.
The method and the device separate and integrate the multiple groups of the data, fully utilize the internal information of each group, simultaneously excavate the interactive information among the groups, and are suitable for predicting the incomplete samples of the multiple groups of the data, reduce the requirements on the integrity degree of the multiple groups of the data of the samples, and provide a technical means for predicting the diseases of the multiple groups of the data of the intestinal tract.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and similar elements thereof may be made without departing from the spirit and principles of the present invention.
Claims (10)
1. The disease prediction method based on the variational neural network is characterized by comprising the following steps of:
step 1: extracting DNA, RNA and metabolites of a sample, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, respectively carrying out species annotation and functional annotation on the original data after processing to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry;
step 2: preprocessing a plurality of groups of chemical data, including data conversion and normalization;
step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain a probability value of illness, wherein an output result of the algorithm frame is illness or not.
2. The disease prediction method based on the variational neural network according to claim 1, wherein the data conversion and normalization process in the step 2 comprises the steps of:
step 2.1: performing down conversion on the data to perform reasonable analysis by using a neural network;
x=log 2 (2x+0.00001)
wherein x represents flora abundance data;
step 2.2: if the flora abundance data has been transformed, the following normalization is performed for each of the histology data:
wherein x is mean Is the average value of the group of data, x max Is the maximum value in the set of data, x min Is the minimum in the set of data.
3. The disease prediction method based on the variational neural network according to claim 2, wherein the algorithm framework in the step 3 obtains the probability value of the occurrence of the disease by using multiple sets of mathematical data, comprising the steps of:
and 3, step 3.1: the processed flora abundance data, pathway data and metabolite abundance data are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v Firstly, each matrix performs feature selection through a trained feature selection layer, and the calculation process is as follows:
u v =x v ·s v
wherein the method comprises the steps ofIs a linear transformation matrix which approaches one-hot form after training, and performs characteristic selection on each group of data to obtain u v ∈R n×F ;
Step 3.2: the characteristics selected histology data is passed through a trained encoder consisting of a full connection layer and an activation function to obtain potential representation of each histology, and the calculation process is as follows:
μ v +∈·σ v =z v
wherein the method comprises the steps ofRepresenting a nonlinear transformation process of a multi-layer neural network, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is finished; first by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v ;
Step 3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1)
wherein V represents the histology number, μ, possessed by the sample 0 Sum sigma 0 Representing the mean and variance of prior distribution, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is completed; from the mean mu of the existing histology data v Sum of variances sigma v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;
step 3.4: potential representation of joint omics data z trained joint predictor consisting of fully connected layers and activation functions to derive probability values for illnessThe calculation process is as follows:
wherein f ψ A nonlinear transformation process representing a multi-layer neural network derives a predictive label of whether the sample is ill or not based on the potential representation z of the federated omics data.
4. A disease prediction method based on a variational neural network as claimed in claim 3, wherein the training process of the algorithm framework comprises the steps of:
step S1: collecting intestinal fecal samples of healthy people and target disease people predicted by a diagnosis confirming model, manually marking the people, marking the fecal sample of a patient with birth as 1 and the fecal sample of a patient without birth as 0, obtaining a plurality of groups of corresponding data of the sample through a sequencing and analysis technology, or collecting public data to construct a plurality of groups of data base, and obtaining a plurality of groups of data of the labeled fecal sample;
step S2: carrying out data conversion and normalization processing on multiple groups of chemical data;
step S3: dividing the marked data set into a training set and a testing set, performing supervised training on the algorithm framework by using the training set data, and testing on the testing set.
5. The disease prediction method based on the variational neural network according to claim 4, wherein the one training process of the algorithm framework in step S3 for performing supervised training by using training set data comprises the following steps:
step S3.1: the flora abundance data, pathway data and metabolite abundance data of the training set are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v First, each matrix is subjected to linear transformation through a feature selection layer:
T e =T 0 ·(T E /T 0 ) e/E
u v =x v ·s v
wherein E represents the total number of iterations of training, E represents the number of iterations, T 0 And T E Is the super parameter of the algorithm model, which is respectively set to 10 and 0.1, gamma v Epsilon is randomly sampled from the uniform distribution (0, 1) for parameters of the fully connected layer, softmax () represents an activation function, which is changed by x v Obtaining u v ∈R n×F ;
Step S3.2: the histology data changed by the feature selection layer is passed through an encoder consisting of a fully connected layer and an activation function to obtain a potential representation of each histology, which is calculated as follows:
μ v +∈·σ v =z v
wherein the method comprises the steps ofRepresenting the nonlinear transformation process of a multi-layer neural network, θ v The E is a parameter of the neural network and randomly sampled from standard normal distribution; first by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v ;
Step S3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1)
wherein V represents the histology number, μ, possessed by the sample 0 Sum sigma 0 Representing the mean and variance of the prior distribution, and randomly sampling the E from the standard normal distribution; from the mean mu of the existing histology data v Sum of variances sigma v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;
step S3.4: potential representation z of each omic data v And the joint potential representation z obtains a prediction probability value y of a specific group through a predictor and a joint predictor which are respectively composed of a full connection layer and an activation function v And final predicted probability valuesThe calculation process is as follows:
wherein the method comprises the steps ofAnd f ψ Representing a nonlinear transformation process of a multi-layer neural network, < >>And ψ are parameters of the neural network;
step S3.5: and calculating loss according to a loss function of the model, and carrying out gradient feedback to update parameters of the model neural network, wherein the calculation process is as follows:
L T =L J +α∑ v∈V L v
where α and β are the equilibrium coefficients combining the different losses, N (μ) 0 ,σ 0 ) Representing a priori distribution, N (μ) v ,σ v ) Representing the distribution of the v-th histology potential representation, N (μ, σ) representing the distribution of the joint potential representation, KL () representing the calculation of the KL divergence, i.e., the relative entropy, between the two distributions, y being the true label in one-hot form, y v The probability values are predicted for a particular group,for the final predicted probability value, n is the number of samples, λ is the learning rate during training of the algorithm framework, +.>Parameters of a neural network in an algorithm frame; thus, the training of the model is completed once to update the parameters of the neural network.
6. A disease prediction system based on a variational neural network, wherein the disease prediction method based on a variational neural network according to any one of claims 1 to 5 is applied, said system comprising:
the feature selection layer module is used for selecting important features to linearly transform the histology data;
an encoder module for randomly encoding the omics data into a potential representation;
a joint encoder module for integrating groups of potential representations; and
a predictor and joint predictor module for providing label inferences, i.e., disease prediction results, from the potential representations encoded by the encoder module.
7. The disease prediction system based on a variational neural network of claim 6 wherein said feature selection layer module is a boolean matrix.
8. The disease prediction system based on a variational neural network of claim 6 wherein said encoder module is comprised of a fully connected neural network.
9. The disease prediction system based on a variational neural network of claim 6, wherein said joint encoder module is a calculation module.
10. The disease prediction system based on a variational neural network of claim 6 wherein said predictor and joint predictor module is comprised of a fully connected neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311028109.1A CN117198397A (en) | 2023-08-15 | 2023-08-15 | Disease prediction method and system based on variational neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311028109.1A CN117198397A (en) | 2023-08-15 | 2023-08-15 | Disease prediction method and system based on variational neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117198397A true CN117198397A (en) | 2023-12-08 |
Family
ID=89002565
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311028109.1A Pending CN117198397A (en) | 2023-08-15 | 2023-08-15 | Disease prediction method and system based on variational neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117198397A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118352007A (en) * | 2024-04-30 | 2024-07-16 | 中国人民解放军总医院第一医学中心 | Disease data analysis method and system based on crowd queue multiunit study data |
-
2023
- 2023-08-15 CN CN202311028109.1A patent/CN117198397A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118352007A (en) * | 2024-04-30 | 2024-07-16 | 中国人民解放军总医院第一医学中心 | Disease data analysis method and system based on crowd queue multiunit study data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Morton et al. | Learning representations of microbe–metabolite interactions | |
Knight et al. | Best practices for analysing microbiomes | |
Franzosa et al. | Species-level functional profiling of metagenomes and metatranscriptomes | |
Chothani et al. | deltaTE: Detection of translationally regulated genes by integrative analysis of Ribo‐seq and RNA‐seq data | |
Kuczynski et al. | Microbial community resemblance methods differ in their ability to detect biologically relevant patterns | |
Manor et al. | MUSiCC: a marker genes based framework for metagenomic normalization and accurate profiling of gene abundances in the microbiome | |
Chen et al. | Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis | |
Boedigheimer et al. | Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories | |
Palacio et al. | A protocol for reproducible functional diversity analyses | |
CN112151118B (en) | Multi-time-sequence intestinal flora data analysis process control method | |
CN117198397A (en) | Disease prediction method and system based on variational neural network | |
Maisano Delser et al. | Demographic inferences after a range expansion can be biased: the test case of the blacktip reef shark (Carcharhinus melanopterus) | |
CN114334012A (en) | Method for identifying cancer subtypes based on multigroup data | |
CN113435321A (en) | Method, system and equipment for evaluating state of main shaft bearing and readable storage medium | |
CN112182257A (en) | Artificial intelligence data cleaning method based on neural network | |
Tillinghast | Microarrays in the clinic | |
CN109584952B (en) | Method for identifying structure and functional variation of ecological network of human microbial flora | |
Heintz-Buschart et al. | A beginner’s guide to integrating multi-omics data from microbial communities | |
CN117594132A (en) | Single-cell RNA sequence data clustering method based on robust residual error map convolutional network | |
CN116978464A (en) | Data processing method, device, equipment and medium | |
Choudhari et al. | Metagenomics: the boon for microbial world knowledge and current challenges | |
CN115588505A (en) | Human health quantitative prediction system and method based on microbial interaction | |
Chung et al. | Data pre-processing for label-free Multiple Reaction Monitoring (MRM) experiments | |
Ma et al. | Assessing and interpreting the within-body biogeography of human microbiome diversity | |
Maitra et al. | UMINT: unsupervised neural network for single cell multi-omics integration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |