CN117198397A - Disease prediction method and system based on variational neural network - Google Patents

Disease prediction method and system based on variational neural network Download PDF

Info

Publication number
CN117198397A
CN117198397A CN202311028109.1A CN202311028109A CN117198397A CN 117198397 A CN117198397 A CN 117198397A CN 202311028109 A CN202311028109 A CN 202311028109A CN 117198397 A CN117198397 A CN 117198397A
Authority
CN
China
Prior art keywords
data
neural network
histology
joint
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311028109.1A
Other languages
Chinese (zh)
Inventor
朱金林
胡明逸
陆文伟
王鸿超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangnan University
Original Assignee
Jiangnan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangnan University filed Critical Jiangnan University
Priority to CN202311028109.1A priority Critical patent/CN117198397A/en
Publication of CN117198397A publication Critical patent/CN117198397A/en
Pending legal-status Critical Current

Links

Landscapes

  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a disease prediction method and a disease prediction system based on a variational neural network, wherein the method comprises the following steps: step 1: extracting DNA, RNA and metabolites of a sample, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, respectively carrying out species annotation and functional annotation on the original data after processing to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry; step 2: preprocessing a plurality of groups of chemical data, including data conversion and normalization; step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain a probability value of illness, wherein an output result of the algorithm frame is illness or not. The invention can be used to integrate incomplete microbiome multi-set data, predict disease and find disease-related biomarkers.

Description

Disease prediction method and system based on variational neural network
Technical Field
The invention relates to a disease prediction method and system based on a variational neural network, and belongs to the technical field of disease prediction.
Background
The human intestinal microbiota is a complex microbial ecosystem in which the genome is about 300 tens of thousands, 150 times larger than that of a human host. In fact, intestinal microbiota plays an important role in human metabolism by synthesizing enzymes that cannot be encoded by the human genome, which can promote the breakdown of polysaccharides and polyphenols, promote nutrient absorption, and provide protection against pathogens. More and more studies have shown that dysbiosis of the intestinal microbiota may be closely related to various diseases, especially those affecting the gastrointestinal system. In order to reveal the relationship between microorganisms and human health, various histologic techniques such as macrogenomics, macrotranscriptomics, metabolomics and the like have emerged. Each provides an information of molecular mechanisms or biological processes at a specific histology level. Over the past few years, more and more studies have shown that the combination of histologic data generally provides more complete information and better understanding of microecology, which can increase the accuracy of human disease prediction, increase the robustness of analysis, and also allow the discovery of important biomarkers. Notably, microbiome multi-set data includes various types of disparate data and is known for its heterogeneity, sparsity, and high dimensional properties. In view of these features, data processing requires specialized analysis methods to facilitate deeper understanding and knowledge discovery. High performance machine learning methods are currently receiving considerable attention in the biology field, and a large number of models have been developed to fully exploit the potential of multiple sets of chemical information.
Incomplete histology data is common in the disclosed databases, which can be attributed to various factors, such as limited funds, ethical considerations, and privacy concerns, that can affect the usability of the sample. This presents a significant challenge for integrated analysis. In this case, sample dropping or mean interpolation may be considered. However, the former will greatly reduce the number of available samples, while the latter may severely distort the true distribution of data. The existing machine learning algorithm for disease prediction by utilizing incomplete multi-study data mainly has two problems, namely that firstly, related features cannot be effectively extracted from Gao Weizu study data and uncorrelated features are filtered, and secondly, the information in the incomplete multi-study data cannot be fully utilized to realize efficient prediction while flexible integration of the incomplete multi-study data is difficult to realize.
Disclosure of Invention
In order to solve the problems, the invention provides a disease prediction method and a disease prediction system based on a variational neural network, wherein the method is based on the variational neural network, utilizes incomplete intestinal tract multigroup data to perform disease prediction, collects human intestinal tract fecal samples, obtains flora abundance data, path data and metabolite abundance data of the samples through sequencing and analysis technologies, preprocesses the multigroup data, substitutes the multigroup data owned by the samples into a trained algorithm frame to obtain a disease probability value, and outputs two types of algorithm frames, namely disease and non-disease.
In one aspect, the invention provides a disease prediction method based on a variational neural network, comprising the following steps:
step 1: extracting DNA, RNA and metabolites of a sample, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, respectively carrying out species annotation and functional annotation on the original data after processing to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry;
step 2: preprocessing a plurality of groups of chemical data, including data conversion and normalization;
step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain a probability value of illness, wherein an output result of the algorithm frame is illness or not.
In one embodiment of the present invention, the data conversion and normalization process in the step 2 includes the steps of:
step 2.1: performing down conversion on the data to perform reasonable analysis by using a neural network;
x=log 2 (2x+0.00001)
wherein x represents flora abundance data;
step 2.2: if the flora abundance data has been transformed, the following normalization is performed for each of the histology data:
wherein x is mean Is the average value of the group of data, x max Is the maximum value in the set of data, x min Is the minimum in the set of data.
In one embodiment of the present invention, the algorithm framework in step 3 uses multiple sets of learning data to obtain the probability value of illness, including the following steps:
step 3.1: the processed flora abundance data, pathway data and metabolite abundance data are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v Firstly, each matrix performs feature selection through a trained feature selection layer, and the calculation process is as follows:
u v =x v ·s v
wherein the method comprises the steps ofIs a linear transformation matrix which approaches one-hot form after training, and performs characteristic selection on each group of data to obtain u v ∈R n×F
Step 3.2: the characteristics selected histology data is passed through a trained encoder consisting of a full connection layer and an activation function to obtain potential representation of each histology, and the calculation process is as follows:
μ v +∈·σ v =z v
wherein the method comprises the steps ofRepresenting a nonlinear transformation process of a multi-layer neural network, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is finished; first by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v
Step 3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1)
wherein V represents the histology number, μ, possessed by the sample 0 Sum sigma 0 Representing the mean and variance of prior distribution, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is completed; from the mean mu of the existing histology data v Sum of variances sigma v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;
and 3, step 3.4: potential representation of joint omics data z trained joint predictor consisting of fully connected layers and activation functions to derive probability values for illnessThe calculation process is as follows:
wherein f ψ A nonlinear transformation process representing a multi-layer neural network derives a predictive label of whether the sample is ill or not based on the potential representation z of the federated omics data.
In one embodiment of the present invention, the training process of the algorithm framework includes the steps of:
step S1: collecting intestinal fecal samples of healthy people and target disease people predicted by a diagnosis confirming model, manually marking the people, marking the fecal sample of a patient with birth as 1 and the fecal sample of a patient without birth as 0, obtaining a plurality of groups of corresponding data of the sample through a sequencing and analysis technology, or collecting public data to construct a plurality of groups of data base, and obtaining a plurality of groups of data of the labeled fecal sample;
step S2: carrying out data conversion and normalization processing on multiple groups of chemical data;
step S3: dividing the marked data set into a training set and a testing set, performing supervised training on the algorithm framework by using the training set data, and testing on the testing set.
In one embodiment of the present invention, the one training process of the algorithm framework in the step S3 for performing supervised training using training set data includes the following steps:
step S3.1: the flora abundance data, pathway data and metabolite abundance data of the training set are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v Each matrix is first selected by featuresLayer selection is carried out to carry out linear transformation:
T e =T 0 ·(T E /T 0 ) e/E
u v =x v ·ε v
wherein E represents the total number of iterations of training, E represents the number of iterations, T 0 And T E Is the super parameter of the algorithm model, which is respectively set to 10 and 0.1, gamma v Epsilon is randomly sampled from the uniform distribution (0, 1) for parameters of the fully connected layer, softmax () represents an activation function, which is changed by x v Obtaining u v ∈R n×F
Step S3.2: the histology data changed by the feature selection layer is passed through an encoder consisting of a fully connected layer and an activation function to obtain a potential representation of each histology, which is calculated as follows:
μ v +∈·σ v =z v
wherein the method comprises the steps ofRepresenting the nonlinear transformation process of a multi-layer neural network, θ v The E is a parameter of the neural network and randomly sampled from standard normal distribution; first by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v
Step S3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1)
wherein V represents the histology number, μ, possessed by the sample 0 Sum sigma 0 Representing the mean and variance of the prior distribution, and randomly sampling the E from the standard normal distribution; from the mean mu of the existing histology data v Sum of variances sigma v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;
step S3.4: potential representation z of each omic data v And the joint potential representation z obtains a prediction probability value y of a specific group through a predictor and a joint predictor which are respectively composed of a full connection layer and an activation function v And final predicted probability valuesThe calculation process is as follows:
wherein the method comprises the steps ofAnd f ψ Representing a nonlinear transformation process of a multi-layer neural network, < >>And ψ are parameters of the neural network;
step S3.5: and calculating loss according to a loss function of the model, and carrying out gradient feedback to update parameters of the model neural network, wherein the calculation process is as follows:
L T =L J +α∑ v∈V L v
where α and β are the equilibrium coefficients combining the different losses, N (μ) 00 ) Representing a priori distribution, N (μ) vv ) Representing the distribution of the v-th histology potential representation, N (μ, σ) representing the distribution of the joint potential representation, KL () representing the calculation of the KL divergence, i.e., the relative entropy, between the two distributions, y being the true label in one-hot form, y v The probability values are predicted for a particular group,for the final predicted probability value, n is the number of samples, λ is the learning rate during training of the algorithm framework, +.>Parameters of a neural network in an algorithm frame; thus, the training of the model is completed once to update the parameters of the neural network.
On the other hand, the invention also provides a disease prediction system based on the variational neural network, and the disease prediction method based on the variational neural network is applied, and the system comprises the following steps:
the feature selection layer module is used for selecting important features to linearly transform the histology data;
an encoder module for randomly encoding the omics data into a potential representation;
a joint encoder module for integrating groups of potential representations; and
a predictor and joint predictor module for providing label inferences, i.e., disease prediction results, from the potential representations encoded by the encoder module.
In one embodiment of the present invention, the feature selection layer module is a boolean matrix.
In one embodiment of the invention, the encoder module is formed by a fully connected neural network.
In one embodiment of the present invention, the joint encoder module is a calculation module.
In one embodiment of the invention, the predictor and joint predictor module is comprised of a fully connected neural network.
The disease prediction method and system based on the variational neural network provided by the invention have the following advantages: the invention provides a new framework based on a variational neural network, which can be used for integrating incomplete microbiome multi-set chemical data, predicting diseases and searching biomarkers related to the diseases. Secondly, the invention introduces specific distribution, and can select the most relevant characteristics of the target diseases in each microbiome, thereby improving the interpretation of the model. Thirdly, the algorithm utilizes the information bottleneck principle to construct a model training loss function, so that learning single-group potential representation and joint-group potential representation are promoted, and the algorithm has high prediction accuracy and robustness. Fourth, the invention has low requirement on the integrity of the multi-group data, and can flexibly utilize the group data owned by the sample to predict the diseases.
Drawings
FIG. 1 is a flow chart of a disease prediction method based on a variational neural network of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the present invention provides a disease prediction method based on a variational neural network, and in some embodiments, the method includes the following steps:
step 1: collecting intestinal faeces samples of people to be tested, firstly extracting DNA, RNA and metabolites of the samples, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, processing the original data, and respectively carrying out species annotation and functional annotation to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry.
Considering the problem of sequencing cost, if the number of samples is excessive, a large amount of cost and time are required to obtain complete multi-group data, so that the situation that the sample multi-group data is incomplete easily occurs, the existing group data of the samples are substituted into a trained model, the missing group data does not participate in integrated calculation, and the follow-up prediction accuracy rate can be higher.
Step 2: and preprocessing the multiple groups of chemical data, including data conversion and normalization.
Optionally, in some embodiments, the data conversion and normalization process in step 2 includes the steps of:
step 2.1: in order to use neural networks for reasonable analysis, the data first needs to be transformed. The relative abundance of the flora is considered in the present invention. These values can range from 0 to relatively large actual values, and most of the amplitudes range from 10 -1 To 10 -4 . Considering that low abundance features may play an important role in health, the following logarithmic transformation was designed and applied:
x=log 2 (2x+0.00001) (1)
to avoid numerical problems of origin, 0.00001 is added. Wherein x represents flora abundance data.
Step 2.2: if the flora abundance data has been transformed, the following normalization is performed for each of the histology data:
wherein x is mean Is the average value of the group of data, x max Is the maximum value in the set of data, x min Is the minimum in the set of data.
Step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain probability values of illness of the crowd to be detected, wherein the output results of the algorithm frame are two types, namely illness and non-illness.
Optionally, in some embodiments, the algorithm framework in step 3 uses multiple sets of learning data to derive the probability value of illness includes the steps of:
step 3.1: the processed flora abundance data, pathway data and metabolite abundance data are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v Firstly, each matrix performs feature selection through a trained feature selection layer, and the calculation process is as follows:
u v =x v ·s v (3)
wherein the method comprises the steps ofIs a linear transformation matrix which approaches one-hot (single thermal coding) form after training, and performs characteristic selection on each group of data to obtain u v ∈R n×F
Step 3.2: the characteristics selected histology data is passed through a trained encoder consisting of a full connection layer and an activation function to obtain potential representation of each histology, and the calculation process is as follows:
μ v +∈·σ v =z v (5)
wherein the method comprises the steps ofAnd (3) representing a nonlinear transformation process of the multi-layer neural network, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to 0 after model training is completed. Mu is firstly formed v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v
Step 3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1) (8)
where V represents the histology number possessed by the sample, set to 3, μ in the examples of the present invention 0 Sum sigma 0 The mean and variance of the prior distribution are shown, and are set to 0 and 1 respectively in the embodiment of the invention, the E is randomly sampled from the standard normal distribution when the network is trained, and the E is fixed to 0 after model training is completed. From the mean mu of the existing histology data v Sum of variances sigma v Integration yields the joint histology mean μ and variance σ, and the potential representation z of the joint histology data is derived using a re-parameterization technique.
Step 3.4: potential representation of federated omics data z trained federated predictors consisting of fully connected layers and activation functionsDeriving a probability value of illnessThe calculation process is as follows:
wherein f ψ The nonlinear transformation process of the multi-layer neural network is represented, and the prediction label of whether the crowd to be detected is ill or not is obtained based on the potential representation z of the combined histology data.
Optionally, in some embodiments, the training process of the algorithm framework of the disease prediction method based on the variational neural network includes the following steps:
step S1: and (3) collecting intestinal fecal samples of healthy people and target disease people predicted by a diagnosis confirming model, manually marking the people, marking the fecal sample of a patient with birth as 1 and the fecal sample of a patient with no birth as 0, obtaining a plurality of groups of corresponding data of the sample through a sequencing and analysis technology, or collecting public data on the internet to construct a plurality of groups of data database, and obtaining the plurality of groups of data of the labeled fecal sample.
Step S2: and carrying out data conversion and normalization processing on the multiple groups of chemical data.
Step S3: dividing the marked data set into a training set and a testing set, performing supervised training on the algorithm framework by using the training set data, and testing on the testing set.
For further explanation, the steps of the one-time training process of the algorithm framework in the step S3 for performing the supervised training by using the training set data are described as follows:
step S3.1: the flora abundance data, pathway data and metabolite abundance data of the training set are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v First, each matrix is subjected to linear transformation through a feature selection layer:
T e =T 0 ·(T E /T 0 ) e/E (10)
u v =x v ·s v (12)
wherein E represents the total number of iterations of training, which is set to 2000 in the embodiment of the invention, E represents the number of iterations, T 0 And T E Is the super parameter of the algorithm model, which is respectively set to 10 and 0.1, gamma v Epsilon is randomly sampled from the uniform distribution (0, 1) for parameters of the fully connected layer, softmax () represents an activation function, which is changed by x v Obtaining u v ∈R n×F
Step S3.2: the histology data changed by the feature selection layer is passed through an encoder consisting of a fully connected layer and an activation function to obtain a potential representation of each histology, which is calculated as follows:
μ v +∈·σ v =z v (14)
wherein the method comprises the steps ofRepresenting the nonlinear transformation process of a multi-layer neural network, θ v For the parameters of the neural network, the e is randomly sampled from a standard normal distribution. First by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v
Step S3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1) (17)
where V represents the histology number possessed by the sample, set to 3, μ in the examples of the present invention 0 Sum sigma 0 The mean and variance of the prior distribution are represented, in the present embodiment set to 0 and 1, respectively, e being randomly sampled from a standard normal distribution. From the mean mu of the existing histology data v Sum of variances sigma v Integration yields the joint histology mean μ and variance σ, and the potential representation z of the joint histology data is derived using a re-parameterization technique.
Step S3.4: potential representation z of each omic data v And the joint potential representation z obtains a prediction probability value y of a specific group through a predictor and a joint predictor which are respectively composed of a full connection layer and an activation function v And final predicted probability valuesThe calculation process is as follows:
wherein the method comprises the steps ofAnd f ψ Representing a nonlinear transformation process of a multi-layer neural network, < >>And ψ are parameters of the neural network.
Step S3.5: and calculating loss according to a loss function of the model, and carrying out gradient feedback to update parameters of the model neural network, wherein the calculation process is as follows:
L T =L J +α∑ v∈V L v (22)
where α and β are balance coefficients combining different losses, which in the present embodiment are set to 1 and 0.001, respectively.
N(μ 00 ) Representing a priori distribution, N (μ) vv ) Representing the distribution of the v-th histology potential representation, N (μ, σ) representing the distribution of the joint potential representation, KL () representing the calculation of the KL divergence, i.e., the relative entropy, between the two distributions, y being the true label in one-hot form, y v The probability values are predicted for a particular group,for final prediction probability value, n is the number of samples, λ is the learning rate during training of the algorithm framework, and in the embodiment of the invention, 0.01 +.>Is a parameter of the neural network in the algorithm framework. Thus, the training of the model is completed once to update the parameters of the neural network.
The invention also provides a disease prediction system based on the variational neural network, which is applied to the disease prediction method based on the variational neural network, and in some embodiments, the system comprises:
the feature selection layer module is used for selecting important features to linearly transform the histology data;
an encoder module for randomly encoding the omics data into a potential representation;
a joint encoder module for integrating groups of potential representations; and
a predictor and joint predictor module for providing label inferences, i.e., disease prediction results, from the potential representations encoded by the encoder module.
Optionally, in some implementations, the feature selection layer module is a boolean matrix.
Optionally, in some embodiments, the encoder module is comprised of a fully connected neural network.
Optionally, in some embodiments, the joint encoder module is a calculation module.
Optionally, in some embodiments, the predictor and joint predictor module is comprised of a fully connected neural network.
The method and the device separate and integrate the multiple groups of the data, fully utilize the internal information of each group, simultaneously excavate the interactive information among the groups, and are suitable for predicting the incomplete samples of the multiple groups of the data, reduce the requirements on the integrity degree of the multiple groups of the data of the samples, and provide a technical means for predicting the diseases of the multiple groups of the data of the intestinal tract.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and similar elements thereof may be made without departing from the spirit and principles of the present invention.

Claims (10)

1. The disease prediction method based on the variational neural network is characterized by comprising the following steps of:
step 1: extracting DNA, RNA and metabolites of a sample, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, respectively carrying out species annotation and functional annotation on the original data after processing to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry;
step 2: preprocessing a plurality of groups of chemical data, including data conversion and normalization;
step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain a probability value of illness, wherein an output result of the algorithm frame is illness or not.
2. The disease prediction method based on the variational neural network according to claim 1, wherein the data conversion and normalization process in the step 2 comprises the steps of:
step 2.1: performing down conversion on the data to perform reasonable analysis by using a neural network;
x=log 2 (2x+0.00001)
wherein x represents flora abundance data;
step 2.2: if the flora abundance data has been transformed, the following normalization is performed for each of the histology data:
wherein x is mean Is the average value of the group of data, x max Is the maximum value in the set of data, x min Is the minimum in the set of data.
3. The disease prediction method based on the variational neural network according to claim 2, wherein the algorithm framework in the step 3 obtains the probability value of the occurrence of the disease by using multiple sets of mathematical data, comprising the steps of:
and 3, step 3.1: the processed flora abundance data, pathway data and metabolite abundance data are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v Firstly, each matrix performs feature selection through a trained feature selection layer, and the calculation process is as follows:
u v =x v ·s v
wherein the method comprises the steps ofIs a linear transformation matrix which approaches one-hot form after training, and performs characteristic selection on each group of data to obtain u v ∈R n×F
Step 3.2: the characteristics selected histology data is passed through a trained encoder consisting of a full connection layer and an activation function to obtain potential representation of each histology, and the calculation process is as follows:
μ v +∈·σ v =z v
wherein the method comprises the steps ofRepresenting a nonlinear transformation process of a multi-layer neural network, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is finished; first by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v
Step 3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1)
wherein V represents the histology number, μ, possessed by the sample 0 Sum sigma 0 Representing the mean and variance of prior distribution, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is completed; from the mean mu of the existing histology data v Sum of variances sigma v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;
step 3.4: potential representation of joint omics data z trained joint predictor consisting of fully connected layers and activation functions to derive probability values for illnessThe calculation process is as follows:
wherein f ψ A nonlinear transformation process representing a multi-layer neural network derives a predictive label of whether the sample is ill or not based on the potential representation z of the federated omics data.
4. A disease prediction method based on a variational neural network as claimed in claim 3, wherein the training process of the algorithm framework comprises the steps of:
step S1: collecting intestinal fecal samples of healthy people and target disease people predicted by a diagnosis confirming model, manually marking the people, marking the fecal sample of a patient with birth as 1 and the fecal sample of a patient without birth as 0, obtaining a plurality of groups of corresponding data of the sample through a sequencing and analysis technology, or collecting public data to construct a plurality of groups of data base, and obtaining a plurality of groups of data of the labeled fecal sample;
step S2: carrying out data conversion and normalization processing on multiple groups of chemical data;
step S3: dividing the marked data set into a training set and a testing set, performing supervised training on the algorithm framework by using the training set data, and testing on the testing set.
5. The disease prediction method based on the variational neural network according to claim 4, wherein the one training process of the algorithm framework in step S3 for performing supervised training by using training set data comprises the following steps:
step S3.1: the flora abundance data, pathway data and metabolite abundance data of the training set are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples v First, each matrix is subjected to linear transformation through a feature selection layer:
T e =T 0 ·(T E /T 0 ) e/E
u v =x v ·s v
wherein E represents the total number of iterations of training, E represents the number of iterations, T 0 And T E Is the super parameter of the algorithm model, which is respectively set to 10 and 0.1, gamma v Epsilon is randomly sampled from the uniform distribution (0, 1) for parameters of the fully connected layer, softmax () represents an activation function, which is changed by x v Obtaining u v ∈R n×F
Step S3.2: the histology data changed by the feature selection layer is passed through an encoder consisting of a fully connected layer and an activation function to obtain a potential representation of each histology, which is calculated as follows:
μ v +∈·σ v =z v
wherein the method comprises the steps ofRepresenting the nonlinear transformation process of a multi-layer neural network, θ v The E is a parameter of the neural network and randomly sampled from standard normal distribution; first by u v Obtaining the mean mu of the potential representation v Sum of variances sigma v Re-parameterization techniques are used to obtain a potential representation z of each histology data v
Step S3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:
μ+∈·σ=z,∈~N(0,1)
wherein V represents the histology number, μ, possessed by the sample 0 Sum sigma 0 Representing the mean and variance of the prior distribution, and randomly sampling the E from the standard normal distribution; from the mean mu of the existing histology data v Sum of variances sigma v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;
step S3.4: potential representation z of each omic data v And the joint potential representation z obtains a prediction probability value y of a specific group through a predictor and a joint predictor which are respectively composed of a full connection layer and an activation function v And final predicted probability valuesThe calculation process is as follows:
wherein the method comprises the steps ofAnd f ψ Representing a nonlinear transformation process of a multi-layer neural network, < >>And ψ are parameters of the neural network;
step S3.5: and calculating loss according to a loss function of the model, and carrying out gradient feedback to update parameters of the model neural network, wherein the calculation process is as follows:
L T =L J +α∑ v∈V L v
where α and β are the equilibrium coefficients combining the different losses, N (μ) 00 ) Representing a priori distribution, N (μ) vv ) Representing the distribution of the v-th histology potential representation, N (μ, σ) representing the distribution of the joint potential representation, KL () representing the calculation of the KL divergence, i.e., the relative entropy, between the two distributions, y being the true label in one-hot form, y v The probability values are predicted for a particular group,for the final predicted probability value, n is the number of samples, λ is the learning rate during training of the algorithm framework, +.>Parameters of a neural network in an algorithm frame; thus, the training of the model is completed once to update the parameters of the neural network.
6. A disease prediction system based on a variational neural network, wherein the disease prediction method based on a variational neural network according to any one of claims 1 to 5 is applied, said system comprising:
the feature selection layer module is used for selecting important features to linearly transform the histology data;
an encoder module for randomly encoding the omics data into a potential representation;
a joint encoder module for integrating groups of potential representations; and
a predictor and joint predictor module for providing label inferences, i.e., disease prediction results, from the potential representations encoded by the encoder module.
7. The disease prediction system based on a variational neural network of claim 6 wherein said feature selection layer module is a boolean matrix.
8. The disease prediction system based on a variational neural network of claim 6 wherein said encoder module is comprised of a fully connected neural network.
9. The disease prediction system based on a variational neural network of claim 6, wherein said joint encoder module is a calculation module.
10. The disease prediction system based on a variational neural network of claim 6 wherein said predictor and joint predictor module is comprised of a fully connected neural network.
CN202311028109.1A 2023-08-15 2023-08-15 Disease prediction method and system based on variational neural network Pending CN117198397A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311028109.1A CN117198397A (en) 2023-08-15 2023-08-15 Disease prediction method and system based on variational neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311028109.1A CN117198397A (en) 2023-08-15 2023-08-15 Disease prediction method and system based on variational neural network

Publications (1)

Publication Number Publication Date
CN117198397A true CN117198397A (en) 2023-12-08

Family

ID=89002565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311028109.1A Pending CN117198397A (en) 2023-08-15 2023-08-15 Disease prediction method and system based on variational neural network

Country Status (1)

Country Link
CN (1) CN117198397A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118352007A (en) * 2024-04-30 2024-07-16 中国人民解放军总医院第一医学中心 Disease data analysis method and system based on crowd queue multiunit study data

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118352007A (en) * 2024-04-30 2024-07-16 中国人民解放军总医院第一医学中心 Disease data analysis method and system based on crowd queue multiunit study data

Similar Documents

Publication Publication Date Title
Morton et al. Learning representations of microbe–metabolite interactions
Knight et al. Best practices for analysing microbiomes
Franzosa et al. Species-level functional profiling of metagenomes and metatranscriptomes
Chothani et al. deltaTE: Detection of translationally regulated genes by integrative analysis of Ribo‐seq and RNA‐seq data
Kuczynski et al. Microbial community resemblance methods differ in their ability to detect biologically relevant patterns
Manor et al. MUSiCC: a marker genes based framework for metagenomic normalization and accurate profiling of gene abundances in the microbiome
Chen et al. Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis
Boedigheimer et al. Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories
Palacio et al. A protocol for reproducible functional diversity analyses
CN112151118B (en) Multi-time-sequence intestinal flora data analysis process control method
CN117198397A (en) Disease prediction method and system based on variational neural network
Maisano Delser et al. Demographic inferences after a range expansion can be biased: the test case of the blacktip reef shark (Carcharhinus melanopterus)
CN114334012A (en) Method for identifying cancer subtypes based on multigroup data
CN113435321A (en) Method, system and equipment for evaluating state of main shaft bearing and readable storage medium
CN112182257A (en) Artificial intelligence data cleaning method based on neural network
Tillinghast Microarrays in the clinic
CN109584952B (en) Method for identifying structure and functional variation of ecological network of human microbial flora
Heintz-Buschart et al. A beginner’s guide to integrating multi-omics data from microbial communities
CN117594132A (en) Single-cell RNA sequence data clustering method based on robust residual error map convolutional network
CN116978464A (en) Data processing method, device, equipment and medium
Choudhari et al. Metagenomics: the boon for microbial world knowledge and current challenges
CN115588505A (en) Human health quantitative prediction system and method based on microbial interaction
Chung et al. Data pre-processing for label-free Multiple Reaction Monitoring (MRM) experiments
Ma et al. Assessing and interpreting the within-body biogeography of human microbiome diversity
Maitra et al. UMINT: unsupervised neural network for single cell multi-omics integration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination