CN117198397A

CN117198397A - Disease prediction method and system based on variational neural network

Info

Publication number: CN117198397A
Application number: CN202311028109.1A
Authority: CN
Inventors: 朱金林; 胡明逸; 陆文伟; 王鸿超
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2023-12-08

Abstract

The invention discloses a disease prediction method and a disease prediction system based on a variational neural network, wherein the method comprises the following steps: step 1: extracting DNA, RNA and metabolites of a sample, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, respectively carrying out species annotation and functional annotation on the original data after processing to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry; step 2: preprocessing a plurality of groups of chemical data, including data conversion and normalization; step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain a probability value of illness, wherein an output result of the algorithm frame is illness or not. The invention can be used to integrate incomplete microbiome multi-set data, predict disease and find disease-related biomarkers.

Description

Disease prediction method and system based on variational neural network

Technical Field

The invention relates to a disease prediction method and system based on a variational neural network, and belongs to the technical field of disease prediction.

Background

The human intestinal microbiota is a complex microbial ecosystem in which the genome is about 300 tens of thousands, 150 times larger than that of a human host. In fact, intestinal microbiota plays an important role in human metabolism by synthesizing enzymes that cannot be encoded by the human genome, which can promote the breakdown of polysaccharides and polyphenols, promote nutrient absorption, and provide protection against pathogens. More and more studies have shown that dysbiosis of the intestinal microbiota may be closely related to various diseases, especially those affecting the gastrointestinal system. In order to reveal the relationship between microorganisms and human health, various histologic techniques such as macrogenomics, macrotranscriptomics, metabolomics and the like have emerged. Each provides an information of molecular mechanisms or biological processes at a specific histology level. Over the past few years, more and more studies have shown that the combination of histologic data generally provides more complete information and better understanding of microecology, which can increase the accuracy of human disease prediction, increase the robustness of analysis, and also allow the discovery of important biomarkers. Notably, microbiome multi-set data includes various types of disparate data and is known for its heterogeneity, sparsity, and high dimensional properties. In view of these features, data processing requires specialized analysis methods to facilitate deeper understanding and knowledge discovery. High performance machine learning methods are currently receiving considerable attention in the biology field, and a large number of models have been developed to fully exploit the potential of multiple sets of chemical information.

Incomplete histology data is common in the disclosed databases, which can be attributed to various factors, such as limited funds, ethical considerations, and privacy concerns, that can affect the usability of the sample. This presents a significant challenge for integrated analysis. In this case, sample dropping or mean interpolation may be considered. However, the former will greatly reduce the number of available samples, while the latter may severely distort the true distribution of data. The existing machine learning algorithm for disease prediction by utilizing incomplete multi-study data mainly has two problems, namely that firstly, related features cannot be effectively extracted from Gao Weizu study data and uncorrelated features are filtered, and secondly, the information in the incomplete multi-study data cannot be fully utilized to realize efficient prediction while flexible integration of the incomplete multi-study data is difficult to realize.

Disclosure of Invention

In order to solve the problems, the invention provides a disease prediction method and a disease prediction system based on a variational neural network, wherein the method is based on the variational neural network, utilizes incomplete intestinal tract multigroup data to perform disease prediction, collects human intestinal tract fecal samples, obtains flora abundance data, path data and metabolite abundance data of the samples through sequencing and analysis technologies, preprocesses the multigroup data, substitutes the multigroup data owned by the samples into a trained algorithm frame to obtain a disease probability value, and outputs two types of algorithm frames, namely disease and non-disease.

In one aspect, the invention provides a disease prediction method based on a variational neural network, comprising the following steps:

step 1: extracting DNA, RNA and metabolites of a sample, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, respectively carrying out species annotation and functional annotation on the original data after processing to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry;

step 2: preprocessing a plurality of groups of chemical data, including data conversion and normalization;

step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain a probability value of illness, wherein an output result of the algorithm frame is illness or not.

In one embodiment of the present invention, the data conversion and normalization process in the step 2 includes the steps of:

step 2.1: performing down conversion on the data to perform reasonable analysis by using a neural network;

x＝log ₂ (2x+0.00001)

wherein x represents flora abundance data;

step 2.2: if the flora abundance data has been transformed, the following normalization is performed for each of the histology data:

wherein x is _mean Is the average value of the group of data, x _max Is the maximum value in the set of data, x _min Is the minimum in the set of data.

In one embodiment of the present invention, the algorithm framework in step 3 uses multiple sets of learning data to obtain the probability value of illness, including the following steps:

step 3.1: the processed flora abundance data, pathway data and metabolite abundance data are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples _v Firstly, each matrix performs feature selection through a trained feature selection layer, and the calculation process is as follows:

u _v ＝x _v ·s _v

wherein the method comprises the steps ofIs a linear transformation matrix which approaches one-hot form after training, and performs characteristic selection on each group of data to obtain u _v ∈R ^n×F ；

Step 3.2: the characteristics selected histology data is passed through a trained encoder consisting of a full connection layer and an activation function to obtain potential representation of each histology, and the calculation process is as follows:

μ _v +∈·σ _v ＝z _v

wherein the method comprises the steps ofRepresenting a nonlinear transformation process of a multi-layer neural network, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is finished; first by u _v Obtaining the mean mu of the potential representation _v Sum of variances sigma _v Re-parameterization techniques are used to obtain a potential representation z of each histology data _v ；

Step 3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:

μ+∈·σ＝z，∈～N(0，1)

wherein V represents the histology number, μ, possessed by the sample ₀ Sum sigma ₀ Representing the mean and variance of prior distribution, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to be 0 after model training is completed; from the mean mu of the existing histology data _v Sum of variances sigma _v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;

and 3, step 3.4: potential representation of joint omics data z trained joint predictor consisting of fully connected layers and activation functions to derive probability values for illnessThe calculation process is as follows:

wherein f _ψ A nonlinear transformation process representing a multi-layer neural network derives a predictive label of whether the sample is ill or not based on the potential representation z of the federated omics data.

In one embodiment of the present invention, the training process of the algorithm framework includes the steps of:

step S1: collecting intestinal fecal samples of healthy people and target disease people predicted by a diagnosis confirming model, manually marking the people, marking the fecal sample of a patient with birth as 1 and the fecal sample of a patient without birth as 0, obtaining a plurality of groups of corresponding data of the sample through a sequencing and analysis technology, or collecting public data to construct a plurality of groups of data base, and obtaining a plurality of groups of data of the labeled fecal sample;

step S2: carrying out data conversion and normalization processing on multiple groups of chemical data;

step S3: dividing the marked data set into a training set and a testing set, performing supervised training on the algorithm framework by using the training set data, and testing on the testing set.

In one embodiment of the present invention, the one training process of the algorithm framework in the step S3 for performing supervised training using training set data includes the following steps:

step S3.1: the flora abundance data, pathway data and metabolite abundance data of the training set are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples _v Each matrix is first selected by featuresLayer selection is carried out to carry out linear transformation:

T _e ＝T ₀ ·(T _E /T ₀ ) ^e/E

u _v ＝x _v ·ε _v

wherein E represents the total number of iterations of training, E represents the number of iterations, T ₀ And T _E Is the super parameter of the algorithm model, which is respectively set to 10 and 0.1, gamma _v Epsilon is randomly sampled from the uniform distribution (0, 1) for parameters of the fully connected layer, softmax () represents an activation function, which is changed by x _v Obtaining u _v ∈R ^n×F ；

Step S3.2: the histology data changed by the feature selection layer is passed through an encoder consisting of a fully connected layer and an activation function to obtain a potential representation of each histology, which is calculated as follows:

μ _v +∈·σ _v ＝z _v

wherein the method comprises the steps ofRepresenting the nonlinear transformation process of a multi-layer neural network, θ _v The E is a parameter of the neural network and randomly sampled from standard normal distribution; first by u _v Obtaining the mean mu of the potential representation _v Sum of variances sigma _v Re-parameterization techniques are used to obtain a potential representation z of each histology data _v ；

Step S3.3: by simply integrating incomplete multi-set of data with arbitrary deletions using a joint-set encoder, and obtaining a joint potential representation z, the calculation process is as follows:

μ+∈·σ＝z,∈～N(0,1)

wherein V represents the histology number, μ, possessed by the sample ₀ Sum sigma ₀ Representing the mean and variance of the prior distribution, and randomly sampling the E from the standard normal distribution; from the mean mu of the existing histology data _v Sum of variances sigma _v Integrating to obtain a mean μ and a variance σ of the joint histology, and obtaining a potential representation z of the joint histology data by using a heavy parameterization technique;

step S3.4: potential representation z of each omic data _v And the joint potential representation z obtains a prediction probability value y of a specific group through a predictor and a joint predictor which are respectively composed of a full connection layer and an activation function _v And final predicted probability valuesThe calculation process is as follows:

wherein the method comprises the steps ofAnd f _ψ Representing a nonlinear transformation process of a multi-layer neural network, < >>And ψ are parameters of the neural network;

step S3.5: and calculating loss according to a loss function of the model, and carrying out gradient feedback to update parameters of the model neural network, wherein the calculation process is as follows:

L _T ＝L _J +α∑ _v∈V L _v

where α and β are the equilibrium coefficients combining the different losses, N (μ) ₀ ,σ ₀ ) Representing a priori distribution, N (μ) _v ,σ _v ) Representing the distribution of the v-th histology potential representation, N (μ, σ) representing the distribution of the joint potential representation, KL () representing the calculation of the KL divergence, i.e., the relative entropy, between the two distributions, y being the true label in one-hot form, y _v The probability values are predicted for a particular group,for the final predicted probability value, n is the number of samples, λ is the learning rate during training of the algorithm framework, +.>Parameters of a neural network in an algorithm frame; thus, the training of the model is completed once to update the parameters of the neural network.

On the other hand, the invention also provides a disease prediction system based on the variational neural network, and the disease prediction method based on the variational neural network is applied, and the system comprises the following steps:

the feature selection layer module is used for selecting important features to linearly transform the histology data;

an encoder module for randomly encoding the omics data into a potential representation;

a joint encoder module for integrating groups of potential representations; and

a predictor and joint predictor module for providing label inferences, i.e., disease prediction results, from the potential representations encoded by the encoder module.

In one embodiment of the present invention, the feature selection layer module is a boolean matrix.

In one embodiment of the invention, the encoder module is formed by a fully connected neural network.

In one embodiment of the present invention, the joint encoder module is a calculation module.

In one embodiment of the invention, the predictor and joint predictor module is comprised of a fully connected neural network.

The disease prediction method and system based on the variational neural network provided by the invention have the following advantages: the invention provides a new framework based on a variational neural network, which can be used for integrating incomplete microbiome multi-set chemical data, predicting diseases and searching biomarkers related to the diseases. Secondly, the invention introduces specific distribution, and can select the most relevant characteristics of the target diseases in each microbiome, thereby improving the interpretation of the model. Thirdly, the algorithm utilizes the information bottleneck principle to construct a model training loss function, so that learning single-group potential representation and joint-group potential representation are promoted, and the algorithm has high prediction accuracy and robustness. Fourth, the invention has low requirement on the integrity of the multi-group data, and can flexibly utilize the group data owned by the sample to predict the diseases.

Drawings

FIG. 1 is a flow chart of a disease prediction method based on a variational neural network of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the present invention provides a disease prediction method based on a variational neural network, and in some embodiments, the method includes the following steps:

step 1: collecting intestinal faeces samples of people to be tested, firstly extracting DNA, RNA and metabolites of the samples, amplifying DNA and RNA information into a library suitable for high-throughput sequencing, obtaining sequencing original data by using a high-throughput sequencing technology, processing the original data, and respectively carrying out species annotation and functional annotation to obtain flora abundance data of a metagenome and path data of a metatranscriptome, and obtaining metabolite abundance data of the metabolome by a mass spectrometry.

Considering the problem of sequencing cost, if the number of samples is excessive, a large amount of cost and time are required to obtain complete multi-group data, so that the situation that the sample multi-group data is incomplete easily occurs, the existing group data of the samples are substituted into a trained model, the missing group data does not participate in integrated calculation, and the follow-up prediction accuracy rate can be higher.

Step 2: and preprocessing the multiple groups of chemical data, including data conversion and normalization.

Optionally, in some embodiments, the data conversion and normalization process in step 2 includes the steps of:

step 2.1: in order to use neural networks for reasonable analysis, the data first needs to be transformed. The relative abundance of the flora is considered in the present invention. These values can range from 0 to relatively large actual values, and most of the amplitudes range from 10 ^-1 To 10 ^-4 . Considering that low abundance features may play an important role in health, the following logarithmic transformation was designed and applied:

x＝log ₂ (2x+0.00001) (1)

to avoid numerical problems of origin, 0.00001 is added. Wherein x represents flora abundance data.

Step 3: substituting the processed flora abundance data, pathway data and metabolite abundance data into a trained algorithm frame to obtain probability values of illness of the crowd to be detected, wherein the output results of the algorithm frame are two types, namely illness and non-illness.

Optionally, in some embodiments, the algorithm framework in step 3 uses multiple sets of learning data to derive the probability value of illness includes the steps of:

u _v ＝x _v ·s _v (3)

wherein the method comprises the steps ofIs a linear transformation matrix which approaches one-hot (single thermal coding) form after training, and performs characteristic selection on each group of data to obtain u _v ∈R ^n×F 。

μ _v +∈·σ _v ＝z _v (5)

wherein the method comprises the steps ofAnd (3) representing a nonlinear transformation process of the multi-layer neural network, randomly sampling from standard normal distribution when the E trains the network, and fixing the E to 0 after model training is completed. Mu is firstly formed _v Obtaining the mean mu of the potential representation _v Sum of variances sigma _v Re-parameterization techniques are used to obtain a potential representation z of each histology data _v 。

μ+∈·σ＝z,∈～N(0,1) (8)

where V represents the histology number possessed by the sample, set to 3, μ in the examples of the present invention ₀ Sum sigma ₀ The mean and variance of the prior distribution are shown, and are set to 0 and 1 respectively in the embodiment of the invention, the E is randomly sampled from the standard normal distribution when the network is trained, and the E is fixed to 0 after model training is completed. From the mean mu of the existing histology data _v Sum of variances sigma _v Integration yields the joint histology mean μ and variance σ, and the potential representation z of the joint histology data is derived using a re-parameterization technique.

Step 3.4: potential representation of federated omics data z trained federated predictors consisting of fully connected layers and activation functionsDeriving a probability value of illnessThe calculation process is as follows:

wherein f _ψ The nonlinear transformation process of the multi-layer neural network is represented, and the prediction label of whether the crowd to be detected is ill or not is obtained based on the potential representation z of the combined histology data.

Optionally, in some embodiments, the training process of the algorithm framework of the disease prediction method based on the variational neural network includes the following steps:

step S1: and (3) collecting intestinal fecal samples of healthy people and target disease people predicted by a diagnosis confirming model, manually marking the people, marking the fecal sample of a patient with birth as 1 and the fecal sample of a patient with no birth as 0, obtaining a plurality of groups of corresponding data of the sample through a sequencing and analysis technology, or collecting public data on the internet to construct a plurality of groups of data database, and obtaining the plurality of groups of data of the labeled fecal sample.

Step S2: and carrying out data conversion and normalization processing on the multiple groups of chemical data.

For further explanation, the steps of the one-time training process of the algorithm framework in the step S3 for performing the supervised training by using the training set data are described as follows:

step S3.1: the flora abundance data, pathway data and metabolite abundance data of the training set are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples _v First, each matrix is subjected to linear transformation through a feature selection layer:

T _e ＝T ₀ ·(T _E /T ₀ ) ^e/E (10)

u _v ＝x _v ·s _v (12)

wherein E represents the total number of iterations of training, which is set to 2000 in the embodiment of the invention, E represents the number of iterations, T ₀ And T _E Is the super parameter of the algorithm model, which is respectively set to 10 and 0.1, gamma _v Epsilon is randomly sampled from the uniform distribution (0, 1) for parameters of the fully connected layer, softmax () represents an activation function, which is changed by x _v Obtaining u _v ∈R ^n×F 。

μ _v +∈·σ _v ＝z _v (14)

wherein the method comprises the steps ofRepresenting the nonlinear transformation process of a multi-layer neural network, θ _v For the parameters of the neural network, the e is randomly sampled from a standard normal distribution. First by u _v Obtaining the mean mu of the potential representation _v Sum of variances sigma _v Re-parameterization techniques are used to obtain a potential representation z of each histology data _v 。

μ+∈·σ＝z,∈～N(0,1) (17)

where V represents the histology number possessed by the sample, set to 3, μ in the examples of the present invention ₀ Sum sigma ₀ The mean and variance of the prior distribution are represented, in the present embodiment set to 0 and 1, respectively, e being randomly sampled from a standard normal distribution. From the mean mu of the existing histology data _v Sum of variances sigma _v Integration yields the joint histology mean μ and variance σ, and the potential representation z of the joint histology data is derived using a re-parameterization technique.

wherein the method comprises the steps ofAnd f _ψ Representing a nonlinear transformation process of a multi-layer neural network, < >>And ψ are parameters of the neural network.

L _T ＝L _J +α∑ _v∈V L _v (22)

where α and β are balance coefficients combining different losses, which in the present embodiment are set to 1 and 0.001, respectively.

N(μ ₀ ,σ ₀ ) Representing a priori distribution, N (μ) _v ,σ _v ) Representing the distribution of the v-th histology potential representation, N (μ, σ) representing the distribution of the joint potential representation, KL () representing the calculation of the KL divergence, i.e., the relative entropy, between the two distributions, y being the true label in one-hot form, y _v The probability values are predicted for a particular group,for final prediction probability value, n is the number of samples, λ is the learning rate during training of the algorithm framework, and in the embodiment of the invention, 0.01 +.>Is a parameter of the neural network in the algorithm framework. Thus, the training of the model is completed once to update the parameters of the neural network.

The invention also provides a disease prediction system based on the variational neural network, which is applied to the disease prediction method based on the variational neural network, and in some embodiments, the system comprises:

a joint encoder module for integrating groups of potential representations; and

Optionally, in some implementations, the feature selection layer module is a boolean matrix.

Optionally, in some embodiments, the encoder module is comprised of a fully connected neural network.

Optionally, in some embodiments, the joint encoder module is a calculation module.

Optionally, in some embodiments, the predictor and joint predictor module is comprised of a fully connected neural network.

The method and the device separate and integrate the multiple groups of the data, fully utilize the internal information of each group, simultaneously excavate the interactive information among the groups, and are suitable for predicting the incomplete samples of the multiple groups of the data, reduce the requirements on the integrity degree of the multiple groups of the data of the samples, and provide a technical means for predicting the diseases of the multiple groups of the data of the intestinal tract.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and similar elements thereof may be made without departing from the spirit and principles of the present invention.

Claims

1. The disease prediction method based on the variational neural network is characterized by comprising the following steps of:

2. The disease prediction method based on the variational neural network according to claim 1, wherein the data conversion and normalization process in the step 2 comprises the steps of:

x＝log ₂ (2x+0.00001)

wherein x represents flora abundance data;

3. The disease prediction method based on the variational neural network according to claim 2, wherein the algorithm framework in the step 3 obtains the probability value of the occurrence of the disease by using multiple sets of mathematical data, comprising the steps of:

and 3, step 3.1: the processed flora abundance data, pathway data and metabolite abundance data are compiled into three matrices, assuming matrices representing the v-th histology dataFrom d of n samples _v Firstly, each matrix performs feature selection through a trained feature selection layer, and the calculation process is as follows:

u _v ＝x _v ·s _v

μ _v +∈·σ _v ＝z _v

μ+∈·σ＝z,∈～N(0,1)

step 3.4: potential representation of joint omics data z trained joint predictor consisting of fully connected layers and activation functions to derive probability values for illnessThe calculation process is as follows:

4. A disease prediction method based on a variational neural network as claimed in claim 3, wherein the training process of the algorithm framework comprises the steps of:

5. The disease prediction method based on the variational neural network according to claim 4, wherein the one training process of the algorithm framework in step S3 for performing supervised training by using training set data comprises the following steps:

T _e ＝T ₀ ·(T _E /T ₀ ) ^e/E

u _v ＝x _v ·s _v

μ _v +∈·σ _v ＝z _v

μ+∈·σ＝z,∈～N(0,1)

L _T ＝L _J +α∑ _v∈V L _v

6. A disease prediction system based on a variational neural network, wherein the disease prediction method based on a variational neural network according to any one of claims 1 to 5 is applied, said system comprising:

a joint encoder module for integrating groups of potential representations; and

7. The disease prediction system based on a variational neural network of claim 6 wherein said feature selection layer module is a boolean matrix.

8. The disease prediction system based on a variational neural network of claim 6 wherein said encoder module is comprised of a fully connected neural network.

9. The disease prediction system based on a variational neural network of claim 6, wherein said joint encoder module is a calculation module.

10. The disease prediction system based on a variational neural network of claim 6 wherein said predictor and joint predictor module is comprised of a fully connected neural network.