CN114783608B

CN114783608B - Construction method of slow patient group disease risk prediction model based on graph self-encoder

Info

Publication number: CN114783608B
Application number: CN202210507317.9A
Authority: CN
Inventors: 邱航; 胡智栩; 杨萍; 王利亚
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2023-05-05
Anticipated expiration: 2042-05-10
Also published as: CN114783608A

Abstract

The invention relates to the technical field of medical information, in particular to a method for constructing a slow patient group disease risk prediction model based on a graph self-encoder, which constructs a patient-disease bipartite graph based on hospitalization records and historical disease information of a patient, and then extracts feature vectors for the patient and the disease respectively; finally, a disease risk prediction model based on a graph attention mechanism is constructed based on a graph self-encoder framework to predict the future disease risk of a chronic disease patient, and the attention mechanism is used in a decoder part of the disease risk prediction model and the weight information of edges is considered at the same time, so that the topology information of two graphs and the individual difference of the patient can be considered at the same time, the complex influence relation among diseases is learned, and the aim of improving the prediction effect is further achieved.

Description

Construction method of slow patient group disease risk prediction model based on graph self-encoder

Technical Field

The invention relates to the technical field of medical information, in particular to a method for constructing a slow patient group disease risk prediction model based on a graph self-encoder.

Background

The aggravation of aging population and the rapid rise in the incidence of chronic diseases place a serious social and economic burden worldwide. It is estimated that more than 75% of the elderly have more than one chronic disease, and that the multiple diseases of the elderly (two and more chronic diseases) have become a prominent global problem, resulting in greater medical needs, more medical service usage and costs. There is a complex correlation between chronic diseases, some of which may lead to the occurrence of other chronic diseases, further increasing the therapeutic burden on the patient. Prevention and treatment of chronic diseases and related complications has become an unprecedented problem. The method can effectively predict the future disease risk of the chronic disease patient, can lead doctors to intervene in advance, and reduces the occurrence risk of related diseases, thereby preventing the diseases and having great realization significance. The existing disease risk prediction method mainly has the following problems:

(1) The partial prediction method models the disease prediction problem as a series of two-class models, each of which predicts whether a disease occurs, and this modeling method causes the number of models to increase as the number of predicted diseases increases, limiting the practicality of the models.

(2) The partial prediction method utilizes the historical disease information of the patient, abstracts the disease information into a patient-disease bipartite graph, models the problem as a link prediction problem, predicts the disease risk by using heuristic methods such as a Common Neighbor (CN) index, an Adamic-Adar (AA) index and the like, and only considers the topological information of the bipartite graph, but does not consider the individual difference of the patient, such as sex, age and the like.

(3) Most of the existing prediction methods do not consider complex influence relation among diseases, so that the prediction effect is poor.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a method for constructing a slow patient group disease risk prediction model based on a graph self-encoder, which aims to solve the technical problems that the prior prediction method mentioned in the background art does not consider the influence of complex relations among diseases, and the prediction effect is poor.

The technical scheme adopted by the invention is as follows:

the method for constructing the slow patient group disease risk prediction model based on the graph self-encoder comprises the following steps:

step 1: acquiring a data set of a first page of a historical medical record, preprocessing data in the data set, and storing the preprocessed historical case data into a storage space established by a storage medium;

step 2: dividing the preprocessed historical case data into a disease which the patient has historically and a disease which the patient has in the future based on a time sequence, constructing the disease which the patient has historically into a patient-disease coding bipartite graph, and constructing the disease which the patient has in the future N years into a patient-disease decoding bipartite graph;

step 3: invoking historical case data in the storage space, and extracting a patient feature vector and a disease feature vector based on the historical case data;

step 4: establishing an encoder and a decoder based on the patient-disease encoding bipartite graph and the patient-decoding bipartite graph respectively, wherein the encoder is a graph annotation network, and establishing a disease risk prediction model based on the encoder and the decoder;

step 4.1: establishing a heuristic feature extraction model;

step 4.2: establishing a neighbor sampling strategy;

step 4.3: using a graph attention network as an encoder, wherein the encoder comprises at least one graph convolution module, and a graph convolution layer of each graph convolution module learns weights of different neighbors by adopting a graph attention mechanism to obtain a final embedded vector expression;

step 4.4: constructing a bilinear decoder based on the patient-decoded bipartite graph, wherein the bilinear decoder predicts the existence probability of edges in the patient-decoded graph for embedded vector expressions and heuristic features of edges of known patients and diseases;

step 5: disease risk prediction models are trained based on the dataset of the historic medical records top page.

For the new hospitalization record, the invention can also obtain individual information, hospitalization hospital information and historical disease diagnosis information of the patient, and can extract corresponding patient characteristic vectors and disease characteristic vectors; for both the patient-disease encoding bipartite graph and the patient-disease decoding bipartite graph, new hospitalization record data for the patient is added to both bipartite graphs. Finally, the trained disease risk prediction model obtains the disease risks of other diseases corresponding to the patient, and descending order is arranged according to the risks.

In the decoder part of the disease risk prediction model, the attention mechanism is used, and the weight information of the edges is considered, so that the topological information of the two graphs and the individual difference of patients can be considered at the same time, the complex influence relationship among diseases is learned, and the purpose of improving the prediction effect is further achieved.

Preferably, the preprocessing in step 1 is to reject variables with a deletion rate greater than 30% in the data set, and fill the deletion value with the average of the non-missing portions for the remaining data with the deletion rate.

Preferably, the edges in the patient-disease encoding bipartite graph represent disease that the patient has historically, and the weights represent the number of occurrences of the disease; the patient-disease decoding bipartite graph comprises a positive sample and a negative sample, wherein the positive sample is a new disease of the patient in the future N years, and the negative sample is a disease which can not be new in the patient in the future N years; subtracting the patient-disease encoding bipartite graph from the full bipartite graph to obtain the edges of the patient-disease decoding bipartite graph; the patient-disease encoding bipartite graph is used for an encoder to automatically learn the expression of the embedded vectors of the patient nodes and the disease nodes, and the patient-disease decoding bipartite graph is used for a decoder to learn the occurrence probability of each edge.

Preferably, the extraction of the patient feature vector includes individual information, hospitalization hospital information, the number of historic diseases and ECI co-disease index of historic diseases; the data with the characteristic type of discrete type is subjected to single-heat coding and is converted into binary variables of 0-1; taking the data with the characteristic type of numerical value as continuous characteristics and taking the value as real number; and encoding the characteristic type as discrete data and the data with sequential relation as numerical characteristic.

Preferably, the extraction of the disease feature vector is performed by ascending order arrangement of ICD-10 codes of disease nodes to obtain the serial number of each disease node, and then a vector is generated for each disease node by independent heat coding; and the prevalence of each disease (number of patients divided by total number) was calculated as a feature to characterize the prevalence of the disease.

Preferably, the step 4 includes the steps of:

step 4.1: establishing a heuristic feature extraction model:

/>

in the method, in the process of the invention,

and->

A set of neighbor nodes that are nodes i, j and z, respectively, wherein node i represents a central node; the I.I is the size of the set; />

It is the second order neighbor set of node j; />

Common Neighbors index representing edges i, j of patient-disease encoding bipartite graph, +.>

Adamic-Adar index representing edges i, j of patient-disease encoding bipartite graph, +.>

Jaccard's coeffient index representing the edges i, j of a patient-disease encoding bipartite graph, +.>

Preferential Attachment index representing the edges i, j of the patient-disease encoding bipartite graph; the larger the value of the index is, the higher the occurrence probability of the edge is;

step 4.2: establishing a neighbor sampling strategy:

wherein: w (w) _ij And

weights and sampling probabilities, w, of edges i, j representing patient-disease encoding bipartite graph, respectively _iu Weights representing sides i, u of the patient-disease encoding bipartite graph based on sampling probability +.>

Performing put-back sampling on neighbors of the central node to obtain a fixed number of neighbor samples;

step 4.3: using a graph attention network as an encoder, wherein the encoder comprises at least one graph convolution module, and a graph convolution layer of each graph convolution module learns weights of different neighbors by adopting a graph attention mechanism to obtain a final embedded vector expression; defining the first layer of the encoder is characterized by

Multi-headed attention weight from node j to node i>

Calculated from the following formula:

in the method, in the process of the invention,

a query vector representing the attention of a central node i at the layer 1 network of the encoder at the c-th header;

a weight matrix representing the attention of the query vector q at the c-th head in the layer 1 network of the encoder; />

An embedded vector representing a central node i at a layer 1 network of the encoder; />

A bias term representing the attention of the query vector q at the c-th head in the layer 1 network of the encoder; />

A key vector representing the attention of node j at the c-th head in the layer 1 network of the encoder; />

A weight matrix representing the attention of key vector k at the c-th head in the layer 1 network of the encoder; />

An embedded vector representing a node j in a layer 1 network of the encoder; w (w) _ij Weights representing edges i, j; />

A bias term representing the attention of key vector k at the c-th head in the layer 1 network of the encoder; />

Attention weights representing the attention of edge i, j at the c-th head in the layer 1 network of the encoder; />

A key vector representing the attention of node u at the c-th head in the layer one network of the encoder; />

Exponential scaling of the vector dot product is performed, and d is the dimension of the vector;

after obtaining the multi-head attention weight, carrying out message aggregation operation on embedded vectors of different neighbors:

in the method, in the process of the invention,

a value vector representing the attention of node j at the c-th head in the layer 1 network of the encoder; />

A weight matrix representing the attention of the vector v at the c-th head in the layer 1 network of the encoder; />

A bias term representing the attention of the vector v at the c-th head in the layer 1 network of the encoder; />

An attention vector representing a central node i of the layer 1 network of the encoder; splicing operation of representation vectors;

embedding vector of center node i

And->

In combination, and taking into account the gating residual mechanism, the inflow of selective control information, thereby calculating the embedded vector expression of the next layer +.>

The specific calculation formula is as follows:

wherein r is _i ^(l) Information representing a central node i in a layer one network of the encoder;

a weight matrix representing a central node i in a layer one network of the encoder; />

A bias term representing a central node i in a layer one network of the encoder; />

A weight representing the gating residual of the central node i in the layer 1 network of the encoder; will->

r _i ^(l) And

sequentially spliced and passes ∈ ->

The weight matrix is subjected to linear transformation, and the value range is mapped to the interval from 0 to 1 through a sigmoid function, so that r is controlled _i ^(l) And->

A function of information inflow; finally obtaining the embedded vector representation of the central node i of the layer 1 network by LayerNorm and ReLU activation functions>

Step 4.4: constructing a bilinear decoder, wherein the bilinear decoder is an embedded vector expression of known patients and diseases, predicts the existence probability of edges in a patient-decoding diagram, and calculates the following formula:

in the method, in the process of the invention,

representing the index corresponding to the edge i, j and taking the index as heuristic characteristics; />

Transpose of embedded vector representing node i, h _j A vector representing node j; the above uses multiple weight matrices to reference the multi-head attention mechanism>

Learning +.>

And h _j And then the learned results are spliced to obtain +.>

Will be

Splicing with heuristic features to form hidden layer feature expression of edge ++>

Finally through W _o The weight matrix is subjected to linear transformation, and the bias term b is added _o Obtaining the result of the output layer, and obtaining the prediction probability p of the edges i and j by using a sigmoid activation function _ij ：

The loss function uses cross entropy and is calculated as follows:

wherein G is _dec Representing a decoding diagram e _ij Representing edges ii, j, y _ij Labels representing edges; and optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.

Preferably, the preprocessed data set is divided into a training set, a verification set and a test set according to the proportion of 7:1:2; the training set is used for training the disease risk prediction model, the verification set is used for optimizing parameters of the disease risk prediction model, and the test set is used for evaluating the generalization effect of the disease risk prediction model.

Preferably, all negative samples in the data set are acquired to form a negative sample set, the negative sample set is sampled to generate a negative sample for training a disease risk prediction model, and the ratio of the positive sample to the negative sample is set to be 1:10.

The beneficial effects of the invention include:

1. for the new hospitalization record, the invention can also obtain individual information, hospitalization hospital information and historical disease diagnosis information of the patient, and can extract corresponding patient characteristic vectors and disease characteristic vectors; for both the patient-disease encoding bipartite graph and the patient-disease decoding bipartite graph, new hospitalization record data for the patient is added to both bipartite graphs. Finally, the trained disease risk prediction model obtains the disease risks of other diseases corresponding to the patient, and descending order is arranged according to the risks.

In the encoder part of the disease risk prediction model, the attention mechanism is used, and the weight information of the edges is considered, so that the topological information of the two graphs and the individual difference of patients can be considered at the same time, the complex influence relationship among diseases is learned, and the purpose of improving the prediction effect is achieved.

2. The invention adopts the final output result to arrange the disease probability of the diseases in a descending order, realizes the risk prediction of all diseases, and has wide practical value.

3. The invention can complete modeling only by the data of the first page of the medical records of the patient, extracts the characteristic vectors of the patient and the characteristic vectors of the diseases, digs available information in all aspects, and strengthens the prediction capability of the model.

4. Besides considering node embedded vectors learned by an encoder, the decoder part of the disease risk prediction model extracts heuristic features such as CN, AA and the like for each edge, and the heuristic features can supplement additional information, so that the model converges more quickly and has better effect.

Drawings

FIG. 1 is a construction of a patient-disease bipartite graph of the present invention.

Fig. 2 is a diagram showing a disease risk prediction model structure according to the present invention.

FIG. 3 is a training flow chart of the disease risk prediction model of the present invention.

Fig. 4 is a prediction flow chart of the disease risk prediction model of the present invention.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

Embodiments of the present invention are described in further detail below with reference to fig. 1 and 4:

the history first page data is a record item generated by the patient after the hospitalization is completed, and each record contains individual information (encrypted identification card number, gender, age, hospitalization time, discharge time and the like) of the patient, information of hospitalization hospitals (information of hospital grade, hospital address and the like) and hospitalization disease diagnosis (main diagnosis and 15 secondary diagnoses at most) of the patient, and is coded by international disease classification 10 th edition (International Classification of Disease-Revision 10, ICD-10); based on the above, the data needs to be preprocessed, that is, variables with the missing rate greater than 30% in the data set are removed, and the remaining data with the missing rate is filled with the missing value by using the average value of the non-missing part; the data without missing values is obtained and stored in a memory space established in a storage medium, such as a database.

The objective of the present invention is to predict the risk of disease in a patient for the next N years based on the patient's historical disease and individual information. Therefore, patients with inpatients whose time span of inpatients is longer than N years need to be screened out before data without missing values are formed, chronic disease diagnosis of inpatients of last N years is regarded as predictive label, and history of inpatients is regarded as known information; the invention can predict future disease risks with different time coarse granularity by setting the value of N.

referring to fig. 1, in order to predict the possible diseases of the patient in the next N years, the present invention abstracts the task scenario into a link prediction problem of two graphs, the left node of which represents different patients, the right node represents different diseases, and only the edges from patient to disease exist, the patient-disease encoding two graphs are used for the encoder to automatically learn the patient node and the disease node embedded vector expression, and the patient-disease decoding two graphs are used for the decoder to learn the occurrence probability of each edge.

The edges in the patient-disease encoding bipartite graph represent disease that the patient has historically, and the weights represent the number of occurrences of the disease; the solid line in the patient-disease decoding bipartite graph represents a positive sample, i.e., new disease in the patient for the next N years; the dashed lines in the patient-disease decoding bipartite graph represent negative samples, i.e., diseases that the patient will not develop newly for the next N years; the patient-disease decoding bipartite graph edges are obtained by subtracting the patient-disease encoding bipartite graph from the full bipartite graph.

The self-constructed patient-disease coding bipartite graph of the present invention is used for a disease risk prediction model (the self-constructed disease risk prediction model is named: GADP model, graph Attention Disease Prediction, GADP, the graph annotates the disease risk prediction model, which is referred to herein as the disease risk prediction model for convenience of description) to automatically learn patient node and patient node embedded vector expression, while the patient-disease decoding bipartite graph is used for resolving future disease risk of the disease.

the extraction of the patient feature vector includes individual information, hospital information, number of historic diseases, ECI co-morbid index (Elixhauser Comorbidity Index, ECI) of historic diseases, ECI co-morbid index being capable of quantifying the physical condition of the patient to some extent; carrying out single-heat coding on data with discrete characteristic types, and converting the data into binary variables of 0-1; taking the data with the characteristic type of numerical value as continuous characteristics and taking the value as real number; and encoding the characteristic type as discrete data and the data with sequential relation as numerical characteristic.

See in particular table 1 below:

TABLE 1 extraction of feature vectors for patient nodes

Referring to table 1 above, the third column in table 1 is the data type of the features, and if the features are numerical, the features are treated as continuous features and the values are real numbers. If discrete, it is required to convert it into a binary variable of 0-1 by one-hot encoding. However, as in the "hospitalization" field, its values are dangerous, urgent and general, although discrete data, there is a sequential relationship of values, which are coded as numerical features, i.e., 1, 2 and 3, in order to reduce the data dimension; thus, the feature dimension can be reduced, and the sequence information in the feature dimension can be reserved.

The extraction of the disease characteristic vector is to obtain the serial number of each disease node by ascending order arrangement of ICD-10 codes of the disease node, and then to generate a vector for each disease node by independent heat coding; and the prevalence of each disease is calculated as a characteristic used to characterize the prevalence of the disease.

the present invention uses a graph self-encoder (Graph auto encoder, GAE) as the link prediction base prediction architecture. The graph is used as an end-to-end model from the encoder, embedded vector expression of each node in the encoded graph can be automatically learned, and then the probability of each edge in the decoded graph is predicted by the decoder. The core components of the self-encoder are the encoder and decoder. The present invention uses a graph attention network (Graph Attention Networks, GAT) as an encoder and a Bilinear layer (Bilinear layer) as a decoder, and this model is named a graph attention disease prediction (Graph Attention Disease Prediction, GADP) model, the network structure of which is shown in fig. 2.

The step 4 comprises the following steps:

step 4.1: establishing a heuristic feature extraction model:

in the method, in the process of the invention,

and->

It is the second order neighbor set of node j; />

step 4.2: because in the neighbor sampling strategy in the graph neural network, a certain number of neighbors are generally sampled based on the mean random distribution; however, because the influence degree of different diseases on patients is different, taking the edge weight of the disease-coding bipartite graph into consideration, a non-uniform neighbor sampling strategy is designed, so that the larger the weight is, the higher the sampling probability is, and the specific neighbor sampling strategy is as follows:

wherein: w (w) _ij And

step 4.3: referring to fig. 2, using a graph attention network as an encoder, the encoder comprises two identical graph convolution modules, and the graph convolution layer of each graph convolution module learns weights of different neighbors by adopting a graph attention mechanism to obtain a final embedded vector expression; defining the first layer of the encoder is characterized by

Multi-headed attention weight from node j to node i>

Calculated from the following formula:

in the method, in the process of the invention,

Is a vector quantityThe dot product is exponentially scaled, d is the dimension of the vector;

first, the central node of the first layer is embedded into the vector

By->

Linear transformation into query vector->

Embedding neighbor nodes into vectors->

Sum edge weight w _ij Splicing and passing->

Linear transformation into key vector->

Reuse of<q,k>Calculate the attention weight of the edge, +.>

Exponential scaling of the vector dot product is performed, and d is the dimension of the vector; finally, normalized attention weight is obtained by normalization operation>

After obtaining the multi-head attention weight of the graph, carrying out message aggregation operation on embedded vectors of different neighbors:

wherein C is the total number of heads of attention, and is the vector splicing operation; first by

Obtain->

Value vector after linear transformation ∈ ->

Then, the weighted sum +.>

Then splice the multi-head attention results together to form the multi-head attention vector of neighbor aggregation>

Embedding vectors into a central node

And->

The specific calculation formula is as follows:

wherein r is _i ^(l) Embedded vector being a central node

By->

Linearly transformed, ++>

Is the weight of the gating residual, will +.>

r _i ^(l) And->

Sequentially spliced and passes ∈ ->

Performing linear transformation, and mapping the value range to the interval from 0 to 1 through a sigmoid function so as to realize the control of r _i ^(l) And->

A function of information inflow; finally obtaining the embedded vector representation of the central node i of the first layer +1 through LayerNorm and ReLU activation functions>

Step 4.4: constructing a bilinear decoder, wherein one side corresponds to a unique patient and disease in a patient-decoding bipartite graph, the bilinear decoder is the embedded vector expression of the known patient and disease, the existence probability of the side in the patient-decoding graph is predicted, and the calculation formula is as follows:

in the method, in the process of the invention,

Learning +.>

And h _j The learned results are spliced together with heuristic features to form hidden layer features of edges ++>

The subscript b of (2) is used only to distinguish between different weight matrices;

The loss function uses cross entropy and is calculated as follows:

To quickly train a disease risk prediction model, samples from the whole negative sample set are required to generate a trained negative sample. The invention sets the sampling ratio of positive and negative samples to 1:10, and if a patient has 3 positive samples, 30 negative samples need to be sampled.

In the data set dividing stage, the data set is divided into a training set, a verification set and a test set by taking a patient as a unit, and the ratio of the training set to the verification set to the test set is 7:1:2 respectively. The training set is used for training a disease risk prediction model; the verification set is used for optimizing parameters of the model; the test set was used to evaluate the generalization effect of the model. And (3) during model reasoning, a full-quantity sample test is adopted to obtain the prediction probability of each disease. And sequencing the prediction probability of the diseases to obtain the risk sequences of different diseases.

The disease risk prediction model adopts a small-batch training mode, and a part of nodes and neighbors thereof are sampled each time to train the network, so that the network can be trained on large-scale graph data. The model has good effect and strong expandability. When new data needs to be predicted, the whole graph data does not need to be trained again like other graph neural network models, and the prediction can be made only by using the neighbor information of the nodes. The number of neighbor samples per layer of the disease risk prediction model is 10. In order to optimize model parameters, a gradient descent method is used for back propagation, so that parameters of a weight matrix are optimized, and a well-trained disease risk prediction model is obtained.

And carrying out disease risk prediction on the new hospitalization record by adopting a trained disease risk prediction model:

referring to fig. 4, for a new hospitalization record, individual information, hospitalization hospital information, historic disease diagnosis information of the patient can be obtained as well, and corresponding patient feature vectors and disease feature vectors can be extracted. For a patient-disease encoding bipartite graph and a patient-disease decoding bipartite graph, the patient is added to both bipartite graphs. Finally, using a GADP model to obtain the disease risks of other diseases of the patient, and carrying out descending order on the risks to return to the diseases of the TopN.

The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.

Claims

1. The method for constructing the slow patient group disease risk prediction model based on the graph self-encoder is characterized by comprising the following steps of:

step 4: an encoder and a decoder are respectively established based on a patient-disease coding bipartite graph and a patient-decoding bipartite graph, the encoder is a graph meaning network, and a disease risk prediction model is established based on the encoder and the decoder, and the method specifically comprises the following steps:

step 4.1: establishing a heuristic feature extraction model;

step 4.2: establishing a neighbor sampling strategy;

2. The method according to claim 1, wherein the preprocessing in step 1 is to discard variables with a missing rate greater than 30% in the data set, and to fill the missing values with the average of the non-missing portions of the remaining data with the missing rate.

3. The method of constructing a model for predicting disease risk in a group of slow patients based on a graph self-encoder as claimed in claim 1, wherein the edges in the patient-disease encoding bipartite graph represent the disease that the patient has historically, and the weights represent the number of occurrences of the disease; the patient-disease decoding bipartite graph comprises a positive sample and a negative sample, wherein the positive sample is a new disease of the patient in the future N years, and the negative sample is a disease which can not be new in the patient in the future N years; subtracting the patient-disease encoding bipartite graph from the full bipartite graph to obtain the edges of the patient-disease decoding bipartite graph; the patient-disease encoding bipartite graph is used for the encoder to automatically learn the expression of the embedded vectors of the patient node and the disease node and the extraction of heuristic features, and the patient-disease decoding bipartite graph is used for the decoder to learn the occurrence probability of each edge.

4. The method for constructing a model for predicting disease risk of a group of slow patients based on a graph-based self-encoder as claimed in claim 1, wherein the extraction of the patient feature vector includes individual information, hospitalization hospital information, number of historic diseases and ECI co-morbid index of historic diseases; the data with the characteristic type of discrete type is subjected to single-heat coding and is converted into binary variables of 0-1; taking the data with the characteristic type of numerical value as continuous characteristics and taking the value as real number; and encoding the characteristic type as discrete data and the data with sequential relation as numerical characteristic.

5. The method for constructing a model for predicting disease risk of a group of slow patients based on a graph self-encoder as claimed in claim 1, wherein the extraction of the disease feature vector is performed by ascending arrangement of ICD-10 codes of disease nodes to obtain serial numbers of each disease node, and then a vector is generated for each disease node by single-hot coding; and the prevalence of each disease is calculated as a characteristic used to characterize the prevalence of the disease.

6. The method for constructing a model for predicting disease risk of a group of slow patients based on a graph-based self-encoder as claimed in claim 1, wherein the step 4 comprises the steps of:

step 4.1: establishing a heuristic feature extraction model:

in the method, in the process of the invention,

and->

It is the second order neighbor set of node j; />

step 4.2: establishing a neighbor sampling strategy:

wherein: w (w) _ij And

Multi-headed attention weight from node j to node i

Calculated from the following formula:

in the method, in the process of the invention,

representation ofA query vector of attention at the c-th head at a central node i of the layer i network of the encoder; />

Attention weights representing the attention of edge i, j at the c-th head in the layer 1 network of the encoder;

in the method, in the process of the invention,

embedding vector of center node i

And->

The specific calculation formula is as follows:

wherein r is _i ^(l) Information representing a central node i in a layer one network of the encoder; w (W) _r ^(l) A weight matrix representing a central node i in a layer one network of the encoder;

a bias term representing a central node i in a layer one network of the encoder;

r _i ^(l) And->

Spliced in turn and pass through W _g ^(l) The weight matrix is subjected to linear transformation, and the value range is mapped to the interval from 0 to 1 through a sigmoid function, so that r is controlled _i ^(l) And->

in the method, in the process of the invention,

representing the index corresponding to the side i, j of the patient-disease encoding bipartite graph and taking the index as heuristic characteristics; />

Learning +.>

And h _j And then the learned results are spliced to obtain +.>

Will->

The loss function uses cross entropy and is calculated as follows:

wherein G is _dec Representing a decoding diagram e _ij Representing edges i, j, y _ij Labels representing edges; and optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.

7. The method for constructing a slow patient group disease risk prediction model based on a graph-based self-encoder according to claim 1, wherein the preprocessed data set is divided into a training set, a validation set and a test set according to a ratio of 7:1:2; the training set is used for training the disease risk prediction model, the verification set is used for optimizing parameters of the disease risk prediction model, and the test set is used for evaluating the generalization effect of the disease risk prediction model.

8. The method of constructing a disease risk prediction model for a slow patient group based on a graph-based self-encoder according to claim 1, wherein all negative samples in the dataset are acquired, a negative sample set is formed, the negative sample set is sampled, negative samples for training the disease risk prediction model are generated, and the ratio of the positive samples to the negative samples is set to 1:10.