CN114783608A

CN114783608A - Construction method of slow patient group disease risk prediction model based on graph self-encoder

Info

Publication number: CN114783608A
Application number: CN202210507317.9A
Authority: CN
Inventors: 邱航; 胡智栩; 杨萍; 王利亚
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-07-22
Anticipated expiration: 2042-05-10
Also published as: CN114783608B

Abstract

The invention relates to the technical field of medical information, in particular to a construction method of a slow patient group disease risk prediction model based on a graph self-encoder, wherein a patient-disease bipartite graph is constructed based on hospitalization records and historical disease information of patients, and then characteristic vectors are respectively extracted aiming at the patients and the diseases; and finally, constructing a disease risk prediction model based on an image attention mechanism based on an image self-encoder architecture to predict the future disease risk of the chronic patient, and using the attention mechanism and considering the side weight information at the decoder part of the disease risk prediction model, so that the topological information of bipartite images and the individual difference of the patient can be considered at the same time, the complex influence relationship among diseases can be learned, and the purpose of improving the prediction effect can be achieved.

Description

Construction method of slow patient group disease risk prediction model based on graph self-encoder

Technical Field

The invention relates to the technical field of medical information, in particular to a construction method of a slow patient group disease risk prediction model based on a graph self-encoder.

Background

The exacerbation of the aging population and the steep rise in the incidence of chronic diseases impose a severe social and economic burden on the world. It is estimated that over 75% of the elderly have more than one chronic disease, and that the multiple disease (of two or more chronic diseases at the same time) of the elderly has become a prominent global problem, resulting in greater medical needs, greater use of medical services and costs. There are complex interrelationships between chronic diseases, and some chronic diseases may cause other chronic diseases, further increasing the treatment burden of patients. The prevention and treatment of chronic diseases and related complications have become an irresistible problem. The method effectively predicts the future disease risk of the chronic disease patient, can lead doctors to intervene in advance, reduces the occurrence risk of related diseases, thereby preventing the diseases in the bud and having great implementation significance. The existing disease risk prediction method mainly has the following problems:

(1) the partial prediction method models the disease prediction problem as a series of two-class models, each of which predicts whether a disease occurs, and this modeling method causes the number of models to increase as the number of predicted diseases increases, limiting the utility of the models.

(2) The partial prediction method utilizes historical disease information of a patient, abstracts the historical disease information into a patient-disease bipartite graph, models the problem into a link prediction problem, and predicts the disease risk by using heuristic methods such as Common Neighbors (CN) indexes, adaptive-Adar (AA) indexes and the like, wherein only topological information of the bipartite graph is considered, and individual differences of the patient, such as information of gender, age and the like, are not considered.

(3) Most of the existing prediction methods do not consider the complex influence relationship among diseases, so that the prediction effect is poor.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a construction method of a slow patient group disease risk prediction model based on a graph self-encoder, aiming at solving the technical problem of poor prediction effect caused by the fact that the existing prediction method in the background art does not consider the complex relation influence among diseases.

The technical scheme adopted by the invention is as follows:

the construction method of the slow patient group disease risk prediction model based on the graph self-encoder comprises the following steps:

step 1: acquiring a data set of a historical case homepage, preprocessing data in the data set, and storing the preprocessed historical case data into a storage space established by a storage medium;

and 2, step: dividing the historical case data obtained by preprocessing into diseases which the patient has historically suffered and diseases which the patient has in the future based on the time sequence, constructing the diseases which the patient has historically suffered as a patient-disease coding bipartite graph, and constructing the diseases which the patient has in the future N years as a patient-disease decoding bipartite graph;

and 3, step 3: calling historical case data in the storage space, and extracting a patient characteristic vector and a disease characteristic vector based on the historical case data;

and 4, step 4: establishing an encoder and a decoder respectively based on the patient-disease encoding bipartite graph and the patient-decoding bipartite graph, wherein the encoder is a graph attention network and establishes a disease risk prediction model based on the encoder and the decoder;

and 5: and training a disease risk prediction model based on the data set of the historical case home page.

For new hospitalization records, the invention can also obtain individual information, hospitalization hospital information and historical disease diagnosis information of patients, and can extract corresponding patient characteristic vectors and disease characteristic vectors; for both the patient-to-illness encoding bipartite graph and the patient-to-illness decoding bipartite graph, new patient admission record data is added into both bipartite graphs. And finally, obtaining the risk of other diseases of the corresponding patient by the trained disease risk prediction model, and performing descending order aiming at the risk.

In addition, in the decoder part of the disease risk prediction model, the attention mechanism is used, and the side weight information is considered at the same time, so that the topological information of the bipartite graph and the individual difference of the patient can be considered at the same time, the complex influence relation among the diseases can be learned, and the purpose of improving the prediction effect can be achieved.

Preferably, the preprocessing in step 1 is to remove variables with a deletion rate of more than 30% from the data set, and to fill the missing values with the mean values of the non-missing parts for the remaining data with the deletion rate.

Preferably, the edges in the patient-disease code bipartite graph represent diseases that the patient has had in history, and the weights represent the number of occurrences of the diseases; the patient-disease decoding bipartite graph comprises positive samples and negative samples, wherein the positive samples are new diseases of the patient in the next N years, and the negative samples are diseases which cannot be newly developed in the next N years; subtracting the patient-disease encoded bipartite graph from the complete bipartite graph to obtain edges of the patient-disease decoded bipartite graph; the patient-disease encoded bipartite graph is used for an encoder to automatically learn patient nodes and disease node embedded vector expressions, and the patient-disease decoded bipartite graph is used for a decoder to learn the occurrence probability of each edge.

Preferably, the extraction of the patient feature vector comprises individual information, hospital information, historical disease number and ECI co-morbidity index of historical diseases; carrying out one-hot coding on the data with the characteristic type of discrete type, and converting the data into a binary variable of 0-1; taking the data with the characteristic type of numerical value as continuous characteristic, and taking the value as real number; and (3) encoding the data with the characteristic type of discrete type and the data with the value having the sequence relation into the numerical characteristic.

Preferably, the extraction of the disease feature vectors is performed by performing ascending order arrangement on ICD-10 codes of disease nodes to obtain a serial number of each disease node, and then generating a vector for each disease node through unique hot coding; and the prevalence rate (the number of patients divided by the total number) of each disease is calculated as a characteristic for representing the hotness of the disease.

Preferably, the step 4 comprises the following steps:

step 4.1: establishing a heuristic characteristic extraction model:

in the formula (I), the compound is shown in the specification,

and

a set of neighbor nodes for nodes i, j and z, respectively, wherein node i represents a central node; | is the size of the solution set;

it is a second-order neighbor set of node j;

common neighbor indices representing the edges i, j of the patient-disease coding bipartite graph,

Adamic-Adar index representing the edge i, j of the patient-disease code bipartite graph,

Jaccard's coeffient index representing the side i, j of the patient-disease code bipartite graph,

Representing patient-disease code bipartite graphA preferred attribute index for edge i, j; the larger the value of the index is, the higher the occurrence probability of the edge is;

step 4.2: establishing a neighbor sampling strategy:

in the formula: w is a_ijAnd

weights and sampling probabilities, w, of the edges i, j of the patient-disease encoded bipartite graph, respectively_iuWeights representing the edges i, u of the patient-disease encoded bipartite graph based on the sampling probabilities

Performing replacement sampling on neighbors of the central node to obtain a fixed number of neighbor samples;

step 4.3: the graph attention network is used as an encoder, the encoder comprises at least one graph convolution module, and the graph convolution layer of each graph convolution module learns the weights of different neighbors by adopting the graph attention mechanism to obtain the final embedded vector expression; the layer I of the encoder is defined by

Multi-headed attention weights from node j to node i

Calculated by the following formula:

in the formula (I), the compound is shown in the specification,

a query vector representing the focus of the central node i at the layer i network of the encoder at the c-th point;

a weight matrix representing the focus of the query vector q at the c-th point in the l-th layer network of the encoder;

an embedded vector representing a central node i in a layer i network of the encoder;

a bias term representing the focus of the query vector q at the c-th point in the l-th layer network of the encoder;

a key vector representing node j's attention at the c-th in the l-th network of the encoder;

a weight matrix representing the key vector k in the layer i network of the encoder attention at c;

an embedded vector representing node j in the l-th network of the encoder; w is a_ijRepresenting the weight of the edge i, j;

a bias term representing the key vector k at the c-th attention in the l-th layer of the encoder;

indicating that in the l-th network of the encoder the edge i, j is in the hc attention weight of head attention;

a key vector representing the attention of node u at the c-th in the l-th network of the encoder;

carrying out exponential scaling on vector dot products, wherein d is the dimension of a vector;

after obtaining the multi-head attention weight of the graph, performing message aggregation operation on the embedded vectors of different neighbors:

in the formula (I), the compound is shown in the specification,

a value vector representing the attention of node j at the c-th in the l-th network of the encoder;

a weight matrix representing the value vector v in the layer i network of the encoder at the c-th attention;

a bias term representing the value vector v in the l-th network of the encoder to be noticed at the c-th place;

an attention vector representing a central node i in an l +1 th layer network of the encoder;

embedding vector of central node i

And

combined and taking into account a gated residual mechanism, selectively controlling the inflow of information to compute the embedded vector representation of the next layer

The specific calculation formula is as follows:

wherein r is_i ^(l)Information indicating a central node i in a layer i network of an encoder;

representing a weight matrix of a central node i in a layer I network of an encoder;

representing the weight of the gating residual error of the central node i in the l-th network of the encoder; will be provided with

r_i ^(l)And

-r_i ^(l)are spliced in sequence and pass through

Weight matrix routing linesPerforming sexual transformation, and mapping a value domain to a range from 0 to 1 through a sigmoid function, thereby realizing r control_i ^(l)And

the function of information inflow of (2); finally, obtaining the embedded vector representation of the central node i of the l +1 th network through LayerNorm and ReLU activation function

Step 4.4: constructing a bilinear decoder which predicts the existence probability of edges in a patient-decoded picture for an embedded vector representation of known patients and diseases, and the calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

representing indexes corresponding to the edges i and j, and taking the indexes as heuristic characteristics;

transpose of the embedded vector, h, representing node i_jA vector representing node j; the above formula uses a plurality of weight matrixes to use the multi-head attention mechanism for reference

Learning from different angles

And h_jThe combination method of (3) and then splicing the learned results with heuristic features to form hidden layer features of edges

Finally pass through W_oThe weight matrix is linearly transformed by adding an offset term b_oObtaining the result of the output layer, and obtaining the prediction probability p of the edge i, j by using the sigmoid activation function_ij：

The loss function uses cross entropy and is calculated as follows:

wherein G is_decRepresents a decoding graph, e_ijRepresents the side ii, j, y_ijA label representing an edge; and (5) optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.

Preferably, the preprocessed data set is divided into a training set, a verification set and a test set according to the ratio of 7:1: 2; the training set is used for training the disease risk prediction model, the verification set is used for optimizing parameters of the disease risk prediction model, and the test set is used for evaluating the generalization effect of the disease risk prediction model.

Preferably, all negative samples in the data set are obtained, a negative sample set is formed, the negative sample set is sampled, a negative sample used for training a disease risk prediction model is generated, and the ratio of the positive sample to the negative sample is set to be 1: 10.

The beneficial effects of the invention include:

1. for new hospitalization records, the invention can also obtain individual information, hospitalization hospital information and historical disease diagnosis information of patients, and can extract corresponding patient characteristic vectors and disease characteristic vectors; for both the patient-to-illness coded bipartite map and the patient-to-illness decoded bipartite map, new patient hospitalization record data is added to both bipartite maps. And finally, obtaining the risks of other diseases of the corresponding patient by the trained disease risk prediction model, and performing descending order aiming at the risks.

In addition, in the encoder part of the disease risk prediction model, the attention mechanism is used and the side weight information is considered at the same time, so that the topological information of the bipartite graph and the individual difference of the patient can be considered at the same time, the complex influence relation among the diseases can be learned, and the purpose of improving the prediction effect is achieved.

2. The invention adopts the final output result to carry out descending order aiming at the disease probability of the diseases, realizes the risk prediction of all the diseases and has wide practical value.

3. According to the invention, modeling can be completed only by the first page data of the medical record of the patient, and the characteristic vector of the patient and the characteristic vector of the disease are extracted, so that available information is comprehensively mined, and the prediction capability of the model is enhanced.

4. The decoder part of the disease risk prediction model not only considers the node embedded vector learned by the encoder, but also extracts heuristic characteristics such as CN, AA and the like for each edge, and the heuristic characteristics can supplement additional information, so that the model convergence is faster and the effect is better.

Drawings

FIG. 1 is a construction of a patient-disease bipartite graph according to the invention.

FIG. 2 is a diagram of the disease risk prediction model architecture of the present invention.

Fig. 3 is a training flowchart of the disease risk prediction model of the present invention.

Fig. 4 is a prediction flowchart of the disease risk prediction model of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of embodiments of the present application, generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

The following detailed description of embodiments of the invention is made with reference to the accompanying drawings 1 and 4:

the historical case first page data is a record item generated after the patient is completely hospitalized, each record comprises individual information (encrypted information such as identification number, sex, age, hospitalization time and discharge time) of the patient, information (information such as hospital grade and hospital address) of the hospitalization hospital and hospitalization Disease diagnosis (main diagnosis and at most 15 secondary diagnoses) of the patient, and the 10 th edition code of International Classification of Disease-review 10, ICD-10 is adopted; based on the above, data needs to be preprocessed, that is, variables with a deletion rate greater than 30% in a data set are removed, and the residual data with the deletion rate are filled with the missing values by using the mean value of the non-missing parts; data without missing values are obtained and stored in a storage space established in the storage medium, such as a database.

The goal of the present invention is to predict a patient's risk of disease N years into the future based on the patient's historical disease and individual information. Therefore, before data without missing values is formed, patients with hospitalization records with time span exceeding N years need to be screened, the chronic disease diagnosis of the hospitalization records of the last N years of the patients is taken as a prediction label, and the historical diseases are taken as known information; according to the invention, by setting the value of N, the future disease risk of the patient at different time granularities can be predicted.

Step 2: dividing the historical case data obtained by preprocessing into diseases which the patient has historically suffered and diseases which the patient has suffered in the future based on the time sequence, constructing the diseases which the patient has historically suffered as a patient-disease coding bipartite graph, and constructing the diseases which the patient has suffered in the next N years as a patient-disease decoding bipartite graph;

referring to fig. 1, in order to be able to predict the disease that may occur in a patient N years in the future, the present invention abstracts the task scenario into the link prediction problem of a bipartite graph, whose left nodes represent different patients and right nodes represent different diseases, and where only edges from patient to disease exist, a patient-disease encoded bipartite graph is used for an encoder to automatically learn the patient nodes and disease node embedded vector expressions, and a patient-disease decoded bipartite graph is used for a decoder to learn the occurrence probability of each edge.

The edges in the patient-disease code bipartite graph represent diseases that the patient has had in history, and the weights represent the number of occurrences of the disease; the solid line in the patient-disease decoding bipartite graph represents a positive sample, i.e., the patient's new disease N years into the future; the dashed line in the patient-disease decoding bipartite graph represents a negative sample, i.e., a disease that the patient will not have newly developed in the next N years; the complete bipartite graph is used to subtract the patient-disease encoded bipartite graph to obtain the edges of the patient-disease decoded bipartite graph.

The patient-Disease coding bipartite Graph constructed by the method is used for a Disease risk Prediction model (the patient-Disease coding bipartite Graph constructed by the method is named as a GADP model, namely a Graph Attention Disease risk Prediction model, and is called as a Disease risk Prediction model in the text for convenience of expression), patient nodes and patient node embedded vector expression are automatically learned, and the patient-Disease decoding bipartite Graph is used for solving the future Disease risk of the Disease.

And step 3: calling historical case data in the storage space, and extracting a patient characteristic vector and a disease characteristic vector based on the historical case data;

the extraction of the patient feature vector comprises individual information, hospital information, historical disease number and ECI (ECI) co-morbidity Index of historical diseases, wherein the ECI co-morbidity Index can quantify the physical condition of the patient to a certain extent; carrying out one-hot coding on the data with the characteristic type of discrete type, and converting the data into a binary variable of 0-1; taking the data with the characteristic type of numerical value as continuous characteristic, and taking the value as real number; and (3) encoding the data with the characteristic type of discrete type and the data with the value having the sequence relation into the numerical characteristic.

See in particular table 1 below:

TABLE 1 extraction of feature vectors for patient nodes

Referring to table 1 above, the third column in table 1 is the data type of the features, and if the features are numerical, the features are regarded as continuous features and take real numbers. If the discrete type is adopted, the single-hot coding is required to convert the discrete type into a binary variable of 0-1. However, as for the "hospitalization condition" field, its values are dangerous, urgent and general, although the values are discrete data, there is a sequential relationship between the values, and in order to reduce the dimension of the data, the feature is coded as a numerical feature, i.e. 1, 2 and 3; therefore, the characteristic dimension can be reduced, and the sequence information in the characteristic dimension can be kept.

Extracting the disease characteristic vectors by performing ascending order arrangement on ICD-10 codes of the disease nodes to obtain a serial number of each disease node, and generating a vector for each disease node through unique hot codes; and calculating the prevalence rate of each disease as a feature for characterizing the prevalence of the disease.

And 4, step 4: establishing an encoder and a decoder respectively based on the patient-disease encoding bipartite graph and the patient-decoding bipartite graph, wherein the encoder is used for establishing an attention network and establishing a disease risk prediction model based on the encoder and the decoder;

the invention uses Graph auto-encoder (GAE) as basic prediction architecture of link prediction. The graph self-encoder is used as an end-to-end model, and can automatically learn the embedded vector expression of each node in the encoding graph, and then a decoder is used for predicting the probability of decoding each edge in the graph. The core components of the graph self-encoder are an encoder and a decoder. The invention uses Graph Attention Networks (GAT) as an encoder, Bilinear layer (Bilinear layer) as a decoder, the model is named Graph Attention Disease Prediction (GADP) model, and the network structure is shown as figure 2.

The step 4 comprises the following steps:

step 4.1: establishing a heuristic characteristic extraction model:

in the formula (I), the compound is shown in the specification,

and

a set of neighbor nodes, respectively nodes i, j and z, wherein node i represents a central node; | is the size of the solution set;

it is a second-order neighbor set of node j;

Jaccard's coeffecifient index, representing the edge i, j of the patient-disease code bipartite graph,

A preference attribute index representing the edge i, j of the patient-disease coding bipartite graph; the larger the value of the index is, the higher the occurrence probability of the edge is;

step 4.2: in the neighbor sampling strategy in the graph neural network, a certain number of neighbors are generally sampled based on mean value random distribution; however, because different diseases have different degrees of influence on patients, a non-uniform neighbor sampling strategy is designed by taking the edge weights of the disease-coding bipartite graph into account, so that the larger the weight is, the higher the probability of sampling is, and the specific neighbor sampling strategy is as follows:

in the formula: w is a_ijAnd

step 4.3: referring to fig. 2, an attention network is used as an encoder, the encoder includes two identical graph convolution modules, and the graph convolution layer of each graph convolution module learns the weights of different neighbors by adopting an attention mechanism to obtain a final embedded vector expression; the layer I of the encoder is defined by

Multi-headed attention weight from node j to node i

Calculated by the following formula:

in the formula (I), the compound is shown in the specification,

a bias term representing the query vector q in the l-th layer network of the encoder to be noticed at the c-th point;

a key vector representing node j's attention at c' th in the l-th layer network of the encoder;

a bias term representing the key vector k in the layer i network of the encoder with attention at the c-th point;

attention weights indicating that edges i, j are attentive at the c-th in the l-th layer of the encoder;

a key vector representing node u's attention at c' th in layer i network of the encoder;

firstly, embedding a central node of the l-th layer into a vector

By passing

Linear transformation to a query vector

Embedding neighboring nodes into vectors

And edge weight w_ijAre spliced and then passed through

Linear transformation to key vector

Reuse of<q,k>The attention weight of the edge is calculated,

carrying out exponential scaling on vector dot products, wherein d is the dimension of a vector; finally, normalized attention weight is obtained by using normalization operation

After obtaining the attention weight of the multiple points of the graph, performing message aggregation operation on the embedded vectors of different neighbors:

wherein, C is the total number of heads of attention, | | | is the splicing operation of the vector; first by

Obtaining

Vector of values after linear transformation

Then, the weighted sum is obtained by the previously calculated attention weight

Then the multi-head attention results are spliced together to form a multi-head attention vector of neighbor aggregation

Embedding vectors of central nodes

And

combining and considering a gated residual mechanism, selectively controlling the inflow of information to compute the embedded vector representation of the next layer

The specific calculation formula is as follows:

wherein r is_i ^(l)Embedded vector being a central node

By passing

Is linearly transformed intoIn the end of the process, the raw materials are mixed,

is the weight of the gated residual, will

r_i ^(l)And

-r_i ^(l)are spliced in sequence and pass

Linear transformation is carried out, and the value range is mapped to the interval from 0 to 1 through a sigmoid function, thereby realizing the control of r_i ^(l)And

the function of information inflow of (2); finally, obtaining the embedded vector representation of the l +1 layer central node i through LayerNorm and ReLU activation function

Step 4.4: constructing a bilinear decoder, wherein one edge corresponds to a unique patient and a unique disease in a patient-decoding bipartite graph, the bilinear decoder is an embedded vector expression of the known patient and disease, and the existence probability of the edge in the patient-decoding graph is predicted by the following calculation formula:

in the formula (I), the compound is shown in the specification,

representing the index corresponding to the edge i, j and taking the index as a startA hair style feature;

transpose of the embedded vector, h, representing node i_jA vector representing node j; the above formula uses multiple weight matrices to use a multi-head attention mechanism

Learning from different angles

And h_jThe learned results are spliced together and are spliced with heuristic characteristics to form hidden layer characteristics of edges

Subscript b in (1) is used only to distinguish different weight matrices;

finally pass through W_oThe weight matrix is linearly transformed, and an offset term b is added_oObtaining the result of the output layer, and obtaining the prediction probability p of the edge i, j by using the sigmoid activation function_ij：

The loss function uses cross entropy and is calculated as follows:

wherein G is_decRepresenting a decoding diagram, e_ijRepresents the side ii, j, y_ijA label representing an edge; and (3) optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.

In order to train the disease risk prediction model quickly, sampling is performed from the whole negative sample set, and a training negative sample is generated. The invention sets the sampling ratio of positive and negative samples to be 1:10, and if a patient has 3 positive samples, 30 negative samples need to be sampled.

In the data set dividing stage, the data set is divided into a training set, a verification set and a test set by taking a patient as a unit, wherein the ratio of the training set to the verification set to the test set is 7:1: 2. The training set is used for training a disease risk prediction model; the verification set is used for optimizing the parameters of the model; the test set was used to evaluate the generalization effect of the model. And when the model is used for reasoning, a full-scale sample test is adopted to obtain the prediction probability of each disease. And sequencing the prediction probabilities of the diseases to obtain the risk sequencing of different diseases.

The disease risk prediction model adopts a small-batch training mode, a part of nodes and neighbors thereof are sampled each time to train the network, so that the network training on large-scale graph data becomes possible. The model has good effect and strong expandability. When the prediction needs to be made on new data, the prediction can be made only by using the neighbor information of the nodes without retraining the whole graph data like other graph neural network models. The number of neighbor samples per layer of the disease risk prediction model is 10. In order to optimize the model parameters, a gradient descent method is used for back propagation, so that the parameters of the weight matrix are optimized, and a well-trained disease risk prediction model is obtained.

And (3) adopting a trained disease risk prediction model to predict the disease risk of the new hospitalization record:

referring to fig. 4, for a new hospitalization record, individual information, hospitalization hospital information, and historical disease diagnosis information of the patient may be obtained as well, and corresponding patient feature vectors and disease feature vectors may be extracted. For the patient-disease encoding bipartite graph and the patient-disease decoding bipartite graph, the patient is added to both bipartite graphs. Finally, the GADP model is used to obtain the risk of other diseases of the patient, and the risk is subjected to descending order to return to the TopN disease.

The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.

Claims

1. The construction method of the slow patient group disease risk prediction model based on the graph self-encoder is characterized by comprising the following steps of:

and step 3: calling historical case data in a storage space, and extracting a patient characteristic vector and a disease characteristic vector based on the historical case data;

and 5: and training a disease risk prediction model based on the data set of the history case front page.

2. The method of claim 1, wherein the preprocessing in step 1 is to eliminate variables with a deletion rate of more than 30% in the data set, and to fill the missing values with the mean values of the non-missing parts for the remaining data with the missing rate.

3. The method for constructing a slow patient group disease risk prediction model based on graph self-encoder according to claim 1, wherein the edges in the patient-disease encoding bipartite graph represent diseases that the patient has had in history, and the weights represent the number of occurrences of the disease; the patient-disease decoding bipartite graph comprises positive samples and negative samples, wherein the positive samples are new diseases of the patient in the next N years, and the negative samples are diseases which cannot be newly sent in the next N years; subtracting the patient-disease encoded bipartite graph from the complete bipartite graph to obtain edges of the patient-disease decoded bipartite graph; the patient-disease encoding bipartite graph is used for an encoder to automatically learn patient nodes and disease node embedding vector expressions and extract heuristic features, and the patient-disease decoding bipartite graph is used for a decoder to learn the occurrence probability of each edge.

4. The method of constructing a slow patient population risk of illness prediction model based on graph self-encoder as claimed in claim 1, wherein the extraction of the patient feature vector comprises individual information, hospital information, historical disease number and ECI comorbidity index of historical disease; carrying out unique hot coding on the data with the characteristic type of discrete type, and converting the data into a binary variable of 0-1; taking the data with the characteristic type of numerical value as continuous characteristic, and taking the value as real number; and (3) encoding the data with the characteristic type of discrete type and the data with the value having the sequence relation into the numerical characteristic.

5. The method for constructing a slow patient group disease risk prediction model based on a graph self-encoder according to claim 1, wherein the disease feature vectors are extracted by arranging ICD-10 codes of disease nodes in an ascending order to obtain a serial number of each disease node, and generating a vector for each disease node through unique hot coding; and calculating the prevalence rate of each disease as a feature for characterizing the prevalence of the disease.

6. The method for constructing a model for predicting the risk of a disease in a chronic patient group based on a graph self-encoder as claimed in claim 1, wherein the step 4 comprises the steps of:

step 4.1: establishing a heuristic characteristic extraction model:

in the formula (I), the compound is shown in the specification,

and

it is the second-order neighbor set of node j;

common Neighbors indices representing the edges i, j of the patient-disease-encoding bipartite graph,

Adamic-Adar index representing the edge i, j of the patient-disease coding bipartite graph,

Representing the patientJaccard's coeffient index of the side i, j of the disease-coding bipartite graph,

A preference attribute index representing the edge i, j of the patient-disease-encoding bipartite graph; the larger the value of the index is, the higher the occurrence probability of the edge is;

and 4.2: establishing a neighbor sampling strategy:

in the formula: w is a_ijAnd

step 4.3: the graph attention network is used as an encoder, the encoder comprises at least one graph convolution module, and the graph convolution layer of each graph convolution module learns the weights of different neighbors by adopting a graph attention mechanism to obtain final embedded vector expression; the layer I of the encoder is defined by

Multi-headed attention weights from node j to node i

Calculated by the following formula:

in the formula (I), the compound is shown in the specification,

representation of sections in layer I network of encoderAn embedded vector for point j; w is a_ijRepresents the weight of the edge i, j;

in the formula (I), the compound is shown in the specification,

a vector of values representing node j's attention at head c in the layer l network of the encoder;

embedding vector of central node i

And

The specific calculation formula is as follows:

wherein r is_i ^(l)Information indicating a central node i in a l-th network of an encoder;

r_i ^(l)And

are spliced in sequence and pass through

The weight matrix is subjected to linear transformation, and the value domain is mapped to the interval from 0 to 1 through a sigmoid function, so that r is controlled_i ^(l)And

the function of information inflow of (2); finally, obtaining the embedded vector representation of the central node i of the l +1 th layer network through LayerNorm and ReLU activation function

Step 4.4: constructing a bilinear decoder which predicts the existence probability of edges in a patient-decoding image for the embedded vector expression of known patients and diseases, and the calculation formula is as follows:

in the formula (I), the compound is shown in the specification,

representing the index corresponding to the edge i, j of the patient-disease encoding bipartite graph as a heuristic characteristic;

transpose of the embedded vector representing node i, h_jA vector representing node j; the above formula uses multiple weight matrices to use a multi-head attention mechanism

Learning from different angles

The loss function uses cross entropy and is calculated as follows:

7. The method for constructing a slow patient population disease risk prediction model based on a graph self-encoder according to claim 1, wherein the preprocessed data set is divided into a training set, a validation set and a test set according to a ratio of 7:1: 2; the training set is used for training the disease risk prediction model, the verification set is used for optimizing parameters of the disease risk prediction model, and the test set is used for evaluating the generalization effect of the disease risk prediction model.

8. The method of claim 1, wherein all negative examples in the data set are obtained, a set of negative examples is formed, the set of negative examples is sampled, negative examples for training a disease risk prediction model are generated, and the ratio of positive examples to negative examples is set to 1: 10.