CN114783608A - Construction method of slow patient group disease risk prediction model based on graph self-encoder - Google Patents
Construction method of slow patient group disease risk prediction model based on graph self-encoder Download PDFInfo
- Publication number
- CN114783608A CN114783608A CN202210507317.9A CN202210507317A CN114783608A CN 114783608 A CN114783608 A CN 114783608A CN 202210507317 A CN202210507317 A CN 202210507317A CN 114783608 A CN114783608 A CN 114783608A
- Authority
- CN
- China
- Prior art keywords
- disease
- patient
- encoder
- vector
- graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/50—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Public Health (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
The invention relates to the technical field of medical information, in particular to a construction method of a slow patient group disease risk prediction model based on a graph self-encoder, wherein a patient-disease bipartite graph is constructed based on hospitalization records and historical disease information of patients, and then characteristic vectors are respectively extracted aiming at the patients and the diseases; and finally, constructing a disease risk prediction model based on an image attention mechanism based on an image self-encoder architecture to predict the future disease risk of the chronic patient, and using the attention mechanism and considering the side weight information at the decoder part of the disease risk prediction model, so that the topological information of bipartite images and the individual difference of the patient can be considered at the same time, the complex influence relationship among diseases can be learned, and the purpose of improving the prediction effect can be achieved.
Description
Technical Field
The invention relates to the technical field of medical information, in particular to a construction method of a slow patient group disease risk prediction model based on a graph self-encoder.
Background
The exacerbation of the aging population and the steep rise in the incidence of chronic diseases impose a severe social and economic burden on the world. It is estimated that over 75% of the elderly have more than one chronic disease, and that the multiple disease (of two or more chronic diseases at the same time) of the elderly has become a prominent global problem, resulting in greater medical needs, greater use of medical services and costs. There are complex interrelationships between chronic diseases, and some chronic diseases may cause other chronic diseases, further increasing the treatment burden of patients. The prevention and treatment of chronic diseases and related complications have become an irresistible problem. The method effectively predicts the future disease risk of the chronic disease patient, can lead doctors to intervene in advance, reduces the occurrence risk of related diseases, thereby preventing the diseases in the bud and having great implementation significance. The existing disease risk prediction method mainly has the following problems:
(1) the partial prediction method models the disease prediction problem as a series of two-class models, each of which predicts whether a disease occurs, and this modeling method causes the number of models to increase as the number of predicted diseases increases, limiting the utility of the models.
(2) The partial prediction method utilizes historical disease information of a patient, abstracts the historical disease information into a patient-disease bipartite graph, models the problem into a link prediction problem, and predicts the disease risk by using heuristic methods such as Common Neighbors (CN) indexes, adaptive-Adar (AA) indexes and the like, wherein only topological information of the bipartite graph is considered, and individual differences of the patient, such as information of gender, age and the like, are not considered.
(3) Most of the existing prediction methods do not consider the complex influence relationship among diseases, so that the prediction effect is poor.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a construction method of a slow patient group disease risk prediction model based on a graph self-encoder, aiming at solving the technical problem of poor prediction effect caused by the fact that the existing prediction method in the background art does not consider the complex relation influence among diseases.
The technical scheme adopted by the invention is as follows:
the construction method of the slow patient group disease risk prediction model based on the graph self-encoder comprises the following steps:
step 1: acquiring a data set of a historical case homepage, preprocessing data in the data set, and storing the preprocessed historical case data into a storage space established by a storage medium;
and 2, step: dividing the historical case data obtained by preprocessing into diseases which the patient has historically suffered and diseases which the patient has in the future based on the time sequence, constructing the diseases which the patient has historically suffered as a patient-disease coding bipartite graph, and constructing the diseases which the patient has in the future N years as a patient-disease decoding bipartite graph;
and 3, step 3: calling historical case data in the storage space, and extracting a patient characteristic vector and a disease characteristic vector based on the historical case data;
and 4, step 4: establishing an encoder and a decoder respectively based on the patient-disease encoding bipartite graph and the patient-decoding bipartite graph, wherein the encoder is a graph attention network and establishes a disease risk prediction model based on the encoder and the decoder;
and 5: and training a disease risk prediction model based on the data set of the historical case home page.
For new hospitalization records, the invention can also obtain individual information, hospitalization hospital information and historical disease diagnosis information of patients, and can extract corresponding patient characteristic vectors and disease characteristic vectors; for both the patient-to-illness encoding bipartite graph and the patient-to-illness decoding bipartite graph, new patient admission record data is added into both bipartite graphs. And finally, obtaining the risk of other diseases of the corresponding patient by the trained disease risk prediction model, and performing descending order aiming at the risk.
In addition, in the decoder part of the disease risk prediction model, the attention mechanism is used, and the side weight information is considered at the same time, so that the topological information of the bipartite graph and the individual difference of the patient can be considered at the same time, the complex influence relation among the diseases can be learned, and the purpose of improving the prediction effect can be achieved.
Preferably, the preprocessing in step 1 is to remove variables with a deletion rate of more than 30% from the data set, and to fill the missing values with the mean values of the non-missing parts for the remaining data with the deletion rate.
Preferably, the edges in the patient-disease code bipartite graph represent diseases that the patient has had in history, and the weights represent the number of occurrences of the diseases; the patient-disease decoding bipartite graph comprises positive samples and negative samples, wherein the positive samples are new diseases of the patient in the next N years, and the negative samples are diseases which cannot be newly developed in the next N years; subtracting the patient-disease encoded bipartite graph from the complete bipartite graph to obtain edges of the patient-disease decoded bipartite graph; the patient-disease encoded bipartite graph is used for an encoder to automatically learn patient nodes and disease node embedded vector expressions, and the patient-disease decoded bipartite graph is used for a decoder to learn the occurrence probability of each edge.
Preferably, the extraction of the patient feature vector comprises individual information, hospital information, historical disease number and ECI co-morbidity index of historical diseases; carrying out one-hot coding on the data with the characteristic type of discrete type, and converting the data into a binary variable of 0-1; taking the data with the characteristic type of numerical value as continuous characteristic, and taking the value as real number; and (3) encoding the data with the characteristic type of discrete type and the data with the value having the sequence relation into the numerical characteristic.
Preferably, the extraction of the disease feature vectors is performed by performing ascending order arrangement on ICD-10 codes of disease nodes to obtain a serial number of each disease node, and then generating a vector for each disease node through unique hot coding; and the prevalence rate (the number of patients divided by the total number) of each disease is calculated as a characteristic for representing the hotness of the disease.
Preferably, the step 4 comprises the following steps:
step 4.1: establishing a heuristic characteristic extraction model:
in the formula (I), the compound is shown in the specification,anda set of neighbor nodes for nodes i, j and z, respectively, wherein node i represents a central node; | is the size of the solution set;it is a second-order neighbor set of node j;common neighbor indices representing the edges i, j of the patient-disease coding bipartite graph,Adamic-Adar index representing the edge i, j of the patient-disease code bipartite graph,Jaccard's coeffient index representing the side i, j of the patient-disease code bipartite graph,Representing patient-disease code bipartite graphA preferred attribute index for edge i, j; the larger the value of the index is, the higher the occurrence probability of the edge is;
step 4.2: establishing a neighbor sampling strategy:
in the formula: w is aijAndweights and sampling probabilities, w, of the edges i, j of the patient-disease encoded bipartite graph, respectivelyiuWeights representing the edges i, u of the patient-disease encoded bipartite graph based on the sampling probabilitiesPerforming replacement sampling on neighbors of the central node to obtain a fixed number of neighbor samples;
step 4.3: the graph attention network is used as an encoder, the encoder comprises at least one graph convolution module, and the graph convolution layer of each graph convolution module learns the weights of different neighbors by adopting the graph attention mechanism to obtain the final embedded vector expression; the layer I of the encoder is defined byMulti-headed attention weights from node j to node iCalculated by the following formula:
in the formula (I), the compound is shown in the specification,a query vector representing the focus of the central node i at the layer i network of the encoder at the c-th point;a weight matrix representing the focus of the query vector q at the c-th point in the l-th layer network of the encoder;an embedded vector representing a central node i in a layer i network of the encoder;a bias term representing the focus of the query vector q at the c-th point in the l-th layer network of the encoder;a key vector representing node j's attention at the c-th in the l-th network of the encoder;a weight matrix representing the key vector k in the layer i network of the encoder attention at c;an embedded vector representing node j in the l-th network of the encoder; w is aijRepresenting the weight of the edge i, j;a bias term representing the key vector k at the c-th attention in the l-th layer of the encoder;indicating that in the l-th network of the encoder the edge i, j is in the hc attention weight of head attention;a key vector representing the attention of node u at the c-th in the l-th network of the encoder;carrying out exponential scaling on vector dot products, wherein d is the dimension of a vector;
after obtaining the multi-head attention weight of the graph, performing message aggregation operation on the embedded vectors of different neighbors:
in the formula (I), the compound is shown in the specification,a value vector representing the attention of node j at the c-th in the l-th network of the encoder;a weight matrix representing the value vector v in the layer i network of the encoder at the c-th attention;a bias term representing the value vector v in the l-th network of the encoder to be noticed at the c-th place;an attention vector representing a central node i in an l +1 th layer network of the encoder;
embedding vector of central node iAndcombined and taking into account a gated residual mechanism, selectively controlling the inflow of information to compute the embedded vector representation of the next layerThe specific calculation formula is as follows:
wherein r isi (l)Information indicating a central node i in a layer i network of an encoder;representing a weight matrix of a central node i in a layer I network of an encoder;representing the weight of the gating residual error of the central node i in the l-th network of the encoder; will be provided withri (l)And-ri (l)are spliced in sequence and pass throughWeight matrix routing linesPerforming sexual transformation, and mapping a value domain to a range from 0 to 1 through a sigmoid function, thereby realizing r controli (l)Andthe function of information inflow of (2); finally, obtaining the embedded vector representation of the central node i of the l +1 th network through LayerNorm and ReLU activation function
Step 4.4: constructing a bilinear decoder which predicts the existence probability of edges in a patient-decoded picture for an embedded vector representation of known patients and diseases, and the calculation formula is as follows:
in the formula (I), the compound is shown in the specification,representing indexes corresponding to the edges i and j, and taking the indexes as heuristic characteristics;transpose of the embedded vector, h, representing node ijA vector representing node j; the above formula uses a plurality of weight matrixes to use the multi-head attention mechanism for referenceLearning from different anglesAnd hjThe combination method of (3) and then splicing the learned results with heuristic features to form hidden layer features of edgesFinally pass through WoThe weight matrix is linearly transformed by adding an offset term boObtaining the result of the output layer, and obtaining the prediction probability p of the edge i, j by using the sigmoid activation functionij:
The loss function uses cross entropy and is calculated as follows:
wherein G isdecRepresents a decoding graph, eijRepresents the side ii, j, yijA label representing an edge; and (5) optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.
Preferably, the preprocessed data set is divided into a training set, a verification set and a test set according to the ratio of 7:1: 2; the training set is used for training the disease risk prediction model, the verification set is used for optimizing parameters of the disease risk prediction model, and the test set is used for evaluating the generalization effect of the disease risk prediction model.
Preferably, all negative samples in the data set are obtained, a negative sample set is formed, the negative sample set is sampled, a negative sample used for training a disease risk prediction model is generated, and the ratio of the positive sample to the negative sample is set to be 1: 10.
The beneficial effects of the invention include:
1. for new hospitalization records, the invention can also obtain individual information, hospitalization hospital information and historical disease diagnosis information of patients, and can extract corresponding patient characteristic vectors and disease characteristic vectors; for both the patient-to-illness coded bipartite map and the patient-to-illness decoded bipartite map, new patient hospitalization record data is added to both bipartite maps. And finally, obtaining the risks of other diseases of the corresponding patient by the trained disease risk prediction model, and performing descending order aiming at the risks.
In addition, in the encoder part of the disease risk prediction model, the attention mechanism is used and the side weight information is considered at the same time, so that the topological information of the bipartite graph and the individual difference of the patient can be considered at the same time, the complex influence relation among the diseases can be learned, and the purpose of improving the prediction effect is achieved.
2. The invention adopts the final output result to carry out descending order aiming at the disease probability of the diseases, realizes the risk prediction of all the diseases and has wide practical value.
3. According to the invention, modeling can be completed only by the first page data of the medical record of the patient, and the characteristic vector of the patient and the characteristic vector of the disease are extracted, so that available information is comprehensively mined, and the prediction capability of the model is enhanced.
4. The decoder part of the disease risk prediction model not only considers the node embedded vector learned by the encoder, but also extracts heuristic characteristics such as CN, AA and the like for each edge, and the heuristic characteristics can supplement additional information, so that the model convergence is faster and the effect is better.
Drawings
FIG. 1 is a construction of a patient-disease bipartite graph according to the invention.
FIG. 2 is a diagram of the disease risk prediction model architecture of the present invention.
Fig. 3 is a training flowchart of the disease risk prediction model of the present invention.
Fig. 4 is a prediction flowchart of the disease risk prediction model of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of embodiments of the present application, generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
The following detailed description of embodiments of the invention is made with reference to the accompanying drawings 1 and 4:
the construction method of the slow patient group disease risk prediction model based on the graph self-encoder comprises the following steps:
step 1: acquiring a data set of a historical case homepage, preprocessing data in the data set, and storing the preprocessed historical case data into a storage space established by a storage medium;
the historical case first page data is a record item generated after the patient is completely hospitalized, each record comprises individual information (encrypted information such as identification number, sex, age, hospitalization time and discharge time) of the patient, information (information such as hospital grade and hospital address) of the hospitalization hospital and hospitalization Disease diagnosis (main diagnosis and at most 15 secondary diagnoses) of the patient, and the 10 th edition code of International Classification of Disease-review 10, ICD-10 is adopted; based on the above, data needs to be preprocessed, that is, variables with a deletion rate greater than 30% in a data set are removed, and the residual data with the deletion rate are filled with the missing values by using the mean value of the non-missing parts; data without missing values are obtained and stored in a storage space established in the storage medium, such as a database.
The goal of the present invention is to predict a patient's risk of disease N years into the future based on the patient's historical disease and individual information. Therefore, before data without missing values is formed, patients with hospitalization records with time span exceeding N years need to be screened, the chronic disease diagnosis of the hospitalization records of the last N years of the patients is taken as a prediction label, and the historical diseases are taken as known information; according to the invention, by setting the value of N, the future disease risk of the patient at different time granularities can be predicted.
Step 2: dividing the historical case data obtained by preprocessing into diseases which the patient has historically suffered and diseases which the patient has suffered in the future based on the time sequence, constructing the diseases which the patient has historically suffered as a patient-disease coding bipartite graph, and constructing the diseases which the patient has suffered in the next N years as a patient-disease decoding bipartite graph;
referring to fig. 1, in order to be able to predict the disease that may occur in a patient N years in the future, the present invention abstracts the task scenario into the link prediction problem of a bipartite graph, whose left nodes represent different patients and right nodes represent different diseases, and where only edges from patient to disease exist, a patient-disease encoded bipartite graph is used for an encoder to automatically learn the patient nodes and disease node embedded vector expressions, and a patient-disease decoded bipartite graph is used for a decoder to learn the occurrence probability of each edge.
The edges in the patient-disease code bipartite graph represent diseases that the patient has had in history, and the weights represent the number of occurrences of the disease; the solid line in the patient-disease decoding bipartite graph represents a positive sample, i.e., the patient's new disease N years into the future; the dashed line in the patient-disease decoding bipartite graph represents a negative sample, i.e., a disease that the patient will not have newly developed in the next N years; the complete bipartite graph is used to subtract the patient-disease encoded bipartite graph to obtain the edges of the patient-disease decoded bipartite graph.
The patient-Disease coding bipartite Graph constructed by the method is used for a Disease risk Prediction model (the patient-Disease coding bipartite Graph constructed by the method is named as a GADP model, namely a Graph Attention Disease risk Prediction model, and is called as a Disease risk Prediction model in the text for convenience of expression), patient nodes and patient node embedded vector expression are automatically learned, and the patient-Disease decoding bipartite Graph is used for solving the future Disease risk of the Disease.
And step 3: calling historical case data in the storage space, and extracting a patient characteristic vector and a disease characteristic vector based on the historical case data;
the extraction of the patient feature vector comprises individual information, hospital information, historical disease number and ECI (ECI) co-morbidity Index of historical diseases, wherein the ECI co-morbidity Index can quantify the physical condition of the patient to a certain extent; carrying out one-hot coding on the data with the characteristic type of discrete type, and converting the data into a binary variable of 0-1; taking the data with the characteristic type of numerical value as continuous characteristic, and taking the value as real number; and (3) encoding the data with the characteristic type of discrete type and the data with the value having the sequence relation into the numerical characteristic.
See in particular table 1 below:
TABLE 1 extraction of feature vectors for patient nodes
Referring to table 1 above, the third column in table 1 is the data type of the features, and if the features are numerical, the features are regarded as continuous features and take real numbers. If the discrete type is adopted, the single-hot coding is required to convert the discrete type into a binary variable of 0-1. However, as for the "hospitalization condition" field, its values are dangerous, urgent and general, although the values are discrete data, there is a sequential relationship between the values, and in order to reduce the dimension of the data, the feature is coded as a numerical feature, i.e. 1, 2 and 3; therefore, the characteristic dimension can be reduced, and the sequence information in the characteristic dimension can be kept.
Extracting the disease characteristic vectors by performing ascending order arrangement on ICD-10 codes of the disease nodes to obtain a serial number of each disease node, and generating a vector for each disease node through unique hot codes; and calculating the prevalence rate of each disease as a feature for characterizing the prevalence of the disease.
And 4, step 4: establishing an encoder and a decoder respectively based on the patient-disease encoding bipartite graph and the patient-decoding bipartite graph, wherein the encoder is used for establishing an attention network and establishing a disease risk prediction model based on the encoder and the decoder;
the invention uses Graph auto-encoder (GAE) as basic prediction architecture of link prediction. The graph self-encoder is used as an end-to-end model, and can automatically learn the embedded vector expression of each node in the encoding graph, and then a decoder is used for predicting the probability of decoding each edge in the graph. The core components of the graph self-encoder are an encoder and a decoder. The invention uses Graph Attention Networks (GAT) as an encoder, Bilinear layer (Bilinear layer) as a decoder, the model is named Graph Attention Disease Prediction (GADP) model, and the network structure is shown as figure 2.
The step 4 comprises the following steps:
step 4.1: establishing a heuristic characteristic extraction model:
in the formula (I), the compound is shown in the specification,anda set of neighbor nodes, respectively nodes i, j and z, wherein node i represents a central node; | is the size of the solution set;it is a second-order neighbor set of node j;common neighbor indices representing the edges i, j of the patient-disease coding bipartite graph,Adamic-Adar index representing the edge i, j of the patient-disease code bipartite graph,Jaccard's coeffecifient index, representing the edge i, j of the patient-disease code bipartite graph,A preference attribute index representing the edge i, j of the patient-disease coding bipartite graph; the larger the value of the index is, the higher the occurrence probability of the edge is;
step 4.2: in the neighbor sampling strategy in the graph neural network, a certain number of neighbors are generally sampled based on mean value random distribution; however, because different diseases have different degrees of influence on patients, a non-uniform neighbor sampling strategy is designed by taking the edge weights of the disease-coding bipartite graph into account, so that the larger the weight is, the higher the probability of sampling is, and the specific neighbor sampling strategy is as follows:
in the formula: w is aijAndweights and sampling probabilities, w, of the edges i, j of the patient-disease encoded bipartite graph, respectivelyiuWeights representing the edges i, u of the patient-disease encoded bipartite graph based on the sampling probabilitiesPerforming replacement sampling on neighbors of the central node to obtain a fixed number of neighbor samples;
step 4.3: referring to fig. 2, an attention network is used as an encoder, the encoder includes two identical graph convolution modules, and the graph convolution layer of each graph convolution module learns the weights of different neighbors by adopting an attention mechanism to obtain a final embedded vector expression; the layer I of the encoder is defined byMulti-headed attention weight from node j to node iCalculated by the following formula:
in the formula (I), the compound is shown in the specification,a query vector representing the focus of the central node i at the layer i network of the encoder at the c-th point;a weight matrix representing the focus of the query vector q at the c-th point in the l-th layer network of the encoder;an embedded vector representing a central node i in a layer i network of the encoder;a bias term representing the query vector q in the l-th layer network of the encoder to be noticed at the c-th point;a key vector representing node j's attention at c' th in the l-th layer network of the encoder;a weight matrix representing the key vector k in the layer i network of the encoder attention at c;an embedded vector representing node j in the l-th network of the encoder; w is aijRepresenting the weight of the edge i, j;a bias term representing the key vector k in the layer i network of the encoder with attention at the c-th point;attention weights indicating that edges i, j are attentive at the c-th in the l-th layer of the encoder;a key vector representing node u's attention at c' th in layer i network of the encoder;carrying out exponential scaling on vector dot products, wherein d is the dimension of a vector;
firstly, embedding a central node of the l-th layer into a vectorBy passingLinear transformation to a query vectorEmbedding neighboring nodes into vectorsAnd edge weight wijAre spliced and then passed throughLinear transformation to key vectorReuse of<q,k>The attention weight of the edge is calculated,carrying out exponential scaling on vector dot products, wherein d is the dimension of a vector; finally, normalized attention weight is obtained by using normalization operation
After obtaining the attention weight of the multiple points of the graph, performing message aggregation operation on the embedded vectors of different neighbors:
wherein, C is the total number of heads of attention, | | | is the splicing operation of the vector; first byObtainingVector of values after linear transformationThen, the weighted sum is obtained by the previously calculated attention weightThen the multi-head attention results are spliced together to form a multi-head attention vector of neighbor aggregation
Embedding vectors of central nodesAndcombining and considering a gated residual mechanism, selectively controlling the inflow of information to compute the embedded vector representation of the next layerThe specific calculation formula is as follows:
wherein r isi (l)Embedded vector being a central nodeBy passingIs linearly transformed intoIn the end of the process, the raw materials are mixed,is the weight of the gated residual, willri (l)And-ri (l)are spliced in sequence and passLinear transformation is carried out, and the value range is mapped to the interval from 0 to 1 through a sigmoid function, thereby realizing the control of ri (l)Andthe function of information inflow of (2); finally, obtaining the embedded vector representation of the l +1 layer central node i through LayerNorm and ReLU activation function
Step 4.4: constructing a bilinear decoder, wherein one edge corresponds to a unique patient and a unique disease in a patient-decoding bipartite graph, the bilinear decoder is an embedded vector expression of the known patient and disease, and the existence probability of the edge in the patient-decoding graph is predicted by the following calculation formula:
in the formula (I), the compound is shown in the specification,representing the index corresponding to the edge i, j and taking the index as a startA hair style feature;transpose of the embedded vector, h, representing node ijA vector representing node j; the above formula uses multiple weight matrices to use a multi-head attention mechanismLearning from different anglesAnd hjThe learned results are spliced together and are spliced with heuristic characteristics to form hidden layer characteristics of edgesSubscript b in (1) is used only to distinguish different weight matrices;
finally pass through WoThe weight matrix is linearly transformed, and an offset term b is addedoObtaining the result of the output layer, and obtaining the prediction probability p of the edge i, j by using the sigmoid activation functionij:
The loss function uses cross entropy and is calculated as follows:
wherein G isdecRepresenting a decoding diagram, eijRepresents the side ii, j, yijA label representing an edge; and (3) optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.
And 5: and training a disease risk prediction model based on the data set of the historical case home page.
In order to train the disease risk prediction model quickly, sampling is performed from the whole negative sample set, and a training negative sample is generated. The invention sets the sampling ratio of positive and negative samples to be 1:10, and if a patient has 3 positive samples, 30 negative samples need to be sampled.
In the data set dividing stage, the data set is divided into a training set, a verification set and a test set by taking a patient as a unit, wherein the ratio of the training set to the verification set to the test set is 7:1: 2. The training set is used for training a disease risk prediction model; the verification set is used for optimizing the parameters of the model; the test set was used to evaluate the generalization effect of the model. And when the model is used for reasoning, a full-scale sample test is adopted to obtain the prediction probability of each disease. And sequencing the prediction probabilities of the diseases to obtain the risk sequencing of different diseases.
The disease risk prediction model adopts a small-batch training mode, a part of nodes and neighbors thereof are sampled each time to train the network, so that the network training on large-scale graph data becomes possible. The model has good effect and strong expandability. When the prediction needs to be made on new data, the prediction can be made only by using the neighbor information of the nodes without retraining the whole graph data like other graph neural network models. The number of neighbor samples per layer of the disease risk prediction model is 10. In order to optimize the model parameters, a gradient descent method is used for back propagation, so that the parameters of the weight matrix are optimized, and a well-trained disease risk prediction model is obtained.
And (3) adopting a trained disease risk prediction model to predict the disease risk of the new hospitalization record:
referring to fig. 4, for a new hospitalization record, individual information, hospitalization hospital information, and historical disease diagnosis information of the patient may be obtained as well, and corresponding patient feature vectors and disease feature vectors may be extracted. For the patient-disease encoding bipartite graph and the patient-disease decoding bipartite graph, the patient is added to both bipartite graphs. Finally, the GADP model is used to obtain the risk of other diseases of the patient, and the risk is subjected to descending order to return to the TopN disease.
The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.
Claims (8)
1. The construction method of the slow patient group disease risk prediction model based on the graph self-encoder is characterized by comprising the following steps of:
step 1: acquiring a data set of a historical case homepage, preprocessing data in the data set, and storing the preprocessed historical case data into a storage space established by a storage medium;
step 2: dividing the historical case data obtained by preprocessing into diseases which the patient has historically suffered and diseases which the patient has suffered in the future based on the time sequence, constructing the diseases which the patient has historically suffered as a patient-disease coding bipartite graph, and constructing the diseases which the patient has suffered in the next N years as a patient-disease decoding bipartite graph;
and step 3: calling historical case data in a storage space, and extracting a patient characteristic vector and a disease characteristic vector based on the historical case data;
and 4, step 4: establishing an encoder and a decoder respectively based on the patient-disease encoding bipartite graph and the patient-decoding bipartite graph, wherein the encoder is a graph attention network and establishes a disease risk prediction model based on the encoder and the decoder;
and 5: and training a disease risk prediction model based on the data set of the history case front page.
2. The method of claim 1, wherein the preprocessing in step 1 is to eliminate variables with a deletion rate of more than 30% in the data set, and to fill the missing values with the mean values of the non-missing parts for the remaining data with the missing rate.
3. The method for constructing a slow patient group disease risk prediction model based on graph self-encoder according to claim 1, wherein the edges in the patient-disease encoding bipartite graph represent diseases that the patient has had in history, and the weights represent the number of occurrences of the disease; the patient-disease decoding bipartite graph comprises positive samples and negative samples, wherein the positive samples are new diseases of the patient in the next N years, and the negative samples are diseases which cannot be newly sent in the next N years; subtracting the patient-disease encoded bipartite graph from the complete bipartite graph to obtain edges of the patient-disease decoded bipartite graph; the patient-disease encoding bipartite graph is used for an encoder to automatically learn patient nodes and disease node embedding vector expressions and extract heuristic features, and the patient-disease decoding bipartite graph is used for a decoder to learn the occurrence probability of each edge.
4. The method of constructing a slow patient population risk of illness prediction model based on graph self-encoder as claimed in claim 1, wherein the extraction of the patient feature vector comprises individual information, hospital information, historical disease number and ECI comorbidity index of historical disease; carrying out unique hot coding on the data with the characteristic type of discrete type, and converting the data into a binary variable of 0-1; taking the data with the characteristic type of numerical value as continuous characteristic, and taking the value as real number; and (3) encoding the data with the characteristic type of discrete type and the data with the value having the sequence relation into the numerical characteristic.
5. The method for constructing a slow patient group disease risk prediction model based on a graph self-encoder according to claim 1, wherein the disease feature vectors are extracted by arranging ICD-10 codes of disease nodes in an ascending order to obtain a serial number of each disease node, and generating a vector for each disease node through unique hot coding; and calculating the prevalence rate of each disease as a feature for characterizing the prevalence of the disease.
6. The method for constructing a model for predicting the risk of a disease in a chronic patient group based on a graph self-encoder as claimed in claim 1, wherein the step 4 comprises the steps of:
step 4.1: establishing a heuristic characteristic extraction model:
in the formula (I), the compound is shown in the specification,anda set of neighbor nodes, respectively nodes i, j and z, wherein node i represents a central node; | is the size of the solution set;it is the second-order neighbor set of node j;common Neighbors indices representing the edges i, j of the patient-disease-encoding bipartite graph,Adamic-Adar index representing the edge i, j of the patient-disease coding bipartite graph,Representing the patientJaccard's coeffient index of the side i, j of the disease-coding bipartite graph,A preference attribute index representing the edge i, j of the patient-disease-encoding bipartite graph; the larger the value of the index is, the higher the occurrence probability of the edge is;
and 4.2: establishing a neighbor sampling strategy:
in the formula: w is aijAndweights and sampling probabilities, w, of the edges i, j of the patient-disease encoded bipartite graph, respectivelyiuWeights representing the edges i, u of the patient-disease encoded bipartite graph based on the sampling probabilitiesPerforming replacement sampling on neighbors of the central node to obtain a fixed number of neighbor samples;
step 4.3: the graph attention network is used as an encoder, the encoder comprises at least one graph convolution module, and the graph convolution layer of each graph convolution module learns the weights of different neighbors by adopting a graph attention mechanism to obtain final embedded vector expression; the layer I of the encoder is defined byMulti-headed attention weights from node j to node iCalculated by the following formula:
in the formula (I), the compound is shown in the specification,a query vector representing the focus of the central node i at the layer i network of the encoder at the c-th point;a weight matrix representing the focus of the query vector q at the c-th point in the l-th layer network of the encoder;an embedded vector representing a central node i in a layer i network of the encoder;a bias term representing the query vector q in the l-th layer network of the encoder to be noticed at the c-th point;a key vector representing node j's attention at the c-th in the l-th network of the encoder;a weight matrix representing the key vector k in the layer i network of the encoder attention at c;representation of sections in layer I network of encoderAn embedded vector for point j; w is aijRepresents the weight of the edge i, j;a bias term representing the key vector k in the layer i network of the encoder with attention at the c-th point;attention weights indicating that edges i, j are attentive at the c-th in the l-th layer of the encoder;a key vector representing the attention of node u at the c-th in the l-th network of the encoder;carrying out exponential scaling on vector dot products, wherein d is the dimension of a vector;
after obtaining the attention weight of the multiple points of the graph, performing message aggregation operation on the embedded vectors of different neighbors:
in the formula (I), the compound is shown in the specification,a vector of values representing node j's attention at head c in the layer l network of the encoder;a weight matrix representing the value vector v in the layer i network of the encoder at the c-th attention;a bias term representing the value vector v in the l-th network of the encoder to be noticed at the c-th place;an attention vector representing a central node i in an l +1 th layer network of the encoder;
embedding vector of central node iAndcombined and taking into account a gated residual mechanism, selectively controlling the inflow of information to compute the embedded vector representation of the next layerThe specific calculation formula is as follows:
wherein r isi (l)Information indicating a central node i in a l-th network of an encoder;representing a weight matrix of a central node i in a layer I network of an encoder;representing the weight of the gating residual error of the central node i in the l-th network of the encoder; will be provided withri (l)Andare spliced in sequence and pass throughThe weight matrix is subjected to linear transformation, and the value domain is mapped to the interval from 0 to 1 through a sigmoid function, so that r is controlledi (l)Andthe function of information inflow of (2); finally, obtaining the embedded vector representation of the central node i of the l +1 th layer network through LayerNorm and ReLU activation function
Step 4.4: constructing a bilinear decoder which predicts the existence probability of edges in a patient-decoding image for the embedded vector expression of known patients and diseases, and the calculation formula is as follows:
in the formula (I), the compound is shown in the specification,representing the index corresponding to the edge i, j of the patient-disease encoding bipartite graph as a heuristic characteristic;transpose of the embedded vector representing node i, hjA vector representing node j; the above formula uses multiple weight matrices to use a multi-head attention mechanismLearning from different anglesAnd hjThe combination method of (3) and then splicing the learned results with heuristic features to form hidden layer features of edgesFinally pass through WoThe weight matrix is linearly transformed by adding an offset term boObtaining the result of the output layer, and obtaining the prediction probability p of the edge i, j by using the sigmoid activation functionij:
The loss function uses cross entropy and is calculated as follows:
wherein G isdecRepresents a decoding graph, eijRepresents the side ii, j, yijA label representing an edge; and (5) optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.
7. The method for constructing a slow patient population disease risk prediction model based on a graph self-encoder according to claim 1, wherein the preprocessed data set is divided into a training set, a validation set and a test set according to a ratio of 7:1: 2; the training set is used for training the disease risk prediction model, the verification set is used for optimizing parameters of the disease risk prediction model, and the test set is used for evaluating the generalization effect of the disease risk prediction model.
8. The method of claim 1, wherein all negative examples in the data set are obtained, a set of negative examples is formed, the set of negative examples is sampled, negative examples for training a disease risk prediction model are generated, and the ratio of positive examples to negative examples is set to 1: 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210507317.9A CN114783608B (en) | 2022-05-10 | 2022-05-10 | Construction method of slow patient group disease risk prediction model based on graph self-encoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210507317.9A CN114783608B (en) | 2022-05-10 | 2022-05-10 | Construction method of slow patient group disease risk prediction model based on graph self-encoder |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114783608A true CN114783608A (en) | 2022-07-22 |
CN114783608B CN114783608B (en) | 2023-05-05 |
Family
ID=82436498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210507317.9A Active CN114783608B (en) | 2022-05-10 | 2022-05-10 | Construction method of slow patient group disease risk prediction model based on graph self-encoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114783608B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115713986A (en) * | 2022-11-11 | 2023-02-24 | 中南大学 | Attention mechanism-based material crystal property prediction method |
CN116072298A (en) * | 2023-04-06 | 2023-05-05 | 之江实验室 | Disease prediction system based on hierarchical marker distribution learning |
CN116825360A (en) * | 2023-07-24 | 2023-09-29 | 湖南工商大学 | Method and device for predicting chronic disease co-morbid based on graph neural network and related equipment |
CN117438023A (en) * | 2023-10-31 | 2024-01-23 | 灌云县南岗镇卫生院 | Hospital information management method and system based on big data |
CN117476240A (en) * | 2023-12-28 | 2024-01-30 | 中国科学院自动化研究所 | Disease prediction method and device with few samples |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050222867A1 (en) * | 2004-03-31 | 2005-10-06 | Aetna, Inc. | System and method for administering health care cost reduction |
WO2013108122A1 (en) * | 2012-01-20 | 2013-07-25 | Mueller-Wolf Martin | "indima apparatus" system, method and computer program product for individualized and collaborative health care |
CN109036553A (en) * | 2018-08-01 | 2018-12-18 | 北京理工大学 | A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge |
CN111462896A (en) * | 2020-03-31 | 2020-07-28 | 重庆大学 | Real-time intelligent auxiliary ICD coding system and method based on medical record |
CN113689954A (en) * | 2021-08-24 | 2021-11-23 | 平安科技(深圳)有限公司 | Hypertension risk prediction method, device, equipment and medium |
CN114023449A (en) * | 2021-11-05 | 2022-02-08 | 中山大学 | Diabetes risk early warning method and system based on depth self-encoder |
-
2022
- 2022-05-10 CN CN202210507317.9A patent/CN114783608B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050222867A1 (en) * | 2004-03-31 | 2005-10-06 | Aetna, Inc. | System and method for administering health care cost reduction |
WO2013108122A1 (en) * | 2012-01-20 | 2013-07-25 | Mueller-Wolf Martin | "indima apparatus" system, method and computer program product for individualized and collaborative health care |
CN109036553A (en) * | 2018-08-01 | 2018-12-18 | 北京理工大学 | A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge |
CN111462896A (en) * | 2020-03-31 | 2020-07-28 | 重庆大学 | Real-time intelligent auxiliary ICD coding system and method based on medical record |
CN113689954A (en) * | 2021-08-24 | 2021-11-23 | 平安科技(深圳)有限公司 | Hypertension risk prediction method, device, equipment and medium |
CN114023449A (en) * | 2021-11-05 | 2022-02-08 | 中山大学 | Diabetes risk early warning method and system based on depth self-encoder |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115713986A (en) * | 2022-11-11 | 2023-02-24 | 中南大学 | Attention mechanism-based material crystal property prediction method |
CN115713986B (en) * | 2022-11-11 | 2023-07-11 | 中南大学 | Attention mechanism-based material crystal attribute prediction method |
CN116072298A (en) * | 2023-04-06 | 2023-05-05 | 之江实验室 | Disease prediction system based on hierarchical marker distribution learning |
CN116072298B (en) * | 2023-04-06 | 2023-08-15 | 之江实验室 | Disease prediction system based on hierarchical marker distribution learning |
CN116825360A (en) * | 2023-07-24 | 2023-09-29 | 湖南工商大学 | Method and device for predicting chronic disease co-morbid based on graph neural network and related equipment |
CN117438023A (en) * | 2023-10-31 | 2024-01-23 | 灌云县南岗镇卫生院 | Hospital information management method and system based on big data |
CN117438023B (en) * | 2023-10-31 | 2024-04-26 | 灌云县南岗镇卫生院 | Hospital information management method and system based on big data |
CN117476240A (en) * | 2023-12-28 | 2024-01-30 | 中国科学院自动化研究所 | Disease prediction method and device with few samples |
CN117476240B (en) * | 2023-12-28 | 2024-04-05 | 中国科学院自动化研究所 | Disease prediction method and device with few samples |
Also Published As
Publication number | Publication date |
---|---|
CN114783608B (en) | 2023-05-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114783608A (en) | Construction method of slow patient group disease risk prediction model based on graph self-encoder | |
CN114169330B (en) | Chinese named entity recognition method integrating time sequence convolution and transform encoder | |
WO2022057669A1 (en) | Method for pre-training knowledge graph on the basis of structured context information | |
CN113905391B (en) | Integrated learning network traffic prediction method, system, equipment, terminal and medium | |
CN111738535A (en) | Method, device, equipment and storage medium for predicting rail transit time-space short-time passenger flow | |
CN110110318B (en) | Text steganography detection method and system based on cyclic neural network | |
CN113535984A (en) | Attention mechanism-based knowledge graph relation prediction method and device | |
CN104572583A (en) | Densification of longitudinal emr for improved phenotyping | |
CN114519469A (en) | Construction method of multivariate long sequence time sequence prediction model based on Transformer framework | |
Mustika et al. | Analysis accuracy of xgboost model for multiclass classification-a case study of applicant level risk prediction for life insurance | |
CN112215604A (en) | Method and device for identifying information of transaction relationship | |
CN113780665B (en) | Private car stay position prediction method and system based on enhanced recurrent neural network | |
CN116187555A (en) | Traffic flow prediction model construction method and prediction method based on self-adaptive dynamic diagram | |
CN116403730A (en) | Medicine interaction prediction method and system based on graph neural network | |
CN111178946B (en) | User behavior characterization method and system | |
CN112749791A (en) | Link prediction method based on graph neural network and capsule network | |
CN113345564B (en) | Early prediction method and device for patient hospitalization duration based on graph neural network | |
CN114398500A (en) | Event prediction method based on graph-enhanced pre-training model | |
CN112201348B (en) | Knowledge-aware-based multi-center clinical data set adaptation device | |
CN114513337A (en) | Privacy protection link prediction method and system based on mail data | |
CN113255750A (en) | VCC vehicle attack detection method based on deep learning | |
CN114418158A (en) | Cell network load index prediction method based on attention mechanism learning network | |
CN114840777B (en) | Multi-dimensional endowment service recommendation method and device and electronic equipment | |
CN115906846A (en) | Document-level named entity identification method based on double-graph hierarchical feature fusion | |
CN115035455A (en) | Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |