CN114783608B - Construction method of slow patient group disease risk prediction model based on graph self-encoder - Google Patents
Construction method of slow patient group disease risk prediction model based on graph self-encoder Download PDFInfo
- Publication number
- CN114783608B CN114783608B CN202210507317.9A CN202210507317A CN114783608B CN 114783608 B CN114783608 B CN 114783608B CN 202210507317 A CN202210507317 A CN 202210507317A CN 114783608 B CN114783608 B CN 114783608B
- Authority
- CN
- China
- Prior art keywords
- disease
- patient
- encoder
- graph
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 201000010099 disease Diseases 0.000 title claims abstract description 226
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 226
- 238000013058 risk prediction model Methods 0.000 title claims abstract description 47
- 238000010276 construction Methods 0.000 title description 3
- 239000013598 vector Substances 0.000 claims abstract description 109
- 238000000034 method Methods 0.000 claims abstract description 44
- 230000007246 mechanism Effects 0.000 claims abstract description 15
- 230000000694 effects Effects 0.000 claims abstract description 10
- 230000014509 gene expression Effects 0.000 claims description 20
- 238000005070 sampling Methods 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 15
- 238000000605 extraction Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 238000010586 diagram Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 230000002776 aggregation Effects 0.000 claims description 4
- 238000004220 aggregation Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000010200 validation analysis Methods 0.000 claims 1
- 208000017667 Chronic Disease Diseases 0.000 abstract description 9
- 239000000284 extract Substances 0.000 abstract description 5
- 238000003745 diagnosis Methods 0.000 description 6
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 239000000047 product Substances 0.000 description 3
- 230000032683 aging Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000306 component Substances 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000001225 therapeutic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/50—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Pathology (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Primary Health Care (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
The invention relates to the technical field of medical information, in particular to a method for constructing a slow patient group disease risk prediction model based on a graph self-encoder, which constructs a patient-disease bipartite graph based on hospitalization records and historical disease information of a patient, and then extracts feature vectors for the patient and the disease respectively; finally, a disease risk prediction model based on a graph attention mechanism is constructed based on a graph self-encoder framework to predict the future disease risk of a chronic disease patient, and the attention mechanism is used in a decoder part of the disease risk prediction model and the weight information of edges is considered at the same time, so that the topology information of two graphs and the individual difference of the patient can be considered at the same time, the complex influence relation among diseases is learned, and the aim of improving the prediction effect is further achieved.
Description
Technical Field
The invention relates to the technical field of medical information, in particular to a method for constructing a slow patient group disease risk prediction model based on a graph self-encoder.
Background
The aggravation of aging population and the rapid rise in the incidence of chronic diseases place a serious social and economic burden worldwide. It is estimated that more than 75% of the elderly have more than one chronic disease, and that the multiple diseases of the elderly (two and more chronic diseases) have become a prominent global problem, resulting in greater medical needs, more medical service usage and costs. There is a complex correlation between chronic diseases, some of which may lead to the occurrence of other chronic diseases, further increasing the therapeutic burden on the patient. Prevention and treatment of chronic diseases and related complications has become an unprecedented problem. The method can effectively predict the future disease risk of the chronic disease patient, can lead doctors to intervene in advance, and reduces the occurrence risk of related diseases, thereby preventing the diseases and having great realization significance. The existing disease risk prediction method mainly has the following problems:
(1) The partial prediction method models the disease prediction problem as a series of two-class models, each of which predicts whether a disease occurs, and this modeling method causes the number of models to increase as the number of predicted diseases increases, limiting the practicality of the models.
(2) The partial prediction method utilizes the historical disease information of the patient, abstracts the disease information into a patient-disease bipartite graph, models the problem as a link prediction problem, predicts the disease risk by using heuristic methods such as a Common Neighbor (CN) index, an Adamic-Adar (AA) index and the like, and only considers the topological information of the bipartite graph, but does not consider the individual difference of the patient, such as sex, age and the like.
(3) Most of the existing prediction methods do not consider complex influence relation among diseases, so that the prediction effect is poor.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a method for constructing a slow patient group disease risk prediction model based on a graph self-encoder, which aims to solve the technical problems that the prior prediction method mentioned in the background art does not consider the influence of complex relations among diseases, and the prediction effect is poor.
The technical scheme adopted by the invention is as follows:
the method for constructing the slow patient group disease risk prediction model based on the graph self-encoder comprises the following steps:
step 1: acquiring a data set of a first page of a historical medical record, preprocessing data in the data set, and storing the preprocessed historical case data into a storage space established by a storage medium;
step 2: dividing the preprocessed historical case data into a disease which the patient has historically and a disease which the patient has in the future based on a time sequence, constructing the disease which the patient has historically into a patient-disease coding bipartite graph, and constructing the disease which the patient has in the future N years into a patient-disease decoding bipartite graph;
step 3: invoking historical case data in the storage space, and extracting a patient feature vector and a disease feature vector based on the historical case data;
step 4: establishing an encoder and a decoder based on the patient-disease encoding bipartite graph and the patient-decoding bipartite graph respectively, wherein the encoder is a graph annotation network, and establishing a disease risk prediction model based on the encoder and the decoder;
step 4.1: establishing a heuristic feature extraction model;
step 4.2: establishing a neighbor sampling strategy;
step 4.3: using a graph attention network as an encoder, wherein the encoder comprises at least one graph convolution module, and a graph convolution layer of each graph convolution module learns weights of different neighbors by adopting a graph attention mechanism to obtain a final embedded vector expression;
step 4.4: constructing a bilinear decoder based on the patient-decoded bipartite graph, wherein the bilinear decoder predicts the existence probability of edges in the patient-decoded graph for embedded vector expressions and heuristic features of edges of known patients and diseases;
step 5: disease risk prediction models are trained based on the dataset of the historic medical records top page.
For the new hospitalization record, the invention can also obtain individual information, hospitalization hospital information and historical disease diagnosis information of the patient, and can extract corresponding patient characteristic vectors and disease characteristic vectors; for both the patient-disease encoding bipartite graph and the patient-disease decoding bipartite graph, new hospitalization record data for the patient is added to both bipartite graphs. Finally, the trained disease risk prediction model obtains the disease risks of other diseases corresponding to the patient, and descending order is arranged according to the risks.
In the decoder part of the disease risk prediction model, the attention mechanism is used, and the weight information of the edges is considered, so that the topological information of the two graphs and the individual difference of patients can be considered at the same time, the complex influence relationship among diseases is learned, and the purpose of improving the prediction effect is further achieved.
Preferably, the preprocessing in step 1 is to reject variables with a deletion rate greater than 30% in the data set, and fill the deletion value with the average of the non-missing portions for the remaining data with the deletion rate.
Preferably, the edges in the patient-disease encoding bipartite graph represent disease that the patient has historically, and the weights represent the number of occurrences of the disease; the patient-disease decoding bipartite graph comprises a positive sample and a negative sample, wherein the positive sample is a new disease of the patient in the future N years, and the negative sample is a disease which can not be new in the patient in the future N years; subtracting the patient-disease encoding bipartite graph from the full bipartite graph to obtain the edges of the patient-disease decoding bipartite graph; the patient-disease encoding bipartite graph is used for an encoder to automatically learn the expression of the embedded vectors of the patient nodes and the disease nodes, and the patient-disease decoding bipartite graph is used for a decoder to learn the occurrence probability of each edge.
Preferably, the extraction of the patient feature vector includes individual information, hospitalization hospital information, the number of historic diseases and ECI co-disease index of historic diseases; the data with the characteristic type of discrete type is subjected to single-heat coding and is converted into binary variables of 0-1; taking the data with the characteristic type of numerical value as continuous characteristics and taking the value as real number; and encoding the characteristic type as discrete data and the data with sequential relation as numerical characteristic.
Preferably, the extraction of the disease feature vector is performed by ascending order arrangement of ICD-10 codes of disease nodes to obtain the serial number of each disease node, and then a vector is generated for each disease node by independent heat coding; and the prevalence of each disease (number of patients divided by total number) was calculated as a feature to characterize the prevalence of the disease.
Preferably, the step 4 includes the steps of:
step 4.1: establishing a heuristic feature extraction model:
in the method, in the process of the invention,and->A set of neighbor nodes that are nodes i, j and z, respectively, wherein node i represents a central node; the I.I is the size of the set; />It is the second order neighbor set of node j; />Common Neighbors index representing edges i, j of patient-disease encoding bipartite graph, +.>Adamic-Adar index representing edges i, j of patient-disease encoding bipartite graph, +.>Jaccard's coeffient index representing the edges i, j of a patient-disease encoding bipartite graph, +.>Preferential Attachment index representing the edges i, j of the patient-disease encoding bipartite graph; the larger the value of the index is, the higher the occurrence probability of the edge is;
step 4.2: establishing a neighbor sampling strategy:
wherein: w (w) ij Andweights and sampling probabilities, w, of edges i, j representing patient-disease encoding bipartite graph, respectively iu Weights representing sides i, u of the patient-disease encoding bipartite graph based on sampling probability +.>Performing put-back sampling on neighbors of the central node to obtain a fixed number of neighbor samples;
step 4.3: using a graph attention network as an encoder, wherein the encoder comprises at least one graph convolution module, and a graph convolution layer of each graph convolution module learns weights of different neighbors by adopting a graph attention mechanism to obtain a final embedded vector expression; defining the first layer of the encoder is characterized byMulti-headed attention weight from node j to node i>Calculated from the following formula:
in the method, in the process of the invention,a query vector representing the attention of a central node i at the layer 1 network of the encoder at the c-th header;a weight matrix representing the attention of the query vector q at the c-th head in the layer 1 network of the encoder; />An embedded vector representing a central node i at a layer 1 network of the encoder; />A bias term representing the attention of the query vector q at the c-th head in the layer 1 network of the encoder; />A key vector representing the attention of node j at the c-th head in the layer 1 network of the encoder; />A weight matrix representing the attention of key vector k at the c-th head in the layer 1 network of the encoder; />An embedded vector representing a node j in a layer 1 network of the encoder; w (w) ij Weights representing edges i, j; />A bias term representing the attention of key vector k at the c-th head in the layer 1 network of the encoder; />Attention weights representing the attention of edge i, j at the c-th head in the layer 1 network of the encoder; />A key vector representing the attention of node u at the c-th head in the layer one network of the encoder; />Exponential scaling of the vector dot product is performed, and d is the dimension of the vector;
after obtaining the multi-head attention weight, carrying out message aggregation operation on embedded vectors of different neighbors:
in the method, in the process of the invention,a value vector representing the attention of node j at the c-th head in the layer 1 network of the encoder; />A weight matrix representing the attention of the vector v at the c-th head in the layer 1 network of the encoder; />A bias term representing the attention of the vector v at the c-th head in the layer 1 network of the encoder; />An attention vector representing a central node i of the layer 1 network of the encoder; splicing operation of representation vectors;
embedding vector of center node iAnd->In combination, and taking into account the gating residual mechanism, the inflow of selective control information, thereby calculating the embedded vector expression of the next layer +.>The specific calculation formula is as follows:
wherein r is i (l) Information representing a central node i in a layer one network of the encoder;a weight matrix representing a central node i in a layer one network of the encoder; />A bias term representing a central node i in a layer one network of the encoder; />A weight representing the gating residual of the central node i in the layer 1 network of the encoder; will->r i (l) And sequentially spliced and passes ∈ ->The weight matrix is subjected to linear transformation, and the value range is mapped to the interval from 0 to 1 through a sigmoid function, so that r is controlled i (l) And->A function of information inflow; finally obtaining the embedded vector representation of the central node i of the layer 1 network by LayerNorm and ReLU activation functions>
Step 4.4: constructing a bilinear decoder, wherein the bilinear decoder is an embedded vector expression of known patients and diseases, predicts the existence probability of edges in a patient-decoding diagram, and calculates the following formula:
in the method, in the process of the invention,representing the index corresponding to the edge i, j and taking the index as heuristic characteristics; />Transpose of embedded vector representing node i, h j A vector representing node j; the above uses multiple weight matrices to reference the multi-head attention mechanism>Learning +.>And h j And then the learned results are spliced to obtain +.>Will beSplicing with heuristic features to form hidden layer feature expression of edge ++>Finally through W o The weight matrix is subjected to linear transformation, and the bias term b is added o Obtaining the result of the output layer, and obtaining the prediction probability p of the edges i and j by using a sigmoid activation function ij :
The loss function uses cross entropy and is calculated as follows:
wherein G is dec Representing a decoding diagram e ij Representing edges ii, j, y ij Labels representing edges; and optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.
Preferably, the preprocessed data set is divided into a training set, a verification set and a test set according to the proportion of 7:1:2; the training set is used for training the disease risk prediction model, the verification set is used for optimizing parameters of the disease risk prediction model, and the test set is used for evaluating the generalization effect of the disease risk prediction model.
Preferably, all negative samples in the data set are acquired to form a negative sample set, the negative sample set is sampled to generate a negative sample for training a disease risk prediction model, and the ratio of the positive sample to the negative sample is set to be 1:10.
The beneficial effects of the invention include:
1. for the new hospitalization record, the invention can also obtain individual information, hospitalization hospital information and historical disease diagnosis information of the patient, and can extract corresponding patient characteristic vectors and disease characteristic vectors; for both the patient-disease encoding bipartite graph and the patient-disease decoding bipartite graph, new hospitalization record data for the patient is added to both bipartite graphs. Finally, the trained disease risk prediction model obtains the disease risks of other diseases corresponding to the patient, and descending order is arranged according to the risks.
In the encoder part of the disease risk prediction model, the attention mechanism is used, and the weight information of the edges is considered, so that the topological information of the two graphs and the individual difference of patients can be considered at the same time, the complex influence relationship among diseases is learned, and the purpose of improving the prediction effect is achieved.
2. The invention adopts the final output result to arrange the disease probability of the diseases in a descending order, realizes the risk prediction of all diseases, and has wide practical value.
3. The invention can complete modeling only by the data of the first page of the medical records of the patient, extracts the characteristic vectors of the patient and the characteristic vectors of the diseases, digs available information in all aspects, and strengthens the prediction capability of the model.
4. Besides considering node embedded vectors learned by an encoder, the decoder part of the disease risk prediction model extracts heuristic features such as CN, AA and the like for each edge, and the heuristic features can supplement additional information, so that the model converges more quickly and has better effect.
Drawings
FIG. 1 is a construction of a patient-disease bipartite graph of the present invention.
Fig. 2 is a diagram showing a disease risk prediction model structure according to the present invention.
FIG. 3 is a training flow chart of the disease risk prediction model of the present invention.
Fig. 4 is a prediction flow chart of the disease risk prediction model of the present invention.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
Embodiments of the present invention are described in further detail below with reference to fig. 1 and 4:
the method for constructing the slow patient group disease risk prediction model based on the graph self-encoder comprises the following steps:
step 1: acquiring a data set of a first page of a historical medical record, preprocessing data in the data set, and storing the preprocessed historical case data into a storage space established by a storage medium;
the history first page data is a record item generated by the patient after the hospitalization is completed, and each record contains individual information (encrypted identification card number, gender, age, hospitalization time, discharge time and the like) of the patient, information of hospitalization hospitals (information of hospital grade, hospital address and the like) and hospitalization disease diagnosis (main diagnosis and 15 secondary diagnoses at most) of the patient, and is coded by international disease classification 10 th edition (International Classification of Disease-Revision 10, ICD-10); based on the above, the data needs to be preprocessed, that is, variables with the missing rate greater than 30% in the data set are removed, and the remaining data with the missing rate is filled with the missing value by using the average value of the non-missing part; the data without missing values is obtained and stored in a memory space established in a storage medium, such as a database.
The objective of the present invention is to predict the risk of disease in a patient for the next N years based on the patient's historical disease and individual information. Therefore, patients with inpatients whose time span of inpatients is longer than N years need to be screened out before data without missing values are formed, chronic disease diagnosis of inpatients of last N years is regarded as predictive label, and history of inpatients is regarded as known information; the invention can predict future disease risks with different time coarse granularity by setting the value of N.
Step 2: dividing the preprocessed historical case data into a disease which the patient has historically and a disease which the patient has in the future based on a time sequence, constructing the disease which the patient has historically into a patient-disease coding bipartite graph, and constructing the disease which the patient has in the future N years into a patient-disease decoding bipartite graph;
referring to fig. 1, in order to predict the possible diseases of the patient in the next N years, the present invention abstracts the task scenario into a link prediction problem of two graphs, the left node of which represents different patients, the right node represents different diseases, and only the edges from patient to disease exist, the patient-disease encoding two graphs are used for the encoder to automatically learn the patient node and the disease node embedded vector expression, and the patient-disease decoding two graphs are used for the decoder to learn the occurrence probability of each edge.
The edges in the patient-disease encoding bipartite graph represent disease that the patient has historically, and the weights represent the number of occurrences of the disease; the solid line in the patient-disease decoding bipartite graph represents a positive sample, i.e., new disease in the patient for the next N years; the dashed lines in the patient-disease decoding bipartite graph represent negative samples, i.e., diseases that the patient will not develop newly for the next N years; the patient-disease decoding bipartite graph edges are obtained by subtracting the patient-disease encoding bipartite graph from the full bipartite graph.
The self-constructed patient-disease coding bipartite graph of the present invention is used for a disease risk prediction model (the self-constructed disease risk prediction model is named: GADP model, graph Attention Disease Prediction, GADP, the graph annotates the disease risk prediction model, which is referred to herein as the disease risk prediction model for convenience of description) to automatically learn patient node and patient node embedded vector expression, while the patient-disease decoding bipartite graph is used for resolving future disease risk of the disease.
Step 3: invoking historical case data in the storage space, and extracting a patient feature vector and a disease feature vector based on the historical case data;
the extraction of the patient feature vector includes individual information, hospital information, number of historic diseases, ECI co-morbid index (Elixhauser Comorbidity Index, ECI) of historic diseases, ECI co-morbid index being capable of quantifying the physical condition of the patient to some extent; carrying out single-heat coding on data with discrete characteristic types, and converting the data into binary variables of 0-1; taking the data with the characteristic type of numerical value as continuous characteristics and taking the value as real number; and encoding the characteristic type as discrete data and the data with sequential relation as numerical characteristic.
See in particular table 1 below:
TABLE 1 extraction of feature vectors for patient nodes
Referring to table 1 above, the third column in table 1 is the data type of the features, and if the features are numerical, the features are treated as continuous features and the values are real numbers. If discrete, it is required to convert it into a binary variable of 0-1 by one-hot encoding. However, as in the "hospitalization" field, its values are dangerous, urgent and general, although discrete data, there is a sequential relationship of values, which are coded as numerical features, i.e., 1, 2 and 3, in order to reduce the data dimension; thus, the feature dimension can be reduced, and the sequence information in the feature dimension can be reserved.
The extraction of the disease characteristic vector is to obtain the serial number of each disease node by ascending order arrangement of ICD-10 codes of the disease node, and then to generate a vector for each disease node by independent heat coding; and the prevalence of each disease is calculated as a characteristic used to characterize the prevalence of the disease.
Step 4: establishing an encoder and a decoder based on the patient-disease encoding bipartite graph and the patient-decoding bipartite graph respectively, wherein the encoder is a graph annotation network, and establishing a disease risk prediction model based on the encoder and the decoder;
the present invention uses a graph self-encoder (Graph auto encoder, GAE) as the link prediction base prediction architecture. The graph is used as an end-to-end model from the encoder, embedded vector expression of each node in the encoded graph can be automatically learned, and then the probability of each edge in the decoded graph is predicted by the decoder. The core components of the self-encoder are the encoder and decoder. The present invention uses a graph attention network (Graph Attention Networks, GAT) as an encoder and a Bilinear layer (Bilinear layer) as a decoder, and this model is named a graph attention disease prediction (Graph Attention Disease Prediction, GADP) model, the network structure of which is shown in fig. 2.
The step 4 comprises the following steps:
step 4.1: establishing a heuristic feature extraction model:
in the method, in the process of the invention,and->A set of neighbor nodes that are nodes i, j and z, respectively, wherein node i represents a central node; the I.I is the size of the set; />It is the second order neighbor set of node j; />Common Neighbors index representing edges i, j of patient-disease encoding bipartite graph, +.>Adamic-Adar index representing edges i, j of patient-disease encoding bipartite graph, +.>Jaccard's coeffient index representing the edges i, j of a patient-disease encoding bipartite graph, +.>Preferential Attachment index representing the edges i, j of the patient-disease encoding bipartite graph; the larger the value of the index is, the higher the occurrence probability of the edge is;
step 4.2: because in the neighbor sampling strategy in the graph neural network, a certain number of neighbors are generally sampled based on the mean random distribution; however, because the influence degree of different diseases on patients is different, taking the edge weight of the disease-coding bipartite graph into consideration, a non-uniform neighbor sampling strategy is designed, so that the larger the weight is, the higher the sampling probability is, and the specific neighbor sampling strategy is as follows:
wherein: w (w) ij Andweights and sampling probabilities, w, of edges i, j representing patient-disease encoding bipartite graph, respectively iu Weights representing sides i, u of the patient-disease encoding bipartite graph based on sampling probability +.>Performing put-back sampling on neighbors of the central node to obtain a fixed number of neighbor samples;
step 4.3: referring to fig. 2, using a graph attention network as an encoder, the encoder comprises two identical graph convolution modules, and the graph convolution layer of each graph convolution module learns weights of different neighbors by adopting a graph attention mechanism to obtain a final embedded vector expression; defining the first layer of the encoder is characterized byMulti-headed attention weight from node j to node i>Calculated from the following formula:
in the method, in the process of the invention,a query vector representing the attention of a central node i at the layer 1 network of the encoder at the c-th header;a weight matrix representing the attention of the query vector q at the c-th head in the layer 1 network of the encoder; />An embedded vector representing a central node i at a layer 1 network of the encoder; />A bias term representing the attention of the query vector q at the c-th head in the layer 1 network of the encoder; />A key vector representing the attention of node j at the c-th head in the layer 1 network of the encoder; />A weight matrix representing the attention of key vector k at the c-th head in the layer 1 network of the encoder; />An embedded vector representing a node j in a layer 1 network of the encoder; w (w) ij Weights representing edges i, j; />A bias term representing the attention of key vector k at the c-th head in the layer 1 network of the encoder; />Attention weights representing the attention of edge i, j at the c-th head in the layer 1 network of the encoder; />A key vector representing the attention of node u at the c-th head in the layer one network of the encoder; />Is a vector quantityThe dot product is exponentially scaled, d is the dimension of the vector;
first, the central node of the first layer is embedded into the vectorBy->Linear transformation into query vector->Embedding neighbor nodes into vectors->Sum edge weight w ij Splicing and passing->Linear transformation into key vector->Reuse of<q,k>Calculate the attention weight of the edge, +.>Exponential scaling of the vector dot product is performed, and d is the dimension of the vector; finally, normalized attention weight is obtained by normalization operation>
After obtaining the multi-head attention weight of the graph, carrying out message aggregation operation on embedded vectors of different neighbors:
wherein C is the total number of heads of attention, and is the vector splicing operation; first byObtain->Value vector after linear transformation ∈ ->Then, the weighted sum +.>Then splice the multi-head attention results together to form the multi-head attention vector of neighbor aggregation>
Embedding vectors into a central nodeAnd->In combination, and taking into account the gating residual mechanism, the inflow of selective control information, thereby calculating the embedded vector expression of the next layer +.>The specific calculation formula is as follows:
wherein r is i (l) Embedded vector being a central nodeBy->Linearly transformed, ++>Is the weight of the gating residual, will +.>r i (l) And->Sequentially spliced and passes ∈ ->Performing linear transformation, and mapping the value range to the interval from 0 to 1 through a sigmoid function so as to realize the control of r i (l) And->A function of information inflow; finally obtaining the embedded vector representation of the central node i of the first layer +1 through LayerNorm and ReLU activation functions>
Step 4.4: constructing a bilinear decoder, wherein one side corresponds to a unique patient and disease in a patient-decoding bipartite graph, the bilinear decoder is the embedded vector expression of the known patient and disease, the existence probability of the side in the patient-decoding graph is predicted, and the calculation formula is as follows:
in the method, in the process of the invention,representing the index corresponding to the edge i, j and taking the index as heuristic characteristics; />Transpose of embedded vector representing node i, h j A vector representing node j; the above uses multiple weight matrices to reference the multi-head attention mechanism>Learning +.>And h j The learned results are spliced together with heuristic features to form hidden layer features of edges ++> The subscript b of (2) is used only to distinguish between different weight matrices;
finally through W o The weight matrix is subjected to linear transformation, and the bias term b is added o Obtaining the result of the output layer, and obtaining the prediction probability p of the edges i and j by using a sigmoid activation function ij :
The loss function uses cross entropy and is calculated as follows:
wherein G is dec Representing a decoding diagram e ij Representing edges ii, j, y ij Labels representing edges; and optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.
Step 5: disease risk prediction models are trained based on the dataset of the historic medical records top page.
To quickly train a disease risk prediction model, samples from the whole negative sample set are required to generate a trained negative sample. The invention sets the sampling ratio of positive and negative samples to 1:10, and if a patient has 3 positive samples, 30 negative samples need to be sampled.
In the data set dividing stage, the data set is divided into a training set, a verification set and a test set by taking a patient as a unit, and the ratio of the training set to the verification set to the test set is 7:1:2 respectively. The training set is used for training a disease risk prediction model; the verification set is used for optimizing parameters of the model; the test set was used to evaluate the generalization effect of the model. And (3) during model reasoning, a full-quantity sample test is adopted to obtain the prediction probability of each disease. And sequencing the prediction probability of the diseases to obtain the risk sequences of different diseases.
The disease risk prediction model adopts a small-batch training mode, and a part of nodes and neighbors thereof are sampled each time to train the network, so that the network can be trained on large-scale graph data. The model has good effect and strong expandability. When new data needs to be predicted, the whole graph data does not need to be trained again like other graph neural network models, and the prediction can be made only by using the neighbor information of the nodes. The number of neighbor samples per layer of the disease risk prediction model is 10. In order to optimize model parameters, a gradient descent method is used for back propagation, so that parameters of a weight matrix are optimized, and a well-trained disease risk prediction model is obtained.
And carrying out disease risk prediction on the new hospitalization record by adopting a trained disease risk prediction model:
referring to fig. 4, for a new hospitalization record, individual information, hospitalization hospital information, historic disease diagnosis information of the patient can be obtained as well, and corresponding patient feature vectors and disease feature vectors can be extracted. For a patient-disease encoding bipartite graph and a patient-disease decoding bipartite graph, the patient is added to both bipartite graphs. Finally, using a GADP model to obtain the disease risks of other diseases of the patient, and carrying out descending order on the risks to return to the diseases of the TopN.
The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.
Claims (8)
1. The method for constructing the slow patient group disease risk prediction model based on the graph self-encoder is characterized by comprising the following steps of:
step 1: acquiring a data set of a first page of a historical medical record, preprocessing data in the data set, and storing the preprocessed historical case data into a storage space established by a storage medium;
step 2: dividing the preprocessed historical case data into a disease which the patient has historically and a disease which the patient has in the future based on a time sequence, constructing the disease which the patient has historically into a patient-disease coding bipartite graph, and constructing the disease which the patient has in the future N years into a patient-disease decoding bipartite graph;
step 3: invoking historical case data in the storage space, and extracting a patient feature vector and a disease feature vector based on the historical case data;
step 4: an encoder and a decoder are respectively established based on a patient-disease coding bipartite graph and a patient-decoding bipartite graph, the encoder is a graph meaning network, and a disease risk prediction model is established based on the encoder and the decoder, and the method specifically comprises the following steps:
step 4.1: establishing a heuristic feature extraction model;
step 4.2: establishing a neighbor sampling strategy;
step 4.3: using a graph attention network as an encoder, wherein the encoder comprises at least one graph convolution module, and a graph convolution layer of each graph convolution module learns weights of different neighbors by adopting a graph attention mechanism to obtain a final embedded vector expression;
step 4.4: constructing a bilinear decoder based on the patient-decoded bipartite graph, wherein the bilinear decoder predicts the existence probability of edges in the patient-decoded graph for embedded vector expressions and heuristic features of edges of known patients and diseases;
step 5: disease risk prediction models are trained based on the dataset of the historic medical records top page.
2. The method according to claim 1, wherein the preprocessing in step 1 is to discard variables with a missing rate greater than 30% in the data set, and to fill the missing values with the average of the non-missing portions of the remaining data with the missing rate.
3. The method of constructing a model for predicting disease risk in a group of slow patients based on a graph self-encoder as claimed in claim 1, wherein the edges in the patient-disease encoding bipartite graph represent the disease that the patient has historically, and the weights represent the number of occurrences of the disease; the patient-disease decoding bipartite graph comprises a positive sample and a negative sample, wherein the positive sample is a new disease of the patient in the future N years, and the negative sample is a disease which can not be new in the patient in the future N years; subtracting the patient-disease encoding bipartite graph from the full bipartite graph to obtain the edges of the patient-disease decoding bipartite graph; the patient-disease encoding bipartite graph is used for the encoder to automatically learn the expression of the embedded vectors of the patient node and the disease node and the extraction of heuristic features, and the patient-disease decoding bipartite graph is used for the decoder to learn the occurrence probability of each edge.
4. The method for constructing a model for predicting disease risk of a group of slow patients based on a graph-based self-encoder as claimed in claim 1, wherein the extraction of the patient feature vector includes individual information, hospitalization hospital information, number of historic diseases and ECI co-morbid index of historic diseases; the data with the characteristic type of discrete type is subjected to single-heat coding and is converted into binary variables of 0-1; taking the data with the characteristic type of numerical value as continuous characteristics and taking the value as real number; and encoding the characteristic type as discrete data and the data with sequential relation as numerical characteristic.
5. The method for constructing a model for predicting disease risk of a group of slow patients based on a graph self-encoder as claimed in claim 1, wherein the extraction of the disease feature vector is performed by ascending arrangement of ICD-10 codes of disease nodes to obtain serial numbers of each disease node, and then a vector is generated for each disease node by single-hot coding; and the prevalence of each disease is calculated as a characteristic used to characterize the prevalence of the disease.
6. The method for constructing a model for predicting disease risk of a group of slow patients based on a graph-based self-encoder as claimed in claim 1, wherein the step 4 comprises the steps of:
step 4.1: establishing a heuristic feature extraction model:
in the method, in the process of the invention,and->A set of neighbor nodes that are nodes i, j and z, respectively, wherein node i represents a central node; the I.I is the size of the set; />It is the second order neighbor set of node j; />Common Neighbors index representing edges i, j of patient-disease encoding bipartite graph, +.>Adamic-Adar index representing edges i, j of patient-disease encoding bipartite graph, +.>Jaccard's coeffient index representing the edges i, j of a patient-disease encoding bipartite graph, +.>Preferential Attachment index representing the edges i, j of the patient-disease encoding bipartite graph; the larger the value of the index is, the higher the occurrence probability of the edge is;
step 4.2: establishing a neighbor sampling strategy:
wherein: w (w) ij Andweights and sampling probabilities, w, of edges i, j representing patient-disease encoding bipartite graph, respectively iu Weights representing sides i, u of the patient-disease encoding bipartite graph based on sampling probability +.>Performing put-back sampling on neighbors of the central node to obtain a fixed number of neighbor samples;
step 4.3: using a graph attention network as an encoder, wherein the encoder comprises at least one graph convolution module, and a graph convolution layer of each graph convolution module learns weights of different neighbors by adopting a graph attention mechanism to obtain a final embedded vector expression; defining the first layer of the encoder is characterized byMulti-headed attention weight from node j to node iCalculated from the following formula:
in the method, in the process of the invention,representation ofA query vector of attention at the c-th head at a central node i of the layer i network of the encoder; />A weight matrix representing the attention of the query vector q at the c-th head in the layer 1 network of the encoder; />An embedded vector representing a central node i at a layer 1 network of the encoder; />A bias term representing the attention of the query vector q at the c-th head in the layer 1 network of the encoder; />A key vector representing the attention of node j at the c-th head in the layer 1 network of the encoder; />A weight matrix representing the attention of key vector k at the c-th head in the layer 1 network of the encoder; />An embedded vector representing a node j in a layer 1 network of the encoder; w (w) ij Weights representing edges i, j; />A bias term representing the attention of key vector k at the c-th head in the layer 1 network of the encoder; />Attention weights representing the attention of edge i, j at the c-th head in the layer 1 network of the encoder;a key vector representing the attention of node u at the c-th head in the layer one network of the encoder; />Exponential scaling of the vector dot product is performed, and d is the dimension of the vector;
after obtaining the multi-head attention weight, carrying out message aggregation operation on embedded vectors of different neighbors:
in the method, in the process of the invention,a value vector representing the attention of node j at the c-th head in the layer 1 network of the encoder; />A weight matrix representing the attention of the vector v at the c-th head in the layer 1 network of the encoder; />A bias term representing the attention of the vector v at the c-th head in the layer 1 network of the encoder; />An attention vector representing a central node i of the layer 1 network of the encoder; splicing operation of representation vectors;
embedding vector of center node iAnd->In combination, and taking into account the gating residual mechanism, the inflow of selective control information, thereby calculating the embedded vector expression of the next layer +.>The specific calculation formula is as follows:
wherein r is i (l) Information representing a central node i in a layer one network of the encoder; w (W) r (l) A weight matrix representing a central node i in a layer one network of the encoder;a bias term representing a central node i in a layer one network of the encoder;a weight representing the gating residual of the central node i in the layer 1 network of the encoder; will->r i (l) And-> Spliced in turn and pass through W g (l) The weight matrix is subjected to linear transformation, and the value range is mapped to the interval from 0 to 1 through a sigmoid function, so that r is controlled i (l) And->A function of information inflow; finally obtaining the embedded vector representation of the central node i of the layer 1 network by LayerNorm and ReLU activation functions>
Step 4.4: constructing a bilinear decoder, wherein the bilinear decoder is an embedded vector expression of known patients and diseases, predicts the existence probability of edges in a patient-decoding diagram, and calculates the following formula:
in the method, in the process of the invention,representing the index corresponding to the side i, j of the patient-disease encoding bipartite graph and taking the index as heuristic characteristics; />Transpose of embedded vector representing node i, h j A vector representing node j; the above uses multiple weight matrices to reference the multi-head attention mechanism>Learning +.>And h j And then the learned results are spliced to obtain +.>Will->Splicing with heuristic features to form hidden layer feature expression of edge ++>Finally through W o The weight matrix is subjected to linear transformation, and the bias term b is added o Obtaining the result of the output layer, and obtaining the prediction probability p of the edges i and j by using a sigmoid activation function ij :
The loss function uses cross entropy and is calculated as follows:
wherein G is dec Representing a decoding diagram e ij Representing edges i, j, y ij Labels representing edges; and optimizing the Loss of the model by using a gradient descent algorithm, and training a disease risk prediction model.
7. The method for constructing a slow patient group disease risk prediction model based on a graph-based self-encoder according to claim 1, wherein the preprocessed data set is divided into a training set, a validation set and a test set according to a ratio of 7:1:2; the training set is used for training the disease risk prediction model, the verification set is used for optimizing parameters of the disease risk prediction model, and the test set is used for evaluating the generalization effect of the disease risk prediction model.
8. The method of constructing a disease risk prediction model for a slow patient group based on a graph-based self-encoder according to claim 1, wherein all negative samples in the dataset are acquired, a negative sample set is formed, the negative sample set is sampled, negative samples for training the disease risk prediction model are generated, and the ratio of the positive samples to the negative samples is set to 1:10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210507317.9A CN114783608B (en) | 2022-05-10 | 2022-05-10 | Construction method of slow patient group disease risk prediction model based on graph self-encoder |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210507317.9A CN114783608B (en) | 2022-05-10 | 2022-05-10 | Construction method of slow patient group disease risk prediction model based on graph self-encoder |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114783608A CN114783608A (en) | 2022-07-22 |
CN114783608B true CN114783608B (en) | 2023-05-05 |
Family
ID=82436498
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210507317.9A Active CN114783608B (en) | 2022-05-10 | 2022-05-10 | Construction method of slow patient group disease risk prediction model based on graph self-encoder |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114783608B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115713986B (en) * | 2022-11-11 | 2023-07-11 | 中南大学 | Attention mechanism-based material crystal attribute prediction method |
CN116072298B (en) * | 2023-04-06 | 2023-08-15 | 之江实验室 | Disease prediction system based on hierarchical marker distribution learning |
CN116825360B (en) * | 2023-07-24 | 2024-08-06 | 湖南工商大学 | Method and device for predicting chronic disease co-morbid based on graph neural network and related equipment |
CN117438023B (en) * | 2023-10-31 | 2024-04-26 | 灌云县南岗镇卫生院 | Hospital information management method and system based on big data |
CN117476240B (en) * | 2023-12-28 | 2024-04-05 | 中国科学院自动化研究所 | Disease prediction method and device with few samples |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013108122A1 (en) * | 2012-01-20 | 2013-07-25 | Mueller-Wolf Martin | "indima apparatus" system, method and computer program product for individualized and collaborative health care |
CN109036553A (en) * | 2018-08-01 | 2018-12-18 | 北京理工大学 | A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge |
CN111462896A (en) * | 2020-03-31 | 2020-07-28 | 重庆大学 | Real-time intelligent auxiliary ICD coding system and method based on medical record |
CN113689954A (en) * | 2021-08-24 | 2021-11-23 | 平安科技(深圳)有限公司 | Hypertension risk prediction method, device, equipment and medium |
CN114023449A (en) * | 2021-11-05 | 2022-02-08 | 中山大学 | Diabetes risk early warning method and system based on depth self-encoder |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7693728B2 (en) * | 2004-03-31 | 2010-04-06 | Aetna Inc. | System and method for administering health care cost reduction |
-
2022
- 2022-05-10 CN CN202210507317.9A patent/CN114783608B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013108122A1 (en) * | 2012-01-20 | 2013-07-25 | Mueller-Wolf Martin | "indima apparatus" system, method and computer program product for individualized and collaborative health care |
CN109036553A (en) * | 2018-08-01 | 2018-12-18 | 北京理工大学 | A kind of disease forecasting method based on automatic extraction Medical Technologist's knowledge |
CN111462896A (en) * | 2020-03-31 | 2020-07-28 | 重庆大学 | Real-time intelligent auxiliary ICD coding system and method based on medical record |
CN113689954A (en) * | 2021-08-24 | 2021-11-23 | 平安科技(深圳)有限公司 | Hypertension risk prediction method, device, equipment and medium |
CN114023449A (en) * | 2021-11-05 | 2022-02-08 | 中山大学 | Diabetes risk early warning method and system based on depth self-encoder |
Also Published As
Publication number | Publication date |
---|---|
CN114783608A (en) | 2022-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114783608B (en) | Construction method of slow patient group disease risk prediction model based on graph self-encoder | |
CN112131673B (en) | Engine surge fault prediction system and method based on fusion neural network model | |
CN111914873B (en) | Two-stage cloud server unsupervised anomaly prediction method | |
CN114169330B (en) | Chinese named entity recognition method integrating time sequence convolution and transform encoder | |
CN112508085B (en) | Social network link prediction method based on perceptual neural network | |
CN109086805B (en) | Clustering method based on deep neural network and pairwise constraints | |
CN112086195B (en) | Admission risk prediction method based on self-adaptive ensemble learning model | |
CN109471895A (en) | The extraction of electronic health record phenotype, phenotype name authority method and system | |
WO2023116111A1 (en) | Disk fault prediction method and apparatus | |
CN113328755A (en) | Compressed data transmission method facing edge calculation | |
CN114898879A (en) | Chronic disease risk prediction method based on graph representation learning | |
CN114528835A (en) | Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination | |
CN113345564B (en) | Early prediction method and device for patient hospitalization duration based on graph neural network | |
CN114880538A (en) | Attribute graph community detection method based on self-supervision | |
CN112862070A (en) | Link prediction system using graph neural network and capsule network | |
CN114418158A (en) | Cell network load index prediction method based on attention mechanism learning network | |
CN116844711A (en) | Disease auxiliary identification method and device based on deep learning | |
Dhanwani et al. | Study of hybrid genetic algorithm using artificial neural network in data mining for the diagnosis of stroke disease | |
CN115035455A (en) | Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation | |
Zhang et al. | Compressing knowledge graph embedding with relational graph auto-encoder | |
CN112989048A (en) | Network security domain relation extraction method based on dense connection convolution | |
Wei et al. | Compression and storage algorithm of key information of communication data based on backpropagation neural network | |
CN115906768B (en) | Enterprise informatization data compliance assessment method, system and readable storage medium | |
CN112489803B (en) | Risk event prediction method and system and generation method of risk event prediction system | |
CN113836923B (en) | Named entity recognition method based on multistage context feature extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |