CN115083616A

CN115083616A - Chronic nephropathy subtype mining system based on self-supervision graph clustering

Info

Publication number: CN115083616A
Application number: CN202210980822.5A
Authority: CN
Inventors: 李劲松; 池胜强; 徐铭鸿; 李雪瑶; 田雨; 周天舒
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-08-16
Filing date: 2022-08-16
Publication date: 2022-09-20
Anticipated expiration: 2042-08-16
Also published as: JP7404581B1; CN115083616B; JP2024027086A

Abstract

The invention discloses a chronic kidney disease subtype mining system based on self-supervision graph clustering, which comprises the following steps: a data acquisition module: the system is used for collecting the structured data in the diagnosis and treatment record of the chronic kidney disease; the data extraction and pretreatment module comprises: the system is used for extracting and preprocessing the structured data to obtain an entity set and a visit set; chronic kidney disease subtype mining module: the entity set and the visit set are used for constructing a chronic kidney disease subtype mining model; chronic kidney disease phenotype subtype assessment module: for evaluating the chronic kidney disease subtype mining model; chronic kidney disease subtype prediction module: for predicting structured data of a patient. The invention solves the problem that the process mining method can not process the coexistence of multi-granularity information such as event information in a single visit and event information among multiple visits in longitudinal electronic medical record data.

Description

Chronic nephropathy subtype mining system based on self-supervision graph clustering

Technical Field

The invention relates to the technical field of medical health information, in particular to a chronic kidney disease subtype mining system based on self-supervision graph clustering.

Background

According to clinical guidelines, chronic kidney disease is graded based on the patient's estimated glomerular filtration rate (eGFR) and urinary albumin-creatinine ratio (UACR). While eGFR and UACR can be used for screening and monitoring of chronic kidney disease, phenotypic differences in disease between individuals with chronic kidney disease cannot be characterized based on eGFR and UACR alone. Chronic kidney disease is a highly heterogeneous disease, closely related to systemic diseases and conditions, such as diabetes, hypertension, autoimmune diseases, genetic predisposition or congenital abnormalities. There are significant differences between individuals with chronic kidney disease, which can be described by disease phenotypes such as laboratory tests, medical history, medication history, and social factors. The initial phenotype difference of chronic kidney disease patients also causes the diagnosis and treatment process and complications of individuals to be different. A rational phenotypic classification of chronic kidney disease should differentiate between different subpopulations of patients, revealing disease characteristics and underlying disease pathology of the different subpopulations, thereby helping to better understand the different mechanisms of disease progression and progression.

The existing classification method of the chronic kidney disease subtype is mainly based on the clustering analysis of initial static phenotype data of a patient. The method mainly utilizes multidimensional data such as patient demographics, biomarkers and clinical characteristics collected at the beginning of research and mines the phenotype classification of chronic kidney disease patients based on common clustering algorithms such as hierarchical clustering and consistency clustering. However, chronic kidney disease patients have long disease process and many complications, which causes great difference in diagnosis and treatment process among patients. The clinical process data may imply important information for distinguishing different phenotypes of chronic kidney disease patients. In the data of the patient diagnosis and treatment process collected and stored in the electronic medical record system, event information such as operation, examination, inspection and medication for a specific patient and the occurrence time of the events can be extracted. The method utilizes the diagnosis and treatment process data of the patients to perform clustering, researches the disease phenotype mode of the patients, and has important significance for identifying and researching the characteristics of different subgroups of patients. The commonly used method for mining the data of the disease diagnosis and treatment process comprises the following steps: (1) the process mining method comprises the following steps: information is extracted from an event log generated in the process of diagnosis and treatment of a patient, and diagnosis and treatment event sequences are formed by arranging according to time sequence. Different patterns in the sequence of clinical events are then mined as different clinical paths for the disease, thereby classifying the disease phenotype of the patient. The method is difficult to utilize the co-occurrence information among the events, and cannot process the event incidence relation and the sequence relation in the longitudinal electronic medical record multi-time visit data. The excavation diagnosis and treatment process is complex, and the representativeness and the coverage rate are poor. (2) Tensor decomposition-based approach: and combining the information of the three dimensions of the patient, the time and the phenotype into a third-order tensor, and decomposing the third-order tensor so as to mine the potential phenotype classification of the patient. The method only considers disease phenotype conversion between continuous diagnosis and treatment and cannot process phenotype evolution information in a long-distance diagnosis and treatment process.

Therefore, we propose a chronic kidney disease subtype mining system based on the self-supervision graph clustering to solve the above technical problem.

Disclosure of Invention

In order to solve the technical problems, the invention provides a chronic kidney disease subtype mining system based on self-supervision graph clustering.

The technical scheme adopted by the invention is as follows:

a chronic kidney disease subtype mining system based on self-supervision picture clustering comprises:

a data acquisition module: the system is used for collecting the structured data in the diagnosis and treatment record of the chronic kidney disease;

the data extraction and pretreatment module: the system is used for extracting and preprocessing the structured data to obtain an entity set and a visit set;

chronic kidney disease subtype mining module: the entity set and the visit set are used for constructing a chronic kidney disease subtype mining model;

a chronic kidney disease phenotype subtype evaluation module: for evaluating the chronic kidney disease subtype mining model;

chronic kidney disease subtype prediction module: for predicting structured data of a patient.

Further, the structured data includes basic information of the patient, medical records, diagnoses during a viewing window, laboratory tests, medical examinations, surgeries, and/or medication data.

Further, the data extraction and preprocessing module is specifically configured to preprocess the structured data, extract the structured data in the diagnosis and treatment record of chronic kidney disease in the electronic medical record system, and preprocess the extracted structured data, where the structured data includes basic information of a patient, a diagnosis record, diagnosis during an observation window, laboratory test, medical examination, surgical data, and medication data, and the laboratory test data only focuses on an abnormal test item according to a normal reference range, divides the result of the abnormal test item into two categories, namely a lower category and a higher category, and retains the name of the abnormal test item and the abnormal category; medical examination and operation data are processed by a simple natural language processing technology, and the examined part, the examined type and the operation name are reserved; the medication data only pay attention to the use of six types of medicines, namely antihyperglycemic medicines, antihypertensive medicines, lipid regulating medicines, non-steroidal anti-inflammatory medicines, antiplatelet medicines and steroids, the six types of medicines in the medication data are classified, and the medicine categories are reserved; obtaining a diagnosis set, a medication set, an operation set, a test set, the number of diagnosis types, the number of medication types, the number of operation types, the number of test types and the number of treatment records, combining the diagnosis set, the medication set, the operation set and the test set to form an entity set, and combining the treatment records of patients to form a treatment set.

Further, the chronic kidney disease subtype mining module specifically comprises:

a visit network construction unit: a network for constructing a visit network using the visit set and the entity set;

an embedded representation construction unit: the entity co-occurrence matrix is constructed by utilizing the entity set, the entity node initial embedded representation and the clinic node initial embedded representation are obtained through the entity co-occurrence matrix, and the entity node initial embedded representation and the clinic node initial embedded representation form the node initial embedded representation;

a clustering network construction unit: the system comprises a node clustering network model, a node clustering model and a node clustering model, wherein the node clustering network model is used for constructing an adjacency matrix by utilizing the relationship among nodes in the visit network, and training the visit node clustering network model based on self-supervision graph clustering through the adjacency matrix and the initial embedded representation of the nodes;

the chronic kidney disease subtype mining model construction unit: and the method is used for constructing the chronic kidney disease subtype mining model through the self-supervision graph clustering-based visit node clustering network model.

Further, the visiting network constructing unit specifically includes:

the system is used for forming the visit set and the entity set into a node set;

the edge set is constructed through the node co-occurrence relations in the node set;

for constructing a treatment network using the set of nodes and the set of edges.

Further, the embedded representation building unit specifically includes:

the entity co-occurrence matrix is constructed by utilizing the entity set;

the initial embedded representation of each entity node is obtained through calculation of a GloVe algorithm based on the entity co-occurrence matrix;

the node initial embedded representation is obtained by calculating an average value of the entity node initial embedded representations of all adjacent entity nodes, and the clinic node initial embedded representation and the entity node initial embedded representation form the node initial embedded representation.

Further, the clustering network constructing unit specifically includes:

the self-supervision graph clustering based visit node clustering network model is used for constructing an adjacency matrix by utilizing the relationship among the nodes in the visit network, inputting the adjacency matrix and the initial node embedded representation into the visit node clustering network model based on the self-supervision graph clustering for graph attention training, and obtaining a node embedded representation, wherein the node embedded representation comprises a visit node embedded representation and an entity node embedded representation;

the node embedded representation is used for reconstructing the visit network and calculating a visit network reconstruction error;

the decoder is used for inputting the entity node embedded representation into the neural network for training, the output of the last layer of the decoder is used as entity node reconstruction embedded representation, and entity node reconstruction errors are calculated;

the system is used for performing softmax regression operation on the embedded expression of the treatment nodes to obtain the probability distribution of the treatment nodes, and calculating the clustering loss according to the probability distribution of the treatment nodes;

and the overall loss function is used for constructing the visit node clustering network model based on the self-supervision graph clustering according to the visit network reconstruction error, the entity node reconstruction error and the clustering loss.

Further, the chronic kidney disease subtype mining model construction unit specifically includes:

the self-supervision graph clustering-based diagnosis node clustering network model is used for obtaining diagnosis node clustering distribution as classification distribution of the diagnosis nodes, selecting the classification with the highest probability in the classification distribution as a classification label of the diagnosis nodes, and arranging all the diagnosis nodes of each patient according to a time sequence;

the event matrix is constructed by arranging the diagnosis nodes;

the method is used for searching for frequent event determination nodes, the frequent events are used as nodes in an event flow, the rest events directly enter an end node, each event in the frequent events is used as an initial node of the next search, a corresponding event vector is extracted to be combined into a new event matrix, the same frequent event searching operation is carried out after the first column is removed, the node obtained by each search is connected with the initial node so as to prolong the event flow until the frequent event is empty or the event flow length reaches the maximum event flow length, and a chronic kidney disease subtype mining model is obtained after the circulation is ended.

Further, the module for predicting the subtype of chronic kidney disease specifically comprises:

the self-supervision graph clustering-based visit node clustering network model is used for inputting the preprocessed patient structured data into the visit node clustering network model for prediction to obtain the probability distribution of the visit node of the patient;

the cluster type of the treatment nodes is judged according to the probability distribution of the treatment nodes, and a treatment event sequence is constructed;

the system is used for inputting the treatment event sequence into the chronic kidney disease subtype mining model, fitting nodes in the chronic kidney disease subtype mining model according to the sequence to obtain an event flow, and judging which chronic kidney disease subtype belongs to through the event flow.

The invention has the beneficial effects that: the invention provides a chronic kidney disease subtype mining system based on self-supervision graph clustering. Firstly, longitudinal electronic medical record data of a patient for multiple times of treatment is constructed into a treatment network, and the treatment network comprises multi-dimensional patient diagnosis and treatment event information such as treatment, diagnosis, laboratory examination, medical examination, operation, medication and the like. And secondly, acquiring vector representation of the diagnosis and treatment events by using the co-occurrence information of the diagnosis and treatment events. And clustering the treatment events by using a treatment node clustering network model based on the self-supervision graph clustering, and labeling each treatment event. Then, on the aspect of the treatment, the diagnosis and treatment path of the patient is excavated to obtain different subtypes of the chronic kidney disease phenotype. Finally, a phenotypic subtype assessment method is provided to assess whether clinically interpretable differences exist among the different mined subtypes, including a series of comprehensive indicators of patient demographics, medication, complications, and survival rates.

The method comprises the steps that diagnosis, laboratory inspection, medical examination, operation, medication and other event information in each visit are trained through a visit node clustering network model based on self-supervision graph clustering to obtain category labels of each visit, and low-level and fine-grained information is gathered into high-level and coarse-grained general information in the process; and the type label of the diagnosis is used for a diagnosis and treatment path mining mode, so that the problem that multi-granularity information such as event information in a single diagnosis and event information among multiple times of diagnoses cannot be processed in longitudinal electronic medical record data by the process mining method is solved.

The event vector representation is obtained based on the co-occurrence information and used for the graph model, the problem that the process mining method is difficult to utilize the event co-occurrence information is effectively solved, and the full feature mining of the diseases by simultaneously utilizing the cross section and the longitudinal electronic medical record data is realized.

The self-supervision graph clustering algorithm provided by the invention brings the multi-time diagnosis information of the patient into a diagnosis node clustering network model based on self-supervision graph clustering, trains the embedded expression of the nodes, and can process the phenotype evolution information in the long-distance diagnosis and treatment process. Then, different nodes and relations in the treatment network are supervised and learned respectively. Computing a reconstruction error of the node using the L2 norm based on the decoder reconstructing the embedded representation of the lower level node; calculating the reconstruction error of the graph relation by using the cross entropy; and calculating the clustering error of the treatment nodes by utilizing the KL divergence.

Based on the distribution similarity of the event labels of the diagnosis nodes, similar adjacent events are combined, the process mining method is optimized, the mined diagnosis and treatment process is simplified, and the representativeness and the coverage rate of the diagnosis and treatment process are improved.

Drawings

FIG. 1 is a schematic structural diagram of a chronic kidney disease subtype mining system based on self-supervision graph clustering according to the present invention;

FIG. 2 is a functional flow diagram of a chronic kidney disease subtype mining system based on self-supervision picture clustering according to the present invention;

FIG. 3 is a treatment network according to an embodiment of the present invention;

FIG. 4 is a co-occurrence matrix of an embodiment of the present invention;

fig. 5 is a diagram of a self-supervision graph clustering-based clinic node clustering network model structure according to an embodiment of the present invention.

Detailed Description

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a chronic kidney disease subtype mining system based on self-supervision map clustering comprises:

a data acquisition module: the system is used for collecting the structured data in the chronic kidney disease diagnosis and treatment record;

chronic kidney disease phenotype subtype assessment module: for evaluating the chronic kidney disease subtype mining model;

Referring to fig. 2, a functional process of a chronic kidney disease subtype mining system based on self-supervision graph clustering comprises the following steps:

step S1: the method comprises the steps of collecting structural data in a chronic kidney disease diagnosis and treatment record to construct a data set through a data collection module; the structured data includes patient basic information, medical records, diagnoses during viewing windows, laboratory tests, medical examinations, surgery and/or medication data;

step S2: preprocessing the structured data through a data extraction and preprocessing module to obtain a doctor seeing set and an entity set; preprocessing the data set, extracting structured data in the diagnosis and treatment record of the chronic kidney disease in an electronic medical record system, wherein the structured data comprises basic information of a patient, a diagnosis record, diagnosis during an observation window, laboratory inspection, medical examination, operation data and medication data, preprocessing the extracted structured data, only paying attention to abnormal inspection items according to a normal reference range, dividing results of the abnormal inspection items into a lower type and a higher type, and keeping names and abnormal types of the abnormal inspection items; medical examination and operation data are processed by a simple natural language processing technology, and the examined part, the examined type and the operation name are reserved; the medication data only pay attention to the use of six types of medicines, namely antihyperglycemic medicines, antihypertensive medicines, lipid regulating medicines, non-steroidal anti-inflammatory medicines, antiplatelet medicines and steroids, the six types of medicines in the medication data are classified, and the medicine categories are reserved; obtaining a diagnosis set, a medication set, an operation set, a test set, the number of diagnosis types, the number of medication types, the number of operation types, the number of test types and the number of treatment records, combining the diagnosis set, the medication set, the operation set and the test set to form an entity set, and combining the treatment records of patients to form a treatment set.

Step S3: inputting the treatment set and the entity set into a chronic kidney disease subtype mining module, and constructing a chronic kidney disease subtype mining model through the chronic kidney disease subtype mining module;

step S31: constructing a treatment network by using the treatment set and the entity set;

step S311: forming a node set by the visit set and the entity set;

step S312: constructing an edge set through the node co-occurrence relationship in the node set;

step S313: and constructing a treatment network by using the node set and the edge set.

Step S32: constructing an entity co-occurrence matrix by using the entity set, acquiring an entity node initial embedded representation and a diagnosis node initial embedded representation through the entity co-occurrence matrix, and forming the entity node initial embedded representation and the diagnosis node initial embedded representation into a node initial embedded representation;

step S321: constructing an entity co-occurrence matrix by using the entity set;

step S322: based on the entity co-occurrence matrix, calculating by a GloVe algorithm to obtain an initial embedded representation of each entity node;

step S323: obtaining a visit node initial embedded representation by calculating an average value of the entity node initial embedded representations of all adjacent entity nodes, wherein the visit node initial embedded representation and the entity node initial embedded representation form a node initial embedded representation.

Step S33: constructing an adjacency matrix by utilizing the relation between nodes in the visit network, and initially embedding the adjacency matrix and the nodes to express and train a visit node clustering network model based on self-supervision graph clustering;

step S331: constructing an adjacency matrix by utilizing the relationship among the nodes in the visit network, inputting the adjacency matrix and the initial node embedded representation into the visit node clustering network model based on the self-supervision graph clustering for graph attention training to obtain a node embedded representation, wherein the node embedded representation comprises a visit node embedded representation and an entity node embedded representation;

step S332: reconstructing the visit network by using the node embedded representation, and calculating a visit network reconstruction error;

step S333: inputting the entity node embedded representation into a decoder of a neural network for training, taking the output of the last layer of the decoder as an entity node reconstruction embedded representation, and calculating an entity node reconstruction error;

step S334: performing softmax regression operation on the embedded representation of the treatment nodes to obtain the probability distribution of the treatment nodes, and calculating clustering loss according to the probability distribution of the treatment nodes;

step S335: and constructing an overall loss function of the visit node clustering network model based on the self-supervision graph clustering according to the visit network reconstruction error, the entity node reconstruction error and the clustering loss.

Step S34: and constructing a chronic kidney disease subtype mining model through the diagnosis node clustering network model based on the self-supervision graph clustering.

Step S341: using the clinic node cluster distribution obtained by the clinic node cluster network model based on the self-supervision graph cluster as the class distribution of the clinic nodes, selecting the class with the highest probability in the class distribution as the class label of the clinic nodes, and arranging all the clinic nodes of each patient according to the time sequence;

step S342: determining to combine or separately reserve the treatment nodes by calculating cosine similarity between category distributions of the continuous treatment nodes having the same category label, and constructing an event matrix by arranging the treatment nodes;

step S343: searching frequent event determination nodes, connecting the diagnosis nodes in sequence to form an event flow, starting from a first column of the event matrix, selecting events with the frequency of occurrence of the events in each column being greater than a threshold value as frequent events, using the frequent events as nodes in the event flow, directly entering the remaining events into a terminal node, taking each event in the frequent events as a starting node of the next round of searching, extracting corresponding event vectors, combining the event vectors into a new event matrix, removing the first column, performing the same operation of searching the frequent events, connecting the nodes obtained by each round of searching with the starting node so as to prolong the event flow until the frequent events are empty or the event flow length reaches the maximum event flow length, and obtaining a chronic kidney disease subtype mining model after the cycle is finished.

Step S4: evaluating the chronic kidney disease subtype mining model through a chronic kidney disease phenotype subtype evaluation module;

step S5: predicting structured data of a patient by a chronic kidney disease subtype prediction module;

step S51: preprocessing structured data of a patient, inputting the preprocessed structured data into the visit node clustering network model based on the self-supervision graph clustering for prediction, and obtaining probability distribution of the visit nodes of the patient;

step S52: judging the cluster type of the treatment nodes according to the probability distribution of the treatment nodes, and constructing a treatment event sequence;

step S53: inputting the diagnosis event sequence into the chronic kidney disease subtype mining model, fitting nodes in the chronic kidney disease subtype mining model according to the sequence to obtain an event flow, and judging which chronic kidney disease subtype belongs to through the event flow.

Example (b):

a data acquisition module: the system is used for acquiring structured data in the diagnosis and treatment record of chronic kidney disease to construct a data set; the structured data includes basic information of the patient, medical records, diagnoses during viewing windows, laboratory tests, medical examinations, surgery, and/or medication data;

the data extraction and pretreatment module: the system is used for extracting and preprocessing the structured data to obtain a doctor seeing set and an entity set; the data extraction and preprocessing module is specifically used for preprocessing the structured data, extracting the structured data in the chronic kidney disease diagnosis and treatment records in the electronic medical record system, wherein the structured data comprises basic information of a patient, a diagnosis record, diagnosis during an observation window, laboratory test, medical examination, operation data and medication data, preprocessing the extracted structured data, only paying attention to an abnormal test item according to a normal reference range, dividing the result of the abnormal test item into a lower type and a higher type, and keeping the name and the type of the abnormal test item; medical examination and operation data are processed by a simple natural language processing technology, and the examined part, the examined type and the operation name are reserved; the medication data only pay attention to the use of six types of medicines, namely antihyperglycemic medicines, antihypertensive medicines, lipid regulating medicines, non-steroidal anti-inflammatory medicines, antiplatelet medicines and steroids, the six types of medicines in the medication data are classified, and the medicine categories are reserved; obtaining a diagnosis set, a medication set, an operation set, a test set, the number of diagnosis types, the number of medication types, the number of operation types, the number of test types and the number of treatment records, combining the diagnosis set, the medication set, the operation set and the test set to form an entity set, and combining the treatment records of patients to form a treatment set.

Chronic kidney disease subtype mining module: the system is used for inputting the treatment set and the entity set into a chronic kidney disease subtype mining module, and a chronic kidney disease subtype mining model is constructed through the chronic kidney disease subtype mining module;

the doctor is integrated into

In which

Indicating the number of visits.

Respectively a diagnosis set, a medication set, an operation set and a test set,

，

，

，

in which

、

、

、

Respectively representing the diagnosis type quantity, the medicine type quantity, the operation type quantity and the inspection type quantity.

Composing collections of entities

The number of entity set types is

。

The entity set and the visit set form a node set

Number of nodes

；

the same visit will be (

) The entities present in constitute a subset of entities

，

Representing a subset of entities

The number of the entities in the group,

. Each entity subset and the corresponding visit form a visit UNICOM subset

. One of the visit unicom subsets comprises a visit node and all entity nodes in the visit, all nodes in one visit unicom subset have a co-occurrence relationship, and the nodes are connected pairwise to form an edge subset; all the edge subsets form an edge set, and the edge set is

；

For constructing a treatment network using the node set and the edge set

。

Referring to FIG. 3, at the visit

In the middle, the physician prescribes goiter (

) Thyroid nodule (A)

) Two diagnoses, partial thyroidectomy (

) And the levothyroxine sodium tablet (

) The medicine is prepared. Then

A subset of visit links is formed, and the 5 nodes in the visit network are connected pairwise. At the moment of treatment

In (A), the doctor has carried out TSH measurement: (A)

) After that, hypothyroidism (

) Diagnosis and development of levothyroxine sodium tablet (

) And (4) medicine preparation. Then

Is also a subset of treatment links, and the 4 nodes are connected in pairs in the treatment network. Due to the fact that

At the same time appear in

And

in the visit network

To the other nodes in both of these subsets of patient associations.

for constructing an entity co-occurrence matrix using the set of entities;

utilizing entity collections

Constructing entity co-occurrence matrices

Referring to FIG. 4, the entity co-occurrence matrix

Has the dimension of

Each row and column representing a set of entities

In the context of one of the entities,

representing entities

And entities

Co-occurrence information of (a).

The calculation formula of (2) is as follows:

wherein, if the entity

And entities

At the moment of treatment

When the two occur at the same time, then

Equal to 1; if not, it is noted as 0. Wherein

To be at the clinic

All entities present in (a) constitute a subset of entities. Entity co-occurrence matrix

The two-dimensional mirror is symmetrical to each other,

and

equal, co-occurrence information of the same entity on the diagonal is marked as 0.

the relationship between the entity node initial embedded representation and the entity co-occurrence matrix is represented as:

wherein the content of the first and second substances,

and

respectively, the entities that ultimately need to be solved

And entities

The entity node of (1) is initially embedded and expressed, and is randomly initialized into a random vector with 128 dimensions and the value between-0.1 and 0.1; upper label

Is a transposition operation;

and

the bias terms are respectively represented by the initial embedding of two entity nodes, and the initial value is 0.

Constructing an objective function based on the relation between the entity co-occurrence matrix and the entity node initial embedded representation

；

Wherein the content of the first and second substances,

is the co-occurrence information threshold value and,

is an exponential parameter.

If two physical nodes do not appear together, i.e.

They do not participate in the calculation of the objective function. Optimizing the objective function through AdaDelta gradient descent algorithm until convergence, and obtaining each entity in the entity set

Corresponding entity node initial embedded representation

；

The node initial embedded representation is obtained by calculating an average value of the entity node initial embedded representations of all adjacent entity nodes, and the clinic node initial embedded representation and the entity node initial embedded representation form a node initial embedded representation;

for the point of visit

The set of all adjacent entity nodes is

，

The initial embedding of the node is represented as:

wherein the content of the first and second substances,

is that

The number of intermediate entity nodes.

Node initial embedded representation

，

Is the initial embedded representation of the treatment node,

is the entity node initial embedded representation.

A clustering network construction unit: the system comprises a node clustering network model, a node clustering model and a node clustering model, wherein the node clustering network model is used for constructing an adjacency matrix by utilizing the relationship among nodes in the visit network, and training the visit node clustering network model based on self-supervision graph clustering through the adjacency matrix and the initial embedded representation of the nodes; referring to fig. 5, the self-supervision graph clustering-based diagnosis node clustering network model consists of 3 parts of graph attention, self-encoder and self-supervision.

For constructing an adjacency matrix using relationships between nodes in the treatment network

Connecting the adjacent matrixes

And the node initial embedded representation

Inputting the information into the visit node clustering network model based on the self-supervision graph clustering

Attention-oriented exercise of secondary drawings, first

Node embedding of a layer is represented as

The calculation method is as follows:

wherein the content of the first and second substances,

is the function of the activation of the relu,

is the first

The layer map is aware of the force weights.

，

Is a normalized adjacency matrix that is,

is an identity matrix

. In the process of passing

After the layer diagram attention training, the node embedding expression is obtained

。

With node initial embedded representation

Likewise, the embedded representation by the updated treatment node

And entity node embedded representation

The structure of the utility model is that the material,

。

reconstructed adjacency matrix

Comprises the following steps:

wherein the content of the first and second substances,

is that

The transpose matrix of (a) is,

is the sigmoid activation function.

Calculating the network reconstruction error of the visit

：

Wherein the content of the first and second substances,

，

。

for embedding a physical node into a representation

Input device

The decoder of the layer neural network is trained, the node is in the second

Representation in a layer decoder as

The following calculation formula is used to obtain:

wherein the content of the first and second substances,

is the first

The network weights of the layer decoder are set,

is a deviation, the input of the decoder is

. Embedding representation with output of last layer of decoder as solid node reconstruction

Calculating the error of reconstruction of the physical node

：

For embedding representations for treatment nodes

Performing softmax regression operation to obtain the probability distribution of the treatment nodes:

wherein the content of the first and second substances,

is of the dimension of

，

The preset number of the clustering centers, namely the number of the categories of the treatment nodes, is selected according to experience attempts 3, 5 and 10, and the category number with a better result is obtained.

Is shown as

A sample belongs to

The probability of a class.

Calculating clustering loss according to the probability distribution of the treatment nodes;

for the first

Individual visit sample and

cluster clustering using student t-distribution to judge data characterization

And a cluster center

The similarity of (c).

Is that

To (1) a

The number of rows is such that,

is based on the probability distribution of the treatment node

A clustering center initialized by a K-means method,

is the degree of freedom of the distribution of the student t,

the calculation formula of (2) is as follows:

wherein the content of the first and second substances,

is the first

A sample belongs to

Probability of each cluster being aggregated. Is provided with

Cluster the distributed set for all samples. Obtaining a cluster distribution

Then, the target distribution is calculated

Target distribution

Sample assignment with higher confidence, and therefore can be based on

To optimize the data distribution so that the data is closer to the cluster center.

And

is of the dimension of

. Target distribution

Each element of

The calculation formula of (2) is as follows:

wherein the content of the first and second substances,

. Target distribution

In the step (1), the first step,

is squared, so

With a higher confidence. The calculation formula of the clustering loss is as follows:

for reconstructing errors from the visit network

Entity node reconstruction error

And cluster loss

And constructing a total loss function of the visit node clustering network model based on the self-supervision graph clustering. The overall loss function is:

wherein the content of the first and second substances,

is a super parameter for adjusting the importance of different loss items, and is set to be 0.1 by default.

The visit node clustering distribution obtained by the visit node clustering network model based on the self-supervision graph clustering

Selecting the category with the highest probability in the category distribution as the category label of the treatment node; medical treatment node

The corresponding category label is

. All the treatment nodes of each patient are arranged in time sequence by taking the recording time of the first medical record of a single treatment as the starting time of the treatment node and the recording time of the last medical record as the ending time of the treatment node.

Determining to combine or separately reserve the treatment nodes by calculating cosine similarity between category distributions of successive treatment nodes having the same category label, and constructing an event matrix by arranging the treatment nodes;

for two consecutive treatment nodes with the same category label

Calculating

Cosine similarity between class distributions:

wherein the content of the first and second substances,

is an event

The class distribution of (2).

Combining the front and back treatment nodes with cosine similarity larger than 0.8 into one treatment node, wherein the category of the combined treatment node is distributed as

Otherwise, the two treatment nodes are kept separately. And (4) for a plurality of continuous treatment nodes with the same category label, performing cosine similarity judgment from front to back according to the arrangement sequence, and determining to merge or separately reserve.

The final visit nodes for each patient are arranged into an event vector

，

The node number of the patient with the most visiting nodes is insufficient

Fills the event vector with 0. Combining event vectors for all patients into an event matrix

The event matrix

Comprises the following steps:

wherein the content of the first and second substances,

has a dimension of

，

Is that

and comparing the differences of the patients with different phenotype subtypes, and checking whether the characteristics of the excavated different subtypes have statistical differences, thereby evaluating whether the disease subtypes obtained by the phenotype subtype excavation method have clinical significance. The specific evaluation protocol was as follows:

and calculating indexes such as sex, age, glomerular filtration rate and the like of the patients with different phenotype subtypes, and judging whether the clinical manifestations of the patients with different phenotype subtypes are different by using a statistical test method.

And (4) counting whether difference exists in important medication data such as the use amount of recombinant human erythropoietin, metformin, candesartan and pravastatin of patients with different subtypes, and analyzing by using a statistical test method.

Counting the number of the patients with various complications of each subtype, including heart failure, coronary heart disease, hypertension, diabetes and hyperlipidemia, calculating the ratio of each complication, and checking whether the ratio of the complications in different subtypes is different.

And counting the total number of all subtypes and the survival number at different time points, and comparing the survival rates of different subtype patients. The difference in survival rates over time for patients of different subtypes was observed and analyzed using the Log-rank test.

If the characteristics of the patient groups of different subtypes are remarkably different by more than 50 percent, the excavated subtypes have better clinical use value.

Chronic kidney disease subtype prediction module: for predicting structured data of a patient;

the self-supervision graph clustering-based diagnosis node clustering network model is used for inputting the preprocessed patient structural data into the diagnosis node clustering network model for prediction to obtain the probability distribution of the diagnosis nodes of the patient;

and the system is used for inputting the visit event sequence into the chronic kidney disease subtype mining model, fitting nodes in the chronic kidney disease subtype mining model according to the sequence to obtain an event flow, and judging which chronic kidney disease subtype belongs to through the event flow.

The invention provides a diagnosis node clustering network model based on self-supervision graph clustering, wherein a decoder is added in graph attention training for reconstructing node embedded representation; adding self-supervision loss for training a clustering model; the method comprises the steps that a clinic node clustering network model based on self-supervision graph clustering is used for gathering low-level and fine-grained chronic nephropathy patient information into high-level and coarse-grained general information for diagnosis and treatment process mining, and the problem that multi-grained information such as event information in a single clinic and event information among multiple diagnoses cannot be processed in longitudinal electronic medical record data in process mining is solved; based on an automatic supervision graph clustering method, multi-dimensional diagnosis and treatment information in a single diagnosis of a patient and time sequence information among multiple diagnoses are fully integrated, and meanwhile, full feature mining is carried out on electronic medical record data from two dimensions, namely a cross section and a longitudinal dimension; based on the distribution similarity of event labels of the diagnosis nodes, similar adjacent events are combined, the process mining method is optimized, the mined diagnosis and treatment process is simplified, and the representativeness and the coverage rate of the diagnosis and treatment process are improved.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A chronic kidney disease subtype mining system based on self-supervision graph clustering is characterized by comprising:

2. The system of claim 1, wherein the structured data comprises basic patient information, medical records, diagnosis during observation windows, laboratory tests, medical examinations, surgery and/or medication data.

3. The chronic kidney disease subtype mining system based on the autopsy clustering as claimed in claim 1, wherein the data extraction and preprocessing module is specifically configured to preprocess the structured data, extract the structured data in the chronic kidney disease diagnosis and treatment records in the electronic medical record system, including basic information of a patient, a diagnosis record, diagnosis during an observation window, laboratory tests, medical examinations, surgical data, and medication data, preprocess the extracted structured data, focus on only abnormal test items according to a normal reference range, classify results of the abnormal test items into lower and higher categories, and retain names and classes of the abnormal test items; medical examination and operation data are processed by a simple natural language processing technology, and the examined part, the examined type and the operation name are reserved; the medication data only pay attention to the use of six types of medicines, namely antihyperglycemic medicines, antihypertensive medicines, lipid regulating medicines, non-steroidal anti-inflammatory medicines, antiplatelet medicines and steroids, the six types of medicines in the medication data are classified, and the medicine categories are reserved; obtaining a diagnosis set, a medication set, an operation set, a test set, the number of diagnosis types, the number of medication types, the number of operation types, the number of test types and the number of treatment records, combining the diagnosis set, the medication set, the operation set and the test set to form an entity set, and combining the treatment records of patients to form a treatment set.

4. The chronic kidney disease subtype mining system based on unsupervised graph clustering of claim 1, wherein the chronic kidney disease subtype mining module specifically includes:

5. The chronic kidney disease subtype mining system based on self-supervision picture clustering as claimed in claim 4, wherein the visiting network constructing unit specifically includes:

6. The chronic kidney disease subtype mining system based on unsupervised graph clustering as claimed in claim 4, wherein said embedded representation construction unit specifically includes:

for constructing an entity co-occurrence matrix using the set of entities;

7. The chronic kidney disease subtype mining system based on self-supervision graph clustering according to claim 4, characterized in that the clustering network construction unit specifically includes:

the self-supervision graph clustering-based visit node clustering network model is used for constructing an adjacency matrix by utilizing the relationship among the nodes in the visit network, inputting the adjacency matrix and the initial node embedded representation into the visit node clustering network model based on self-supervision graph clustering for graph attention training, and obtaining a node embedded representation, wherein the node embedded representation comprises a visit node embedded representation and an entity node embedded representation;

the node embedded representation is used for reconstructing the diagnosis network and calculating the diagnosis network reconstruction error;

the system is used for performing softmax regression operation on the embedded expression of the treatment nodes to obtain the probability distribution of the treatment nodes, and calculating clustering loss according to the probability distribution of the treatment nodes;

and the overall loss function is used for constructing the self-supervision graph clustering-based visit node clustering network model according to the visit network reconstruction error, the entity node reconstruction error and the clustering loss.

8. The chronic kidney disease subtype mining system based on self-supervision picture clustering according to claim 4, characterized in that the chronic kidney disease subtype mining model building unit specifically includes:

the self-supervision graph clustering-based visit node clustering network model is used for obtaining visit node clustering distribution as the category distribution of the visit nodes, selecting the category with the highest probability in the category distribution as the category label of the visit nodes, and arranging all the visit nodes of each patient according to the time sequence;

9. The chronic kidney disease subtype mining system based on unsupervised graph clustering of claim 1, wherein the chronic kidney disease subtype prediction module specifically includes: