CN115719647A

CN115719647A - Hemodialysis-concurrent cardiovascular disease prediction system integrating active learning and contrast learning

Info

Publication number: CN115719647A
Application number: CN202310029096.3A
Authority: CN
Inventors: 李劲松; 王丰; 池胜强; 朱伟伟
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-02-28
Anticipated expiration: 2043-01-09
Also published as: CN115719647B

Abstract

The invention discloses a hemodialysis-associated cardiovascular disease prediction system integrating active learning and comparative learning, which comprises the following steps: the hemodialysis data preparation module is used for extracting structured data of a patient sample by utilizing a hospital electronic information system and daily monitoring equipment and processing the structured data to obtain amplified structured data; and the hemodialysis concurrent cardiovascular disease risk prediction module is used for constructing a risk evaluation model, training and learning the amplification structured data through the risk evaluation model to obtain the characterization and the score of the patient, and predicting the hemodialysis concurrent cardiovascular disease risk by using the characterization and the score of the patient. The method solves the problem of positive and negative sample matching, iteratively updates the parameters of the comparison learning model by using the real label data of the hemodialysis complicated cardiovascular diseases, and improves the performance of the model by using the real complication result label; the problem of too few samples or unbalanced number of positive samples and negative samples is solved, and the difference between the amplification data and the original data is reduced.

Description

Hemodialysis-concurrent cardiovascular disease prediction system integrating active learning and contrast learning

Technical Field

The invention relates to the technical field of medical health information, in particular to a hemodialysis-associated cardiovascular disease prediction system integrating active learning and comparative learning.

Background

Maintenance hemodialysis (hemodialysis) treatment is one of the main treatment modes of end-stage renal diseases, ensures that hemodialysis patients are effectively treated, and is an urgent need in the field of clinical medical treatment at present. Hemodialysis treatment is a long-term treatment that progresses throughout the course of the disease. Various cardiovascular complications can occur in the long-term hemodialysis process, and the survival condition of a patient is seriously influenced. Therefore, risk prediction and early intervention for cardiovascular complications of maintenance hemodialysis are of crucial importance to improve the quality of life of end stage renal disease patients.

Contrast learning is an automatic supervision algorithm, is widely applied to various fields such as computer vision, natural language processing and the like, and in recent years, model performance exceeding supervision learning is achieved even in various mainstream tasks. There are still difficulties in applying a comparative learning method suitable for an auto-supervised task to a supervised hemodialysis-complicated cardiovascular disease prediction task. On the one hand, cardiovascular complication prediction is a supervised task, and compared with an unsupervised task, additional label information is provided, so that how to effectively utilize a real complication result label to improve the performance of a model is a key problem. On the other hand, the key of the comparative learning lies in matching proper positive and negative samples, an improper matching method will seriously affect the model performance, and how to match proper and most valuable positive and negative samples to improve the model performance is a key problem.

Aiming at the problems, the patent aims to construct a hemodialysis concurrent cardiovascular disease prediction system integrating active learning and comparative learning by aiming at a hemodialysis concurrent cardiovascular disease prediction scene, and provides accurate and effective decision support for clinical decision.

Disclosure of Invention

In order to solve the technical problems, the invention provides a hemodialysis-complicated cardiovascular disease prediction system integrating active learning and comparative learning.

The technical scheme adopted by the invention is as follows:

a hemodialysis-complicated cardiovascular disease prediction system that fuses active learning and contrast learning, comprising:

the hemodialysis data preparation module is used for extracting structured data of a patient sample by utilizing a hospital electronic information system and daily monitoring equipment and processing the structured data to obtain amplified structured data;

and the hemodialysis concurrent cardiovascular disease risk prediction module is used for constructing a risk evaluation model, training and learning the amplification structured data through the risk evaluation model to obtain patient characterization and scores, and predicting the hemodialysis concurrent cardiovascular disease risk by using the patient characterization and scores.

Further, the structured data includes demographic data, clinical event data, medication data, and daily monitoring data.

Further, the hemodialysis data preparation module specifically includes:

the data acquisition unit is used for extracting the structured data of the patient sample by utilizing the hospital electronic information system and the wearable equipment;

the data cleaning unit is used for carrying out missing value processing, error value detection, repeated data elimination and/or inconsistency elimination on the structured data to obtain static data and time sequence data;

the data fusion unit is used for splicing one-dimensional compressed data obtained by performing convolution operation on the time sequence data and the static data to obtain original fusion characteristics;

and the data amplification unit is used for obtaining the amplification structured data by adopting a single-feature randomization method for the original fusion features.

Further, the amplification process of the data amplification unit is as follows:

step S1: taking patients with cardiovascular complications as original positive samples, taking patients without cardiovascular complications as original negative samples, wherein all the original positive samples form an original positive sample set, and all the original negative samples form an original negative sample set;

step S2: when the number of the original positive samples is smaller than that of the original negative samples, amplifying the original positive sample set to obtain amplified positive samples until the number of the positive samples is equal to that of the original negative samples; when the number of the original positive samples is larger than that of the original negative samples, amplifying the original negative sample set to obtain amplified negative samples until the number of the negative samples is equal to that of the original positive samples;

and step S3: the original positive sample set and the amplified positive samples form a positive sample amplification set, and the original negative sample set and the amplified negative samples form a negative sample amplification set;

and step S4: the positive sample amplification set and the negative sample amplification set together constitute amplification structured data.

Further, the process of obtaining the amplification positive sample in step S2 is:

combining the original fusion features with the original positive sample set to obtain a combined positive sample set, wherein the combined positive sample set comprises a single original fusion feature and a single positive sample set corresponding to the single original fusion feature;

taking a single original fusion feature in a single combined positive sample set as an intervention feature, taking the rest original fusion features in the single combined positive sample set as a fixed feature set, taking a positive sample in the single positive sample set as an amplification object to perform sample amplification to obtain a single amplification positive sample, and completing the whole amplification process until the amplification times are the difference value between the original negative sample and the original positive sample to obtain a final amplification positive sample;

the process of obtaining the amplification negative sample comprises the following steps:

combining the original fusion features with the original negative sample set to obtain a combined negative sample set, wherein the combined negative sample set comprises a single original fusion feature and a single negative sample set corresponding to the single original fusion feature;

and taking a single original fusion feature in the single combined negative sample set as an intervention feature, taking the rest original fusion features in the single combined negative sample set as a fixed feature set, taking the negative samples in the single negative sample set as amplification objects to carry out sample amplification to obtain a single amplification negative sample, and completing the whole amplification process until the amplification times are the difference value between the original negative sample and the original positive sample to obtain a final amplification negative sample.

Further, the module for predicting risk of hemodialysis complicated cardiovascular diseases specifically comprises:

a risk evaluation unit: the risk evaluation model is constructed, and the amplification structured data is used as training data of the model to obtain scores and patient phenotypes;

an active learning unit: for selecting positive and negative samples from said amplified structured data by a positive and negative sample selection normalizer using said score and said patient phenotype;

a comparison learning unit: and the system is used for performing comparison learning by using the positive and negative samples and updating the network parameters of the encoder shared by the risk evaluation unit.

Further, the risk evaluation unit specifically includes:

the risk evaluation model is constructed by utilizing an encoder and a risk evaluation network, and is optimized through a loss function;

for extracting a patient phenotype with an encoder in the risk assessment model, the patient phenotype calculating a score for hemodialysis-complicated cardiovascular disease through the risk assessment network;

the system is used for setting a real label for a patient, and when the patient has cardiovascular complications, the real label is 1; otherwise, the real label is 0;

for optimizing a loss function using the score and the true label.

Further, the active learning unit specifically includes:

the risk evaluation model is used for carrying out normalization processing on the patient phenotype output by the risk evaluation model, and the obtained patient representation is mapped into a 0-1 space through the normalization processing;

the device is used for respectively calculating the included angle of each sample representation in the amplification structured data to other sample representations in the 0-1 space direction by utilizing a positive and negative sample selection rule device;

the device is used for dividing the calculated included angle of each sample into a first group and a second group according to whether the real labels of other samples are the same as the real label of the current sample, and respectively sequencing the interiors of the first group and the second group from small to large;

the system is used for selecting the upper quartile as a positive sample set in the sorted first group and selecting the lower quartile as a negative sample set in the sorted second group.

Further, the comparison learning unit specifically includes: and the positive sample and the negative sample are used for performing comparison learning, the real labels of the positive sample and the patient sample are the same, the real labels of the negative sample and the patient sample are different, the cosine distance of the positive sample and the cosine distance of the negative sample of the patient sample are calculated, the loss function of the comparison learning unit is constructed according to the cosine distance of the positive sample and the cosine distance of the negative sample, and the network parameters of the encoder shared by the risk evaluation unit are updated.

Further, the risk evaluation unit, the active learning unit and the comparison learning unit share the encoder, the encoder is a 5-layer fully-connected network, the number of nodes in each layer is 1024, 512, 256, 128 and 64, respectively, and the activation function is ReLU.

The invention has the beneficial effects that:

1. the invention provides a positive and negative sample matching method based on active learning, which is used for selecting high-value comparison samples to improve the model performance and solve the problem of positive and negative sample matching.

2. The invention provides a training method for integrating active learning and comparative learning, which iteratively updates comparative learning model parameters by using real label data of hemodialysis complicated cardiovascular diseases, and solves the problem of how to effectively utilize real complicated symptom result labels to improve model performance in a supervised scene.

3. The invention provides a single-feature randomization method for amplifying original data, solves the problems of too few collected samples or unbalanced number of positive samples and negative samples, and reduces the difference between the amplified data and the original data.

Drawings

FIG. 1 is a block diagram of a hemodialysis-complicated cardiovascular disease prediction system incorporating active learning and contrast learning in accordance with the present invention;

FIG. 2 is a block diagram of a hemodialysis data preparation module of the present invention;

fig. 3 is a block diagram of the module for predicting risk of hemodialysis complicated with cardiovascular diseases according to the present invention.

Detailed Description

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Referring to fig. 1, a hemodialysis-complicated cardiovascular disease prediction system fusing active learning and contrast learning, includes:

the structured data includes demographic data, clinical event data, medication data, and daily monitoring data;

(1) Demographic data: age, sex, region, etc.; (2) clinical event data: hemodialysis events, diagnostic events, etc.; (3) medication data: drug name, dosage, etc.; (4) daily monitoring data: blood pressure, heart rate, body weight, etc.

Referring to fig. 2, the hemodialysis data preparation module specifically includes:

taking the information of the complication diagnosis event as an example, the first time of using the complication medicines can be used for filling the missing diagnosis time of the complication; for the missing complication names, the specific complication names can be judged according to the medication condition of the complications; if the name of the complication can not be judged through the medication information, the missing complication diagnosis information is screened actively.

the acquired basic information of the patient, such as age and sex, belongs to static data, and hemodialysis information and daily detection information belong to one-dimensional time series data. Convolution operation is carried out on the one-dimensional time sequence data, so that the one-dimensional time sequence data can be fused with static data, and subsequent data processing and model training are facilitated.

Under the condition that the number of the collected samples is too small or the number of the positive samples and the negative samples is unbalanced, the training effect of the model can be influenced. In order to reduce the influence, the invention adopts a single-feature randomization method to amplify the original data, and solves the problems of too few collected samples or unbalanced quantity of positive samples and negative samples. In order to reduce the difference between the amplification data and the actual data as much as possible, only one feature is selected as an interference feature residual characteristic set as a fixed feature at each time for sample amplification in the process of amplifying the original data by using a single-feature randomization method.

The amplification process of the data amplification unit comprises the following steps:

the process of obtaining the amplification positive sample comprises the following steps:

the number of the existing original positive samples is M, the number of the original negative samples is N, and

because the original positive sample and the original negative sample are not balanced, the original positive sample needs to be amplified by the amount of Q, i.e. Q = N-M, and the whole amplification process of Q is described in detail below:

the original fusion features are noted as

Wherein, in the step (A),

representing the ith single original fused feature,

the characteristic quantity of the original positive sample is obtained; all original positive samples were then randomly averaged into

Group, original Positive sample set recording

In which

Representing a single set of positive samples of group i and recording

In which

Indicates the total number of positive samples assigned to the i-th group,

representing the grouped ith group of jth positive samples.

Then combining the original fusion characteristic V with the original positive sample set X to obtain a combined positive sample set, and recording the combined positive sample set as

Wherein a single positive sample set is combined

，

Representing the ith single original fused feature,

a single set of positive samples representing the ith group. After combination, only a single combined positive sample set is needed for each group of the combined positive sample sets VX

With a single original fused feature

As a result of the nature of the intervention,

(feature set V divide

Extra features) as a set of fixed features, a single set of positive samples

The positive sample in (1) is used as an amplification target to amplify the sample. And combining each set of individual combined positive sample sets of positive sample sets VX

The number of amplified samples is

. Recording the amplified data as

Wherein, in the process,

represents the sample set amplified by combining the ith group of samples of the positive sample set VX. Memo

Wherein, in the process,

indicating the number of amplifications of group i, single amplification positiveSample(s)

Represents the ith group of single combined positive sample set of combined positive sample set VX

Sample amplification of the jth sample.

For single amplification positive samples

First, a single set of positive samples is combined

Single positive sample set of

Randomly selecting two samples

Sample of

Is characterized by being represented as

Sample of

Is characterized by being represented as

. Single amplification positive sample

The characteristics are expressed as follows:

wherein the content of the first and second substances,

is a random number with a value range of (0, 1);

representing a sample

To (1) a

A value of the individual characteristic;

representing a sample

To (1) a

The value of each feature.

Representing amplified samples

To (1)

For amplifying the sample

Is characterized by

Is taken of a sample

And a sample

Is characterized in that

Random numbers between the lines of the upper values, which reduces the difference between the amplified data and the original data.

The module for predicting the risk of the hemodialysis complicated cardiovascular diseases comprises three parts: risk evaluation unit, active learning unit, contrast learning unit, as shown in fig. 3. Firstly, using amplification structured data as input of a system, and training a primary risk evaluation model through a risk evaluation unit; then, the active learning unit selects high-value contrast samples from the amplified structured data through a positive and negative sample selection rule device R by using the output score p of the risk evaluation unit and the phenotypes of the patients s1 and s2 for the contrast learning unit to learn; and finally, the comparison learning unit learns by using the high-quality comparison samples selected by the active learning unit, so that the samples with the same label are closer, the samples with different labels are farther, and meanwhile, the parameter of the encoder f shared by the comparison learning unit and the risk evaluation unit is updated, so that the risk evaluation model is more accurate.

The hemodialysis complicated cardiovascular disease risk prediction module specifically comprises:

the risk evaluation unit specifically includes:

for optimizing a loss function using the score and the truth label.

The risk evaluation unit, the active learning unit and the comparison learning unit share the encoder f, the encoder f is a 5-layer fully-connected network, the number of nodes in each layer is 1024, 512, 256, 128 and 64 respectively, and the activation function is ReLU.

After the patient phenotype S (which is a 64-bit vector) is extracted from the patient raw fusion features by using the encoder f, the patient phenotype S is evaluated by a risk evaluation network

Calculating the score p =suffering from cardiovascular complications of the patient

. Risk assessment network

Is a network consisting of a 4-layer full connection. Each layer of nodes is 128, 32, 8 and 2 respectively. The activation function of the first three layers is ReLU, and the activation function of the last output layer is

The entire network uses the SGD function as an optimizer. The predicted loss function of the risk assessment unit is as follows:

wherein N represents the number of all samples in the amplified structured data,

indicating that the risk assessment unit has a predictive score for the input patient sample i with a certain cardiovascular disease,

is a true label for patient i, when patient i has cardiovascular disease,

when the patient i does not suffer from cardiovascular disease,

. For the entire loss function, when patient i suffers from cardiovascular disease, in the loss function

，

Predictive scoring with patient i

The larger and larger, and thus the smaller the overall loss function; similarly, when patient i does not have a certain cardiovascular disease, the predicted score for patient i is determined

The smaller the overall loss function.

An active learning unit: for selecting positive and negative samples from said augmented structured data using said score and said patient phenotype by a positive and negative sample selection normalizer;

the active learning unit is used for selecting high-value comparison samples for the comparison learning unit to learn by combining the risk evaluation unit, so that the patient characteristics of the same label are closer, and the patient characteristics of different labels are farther.

The active learning unit specifically includes:

first, the patient phenotype s generated by the risk assessment unit is normalized and recorded as

A patient characterization vector with s length of 64,

representing the L1 norm of s. After normalization, the patient characterization is mapped into a space of 0-1, facilitating subsequent pick calculations.

The positive and negative sample rule selector R is used for selecting positive and negative samples from the original input sample set by using a selection rule.

The positive and negative sample rule picker R utilizes the following rules: the cosine distances between patient phenotype vectors of the same label should be similar and the cosine distances between their patient phenotype vectors should be far apart for samples of different labels. The rule for choosing a positive sample j of sample i is that sample j has the same true label as sample i, but sample j is further away from the cosine of sample i. It is desirable to make the cosine distance between the sample i and the positive sample closer by contrast learning; the rule for choosing the negative sample k of sample i is that sample k is not true label to sample i, but the cosine distance of sample k to sample i is small. It is desirable to make the cosine distance between the sample i and the negative sample k further away by contrast learning;

using formulas

Calculating the included angle of the sample characterization in the amplification structured data to the spatial direction of other sample characterizations, wherein,

，

which represents a characterization of the sample i,

representing a characterization of sample j. In general, if two exemplars have the same label, they should have the same or similar orientation in space, the smaller the cosine of the angle between them, if the labels of the two exemplars are different, the different orientation in space, the larger the cosine of the angle between them,

the vector is the vector of the ith sample characterization vector after normalization processing.

dividing the cosine of the included angle between the sample i and other samples into two groups according to whether the real labels of other samples are the same as the real label of the sample i

And

in which

The other sample true tags are the same as the true tag of sample i, i.e. the set is

，

A real label representing the sample i,

a true label representing sample j;

the true tags of the other samples are different from the true tag of sample i, i.e. the set

. And is

And

sorting the interior from small to large, recording

Wherein, in the step (A),

，

；

wherein, in the step (A),

，

。

and the upper quartile is selected from the sorted first group as a positive sample set, and the lower quartile is selected from the sorted second group as a negative sample set.

After sorting

Selecting upper quartile as positive sample set of sample i

Wherein, in the step (A),

(ii) a After sorting

Selecting a lower quartile as a negative sample set of the sample i from the group

Wherein, in the process,

。

The comparison learning unit specifically comprises: and the positive sample and the negative sample are used for performing comparison learning, the real labels of the positive sample and the patient sample are the same, the real labels of the negative sample and the patient sample are different, the cosine distance of the positive sample and the cosine distance of the negative sample of the patient sample are calculated, the loss function of the comparison learning unit is constructed according to the cosine distance of the positive sample and the cosine distance of the negative sample, and the network parameters of the encoder shared by the risk evaluation unit are updated.

In the comparison learning unit, the active learning unit selects positive and negative samples of the original sample based on the real label and the patient characterization. The positive and negative samples obtain patient characteristics s of the positive and negative samples through an encoder f, the obtained positive and negative sample patient characteristics s can be subjected to characteristic mapping through a projector h to obtain a mapped comparison characteristic t, the projector h is a 3-layer fully-connected network, the number of nodes in each layer is 512, 256 and 128 respectively, an activation function is a ReLU function, and an SGD function is used as an optimizer. The mapped representation is normalized and recorded as

，

Wherein, in the step (A),

is the mean value of the characteristic dimension of the comparative characterization t,

is the standard deviation of the comparative characterization t characteristic dimension.

Is a characterization vector of a positive sample j of patient i screened by the active learning unit,

is a characterization vector of the negative sample k of patient i screened by the active learning unit.

Representing the cosine distance between sample i and sample j,

representing the cosine distance between sample j and negative sample k. As can be seen from the active learning unit described above, the true labels of the positive sample j and the sample i are the same, and as a loss, the smaller the cosine distance between the positive sample j and the sample i is, the better, and in the same way, the true labels of the negative sample k and the sample i are different, and as a loss, the larger the cosine distance between the negative sample k and the sample i is, the better is. Thus, the loss function of the comparative learning unit is constructed as follows:

the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A hemodialysis-complicated cardiovascular disease prediction system that incorporates active learning and contrast learning, comprising:

2. The system of claim 1, wherein the structured data comprises demographic data, clinical event data, medication data, and daily monitoring data.

3. The hemodialysis-complicated cardiovascular disease prediction system combining active learning and contrast learning according to claim 1, wherein the hemodialysis data preparation module specifically comprises:

4. The system for predicting hemodialysis-complicated cardiovascular diseases by combining active learning and contrast learning according to claim 3, wherein the data amplification unit comprises:

5. The system for predicting hemodialysis-complicated cardiovascular disease through active learning and contrast learning according to claim 4, wherein the process of obtaining the amplification positive sample in step S2 comprises:

taking a single original fusion feature in a single combined positive sample set as an intervention feature, taking the rest original fusion features in the single combined positive sample set as a fixed feature set, taking a positive sample in the single positive sample set as an amplification object to carry out sample amplification, obtaining a single amplification positive sample, completing the whole amplification process until the amplification times are the difference value between the original negative sample and the original positive sample, and obtaining a final amplification positive sample;

and taking a single original fusion feature in the single combined negative sample set as an intervention feature, taking the rest original fusion features in the single combined negative sample set as a fixed feature set, taking the negative samples in the single negative sample set as amplification objects to carry out sample amplification, obtaining a single amplification negative sample, completing the whole amplification process until the amplification times are the difference between the original negative sample and the original positive sample, and obtaining a final amplification negative sample.

6. The system for predicting hemodialysis-complicated cardiovascular disease fused with active learning and comparative learning according to claim 1, wherein the module for predicting risk of hemodialysis-complicated cardiovascular disease comprises:

a risk evaluation unit: the risk evaluation model is constructed, and the amplification structured data is used as training data of the model to obtain scores and a patient phenotype;

7. The system for predicting hemodialysis-complicated cardiovascular diseases by combining active learning and comparative learning according to claim 6, wherein the risk evaluation unit specifically comprises:

for optimizing a loss function using the score and the truth label.

8. The system for predicting hemodialysis-complicated cardiovascular diseases by combining active learning and contrast learning according to claim 6, wherein the active learning unit comprises:

the positive and negative sample selection rulers are used for respectively calculating included angles from each sample characterization in the amplification structured data to other sample characterizations in the 0-1 space direction;

9. The system for predicting hemodialysis-complicated cardiovascular diseases by combining active learning and comparative learning according to claim 6, wherein the comparative learning unit comprises: and the positive sample and the negative sample are used for performing comparison learning, the real labels of the positive sample and the patient sample are the same, the real labels of the negative sample and the patient sample are different, the cosine distance of the positive sample and the cosine distance of the negative sample of the patient sample are calculated, the loss function of the comparison learning unit is constructed according to the cosine distance of the positive sample and the cosine distance of the negative sample, and the network parameters of the encoder shared by the risk evaluation unit are updated.

10. The hemodialysis-complicated cardiovascular disease prediction system combining active learning and comparative learning according to claim 6, wherein the risk evaluation unit, the active learning unit and the comparative learning unit share the encoder, the encoder is a 5-layer fully-connected network, each layer has 1024 nodes, 512 nodes, 256 nodes, 128 nodes and 64 nodes, respectively, and the activation function is ReLU.