WO2023056918A1

WO2023056918A1 - False negative sample recognition-based physical examination assistant decision-making system

Info

Publication number: WO2023056918A1
Application number: PCT/CN2022/123731
Authority: WO
Inventors: 李劲松; 周天舒; 田雨; 吴承凯
Original assignee: 浙江大学
Priority date: 2021-10-09
Filing date: 2022-10-07
Publication date: 2023-04-13
Also published as: CN113611411A; CN113611411B

Abstract

Disclosed in the present invention is a false negative sample recognition-based physical examination assistant decision-making system. The system comprises a data acquisition module, a data preprocessing module, a basic feature analysis module, a false negative sample recognition module, a prediction model construction module, and an assistant decision-making module. According to the present invention, a universal clinical diagnosis process is simulated, the data incentive caused by missing diagnosis is analyzed, and the process is modeled, thereby being more in line with clinical logic, better discovering false negative samples in real-world medical data, and improving the application ability of the real-world medical data in the construction of a physical examination assistant decision-making model and clinical assistant decision-making. According to the present invention, there is no need to use additional data in the process of modeling and clinical assistant decision-making, and a universal clinical actual decision-making process is embedded into the development logic of the model, without introducing additional medical knowledge for application cases, thereby achieving high universality.

Description

A medical examination assistant decision-making system based on false negative sample identification

technical field

The invention belongs to the technical field of medical and health information, and in particular relates to a medical examination assistant decision-making system based on false negative sample identification.

Background technique

Retrospective clinical medical research and clinical assistant decision support based on real-world clinical data (represented by electronic medical record data) have become common and important means in current medical informatics research. Compared with prospective clinical randomized controlled trials (RCTs), the use of retrospective real-world data for informatics modeling has the advantages of large data volume, complete clinical scenarios, and high similarity in patient distribution, and can be closer to actual diagnosis and treatment scenarios. Better clinical application value.

Physical examination is an important means of discovering potential diseases, among which blood routine, urine routine and other laboratory indicators carry a lot of health status information. However, the current physical examination process can only provide suggestive screening for a small number of diseases. Retrospective modeling based on electronic case data can greatly improve the ability of physical examination data to identify diseases that are not included in the scope of current physical examination findings, and increase the health value that a single physical examination can bring.

However, due to the complex sources of real-world medical data, the accuracy and completeness of the included data will be affected by the diagnosis and treatment process when the specific data is entered. Among them, a typical situation of incomplete data is the absence of positive labels of samples in the real diagnostic labels (ie, false negative samples), which will have a great impact on the subsequent prediction model modeling and clinical application process. Reasons that may lead to the absence of positive labels include: 1) There are other irrelevant indicators/diseases that are more subjectively concerned by the entered doctors during the current medical visit; 2) The registered department or reason for medical treatment is inconsistent with the target disease; 3) The doctor entered There are omissions in the event of illness, etc.

Due to the prevalence of false negative samples in real-world data, many studies have taken this issue into consideration. Among them, the technical solution most similar to this application is: ① Positive-unlabeled learning (positive and unlabeled learning, PU learning), this technical solution regards unlabeled samples in the data as unlabeled samples that may be positive or negative. label sample. Jinbo Chen et al. [1] eliminated the influence of false negative samples on the overall model by adjusting the sample weights. Based on the logistic regression algorithm, this technology takes the global positive sample proportion as an additional unknown parameter, and obtains the global positive sample proportion under the data set by maximizing the likelihood function including the global positive sample proportion and the weight matrix. Excellent value, so as to correct the predicted value of the model and obtain the final prediction result; ② Representation learning, such as Kavishwar B. Wagholikar et al. Based on the coding set, the additional associated data (such as text data, omics data, etc.) of the sample is screened, so that the unlabeled sample with a high probability of positive sample is marked as positive, and the overall impact of false negative samples on the modeling process is reduced.

The existing technology similar to technical solution ① corrects the final model parameters by adjusting the loss function, sample weight, etc. in the modeling process. When this type of technology sets the adjustment parameters, it only assumes the false negative samples in the data set to be a random subset of positive samples, and does not take into account the real medical scenarios that lead to "patients who are actually positive for the target disease but not diagnosed or The actual reason for the false negative samples of "diagnosis not entered". In fact, the distribution of false negative samples is often very different from the random distribution. The random assumption of false-negative samples is inconsistent with the logic of the occurrence of actual false-negative samples, which will affect the actual clinical prediction effect.

The existing technology similar to technical solution ② supplements positive samples through representation learning. However, the process of representation learning often requires the construction of a terminology set with a high threshold of medical knowledge for specific diseases, which is not conducive to the universal use of technology. At the same time, this technical solution requires the cooperation of a large amount of additional medical data to realize the discovery of false negative samples. For single-visit patient cases, which account for the majority of real-world data, representation learning-based methods cannot be used to address false negatives in real-world medical data in the absence of sufficient additional data.

[1] Zhang L, Ding X, Ma Y, et al. A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients [J]. Journal of the American Medical Informatics Association, 2020, 27(1): 119- 126.

[2] Wagholikar K B, Estiri H, Murphy M, et al. Polar labeling: silver standard algorithm for training disease classifiers [J]. Bioinformatics, 2020, 36(10): 3200-3206.

[3] Halpern Y, Horng S, Choi Y, et al. Electronic medical record phenotyping using the anchor and learn framework [J]. Journal of the American Medical Informatics Association, 2016, 23(4): 731-740.

Contents of the invention

The present invention is based on the basic setting of PU learning, by analyzing the common generation logic of false negative samples in real-world medical data, using "the feature dimension of physical examination data can be split into two types of features: direct correlation dimension and competition dimension, and in the data The feature granularity assumption of "different performance at different levels" replaces the default data set granularity "random distribution of false negative samples" assumed by the existing technology, and solves the inconsistency between the assumptions in PU learning modeling and the distribution of real-world medical data, thereby improving the accuracy of The ability to utilize real-world data, and thus improve the accuracy and scope of physical examination data for potential disease-assisted decision-making. The present invention self-adaptively determines the impact of data on clinical disease diagnosis and physical examination results entry in a data-driven manner in each clinical feature dimension, has universality in different target physical examination results, does not depend on a priori medical knowledge system, and is beneficial The present invention is applicable to various diseases that can be preliminarily diagnosed based on basic physiological indicators and conventional laboratory indicators, so it is especially suitable for large-scale medical examination scenarios. The identification process of the false negative sample in the present invention does not depend on an additional representation mining process, so the data analysis result will not be affected by the lack of additional associated data in the used medical data.

The purpose of the present invention is achieved through the following technical solutions: a medical examination assistant decision-making system based on false negative sample identification, the system includes the following modules:

Data acquisition module: used to obtain real-world physical examination data sets, matrixed into original data sets including input feature matrix and real diagnostic labels, and samples with negative physical examination results as unlabeled samples;

Data preprocessing module: form a standardized data set by unifying the standard deviation and mean of each feature component in the original data set; separate the positive and negative semi-axis components of each feature component in the standardized data set, and add the corresponding positive and negative semi-axis components to each positive and negative semi-axis component The trainable upper and lower limits of , forming an extended data set;

Basic feature analysis module: using the logistic regression model, the unlabeled sample is regarded as a negative sample, and the training obtains the feature weight of each feature dimension to generate a true diagnostic label without considering false negative samples;

False-negative sample identification module: Divide the feature dimension into two categories: direct correlation dimension and competition dimension. The direct correlation dimension has a direct impact on the judgment of the target physical examination result from the medical point of view, and the competition dimension does not directly affect the judgment of the target physical examination result from the medical point of view. influence, but it will compete with the target physical examination results for attention, resulting in missing target physical examination results and false negative samples; construct two logistic regression models and a joint loss function for joint training, and use the joint loss function to filter true negative samples and false negative samples , and enables the direct correlation dimension to distinguish the positive samples from the screened suspected true negative samples to the greatest extent, and the competition dimension to distinguish the positive samples from the screened suspected false negative samples to the greatest extent; the possibility of the sample being a false negative sample is indicated by the false negative index sex;

Predictive model building block: build a multi-layer neural network and introduce a loss function with false negative indicators, and train a medical examination-assisted decision-making model based on standardized data sets and false negative indicators;

Auxiliary decision-making module: Based on the physical examination data of the examinee, the standardized feature vector is obtained through the data preprocessing module, and the prediction result is obtained through the auxiliary decision-making model of the physical examination, and output to the clinician as the auxiliary decision-making result of the physical examination.

Further, in the data acquisition module, the feature dimensions of the physical examination data include basic physiological indicators and routine laboratory indicators, the basic physiological indicators include height, weight, BMI, systolic blood pressure and diastolic blood pressure, and the routine laboratory indicators include blood routine and urine routine; the true diagnostic label is the result of physical examination.

Further, in the data acquisition module, the physical examination data set is matrixed into an original data set (X, y),

is the input feature matrix, n is the sample size, p is the total number of physical examination indicators, x ₁ to x _n represent each sample, f ₁ to f _p are the feature components of the original data set on each feature dimension, T represents transposition; y= [y ₁ , y ₂ ,...y _n ]∈{0, 1} ⁿ is the real diagnostic label of n samples, y _i =1 means the i-th sample is a positive sample, y _i =0 means the i-th A sample is a true negative sample or a false negative sample, and it is regarded as an unlabeled sample; the positive sample set is recorded as S _P , the unlabeled sample set is recorded as _SN , the true negative sample set is recorded as S _TN , and the false negative sample set is Denoted as S _FN , there is S _TN ∪ S _FN ＝ S _N ,

And the specific sample composition of S _P and _SN is known, and the specific sample composition of S _TN and S _FN is unknown.

Further, in the data preprocessing module, each feature component in X is standardized, so that the standard deviation of all physical examination data on each feature component is 1, and the mean value is 0; the feature matrix after the normalization process is recorded as X'=[x' ₁ , x' ₂ , ... x' _n ] ^T = [f' ₁ , f' ₂ , ... f' _p ],

Indicates the i-th standardized sample, f' _j is the j-th dimension feature component after standardization, and X' and y form a standardized data set (X', y);

Expand X' to form a trainable feature matrix X":

X″=[x″ ₁ , x″ ₂ , ... x″ _n ] ^T = [f′ ₁₁ , f′ ₁₂ , f′ ₂₁ , f′ ₂₂ ...f′ _p1 , f′ _p2 ]+t =[f″ ₁₁ , f″ ₁₂ , . . . f″ _p1 , f″ _p2 ]

in

Indicates the i-th sample after data expansion, f′ _j1 =max(f′ _j ,0), f′ _j2 =min(f′ _j ,0) are the positive semi-axis component and negative semi-axis of f′ _j respectively Component; t=[t ₁₁ , t ₁₂ , t ₂₁ , t ₂₂ ...t _p1 , t _p2 ] is the offset vector formed by the trainable upper and lower limits on each component,

The addition is done through a broadcast mechanism; the trainable feature matrices X" and y form the extended data set (X", y).

Further, in the basic feature analysis module, unlabeled samples are regarded as negative samples, and a logistic regression model M ₀ is constructed based on the extended data set (X", y), and the loss function L ₀ of M ₀ (w, t, b )for:

p ₀ (x″ _i )=sig(w ^T x″ _i +b)

in

is a trainable feature weight vector, b is a trainable intercept value; sig(·) is a sigmoid function, w ^T x″ _i + b is a decision function, and its value is a decision value, p ₀ (x″ _i ) is The output probability of the logistic regression model M ₀ obtained after normalization by the sigmoid function.

Further, the false negative sample identification module includes:

Take the feature weight vector w obtained from training in the basic feature analysis module, set the trainable non-negative matrix A _D , A _F ∈ [0, 1] ^2p×2p , and satisfy the sum matrix of A _D and A _F as the identity matrix E=A _D +A _F ;

Construct two logistic regression models M _D and M _F , which have feature weight coefficients w _D = w ^T A _D , w _F = w ^T A _F , respectively have trainable intercept values b _D , b _F , then the two logistic The output probabilities of the regression model normalized by the sigmoid function are expressed as:

p _D (x″ _i )=sig(w ^T A _D x″ _i +b _D )

p _F (x″ _i )=sig(w ^T A _F x″ _i +b _F )

Among them, p _D (x″ _i ) is the direct probability, and p _F (x″ _i ) is the probability of attention;

Use the extended data set (X″, y) to minimize the joint loss function L ₁ (A _D , b _D , b _F ) to obtain the optimal parameters;

in,

is the sample category weight; γ is the screening coefficient;

But it does not participate in the gradient backpropagation during model training;

For the sample x″ _i in the unlabeled sample set, the direct probability p _D (x″ _i ) and the attention probability p _F (x″ _i ) are obtained through the models M _D and _MF respectively, using the false negative index ri=p _D (x″ _i )·(1-p _F (x″ _i )) indicates the probability that sample x″ _i is a false negative.

Further, in the false negative sample identification module, for the logistic regression model M _F , the multiplication term

Screen the unlabeled samples whose output probability p ₀ (x″ _i ) predicted by M ₀ is close to 1, and record the screened unlabeled sample set as

There are differences in the characteristics of the competition dimension F class and the positive sample set S _P , and there should be no significant difference in the characteristics of the directly related dimension D class. Through training, S _P is the positive class, and

The model M _F of the negative class recognizes the features belonging to the competition dimension F in the feature dimension. The training process optimizes A _F and b _F at the same time to obtain

The optimal distinction between SP and _SP , such that for the sample

The attention probability p _F (x″ _i ) tends to 0, and for the sample x″ _i ∈ S _P , the attention probability p _F (x″ _i ) tends to 1.

Further, in the false negative sample identification module, for the logistic regression model M _D , the multiplication term

Screen the unlabeled samples whose attention probability p _F (x″ _i ) obtained by _MF prediction is close to 1, and record the screened unlabeled sample set as

There are differences in the characteristics of the directly related dimension D class from the positive sample set _SP , and there should be no obvious difference in the characteristics of the competitive dimension F class. Through training, S _P is used as the positive class, and

The model M _D of the negative class recognizes the features belonging to the directly related dimension D in the feature dimension. The training process optimizes A _D and b _D at the same time to obtain

The optimal distinction between SP and _SP , such that for the sample

The direct probability p _D (x″ _i ) tends to 0, and for the sample x″ _i ∈ S _P , the direct probability p _D (x″ _i ) tends to 1.

Further, in the predictive model building module, based on the standardized data set (X′, y) and the false negative index r=[r ₁ ,...r _n ]∈(0,1) ⁿ of each sample, the input The number of layer nodes is p, the number of output layer nodes is 1, the activation function of the output layer is a sigmoid function, and the set of transfer matrices between each layer is a multi-layer neural network M _net of w _net . The sample x′ _i ∈ X′ is operated by M _net After the output is defined as

The optimal parameters of M _net are obtained by minimizing the loss function L ₂ (w _net ) that introduces false negative indicators;

Then M _net is the constructed medical examination aided decision-making model optimized by introducing false negative indicators.

Further, in the auxiliary decision-making module, the p item obtained by a single medical examiner through physical _{examination and the physical examination index corresponding to the feature dimension are obtained through the data preprocessing module. The standardized feature vector x′ u} _is input into the The medical examination auxiliary decision-making model constructed by the prediction model building block outputs the prediction results

when

When it tends to 1, the physical examination result tends to be positive, when

When it tends to 0, the physical examination results tend to be negative, and the predicted results are provided to clinicians as the auxiliary decision-making results of physical examination.

The beneficial effects of the present invention are:

1. Existing positive-unlabeled learning techniques treat missing clinical diagnoses as randomly occurring behaviors. The present invention simulates the universal clinical diagnosis process, analyzes the data incentives caused by the lack of diagnosis, and models the process, which is more in line with clinical logic, can better discover false negative samples in real-world medical data, and improve The application ability of real-world medical data in the construction of physical examination auxiliary decision-making model and clinical auxiliary decision-making.

2. The existing representation learning technology requires a large amount of additional data and a certain amount of medical expertise to support the representation mining process, and its universality is weak. The present invention does not need to use additional data in the process of modeling and clinical auxiliary decision-making, and at the same time embeds the universal clinical actual decision-making process into the development logic of the model, without introducing additional medical knowledge for application cases, and has strong universality.

Description of drawings

Fig. 1 is a structural diagram of a medical examination assistant decision-making system based on false negative sample identification provided by an embodiment of the present invention;

Fig. 2 is the false negative sample identification flowchart provided by the embodiment of the present invention;

FIG. 3 is a flow chart of constructing a medical examination-aided decision-making model after introducing false-negative index optimization provided by an embodiment of the present invention.

Detailed ways

In order to make the above objects, features and advantages of the present invention more comprehensible, specific implementations of the present invention will be described in detail below in conjunction with the accompanying drawings.

An embodiment of the present invention provides a medical examination auxiliary decision-making system based on false negative sample identification, as shown in Figure 1, the system includes a data acquisition module, a data preprocessing module, a basic feature analysis module, a false negative sample identification module, and a prediction model construction modules and auxiliary decision-making modules, the implementation process of each module is described in detail below.

1. Data acquisition module: used to obtain real-world medical examination data sets, matrixed into original data sets including input feature matrix and real diagnostic labels, and samples with negative medical examination results as unlabeled samples;

Specifically, the data acquisition module is used to acquire the real-world physical examination data set stored in the .csv file, including feature dimensions and real diagnosis labels. The characteristic dimensions of physical examination data include basic physiological indicators and routine laboratory indicators; basic physiological indicators include height, weight, BMI, systolic blood pressure, and diastolic blood pressure; routine laboratory indicators include blood routine (total protein, albumin, globulin, albumin ratio, Alanine aminotransferase, aspartate aminotransferase, alkaline phosphatase, cholinesterase, total bile acid, total bilirubin, direct bilirubin, indirect bilirubin, adenylate deaminase, glutamyl transpeptidase, Glomerular filtration rate, creatinine, urea, uric acid, cystatin C, triglycerides, total cholesterol, high-density lipoprotein-C, low-density lipoprotein-C, very low-density lipoprotein-C, fasting blood glucose, Potassium, sodium, chloride, total calcium, inorganic phosphorus, ganpu dipeptide aminopeptidase, α-fucosidase), urine routine (urine protein, urine ketone body, urine sugar, urine bilirubin, urine sediment white blood cells, urine sediment Red blood cells, urobilinogen, uric acid); the real diagnostic label is the result of physical examination, such as the result of diabetes diagnosis.

Matrixize the physical examination data set into the original data set (X, y), where

is the input feature matrix; n is the sample size, p is the total number of physical examination indicators, in the example n=25000, p=45; x ₁ to x _n represent each sample, expressed in the form of feature vectors, f ₁ to f _p are the original data Set the feature components on each feature dimension, T means transpose; y=[y ₁ , y ₂ ,...y _n ]∈{0, 1} ⁿ is the real diagnostic label of n samples, that is, the target label, y _i =1 means that the physical examination result of the i-th sample is positive, that is, the sample is a positive sample; y _i =0 means that the physical examination result of the i-th sample is negative, and the sample may be a true negative sample or a false negative sample, and Such samples are regarded as unlabeled samples. The set of positive samples is denoted as S _P , including all samples with y _i =1; the set of unlabeled samples is denoted as _SN , including all samples with y _i =0; the set of true negative samples is denoted as S _TN , Denote the set of false negative samples as S _FN , there is S _TN ∪ S _FN ＝ S _N ,

2. Data preprocessing module: By unifying the standard deviation and mean value of each feature component in the original data set, a standardized data set is formed; the positive and negative semi-axis components of each feature component in the standardized data set are separated, and each positive and negative semi-axis component is added The corresponding trainable upper and lower limit values above form an extended data set, including:

For each feature component f _j in X, perform standardization processing φ _j based on this component, so that the standard deviation of all physical examination data on this component is 1, and the mean value is 0; the standardized feature matrix is recorded as X′=[x′ ₁ , x′ ₂ ,...x′ _n ] ^T = [f′ ₁ , f′ ₂ ,...f′ _p ],

Represents the i-th standardized sample in the form of a feature vector, and X' and y form a standardized data set (X', y);

Where f' _j is the jth dimension feature component after normalization, λ _j is the mean value of n samples on component f _j , and σ _j is the standard deviation of n samples on component f _j .

Since physical examination indicators often provide auxiliary decision-making information in the form of "higher than the normal upper limit" and "lower than the normal lower limit" in actual use, and the physical examination results guided by the two types of auxiliary decision-making information are often not completely opposite, so the present invention The data preprocessing process considers positive and negative data separately, and additionally adds a trainable offset vector, so that the feature matrix of the constructed extended data set is close to the clinical use scenario. Specifically, the positive and negative semi-axis components of each feature component f' _j of X' are separated to simulate the difference between the two types of auxiliary decision-making information, and the offset vector t is added to simulate the normal upper and lower limits of the physical examination index.

Based on this, expand X′ to form a trainable feature matrix X″:

Wherein f′ _j1 =max(f′ _j ,0), f′ _j2 =min(f′ _j ,0) are the positive and negative semi-axis components of f′ _j respectively; t=[t ₁₁ ,t ₁₂ , t ₂₁ , t ₂₂ ... t _p1 , t _p2 ] are offset vectors composed of trainable upper and lower limits on each component, and have

The addition is done by broadcasting.

The trainable feature matrices X" and y form an extended data set (X", y).

In the above preprocessed data set, the extended data set (X″, y) is used for the basic feature analysis module and the false negative sample identification module, and the standardized data set (X′, y) is used for the prediction model building module and auxiliary decision-making module.

3. Basic feature analysis module: use the logistic regression model to treat unlabeled samples as negative samples, and obtain the feature weights of each feature dimension for generating true diagnostic labels without considering false negative samples during training, including:

Treat all unlabeled samples as negative samples, and build a logistic regression model M ₀ based on the preprocessed extended data set (X″, y). The loss function L ₀ (w, t, b) of M ₀ is:

p ₀ (x″ _i )=sig(w ^T x″ _i +b)

in

is a trainable feature weight vector, b is a trainable intercept value,

Indicates the i-th sample after data expansion, in the form of feature vectors, y _i is the real diagnostic label of the i-th sample; sig(·) is a sigmoid function, w ^T x″ _i + b is a decision function, where The value is the decision value, p ₀ (x″ _i ) is the output probability of the logistic regression model M ₀ obtained after normalization by the sigmoid function, that is, the probability that the sample x″ _i predicted by M ₀ is positive. In the example, a small The batch gradient descent method (Mini-Batch Gradient Descent) is used for model training, and the sample size used in a single batch is 500.

4. False negative sample identification module: Divide the feature dimension into two categories: direct correlation dimension and competition dimension, in which the direct correlation dimension has a direct impact on the judgment of the target physical examination result from the medical point of view, and the competition dimension does not affect the judgment of the target physical examination result from the medical point of view It has a direct impact, but it will compete with the target physical examination results for attention, resulting in the lack of target physical examination results and false negative samples; construct two logistic regression models and a joint loss function for joint training, and use the joint loss function to filter true negative samples and false negative samples. Negative samples, and the direct correlation dimension can distinguish the positive samples from the screened suspected true negative samples to the greatest extent, and the competition dimension can maximize the distinction between the positive samples and the screened suspected false negative samples; the false negative index indicates that the sample is a false negative sample possibilities; including:

Based on the generation logic of physical examination results in clinical practice of physical examination, the feature dimensions are divided into two categories: direct correlation dimension D and competition dimension F. It is defined as: the features in the category D of the directly related dimension have a direct impact on the judgment of the target physical examination result from the medical point of view; The target physical examination results compete for attention, which may lead to the lack of target physical examination results and false negative samples. Logically, the feature weight vector w is generated under the joint action of the above two types of features. The core idea of the false negative sample identification module is to identify the two types of features D and F through data induction, so as to evaluate the possibility of unlabeled samples being false negatives.

Take the feature weight vector w obtained from training in the basic feature analysis module, set the trainable non-negative matrix A _D , A _F ∈ [0, 1] ^2p×2p , and satisfy the sum matrix of A _D and A _F as the identity matrix E=A _D +A _F ; then:

p ₀ (x″ _i )=sig(w ^T x″ _i +b)=sig(w ^T Ex″ _i +b)=sig(w ^T (A _D +A _F )x″ _i +b)

Among them, the decision value contributed by class D features is w ^T A _D x″ _i , and the positive sample set S _P and the true negative sample set S _TN should be distinguished to the greatest extent; the decision value contributed by class F features is w ^T A _F x ″ _i , the positive sample set S _P should be distinguished from the false negative sample set S _FN to the greatest extent.

Based on the above understanding, the false negative sample identification module completes the following steps:

Construct two logistic regression models M _D and _MF , respectively having feature weight coefficients w _D =w ^T A _D , w _F =w ^T A _F , and having trainable intercept values b _D , b _F . Then the output probabilities of the two logistic regression models normalized by the sigmoid function are expressed as:

p _D (x″ _i )=sig(w ^T A _D x″ _i +b _D )

p _F (x″ _i )=sig(w ^T A _F x″ _i +b _F )

Call p _D (x″ _i ) the direct probability, and p _F (x″ _i ) the attention probability.

Under the optimal feature classification, M _D should distinguish the positive sample set S _P from the true negative sample set S _TN to the greatest extent, and _MF should distinguish the positive sample set _SP from the false negative sample set S _FN to the greatest extent. Therefore, the trainable parameters include A _D , A _F =EA _D , b _D , b _F , which are obtained by minimizing the joint loss function L ₁ (A _D , b _D , b _F ) using the extended data set (X″, y). The optimal parameter, the offset vector t in the extended data set (X″, y) uses the optimization result obtained after M ₀ training in the basic feature analysis module, and no further training is required.

in,

is the sample category weight, which is used to adjust the proportion of different categories of samples during training, and is used in the example

γ is the screening coefficient. When γ is larger, the screening strength of unlabeled samples classified as false negatives and true negative samples by all parts of the joint loss function will increase, but the diversity of the screened samples will decrease. In the example, γ=2 is used ;

But it does not participate in the gradient backpropagation during model training. In the example, the mini-batch gradient descent method is used for joint training of the models _MD and _MF , and the sample size used in a single batch is 500.

The construction logic of the joint loss function L ₁ (A _D , b _D , b _F ) is:

(1) For the model M _F , through the multiplicative term

Screen the unlabeled samples with higher output probability p ₀ (x″ _i ) predicted by M0, and record the set of these unlabeled samples as

Relative to the overall unlabeled sample set S _N ,

The proportion of false negative samples in .

There are differences in the characteristics of the competition dimension F class and the positive sample set S _P , but there should be no significant difference in the characteristics of the directly related dimension D class, so it can be trained with S _P as the positive class, and with

The model M _F of the negative class identifies features belonging to the category F of the competing dimension in the feature dimension. The training process simultaneously optimizes A _F , b _F to obtain

The optimal distinction between SP and _SP , such that for the sample

(2) For model M _D , through the multiplicative term

Screen the unlabeled samples with higher attention probability p _F (x″ _i ) predicted by _MF , and record the set of these unlabeled samples as

Relative to the overall unlabeled sample set S _N ,

The proportion of true negative samples in .

There are differences in the characteristics of the directly related dimension D class from the positive sample set S _P , but there should be no significant difference in the characteristics of the competitive dimension F class, so it can be trained with S _P as the positive class, with

The model M _D of the negative class identifies the features belonging to the directly related dimension D class in the feature dimension. The training process simultaneously optimizes A _D , b _D to obtain

The optimal distinction between SP and _SP , such that for the sample

There is direct probability p _D (x″ _i ) tends to 0, and for sample x″ _i ∈ S _P , there is direct probability p _D (x″ _i ) tends to 1.

(3) Due to the restriction condition A _D +A _F =E in the model training process, it is necessary to use the joint loss function L ₁ (A _D , b _D , b _F ), and to adjust each parameter through the joint training of the model M _D and _MF optimize.

After obtaining the optimal parameters, for the sample x″ _i ∈ S _N , obtain its direct probability p _D (x″ _i ) and attention probability p _F (x″ _i ) through the models M _D and _MF respectively. If x″ If _i is a false negative sample, p _D (x″ _i ) should tend to 1, p _F (x″ _i ) tends to 0, and the false negative index r _i =p _D (x″ _i )·(1-p _F (x″ _i )) indicates the likelihood that each sample x″ _i is a false negative.

The flow chart of false negative sample identification is shown in Figure 2.

5. Prediction model building block: build a multi-layer neural network and introduce a loss function of false negative indicators, based on standardized data sets and false negative indicators, train the medical examination auxiliary decision-making model, including:

Based on the standardized data set (X′, y) and the false negative index r=[r ₁ ,...r _n ]∈(0,1) ⁿ of each sample, the number of nodes in the input layer is p, and the number of nodes in the output layer is 1. The activation function of the output layer is a sigmoid function, and the set of transfer matrices between layers is a multi-layer neural network M _net of W _net . The output of the sample x′ _i ∈ X′ after the operation of the neural network M _net is defined as

The vector composed of all outputs is denoted as

Then the optimal parameters of M _net can be obtained by minimizing the loss function L ₂ (W _net ) that introduces false negative indicators.

Then M _net is a medical examination-aided decision-making model constructed by introducing false negative indicators optimized. Refer to Figure 3 for the construction process of the medical examination-assisted decision-making model.

In the example, a three-layer neural network M _net is constructed, the number of nodes in the input layer of M _net is p=45, the number of nodes in the output layer is 1, the number of nodes in the middle layer is 20, and the set of transition matrices between layers is W _net ={W ₁₂ , W ₂₃ }, W ₁₂ is the transfer matrix from the input layer to the middle layer, W ₂₃ is the transfer matrix from the middle layer to the output layer, and the activation function between each layer is {ReLU, sigmoid}. The mini-batch gradient descent method is used for model training, and the sample size used in a single batch is 500.

6. Auxiliary decision-making module: Based on the physical examination data of the examinee, the standardized feature vector is obtained through the data preprocessing module, and the prediction result is obtained through the auxiliary decision-making model of the physical examination, and output to the clinician as the auxiliary decision-making result of the physical examination, including:

The p items obtained by a single medical examiner through the physical examination and the physical examination indicators corresponding to the feature dimensions are obtained through the data preprocessing module to obtain the standardized feature vector x′ _u . After that, input x′ _u into the medical examination-assisted decision-making model built in the prediction model building module, and output the prediction result

when

When it tends to 1, the physical examination result tends to be positive, when

When it tends to 0, the physical examination result tends to be negative, and the predicted result is provided to clinicians as the auxiliary decision-making result of physical examination.

The above descriptions are only preferred implementations of the present invention. Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it into an equivalent of equivalent change Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims

A medical examination assistant decision-making system based on false negative sample identification, characterized in that it includes:

Data acquisition module: used to obtain real-world physical examination data sets, matrixed into original data sets including input feature matrix and real diagnostic labels, and samples with negative physical examination results as unlabeled samples;

Data preprocessing module: form a standardized data set by unifying the standard deviation and mean of each feature component in the original data set; separate the positive and negative semi-axis components of each feature component in the standardized data set, and add the corresponding positive and negative semi-axis components to each positive and negative semi-axis component The trainable upper and lower limits of , forming an extended data set;

Basic feature analysis module: using the logistic regression model, the unlabeled sample is regarded as a negative sample, and the training obtains the feature weight of each feature dimension to generate a true diagnostic label without considering false negative samples;

False-negative sample identification module: Divide the feature dimension into two categories: direct correlation dimension and competition dimension. The direct correlation dimension has a direct impact on the judgment of the target physical examination result from the medical point of view, and the competition dimension does not directly affect the judgment of the target physical examination result from the medical point of view. influence, but it will compete with the target physical examination results for attention, resulting in missing target physical examination results and false negative samples; construct two logistic regression models and a joint loss function for joint training, and use the joint loss function to filter true negative samples and false negative samples , and enables the direct correlation dimension to distinguish the positive samples from the screened suspected true negative samples to the greatest extent, and the competition dimension to distinguish the positive samples from the screened suspected false negative samples to the greatest extent; the possibility of the sample being a false negative sample is indicated by the false negative index sex;

Predictive model building block: build a multi-layer neural network and introduce a loss function with false negative indicators, and train a medical examination-assisted decision-making model based on standardized data sets and false negative indicators;

Auxiliary decision-making module: Based on the physical examination data of the examinee, the standardized feature vector is obtained through the data preprocessing module, and the prediction result is obtained through the auxiliary decision-making model of the physical examination, and output to the clinician as the auxiliary decision-making result of the physical examination.
The medical examination auxiliary decision-making system based on false negative sample identification according to claim 1, wherein in the data acquisition module, the feature dimensions of the physical examination data include basic physiological indicators and routine laboratory indicators, and the basic physiological indicators include height , body weight, BMI, systolic blood pressure and diastolic blood pressure, the routine laboratory indicators include blood routine and urine routine; the real diagnostic label is the result of physical examination.
The medical examination auxiliary decision-making system based on false negative sample identification according to claim 1, wherein, in the data acquisition module, the medical examination data set is matrixed into an original data set (X, y),

is the input feature matrix, n is the sample size, p is the total number of physical examination indicators, x 1 to x n represent each sample, f 1 to f p are the feature components of the original data set on each feature dimension, T represents transposition; y= [y 1 ,y 2 ,…y n ]∈{0,1} n is the real diagnostic label of n samples, y i =1 means that the i-th sample is a positive sample, y i =0 means that the i-th sample is True negative samples or false negative samples are regarded as unlabeled samples; the positive sample set is recorded as S P , the unlabeled sample set is recorded as SN , the true negative sample set is recorded as S TN , and the false negative sample set is recorded as S FN , with
And the specific sample composition of S P and S N is known, and the specific sample composition of S TN and S FN is unknown.
The medical examination auxiliary decision-making system based on false negative sample identification according to claim 3, characterized in that, in the data preprocessing module, each feature component in X is standardized, so that the standard of all physical examination data on each feature component The difference is 1, and the mean is 0; the standardized feature matrix is recorded as
Indicates the i-th standardized sample, f' j is the j-th dimension feature component after standardization, and X' and y form a standardized data set (X', y);

Expand X' to form a trainable feature matrix X":

X″=[x″ 1 , x″ 2 ,…x″ n ] T = [f′ 11 , f′ 12 , f′ 21 , f′ 22 …f′ p1 , f′ p2 ]+t=[f″ 11 ,f″ 12 ,…f″ p1 ,f″ p2 ]

in
Indicates the i-th sample after data expansion, f′ j1 =max(f′ j ,0), f′ j2 =min(f′ j ,0) are the positive semi-axis component and negative semi-axis of f′ j respectively Component; t=[t 11 ,t 12 ,t 21 ,t 22 ...t p1 ,t p2 ] is the offset vector formed by the trainable upper and lower limits on each component,
The addition is done through a broadcast mechanism; the trainable feature matrix X″ and y form an extended dataset (X″,y).
The medical examination auxiliary decision-making system based on false negative sample identification according to claim 4, wherein, in the basic feature analysis module, the unlabeled sample is regarded as a negative sample, which is constructed based on the extended data set (X″, y) Logistic regression model M 0 , the loss function L 0 (w,t,b) of M 0 is:

p 0 (x″ i )=sig(w T x″ i +b)

in
is a trainable feature weight vector, b is a trainable intercept value; sig(·) is a sigmoid function, w T x″ i + b is a decision function, and its value is a decision value, p 0 (x″ i ) is The output probability of the logistic regression model M 0 obtained after normalization by the sigmoid function.
The medical examination auxiliary decision-making system based on false negative sample identification according to claim 5, wherein the false negative sample identification module comprises:

Take the feature weight vector w obtained from training in the basic feature analysis module, set the trainable non-negative matrix A D , A F ∈ [0,1] 2p×2p , and satisfy the sum matrix of A D and A F as the identity matrix E=A D +A F ;

Construct two logistic regression models M D and MF , which have feature weight coefficients w D ＝ w T A D , w F ＝ w T A F , and have trainable intercept values b D , b F , then the two logistic regression models The output probabilities of the regression model normalized by the sigmoid function are expressed as:

p D (x″ i )=sig(w T A D x″ i +b D )

p F (x″ i )=sig(w T A F x″ i +b F )

Among them, p D (x″ i ) is the direct probability, and p F (x″ i ) is the probability of attention;

Use the extended data set (X″,y) to minimize the joint loss function L 1 (A D ,b D ,b F ) to obtain the optimal parameters;

in,
is the sample category weight; γ is the screening coefficient;
But it does not participate in the gradient backpropagation during model training;

For the sample x″ i in the unlabeled sample set, the direct probability p D (x″ i ) and the attention probability p F (x″ i ) are obtained through the models M D and MF respectively, using the false negative index r i =p D (x″ i )·(1− pF (x″ i )) indicates the probability that sample x″ i is a false negative.
The medical examination auxiliary decision-making system based on false negative sample identification according to claim 6, wherein, in the false negative sample identification module, for the logistic regression model M F , the multiplication term
Screen the unlabeled samples whose output probability p 0 (x″ i ) predicted by M 0 is close to 1, and record the screened unlabeled sample set as
There are differences in the characteristics of the competition dimension F class and the positive sample set S P , and there should be no significant difference in the characteristics of the directly related dimension D class. Through training, S P is the positive class, and
The model M F of the negative class recognizes the features belonging to the competition dimension F in the feature dimension. The training process optimizes A F and b F at the same time to obtain
The optimal distinction between SP and SP , such that for the sample
The attention probability p F (x″ i ) tends to 0, and for the sample x″ i ∈ S P , the attention probability p F (x″ i ) tends to 1.
The medical examination auxiliary decision-making system based on false negative sample identification according to claim 6, wherein, in the false negative sample identification module, for the logistic regression model M D , by the multiplication term
Screen the unlabeled samples whose attention probability p F (x″ i ) obtained by MF prediction is close to 1, and record the screened unlabeled sample set as
There are differences in the characteristics of the directly related dimension D class from the positive sample set SP , and there should be no obvious difference in the characteristics of the competitive dimension F class. Through training, S P is used as the positive class, and
The model M D of the negative class recognizes the features of the directly related dimension D in the feature dimension. The training process optimizes A D and b D at the same time to obtain
The optimal distinction between SP and SP , such that for the sample
The direct probability p D (x″ i ) tends to 0, and for the sample x″ i ∈ S P , the direct probability p D (x″ i ) tends to 1.
The medical examination auxiliary decision-making system based on false negative sample identification according to claim 6, wherein, in the predictive model building module, the false negative index r=[ r 1 ,…r n ]∈(0,1) n , construct a multi-layered network with the number of nodes in the input layer being p, the number of nodes in the output layer being 1, the activation function of the output layer being a sigmoid function, and the set of transition matrices between layers being W net Neural network M net , the output of sample x′ i ∈ X′ after M net operation is defined as
The optimal parameters of M net are obtained by minimizing the loss function L 2 (W net ) that introduces false negative indicators;

Then M net is the constructed medical examination aided decision-making model optimized by introducing false negative indicators.
The medical examination auxiliary decision-making system based on false negative sample identification according to claim 9, characterized in that, in the auxiliary decision-making module, the p item obtained by a single medical examiner through physical examination and the physical examination index corresponding to the feature dimension are obtained through data pre-processing. The processing module obtains the standardized feature vector x′u , inputs x′u into the medical examination auxiliary decision-making model built in the prediction model building module, and outputs the prediction result
when
When it tends to 1, the physical examination result tends to be positive, when
When it tends to 0, the physical examination results tend to be negative, and the predicted results are provided to clinicians as the auxiliary decision-making results of physical examination.