WO2023056918A1 - False negative sample recognition-based physical examination assistant decision-making system - Google Patents

False negative sample recognition-based physical examination assistant decision-making system Download PDF

Info

Publication number
WO2023056918A1
WO2023056918A1 PCT/CN2022/123731 CN2022123731W WO2023056918A1 WO 2023056918 A1 WO2023056918 A1 WO 2023056918A1 CN 2022123731 W CN2022123731 W CN 2022123731W WO 2023056918 A1 WO2023056918 A1 WO 2023056918A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
false negative
feature
physical examination
dimension
Prior art date
Application number
PCT/CN2022/123731
Other languages
French (fr)
Chinese (zh)
Inventor
李劲松
周天舒
田雨
吴承凯
Original Assignee
浙江大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 浙江大学 filed Critical 浙江大学
Publication of WO2023056918A1 publication Critical patent/WO2023056918A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders

Definitions

  • the invention belongs to the technical field of medical and health information, and in particular relates to a medical examination assistant decision-making system based on false negative sample identification.
  • Physical examination is an important means of discovering potential diseases, among which blood routine, urine routine and other laboratory indicators carry a lot of health status information.
  • the current physical examination process can only provide suggestive screening for a small number of diseases.
  • Retrospective modeling based on electronic case data can greatly improve the ability of physical examination data to identify diseases that are not included in the scope of current physical examination findings, and increase the health value that a single physical examination can bring.
  • the additional associated data (such as text data, omics data, etc.) of the sample is screened, so that the unlabeled sample with a high probability of positive sample is marked as positive, and the overall impact of false negative samples on the modeling process is reduced.
  • the existing technology similar to technical solution 1 corrects the final model parameters by adjusting the loss function, sample weight, etc. in the modeling process.
  • this type of technology sets the adjustment parameters, it only assumes the false negative samples in the data set to be a random subset of positive samples, and does not take into account the real medical scenarios that lead to "patients who are actually positive for the target disease but not diagnosed or The actual reason for the false negative samples of "diagnosis not entered". In fact, the distribution of false negative samples is often very different from the random distribution. The random assumption of false-negative samples is inconsistent with the logic of the occurrence of actual false-negative samples, which will affect the actual clinical prediction effect.
  • the present invention is based on the basic setting of PU learning, by analyzing the common generation logic of false negative samples in real-world medical data, using "the feature dimension of physical examination data can be split into two types of features: direct correlation dimension and competition dimension, and in the data
  • the feature granularity assumption of "different performance at different levels” replaces the default data set granularity "random distribution of false negative samples” assumed by the existing technology, and solves the inconsistency between the assumptions in PU learning modeling and the distribution of real-world medical data, thereby improving the accuracy of The ability to utilize real-world data, and thus improve the accuracy and scope of physical examination data for potential disease-assisted decision-making.
  • the present invention self-adaptively determines the impact of data on clinical disease diagnosis and physical examination results entry in a data-driven manner in each clinical feature dimension, has universality in different target physical examination results, does not depend on a priori medical knowledge system, and is beneficial
  • the present invention is applicable to various diseases that can be preliminarily diagnosed based on basic physiological indicators and conventional laboratory indicators, so it is especially suitable for large-scale medical examination scenarios.
  • the identification process of the false negative sample in the present invention does not depend on an additional representation mining process, so the data analysis result will not be affected by the lack of additional associated data in the used medical data.
  • a medical examination assistant decision-making system based on false negative sample identification the system includes the following modules:
  • Data acquisition module used to obtain real-world physical examination data sets, matrixed into original data sets including input feature matrix and real diagnostic labels, and samples with negative physical examination results as unlabeled samples;
  • Data preprocessing module form a standardized data set by unifying the standard deviation and mean of each feature component in the original data set; separate the positive and negative semi-axis components of each feature component in the standardized data set, and add the corresponding positive and negative semi-axis components to each positive and negative semi-axis component The trainable upper and lower limits of , forming an extended data set;
  • Basic feature analysis module using the logistic regression model, the unlabeled sample is regarded as a negative sample, and the training obtains the feature weight of each feature dimension to generate a true diagnostic label without considering false negative samples;
  • False-negative sample identification module Divide the feature dimension into two categories: direct correlation dimension and competition dimension.
  • the direct correlation dimension has a direct impact on the judgment of the target physical examination result from the medical point of view, and the competition dimension does not directly affect the judgment of the target physical examination result from the medical point of view.
  • Predictive model building block build a multi-layer neural network and introduce a loss function with false negative indicators, and train a medical examination-assisted decision-making model based on standardized data sets and false negative indicators;
  • Auxiliary decision-making module Based on the physical examination data of the examinee, the standardized feature vector is obtained through the data preprocessing module, and the prediction result is obtained through the auxiliary decision-making model of the physical examination, and output to the clinician as the auxiliary decision-making result of the physical examination.
  • the feature dimensions of the physical examination data include basic physiological indicators and routine laboratory indicators
  • the basic physiological indicators include height, weight, BMI, systolic blood pressure and diastolic blood pressure
  • the routine laboratory indicators include blood routine and urine routine
  • the true diagnostic label is the result of physical examination.
  • X′′ [x′′ 1 , x′′ 2 , ... x′′ n ]
  • f′ j1 max(f′ j ,0)
  • t [t 11 , t 12 , t 21 , t 22 ...t p1 , t p2 ] is the offset vector formed by the trainable upper and lower limits on each component, The addition is done through a broadcast mechanism; the trainable feature matrices X" and y form the extended data set (X", y).
  • a logistic regression model M 0 is constructed based on the extended data set (X", y), and the loss function L 0 of M 0 (w, t, b )for:
  • b is a trainable intercept value
  • sig( ⁇ ) is a sigmoid function
  • w T x′′ i + b is a decision function, and its value is a decision value
  • p 0 (x′′ i ) is The output probability of the logistic regression model M 0 obtained after normalization by the sigmoid function.
  • the false negative sample identification module includes:
  • p D (x′′ i ) is the direct probability
  • p F (x′′ i ) is the probability of attention
  • the multiplication term Screen the unlabeled samples whose output probability p 0 (x′′ i ) predicted by M 0 is close to 1, and record the screened unlabeled sample set as There are differences in the characteristics of the competition dimension F class and the positive sample set S P , and there should be no significant difference in the characteristics of the directly related dimension D class.
  • S P is the positive class
  • the model M F of the negative class recognizes the features belonging to the competition dimension F in the feature dimension.
  • the training process optimizes A F and b F at the same time to obtain The optimal distinction between SP and SP , such that for the sample The attention probability p F (x′′ i ) tends to 0, and for the sample x′′ i ⁇ S P , the attention probability p F (x′′ i ) tends to 1.
  • the multiplication term Screen the unlabeled samples whose attention probability p F (x′′ i ) obtained by MF prediction is close to 1, and record the screened unlabeled sample set as There are differences in the characteristics of the directly related dimension D class from the positive sample set SP , and there should be no obvious difference in the characteristics of the competitive dimension F class.
  • S P is used as the positive class
  • the model M D of the negative class recognizes the features belonging to the directly related dimension D in the feature dimension.
  • the training process optimizes A D and b D at the same time to obtain The optimal distinction between SP and SP , such that for the sample The direct probability p D (x′′ i ) tends to 0, and for the sample x′′ i ⁇ S P , the direct probability p D (x′′ i ) tends to 1.
  • the number of layer nodes is p
  • the number of output layer nodes is 1
  • the activation function of the output layer is a sigmoid function
  • the set of transfer matrices between each layer is a multi-layer neural network M net of w net .
  • the sample x′ i ⁇ X′ is operated by M net
  • the optimal parameters of M net are obtained by minimizing the loss function L 2 (w net ) that introduces false negative indicators;
  • M net is the constructed medical examination aided decision-making model optimized by introducing false negative indicators.
  • the p item obtained by a single medical examiner through physical examination and the physical examination index corresponding to the feature dimension are obtained through the data preprocessing module.
  • the standardized feature vector x′ u is input into the The medical examination auxiliary decision-making model constructed by the prediction model building block outputs the prediction results when When it tends to 1, the physical examination result tends to be positive, when When it tends to 0, the physical examination results tend to be negative, and the predicted results are provided to clinicians as the auxiliary decision-making results of physical examination.
  • the present invention simulates the universal clinical diagnosis process, analyzes the data incentives caused by the lack of diagnosis, and models the process, which is more in line with clinical logic, can better discover false negative samples in real-world medical data, and improve The application ability of real-world medical data in the construction of physical examination auxiliary decision-making model and clinical auxiliary decision-making.
  • the existing representation learning technology requires a large amount of additional data and a certain amount of medical expertise to support the representation mining process, and its universality is weak.
  • the present invention does not need to use additional data in the process of modeling and clinical auxiliary decision-making, and at the same time embeds the universal clinical actual decision-making process into the development logic of the model, without introducing additional medical knowledge for application cases, and has strong universality.
  • Fig. 1 is a structural diagram of a medical examination assistant decision-making system based on false negative sample identification provided by an embodiment of the present invention
  • Fig. 2 is the false negative sample identification flowchart provided by the embodiment of the present invention.
  • FIG. 3 is a flow chart of constructing a medical examination-aided decision-making model after introducing false-negative index optimization provided by an embodiment of the present invention.
  • An embodiment of the present invention provides a medical examination auxiliary decision-making system based on false negative sample identification, as shown in Figure 1, the system includes a data acquisition module, a data preprocessing module, a basic feature analysis module, a false negative sample identification module, and a prediction model construction modules and auxiliary decision-making modules, the implementation process of each module is described in detail below.
  • Data acquisition module used to obtain real-world medical examination data sets, matrixed into original data sets including input feature matrix and real diagnostic labels, and samples with negative medical examination results as unlabeled samples;
  • the data acquisition module is used to acquire the real-world physical examination data set stored in the .csv file, including feature dimensions and real diagnosis labels.
  • the characteristic dimensions of physical examination data include basic physiological indicators and routine laboratory indicators; basic physiological indicators include height, weight, BMI, systolic blood pressure, and diastolic blood pressure; routine laboratory indicators include blood routine (total protein, albumin, globulin, albumin ratio, Alanine aminotransferase, aspartate aminotransferase, alkaline phosphatase, cholinesterase, total bile acid, total bilirubin, direct bilirubin, indirect bilirubin, adenylate deaminase, glutamyl transpeptidase, Glomerular filtration rate, creatinine, urea, uric acid, cystatin C, triglycerides, total cholesterol, high-density lipoprotein-C, low-density lipoprotein-C, very low-density lipoprotein-
  • Data preprocessing module By unifying the standard deviation and mean value of each feature component in the original data set, a standardized data set is formed; the positive and negative semi-axis components of each feature component in the standardized data set are separated, and each positive and negative semi-axis component is added
  • the corresponding trainable upper and lower limit values above form an extended data set, including:
  • f' j is the jth dimension feature component after normalization
  • ⁇ j is the mean value of n samples on component f j
  • ⁇ j is the standard deviation of n samples on component f j .
  • the data preprocessing process considers positive and negative data separately, and additionally adds a trainable offset vector, so that the feature matrix of the constructed extended data set is close to the clinical use scenario. Specifically, the positive and negative semi-axis components of each feature component f' j of X' are separated to simulate the difference between the two types of auxiliary decision-making information, and the offset vector t is added to simulate the normal upper and lower limits of the physical examination index.
  • X′′ [x′′ 1 , x′′ 2 , ... x′′ n ]
  • f′ j1 max(f′ j ,0)
  • t [t 11 ,t 12 , t 21 , t 22 ... t p1 , t p2 ] are offset vectors composed of trainable upper and lower limits on each component, and have The addition is done by broadcasting.
  • the trainable feature matrices X" and y form an extended data set (X", y).
  • the extended data set (X′′, y) is used for the basic feature analysis module and the false negative sample identification module, and the standardized data set (X′, y) is used for the prediction model building module and auxiliary decision-making module.
  • Basic feature analysis module use the logistic regression model to treat unlabeled samples as negative samples, and obtain the feature weights of each feature dimension for generating true diagnostic labels without considering false negative samples during training, including:
  • b is a trainable intercept value, Indicates the i-th sample after data expansion, in the form of feature vectors, y i is the real diagnostic label of the i-th sample;
  • sig( ⁇ ) is a sigmoid function, w T x′′ i + b is a decision function, where The value is the decision value, p 0 (x′′ i ) is the output probability of the logistic regression model M 0 obtained after normalization by the sigmoid function, that is, the probability that the sample x′′ i predicted by M 0 is positive.
  • a small The batch gradient descent method (Mini-Batch Gradient Descent) is used for model training, and the sample size used in a single batch is 500.
  • False negative sample identification module Divide the feature dimension into two categories: direct correlation dimension and competition dimension, in which the direct correlation dimension has a direct impact on the judgment of the target physical examination result from the medical point of view, and the competition dimension does not affect the judgment of the target physical examination result from the medical point of view It has a direct impact, but it will compete with the target physical examination results for attention, resulting in the lack of target physical examination results and false negative samples; construct two logistic regression models and a joint loss function for joint training, and use the joint loss function to filter true negative samples and false negative samples.
  • Negative samples, and the direct correlation dimension can distinguish the positive samples from the screened suspected true negative samples to the greatest extent, and the competition dimension can maximize the distinction between the positive samples and the screened suspected false negative samples;
  • the false negative index indicates that the sample is a false negative sample possibilities;
  • the feature dimensions are divided into two categories: direct correlation dimension D and competition dimension F. It is defined as: the features in the category D of the directly related dimension have a direct impact on the judgment of the target physical examination result from the medical point of view; The target physical examination results compete for attention, which may lead to the lack of target physical examination results and false negative samples.
  • the feature weight vector w is generated under the joint action of the above two types of features. The core idea of the false negative sample identification module is to identify the two types of features D and F through data induction, so as to evaluate the possibility of unlabeled samples being false negatives.
  • the decision value contributed by class D features is w T A D x′′ i
  • the positive sample set S P and the true negative sample set S TN should be distinguished to the greatest extent
  • the decision value contributed by class F features is w T A F x ′′ i
  • the positive sample set S P should be distinguished from the false negative sample set S FN to the greatest extent.
  • the false negative sample identification module completes the following steps:
  • the optimal parameter, the offset vector t in the extended data set (X′′, y) uses the optimization result obtained after M 0 training in the basic feature analysis module, and no further training is required.
  • is the screening coefficient.
  • is the screening strength of unlabeled samples classified as false negatives and true negative samples by all parts of the joint loss function.
  • the mini-batch gradient descent method is used for joint training of the models MD and MF , and the sample size used in a single batch is 500.
  • the training process simultaneously optimizes A F , b F to obtain The optimal distinction between SP and SP , such that for the sample The attention probability p F (x′′ i ) tends to 0, and for the sample x′′ i ⁇ S P , the attention probability p F (x′′ i ) tends to 1.
  • model M D For model M D , through the multiplicative term Screen the unlabeled samples with higher attention probability p F (x′′ i ) predicted by MF , and record the set of these unlabeled samples as Relative to the overall unlabeled sample set S N , The proportion of true negative samples in . There are differences in the characteristics of the directly related dimension D class from the positive sample set S P , but there should be no significant difference in the characteristics of the competitive dimension F class, so it can be trained with S P as the positive class, with The model M D of the negative class identifies the features belonging to the directly related dimension D class in the feature dimension.
  • the training process simultaneously optimizes A D , b D to obtain The optimal distinction between SP and SP , such that for the sample There is direct probability p D (x′′ i ) tends to 0, and for sample x′′ i ⁇ S P , there is direct probability p D (x′′ i ) tends to 1.
  • Prediction model building block build a multi-layer neural network and introduce a loss function of false negative indicators, based on standardized data sets and false negative indicators, train the medical examination auxiliary decision-making model, including:
  • the number of nodes in the input layer is p
  • the number of nodes in the output layer is 1.
  • the activation function of the output layer is a sigmoid function
  • the set of transfer matrices between layers is a multi-layer neural network M net of W net .
  • the output of the sample x′ i ⁇ X′ after the operation of the neural network M net is defined as The vector composed of all outputs is denoted as Then the optimal parameters of M net can be obtained by minimizing the loss function L 2 (W net ) that introduces false negative indicators.
  • M net is a medical examination-aided decision-making model constructed by introducing false negative indicators optimized.
  • FIG. 3 for the construction process of the medical examination-assisted decision-making model.
  • the mini-batch gradient descent method is used for model training, and the sample size used in a single batch is 500.
  • Auxiliary decision-making module Based on the physical examination data of the examinee, the standardized feature vector is obtained through the data preprocessing module, and the prediction result is obtained through the auxiliary decision-making model of the physical examination, and output to the clinician as the auxiliary decision-making result of the physical examination, including:
  • the p items obtained by a single medical examiner through the physical examination and the physical examination indicators corresponding to the feature dimensions are obtained through the data preprocessing module to obtain the standardized feature vector x′ u .
  • input x′ u into the medical examination-assisted decision-making model built in the prediction model building module, and output the prediction result when When it tends to 1, the physical examination result tends to be positive, when When it tends to 0, the physical examination result tends to be negative, and the predicted result is provided to clinicians as the auxiliary decision-making result of physical examination.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Public Health (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Theoretical Computer Science (AREA)
  • Primary Health Care (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

Disclosed in the present invention is a false negative sample recognition-based physical examination assistant decision-making system. The system comprises a data acquisition module, a data preprocessing module, a basic feature analysis module, a false negative sample recognition module, a prediction model construction module, and an assistant decision-making module. According to the present invention, a universal clinical diagnosis process is simulated, the data incentive caused by missing diagnosis is analyzed, and the process is modeled, thereby being more in line with clinical logic, better discovering false negative samples in real-world medical data, and improving the application ability of the real-world medical data in the construction of a physical examination assistant decision-making model and clinical assistant decision-making. According to the present invention, there is no need to use additional data in the process of modeling and clinical assistant decision-making, and a universal clinical actual decision-making process is embedded into the development logic of the model, without introducing additional medical knowledge for application cases, thereby achieving high universality.

Description

一种基于假阴性样本识别的体检辅助决策系统A medical examination assistant decision-making system based on false negative sample identification 技术领域technical field
本发明属于医疗健康信息技术领域,尤其涉及一种基于假阴性样本识别的体检辅助决策系统。The invention belongs to the technical field of medical and health information, and in particular relates to a medical examination assistant decision-making system based on false negative sample identification.
背景技术Background technique
基于真实世界临床数据(以电子病历数据为代表)进行回顾性临床医学研究和临床辅助决策支持已成为当前医学信息学研究中的常见与重要手段。相较前瞻性的临床随机对照试验(RCT),使用回顾性真实世界数据进行信息学建模具有数据体量大、临床场景全、患者分布相似性高等优势,能够更贴近于实际诊疗场景,具有更好的临床应用价值。Retrospective clinical medical research and clinical assistant decision support based on real-world clinical data (represented by electronic medical record data) have become common and important means in current medical informatics research. Compared with prospective clinical randomized controlled trials (RCTs), the use of retrospective real-world data for informatics modeling has the advantages of large data volume, complete clinical scenarios, and high similarity in patient distribution, and can be closer to actual diagnosis and treatment scenarios. Better clinical application value.
体检是发现潜在疾病的重要手段,其中血常规、尿常规等化验指标带有大量的健康状态信息。但当前的体检过程仅能对小部分疾病进行提示性筛查。基于电子病例数据进行回顾性建模,能够极大提升体检数据对未纳入当前体检发现范围的疾病的识别能力,提升单次体检可带来的健康价值。Physical examination is an important means of discovering potential diseases, among which blood routine, urine routine and other laboratory indicators carry a lot of health status information. However, the current physical examination process can only provide suggestive screening for a small number of diseases. Retrospective modeling based on electronic case data can greatly improve the ability of physical examination data to identify diseases that are not included in the scope of current physical examination findings, and increase the health value that a single physical examination can bring.
然而,由于真实世界医疗数据来源复杂,所包含数据的准确性和完备性会受到具体数据录入时诊疗过程的影响。其中,一种典型的数据不完备情形是真实诊断标签中样本的阳性标签缺失(即假阴性样本),会对后续的预测模型建模及临床应用过程产生较大的影响。可能导致阳性标签缺失的原因包括:1)当次就医过程中存在其他不相关但更受录入医生主观关注的指标/疾病;2)就医时挂号科室或就医原因与目标疾病不一致;3)医生录入疾病时存在遗漏等。However, due to the complex sources of real-world medical data, the accuracy and completeness of the included data will be affected by the diagnosis and treatment process when the specific data is entered. Among them, a typical situation of incomplete data is the absence of positive labels of samples in the real diagnostic labels (ie, false negative samples), which will have a great impact on the subsequent prediction model modeling and clinical application process. Reasons that may lead to the absence of positive labels include: 1) There are other irrelevant indicators/diseases that are more subjectively concerned by the entered doctors during the current medical visit; 2) The registered department or reason for medical treatment is inconsistent with the target disease; 3) The doctor entered There are omissions in the event of illness, etc.
由于假阴性样本在真实世界数据中的普遍存在,许多研究已将该问题纳入考虑。其中,与本申请最相近的技术方案为:①阳性-无标签学习(positive and unlabeled learning,PU学习),该技术方案将数据中未被标记的样本视为可能为阳性也可能为阴性的无标签样本。Jinbo Chen等[1]通过对样本权重进行调整,消除假阴性样本对整体模型的影响。该技术在逻辑回归算法的基础上,将全局阳性样本比例作为一个额外的未知参量,通过最大化包含该全局阳性样本比例与权重矩阵的似然函数,得到该数据集下全局阳性样本比例的最优值,从而对模型预测值进行矫正,获得最终预测结果;②表征学习,如Kavishwar B.Wagholikar等[2]和Yoni Halpern等[3]通过手动/半自动构建与目标诊断相关联的编码集,基于编码集对样本的额外关联数据(如文本数据、组学数据等)进行筛选,从而将大概率为阳性样本的无标签样本标记为阳性,降低假阴性样本对建模过程的整体影响。Due to the prevalence of false negative samples in real-world data, many studies have taken this issue into consideration. Among them, the technical solution most similar to this application is: ① Positive-unlabeled learning (positive and unlabeled learning, PU learning), this technical solution regards unlabeled samples in the data as unlabeled samples that may be positive or negative. label sample. Jinbo Chen et al. [1] eliminated the influence of false negative samples on the overall model by adjusting the sample weights. Based on the logistic regression algorithm, this technology takes the global positive sample proportion as an additional unknown parameter, and obtains the global positive sample proportion under the data set by maximizing the likelihood function including the global positive sample proportion and the weight matrix. Excellent value, so as to correct the predicted value of the model and obtain the final prediction result; ② Representation learning, such as Kavishwar B. Wagholikar et al. Based on the coding set, the additional associated data (such as text data, omics data, etc.) of the sample is screened, so that the unlabeled sample with a high probability of positive sample is marked as positive, and the overall impact of false negative samples on the modeling process is reduced.
与技术方案①相似的现有技术通过调整建模过程中的损失函数、样本权重等,对最终模型参数进行修正。该类技术在设定调整参数时,仅将数据集中的假阴性样本假设为阳性样本的一个随机子集,未结合考虑现实医疗场景中导致“患者在目标疾病上实际为阳性但未诊出或诊断未录入”的假阴性样本产生的实际原因。事实上,假阴性样本的分布与随机分布往往有很大出入。对假阴性样本的随机性假设与实际假阴性样本的出现逻辑不一致,会影响实际的临床预测效果。The existing technology similar to technical solution ① corrects the final model parameters by adjusting the loss function, sample weight, etc. in the modeling process. When this type of technology sets the adjustment parameters, it only assumes the false negative samples in the data set to be a random subset of positive samples, and does not take into account the real medical scenarios that lead to "patients who are actually positive for the target disease but not diagnosed or The actual reason for the false negative samples of "diagnosis not entered". In fact, the distribution of false negative samples is often very different from the random distribution. The random assumption of false-negative samples is inconsistent with the logic of the occurrence of actual false-negative samples, which will affect the actual clinical prediction effect.
与技术方案②相似的现有技术通过表征学习补足阳性样本。然而,表征学习的过程往往需要针对特定疾病构建医学知识门槛较高的术语集,不利于技术的普适性使用。同时,该技术方案需要大量额外医疗数据进行配合,以实现对假阴性样本的发现。对于占真实世界数据中大多数的单次来访患者案例,在缺乏足够额外数据的情况下,基于表征学习的方法无法用于解决真实世界医疗数据中的假阴性问题。The existing technology similar to technical solution ② supplements positive samples through representation learning. However, the process of representation learning often requires the construction of a terminology set with a high threshold of medical knowledge for specific diseases, which is not conducive to the universal use of technology. At the same time, this technical solution requires the cooperation of a large amount of additional medical data to realize the discovery of false negative samples. For single-visit patient cases, which account for the majority of real-world data, representation learning-based methods cannot be used to address false negatives in real-world medical data in the absence of sufficient additional data.
[1]Zhang L,Ding X,Ma Y,et al.A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients[J].Journal of the American Medical Informatics Association,2020,27(1):119-126.[1] Zhang L, Ding X, Ma Y, et al. A maximum likelihood approach to electronic health record phenotyping using positive and unlabeled patients [J]. Journal of the American Medical Informatics Association, 2020, 27(1): 119- 126.
[2]Wagholikar K B,Estiri H,Murphy M,et al.Polar labeling:silver standard algorithm for training disease classifiers[J].Bioinformatics,2020,36(10):3200-3206.[2] Wagholikar K B, Estiri H, Murphy M, et al. Polar labeling: silver standard algorithm for training disease classifiers [J]. Bioinformatics, 2020, 36(10): 3200-3206.
[3]Halpern Y,Horng S,Choi Y,et al.Electronic medical record phenotyping using the anchor and learn framework[J].Journal ofthe American Medical Informatics Association,2016,23(4):731-740.[3] Halpern Y, Horng S, Choi Y, et al. Electronic medical record phenotyping using the anchor and learn framework [J]. Journal of the American Medical Informatics Association, 2016, 23(4): 731-740.
发明内容Contents of the invention
本发明基于PU学习的基本设定,通过分析真实世界医疗数据中假阴性样本的普遍产生逻辑,使用“体检数据的特征维度可被拆分为直接相关维度和竞争维度两类特征,并在数据层面存在相异表现”的特征粒度假设,替代现有技术所默认的数据集粒度“假阴性样本随机分布”假设,解决PU学习建模中假设与真实世界医疗数据分布的不一致问题,从而提升对真实世界数据的利用能力,并因此提升体检数据对潜在疾病辅助决策的准确性与范围。本发明通过数据驱动的方式自适应确定各临床特征维度上数据对临床疾病诊断与体检结果录入的影响,在不同目标体检结果上具有普适性,不依赖于先验的医学知识体系,有利于本发明在各类可基于基本生理指标和常规化验指标进行初步诊断的疾病上的应用,因此特别适用于大规模体检场景。本发明对假阴性样本的识别过程不依赖于额外的表征挖掘过程,因此不会因所使用医疗数据中缺少额外的关联数据而影响数据分析结果。The present invention is based on the basic setting of PU learning, by analyzing the common generation logic of false negative samples in real-world medical data, using "the feature dimension of physical examination data can be split into two types of features: direct correlation dimension and competition dimension, and in the data The feature granularity assumption of "different performance at different levels" replaces the default data set granularity "random distribution of false negative samples" assumed by the existing technology, and solves the inconsistency between the assumptions in PU learning modeling and the distribution of real-world medical data, thereby improving the accuracy of The ability to utilize real-world data, and thus improve the accuracy and scope of physical examination data for potential disease-assisted decision-making. The present invention self-adaptively determines the impact of data on clinical disease diagnosis and physical examination results entry in a data-driven manner in each clinical feature dimension, has universality in different target physical examination results, does not depend on a priori medical knowledge system, and is beneficial The present invention is applicable to various diseases that can be preliminarily diagnosed based on basic physiological indicators and conventional laboratory indicators, so it is especially suitable for large-scale medical examination scenarios. The identification process of the false negative sample in the present invention does not depend on an additional representation mining process, so the data analysis result will not be affected by the lack of additional associated data in the used medical data.
本发明的目的是通过以下技术方案实现的:一种基于假阴性样本识别的体检辅助决策系统,该系统包括以下模块:The purpose of the present invention is achieved through the following technical solutions: a medical examination assistant decision-making system based on false negative sample identification, the system includes the following modules:
数据获取模块:用于获取真实世界体检数据集,矩阵化为包括输入特征矩阵和真实诊断标签的原始数据集,将体检结果为阴性的样本视为无标签样本;Data acquisition module: used to obtain real-world physical examination data sets, matrixed into original data sets including input feature matrix and real diagnostic labels, and samples with negative physical examination results as unlabeled samples;
数据预处理模块:通过统一原始数据集中各特征分量的标准差和均值,形成标准化数据集;分离标准化数据集中各特征分量的正负半轴分量,在每个正负半轴分量上加上对应的可训练上下限值,形成扩展数据集;Data preprocessing module: form a standardized data set by unifying the standard deviation and mean of each feature component in the original data set; separate the positive and negative semi-axis components of each feature component in the standardized data set, and add the corresponding positive and negative semi-axis components to each positive and negative semi-axis component The trainable upper and lower limits of , forming an extended data set;
基础特征分析模块:使用逻辑回归模型,将无标签样本视为负样本,训练获得在不考虑假阴性样本的情况下,各特征维度对产生真实诊断标签的特征权重;Basic feature analysis module: using the logistic regression model, the unlabeled sample is regarded as a negative sample, and the training obtains the feature weight of each feature dimension to generate a true diagnostic label without considering false negative samples;
假阴性样本识别模块:将特征维度分为直接相关维度和竞争维度两类,其中直接相关维度从医学角度对目标体检结果的判定产生直接影响,竞争维度从医学角度不对目标体检结果的判定产生直接影响,但会与目标体检结果竞争关注度,导致目标体检结果缺失,产生假阴性样本;构建两个逻辑回归模型和联合损失函数,进行联合训练,使用联合损失函数筛选真阴性样本和假阴性样本,并且使得直接相关维度能够最大程度区分阳性样本与筛选出的疑似真阴性样本,竞争维度能够最大程度区分阳性样本与筛选出的疑似假阴性样本;通过假阴性指标指示样本为假阴性样本的可能性;False-negative sample identification module: Divide the feature dimension into two categories: direct correlation dimension and competition dimension. The direct correlation dimension has a direct impact on the judgment of the target physical examination result from the medical point of view, and the competition dimension does not directly affect the judgment of the target physical examination result from the medical point of view. influence, but it will compete with the target physical examination results for attention, resulting in missing target physical examination results and false negative samples; construct two logistic regression models and a joint loss function for joint training, and use the joint loss function to filter true negative samples and false negative samples , and enables the direct correlation dimension to distinguish the positive samples from the screened suspected true negative samples to the greatest extent, and the competition dimension to distinguish the positive samples from the screened suspected false negative samples to the greatest extent; the possibility of the sample being a false negative sample is indicated by the false negative index sex;
预测模型构建模块:构建多层神经网络和引入了假阴性指标的损失函数,基于标准化数据集与假阴性指标,训练体检辅助决策模型;Predictive model building block: build a multi-layer neural network and introduce a loss function with false negative indicators, and train a medical examination-assisted decision-making model based on standardized data sets and false negative indicators;
辅助决策模块:基于体检者的体检数据,通过数据预处理模块获得标准化的特征向量,通过体检辅助决策模型得到预测结果,并输出给临床医生作为体检辅助决策结果。Auxiliary decision-making module: Based on the physical examination data of the examinee, the standardized feature vector is obtained through the data preprocessing module, and the prediction result is obtained through the auxiliary decision-making model of the physical examination, and output to the clinician as the auxiliary decision-making result of the physical examination.
进一步地,所述数据获取模块中,体检数据的特征维度包括基本生理指标和常规化验指标,所述基本生理指标包括身高、体重、BMI、收缩压和舒张压,所述常规化验指标包括血常规和尿常规;所述真实诊断标签为体检结果。Further, in the data acquisition module, the feature dimensions of the physical examination data include basic physiological indicators and routine laboratory indicators, the basic physiological indicators include height, weight, BMI, systolic blood pressure and diastolic blood pressure, and the routine laboratory indicators include blood routine and urine routine; the true diagnostic label is the result of physical examination.
进一步地,所述数据获取模块中,将体检数据集矩阵化为原始数据集(X,y),
Figure PCTCN2022123731-appb-000001
Figure PCTCN2022123731-appb-000002
为输入特征矩阵,n为样本量,p为体检指标总数,x 1至x n表示各样本,f 1至f p为原始数据集在各特征维度上的特征分量,T表示转置;y=[y 1,y 2,...y n]∈{0,1} n为n个样本的真实诊断标签,y i=1代表第i个样本为阳性样本,y i=0代表第i个样本为真阴性样本或假阴性样本,视为无标签样本;将阳性样本集合记为S P,将无标签样本集合记为S N,将真阴性样本集合记为S TN,将假阴性样本集合记为S FN,有S TN∪S FN=S N
Figure PCTCN2022123731-appb-000003
且S P,S N的具体样本组成已知,S TN,S FN的具体样本组成未知。
Further, in the data acquisition module, the physical examination data set is matrixed into an original data set (X, y),
Figure PCTCN2022123731-appb-000001
Figure PCTCN2022123731-appb-000002
is the input feature matrix, n is the sample size, p is the total number of physical examination indicators, x 1 to x n represent each sample, f 1 to f p are the feature components of the original data set on each feature dimension, T represents transposition; y= [y 1 , y 2 ,...y n ]∈{0, 1} n is the real diagnostic label of n samples, y i =1 means the i-th sample is a positive sample, y i =0 means the i-th A sample is a true negative sample or a false negative sample, and it is regarded as an unlabeled sample; the positive sample set is recorded as S P , the unlabeled sample set is recorded as SN , the true negative sample set is recorded as S TN , and the false negative sample set is Denoted as S FN , there is S TN ∪ S FN = S N ,
Figure PCTCN2022123731-appb-000003
And the specific sample composition of S P and SN is known, and the specific sample composition of S TN and S FN is unknown.
进一步地,所述数据预处理模块中,对X中各特征分量做标准化处理,使各特征分量上所有 体检数据的标准差为1,均值为0;将标准化处理后的特征矩阵记为X′=[x′ 1,x′ 2,...x′ n] T=[f′ 1,f′ 2,...f′ p],
Figure PCTCN2022123731-appb-000004
表示第i个经标准化后的样本,f′ j为标准化后的第j维特征分量,X′与y形成标准化数据集(X′,y);
Further, in the data preprocessing module, each feature component in X is standardized, so that the standard deviation of all physical examination data on each feature component is 1, and the mean value is 0; the feature matrix after the normalization process is recorded as X'=[x' 1 , x' 2 , ... x' n ] T = [f' 1 , f' 2 , ... f' p ],
Figure PCTCN2022123731-appb-000004
Indicates the i-th standardized sample, f' j is the j-th dimension feature component after standardization, and X' and y form a standardized data set (X', y);
将X′扩展形成可训练特征矩阵X″:Expand X' to form a trainable feature matrix X":
X″=[x″ 1,x″ 2,...x″ n] T=[f′ 11,f′ 12,f′ 21,f′ 22...f′ p1,f′ p2]+t=[f″ 11,f″ 12,...f″ p1,f″ p2] X″=[x″ 1 , x″ 2 , ... x″ n ] T = [f′ 11 , f′ 12 , f′ 21 , f′ 22 ...f′ p1 , f′ p2 ]+t =[f″ 11 , f″ 12 , . . . f″ p1 , f″ p2 ]
其中
Figure PCTCN2022123731-appb-000005
表示第i个经数据扩展后的样本,f′ j1=max(f′ j,0),f′ j2=min(f′ j,0)分别为f′ j的正半轴分量和负半轴分量;t=[t 11,t 12,t 21,t 22...t p1,t p2]为各分量上的可训练上下限值构成的偏移向量,
Figure PCTCN2022123731-appb-000006
加法通过广播机制完成;可训练特征矩阵X″与y形成扩展数据集(X″,y)。
in
Figure PCTCN2022123731-appb-000005
Indicates the i-th sample after data expansion, f′ j1 =max(f′ j ,0), f′ j2 =min(f′ j ,0) are the positive semi-axis component and negative semi-axis of f′ j respectively Component; t=[t 11 , t 12 , t 21 , t 22 ...t p1 , t p2 ] is the offset vector formed by the trainable upper and lower limits on each component,
Figure PCTCN2022123731-appb-000006
The addition is done through a broadcast mechanism; the trainable feature matrices X" and y form the extended data set (X", y).
进一步地,所述基础特征分析模块中,将无标签样本视为负样本,基于扩展数据集(X″,y)构建逻辑回归模型M 0,M 0的损失函数L 0(w,t,b)为: Further, in the basic feature analysis module, unlabeled samples are regarded as negative samples, and a logistic regression model M 0 is constructed based on the extended data set (X", y), and the loss function L 0 of M 0 (w, t, b )for:
Figure PCTCN2022123731-appb-000007
Figure PCTCN2022123731-appb-000007
p 0(x″ i)=sig(w Tx″ i+b) p 0 (x″ i )=sig(w T x″ i +b)
其中
Figure PCTCN2022123731-appb-000008
为可训练的特征权重向量,b为可训练的截距值;sig(·)为sigmoid函数,w Tx″ i+b为决策函数,其值为决策值,p 0(x″ i)为经sigmoid函数归一化后得到的逻辑回归模型M 0的输出概率。
in
Figure PCTCN2022123731-appb-000008
is a trainable feature weight vector, b is a trainable intercept value; sig(·) is a sigmoid function, w T x″ i + b is a decision function, and its value is a decision value, p 0 (x″ i ) is The output probability of the logistic regression model M 0 obtained after normalization by the sigmoid function.
进一步地,所述假阴性样本识别模块包括:Further, the false negative sample identification module includes:
取基础特征分析模块中训练所得特征权重向量w,设定可训练非负矩阵A D,A F∈[0,1] 2p×2p,满足A D、A F的和矩阵为单位矩阵E=A D+A FTake the feature weight vector w obtained from training in the basic feature analysis module, set the trainable non-negative matrix A D , A F ∈ [0, 1] 2p×2p , and satisfy the sum matrix of A D and A F as the identity matrix E=A D +A F ;
构建两个逻辑回归模型M D和M F,分别具有特征权重系数w D=w TA D,w F=w TA F,分别具有可训练截距值b D,b F,则两个逻辑回归模型经sigmoid函数归一化后的输出概率分别表示为: Construct two logistic regression models M D and M F , which have feature weight coefficients w D = w T A D , w F = w T A F , respectively have trainable intercept values b D , b F , then the two logistic The output probabilities of the regression model normalized by the sigmoid function are expressed as:
p D(x″ i)=sig(w TA Dx″ i+b D) p D (x″ i )=sig(w T A D x″ i +b D )
p F(x″ i)=sig(w TA Fx″ i+b F) p F (x″ i )=sig(w T A F x″ i +b F )
其中p D(x″ i)为直接概率,p F(x″ i)为关注度概率; Among them, p D (x″ i ) is the direct probability, and p F (x″ i ) is the probability of attention;
利用扩展数据集(X″,y)最小化联合损失函数L 1(A D,b D,b F)获得最优参数; Use the extended data set (X″, y) to minimize the joint loss function L 1 (A D , b D , b F ) to obtain the optimal parameters;
Figure PCTCN2022123731-appb-000009
Figure PCTCN2022123731-appb-000009
其中,
Figure PCTCN2022123731-appb-000010
为样本类别权重;γ为筛选系数;
Figure PCTCN2022123731-appb-000011
但不参与模型训练过程中的梯度反向传播;
in,
Figure PCTCN2022123731-appb-000010
is the sample category weight; γ is the screening coefficient;
Figure PCTCN2022123731-appb-000011
But it does not participate in the gradient backpropagation during model training;
对于无标签样本集合中的样本x″ i,分别通过模型M D和M F获得直接概率p D(x″ i)和关注度概率p F(x″ i),使用假阴性指标ri=p D(x″ i)·(1-p F(x″ i))指示样本x″ i为假阴性的可能性。 For the sample x″ i in the unlabeled sample set, the direct probability p D (x″ i ) and the attention probability p F (x″ i ) are obtained through the models M D and MF respectively, using the false negative index ri=p D (x″ i )·(1-p F (x″ i )) indicates the probability that sample x″ i is a false negative.
进一步地,所述假阴性样本识别模块中,对于逻辑回归模型M F,通过乘法项
Figure PCTCN2022123731-appb-000012
筛选经M 0预测得到的输出概率p 0(x″ i)接近1的无标签样本,将筛选出的无标签样本集合记为
Figure PCTCN2022123731-appb-000013
与阳性样本集合S P在竞争维度F类的特征上存在差异,在直接相关维度D类的特征上应无明显差异,通过训练以S P为正类,以
Figure PCTCN2022123731-appb-000014
为负类的模型M F,识别特征维度中属于竞争维度F类的特征,训练过程同时优化A F,b F以得到
Figure PCTCN2022123731-appb-000015
与S P间的最优区分,使得对于样本
Figure PCTCN2022123731-appb-000016
关注度概率p F(x″ i)趋向于0,对于样本x″ i∈S P,关注度概率p F(x″ i)趋向于1。
Further, in the false negative sample identification module, for the logistic regression model M F , the multiplication term
Figure PCTCN2022123731-appb-000012
Screen the unlabeled samples whose output probability p 0 (x″ i ) predicted by M 0 is close to 1, and record the screened unlabeled sample set as
Figure PCTCN2022123731-appb-000013
There are differences in the characteristics of the competition dimension F class and the positive sample set S P , and there should be no significant difference in the characteristics of the directly related dimension D class. Through training, S P is the positive class, and
Figure PCTCN2022123731-appb-000014
The model M F of the negative class recognizes the features belonging to the competition dimension F in the feature dimension. The training process optimizes A F and b F at the same time to obtain
Figure PCTCN2022123731-appb-000015
The optimal distinction between SP and SP , such that for the sample
Figure PCTCN2022123731-appb-000016
The attention probability p F (x″ i ) tends to 0, and for the sample x″ i ∈ S P , the attention probability p F (x″ i ) tends to 1.
进一步地,所述假阴性样本识别模块中,对于逻辑回归模型M D,通过乘法项
Figure PCTCN2022123731-appb-000017
筛选经M F预测得到的关注度概率p F(x″ i)接近1的无标签样本,将筛选出的无标签样本集合记为
Figure PCTCN2022123731-appb-000018
与阳性样本集合S P在直接相关维度D类的特征上存在差异,在竞争维度F类的特征上应无明显差异,通过训练以S P为正类,以
Figure PCTCN2022123731-appb-000019
为负类的模型M D,识别特征维度中属于直接相关维度D类的特征,训练过程同时优化A D,b D以得到
Figure PCTCN2022123731-appb-000020
与S P间的最优区分,使得对于样本
Figure PCTCN2022123731-appb-000021
直接概率p D(x″ i)趋向于0,对于样本x″ i∈S P,直接概率p D(x″ i)趋向于1。
Further, in the false negative sample identification module, for the logistic regression model M D , the multiplication term
Figure PCTCN2022123731-appb-000017
Screen the unlabeled samples whose attention probability p F (x″ i ) obtained by MF prediction is close to 1, and record the screened unlabeled sample set as
Figure PCTCN2022123731-appb-000018
There are differences in the characteristics of the directly related dimension D class from the positive sample set SP , and there should be no obvious difference in the characteristics of the competitive dimension F class. Through training, S P is used as the positive class, and
Figure PCTCN2022123731-appb-000019
The model M D of the negative class recognizes the features belonging to the directly related dimension D in the feature dimension. The training process optimizes A D and b D at the same time to obtain
Figure PCTCN2022123731-appb-000020
The optimal distinction between SP and SP , such that for the sample
Figure PCTCN2022123731-appb-000021
The direct probability p D (x″ i ) tends to 0, and for the sample x″ i ∈ S P , the direct probability p D (x″ i ) tends to 1.
进一步地,所述预测模型构建模块中,基于标准化数据集(X′,y)及各样本的假阴性指标r=[r 1,...r n]∈(0,1) n,构建输入层节点数为p,输出层节点数为1,输出层激活函数为sigmoid函数,各层间转移矩阵集合为w net的多层神经网络M net,将样本x′ i∈X′经M net运算后的输出定义为
Figure PCTCN2022123731-appb-000022
通过最小化引入假阴性指标的损失函数L 2(w net)获得M net的最优参数;
Further, in the predictive model building module, based on the standardized data set (X′, y) and the false negative index r=[r 1 ,...r n ]∈(0,1) n of each sample, the input The number of layer nodes is p, the number of output layer nodes is 1, the activation function of the output layer is a sigmoid function, and the set of transfer matrices between each layer is a multi-layer neural network M net of w net . The sample x′ i ∈ X′ is operated by M net After the output is defined as
Figure PCTCN2022123731-appb-000022
The optimal parameters of M net are obtained by minimizing the loss function L 2 (w net ) that introduces false negative indicators;
Figure PCTCN2022123731-appb-000023
Figure PCTCN2022123731-appb-000023
则M net为构建的引入假阴性指标优化后的体检辅助决策模型。 Then M net is the constructed medical examination aided decision-making model optimized by introducing false negative indicators.
进一步地,所述辅助决策模块中,将单一体检者通过体检获得的p项与特征维度对应的体检指标,通过数据预处理模块获得标准化处理后的特征向量x′ u,将x′ u输入在预测模型构建模块构建的体检辅助决策模型,输出预测结果
Figure PCTCN2022123731-appb-000024
Figure PCTCN2022123731-appb-000025
趋向于1时,体检结果趋向于阳性,当
Figure PCTCN2022123731-appb-000026
趋向于0时,体检结果趋向于阴性,将预测结果提供给临床医生,作为体检辅助决策结果。
Further, in the auxiliary decision-making module, the p item obtained by a single medical examiner through physical examination and the physical examination index corresponding to the feature dimension are obtained through the data preprocessing module. The standardized feature vector x′ u is input into the The medical examination auxiliary decision-making model constructed by the prediction model building block outputs the prediction results
Figure PCTCN2022123731-appb-000024
when
Figure PCTCN2022123731-appb-000025
When it tends to 1, the physical examination result tends to be positive, when
Figure PCTCN2022123731-appb-000026
When it tends to 0, the physical examination results tend to be negative, and the predicted results are provided to clinicians as the auxiliary decision-making results of physical examination.
本发明的有益效果是:The beneficial effects of the present invention are:
1.现有的阳性-无标签学习技术将临床诊断缺失视为随机发生的行为。本发明通过模拟普遍 性的临床诊断流程,分析诊断缺失产生的数据诱因,并对该过程进行建模,更符合临床逻辑,能够更好地对真实世界医疗数据中的假阴性样本进行发现,提高真实世界医疗数据在体检辅助决策模型的构建与临床辅助决策上的应用能力。1. Existing positive-unlabeled learning techniques treat missing clinical diagnoses as randomly occurring behaviors. The present invention simulates the universal clinical diagnosis process, analyzes the data incentives caused by the lack of diagnosis, and models the process, which is more in line with clinical logic, can better discover false negative samples in real-world medical data, and improve The application ability of real-world medical data in the construction of physical examination auxiliary decision-making model and clinical auxiliary decision-making.
2.现有的表征学习技术需要大量额外数据和一定量的医学专业知识以支撑表征挖掘过程,普适性较弱。本发明在建模和临床辅助决策过程中无需使用额外数据,同时将普遍性的临床实际决策过程嵌入到模型的开发逻辑当中,无需针对应用案例引入额外医学知识,具有较强的普适性。2. The existing representation learning technology requires a large amount of additional data and a certain amount of medical expertise to support the representation mining process, and its universality is weak. The present invention does not need to use additional data in the process of modeling and clinical auxiliary decision-making, and at the same time embeds the universal clinical actual decision-making process into the development logic of the model, without introducing additional medical knowledge for application cases, and has strong universality.
附图说明Description of drawings
图1为本发明实施例提供的基于假阴性样本识别的体检辅助决策系统结构图;Fig. 1 is a structural diagram of a medical examination assistant decision-making system based on false negative sample identification provided by an embodiment of the present invention;
图2为本发明实施例提供的假阴性样本识别流程图;Fig. 2 is the false negative sample identification flowchart provided by the embodiment of the present invention;
图3为本发明实施例提供的引入了假阴性指标优化后的体检辅助决策模型构建流程图。FIG. 3 is a flow chart of constructing a medical examination-aided decision-making model after introducing false-negative index optimization provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图对本发明的具体实施方式做详细的说明。In order to make the above objects, features and advantages of the present invention more comprehensible, specific implementations of the present invention will be described in detail below in conjunction with the accompanying drawings.
本发明实施例提供一种基于假阴性样本识别的体检辅助决策系统,如图1所示,该系统包括数据获取模块、数据预处理模块、基础特征分析模块、假阴性样本识别模块、预测模型构建模块和辅助决策模块,下面详细阐述每个模块的实现过程。An embodiment of the present invention provides a medical examination auxiliary decision-making system based on false negative sample identification, as shown in Figure 1, the system includes a data acquisition module, a data preprocessing module, a basic feature analysis module, a false negative sample identification module, and a prediction model construction modules and auxiliary decision-making modules, the implementation process of each module is described in detail below.
一、数据获取模块:用于获取真实世界体检数据集,矩阵化为包括输入特征矩阵和真实诊断标签的原始数据集,将体检结果为阴性的样本视为无标签样本;1. Data acquisition module: used to obtain real-world medical examination data sets, matrixed into original data sets including input feature matrix and real diagnostic labels, and samples with negative medical examination results as unlabeled samples;
具体地,数据获取模块用于获取存储于.csv文件中的真实世界体检数据集,包括特征维度和真实诊断标签。体检数据的特征维度包括基本生理指标和常规化验指标;基本生理指标包括身高、体重、BMI、收缩压、舒张压;常规化验指标包括血常规(总蛋白、白蛋白、球蛋白、白球蛋白比例、谷丙转氨酶、谷草转氨酶、碱性磷酸酶、胆碱酯酶、总胆汁酸、总胆红素、直接胆红素、间接胆红素、腺苷酸脱氨酶、谷氨酰转肽酶、肾小球滤过率、肌酐、尿素、尿酸、膀抑素C、甘油三酯、总胆固醇、高密度脂蛋白-C、低密度脂蛋白-C、极低密度脂蛋白-C、空腹血糖、钾、钠、氯、总钙、无机磷、甘铺二肽氨基肽酶、α-岩藻糖苷酶)、尿常规(尿蛋白质、尿酮体、尿糖、尿胆红素、尿沉渣白细胞、尿沉渣红细胞、尿胆原、尿酸度);真实诊断标签为体检结果,例如糖尿病诊断结果。Specifically, the data acquisition module is used to acquire the real-world physical examination data set stored in the .csv file, including feature dimensions and real diagnosis labels. The characteristic dimensions of physical examination data include basic physiological indicators and routine laboratory indicators; basic physiological indicators include height, weight, BMI, systolic blood pressure, and diastolic blood pressure; routine laboratory indicators include blood routine (total protein, albumin, globulin, albumin ratio, Alanine aminotransferase, aspartate aminotransferase, alkaline phosphatase, cholinesterase, total bile acid, total bilirubin, direct bilirubin, indirect bilirubin, adenylate deaminase, glutamyl transpeptidase, Glomerular filtration rate, creatinine, urea, uric acid, cystatin C, triglycerides, total cholesterol, high-density lipoprotein-C, low-density lipoprotein-C, very low-density lipoprotein-C, fasting blood glucose, Potassium, sodium, chloride, total calcium, inorganic phosphorus, ganpu dipeptide aminopeptidase, α-fucosidase), urine routine (urine protein, urine ketone body, urine sugar, urine bilirubin, urine sediment white blood cells, urine sediment Red blood cells, urobilinogen, uric acid); the real diagnostic label is the result of physical examination, such as the result of diabetes diagnosis.
将体检数据集矩阵化为原始数据集(X,y),其中
Figure PCTCN2022123731-appb-000027
为输入特征矩阵;n为样本量,p为体检指标总数,实例中n=25000,p=45;x 1至x n表示各样本, 以特征向量的形式体现,f 1至f p为原始数据集在各特征维度上的特征分量,T表示转置;y=[y 1,y 2,...y n]∈{0,1} n为n个样本的真实诊断标签,即目标标签,y i=1代表第i个样本的体检结果为阳性,即该样本为阳性样本;y i=0代表第i个样本的体检结果为阴性,该样本可能为真阴性样本或假阴性样本,将该类样本视为无标签样本。将阳性样本的集合记为S P,包括所有y i=1的样本;将无标签样本的集合记为S N,包括所有y i=0的样本;将真阴性样本的集合记为S TN,将假阴性样本的集合记为S FN,有S TN∪S FN=S N
Figure PCTCN2022123731-appb-000028
且S P,S N的具体样本组成已知,S TN,S FN的具体样本组成未知。
Matrixize the physical examination data set into the original data set (X, y), where
Figure PCTCN2022123731-appb-000027
is the input feature matrix; n is the sample size, p is the total number of physical examination indicators, in the example n=25000, p=45; x 1 to x n represent each sample, expressed in the form of feature vectors, f 1 to f p are the original data Set the feature components on each feature dimension, T means transpose; y=[y 1 , y 2 ,...y n ]∈{0, 1} n is the real diagnostic label of n samples, that is, the target label, y i =1 means that the physical examination result of the i-th sample is positive, that is, the sample is a positive sample; y i =0 means that the physical examination result of the i-th sample is negative, and the sample may be a true negative sample or a false negative sample, and Such samples are regarded as unlabeled samples. The set of positive samples is denoted as S P , including all samples with y i =1; the set of unlabeled samples is denoted as SN , including all samples with y i =0; the set of true negative samples is denoted as S TN , Denote the set of false negative samples as S FN , there is S TN ∪ S FN = S N ,
Figure PCTCN2022123731-appb-000028
And the specific sample composition of S P and SN is known, and the specific sample composition of S TN and S FN is unknown.
二、数据预处理模块:通过统一原始数据集中各特征分量的标准差和均值,形成标准化数据集;分离标准化数据集中各特征分量的正负半轴分量,在每个正负半轴分量上加上对应的可训练上下限值,形成扩展数据集,包括:2. Data preprocessing module: By unifying the standard deviation and mean value of each feature component in the original data set, a standardized data set is formed; the positive and negative semi-axis components of each feature component in the standardized data set are separated, and each positive and negative semi-axis component is added The corresponding trainable upper and lower limit values above form an extended data set, including:
对X中各特征分量f j做基于该分量的标准化处理φ j,使该分量上所有体检数据的标准差为1,均值为0;将标准化处理后的特征矩阵记为X′=[x′ 1,x′ 2,...x′ n] T=[f′ 1,f′ 2,...f′ p],
Figure PCTCN2022123731-appb-000029
表示第i个经标准化后的样本,以特征向量的形式体现,X′与y形成标准化数据集(X′,y);
For each feature component f j in X, perform standardization processing φ j based on this component, so that the standard deviation of all physical examination data on this component is 1, and the mean value is 0; the standardized feature matrix is recorded as X′=[x′ 1 , x′ 2 ,...x′ n ] T = [f′ 1 , f′ 2 ,...f′ p ],
Figure PCTCN2022123731-appb-000029
Represents the i-th standardized sample in the form of a feature vector, and X' and y form a standardized data set (X', y);
Figure PCTCN2022123731-appb-000030
Figure PCTCN2022123731-appb-000030
其中f′ j为标准化后的第j维特征分量,λ j为n个样本在分量f j上的均值,σ j为n个样本在分量f j上的标准差。 Where f' j is the jth dimension feature component after normalization, λ j is the mean value of n samples on component f j , and σ j is the standard deviation of n samples on component f j .
由于体检指标在现实使用中常通过“高于正常上限值”、“低于正常下限值”的形式提供辅助决策信息,且两类辅助决策信息导向的体检结果常不完全对立,因此本发明的数据预处理过程将正负数据分开考虑,并额外加入可训练的偏移向量,使得所构建扩展数据集的特征矩阵贴近临床使用场景。具体而言,分离X′各特征分量f′ j的正负半轴分量以模拟两类辅助决策信息的差异,加入偏移向量t以模拟体检指标的正常上下限值。 Since physical examination indicators often provide auxiliary decision-making information in the form of "higher than the normal upper limit" and "lower than the normal lower limit" in actual use, and the physical examination results guided by the two types of auxiliary decision-making information are often not completely opposite, so the present invention The data preprocessing process considers positive and negative data separately, and additionally adds a trainable offset vector, so that the feature matrix of the constructed extended data set is close to the clinical use scenario. Specifically, the positive and negative semi-axis components of each feature component f' j of X' are separated to simulate the difference between the two types of auxiliary decision-making information, and the offset vector t is added to simulate the normal upper and lower limits of the physical examination index.
基于此,将X′扩展形成可训练特征矩阵X″:Based on this, expand X′ to form a trainable feature matrix X″:
X″=[x″ 1,x″ 2,...x″ n] T=[f′ 11,f′ 12,f′ 21,f′ 22...f′ p1,f′ p2]+t=[f″ 11,f″ 12,...f″ p1,f″ p2] X″=[x″ 1 , x″ 2 , ... x″ n ] T = [f′ 11 , f′ 12 , f′ 21 , f′ 22 ...f′ p1 , f′ p2 ]+t =[f″ 11 , f″ 12 , . . . f″ p1 , f″ p2 ]
其中f′ j1=max(f′ j,0),f′ j2=min(f′ j,0)分别为f′ j的正半轴分量和负半轴分量;t=[t 11,t 12,t 21,t 22...t p1,t p2]为各分量上的可训练上下限值构成的偏移向量,有
Figure PCTCN2022123731-appb-000031
Figure PCTCN2022123731-appb-000032
加法通过广播机制(broadcasting)完成。
Wherein f′ j1 =max(f′ j ,0), f′ j2 =min(f′ j ,0) are the positive and negative semi-axis components of f′ j respectively; t=[t 11 ,t 12 , t 21 , t 22 ... t p1 , t p2 ] are offset vectors composed of trainable upper and lower limits on each component, and have
Figure PCTCN2022123731-appb-000031
Figure PCTCN2022123731-appb-000032
The addition is done by broadcasting.
可训练特征矩阵X″与y形成扩展数据集(X″,y)。The trainable feature matrices X" and y form an extended data set (X", y).
上述经预处理的数据集中,扩展数据集(X″,y)被用于基础特征分析模块与假阴性样本识别模 块,标准化数据集(X′,y)被用于预测模型构建模块与辅助决策模块。In the above preprocessed data set, the extended data set (X″, y) is used for the basic feature analysis module and the false negative sample identification module, and the standardized data set (X′, y) is used for the prediction model building module and auxiliary decision-making module.
三、基础特征分析模块:使用逻辑回归模型,将无标签样本视为负样本,训练获得在不考虑假阴性样本的情况下,各特征维度对产生真实诊断标签的特征权重,包括:3. Basic feature analysis module: use the logistic regression model to treat unlabeled samples as negative samples, and obtain the feature weights of each feature dimension for generating true diagnostic labels without considering false negative samples during training, including:
将所有无标签样本视为负样本,基于预处理后的扩展数据集(X″,y)构建逻辑回归模型M 0,M 0的损失函数L 0(w,t,b)为: Treat all unlabeled samples as negative samples, and build a logistic regression model M 0 based on the preprocessed extended data set (X″, y). The loss function L 0 (w, t, b) of M 0 is:
Figure PCTCN2022123731-appb-000033
Figure PCTCN2022123731-appb-000033
p 0(x″ i)=sig(w Tx″ i+b) p 0 (x″ i )=sig(w T x″ i +b)
其中
Figure PCTCN2022123731-appb-000034
为可训练的特征权重向量,b为可训练的截距值,
Figure PCTCN2022123731-appb-000035
表示第i个经数据扩展后的样本,以特征向量的形式体现,y i为第i个样本的真实诊断标签;sig(·)为sigmoid函数,w Tx″ i+b为决策函数,其值为决策值,p 0(x″ i)为经sigmoid函数归一化后得到的逻辑回归模型M 0的输出概率,即M 0所预测的样本x″ i为阳性的概率。实例中使用小批量梯度下降法(Mini-Batch Gradient Descent)进行模型训练,单批次使用样本量为500。
in
Figure PCTCN2022123731-appb-000034
is a trainable feature weight vector, b is a trainable intercept value,
Figure PCTCN2022123731-appb-000035
Indicates the i-th sample after data expansion, in the form of feature vectors, y i is the real diagnostic label of the i-th sample; sig(·) is a sigmoid function, w T x″ i + b is a decision function, where The value is the decision value, p 0 (x″ i ) is the output probability of the logistic regression model M 0 obtained after normalization by the sigmoid function, that is, the probability that the sample x″ i predicted by M 0 is positive. In the example, a small The batch gradient descent method (Mini-Batch Gradient Descent) is used for model training, and the sample size used in a single batch is 500.
四、假阴性样本识别模块:将特征维度分为直接相关维度和竞争维度两类,其中直接相关维度从医学角度对目标体检结果的判定产生直接影响,竞争维度从医学角度不对目标体检结果的判定产生直接影响,但会与目标体检结果竞争关注度,导致目标体检结果缺失,产生假阴性样本;构建两个逻辑回归模型和联合损失函数,进行联合训练,使用联合损失函数筛选真阴性样本和假阴性样本,并且使得直接相关维度能够最大程度区分阳性样本与筛选出的疑似真阴性样本,竞争维度能够最大程度区分阳性样本与筛选出的疑似假阴性样本;通过假阴性指标指示样本为假阴性样本的可能性;包括:4. False negative sample identification module: Divide the feature dimension into two categories: direct correlation dimension and competition dimension, in which the direct correlation dimension has a direct impact on the judgment of the target physical examination result from the medical point of view, and the competition dimension does not affect the judgment of the target physical examination result from the medical point of view It has a direct impact, but it will compete with the target physical examination results for attention, resulting in the lack of target physical examination results and false negative samples; construct two logistic regression models and a joint loss function for joint training, and use the joint loss function to filter true negative samples and false negative samples. Negative samples, and the direct correlation dimension can distinguish the positive samples from the screened suspected true negative samples to the greatest extent, and the competition dimension can maximize the distinction between the positive samples and the screened suspected false negative samples; the false negative index indicates that the sample is a false negative sample possibilities; including:
基于体检临床实践中体检结果的产生逻辑,将特征维度分为直接相关维度D类和竞争维度F类这两类。其定义为:直接相关维度D类中的特征,从医学角度对目标体检结果的判定产生直接影响;竞争维度F类中的特征,从医学角度不对目标体检结果的判定产生直接影响,但会与目标体检结果竞争关注度,从而有可能导致目标体检结果缺失,产生假阴性样本。从产生逻辑上,特征权重向量w是在以上两类特征共同作用下产生的。假阴性样本识别模块的核心思想是通过数据归纳,识别D和F这两类特征,从而对无标签样本为假阴性的可能性进行评估。Based on the generation logic of physical examination results in clinical practice of physical examination, the feature dimensions are divided into two categories: direct correlation dimension D and competition dimension F. It is defined as: the features in the category D of the directly related dimension have a direct impact on the judgment of the target physical examination result from the medical point of view; The target physical examination results compete for attention, which may lead to the lack of target physical examination results and false negative samples. Logically, the feature weight vector w is generated under the joint action of the above two types of features. The core idea of the false negative sample identification module is to identify the two types of features D and F through data induction, so as to evaluate the possibility of unlabeled samples being false negatives.
取基础特征分析模块中训练所得特征权重向量w,设定可训练非负矩阵A D,A F∈[0,1] 2p×2p,满足A D、A F的和矩阵为单位矩阵E=A D+A F;则: Take the feature weight vector w obtained from training in the basic feature analysis module, set the trainable non-negative matrix A D , A F ∈ [0, 1] 2p×2p , and satisfy the sum matrix of A D and A F as the identity matrix E=A D +A F ; then:
p 0(x″ i)=sig(w Tx″ i+b)=sig(w TEx″ i+b)=sig(w T(A D+A F)x″ i+b) p 0 (x″ i )=sig(w T x″ i +b)=sig(w T Ex″ i +b)=sig(w T (A D +A F )x″ i +b)
其中,由D类特征所贡献决策值为w TA Dx″ i,应最大程度区分阳性样本集合S P与真阴性样本集合S TN;由F类特征所贡献决策值为w TA Fx″ i,应最大程度区分阳性样本集合S P与假阴性样本集合S FNAmong them, the decision value contributed by class D features is w T A D x″ i , and the positive sample set S P and the true negative sample set S TN should be distinguished to the greatest extent; the decision value contributed by class F features is w T A F x ″ i , the positive sample set S P should be distinguished from the false negative sample set S FN to the greatest extent.
基于上述认识,假阴性样本识别模块完成如下步骤:Based on the above understanding, the false negative sample identification module completes the following steps:
构建两个逻辑回归模型M D和M F,分别具有特征权重系数w D=w TA D,w F=w TA F,分别具有可训练截距值b D,b F。则两个逻辑回归模型经sigmoid函数归一化后的输出概率分别表示为: Construct two logistic regression models M D and MF , respectively having feature weight coefficients w D =w T A D , w F =w T A F , and having trainable intercept values b D , b F . Then the output probabilities of the two logistic regression models normalized by the sigmoid function are expressed as:
p D(x″ i)=sig(w TA Dx″ i+b D) p D (x″ i )=sig(w T A D x″ i +b D )
p F(x″ i)=sig(w TA Fx″ i+b F) p F (x″ i )=sig(w T A F x″ i +b F )
称p D(x″ i)为直接概率,p F(x″ i)为关注度概率。 Call p D (x″ i ) the direct probability, and p F (x″ i ) the attention probability.
在最优的特征分类下,M D应最大程度区分阳性样本集合S P与真阴性样本集合S TN,M F应最大程度区分阳性样本集合S P与假阴性样本集合S FN。因此,可训练的参数包括A D,A F=E-A D,b D,b F,利用扩展数据集(X″,y)最小化联合损失函数L 1(A D,b D,b F)获得最优参数,拓展数据集(X″,y)中的偏移向量t使用基础特征分析模块中M 0训练后得到的优化结果,不再进一步训练。 Under the optimal feature classification, M D should distinguish the positive sample set S P from the true negative sample set S TN to the greatest extent, and MF should distinguish the positive sample set SP from the false negative sample set S FN to the greatest extent. Therefore, the trainable parameters include A D , A F =EA D , b D , b F , which are obtained by minimizing the joint loss function L 1 (A D , b D , b F ) using the extended data set (X″, y). The optimal parameter, the offset vector t in the extended data set (X″, y) uses the optimization result obtained after M 0 training in the basic feature analysis module, and no further training is required.
Figure PCTCN2022123731-appb-000036
Figure PCTCN2022123731-appb-000036
其中,
Figure PCTCN2022123731-appb-000037
为样本类别权重,用于调节不同类别样本在训练时所占的比重,实例中使用
Figure PCTCN2022123731-appb-000038
γ为筛选系数,当γ越大时,联合损失函数各部分将无标签样本归为假阴性、真阴性样本的筛选力度会增加,但筛选出的样本多样性会降低,实例中使用γ=2;
Figure PCTCN2022123731-appb-000039
但不参与模型训练过程中的梯度反向传播。实例中使用小批量梯度下降法进行模型M D和M F的联合训练,单批次使用样本量为500。
in,
Figure PCTCN2022123731-appb-000037
is the sample category weight, which is used to adjust the proportion of different categories of samples during training, and is used in the example
Figure PCTCN2022123731-appb-000038
γ is the screening coefficient. When γ is larger, the screening strength of unlabeled samples classified as false negatives and true negative samples by all parts of the joint loss function will increase, but the diversity of the screened samples will decrease. In the example, γ=2 is used ;
Figure PCTCN2022123731-appb-000039
But it does not participate in the gradient backpropagation during model training. In the example, the mini-batch gradient descent method is used for joint training of the models MD and MF , and the sample size used in a single batch is 500.
联合损失函数L 1(A D,b D,b F)的构建逻辑为: The construction logic of the joint loss function L 1 (A D , b D , b F ) is:
(1)对模型M F,通过乘法项
Figure PCTCN2022123731-appb-000040
筛选经M0预测得到的输出概率p 0(x″ i)较高的无标签样本,将筛选出的这些无标签样本的集合记为
Figure PCTCN2022123731-appb-000041
相对整体的无标签样本集合S N
Figure PCTCN2022123731-appb-000042
中的假阴性样本占比较大。
Figure PCTCN2022123731-appb-000043
与阳性样本集合S P在竞争维度F类的特征上存在差异,而在直接相关维度D类的特征上应无明显差异,因此可以通过训练以S P为正类,以
Figure PCTCN2022123731-appb-000044
为负类的模型M F,识别特征维度中属于竞争维度F类的特征。训练过程同时优化A F,b F以得到
Figure PCTCN2022123731-appb-000045
与S P间的最优区分,使得对于样本
Figure PCTCN2022123731-appb-000046
有关注度概率p F(x″ i)趋向于0,对于样本x″ i∈S P,有关注度概率p F(x″ i)趋向于1。
(1) For the model M F , through the multiplicative term
Figure PCTCN2022123731-appb-000040
Screen the unlabeled samples with higher output probability p 0 (x″ i ) predicted by M0, and record the set of these unlabeled samples as
Figure PCTCN2022123731-appb-000041
Relative to the overall unlabeled sample set S N ,
Figure PCTCN2022123731-appb-000042
The proportion of false negative samples in .
Figure PCTCN2022123731-appb-000043
There are differences in the characteristics of the competition dimension F class and the positive sample set S P , but there should be no significant difference in the characteristics of the directly related dimension D class, so it can be trained with S P as the positive class, and with
Figure PCTCN2022123731-appb-000044
The model M F of the negative class identifies features belonging to the category F of the competing dimension in the feature dimension. The training process simultaneously optimizes A F , b F to obtain
Figure PCTCN2022123731-appb-000045
The optimal distinction between SP and SP , such that for the sample
Figure PCTCN2022123731-appb-000046
The attention probability p F (x″ i ) tends to 0, and for the sample x″ i ∈ S P , the attention probability p F (x″ i ) tends to 1.
(2)对模型M D,通过乘法项
Figure PCTCN2022123731-appb-000047
筛选经M F预测得到的关注度概率p F(x″ i)较高的无标签样 本,将筛选出的这些无标签样本的集合记为
Figure PCTCN2022123731-appb-000048
相对整体的无标签样本集合S N
Figure PCTCN2022123731-appb-000049
中的真阴性样本占比较大。
Figure PCTCN2022123731-appb-000050
与阳性样本集合S P在直接相关维度D类的特征上存在差异,而在竞争维度F类的特征上应无明显差异,因此可以通过训练以S P为正类,以
Figure PCTCN2022123731-appb-000051
为负类的模型M D,识别特征维度中属于直接相关维度D类的特征。训练过程同时优化A D,b D以得到
Figure PCTCN2022123731-appb-000052
与S P间的最优区分,使得对于样本
Figure PCTCN2022123731-appb-000053
有直接概率p D(x″ i)趋向于0,对于样本x″ i∈S P,有直接概率p D(x″ i)趋向于1。
(2) For model M D , through the multiplicative term
Figure PCTCN2022123731-appb-000047
Screen the unlabeled samples with higher attention probability p F (x″ i ) predicted by MF , and record the set of these unlabeled samples as
Figure PCTCN2022123731-appb-000048
Relative to the overall unlabeled sample set S N ,
Figure PCTCN2022123731-appb-000049
The proportion of true negative samples in .
Figure PCTCN2022123731-appb-000050
There are differences in the characteristics of the directly related dimension D class from the positive sample set S P , but there should be no significant difference in the characteristics of the competitive dimension F class, so it can be trained with S P as the positive class, with
Figure PCTCN2022123731-appb-000051
The model M D of the negative class identifies the features belonging to the directly related dimension D class in the feature dimension. The training process simultaneously optimizes A D , b D to obtain
Figure PCTCN2022123731-appb-000052
The optimal distinction between SP and SP , such that for the sample
Figure PCTCN2022123731-appb-000053
There is direct probability p D (x″ i ) tends to 0, and for sample x″ i ∈ S P , there is direct probability p D (x″ i ) tends to 1.
(3)由于模型训练过程存在限制条件A D+A F=E,需要使用联合损失函数L 1(A D,b D,b F),通过模型M D和M F联合训练的方式对各参数进行优化。 (3) Due to the restriction condition A D +A F =E in the model training process, it is necessary to use the joint loss function L 1 (A D , b D , b F ), and to adjust each parameter through the joint training of the model M D and MF optimize.
在获得最优参数后,对于样本x″ i∈S N,分别通过模型M D和M F获得其直接概率p D(x″ i)和关注度概率p F(x″ i)。若x″ i为假阴性样本,则应有p D(x″ i)趋向于1,p F(x″ i)趋向于0,使用假阴性指标r i=p D(x″ i)·(1-p F(x″ i))指示各样本x″ i为假阴性的可能性。 After obtaining the optimal parameters, for the sample x″ i ∈ S N , obtain its direct probability p D (x″ i ) and attention probability p F (x″ i ) through the models M D and MF respectively. If x″ If i is a false negative sample, p D (x″ i ) should tend to 1, p F (x″ i ) tends to 0, and the false negative index r i =p D (x″ i )·(1-p F (x″ i )) indicates the likelihood that each sample x″ i is a false negative.
假阴性样本识别的流程参见图2。The flow chart of false negative sample identification is shown in Figure 2.
五、预测模型构建模块:构建多层神经网络和引入了假阴性指标的损失函数,基于标准化数据集与假阴性指标,训练体检辅助决策模型,包括:5. Prediction model building block: build a multi-layer neural network and introduce a loss function of false negative indicators, based on standardized data sets and false negative indicators, train the medical examination auxiliary decision-making model, including:
基于标准化数据集(X′,y)及各样本的假阴性指标r=[r 1,...r n]∈(0,1) n,构建输入层节点数为p,输出层节点数为1,输出层激活函数为sigmoid函数,各层间转移矩阵集合为W net的多层神经网络M net,将样本x′ i∈X′经神经网络M net运算后的输出定义为
Figure PCTCN2022123731-appb-000054
所有输出构成的向量记为
Figure PCTCN2022123731-appb-000055
则可通过最小化引入了假阴性指标的损失函数L 2(W net)获得M net的最优参数。
Based on the standardized data set (X′, y) and the false negative index r=[r 1 ,...r n ]∈(0,1) n of each sample, the number of nodes in the input layer is p, and the number of nodes in the output layer is 1. The activation function of the output layer is a sigmoid function, and the set of transfer matrices between layers is a multi-layer neural network M net of W net . The output of the sample x′ i ∈ X′ after the operation of the neural network M net is defined as
Figure PCTCN2022123731-appb-000054
The vector composed of all outputs is denoted as
Figure PCTCN2022123731-appb-000055
Then the optimal parameters of M net can be obtained by minimizing the loss function L 2 (W net ) that introduces false negative indicators.
Figure PCTCN2022123731-appb-000056
Figure PCTCN2022123731-appb-000056
则M net为构建的引入了假阴性指标优化后的体检辅助决策模型。体检辅助决策模型构建流程参见图3。 Then M net is a medical examination-aided decision-making model constructed by introducing false negative indicators optimized. Refer to Figure 3 for the construction process of the medical examination-assisted decision-making model.
实例中,构建三层神经网络M net,M net的输入层节点数为p=45,输出层节点数为1,中间层节点数为20,各层间转移矩阵集合为W net={W 12,W 23},W 12为输入层到中间层的转移矩阵,W 23为中间层到输出层的转移矩阵,各层间的激活函数为{ReLU,sigmoid}。使用小批量梯度下降法进行模型训练,单批次使用样本量为500。 In the example, a three-layer neural network M net is constructed, the number of nodes in the input layer of M net is p=45, the number of nodes in the output layer is 1, the number of nodes in the middle layer is 20, and the set of transition matrices between layers is W net ={W 12 , W 23 }, W 12 is the transfer matrix from the input layer to the middle layer, W 23 is the transfer matrix from the middle layer to the output layer, and the activation function between each layer is {ReLU, sigmoid}. The mini-batch gradient descent method is used for model training, and the sample size used in a single batch is 500.
六、辅助决策模块:基于体检者的体检数据,通过数据预处理模块获得标准化的特征向量,通过体检辅助决策模型得到预测结果,并输出给临床医生作为体检辅助决策结果,包括:6. Auxiliary decision-making module: Based on the physical examination data of the examinee, the standardized feature vector is obtained through the data preprocessing module, and the prediction result is obtained through the auxiliary decision-making model of the physical examination, and output to the clinician as the auxiliary decision-making result of the physical examination, including:
将单一体检者通过体检获得的p项与特征维度对应的体检指标,通过数据预处理模块获得标准化处理后的特征向量x′ u。之后,将x′ u输入在预测模型构建模块构建的体检辅助决策模型,输 出预测结果
Figure PCTCN2022123731-appb-000057
Figure PCTCN2022123731-appb-000058
趋向于1时,体检结果趋向于阳性,当
Figure PCTCN2022123731-appb-000059
趋向于0时,体检结果趋向于阴性,将该预测结果提供给临床医生,作为体检辅助决策结果。
The p items obtained by a single medical examiner through the physical examination and the physical examination indicators corresponding to the feature dimensions are obtained through the data preprocessing module to obtain the standardized feature vector x′ u . After that, input x′ u into the medical examination-assisted decision-making model built in the prediction model building module, and output the prediction result
Figure PCTCN2022123731-appb-000057
when
Figure PCTCN2022123731-appb-000058
When it tends to 1, the physical examination result tends to be positive, when
Figure PCTCN2022123731-appb-000059
When it tends to 0, the physical examination result tends to be negative, and the predicted result is provided to clinicians as the auxiliary decision-making result of physical examination.
以上所述仅是本发明的优选实施方式,虽然本发明已以较佳实施例披露如上,然而并非用以限定本发明。任何熟悉本领域的技术人员,在不脱离本发明技术方案范围情况下,都可利用上述揭示的方法和技术内容对本发明技术方案做出许多可能的变动和修饰,或修改为等同变化的等效实施例。因此,凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所做的任何的简单修改、等同变化及修饰,均仍属于本发明技术方案保护的范围内。The above descriptions are only preferred implementations of the present invention. Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person familiar with the art, without departing from the scope of the technical solution of the present invention, can use the methods and technical content disclosed above to make many possible changes and modifications to the technical solution of the present invention, or modify it into an equivalent of equivalent change Example. Therefore, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention, which do not deviate from the technical solution of the present invention, still fall within the protection scope of the technical solution of the present invention.

Claims (10)

  1. 一种基于假阴性样本识别的体检辅助决策系统,其特征在于,包括:A medical examination assistant decision-making system based on false negative sample identification, characterized in that it includes:
    数据获取模块:用于获取真实世界体检数据集,矩阵化为包括输入特征矩阵和真实诊断标签的原始数据集,将体检结果为阴性的样本视为无标签样本;Data acquisition module: used to obtain real-world physical examination data sets, matrixed into original data sets including input feature matrix and real diagnostic labels, and samples with negative physical examination results as unlabeled samples;
    数据预处理模块:通过统一原始数据集中各特征分量的标准差和均值,形成标准化数据集;分离标准化数据集中各特征分量的正负半轴分量,在每个正负半轴分量上加上对应的可训练上下限值,形成扩展数据集;Data preprocessing module: form a standardized data set by unifying the standard deviation and mean of each feature component in the original data set; separate the positive and negative semi-axis components of each feature component in the standardized data set, and add the corresponding positive and negative semi-axis components to each positive and negative semi-axis component The trainable upper and lower limits of , forming an extended data set;
    基础特征分析模块:使用逻辑回归模型,将无标签样本视为负样本,训练获得在不考虑假阴性样本的情况下,各特征维度对产生真实诊断标签的特征权重;Basic feature analysis module: using the logistic regression model, the unlabeled sample is regarded as a negative sample, and the training obtains the feature weight of each feature dimension to generate a true diagnostic label without considering false negative samples;
    假阴性样本识别模块:将特征维度分为直接相关维度和竞争维度两类,其中直接相关维度从医学角度对目标体检结果的判定产生直接影响,竞争维度从医学角度不对目标体检结果的判定产生直接影响,但会与目标体检结果竞争关注度,导致目标体检结果缺失,产生假阴性样本;构建两个逻辑回归模型和联合损失函数,进行联合训练,使用联合损失函数筛选真阴性样本和假阴性样本,并且使得直接相关维度能够最大程度区分阳性样本与筛选出的疑似真阴性样本,竞争维度能够最大程度区分阳性样本与筛选出的疑似假阴性样本;通过假阴性指标指示样本为假阴性样本的可能性;False-negative sample identification module: Divide the feature dimension into two categories: direct correlation dimension and competition dimension. The direct correlation dimension has a direct impact on the judgment of the target physical examination result from the medical point of view, and the competition dimension does not directly affect the judgment of the target physical examination result from the medical point of view. influence, but it will compete with the target physical examination results for attention, resulting in missing target physical examination results and false negative samples; construct two logistic regression models and a joint loss function for joint training, and use the joint loss function to filter true negative samples and false negative samples , and enables the direct correlation dimension to distinguish the positive samples from the screened suspected true negative samples to the greatest extent, and the competition dimension to distinguish the positive samples from the screened suspected false negative samples to the greatest extent; the possibility of the sample being a false negative sample is indicated by the false negative index sex;
    预测模型构建模块:构建多层神经网络和引入了假阴性指标的损失函数,基于标准化数据集与假阴性指标,训练体检辅助决策模型;Predictive model building block: build a multi-layer neural network and introduce a loss function with false negative indicators, and train a medical examination-assisted decision-making model based on standardized data sets and false negative indicators;
    辅助决策模块:基于体检者的体检数据,通过数据预处理模块获得标准化的特征向量,通过体检辅助决策模型得到预测结果,并输出给临床医生作为体检辅助决策结果。Auxiliary decision-making module: Based on the physical examination data of the examinee, the standardized feature vector is obtained through the data preprocessing module, and the prediction result is obtained through the auxiliary decision-making model of the physical examination, and output to the clinician as the auxiliary decision-making result of the physical examination.
  2. 根据权利要求1所述的基于假阴性样本识别的体检辅助决策系统,其特征在于,所述数据获取模块中,体检数据的特征维度包括基本生理指标和常规化验指标,所述基本生理指标包括身高、体重、BMI、收缩压和舒张压,所述常规化验指标包括血常规和尿常规;所述真实诊断标签为体检结果。The medical examination auxiliary decision-making system based on false negative sample identification according to claim 1, wherein in the data acquisition module, the feature dimensions of the physical examination data include basic physiological indicators and routine laboratory indicators, and the basic physiological indicators include height , body weight, BMI, systolic blood pressure and diastolic blood pressure, the routine laboratory indicators include blood routine and urine routine; the real diagnostic label is the result of physical examination.
  3. 根据权利要求1所述的基于假阴性样本识别的体检辅助决策系统,其特征在于,所述数据获取模块中,将体检数据集矩阵化为原始数据集(X,y),
    Figure PCTCN2022123731-appb-100001
    Figure PCTCN2022123731-appb-100002
    为输入特征矩阵,n为样本量,p为体检指标总数,x 1至x n表示各样本,f 1至f p为原始数据集在各特征维度上的特征分量,T表示转置;y=[y 1,y 2,…y n]∈{0,1} n为n个样本的真实诊断标 签,y i=1代表第i个样本为阳性样本,y i=0代表第i个样本为真阴性样本或假阴性样本,视为无标签样本;将阳性样本集合记为S P,将无标签样本集合记为S N,将真阴性样本集合记为S TN,将假阴性样本集合记为S FN,有
    Figure PCTCN2022123731-appb-100003
    且S P,S N的具体样本组成已知,S TN,S FN的具体样本组成未知。
    The medical examination auxiliary decision-making system based on false negative sample identification according to claim 1, wherein, in the data acquisition module, the medical examination data set is matrixed into an original data set (X, y),
    Figure PCTCN2022123731-appb-100001
    Figure PCTCN2022123731-appb-100002
    is the input feature matrix, n is the sample size, p is the total number of physical examination indicators, x 1 to x n represent each sample, f 1 to f p are the feature components of the original data set on each feature dimension, T represents transposition; y= [y 1 ,y 2 ,…y n ]∈{0,1} n is the real diagnostic label of n samples, y i =1 means that the i-th sample is a positive sample, y i =0 means that the i-th sample is True negative samples or false negative samples are regarded as unlabeled samples; the positive sample set is recorded as S P , the unlabeled sample set is recorded as SN , the true negative sample set is recorded as S TN , and the false negative sample set is recorded as S FN , with
    Figure PCTCN2022123731-appb-100003
    And the specific sample composition of S P and S N is known, and the specific sample composition of S TN and S FN is unknown.
  4. 根据权利要求3所述的基于假阴性样本识别的体检辅助决策系统,其特征在于,所述数据预处理模块中,对X中各特征分量做标准化处理,使各特征分量上所有体检数据的标准差为1,均值为0;将标准化处理后的特征矩阵记为
    Figure PCTCN2022123731-appb-100004
    表示第i个经标准化后的样本,f′ j为标准化后的第j维特征分量,X′与y形成标准化数据集(X′,y);
    The medical examination auxiliary decision-making system based on false negative sample identification according to claim 3, characterized in that, in the data preprocessing module, each feature component in X is standardized, so that the standard of all physical examination data on each feature component The difference is 1, and the mean is 0; the standardized feature matrix is recorded as
    Figure PCTCN2022123731-appb-100004
    Indicates the i-th standardized sample, f' j is the j-th dimension feature component after standardization, and X' and y form a standardized data set (X', y);
    将X′扩展形成可训练特征矩阵X″:Expand X' to form a trainable feature matrix X":
    X″=[x″ 1,x″ 2,…x″ n] T=[f′ 11,f′ 12,f′ 21,f′ 22…f′ p1,f′ p2]+t=[f″ 11,f″ 12,…f″ p1,f″ p2] X″=[x″ 1 , x″ 2 ,…x″ n ] T = [f′ 11 , f′ 12 , f′ 21 , f′ 22 …f′ p1 , f′ p2 ]+t=[f″ 11 ,f″ 12 ,…f″ p1 ,f″ p2 ]
    其中
    Figure PCTCN2022123731-appb-100005
    表示第i个经数据扩展后的样本,f′ j1=max(f′ j,0),f′ j2=min(f′ j,0)分别为f′ j的正半轴分量和负半轴分量;t=[t 11,t 12,t 21,t 22…t p1,t p2]为各分量上的可训练上下限值构成的偏移向量,
    Figure PCTCN2022123731-appb-100006
    加法通过广播机制完成;可训练特征矩阵X″与y形成扩展数据集(X″,y)。
    in
    Figure PCTCN2022123731-appb-100005
    Indicates the i-th sample after data expansion, f′ j1 =max(f′ j ,0), f′ j2 =min(f′ j ,0) are the positive semi-axis component and negative semi-axis of f′ j respectively Component; t=[t 11 ,t 12 ,t 21 ,t 22 ...t p1 ,t p2 ] is the offset vector formed by the trainable upper and lower limits on each component,
    Figure PCTCN2022123731-appb-100006
    The addition is done through a broadcast mechanism; the trainable feature matrix X″ and y form an extended dataset (X″,y).
  5. 根据权利要求4所述的基于假阴性样本识别的体检辅助决策系统,其特征在于,所述基础特征分析模块中,将无标签样本视为负样本,基于扩展数据集(X″,y)构建逻辑回归模型M 0,M 0的损失函数L 0(w,t,b)为: The medical examination auxiliary decision-making system based on false negative sample identification according to claim 4, wherein, in the basic feature analysis module, the unlabeled sample is regarded as a negative sample, which is constructed based on the extended data set (X″, y) Logistic regression model M 0 , the loss function L 0 (w,t,b) of M 0 is:
    Figure PCTCN2022123731-appb-100007
    Figure PCTCN2022123731-appb-100007
    p 0(x″ i)=sig(w Tx″ i+b) p 0 (x″ i )=sig(w T x″ i +b)
    其中
    Figure PCTCN2022123731-appb-100008
    为可训练的特征权重向量,b为可训练的截距值;sig(·)为sigmoid函数,w Tx″ i+b为决策函数,其值为决策值,p 0(x″ i)为经sigmoid函数归一化后得到的逻辑回归模型M 0的输出概率。
    in
    Figure PCTCN2022123731-appb-100008
    is a trainable feature weight vector, b is a trainable intercept value; sig(·) is a sigmoid function, w T x″ i + b is a decision function, and its value is a decision value, p 0 (x″ i ) is The output probability of the logistic regression model M 0 obtained after normalization by the sigmoid function.
  6. 根据权利要求5所述的基于假阴性样本识别的体检辅助决策系统,其特征在于,所述假阴性样本识别模块包括:The medical examination auxiliary decision-making system based on false negative sample identification according to claim 5, wherein the false negative sample identification module comprises:
    取基础特征分析模块中训练所得特征权重向量w,设定可训练非负矩阵A D,A F∈[0,1] 2p×2p,满足A D、A F的和矩阵为单位矩阵E=A D+A FTake the feature weight vector w obtained from training in the basic feature analysis module, set the trainable non-negative matrix A D , A F ∈ [0,1] 2p×2p , and satisfy the sum matrix of A D and A F as the identity matrix E=A D +A F ;
    构建两个逻辑回归模型M D和M F,分别具有特征权重系数w D=w TA D,w F=w TA F,分别具有可训练截距值b D,b F,则两个逻辑回归模型经sigmoid函数归一化后的输出概率分别表示为: Construct two logistic regression models M D and MF , which have feature weight coefficients w D = w T A D , w F = w T A F , and have trainable intercept values b D , b F , then the two logistic regression models The output probabilities of the regression model normalized by the sigmoid function are expressed as:
    p D(x″ i)=sig(w TA Dx″ i+b D) p D (x″ i )=sig(w T A D x″ i +b D )
    p F(x″ i)=sig(w TA Fx″ i+b F) p F (x″ i )=sig(w T A F x″ i +b F )
    其中p D(x″ i)为直接概率,p F(x″ i)为关注度概率; Among them, p D (x″ i ) is the direct probability, and p F (x″ i ) is the probability of attention;
    利用扩展数据集(X″,y)最小化联合损失函数L 1(A D,b D,b F)获得最优参数; Use the extended data set (X″,y) to minimize the joint loss function L 1 (A D ,b D ,b F ) to obtain the optimal parameters;
    Figure PCTCN2022123731-appb-100009
    Figure PCTCN2022123731-appb-100009
    其中,
    Figure PCTCN2022123731-appb-100010
    为样本类别权重;γ为筛选系数;
    Figure PCTCN2022123731-appb-100011
    但不参与模型训练过程中的梯度反向传播;
    in,
    Figure PCTCN2022123731-appb-100010
    is the sample category weight; γ is the screening coefficient;
    Figure PCTCN2022123731-appb-100011
    But it does not participate in the gradient backpropagation during model training;
    对于无标签样本集合中的样本x″ i,分别通过模型M D和M F获得直接概率p D(x″ i)和关注度概率p F(x″ i),使用假阴性指标r i=p D(x″ i)·(1-p F(x″ i))指示样本x″ i为假阴性的可能性。 For the sample x″ i in the unlabeled sample set, the direct probability p D (x″ i ) and the attention probability p F (x″ i ) are obtained through the models M D and MF respectively, using the false negative index r i =p D (x″ i )·(1− pF (x″ i )) indicates the probability that sample x″ i is a false negative.
  7. 根据权利要求6所述的基于假阴性样本识别的体检辅助决策系统,其特征在于,所述假阴性样本识别模块中,对于逻辑回归模型M F,通过乘法项
    Figure PCTCN2022123731-appb-100012
    筛选经M 0预测得到的输出概率p 0(x″ i)接近1的无标签样本,将筛选出的无标签样本集合记为
    Figure PCTCN2022123731-appb-100013
    与阳性样本集合S P在竞争维度F类的特征上存在差异,在直接相关维度D类的特征上应无明显差异,通过训练以S P为正类,以
    Figure PCTCN2022123731-appb-100014
    为负类的模型M F,识别特征维度中属于竞争维度F类的特征,训练过程同时优化A F,b F以得到
    Figure PCTCN2022123731-appb-100015
    与S P间的最优区分,使得对于样本
    Figure PCTCN2022123731-appb-100016
    关注度概率p F(x″ i)趋向于0,对于样本x″ i∈S P,关注度概率p F(x″ i)趋向于1。
    The medical examination auxiliary decision-making system based on false negative sample identification according to claim 6, wherein, in the false negative sample identification module, for the logistic regression model M F , the multiplication term
    Figure PCTCN2022123731-appb-100012
    Screen the unlabeled samples whose output probability p 0 (x″ i ) predicted by M 0 is close to 1, and record the screened unlabeled sample set as
    Figure PCTCN2022123731-appb-100013
    There are differences in the characteristics of the competition dimension F class and the positive sample set S P , and there should be no significant difference in the characteristics of the directly related dimension D class. Through training, S P is the positive class, and
    Figure PCTCN2022123731-appb-100014
    The model M F of the negative class recognizes the features belonging to the competition dimension F in the feature dimension. The training process optimizes A F and b F at the same time to obtain
    Figure PCTCN2022123731-appb-100015
    The optimal distinction between SP and SP , such that for the sample
    Figure PCTCN2022123731-appb-100016
    The attention probability p F (x″ i ) tends to 0, and for the sample x″ i ∈ S P , the attention probability p F (x″ i ) tends to 1.
  8. 根据权利要求6所述的基于假阴性样本识别的体检辅助决策系统,其特征在于,所述假阴性样本识别模块中,对于逻辑回归模型M D,通过乘法项
    Figure PCTCN2022123731-appb-100017
    筛选经M F预测得到的关注度概率p F(x″ i)接近1的无标签样本,将筛选出的无标签样本集合记为
    Figure PCTCN2022123731-appb-100018
    与阳性样本集合S P在直接相关维度D类的特征上存在差异,在竞争维度F类的特征上应无明显差异,通过训练以S P为正类,以
    Figure PCTCN2022123731-appb-100019
    为负类的模型M D,识别特征维度中属于直接相关维度D类的特征,训练过程同时优化A D,b D以得到
    Figure PCTCN2022123731-appb-100020
    与S P间的最优区分,使得对于样本
    Figure PCTCN2022123731-appb-100021
    直接概率p D(x″ i)趋向于0,对于样本x″ i∈S P,直接概率p D(x″ i)趋向于1。
    The medical examination auxiliary decision-making system based on false negative sample identification according to claim 6, wherein, in the false negative sample identification module, for the logistic regression model M D , by the multiplication term
    Figure PCTCN2022123731-appb-100017
    Screen the unlabeled samples whose attention probability p F (x″ i ) obtained by MF prediction is close to 1, and record the screened unlabeled sample set as
    Figure PCTCN2022123731-appb-100018
    There are differences in the characteristics of the directly related dimension D class from the positive sample set SP , and there should be no obvious difference in the characteristics of the competitive dimension F class. Through training, S P is used as the positive class, and
    Figure PCTCN2022123731-appb-100019
    The model M D of the negative class recognizes the features of the directly related dimension D in the feature dimension. The training process optimizes A D and b D at the same time to obtain
    Figure PCTCN2022123731-appb-100020
    The optimal distinction between SP and SP , such that for the sample
    Figure PCTCN2022123731-appb-100021
    The direct probability p D (x″ i ) tends to 0, and for the sample x″ i ∈ S P , the direct probability p D (x″ i ) tends to 1.
  9. 根据权利要求6所述的基于假阴性样本识别的体检辅助决策系统,其特征在于,所述预测模型构建模块中,基于标准化数据集(X′,y)及各样本的假阴性指标r=[r 1,…r n]∈(0,1) n,构建输入层节点数为p,输出层节点数为1,输出层激活函数为sigmoid函数,各层间转移矩阵集合为W net的多层神经网络M net,将样本x′ i∈X′经M net运算后的输出定义为
    Figure PCTCN2022123731-appb-100022
    通过最 小化引入假阴性指标的损失函数L 2(W net)获得M net的最优参数;
    The medical examination auxiliary decision-making system based on false negative sample identification according to claim 6, wherein, in the predictive model building module, the false negative index r=[ r 1 ,…r n ]∈(0,1) n , construct a multi-layered network with the number of nodes in the input layer being p, the number of nodes in the output layer being 1, the activation function of the output layer being a sigmoid function, and the set of transition matrices between layers being W net Neural network M net , the output of sample x′ i ∈ X′ after M net operation is defined as
    Figure PCTCN2022123731-appb-100022
    The optimal parameters of M net are obtained by minimizing the loss function L 2 (W net ) that introduces false negative indicators;
    Figure PCTCN2022123731-appb-100023
    Figure PCTCN2022123731-appb-100023
    则M net为构建的引入假阴性指标优化后的体检辅助决策模型。 Then M net is the constructed medical examination aided decision-making model optimized by introducing false negative indicators.
  10. 根据权利要求9所述的基于假阴性样本识别的体检辅助决策系统,其特征在于,所述辅助决策模块中,将单一体检者通过体检获得的p项与特征维度对应的体检指标,通过数据预处理模块获得标准化处理后的特征向量x′ u,将x′ u输入在预测模型构建模块构建的体检辅助决策模型,输出预测结果
    Figure PCTCN2022123731-appb-100024
    Figure PCTCN2022123731-appb-100025
    趋向于1时,体检结果趋向于阳性,当
    Figure PCTCN2022123731-appb-100026
    趋向于0时,体检结果趋向于阴性,将预测结果提供给临床医生,作为体检辅助决策结果。
    The medical examination auxiliary decision-making system based on false negative sample identification according to claim 9, characterized in that, in the auxiliary decision-making module, the p item obtained by a single medical examiner through physical examination and the physical examination index corresponding to the feature dimension are obtained through data pre-processing. The processing module obtains the standardized feature vector x′u , inputs x′u into the medical examination auxiliary decision-making model built in the prediction model building module, and outputs the prediction result
    Figure PCTCN2022123731-appb-100024
    when
    Figure PCTCN2022123731-appb-100025
    When it tends to 1, the physical examination result tends to be positive, when
    Figure PCTCN2022123731-appb-100026
    When it tends to 0, the physical examination results tend to be negative, and the predicted results are provided to clinicians as the auxiliary decision-making results of physical examination.
PCT/CN2022/123731 2021-10-09 2022-10-07 False negative sample recognition-based physical examination assistant decision-making system WO2023056918A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111175001.6 2021-10-09
CN202111175001.6A CN113611411B (en) 2021-10-09 2021-10-09 Body examination aid decision-making system based on false negative sample identification

Publications (1)

Publication Number Publication Date
WO2023056918A1 true WO2023056918A1 (en) 2023-04-13

Family

ID=78343379

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/123731 WO2023056918A1 (en) 2021-10-09 2022-10-07 False negative sample recognition-based physical examination assistant decision-making system

Country Status (2)

Country Link
CN (1) CN113611411B (en)
WO (1) WO2023056918A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117150369A (en) * 2023-10-30 2023-12-01 恒安标准人寿保险有限公司 Training method of overweight prediction model and electronic equipment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113611411B (en) * 2021-10-09 2021-12-31 浙江大学 Body examination aid decision-making system based on false negative sample identification
CN113990494B (en) * 2021-12-24 2022-03-25 浙江大学 Tic disorder auxiliary screening system based on video data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107887036A (en) * 2017-11-09 2018-04-06 北京纽伦智能科技有限公司 Construction method, device and the clinical decision accessory system of clinical decision accessory system
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform
CN111180068A (en) * 2019-12-19 2020-05-19 浙江大学 Chronic disease prediction system based on multi-task learning model
CN111312401A (en) * 2020-01-14 2020-06-19 之江实验室 After-physical-examination chronic disease prognosis system based on multi-label learning
US20200210899A1 (en) * 2017-11-22 2020-07-02 Alibaba Group Holding Limited Machine learning model training method and device, and electronic device
CN113611411A (en) * 2021-10-09 2021-11-05 浙江大学 Body examination aid decision-making system based on false negative sample identification

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106383853A (en) * 2016-08-30 2017-02-08 刘勇 Realization method and system for electronic medical record post-structuring and auxiliary diagnosis
CN110084374A (en) * 2019-04-24 2019-08-02 第四范式(北京)技术有限公司 Construct method, apparatus and prediction technique, device based on the PU model learnt
US20210174448A1 (en) * 2019-12-04 2021-06-10 Michael William Kotarinos Artificial intelligence decision modeling processes using analytics and data shapely for multiple stakeholders

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107887036A (en) * 2017-11-09 2018-04-06 北京纽伦智能科技有限公司 Construction method, device and the clinical decision accessory system of clinical decision accessory system
US20200210899A1 (en) * 2017-11-22 2020-07-02 Alibaba Group Holding Limited Machine learning model training method and device, and electronic device
CN109830303A (en) * 2019-02-01 2019-05-31 上海众恒信息产业股份有限公司 Clinical data mining analysis and aid decision-making method based on internet integration medical platform
CN111180068A (en) * 2019-12-19 2020-05-19 浙江大学 Chronic disease prediction system based on multi-task learning model
CN111312401A (en) * 2020-01-14 2020-06-19 之江实验室 After-physical-examination chronic disease prognosis system based on multi-label learning
CN113611411A (en) * 2021-10-09 2021-11-05 浙江大学 Body examination aid decision-making system based on false negative sample identification

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117150369A (en) * 2023-10-30 2023-12-01 恒安标准人寿保险有限公司 Training method of overweight prediction model and electronic equipment
CN117150369B (en) * 2023-10-30 2024-01-26 恒安标准人寿保险有限公司 Training method of overweight prediction model and electronic equipment

Also Published As

Publication number Publication date
CN113611411A (en) 2021-11-05
CN113611411B (en) 2021-12-31

Similar Documents

Publication Publication Date Title
WO2023056918A1 (en) False negative sample recognition-based physical examination assistant decision-making system
Khalilov et al. Advantages and Applications of Neural Networks
WO2021143780A1 (en) Multi-label learning-based post-physical examination chronic disease prognosis system
Darmawahyuni et al. Coronary heart disease interpretation based on deep neural network
Ferrante et al. Artificial intelligence in the diagnosis of pediatric allergic diseases
Zhou et al. Cohesive multi-modality feature learning and fusion for COVID-19 patient severity prediction
CN113610118A (en) Fundus image classification method, device, equipment and medium based on multitask course learning
CN117033568A (en) Medical data index interpretation method, device, storage medium and equipment
Jhumka et al. Chronic Kidney Disease Prediction using Deep Neural Network
CN115130651A (en) Pulse neural network inspired by multilayer heterogeneous mechanism of memory loop
CN114898879A (en) Chronic disease risk prediction method based on graph representation learning
Cheng et al. Combining knowledge extension with convolution neural network for diabetes prediction
Desai Early Detection and Prevention of Chronic Kidney Disease
Li et al. Automatic sleep staging by a hybrid model based on deep 1D-ResNet-SE and LSTM with single-channel raw EEG signals
Übeyli Modified mixture of experts for diabetes diagnosis
Schipor et al. From fuzzy expert system to artificial neural network: Application to assisted speech therapy
Badnjević et al. Application of artificial intelligence for the classification of the clinical outcome and therapy in patients with viral infections: The case of COVID-19
Mellal et al. CNN Models Using Chest X-Ray Images for COVID-19 Detection: A Survey.
Shaheen et al. Hi-Le and HiTCLe: Ensemble Learning Approaches for Early Diabetes Detection using Deep Learning and eXplainable Artificial Intelligence
Yadav et al. Genetic algorithm and Naïve Bayes-based (GANB) diabetes mellitus prediction system
CN117235487B (en) Feature extraction method and system for predicting hospitalization event of asthma patient
Muthulakshmi et al. Big Data Analytics for Heart Disease Prediction using Regularized Principal and Quadratic Entropy Boosting
Tan et al. ICU Mortality Prediction Based on Key Risk Factors Identification
Bhatia Deep learning-based approach for thyroid dysfunction prediction
Doreswamy et al. Prediction accuracy comparison of predictive models using machine learning for diabetes data set

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22877934

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE