CN114613497A - Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level - Google Patents

Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level Download PDF

Info

Publication number
CN114613497A
CN114613497A CN202210295832.5A CN202210295832A CN114613497A CN 114613497 A CN114613497 A CN 114613497A CN 202210295832 A CN202210295832 A CN 202210295832A CN 114613497 A CN114613497 A CN 114613497A
Authority
CN
China
Prior art keywords
patient sample
adagbdt
sample
patient
embedding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210295832.5A
Other languages
Chinese (zh)
Inventor
朱振峰
国圳宇
常冬霞
赵耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202210295832.5A priority Critical patent/CN114613497A/en
Publication of CN114613497A publication Critical patent/CN114613497A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Primary Health Care (AREA)
  • Software Systems (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention provides an intelligent medical auxiliary diagnosis method for a patient sample based on a GBDT sample grade. The method comprises the following steps: constructing a self-adaptive AdaGBDT module according to historical patient data, and calculating a diagnosis result of a patient sample to be diagnosed through an iterative training process by using a decision tree in the AdaGBDT module; calculating interpretable feature embedding of the patient sample according to the decision path of the patient sample, and accumulating results among different decision trees of the AdaGBDT model to obtain feature importance embedding of the patient sample; and searching a plurality of cases which are most similar to the patient sample in a historical patient case database by using a case reasoning system in an embedding space by using self-weighted distance measurement, comparing the plurality of cases with the diagnosis result of the patient sample, and if the comparison result is inconsistent, marking the patient sample as a difficult sample. The method improves the attention degree of the cases with inconsistent prediction results, thereby being beneficial to improving the recall rate of the system and reducing the number of missed detection cases.

Description

Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level
Technical Field
The invention relates to the technical field of medical informatization, in particular to a sample-level interpretable intelligent medical auxiliary diagnosis method based on a GBDT (iterative Boosting Decision Tree algorithm).
Background
In modern medical diagnosis, the rapid increase of data volume brings great challenges to doctors, and how to accurately and rapidly diagnose diseases becomes a non-negligible problem. In recent years, significant progress has been made in applying artificial intelligence to medical diagnosis. Machine learning models, especially tree-based models, are one of the most popular non-linear predictive models in today's big data scenarios due to their superior classification predictive power in structured data, however, the explicability decreases inversely proportional to the predictive power, and current work is of little interest in interpreting their predictions. Machine learning models are generally more effective in disease diagnosis than traditional expert knowledge-based models, and unfortunately, existing machine learning models, including deep learning, are black box models that are not easily interpreted by clinicians. Therefore, a trade-off must be made between transparency and predictive power, and for high-risk applications, simpler, more transparent systems are often employed, where the clinician can easily backtrack predictions.
To benefit from machine learning models with higher predictive power, the importance of interpretable and transparent machine learning model algorithms in clinical medicine is undoubted, as erroneous predictions may have serious consequences. Clinicians must be able to understand the underlying decision making principles of AI (Artificial Intelligence) models so that they can trust these predictions and can identify cases where the models may give false predictions.
In fact, researchers have long been concerned with how to interpret AI models. Scholars have proposed a rule-based expert system that is easy to understand. CBR (Case-based reasoning) solves new problems by building a historical Case database and using similar experience and results of historical cases, it provides another approach that is highly interpretable, since the physician's diagnosis and treatment strategies are typically based on historical experience and Case knowledge and pattern recognition by matching query cases to historical cases. However, conventional CBR systems are often subject to data distribution and lack sufficient accuracy. Therefore, an AI model with high predictive ability and good interpretability is essential for the field of medicine to which AI falls.
Tree-based machine learning models are the most popular non-linear models today. From primitive decision trees, random forests to gradient boosting trees and the like are widely used in finance, medicine, public health, manufacturing and other fields. For these applications, the model must be both accurate and interpretable, where interpretability means that we can understand how the model uses input features to make predictions. However, although the global interpretation method of the tree model has abundant development, and summarizes the influence of the input features on the whole model, the local interpretation at the sample level is much less concerned, and the local interpretation explains the influence of the input features on the prediction of a single sample.
Disclosure of Invention
The embodiment of the invention provides an intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample levels, which aims to overcome the problems in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme.
An intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample levels comprises the following steps:
constructing an adaptive AdaGBDT module according to historical patient data, and calculating a diagnosis result of a patient sample to be diagnosed through an iterative training process by using a decision tree in the AdaGBDT module;
calculating interpretable feature embedding of the patient sample according to a decision path of the patient sample obtained by an AdaGBDT module, and accumulating results among different decision trees of the AdaGBDT module by utilizing the interpretable feature embedding of the patient sample to obtain feature importance embedding of the patient sample;
searching a plurality of cases which are most similar to the patient sample in a historical patient case database by using a case reasoning system in an embedding space by using self-weighted distance measurement, comparing the plurality of cases with the diagnosis result of the patient sample, and if the comparison result is inconsistent, marking the patient sample as a difficult sample; and if the comparison result is consistent, determining that the patient sample passes the verification, and storing the patient sample into a database.
Preferably, the constructing an adaptive AdaGBDT module according to historical patient data, and calculating a diagnosis result of a patient sample to be diagnosed through an iterative training process using a decision tree in the AdaGBDT module, includes:
the adaptive AdaGBDT is an aggregation of a plurality of decision trees, the specific number can be dynamically adjusted through the attenuation amount of the loss function, and when the attenuation amount of the loss function is smaller than a threshold value, the training of the AdaGBDT is stopped. The specific training process is as follows:
for a patient sample x to be diagnosediThe loss function of the AdaGBDT model is set as:
Figure BDA0003563287130000021
calculating patient sample xiThe negative gradient of (d) is:
Figure BDA0003563287130000031
wherein, L (-) is a loss function of the AdaGBDT model, and F (-) is a decision function of the AdaGBDT.
Calculate patient sample x in the T-round training processiThe formula is as follows:
Figure BDA0003563287130000032
the formula is a softmax normalization function according to a patient sample xiThe magnitude of the negative gradient of (2) calculates the sampling probability pTiWhere exp (·) represents an exponential operation with e as base;
according to the calculated sampling probability pTiObtaining a new data set (X) using a weighted recoveryt,Yt) And according to the data set (X)t,Yt) Training AdaGBDT for round T + 1;
repeating the above processing procedures, continuously calculating the negative gradient of the sample, and continuously calculating the sampling probability p according to the calculated sampling probabilityTiObtaining a new data set (X) using a weighted recoveryt,Yt) Continuing the training of the AdaGBDT model until a preset training turn is reached or the attenuation of the loss function is smaller than a threshold value, stopping the training of the AdaGBDT to obtain a trained AdaGBDT model, and finally obtaining a patient sample xiThe negative gradient of (a) is taken as a patient sample x to be diagnosed obtained by the AdaGBDT modeliThe diagnosis result of (2).
Preferably, the calculating the interpretable feature embedding of the patient sample according to the decision path of the patient sample obtained by the AdaGBDT module, and accumulating the results of different decision trees of the AdaGBDT model by using the interpretable feature embedding of the patient sample to obtain the feature importance embedding of the patient sample includes:
utilizing a characteristic embedding BME module based on bidirectional mutual information backtracking to backtrack from leaf nodes on a decision tree of a trained AdaGBDT model from bottom to top to calculate the score information of a parent node, wherein the formula is as follows:
Figure BDA0003563287130000033
wherein N isx(. represents the number of samples on a node, Ci(. represents a child of a node, SiRepresenting the fraction, MI, at a nodeiRepresenting mutual information between the node i and the child nodes;
according to the patient sample x to be diagnosediThe decision path of (a) is calculated from the top down to the patient sample xiAccording to the formula:
Figure BDA0003563287130000041
wherein S ispAnd SqRespectively representing the score information of a parent node and a child node, wherein the index q is an index set of the intersection node of the parent node and the backtracking path;
computing patient sample x on each decision tree of AdaGBDT modeliAfter the feature importance of (2), the patient sample x on all decision trees is setiAre combined to obtain a patient sample x to be diagnosediAnd embedding the final feature importance, wherein a combination formula is as follows:
Figure BDA0003563287130000042
therein, FC(t)(x[i]) Representing the importance of the ith feature of the sample x in the t-th tree, xeRepresenting a patient sample x to be diagnosediThe final feature importance is embedded.
Preferably, the calculating the interpretable feature embedding of the patient sample according to the decision path of the patient sample obtained by the AdaGBDT module, and accumulating the results of different decision trees of the AdaGBDT model by using the interpretable feature embedding of the patient sample to obtain the feature importance embedding of the patient sample includes:
a case-based reasoning CBR model is constructed in an embedding space, the distance measurement between different samples adopts a self-attention weighting mode, and the formula is as follows:
Figure BDA0003563287130000043
wherein the content of the first and second substances,
Figure BDA0003563287130000044
representing the query sample and the existing sample, x, respectivelye[k]Representing the k-dimension feature in the importance embedding of the sample x;
the patient sample x to be diagnosediAs a query sample, using cases in the historical patient case database as existing samples, and using the CBR model to embed emptyInter-process use of self-weighted distance metrics to find patient samples x in a historical patient case database for diagnosisiThe most similar k cases are used as patient samples x to be diagnosediThe inference of (2) predicts the result;
comparing the inference prediction result of the CBR model with the prediction result obtained by the AdaGBDT model, and if the comparison result is inconsistent, comparing the inference prediction result of the CBR model with the prediction result obtained by the AdaGBDT model, and if the inference prediction result is inconsistent with the prediction result obtained by the AdaGBDT model, obtaining a patient sample xiLabeling as a difficult sample; if the comparison result is consistent, the patient sample and the prediction result are stored in the database through verification.
According to the technical scheme provided by the embodiment of the invention, the method provided by the embodiment of the invention improves the attention degree of the cases with inconsistent prediction results, thereby being beneficial to improving the recall rate of the system and reducing the number of missed detection cases. The system promotes cooperation between human and machine while ensuring the safety of patients.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of an implementation of an intelligent medical auxiliary diagnosis method for a patient sample based on GBDT sample level according to an embodiment of the present invention;
fig. 2 is a processing flow chart of an intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample level according to an embodiment of the present invention;
fig. 3 is a schematic diagram illustrating calculation of importance of two-way backtracking features according to an embodiment of the present invention;
fig. 4 is a schematic view of an active labeling process for a difficult sample according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an interpretable result according to an embodiment of the invention;
FIG. 6 is an exemplary illustration of an exemplary feature according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an active labeling result of a hard sample according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
The embodiment of the invention provides an intelligent medical auxiliary diagnosis method for a patient sample based on a GBDT sample level, which can obtain interpretable embedding of the sample level according to decision path backtracking calculation of a tree model and can keep the local loyalty of the model. Based on high-quality sample embedding, a CBR system based on self-weighting KNN (K-nearest neighbor) can be constructed in an embedding space.
The embodiment of the invention improves the gradient lifting GBDT, realizes the self-adaptive GBDT, namely AdaGBDT (gradient lifting decision tree), to improve the convergence speed of the system, extracts the characteristic importance information of a specific patient through a mutual information-based bidirectional backtracking algorithm, and the characteristic importance information can be used for providing interpretable information for doctors. By constructing the CBR system in the embedding space, traceability of similar patients is realized, and doctors are assisted in diagnosis. Finally, through the cooperation of the CBR and the GBDT, a difficult sample mining technology is constructed, and samples with low confidence coefficient can be mined for further diagnosis by doctors.
According to the embodiment of the invention, the interpretable method of the tree model is divided into a global interpretable method and a local interpretable method according to the action range, and for the tree model, the global interpretable method can summarize the influence of input characteristics on the whole model. The local interpretable method pursues the influence of more refined input characteristics on the decision result of the current input sample. Although global interpretation methods of tree models have a long history, there is little focus on locally interpretable methods. Compared with the existing interpretable method, the embodiment of the invention realizes sample-level interpretable technology by a characteristic importance calculation method based on information entropy bidirectional backtracking.
An implementation schematic diagram of an intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample levels provided by an embodiment of the present invention is shown in fig. 1, and a specific processing flow is shown in fig. 2, and includes the following processing steps:
and step S10, constructing an adaptive AdaGBDT module according to the existing patient data. The adaptive AdaGBDT module calculates the negative gradient of a patient sample to be diagnosed through a decision tree, samples required by the new round of AdaGBDT training are obtained by using the negative gradient to carry out playback sampling on the patient sample with the right, wherein each decision tree result is the negative gradient between the previous round of training set and the corresponding label, the negative gradient is used as the label of the next round of training, and the diagnosis result of the patient sample to be diagnosed is obtained until the AdaGBDT module finishes training.
Step S20, the BME (Bi-side MI based feature embedding based on bidirectional mutual information backtracking) module calculates interpretable feature embedding of the patient sample to be diagnosed according to the decision path of the patient sample to be diagnosed, and accumulates results between different decision trees of AdaGBDT by using the interpretable feature embedding of the patient sample to obtain feature importance embedding of the patient sample.
Step S30, embedding and inputting the final negative gradient of the patient sample to be diagnosed and the characteristic importance of the patient sample into a CBR module, searching a plurality of cases most similar to the patient sample in a historical patient case database by the CBR module in an embedding space by using self-weighted distance measurement, comparing the plurality of cases with the diagnosis result of the patient sample, and if the comparison result is inconsistent, marking the patient sample as a difficult sample; and if the comparison result is consistent, determining that the patient sample passes the verification, and storing the patient sample into a database.
Further, the step S10 specifically includes that the adaptive AdaGBDT is an aggregate of multiple decision trees, the specific number is dynamically adjusted by the attenuation amount of the loss function, and when the attenuation amount of the loss function is smaller than the threshold, the training of the AdaGBDT is stopped. The specific training process is as follows:
for sample xiThe penalty function for the AdaGBDT model is:
Figure BDA0003563287130000071
the corresponding negative gradient of the sample is calculated as:
Figure BDA0003563287130000072
wherein, L (-) is a loss function of the AdaGBDT model, and F (-) is a decision function of the AdaGBDT.
Calculate sample x in the T round training processiThe formula is as follows:
Figure BDA0003563287130000073
the formula is a softmax normalization function, and probability calculation is carried out according to the magnitude of the negative gradient of the sample, wherein exp (-) represents exponential operation with e as a base.
Obtaining a new data set (X) using a weighted back-sampling based on the calculated sampling probabilityt,Yt). And according to the data set (X)t,Yt) AdaGBDT for round T +1 was trained.
Repeating the above steps, continuously calculating the negative gradient of the sample, and continuously calculating the sampling probability p according to the calculated sampling probabilityTiObtaining a new data set (X) using a weighted recoveryt,Yt) Continuing the training of the AdaGBDT model until a preset training turn is reached or the attenuation of the loss function is smaller than a threshold value, stopping the training of the AdaGBDT to obtain a trained AdaGBDT model, and finally obtaining a patient sample xiThe negative gradient of (a) is taken as a patient sample x to be diagnosed obtained by the AdaGBDT modeliThe diagnosis result of (1).
Further, the step S20 specifically includes a feature importance extraction algorithm based on bidirectional mutual information backtracking of decision paths, where the decision paths represent paths between the samples in the decision tree from the starting root node to the leaf nodes. Fig. 3 is a schematic diagram of calculating importance of two-way backtracking features according to an embodiment of the present invention, and the specific operation process is as follows:
AdaGBDT model is trained according to the AdaGBDT training method
Utilizing a BME module to backtrack and calculate score information of a parent node from a leaf node on a decision tree of the AdaGBDT model from bottom to top, wherein the formula is as follows:
Figure BDA0003563287130000081
wherein N isx(. represents the number of samples on a node, Ci(. represents a child of a node, SiRepresenting the fraction, MI, at a nodeiRepresenting mutual information between node i and the child nodes.
According to the patient sample x to be diagnosediCalculating the feature importance information of the sample from top to bottom according to the formula as follows:
Figure BDA0003563287130000082
wherein S ispAnd SqRespectively representing the score information of a parent node and a child node, wherein the index q is an index set of the intersection node of the parent node and the backtracking path.
Computing patient sample x on each decision tree of AdaGBDT modeliAfter the feature importance of (2), the patient sample x on all decision trees is setiAre combined to obtain a patient sample x to be diagnosediThe final feature importance is embedded. The combined formula is as follows:
Figure BDA0003563287130000091
therein, FC(t)(x[i]) Represents a sample xiImportance of ith feature on t Tree, xeRepresenting a patient sample x to be diagnosediThe final feature importance is embedded.
After the AdaGBDT is constructed, the system calculates the score value of the father node from bottom to top according to the contribution degree between the father and the son, namely, through a mutual information backtracking calculation mode, calculates the personalized interpretable feature embedding of different samples from top to bottom according to the decision path of the samples, and finally accumulates the results between different decision trees of the AdaGBDT to obtain the final feature importance embedding of the samples.
Further, the step S30 specifically includes a CBR system based on adaptive distance weighting and an interactive hard sample labeling method with AdaGBDT, where the CBR is based on sample reasoning, and can obtain patient information similar to doctor experience through the CBR, and provide the hard sample information of the model for the doctor to select, and fig. 4 is a schematic diagram of an active hard sample labeling process provided by an embodiment of the present invention, and an operation flow thereof is as follows:
calculating a patient sample x to be diagnosed according to the BME moduleiThe feature importance of (1) is embedded.
Constructing a CBR model in an embedding space, wherein distance measurement among different samples adopts a self-attention weighting mode, and the formula is as follows:
Figure BDA0003563287130000092
wherein the content of the first and second substances,
Figure BDA0003563287130000093
representing the query sample and the existing sample, x, respectivelye[k]Representing the k-th dimension of the importance embedding of sample x.
The patient sample x to be diagnosediUsing the CBR model to search the historical patient case database for patient samples x to be diagnosed in an embedding space by using self-weighted distance measurement by taking cases in the historical patient case database as an existing sampleiThe most similar k cases are used as patient samples x to be diagnosediThe reasoning and prediction result of the method provides more auxiliary information for doctors.
Comparing the inference prediction result of the CBR model with the prediction result obtained by the AdaGBDT model, and if the comparison result is inconsistent, then sampling the patientThis xiLabeling as difficult sample, and using Tsone technology to perform a patient sample xiAnd (4) visualizing, submitting the patient sample information to a doctor for diagnosis, and storing the patient sample into a database after the diagnosis obtains a result. If the comparison result is consistent, the prediction result is input through verification, and the patient sample is stored in the database. Fig. 7 is a schematic diagram of an active labeling result of a hard sample according to an embodiment of the present invention.
(1) The adaptive AdaGBDT module is used for being responsible for tree model construction and case decision path mining, and the traditional GBDT is fast in convergence on simple samples and slow in convergence on difficult samples, so that an overfitting phenomenon is easily caused, and noise is introduced into subsequent parts. Therefore, the embodiment of the invention provides a sample adaptive weighted gradient lifting tree AdaGBDT which can effectively accelerate the convergence rate and reduce the overfitting phenomenon. The simple sample is a sample with high convergence rate in AdaGBDT and less iteration times, and the difficult sample is just the opposite.
(2) The BME (Bi-side MI based feature embedding) module is used for embedding the feature importance of extracting the sample, and compared with the existing method based on single side or distribution, the BME module provided by the embodiment of the invention considers the role of the node splitting feature in the offspring and is more fair in calculating the feature importance.
The BME module is used for embedding and extracting feature importance based on bilateral mutual information. The module makes full use of information of scores in the regression tree and considers importance of features more comprehensively. The CBR module is based on active labeling of AdaGBDT and CBR, and allows visualization to be an intuitive interpretable mode, similar patient information of doctor experience can be obtained through the CBR, and information of a model difficult sample is provided for a doctor to choose.
(3) The hard sample mining module based on active learning is not enough to assist a doctor to make a decision only according to feature importance, is used for searching k cases most similar to the current case in an embedding space by using a CBR method according to an actual diagnosis process and providing more auxiliary information for the doctor, and meanwhile, carries out decision level fusion according to decision results of AdaGBDT and CBR and mines hard samples through inconsistency of the decision results. Fig. 5 is a schematic diagram of an interpretable result according to an embodiment of the present invention, and fig. 6 is a schematic diagram of an interpretable feature according to an embodiment of the present invention. Fig. 5 shows the difference between the global importance index of the method of the present invention and the gold standard index compared with other existing methods, and it can be seen that the method of the present invention is more consistent with the gold standard in the global level. Fig. 6 is a schematic diagram illustrating importance of personalized features in the intelligent medical auxiliary diagnosis system, which can visually see importance differences of different features of patients on the diagnosis result, and provide more convincing diagnosis basis for clinical diagnosticians.
GBDT is one of the most popular models in the field of machine learning, however, GBDT is easily over-fitted, GBDT models converge quickly for easy samples in case sets, and GBDT requires more iterations for difficult samples in case sets, which inevitably introduces noise on simple samples that have already converged. Therefore, the adaptive AdaGBDT module is constructed in the embodiment of the invention, and the adaptive AdaGBDT module performs adaptive sample weighting according to the gradient obtained by the previous iteration in each iteration process.
The active learning-based hard sample mining module is used for performing active labeling on hard samples based on AdaGBDT and CBR. CBR is a form of historical knowledge-centric analogy reasoning, and the KNN-based CBR system has two major problems: (1) in which space is the similarity measure? (2) In what metric to measure the similarity of two cases? The embodiment of the invention provides a technology for carrying out self-attention weighted measurement in a homogeneous feature space aiming at the two problems, and the technology can avoid the problem of inconsistent data dimensions in an original space and can change along with the change of feature importance. On the basis, the embodiment of the invention provides an active labeling technology based on CBR and AdaGBDT, and firstly, two prediction results are obtained by a sample through AdaGBDT and CBR respectively. Secondly, consistency check is carried out on the two results, if the results are consistent, the predicted results are input through the check, and the samples are stored in a database. If the results are not consistent, the sample information is submitted to a doctor for diagnosis, and the samples are stored in a database after the diagnosis result is obtained.
In conclusion, the method provided by the embodiment of the invention improves the attention degree of the cases with inconsistent prediction results, thereby being beneficial to improving the recall rate of the system and reducing the number of missed detection cases. The system promotes cooperation between human and machine while ensuring the safety of patients.
According to the embodiment of the invention, by introducing the adaptive AdaGBDT and performing adaptive sample weighting according to the gradient obtained by the previous iteration in each iteration process, the number of rounds required by training is greatly reduced and the introduction of noise is avoided under the condition that the precision is kept unchanged.
The invention solves the problem of interpretability of the traditional medical auxiliary diagnosis system based on artificial intelligence, is beneficial to accelerating the landing implementation of the medical auxiliary diagnosis system, enhances the credibility between doctors and the system, improves the speed and accuracy of medical diagnosis and reduces the burden of medical staff.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of software products, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (4)

1. An intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample levels, comprising:
constructing an adaptive AdaGBDT module according to historical patient data, and calculating a diagnosis result of a patient sample to be diagnosed through an iterative training process by using a decision tree in the AdaGBDT module;
calculating interpretable feature embedding of the patient sample according to a decision path of the patient sample obtained by an AdaGBDT module, and accumulating results among different decision trees of the AdaGBDT module by utilizing the interpretable feature embedding of the patient sample to obtain feature importance embedding of the patient sample;
searching a plurality of cases which are most similar to the patient sample in a historical patient case database by using a case reasoning system in an embedding space by using self-weighted distance measurement, comparing the plurality of cases with the diagnosis result of the patient sample, and if the comparison result is inconsistent, marking the patient sample as a difficult sample; and if the comparison result is consistent, determining that the patient sample passes the verification, and storing the patient sample into a database.
2. The system of claim 1, wherein the adaptive AdaGBDT module is constructed based on historical patient data, and the decision tree in the AdaGBDT module is used to calculate the diagnosis result of the patient sample to be diagnosed through an iterative training process, comprising:
the adaptive AdaGBDT is an aggregate of a plurality of decision trees, the specific number can be dynamically adjusted through the attenuation of the loss function, and when the attenuation of the loss function is smaller than a threshold value, the training of the AdaGBDT is stopped. The specific training process is as follows:
for a patient sample x to be diagnosediThe loss function of the AdaGBDT model is set as:
Figure FDA0003563287120000011
calculating patient sample xiThe negative gradient of (d) is:
Figure FDA0003563287120000012
wherein, L (-) is a loss function of the AdaGBDT model, and F (-) is a decision function of the AdaGBDT.
Calculate patient sample x in the T-round training processiThe formula is as follows:
Figure FDA0003563287120000013
the formula is a softmax normalization function according to a patient sample xiIs calculated by the negative gradient magnitudeSample probability pTiWhere exp (·) represents an exponential operation with e as base;
according to the calculated sampling probability pTiObtaining a new data set (X) using a weighted recoveryt,Yt) And according to the data set (X)t,Yt) Training AdaGBDT for round T + 1;
repeating the above steps, continuously calculating the negative gradient of the sample, and continuously calculating the sampling probability p according to the calculated sampling probabilityTiObtaining a new data set (X) using a weighted recoveryt,Yt) Continuing the training of the AdaGBDT model until a preset training turn is reached or the attenuation of the loss function is smaller than a threshold value, stopping the training of the AdaGBDT to obtain a trained AdaGBDT model, and finally obtaining a patient sample xiThe negative gradient of (a) is taken as a patient sample x to be diagnosed obtained by the AdaGBDT modeliThe diagnosis result of (1).
3. The system of claim 2, wherein the calculating of the interpretable feature embedding of the patient sample based on the decision path of the patient sample obtained by the AdaGBDT module, and the accumulating of the results between different decision trees of the AdaGBDT model using the interpretable feature embedding of the patient sample, to obtain the feature importance embedding of the patient sample comprises:
utilizing a characteristic embedding BME module based on bidirectional mutual information backtracking to backtrack from leaf nodes on a decision tree of a trained AdaGBDT model from bottom to top to calculate the score information of a parent node, wherein the formula is as follows:
Figure FDA0003563287120000021
wherein N isx(. represents the number of samples on a node, Ci(. represents a child of a node, SiRepresenting the fraction, MI, at a nodeiRepresenting mutual information between the node i and the child nodes;
according to the patient sample x to be diagnosediDecision path of (2), top-downCalculating the patient sample xiAccording to the formula:
Figure FDA0003563287120000022
wherein S ispAnd SqRespectively representing the score information of a parent node and a child node, wherein the index q is an index set of the intersection node of the parent node and the backtracking path;
computing patient sample x on each decision tree of AdaGBDT modeliAfter the feature importance of (2), the patient sample x on all decision trees is setiAre combined to obtain a patient sample x to be diagnosediAnd embedding the final feature importance, wherein a combination formula is as follows:
Figure FDA0003563287120000031
therein, FC(t)(x[i]) Representing the importance of the ith feature of the sample x in the t-th tree, xeRepresenting a patient sample x to be diagnosediThe final feature importance is embedded.
4. The system of claim 3, wherein the calculating of the interpretable feature embedding of the patient sample based on the decision path of the patient sample obtained by the AdaGBDT module, and the accumulating of the results between different decision trees of the AdaGBDT model using the interpretable feature embedding of the patient sample, to obtain the feature importance embedding of the patient sample comprises:
a case-based reasoning CBR model is constructed in an embedding space, the distance measurement between different samples adopts a self-attention weighting mode, and the formula is as follows:
Figure FDA0003563287120000032
wherein the content of the first and second substances,
Figure FDA0003563287120000033
representing the query sample and the existing sample, x, respectivelye[k]Representing the k-dimension feature in the importance embedding of the sample x;
the patient sample x to be diagnosediUsing the CBR model to search the historical patient case database for patient samples x to be diagnosed in an embedding space by using self-weighted distance measurement by taking cases in the historical patient case database as an existing sampleiThe most similar k cases are used as patient samples x to be diagnosediThe inference of (2) predicts the result;
comparing the inference prediction result of the CBR model with the prediction result obtained by the AdaGBDT model, and if the comparison result is inconsistent, comparing the patient sample xiLabeling as a difficult sample; if the comparison result is consistent, the patient sample and the prediction result are stored in the database through verification.
CN202210295832.5A 2022-03-24 2022-03-24 Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level Pending CN114613497A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210295832.5A CN114613497A (en) 2022-03-24 2022-03-24 Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210295832.5A CN114613497A (en) 2022-03-24 2022-03-24 Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level

Publications (1)

Publication Number Publication Date
CN114613497A true CN114613497A (en) 2022-06-10

Family

ID=81865147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210295832.5A Pending CN114613497A (en) 2022-03-24 2022-03-24 Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level

Country Status (1)

Country Link
CN (1) CN114613497A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304932A (en) * 2023-05-19 2023-06-23 湖南工商大学 Sample generation method, device, terminal equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304932A (en) * 2023-05-19 2023-06-23 湖南工商大学 Sample generation method, device, terminal equipment and medium
CN116304932B (en) * 2023-05-19 2023-09-05 湖南工商大学 Sample generation method, device, terminal equipment and medium

Similar Documents

Publication Publication Date Title
Pathan et al. Analyzing the impact of feature selection on the accuracy of heart disease prediction
Ambekar et al. Disease risk prediction by using convolutional neural network
WO2016192612A1 (en) Method for analysing medical treatment data based on deep learning, and intelligent analyser thereof
CN111968741B (en) Deep learning and integrated learning-based diabetes complication high-risk early warning system
CN113113130A (en) Tumor individualized diagnosis and treatment scheme recommendation method
CN110277167A (en) The Chronic Non-Communicable Diseases Risk Forecast System of knowledge based map
CN111612278A (en) Life state prediction method and device, electronic equipment and storage medium
CN116364299B (en) Disease diagnosis and treatment path clustering method and system based on heterogeneous information network
Theerthagiri Predictive analysis of cardiovascular disease using gradient boosting based learning and recursive feature elimination technique
Khalid et al. Machine learning hybrid model for the prediction of chronic kidney disease
Mehmood et al. Systematic framework to predict early-stage liver carcinoma using hybrid of feature selection techniques and regression techniques
Induja et al. Computational methods for predicting chronic disease in healthcare communities
Arghandabi et al. A comparative study of machine learning algorithms for the prediction of heart disease
CN114613497A (en) Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level
Yuan et al. Efficient symptom inquiring and diagnosis via adaptive alignment of reinforcement learning and classification
Sun et al. A ranking-based cross-entropy loss for early classification of time series
Choubey et al. Implementation of a hybrid classification method for diabetes
CN115547502B (en) Hemodialysis patient risk prediction device based on time sequence data
Angayarkanni Predictive analytics of chronic kidney disease using machine learning algorithm
CN115376638A (en) Physiological characteristic data analysis method based on multi-source health perception data fusion
Kalaivani et al. Heart disease diagnosis using optimized features of hybridized ALCSOGA algorithm and LSTM classifier
Mehmood et al. Chronic diseases modelling–python environment
Cheng et al. Combining knowledge extension with convolution neural network for diabetes prediction
Li et al. Application of Deep Learning Technology in Predicting the Risk of Inpatient Death in Intensive Care Unit
Shouryadhar et al. Multilevel Ensemble Method to Identify Risks in Chronic Kidney Disease Using Hybrid Synthetic Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination