CN114613497A

CN114613497A - Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level

Info

Publication number: CN114613497A
Application number: CN202210295832.5A
Authority: CN
Inventors: 朱振峰; 国圳宇; 常冬霞; 赵耀
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-06-10

Abstract

The invention provides an intelligent medical auxiliary diagnosis method for a patient sample based on a GBDT sample grade. The method comprises the following steps: constructing a self-adaptive AdaGBDT module according to historical patient data, and calculating a diagnosis result of a patient sample to be diagnosed through an iterative training process by using a decision tree in the AdaGBDT module; calculating interpretable feature embedding of the patient sample according to the decision path of the patient sample, and accumulating results among different decision trees of the AdaGBDT model to obtain feature importance embedding of the patient sample; and searching a plurality of cases which are most similar to the patient sample in a historical patient case database by using a case reasoning system in an embedding space by using self-weighted distance measurement, comparing the plurality of cases with the diagnosis result of the patient sample, and if the comparison result is inconsistent, marking the patient sample as a difficult sample. The method improves the attention degree of the cases with inconsistent prediction results, thereby being beneficial to improving the recall rate of the system and reducing the number of missed detection cases.

Description

Intelligent medical auxiliary diagnosis method of patient sample based on GBDT sample level

Technical Field

The invention relates to the technical field of medical informatization, in particular to a sample-level interpretable intelligent medical auxiliary diagnosis method based on a GBDT (iterative Boosting Decision Tree algorithm).

Background

In modern medical diagnosis, the rapid increase of data volume brings great challenges to doctors, and how to accurately and rapidly diagnose diseases becomes a non-negligible problem. In recent years, significant progress has been made in applying artificial intelligence to medical diagnosis. Machine learning models, especially tree-based models, are one of the most popular non-linear predictive models in today's big data scenarios due to their superior classification predictive power in structured data, however, the explicability decreases inversely proportional to the predictive power, and current work is of little interest in interpreting their predictions. Machine learning models are generally more effective in disease diagnosis than traditional expert knowledge-based models, and unfortunately, existing machine learning models, including deep learning, are black box models that are not easily interpreted by clinicians. Therefore, a trade-off must be made between transparency and predictive power, and for high-risk applications, simpler, more transparent systems are often employed, where the clinician can easily backtrack predictions.

To benefit from machine learning models with higher predictive power, the importance of interpretable and transparent machine learning model algorithms in clinical medicine is undoubted, as erroneous predictions may have serious consequences. Clinicians must be able to understand the underlying decision making principles of AI (Artificial Intelligence) models so that they can trust these predictions and can identify cases where the models may give false predictions.

In fact, researchers have long been concerned with how to interpret AI models. Scholars have proposed a rule-based expert system that is easy to understand. CBR (Case-based reasoning) solves new problems by building a historical Case database and using similar experience and results of historical cases, it provides another approach that is highly interpretable, since the physician's diagnosis and treatment strategies are typically based on historical experience and Case knowledge and pattern recognition by matching query cases to historical cases. However, conventional CBR systems are often subject to data distribution and lack sufficient accuracy. Therefore, an AI model with high predictive ability and good interpretability is essential for the field of medicine to which AI falls.

Tree-based machine learning models are the most popular non-linear models today. From primitive decision trees, random forests to gradient boosting trees and the like are widely used in finance, medicine, public health, manufacturing and other fields. For these applications, the model must be both accurate and interpretable, where interpretability means that we can understand how the model uses input features to make predictions. However, although the global interpretation method of the tree model has abundant development, and summarizes the influence of the input features on the whole model, the local interpretation at the sample level is much less concerned, and the local interpretation explains the influence of the input features on the prediction of a single sample.

Disclosure of Invention

The embodiment of the invention provides an intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample levels, which aims to overcome the problems in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme.

An intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample levels comprises the following steps:

constructing an adaptive AdaGBDT module according to historical patient data, and calculating a diagnosis result of a patient sample to be diagnosed through an iterative training process by using a decision tree in the AdaGBDT module;

calculating interpretable feature embedding of the patient sample according to a decision path of the patient sample obtained by an AdaGBDT module, and accumulating results among different decision trees of the AdaGBDT module by utilizing the interpretable feature embedding of the patient sample to obtain feature importance embedding of the patient sample;

searching a plurality of cases which are most similar to the patient sample in a historical patient case database by using a case reasoning system in an embedding space by using self-weighted distance measurement, comparing the plurality of cases with the diagnosis result of the patient sample, and if the comparison result is inconsistent, marking the patient sample as a difficult sample; and if the comparison result is consistent, determining that the patient sample passes the verification, and storing the patient sample into a database.

Preferably, the constructing an adaptive AdaGBDT module according to historical patient data, and calculating a diagnosis result of a patient sample to be diagnosed through an iterative training process using a decision tree in the AdaGBDT module, includes:

the adaptive AdaGBDT is an aggregation of a plurality of decision trees, the specific number can be dynamically adjusted through the attenuation amount of the loss function, and when the attenuation amount of the loss function is smaller than a threshold value, the training of the AdaGBDT is stopped. The specific training process is as follows:

for a patient sample x to be diagnosed_iThe loss function of the AdaGBDT model is set as:

calculating patient sample x_iThe negative gradient of (d) is:

wherein, L (-) is a loss function of the AdaGBDT model, and F (-) is a decision function of the AdaGBDT.

Calculate patient sample x in the T-round training process_iThe formula is as follows:

the formula is a softmax normalization function according to a patient sample x_iThe magnitude of the negative gradient of (2) calculates the sampling probability p_TiWhere exp (·) represents an exponential operation with e as base;

according to the calculated sampling probability p_TiObtaining a new data set (X) using a weighted recovery^t,Y^t) And according to the data set (X)^t,Y^t) Training AdaGBDT for round T + 1;

repeating the above processing procedures, continuously calculating the negative gradient of the sample, and continuously calculating the sampling probability p according to the calculated sampling probability_TiObtaining a new data set (X) using a weighted recovery^t,Y^t) Continuing the training of the AdaGBDT model until a preset training turn is reached or the attenuation of the loss function is smaller than a threshold value, stopping the training of the AdaGBDT to obtain a trained AdaGBDT model, and finally obtaining a patient sample x_iThe negative gradient of (a) is taken as a patient sample x to be diagnosed obtained by the AdaGBDT model_iThe diagnosis result of (2).

Preferably, the calculating the interpretable feature embedding of the patient sample according to the decision path of the patient sample obtained by the AdaGBDT module, and accumulating the results of different decision trees of the AdaGBDT model by using the interpretable feature embedding of the patient sample to obtain the feature importance embedding of the patient sample includes:

utilizing a characteristic embedding BME module based on bidirectional mutual information backtracking to backtrack from leaf nodes on a decision tree of a trained AdaGBDT model from bottom to top to calculate the score information of a parent node, wherein the formula is as follows:

wherein N is_x(. represents the number of samples on a node, C_i(. represents a child of a node, S_iRepresenting the fraction, MI, at a node_iRepresenting mutual information between the node i and the child nodes;

according to the patient sample x to be diagnosed_iThe decision path of (a) is calculated from the top down to the patient sample x_iAccording to the formula:

wherein S is_pAnd S_qRespectively representing the score information of a parent node and a child node, wherein the index q is an index set of the intersection node of the parent node and the backtracking path;

computing patient sample x on each decision tree of AdaGBDT model_iAfter the feature importance of (2), the patient sample x on all decision trees is set_iAre combined to obtain a patient sample x to be diagnosed_iAnd embedding the final feature importance, wherein a combination formula is as follows:

therein, FC^(t)(x[i]) Representing the importance of the ith feature of the sample x in the t-th tree, x^eRepresenting a patient sample x to be diagnosed_iThe final feature importance is embedded.

a case-based reasoning CBR model is constructed in an embedding space, the distance measurement between different samples adopts a self-attention weighting mode, and the formula is as follows:

wherein the content of the first and second substances,

representing the query sample and the existing sample, x, respectively^e[k]Representing the k-dimension feature in the importance embedding of the sample x;

the patient sample x to be diagnosed_iAs a query sample, using cases in the historical patient case database as existing samples, and using the CBR model to embed emptyInter-process use of self-weighted distance metrics to find patient samples x in a historical patient case database for diagnosis_iThe most similar k cases are used as patient samples x to be diagnosed_iThe inference of (2) predicts the result;

comparing the inference prediction result of the CBR model with the prediction result obtained by the AdaGBDT model, and if the comparison result is inconsistent, comparing the inference prediction result of the CBR model with the prediction result obtained by the AdaGBDT model, and if the inference prediction result is inconsistent with the prediction result obtained by the AdaGBDT model, obtaining a patient sample x_iLabeling as a difficult sample; if the comparison result is consistent, the patient sample and the prediction result are stored in the database through verification.

According to the technical scheme provided by the embodiment of the invention, the method provided by the embodiment of the invention improves the attention degree of the cases with inconsistent prediction results, thereby being beneficial to improving the recall rate of the system and reducing the number of missed detection cases. The system promotes cooperation between human and machine while ensuring the safety of patients.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation of an intelligent medical auxiliary diagnosis method for a patient sample based on GBDT sample level according to an embodiment of the present invention;

fig. 2 is a processing flow chart of an intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample level according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating calculation of importance of two-way backtracking features according to an embodiment of the present invention;

fig. 4 is a schematic view of an active labeling process for a difficult sample according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an interpretable result according to an embodiment of the invention;

FIG. 6 is an exemplary illustration of an exemplary feature according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an active labeling result of a hard sample according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

The embodiment of the invention provides an intelligent medical auxiliary diagnosis method for a patient sample based on a GBDT sample level, which can obtain interpretable embedding of the sample level according to decision path backtracking calculation of a tree model and can keep the local loyalty of the model. Based on high-quality sample embedding, a CBR system based on self-weighting KNN (K-nearest neighbor) can be constructed in an embedding space.

The embodiment of the invention improves the gradient lifting GBDT, realizes the self-adaptive GBDT, namely AdaGBDT (gradient lifting decision tree), to improve the convergence speed of the system, extracts the characteristic importance information of a specific patient through a mutual information-based bidirectional backtracking algorithm, and the characteristic importance information can be used for providing interpretable information for doctors. By constructing the CBR system in the embedding space, traceability of similar patients is realized, and doctors are assisted in diagnosis. Finally, through the cooperation of the CBR and the GBDT, a difficult sample mining technology is constructed, and samples with low confidence coefficient can be mined for further diagnosis by doctors.

According to the embodiment of the invention, the interpretable method of the tree model is divided into a global interpretable method and a local interpretable method according to the action range, and for the tree model, the global interpretable method can summarize the influence of input characteristics on the whole model. The local interpretable method pursues the influence of more refined input characteristics on the decision result of the current input sample. Although global interpretation methods of tree models have a long history, there is little focus on locally interpretable methods. Compared with the existing interpretable method, the embodiment of the invention realizes sample-level interpretable technology by a characteristic importance calculation method based on information entropy bidirectional backtracking.

An implementation schematic diagram of an intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample levels provided by an embodiment of the present invention is shown in fig. 1, and a specific processing flow is shown in fig. 2, and includes the following processing steps:

and step S10, constructing an adaptive AdaGBDT module according to the existing patient data. The adaptive AdaGBDT module calculates the negative gradient of a patient sample to be diagnosed through a decision tree, samples required by the new round of AdaGBDT training are obtained by using the negative gradient to carry out playback sampling on the patient sample with the right, wherein each decision tree result is the negative gradient between the previous round of training set and the corresponding label, the negative gradient is used as the label of the next round of training, and the diagnosis result of the patient sample to be diagnosed is obtained until the AdaGBDT module finishes training.

Step S20, the BME (Bi-side MI based feature embedding based on bidirectional mutual information backtracking) module calculates interpretable feature embedding of the patient sample to be diagnosed according to the decision path of the patient sample to be diagnosed, and accumulates results between different decision trees of AdaGBDT by using the interpretable feature embedding of the patient sample to obtain feature importance embedding of the patient sample.

Step S30, embedding and inputting the final negative gradient of the patient sample to be diagnosed and the characteristic importance of the patient sample into a CBR module, searching a plurality of cases most similar to the patient sample in a historical patient case database by the CBR module in an embedding space by using self-weighted distance measurement, comparing the plurality of cases with the diagnosis result of the patient sample, and if the comparison result is inconsistent, marking the patient sample as a difficult sample; and if the comparison result is consistent, determining that the patient sample passes the verification, and storing the patient sample into a database.

Further, the step S10 specifically includes that the adaptive AdaGBDT is an aggregate of multiple decision trees, the specific number is dynamically adjusted by the attenuation amount of the loss function, and when the attenuation amount of the loss function is smaller than the threshold, the training of the AdaGBDT is stopped. The specific training process is as follows:

for sample x_iThe penalty function for the AdaGBDT model is:

the corresponding negative gradient of the sample is calculated as:

Calculate sample x in the T round training process_iThe formula is as follows:

the formula is a softmax normalization function, and probability calculation is carried out according to the magnitude of the negative gradient of the sample, wherein exp (-) represents exponential operation with e as a base.

Obtaining a new data set (X) using a weighted back-sampling based on the calculated sampling probability^t,Y^t). And according to the data set (X)^t,Y^t) AdaGBDT for round T +1 was trained.

Repeating the above steps, continuously calculating the negative gradient of the sample, and continuously calculating the sampling probability p according to the calculated sampling probability_TiObtaining a new data set (X) using a weighted recovery^t,Y^t) Continuing the training of the AdaGBDT model until a preset training turn is reached or the attenuation of the loss function is smaller than a threshold value, stopping the training of the AdaGBDT to obtain a trained AdaGBDT model, and finally obtaining a patient sample x_iThe negative gradient of (a) is taken as a patient sample x to be diagnosed obtained by the AdaGBDT model_iThe diagnosis result of (1).

Further, the step S20 specifically includes a feature importance extraction algorithm based on bidirectional mutual information backtracking of decision paths, where the decision paths represent paths between the samples in the decision tree from the starting root node to the leaf nodes. Fig. 3 is a schematic diagram of calculating importance of two-way backtracking features according to an embodiment of the present invention, and the specific operation process is as follows:

AdaGBDT model is trained according to the AdaGBDT training method

Utilizing a BME module to backtrack and calculate score information of a parent node from a leaf node on a decision tree of the AdaGBDT model from bottom to top, wherein the formula is as follows:

wherein N is_x(. represents the number of samples on a node, C_i(. represents a child of a node, S_iRepresenting the fraction, MI, at a node_iRepresenting mutual information between node i and the child nodes.

According to the patient sample x to be diagnosed_iCalculating the feature importance information of the sample from top to bottom according to the formula as follows:

wherein S is_pAnd S_qRespectively representing the score information of a parent node and a child node, wherein the index q is an index set of the intersection node of the parent node and the backtracking path.

Computing patient sample x on each decision tree of AdaGBDT model_iAfter the feature importance of (2), the patient sample x on all decision trees is set_iAre combined to obtain a patient sample x to be diagnosed_iThe final feature importance is embedded. The combined formula is as follows:

therein, FC^(t)(x[i]) Represents a sample x_iImportance of ith feature on t Tree, x^eRepresenting a patient sample x to be diagnosed_iThe final feature importance is embedded.

After the AdaGBDT is constructed, the system calculates the score value of the father node from bottom to top according to the contribution degree between the father and the son, namely, through a mutual information backtracking calculation mode, calculates the personalized interpretable feature embedding of different samples from top to bottom according to the decision path of the samples, and finally accumulates the results between different decision trees of the AdaGBDT to obtain the final feature importance embedding of the samples.

Further, the step S30 specifically includes a CBR system based on adaptive distance weighting and an interactive hard sample labeling method with AdaGBDT, where the CBR is based on sample reasoning, and can obtain patient information similar to doctor experience through the CBR, and provide the hard sample information of the model for the doctor to select, and fig. 4 is a schematic diagram of an active hard sample labeling process provided by an embodiment of the present invention, and an operation flow thereof is as follows:

calculating a patient sample x to be diagnosed according to the BME module_iThe feature importance of (1) is embedded.

Constructing a CBR model in an embedding space, wherein distance measurement among different samples adopts a self-attention weighting mode, and the formula is as follows:

wherein the content of the first and second substances,

representing the query sample and the existing sample, x, respectively^e[k]Representing the k-th dimension of the importance embedding of sample x.

The patient sample x to be diagnosed_iUsing the CBR model to search the historical patient case database for patient samples x to be diagnosed in an embedding space by using self-weighted distance measurement by taking cases in the historical patient case database as an existing sample_iThe most similar k cases are used as patient samples x to be diagnosed_iThe reasoning and prediction result of the method provides more auxiliary information for doctors.

Comparing the inference prediction result of the CBR model with the prediction result obtained by the AdaGBDT model, and if the comparison result is inconsistent, then sampling the patientThis x_iLabeling as difficult sample, and using Tsone technology to perform a patient sample x_iAnd (4) visualizing, submitting the patient sample information to a doctor for diagnosis, and storing the patient sample into a database after the diagnosis obtains a result. If the comparison result is consistent, the prediction result is input through verification, and the patient sample is stored in the database. Fig. 7 is a schematic diagram of an active labeling result of a hard sample according to an embodiment of the present invention.

(1) The adaptive AdaGBDT module is used for being responsible for tree model construction and case decision path mining, and the traditional GBDT is fast in convergence on simple samples and slow in convergence on difficult samples, so that an overfitting phenomenon is easily caused, and noise is introduced into subsequent parts. Therefore, the embodiment of the invention provides a sample adaptive weighted gradient lifting tree AdaGBDT which can effectively accelerate the convergence rate and reduce the overfitting phenomenon. The simple sample is a sample with high convergence rate in AdaGBDT and less iteration times, and the difficult sample is just the opposite.

(2) The BME (Bi-side MI based feature embedding) module is used for embedding the feature importance of extracting the sample, and compared with the existing method based on single side or distribution, the BME module provided by the embodiment of the invention considers the role of the node splitting feature in the offspring and is more fair in calculating the feature importance.

The BME module is used for embedding and extracting feature importance based on bilateral mutual information. The module makes full use of information of scores in the regression tree and considers importance of features more comprehensively. The CBR module is based on active labeling of AdaGBDT and CBR, and allows visualization to be an intuitive interpretable mode, similar patient information of doctor experience can be obtained through the CBR, and information of a model difficult sample is provided for a doctor to choose.

(3) The hard sample mining module based on active learning is not enough to assist a doctor to make a decision only according to feature importance, is used for searching k cases most similar to the current case in an embedding space by using a CBR method according to an actual diagnosis process and providing more auxiliary information for the doctor, and meanwhile, carries out decision level fusion according to decision results of AdaGBDT and CBR and mines hard samples through inconsistency of the decision results. Fig. 5 is a schematic diagram of an interpretable result according to an embodiment of the present invention, and fig. 6 is a schematic diagram of an interpretable feature according to an embodiment of the present invention. Fig. 5 shows the difference between the global importance index of the method of the present invention and the gold standard index compared with other existing methods, and it can be seen that the method of the present invention is more consistent with the gold standard in the global level. Fig. 6 is a schematic diagram illustrating importance of personalized features in the intelligent medical auxiliary diagnosis system, which can visually see importance differences of different features of patients on the diagnosis result, and provide more convincing diagnosis basis for clinical diagnosticians.

GBDT is one of the most popular models in the field of machine learning, however, GBDT is easily over-fitted, GBDT models converge quickly for easy samples in case sets, and GBDT requires more iterations for difficult samples in case sets, which inevitably introduces noise on simple samples that have already converged. Therefore, the adaptive AdaGBDT module is constructed in the embodiment of the invention, and the adaptive AdaGBDT module performs adaptive sample weighting according to the gradient obtained by the previous iteration in each iteration process.

The active learning-based hard sample mining module is used for performing active labeling on hard samples based on AdaGBDT and CBR. CBR is a form of historical knowledge-centric analogy reasoning, and the KNN-based CBR system has two major problems: (1) in which space is the similarity measure? (2) In what metric to measure the similarity of two cases? The embodiment of the invention provides a technology for carrying out self-attention weighted measurement in a homogeneous feature space aiming at the two problems, and the technology can avoid the problem of inconsistent data dimensions in an original space and can change along with the change of feature importance. On the basis, the embodiment of the invention provides an active labeling technology based on CBR and AdaGBDT, and firstly, two prediction results are obtained by a sample through AdaGBDT and CBR respectively. Secondly, consistency check is carried out on the two results, if the results are consistent, the predicted results are input through the check, and the samples are stored in a database. If the results are not consistent, the sample information is submitted to a doctor for diagnosis, and the samples are stored in a database after the diagnosis result is obtained.

In conclusion, the method provided by the embodiment of the invention improves the attention degree of the cases with inconsistent prediction results, thereby being beneficial to improving the recall rate of the system and reducing the number of missed detection cases. The system promotes cooperation between human and machine while ensuring the safety of patients.

According to the embodiment of the invention, by introducing the adaptive AdaGBDT and performing adaptive sample weighting according to the gradient obtained by the previous iteration in each iteration process, the number of rounds required by training is greatly reduced and the introduction of noise is avoided under the condition that the precision is kept unchanged.

The invention solves the problem of interpretability of the traditional medical auxiliary diagnosis system based on artificial intelligence, is beneficial to accelerating the landing implementation of the medical auxiliary diagnosis system, enhances the credibility between doctors and the system, improves the speed and accuracy of medical diagnosis and reduces the burden of medical staff.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of software products, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An intelligent medical auxiliary diagnosis method for patient samples based on GBDT sample levels, comprising:

2. The system of claim 1, wherein the adaptive AdaGBDT module is constructed based on historical patient data, and the decision tree in the AdaGBDT module is used to calculate the diagnosis result of the patient sample to be diagnosed through an iterative training process, comprising:

the adaptive AdaGBDT is an aggregate of a plurality of decision trees, the specific number can be dynamically adjusted through the attenuation of the loss function, and when the attenuation of the loss function is smaller than a threshold value, the training of the AdaGBDT is stopped. The specific training process is as follows:

calculating patient sample x_iThe negative gradient of (d) is:

the formula is a softmax normalization function according to a patient sample x_iIs calculated by the negative gradient magnitudeSample probability p_TiWhere exp (·) represents an exponential operation with e as base;

3. The system of claim 2, wherein the calculating of the interpretable feature embedding of the patient sample based on the decision path of the patient sample obtained by the AdaGBDT module, and the accumulating of the results between different decision trees of the AdaGBDT model using the interpretable feature embedding of the patient sample, to obtain the feature importance embedding of the patient sample comprises:

according to the patient sample x to be diagnosed_iDecision path of (2), top-downCalculating the patient sample x_iAccording to the formula:

4. The system of claim 3, wherein the calculating of the interpretable feature embedding of the patient sample based on the decision path of the patient sample obtained by the AdaGBDT module, and the accumulating of the results between different decision trees of the AdaGBDT model using the interpretable feature embedding of the patient sample, to obtain the feature importance embedding of the patient sample comprises:

wherein the content of the first and second substances,

the patient sample x to be diagnosed_iUsing the CBR model to search the historical patient case database for patient samples x to be diagnosed in an embedding space by using self-weighted distance measurement by taking cases in the historical patient case database as an existing sample_iThe most similar k cases are used as patient samples x to be diagnosed_iThe inference of (2) predicts the result;

comparing the inference prediction result of the CBR model with the prediction result obtained by the AdaGBDT model, and if the comparison result is inconsistent, comparing the patient sample x_iLabeling as a difficult sample; if the comparison result is consistent, the patient sample and the prediction result are stored in the database through verification.