CN115148319A

CN115148319A - Auxiliary classification method, equipment and storage medium for multi-clinical stage diseases

Info

Publication number: CN115148319A
Application number: CN202210877630.1A
Authority: CN
Inventors: 张宏国; 任涵彬; 杜宇芳; 方舟; 白瑞; 杨霄璇; 宋雪; 李锐; 刘明鸽; 齐红; 何晨龙; 耿瑞
Original assignee: Heilongjiang Network Space Research Center; Harbin University of Science and Technology
Current assignee: Heilongjiang Network Space Research Center; Harbin University of Science and Technology
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-10-04

Abstract

The invention provides an auxiliary classification method, equipment and a storage medium for multi-clinical stage diseases, wherein the method comprises the following steps: determining a medical record data set; extracting characteristic values and labels in the data to form a characteristic value set and a label set; performing secondary classification on the medical record data set by using a secondary classification model; analyzing the association degree of the characteristic value set to obtain an optimized characteristic value set; screening the optimized characteristic value set to obtain a key characteristic value set; medical record data with characteristic values meeting the confirmed diagnosis conditions are searched in the health data set and added into the sick data set to form a new sick data set; and carrying out multi-classification on the new diseased data set to obtain the prediction of different stages of the disease. The invention predicts the disease stage by stage through a classification algorithm model and assists doctors to diagnose the disease.

Description

Auxiliary classification method, equipment and storage medium for multi-clinical stage diseases

Technical Field

The application relates to the field of intelligent medical treatment, in particular to an auxiliary classification method, equipment and a storage medium for multi-clinical stage diseases.

Background

Disease staging initially stayed at a purely clinical level, e.g., mild vs. severe symptoms, and thereafter evolved gradually to a more advanced clinical pathology perspective under the guidance of progress in the fields of autopsy, imaging, and biomarkers. The disease stage is suitable for diseases which are possibly delayed to be healed, have progressive functional deterioration and/or are possibly died early, and for most diseases, the early state is relatively stable, the clinical cure rate is higher, the late state is fast to develop, and the cure rate is lower. If the patient can find and treat the disease in the early stage of the disease development, the clinical cure rate of the patient is greatly improved before the disease condition is worsened, so that how to accurately diagnose the stage of the disease is one of the important problems in clinical medicine. With the development of machine learning and the improvement of electronic medical records, data-driven intelligent medical diagnosis and treatment methods become the mainstream. Intelligent medical treatment is a hot point of academic research in recent years and is a hot focus of combination of computer and medical fields, so how to help disease stage diagnosis through intelligent medical treatment is a problem to be solved.

Disclosure of Invention

In view of the above, the present application provides a method, an apparatus, and a storage medium for assisting classification of multiple clinical stage diseases, so as to solve the problem of helping disease stage diagnosis through intelligent medical treatment.

The implementation method of the technical scheme of the application comprises the following steps:

an assisted classification method for multi-clinical stage diseases, comprising:

determining a medical record data set S1, wherein the medical record data set S1 comprises medical record data of at least one patient;

extracting characteristic values and labels of medical records in the medical record data set S1 to form a characteristic value set F and a label set D, wherein the characteristic value set F comprises physical examination data and examination result data in the medical record data of patients, and the label set D comprises diseased or healthy labels determined based on the diagnosis results of doctors;

performing secondary classification on the medical record data set S1 by using a secondary classification model based on the characteristic value set F and the label set D to obtain a healthy data set and a diseased data set;

analyzing the association degree of the characteristic value set F to obtain an optimized characteristic value set F1;

based on the medical field information, screening the optimized characteristic value set F1 to obtain a key characteristic value set F2 and conditions corresponding to the characteristics in the key characteristic value set F2;

medical record data with the characteristic value meeting the confirmed diagnosis condition in the F2 are searched in the health data set and added into the sick data set to form a new sick data set S3;

the new diseased data set S3 is multi-classified to obtain predictions of different stages of the disease.

In the method, the volume survey data at least comprises: height, weight, pain level, smoking history, drinking history, and medical history;

the inspection result data includes at least: biochemical test result of hematuria and imaging test result.

In the method, the secondary classification of the medical record data set S1 by using a secondary classification model based on the feature value set F and the label set D includes:

establishing a candidate two-classification model library, wherein the candidate two-classification model library comprises a plurality of two-classification models;

and simultaneously executing a plurality of two classification models to obtain the accuracy, recall rate and F1Score value of the two classification models, comprehensively considering the three classification evaluation indexes, and selecting the medical record data set S1 with the best evaluation index effect, wherein one two classification model carries out two classifications on the medical record data set S1.

In the method, the analyzing the association degree of the characteristic value set F to obtain an optimized characteristic value set F1 includes:

and performing association degree analysis on the characteristic values in the characteristic value set F through chi-square test, or sample variance values, or discrete category interaction information, and deleting the characteristic values with lower association degree to obtain an optimized characteristic value set F1.

In the method, the optimized characteristic value set F1 is screened based on the medical field information to obtain a key characteristic value set F2, wherein the key characteristic value set is the characteristic value set which has decisive influence on the confirmed disease.

In the method, before the multi-classifying the new diseased data set S3, the method further includes:

filling missing feature items in the new diseased data set S3 with a specific value, or an average value, or a mode according to the corresponding medical meaning;

the data in the filled diseased data set S3 is normalized to constitute a data set S4.

In the method, the multi-classification is performed on the new diseased data set S3, specifically:

determining a new label set D 'according to the disease type, wherein the new label set D' is a stage diagnosis set corresponding to the disease;

performing multi-classification on the S4 based on the deep neural network model; wherein the content of the first and second substances,

the number of the neurons of the input layer corresponds to the number of the characteristic values in the characteristic set F1;

the number of the neurons of the output layer corresponds to the number of disease stages, namely the number of numerical values in the label set D';

using the relu function as an activation function for each hidden layer and creating a softmax function, a disease stage prediction is determined.

The invention also provides auxiliary classification equipment for multi-clinical stage diseases, which comprises: a processor and a memory;

the processor is used for storing a computer program for realizing the auxiliary classification method of the multi-clinical stage diseases.

The invention also proposes a storage medium for storing at least one set of instructions;

the set of instructions is for being invoked and performing at least the assisted classification method for the multi-clinical stage disease.

The method provided by the invention is suitable for multi-stage disease diagnosis. Firstly, a machine learning two-classification model is used for carrying out two-classification on whether diseases are diagnosed or not, then professional knowledge in the medical field is applied to determine a characteristic value set, and the diagnosed data in the two-classification result is diagnosed by a deep learning multi-classification model to realize disease stage diagnosis.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings may be obtained according to the drawings without inventive labor.

FIG. 1 is a flow chart of an embodiment of a method for assisted classification of multiple clinical stage diseases according to the present invention;

FIG. 2 is a schematic structural diagram of an auxiliary classification device for multi-clinical stage diseases according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

With the development of machine learning and the improvement of electronic medical records, data-driven intelligent medical diagnosis and treatment methods become the mainstream. The generation of a large number of electronic medical records provides an adequate data source for intelligent medical treatment. On the other hand, how to accurately carry out disease staging is one of the major difficulties of disease diagnosis of clinicians, and for most diseases, the early disease condition is relatively stable, the clinical cure rate is also high, the late disease condition is rapidly developed, and the cure rate is low. The disease staging can be timely and accurately carried out, so that the survival rate and the quality after healing of the patient are greatly improved. In view of the practical problems, the invention provides an auxiliary classification method for multi-clinical stage diseases, which is suitable for predicting the diseases with the multi-clinical stage and assisting a clinician in diagnosing the diseases.

the embodiment of the invention provides an auxiliary classification method for multi-clinical stage diseases, which comprises the following steps of:

s101: determining a medical record data set S1, wherein the medical record data set S1 comprises medical record data of at least one patient; the medical record data set S1 can comprise electronic medical records of a plurality of patients in a hospital medical record library as a data set;

s102: extracting characteristic values and labels of medical records in the medical record data set S1 to form a characteristic value set F and a label set D, wherein the characteristic value set F comprises physical examination data and examination result data in the medical record data of patients, and the label set D comprises diseased or healthy labels determined based on the diagnosis results of doctors;

when a patient goes to a doctor, a doctor can input a medical record of the patient through an electronic medical record information system of a hospital. The electronic medical record data comprises personal information of patients, symptom data of the patients, physical examination, biochemical detection data, diagnosis orders of doctors and medication data. Then, the electronic medical record data of the patient is derived from the electronic medical record information system, all disease characteristics in the medical record data set, such as height, weight, pain degree, smoking history, drinking history, medical history and other physical examination data, hematuria biochemical detection results, imaging examination results and other results are used as characteristic values, and the multi-stage disease condition diagnosed by a doctor is simplified into disease or health and used as a label.

S103: performing secondary classification on the medical record data set S1 by using a secondary classification model based on the characteristic value set F and the label set D to obtain a healthy data set and a diseased data set;

based on preliminary observation and statistics of electronic medical records, we have found that the overall diagnostic results can be divided into two categories, namely "healthy" and "diseased" depending on whether the disease is diseased or not. Then, we perform a preliminary screening on the labels in the medical record data S1, remove the personal information part and the medication advice part of the patient that have no effect on disease diagnosis, and use the remaining feature values, such as the pain level, the physical examination data of height and weight, and the biochemical detection result of hematuria, as a feature set, which is denoted by F = { F1, F2, \8230;, fn }. Let us let D denote the set of diseased cases as the set of labels, then D = {1,0},1 denotes diseased and 0 denotes healthy. Performing secondary classification on the S1 on the basis;

since medical field knowledge is not introduced here, in order to better support the two-classification diagnosis of different types of multi-clinical staging diseases, a concept of "candidate two-classification model library" is proposed, and a user can simultaneously execute a plurality of two-classification models according to a specific scenario (i.e., an electronic medical record of a specific multi-clinical staging disease), and then select the most appropriate one of the two-classification models according to an actual test effect. Commonly used binary classification algorithms include logistic regression, K Nearest Neighbor (KNN), support Vector Machine (SVM), and the like. In addition to the mainstream binary classification algorithm, the random forest and XGboost models also have better performance in the classification problem. Taking the case that two classification models, namely SVM and XGboost, are used in a "candidate two-classification model library":

a Support Vector Machine (SVM) is a typical binary model. The basic model of the method is a linear classifier with the maximum interval defined on a feature space, and the maximum interval makes the linear classifier different from a perceptron; the SVM also includes kernel techniques, which make it a substantially non-linear classifier. The basic idea is to solve for hyperplanes that can properly partition a data set and have a geometrically maximum separation. In sample space, a hyperplane (ω, b) is determined by a normal vector ω and a displacement term b, and the distance from any point x in sample space to the hyperplane can be written as:

among the numerous hyperplanes for segmenting the two types of samples, the partition hyperplane with the largest interval needs to satisfy the following constraints:

s.t.yi(ωTxi+b)≥1,i＝1,2,…,m

the XGboost is an optimized distributed gradient enhancement library. The machine learning algorithm is realized under the Gradient Boosting framework, and large-scale training samples can be efficiently, flexibly and conveniently processed. The objective function for Xgboost is:

wherein n is the number of medical record samples, y _i For a true diagnosis of the ith medical record,

the predictive diagnosis of the ith sample for the model. K denotes the number of regression trees, f _k Expressing the kth tree, wherein omega is the complexity of a regression tree as a regularization term, and is expressed as follows:

wherein T is the number of leaf nodes, omega is pruned through gamma when the leaf nodes are excessive, and lambda is controlled

The problem of overfitting occurs when it is too large.

Our optimization objective is

Wherein

The sample falls to the leaf node value of the ith regression tree for the optimal case.

S104: analyzing the association degree of the characteristic value set F to obtain an optimized characteristic value set F1;

in order to improve the accuracy of model prediction, chi-square test, sample variance value and discrete category interaction information are comprehensively considered to carry out relevance analysis on the characteristic value in the F. Taking chi-square test and discrete category interaction information as an example, chi-square statistic between the characteristic value Fi and whether the disease D is affected is as follows:

where A is the actual value of F over D and T is the theoretical value. X is the absolute magnitude of the deviation of the actual value from the theoretical value, and the larger X indicates that Fi has more influence on the disease.

The discrete category interactive information is called 'mutual information' for short, and is a method for screening characteristic values in characteristic engineering. For discrete random variables X, Y, the formula for mutual information is as follows:

if X, Y are mutually independent variables p (X, Y) = p (X) p (Y), I (X; Y) above is 0, so that a larger value of I (X; Y) indicates a larger correlation between the two variables.

On the basis, the feature values with low association degree are deleted, and a feature set F1 after preliminary optimization is obtained.

S105: based on the medical field information, screening the optimized characteristic value set F1 to obtain a key characteristic value set F2 and conditions corresponding to the characteristics in the key characteristic value set F2;

on the basis, medical field knowledge is introduced, and a characteristic value set F2= { fn, \8230; fm } and conditions thereof which have decisive influence on a prediction result are screened from the characteristic value set F1;

the process is based on disease diagnosis knowledge in the medical field, and characteristic values which have decisive influence on confirmed diseases are screened, for example, the score of the ventilation quantitative score has decisive influence on whether the ventilation is confirmed or not for gout;

s106: medical record data with the characteristic value meeting the confirmed diagnosis condition in the F2 are searched in the health data set and added into the sick data set to form a new sick data set S3;

on the basis, medical field knowledge is introduced, the characteristic value set F2= { fn, \8230; fm } which has decisive influence on a prediction result and conditions thereof are screened from the characteristic value set F1, the health data set S2 with D =0 after the two classifications is searched, data S2 'with F2 meeting diagnosis confirmation conditions are screened, and then S2' is added into the data set with D =1 to form a data set S3 for multi-classification.

S107: the new diseased data set S3 is multi-classified to obtain predictions of different stages of the disease.

the inspection result data includes at least: biochemical detection result of hematuria and imaging examination result.

In the method, the optimized characteristic value set F1 is screened based on the medical field information to obtain a key characteristic value set F2, wherein the key characteristic value set is the characteristic value set which has decisive influence on the diagnosed diseases.

for the characteristic item of data missing in S3, the missing value is filled in with a specific value, an average value or a mode according to the medical meaning of the item. E.g., the number of painful joints, the absence of an attribute indicates that the patient does not present symptoms of joint pain, and the default non-painful joints are filled with 0 s. If the drinking type is lost in the drinking history, the value is the most frequently appeared 'beer' type.

And because different evaluation indexes often have different dimensions and dimension units, in order to make up for the influence of the problem on data analysis, a Z-Score method is adopted for standardization to scale the data in proportion so as to enable the data to fall into a specific interval.

Where x is the actual value of a certain characteristic value in F1, μ is the mean, and σ is the standard deviation. The Z-Score method converts data of different magnitudes into a unified measurement, and the comparability of the data is improved. And the data S4 subjected to missing value filling and normalization can be used as the input of a multi-classification model for disease stage prediction.

using the relu function as an activation function of each hidden layer and creating a softmax function, disease stage prediction is determined.

We multi-classify S4 using a deep neural network model (DNN model). DNN is a neural network comprising a plurality of hidden layers, and its internal neural network layers can be classified into three categories: an input layer, a hidden layer, and an output layer. The number of neurons in the input layer corresponds to the number of characteristic values in the characteristic set F1, and the number of neurons in the output layer corresponds to the number of disease stages, i.e. the indexSign D' = { D = ₁ ,d ₂ ,…,d _n |d _i E.g. N + }, where d ₁ To d _n All the diseases are diagnosed in different stages. And uses the relu function as the activation function of each hidden layer and creates a softmax function for the activation function of the output layer to solve the multi-classification problem. Wherein the softmax function is defined as follows:

wherein z is _i Is the output value of the ith node, namely the output value of a certain disease stage; c is the number of output nodes, namely the number of disease stages. And the cross entropy of the classification which shows better in the multi-classification problem is used as a loss function according to different disease types. To further improve the accuracy of prediction of different disease stages.

In another embodiment, the present invention further provides an auxiliary classification device for multi-clinical stage diseases, comprising: a processor 201 and a memory 202;

In yet another embodiment, the present invention further provides a storage medium for storing at least one set of instructions;

The method provided by the invention is suitable for multi-stage disease diagnosis. Firstly, a machine learning two-classification model is used for carrying out two-classification on whether diseases are diagnosed or not, then professional knowledge in the medical field is applied to determine a characteristic value set, and the diagnosed data in the two-classification result is diagnosed by a deep learning multi-classification model to realize disease stage diagnosis. The disease characteristics are segmented and screened on the complex and various electronic medical record data collected by a hospital by combining with professional knowledge in the medical field, and the segmented and screened electronic medical record data is used for predicting diseases with multiple clinical stages and assisting a clinician in disease diagnosis.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present application, should be understood that the above-mentioned embodiments are only examples of the present application and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present application.

Claims

1. An auxiliary classification method for multi-clinical stage diseases is characterized by comprising the following steps:

extracting characteristic values and labels of medical records in the medical record data set S1 to form a characteristic value set F and a label set D, wherein the characteristic value set F comprises physical examination data and examination result data in the medical record data of patients, and the label set D comprises two types of labels of diseases or health which are determined based on doctor diagnosis results;

medical record data with characteristic values meeting the confirmed diagnosis conditions in the F2 are searched in the health data set and added into the diseased data set to form a new diseased data set S3;

2. The method of claim 1, wherein the volume survey data comprises at least: height, weight, pain level, smoking history, drinking history, and medical history;

the inspection result data at least includes: biochemical test result of hematuria and imaging test result.

3. The method according to claim 1, wherein the bi-classifying medical record data set S1 based on the feature value set F and the label set D using a bi-classification model comprises:

4. The method according to claim 1, wherein the analyzing the eigenvalue set F for relevancy to obtain an optimized eigenvalue set F1 comprises:

and analyzing the association degree of the characteristic values in the characteristic value set F through chi-square test, or sample variance values, or discrete category interaction information, and deleting the characteristic values with lower association degree to obtain the optimized characteristic value set F1.

5. The method according to claim 1, wherein the optimized feature value set F1 is screened based on the medical field information to obtain a key feature value set F2, wherein the key feature value set is a feature value set that is decisive for determining the disease.

6. The method according to claim 1, wherein prior to multi-classifying the new diseased data set S3, further comprising:

7. The method according to claim 6, wherein the new diseased data set S3 is multi-classified, in particular:

determining a new set of tags D based on the disease category ^’ The new label set D ^’ (ii) a set of staging diagnoses corresponding to said disease;

the number of neurons in the output layer corresponds to the number of disease stages, i.e. the label set D ^’ The number of median values;

8. An auxiliary classification apparatus for multi-clinical stage disease, comprising: a processor and a memory;

the processor is for storing a computer program for implementing a method for assisted classification of a multi-clinical stage disease according to any one of claims 1-7.

9. A storage medium storing at least one set of instructions;

the set of instructions for being invoked and performing at least the method of assisted classification of a multi-clinical stage disease as claimed in any one of claims 1 to 7.