CN115148319A - Auxiliary classification method, equipment and storage medium for multi-clinical stage diseases - Google Patents
Auxiliary classification method, equipment and storage medium for multi-clinical stage diseases Download PDFInfo
- Publication number
- CN115148319A CN115148319A CN202210877630.1A CN202210877630A CN115148319A CN 115148319 A CN115148319 A CN 115148319A CN 202210877630 A CN202210877630 A CN 202210877630A CN 115148319 A CN115148319 A CN 115148319A
- Authority
- CN
- China
- Prior art keywords
- classification
- data set
- disease
- characteristic value
- medical record
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 201000010099 disease Diseases 0.000 title claims abstract description 88
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 88
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000013145 classification model Methods 0.000 claims abstract description 35
- 238000003745 diagnosis Methods 0.000 claims abstract description 27
- 230000036541 health Effects 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims abstract description 6
- 230000006870 function Effects 0.000 claims description 16
- 210000002569 neuron Anatomy 0.000 claims description 8
- 238000011156 evaluation Methods 0.000 claims description 7
- 230000035622 drinking Effects 0.000 claims description 6
- 208000002193 Pain Diseases 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 5
- 238000000546 chi-square test Methods 0.000 claims description 5
- 208000006750 hematuria Diseases 0.000 claims description 5
- 238000003384 imaging method Methods 0.000 claims description 5
- 230000003993 interaction Effects 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 4
- 238000003062 neural network model Methods 0.000 claims description 4
- 230000000391 smoking effect Effects 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 238000007689 inspection Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 238000010876 biochemical test Methods 0.000 claims description 2
- 238000007635 classification algorithm Methods 0.000 abstract description 3
- 238000012706 support-vector machine Methods 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 5
- 238000001514 detection method Methods 0.000 description 4
- 208000006820 Arthralgia Diseases 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 208000024891 symptom Diseases 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000009423 ventilation Methods 0.000 description 2
- 201000005569 Gout Diseases 0.000 description 1
- 238000011888 autopsy Methods 0.000 description 1
- 235000013405 beer Nutrition 0.000 description 1
- 239000000090 biomarker Substances 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000002888 effect on disease Effects 0.000 description 1
- 230000035876 healing Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Primary Health Care (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Public Health (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Epidemiology (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention provides an auxiliary classification method, equipment and a storage medium for multi-clinical stage diseases, wherein the method comprises the following steps: determining a medical record data set; extracting characteristic values and labels in the data to form a characteristic value set and a label set; performing secondary classification on the medical record data set by using a secondary classification model; analyzing the association degree of the characteristic value set to obtain an optimized characteristic value set; screening the optimized characteristic value set to obtain a key characteristic value set; medical record data with characteristic values meeting the confirmed diagnosis conditions are searched in the health data set and added into the sick data set to form a new sick data set; and carrying out multi-classification on the new diseased data set to obtain the prediction of different stages of the disease. The invention predicts the disease stage by stage through a classification algorithm model and assists doctors to diagnose the disease.
Description
Technical Field
The application relates to the field of intelligent medical treatment, in particular to an auxiliary classification method, equipment and a storage medium for multi-clinical stage diseases.
Background
Disease staging initially stayed at a purely clinical level, e.g., mild vs. severe symptoms, and thereafter evolved gradually to a more advanced clinical pathology perspective under the guidance of progress in the fields of autopsy, imaging, and biomarkers. The disease stage is suitable for diseases which are possibly delayed to be healed, have progressive functional deterioration and/or are possibly died early, and for most diseases, the early state is relatively stable, the clinical cure rate is higher, the late state is fast to develop, and the cure rate is lower. If the patient can find and treat the disease in the early stage of the disease development, the clinical cure rate of the patient is greatly improved before the disease condition is worsened, so that how to accurately diagnose the stage of the disease is one of the important problems in clinical medicine. With the development of machine learning and the improvement of electronic medical records, data-driven intelligent medical diagnosis and treatment methods become the mainstream. Intelligent medical treatment is a hot point of academic research in recent years and is a hot focus of combination of computer and medical fields, so how to help disease stage diagnosis through intelligent medical treatment is a problem to be solved.
Disclosure of Invention
In view of the above, the present application provides a method, an apparatus, and a storage medium for assisting classification of multiple clinical stage diseases, so as to solve the problem of helping disease stage diagnosis through intelligent medical treatment.
The implementation method of the technical scheme of the application comprises the following steps:
an assisted classification method for multi-clinical stage diseases, comprising:
determining a medical record data set S1, wherein the medical record data set S1 comprises medical record data of at least one patient;
extracting characteristic values and labels of medical records in the medical record data set S1 to form a characteristic value set F and a label set D, wherein the characteristic value set F comprises physical examination data and examination result data in the medical record data of patients, and the label set D comprises diseased or healthy labels determined based on the diagnosis results of doctors;
performing secondary classification on the medical record data set S1 by using a secondary classification model based on the characteristic value set F and the label set D to obtain a healthy data set and a diseased data set;
analyzing the association degree of the characteristic value set F to obtain an optimized characteristic value set F1;
based on the medical field information, screening the optimized characteristic value set F1 to obtain a key characteristic value set F2 and conditions corresponding to the characteristics in the key characteristic value set F2;
medical record data with the characteristic value meeting the confirmed diagnosis condition in the F2 are searched in the health data set and added into the sick data set to form a new sick data set S3;
the new diseased data set S3 is multi-classified to obtain predictions of different stages of the disease.
In the method, the volume survey data at least comprises: height, weight, pain level, smoking history, drinking history, and medical history;
the inspection result data includes at least: biochemical test result of hematuria and imaging test result.
In the method, the secondary classification of the medical record data set S1 by using a secondary classification model based on the feature value set F and the label set D includes:
establishing a candidate two-classification model library, wherein the candidate two-classification model library comprises a plurality of two-classification models;
and simultaneously executing a plurality of two classification models to obtain the accuracy, recall rate and F1Score value of the two classification models, comprehensively considering the three classification evaluation indexes, and selecting the medical record data set S1 with the best evaluation index effect, wherein one two classification model carries out two classifications on the medical record data set S1.
In the method, the analyzing the association degree of the characteristic value set F to obtain an optimized characteristic value set F1 includes:
and performing association degree analysis on the characteristic values in the characteristic value set F through chi-square test, or sample variance values, or discrete category interaction information, and deleting the characteristic values with lower association degree to obtain an optimized characteristic value set F1.
In the method, the optimized characteristic value set F1 is screened based on the medical field information to obtain a key characteristic value set F2, wherein the key characteristic value set is the characteristic value set which has decisive influence on the confirmed disease.
In the method, before the multi-classifying the new diseased data set S3, the method further includes:
filling missing feature items in the new diseased data set S3 with a specific value, or an average value, or a mode according to the corresponding medical meaning;
the data in the filled diseased data set S3 is normalized to constitute a data set S4.
In the method, the multi-classification is performed on the new diseased data set S3, specifically:
determining a new label set D 'according to the disease type, wherein the new label set D' is a stage diagnosis set corresponding to the disease;
performing multi-classification on the S4 based on the deep neural network model; wherein the content of the first and second substances,
the number of the neurons of the input layer corresponds to the number of the characteristic values in the characteristic set F1;
the number of the neurons of the output layer corresponds to the number of disease stages, namely the number of numerical values in the label set D';
using the relu function as an activation function for each hidden layer and creating a softmax function, a disease stage prediction is determined.
The invention also provides auxiliary classification equipment for multi-clinical stage diseases, which comprises: a processor and a memory;
the processor is used for storing a computer program for realizing the auxiliary classification method of the multi-clinical stage diseases.
The invention also proposes a storage medium for storing at least one set of instructions;
the set of instructions is for being invoked and performing at least the assisted classification method for the multi-clinical stage disease.
The method provided by the invention is suitable for multi-stage disease diagnosis. Firstly, a machine learning two-classification model is used for carrying out two-classification on whether diseases are diagnosed or not, then professional knowledge in the medical field is applied to determine a characteristic value set, and the diagnosed data in the two-classification result is diagnosed by a deep learning multi-classification model to realize disease stage diagnosis.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments will be briefly introduced below, and it is apparent that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings may be obtained according to the drawings without inventive labor.
FIG. 1 is a flow chart of an embodiment of a method for assisted classification of multiple clinical stage diseases according to the present invention;
FIG. 2 is a schematic structural diagram of an auxiliary classification device for multi-clinical stage diseases according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In view of the above, the present application provides a method, an apparatus, and a storage medium for assisting classification of multiple clinical stage diseases, so as to solve the problem of helping disease stage diagnosis through intelligent medical treatment.
With the development of machine learning and the improvement of electronic medical records, data-driven intelligent medical diagnosis and treatment methods become the mainstream. The generation of a large number of electronic medical records provides an adequate data source for intelligent medical treatment. On the other hand, how to accurately carry out disease staging is one of the major difficulties of disease diagnosis of clinicians, and for most diseases, the early disease condition is relatively stable, the clinical cure rate is also high, the late disease condition is rapidly developed, and the cure rate is low. The disease staging can be timely and accurately carried out, so that the survival rate and the quality after healing of the patient are greatly improved. In view of the practical problems, the invention provides an auxiliary classification method for multi-clinical stage diseases, which is suitable for predicting the diseases with the multi-clinical stage and assisting a clinician in diagnosing the diseases.
The implementation method of the technical scheme of the application comprises the following steps:
the embodiment of the invention provides an auxiliary classification method for multi-clinical stage diseases, which comprises the following steps of:
s101: determining a medical record data set S1, wherein the medical record data set S1 comprises medical record data of at least one patient; the medical record data set S1 can comprise electronic medical records of a plurality of patients in a hospital medical record library as a data set;
s102: extracting characteristic values and labels of medical records in the medical record data set S1 to form a characteristic value set F and a label set D, wherein the characteristic value set F comprises physical examination data and examination result data in the medical record data of patients, and the label set D comprises diseased or healthy labels determined based on the diagnosis results of doctors;
when a patient goes to a doctor, a doctor can input a medical record of the patient through an electronic medical record information system of a hospital. The electronic medical record data comprises personal information of patients, symptom data of the patients, physical examination, biochemical detection data, diagnosis orders of doctors and medication data. Then, the electronic medical record data of the patient is derived from the electronic medical record information system, all disease characteristics in the medical record data set, such as height, weight, pain degree, smoking history, drinking history, medical history and other physical examination data, hematuria biochemical detection results, imaging examination results and other results are used as characteristic values, and the multi-stage disease condition diagnosed by a doctor is simplified into disease or health and used as a label.
S103: performing secondary classification on the medical record data set S1 by using a secondary classification model based on the characteristic value set F and the label set D to obtain a healthy data set and a diseased data set;
based on preliminary observation and statistics of electronic medical records, we have found that the overall diagnostic results can be divided into two categories, namely "healthy" and "diseased" depending on whether the disease is diseased or not. Then, we perform a preliminary screening on the labels in the medical record data S1, remove the personal information part and the medication advice part of the patient that have no effect on disease diagnosis, and use the remaining feature values, such as the pain level, the physical examination data of height and weight, and the biochemical detection result of hematuria, as a feature set, which is denoted by F = { F1, F2, \8230;, fn }. Let us let D denote the set of diseased cases as the set of labels, then D = {1,0},1 denotes diseased and 0 denotes healthy. Performing secondary classification on the S1 on the basis;
since medical field knowledge is not introduced here, in order to better support the two-classification diagnosis of different types of multi-clinical staging diseases, a concept of "candidate two-classification model library" is proposed, and a user can simultaneously execute a plurality of two-classification models according to a specific scenario (i.e., an electronic medical record of a specific multi-clinical staging disease), and then select the most appropriate one of the two-classification models according to an actual test effect. Commonly used binary classification algorithms include logistic regression, K Nearest Neighbor (KNN), support Vector Machine (SVM), and the like. In addition to the mainstream binary classification algorithm, the random forest and XGboost models also have better performance in the classification problem. Taking the case that two classification models, namely SVM and XGboost, are used in a "candidate two-classification model library":
a Support Vector Machine (SVM) is a typical binary model. The basic model of the method is a linear classifier with the maximum interval defined on a feature space, and the maximum interval makes the linear classifier different from a perceptron; the SVM also includes kernel techniques, which make it a substantially non-linear classifier. The basic idea is to solve for hyperplanes that can properly partition a data set and have a geometrically maximum separation. In sample space, a hyperplane (ω, b) is determined by a normal vector ω and a displacement term b, and the distance from any point x in sample space to the hyperplane can be written as:
among the numerous hyperplanes for segmenting the two types of samples, the partition hyperplane with the largest interval needs to satisfy the following constraints:
s.t.yi(ωTxi+b)≥1,i=1,2,…,m
the XGboost is an optimized distributed gradient enhancement library. The machine learning algorithm is realized under the Gradient Boosting framework, and large-scale training samples can be efficiently, flexibly and conveniently processed. The objective function for Xgboost is:
wherein n is the number of medical record samples, y i For a true diagnosis of the ith medical record,the predictive diagnosis of the ith sample for the model. K denotes the number of regression trees, f k Expressing the kth tree, wherein omega is the complexity of a regression tree as a regularization term, and is expressed as follows:
wherein T is the number of leaf nodes, omega is pruned through gamma when the leaf nodes are excessive, and lambda is controlledThe problem of overfitting occurs when it is too large.
Our optimization objective isWhereinThe sample falls to the leaf node value of the ith regression tree for the optimal case.
S104: analyzing the association degree of the characteristic value set F to obtain an optimized characteristic value set F1;
in order to improve the accuracy of model prediction, chi-square test, sample variance value and discrete category interaction information are comprehensively considered to carry out relevance analysis on the characteristic value in the F. Taking chi-square test and discrete category interaction information as an example, chi-square statistic between the characteristic value Fi and whether the disease D is affected is as follows:
where A is the actual value of F over D and T is the theoretical value. X is the absolute magnitude of the deviation of the actual value from the theoretical value, and the larger X indicates that Fi has more influence on the disease.
The discrete category interactive information is called 'mutual information' for short, and is a method for screening characteristic values in characteristic engineering. For discrete random variables X, Y, the formula for mutual information is as follows:
if X, Y are mutually independent variables p (X, Y) = p (X) p (Y), I (X; Y) above is 0, so that a larger value of I (X; Y) indicates a larger correlation between the two variables.
On the basis, the feature values with low association degree are deleted, and a feature set F1 after preliminary optimization is obtained.
S105: based on the medical field information, screening the optimized characteristic value set F1 to obtain a key characteristic value set F2 and conditions corresponding to the characteristics in the key characteristic value set F2;
on the basis, medical field knowledge is introduced, and a characteristic value set F2= { fn, \8230; fm } and conditions thereof which have decisive influence on a prediction result are screened from the characteristic value set F1;
the process is based on disease diagnosis knowledge in the medical field, and characteristic values which have decisive influence on confirmed diseases are screened, for example, the score of the ventilation quantitative score has decisive influence on whether the ventilation is confirmed or not for gout;
s106: medical record data with the characteristic value meeting the confirmed diagnosis condition in the F2 are searched in the health data set and added into the sick data set to form a new sick data set S3;
on the basis, medical field knowledge is introduced, the characteristic value set F2= { fn, \8230; fm } which has decisive influence on a prediction result and conditions thereof are screened from the characteristic value set F1, the health data set S2 with D =0 after the two classifications is searched, data S2 'with F2 meeting diagnosis confirmation conditions are screened, and then S2' is added into the data set with D =1 to form a data set S3 for multi-classification.
S107: the new diseased data set S3 is multi-classified to obtain predictions of different stages of the disease.
In the method, the volume survey data at least comprises: height, weight, pain level, smoking history, drinking history, and medical history;
the inspection result data includes at least: biochemical detection result of hematuria and imaging examination result.
In the method, the secondary classification of the medical record data set S1 by using a secondary classification model based on the feature value set F and the label set D includes:
establishing a candidate two-classification model library, wherein the candidate two-classification model library comprises a plurality of two-classification models;
and simultaneously executing a plurality of two classification models to obtain the accuracy, recall rate and F1Score value of the two classification models, comprehensively considering the three classification evaluation indexes, and selecting the medical record data set S1 with the best evaluation index effect, wherein one two classification model carries out two classifications on the medical record data set S1.
In the method, the analyzing the association degree of the characteristic value set F to obtain an optimized characteristic value set F1 includes:
and performing association degree analysis on the characteristic values in the characteristic value set F through chi-square test, or sample variance values, or discrete category interaction information, and deleting the characteristic values with lower association degree to obtain an optimized characteristic value set F1.
In the method, the optimized characteristic value set F1 is screened based on the medical field information to obtain a key characteristic value set F2, wherein the key characteristic value set is the characteristic value set which has decisive influence on the diagnosed diseases.
In the method, before the multi-classifying the new diseased data set S3, the method further includes:
filling missing feature items in the new diseased data set S3 with a specific value, or an average value, or a mode according to the corresponding medical meaning;
for the characteristic item of data missing in S3, the missing value is filled in with a specific value, an average value or a mode according to the medical meaning of the item. E.g., the number of painful joints, the absence of an attribute indicates that the patient does not present symptoms of joint pain, and the default non-painful joints are filled with 0 s. If the drinking type is lost in the drinking history, the value is the most frequently appeared 'beer' type.
The data in the filled diseased data set S3 is normalized to constitute a data set S4.
And because different evaluation indexes often have different dimensions and dimension units, in order to make up for the influence of the problem on data analysis, a Z-Score method is adopted for standardization to scale the data in proportion so as to enable the data to fall into a specific interval.
Where x is the actual value of a certain characteristic value in F1, μ is the mean, and σ is the standard deviation. The Z-Score method converts data of different magnitudes into a unified measurement, and the comparability of the data is improved. And the data S4 subjected to missing value filling and normalization can be used as the input of a multi-classification model for disease stage prediction.
In the method, the multi-classification is performed on the new diseased data set S3, specifically:
determining a new label set D 'according to the disease type, wherein the new label set D' is a stage diagnosis set corresponding to the disease;
performing multi-classification on the S4 based on the deep neural network model; wherein the content of the first and second substances,
the number of the neurons of the input layer corresponds to the number of the characteristic values in the characteristic set F1;
the number of the neurons of the output layer corresponds to the number of disease stages, namely the number of numerical values in the label set D';
using the relu function as an activation function of each hidden layer and creating a softmax function, disease stage prediction is determined.
We multi-classify S4 using a deep neural network model (DNN model). DNN is a neural network comprising a plurality of hidden layers, and its internal neural network layers can be classified into three categories: an input layer, a hidden layer, and an output layer. The number of neurons in the input layer corresponds to the number of characteristic values in the characteristic set F1, and the number of neurons in the output layer corresponds to the number of disease stages, i.e. the indexSign D' = { D = 1 ,d 2 ,…,d n |d i E.g. N + }, where d 1 To d n All the diseases are diagnosed in different stages. And uses the relu function as the activation function of each hidden layer and creates a softmax function for the activation function of the output layer to solve the multi-classification problem. Wherein the softmax function is defined as follows:
wherein z is i Is the output value of the ith node, namely the output value of a certain disease stage; c is the number of output nodes, namely the number of disease stages. And the cross entropy of the classification which shows better in the multi-classification problem is used as a loss function according to different disease types. To further improve the accuracy of prediction of different disease stages.
In another embodiment, the present invention further provides an auxiliary classification device for multi-clinical stage diseases, comprising: a processor 201 and a memory 202;
the processor is used for storing a computer program for realizing the auxiliary classification method of the multi-clinical stage diseases.
In yet another embodiment, the present invention further provides a storage medium for storing at least one set of instructions;
the set of instructions is for being invoked and performing at least the assisted classification method for the multi-clinical stage disease.
The method provided by the invention is suitable for multi-stage disease diagnosis. Firstly, a machine learning two-classification model is used for carrying out two-classification on whether diseases are diagnosed or not, then professional knowledge in the medical field is applied to determine a characteristic value set, and the diagnosed data in the two-classification result is diagnosed by a deep learning multi-classification model to realize disease stage diagnosis. The disease characteristics are segmented and screened on the complex and various electronic medical record data collected by a hospital by combining with professional knowledge in the medical field, and the segmented and screened electronic medical record data is used for predicting diseases with multiple clinical stages and assisting a clinician in disease diagnosis.
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present application, should be understood that the above-mentioned embodiments are only examples of the present application and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present application.
Claims (9)
1. An auxiliary classification method for multi-clinical stage diseases is characterized by comprising the following steps:
determining a medical record data set S1, wherein the medical record data set S1 comprises medical record data of at least one patient;
extracting characteristic values and labels of medical records in the medical record data set S1 to form a characteristic value set F and a label set D, wherein the characteristic value set F comprises physical examination data and examination result data in the medical record data of patients, and the label set D comprises two types of labels of diseases or health which are determined based on doctor diagnosis results;
performing secondary classification on the medical record data set S1 by using a secondary classification model based on the characteristic value set F and the label set D to obtain a healthy data set and a diseased data set;
analyzing the association degree of the characteristic value set F to obtain an optimized characteristic value set F1;
based on the medical field information, screening the optimized characteristic value set F1 to obtain a key characteristic value set F2 and conditions corresponding to the characteristics in the key characteristic value set F2;
medical record data with characteristic values meeting the confirmed diagnosis conditions in the F2 are searched in the health data set and added into the diseased data set to form a new diseased data set S3;
the new diseased data set S3 is multi-classified to obtain predictions of different stages of the disease.
2. The method of claim 1, wherein the volume survey data comprises at least: height, weight, pain level, smoking history, drinking history, and medical history;
the inspection result data at least includes: biochemical test result of hematuria and imaging test result.
3. The method according to claim 1, wherein the bi-classifying medical record data set S1 based on the feature value set F and the label set D using a bi-classification model comprises:
establishing a candidate two-classification model library, wherein the candidate two-classification model library comprises a plurality of two-classification models;
and simultaneously executing a plurality of two classification models to obtain the accuracy, recall rate and F1Score value of the two classification models, comprehensively considering the three classification evaluation indexes, and selecting the medical record data set S1 with the best evaluation index effect, wherein one two classification model carries out two classifications on the medical record data set S1.
4. The method according to claim 1, wherein the analyzing the eigenvalue set F for relevancy to obtain an optimized eigenvalue set F1 comprises:
and analyzing the association degree of the characteristic values in the characteristic value set F through chi-square test, or sample variance values, or discrete category interaction information, and deleting the characteristic values with lower association degree to obtain the optimized characteristic value set F1.
5. The method according to claim 1, wherein the optimized feature value set F1 is screened based on the medical field information to obtain a key feature value set F2, wherein the key feature value set is a feature value set that is decisive for determining the disease.
6. The method according to claim 1, wherein prior to multi-classifying the new diseased data set S3, further comprising:
filling missing feature items in the new diseased data set S3 with a specific value, or an average value, or a mode according to the corresponding medical meaning;
the data in the filled diseased data set S3 is normalized to constitute a data set S4.
7. The method according to claim 6, wherein the new diseased data set S3 is multi-classified, in particular:
determining a new set of tags D based on the disease category ’ The new label set D ’ (ii) a set of staging diagnoses corresponding to said disease;
performing multi-classification on the S4 based on the deep neural network model; wherein the content of the first and second substances,
the number of the neurons of the input layer corresponds to the number of the characteristic values in the characteristic set F1;
the number of neurons in the output layer corresponds to the number of disease stages, i.e. the label set D ’ The number of median values;
using the relu function as an activation function for each hidden layer and creating a softmax function, a disease stage prediction is determined.
8. An auxiliary classification apparatus for multi-clinical stage disease, comprising: a processor and a memory;
the processor is for storing a computer program for implementing a method for assisted classification of a multi-clinical stage disease according to any one of claims 1-7.
9. A storage medium storing at least one set of instructions;
the set of instructions for being invoked and performing at least the method of assisted classification of a multi-clinical stage disease as claimed in any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210877630.1A CN115148319A (en) | 2022-07-25 | 2022-07-25 | Auxiliary classification method, equipment and storage medium for multi-clinical stage diseases |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210877630.1A CN115148319A (en) | 2022-07-25 | 2022-07-25 | Auxiliary classification method, equipment and storage medium for multi-clinical stage diseases |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115148319A true CN115148319A (en) | 2022-10-04 |
Family
ID=83414231
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210877630.1A Pending CN115148319A (en) | 2022-07-25 | 2022-07-25 | Auxiliary classification method, equipment and storage medium for multi-clinical stage diseases |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115148319A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065171A (en) * | 2018-11-05 | 2018-12-21 | 苏州贝斯派生物科技有限公司 | The construction method and system of Kawasaki disease risk evaluation model based on integrated study |
CN109785976A (en) * | 2018-12-11 | 2019-05-21 | 青岛中科慧康科技有限公司 | A kind of goat based on Soft-Voting forecasting system by stages |
CN110347837A (en) * | 2019-07-17 | 2019-10-18 | 电子科技大学 | A kind of unplanned Risk Forecast Method of being hospitalized again of cardiovascular disease |
CN112541542A (en) * | 2020-12-11 | 2021-03-23 | 第四范式(北京)技术有限公司 | Method and device for processing multi-classification sample data and computer readable storage medium |
CN113555077A (en) * | 2021-09-18 | 2021-10-26 | 北京大学第三医院(北京大学第三临床医学院) | Suspected infectious disease prediction method and device |
-
2022
- 2022-07-25 CN CN202210877630.1A patent/CN115148319A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065171A (en) * | 2018-11-05 | 2018-12-21 | 苏州贝斯派生物科技有限公司 | The construction method and system of Kawasaki disease risk evaluation model based on integrated study |
CN109785976A (en) * | 2018-12-11 | 2019-05-21 | 青岛中科慧康科技有限公司 | A kind of goat based on Soft-Voting forecasting system by stages |
CN110347837A (en) * | 2019-07-17 | 2019-10-18 | 电子科技大学 | A kind of unplanned Risk Forecast Method of being hospitalized again of cardiovascular disease |
CN112541542A (en) * | 2020-12-11 | 2021-03-23 | 第四范式(北京)技术有限公司 | Method and device for processing multi-classification sample data and computer readable storage medium |
CN113555077A (en) * | 2021-09-18 | 2021-10-26 | 北京大学第三医院(北京大学第三临床医学院) | Suspected infectious disease prediction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Islam et al. | Chronic kidney disease prediction based on machine learning algorithms | |
CN111710420B (en) | Complication onset risk prediction method, system, terminal and storage medium based on electronic medical record big data | |
Bashir et al. | BagMOOV: A novel ensemble for heart disease prediction bootstrap aggregation with multi-objective optimized voting | |
Mishra et al. | Use of deep learning for disease detection and diagnosis | |
Köse et al. | Effect of missing data imputation on deep learning prediction performance for vesicoureteral reflux and recurrent urinary tract infection clinical study | |
Zhang et al. | HCNN: Heterogeneous convolutional neural networks for comorbid risk prediction with electronic health records | |
Mall et al. | Heart diagnosis using deep neural network | |
Benhar et al. | A systematic mapping study of data preparation in heart disease knowledge discovery | |
Ahmad et al. | Diagnosis of cardiovascular disease using deep learning technique | |
Lin et al. | Acute coronary syndrome risk prediction based on gradient boosted tree feature selection and recursive feature elimination: A dataset-specific modeling study | |
Samet et al. | Predicting and staging chronic kidney disease using optimized random forest algorithm | |
Navaz et al. | The use of data mining techniques to predict mortality and length of stay in an ICU | |
Svenson et al. | Sepsis deterioration prediction using channelled long short-term memory networks | |
Madanan et al. | Designing a hybrid artificial intelligent clinical decision support system using artificial neural network and artificial Bee Colony for predicting heart failure rate | |
Gollapalli et al. | Text mining on hospital stay durations and management of sickle cell disease patients | |
Chaki | Deep learning in healthcare: applications, challenges, and opportunities | |
CN115148319A (en) | Auxiliary classification method, equipment and storage medium for multi-clinical stage diseases | |
Chaudhuri et al. | Variable Selection in Genetic Algorithm Model with Logistic Regression for Prediction of Progression to Diseases | |
Mythili et al. | Similarity Disease Prediction System for Efficient Medicare | |
Esteva et al. | Neural networks and artificial intelligence in thoracic surgery | |
Dilli Babu et al. | Heart disease prognosis and quick access to medical data record using data lake with deep learning approaches | |
Firthous et al. | Survey on using electronic medical records (EMR) to identify the health conditions of the patients | |
Sharma et al. | Machine Learning-Based Algorithms for Prediction of Chronic Kidney Disease: A Review | |
Brindha et al. | Efficient Method for Predicting Thyroid Disease Classification using Convolutional Neural Network with Support Vector Machine | |
Bamidele et al. | Survival model for diabetes mellitus patients’ using support vector machine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |