CN116741393A - Medical record-based thyroid disease dataset classification model construction method, classification device and computer-readable medium - Google Patents
Medical record-based thyroid disease dataset classification model construction method, classification device and computer-readable medium Download PDFInfo
- Publication number
- CN116741393A CN116741393A CN202310648204.5A CN202310648204A CN116741393A CN 116741393 A CN116741393 A CN 116741393A CN 202310648204 A CN202310648204 A CN 202310648204A CN 116741393 A CN116741393 A CN 116741393A
- Authority
- CN
- China
- Prior art keywords
- model
- classification
- feature
- medical record
- thyroid disease
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 208000024799 Thyroid disease Diseases 0.000 title claims abstract description 33
- 238000013145 classification model Methods 0.000 title claims abstract description 31
- 208000021510 thyroid gland disease Diseases 0.000 title claims abstract description 27
- 238000010276 construction Methods 0.000 title claims description 6
- 201000010099 disease Diseases 0.000 claims abstract description 51
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 51
- 238000000034 method Methods 0.000 claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 40
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 39
- 239000013598 vector Substances 0.000 claims abstract description 21
- 238000003066 decision tree Methods 0.000 claims abstract description 20
- 238000012360 testing method Methods 0.000 claims abstract description 17
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 238000007477 logistic regression Methods 0.000 claims abstract description 10
- 238000007637 random forest analysis Methods 0.000 claims abstract description 8
- 238000010606 normalization Methods 0.000 claims abstract description 7
- 238000012706 support-vector machine Methods 0.000 claims abstract description 7
- 238000012545 processing Methods 0.000 claims abstract description 6
- 238000004140 cleaning Methods 0.000 claims abstract description 5
- 238000005516 engineering process Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 6
- 210000001685 thyroid gland Anatomy 0.000 claims description 6
- 238000007476 Maximum Likelihood Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000009826 distribution Methods 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 238000010219 correlation analysis Methods 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 claims description 2
- 238000011478 gradient descent method Methods 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 claims description 2
- 238000010187 selection method Methods 0.000 claims description 2
- 238000010801 machine learning Methods 0.000 description 11
- 230000004927 fusion Effects 0.000 description 8
- 238000005070 sampling Methods 0.000 description 7
- 206010020850 Hyperthyroidism Diseases 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 206010033701 Papillary thyroid cancer Diseases 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000007635 classification algorithm Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 208000003532 hypothyroidism Diseases 0.000 description 2
- 230000002989 hypothyroidism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 208000030045 thyroid gland papillary carcinoma Diseases 0.000 description 2
- 206010018498 Goitre Diseases 0.000 description 1
- 208000006083 Hypokinesia Diseases 0.000 description 1
- 208000037196 Medullary thyroid carcinoma Diseases 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 206010033557 Palpitations Diseases 0.000 description 1
- 208000009453 Thyroid Nodule Diseases 0.000 description 1
- 208000024770 Thyroid neoplasm Diseases 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000002059 diagnostic imaging Methods 0.000 description 1
- 238000007876 drug discovery Methods 0.000 description 1
- 210000002468 fat body Anatomy 0.000 description 1
- 235000003642 hunger Nutrition 0.000 description 1
- 230000002757 inflammatory effect Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000003902 lesion Effects 0.000 description 1
- 238000000968 medical method and process Methods 0.000 description 1
- 208000023356 medullary thyroid gland carcinoma Diseases 0.000 description 1
- 201000008492 nontoxic goiter Diseases 0.000 description 1
- 201000003868 postsurgical hypothyroidism Diseases 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 208000030901 thyroid gland follicular carcinoma Diseases 0.000 description 1
- 208000013818 thyroid gland medullary carcinoma Diseases 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention discloses a method for constructing a thyroid disease data set classification model based on medical records in the field of medical equipment, which comprises the following steps: 1) Collecting medical record data of a certain number of thyroid disease patients; 2) Extracting features from medical record contents, current medical history and physical examination results to form feature vectors to represent each medical record; 3) Performing data preprocessing, including data cleaning, missing value processing and feature normalization; 4) Taking the preprocessed feature vector as input, adopting a support vector machine algorithm, a decision tree algorithm, a logistic regression algorithm and a random forest algorithm as basic classifiers, establishing an integrated model, and then transversely comparing with a single classifier result formed by the integrated model; 5) The invention improves the accuracy and reliability of disease classification by respectively using the training set and the testing set to train and evaluate each classification model and selecting the model with the highest accuracy of the prediction result.
Description
Technical Field
The invention relates to the field of medical equipment, in particular to a method for constructing a disease data set classification model.
Background
In modern medical systems, disease classification has been one of the major challenges in the medical field. In the medical process, accurate judgment of the disease condition is a precondition and key for treating patients, and with the continuous development of modern medical technology, the medical data volume is increased sharply, so that classification of medical disease data sets is more challenging. In the face of a large number of cases and complicated illness, errors and subjectivity may exist in the judgment result of doctors, so that the medical quality is affected.
In recent years, with the development of artificial intelligence and machine learning techniques, the medical field has also begun to utilize these new techniques to improve the accuracy and efficiency of classification of disease datasets. The rapid development of information technology brings new opportunities to the medical industry. Currently, a wide variety of medical techniques have been widely used in various fields such as drug discovery, medical imaging, and disease classification. With advances in science and technology, machine learning will be increasingly widely used and important in the field of disease classification. The integrated learning algorithm in machine learning is an effective way to improve classifier accuracy and has been shown to perform better than a single classifier in many fields. Machine learning techniques can learn patterns and rules from the data, thereby providing physicians with more accurate disease dataset classification results. Meanwhile, with the continuous accumulation and digitization of medical data, the application prospect of the machine learning technology in the medical field is wider and wider.
Among the many diseases, thyroid diseases are relatively clear and regular in disease characteristics with little research. Therefore, the application of the ensemble learning of the heterogeneous patterns to the thyroid disease classification field is very suitable.
In the existing machine learning technology, multi-label classification is an important technical means, and can be used for identifying a plurality of possible diseases and sorting according to characteristics such as disease description and physical examination content. Meanwhile, the heterogeneous integrated learning method can effectively combine a plurality of different classification algorithms, and accuracy and reliability of prediction are improved.
While the prior art provides some machine learning based disease dataset classification methods, there are still some problems. For example, existing methods generally can only classify individual diseases, and are difficult to deal with multiple disease cases; meanwhile, the existing method is usually only used for prediction based on one or a plurality of classification algorithms, and the classification accuracy and reliability are difficult to improve. Therefore, the machine learning method is adopted to assist in disease classification, and the accuracy and reliability of diagnosis classification can be effectively improved.
Currently, automatic analysis of medical images has been widely used, such as computer-aided classification systems based on deep learning have achieved significant results in image classification, disease detection, lesion segmentation, and the like. However, in terms of medical disease classification, automated handling of disease classification is relatively more difficult because the content of medical records involves textual descriptions of doctors, whose expression forms are more ambiguous and diverse.
Some current researches mainly adopt a deep learning method to solve the problem, such as converting text information in medical records into vector representation and then classifying the medical records by using a deep neural network, but the method requires a large amount of labeling data for training, and has high requirements on training data due to diversity and complexity of the medical records, and the problems of fitting and the like possibly exist.
Disclosure of Invention
Aiming at the problems in the background technology, the medical record of a hospital is used as one sample for model input, the multi-label classification problem is converted into a plurality of classification problems through feature extraction and data preprocessing, and a plurality of machine learning algorithms are adopted for training and fusion, so that a reliable disease classification model is finally obtained, and the accuracy and reliability of disease classification are improved.
The purpose of the invention is realized in the following way: a method for constructing a thyroid disease data set classification model based on medical records comprises the following steps:
step 1) collecting medical record data of a certain number of thyroid disease patients;
step 2) extracting features from medical record contents, current medical history and physical examination results to form feature vectors to represent each medical record;
step 3) data preprocessing, including data cleaning, missing value processing and feature normalization;
step 4) taking the preprocessed feature vector as input, adopting a support vector machine algorithm, a decision tree algorithm, a logistic regression algorithm and a random forest algorithm as base classifiers, establishing an integrated model, and then transversely comparing with a single classifier result formed by the integrated model;
and 5) training and evaluating each classification model by using the training set and the testing set respectively, and selecting the model with the highest prediction result accuracy.
As a further definition of the present invention, the feature extraction in step 2) may specifically include:
step 2-1), extracting key features related to thyroid diseases according to medical record content and a keyword matching method;
step 2-2) converting the medical record content into numerical characteristics by using a word bag model by using a natural language processing technology;
step 2-3) combining the history of the disease with the physical examination results to extract numerical features associated with thyroid disease.
As a further definition of the present invention, the data preprocessing in step 3 specifically includes:
step 3-1) data cleaning, including processing missing values, outliers and noise data;
step 3-2) feature normalization, scaling values of different features to a uniform range using a normalization method;
step 3-3) feature selection, wherein feature subsets which have remarkable correlation with thyroid disease classification, have a certain degree of correlation between features and thyroid disease classification results and have high contribution to thyroid disease classification are selected according to correlation analysis, feature importance evaluation and a model-based feature selection method.
As a further definition of the present invention, constructing a plurality of classification models in step 4 specifically includes:
step 4-1) using a support vector machine algorithm to perform model training by adjusting a kernel function and regularization parameters:
step 4-2) using a decision tree algorithm to perform model training by selecting the optimal splitting feature and the depth of the tree;
step 4-3) performing model training through maximum likelihood estimation by using a logistic regression algorithm;
step 4-4) performing model training by constructing a plurality of decision trees and voting decisions by using a random forest algorithm.
As a further definition of the present invention, the step 4-1) specifically includes:
step 4-1-1) selecting a Gaussian kernel function and regularization parameters;
step 4-1-2) searching a hyperplane based on the training data set by using a maximum interval classification principle, and classifying the data set into different categories;
step 4-1-3) determining the position and shape of the hyperplane by a convex optimization method, so that different categories can be distinguished;
step 4-1-4) establishing a support vector machine classification model according to the feature vectors and the class labels of the training data set.
As a further definition of the present invention, the step 4-2) specifically includes:
step 4-2-1) constructing a decision tree model based on the characteristics and the category labels of the training data set;
step 4-2-2) selecting a feature which can most effectively distinguish different thyroid categories as an optimal splitting feature, so that different categories can be distinguished best on the value of the feature;
step 4-2-3) recursively performing the splitting operation until a predetermined maximum depth of the data set is reached;
step 4-2-4) forming a model capable of carrying out classification prediction according to the input feature vector through the learning process of the decision tree.
As a further definition of the present invention, the step 4-3) specifically includes:
step 4-3-1) modeling the relationship between the feature vector and the class label as a probability distribution;
step 4-3-2) estimating model parameters based on a training data set by a maximum likelihood estimation method, so that the probability of model prediction is the most consistent with the probability of an actual observation value;
step 4-3-3) iteratively updating the model parameters by a gradient descent method until the change in the model parameters or the change in the loss function is no longer significant;
step 4-3-4) establishes a logistic regression classification model, and can predict corresponding class probabilities according to the input feature vectors.
As a further definition of the present invention, the step 4-4) specifically includes:
step 4-4-1) based on the training data set, randomly selecting a feature subset and a sample subset to construct a plurality of decision tree models;
step 4-4-2) for the new input samples, each decision tree performs independent classification prediction;
step 4-4-3) combining the classification results of each decision tree in an average manner to obtain a final model prediction result.
A thyroid disease data set classification device based on medical records comprises a processor, a memory and a construction program stored on the memory and capable of running on the processor, wherein the construction program realizes the steps of the construction method when being executed by the processor.
A computer-readable storage medium having stored thereon a construction program which, when executed by a processor, implements the steps of the construction method described above.
Compared with the prior art, the invention has the beneficial effects that:
the invention classifies the thyroid-related 138 diseases and 35 corresponding diseases, covers a more comprehensive and finer disease range, and can more accurately classify the diseases;
the multi-label classification problem model in machine learning is adopted, so that each medical record can be provided with multiple kinds of disease labels, and the medical record can be more fit for actual clinical situations;
the heterogeneous integrated learning method of the Bagging algorithm is used, so that the problem of overfitting of the model is effectively reduced, and the generalization capability of the model is improved;
by adopting a model fusion strategy and carrying out error analysis on the prediction results of different classifiers, the types of diseases can be predicted more accurately, and the accuracy of classification of the disease data set is improved;
by carrying out feature extraction and data preprocessing on a large amount of medical record data, the data can be better utilized to classify the disease data set, and the efficiency of classifying the disease data set is improved;
in conclusion, the invention has innovations in aspects of classification range, model selection, algorithm optimization, model fusion, data utilization and the like, can more accurately classify diseases, and improves the efficiency and accuracy of clinical work.
Drawings
FIG. 1 is a schematic flow chart of the present invention
FIG. 2 is a flow chart of the heterogeneous ensemble learning algorithm according to the present invention
Detailed Description
Example 1
The method for constructing the thyroid disease data set classification model based on medical records shown in fig. 1 comprises the following steps:
step 1, acquiring 1880 medical record sample data and preprocessing;
step 2, constructing an algorithm model;
step 3, training a model and predicting;
and 4, predicting the test set and integrating the model.
Further, in embodiment 1, the data collection and preprocessing in step 1 specifically includes:
step 1-1, collecting data: medical records of 1880 thyroid disease patients, including current medical history, physical examination content and doctor judgment results, were collected from hospitals.
Step 1-2, feature extraction: and extracting 138 thyroid related symptoms (such as palpitation, hypodynamia, fatness, hunger and the like) and 35 corresponding diseases (such as hyperthyroidism, hypothyroidism, thyroid nodule, papillary thyroid cancer and the like) sequences from all medical record samples by adopting a characteristic extraction algorithm, and taking the sequences as characteristics of the samples.
Step 1-3, data preprocessing: for each medical record, 138-dimensional feature vectors and their corresponding 35-dimensional target vectors are generated for the disease and disease sequences, respectively. Wherein, if the medical record exists, the disease is 1, otherwise, the disease is 0. Likewise, if the disease is present in the case, it is 1, otherwise it is 0.
The following is a partial sample distribution in the population in one embodiment.
Disease name | Number of | Marking label |
Inflammatory hyperthyroidism | 878 | 0 or 1 |
Primary hyperthyroidism | 567 | 0 or 1 |
Papillary thyroid carcinoma | 3 | 0 or 1 |
Follicular carcinoma of thyroid gland | 2 | 0 or 1 |
Medullary thyroid carcinoma | 1 | 0 or 1 |
Thyroid benign tumor postoperative hypothyroidism | 5 | 0 or 1 |
Non-toxic goiter | 121 | 0 or 1 |
Hypothyroidism | 234 | 0 or 1 |
Fat body | 145 | 0 or 1 |
Further, in embodiment 1, the algorithm model is constructed in step 2, and the specific process includes:
and 2-1, dividing the acquired 1880 medical record samples into a training set and a testing set.
And 2-2, sampling the training sets for a plurality of times randomly with a replacement way by adopting a heterogeneous integrated learning method of a Bagging algorithm, constructing N different training sets, and training a multi-label classification model aiming at each training set, wherein N is more than or equal to 2.
And 2-3, classifying the classifier according to the actual value of the sample and the error of the prediction result of each classifier by adopting a model fusion strategy to obtain the credible model of each class respectively, thereby obtaining the prediction result.
And 2-4, training N multi-label classification models by using One-VS-the-Rest strategy by adopting a multi-label classification SVM, a decision tree, a logistic regression and a random forest algorithm.
Further, in embodiment 1, the training of the model and the prediction in step 3 include:
and 3-1, respectively training the pre-divided training sets by using the models.
And 3-2, predicting each multi-label classification model according to the test set to obtain N prediction results. For each test sample, N predicted results are converted into a multidimensional row vector, and compared with the real results to obtain a trusted model containing a plurality of categories.
Further, in embodiment 1, the predicting and model integrating the test set in step 4 specifically includes:
and 4-1, predicting the test set by using the trained model and integrating the model. For each medical record in the test set, firstly, predicting the medical record into a plurality of binary classification problems through an One-VS-the-Rest strategy, wherein each problem corresponds to One classifier, and a prediction result of the medical record under each classifier is obtained.
And 4-2, comparing the prediction result of each classifier with the true value of the test set to obtain the accuracy, recall rate and F1 value of the classifier, and calculating the error of the classifier.
Step 4-3, defining a classifier with an error smaller than a certain threshold value as a credible model of the class for each class. If a sample in the test set can be predicted by a trusted model, that class is added to the final prediction result for that sample.
And 4-4, finally, summarizing the final prediction results of all samples to obtain the prediction result of the whole test set. Meanwhile, calculating the accuracy, recall rate and F1 value of the prediction result, and evaluating the performance of the whole model.
By the method, a reliable multi-label disease classification model can be constructed by effectively utilizing a large amount of medical record data of a hospital, and accurate disease classification prediction can be carried out on medical records in a test set.
Example 2
Referring to fig. 2, the present invention provides a model fusion algorithm for step 2-3 of example 1, which is used to construct an algorithm model, comprising:
step 1, given a data set D comprising n samples, sampling it generates a data set D':
step 2, randomly selecting a sample from the samples D each time, copying the sample into the data set D ', and then putting the sample back into the initial data set D again, so that the sample can still be pumped out in the next sampling, and repeatedly executing the process for k times to obtain the data set D' containing n samples.
D, a part of samples in D 'repeatedly appear in D', and the probability that the samples are not taken all the time in the k sampling processes is thatGet the limit->By self-sampling, 36.8% of the samples in the original dataset D may not be present in the sampled dataset D'.
D 'may be used as a training set and D\D' as a test set, with n samples being used for both the actual and desired estimated models.
And step 3, training through multiple rounds of different algorithms after sampling is completed to obtain a prediction function sequence, and fusing the classification problems by the final prediction function. Thus, by fusing multiple classification models, the probability of erroneously predicting a new sample is lower than that of a single model.
Further, in embodiment 2, step 3 firstly considers that a "self-sampling" method is adopted firstly, and secondly, a model fusion algorithm is utilized to classify the SVM, the decision tree, the logistic regression and the random forest algorithm, so that the SVM, the decision tree, the logistic regression and the random forest algorithm are iterated to generate a credible model for each class, and then a final model result is selected.
The specific process is as follows:
step 4-1, inputting a sample training data set;
step 4-2, training multi-label classification models of N different algorithms;
step 4-3, comparing 0/1 multidimensional row vectors of the N model prediction results with real results respectively in all test samples to obtain a trusted model comprising a plurality of categories;
and 4-4, for the test set, if a certain category can be found and predicted by the trusted model, adding the disease into a final prediction result, and finally comparing the accuracy, recall and F1 value of the model respectively.
In the technical scheme, a multi-label classification model of N distinct algorithms is trained by utilizing One-VS-the-Rest strategy, model fusion is carried out on the model to improve the prediction performance, and meanwhile, 138-dimensional feature vectors and 35-dimensional target vectors are used for processing data, so that the features of a sample are better reflected. The disease prediction method based on multi-label classification has the characteristics of high accuracy, high recall rate and high F1 value, and is suitable for predicting various diseases such as thyroid diseases.
The invention can classify 138 diseases related to thyroid and 35 corresponding diseases, covers a more comprehensive and finer disease range, and can classify the diseases more accurately. The model adopts a multi-label classification problem model in machine learning, can have multi-type disease labels for each medical record, and is more fit for actual clinical situations. Meanwhile, the model adopts algorithm optimization based on a heterogeneous integrated learning method, and classification accuracy and efficiency are improved through means such as model fusion strategy and data preprocessing.
Example 3
A medical record-based thyroid disorder dataset classifying device comprising a processor, a memory, and a build program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the build method of embodiment 1.
Example 4
A computer-readable storage medium having stored thereon a construction program which, when executed by a processor, implements the steps of the construction method described in embodiment 1.
The invention is not limited to the above embodiments, and based on the technical solution disclosed in the invention, a person skilled in the art may make some substitutions and modifications to some technical features thereof without creative effort according to the technical content disclosed, and all the substitutions and modifications are within the protection scope of the invention.
Claims (10)
1. The method for constructing the thyroid disease dataset classification model based on medical records is characterized by comprising the following steps of:
step 1) collecting medical record data of a certain number of thyroid disease patients;
step 2) extracting features from medical record contents, current medical history and physical examination results to form feature vectors to represent each medical record;
step 3) data preprocessing, including data cleaning, missing value processing and feature normalization;
step 4) taking the preprocessed feature vector as input, adopting a support vector machine algorithm, a decision tree algorithm, a logistic regression algorithm and a random forest algorithm as base classifiers, establishing an integrated model, and then transversely comparing with a single classifier result formed by the integrated model;
and 5) training and evaluating each classification model by using the training set and the testing set respectively, and selecting the model with the highest prediction result accuracy.
2. The method for constructing a medical record-based thyroid disease dataset classification model according to claim 1, wherein the feature extraction in step 2) may specifically include:
step 2-1), extracting key features related to thyroid diseases according to medical record content and a keyword matching method;
step 2-2) converting the medical record content into numerical characteristics by using a word bag model by using a natural language processing technology;
step 2-3) combining the history of the disease with the physical examination results to extract numerical features associated with thyroid disease.
3. The method for constructing a medical record-based thyroid disease dataset classification model according to claim 1 or 2, wherein the data preprocessing in step 3 specifically comprises:
step 3-1) data cleaning, including processing missing values, outliers and noise data;
step 3-2) feature normalization, scaling values of different features to a uniform range using a normalization method;
step 3-3) feature selection, wherein feature subsets which have remarkable correlation with thyroid disease classification, have a certain degree of correlation between features and thyroid disease classification results and have high contribution to thyroid disease classification are selected according to correlation analysis, feature importance evaluation and a model-based feature selection method.
4. The method for constructing a classification model based on a thyroid disease dataset as claimed in claim 1 or 2, wherein constructing a plurality of classification models in step 4 specifically comprises:
step 4-1) using a support vector machine algorithm to perform model training by adjusting a kernel function and regularization parameters:
step 4-2) using a decision tree algorithm to perform model training by selecting the optimal splitting feature and the depth of the tree;
step 4-3) performing model training through maximum likelihood estimation by using a logistic regression algorithm;
step 4-4) performing model training by constructing a plurality of decision trees and voting decisions by using a random forest algorithm.
5. The method for constructing a medical record-based thyroid disease dataset classification model as claimed in claim 4, wherein said step 4-1) specifically comprises:
step 4-1-1) selecting a Gaussian kernel function and regularization parameters;
step 4-1-2) searching a hyperplane based on the training data set by using a maximum interval classification principle, and classifying the data set into different categories;
step 4-1-3) determining the position and shape of the hyperplane by a convex optimization method, so that different categories can be distinguished;
step 4-1-4) establishing a support vector machine classification model according to the feature vectors and the class labels of the training data set.
6. The method for constructing a medical record-based thyroid disease dataset classification model as claimed in claim 4, wherein said step 4-2) specifically comprises:
step 4-2-1) constructing a decision tree model based on the characteristics and the category labels of the training data set;
step 4-2-2) selecting a feature which can most effectively distinguish different thyroid categories as an optimal splitting feature, so that different categories can be distinguished best on the value of the feature;
step 4-2-3) recursively performing the splitting operation until a predetermined maximum depth of the data set is reached;
step 4-2-4) forming a model capable of carrying out classification prediction according to the input feature vector through the learning process of the decision tree.
7. The method for constructing a medical record-based thyroid disease dataset classification model as claimed in claim 4, wherein said step 4-3) specifically comprises:
step 4-3-1) modeling the relationship between the feature vector and the class label as a probability distribution;
step 4-3-2) estimating model parameters based on a training data set by a maximum likelihood estimation method, so that the probability of model prediction is the most consistent with the probability of an actual observation value;
step 4-3-3) iteratively updating the model parameters by a gradient descent method until the change in the model parameters or the change in the loss function is no longer significant;
step 4-3-4) establishes a logistic regression classification model, and can predict corresponding class probabilities according to the input feature vectors.
8. The method for constructing a medical record-based thyroid disease dataset classification model as claimed in claim 4, wherein said step 4-4) specifically comprises:
step 4-4-1) based on the training data set, randomly selecting a feature subset and a sample subset to construct a plurality of decision tree models;
step 4-4-2) for the new input samples, each decision tree performs independent classification prediction;
step 4-4-3) combining the classification results of each decision tree in an average manner to obtain a final model prediction result.
9. A medical record based thyroid disorder dataset classification apparatus comprising a processor, a memory and a build program stored on the memory and executable on the processor, the build program when executed by the processor implementing the steps of the build method of any one of claims 1 to 8.
10. A computer-readable storage medium, characterized in that it has stored thereon a construction program which, when executed by a processor, implements the steps of the construction method according to any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310648204.5A CN116741393A (en) | 2023-06-02 | 2023-06-02 | Medical record-based thyroid disease dataset classification model construction method, classification device and computer-readable medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310648204.5A CN116741393A (en) | 2023-06-02 | 2023-06-02 | Medical record-based thyroid disease dataset classification model construction method, classification device and computer-readable medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116741393A true CN116741393A (en) | 2023-09-12 |
Family
ID=87914394
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310648204.5A Pending CN116741393A (en) | 2023-06-02 | 2023-06-02 | Medical record-based thyroid disease dataset classification model construction method, classification device and computer-readable medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116741393A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117743957A (en) * | 2024-02-06 | 2024-03-22 | 北京大学第三医院(北京大学第三临床医学院) | Data sorting method and related equipment of Th2A cells based on machine learning |
CN118136247A (en) * | 2024-02-26 | 2024-06-04 | 中国人民解放军总医院第一医学中心 | Method and system for evaluating cognitive function of chronic low-perfusion cerebrovascular patient group |
-
2023
- 2023-06-02 CN CN202310648204.5A patent/CN116741393A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117743957A (en) * | 2024-02-06 | 2024-03-22 | 北京大学第三医院(北京大学第三临床医学院) | Data sorting method and related equipment of Th2A cells based on machine learning |
CN117743957B (en) * | 2024-02-06 | 2024-05-07 | 北京大学第三医院(北京大学第三临床医学院) | Data sorting method and related equipment of Th2A cells based on machine learning |
CN118136247A (en) * | 2024-02-26 | 2024-06-04 | 中国人民解放军总医院第一医学中心 | Method and system for evaluating cognitive function of chronic low-perfusion cerebrovascular patient group |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Amirgaliyev et al. | Analysis of chronic kidney disease dataset by applying machine learning methods | |
Zheng et al. | Breast cancer diagnosis based on feature extraction using a hybrid of K-means and support vector machine algorithms | |
JP2022538866A (en) | System and method for image preprocessing | |
Wakili et al. | Classification of breast cancer histopathological images using DenseNet and transfer learning | |
CN116741393A (en) | Medical record-based thyroid disease dataset classification model construction method, classification device and computer-readable medium | |
Dong et al. | Cervical cell classification based on the CART feature selection algorithm | |
Thilagaraj et al. | [Retracted] Classification of Breast Cancer Images by Implementing Improved DCNN with Artificial Fish School Model | |
Bajcsi et al. | Towards feature selection for digital mammogram classification | |
Folorunso et al. | EfficientNets transfer learning strategies for histopathological breast cancer image analysis | |
Salman et al. | Gene expression analysis via spatial clustering and evaluation indexing | |
Roselin et al. | Fuzzy-rough feature selection for mammogram classification | |
Mohapatra et al. | Automated invasive cervical cancer disease detection at early stage through deep learning | |
Monfared et al. | Assessing classical and evolutionary preprocessing approaches for breast cancer diagnosis | |
Patel et al. | Two-Stage Feature Selection Method Created for 20 Neurons Artificial Neural Networks for Automatic Breast Cancer Detection | |
Laishram et al. | An optimized ensemble classifier for mammographic mass classification | |
Cripsy et al. | Lung Cancer Disease Prediction and Classification based on Feature Selection method using Bayesian Network, Logistic Regression, J48, Random Forest, and Naïve Bayes Algorithms | |
Ponraj et al. | Deep learning with histogram of oriented gradients-based computer-aided diagnosis for breast cancer detection and classification | |
Latif et al. | Improving Thyroid Disorder Diagnosis via Ensemble Stacking and Bidirectional Feature Selection. | |
Boumaraf et al. | Conventional Machine Learning versus Deep Learning for Magnification Dependent Histopathological Breast Cancer Image Classification: A Comparative Study with Visual Explanation. Diagnostics, 2021; 11 (3): 528 | |
Kadhim et al. | Ensemble Model for Prostate Cancer Detection Using MRI Images | |
Safia Naveed | Prediction of breast cancer through Random Forest | |
Dutta et al. | Cross-validated AdaBoost classifier used for brain tumor detection | |
Kasthuri et al. | AI‐Driven Healthcare Analysis | |
Kaur et al. | Prediction of Liver Disorders Using Simple Logistic Technique of Machine Learning | |
Sumithra et al. | Optimizing Brain Tumor Recognition with Ensemble support Vector-based Local Coati Algorithm and CNN Feature Extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |