CN116741393A

CN116741393A - Medical record-based thyroid disease dataset classification model construction method, classification device and computer-readable medium

Info

Publication number: CN116741393A
Application number: CN202310648204.5A
Authority: CN
Inventors: 窦琪翔; 朱毅; 朱顺
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2023-06-02
Filing date: 2023-06-02
Publication date: 2023-09-12

Abstract

The invention discloses a method for constructing a thyroid disease data set classification model based on medical records in the field of medical equipment, which comprises the following steps: 1) Collecting medical record data of a certain number of thyroid disease patients; 2) Extracting features from medical record contents, current medical history and physical examination results to form feature vectors to represent each medical record; 3) Performing data preprocessing, including data cleaning, missing value processing and feature normalization; 4) Taking the preprocessed feature vector as input, adopting a support vector machine algorithm, a decision tree algorithm, a logistic regression algorithm and a random forest algorithm as basic classifiers, establishing an integrated model, and then transversely comparing with a single classifier result formed by the integrated model; 5) The invention improves the accuracy and reliability of disease classification by respectively using the training set and the testing set to train and evaluate each classification model and selecting the model with the highest accuracy of the prediction result.

Description

Medical record-based thyroid disease dataset classification model construction method, classification device and computer-readable medium

Technical Field

The invention relates to the field of medical equipment, in particular to a method for constructing a disease data set classification model.

Background

In modern medical systems, disease classification has been one of the major challenges in the medical field. In the medical process, accurate judgment of the disease condition is a precondition and key for treating patients, and with the continuous development of modern medical technology, the medical data volume is increased sharply, so that classification of medical disease data sets is more challenging. In the face of a large number of cases and complicated illness, errors and subjectivity may exist in the judgment result of doctors, so that the medical quality is affected.

In recent years, with the development of artificial intelligence and machine learning techniques, the medical field has also begun to utilize these new techniques to improve the accuracy and efficiency of classification of disease datasets. The rapid development of information technology brings new opportunities to the medical industry. Currently, a wide variety of medical techniques have been widely used in various fields such as drug discovery, medical imaging, and disease classification. With advances in science and technology, machine learning will be increasingly widely used and important in the field of disease classification. The integrated learning algorithm in machine learning is an effective way to improve classifier accuracy and has been shown to perform better than a single classifier in many fields. Machine learning techniques can learn patterns and rules from the data, thereby providing physicians with more accurate disease dataset classification results. Meanwhile, with the continuous accumulation and digitization of medical data, the application prospect of the machine learning technology in the medical field is wider and wider.

Among the many diseases, thyroid diseases are relatively clear and regular in disease characteristics with little research. Therefore, the application of the ensemble learning of the heterogeneous patterns to the thyroid disease classification field is very suitable.

In the existing machine learning technology, multi-label classification is an important technical means, and can be used for identifying a plurality of possible diseases and sorting according to characteristics such as disease description and physical examination content. Meanwhile, the heterogeneous integrated learning method can effectively combine a plurality of different classification algorithms, and accuracy and reliability of prediction are improved.

While the prior art provides some machine learning based disease dataset classification methods, there are still some problems. For example, existing methods generally can only classify individual diseases, and are difficult to deal with multiple disease cases; meanwhile, the existing method is usually only used for prediction based on one or a plurality of classification algorithms, and the classification accuracy and reliability are difficult to improve. Therefore, the machine learning method is adopted to assist in disease classification, and the accuracy and reliability of diagnosis classification can be effectively improved.

Currently, automatic analysis of medical images has been widely used, such as computer-aided classification systems based on deep learning have achieved significant results in image classification, disease detection, lesion segmentation, and the like. However, in terms of medical disease classification, automated handling of disease classification is relatively more difficult because the content of medical records involves textual descriptions of doctors, whose expression forms are more ambiguous and diverse.

Some current researches mainly adopt a deep learning method to solve the problem, such as converting text information in medical records into vector representation and then classifying the medical records by using a deep neural network, but the method requires a large amount of labeling data for training, and has high requirements on training data due to diversity and complexity of the medical records, and the problems of fitting and the like possibly exist.

Disclosure of Invention

Aiming at the problems in the background technology, the medical record of a hospital is used as one sample for model input, the multi-label classification problem is converted into a plurality of classification problems through feature extraction and data preprocessing, and a plurality of machine learning algorithms are adopted for training and fusion, so that a reliable disease classification model is finally obtained, and the accuracy and reliability of disease classification are improved.

The purpose of the invention is realized in the following way: a method for constructing a thyroid disease data set classification model based on medical records comprises the following steps:

step 1) collecting medical record data of a certain number of thyroid disease patients;

step 2) extracting features from medical record contents, current medical history and physical examination results to form feature vectors to represent each medical record;

step 3) data preprocessing, including data cleaning, missing value processing and feature normalization;

step 4) taking the preprocessed feature vector as input, adopting a support vector machine algorithm, a decision tree algorithm, a logistic regression algorithm and a random forest algorithm as base classifiers, establishing an integrated model, and then transversely comparing with a single classifier result formed by the integrated model;

and 5) training and evaluating each classification model by using the training set and the testing set respectively, and selecting the model with the highest prediction result accuracy.

As a further definition of the present invention, the feature extraction in step 2) may specifically include:

step 2-1), extracting key features related to thyroid diseases according to medical record content and a keyword matching method;

step 2-2) converting the medical record content into numerical characteristics by using a word bag model by using a natural language processing technology;

step 2-3) combining the history of the disease with the physical examination results to extract numerical features associated with thyroid disease.

As a further definition of the present invention, the data preprocessing in step 3 specifically includes:

step 3-1) data cleaning, including processing missing values, outliers and noise data;

step 3-2) feature normalization, scaling values of different features to a uniform range using a normalization method;

step 3-3) feature selection, wherein feature subsets which have remarkable correlation with thyroid disease classification, have a certain degree of correlation between features and thyroid disease classification results and have high contribution to thyroid disease classification are selected according to correlation analysis, feature importance evaluation and a model-based feature selection method.

As a further definition of the present invention, constructing a plurality of classification models in step 4 specifically includes:

step 4-1) using a support vector machine algorithm to perform model training by adjusting a kernel function and regularization parameters:

step 4-2) using a decision tree algorithm to perform model training by selecting the optimal splitting feature and the depth of the tree;

step 4-3) performing model training through maximum likelihood estimation by using a logistic regression algorithm;

step 4-4) performing model training by constructing a plurality of decision trees and voting decisions by using a random forest algorithm.

As a further definition of the present invention, the step 4-1) specifically includes:

step 4-1-1) selecting a Gaussian kernel function and regularization parameters;

step 4-1-2) searching a hyperplane based on the training data set by using a maximum interval classification principle, and classifying the data set into different categories;

step 4-1-3) determining the position and shape of the hyperplane by a convex optimization method, so that different categories can be distinguished;

step 4-1-4) establishing a support vector machine classification model according to the feature vectors and the class labels of the training data set.

As a further definition of the present invention, the step 4-2) specifically includes:

step 4-2-1) constructing a decision tree model based on the characteristics and the category labels of the training data set;

step 4-2-2) selecting a feature which can most effectively distinguish different thyroid categories as an optimal splitting feature, so that different categories can be distinguished best on the value of the feature;

step 4-2-3) recursively performing the splitting operation until a predetermined maximum depth of the data set is reached;

step 4-2-4) forming a model capable of carrying out classification prediction according to the input feature vector through the learning process of the decision tree.

As a further definition of the present invention, the step 4-3) specifically includes:

step 4-3-1) modeling the relationship between the feature vector and the class label as a probability distribution;

step 4-3-2) estimating model parameters based on a training data set by a maximum likelihood estimation method, so that the probability of model prediction is the most consistent with the probability of an actual observation value;

step 4-3-3) iteratively updating the model parameters by a gradient descent method until the change in the model parameters or the change in the loss function is no longer significant;

step 4-3-4) establishes a logistic regression classification model, and can predict corresponding class probabilities according to the input feature vectors.

As a further definition of the present invention, the step 4-4) specifically includes:

step 4-4-1) based on the training data set, randomly selecting a feature subset and a sample subset to construct a plurality of decision tree models;

step 4-4-2) for the new input samples, each decision tree performs independent classification prediction;

step 4-4-3) combining the classification results of each decision tree in an average manner to obtain a final model prediction result.

A thyroid disease data set classification device based on medical records comprises a processor, a memory and a construction program stored on the memory and capable of running on the processor, wherein the construction program realizes the steps of the construction method when being executed by the processor.

A computer-readable storage medium having stored thereon a construction program which, when executed by a processor, implements the steps of the construction method described above.

Compared with the prior art, the invention has the beneficial effects that:

the invention classifies the thyroid-related 138 diseases and 35 corresponding diseases, covers a more comprehensive and finer disease range, and can more accurately classify the diseases;

the multi-label classification problem model in machine learning is adopted, so that each medical record can be provided with multiple kinds of disease labels, and the medical record can be more fit for actual clinical situations;

the heterogeneous integrated learning method of the Bagging algorithm is used, so that the problem of overfitting of the model is effectively reduced, and the generalization capability of the model is improved;

by adopting a model fusion strategy and carrying out error analysis on the prediction results of different classifiers, the types of diseases can be predicted more accurately, and the accuracy of classification of the disease data set is improved;

by carrying out feature extraction and data preprocessing on a large amount of medical record data, the data can be better utilized to classify the disease data set, and the efficiency of classifying the disease data set is improved;

in conclusion, the invention has innovations in aspects of classification range, model selection, algorithm optimization, model fusion, data utilization and the like, can more accurately classify diseases, and improves the efficiency and accuracy of clinical work.

Drawings

FIG. 1 is a schematic flow chart of the present invention

FIG. 2 is a flow chart of the heterogeneous ensemble learning algorithm according to the present invention

Detailed Description

Example 1

The method for constructing the thyroid disease data set classification model based on medical records shown in fig. 1 comprises the following steps:

step 1, acquiring 1880 medical record sample data and preprocessing;

step 2, constructing an algorithm model;

step 3, training a model and predicting;

and 4, predicting the test set and integrating the model.

Further, in embodiment 1, the data collection and preprocessing in step 1 specifically includes:

step 1-1, collecting data: medical records of 1880 thyroid disease patients, including current medical history, physical examination content and doctor judgment results, were collected from hospitals.

Step 1-2, feature extraction: and extracting 138 thyroid related symptoms (such as palpitation, hypodynamia, fatness, hunger and the like) and 35 corresponding diseases (such as hyperthyroidism, hypothyroidism, thyroid nodule, papillary thyroid cancer and the like) sequences from all medical record samples by adopting a characteristic extraction algorithm, and taking the sequences as characteristics of the samples.

Step 1-3, data preprocessing: for each medical record, 138-dimensional feature vectors and their corresponding 35-dimensional target vectors are generated for the disease and disease sequences, respectively. Wherein, if the medical record exists, the disease is 1, otherwise, the disease is 0. Likewise, if the disease is present in the case, it is 1, otherwise it is 0.

The following is a partial sample distribution in the population in one embodiment.

Disease name	Number of	Marking label
			Inflammatory hyperthyroidism	878	0 or 1
Primary hyperthyroidism	567	0 or 1
			Papillary thyroid carcinoma	3	0 or 1
Follicular carcinoma of thyroid gland	2	0 or 1
			Medullary thyroid carcinoma	1	0 or 1
Thyroid benign tumor postoperative hypothyroidism	5	0 or 1
			Non-toxic goiter	121	0 or 1
Hypothyroidism	234	0 or 1
			Fat body	145	0 or 1

Further, in embodiment 1, the algorithm model is constructed in step 2, and the specific process includes:

and 2-1, dividing the acquired 1880 medical record samples into a training set and a testing set.

And 2-2, sampling the training sets for a plurality of times randomly with a replacement way by adopting a heterogeneous integrated learning method of a Bagging algorithm, constructing N different training sets, and training a multi-label classification model aiming at each training set, wherein N is more than or equal to 2.

And 2-3, classifying the classifier according to the actual value of the sample and the error of the prediction result of each classifier by adopting a model fusion strategy to obtain the credible model of each class respectively, thereby obtaining the prediction result.

And 2-4, training N multi-label classification models by using One-VS-the-Rest strategy by adopting a multi-label classification SVM, a decision tree, a logistic regression and a random forest algorithm.

Further, in embodiment 1, the training of the model and the prediction in step 3 include:

and 3-1, respectively training the pre-divided training sets by using the models.

And 3-2, predicting each multi-label classification model according to the test set to obtain N prediction results. For each test sample, N predicted results are converted into a multidimensional row vector, and compared with the real results to obtain a trusted model containing a plurality of categories.

Further, in embodiment 1, the predicting and model integrating the test set in step 4 specifically includes:

and 4-1, predicting the test set by using the trained model and integrating the model. For each medical record in the test set, firstly, predicting the medical record into a plurality of binary classification problems through an One-VS-the-Rest strategy, wherein each problem corresponds to One classifier, and a prediction result of the medical record under each classifier is obtained.

And 4-2, comparing the prediction result of each classifier with the true value of the test set to obtain the accuracy, recall rate and F1 value of the classifier, and calculating the error of the classifier.

Step 4-3, defining a classifier with an error smaller than a certain threshold value as a credible model of the class for each class. If a sample in the test set can be predicted by a trusted model, that class is added to the final prediction result for that sample.

And 4-4, finally, summarizing the final prediction results of all samples to obtain the prediction result of the whole test set. Meanwhile, calculating the accuracy, recall rate and F1 value of the prediction result, and evaluating the performance of the whole model.

By the method, a reliable multi-label disease classification model can be constructed by effectively utilizing a large amount of medical record data of a hospital, and accurate disease classification prediction can be carried out on medical records in a test set.

Example 2

Referring to fig. 2, the present invention provides a model fusion algorithm for step 2-3 of example 1, which is used to construct an algorithm model, comprising:

step 1, given a data set D comprising n samples, sampling it generates a data set D':

step 2, randomly selecting a sample from the samples D each time, copying the sample into the data set D ', and then putting the sample back into the initial data set D again, so that the sample can still be pumped out in the next sampling, and repeatedly executing the process for k times to obtain the data set D' containing n samples.

D, a part of samples in D 'repeatedly appear in D', and the probability that the samples are not taken all the time in the k sampling processes is thatGet the limit->By self-sampling, 36.8% of the samples in the original dataset D may not be present in the sampled dataset D'.

D 'may be used as a training set and D\D' as a test set, with n samples being used for both the actual and desired estimated models.

And step 3, training through multiple rounds of different algorithms after sampling is completed to obtain a prediction function sequence, and fusing the classification problems by the final prediction function. Thus, by fusing multiple classification models, the probability of erroneously predicting a new sample is lower than that of a single model.

Further, in embodiment 2, step 3 firstly considers that a "self-sampling" method is adopted firstly, and secondly, a model fusion algorithm is utilized to classify the SVM, the decision tree, the logistic regression and the random forest algorithm, so that the SVM, the decision tree, the logistic regression and the random forest algorithm are iterated to generate a credible model for each class, and then a final model result is selected.

The specific process is as follows:

step 4-1, inputting a sample training data set;

step 4-2, training multi-label classification models of N different algorithms;

step 4-3, comparing 0/1 multidimensional row vectors of the N model prediction results with real results respectively in all test samples to obtain a trusted model comprising a plurality of categories;

and 4-4, for the test set, if a certain category can be found and predicted by the trusted model, adding the disease into a final prediction result, and finally comparing the accuracy, recall and F1 value of the model respectively.

In the technical scheme, a multi-label classification model of N distinct algorithms is trained by utilizing One-VS-the-Rest strategy, model fusion is carried out on the model to improve the prediction performance, and meanwhile, 138-dimensional feature vectors and 35-dimensional target vectors are used for processing data, so that the features of a sample are better reflected. The disease prediction method based on multi-label classification has the characteristics of high accuracy, high recall rate and high F1 value, and is suitable for predicting various diseases such as thyroid diseases.

The invention can classify 138 diseases related to thyroid and 35 corresponding diseases, covers a more comprehensive and finer disease range, and can classify the diseases more accurately. The model adopts a multi-label classification problem model in machine learning, can have multi-type disease labels for each medical record, and is more fit for actual clinical situations. Meanwhile, the model adopts algorithm optimization based on a heterogeneous integrated learning method, and classification accuracy and efficiency are improved through means such as model fusion strategy and data preprocessing.

Example 3

A medical record-based thyroid disorder dataset classifying device comprising a processor, a memory, and a build program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the build method of embodiment 1.

Example 4

A computer-readable storage medium having stored thereon a construction program which, when executed by a processor, implements the steps of the construction method described in embodiment 1.

The invention is not limited to the above embodiments, and based on the technical solution disclosed in the invention, a person skilled in the art may make some substitutions and modifications to some technical features thereof without creative effort according to the technical content disclosed, and all the substitutions and modifications are within the protection scope of the invention.

Claims

1. The method for constructing the thyroid disease dataset classification model based on medical records is characterized by comprising the following steps of:

2. The method for constructing a medical record-based thyroid disease dataset classification model according to claim 1, wherein the feature extraction in step 2) may specifically include:

3. The method for constructing a medical record-based thyroid disease dataset classification model according to claim 1 or 2, wherein the data preprocessing in step 3 specifically comprises:

4. The method for constructing a classification model based on a thyroid disease dataset as claimed in claim 1 or 2, wherein constructing a plurality of classification models in step 4 specifically comprises:

5. The method for constructing a medical record-based thyroid disease dataset classification model as claimed in claim 4, wherein said step 4-1) specifically comprises:

step 4-1-1) selecting a Gaussian kernel function and regularization parameters;

6. The method for constructing a medical record-based thyroid disease dataset classification model as claimed in claim 4, wherein said step 4-2) specifically comprises:

7. The method for constructing a medical record-based thyroid disease dataset classification model as claimed in claim 4, wherein said step 4-3) specifically comprises:

8. The method for constructing a medical record-based thyroid disease dataset classification model as claimed in claim 4, wherein said step 4-4) specifically comprises:

9. A medical record based thyroid disorder dataset classification apparatus comprising a processor, a memory and a build program stored on the memory and executable on the processor, the build program when executed by the processor implementing the steps of the build method of any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that it has stored thereon a construction program which, when executed by a processor, implements the steps of the construction method according to any of claims 1 to 8.