CN114999628B

CN114999628B - Method for searching for obvious characteristic of degenerative knee osteoarthritis by using machine learning

Info

Publication number: CN114999628B
Application number: CN202210445525.0A
Authority: CN
Inventors: 张佳; 张子龙; 龙锦益
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2022-04-26
Filing date: 2022-04-26
Publication date: 2023-06-02
Anticipated expiration: 2042-04-26
Also published as: CN114999628A

Abstract

The invention discloses a method for searching for obvious characteristics of degenerative knee osteoarthritis by utilizing machine learning, which particularly relates to the technical field of intelligent medical treatment and comprises the following specific steps: s1, acquiring traditional Chinese medicine and western medicine information of a clinical patient to be diagnosed, preprocessing the information, and constructing a knee osteoarthritis characteristic data set; s2, training the encoder to learn the risk characteristics of the knee osteoarthritis by utilizing the characteristic dimension reduction characteristics of the self-encoder; s3, performing feature ordering on the knee osteoarthritis feature data set by using 6 existing feature selection algorithms; s4, training a model by using an SVM classifier; s5, taking out the features which appear at high frequency in the 6 algorithm results, and comparing the effect of the self-encoder and the traditional feature selection method on the selection risk factors. The risk factors screened by the invention can provide scientific and reliable references for diagnosing knee osteoarthritis in traditional Chinese medicine, and a more accurate and reliable disease identification model is constructed.

Description

Method for searching for obvious characteristic of degenerative knee osteoarthritis by using machine learning

Technical Field

The invention relates to the technical field of intelligent medical treatment, in particular to a method for searching for obvious characteristics of degenerative knee osteoarthritis by utilizing machine learning.

Background

Degenerative knee osteoarthritis belongs to the dominant disease species of orthopaedics in traditional Chinese medicine. In the long-term medical practice of traditional Chinese medicine, the history of doctors accumulates abundant clinical diagnosis experience and forms a complete diagnosis system which is unique to China, namely four diagnosis (inspection, smelling, inquiring and cutting), differentiation of symptoms and differentiation of symptoms. The unique diagnosis method and the knowledge of the vital activity state of the human body of the traditional Chinese medicine diagnostics always play an important role in clinic from ancient times, are continuously enriched and developed, and have certain influence on foreign medicine. Because of the limitation of history conditions, the diagnostic methods of traditional Chinese medicine have a certain subjectivity. For example, tongue diagnosis and pulse diagnosis are unique contents of traditional Chinese medicine, and have important values in diagnosis, but traditional Chinese medicine is based on experience and subjective feeling of eyes and fingers, and lacks objective indexes as standards for judging tongue manifestations and pulse manifestations, so that the values of tongue diagnosis and pulse diagnosis are clarified, and objectified and practical application of the values are the needs of traditional Chinese medicine development. Therefore, along with the transition of modern medical modes, the method for searching the significance characteristics of the degenerative knee osteoarthritis is researched by utilizing the artificial intelligence technology, so that the scientificity and the feasibility of traditional Chinese medicine diagnosis are verified, a more accurate and reliable disease identification model is constructed, the advantages of the artificial intelligence technology are brought into play, and the co-development and prosperity of interdisciences are promoted.

The purpose of data normalization is to eliminate variability between features, facilitating weight learning at a glance. In the machine learning field, different evaluation indexes (i.e. different features in feature vectors are the different evaluation indexes) often have different dimensions and dimension units, and such a situation can affect the result of data analysis, so that in order to eliminate the dimension effect between indexes, data standardization processing is required to solve the comparability between data indexes. After the original data is subjected to data standardization processing, all indexes are in the same order of magnitude, and the method is suitable for comprehensive comparison and evaluation. Of these, the most typical is normalization/normalization of data, which is required when there is an excessive difference between data in the knee osteoarthritis dataset.

The self-encoder is an unsupervised learning algorithm whose output enables reproduction of the input data. The concept of a self-encoder, which was first proposed by Rumelhart et al, is a data compression algorithm that uses an encoder to achieve data compression and a decoder to achieve decompression. The coding stage maps the high-dimension data into low-dimension data, so that the data quantity is reduced; and the decoding stages are reversed exactly so as to realize reproduction of the input data. The self-encoder is applied to various fields such as image classification, face recognition, natural language processing and other fields in the process of optimizing development, and achieves better results. In addition to feature dimension reduction, new features learned by the automatic encoder can be fed into the supervised learning model, so the automatic encoder can function as a feature extractor. The risk factors of knee osteoarthritis can be extracted by using the method in the patent.

Feature selection is an important issue in feature engineering, whose goal is to find the optimal feature subset. The feature selection can eliminate irrelevant or redundant features, thereby achieving the purposes of reducing the number of features, improving the model accuracy and reducing the running time. The method has been widely focused and applied in the fields of pattern recognition, text classification, biological genetics, information retrieval, data analysis and the like. Specifically, the diseased features of the clinical patient include the features of traditional Chinese medicine and western medicine, and hundreds of features are reached, and the optimal feature subset must be found out in order to find the significant features. Thus, based on the problem of finding the significant features of degenerative knee osteoarthritis by artificial intelligence technology, feature selection technology is introduced into finding the significant feature analysis of degenerative knee osteoarthritis.

Disclosure of Invention

Aiming at the degenerative knee osteoarthritis is a traditional Chinese medicine orthopaedics dominant disease species, and the source of diagnostic information is various and subjective, the invention provides a method for searching for the obvious characteristic of the degenerative knee osteoarthritis by utilizing machine learning.

In order to achieve the above purpose, the present invention provides the following technical solutions: a method for searching for the significant characteristics of degenerative knee osteoarthritis by using machine learning, which comprises the following specific steps:

s1, acquiring information of traditional Chinese medicine and western medicine such as looking, smelling, asking, cutting and the like of a clinical patient, preprocessing the collected information more comprehensively and better, and constructing a knee osteoarthritis characteristic data set;

s2, training the encoder to learn the risk characteristics of the knee osteoarthritis by utilizing the characteristic dimension reduction characteristics of the self-encoder, so as to achieve the purpose of characteristic selection;

s3, performing feature ordering on the knee osteoarthritis feature data set by using 6 existing feature selection algorithms, and reserving physical meanings of risk factors;

s4, training a model by using an SVM classifier, predicting classification performance according to the number of the ordered feature subsets from less to more, respectively reserving the feature subset with the best classification performance in 6 algorithms, and comparing the effect of a self-encoder with a traditional feature selection method on the selection of risk features;

s5, taking out the features which appear at high frequency in the 6 algorithm results, so that the finally obtained features have better generalization and significance.

Further, the step S1 specifically includes:

s11, extracting information of a clinical patient to be diagnosed from the early-stage identification table of the knee osteoarthritis, recording the information into case data, and writing a related data dictionary. For the case that the same field has different symptoms, the symptoms are separately classified as a feature, and 0 is no and 1 is yes;

s12, constructing a knee osteoarthritis data set, and marking the disease state of the knee osteoarthritis of a patient in clinical treatment, wherein 0 is not diseased, and 1 is diseased;

s13, removing rows and columns with too large proportion of the number of the vacancies;

s14, carrying out normalization processing on continuous data by using the maximum normalization, reducing the difference between the data, accelerating the training speed, deleting useless features, and only retaining the features with analysis significance;

s15, splitting discrete features with multiple states so as to obtain the influence degree of the same features on osteoarthropathy.

Further, step S2 specifically includes:

s21, constructing and compiling a traditional three-layer self-coding model, and setting the number of neurons of an intermediate hidden layer; in general, a conventional self-encoder mainly includes an encoding stage and a decoding stage, and the structure is symmetrical, the purpose of the self-encoder is to reconstruct input data at an output layer, and in the most perfect case, an output signal y is completely consistent with an input signal x, and according to the structure shown in fig. 1, the encoding and decoding processes of the conventional self-encoder can be described as:

the coding process comprises the following steps: h is a ₁ ＝σ _e (W ₁ x+b ₁ ) (1)；

DecodingThe process comprises the following steps: y=σ _d (W ₂ h ₁ +b ₂ ) (2)；

Wherein W is ₁ ，b ₁ For coding weights and offsets, W ₂ ，b ₂ For decoding weights and offsets, σ _e As an activation function of nonlinear transformation, sigmoid, tanh, relu, sigma and the like are commonly used at present _d May be the same activation function as in the encoding process, so the loss function from the encoder is to minimize the error between y and x:

/>

s22, setting five-fold cross validation for a data set, and training an unsupervised self-coding model by taking a training set as an output signal y and an input signal x at the same time;

s23, compressing the features in the training set by using the trained encoder, and evaluating the effect of feature selection by using an SVM classifier. The encoding stage may be regarded as a deterministic mapping to convert the input signal into a hidden layer representation, while the decoding stage is to remap the hidden layer representation as much as possible into the input signal, and the loss function may choose a cross entropy in addition to the mean square error given by equation (3), specifically expressed as:

s24, repeating the steps S21-S23 to respectively obtain SVM classification performances with different hidden layers and preserve the evaluation index with the best performance.

Further, the step S3 specifically includes:

s31, respectively using the existing feature selection algorithm to sort the importance of the features in the knee osteoarthritis feature data set, and storing the subscripts of the features in the original data set in descending order of importance.

Further, the step S4 specifically includes:

s41, selecting the first 1-N features (N is the maximum feature number of the knee osteoarthritis dataset) subjected to feature sequencing in the knee osteoarthritis dataset, and taking out corresponding features in the original dataset according to the subscript to serve as a new dataset for training;

s42, verifying an algorithm by adopting a five-fold cross verification method: the processed normalized data are processed according to 4:1, dividing the ratio into training data and test data;

s43, training a model by using an SVM classifier, and predicting the disease state of a patient. Each feature selection algorithm needs to be trained N times, and the number M of the feature subsets with the best performance and the accuracy index thereof are saved in N classification tests. The first M features are taken out of the N features as the final selection result of the feature selection algorithm.

Further, the step S5 specifically includes: and combining the prediction results of the feature subsets of the various algorithms, and taking out the features which occur at high frequency as final results, so that the obtained features have better robustness and significance.

The invention has the technical effects and advantages that:

the invention can screen out the risk factors of the degenerative knee osteoarthritis, so that a more accurate and reliable disease identification model is constructed, and scientific reference is provided for the diagnosis of the knee osteoarthritis by traditional Chinese medicine.

Compared with the prior art, the invention can integrate the traditional Chinese medicine and western medicine characteristic information of the clinical patient, thereby obtaining more accurate and reliable characteristic analysis results.

In a word, the invention can provide accurate and reliable risk factors of the degenerative knee osteoarthritis and provide scientific and reliable basis for traditional Chinese medicine diagnosis.

Drawings

FIG. 1 is a conventional self-encoder network architecture;

FIG. 2 is a schematic diagram of a prior art feature selection using a machine learning method;

FIG. 3 is a flow chart of the present invention;

FIG. 4 is a traditional Chinese medical auxiliary diagnostic tool for osteoarthropathy;

fig. 5 is a two-dimensional code diagram of an applet.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-5 of the specification, the invention provides a method for searching for the significant characteristics of degenerative knee osteoarthritis by using machine learning, which comprises the following steps:

s1, collecting 5025 cases of information such as blood, CT and the like of Western medicine of clinical patients, information such as looking, smelling, asking, cutting and the like of traditional Chinese medicine, preprocessing the collected information, and constructing a knee osteoarthritis characteristic data set, wherein 254 characteristic numbers are obtained in total.

S11, extracting information of a clinical patient to be diagnosed from the early-stage identification table of the knee osteoarthritis, recording the information into case data, and writing a related data dictionary.

S12, constructing a knee osteoarthritis data set, and marking the knee osteoarthritis disease state of a patient in clinical treatment, wherein 0 is not diseased and 1 is diseased.

S13, removing rows and columns with a large proportion of the number of the vacancies (30% of the vacancies in the example).

S14, carrying out normalization processing on continuous data by using the maximum normalization, reducing the difference between the data, accelerating the training speed, deleting useless features, and only retaining the features with analysis significance.

S15, splitting discrete features with multiple states so as to obtain the influence degree of the same features on osteoarthropathy. After treatment, 3338 cases were obtained, and a total of 178 cases were characterized by the knee osteoarthritis characterization data set.

S2, training the encoder to learn the risk characteristics of the knee osteoarthritis by utilizing the characteristic dimension reduction characteristics of the self-encoder, so as to achieve the purpose of characteristic selection.

S21, constructing and compiling a traditional three-layer self-coding model, and setting the size of neurons of an intermediate hidden layer; in general, a conventional self-encoder mainly includes an encoding stage and a decoding stage, and the structure is symmetrical, the purpose of the self-encoder is to reconstruct input data at an output layer, and in the most perfect case, an output signal y is completely consistent with an input signal x, and according to the structure shown in fig. 1, the encoding and decoding processes of the conventional self-encoder can be described as:

the coding process comprises the following steps: h is a ₁ ＝σ _e (W ₁ x+b ₁ ) (1)

The decoding process comprises the following steps: y=σ _d (W ₂ h ₁ +b ₂ ) (2)

Wherein W is ₁ ，b ₁ For coding weights and offsets, W ₂ ，b ₂ For decoding weights and offsets, σ _e As an activation function of nonlinear transformation, sigmoid, tanh, relu, sigma and the like are commonly used at present _d The same activation function as in the encoding process can be used, in this example, both the encoding and decoding activation functions use sigmoid, and the model compilation optimizer uses RMSProp.

S22, setting five-fold cross validation for the data set, and training an unsupervised self-coding model by taking the training set as an output signal y and an input signal x at the same time.

S23, compressing the features in the training set by using the trained encoder, and evaluating the effect of feature selection by using an SVM classifier. The encoding stage may be regarded as a deterministic mapping to convert the input signal into hidden layer representations, while the decoding stage is to remap the hidden layer representations as much as possible into the input signal, the loss function of this example selecting cross entropy, expressed in particular as:

S3, ordering the features from the knee osteoarthritis dataset by using 6 existing feature selection algorithms respectively, as shown in table 1;

s31, respectively using the existing feature selection algorithm, sorting the features in the knee osteoarthritis feature data set in a descending order of importance, and storing subscripts thereof in an array as indexes.

S4, evaluating the feature selection effect by combining with the SVM classifier, and respectively reserving feature subsets with the best classification results in the 6 algorithms.

S41, traversing and selecting the first 1-178 features (178 is the maximum feature number of the knee osteoarthritis dataset), and taking out the corresponding features in the original dataset according to the subscript as a new dataset for training.

S42, verifying an algorithm by adopting a five-fold cross verification method: the processed normalized data are processed according to 4:1 is divided into training data and test data.

S43, predicting whether a patient is ill by using an SVM classifier, and storing the quantity X of the primary feature subsets with the best performance and each evaluation index thereof in 178 classification training, wherein X can be expressed as the optimal dimension under the feature selection algorithm. The first X features are taken out of 178 features as the final selection result of the feature selection algorithm. And the classification performance is evaluated by adopting the 6 indexes.

S44, repeating the steps S31-S33 to obtain feature subset results based on different feature selection algorithms, and comparing the results of the proposed algorithms with the self-encoder, wherein the results are shown in table 2. As can be seen from table 2, the self-encoder can obtain the optimal result on each evaluation index, the feature selection effect is more ideal, and the 6 conventional feature selection algorithms can select risk factors with actual physical meanings although the classification performance is poor:

table 2: performance index comparison of self-encoder and 6 feature selection algorithms under SVM classifier

S5, combining prediction results of feature subsets of various algorithms, so that the finally obtained features have better generalization and significance.

And combining the prediction results of the feature subsets of the various algorithms, taking out the features which occur frequently as final results, and removing normal factors of human bodies and some disease-independent factors, such as K & L grade 0, myocardial infarction-free history, genetic disease-free family history and the like, from the feature subsets, so that the obtained risk factors have better robustness and significance. The final extracted risk factors for degenerative knee osteoarthritis are shown in table 3. As can be seen from tables 3 and 4, after a round of feature selection and extraction, 11 risk factors in the meaning of traditional Chinese medicine are finally obtained, wherein 5 features are in accordance with the differentiation of traditional Chinese medicine, and the method is more explanatory. Therefore, in theory, the risk factors obtained by the machine learning method can be used as scientific references for diagnosing knee osteoarthritis in traditional Chinese medicine, a more accurate and reliable disease identification model is established, and the scientificity and practicability of diagnosis of the traditional Chinese medicine are verified on the side.

Table 3: risk factors for degenerative knee osteoarthritis

Table 4: traditional Chinese medicine differentiation type of knee osteoarthritis

Finally: the foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method for searching for the significant characteristics of degenerative knee osteoarthritis by using machine learning, which is characterized by comprising the following steps: the method comprises the following specific steps:

s1, acquiring traditional Chinese medicine and western medicine information of a clinical patient to be diagnosed, preprocessing the collected information, and constructing a knee osteoarthritis characteristic data set;

s2, training the encoder to learn the risk characteristics of the knee osteoarthritis by utilizing the characteristic dimension reduction characteristics of the self-encoder;

the step S2 specifically comprises the following steps:

s21, constructing and compiling a traditional three-layer self-coding model, and setting the number of neurons of an intermediate hidden layer; the conventional self-encoder comprises an encoding stage and a decoding stage, and the structure is symmetrical, and the encoding and decoding processes of the conventional self-encoder are described as follows:

the coding process comprises the following steps: h is a ₁ ＝σ _e (W ₁ x+b ₁ )(1)；

The decoding process comprises the following steps: y=σ _d (W ₂ h ₁ +b ₂ )(2)；

Wherein W is ₁ ，b ₁ For coding weights and offsets, W ₂ ，b ₂ For decoding weights and offsets, σ _e Sigma, an activation function for nonlinear transformation _d Is the same activation function as in the encoding process:

s23, compressing the characteristics in the training set by using a trained encoder, and evaluating the effect of characteristic selection by using an SVM classifier; the encoding stage can be seen as a deterministic mapping that converts the input signal into a hidden layer representation, while the decoding stage remaps the hidden layer representation into the input signal, the loss function can also select the cross entropy, in addition to the mean square error given by equation (3), expressed in particular as:

s24, repeating the steps S21-S23 to respectively obtain SVM classification performances with different hidden layers, and storing performance evaluation indexes;

s4, training a model by using an SVM classifier, predicting classification performance from less to more according to the number of the ordered feature subsets, respectively reserving feature subsets with good classification performance in 6 algorithms, and comparing the effect of a self-encoder and a traditional feature selection method on the selection of risk features;

the step S4 specifically comprises the following steps:

s41, selecting the first 1-N features subjected to feature sequencing in the knee osteoarthritis dataset, and taking out corresponding features in the original dataset according to subscripts to serve as a new dataset for training;

s43, training a model by using an SVM classifier, predicting the disease state of a patient, training each feature selection algorithm for N times, storing the number M of primary feature subsets with the best performance and the precision index thereof in N times of classification tests, taking out the first M features from the N features as the final selection result of the feature selection algorithm, and evaluating the classification performance by adopting the following five indexes:

A. accuracy rate: in all samples, the ratio of the number of correctly classified samples to the total number of samples, i.e. the probability of correct prediction;

B. accuracy rate: how many of the samples predicted to be positive are true positive samples is for the predicted outcome;

C. recall rate: how many positive examples in the sample are predicted to be correct is for the original sample;

D. balance F score: comprehensively considering the reconciliation values of Precision and Recall;

auc: the area under the ROC curve is defined, and the AUC is an evaluation index for measuring the merits of the two classification models and represents the probability that the predicted positive case is arranged in front of the negative case;

s44, repeating the steps S41-S43 to respectively obtain feature subset results based on different feature selection algorithms, and comparing the effect of the self-encoder and the traditional feature selection method on the selection of risk features;

s5, taking out the characteristics of high frequency occurrence in the 6 algorithm results.

2. A method for finding a salient feature of degenerative knee osteoarthritis using machine learning as claimed in claim 1, wherein: the step S1 specifically comprises the following steps:

s11, extracting information of a clinical patient to be diagnosed from an early-stage identification table of knee osteoarthritis, recording the information into case data, writing a related data dictionary, and independently classifying the symptoms into a feature when the same field has different symptoms, wherein 0 is NO, and 1 is yes;

s13, removing rows and columns with large blank quantity proportion;

s14, carrying out normalization processing on the continuous data by using the maximum normalization;

s15, splitting discrete features with multiple states to obtain the influence degree of the same features on osteoarthropathy.

3. A method for finding a salient feature of degenerative knee osteoarthritis using machine learning as claimed in claim 1, wherein: the step S3 specifically comprises the following steps: and (3) respectively using the existing feature selection algorithm to sort the importance of the features in the knee osteoarthritis feature data set, and storing the subscripts of the features in the original data set in descending order of importance.

4. A method for finding a salient feature of degenerative knee osteoarthritis using machine learning as claimed in claim 1, wherein: the step S5 specifically comprises the following steps: and combining the prediction results of the feature subsets of the various algorithms, and taking out the features which occur frequently as final results.