CN114037011A

CN114037011A - Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample

Info

Publication number: CN114037011A
Application number: CN202111316442.3A
Authority: CN
Inventors: 卓力; 李艳萍; 孙亮亮; 张雷; 张菁; 李晓光; 张辉
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-11

Abstract

The invention discloses an automatic identification and cleaning method for a traditional Chinese medicine tongue color noise labeling sample, which is used for realizing accurate and automatic identification and cleaning of tongue color noise labeling data by comparing the probability relationship between a prediction label and a manual labeling label and adopting two different screening strategies. The invention calls the manual labeling label as a hard label, calls the label prediction probability obtained by the model as a soft label, and calls the label corresponding to the maximum value of the prediction probability as a pseudo label. According to the invention, the deep network model is used for predicting the sample label, so that the automatic identification and screening of the noise sample are carried out, and the result is more objective and accurate. In addition, no expert participates in the whole process, the labor is not consumed, the possibility of noise caused by human is reduced, and the accuracy of identifying the noise labeling sample is improved; the processing of the data set prior to model training allows the processed data set to be adapted to other classification models.

Description

Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample

Technical Field

The invention belongs to the field of computer vision and traditional Chinese medicine diagnostics, and particularly relates to technologies such as computer image processing, deep learning and traditional Chinese medicine tongue diagnosis.

Background

In the four diagnostic methods of traditional Chinese medicine, the observation is the first and the spirit is the term of observing and knowing. The tongue diagnosis is an important part of inspection. The tongue diagnosis refers to an examination method for understanding the physiological functions and pathological changes of the human body by observing the changes of the tongue manifestations. The physician can diagnose the disease by observing various manifestations of the tongue proper and tongue coating, including the color, thickness, texture, moisture, shape and state of the tongue. The tongue color is the most intuitive and important diagnostic feature in traditional Chinese medicine diagnosis and treatment, and can be classified into 4 categories, namely pale red, magenta, purple and the like.

When the computer is used for automatic analysis of the tongue color of the traditional Chinese medicine, the automatic analysis is often regarded as a classification problem and is realized by adopting a machine learning method. The method learns the tongue color classification rule from sample data manually labeled by a doctor, models the new classified sample, and realizes automatic tongue color classification. Then, limited by the knowledge level, thinking way and diagnosis experience of the doctor, and due to the influence of external objective factors such as light, temperature and the like, errors often occur in the labeled sample of the doctor, so that certain noise exists in the labeled sample data, the training of the tongue color classification model is influenced, and the accuracy of tongue color classification is not high.

The invention provides an automatic identification and cleaning method of a traditional Chinese medicine tongue color noise labeling sample. By adopting the method to process the labeled sample data, the automatic screening of the noise labeled sample can be realized, and the consistency of the sample labeling is improved. The processed data set can obtain higher identification accuracy.

Disclosure of Invention

The invention aims to provide an automatic identification and cleaning method of a traditional Chinese medicine tongue color noise labeling sample, which realizes accurate and automatic identification and cleaning of tongue color noise labeling data by comparing the probability relation between a prediction label and a manual labeling label and adopting two different screening strategies.

In order to achieve the aim, the technical scheme adopted by the invention is an automatic identification and cleaning method of a traditional Chinese medicine tongue color noise labeling sample, which comprises the steps of estimating joint probability distribution of sample labels, screening the noise sample, correcting the sample labels and the like. Each step is described in detail below. The invention calls the manual labeling label as a hard label, calls the label prediction probability obtained by the model as a soft label, and calls the label corresponding to the maximum value of the prediction probability as a pseudo label.

Step 1: estimating joint probability distribution of sample labels

Step 1.1 training tongue color classification model and determining sample pseudo label in cross validation mode

And (3) taking ResNet18 as a backbone network, applying a channel attention mechanism, ACON (active Or not) activation function to the network, and constructing a classification network model. And dividing the marked tongue color sample data into a training set and a test set, wherein the training set is used for training the classification network model, and the trained model is used for determining the pseudo label of each sample of the test set. In order to improve the robustness and reliability of the classification result, the invention adopts an integrated learning strategy to integrate a plurality of classification network models, thereby improving the robustness and stability of prediction. Cross validation is performed using the integrated network model until all samples are predicted and only once. Through the cyclic estimation processing, the prediction probability of each sample label can be obtained, and a probability matrix is formed.

Step 1.2 Joint probability distribution estimation of sample labels

First, noise samples are screened from all samples. The pseudo tag is not consistent with the hard tag and is used as a noise sample. And according to the prediction probability of the sample label, taking the class corresponding to the maximum probability as a pseudo label of the sample. And judging whether the pseudo label is consistent with the hard label or not, and judging the inconsistent sample as a noise sample.

The noise samples contain both incorrect and inconsistent label samples. Wherein inconsistent label samples refer to samples that result in fuzzy bounds for the categories due to themselves containing different categories of information. The existence of such samples leads to overfitting of the training process, early non-convergence of model optimization, and poor performance. The incorrect label sample is due to human error. Next, incorrect and inconsistent label exemplars will be distinguished. Through analyzing the probability distribution condition of the sample labels, the maximum value of the soft labels of the inconsistent label samples is generally lower, and the difference between the maximum value and the second maximum value is smaller; while the soft label maximum for clean and incorrect label samples is generally higher. Therefore, the confidence threshold value is set by utilizing the characteristic, and if the maximum value in the sample soft label is larger than the preset confidence threshold value and the pseudo label of the sample is inconsistent with the hard label, the sample is judged to be an incorrect label sample. Sampling in this manner can distinguish incorrect tag samples from noise samples.

And finally, constructing a counting matrix, and obtaining the joint probability distribution of the pseudo label and the hard label of the sample through a series of calculations. The joint probability distribution can fully reflect the incidence relation between the pseudo label and the manually marked hard label, presents the distribution condition of the number of samples except for the inconsistent label samples, and provides a basis for the subsequent sample cleaning.

Step 2: screening and correction of noise samples

The invention respectively provides two noise sample screening strategies by utilizing the joint probability distribution of the sample labels, wherein the first strategy is used for identifying the noise samples, and the second strategy is used for identifying the incorrect label samples.

For an incorrect label sample, the label of the sample is corrected into a false label for training a classification model. And for the inconsistent label samples, the inconsistent label samples are not used for training the classification model and are directly eliminated. After the noise sample is cleaned, a clean sample is obtained and is used for training the tongue color classification model.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

1. the identification accuracy is high. Compared with the traditional manual cleaning method, the method provided by the invention has the advantages that the sample label is predicted by utilizing the deep network model, the automatic identification and screening of the noise sample are further carried out, and the result is more objective and accurate. In addition, no expert participates in the whole process, the labor is not consumed, the possibility of noise caused by human is reduced, and the accuracy of identifying the noise labeling sample is improved;

2. the sample utilization rate is high. The method provided by the invention can distinguish incorrect label samples from inconsistent label samples and respectively adopts different modes for processing. Therefore, each sample can be fully utilized, and the utilization rate of the sample is improved.

3. The flexibility and the adaptability are high. The invention gets rid of the noise sample learning mode aiming at individual data set design algorithm and model, and only processes the data set before the model training, so that the processed data set can be suitable for other classification models.

Drawings

FIG. 1 is a diagram of a deep neural network architecture for cross-validation.

Fig. 2 is an example of a count matrix and a joint probability distribution matrix.

FIG. 3 is an overall block diagram of the identification and cleaning method.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and examples.

A method for automatically identifying and cleaning a traditional Chinese medicine tongue color noise labeling sample comprises the following steps of 1: estimating joint probability distribution of sample labels

Supposing that an expert containing noise samples is labeled as a hard label y, and a false label y is obtained through tongue color classification model prediction^*. Let the total number of samples be n, and the set of class labels be {1,2, …, m }, and be denoted as [ m]. Set the sample as

Sample set representing hard label j and pseudo label k, where j, k is E [ m ∈]。

Let the original data set be

Wherein x_iRepresenting the ith sample in the data set,

y_ia hard tag representing the ith sample in the dataset,

a pseudo label representing the ith sample in the dataset,

s1.1 training tongue color classification model and determining sample pseudo label in cross validation mode

The invention identifies the noise labeling label aiming at the whole data set, and calculates the probability P [ i ] [ j ] of the ith sample under the jth category by adopting a cross validation method. And (4) obtaining a prediction label probability matrix of each sample through cyclic estimation processing. Aiming at the characteristic that the tongue picture data set has fewer samples, the invention adopts larger fold numbers to increase the number of samples in the training set, thus being more beneficial to model training and ensuring more accurate results.

Aiming at the characteristics of few tongue picture data set samples, unbalanced categories and the like, the invention designs a deep classification network model based on ensemble learning for cross validation. The model takes ResNet18 as a backbone network, and a channel attention mechanism and an ACON activation function are added in the backbone network, so that a 'ResNet 18+ CA + ACON' network structure is constructed, and the structure is shown in FIG. 1. Meanwhile, a model integration mode is adopted to train the network, and the predicted label is ensured to be more accurate and stable. Compared with other network models, the network solves the problem of model degradation and has more excellent performance on tongue color classification; furthermore, the parameters of ResNet18 are small in scale, and overfitting is unlikely to occur in small sample problems.

(1) Channel attention mechanism

The attention mechanism is a mechanism focusing on local information, and the attention area tends to change as the task changes. The invention adds a channel attention mechanism in the last layer of the ResNet18 model, and the channel attention mechanism can enhance or inhibit different channels aiming at the tongue color classification task by modeling the importance degree of each characteristic channel so as to learn the importance of the different channels. Although a small amount of calculation is added, the expression capability of the features can be effectively improved, and therefore better classification performance is obtained.

(2) ACON activation function

The ACON activation function can adaptively select whether to activate the neuron, and classification accuracy can be improved to a certain extent by replacing the activation layer of the original network. This activation behavior helps to improve the generalization ability and performance of the network. Considering that the color features of the tongue image are mainly contained in the shallow layers of the network, the present invention replaces the Relu activation function in the first layer of the network with the ACON activation function.

(3) Model integration strategy

Ensemble learning is a technique for completing a learning task by combining a plurality of base classifiers, which not only can improve the accuracy, but also has better generalization ability than a single classifier. Model integration is a way of ensemble learning, and by integrating multiple trained models, randomness of classification results can be avoided. When the label of the sample is predicted through cross validation, the accuracy and stability of the model have great influence on the prediction result, so that the prediction probability of the label is averaged by adopting a mode of integrating 10 models, and the average is taken as the prediction probability of the label, as shown in fig. 1.

S1.2 Joint probability distribution estimation of sample labels

The invention makes the average probability t under each artificial calibration category j_jAnd setting a confidence threshold value, and expressing by the formula (1):

wherein the content of the first and second substances,

represents the prediction probability of the sample X when the model parameter is theta when the hard tag is j, | X_y＝jAnd | represents the number of samples of category j.

And screening out samples with the maximum value of the soft label larger than the confidence coefficient threshold value, and expressing the samples by using a formula (2):

wherein l epsilon [ m ] represents any type of label, and k represents a pseudo label obtained through classification model prediction.

And dividing and counting the samples screened out from the above according to the relationship between the hard label and the pseudo label of each sample, and constructing a counting matrix. Counting the number of clean samples corresponding to the hard label and the pseudo label at the diagonal of the matrix; the off-diagonal corresponds to the number of noise labeled samples for which the hard tag and the pseudo tag are inconsistent. The count matrix is expressed by equation (3):

processing the counting matrix, firstly expanding the counting sum proportion of the counting matrix to the total number of the original data samples, then carrying out normalization calculation to obtain the joint probability distribution estimation of the hard label and the pseudo label

The joint probability distribution can fully reflect the incidence relation between the pseudo label and the hard label, and importantly, the distribution condition of the noise labeling sample in all samples is presented, so that a basis is provided for the subsequent sample cleaning.

The calculation formula of (a) is as follows:

wherein | X_y＝jAnd | represents the total number of samples marked with a label of j manually.

To more intuitively express the count matrix and the joint probability distribution matrix, a specific example is presented. If the tongue color classification problem has 4 classes, which are represented by 0, 1,2, and 3, the number of samples is assumed to be 415. After a series of calculations such as prediction probability, the number of samples satisfying the maximum value of the sample soft label greater than the confidence threshold is 400, which includes a large number of clean samples with hard labels consistent with pseudo labels and a small number of incorrect label samples with hard labels inconsistent with pseudo labels. The resulting count matrix and joint probability distribution matrix are shown in fig. 2.

Step 2: screening and correction of noise samples

The invention obtains the noise distribution condition of the data set by using the joint probability distribution between the pseudo label and the manually marked hard label, provides a noise sample screening strategy and can identify the noise sample. The noise samples comprise incorrect label samples and inconsistent label samples, the invention respectively provides 2 strategies, and the incorrect label samples and the inconsistent label samples can be respectively screened out. The overall framework of noise sample screening is shown in figure 3.

Strategy one: screening out samples with inconsistent hard labels and false labels to form a noise sample set

The strategy can screen out incorrect label samples and inconsistent label samples in the annotation data at the same time. The dataset after strategy one clean is denoted as S',S'＝S-N。

and (2) strategy two: and screening out incorrect label samples. Through the calculation and estimation in the step 1, the sample distribution of the non-diagonal units in the joint probability distribution is the incorrect label sample distribution needing to be screened out, so that the sample quantity distributed in the same proportion is screened out from the original data set in the next step. The larger the value of the corresponding hard tag class in the sample prediction probability is, the more likely the hard tag is to be determined as a pseudo tag; the smaller the probability value, the less likely the hard tag is to be consistent with the pseudo tag, and the more likely the exemplar is an incorrect tag exemplar. Therefore, the invention provides an incorrect label sample screening strategy, which is implemented in the following specific way:

under each artificial labeling category, samples are sorted from low to high according to the probability under the category, and the sample distribution number of non-diagonal units in the joint probability distribution under the category is selected, namely

Individual samples were screened. The strategy can be used for cleaning off-diagonal unit samples in the joint probability distribution, namely incorrect label samples, so as to form an incorrect label sample set E. The dataset after strategy two washes is denoted as S ", S ═ S-E.

And removing the samples screened by the strategy two from the samples screened by the strategy one, wherein the rest samples are inconsistent label samples, namely the inconsistent label sample set is U-S'.

For samples of incorrect labels, they are corrected. That is, the pseudo tag is used as its tag to replace the original tag. The corrected samples may be used for training of the classification model. And for inconsistent label samples, the inconsistent label samples are directly removed and are not used for training the classification model. After the noise sample is cleaned, a clean sample can be obtained for training the tongue color classification model.

The invention provides an automatic identification and cleaning method for a traditional Chinese medicine tongue color noise labeling sample, which is used for processing the labeling sample data, can realize automatic screening of the noise labeling sample, and improves the consistency of sample labeling, thereby improving the precision of a classification model. Compared with the traditional manual cleaning data, the method not only saves manpower and material resources, but also improves the identification and cleaning accuracy, and also improves the utilization rate of the sample, and the method has good flexibility.

Claims

1. A method for automatically identifying and cleaning a traditional Chinese medicine tongue color noise labeling sample is characterized by comprising the following steps: the method comprises the following steps of,

step 1: estimating a joint probability distribution of the sample labels;

step 1.1, training a tongue color classification model and determining a sample pseudo label in a cross validation mode;

using ResNet18 as a backbone network, applying a channel attention mechanism and an ACON activation function to the network, and constructing a classification network model; dividing the marked tongue color sample data into a training set and a test set, wherein the training set is used for training the classification network model, and the trained model is used for determining the pseudo label of each sample of the test set; an integrated learning strategy is adopted to integrate a plurality of classification network models, so that the robustness and stability of prediction are improved; performing cross validation by using the integrated network model until all samples are predicted and are predicted only once; obtaining the prediction probability of each sample label through cyclic estimation processing to form a probability matrix;

step 1.2, estimating the joint probability distribution of the sample labels;

firstly, screening a noise sample from all samples; taking the inconsistency of the pseudo label and the hard label as a noise sample; according to the prediction probability of the sample label, taking the category corresponding to the maximum probability as a pseudo label of the sample; judging whether the pseudo label is consistent with the hard label or not, and judging the inconsistent sample as a noise sample;

the noise samples comprise both incorrect label samples and inconsistent label samples; the inconsistent label sample refers to a sample which contains different types of information, so that the boundary of the category is fuzzy; will distinguish between incorrect and inconsistent label samples; through analyzing the probability distribution condition of the sample labels, the maximum value of the soft labels of the inconsistent label samples is generally lower, and the difference between the maximum value and the second maximum value is smaller; the maximum value of the soft label of the clean sample and the incorrect label sample is generally higher; setting a confidence threshold, and if the maximum value in the sample soft label is greater than the preset confidence threshold and the pseudo label of the sample is inconsistent with the hard label, judging the sample as an incorrect label sample; distinguishing incorrect label samples from noise samples;

finally, constructing a counting matrix, and obtaining the joint probability distribution of the pseudo label and the hard label of the sample through a series of calculations; the joint probability distribution fully reflects the incidence relation between the pseudo label and the manually marked hard label, presents the distribution condition of the number of samples except for the inconsistent label samples and provides a basis for the subsequent sample cleaning;

step 2: screening and correction of noise samples

Two noise sample screening strategies are respectively provided by utilizing the joint probability distribution of the sample labels, wherein the first strategy is used for identifying the noise samples, and the second strategy is used for distinguishing the incorrect label samples;

correcting the label of the incorrect label sample into a pseudo label for training a classification model; for inconsistent label samples, the inconsistent label samples are not used for training the classification model and are directly eliminated; after the noise sample is cleaned, a clean sample is obtained and is used for training the tongue color classification model.

2. The automatic identification and cleaning method of the tongue color noise labeling sample of traditional Chinese medicine according to claim 1, characterized in that: in step 1, assuming that an expert label containing a noise sample is a hard label y, and predicting to obtain a pseudo label y through a tongue color classification model^*(ii) a Let the total number of samples be n, and the set of class labels be {1, 2., m }, and be recorded as [ m ]](ii) a Set the sample as

Sample set representing hard label j and pseudo label k, where j, k is E [ m ∈]；

Let the original data set be

Wherein x_iRepresenting the ith sample, x, in the data set_i∈

y_iA hard tag representing the ith sample in the dataset,

a pseudo label representing the ith sample in the dataset,

3. the automatic identification and cleaning method of the tongue color noise labeling sample of traditional Chinese medicine according to claim 2, characterized in that: identifying a noise labeling label aiming at the whole data set, and calculating the probability P [ i ] [ j ] of the ith sample under the jth category by adopting a cross validation method; obtaining a prediction label probability matrix of each sample through cyclic estimation processing;

ResNet18 is used as a main network, a channel attention mechanism and an ACON activation function are added into the main network, a ResNet18+ CA + ACON network structure is constructed, the network is trained in a model integration mode, and the predicted label is more accurate and stable.

4. The automatic identification and cleaning method of the tongue color noise labeling samples in traditional Chinese medicine according to claim 3, characterized in that: the attention mechanism is a mechanism focusing on local information, and an attention area is changed along with the change of a task; adding a channel attention mechanism in the last layer of the ResNet18 model, and strengthening or inhibiting different channels aiming at the tongue color classification task by modeling the importance degree of each characteristic channel so as to learn the importance of the different channels; although a small amount of calculation is added, the expression capability of the features can be effectively improved, and therefore better classification performance is obtained.

5. The automatic identification and cleaning method of the tongue color noise labeling samples in traditional Chinese medicine according to claim 3, characterized in that: the ACON activation function can adaptively select whether to activate the neuron, and classification precision can be improved to a certain extent by replacing an activation layer of an original network; this activation behavior helps to improve the generalization ability and performance of the network; the Relu activation function in the first layer of the network is replaced with the ACON activation function, taking into account that the color characteristics of the tongue image are contained in the shallow layers of the network.

6. The automatic identification and cleaning method of the tongue color noise labeling samples in traditional Chinese medicine according to claim 3, characterized in that: averaging the prediction probabilities of the multiple models in a mode of integrating 10 models to serve as the prediction probability of the label;

the average probability t under each artificial calibration category j_jAnd setting a confidence threshold value, and expressing by the formula (1):

wherein the content of the first and second substances,

represents the prediction probability of the sample X when the model parameter is theta when the hard tag is j, | X_y＝jL represents the number of samples of category j;

wherein l belongs to [ m ] to represent any type of label, and k represents a pseudo label obtained through classification model prediction;

dividing and counting the samples screened out from the above according to the relationship between the hard label and the pseudo label of each sample, and constructing a counting matrix; counting the number of clean samples corresponding to the hard label and the pseudo label at the diagonal of the matrix; the non-diagonal part corresponds to the number of noise labeling samples with inconsistent hard labels and pseudo labels; the count matrix is expressed by equation (3):

The joint probability distribution fully reflects the incidence relation between the pseudo label and the hard label, and importantly, the distribution condition of the noise labeling sample in all samples is presented, so that a basis is provided for the subsequent sample cleaning;

the calculation formula of (a) is as follows:

wherein | X_y＝jL represents the total number of samples marked with a label j manually;

in order to express the counting matrix and the joint probability distribution matrix more intuitively, the method is shown by a specific example; if the tongue color classification problem has 4 classes, which are represented by 0, 1,2 and 3, the number of samples is assumed to be 415; after a series of calculation of prediction probability, the total number of samples meeting the condition that the maximum value of the sample soft label is greater than the confidence coefficient threshold is 400, and the samples comprise clean samples with hard labels consistent with pseudo labels and incorrect label samples with small parts of hard labels inconsistent with the pseudo labels; the resulting count matrix and the joint probability distribution matrix.

7. The automatic identification and cleaning method of the tongue color noise labeling sample of traditional Chinese medicine according to claim 1, characterized in that: obtaining the noise distribution condition of the data set by using the joint probability distribution between the pseudo label and the manually marked hard label, and providing a noise sample screening strategy to identify the noise sample; the noise samples comprise an incorrect label sample and an inconsistent label sample, 2 strategies are respectively provided, and the incorrect label sample and the inconsistent label sample are respectively screened out;

Screening out incorrect label samples and inconsistent label samples in the labeling data at the same time; the data set after the strategy one cleaning is represented as S', S ═ S-N;

and (2) strategy two: screening out incorrect label samples; through the calculation and estimation in the step 1, the sample distribution of the non-diagonal units in the joint probability distribution is known to be the incorrect label sample distribution needing to be screened out, and the number of samples distributed in the same proportion is screened out from the original data set; and selecting an incorrect label sample screening strategy for processing.

8. The method for automatically recognizing and cleaning the tongue color noise labeling sample of traditional Chinese medicine according to claim 7, wherein: in the incorrect label sample screening strategy, under each manual labeling category, samples are sorted from low to high according to the probability under the category, and the sample distribution number of non-diagonal units in the joint probability distribution under the category is selected, namely

Screening the samples; the strategy can be used for cleaning off-diagonal unit samples in the joint probability distribution, namely incorrect label samples, so as to form an incorrect label sample set E; the dataset after strategy two washing is denoted as S ", S ═ S-E;

removing the samples screened by the strategy two from the samples screened by the strategy one, wherein the rest samples are inconsistent label samples, namely the inconsistent label sample set is U-S';

correcting samples of incorrect labels; namely, the false label is used as the label to replace the original label; the corrected samples are used for training a classification model; directly removing the inconsistent label samples, and not using the inconsistent label samples for training the classification model; after the noise sample is cleaned, a clean sample can be obtained for training the tongue color classification model.