CN114037011A - Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample - Google Patents

Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample Download PDF

Info

Publication number
CN114037011A
CN114037011A CN202111316442.3A CN202111316442A CN114037011A CN 114037011 A CN114037011 A CN 114037011A CN 202111316442 A CN202111316442 A CN 202111316442A CN 114037011 A CN114037011 A CN 114037011A
Authority
CN
China
Prior art keywords
label
sample
samples
noise
inconsistent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111316442.3A
Other languages
Chinese (zh)
Inventor
卓力
李艳萍
孙亮亮
张雷
张菁
李晓光
张辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202111316442.3A priority Critical patent/CN114037011A/en
Publication of CN114037011A publication Critical patent/CN114037011A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention discloses an automatic identification and cleaning method for a traditional Chinese medicine tongue color noise labeling sample, which is used for realizing accurate and automatic identification and cleaning of tongue color noise labeling data by comparing the probability relationship between a prediction label and a manual labeling label and adopting two different screening strategies. The invention calls the manual labeling label as a hard label, calls the label prediction probability obtained by the model as a soft label, and calls the label corresponding to the maximum value of the prediction probability as a pseudo label. According to the invention, the deep network model is used for predicting the sample label, so that the automatic identification and screening of the noise sample are carried out, and the result is more objective and accurate. In addition, no expert participates in the whole process, the labor is not consumed, the possibility of noise caused by human is reduced, and the accuracy of identifying the noise labeling sample is improved; the processing of the data set prior to model training allows the processed data set to be adapted to other classification models.

Description

Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample
Technical Field
The invention belongs to the field of computer vision and traditional Chinese medicine diagnostics, and particularly relates to technologies such as computer image processing, deep learning and traditional Chinese medicine tongue diagnosis.
Background
In the four diagnostic methods of traditional Chinese medicine, the observation is the first and the spirit is the term of observing and knowing. The tongue diagnosis is an important part of inspection. The tongue diagnosis refers to an examination method for understanding the physiological functions and pathological changes of the human body by observing the changes of the tongue manifestations. The physician can diagnose the disease by observing various manifestations of the tongue proper and tongue coating, including the color, thickness, texture, moisture, shape and state of the tongue. The tongue color is the most intuitive and important diagnostic feature in traditional Chinese medicine diagnosis and treatment, and can be classified into 4 categories, namely pale red, magenta, purple and the like.
When the computer is used for automatic analysis of the tongue color of the traditional Chinese medicine, the automatic analysis is often regarded as a classification problem and is realized by adopting a machine learning method. The method learns the tongue color classification rule from sample data manually labeled by a doctor, models the new classified sample, and realizes automatic tongue color classification. Then, limited by the knowledge level, thinking way and diagnosis experience of the doctor, and due to the influence of external objective factors such as light, temperature and the like, errors often occur in the labeled sample of the doctor, so that certain noise exists in the labeled sample data, the training of the tongue color classification model is influenced, and the accuracy of tongue color classification is not high.
The invention provides an automatic identification and cleaning method of a traditional Chinese medicine tongue color noise labeling sample. By adopting the method to process the labeled sample data, the automatic screening of the noise labeled sample can be realized, and the consistency of the sample labeling is improved. The processed data set can obtain higher identification accuracy.
Disclosure of Invention
The invention aims to provide an automatic identification and cleaning method of a traditional Chinese medicine tongue color noise labeling sample, which realizes accurate and automatic identification and cleaning of tongue color noise labeling data by comparing the probability relation between a prediction label and a manual labeling label and adopting two different screening strategies.
In order to achieve the aim, the technical scheme adopted by the invention is an automatic identification and cleaning method of a traditional Chinese medicine tongue color noise labeling sample, which comprises the steps of estimating joint probability distribution of sample labels, screening the noise sample, correcting the sample labels and the like. Each step is described in detail below. The invention calls the manual labeling label as a hard label, calls the label prediction probability obtained by the model as a soft label, and calls the label corresponding to the maximum value of the prediction probability as a pseudo label.
Step 1: estimating joint probability distribution of sample labels
Step 1.1 training tongue color classification model and determining sample pseudo label in cross validation mode
And (3) taking ResNet18 as a backbone network, applying a channel attention mechanism, ACON (active Or not) activation function to the network, and constructing a classification network model. And dividing the marked tongue color sample data into a training set and a test set, wherein the training set is used for training the classification network model, and the trained model is used for determining the pseudo label of each sample of the test set. In order to improve the robustness and reliability of the classification result, the invention adopts an integrated learning strategy to integrate a plurality of classification network models, thereby improving the robustness and stability of prediction. Cross validation is performed using the integrated network model until all samples are predicted and only once. Through the cyclic estimation processing, the prediction probability of each sample label can be obtained, and a probability matrix is formed.
Step 1.2 Joint probability distribution estimation of sample labels
First, noise samples are screened from all samples. The pseudo tag is not consistent with the hard tag and is used as a noise sample. And according to the prediction probability of the sample label, taking the class corresponding to the maximum probability as a pseudo label of the sample. And judging whether the pseudo label is consistent with the hard label or not, and judging the inconsistent sample as a noise sample.
The noise samples contain both incorrect and inconsistent label samples. Wherein inconsistent label samples refer to samples that result in fuzzy bounds for the categories due to themselves containing different categories of information. The existence of such samples leads to overfitting of the training process, early non-convergence of model optimization, and poor performance. The incorrect label sample is due to human error. Next, incorrect and inconsistent label exemplars will be distinguished. Through analyzing the probability distribution condition of the sample labels, the maximum value of the soft labels of the inconsistent label samples is generally lower, and the difference between the maximum value and the second maximum value is smaller; while the soft label maximum for clean and incorrect label samples is generally higher. Therefore, the confidence threshold value is set by utilizing the characteristic, and if the maximum value in the sample soft label is larger than the preset confidence threshold value and the pseudo label of the sample is inconsistent with the hard label, the sample is judged to be an incorrect label sample. Sampling in this manner can distinguish incorrect tag samples from noise samples.
And finally, constructing a counting matrix, and obtaining the joint probability distribution of the pseudo label and the hard label of the sample through a series of calculations. The joint probability distribution can fully reflect the incidence relation between the pseudo label and the manually marked hard label, presents the distribution condition of the number of samples except for the inconsistent label samples, and provides a basis for the subsequent sample cleaning.
Step 2: screening and correction of noise samples
The invention respectively provides two noise sample screening strategies by utilizing the joint probability distribution of the sample labels, wherein the first strategy is used for identifying the noise samples, and the second strategy is used for identifying the incorrect label samples.
For an incorrect label sample, the label of the sample is corrected into a false label for training a classification model. And for the inconsistent label samples, the inconsistent label samples are not used for training the classification model and are directly eliminated. After the noise sample is cleaned, a clean sample is obtained and is used for training the tongue color classification model.
Compared with the prior art, the invention has the following obvious advantages and beneficial effects:
1. the identification accuracy is high. Compared with the traditional manual cleaning method, the method provided by the invention has the advantages that the sample label is predicted by utilizing the deep network model, the automatic identification and screening of the noise sample are further carried out, and the result is more objective and accurate. In addition, no expert participates in the whole process, the labor is not consumed, the possibility of noise caused by human is reduced, and the accuracy of identifying the noise labeling sample is improved;
2. the sample utilization rate is high. The method provided by the invention can distinguish incorrect label samples from inconsistent label samples and respectively adopts different modes for processing. Therefore, each sample can be fully utilized, and the utilization rate of the sample is improved.
3. The flexibility and the adaptability are high. The invention gets rid of the noise sample learning mode aiming at individual data set design algorithm and model, and only processes the data set before the model training, so that the processed data set can be suitable for other classification models.
Drawings
FIG. 1 is a diagram of a deep neural network architecture for cross-validation.
Fig. 2 is an example of a count matrix and a joint probability distribution matrix.
FIG. 3 is an overall block diagram of the identification and cleaning method.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and examples.
A method for automatically identifying and cleaning a traditional Chinese medicine tongue color noise labeling sample comprises the following steps of 1: estimating joint probability distribution of sample labels
Supposing that an expert containing noise samples is labeled as a hard label y, and a false label y is obtained through tongue color classification model prediction*. Let the total number of samples be n, and the set of class labels be {1,2, …, m }, and be denoted as [ m]. Set the sample as
Figure BDA0003343831130000031
Sample set representing hard label j and pseudo label k, where j, k is E [ m ∈]。
Let the original data set be
Figure BDA0003343831130000032
Wherein xiRepresenting the ith sample in the data set,
Figure BDA0003343831130000033
Figure BDA0003343831130000034
yia hard tag representing the ith sample in the dataset,
Figure BDA0003343831130000035
a pseudo label representing the ith sample in the dataset,
Figure BDA0003343831130000041
s1.1 training tongue color classification model and determining sample pseudo label in cross validation mode
The invention identifies the noise labeling label aiming at the whole data set, and calculates the probability P [ i ] [ j ] of the ith sample under the jth category by adopting a cross validation method. And (4) obtaining a prediction label probability matrix of each sample through cyclic estimation processing. Aiming at the characteristic that the tongue picture data set has fewer samples, the invention adopts larger fold numbers to increase the number of samples in the training set, thus being more beneficial to model training and ensuring more accurate results.
Aiming at the characteristics of few tongue picture data set samples, unbalanced categories and the like, the invention designs a deep classification network model based on ensemble learning for cross validation. The model takes ResNet18 as a backbone network, and a channel attention mechanism and an ACON activation function are added in the backbone network, so that a 'ResNet 18+ CA + ACON' network structure is constructed, and the structure is shown in FIG. 1. Meanwhile, a model integration mode is adopted to train the network, and the predicted label is ensured to be more accurate and stable. Compared with other network models, the network solves the problem of model degradation and has more excellent performance on tongue color classification; furthermore, the parameters of ResNet18 are small in scale, and overfitting is unlikely to occur in small sample problems.
(1) Channel attention mechanism
The attention mechanism is a mechanism focusing on local information, and the attention area tends to change as the task changes. The invention adds a channel attention mechanism in the last layer of the ResNet18 model, and the channel attention mechanism can enhance or inhibit different channels aiming at the tongue color classification task by modeling the importance degree of each characteristic channel so as to learn the importance of the different channels. Although a small amount of calculation is added, the expression capability of the features can be effectively improved, and therefore better classification performance is obtained.
(2) ACON activation function
The ACON activation function can adaptively select whether to activate the neuron, and classification accuracy can be improved to a certain extent by replacing the activation layer of the original network. This activation behavior helps to improve the generalization ability and performance of the network. Considering that the color features of the tongue image are mainly contained in the shallow layers of the network, the present invention replaces the Relu activation function in the first layer of the network with the ACON activation function.
(3) Model integration strategy
Ensemble learning is a technique for completing a learning task by combining a plurality of base classifiers, which not only can improve the accuracy, but also has better generalization ability than a single classifier. Model integration is a way of ensemble learning, and by integrating multiple trained models, randomness of classification results can be avoided. When the label of the sample is predicted through cross validation, the accuracy and stability of the model have great influence on the prediction result, so that the prediction probability of the label is averaged by adopting a mode of integrating 10 models, and the average is taken as the prediction probability of the label, as shown in fig. 1.
S1.2 Joint probability distribution estimation of sample labels
The invention makes the average probability t under each artificial calibration category jjAnd setting a confidence threshold value, and expressing by the formula (1):
Figure BDA0003343831130000051
wherein the content of the first and second substances,
Figure BDA0003343831130000052
represents the prediction probability of the sample X when the model parameter is theta when the hard tag is j, | Xy=jAnd | represents the number of samples of category j.
And screening out samples with the maximum value of the soft label larger than the confidence coefficient threshold value, and expressing the samples by using a formula (2):
Figure BDA0003343831130000053
wherein l epsilon [ m ] represents any type of label, and k represents a pseudo label obtained through classification model prediction.
And dividing and counting the samples screened out from the above according to the relationship between the hard label and the pseudo label of each sample, and constructing a counting matrix. Counting the number of clean samples corresponding to the hard label and the pseudo label at the diagonal of the matrix; the off-diagonal corresponds to the number of noise labeled samples for which the hard tag and the pseudo tag are inconsistent. The count matrix is expressed by equation (3):
Figure BDA0003343831130000054
processing the counting matrix, firstly expanding the counting sum proportion of the counting matrix to the total number of the original data samples, then carrying out normalization calculation to obtain the joint probability distribution estimation of the hard label and the pseudo label
Figure BDA0003343831130000055
The joint probability distribution can fully reflect the incidence relation between the pseudo label and the hard label, and importantly, the distribution condition of the noise labeling sample in all samples is presented, so that a basis is provided for the subsequent sample cleaning.
Figure BDA0003343831130000056
The calculation formula of (a) is as follows:
Figure BDA0003343831130000057
wherein | Xy=jAnd | represents the total number of samples marked with a label of j manually.
To more intuitively express the count matrix and the joint probability distribution matrix, a specific example is presented. If the tongue color classification problem has 4 classes, which are represented by 0, 1,2, and 3, the number of samples is assumed to be 415. After a series of calculations such as prediction probability, the number of samples satisfying the maximum value of the sample soft label greater than the confidence threshold is 400, which includes a large number of clean samples with hard labels consistent with pseudo labels and a small number of incorrect label samples with hard labels inconsistent with pseudo labels. The resulting count matrix and joint probability distribution matrix are shown in fig. 2.
Step 2: screening and correction of noise samples
The invention obtains the noise distribution condition of the data set by using the joint probability distribution between the pseudo label and the manually marked hard label, provides a noise sample screening strategy and can identify the noise sample. The noise samples comprise incorrect label samples and inconsistent label samples, the invention respectively provides 2 strategies, and the incorrect label samples and the inconsistent label samples can be respectively screened out. The overall framework of noise sample screening is shown in figure 3.
Strategy one: screening out samples with inconsistent hard labels and false labels to form a noise sample set
Figure BDA0003343831130000061
Figure BDA0003343831130000062
The strategy can screen out incorrect label samples and inconsistent label samples in the annotation data at the same time. The dataset after strategy one clean is denoted as S',S'=S-N。
and (2) strategy two: and screening out incorrect label samples. Through the calculation and estimation in the step 1, the sample distribution of the non-diagonal units in the joint probability distribution is the incorrect label sample distribution needing to be screened out, so that the sample quantity distributed in the same proportion is screened out from the original data set in the next step. The larger the value of the corresponding hard tag class in the sample prediction probability is, the more likely the hard tag is to be determined as a pseudo tag; the smaller the probability value, the less likely the hard tag is to be consistent with the pseudo tag, and the more likely the exemplar is an incorrect tag exemplar. Therefore, the invention provides an incorrect label sample screening strategy, which is implemented in the following specific way:
under each artificial labeling category, samples are sorted from low to high according to the probability under the category, and the sample distribution number of non-diagonal units in the joint probability distribution under the category is selected, namely
Figure BDA0003343831130000063
Figure BDA0003343831130000064
Individual samples were screened. The strategy can be used for cleaning off-diagonal unit samples in the joint probability distribution, namely incorrect label samples, so as to form an incorrect label sample set E. The dataset after strategy two washes is denoted as S ", S ═ S-E.
And removing the samples screened by the strategy two from the samples screened by the strategy one, wherein the rest samples are inconsistent label samples, namely the inconsistent label sample set is U-S'.
For samples of incorrect labels, they are corrected. That is, the pseudo tag is used as its tag to replace the original tag. The corrected samples may be used for training of the classification model. And for inconsistent label samples, the inconsistent label samples are directly removed and are not used for training the classification model. After the noise sample is cleaned, a clean sample can be obtained for training the tongue color classification model.
The invention provides an automatic identification and cleaning method for a traditional Chinese medicine tongue color noise labeling sample, which is used for processing the labeling sample data, can realize automatic screening of the noise labeling sample, and improves the consistency of sample labeling, thereby improving the precision of a classification model. Compared with the traditional manual cleaning data, the method not only saves manpower and material resources, but also improves the identification and cleaning accuracy, and also improves the utilization rate of the sample, and the method has good flexibility.

Claims (8)

1. A method for automatically identifying and cleaning a traditional Chinese medicine tongue color noise labeling sample is characterized by comprising the following steps: the method comprises the following steps of,
step 1: estimating a joint probability distribution of the sample labels;
step 1.1, training a tongue color classification model and determining a sample pseudo label in a cross validation mode;
using ResNet18 as a backbone network, applying a channel attention mechanism and an ACON activation function to the network, and constructing a classification network model; dividing the marked tongue color sample data into a training set and a test set, wherein the training set is used for training the classification network model, and the trained model is used for determining the pseudo label of each sample of the test set; an integrated learning strategy is adopted to integrate a plurality of classification network models, so that the robustness and stability of prediction are improved; performing cross validation by using the integrated network model until all samples are predicted and are predicted only once; obtaining the prediction probability of each sample label through cyclic estimation processing to form a probability matrix;
step 1.2, estimating the joint probability distribution of the sample labels;
firstly, screening a noise sample from all samples; taking the inconsistency of the pseudo label and the hard label as a noise sample; according to the prediction probability of the sample label, taking the category corresponding to the maximum probability as a pseudo label of the sample; judging whether the pseudo label is consistent with the hard label or not, and judging the inconsistent sample as a noise sample;
the noise samples comprise both incorrect label samples and inconsistent label samples; the inconsistent label sample refers to a sample which contains different types of information, so that the boundary of the category is fuzzy; will distinguish between incorrect and inconsistent label samples; through analyzing the probability distribution condition of the sample labels, the maximum value of the soft labels of the inconsistent label samples is generally lower, and the difference between the maximum value and the second maximum value is smaller; the maximum value of the soft label of the clean sample and the incorrect label sample is generally higher; setting a confidence threshold, and if the maximum value in the sample soft label is greater than the preset confidence threshold and the pseudo label of the sample is inconsistent with the hard label, judging the sample as an incorrect label sample; distinguishing incorrect label samples from noise samples;
finally, constructing a counting matrix, and obtaining the joint probability distribution of the pseudo label and the hard label of the sample through a series of calculations; the joint probability distribution fully reflects the incidence relation between the pseudo label and the manually marked hard label, presents the distribution condition of the number of samples except for the inconsistent label samples and provides a basis for the subsequent sample cleaning;
step 2: screening and correction of noise samples
Two noise sample screening strategies are respectively provided by utilizing the joint probability distribution of the sample labels, wherein the first strategy is used for identifying the noise samples, and the second strategy is used for distinguishing the incorrect label samples;
correcting the label of the incorrect label sample into a pseudo label for training a classification model; for inconsistent label samples, the inconsistent label samples are not used for training the classification model and are directly eliminated; after the noise sample is cleaned, a clean sample is obtained and is used for training the tongue color classification model.
2. The automatic identification and cleaning method of the tongue color noise labeling sample of traditional Chinese medicine according to claim 1, characterized in that: in step 1, assuming that an expert label containing a noise sample is a hard label y, and predicting to obtain a pseudo label y through a tongue color classification model*(ii) a Let the total number of samples be n, and the set of class labels be {1, 2., m }, and be recorded as [ m ]](ii) a Set the sample as
Figure FDA0003343831120000024
Sample set representing hard label j and pseudo label k, where j, k is E [ m ∈];
Let the original data set be
Figure FDA0003343831120000021
Wherein xiRepresenting the ith sample, x, in the data seti
Figure FDA0003343831120000025
yiA hard tag representing the ith sample in the dataset,
Figure FDA0003343831120000022
a pseudo label representing the ith sample in the dataset,
Figure FDA0003343831120000023
3. the automatic identification and cleaning method of the tongue color noise labeling sample of traditional Chinese medicine according to claim 2, characterized in that: identifying a noise labeling label aiming at the whole data set, and calculating the probability P [ i ] [ j ] of the ith sample under the jth category by adopting a cross validation method; obtaining a prediction label probability matrix of each sample through cyclic estimation processing;
ResNet18 is used as a main network, a channel attention mechanism and an ACON activation function are added into the main network, a ResNet18+ CA + ACON network structure is constructed, the network is trained in a model integration mode, and the predicted label is more accurate and stable.
4. The automatic identification and cleaning method of the tongue color noise labeling samples in traditional Chinese medicine according to claim 3, characterized in that: the attention mechanism is a mechanism focusing on local information, and an attention area is changed along with the change of a task; adding a channel attention mechanism in the last layer of the ResNet18 model, and strengthening or inhibiting different channels aiming at the tongue color classification task by modeling the importance degree of each characteristic channel so as to learn the importance of the different channels; although a small amount of calculation is added, the expression capability of the features can be effectively improved, and therefore better classification performance is obtained.
5. The automatic identification and cleaning method of the tongue color noise labeling samples in traditional Chinese medicine according to claim 3, characterized in that: the ACON activation function can adaptively select whether to activate the neuron, and classification precision can be improved to a certain extent by replacing an activation layer of an original network; this activation behavior helps to improve the generalization ability and performance of the network; the Relu activation function in the first layer of the network is replaced with the ACON activation function, taking into account that the color characteristics of the tongue image are contained in the shallow layers of the network.
6. The automatic identification and cleaning method of the tongue color noise labeling samples in traditional Chinese medicine according to claim 3, characterized in that: averaging the prediction probabilities of the multiple models in a mode of integrating 10 models to serve as the prediction probability of the label;
the average probability t under each artificial calibration category jjAnd setting a confidence threshold value, and expressing by the formula (1):
Figure FDA0003343831120000031
wherein the content of the first and second substances,
Figure FDA0003343831120000032
represents the prediction probability of the sample X when the model parameter is theta when the hard tag is j, | Xy=jL represents the number of samples of category j;
and screening out samples with the maximum value of the soft label larger than the confidence coefficient threshold value, and expressing the samples by using a formula (2):
Figure FDA0003343831120000033
wherein l belongs to [ m ] to represent any type of label, and k represents a pseudo label obtained through classification model prediction;
dividing and counting the samples screened out from the above according to the relationship between the hard label and the pseudo label of each sample, and constructing a counting matrix; counting the number of clean samples corresponding to the hard label and the pseudo label at the diagonal of the matrix; the non-diagonal part corresponds to the number of noise labeling samples with inconsistent hard labels and pseudo labels; the count matrix is expressed by equation (3):
Figure FDA0003343831120000034
processing the counting matrix, firstly expanding the counting sum proportion of the counting matrix to the total number of the original data samples, then carrying out normalization calculation to obtain the joint probability distribution estimation of the hard label and the pseudo label
Figure FDA0003343831120000035
The joint probability distribution fully reflects the incidence relation between the pseudo label and the hard label, and importantly, the distribution condition of the noise labeling sample in all samples is presented, so that a basis is provided for the subsequent sample cleaning;
Figure FDA0003343831120000036
the calculation formula of (a) is as follows:
Figure FDA0003343831120000037
wherein | Xy=jL represents the total number of samples marked with a label j manually;
in order to express the counting matrix and the joint probability distribution matrix more intuitively, the method is shown by a specific example; if the tongue color classification problem has 4 classes, which are represented by 0, 1,2 and 3, the number of samples is assumed to be 415; after a series of calculation of prediction probability, the total number of samples meeting the condition that the maximum value of the sample soft label is greater than the confidence coefficient threshold is 400, and the samples comprise clean samples with hard labels consistent with pseudo labels and incorrect label samples with small parts of hard labels inconsistent with the pseudo labels; the resulting count matrix and the joint probability distribution matrix.
7. The automatic identification and cleaning method of the tongue color noise labeling sample of traditional Chinese medicine according to claim 1, characterized in that: obtaining the noise distribution condition of the data set by using the joint probability distribution between the pseudo label and the manually marked hard label, and providing a noise sample screening strategy to identify the noise sample; the noise samples comprise an incorrect label sample and an inconsistent label sample, 2 strategies are respectively provided, and the incorrect label sample and the inconsistent label sample are respectively screened out;
strategy one: screening out samples with inconsistent hard labels and false labels to form a noise sample set
Figure FDA0003343831120000041
Figure FDA0003343831120000042
Screening out incorrect label samples and inconsistent label samples in the labeling data at the same time; the data set after the strategy one cleaning is represented as S', S ═ S-N;
and (2) strategy two: screening out incorrect label samples; through the calculation and estimation in the step 1, the sample distribution of the non-diagonal units in the joint probability distribution is known to be the incorrect label sample distribution needing to be screened out, and the number of samples distributed in the same proportion is screened out from the original data set; and selecting an incorrect label sample screening strategy for processing.
8. The method for automatically recognizing and cleaning the tongue color noise labeling sample of traditional Chinese medicine according to claim 7, wherein: in the incorrect label sample screening strategy, under each manual labeling category, samples are sorted from low to high according to the probability under the category, and the sample distribution number of non-diagonal units in the joint probability distribution under the category is selected, namely
Figure FDA0003343831120000043
Screening the samples; the strategy can be used for cleaning off-diagonal unit samples in the joint probability distribution, namely incorrect label samples, so as to form an incorrect label sample set E; the dataset after strategy two washing is denoted as S ", S ═ S-E;
removing the samples screened by the strategy two from the samples screened by the strategy one, wherein the rest samples are inconsistent label samples, namely the inconsistent label sample set is U-S';
correcting samples of incorrect labels; namely, the false label is used as the label to replace the original label; the corrected samples are used for training a classification model; directly removing the inconsistent label samples, and not using the inconsistent label samples for training the classification model; after the noise sample is cleaned, a clean sample can be obtained for training the tongue color classification model.
CN202111316442.3A 2021-11-08 2021-11-08 Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample Pending CN114037011A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111316442.3A CN114037011A (en) 2021-11-08 2021-11-08 Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111316442.3A CN114037011A (en) 2021-11-08 2021-11-08 Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample

Publications (1)

Publication Number Publication Date
CN114037011A true CN114037011A (en) 2022-02-11

Family

ID=80136853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111316442.3A Pending CN114037011A (en) 2021-11-08 2021-11-08 Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample

Country Status (1)

Country Link
CN (1) CN114037011A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163376A (en) * 2018-06-04 2019-08-23 腾讯科技(深圳)有限公司 Sample testing method, the recognition methods of media object, device, terminal and medium
CN115511012A (en) * 2022-11-22 2022-12-23 南京码极客科技有限公司 Class soft label recognition training method for maximum entropy constraint
CN116824275A (en) * 2023-08-29 2023-09-29 青岛美迪康数字工程有限公司 Method, device and computer equipment for realizing intelligent model optimization

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194937A (en) * 2017-05-27 2017-09-22 厦门大学 Tongue image partition method under a kind of open environment
CN108537259A (en) * 2018-03-27 2018-09-14 北京交通大学 Train control on board equipment failure modes and recognition methods based on Rough Sets Neural Networks model
CN111967294A (en) * 2020-06-23 2020-11-20 南昌大学 Unsupervised domain self-adaptive pedestrian re-identification method
CN113408605A (en) * 2021-06-16 2021-09-17 西安电子科技大学 Hyperspectral image semi-supervised classification method based on small sample learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194937A (en) * 2017-05-27 2017-09-22 厦门大学 Tongue image partition method under a kind of open environment
CN108537259A (en) * 2018-03-27 2018-09-14 北京交通大学 Train control on board equipment failure modes and recognition methods based on Rough Sets Neural Networks model
CN111967294A (en) * 2020-06-23 2020-11-20 南昌大学 Unsupervised domain self-adaptive pedestrian re-identification method
CN113408605A (en) * 2021-06-16 2021-09-17 西安电子科技大学 Hyperspectral image semi-supervised classification method based on small sample learning

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163376A (en) * 2018-06-04 2019-08-23 腾讯科技(深圳)有限公司 Sample testing method, the recognition methods of media object, device, terminal and medium
CN110163376B (en) * 2018-06-04 2023-11-03 腾讯科技(深圳)有限公司 Sample detection method, media object identification method, device, terminal and medium
CN115511012A (en) * 2022-11-22 2022-12-23 南京码极客科技有限公司 Class soft label recognition training method for maximum entropy constraint
CN115511012B (en) * 2022-11-22 2023-04-07 南京码极客科技有限公司 Class soft label identification training method with maximum entropy constraint
CN116824275A (en) * 2023-08-29 2023-09-29 青岛美迪康数字工程有限公司 Method, device and computer equipment for realizing intelligent model optimization
CN116824275B (en) * 2023-08-29 2023-11-17 青岛美迪康数字工程有限公司 Method, device and computer equipment for realizing intelligent model optimization

Similar Documents

Publication Publication Date Title
Ozdemir et al. A 3D probabilistic deep learning system for detection and diagnosis of lung cancer using low-dose CT scans
CN114037011A (en) Automatic identification and cleaning method for traditional Chinese medicine tongue color noise labeling sample
Ghosh et al. CHOBS: Color histogram of block statistics for automatic bleeding detection in wireless capsule endoscopy video
CN112101451B (en) Breast cancer tissue pathological type classification method based on generation of antagonism network screening image block
CN110051324B (en) Method and system for predicting death rate of acute respiratory distress syndrome
CN111862085A (en) Method and system for predicting latent N2 lymph node metastasis of peripheral NSCLC
CN111009321A (en) Application method of machine learning classification model in juvenile autism auxiliary diagnosis
CN114549469A (en) Deep neural network medical image diagnosis method based on confidence degree calibration
JP2023521648A (en) AI Methods for Cleaning Data to Train Artificial Intelligence (AI) Models
CN112116957A (en) Disease subtype prediction method, system, device and medium based on small sample
CN112201330A (en) Medical quality monitoring and evaluating method combining DRGs tool and Bayesian model
CN113610118A (en) Fundus image classification method, device, equipment and medium based on multitask course learning
CN117315379B (en) Deep learning-oriented medical image classification model fairness evaluation method and device
Chubb et al. BioVision: an application for the automated image analysis of histological sections
CN114098779A (en) Intelligent pneumoconiosis grade judging method
CN114580501A (en) Bone marrow cell classification method, system, computer device and storage medium
CN117195027A (en) Cluster weighted clustering integration method based on member selection
CN112690815A (en) System and method for assisting in diagnosing lesion grade based on lung image report
CN113052227A (en) Pulmonary tuberculosis identification method based on SE-ResNet
GB2604706A (en) System and method for diagnosing small bowel cleanliness
Fouad A hybrid approach of missing data imputation for upper gastrointestinal diagnosis
CN115083616B (en) Chronic nephropathy subtype mining system based on self-supervision graph clustering
CN114743619B (en) Questionnaire quality evaluation method and system for disease risk prediction
CN113476065B (en) Multiclass pneumonia diagnostic system
CN109978877B (en) Method and device for classifying by using screening model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination