CN116824237A - Image recognition and classification method based on two-stage active learning - Google Patents

Image recognition and classification method based on two-stage active learning Download PDF

Info

Publication number
CN116824237A
CN116824237A CN202310715906.0A CN202310715906A CN116824237A CN 116824237 A CN116824237 A CN 116824237A CN 202310715906 A CN202310715906 A CN 202310715906A CN 116824237 A CN116824237 A CN 116824237A
Authority
CN
China
Prior art keywords
sample
stage
samples
sampling
view
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310715906.0A
Other languages
Chinese (zh)
Inventor
杨育彬
范译
江彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202310715906.0A priority Critical patent/CN116824237A/en
Publication of CN116824237A publication Critical patent/CN116824237A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses an image recognition and classification method based on two-stage active learning, which comprises the following steps: performing first-stage active sampling by using a conventional active learning method; clustering samples in each view; calculating the consistency between the two views; performing second-stage active sampling by using the expression degree and the stability degree; training a neural network; and carrying out iterative optimization on the model. The invention relates to a novel active learning method which is independent of a specific learning task and decoupled from the existing active learning method, can be applied to various active learning scenes, and can be combined with the existing active learning method for use. In addition, the expressive property and the stability of the sample are considered, and the sampled sample for labeling has higher credibility. The invention realizes the two-stage active learning method which is accurate and has strong expansibility and is based on multi-view clustering to calculate the sample expression degree and stability, thereby having higher use value.

Description

Image recognition and classification method based on two-stage active learning
Technical Field
The invention relates to an image recognition method, in particular to an image recognition and classification method based on two-stage active learning.
Background
In the field of computer vision, high quality marking data is indispensable in the face of complex tasks and neural networks designed to cope with them. However, in many application scenarios, high quality marking data is difficult to obtain in large quantities. Active Learning (AL) aims to obtain the same effect as full supervised training while controlling the cost of data labeling by using as few label samples as possible but with a sufficiently large amount of information. In a traditional pool-based active learning scenario, a large number of unlabeled samples form a pool of candidate samples (referred to as a label-free pool), while the training set is limited. The model continuously selects key samples from the label-free pool through a specific sampling strategy, and requests manual labeling, so that a training set is expanded, and the current model is iteratively optimized.
The main idea of the present AL is to design different active sampling strategies on the premise of following the above framework. For example, in classification tasks, classical minimum Confidence (LC) algorithms, interval (Margin) algorithms, and Entropy (Entropy) algorithms all take the prediction uncertainty of the current model as the basis for sampling. In the target detection task, one part of methods directly borrows the thought of the classification task, only samples the classification branches, and the other part of methods gathers the eyes on the regression branches, and takes the stability of regression frame prediction as the basis of sampling.
However, the AL sampling strategy in the above approach depends on the specific task. Although appropriate modifications may be made to accommodate other tasks, these methods tend to be ineffective in new tasks. In recent years researchers have begun to explore and design a mission agnostic AL approach, which is desirable to provide a versatile sampling strategy. For example, yoo et al propose a task agnostic loss prediction module that directly predicts sample loss for guiding sampling. Sener et al propose a method of active sampling by measuring data distribution, coreset. Unfortunately, the sampling criteria of the above method still have some one-sidedness. Yoo et al only consider feedback of the model, ignoring features of the data, while seer et al only consider feature distribution of the macro-level data.
Disclosure of Invention
The invention aims to: the invention aims to solve the technical problem of providing an image recognition and classification method based on two-stage active learning aiming at the defects of the prior art.
In order to solve the technical problems, the invention discloses an image recognition and classification method based on two-stage active learning, which comprises the following steps:
step 1, determining the data quantity to be marked in a data set for training an image recognition classification model;
Step 2, performing first-stage active sampling by using an active learning method to obtain a first-stage sample;
step 3, clustering the samples in the first stage by adopting a multi-view clustering method;
step 4, calculating the consistency between any two views;
step 5, calculating a sampling score as a sampling strategy, performing second-stage active sampling, and manually marking a second-stage sample obtained through the second-stage active sampling;
step 6, training the image recognition classification model;
step 7, repeating the steps 1 to 6, performing iterative optimization on the image recognition classification model, skipping the steps 1 to 5 when the number of manually marked samples reaches the data volume determined in the step 1, and training the image recognition classification model by using only the task loss in the step 6 as a loss function;
and 8, performing image recognition and classification by applying the optimized image recognition and classification model.
The beneficial effects are that:
the invention adopts an active learning method which is independent of specific learning tasks and decoupled from the existing active learning method to identify and classify images. The method is mainly applied to classification tasks regardless of specific learning tasks, and can be used for any deep learning task as long as a large number of unlabeled samples exist, and the neural network is used for extracting the characteristics of the unlabeled samples in the learning architecture. The decoupling with the existing active learning method mainly realizes that the core of the invention is a highly packaged sample sampling module, and the whole framework of the model is not modified in a large scale, so that the model can be randomly combined with other sample sampling strategies to make up for the strong effect. In conclusion, the method has strong adaptability and can be applied to various active learning scenes. In addition, the expressive property and the stability of the sample are considered, and the sampled sample for labeling has higher credibility. Therefore, the two-stage active learning method for calculating the sample expression degree and the stability based on the multi-view clustering provided by the invention is used for carrying out image recognition and classification, and a better effect is obtained.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a schematic diagram of clustering samples of various views.
FIG. 3 is a flow chart of the consistency calculation between two views.
FIG. 4 is a flow chart of sample expression level calculation.
FIG. 5 is a flow chart of sample stability calculation.
Fig. 6 is a flow chart of model iterative optimization.
FIG. 7 is a flow chart of a task for classifying race using the present invention.
Detailed Description
As shown in fig. 1, the invention discloses an image recognition and classification method based on two-stage active learning, which comprises the following steps:
step 1, determining the data quantity to be marked;
step 2, performing first-stage active sampling by using a conventional active learning method;
step 3, clustering samples in each view;
step 4, calculating the consistency between the two views;
step 5, performing active sampling at a second stage by using the expression degree and the stability degree;
step 6, training the neural network;
and 7, performing iterative optimization on the model.
The step 1 specifically comprises the following steps:
and determining the total data quantity to be marked according to the marking cost of the data set. For example, a classification model of a person is to be trained, the input of the model is a face image, and the output is the person to whom the face belongs. Now, assume that 100 ten thousand global face images have been crawled from the internet, but the race to which they correspond is unknown. To complete training, the person to whom these faces belong needs to be manually labeled as a label.
However, the artificial annotation of ethnic data requires a significant amount of cost. For example, the price of the general picture classification label in the cloud shop is 48 yuan/ten thousand, but the labeling cost is higher for images in specific fields such as faces, for example, the price of the face attribute labeling task containing various attributes such as gender, expression, head gesture and the like is 6000 yuan/ten thousand. Assuming that the pricing of race annotation in the application scene is 500 yuan/ten thousand, and the total budget for data annotation is 5000 yuan, 10 ten thousand images can be annotated in total. Clearly, training with only 10 ten thousand images is less effective than labeling all 100 ten thousand images and using them all for training of the model. The problem to be solved in active learning is to screen out the 10 ten thousand images with the highest value from 100 ten thousand images without labels, and train the model by using the 10 ten thousand images, so that the training effect is as good as possible.
When the active learning method is used for selecting the sample, certain output values of the classification model are needed in addition to the characteristics of the sample, and the selected sample is more accurate after the classification model is trained to a certain degree. Therefore, the invention divides the sample sampling process into a plurality of rounds, so that the classification model is trained to a certain extent when sampling is performed in each round except the first round. For example, the sampling process may be divided into 10 rounds, each of which samples 1 ten thousand images.
The step 2 specifically comprises the following steps:
the temporarily unlabeled dataset is referred to as a label-free pool. A round of sampling may be divided into two phases, wherein the second phase of sampling is performed in the result of the first phase of sampling. For example, for the race classification model described above, it may be set that the first stage samples 2 ten thousand images from the unlabeled pool, and the second stage selects 1 ten thousand images from the 2 ten thousand images. In this step, a minimum confidence (LC) algorithm, a Learning Loss (LL) algorithm, a core set (Coreset) algorithm, and other conventional active learning methods are used for data in the label-free pool as an active sampling strategy of the first stage, so as to obtain a sampling result of the first stage, where the sampling result is a subset of the label-free pool.
The LC algorithm selects samples with low confidence in probability distribution of model output in the label-free pool. Specifically, let the input sample (face image here for the above-mentioned race classification task) be x, the parameter of the model (race classification model here for the above-mentioned race classification task) be θ, and the output of the model be p (x|θ). Taking the classification task as an example, p (x|θ) can be written specifically as [ p ] 1 ,p 2 ,…,p C ]Wherein C is the number of categories, p c The probability that x is the class C (1. Ltoreq.c. Ltoreq.C) is expressed according to [ p ] 1 ,p 2 ,…,p C ]Calculate the score S of sample x LC The higher the score, the higher the priority at the time of sampling. S is given herein LC Is a method of calculating the same. The first approach examines the probability of the corresponding classification result, the lower this probability is, the lower the model's knowledge of the classification result is, the more should be sampled, i.eThe second approach examines [ p ] 1 ,p 2 ,…,p C ]The larger entropy means that the uncertainty of the model output result is larger, the more should be sampled, i.e +.> In the sampling process, a plurality of samples with highest scores are selected as sampling results.
The LL algorithm uses a loss predictor to predict the loss function value for each sample in the unlabeled pool, the larger the value, the greater the magnitude of the decrease in the loss function when training with these samples, and therefore the more should be sampled. Specifically, loss predictor L loss Extracting the characteristics of a sample x in a plurality of layers in a neural network, respectively passing each characteristic through an independent global average pooling layer, a full connection layer and a ReLU activation layer, splicing all the characteristics after passing through the three layers, and finally mapping the characteristics into a scalar through the full connection layer, wherein the scalar is the predicted value of the corresponding loss function of the sampleIn the sampling process, a plurality of samples with the maximum predicted values of the loss function are selected as sampling results.
The Coreset algorithm also selects samples from the unlabeled pool whose loss function is closest to the total sample loss function and translates to the following problem by theoretical analysis: taking the sampled sample as the center point, consider the distances from all other samples to the nearest center point, with the goal of minimizing the maximum value of this distance. Specifically, let the samples in the set X composed of all samples be X i (1.ltoreq.i.ltoreq.N, where N is the total number of samples), the set of samples for the currently existing tag is s 0 (the algorithm is applicable to the case where some samples are labeled in the initial state, if all samples are not labeled, then s 0 Empty set), the number of samples planned to be sampled in this step is b (for example, the above-mentioned race classification task should take a value of 2 ten thousand), the following steps are performed:
s1: let s denote the set of samples of all existing tags in the current state, where it is apparent that there is s=s 0
S2: order the wherein d(xi ,x j ) Representing sample x i and xj A distance therebetween;
s3: adding x to s, i.e. let s = s ≡x };
s4: s2 and S3 are repeated until |s|= |s 0 I+b, where II represents the number of elements in a collection;
s5: will s-s 0 As a result of this step sampling.
For the Coreset algorithm to be more robust, it can be further optimized, where a hyper-parameter, xi, is specified, which represents the number of outliers in the algorithm, where "outliers" refers to samples that are far from all center points and are no longer considered elements in calculating the distance maxima. The introduction of outliers, while disrupting the definition of the problem, may bring about a further improvement in the calculation of the algorithm. The algorithm performs the following steps:
S1: obtaining a preliminary sampling result by using the basic algorithm, and recording the preliminary sampling result as s g Calculation ofLet lb=δ/2, ub=δ;
s2: with intermediate variables Judging whether the following conditions are satisfied: (1) Sigma (sigma) j u j =|s 0 |+b;(2)∑ i,j ξ i,j ≤Ξ;(3)/>(4)/> (5)(6)/>If d (x) i ,x j )>Delta, w i,j =ξ i,j . If the above 6 conditions are satisfied at the same time, then executeOtherwise execute->
S3: repeating S2 until ub=lb;
s4: executing delta≡ub;
s5: substituting delta into 6 conditions in S2, and solving u according to the 6 conditions i 、ξ i,j and wi,j (S2 and S3 operations have guaranteed that the solution here is unique. In the actual calculation process, ub=lb of S3 is difficult to satisfy, and in general, when the value of ub-lb is a small positive number epsilon, iteration S2 can be stopped, which has the consequence that there may be a plurality of solutions here, but as long as epsilon is sufficiently small, the difference between these solutions is small and they can be considered approximately the same solution);
s6: will set { x } i |u i =1 } as the final result of this step sampling.
The step 3 specifically comprises the following steps:
from this step, the second phase sampling, i.e. formally entering the multi-view clustering algorithm, will begin. First, as shown in fig. 2, features of several layers of each sample in a model (for the above-mentioned race classification task, here a race classification model) are extracted, and features of all samples in the same layer are referred to herein as one view. The layers provided for feature extraction have a total of U, that is to say U views.
The distribution of the samples in each view is then modeled using a Gaussian Mixture Model (GMM), and on this basis a clustering result is formed. In GMMUsing probability density functionsThe distribution of the individual samples in the same view is depicted. Where x represents the input sample, K represents the number of Gaussian models in the GMM model, φ (x|μ kk ) Expressed in mu k Is mean value, sigma k The kth gaussian model of variance, alpha k Is a weight and can be understood as the probability that the current sample belongs to the kth gaussian model.
The specific operation of the GMM is described below. Let the set of all samples be wherein x(i) (1.ltoreq.i.ltoreq.N) represents the ith first-stage sample, the hidden variable +.> wherein The following steps are then performed:
s1: randomly initializing θ (including α k 、μ k and σk ,1≤k≤K);
S2: calculation ofEstimate of +.>
S3: calculating estimated values of model parameters and /> wherein />
S4: s2 and S3 are repeated until the model converges. End use and />As alpha k 、μ k and σk Is a similar value to (a) in the above.
After the operation, for each U view, U distributions p can be obtained 1 (x|θ 1 )、p 2 (x|θ 2 )、……、p U (x|θ U ). From these distributions, the belonging categories can be divided for each sample on each view separately As a result of the clustering.
The step 4 specifically comprises the following steps:
For a fully trained neural network, the clustering results on the various views should be as similar as possible. Thus, this step calculates two views V using the Rand statistic m and Vn Degree of agreement between R (V) m ,V n ) (m is more than or equal to 1 and less than or equal to U, n is more than or equal to 1 and less than or equal to U). As shown in fig. 3.
Specifically, sample x i In view V m The clustering result is recorded asThe total number of samples is denoted as s, so that all samples together can be formed into s (s-1)/2 shapes such as (x) i ,x j ) Sample pairs of (i+.j). Examining element x in a sample pair i and xj In view V m and Vn The above clustering results are found to be classified into the following four cases: (1) At the same time satisfy-> And(2) At the same time satisfy-> and />(3) At the same time satisfy and />(4) At the same time satisfy-> and /> For all s (s-1)/2 sample pairs, each sample pair belongs to and only belongs to one of the cases. The set of all the sample pair compositions satisfying (1) or (2) is denoted as s p (i.e. view V m and Vn For agreement whether or not both samples in a pair belong to the same class, the set of all pairs satisfying (3) or (4) is denoted s n (i.e. view V m and Vn The opinion is different as to whether the two samples in the pair belong to the same class).
After s is obtained p and sn After that, V can be calculated m and Vn Degree of agreement between R (V) m ,V n )=||s p ||/||s p +s n I, where i·irepresents the number of elements in a collection.
The step 5 specifically comprises the following steps:
the sampling score is calculated according to a probability density function p (x|theta) obtained by the GMM algorithm and the consistency R between the two views, and the sampling score is used as an active sampling strategy in the second stage. Wherein the score comprises a consideration of the degree of expression and stability of the sample.
As for the degree of expression of the sample, as shown in fig. 4, the probability density at the point where the sample is located in a typical view is taken as an evaluation criterion. By "typical view" is meant a view that is similar to the clustering results of most other views. Specifically, a view V is calculated m Is of the consistency of (1)Then view V with highest consistency o As a typical view. On the basis, the expression level Rep (x i ) Defined as V o The probability density function value obtained by the upper GMM algorithm, namely Rep (x i )=p o (x io), wherein />The sample expression degree obtained through the calculation can reflect the distance from the sample to the clustering center, and the higher the sample expression degree is, the closer the distance from the sample to the clustering center is, the higher the frequency of appearance of the sample similar to the sample characteristic is, and the sample can represent more samples.
As for the stability of the sample, as shown in fig. 5, the degree of similarity of the samples appearing in the same category of samples in each view is taken as an evaluation criterion. Specifically, sample x i In view V m The collection of samples of the same class is noted asFirst calculate sample x i In view V m and Vn Stability of the upper part->Wherein II is the number of elements in a collection. From this, x is then calculated i Stability of Stab (x) i ) There is->
Finally, sample x i Is (x) i ) Is the weighted sum of the degree of expression and the degree of stability, i.e. S (x i )=Rep(x i )+λStab(x i ) Wherein λ is a superparameter satisfying λ>And 0, the importance degree of the balance model on the expression degree and the stability degree is required to be repeatedly tested and adjusted in the actual application scene. After the sampling scores of all the samples are obtained, selecting a plurality of samples with highest scores from the samples obtained by the first-stage active sampling as the result of the second-stage active sampling, and giving the result to manual labeling. The proportion of the sampled samples needs to be determined according to the scale of manual labeling, for example, in the task of classifying the race, 1 ten thousand face images with highest scores should be selected.
The step 6 specifically comprises the following steps:
this step trains the model on the basis of the labels of the obtained partial samples. In the training process, parameters of the neural network are optimized, so that the model can better complete a specified task on one hand, and the model can better extract characteristics on the other hand. For this purpose, the loss function is lost by the Task (TL) And multiview clustering loss (MVCL)/(MVCL)>Two parts. Wherein (1)>Depending on the specific task, cross entropy loss may be used, for example, in classification tasks, with only labeled samples used in the calculation process; whereas MVCL is determined by the sample consistency, the specific calculation method is +.> All samples were used in the calculation process.
Finally, the total loss function of the training processWherein μ is a superparameter satisfying μ>0. When the cost of manual labeling is limited, only a small number of samples can be labeled, the value of mu can be increased appropriately, and supervision on the consistency of the samples is focused, so that each sampled sample is carefully selected as much as possible, and conversely, the value of mu can be reduced appropriately.
Furthermore, it should be noted that if the method of first stage active sampling selection requires iterative optimization during training, the loss function here needs to be added with the corresponding term. For example, when the LL algorithm is selected by the first-stage active sampling, the loss predictor L needs to be optimized according to the tagged data. Specifically, a loss function value loss is introducedThe batch size (batch size) is set to an even number during training, and samples in the batch are paired two by two in an iterative process as (x) i ,x j ) Will x i and xj After inputting the neural network, the corresponding output is marked +.> and />x i and xj The manually noted authentic label is denoted y (x i) and y(xj ) Calculate the true loss-> At the same time according to x i and xj Feature calculation prediction loss of several layers in neural network>Then the sample pair (x i ,x j ) Loss function of the corresponding predicted loss value>Where ζ is a hyper-parameter. It should be noted that in the numerical calculation, l (x i )=l(x j ) Is almost impossible to occur and is therefore omitted here. Total predicted loss-> wherein />Is the set of all pairs of samples in the current batch. During training, will ∈ ->As a function of total loss->And use the coefficient as a superparameter to balance it with + ->Importance of the two.
The step 7 specifically comprises the following steps:
and repeating the steps 1 to 5, and performing iterative optimization on the model, as shown in fig. 6. In each iteration, all samples that are not labeled temporarily constitute a pool of unlabeled labels. And (3) performing two-stage active sampling on the samples in the label-free pool, delivering the sampled samples to manual labeling, and training the model by using all the samples after the labeling is completed. With the increase of the iteration times, the number of the labeled samples is increased, when the labeled samples reach a certain scale, the iteration process directly skips the active sampling, only the model is trained, and the loss function of the training process is only used For example, for the race classification task described above, after 10 iterations, the number of labeled samples reaches 10 tens of thousands, at which time training of the model only begins.
After model training is completed, the model is tested. And after the test result reaches the expected value, the test result can be put into use of an actual task.
Examples
In this embodiment, taking a race classification task as an example, face images are crawled from the internet, and training of a race classification model is completed by using the active learning method provided by the invention, and the overall flow is shown in fig. 7. Comprises the following parts:
and step 1, determining the data quantity to be marked.
And determining the total data quantity to be marked according to the marking cost of the data set. In this section, a model for classifying a person is trained, and the input of the model is a face image, and the output is the person to which the face belongs, and the number of the categories is 5. Now, assume that 100 ten thousand global face images have been crawled from the internet, but the race to which they correspond is unknown. To complete training, the person to whom these faces belong needs to be manually labeled as a label.
However, the artificial annotation of ethnic data requires a significant amount of cost. For example, the price of the general picture classification label in the cloud shop is 48 yuan/ten thousand, but the labeling cost is higher for images in specific fields such as faces, for example, the price of the face attribute labeling task containing various attributes such as gender, expression, head gesture and the like is 6000 yuan/ten thousand. Assuming that the pricing of race annotation in the application scene is 500 yuan/ten thousand, and the total budget for data annotation is 5000 yuan, 10 ten thousand images can be annotated in total. Clearly, training with only 10 ten thousand images is less effective than labeling all 100 ten thousand images and using them all for training of the model. The problem to be solved in active learning is to screen out the 10 ten thousand images with the highest value from 100 ten thousand images without labels, and train the model by using the 10 ten thousand images, so that the training effect is as good as possible.
When the active learning method is used for selecting the sample, certain output values of the classification model are needed in addition to the characteristics of the sample, and the selected sample is more accurate after the classification model is trained to a certain degree. Therefore, the invention divides the sample sampling process into a plurality of rounds, so that the classification model is trained to a certain extent when sampling is performed in each round except the first round. For example, the sampling process may be divided into 10 rounds, each of which samples 1 ten thousand images.
To facilitate the model processing of the images, the images are scaled here, with the resolution of all images adjusted to 224 x 224.
And 2, performing first-stage active sampling by using a conventional active learning method.
The temporarily unlabeled dataset is referred to as a label-free pool. A round of sampling may be divided into two phases, wherein the second phase of sampling is performed in the result of the first phase of sampling. For example, for the race classification model described above, it may be set that the first stage samples 2 ten thousand images from the unlabeled pool, and the second stage selects 1 ten thousand images from the 2 ten thousand images. In this step, for the data in the unlabeled pool, we use three schemes as the active sampling strategy in the first stage: minimum confidence (LC) algorithm, learning Loss (LL) algorithm, core set (Coreset) algorithm. After the sampling is completed, a sampling result of the first stage is obtained, and the sampling result is a subset of the label-free pool.
The LC algorithm selects samples with low confidence in probability distribution of model output in the label-free pool. Specifically, let the face image be x, the parameters of the race classification model be θ, and the output of the model be p (x|θ). Since the task of p (x|θ) is to divide all x into 5 categories, p (x|θ) can be written specifically as [ p ] 1 ,p 2 ,…,p 5], wherein pc The probability that x is the class c (1.ltoreq.c.ltoreq.5) is expressed according to [ p ] 1 ,p 2 ,…,p 5 ]Calculate the score S of sample x LC The higher the score, the higher the priority at the time of sampling. S is given herein LC Is a method of calculating the same. The first approach examines the probability of the corresponding classification result, the lower this probability is, the lower the model's knowledge of the classification result is, the more should be sampled, i.e The second approach examines [ p ] 1 ,p 2 ,…,p C ]The greater the entropy, the greater the entropy represents the model output junctionThe greater the uncertainty of the fruit, the more should be sampled, i.eAfter the result is calculated, 2 ten thousand images with the highest score are selected as sampling results.
The LL algorithm uses a loss predictor to predict the loss function value for each sample in the unlabeled pool, the larger the value, the greater the magnitude of the decrease in the loss function when training with these samples, and therefore the more should be sampled. Specifically, loss predictor Extracting characteristics of a sample x in a plurality of layers in a neural network, respectively passing each characteristic through an independent global average pooling layer, a full connection layer and a ReLU activation layer, splicing all the characteristics after passing through the three layers, and finally mapping the characteristics into a scalar through the full connection layer, wherein the scalar is a predicted value of a corresponding loss function of the sample->In the sampling process, 2 ten thousand images with the maximum predicted value of the loss function are selected as sampling results.
The Coreset algorithm also selects samples from the unlabeled pool whose loss function is closest to the total sample loss function and translates to the following problem by theoretical analysis: taking the sampled sample as the center point, consider the distances from all other samples to the nearest center point, with the goal of minimizing the maximum value of this distance. In the face image dataset X, let the sample be X i (1.ltoreq.i.ltoreq.10000) the set of sample compositions of the currently existing tags is s 0 (s if all samples have no tag 0 Empty set), the number of samples scheduled to be sampled in this step is 2 tens of thousands, and the following steps are performed:
s1: let s denote the set of samples of all existing tags in the current state, where it is apparent that there is s=s 0
S2: order the wherein d(xi ,x j ) Representing sample x i and xj The distance between them may be, for example, euclidean distance x i -x j || 2 The x obtained in the step is the selected sample to be marked, the distance from the sample to the sample is relatively close, and the representative value is relatively strong;
s3: adding x to s, i.e. let s = s ≡x };
s4: s2 and S3 are repeated until |s|= |s 0 The number of samples to be marked is up to a predetermined number of 2 ten thousand when the condition is satisfied;
s5: will s-s 0 As a result of this step sampling.
Because our face images are obtained from the internet, abnormal images, such as facial masks, cartoon images, and the like, are unavoidable, and are not helpful for training the model. Therefore, the Coreset algorithm may be further optimized for greater robustness. The above-mentioned outlier images are regarded as "outliers" here, which are far from all center points, and the introduced hyper-parameter xi represents the number of outliers. The outliers are no longer the element under investigation when calculating the distance maxima. The introduction of outliers, while disrupting the definition of the problem, may bring about a further improvement in the calculation of the algorithm. The algorithm performs the following steps:
S1: obtaining a preliminary sampling result by using the basic algorithm, and recording the preliminary sampling result as s g Calculation ofLet lb=δ/2, ub=δ;
s2: with intermediate variables Judging whether the following conditions are satisfied: (1) Sigma (sigma) j u j =|s 0 |+b;(2)∑ i,j ξ i,j ≤Ξ;(3)/>(4)/> (5)(6)/>If d (x) i ,x j )>Delta, w i,j =ξ i,j . If the above 6 conditions are satisfied at the same time, then executeOtherwise execute->
S3: repeating S2 until ub=lb;
s4: executing delta≡ub;
s5: substituting delta into 6 conditions in S2, and solving u according to the 6 conditions i 、ξ i,j and wi,j (S2 and S3 operations have guaranteed that the solution here is unique. In the actual calculation process, ub=lb of S3 is difficult to satisfy, and in general, when the value of ub-lb is a small positive number epsilon, iteration S2 can be stopped, which has the consequence that there may be a plurality of solutions here, but as long as epsilon is sufficiently small, the difference between these solutions is small and they can be considered approximately the same solution);
s6: will set { x } i |u i =1 } as the final result of this step sampling.
And step 3, clustering samples in each view.
From this step, the second phase sampling, i.e. formally entering the multi-view clustering algorithm, will begin. Firstly, 20000 face image samples obtained in the first stage of sampling are sent into a neural network model, and features of the samples on each layer are extracted, wherein features of all the samples on the same layer are called a view. For example, a ViT network with 12 layers of transformers is selected as the network for extracting the features after the 2, 4, 6, 8, 10, 12 (the features after the 12 th Transformer are the features fed into the classification network head), and then 6 views are taken.
The distribution of samples in the 6 views is then modeled using a Gaussian Mixture Model (GMM), and on this basis a clustering result is formed. In GMM, probability density function is usedThe distribution of the individual samples in the same view is depicted. Where x represents the input sample, K represents the number of Gaussian models in the GMM model, φ (x|μ kk ) Expressed in mu k Is mean value, sigma k The kth gaussian model of variance, alpha k Is a weight and can be understood as the probability that the current sample belongs to the kth gaussian model.
The specific operation of the GMM is described below. Let the set of all samples be wherein x(i) (1.ltoreq.i.ltoreq.N) represents the ith first-stage sample, the hidden variable +.> wherein The following steps are then performed:
s1: randomly initializing θ (including α k 、μ k and σk ,1≤k≤K);
S2: calculation ofEstimate of +.>
S3: calculating estimated values of model parameters and /> wherein
S4: s2 and S3 are repeated until the model converges. End use and />As alpha k 、μ k and σk Is a similar value to (a) in the above.
After the above operation, 6 distributions p can be obtained for 6 views 1 (x|θ 1 )、p 2 (x|θ 2 )、……、p 6 (x|θ 6 ). From these distributions, the belonging categories can be divided for each sample on each view separately As a result of the clustering.
And 4, calculating the consistency between the two views.
For a fully trained neural network, the clustering results on the various views should be as similar as possible. Thus, this step calculates two views V using the Rand statistic m and Vn Degree of agreement between R (V) m ,V n )(1≤m≤U,1≤n≤U)。
Specifically, sample x i In view V m The clustering result is recorded asThe total number of samples is denoted as s, so that all samples together can be formed into s (s-1)/2 shapes such as (x) i ,x j ) Sample pairs of (i+.j). Examining element x in a sample pair i and xj In view V m and Vn The above clustering results are found to be classified into the following four cases: (1) At the same time satisfy-> And(2) At the same time satisfy-> and />(3) At the same time satisfy and />(4) At the same time satisfy-> and /> For all s (s-1)/2 sample pairs, each sample pair belongs to and only belongs to one of the cases. The set of all the sample pair compositions satisfying (1) or (2) is denoted as s p (i.e. view V m and Vn For agreement whether or not both samples in a pair belong to the same class, the set of all pairs satisfying (3) or (4) is denoted s n (i.e. view V m and Vn For the followingWhether two samples in a sample pair belong to the same class is different.
After s is obtained p and sn After that, V can be calculated m and Vn Degree of agreement between R (V) m ,V n )=||s p ||/||s p +s n I, where i represents the number of elements in a set, that is, how many sample pairs are in V over all 20000 sample pairs m and Vn The above division results are uniform. Since we have a total of 6 views, a total of 6×5/2=15 degrees of agreement can be obtained.
And 5, performing the second-stage active sampling by using the expression degree and the stability degree.
The sampling score is calculated according to a probability density function p (x|theta) obtained by the GMM algorithm and the consistency R between the two views, and the sampling score is used as an active sampling strategy in the second stage. Wherein the score comprises a consideration of the degree of expression and stability of the sample.
For the degree of expression of the sample, the probability density at the point where the sample is located in the typical view is taken as an evaluation criterion. By "typical view" is meant a view that is similar to the clustering results of most other views. Specifically, a view V is calculated m Is of the consistency of (1)Then view V with highest consistency o As a typical view. In the ViT network we use, the 5 th view becomes the typical view. Intuitively, random noise in the samples is gradually removed during forward propagation of the network, while essential features are preserved. The later positioned views in the model are therefore more likely to have a higher degree of agreement with most views, which is consistent with the actual calculation results. On the basis, the expression level Rep (x i ) Defined as V o The probability density function value obtained by the upper GMM algorithm, namely Rep (x i )=p o (x io), wherein Through the above calculationThe obtained sample expression degree can reflect the distance from the sample to the clustering center, and the higher the sample expression degree is, the closer the sample is to the clustering center, the higher the frequency of appearance of the sample similar to the sample characteristic is, and the sample can represent more samples. Specifically, after face images are clustered, each person tends to be classified into different categories (when the set categories are more, different regional long phases of the same person are also classified into different categories, for example, when the category is set to 5, each person is approximately classified into the same category, at this time, a certain person is mostly in the same category, when the category is set to 30, each subdivision person in the person is highly likely to be classified into different categories), a plurality of 'faces' exist in each person, and the 'faces' are emphasized, so that the model learns the main characteristics of the person, otherwise, if a large number of 'faces' exist in the labeling data, the model can judge the basis of the person by the accidental characteristics on the 'faces', so that the classification result is greatly deviated. And the sample with high expression degree corresponds to a 'public face'.
For the stability of the sample, the degree of similarity of the sample to the appearance of the same class sample in each view is taken as an evaluation criterion. Specifically, sample x i In view V m The collection of samples of the same class is noted asFirst calculate sample x i In view V m and Vn Stability of the upper part->Wherein II is the number of elements in a collection. From this, x is then calculated i Stability of Stab (x) i ) There is->In particular, there may be "mixed blood race" images in the face image, such as an AB mixed blood face image, with a brown skin tone, but the five sense organs are more characterized by A, in the model, a view may be of greater interest in skin toneOn which the image is divided into B, while another view may be more concerned about the opening of the five sense organs, on which the image is divided into a race. If the image is manually marked, the mark of the image as "A person" or the mark of the image as "B person" is unsuitable, and the part of the image which is not matched with the person is misled in training of the model, so that the sampling of the sample is avoided as much as possible. The stability is the angle from which the sample is taken.
Finally, sample x i Is (x) i ) Is the weighted sum of the degree of expression and the degree of stability, i.e. S (x i )=Rep(x i )+λStab(x i ) Wherein λ is a superparameter satisfying λ>And 0, the importance degree of the balance model on the expression degree and the stability degree is required to be repeatedly tested and adjusted in the actual application scene. In our race classification model, lambda is preferably 0.05. And after the sampling scores of all the samples are obtained, 10000 images with the highest scores are selected from 20000 face images obtained by active sampling in the first stage as the result of active sampling in the second stage, and are submitted to manual labeling.
And 6, training the neural network.
This step trains the model on the basis of the labels of the obtained partial samples. In the training process, parameters of the neural network are optimized, so that the model can better finish the classification task of the race on one hand, and the model can better extract the characteristics on the other hand. For this purpose, the loss function is lost by the Task (TL)And multiview clustering loss (MVCL)/(MVCL)>Two parts. Wherein (1)>Using cross entropy loss, only using labeled samples in the calculation process; whereas MVCL is determined by the sample consistency, the specific calculation method is +.>All samples were used in the calculation process. Finally, the total loss function of the training process +.>Wherein mu is 0.1.
During the training process, we set the learning rate to 1×10 -3 The optimizer uses Adam (parameter a 1 =0.9,α 2 =0.99)。
When the active sampling strategy of the first stage is the LL algorithm, the loss predictor L needs to be optimized according to the tagged data. Specifically, a loss function value loss is introducedThe batch size (batch size) is set to 64 during training, and in one iteration, the samples in the batch are paired two by two to form a sample (x) i ,x j ) Will x i and xj After inputting the neural network, the corresponding output is marked +.> and />x i and xj The manually noted authentic label is denoted y (x i) and y(xj ) Calculate the true loss->At the same time according to x i and xj Feature calculation prediction loss of several layers in neural network> Then the sample pair (x i ,x j ) Loss function of the corresponding predicted loss value> Wherein the super parameter ζ is 1. It should be noted that in the numerical calculation, l (x i )=l(x j ) Is almost impossible to occur and is therefore omitted here. Total predicted loss-> wherein />Is the set of all pairs of samples in the current batch. During training, will ∈ ->As a function of total loss->And use the coefficient as a superparameter to balance it with + ->Importance of the two, we set the value of the coefficient to 0.1, i.e. +.>
And 7, performing iterative optimization on the model.
Repeating the steps 1-5, and performing 10 rounds of iterative optimization on the model. In each iteration, all samples that are not labeled temporarily constitute a pool of unlabeled labels. And (3) performing two-stage active sampling on the samples in the label-free pool to obtain 10000 face images, and giving manual labeling, and training a ViT model by using all the samples after the labeling is completed. After completing 10 iterations, only the model is trained, and the loss function of the training process is only used
After model training is completed, the model is tested, namely 1000 face images similar to training data in style are crawled from the Internet, and the face images are used for testing the model after training is completed. When the LC algorithm, the LL algorithm and the Coreset algorithm are respectively used in the first stage sampling strategy, the reached top-1 classification accuracy is 89.7%, 90.2% and 86.3% respectively, so that a model corresponding to the LL algorithm is finally selected for the first stage sampling strategy for the face classification task.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, where the computer program when executed by the data processing unit may perform part or all of the steps in the summary of the image recognition method based on two-stage active learning and the embodiments provided by the present application. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.
It will be apparent to those skilled in the art that the technical solutions in the embodiments of the present invention may be implemented by means of a computer program and its corresponding general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in the form of a computer program, i.e. a software product, which may be stored in a storage medium, and include several instructions to cause a device (which may be a personal computer, a server, a single-chip microcomputer, MUU or a network device, etc.) including a data processing unit to perform the methods described in the embodiments or some parts of the embodiments of the present invention.
The invention provides an idea and a method for image recognition method based on two-stage active learning, and the method and the way for realizing the technical scheme are more specific, the above description is only a preferred embodiment of the invention, and it should be noted that, for a person skilled in the art, a plurality of improvements and modifications can be made without departing from the principle of the invention, and the improvements and modifications are also considered as the protection scope of the invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (10)

1. The image recognition and classification method based on the two-stage active learning is characterized by comprising the following steps of:
step 1, determining the data quantity to be marked in a data set for training an image recognition classification model;
step 2, performing first-stage active sampling by using an active learning method to obtain a first-stage sample;
step 3, clustering the samples in the first stage by adopting a multi-view clustering method;
step 4, calculating the consistency between any two views;
step 5, calculating a sampling score as a sampling strategy, performing second-stage active sampling, and manually marking a second-stage sample obtained through the second-stage active sampling;
step 6, training the image recognition classification model;
step 7, repeating the steps 1 to 6, performing iterative optimization on the image recognition classification model, skipping the steps 1 to 5 when the number of manually marked samples reaches the data volume determined in the step 1, and training the image recognition classification model by using only the task loss in the step 6 as a loss function;
and 8, performing image recognition and classification by applying the optimized image recognition and classification model.
2. The method for classifying image recognition based on two-stage active learning according to claim 1, wherein the step 2 of performing the first-stage active sampling by using the active learning method comprises:
And (3) using an active learning method for the data in the original data set, namely the label-free pool, as an active sampling strategy of the first stage to obtain a sampling result of the first stage, namely a first stage sample, wherein the set of the first stage samples is a subset of the label-free pool.
3. The method for classifying and identifying images based on two-stage active learning according to claim 2, wherein the clustering of the first-stage samples in step 3 specifically comprises:
step 3-1, extracting the characteristics of each layer of each first-stage sample in the image recognition classification model, and defining the characteristics of all samples in the same layer as a view; setting a total of U layers for extracting features in the image recognition classification model, namely U views;
and 3-2, modeling the distribution of the first stage samples in each view by using a Gaussian Mixture Model (GMM) to obtain a GMM model, and forming a clustering result.
4. The method of image recognition and classification based on two-stage active learning according to claim 3, wherein in step 3, the step 3-2 of modeling the distribution of the first stage samples in each view using a gaussian mixture model GMM and forming a clustering result specifically includes:
In a Gaussian mixture model GMM, probability density functions are usedDepicting the distribution of samples in the same view, where x represents the input sample and θ represents the parameter of the GMM model +.>K represents the number of Gaussian models in the Gaussian mixture model, φ (x|μ) kk ) Expressed in mu k Is mean value, sigma k The kth gaussian model of variance, alpha k Is the weight, i.e., the probability that the current sample belongs to the kth gaussian model;
the specific operation method for forming the clustering result by using the Gaussian mixture model GMM is described as follows: let the set of all the first stage samples beWhere N is the number of first stage samples, where x (i) (1.ltoreq.i.ltoreq.N) represents the ithA first stage sample;
introducing hidden variables The following steps are then performed:
step 3-2-1: randomly initializing θ, including weights α k Mean mu k Sum of variances sigma k
Step 3-2-2: calculating hidden variablesEstimate of +.>The method comprises the following steps:
step 3-2-3: calculating an estimated value of a GMM model parameter θ and />The specific method comprises the following steps:
and is used in combination and />As alpha k 、μ k and σk Is a new value of (2);
step 3-2-4: repeating steps 3-2-2 and 3-2-3 until the model converges, i.e. for a predetermined threshold t α 、t μ and tσ The method comprises the following steps: and />End use-> and />As alpha k 、μ k and σk Is a approximation of (a);
Step 3-2-5: for each U views, U distributions p are obtained 1 (x|θ 1 )、p 2 (x|θ 2 )、……、p U (x|θ U), wherein θU Represents a U-th parameter; according to the distribution, the category of each first stage sample is respectively divided on each viewAs a result of the clustering.
5. The method of claim 4, wherein the image recognition classification is based on two-stage active learningThe method is characterized in that the consistency degree between any two views is calculated in the step 4, namely, the m-th view V is calculated by using Rand statistics m And view V of nth n Degree of agreement between R (V) m ,V n ) (m is more than or equal to 1 and less than or equal to U, n is more than or equal to 1 and less than or equal to U), and the specific method is as follows:
sample x of the first stage i In view V m The clustering result is recorded asThe total number of the first stage samples is recorded as s, and all the samples are combined into s (s-1)/2 sample pairs, wherein the sample x i And sample x j The sample pair consisting of (x) i ,x j ) (i+.j); element x in a sample pair i and xj In view V m and Vn The clustering results are divided into the following four cases:
case 1: at the same time satisfy and />
Case 2: at the same time satisfy and />
Case 3: at the same time satisfy and />
Case 4, at the same time satisfy and />
For all s (s-1)/2 sample pairs, each sample pair belongs to and only to one of the cases described above, the set of all sample pairs satisfying case 1 or case 2 is denoted as s p The set of all sample pair compositions satisfying case 3 or case 4 is denoted as s n The method comprises the steps of carrying out a first treatment on the surface of the Calculate V m and Vn Degree of agreement between R (V) m ,V n ) The method comprises the following steps:
R(V m ,V n )=||s p ||/||s p +s n ||,
wherein II is the number of elements in the collection.
6. The method for classifying and identifying images based on two-stage active learning according to claim 5, wherein the calculating the sampling score in step 5 is used as a sampling strategy, i.e. the sampling score is calculated according to the sample expression level and the sample stability, and the sampling score is used as an active sampling strategy in the second stage;
wherein the first stage sample x i Is (x) i ) Is a weighted sum of the degree of expression and the degree of stability, namely:
S(x i )=Rep(x i )+λStab(x i ),
wherein lambda is a superparameter satisfying lambda>0,Rep(x i ) Is sample x i Is expressed by the sample of Stab (x) i ) Is sample x i Is a sample stability of (2);
the active sampling strategy of the second stage is as follows: after the sampling scores of all the first-stage samples are obtained, the sample with the highest score is selected as the result of the second-stage active sampling.
7. The method for classifying and identifying images based on two-stage active learning according to claim 6, wherein the sample expression level in step 5 is calculated as follows:
computing view V m Consistency Cons (V) m ) The method comprises the following steps:
view V with highest degree of coincidence o As a typical view, the sample expression level Rep (x i ) Is V (V) o The probability density function value obtained by using the Gaussian mixture model GMM in the step 3 is as follows:
Rep(x i )=p o (x io ),
wherein ,o is the number of the typical view.
8. The method for classifying and identifying images based on two-stage active learning according to claim 7, wherein the sample stability in step 5 is calculated as follows:
sample x i In view V m The collection of samples of the same class is noted asSample x i In view V n The set of samples of the same class is marked +.>First calculate sample x i In view V m and Vn Stability of Stab mn (x i ) The method comprises the following steps:
then calculate x i Stabilization of (2)Degree Stab (x) i ) The calculation method is as follows:
9. the method for classifying and identifying images based on two-stage active learning according to claim 8, wherein the training of the image classification model in step 6 is performed by training the image classification model using the second stage sample artificially labeled in step 5;
during the training process, the loss function is lost by the taskAnd multiview clustering penalty->Two-part composition, wherein the task is lost- >Cross entropy loss is used in classification tasks, and only labeled samples are used in the calculation process; multi-view clustering penaltyAll samples are used in the calculation process according to the sample consistency, and the specific calculation method is as follows:
finally, the total loss function of the training processThe method comprises the following steps:
wherein μ is a superparameter satisfying μ >0.
10. The method of claim 9, wherein the training process in step 6 has a total loss functionWhen the active learning method described in step 2 is the learning loss LL algorithm, a loss function value loss +_is introduced>Will->As a function of total loss->Is one of (2); sample pair (x) i ,x j ) Loss function of the corresponding predicted loss value>The method comprises the following steps:
where ζ is the superparameter, l (-) is the true loss,to predict loss;
the loss function value loss is:
wherein ,is the set of all pairs of samples in the current training batch.
CN202310715906.0A 2023-06-16 2023-06-16 Image recognition and classification method based on two-stage active learning Pending CN116824237A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310715906.0A CN116824237A (en) 2023-06-16 2023-06-16 Image recognition and classification method based on two-stage active learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310715906.0A CN116824237A (en) 2023-06-16 2023-06-16 Image recognition and classification method based on two-stage active learning

Publications (1)

Publication Number Publication Date
CN116824237A true CN116824237A (en) 2023-09-29

Family

ID=88128580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310715906.0A Pending CN116824237A (en) 2023-06-16 2023-06-16 Image recognition and classification method based on two-stage active learning

Country Status (1)

Country Link
CN (1) CN116824237A (en)

Similar Documents

Publication Publication Date Title
Liu et al. Fuzzy detection aided real-time and robust visual tracking under complex environments
CN108647583B (en) Face recognition algorithm training method based on multi-target learning
CN113128369B (en) Lightweight network facial expression recognition method fusing balance loss
CN114841257B (en) Small sample target detection method based on self-supervision comparison constraint
US11335118B2 (en) Signal retrieval apparatus, method, and program
CN110826639B (en) Zero sample image classification method trained by full data
Luo et al. SFA: small faces attention face detector
CN114998602B (en) Domain adaptive learning method and system based on low confidence sample contrast loss
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
Zhu et al. Convolutional ordinal regression forest for image ordinal estimation
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
WO2020135054A1 (en) Method, device and apparatus for video recommendation and storage medium
Jin et al. Pseudo-Labeling and Meta Reweighting Learning for Image Aesthetic Quality Assessment
CN114328942A (en) Relationship extraction method, apparatus, device, storage medium and computer program product
CN113297936A (en) Volleyball group behavior identification method based on local graph convolution network
Niu Music Emotion Recognition Model Using Gated Recurrent Unit Networks and Multi-Feature Extraction
CN111639688A (en) Local interpretation method of Internet of things intelligent model based on linear kernel SVM
CN116824237A (en) Image recognition and classification method based on two-stage active learning
Shukla et al. A novel stochastic deep conviction network for emotion recognition in speech signal
CN114973350A (en) Cross-domain facial expression recognition method irrelevant to source domain data
CN114155279A (en) Visual target tracking method based on multi-feature game
CN113591731A (en) Knowledge distillation-based weak surveillance video time sequence behavior positioning method
CN115829036B (en) Sample selection method and device for text knowledge reasoning model continuous learning
KR102399833B1 (en) synopsis production service providing apparatus using log line based on artificial neural network and method therefor
CN113780341B (en) Multidimensional emotion recognition method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination