CN114419394A

CN114419394A - Method and device for recognizing semantic soft label image with limited and unbalanced data

Info

Publication number: CN114419394A
Application number: CN202210063646.9A
Authority: CN
Inventors: 王瑞轩; 钟哲灏
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-04-29

Abstract

The invention discloses a semantic soft label image identification method and a device with limited and unbalanced data, wherein the method comprises the following steps: constructing a semantic soft label image recognition model; pre-training an automatic supervision network on a large-scale text data set to obtain a word embedding module; generating a corresponding soft label for each category in the training data set by using a word embedding module; inputting a training data set into a feature extractor to obtain a feature vector, and using a corresponding soft label to guide training to obtain a trained semantic soft label image recognition model; and inputting the test data set into the trained semantic soft label image recognition model for testing to obtain an image recognition result. According to the method, the word embedding module is trained on the large-scale text data set, and the soft label containing rich semantic information is generated for each class of the data set by using the word embedding module, so that under the condition of limited and unbalanced data, an image recognition model with strong generalization performance is obtained by training, and the recognition performance is improved.

Description

Method and device for recognizing semantic soft label image with limited and unbalanced data

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to a semantic soft label image recognition method and device with limited and unbalanced data.

Background

In recent years, the field of image processing of deep learning has been rapidly advanced, and the conventional image recognition model has reached an accuracy of 88.3% in ImageNet image recognition data sets including 1000 types of natural images, such as airplanes, cars, birds, cats, and the like. The image recognition task is essentially a classification task, i.e. classifying an input picture into its true category, and usually requires a well-behaved image feature extractor and feature classifier.

However, image recognition based on deep learning relies on a large amount of training data, applying statistical principles to learn the data feature distribution of the corresponding real class. In practical applications, such as face recognition and medical image analysis, the difficulty of obtaining images is high (for example, images of rare diseases), the labeling cost is also high (a doctor who needs professional training spends time labeling), and it is generally difficult to have a large amount of training data for learning a deep learning model. Under the condition of limited training data quantity, the deep learning model can generate a serious overfitting condition on the training data set, namely, the classification accuracy on the test data set is reduced along with the increase of the classification accuracy on the training data set. In the case of unbalanced training data, the deep learning model tends to predict the image as having more classes due to the unbalanced distribution of data amount among different classes on the training set, which may lead to the problem of model bias. In both cases, the existing deep learning models do not perform ideally, limiting further development of deep learning in these areas.

In order to solve the above problems, the scientific community has proposed corresponding methods for the limited data and unbalanced data methods, respectively. For the limited data case, the scientific community has proposed two main methods: 1. migrating learned knowledge about training data from an additional training dataset associated with the current training dataset and having sufficient data to the currently trained deep learning model, the method generally comprising training a well-behaved feature extractor with the additional training dataset, then fine-tuning or fixing the parameters of the feature extractor on the current task, and training a new feature classifier; 2. the training images are subjected to a data augmentation method such as rotating, cutting, translation transformation and the like to generate new training images, the number of training data is increased in a phase-changing mode, and Cutout, Randomerasing and Gridmask are newer methods. Cutout is to replace a random area by a 0 value for an original image so as to generate a new training image; random imaging replaces a random area on an original image with a random value to generate a new training image; gridmask replaces multiple different random areas on the original image with a 0 value or a random value to generate a new training image. For the case of data imbalance, the scientific community proposes a class balancing method in a series of training processes, such as increasing the importance of training samples with less appeared classes, or performing more repeated sampling on the less appeared classes, and recent methods focus on the design of a loss function of deep learning training, and mainly consider the prediction challenge at the instance level and the data distribution at the class level. The prediction challenge at the example level is represented by CBFocal, and aiming at the marginal decrease effect that the model performance is improved and is not obvious under the condition that the training data is increased, the method provides an effective sample number, namely the number of samples required for providing sufficient information in each class, and carries out class balance through the effective sample number, considering that similar information can be contained between different images, namely information overlapping. The class-level data distribution is represented by LDAMloss, and the boundary of the distribution of the less-appeared classes is larger than the boundaries of other classes in the feature space through the design of a loss function, so that the generalization performance on the less-appeared sample classes is improved.

However, there are certain drawbacks, in the case of limited data: the knowledge migration method needs an additional training data set with a large number of labeled images related to the current training data, however, under the task of medical image analysis or disease and pest disaster monitoring, the additional training data set with the large number of labeled images is almost impossible to obtain due to the great difficulty of collection and labeling; the data augmentation method is to generate new images through some transformation operations on original training images, however, effective high-level semantic information (some highly generalized knowledge about the category of the current image, etc.) is difficult to effectively introduce into the transformed images compared with the original images, so that the capability of further improving the classification performance of the model is limited; the rebalancing method under the condition of data imbalance is actually a trade-off for classification performance between more and less classes, that is, the classification performance of more classes is sacrificed to improve the classification performance of less classes, which usually results in under-fitting of classification performance of models on more classes (further raises space); this artificial rebalancing approach may also result in an overfitting of the classification performance of the model over less-present classes, when training data per se is less present in fewer classes.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provide a semantic soft label image recognition method and device with limited and unbalanced data.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a semantic soft label image identification method with limited and unbalanced data on one hand, which comprises the following steps:

constructing a semantic soft label image recognition model; the semantic soft label image recognition model comprises a feature extractor and a word embedding module;

pre-training an automatic supervision network on a large-scale text data set to obtain a word embedding module;

generating a corresponding soft label for each category in the training data set by using a word embedding module;

inputting a training data set into a feature extractor to obtain a feature vector, and using a corresponding soft label to guide training to obtain a trained semantic soft label image recognition model;

and inputting the test data set into the trained semantic soft label image recognition model for testing to obtain an image recognition result.

As a preferred technical solution, the word embedding module training step is:

processing training texts in a large-scale text data set, and generating a vocabulary list with V words after removing messy codes and symbols;

correspondingly determining each word in the vocabulary as an N-dimensional learnable word vector;

training the self-supervision network by using the N-dimensional learnable word vectors of V words in the vocabulary table to obtain the trained self-supervision network;

and reserving a characteristic encoder part of the trained self-monitoring network as a word embedding module.

As a preferable technical scheme, the self-supervision network comprises a Word2Vec, GloVe, Fasttext, Kazumachor and Bert network.

As a preferred technical solution, the generating of the corresponding soft label specifically includes:

embedding the input words of the training data set into a module, and acquiring a soft label corresponding to each category, wherein the soft label is expressed as:

w_c∈R^D

wherein, w_cSoft label, R, representing the c-th class in the training dataset^DRepresenting a D-dimensional vector on a real number space R;

the training data set is represented as:

DA＝{(x_i,y_i),i＝1,…,N}

wherein x is_iRepresenting the i-th training image, y_iIs shown asAnd (3) real one-hot labels corresponding to the i training images, wherein N represents the total number of the training images in the training data set.

As a preferred technical solution, the obtaining the feature vector specifically includes:

extracting training image x using a feature extractor_iIs represented as:

f_i＝F(x_i)

wherein f is_iRepresenting feature extractor for input training image x_iAnd (4) outputting the D-dimensional feature vector.

As a preferred technical solution, the obtaining of the trained semantic soft label image recognition model specifically includes:

for each training image x_iCalculating f of feature extractor output by cosine distance loss function_iAnd cosine similarity between the soft labels to obtain a cross entropy loss function, wherein the formula is as follows:

wherein, s (f)_i,w_i) Feature vector f representing the ith training image_iSoft label w corresponding to ith training image_iCosine similarity between them, C denotes the total number of classes in the training dataset, s (f)_i,w_j) Is the feature vector f of the ith training image_iSoft label w corresponding to jth category in training data set_jCosine similarity between the two, wherein tau is a temperature hyper-parameter;

training the feature extractor by using a cross entropy loss function, and updating parameters of the feature extractor through random gradient descent;

and obtaining the trained semantic soft label image recognition model.

As a preferred technical scheme, the feature extractor adopts a ResNet, VGG, DenseNet or ViT network structure.

and (3) performing feature extraction on each type of image in the training data set by using a deep learning network trained by a large amount of image data, and calculating a class center as a soft label.

The invention provides a semantic soft label image recognition system with limited and unbalanced data, which is applied to the semantic soft label image recognition method with limited and unbalanced data and comprises a model construction module, a word embedding training module, a soft label generation module, a model training module and an image recognition module;

the model construction module is used for constructing a semantic soft label image recognition model, and the model comprises a feature extractor and a word embedding module;

the word embedding training module is used for pre-training an automatic supervision network on a large-scale text data set to obtain a word embedding module;

the soft label generation module generates a corresponding soft label for each category in the training data set by using a word embedding module;

the model training module inputs a training data set into the feature extractor to obtain a feature vector, and guides training by using a corresponding soft label to obtain a trained semantic soft label image recognition model;

and the image recognition module is used for inputting the test data set into the trained semantic soft label image recognition model for testing to obtain an image recognition result.

The invention also provides a computer readable storage medium storing a program which, when executed by a processor, implements a method for identifying semantic soft label images with limited and unbalanced data as described above.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method adopts a self-supervision network trained on a large-scale text data set to obtain a word embedding module, introduces the prior knowledge of the relationship between categories into the model training process in the form of soft labels, and helps train a model with stronger generalization performance under the condition of limited or unbalanced data;

2. the method obtains the classification probability of the corresponding category by directly carrying out similarity calculation on the extracted feature vector and the soft label, can directly guide an image feature extractor to learn and extract visual features on a feature level, and simultaneously learns and extracts unique features of the category for distinguishing between the categories with similar semantics, thereby improving the image identification performance;

3. the image recognition method provided by the method obtains the word embedding model through training of a large amount of natural language materials and generates the soft label, is not influenced by an image training data set, and is widely suitable for image recognition tasks under the condition of limited or unbalanced data.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a semantic soft label image recognition method for data limitation and imbalance according to an embodiment of the present invention;

FIG. 2 is a block diagram of a semantic soft tag image recognition system with limited and unbalanced data according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention;

FIG. 4(a) is a diagram illustrating cosine similarity between two similar class soft labels on a CIFAR-100 dataset according to an embodiment of the present invention;

FIG. 4(b) is a diagram illustrating category examples of two soft labels with higher cosine similarity according to an embodiment of the present invention;

FIG. 5 is a graph of performance of an embodiment of the present invention on a limited data set versus a baseline method.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Examples

The main goal of training a deep learning image recognition model is to make the model learn the knowledge of the class of the training image in the training process, i.e. some representative features or discriminative features of each class, so that it can accurately distinguish various new images, and this capability is called the generalization performance of the model. However, for training data with limited data or unbalanced data categories, it is very challenging for the model to adequately learn the corresponding knowledge in such a small number of samples. Migrating or embedding a priori knowledge of classes into the model may help the model to learn class knowledge efficiently in cases of limited data or unbalanced data classes. The embedded vector of each class of words, namely the vector expression form of words in natural language embedded into a mathematically quantized embedding space, is considered as a soft label of the class, so that the relationship between the classes is naturally embedded into the process of training the deep learning image recognition model as part of class knowledge.

As shown in fig. 1, the present embodiment provides a semantic soft label image recognition method with limited and unbalanced data, which includes the following steps:

s1, constructing a semantic soft label image recognition model, including a feature extractor and a word embedding module;

s2, pre-training the self-supervision network on a large-scale text data set to obtain a word embedding module;

s3, generating a corresponding soft label for each category in the training data set by using a word embedding module;

s4, inputting the training data set into a feature extractor to obtain feature vectors, and using corresponding soft labels to guide training to obtain a trained semantic soft label image recognition model;

and S5, inputting the test data set into the trained semantic soft label image recognition model for testing to obtain an image recognition result.

More specifically, in step S2, the training step of the word embedding module is:

The word embedding model is performed on a large-scale text data set (such as a large amount of texts crawled on a network by using a crawler technology proposed by google) by using an auto-supervision network (namely, training data does not need to be manually labeled but the training data is directly used as training supervision information), and the training task is to predict a current word vector through a neural network after connection based on a word context vector or predict a context word vector taking the word as a center through a neural network based on the current word vector. When an unsupervised network is trained, its feature encoder portion (i.e., after the training network removes the portion of the network associated with the task such as the neural network mentioned above for prediction) is used as the word embedding model, i.e., for each input word, the word embedding model outputs the corresponding semantic representation, i.e., the representation of the vector, in the embedded feature space. Since the word embedding model is trained on a million or even billion scale of sentence worth of training material, each word's embedded feature vector has potential semantic relationships from word to word. In particular, two semantically closely related words (such as boys and men) are often more similar in feature embedding space.

More specifically, the self-supervision network for obtaining the Word embedding module in the embodiment of the invention comprises networks such as Word2Vec, GloVe, Fasttext, kazumachor and Bert;

two methods of obtaining word embedding modules are listed below:

firstly, a Word embedding module is obtained by adopting a Word2Vec network, and the training steps are as follows:

setting a window size, sampling V words and contexts in the window thereof in a training text to form training data, finding the training data in a vocabulary table and converting the training data into a corresponding vector representation form;

representing the word as w (t), and the context as w (t + i) and w (t-i);

the Word2Vec network has two training modes:

under the CBOW mode, predicting a middle word w (t) through a neural network after vector summation of context words w (t + i) and w (t-i);

under the Skip-gram mode, predicting context words w (t + i) and w (t-i) by adopting a neural network through a vector of a current word w (t);

optimizing N-dimensional vector representation of V words in a vocabulary by any training mode; and discarding the neural network after training is finished, and keeping the N-dimensional vector representation of V words in the vocabulary table to obtain the word embedding module.

Secondly, a GloVe network is adopted to obtain a word embedding module, and the training steps are as follows:

setting a window size according to the training text, constructing a V multiplied by V co-occurrence matrix for V words, and using X_i,jRepresenting the times of the ith word and the jth word of each element appearing in the same window in the training text, moving the window on the training text, and updating the co-occurrence matrix once when moving once;

the loss function for the GloVe training network is:

indicating that the predicted similarity is reflected by using the dot product of two word vectors, and the actual similarity is reflected by the frequency of the common appearance of the two words; wherein f (X)_i,j) The weight matrix is artificially set, and the influence of a plurality of words such as the words, a and the like is eliminated by using the weight matrix; through continuous training of the square error loss function, the prediction similarity can be continuously close to the actual similarity, and when the prediction similarity and the actual similarity are very close, an expected word vector is obtained;

and finally, retaining N-dimensional expected word vector representations of the V words to obtain a word embedding module.

More specifically, in step S3, the training data set is input into the trained word embedding module, and a soft label corresponding to each category is generated, where the soft label is expressed as:

w_c∈R^D

the training data set is represented as:

DA＝{(x_i,y_i),i＝1,…,N}

wherein x is_iRepresenting the i-th training image, y_iRepresenting a real one-hot label corresponding to the ith training image, namely a label vector which only consists of 0 and 1 and is only 1 at the position of the corresponding class serial number; n represents the total number of training images in the training data set.

More specifically, in step S4, the training image x is extracted using the feature extractor_iIs represented as:

f_i＝F(x_i)

wherein f is_iRepresenting feature extractor for input training image x_iOutputting the D-dimensional feature vector;

in the method, the similarity calculation may also adopt other similarity calculation methods besides the cosine similarity calculation, which is not described herein.

If the output of the feature extractor is similar to the corresponding soft label, the feature extractor is indicated to be capable of extracting the visual features with the inter-class relationship contained in the soft label vector; therefore, the training of the feature extractor is guided by using the soft label, and the feature extractor can be helped to learn the prior knowledge of the inter-class relation even under the condition of limited training data or unbalanced class;

therefore, the cross entropy loss function is used for training the feature extractor, and the parameters of the feature extractor are updated through random gradient descent;

for each training image, all classes of soft labels participate in cosine distance loss function calculation, and the feature extractor is trained to minimize the distance between each feature vector and the corresponding soft label and maximize the distances between the soft labels and the feature vectors of other classes; in this way, for semantically similar categories, the feature extractor can be trained to extract more discriminative features, otherwise similar soft labels would result in larger loss function values; finally, the trained semantic soft label image recognition model can better distinguish different categories, namely, the category of the corresponding soft label vector closest to the output vector of the feature extractor is found.

FIG. 4(a) shows cosine similarity between two similar classes of soft labels on a CIFAR-100 dataset, where each point represents cosine similarity of two classes of soft labels on corresponding coordinates, and brighter points represents higher similarity; fig. 4(b) is a category example image where two soft labels have a very high similarity.

In this embodiment, there are various network frameworks that can be adopted by the feature extractor, and network structures such as ResNet, VGG, densnet or ViT can be combined with the method of the present invention.

The key point of the method is that a large amount of natural language materials in the field of natural language processing are used for training to obtain a word embedding module, and the word embedding module is used for generating a corresponding soft label containing rich semantic information for each class in a data set; an effective image recognition method is provided based on the generated soft label, namely semantic information in the soft label is utilized to help training to obtain an image recognition model with better generalization performance, the image recognition model is not influenced by an image training data set, and good image recognition performance can be obtained under the conditions of limited data and unbalanced data.

FIG. 5 is a graph of the performance of the benchmark method tested on a limited data set and the performance effects of the benchmark method in combination with the method; training is carried out on each type of training images with limited number (50, 100 and 200) sampled on CIFAR10, CIFAR100 and mini-Imagenet data sets, and testing is carried out on a plurality of limited data classification tasks after common data augmentation methods Cutout, Randomerasing and Gridmask are combined with the method, so that the accuracy of the test result is further improved.

Meanwhile, in the invention, besides the soft label is generated by the trained word embedding module, the characteristic extraction of each type of image in the current data set can be carried out by utilizing a deep learning network trained by a large amount of image data, and the class center is calculated to be used as the soft label.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.

Based on the same idea as the semantic soft label image identification method with limited and unbalanced data in the embodiment, the invention also provides a semantic soft label image identification system with limited and unbalanced data, which can be used for executing the semantic soft label image identification method with limited and unbalanced data. For ease of illustration, only those portions of the embodiments of the semantic soft tag image recognition system relevant to the embodiments of the present invention are shown in the schematic structural diagram of an embodiment of a semantic soft tag image recognition system with limited and unbalanced data, and those skilled in the art will appreciate that the illustrated structure does not constitute a limitation of the apparatus, and may include more or less components than those illustrated, or may combine certain components, or may be arranged in different components.

As shown in FIG. 2, another embodiment of the present invention provides a semantic soft label image recognition system with limited and unbalanced data, which comprises the following modules:

the word embedding training module is used for pre-training the self-monitoring network on a large-scale text data set to obtain a word embedding module;

the soft label generating module generates a corresponding soft label for each category in the training data set by using the word embedding module;

the model training module inputs a training data set into the feature extractor to obtain a feature vector, and the corresponding soft label is used for guiding training to obtain a trained semantic soft label image recognition model;

It should be noted that, the semantic soft label image recognition system with limited and unbalanced data of the present invention corresponds to the semantic soft label image recognition method with limited and unbalanced data of the present invention one by one, and the technical features and the beneficial effects thereof described in the embodiments of the semantic soft label image recognition method with limited and unbalanced data are both applicable to the embodiments of the semantic soft label image recognition system with limited and unbalanced data, and the specific contents thereof can be referred to the description in the embodiments of the method of the present invention, and are not described herein again, and thus, the present invention is stated.

In addition, in the implementation of the semantic soft label image recognition system with limited and unbalanced data according to the above embodiments, the logical division of each program module is only an example, and in practical applications, the above function distribution may be performed by different program modules according to needs, for example, due to configuration requirements of corresponding hardware or convenience of implementation of software, that is, the internal structure of the semantic soft label image recognition system with limited and unbalanced data is divided into different program modules to perform all or part of the above described functions.

As shown in fig. 3, in an embodiment, a computer-readable storage medium is provided, which stores a program, when the program is executed by a processor, implementing the above-mentioned method for recognizing semantic soft tag images with limited and unbalanced data, specifically:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A semantic soft label image identification method with limited and unbalanced data is characterized by comprising the following steps:

2. The method for recognizing semantic soft label images with limited and unbalanced data according to claim 1, wherein the word embedding module training step is as follows:

3. The method for recognizing semantic soft tag images with limited and unbalanced data according to claim 2, wherein the self-supervision network comprises a Word2Vec, GloVe, Fasttext, kazumachor and Bert network.

4. The method for recognizing semantic soft label images with limited and unbalanced data according to claim 2, wherein the generating of the corresponding soft label specifically comprises:

w_c∈R^D

the training data set is represented as:

DA＝{(x_i,y_i),i＝1,…,N}

wherein x is_iRepresenting the i-th training image, y_iAnd representing the real one-hot label corresponding to the ith training image, and N represents the total number of training images in the training data set.

5. The method for recognizing semantic soft label images with limited and unbalanced data according to claim 4, wherein the obtaining feature vectors specifically comprises:

extracting training image x using a feature extractor_iIs represented as:

f_i＝F(x_i)

6. The method for recognizing the semantic soft label image with limited and unbalanced data according to claim 5, wherein the obtaining of the trained semantic soft label image recognition model specifically comprises:

and obtaining the trained semantic soft label image recognition model.

7. The method for recognizing semantic soft label images with limited and unbalanced data as claimed in claim 6, wherein the feature extractor adopts a ResNet, VGG, DenseNet or ViT network structure.

8. The method for recognizing semantic soft label images with limited and unbalanced data according to claim 2, wherein the generating of the corresponding soft label specifically comprises:

9. The system for recognizing the semantic soft label image with limited and unbalanced data is characterized by being applied to the semantic soft label image recognition method with limited and unbalanced data of any one of claims 1 to 8, and comprising a model construction module, a word embedding training module, a soft label generation module, a model training module and an image recognition module;

10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements a method for semantic soft tag image recognition of limited and unbalanced data according to any one of claims 1-8.