CN115546553A

CN115546553A - Zero sample classification method based on dynamic feature extraction and attribute correction

Info

Publication number: CN115546553A
Application number: CN202211268579.0A
Authority: CN
Inventors: 贺喆南; 徐浚哲; 吕建成; 汤臣薇; 江姗霖
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2022-12-30

Abstract

The invention discloses a zero sample classification method based on dynamic feature extraction and attribute correction, which comprises the following steps: acquiring a visual sample and semantic features; constructing a zero sample learning network based on dynamic feature extraction and attribute correction; transmitting the visual sample and the semantic features to a zero sample learning network based on dynamic feature extraction and attribute correction, obtaining and calculating a loss value according to the visual sample features and the corrected semantic features, and reversely transmitting the loss value and repeating the steps until the training is finished; verifying the trained zero sample learning network based on dynamic feature extraction and attribute correction; if the accuracy is higher than the preset value, entering the next step; otherwise, returning to the previous step; and classifying the data set by adopting a trained zero sample learning network based on dynamic feature extraction and attribute correction. The invention adopts different feature extraction methods aiming at different properties of different attributes, provides an attribute correction concept and enhances the characterization capability of the network.

Description

Zero sample classification method based on dynamic feature extraction and attribute correction

Technical Field

The invention relates to the field of zero sample identification, in particular to a zero sample classification method based on dynamic feature extraction and attribute correction.

Background

In the conventional deep learning classification algorithm research, samples in a training set include all label distribution information in the data set, and at this time, a model can grasp all knowledge of sample distribution through learning of the training set and verify the learning effect of the model through the prediction accuracy of the test model on the test set. In such a case, it is critical to verify the effect of the model that the training set and the test set have the same label. However, in some special application scenarios, training samples of some categories may be difficult to obtain or labels of the samples are difficult to label, and at this time, due to the absence of label information of the samples, a model trained in advance cannot be predicted on the categories, which greatly limits the application range of the deep learning model. Therefore, to solve the problem of prediction on new classes, a zero sample learning task is proposed that attempts to allow a model to accurately identify samples of classes that are not seen in the training set, while already being able to identify classes that are already in the training set. The task of enabling the model to learn the invisible type knowledge without seeing any sample greatly widens the application range of deep learning, and has high research value.

To study zero sample learning, researchers have proposed and designed several data sets, each of which includes a large number of visual samples X. Assume all categories of all visual samples are

Wherein

Is a modelThe number of visible classes that can be seen during training is N _s Visual samples belonging to the visible category may be recorded as X _s ，

The classes are invisible classes used by the test set to detect the learning performance of the model zero sample, and the number of the classes is N _u A visual sample belonging to the invisible category may be recorded as X _u . Notably, the visible and invisible classes are without any coincidence and encompass all classes in the dataset, i.e., all classes in the dataset

In order for a model to learn without a sample, researchers have introduced the concept of semantic features into a data set, one for each class

It is assumed that the semantic features of all classes in a dataset can be denoted as A, where

A semantic feature that represents all visible classes,

the semantic features of the invisible class are represented, and K represents the dimensions of the semantic feature vector, wherein each dimension can be represented as a specific attribute, so that each semantic feature can be represented by a combination of K attributes. When the zero sample learning model is trained, the model can see visual samples X of visible classes _s And semantics a of all classes including invisible classes. The zero sample learning aims to use the semantic feature A as a bridge, so that the model learns the relationship between the corresponding visual samples through the relationship between the semantics of the visible class and the invisible class, and therefore accurate prediction is made on the visual samples of the invisible class in the test set.

Currently, zero sample learning has three main techniques:

the first prior art is as follows: the learning algorithm based on cross-modal mapping maps visual samples originally distributed in a visual space and semantic features distributed in a semantic space to the same space, aligns the visual sample distribution by taking the semantic features as central points, and maps invisible visual samples to the space for classification in a testing stage.

The method has the disadvantages that the feature extraction quality of the visual samples cannot be guaranteed, the global features and the semantic features of the visual samples are used for alignment, but the extraction and understanding of the local features of the samples are neglected, so that certain redundant features of the visual samples influence the training of a model, and the performance of an algorithm is reduced finally.

The second prior art is: based on the method of generation, the method directly hits the core problem of zero sample learning: samples of the invisible class are missing. And finally converting the zero-sample learning task into a standard supervised learning task by generating a large number of samples of invisible classes with semantics as reference.

The main disadvantage of the technology is similar to that of the prior art, namely that global features are used as feature expressions of visual samples for model training, and the importance of local features is ignored. In order to realize high-quality generation of invisible class samples, a model is often expected to have good generation capability on specific attributes related to semantics, while background parts unrelated to semantics are not so important, but the generation method based on global features is not considered to this point, so that the generation quality cannot be guaranteed.

The prior art is three: the method based on the attention mechanism comprises the steps of decomposing semantics into different attributes, extracting features on a visual picture by taking the attributes as units, aligning the extracted attribute features as feature expression and semantics of the picture, and well predicting an invisible class visual sample according to the features extracted from the attributes because the semantics are combined by the different attributes and the attributes are universal among different classes.

Although the technical route considers the importance of local features for the first time, the technology still has two major disadvantages, the first one is that the category of the attribute is not discussed in a targeted way. Semantic attributes can be generally divided into two categories, the first being low-level texture-based attributes that tend to describe color or shape features of specific parts of the subject of the visual sample, which can be easily extracted by the model. Another attribute is a high level abstract attribute that requires an understanding of the relevant content, such as the grass attribute of an animal, which cannot be captured by a low level texture. The existing scheme uniformly uses a set of methods for extracting low-level texture attributes to extract the characteristics of all attributes, and lacks of consideration for high-level abstract attributes. Another disadvantage is that the existing techniques tend to target a fixed attribute feature for prediction, but in practice, the semantic features may also change due to the fact that different visual samples may be taken due to different angles and different light rays. Therefore, all visual samples in one class are described by fixed attribute values, so that the characteristic change of attributes in different visual samples is ignored, and finally, the characteristic extraction effect is poor.

Disclosure of Invention

Aiming at the defects in the prior art, the zero sample classification method based on dynamic feature extraction and attribute correction solves the problems that high-level abstract attributes are not considered and the feature change of the attributes in different visual samples is ignored in the prior art.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that: a zero sample classification method based on dynamic feature extraction and attribute correction comprises the following steps:

s1, obtaining a visual sample x and semantic features alpha;

s2, constructing a zero sample learning network based on dynamic feature extraction and attribute correction;

s3, transmitting the visual samples and the semantic features to a zero sample learning network based on dynamic feature extraction and attribute correction to obtain and calculate a loss function according to the visual sample features and the corrected semantic features, calculating a loss value according to the loss function and transmitting the loss value back in a gradient manner; repeating the steps until the training is finished;

s4, verifying the trained zero sample learning network based on dynamic feature extraction and attribute correction; if the accuracy is higher than the preset value, the step S5 is carried out; otherwise, entering step S3;

and S5, classifying the data set by adopting the trained zero sample learning network based on dynamic feature extraction and attribute correction.

Furthermore, the zero sample learning network based on dynamic feature extraction and attribute correction comprises a feature extraction backbone network, an attribute positioning network, an attribute correction network, a scale control unit and a loss value calculation module;

the first output end of the feature extraction backbone network is connected with the first input end of the attribute correction network; the second output end of the feature extraction backbone network is connected with the first input end of the attribute positioning network; the third output end of the feature extraction backbone network is connected with the input end of the scale control unit; a first output end of the scale control unit is connected with a second input end of the attribute correction network; a second output end of the scale control unit is connected with a second input end of the attribute positioning network; and the output end of the attribute positioning network and the output end of the attribute correction network are connected with a loss value calculation module.

Further, the specific implementation manner of step S3 is as follows:

s3-1, positioning the feature attributes of the visual sample through an attribute positioning network and extracting local features and global features;

s3-2, extracting local features and global features required by attribute correction through an attribute correction network;

s3-3, fusing local features and global features extracted by the attribute positioning network and the attribute correction network through the scale control unit to obtain an attribute correction value and visual sample features;

s3-4, correcting the semantic features according to the attribute correction values to obtain corrected semantic features;

s3-5, calculating a loss value according to the distance between the visual sample characteristic and the corrected semantic characteristic; returning the loss value, and updating the zero sample learning network parameters based on dynamic feature extraction and attribute correction.

Further, the specific implementation manner of step S3-1 is as follows:

s3-1-1, obtaining a visual sample characteristic diagram of a visual sample x through a characteristic extraction backbone network

C represents the number of channels of the characteristic diagram, namely the dimension of the characteristic of each pixel point; h represents the height of the feature map; w represents the width of the feature map;

a shape representing the data;

s3-1-2, according to the formula:

obtaining local features u of a visual sample _L (ii) a Wherein i represents the height of the feature map; j represents the width of the feature map;

the attribute graph represents the distribution of the attributes on the feature graph, K represents the number of the attributes, w represents the attention weight, and v represents the specific distribution value of the attributes; the softmax function represents normalizing the pixel values of the feature map on each channel to between 0-1; phi is a _v And phi _w Represents two convolution layers with convolution kernel size of 1 × 1;

s3-1-3, according to the formula:

obtaining global features u of visual samples _G (ii) a Wherein;

i' represents the height of the feature map; j' represents the width of the feature map.

Further, the specific implementation manner of step S3-2 is as follows:

according to the formula:

obtaining local characteristics t of each attribute _L And global feature t of each attribute _G (ii) a Wherein phi is _r A convolution kernel having a convolution kernel size of 1 × 1, which represents a calculation attribute correction value; max _c’,d’ Representing global maximum pooling; c' represents the height of the feature map; d' represents the width of the feature map;

further, the specific implementation manner of step S3-3 is as follows:

s3-3-1, according to the formula:

obtaining the probability g of whether the attribute is a local attribute or a global attribute; wherein phi is _s A convolution layer representing a convolution kernel of 1 × 1; c represents the height of the feature map; d represents the width of the feature map;

s3-3-2, according to the formula:

obtaining attribute correction values

And a visual sample characteristic ψ (x).

Further, the specific implementation manner of step S3-4 is as follows:

according to the formula:

obtaining corrected semantic feature pi _m (alpha); wherein normaize indicates normalizing the vector length to 1;

represents the value corresponding to the nth dimension in the semantic features of the mth class, n =1, 2., K;

represent

The nth dimension of (a).

Further, the specific implementation manner of step S3-5 is as follows:

s3-5-1, according to the formula:

get the classification loss

And loss of distance

Wherein N is _B A batch size representing a visual sample of each round of learning sampling; exp denotes the natural index; cos represents cosine similarity, tau represents temperature coefficient, alpha _y Representation and sample x _p Semantic features corresponding to the same class;

represents the square of the L2 norm;

is a visible class; alpha (alpha) ("alpha") _q Is the semantic feature of the qth class in the visible class;

s3-5-2, according to the formula:

obtaining the difference between the predicted value and the true value of the zero sample learning network based on the dynamic feature extraction and the attribute correction

I.e. the final loss function.

And S3-5-3, calculating a loss value according to the loss function, carrying out gradient feedback, and updating the zero sample learning network parameters based on dynamic feature extraction and attribute correction.

The invention has the beneficial effects that: the invention classifies the semantic attributes and designs a set of comprehensive attribute feature extraction method. For those low-level texture-based attributes, attention-based local feature extraction is still retained, and for those high-level abstract attributes based on content understanding, the global features of the visual sample are used as their feature expression. And performing feature fusion on the local features and the global features, taking a gate control unit as a weight, adjusting the attribute ratio of the two features, and finally realizing the extraction of the features of the visual sample. And an attribute correction concept is provided, and an attribute correction module is designed to modify the attribute value, so that the value of the attribute is closer to the real expression of the visual sample. And aligning the characteristics extracted from the visual sample with the corrected attributes to enhance the characterization capability of the network.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a network architecture according to the present invention;

FIG. 3 is a visualization of an attribute localization module attention mechanism.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, a zero sample classification method based on dynamic feature extraction and attribute modification includes the following steps:

s1, obtaining a visual sample x and semantic features alpha;

The specific implementation manner of step S3 is as follows:

s3-5, calculating a loss value according to the distance between the visual sample characteristic and the corrected semantic characteristic; returning a loss value, and updating zero sample learning network parameters based on dynamic feature extraction and attribute correction.

The specific implementation manner of the step S3-1 is as follows:

Wherein, C represents the channel number of the characteristic diagram, namely the dimension of the characteristic of each pixel point; h represents the height of the feature map; w represents the width of the feature map;

a shape representing the data;

s3-1-2, according to the formula:

the attribute graph represents the distribution of the attributes on the feature graph, K represents the number of the attributes, w represents the attention weight, and v represents the specific distribution value of the attributes; the softmax function represents normalizing the pixel values of the feature map on each channel to between 0-1; phi is a unit of _v And phi _w Represents two convolution layers with convolution kernel size of 1 × 1;

s3-1-3, according to the formula:

obtaining global characteristics u of visual sample _G (ii) a Wherein；

The specific implementation manner of the step S3-2 is as follows:

according to the formula:

the specific implementation manner of the step S3-3 is as follows:

s3-3-1, according to the formula:

obtaining the probability g of whether the attribute is a local attribute or a global attribute; wherein phi is _s A convolutional layer representing a convolution kernel of 1 × 1; c represents the height of the feature map; d represents the width of the feature map;

s3-3-2, according to the formula:

obtaining attribute correction values

And a visual sample characteristic ψ (x).

The specific implementation manner of step S3-4 is as follows:

according to the formula:

obtaining corrected semantic feature pi _m (α); wherein normaize indicates normalizing the vector length to 1;

to represent

The nth dimension of (a).

The specific implementation manner of the step S3-5 is as follows:

s3-5-1, according to a formula:

get the classification loss

And distance loss

Wherein N is _B A batch size representing a visual sample of each round of learning sampling; exp denotes the natural index; cos represents cosine similarity, τ represents temperature coefficient, α _y Representation and sample x _p Semantic features corresponding to the same class;

represents the square of the L2 norm;

is a visible class; alpha is alpha _q Is the semantic feature of the qth class in the visible class;

s3-5-2, according to the formula:

I.e. the final loss function.

As shown in fig. 2, the zero-sample learning network based on dynamic feature extraction and attribute correction includes a feature extraction backbone network, an attribute positioning network, an attribute correction network, a scale control unit, and a loss value calculation module;

As shown in fig. 3, SUN represents a scene understanding dataset, and CUB represents a fine-grained bird classification dataset. It can be seen that the model is very accurate for the localization of local features, for example, for the localization of different parts of the bird's body and for still water, fences in complex scenes. In addition, for complex attributes requiring content understanding, such as an open field in the SUN dataset, the model assigns a higher attention weight value to the whole picture, which conforms to the definition of the open field.

In one embodiment of the invention, the softmax function normalizes the pixel values of the feature map on each channel to between 0-1, thereby representing the attention weight, with high value pixels representing higher importance. Global max pooling can be considered as a special manifestation of the attention mechanism, i.e. only one pixel has a weight of 1 and the other pixels have weights of 0. The zero sample learning network based on dynamic feature extraction and attribute correction integrates the features of each attribute on the image through global average pooling to obtain the judgment score of whether the attribute belongs to the global attribute or the local attribute of the image, and finally, the score is normalized to be between 0 and 1 by a sigmoid function. Loss of classification

The visual sample characteristics extracted by the attribute positioning module and the semantic characteristics corrected by the attribute correction module are drawn up in cosine similarity, which is the alignment of the whole semantic level; loss of distance

The sample features and the modified semantics are directly required to be the same in each dimension, which is to say that the extracted sample features are the same as the modified semantics in each dimension, which is the alignment of the attribute level.

On the aspect of quantitative analysis, compared with the prior art, the method disclosed by the invention obtains higher prediction accuracy of the test set, as shown in table 1.

TABLE 1

The zero sample learning task is quantized through three indexes, the prediction accuracy S of the zero sample learning network on the visible class, the prediction accuracy U on the invisible class and the harmonic mean H of the two accuracies are based on dynamic feature extraction and attribute correction, and generally, the higher the harmonic mean is, the better the comprehensive performance of the algorithm is. As can be seen from table 1, the present invention has a larger improvement in the harmonic mean of the accuracy compared to the prior art, which proves the superiority of the present invention.

On the qualitative analysis, the visual analysis of the attention mechanism can show that the invention obtains good effect on the key task of extracting the attribute features.

The invention classifies the semantic attributes and designs a set of comprehensive attribute feature extraction method. For those low-level texture-based attributes, local feature extraction based on the attention mechanism is still retained, and for those high-level abstract attributes based on content understanding, global features of the visual sample are adopted as feature expressions thereof. And performing feature fusion on the local features and the global features, taking a gate control unit as a weight, adjusting the attribute ratio of the two features, and finally realizing the extraction of the features of the visual sample. And an attribute correction concept is provided, and an attribute correction module is designed to modify the attribute value, so that the value of the attribute is closer to the real expression of the visual sample. And aligning the characteristics extracted from the visual sample with the corrected attributes to enhance the characterization capability of the network.

Claims

1. A zero sample classification method based on dynamic feature extraction and attribute correction is characterized by comprising the following steps:

s1, obtaining a visual sample x and semantic features alpha;

s3, transmitting the visual sample and the semantic features to a zero sample learning network based on dynamic feature extraction and attribute correction, obtaining and calculating a loss function according to the visual sample features and the corrected semantic features, calculating a loss value according to the loss function, and returning the loss value in a gradient manner; repeating the steps until the training is finished;

2. The zero sample classification method based on the dynamic feature extraction and the attribute correction as claimed in claim 1, wherein the zero sample learning network based on the dynamic feature extraction and the attribute correction comprises a feature extraction backbone network, an attribute positioning network, an attribute correction network, a scale control unit and a loss value calculation module;

a first output end of the characteristic extraction backbone network is connected with a first input end of the attribute correction network; the second output end of the feature extraction backbone network is connected with the first input end of the attribute positioning network; the third output end of the feature extraction backbone network is connected with the input end of the scale control unit; a first output end of the scale control unit is connected with a second input end of the attribute correction network; a second output end of the scale control unit is connected with a second input end of the attribute positioning network; and the output end of the attribute positioning network and the output end of the attribute correction network are connected with a loss value calculation module.

3. The zero sample classification method based on dynamic feature extraction and attribute modification as claimed in claim 2, wherein the specific implementation manner of step S3 is as follows:

4. The zero sample classification method based on dynamic feature extraction and attribute modification as claimed in claim 3, wherein the specific implementation manner of step S3-1 is as follows:

a shape representing the data;

s3-1-2, according to the formula:

s3-1-3, according to the formula:

obtaining global features u of visual samples _G (ii) a Wherein;

5. The zero sample classification method based on dynamic feature extraction and attribute modification as claimed in claim 4, wherein the step S3-2 is implemented as follows:

according to the formula:

6. the zero sample classification method based on dynamic feature extraction and attribute modification as claimed in claim 5, wherein the specific implementation manner of step S3-3 is as follows:

s3-3-1, according to the formula:

s3-3-2, according to the formula:

obtaining attribute correction values

And a visual sample characteristic ψ (x).

7. The zero sample classification method based on dynamic feature extraction and attribute modification as claimed in claim 6, wherein the specific implementation manner of step S3-4 is as follows:

according to the formula:

represents the value corresponding to the nth dimension in the semantic features of the mth class, n =1, 2.,;

to represent

The nth dimension of (a).

8. The zero sample classification method based on dynamic feature extraction and attribute modification as claimed in claim 7, wherein the specific implementation manner of step S3-5 is as follows:

s3-5-1, according to the formula:

get the classification loss

And distance loss

Wherein, N _B A batch size representing a visual sample of each round of learning sampling; exp denotes the natural index; cos represents cosine similarity, τ represents temperature coefficient, α _y Representation and sample x _p Semantic features corresponding to the same class;

represents the square of the L2 norm;

s3-5-2, according to the formula:

I.e. the final loss function.