CN112200267A

CN112200267A - Zero sample learning classification method based on multi-scale feature fusion

Info

Publication number: CN112200267A
Application number: CN202011190644.3A
Authority: CN
Inventors: 王国威; 管乃洋
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-01-08

Abstract

The invention discloses a zero sample learning classification method based on multi-scale feature fusion, which uses a deep neural network model to extract original scale image features; carrying out secondary processing on the feature map extracted by the deep neural network by using a recalibration technology; calculating the processed characteristic graph by using a cutting technology to generate a fine-scale image; extracting the characteristics of the fine-scale image by using the generated fine-scale image and a parameter-shared deep neural network; performing feature fusion on the original scale image features and the fine scale image features, further projecting the fusion features to a semantic space and a hidden space, and performing parameter optimization and updating by respectively adopting a softmax loss and a triplet loss; in the prediction stage, classifying and identifying the image with given semantic information by using a conversion matrix W obtained in the training stage; the depth features extracted from the original scale images and the fine scale images are used for simultaneously participating in training, so that the fusion features comprise multi-scale features, the robustness of the model is improved, and the classification accuracy is improved.

Description

Zero sample learning classification method based on multi-scale feature fusion

Technical Field

The invention particularly relates to a zero sample learning classification method based on multi-scale feature fusion, and relates to the field of deep learning zero sample learning.

Background

In recent years, deep learning has experienced unprecedented growth in scale, and factors driving this dramatic growth include: the increased complexity of algorithms and models, the increased computing power of machines, and the application of large-scale data. In fact, successful application of deep learning algorithms is not supported by large amounts of data. If the classification model based on deep learning is required to achieve higher accuracy, a large number of high-quality manual labeling samples are required for training.

For some common categories, a large number of images can be acquired with relative ease by means of collection from a network or taking a picture in the field. For some rare categories, such as endangered species, the category is not only extremely low in stock, but also can live in extreme physical environments sometimes, so that the difficulty of image acquisition is increased, and a large amount of high-quality samples cannot be collected easily. For some newly generated classes, the sample size is zero. In any case, it is impractical to obtain sufficient data volume to improve model accuracy, and labeling samples can be prohibitively expensive and time consuming; conventional deep learning-based learning systems can only identify classes that have been seen in the training phase, and cannot identify a class that never occurred in the training phase.

The most widely studied of the zero sample learning algorithms is the compatibility-based algorithm. The method projects a sample image to a visual space, projects the category of the sample to a semantic space, reduces the difference between the visual space and the semantic space as much as possible, improves the compatibility of the visual space and the semantic space, and an end-to-end algorithm directly uses an original image to participate in training in the training process.

Disclosure of Invention

Therefore, in order to solve the above-mentioned deficiencies, the present invention herein provides a zero sample learning classification method based on multi-scale feature fusion.

The invention is realized in such a way that a zero sample learning classification method based on multi-scale feature fusion is constructed, and the method comprises the following steps:

s1, extracting the features of the original image by using a deep neural network to obtain a feature map of original size features;

s2, obtaining a fine-scale image by using a recalibration positioning and cutting combination technology for the original image and the feature map obtained in the previous step;

s3, extracting the features of the fine-scale image by using a parameter-shared deep neural network to obtain fine-scale features, and fusing the two features;

s4, projecting the fusion features obtained in the previous step to a semantic space and a hidden space, and respectively adopting softmax loss and triplet loss to carry out parameter optimization;

and S5, repeating the steps, setting a plurality of periods for training, and finally obtaining a zero sample learning model with strong characterization capability.

Further, the cropping is used to perform a cropping operation on the original image, and the cropped area generally contains the whole of the object or a part of the object.

Furthermore, the fine-scale image is an area which reserves the whole original image object and is rich in semantic information.

Further, the re-calibrated feature map automatically cuts out a target area.

Further, the output of the feature map after recalibration is a group of values including three parameters, and the three parameters respectively represent the coordinates of the center point of the region to be cropped on the original image and the length of the region to be cropped.

Further, the target area is designed to be a square.

Furthermore, the image obtained by clipping is converted into the same size as the original image again by bilinear interpolation.

The invention has the following advantages: the invention provides a zero sample learning classification method based on multi-scale feature fusion through improvement, and compared with the same type of methods, the zero sample learning classification method has the following beneficial effects:

the method has the advantages that: according to the zero sample learning classification method based on multi-scale feature fusion, the model is more adaptive to a zero sample learning classification task by directly training the samples of the original images.

The method has the advantages that: the zero sample learning classification method based on the multi-scale feature fusion is characterized in that a repositioning technology and a cutting technology based on an attention mechanism are used in a matched mode, on the basis of an original image, a feature map extracted through a deep neural network is processed to obtain a fine-scale image, fine-scale features are obtained through the deep neural network again, the two features are fused, the characterization capability of a model is greatly improved through the fusion features, and the model precision is improved.

The method has the advantages that: according to the zero sample learning classification method based on multi-scale feature fusion, the scheme that image features are projected to a semantic space and a hidden space is adopted, softmax loss and triplet loss are adopted to optimize and update parameters respectively, compared with the traditional method, learned feature points are restrained to have discriminativity in a training stage, and model learning capacity is improved.

Drawings

Fig. 1 is an explanatory view of the present invention.

Wherein: the method comprises the following steps of 1, an original image, 2, 3, 4, 5, 6, 7, 8, 9 and 10, wherein the original image is an original size feature, the original size feature is a recalibration feature, the clipping feature is a fine-scale image, the parameters are shared, the semantic space is a hidden space, and the fine-scale feature is a fine-scale feature.

Detailed Description

The present invention will be described in detail with reference to fig. 1, and the technical solutions in the embodiments of the present invention will be clearly and completely described, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Basic setting:

suppose there is N_sTraining set composed of samples

Wherein

Representing the ith image of the seen class s,

representing its corresponding class label. Test set consisting of Nu samples

Wherein

Representing the jth image of unseen class u,

representing its class label. The seen and unseen classes are disjoint,

and Y is^s∪Y^uY. Let □ (x) be theta (x)^TAnd W represents the mapping of the visual features in the semantic space and is regarded as a predicted semantic vector, wherein theta (x) represents the visual features, and W represents a mapping matrix. Let σ (x) represent the predicted implicit feature vector.

Recalibration operation:

the original image is subjected to a cropping operation, the cropped area typically containing the whole of the object or parts of the object. Since the target area should reflect as much of the attribute information as possible, the goal is to crop out areas that preserve the entire object and are rich in semantic information. For this purpose, the method of SENET [ is used for reference, a recalibrated feature map is obtained through a group of global scaling operations, and specifically, the first step describes channel information by obtaining a group of descriptors between channels through a global average pooling operation

p＝[p₁,p₂,…,p_C]∈R^C

Wherein b is_cIs the c-th element of the feature map. Secondly, the descriptors between the channels are operated through a two-layer full connection layer to capture the mutual relation between the characteristic channels,

a_c＝ρ(W₂f(W₁p))∈R^C

where f (-) and ρ (-) represent the Relu activation function and Sigmoid activation function, respectively. a is_cSince the non-co-occurrence relationship among the channels can be well modeled by obtaining the Sigmoid activation function, the non-co-occurrence relationship is used for re-weighting the original characteristic channels,

wherein F_fc(. cndot.) represents a two-layer fully connected layer, and ρ (-) represents the Sigmoid activation function. The information expressed by the feature map enhances the originally important feature channel information and weakens the originally less important feature channel information, so that the feature map has more obvious semantic meaning, and the starting point of the next cutting operation is ensured to be a target area with obvious semantic meaning.

Cutting operation:

after the recalibrated feature map is available, the target area is automatically cut out on the feature map. The input of the operation is a recalibration characteristic diagram, and the output is a group of values containing three parameters which respectively represent the coordinates of the center point of the area to be cut on the original image and the length of the area to be cut. To improve the efficiency of the calculation, the target area is designed as a square, which can be expressed as,

[t_x,t_y,t_l]＝F_CM(M)

wherein, t_xAnd t_yCoordinates of the x-axis and y-axis, t, respectively_lIs the side of a square regionLength, F_CMRepresenting a clipping function. Once the re-aligned feature map M is obtained, a finer scale image can be obtained by cropping from the original scale image. Specifically, a two-dimensional continuous mask is first generated, where V (x, y) ═ V_x·V_yWherein, in the step (A),

V_x＝f(x-z_x+0.5z_l)-f(x-z_x-0.5z_l),

V_y＝f(x-z_y+0.5z_l)-f(x-z_y-0.5z_l).

here, a non-linear mapping is set

f(x)＝1/(1+exp(-kx))

And k is set to 10. The cropped area may then be obtained by multiplying the mask by the original image,

x^crop＝x⊙V

then, the cropped image is transformed into the same size as the original image again by bilinear interpolation. Learning on the image with the finer scale has an important influence on obtaining finer visual features with identifiability, and makes up for the performance deficiency of the task of identifying the image with the fine granularity.

Feature fusion and mixed mode projection:

a hybrid projection mode is adopted, namely, the image features are projected to two spaces simultaneously, and the method not only can utilize the reasonability of manually defined attributes, but also can utilize the identification of hidden features. Thus, there are two kinds of mappings, visual-semantic mapping and visual-latent mapping.

For visual-semantic mapping, let φ (y) be the semantic feature (attribute) of category y, then the compatibility score can be defined as,

wherein theta is_xRepresents the visual characteristics, and W represents the visual-semantic mapping matrix to be learned. Considering the compatibility score s as logits in softmax, then sotfmax loss can be expressed as，

Wherein the content of the first and second substances,

for the visual-implicit mapping, the triple loss [ ] is adopted to minimize the intra-class distance and maximize the inter-class distance, so as to obtain the implicit characteristic with the identifiability,

wherein x_i，x_j，x_kRespectively, an anchor point, a positive class sample and a negative class sample, and mrg represents a separation distance, which is set to 1.0.

The characteristic diagram M of the cutting module obtains a group of three parameters t through a two-layer full connection layer_x,t_y,t_l]Wherein t is_x,t_yCoordinates representing the target area, t_lRepresenting the side length of the square area. Let the peak response coordinate of an image be the center, let t_lIs a region of side length r_x,r_y,t_l]The cutting module is trained by MSE loss for the real value of the target area, i.e.,

in combination with the loss functions of visual-semantic mapping, visual-implicit mapping and clipping network, the overall loss function can be expressed as:

L＝L_att+αL_lat+betaL_mse

where α and β are balance factors and are set to 1.0.

Zero sample learning prediction:

because a mixed mapping mode is adopted in the training stage, namely, the vision-semantic mapping and the vision hidden feature mapping are learned at the same time, a mixed mode is correspondingly adopted for prediction in the testing stage. For the case of visual-semantic mapping, a test image x is given, its projection in semantic space is □ (x), the goal is to assign it a class label,

for visual-latent feature mapping, an image x is tested whose projection in semantic space is σ (x), the mean of the prototypes of the known class of latent features is,

for an unseen class u, we first compute its relationship in semantic space to all the seen classes,

assuming that unseen class u shares a consistent relationship with the semantic space in the hidden space,

the prediction of the entire blend can be expressed as, i.e.,

where s (·, ·) is a compatibility function.

The parameters and the meanings of English abbreviations are as follows:

the invention provides a zero sample learning classification method based on multi-scale feature fusion through improvement, and the working principle is as follows;

firstly, extracting features from an original image 2 through a deep neural network 1 to obtain a feature map of original size features 3, wherein the features are called original scale features;

secondly, carrying out recalibration 4 positioning and cutting 5 combined technology on the original scale features to obtain a fine scale image 6;

thirdly, obtaining a fine-scale feature 10 through the fine-scale image 6 by the deep neural network 1 with the parameter sharing 7 and fusing the fine-scale feature with the original-scale feature;

fourthly, projecting the obtained fusion features to a semantic space 8 and a hidden space 9, and respectively adopting softmax loss and triplet loss to carry out parameter optimization.

Fifthly, repeating the steps, setting a plurality of periods for training, and finally obtaining a zero sample learning model with strong characterization capability.

The zero sample learning classification method based on multi-scale feature fusion is provided through improvement, and a model is more adaptive to a zero sample learning classification task by directly training samples of an original image; the repositioning technology and the cutting technology based on the attention mechanism are used in a matched mode, on the basis of an original image, a feature map extracted through a deep neural network is processed to obtain a fine-scale image, fine-scale features are obtained through the deep neural network again, the two features are fused, the characterization capability of the model is greatly improved through the fusion features, and the model precision is improved; compared with the traditional method, the method has the advantages that the learned feature points are restrained to have identifiability in the training stage, and the model learning capacity is improved.

Claims

1. A zero sample learning classification method based on multi-scale feature fusion is characterized by comprising the following steps: the method comprises the following steps:

s1, extracting the features of the original image (2) by using the deep neural network (1) to obtain a feature map of the original size features (3);

s2, obtaining a fine-scale image (6) by using a recalibration (4) positioning and cutting (5) combination technology for the original image (2) and the feature map obtained in the previous step;

s3, extracting features of the fine-scale image (6) by using the depth neural network (1) of the parameter sharing (7) to obtain fine-scale features (10), and fusing the two features;

s4, projecting the fusion features obtained in the previous step to a semantic space (8) and a hidden space (9), and respectively performing parameter optimization by adopting softmax loss and triplet loss;

2. The zero sample learning classification method based on the multi-scale feature fusion as claimed in claim 1, characterized in that: the cropping (5) is used to perform a cropping operation on the original image, the cropped area typically containing the whole of the object or parts of the object.

3. The zero sample learning classification method based on the multi-scale feature fusion as claimed in claim 1, characterized in that: the fine-scale image (6) is an area which reserves the whole object of the original image (2) and is rich in semantic information.

4. The zero sample learning classification method based on the multi-scale feature fusion as claimed in claim 1, characterized in that: and automatically cutting out a target area from the feature map after recalibration (4).

5. The zero sample learning classification method based on the multi-scale feature fusion as claimed in claim 1, characterized in that: and outputting the feature map after the recalibration (4) as a group of values containing three parameters, wherein the three parameters respectively represent the coordinates of the central point of the area to be cut on the original image (2) and the length of the area to be cut.

6. The zero sample learning classification method based on multi-scale feature fusion as claimed in claim 4, wherein: the target area is designed as a square.

7. The zero sample learning classification method based on multi-scale feature fusion as claimed in claim 4, wherein: the cropped image is re-transformed to the same size as the original image (2) by bilinear interpolation.