CN111325237A

CN111325237A - Image identification method based on attention interaction mechanism

Info

Publication number: CN111325237A
Application number: CN202010070791.0A
Authority: CN
Inventors: 乔宇; 庄培钦; 王亚立
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-23
Anticipated expiration: 2040-01-21
Also published as: CN111325237B

Abstract

The invention provides an image recognition method based on an attention interaction mechanism, which utilizes a pre-trained image recognition model to obtain the classification of a picture to be tested, wherein the training process of the image recognition model comprises the following steps: for each of the N image categories, selecting K images and inputting the K images into a convolutional neural network for feature extraction to obtain a plurality of image features; establishing an image feature pair according to the similarity between different image features; extracting common feature vectors from the constructed image feature pairs through common feature learning; calculating a gate feature vector corresponding to each feature in the image feature pair based on the common feature vector; and inputting the features after the combination of each feature in the image feature pair and the gate feature vector into a classifier, and optimizing according to the set loss function to obtain the trained convolutional neural network and the classifier. The invention can improve the accuracy of image recognition, and is particularly suitable for fine-grained image recognition.

Description

Image identification method based on attention interaction mechanism

Technical Field

The invention relates to the technical field of computer vision, in particular to an image identification method based on an attention interaction mechanism.

Background

In recent years, methods based on deep learning have made a great breakthrough in the field of computer vision, particularly as typified by image recognition tasks. But in the image recognition task, the fine-grained image (subcategory) recognition task has limited breakthroughs. Compared with the conventional general object recognition task, the difficulty of fine-grained image recognition is mainly reflected in that: 1) the classification of the data set is extremely fine, the similarity of images in adjacent subcategories is high, only slight visual difference exists, and the visual difference is not easy to be found and distinguished; 2) due to the influence of various factors such as light, visual angle, posture and the like in the image acquisition process, the images in the same category have great difference. The fine-grained images have the characteristics of small inter-class difference and large intra-class difference, so that the identification task is challenged. The identification requirement of the fine-grained image is commonly found in biological species identification tasks with classification levels in nature.

In the prior art, the task of fine-grained image recognition generally originates from the following three main ideas: 1) a key component positioning method. Since the images of similar categories in the fine-grained image task have small differences and are not easily distinguished, it is necessary to select features with high discriminative power in the images for final classification. The method hopes to be capable of automatically positioning a plurality of key parts in the image and extracting the image characteristics of the local areas. But because the experiment often only has weak supervision information (image label information), the method has limited capability of positioning key parts; 2) and (4) learning high-order features. Because the image content in the fine-grained task is complex and various and the expression capacity of the conventional feature extraction method is limited, the method hopes to improve the expression capacity of the features so as to improve the capacity of the algorithm; 3) a method of metric-based learning. Since fine-grained images have the characteristics of small inter-class differences and large intra-class differences, a metric learning-based method is expected to improve such a situation. However, this method can only improve the distribution of samples in the feature space, and lacks the capability of finding differences between samples, so that the performance of the recognition task cannot be improved well.

Because the difference between similar images in the fine-grained image recognition task is slight, the existing method takes corresponding measures aiming at the situation that the content in the fine-grained image is complex. For example, by constructing high-order image features, the expression capability of the features is increased, and the quality of the features is improved, so that the performance of an identification task is improved; for another example, by using detection and segmentation techniques, important local regions are found in the original image, and by extracting image features of these key regions. However, the existing methods are all built in a single image, so that the difference part between two similar images cannot be found, and the image area with high distinguishability cannot be really and efficiently found.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an image identification method based on an attention interaction mechanism, wherein the method can be used for comparing two images with high similarity by simulating the cognitive process of a human, so that the difference between the image pairs can be found, and the images can be accurately distinguished.

According to a first aspect of the present invention, a method of constructing an image recognition model based on an attention interaction mechanism is provided. The method comprises the following steps:

for each of N image categories, selecting K images, inputting the K images into a convolutional neural network for feature extraction, and obtaining a plurality of image features, wherein N, K is an integer greater than or equal to 2;

establishing an image feature pair according to the similarity between different image features;

extracting common feature vectors from the constructed image feature pairs through common feature learning;

calculating a gate feature vector corresponding to each feature in the image feature pair based on the common feature vector;

and inputting the features of the image feature pair combined with the gate feature vector into a classifier, and optimizing according to a set loss function to obtain a trained convolutional neural network and a trained classifier.

In one embodiment, the creating image feature pairs according to the similarity between different image features comprises: for each image feature x₁Calculating the nearest image feature in or among the classes according to the Euclidean distance, and marking as x₂And forming 2 × N × K image feature pairs.

In one embodiment, the extracting common feature vectors comprises:

image feature pair x₁And x₂Splicing is carried out, the spliced features are respectively sent into a plurality of full connection layers, and a common feature vector is obtained and expressed as:

x_m＝f_m([x₁，x₂])。

in one embodiment, calculating the gate feature vector for each feature in the image feature pair comprises:

common feature vector x_mRespectively multiplying the feature points of the image feature pairs, and normalizing through a sigmoid function to obtain corresponding gate feature vectors, wherein the gate feature vectors are expressed as:

g_i＝sigmoid(x_m⊙x_i)，i∈{1，2}。

in one embodiment, the feature of each feature in the image feature pair combined with the gate feature vector includes four expression forms, respectively

Wherein

Representing the result of the multiplication of the own image feature with the corresponding gate feature vector point,

representing the result of the point multiplication of the image features of itself with other gate feature vectors, g₁，g₂Representing the gate feature vector.

In one embodiment, the loss function is set to:

wherein, y_iReflects the true classification label or labels,

a classification probability vector representing the classifier output.

In one embodiment, the loss function is set to:

wherein

Representing probability vectors

C in_iThe score corresponding to the class, ∈, represents the threshold.

According to a second aspect of the present invention, an image recognition method based on an attention interaction mechanism is provided. The method comprises the following steps:

sending a single picture into the trained convolutional neural network of the invention, and extracting the corresponding image characteristic x_*X is to be_*And sending the data to the trained classifier to obtain a final classification result.

Compared with the prior art, the invention has the advantages that: the method can solve the problem that only a single picture is considered in modeling of the prior art, and the difference between the image pairs is neglected to be found.

Drawings

The invention is illustrated and described only by way of example and not by way of limitation in the scope of the invention as set forth in the following drawings, in which:

FIG. 1 is a flow diagram of an image recognition method based on an attention interaction mechanism, according to one embodiment of the present invention;

FIG. 2 is a schematic diagram of a common feature vector learning module according to one embodiment of the invention;

FIG. 3 is a schematic diagram of an attention interaction mechanism, according to one embodiment of the invention;

FIG. 4 is a schematic diagram of an image recognition system based on an attention interaction mechanism, according to one embodiment of the present invention;

FIG. 5 is a schematic diagram of a terminal device according to one embodiment of the invention;

FIG. 6 is a schematic diagram of an application embodiment according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions, design methods, and advantages of the present invention more apparent, the present invention will be further described in detail by specific embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not as a limitation. Thus, other examples of the exemplary embodiments may have different values.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

The image recognition method based on the attention interaction mechanism is based on the image pair, and the characteristic difference in the image pair is found through comparison, so that the two images are correctly distinguished. In brief, the method simultaneously inputs a pair of similar images, and firstly constructs a common (mutual) feature vector which contains the difference semantic features in the image pair; then multiplying and normalizing each image feature and the common feature point to generate a gate feature vector for searching a channel with high-specificity semantic features; and finally, the original image features are interacted with the door feature vectors to improve the sensitivity of the classifier for finding subtle differences in the features.

Specifically, referring to fig. 1, an image recognition method provided by the embodiment of the present invention includes the following steps:

in step S110, for each of the plurality of types of pictures, a plurality of pictures are randomly selected.

Firstly, randomly selecting N categories in a database, and randomly selecting K pictures in the categories for each category, namely selecting N × K pictures in each batch for inputting, wherein the mode of considering the image categories in each batch and selecting the input pictures according to a set strategy is beneficial to ensuring the diversity of data in the same batch compared with the conventional random selection mode.

And step S120, inputting the selected picture into a convolutional neural network for feature extraction.

Specifically, the selected pictures are input into the convolutional neural network, and the image features x and x ∈ R can be obtained through the final Global Pooling operation (GAP) of the network^DWhere D is the dimension of the feature. For example, a ResNet50 network or other type of convolutional neural network is selected for feature extraction based on the complexity of the data and the nature of the task.

In step S130, for each of the extracted image features, an image feature pair is selected based on the degree of similarity between the image features.

In one embodiment, image features with greater degrees of similarity are selected to form pairs of image features. For example, the euclidean distance between different image features is first calculated. For eachIndividual image feature x₁Calculating the nearest image feature in or among the classes according to the Euclidean distance, and marking as x₂In other embodiments, the distance metric may be replaced by other types such as cosine distance, and the intra-class and inter-class distance metrics may be replaced by farthest, etc.

In step S130, by selecting the image feature pairs, the most similar image pairs can be selected, so that the difficulty of image recognition can be improved, and the robustness of the network can be increased.

In step S140, a common feature vector is extracted from the pair of image features.

Learning the image features through a common feature vector to obtain a corresponding common feature vector x_m，x_m∈R^D. Let the common feature vector learning process be f_mThen the common feature vector can be expressed as:

x_m＝f_m([x₁，x₂]) (1)

the operation represented by equation (1) is to couple features to x₁And x₂And performing splicing, and sending the spliced features into a multi-layer fully-connected layer, for example, as shown in fig. 2, taking two fully-connected layers as an example, the dimension of the feature mapping is changed from 2048 to 512, and the dimension of the feature mapping is changed from 512 to 2048. In further embodiments, f_mOther forms of bilinear pooling operations, dot multiplication, dot addition, etc. may be substituted.

It should be noted that, the number of fully connected layers and the dimension of the feature mapping are not limited in the present invention, and those skilled in the art can set the number and the dimension according to the requirements of the training precision, the training speed, and the like.

In step S150, a gate feature vector corresponding to each feature of the image feature is calculated based on the common feature vector.

The generated common feature vectors and the vectors in the image feature pairs are respectively subjected to point multiplication and normalized by a linear function. For example, the nonlinear function may employ a sigmoid function. Finally generating a gate feature vector g_i，g_i∈R^DExpressed as:

g_i＝sigmoid(x_m⊙x_i)，i∈{1，2} (2)

g_iis between 0 and 1, a numerical value indicates that the semantic feature in the channel is a feature x_iThe classification of (2) has an important role and high distinctiveness.

In other embodiments, the normalization may be performed by using a tanh function or other non-linear function, which is not limited in the present invention.

The common feature vector is different from the conventional operation, and the features of the vector comprise the features with stronger contrast in the image pair, so that the vector is favorable to be subsequently used as context information to guide the discovery of specific semantic features in the image.

Step S160, combining each feature in the image feature pair with the gate feature vector to obtain a combination result of the self image feature and the corresponding gate feature vector and a combination result of the self image feature and other gate feature vectors.

Combining the original image features (i.e., each feature in the image feature pair) with the gate feature vector to obtain features in four expression forms, as shown in fig. 3, specifically expressed as follows:

wherein

Represents the result of the point multiplication of the image feature of itself and the corresponding gate feature vector, and

representing the result of point multiplication of the image feature of itself with other gate feature vectors, wherein

Should be compared with

Is more distinctive。

In the step, through an attention interaction mechanism, the diversity of the features can be enriched, and the difficulty of the image features is increased.

And step S170, inputting the combination result into a classifier for optimization, and obtaining the trained convolutional neural network and the classifier.

The combined features are sent to a classifier in sequence to obtain corresponding classification probability vectors

Where C is the number of classes, expressed as:

wherein the content of the first and second substances,

is the probability vector normalized by the softmax function, W and b represent the weight and bias of the classifier, respectively. On the basis of the probability vectors, the optimization process of the whole network (namely, the convolutional neural network and the classifier used for feature extraction) is guided by the loss function by introducing the corresponding loss function.

In one embodiment, the optimization process first uses a Cross Entropy Loss function (Cross entry Loss), expressed as:

wherein, y_iRepresenting true class labels, e.g. y_iThe expression of the one-hot coded vector is adopted, and the dimension of the real label is only 1, and the other dimensions are 0.

Further, considering that the different priorities of different feature vectors and the corresponding classification results are different, a score ranking loss function (score ranking loss) can be introduced, which is specifically expressed as:

wherein

Representing probability vectors

C in_iThe score corresponding to the class, ∈ represents the threshold value the score ordering penalty function expects the probability vector

In the c th_iThe score on a class can be compared to a probability vector

In the c th_iThe score on the class is large and at least exceeds the threshold ∈, the threshold ∈ can be set according to the classification accuracy and other factors, and the invention is not limited to this.

By adding the fractional ranking loss function, the influence of the subtle feature difference on the classification result is taken into consideration, so that the sensitivity of the classifier on the subtle image difference can be increased, and the classification robustness is increased.

The optimized convolutional neural network parameters and classifier parameters, i.e. the trained image recognition model, can be obtained through the training process. In practical application, for the pictures to be classified, a single picture can be sent into a trained convolutional neural network, and corresponding image features x are extracted_*X is to be_*And sending the data to a trained classifier to obtain a final classification result.

For example, as shown in FIG. 4, the system comprises a data input module for selecting pictures according to a preset data selection strategy, selecting several classes (N) in each batch, selecting several pictures (K) from each class, an image pair selection module for calculating Euclidean distances between every two image features after obtaining the N × K image features, selecting feature pairs formed by features in the class and among the classes with the smallest Euclidean distance from the image features of the image features, and obtaining 2 × N × K image feature pairs, a common feature vector learning module for obtaining common features of the image pairs by mapping of a full connection layer for each pair of feature pairs, a gate feature vector generation module for performing single point multiplication and normalization on the common features and the features in the feature pairs to respectively obtain two gate feature vectors, each gate feature vector can represent a channel with high-performance semantic features in the image, an attention interaction module for obtaining two gate feature vectors corresponding to each pair of feature pairs, and a final feature classifier for obtaining four types of residual image features by adopting a residual image classifier and a final feature classifier for realizing classification.

The invention can be used for various image recognition scenes, such as the image recognition scene of a mobile terminal. Referring to fig. 5, the mobile terminal includes a data acquisition module, an algorithm processing module, and a user interface display module, and the specific process includes: acquiring a picture to be predicted through a mobile phone terminal, and performing simple image preprocessing; and then, the image is sent to an algorithm recognition module, the feature extraction is carried out through a pre-trained convolutional neural network model, and then the extracted features are sent to a classifier recognition module to obtain a prediction result. Furthermore, the identification result can be returned to the mobile phone terminal, and the acquired image and the identification result of the image are displayed on a display interface.

The invention aims to discover semantic features with high distinguishability in fine-grained images by simultaneously inputting similar image pairs in the training process, and finally improve the performance of the recognition task. The invention is particularly suitable for the recognition of fine-grained images in real life or used for tasks such as object recognition task, face recognition, pedestrian re-recognition, biological category recognition and the like. For example, fine-grained images include birds, flowers, cars, biological categories with classification levels, and the like. Referring to fig. 6, the specific process includes: collecting corresponding data sets and dividing a training set; selecting reasonable hyper-parameters and strategies including but not limited to a basic network, batch size, learning rate, a common vector generation module and the like, and optimizing the network by using the strategy provided by the scheme of the invention under the given condition of the hyper-parameters; and sending the given picture to be tested into a network to obtain a prediction label corresponding to the test picture and give a name of the corresponding picture category.

Through verification, the image identification method based on the attention interaction mechanism can effectively improve the identification accuracy, compared with other existing methods, the image identification accuracy can be improved by 1 to 2 percent on a plurality of databases, and the effect on fine-grained images is particularly obvious.

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of constructing an image recognition model based on an attention interaction mechanism, comprising the steps of:

2. The method of claim 1, wherein the constructing pairs of image features based on similarities between different image features comprises:

for each image feature x₁Calculating the nearest image feature in or among the classes according to the Euclidean distance, and marking as x₂And forming 2 × N × K image feature pairs.

3. The method of claim 2, wherein extracting the common feature vector comprises:

x_m＝f_m([x₁，2])。

4. the method of claim 3, wherein computing a gate feature vector for each feature in the image feature pair comprises:

g_i＝sigmoid(x_m⊙_i)，∈{1，2。

5. the method of claim 4, wherein the features of each feature in the image feature pair combined with the gate feature vector comprise four representations, respectively

Wherein

6. The method of claim 1, wherein the loss function is set to:

wherein, y_iA label representing a true classification of the object,

a classification probability vector representing the classifier output.

7. The method of claim 1, wherein the loss function is represented as:

wherein

Representing probability vectors

C in_iThe score corresponding to the class, ∈, represents the threshold.

8. An image recognition method based on an attention interaction mechanism comprises the following steps:

feeding a single picture into the trained convolutional neural network of claim 1, extracting the corresponding image feature x_*X is to be_*And sending the data to the trained classifier to obtain a final classification result.

9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

10. An electronic device comprising a memory and a processor, on which a computer program is stored which is executable on the processor, characterized in that the steps of the method according to any of claims 1 to 8 are implemented when the processor executes the program.