CN113255821B

CN113255821B - Attention-based image recognition method, attention-based image recognition system, electronic device and storage medium

Info

Publication number: CN113255821B
Application number: CN202110657873.XA
Authority: CN
Inventors: 李硕豪; 李小飞; 张军; 雷军; 赵翔; 葛斌; 谭真; 胡艳丽; 肖卫东; 肖华欣; 张萌萌
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-06-15
Filing date: 2021-06-15
Publication date: 2021-10-29
Anticipated expiration: 2041-06-15
Also published as: CN113255821A

Abstract

The invention provides an attention-based image recognition method, an attention-based image recognition system, electronic equipment and a storage medium, wherein the method is realized by a trained attention bilinear pooling network model which comprises a space attention module and a channel attention module which are arranged in parallel.

Description

Attention-based image recognition method, attention-based image recognition system, electronic device and storage medium

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to an attention-based image recognition method, system, electronic device, and storage medium.

Background

The fine-grained image recognition belongs to a branch of image recognition, aims at recognizing different subclasses under the same large class, and is characterized in that compared with the common image recognition: the granularity of identification by fine-grained image identification is finer, for example, the fine-grained image identification is often used for identifying birds of different types, or identifying automobiles of different types and the like, and the bird or the automobile needs to be accurately identified; while conventional image recognition typically identifies between different species, such as only cats and dogs, and does not necessarily identify what kind of cat and dog is.

The key of the fine-grained image recognition is to learn the discriminative characteristics of images, and most of the current methods for researching the fine-grained image recognition are local positioning, image region cutting and multi-level training based on weak supervision. Although the methods can achieve good recognition rate, the methods have the problems of inaccurate local positioning, easy clipping to some background areas and the like.

Disclosure of Invention

In view of the above, the present invention is directed to an attention-based image recognition method, system, electronic device and storage medium.

Based on the above object, the present invention provides an attention-based image recognition method, wherein the method is implemented by a trained attention bilinear pooling network model, the attention bilinear pooling network model includes a spatial attention module and a channel attention module, which are arranged in parallel, and the method includes:

acquiring global features of an image to be identified;

acquiring channel characteristics of the image to be identified based on the channel attention module;

acquiring the spatial features of the image to be recognized based on the spatial attention module;

performing feature fusion on the channel features and the space features through a bilinear pooling operation to obtain fused local features;

identifying the image to be identified based on the global features and the local features;

and when the attention bilinear pooling network model is trained, predicting the fused local features by adopting a cross entropy loss function, and predicting the recognition result of the attention bilinear pooling network model by adopting the cross entropy loss function.

Optionally, the obtaining, based on the channel attention module, a channel feature of the image to be recognized specifically includes:

acquiring a feature map of the image to be identified and carrying out global tie pooling on the feature map to obtain unit channel features;

inputting the unit channel characteristics into a first full-connection layer and then performing first activation through a first preset activation function;

inputting the result after the first activation into a second full-connection layer, and then performing second activation through a second preset activation function to obtain a channel attention weight;

and multiplying the channel attention weight value by the feature map to obtain the channel feature.

Optionally, the obtaining the spatial feature of the image to be recognized based on the spatial attention module specifically includes:

acquiring a feature map of the image to be identified, respectively carrying out global tie pooling and global maximum pooling on the feature map, and splicing the two pooling results along the channel direction;

performing convolution processing on the spliced result, and obtaining a space attention weight value through a third preset activation function;

and multiplying the spatial attention weight value by the feature map to obtain the spatial feature.

Optionally, before obtaining the global feature of the image to be recognized, the method further includes:

and carrying out data augmentation on the image to be identified, wherein the data augmentation comprises one or more of image scale normalization, image random cutting, image numerical value normalization, image turning, image scaling, image rotation and image inclination.

From the above, it can be seen that the attention-based image recognition method provided by the present invention is implemented by training an obtained attention bilinear pooling network model, which includes a spatial attention module and a channel attention module arranged in parallel, and the method performs discriminant feature extraction on a channel level and a spatial level of an image respectively by using the channel attention module and the spatial attention module, then fuses the extracted channel features and spatial features by using a hierarchical bilinear pooling operation, associates the channel features and the spatial features as local features, and performs final image recognition according to the local features and global features learned by main branches, thereby improving the accuracy of image recognition, and meanwhile, when the attention bilinear pooling network model is trained, predicts the fused local features by using a cross entropy loss function, and a cross entropy loss function is adopted to predict the recognition result of the attention bilinear pooling network model, so that the efficiency and the accuracy of parameter adjustment during model training are further improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the related art, the drawings required to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart illustrating an attention-based image recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating another attention-based image recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an attention-based image recognition system according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a hardware structure of a specific electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.

It should be noted that technical terms or scientific terms used in the embodiments of the present invention should have the ordinary meanings as understood by those having ordinary skill in the art to which the present invention belongs, unless otherwise defined. The use of "first," "second," and similar language in the embodiments of the present invention does not denote any order, quantity, or importance, but rather the terms "first," "second," and similar language are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

As described in the background section, the biggest challenges of fine-grained image recognition lie in that the difference between image classes is small, the difference within the classes is large, and objects in the same class are photographed from different angles to present different postures, so that the network can fully learn discriminant features of different classes, which is the key of fine-grained image recognition. The invention learns the characteristics of the channel level and the characteristics of the space level of the image respectively through the space attention module and the channel attention module which are arranged in parallel so as to fully mine the local characteristics of the image. The spatial attention mechanism for each feature map C H W, spatial domain attention is given the same weight on each channel C, while learning different attention weights in the planar dimension H W, spatial attention is primarily concerned with the "where" of the object, which is the different regions of the learning image it focuses on. According to the invention, after the characteristics of the channel level and the characteristics of the space level are obtained, the characteristics of the channel level and the space level are fused by utilizing bilinear pooling operation, so that the channel characteristics and the space characteristics are associated, different characteristics are mutually enhanced, complete local characteristics are formed, and more abundant discriminative local characteristics are further obtained. And finally, image recognition is carried out through the discriminant local features and the obtained global features, so that the accuracy of image recognition is fully improved.

In a specific application scenario of the invention, the attention-based image recognition method can be applied to classification recognition of fine-grained images, such as classification recognition of naval vessel pictures of navy, or recognition of specific classes of certain animals.

In another specific application scenario of the present invention, the attention-based image recognition method of the present invention may perform secondary recognition on the recognition result of other image recognition methods, or perform primary recognition by the attention-based image recognition method of the present invention, and then perform secondary or multiple recognition by other image recognition methods.

Referring to fig. 1, a schematic flowchart of an attention-based image recognition method according to an embodiment of the present invention is shown, where the method is implemented by a trained attention bilinear pooling network model, where the attention bilinear pooling network model includes a spatial attention module and a channel attention module that are arranged in parallel, and the method includes the following steps:

and S101, acquiring the global features of the image to be recognized.

In this step, the attention bilinear pooling network model obtained through training first obtains the global features of the image to be recognized, and optionally, the last layer of the attention bilinear pooling network model is used as a main branch, and the global features of the image to be recognized are obtained through the main branch.

S102, acquiring the channel characteristics of the image to be recognized based on the channel attention module.

In the step, the attention module of the attention bilinear pooling network model obtained through training acquires the channel characteristics of the image to be recognized, wherein the channel characteristics mainly represent some characteristics unique to the image to be recognized.

In order to accurately acquire the channel feature of the image to be recognized, in some embodiments, the acquiring the channel feature of the image to be recognized based on the spatial attention module specifically includes:

Specifically, a feature map of an image to be recognized is obtained through an attention bilinear pooling network model, then global flattening pooling is performed on the feature map to obtain unit channel features, for example, a feature map of C × H × W, and input of C × H × W is converted into unit channel features of 1 × C through global averaging pooling. After the unit channel characteristics are obtained, inputting the unit channel characteristics into a first full connection layer, then performing first activation through a first preset activation function, after the unit channel characteristics pass through the first full connection layer, reducing the dimensionality from the original dimensionality to a preset dimensionality, then performing first activation through the first preset activation function, wherein the characteristics obtained after activation and the characteristics before activation are not in a linear relationship any more, the first activation function can be selected according to needs, optionally, the first activation function can be set as a ReLU function, after the first activation, inputting the first activation result into a second full connection layer again, the second full connection layer does not play a dimensionality reduction role any more, but sorting the results after the first activation, and then performing second activation according to the second preset activation function to obtain a channel attention weight.

It should be noted that the first activation may be regarded as initially obtaining an approximate weight of the channel attention, and the second activation is to modify the approximate weight of the channel attention on the basis of the first activation to obtain a more accurate channel attention weight, where the second preset activation function may be set as needed, and optionally, the second preset activation function is set as a Sigmoid function. After the channel attention weight value is obtained through two times of activation, the channel characteristic can be obtained by multiplying the channel attention weight value by the characteristic diagram.

S103, acquiring the spatial features of the image to be recognized based on the spatial attention module.

In this step, the space of the image to be recognized is obtained through a spatial force module of the attention bilinear pooling network model obtained through training, the spatial feature mainly represents the feature difference of different areas of the image to be recognized, and further represents the position of the important feature of the image to be recognized.

In order to accurately acquire the spatial features of the image to be recognized, in some embodiments, the acquiring the spatial features of the image to be recognized based on the channel attention module specifically includes:

Specifically, a feature map of an image to be recognized is obtained through an attention bilinear pooling network model, then global tie pooling and global maximum pooling are respectively carried out on the feature map, and the two pooling results are spliced along the channel direction; and performing convolution processing on the spliced result, and obtaining a space attention weight value through a third preset activation function, wherein the convolution processing is used for reducing the dimension of the spliced result into 1 channel, optionally, 1 × 1 convolution kernel is used for performing the convolution processing, the third preset activation function can be selected according to needs, and optionally, the third preset activation function can be set as a Sigmoid function. After the spatial attention weight value is obtained, the spatial attention weight value is multiplied by the feature map to obtain the spatial feature.

And S104, performing feature fusion on the channel features and the spatial features through bilinear pooling operation to obtain fused local features.

In this step, the obtained channel features and spatial features are used as two input features of bilinear pooling operation, and feature fusion is performed to obtain fused local features. The channel characteristics and the space characteristics are mutually enhanced through fusion, meanwhile, the channel characteristics concern the characteristics of local characteristics, the space characteristics concern the positions of the local characteristics, and therefore, after the two characteristics are fused, the characteristics of the local characteristics on each position can be obtained, and the differences of different images to be recognized in different positions can be well found for fine-grained image recognition. In addition, mutual error correction of the channel features and the spatial features can be realized through fusion, namely, the features with extraction errors can be eliminated through fusion of matching degrees.

S105, identifying the image to be identified based on the global features and the local features.

In this step, after the global feature and the fused local feature are obtained, the image to be recognized is recognized according to the global feature and the local feature, optionally, the fused local feature and the global feature are spliced together, and the image to be recognized is recognized through a full connection layer output.

In order to further improve the accuracy of image recognition, in some embodiments, when the attention bilinear pooling network model is trained, a cross entropy loss function is first used to predict the fused local features, then the sample image recognition is performed on the fused local features and the obtained global features which are predicted to be qualified, and the recognition result of the attention bilinear pooling network model is predicted by using the cross entropy loss function. Therefore, the fused local features are predicted independently, the corresponding parameters are adjusted, the fusion effect is guaranteed, the final recognition result is also predicted, further, the parameter adjusting efficiency during model training is improved, and the accuracy of image recognition is fully improved.

To avoid the contingency of a single image recognition, in some embodiments, before acquiring the global features of the image to be recognized, the method further comprises:

Referring to fig. 2, a schematic flow chart of another attention-based image recognition method according to an embodiment of the present invention is shown, where a feature map of an image to be recognized is obtained, a channel feature, a spatial feature, and a global feature are obtained according to the feature map, the channel feature and the spatial feature are fused to obtain a fused local feature, optionally, the local feature is fused by bilinear pooling, the fused local feature and the global feature are fused, and image recognition is performed according to a fused result.

Experiments prove that the accuracy rate of performing fine-grained image identification on a ship data set by using the attention-based image identification method can reach 91.3%. In order to research the contribution degree of the channel attention, the space attention and the bilinear pooling to the image recognition precision, the method carries out an ablation experiment, the accuracy is only 87.8% when a channel attention mechanism is not used, the accuracy is reduced by 2.5% compared with the accuracy when the attention-based image recognition method is completely used, and the influence of the channel attention characteristic on the model accuracy is large; the accuracy rate when the spatial attention is not used is reduced by 2% compared with the accuracy rate when the attention-based image recognition method is completely used; when local feature fusion is performed without using bilinear pooling, the accuracy rate is reduced by 1.1%. Therefore, the channel attention, the spatial attention and the bilinear pooling operation in the attention-based image recognition method make corresponding contributions to the accuracy of image recognition.

According to the attention-based image recognition method provided by the invention, the channel attention mechanism and the space attention mechanism are taken as two independent branches to respectively extract the characteristics on the channel and the space, so that the channel attention characteristic and the space attention characteristic which are more distinctive and specific can be extracted, and the separately extracted channel attention characteristic and the space attention characteristic are fused by using bilinear pooling operation, so that the two characteristics are associated, different characteristics are mutually enhanced to form a complete local characteristic, and finally, the final image recognition is carried out according to the local characteristic and the global characteristic learned by the main branch, so that the accuracy of the image recognition is improved.

It should be noted that the method of the embodiment of the present invention may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In the case of such a distributed scenario, one of the multiple devices may only perform one or more steps of the method according to the embodiment of the present invention, and the multiple devices interact with each other to complete the method.

It should be noted that the above describes some embodiments of the invention. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, the invention also provides an attention-based image recognition system corresponding to the method of any embodiment.

Referring to fig. 3, the attention-based image recognition system includes:

a first feature obtaining unit 301, which obtains global features of an image to be identified;

a second feature obtaining unit 302, configured to obtain a channel feature of the image to be recognized based on the channel attention module;

a third feature obtaining unit 303, configured to obtain a spatial feature of the image to be recognized based on the spatial attention module;

a fusion unit 304, which performs feature fusion on the channel features and the spatial features through bilinear pooling operation to obtain fused local features;

an identifying unit 305, configured to identify the image to be identified based on the global feature and the local feature;

In some specific application scenarios, the second feature obtaining unit 302 is specifically configured to:

In some specific application scenarios, the third feature obtaining unit 303 is specifically configured to:

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations of the invention.

The apparatus of the foregoing embodiment is used to implement the corresponding attention-based image recognition method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-mentioned embodiments, the invention further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the attention-based image recognition method according to any of the above-mentioned embodiments.

Fig. 4 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the above embodiment is used to implement the corresponding attention-based image recognition method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-described embodiment methods, the present invention also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the attention-based image recognition method according to any of the above-described embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the attention-based image recognition method according to any one of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to those examples; within the idea of the invention, also technical features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the invention as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the present invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present invention are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that embodiments of the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the invention.

Claims

1. An attention-based image recognition method is realized through a trained attention bilinear pooling network model, wherein the attention bilinear pooling network model comprises a spatial attention module and a channel attention module which are arranged in parallel, and the method comprises the following steps:

acquiring global features of an image to be identified;

when the attention bilinear pooling network model is trained, predicting the fused local features by adopting a cross entropy loss function, identifying a sample image based on the fused local features and the global features which are qualified through prediction, and predicting the identification result of the attention bilinear pooling network model by adopting the cross entropy loss function;

the method for acquiring the channel characteristics of the image to be recognized based on the channel attention module specifically comprises the following steps:

inputting the result after the first activation into a second full-connection layer, and then performing second activation through a second preset activation function to obtain a channel attention weight, wherein the second full-connection layer does not play a role in dimensionality reduction;

2. The image recognition method according to claim 1, wherein the obtaining of the spatial feature of the image to be recognized based on the spatial attention module specifically includes:

3. The image recognition method of claim 1, wherein before obtaining the global features of the image to be recognized, the method further comprises:

4. An attention-based image recognition system, wherein the system is implemented by a trained attention bilinear pooling network model, the attention bilinear pooling network model comprises a spatial attention module and a channel attention module which are arranged in parallel, and the system comprises:

the first characteristic acquisition unit is used for acquiring the global characteristics of the image to be identified;

the second feature acquisition unit is used for acquiring the channel features of the image to be recognized based on the channel attention module;

the third feature acquisition unit is used for acquiring the spatial features of the image to be recognized based on the spatial attention module;

the fusion unit is used for performing feature fusion on the channel features and the space features through bilinear pooling operation to obtain fused local features;

the identification unit is used for identifying the image to be identified based on the global characteristic and the local characteristic;

when the attention bilinear pooling network model is trained, predicting the fused local features by adopting a cross entropy loss function, and predicting the recognition result of the attention bilinear pooling network model by adopting the cross entropy loss function;

the second feature obtaining unit is specifically configured to:

5. The system according to claim 4, wherein the third feature obtaining unit is specifically configured to:

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 3 when executing the program.

7. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 3.