CN116543216A

CN116543216A - Fine granularity image classification optimization method and system

Info

Publication number: CN116543216A
Application number: CN202310525083.5A
Authority: CN
Inventors: 谭志; 胥子皓
Original assignee: Beijing University of Civil Engineering and Architecture
Current assignee: Beijing University of Civil Engineering and Architecture
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-08-04

Abstract

The invention provides a fine-grained image classification optimization method and a fine-grained image classification optimization system, which belong to the technical field of image processing, and the method uses two asymmetric convolutions as data enhancement branches of classical convolutions through flexibility and additivity of convolution operation, combines the branches and a main road by using structural re-parameterization to reduce model parameters, and realizes data enhancement under the condition of not increasing the model parameters and calculation power so as to improve model effects; the convolution and attention fusion module is provided, and a brand new solution which is lighter than the prior solution is provided for fusion of the attention network and the convolution network; the asymmetric convolution data enhancement module and the convolution and attention fusion module are integrated into a residual error network, an improved asymmetric convolution and attention fusion network is provided, and a downsampling layer technology suitable for the convolution and attention fusion network is provided by referencing a downsampling layer technology in the attention network, so that the convolution and attention fusion technology is better adapted to the residual error network.

Description

Fine granularity image classification optimization method and system

Technical Field

The invention relates to the technical field of image processing, in particular to a fine-granularity image classification optimization method and system based on a convolution and self-attention mechanism fusion residual error network.

Background

The key to solving the fine-grained image classification problem is how to find the distinctive fine-grained image feature regions. To solve this problem, a large number of conventional and complex neural networks are used to extract fine features from images. Among them, convolutional neural networks based on convolutional operations have been accepted as the mainstream in this field due to their superior performance compared to conventional methods. The attention mechanism-based Vision Transformer (ViT) model, which adopts the neural network of the attention mechanism, can better perform in the field of fine-granularity image classification, and various attention mechanism-based neural network models are proposed, so that the classification effect of fine-granularity image classification tasks is improved to a brand new stage.

However, although the neural network built with the attention mechanism has good effect compared with the convolutional neural network, the high training cost (such as time cost, large data volume and the like) and the hardware requirement thereof cause that the neural network has no small obstacle in the aspects of practical application, technical landing and the like. However, although the convolutional neural network has poorer performance than the attention neural network, the convolutional neural network is popular in the practical application field due to the characteristics of low cost and light weight.

Sanghyun et al propose a Convolutional Block Attention Module (CBAM) method. The method comprises the steps of connecting a channel attention module and a space attention module in series, calculating an attention map of a feature map from two dimensions of the channel and the space by the CBAM, and then multiplying the feature map and the attention map to perform self-adaptive learning of the features. The method has excellent performance in a plurality of computer vision fields such as face recognition, fine granularity image classification fields and the like.

According to the method, only two attention mechanisms are mechanically put into the convolutional neural network, and further fusion development is not carried out on the two attention mechanisms, so that the new network has the convolution and the attention mechanisms at the same time, but the effect is not ideal; after the attention mechanism is added into the convolutional neural network, other constituent modules are not changed, so that the attention mechanism cannot be well adapted to the convolutional neural network; although attempts have been made to combine convolution and attention mechanisms, both model calculation and classification accuracy have been improved.

Disclosure of Invention

The invention aims to provide a fine-granularity image classification optimization method and system based on a convolution and self-attention mechanism fusion residual network, which are used for solving at least one technical problem in the background technology.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in one aspect, the present invention provides a fine-grained image classification optimization method, which is characterized by comprising:

acquiring an image to be classified and optimized;

processing the acquired image to be classified and optimized by using a pre-trained model to obtain a classification and optimization result of the image; the model training method comprises the steps of training a model in advance, wherein the model training method comprises a feature extraction network and a neural network classifier;

wherein, based on the additivity of convolution operation, data enhancement is performed by using asymmetric convolution; the convolution operation and the attention mechanism are fused through decomposition of the convolution operation and the attention mechanism, and a fusion module is obtained; and embedding the asymmetric convolution and fusion module into a residual block of a residual network.

Optionally, the residual network structure is composed of four different stages with sequentially increasing channel numbers, each stage is composed of a channel shuffling module and n residual blocks, when entering each stage, input data is firstly transmitted into the channel shuffling module to unify the channel numbers, and then is transmitted into n residual blocks to perform feature extraction so as to complete calculation of one stage, and the four stages are reciprocally passed through; and then transmitting the calculated data to a classification layer for characteristic classification and outputting a classification result.

Alternatively, the asymmetric convolution includes 3 branches side by side, respectively a 1×3 convolution, a 3×3 convolution, and a 3×1 convolution, each branch extracting an intermediate feature map, respectively.

Alternatively, assume that a convolution kernel is presentWhere k is the size of the convolution kernel, c _in And c _out Representing the number of input and output channels, respectively; let tensor->And->Is a feature map of the input and output, where H and W represent the height and width of the feature map, respectively; will->And-> The feature tensor of the corresponding pixel points (i, j) denoted as F and Y respectively, the operation procedure of the standard convolution is:

wherein, the liquid crystal display device comprises a liquid crystal display device,a, b e {0,1, …, k-1} represents the weight of core location (a, b);

the standard convolution is expressed as the following two phases:

stage one:

stage two:

in the first stage, namely the convolution calculation stage, an input feature map is linearly projected onto a convolution kernel from a certain position; in the second stage, the shift and aggregation stage, the projected signature moves together according to the convolution kernel and the location of the aggregation.

Alternatively, if a multi-headed note mechanism with multiple heads has N heads, representing input and output features, < >>Representing the tensor corresponding to a specific point (i, j) in the image, the single head in the multi-head attention mechanism is:

wherein Z is _q ，Z _k ，Z _v For the corresponding projection matrix of Q, K, V, N ₁ The number of heads representing the multi-head attention mechanism is 1, N _k (i, j) represents a local region having the center pixel (i, j) as the spatial range K; and is also provided withIs about N _k (i, j) features of the respective matrix;

the multi-headed attention mechanism is expressed as two phases:

stage one:

stage two:

in the first stage, 1×1 convolution is performed first, and the projection of the input features as Q, K and V, Q, K, V represent three intermediate quantities in the course of the attention mechanism operation; the second stage is the calculation of the attention weights and the aggregation of the value matrix, i.e. the aggregation of the local features.

Alternatively, a 4 x 4 convolution with a step size of 4 is chosen as the initial downsampling layer; for the downsampling layers of different stages in the residual network, a 2×2 convolution kernel with a step length of 2 is adopted, namely, an input feature map is divided into a plurality of mutually non-overlapping blocks with a size of 2×2 so as to gradually concentrate key information, thereby being convenient for the network to obtain higher results under the reasonable requirement of computing capacity.

In a second aspect, the present invention provides a fine-grained image classification optimization system comprising:

the acquisition module is used for acquiring the image to be classified and optimized;

the processing module is used for processing the acquired image to be classified and optimized by utilizing the pre-trained model to obtain a classification and optimization result of the image; the model training method comprises the steps of training a model in advance, wherein the model training method comprises a feature extraction network and a neural network classifier;

In a third aspect, the present invention provides a non-transitory computer readable storage medium for storing computer instructions which, when executed by a processor, implement a fine-grained image classification optimization method as described above.

In a fourth aspect, the present invention provides a computer program product comprising a computer program for implementing a fine-grained image classification optimization method as described above when run on one or more processors.

In a fifth aspect, the present invention provides an electronic device, comprising: a processor, a memory, and a computer program; wherein the processor is connected to the memory, and the computer program is stored in the memory, which processor executes the computer program stored in the memory when the electronic device is running, to cause the electronic device to execute instructions for implementing the fine-grained image classification optimization method as described above.

The invention has the beneficial effects that: by combining the common points of convolution operation and attention mechanism, the fusion of the convolution and the attention mechanism is realized, and the convolution and the attention mechanism can be simultaneously used under lower parameter quantity; an improved module is provided, so that the module built on the basis can have the characteristics of a convolution network and an attention network; the feature extraction mode used in the model is improved, and the efficiency is improved so as to better balance the relation between the complexity of the model and the model effect.

The advantages of additional aspects of the invention will be set forth in part in the description which follows, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of convolution asymmetry according to an embodiment of the present disclosure.

Fig. 2 is a block diagram of an asymmetric convolution enhancement module according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating a convolution operation process according to an embodiment of the present invention.

Fig. 4 is a schematic diagram illustrating an exploded view of the attention mechanism operation process according to an embodiment of the present invention.

Fig. 5 is a block diagram of a convolution and attention fusion module according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a downsampling layer structure according to an embodiment of the invention.

Fig. 7 is a basic block diagram of a residual block according to an embodiment of the present invention.

Fig. 8 is a diagram of a network model according to an embodiment of the present invention.

Fig. 9 is a network flow chart of a method implementation according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements throughout or elements having like or similar functionality. The embodiments described below by way of the drawings are exemplary only and should not be construed as limiting the invention.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or groups thereof.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

In order that the invention may be readily understood, a further description of the invention will be rendered by reference to specific embodiments that are illustrated in the appended drawings and are not to be construed as limiting embodiments of the invention.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of examples and that the elements of the drawings are not necessarily required to practice the invention.

Example 1

In this embodiment 1, in order to solve the problems of the convolutional neural network and the attention neural network, according to the common points in the convolutional operation and the attention mechanism, a convolutional and attention fusion module is provided, and an asymmetric convolutional enhancement module and a fragmented downsampling layer are respectively added, and a brand-new residual network, which is called Asymmetric Convolution and self-Attention Network (ACANet), is provided.

In this embodiment, there is provided a fine-grained image classification optimization system, including: the acquisition module is used for acquiring the image to be classified and optimized; the processing module is used for processing the acquired image to be classified and optimized by utilizing the pre-trained model to obtain a classification and optimization result of the image; the model training method comprises the steps of training a model in advance, wherein the model training method comprises a feature extraction network and a neural network classifier; wherein, based on the additivity of convolution operation, data enhancement is performed by using asymmetric convolution; the convolution operation and the attention mechanism are fused through decomposition of the convolution operation and the attention mechanism, and a fusion module is obtained; and embedding the asymmetric convolution and fusion module into a residual block of a residual network.

In this embodiment 1, the fine-grained image classification optimization method is implemented by using the system described above, and includes: acquiring an image to be classified and optimized by using an acquisition module; processing the acquired image to be classified and optimized based on a pre-trained model by utilizing a processing module to obtain a classification and optimization result of the image; the model training method comprises the steps of training a model in advance, wherein the model training method comprises a feature extraction network and a neural network classifier; wherein, based on the additivity of convolution operation, data enhancement is performed by using asymmetric convolution; the convolution operation and the attention mechanism are fused through decomposition of the convolution operation and the attention mechanism, and a fusion module is obtained; and embedding the asymmetric convolution and fusion module into a residual block of a residual network.

The residual network structure consists of four different stages with sequentially increasing channel numbers, each stage consists of a channel shuffling module and n residual blocks, when the input data enter each stage, the input data are firstly transmitted into the channel shuffling module to unify the channel numbers, and then are transmitted into the n residual blocks to perform feature extraction so as to finish the calculation of one stage, and the calculation is repeated until the input data completely pass through the four stages; and then transmitting the calculated data to a classification layer for characteristic classification and outputting a classification result.

For an asymmetric convolution module, convolution with a convolution kernel size of 3×3 and no length of 1 is adopted, the asymmetric convolution is composed of 3 parallel branches, namely, 1×3 convolution, 3×3 convolution and 3×1 convolution, and each branch is respectively used for extracting an intermediate feature map.

For the operation of convolution, it is assumed that there is a convolution kernelWhere k is the size of the convolution kernel, c _in And c _out Representing the number of input and output channels, respectively; let tensor->And->Is a feature map of the input and output, where H and W represent the height and width of the feature map, respectively; will->And->The feature tensor of the corresponding pixel points (i, j) denoted as F and Y respectively, the operation procedure of the standard convolution is:

the standard convolution is expressed as the following two phases:

stage one:

stage two:

If a multi-headed note mechanism with multiple heads has N heads, representing input and output features, < >>Representing the correspondence of a particular point (i, j) in the imageThe tensor of (2), then the single head in the multi-head attention mechanism is:

wherein Z is _q ，Zk，Z _v For the corresponding projection matrix of Q, K, V, N ₁ The number of heads representing the multi-head attention mechanism is 1, N _k (i, j) represents a local region having the center pixel (i, j) as the spatial range K; and is also provided withIs about N _k (i, j) features of the respective matrix;

the multi-headed attention mechanism is expressed as two phases:

stage one:

stage two:

Selecting a 4 x 4 convolution with a step size of 4 as an initial downsampling layer; for the downsampling layers of different stages in the residual network, a 2×2 convolution kernel with a step length of 2 is adopted, namely, an input feature map is divided into a plurality of mutually non-overlapping blocks with a size of 2×2 so as to gradually concentrate key information, thereby being convenient for the network to obtain higher results under the reasonable requirement of computing capacity.

Example 2

In this embodiment 2, by decomposing the convolution operation and the attention mechanism, a basic module that fuses the convolution and the attention mechanism is proposed, taking advantage of the commonality in the convolution operation and the attention mechanism. The method comprises the steps of carrying out data enhancement on the traditional convolution by using asymmetric convolution, designing a brand-new pure convolution module, embedding the brand-new pure convolution module into a residual block of a residual network on the basis of the brand-new pure convolution module, and providing a residual network basic module. Finally, the invention refers to advanced experience in the attention network and provides a downsampling layer suitable for the convolution and attention fusion module. The improved network designed based on the modules and used for fine-grained image classification is called Asymmetric Convolution and self-Attention Network (ACANT), and the model structure mainly comprises a feature extraction network and a neural network classifier.

In this embodiment 2, for the designed asymmetric convolution data enhancement module, the following is specifically described:

in this embodiment, by decomposing the convolution operation, the characteristics of the convolution operation are carefully studied, and the characteristics of flexibility and additivity of the convolution operation are found. A k x k-sized convolution can be split into a set of 1 x k and k x 1-sized asymmetric convolutions to reduce the number of parameters of the model while achieving equivalent operations. The principle is that if a two-dimensional matrix has a rank of 1, the matrix can be equivalently converted into a series of one-dimensional matrices. The additivity with respect to convolution operations may be expressed by the following equation (1):

I×K ₁ +I×K ₂ ＝I×(K ₁ +K ₂ ) (1)

wherein K is ₁ And K ₂ Respectively representing two convolution kernels, and I representing the feature map input. Thus, a 3×3 symmetric convolution, which is commonly used for example with a 3×3 convolution, can be split into a set of asymmetric convolution combinations of 1×3 convolution and 3×1 convolution as shown in fig. 1.

However, in practical applications, if the convolution operation is regarded as a matrix, the rank of the matrix is often not equal to 1. The actual effect tends to be less than ideal if the conventional convolution were replaced directly with a set of asymmetric convolutions. The present embodiment therefore proposes data enhancement of classical convolution with a set of asymmetric convolutions, which can improve the feature extraction capability of the network with less effort. The data enhancement module structure is shown in fig. 2.

In this embodiment 2, for the design convolution and attention mechanism fusion module, the following is specifically described:

as an important component of convolutional neural networks, convolutions are typically set to a 3 x 3 size, with a step size of 1. The specific procedure of the convolution operation may be as shown in fig. 3.

Assuming that convolution kernels are presentWhere k is the size of the convolution kernel, c _in And c _out Representing the number of input and output channels, respectively. Let tensor +.>And->Is a feature map of the input and output, where H and W represent the height and width of the feature map, respectively. We will->And->Denoted as the characteristic tensor of the corresponding pixel point (i, j) of F and Y, respectively. The operation of the standard convolution can be as shown in equation (2).

Wherein, the liquid crystal display device comprises a liquid crystal display device,a, b e {0,1, …, k-1} represents the weight of core location (a, b). The standard convolution can be expressed as two phases:

stage one:

stage two:

in the first stage, the 1 x 1 convolution calculation stage, the input feature map is projected linearly from a certain position onto the convolution kernel. This is almost the same as classical convolution. However, in the second stage, the shift and aggregation stages, the projected signature moves together according to the convolution kernel and the location of the aggregation. The computational effort during the convolution operation is mainly derived from the first stage, which is relatively simple, by careful calculation.

Due to the proposal of the ViT model, the attention mechanism has become another important fundamental module in addition to convolution. The attention mechanism may allow the model to pay more attention to a larger range of image information than conventional convolution. The operation of the attention mechanism may be as shown in fig. 4, where Q, K, V represents three intermediate quantities in the operation of the attention mechanism, which function similarly to the 1 x 1 convolution in the convolution process.

If a multi-headed note mechanism with multiple heads has N heads, representing input and output features, < >>Representing the tensor corresponding to a particular point (i, j) in the image. Thus, a single head in a multi-head attention mechanism may be as shown in equation (6).

Wherein Z is _q ，Z _k ，Z _v For the corresponding projection matrix of Q, K, V, N ₁ The number of heads representing the multi-head attention mechanism is 1, N _k (i, j) represents a local region having the center pixel (i, j) as the spatial range K. And is also provided withIs about N _k (i, j) features of the corresponding matrix. Also the multi-headed attention mechanism can be expressed as two phases:

stage one:

stage two:

similar to the conventional convolution, a 1 x 1 convolution is first performed in stage I, projecting the input features as Q, K and V. The second stage is the calculation of the attention weights and the aggregation of the value matrix, i.e. the aggregation of the local features. The corresponding computation costs have also proved to be smaller compared to phase one, following the same pattern as convolution. The composition of the convolution and attention fusion module is shown in fig. 5.

In this embodiment 2, for designing the downsampling layer, the following is specifically described:

in general, the downsampling layer is concerned with how input image data is processed for subsequent operation. Because of the natural redundancy that is prevalent in images, downsampling layers in convolutional and attention networks tend to actively downsample the input image, making it sized for subsequent operations.

In classical residual networks, the input image is first convolved using a 7 x 7 convolution layer with a step size of 2, and then further reduced using a max-pooling layer, removing redundant information. The residual network uses this set of operations as an initial downsampling layer of the network through which the input image will shrink to one-fourth of the original size.

In the Swin transform model, a more specific downsampling strategy is adopted, namely a "patch" layer with the size of 4 is used, and the whole input image is divided into a plurality of blocks with the same size as the convolution kernel and non-overlapping with each other in order to perform downsampling operation. Therefore, in this embodiment, an attempt is made to apply the method to the classical residual error network, so that the classical residual error network can better play the role of the attention mechanism, and the fusion of convolution and attention is realized.

In this embodiment, a 4×4 convolution with a step size of 4 is selected as the initial downsampling layer. For the downsampling layers of different stages in the residual network, the invention adopts the 2 multiplied by 2 convolution kernel with the step length of 2, namely, the input characteristic diagram is divided into a plurality of mutually non-overlapping blocks with the size of 2 multiplied by 2 so as to gradually concentrate key information, thereby being convenient for the network to obtain higher results under the reasonable calculation capability requirement. The downsampling layer structure may be as shown in fig. 6.

In this embodiment 2, an improved network for fine-granularity image classification (ACANet) is constructed by integrating the neural network structure of the asymmetric convolution data enhancement module, the convolution and attention mechanism fusion module and the downsampling layer of the above design. In the improved network, the whole residual network structure is pyramid-shaped and consists of four different stages with sequentially increasing channel numbers, each stage consists of a channel shuffling module and n residual blocks, wherein n is a settable super parameter. The residual block structure may be as shown in fig. 7. When entering each stage, the input data is firstly transmitted into a channel shuffling module to unify the channel numbers, then transmitted into n residual blocks to perform feature extraction so as to finish the calculation of one stage, and the calculation is repeated until the four stages are completely passed through. And then transmitting the calculated data to a classification layer for characteristic classification and outputting a classification result. The overall network structure is shown in fig. 8.

In this embodiment 2, with reference to fig. 9, the implementation procedure of the whole method is as follows:

firstly, configuring an operation environment, training by selecting a Pytorch deep learning framework, installing a Pytorch, numpy, os database and the like to match training before starting training a model, and configuring a virtual environment by using Python 3.10.

Secondly, preparing a data set for training a model, and selecting CUB-200-2011, stanford cards and Flowers-102 data sets for model training, wherein the three data sets respectively comprise 5994, 8144 and 2040 high-dimensional pictures and low-dimensional tag data for training and high-dimensional images and low-dimensional tag data for testing.

And thirdly, setting model training files and related parameters, wherein the setting files and the parameters are specially configured according to the characteristics of the three data sets besides setting the configuration files of the whole system. Since the training set has a small number of pictures, the present invention sets the number of cycles to 600, the initial learning rate to 0.1 and adjusts the learning rate to one tenth every two hundred cycles.

Fourth, image preprocessing and loading stage. The preprocessing operations such as size adjustment, clipping, random rotation and the like are performed on the input image, and data are enhanced while the image format is unified, so that the model is free from the problems of overfitting and the like in the training process, and has better generalization performance.

Fifth, image feature extraction stage. And transmitting the preprocessed image into a network for feature extraction. And after the channel number is increased by a channel shuffling module in the network, the network width is widened, a plurality of residual blocks are entered for feature extraction, and then a training process of adding the network to the ReLU function layer is added to increase nonlinear factors so as to better fit data features.

And sixthly, classifying the characteristics. And rearranging the extracted features, and then transmitting the rearranged features into a full-connection layer to perform feature classification through a related classification function.

And seventh, calculating loss and accuracy. And comparing the classified result with the label data, calculating the accuracy, and transmitting the classified result into a loss function to calculate the loss. The loss functions CELoss, focalLoss and MCLoss are used.

And eighth step, updating the weight by the gradient descent algorithm. And continuously searching the optimal solution of the model through a gradient descent algorithm, and returning data to update the network weight to guide the model training.

And ninth, storing the optimal model. The accuracy of model training was recorded, and model accuracy was calculated once per epoch before the set number of training rounds (epochs) was reached.

And step ten, ending the whole process.

In this example 2, simulation experiments were performed on three data sets, and the results of the simulation experiments were compared with those of other existing models and loss functions, and the comparison results are shown in table 1. It is clear from table 1 that the method of this example 2 shows better performance.

TABLE 1

In summary, in embodiment 2, first, through flexibility and additivity of convolution operation, two asymmetric convolutions are used as data enhancement branches of classical convolution during training and feature extraction, and a structural re-parameterization technique is used to combine the branches and the main branches during testing to reduce the number of model parameters, which is called as an asymmetric convolution data enhancement module, so as to achieve data enhancement without increasing the number of model parameters and calculation force, so as to improve model effect. Secondly, by disassembling the convolution operation process and the attention mechanism, the two parts are combined together, so that a convolution and attention fusion module is provided, and a brand new solution which is lighter than the previous solution is provided for fusion of an attention network and a convolution network. Thirdly, the proposed asymmetric convolution data enhancement module and the convolution and attention fusion module are integrated into a residual network, and the improved asymmetric convolution and attention fusion network (ACANT) is provided, so that more excellent results can be obtained on a plurality of public data sets than a plurality of traditional methods. Finally, by referencing the downsampling layer technology in the attention network, a downsampling layer technology suitable for the convolution and attention fusion network is provided, so that the convolution and attention fusion technology can be better adapted to the residual network.

Example 3

Embodiment 3 provides a non-transitory computer-readable storage medium storing computer instructions that, when executed by a processor, implement a fine-grained image classification optimization method as described above, the method comprising:

acquiring an image to be classified and optimized;

Example 4

This embodiment 4 provides a computer program product comprising a computer program for implementing a fine-grained image classification optimization method as described above when run on one or more processors, the method comprising:

acquiring an image to be classified and optimized;

Example 5

Embodiment 5 provides an electronic apparatus including: a processor, a memory, and a computer program; wherein the processor is connected to the memory, and wherein the computer program is stored in the memory, said processor executing the computer program stored in said memory when the electronic device is running, to cause the electronic device to execute instructions for implementing a fine-grained image classification optimization method as described above, the method comprising:

acquiring an image to be classified and optimized;

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it should be understood that various changes and modifications could be made by one skilled in the art without the need for inventive faculty, which would fall within the scope of the invention.

Claims

1. A fine-grained image classification optimization method, characterized by comprising:

acquiring an image to be classified and optimized;

2. The fine-grained image classification optimization method according to claim 1, wherein the residual network structure consists of four different stages with sequentially increasing channel numbers, each stage consists of a channel shuffling module and n residual blocks, input data is firstly transmitted into the channel shuffling module to unify the channel numbers when entering each stage, and then is transmitted into n residual blocks to perform feature extraction so as to complete calculation of one stage, and the four stages are reciprocally passed through; and then transmitting the calculated data to a classification layer for characteristic classification and outputting a classification result.

3. The fine-grained image classification optimization method according to claim 2, characterized in that the asymmetric convolution comprises 3 branches side by side, respectively a 1 x 3 convolution, a 3 x 3 convolution, and a 3 x 1 convolution, each branch extracting an intermediate feature map.

4. The fine-grained image classification optimization method of claim 2, wherein a convolution kernel is assumed to existWhere k is the size of the convolution kernel, c _in And c _out Representing the number of input and output channels, respectively; hypothesis tensorAnd->Is a feature map of the input and output, where H and W represent the height and width of the feature map, respectively; will->And->The feature tensor of the corresponding pixel points (i, j) denoted as F and Y respectively, the operation procedure of the standard convolution is:

the standard convolution is expressed as the following two phases:

stage one:

stage two:

5. The fine-grained image classification optimization method according to claim 4, characterized in that if a multi-headed attention mechanism with multiple heads has N heads,the input and output characteristics are represented as such,representing the tensor corresponding to a specific point (i, j) in the image, the single head in the multi-head attention mechanism is:

the multi-headed attention mechanism is expressed as two phases:

stage one:

stage two:

6. The fine-grained image classification optimization method according to claim 5, characterized in that a 4 x 4 convolution with a step size of 4 is chosen as the initial downsampling layer; for the downsampling layers of different stages in the residual network, a 2×2 convolution kernel with a step length of 2 is adopted, namely, an input feature map is divided into a plurality of mutually non-overlapping blocks with a size of 2×2 so as to gradually concentrate key information, thereby being convenient for the network to obtain higher results under the reasonable requirement of computing capacity.

7. A fine-grained image classification optimization system, comprising:

8. A non-transitory computer readable storage medium storing computer instructions which, when executed by a processor, implement the fine-grained image classification optimization method of any of claims 1-6.

9. A computer program product comprising a computer program for implementing the fine-grained image classification optimization method according to any of the claims 1-6 when run on one or more processors.

10. An electronic device, comprising: a processor, a memory, and a computer program; wherein the processor is connected to the memory, and wherein the computer program is stored in the memory, which processor executes the computer program stored in the memory when the electronic device is running, to cause the electronic device to execute instructions for implementing the fine-grained image classification optimization method according to any of the claims 1-6.