CN114067153A

CN114067153A - Image classification method and system based on parallel double-attention light-weight residual error network

Info

Publication number: CN114067153A
Application number: CN202111290845.5A
Authority: CN
Inventors: 骆爱文; 路畅; 黄蓓蓓; 李媛; 王芮
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2022-02-18
Anticipated expiration: 2041-11-02
Also published as: CN114067153B

Abstract

The invention provides an image classification method and system based on a parallel double-attention light-weight residual error network, which comprises the steps of carrying out structural optimization on a convolution kernel of the residual error network and extracting a characteristic information matrix of an input image; performing global average pooling on the characteristic information matrix output by the parallel double-attention light-weight residual error structure, integrating full-layer spatial information, and converting the characteristic information matrix into a one-dimensional characteristic information matrix; and inputting the one-dimensional characteristic information matrix into the full-connection layer to obtain a matrix of the number of classes corresponding to the classification task, and outputting an image classification result. According to the invention, the residual error network is compressed and a double-branch space channel attention mechanism is adopted, so that parameters and calculated amount are compressed and the processing speed is increased on the premise of ensuring the precision, thereby improving the overall efficiency of the target classification identification based on the deep neural network.

Description

Image classification method and system based on parallel double-attention light-weight residual error network

Technical Field

The invention relates to the field of image processing, in particular to an image classification method and system based on a parallel double-attention light-weight residual error network.

Background

The purpose of image object classification is to achieve the positioning of image objects on a background image and output their corresponding categories. The ResNet residual network is a deep neural network commonly used today in the field of image classification. The ResNet can effectively relieve the precision degradation problem caused by network deepening by means of a residual error layer, so that the training of a deep neural network becomes feasible, and a deeper network can be used for obtaining a better image recognition effect. In addition, the residual layer of ResNet can make the neural network easily break through hundreds of layers or even thousands of layers, thereby obtaining stronger image feature expression capability and target classification and identification capability.

The method for classifying the chip defect images based on the ResNet network is provided with a publication number CN113076989A (the publication number is 2021-07-06), and classification is carried out through the ResNet network, wherein the classification comprises the steps of dividing obtained sample data into a training set, a verification set and a test set; sample pretreatment; using the training set and the verification set in the processed sample image to train the constructed network model; taking the trained network model as a test model, using the rest test set in a test network, and finally outputting a classification result through an activation function; through preprocessing, the problem that a large amount of calculation work and time are consumed for processing the whole image due to the fact that the original image size of a sample is large is avoided, the possibility of over-fitting is prevented through data enhancement, and the classification performance can be improved through adding more feature information; the problems of gradient disappearance or gradient explosion and the problem of learning efficiency degradation are solved through a ResNet residual block.

The method classifies the images based on the ResNet network; however, parameters and Floating-Point Operations (FLOPs) of the ResNet network are still high, which still causes a large amount of computation, and further causes a too slow computation speed measured by a frame rate (fps); in addition, after the ResNet network enters a deeper layer, the improvement range of the model classification precision is very limited, and the development requirement of the edge machine vision application cannot be met.

Disclosure of Invention

The invention provides an image classification method and system based on a parallel double-attention light-weight residual error network, aiming at overcoming the defects that the existing ResNet residual error network is large in parameter and large in calculation amount, so that the calculation speed is low, and the identification accuracy is damaged when the ResNet residual error network is compressed.

In order to solve the technical problems, the technical scheme of the invention is as follows:

in a first aspect, the present invention provides an image classification method based on a parallel dual-attention lightweight residual error network, including the following steps:

s1: and inputting the image into a residual error network and preprocessing the image.

S2: and carrying out structure optimization on the convolution kernel of the residual error network, and extracting a characteristic information matrix of the input image.

S3: and carrying out batch normalization processing on the characteristic information matrix and carrying out nonlinear activation.

S4: the feature information matrix obtained in S3 is subjected to parallel processing of channel attention and spatial attention, and a new feature information matrix is output.

S5: and performing global average pooling on the characteristic information matrix obtained in the step S4, integrating full-layer spatial information, and converting the characteristic information matrix into a one-dimensional characteristic information matrix.

S6: and inputting the one-dimensional characteristic information matrix into the full-connection layer to obtain a matrix of the number of classes corresponding to the classification task, and outputting an image classification result.

Preferably, the preprocessing of the image in S1 includes a uniform modification of the size of the image by way of supplementation or cropping.

Preferably, S2 specifically includes: dividing an A multiplied by A large convolution kernel arranged on a residual network input layer into a plurality of layers of symmetrical small convolution kernels connected in series, then sequentially inputting the preprocessed input image into the plurality of layers of symmetrical small convolution kernels connected in series, and extracting to obtain a characteristic information matrix of the input image; wherein the size of any one of the small symmetric convolution kernels is B multiplied by B, and A > B is more than or equal to 1.

Preferably, S2 specifically includes: decomposing the A multiplied by A large convolution kernel arranged on the residual network input layer into a layer of A multiplied by 1 and a layer of 1 multiplied by A asymmetric convolution kernels which are connected in sequence, then sequentially inputting the preprocessed input image into the layer of A multiplied by 1 and the layer of 1 multiplied by A asymmetric convolution kernels, and extracting to obtain a characteristic information matrix of the input image; wherein A is a positive integer greater than 1.

Preferably, S3 specifically includes the following:

s3.1: the characteristic information matrix is subjected to batch normalization processing, and the calculation formula is as follows:

wherein, F_outputAn output characteristic information matrix representing batch normalization processing; f_inputAn input characteristic information matrix representing batch normalization processing; mean () represents the Mean calculation; var (·) denotes variance calculation; eps indicates the introduction of errors and avoids the denominator being zero; gamma represents a scaling factor; β represents a characteristic translation factor;

s3.2: output characteristic information matrix F for batch normalization processing by ReLU function_outputAnd carrying out nonlinear activation, wherein the output size of the nonlinear activation is kept unchanged, and the formula is as follows:

s3.3: down-sampling the characteristic information matrix subjected to nonlinear activation through maximum pooling operation, and changing the output size of the characteristic information matrix;

s3.4: and (4) performing 1 × 1 convolution operation on the matrix obtained in the step (S3.3), performing batch normalization processing, and activating by using a ReLU function to obtain a new characteristic information matrix.

Preferably, S4 specifically includes the following:

s4.1: the characteristic information matrix is divided into two parts according to the equal channel number, and the two parts respectively enter two parallel characteristic screening branches.

S4.2: and splicing the characteristic information matrixes respectively output by the two parallel characteristic screening branches, and performing batch normalization processing on the spliced characteristic information matrixes and activating by utilizing a ReLU function.

S4.3: and performing parallel processing of channel attention and space attention on the characteristic information matrix obtained in the step S4.2, and adding attention to the characteristic information matrix.

S4.4: and adding the characteristic information matrix obtained in the step S3.4 and the characteristic information matrix added with attention in the step S4.3 to output a new characteristic information matrix.

Preferably, in S4.1, the two parallel feature screening branches respectively include a 1 × 1 dot convolution, a 3 × 3 depth separable convolution and a 1 × 1 dot convolution, which are connected in sequence.

Preferably, in S4.1, the parallel feature screening branch is provided with a variable size processing operation and a non-variable size processing operation:

in the variable-size processing operation, after the characteristic information matrix is subjected to 3 multiplied by 3 depth separable convolution, the height and the width are halved, and the number of channels is doubled;

in the invariant size processing operation, the feature information matrix is not changed in size after being subjected to 3 × 3 depth separable convolution.

Preferably, S4.3 specifically includes the following:

s4.3.1: compressing the characteristic information matrix in the spatial dimension, and performing global average pooling, multi-layer perceptron and Sigmoid activation function processing to obtain a multi-layer information matrix with the size of 1 × 1 and unchanged channels, wherein the multi-layer information matrix is a weight matrix of the characteristic information in the channel dimension, namely a channel attention F_{output_C}The formula is as follows:

F_{output_C}＝σ(MLP₂(ReLU(MLP₁(AvgPool(F′_input)))))

wherein the first sensing operation

Second perception operation

The inchannel is the number of channels of the characteristic information matrix output by S4.2; conv (in, out, kernel _ size) represents convolution calculation operation, wherein in is the number of input channels, out is the number of output channels, and kernel _ size is the size of a convolution kernel; ReLU (.) is a ReLU function; σ () denotes a Sigmoid activation function, and

AvgPool (.) represents the average pooling operation; f'_inputA characteristic information matrix obtained in S4.2;

compressing the characteristic information matrix in the channel dimension, and performing channel averaging and Sigmoid activation function processing to obtain a single-channel information matrix with unchanged size, wherein the single-channel information matrix is a weight matrix of the characteristic information in the space dimension, namely space attention F_{output_S}The formula is as follows:

F_{output_S}＝σ(Mean(F′_input))

wherein Mean (·) represents the Mean calculation;

s4.3.2: attention to the channel F_{output_C}And spatial attention F_{output_S}The feature information matrix F 'obtained in S4.2 is added in a multiplication mode'_inputAnd merge the input features F_inputNamely, the input characteristics, the channel attention and the space attention are fused in parallel to obtain a characteristic information matrix F_{output_dual}The formula is as follows:

F_{output_dual}＝F_input*σ(MLP₂(ReLU(MLP₁(AvgPool(F′_input)))))*σ(Mean(F′_input))。

in a second aspect, the present invention further provides an image classification system based on a parallel dual-attention lightweight residual error network, which is applied to the image classification method based on the parallel dual-attention lightweight residual error network in any of the above solutions, and includes:

the preprocessing module is used for preprocessing an input image;

the characteristic information extraction module is used for carrying out structural optimization on a convolution kernel of the residual error network and extracting a characteristic information matrix of the input image;

the characteristic information processing module is used for processing a characteristic information matrix by utilizing a parallel double-attention light-weight residual error structure in a residual error network to obtain a one-dimensional characteristic information matrix comprising accurate characteristic information;

and the image classification module is used for inputting the one-dimensional characteristic information matrix into the full connection layer of the residual error network to obtain a matrix of the number of classes corresponding to the classification task and outputting an image classification result.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

(1) the invention adopts the modes of convolution kernel decomposition, channel separation, depth separable convolution and network width adjustment to reduce network parameters and hardware memory resource occupation, thereby reducing the calculated amount of the network, accelerating the calculation speed of the network and realizing the lightweight of the model.

(2) The invention further adopts a double-branch spatial channel attention mechanism (DBSC) to improve the system identification capability, and improves the accuracy of the whole network on the basis of light model weight, thereby achieving better classification effect.

On the premise of ensuring the precision, the invention compresses the parameters and the calculated amount, and improves the processing speed, thereby improving the overall efficiency of the target classification and identification based on the deep neural network.

Drawings

FIG. 1 is a flowchart of an image classification method based on a parallel dual attention lightweight residual network in embodiment 1

Fig. 2 is an overall framework diagram of the parallel dual-attention lightweight residual error network model in embodiment 1.

FIG. 3 is a comparison of three different convolution kernels of example 1.

Fig. 4 is a flowchart for performing parameter weight reduction on the original bottleeck residual structure by using channel separation parallel computation in example 1.

Fig. 5 is a schematic diagram of adjusting the network width in the bottleeck residual structure in embodiment 1.

Fig. 6 is a schematic diagram of the operation of the attention mechanism of the dual-branch spatial channel in embodiment 1.

Fig. 7 is a diagram showing the overall architecture of the four bottleeck residual structures in example 2.

FIG. 8 is the Top-1 error evolution of the four residual structures estimated on the Animals-10 dataset and the CIFAR-10 dataset in example 2.

Fig. 9 is a schematic diagram of an image classification system based on a parallel dual attention lightweight residual network.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

Referring to fig. 1 to fig. 6, the present embodiment provides an image classification method based on a parallel dual attention lightweight residual error network, including the following steps:

s1: the image is input to a residual network and preprocessed.

In this embodiment, an images dataset of Animals-10 and a data dataset of CIFAR-10 are used as input images. Both Animals-10 and CIFAR-10 contain class 10 images. For the Animals-10 dataset, this example used 25,000 images as the training set, 1,000 images as the validation set, and 1,000 images as the test set. For the CIFAR-10 dataset, this example has 50,000 images as the training set, 5,000 images as the validation set, and 5,000 images as the test set.

In the present embodiment, the input image is subjected to uniform modification of a size of 224 × 224 × 3 by way of supplementation or cropping. As shown in fig. 2, fig. 2 is aN overall framework diagram of a parallel dual-attention lightweight residual error network model, the number of input channels in the first layer in fig. 2 is 3, and the number of output channels is aN, where a represents a channel width factor, and N is the number of output channels of aN original residual error network ResNet.

In this embodiment, the number of input channels and the number of output channels of each layer of the residual error structure in the residual error network may be set or modified according to different convolution operations executed by each layer, using channel information carried by a convolution kernel.

The input layer of the existing residual error network needs to be convolved with a large 7 × 7 convolution kernel to strengthen the correlation between the input image and the first layer feature map, however, the algorithm precision improvement obtained by the large 7 × 7 or 5 × 5 convolution kernel is not proportional to the resource consumption, the consumption of computing power is increased more rapidly, and therefore the computing cost is very expensive.

In this embodiment, two schemes are proposed to extract the feature information matrix of the input image instead of the large convolution kernel of 7 × 7. A7 x 7 large convolution kernel of a residual network input layer is divided into a plurality of layers of symmetrical small convolution kernels which are connected in series, preferably three layers of continuous symmetrical small convolution kernels of 3 x 3, and then a characteristic information matrix of an input image is extracted by utilizing the three layers of continuous symmetrical small convolution kernels of 3 x 3; the other method is to nucleate the 7 × 7 large convolution of the residual network input layer into a layer of 7 × 1 and a layer of 1 × 7 asymmetric convolution kernels, and then extract the characteristic information matrix of the input image by using the layer of 7 × 1 and the layer of 1 × 7 asymmetric convolution kernels, and the transformation is called asymmetric matrix of spatial decomposition. As shown in fig. 3, fig. 3 is a comparison of the structures of three different convolution kernels designed in this embodiment.

From the experimental results, as shown in table 1, the computational efficiency is improved by using three successive layers of symmetric small convolution kernels of 3 × 3 × 3, but the improvement effect is rather limited, and the improvement comes at the expense of the reduction of the expression ability. And the input size and the output depth of the asymmetric convolution kernels of one layer of 7 multiplied by 1 and one layer of 1 multiplied by 7 are the same, but the generated model parameters are less, the calculation speed (fps) is faster, and the identification precision of the algorithm can be ensured to be almost unchanged.

TABLE 1 comparison of training time and accuracy before and after model compression

As can be seen from table 1, both the small convolution kernel and the asymmetric convolution kernel can effectively reduce the network parameters and improve the memory utilization, wherein the training time can be reduced by 22.3% by using the multilayer continuous symmetric small convolution kernel, and the training time can be reduced by about 29.0% by using the asymmetric convolution kernel. However, the frame rate of the multi-layer small convolution kernel operation is similar to that of the large convolution kernel in the reasoning process, and the asymmetric convolution can greatly improve the frame recognition rate (fps) and accelerate the reasoning speed, so the invention preferentially selects the asymmetric convolution kernel to carry out calculation optimization on the large convolution kernel operation of the network input layer. The characteristic information matrix is extracted by using a layer of 7 × 1 and a layer of 1 × 7 asymmetric convolution kernels, and the size change process of the characteristic information matrix is 224 × 224 × 3 → 112 × 224 × 3 → 112 × 112 × 48.

S3: and carrying out batch normalization processing on the characteristic information matrix and carrying out nonlinear activation. When a deep neural network model is constructed, the perception effect of the middle layer (hidden layer) of the neural network is almost disappeared because the linear function only generates linear change to the input signal, namely the output signal is always the linear combination of the input signal. Therefore, it is necessary to introduce a nonlinear function to approximate the input signal of the corresponding layer, so that each layer of the neural network can generate corresponding spatial mapping or transformation in the nonlinear function, and thus the neural network generates multi-layer perception through approximation of the nonlinear function. In the present embodiment, the ReLU function is preferably used to perform nonlinear activation on the neural network model.

Specifically, the step of S3 includes the following:

wherein, F_outputAn output characteristic information matrix representing batch normalization processing; f_inputAn input characteristic information matrix representing batch normalization processing; mean () means allCalculating a value; var (·) denotes variance calculation; eps is an introduced error, and the default value is 1 e-5; gamma represents a scaling factor, with a default value of 1; beta represents a characteristic translation factor, and the default value is 0;

s3.3: the non-linearly activated feature information matrix is down-sampled by a maximum pooling operation with a step size of 1,3 × 3 so that the output size of the feature information matrix becomes 56 × 56 × 48.

S3.4: and (4) performing 1 × 1 convolution operation on the matrix obtained in the step (S3.4), performing batch normalization processing, and activating by using a ReLU function to obtain a new characteristic information matrix.

S4: the parallel processing of the channel attention and the spatial attention is performed on the feature information matrix obtained in S3, and a new feature information matrix is output, which specifically includes the following steps:

s4.1: dividing the characteristic information matrix into two parts according to the equal channel number, and respectively entering two parallel characteristic branches; wherein, the two parallel characteristic branches respectively comprise a 1 × 1 point convolution, a 3 × 3 depth separable convolution and a 1 × 1 point convolution which are connected in sequence.

In this embodiment, the parallel feature screening branch is provided with a variable size processing operation and a non-variable size processing operation: after the characteristic information matrix is subjected to separable convolution with the depth of 3 multiplied by 3 in the variable-size operation, the height and the width are halved, and the number of channels is doubled; the characteristic information matrix in the operation of the unchanged size has no change in size after being subjected to the depth separable convolution of 3 multiplied by 3. In the implementation process, when the 3 × 3 deep separable convolution is performed in each layer network, the modification of the size of the feature map matrix can be realized by changing the number of channels of the 3 × 3 convolution kernel and the convolution operation.

Because the middle hidden layer in the neural network related by the invention is mainly formed by superposing a certain number of residual error structures, the residual error structure of each layer adopts variable size operation or invariable size operation, and the characteristic diagram information required to be reserved in the layer is determined and is continuously transmitted downwards. The characteristic diagram information needing to be reserved is optimally set according to the calculation result of the previous layer (for example, the characteristic diagram size matched with the previous layer) and the superiority and inferiority of the experimental test result.

In this embodiment, channel separation is performed on the input layer, and the characteristic information matrix of the input residual structure is divided into two parts according to the equal number of channels, so that parallel computation of a plurality of convolution kernels can be realized, the efficiency of convolution computation is improved, and the parameter amount of the network is reduced. The channel separation technology is used for compressing parameters in a Bottleneck (Bottleneck) residual structure formed by a core backbone network of a residual network.

In the implementation process, the input feature information of each bottleeck module is averagely divided into a plurality of groups C, and assuming that the number of feature map input channels input into the current residual module is N, the number of channels of the feature map of each convolution group (sequentially including 1 × 1 point convolution, 3 × 3 depth separable convolution and 1 × 1 point convolution) becomes N/C, that is, N channels of the original feature map are averagely divided into N/C channels by each parallel convolution group. Each divided characteristic channel is processed and calculated by an independent convolution group (sequentially comprising 1 × 1 point convolution, 3 × 3 depth separable convolution and 1 × 1 point convolution), so that multi-branch parallel calculation is realized, wherein 2 is preferentially selected in the embodiment.

In this embodiment, the possibility that parameter optimization can be performed is continuously searched for inside a convolution group including 1 × 1-point convolution, 3 × 3-depth separable convolution, and 1 × 1-point convolution in this order. The 3 × 3 Convolution operation in the original ResNet residual error network is a standard Convolution operation with high calculation power consumption, and in order to further improve the calculation speed and reduce the parameter quantity and the calculation quantity, the invention adopts the depth Separable Convolution (Depthwise Separable Convolition) with higher calculation efficiency to replace the standard 3 × 3 Convolution operation in the Bottleneck residual error structure. The main idea of the depth separable convolution is to combine the depth convolution with a 1 x 1 point-by-point convolution instead of the standard convolution.

For standard convolution, a convolution kernel requires more computational operations and usually more energy to perform for each standard convolution of N input channels. Since the 3 × 3 standard convolution kernel in the bottleeck is a main calculation consumption source of the bottleeck residual error structure, the depth convolution adopted here first is used to apply a single 3 × 3 convolution kernel to each input channel to reduce the calculation complexity; a linear combination of depth layer outputs is then created using point-by-point convolution implemented with a simple 1 x 1 point convolution, whose depth can be flexibly controlled to map it to higher dimensions. In addition, Batch Normalization (BN) processing and the ReLU function are used for non-linear activation for both channels.

The deep convolution divides a 3 multiplied by 3 convolution kernel into three layers, each input channel is only subjected to convolution calculation with the convolution kernel of the corresponding layer to obtain three layers of output channel information, and the three layers of channel information are integrated through 1 multiplied by 1 point-by-point convolution to obtain complete output characteristics. The deep separable convolution reduces the calculation amount of convolution by sacrificing certain correlation among channels and compresses the network parameter amount.

The depth convolution in this embodiment has a network width adjustment, where the network width adjustment is to adjust the number of channels of the feature map by introducing a channel parameter α (0< α ≦ 1) and then according to a formula M ═ α M, where M represents the initial channel depth and M' represents the modified channel depth, for the convolution channel depth of the bottleeck residual structure. As shown in fig. 5, the network depth is reduced according to the size of α, so as to achieve the purpose of reducing the calculation amount and compressing the network parameters, and an optimal parameter α for balancing the accuracy and the light weight is obtained according to the experimental result.

S4.2: splicing (Concat) the characteristic information matrixes respectively output by the two parallel characteristic branches, and performing batch normalization processing on the spliced characteristic information matrixes and activating by utilizing a ReLU function;

The attention mechanism is mainly to enable a model to learn to ignore relatively irrelevant information in an image in a training process, pay more attention to interested information in the image, recover accuracy lost due to lightweight transformation, and is essentially weight operation. Processing the characteristic information of BottleNeck input characteristics according to channel average and convolution average respectively to obtain channel attention and space attention, namely weight; then, the two types of weights are multiplied by the output features of the BottleNeck at the same time, the weights are given to the output features, the feature difference of the output features is increased, and the identification accuracy is improved, wherein the working principle of the method is shown in fig. 6, and the method specifically comprises the following steps:

F_{output_C}＝σ(MLP₂(ReLU(MLP₁(AvgPool(F′_input)))))

wherein the first sensing operation

Second perception operation

compressing the characteristic information matrix in the channel dimension, and carrying out channel averaging and Sigmoid activation function processing to obtain a single-channel information matrix with unchanged size, wherein the single-channel information matrix is characterized byWeight matrix of information in spatial dimension, i.e. spatial attention F_{output_S}The formula is as follows:

F_{output_S}＝σ(Mean(F′_input))

wherein Mean (·) represents the Mean calculation;

s4.3.2: attention to the channel F_{output_C}And spatial attention F_{output_S}The feature information matrix F 'obtained in S4.2 is added in a multiplication mode'_inputThe formula is as follows:

F_{output_dual}＝F_input*σ(MLP₂(ReLU(MLP_i(AvgPool(F′_input)))))*σ(Mean(F′_input))。

since a single attention mechanism can usually only focus on critical information in its own one-dimensional feature space, different attention architectures produce distinct results in deep convolutional neural networks. Unlike many previous networking frameworks that only use spatial attention in low-level feature extraction and apply channel attention in high-level feature extraction, the present invention combines spatial attention and channel attention to form a dual-branch spatial channel (DBSC) attention mechanism, i.e., the channel attention processing and the spatial attention processing are performed in parallel; directing the channel attention F at the output of the residual structure_{output_C}And spatial attention F_{output_S}Is multiplied and simultaneously added to a feature information matrix F'_inputAnd merge the input features F_inputNamely, the input characteristics, the channel attention and the space attention are fused in parallel to obtain a characteristic information matrix F_{output_dual}。

The advantages of the double-branch space channel are: on the one hand, the salient position information of each feature map can be emphasized through spatial attention; on the other hand, a significant region present in some feature maps may be captured by the channel attention in another branch.

S4.4: and adding the characteristic information matrix obtained in the step S3.4 and the characteristic information matrix added with attention in the step S4.3 to output a new characteristic information matrix, wherein the matrix can retain the characteristic information of the original image to the greatest extent.

Because the neural networks with different depths can be realized by superposing different numbers of network layers, the depth of the whole residual error network is changed by superposing S residual error structures. In the embodiment, when the performance of the network architecture is tested, it is determined through experimental data that S is set to 16, and the "precision-speed" efficiency ratio is the highest at this time. The number and size of input/output channels of the residual error structure of each network layer are set by a convolution kernel adopted in convolution operation to be executed by the network layer according to the information of the connected upper and lower layer characteristic graphs, and final characteristic information is obtained after transformation of an S layer residual error module.

S5: performing global average pooling on the characteristic information matrix output by the residual error structure, integrating full-layer spatial information, and converting the characteristic information matrix into a one-dimensional characteristic information matrix through a flatten operation;

s6: and inputting the one-dimensional characteristic information matrix into the full-connection layer to obtain a matrix of the number of classes corresponding to the classification task, and realizing image classification.

The large-kernel convolution of the input layer can be replaced by a plurality of layers of small convolution kernels or asymmetric convolution, so that the input parameter quantity can be reduced, the calculation speed (fps) is improved, the network depth can be deepened to realize the improvement of the network capacity, and sufficient image characteristic information is obtained to ensure the stability of the identification precision; by optimizing the residual error structure and using the depth separable convolution to replace network compression technical means such as standard convolution and the like, the parameter quantity of the residual error structure is greatly reduced again, and after the multilayer residual error structures are overlapped, the parameter quantity of the integral model is lighter. The image classification method based on the parallel double-attention light-weight residual error network can be used for the local (internet network edge end) rapid and high-precision image target classification and identification, and reduces hardware resource consumption and energy consumption. The method has high application value for a plurality of devices, such as high-definition televisions, computer monitors, cameras, smart phones and tablet computers. In addition, it can be applied to various computer vision fields, such as object detection, medical imaging, security and surveillance imaging, face recognition, remote sensing images, and the like. By selecting a proper computing platform, the deep residual error network model trained by the invention can be applied to key nodes of new-generation information technologies such as big data, Internet of things, cloud services and the like, namely edge equipment terminals.

Example 2

Referring to fig. 7-8, this embodiment provides an image classification method based on a parallel dual-attention lightweight residual network, which further includes various improved designs of an overall structure of a Bottleneck residual structure, including a bottleeck residual structure PAResNet with only a dual-branch spatial channel attention mechanism, a bottleeck residual structure LightResNet combined with channel segmentation and depth separable convolution, and a bottleeck residual structure ALResNet combined with a channel segmentation, depth separable convolution and parallel dual-branch spatial channel attention mechanism.

The lightweight residual error bottleneck structure is important for constructing a lightweight residual error network. Therefore, different residual networks can be formed by overlapping according to different residual structures in fig. 7, as shown in table 3.

TABLE 3 comparison of four residual network architectures

Wherein, the second residual network PAResNet is constructed by stacking the Bottleneck residual structures as shown in fig. 7 (b); the third residual network LightResNet is constructed by stacking channel split Bottleneck residual structures as shown in fig. 7 (c). The attention-driven ALResNet lightweight residual network is constructed by combining the parallel dual-branch spatial channel attention mechanism with the lightweight-oriented technique, as shown in fig. 7 (d). Applying asymmetric convolution at the input of LightResNet and ALResNet instead of a 7 × 7 convolution kernel, unlike previous work using separate attention strategies in different levels of feature extraction, the present invention uses spatial attention and channel attention in parallel for low, medium, and high level feature extraction. Since ALResNet fuses the advantages of model compression and attention mechanisms, it is desirable to obtain better accuracy with fewer parameters and less computational cost. More specifically, the input of the two-branch spatial channel attention mechanism module is connected to the output of the convolutional layer in the Bottleneck, and the other branch of the identity map Bottleneck Bottleneck is connected to the output of the attention module, as shown. However, stacking the attention module directly results in significant performance degradation. Although the two-branch space channel attention mechanism can be integrated into the residual Bottleneck in different ways to obtain more accurate feature information, it also brings more model parameters and calculation operations, so that when the bottleeck residual structure is optimized, the invention combines the two-branch space channel attention mechanism with the lightweight network compression technical means to form the bottleeck residual structure as shown in fig. 7(d) to balance the parameter number and the network precision. Furthermore, the dual branch spatial channel attention mechanism of the present invention may also be applied to other types of layers, blocks, or networks.

This example performed a number of experiments on the Animals-10 dataset and the CIFAR-10 dataset, the results of which are shown in Table 4, Table 5, Table 6, Table 7 and FIG. 8.

TABLE 4 Experimental results for four residual structures estimated on the Animals-10 dataset

TABLE 5 Experimental results for four residual structures estimated on CIFAR-10 dataset

TABLE 6 evaluation of the comparison of Performance of network models formed by superposition of different residual structures on the Animals-10 dataset

TABLE 7 evaluation of the comparison of the Performance of a network model formed by the superposition of different residual structures on the CIFAR-10 dataset

It can be seen from the experimental results that the model size of the residual network of the increased attention mechanism increases slightly by about 4.95%, the parameter amount reaches 24.82M, but considerable improvement is achieved in the verification (up to 97.4% on Animals-10 and up to 92.5% on CIFAR-10) and recognition accuracy in the test (up to 95.2% on Animals-10 and up to 92.6% on CIFAR-10). According to the experimental result, the accuracy performance of different attention mechanisms accords with the following conditions: channel attention > spatial attention, fusion attention > single attention, parallel attention > serial attention. In particular, PAResNet integrated with parallel DBSC attention achieves the best accuracy performance. Nevertheless, the model size of PAResNet is still far from satisfactory for edge calculation.

In the embodiment, the ALResNet with the superposition of S-16 residual structures generates a parameter quantity of 4.77M, which is only one fifth of the original ResNet-50 parameter quantity, the inference speed on animalls-10 is as high as 14.90fps, and the inference speed on CIFAR-10 is as high as 16.21 fps. In addition, ALResNet achieved 92.1% top-1 test accuracy on Animals-10 and 89.4% top-1 test accuracy on CIFAR-10, respectively, at a computational cost of 736.82 MFLOPs. The above results demonstrate the effectiveness of spatial channel attention mechanisms and lightweight-oriented network compression techniques. Compared to the most advanced studies, the proposed ALResNet enables a good trade-off between accuracy and computational efficiency in fast reasoning for resource-limited mobile devices in vision-based tasks.

The present embodiment further studies the performance of a lightweight residual network LightResNet, which involves only lightweight oriented compression techniques. Their impact on model size, computational efficiency and recognition accuracy are estimated separately. The results of experiments based on Animals-10 and CIFAR-10 are summarized in tables 6 and 7, respectively. According to the experimental result, PAResNet of uncompressed network scale realizes the best accuracy performance, but the parameter quantity and the reasoning speed of the PAResNet need to be improved. In contrast, LightResNet, which does not involve any attention-driven layer, reduces the amount of parameters to 4.08M substantially and achieves an inference speed two times higher than ResNet-50. It is feasible to use model compression techniques to improve computational efficiency. However, LightResNet achieves the worst accuracy in both data sets. The model error-epoch curve shown in fig. 8 also demonstrates a non-negligible error for a lightweight LightResNet. In other words, the identification capability of LightResNet is diminished due to the loss of characteristic information during network compression.

The three networks have respective advantages, and the lightweight network LightResNet has the fastest inference speed and the minimum parameter number, but the error rate is also the highest; in contrast, PAResNet, which only adds a mechanism of attention, shows better recognition accuracy, but the parameter quantity thereof is also the highest. In contrast, ALResNet, which combines both attention mechanism and network compression technique, achieves the best tradeoff between accuracy and speed.

Example 3

Referring to fig. 9, the present embodiment further provides an image classification system based on a parallel dual attention lightweight residual error network, which is applied to the image classification method based on a parallel dual attention lightweight residual error network in the foregoing embodiment, and includes:

the device comprises a preprocessing module, a characteristic information extraction module, a characteristic information processing module and an image classification module.

In the specific implementation process, the preprocessing module performs preprocessing of uniform size on the input image; the characteristic information extraction module is used for carrying out structural optimization on a convolution kernel of the residual error network and extracting a characteristic information matrix of the input image; (ii) a The characteristic information processing module processes the characteristic information matrix by using a parallel double-attention light-weight residual error structure to obtain a one-dimensional characteristic information matrix comprising accurate characteristic information; and the image classification module inputs the one-dimensional characteristic information matrix into a full connection layer of the residual error network to obtain a matrix of the number of classes corresponding to the classification task, so as to realize image classification.

The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. The image classification method based on the parallel double-attention light-weight residual error network is characterized by comprising the following steps of:

s1: inputting the image into a residual error network and preprocessing the image;

s2: performing structural optimization on a convolution kernel of the residual error network, and extracting a characteristic information matrix of an input image;

s3: carrying out batch normalization processing on the characteristic information matrix and carrying out nonlinear activation;

s4: performing parallel processing of channel attention and space attention on the characteristic information matrix obtained in the step S3, and outputting a new characteristic information matrix;

s5: performing global average pooling on the characteristic information matrix obtained in the step S4, integrating full-layer spatial information, and converting the characteristic information matrix into a one-dimensional characteristic information matrix;

2. The method of image classification based on parallel dual attention lightweight residual network according to claim 1, characterized in that the preprocessing of the image in S1 includes a unified modification of the size of the image by means of supplementation or cropping.

3. The method for image classification based on the parallel dual-attention lightweight residual network according to claim 1, wherein S2 specifically comprises: dividing an A multiplied by A large convolution kernel arranged on a residual network input layer into a plurality of layers of symmetrical small convolution kernels connected in series, then sequentially inputting the preprocessed input image into the plurality of layers of symmetrical small convolution kernels connected in series, and extracting to obtain a characteristic information matrix of the input image; wherein the size of any one of the small symmetric convolution kernels is B multiplied by B, and A > B is more than or equal to 1.

4. The method for image classification based on the parallel dual-attention lightweight residual network according to claim 1, wherein S2 specifically comprises: decomposing the A multiplied by A large convolution kernel arranged on the residual network input layer into a layer of A multiplied by 1 and a layer of 1 multiplied by A asymmetric convolution kernels which are connected in sequence, then sequentially inputting the preprocessed input image into the layer of A multiplied by 1 and the layer of 1 multiplied by A asymmetric convolution kernels, and extracting to obtain a characteristic information matrix of the input image; wherein A is a positive integer greater than 1.

5. The method for image classification based on the parallel dual attention lightweight residual network according to claim 1, wherein S3 specifically comprises the following:

wherein, F_outputAn output characteristic information matrix representing batch normalization processing; f_inputAn input characteristic information matrix representing batch normalization processing; mean () represents the Mean calculation; var (·) denotes variance calculation; eps is an introduced error; gamma represents a scaling factor; β represents a characteristic translation factor;

6. The method for image classification based on the parallel dual attention lightweight residual network according to claim 5, wherein S4 specifically comprises the following steps:

s4.1: dividing the characteristic information matrix into two parts according to the equal channel number, and respectively entering two parallel characteristic screening branches;

s4.2: splicing the characteristic information matrixes respectively output by the two parallel characteristic screening branches, and performing batch normalization processing on the spliced characteristic information matrixes and activating by utilizing a ReLU function;

s4.3: performing parallel processing of channel attention and space attention on the characteristic information matrix obtained in the S4.2, and adding attention to the characteristic information matrix;

7. The method for image classification based on the parallel double-attention light-weight residual error network according to the claim 6, wherein in S4.1, two parallel feature screening branches respectively comprise a 1 x 1 point convolution, a 3 x 3 depth separable convolution and a 1 x 1 point convolution which are connected in sequence.

8. The parallel dual-attention light-weight residual error network-based image classification method according to claim 7, characterized in that in S4.1, the parallel feature screening branch is provided with variable-size processing operation and invariable-size processing operation:

9. The method for image classification based on the parallel dual attention lightweight residual network according to claim 6, wherein S4.3 specifically comprises the following:

F_{output_C}＝σ(MLP₂(ReLU(MLP₁(AvgPool(F′_input)))))

wherein the first sensing operation

Second perception operation

compressing the characteristic information matrix in the channel dimension, and performing channel averaging and Sigmoid activation function processing to obtain the characteristic information matrixTo a size-invariant single-channel information matrix, which is a weight matrix of the feature information in the spatial dimension, i.e. spatial attention F_{output_S}The formula is as follows:

F_{output_S}＝σ(Mean(F′_input))

wherein Mean (·) represents the Mean calculation;

10. the image classification system based on the parallel double-attention light weight residual error network is applied to the image classification method based on the parallel double-attention light weight residual error network, which is characterized by comprising the following steps of:

the preprocessing module is used for preprocessing an input image;