CN115601583A

CN115601583A - Deep convolution network target identification method of double-channel attention mechanism

Info

Publication number: CN115601583A
Application number: CN202211090432.7A
Authority: CN
Inventors: 王俊杰; 赵立业; 黄程韦
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2023-01-13

Abstract

The invention discloses a deep convolutional network target identification method of a double-channel attention mechanism, which comprises the following steps of constructing a convolutional neural network, and extracting a high-dimensional characteristic diagram by taking a single sample pair as input; respectively constructing a spatial attention and channel attention mechanism module, taking two high-dimensional feature graphs extracted by a neural network as input, calculating the correlation between feature pixels in spatial dimension and adding the correlation with original features element by element; stacking the outputs of the space and channel attention mechanism modules on channel dimensions to obtain the final characteristic representation of the model; constructing a training sample pair, wherein the scale of the similar targets is increased through data enhancement, and the different targets are directly paired; and calculating cross entropy loss, and learning network parameters through random gradient descent to obtain a neural network model with the capability of distinguishing target classes. By the method, the accuracy of visual target image recognition can be improved under a single-sample scene and for the target classes which do not participate in training.

Description

Deep convolution network target identification method of double-channel attention mechanism

Technical Field

The invention belongs to the technical field of pattern recognition, and particularly relates to a deep convolutional network target recognition method based on a dual-channel attention mechanism.

Background

In the last decade, deep learning has been a great success in the field of computer vision, and more researchers have begun to focus on the application of neural networks in object recognition.

Although neural network models achieve excellent results in most target recognition tasks, challenges such as a variety of models, insufficient training samples, intra-class fine-grained variation, class increase and the like are still faced in an actual production environment. A neural network is a typical supervised learning algorithm, which relies on a large-scale labeled training data set, and the data cost is not negligible, so that enough images cannot be collected for each class of targets to be used for training; in addition, when the classes which change frequently are faced, the more typical neural network classification model cannot effectively process the classes which do not participate in training, which is one of the problems to be solved before the technology is actually applied.

Disclosure of Invention

In order to solve the problems, the invention discloses a method for identifying a deep convolution network target with a two-channel attention mechanism, which can realize automatic classification of visual targets on the premise that each type of target only has one training image sample.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a method for identifying a deep convolution network target with a two-channel attention mechanism comprises the following steps,

step 1: constructing a convolutional neural network, taking the image sample pair as input, and extracting a high-dimensional characteristic diagram;

step 2: constructing a spatial attention mechanism module, taking two high-dimensional feature graphs extracted by a neural network as input, calculating the correlation between feature pixels on spatial dimensions, and adding the correlation and the original features element by element;

and 3, step 3: constructing a channel attention mechanism module, taking two high-dimensional feature graphs extracted by a neural network as input, calculating the correlation between feature channels on channel dimensions, and adding the correlation and the original features element by element;

and 4, step 4: stacking the outputs of the space attention mechanism module and the channel attention mechanism module on a channel dimension to obtain a final characteristic representation of the model;

and 5: constructing a training sample pair, wherein the scale of similar targets is increased through data enhancement, and different targets are directly paired;

step 6: and calculating cross entropy loss and learning network parameters through random gradient descent to obtain a neural network model with the capability of distinguishing target categories.

Further, in the present invention: the step 1 further comprises the step of,

step 1-1: constructing a convolutional neural network comprising 17 convolutional layers, wherein the head convolutional layer is composed of 64 convolution kernels with the size of 7 multiplied by 7 and the step size is 2, so that 0.5-time down-sampling is carried out on the input image and the number of characteristic image channels is increased to 64 dimensions; the maximum value pooling layer adopts a window size of 3 multiplied by 3, the step length is 2, and the maximum value pooling layer is used for carrying out 0.5 time down sampling on the feature map; every 2 convolutional layers which adopt convolution kernels with the size of 3 multiplied by 3 except the head convolutional layer form a residual error module by a direct connection structure, 8 residual error modules are counted, the step length of the first convolutional layer in each residual error block is 2, the rest are 1, the number of the convolution kernels is continuously increased along with the increase of the network depth, and finally, the network weight of the high-dimensional feature map with the size of 1/32 of the input image and the channel number from ascending dimension to 512 dimension is obtained through random initialization and is continuously updated through back propagation in the training process;

step 1-2: constructing two paths of completely identical convolutional neural networks as described in step 1-1, wherein each path receives one image in the image sample pair as input and respectively outputs a high-dimensional feature map F ₁ And F ₂ 。

Further, in the present invention: the step 2 further comprises the step of,

step 2-1: extracting an original high-dimensional feature map F epsilon R from the convolutional neural network in the step 1 ^C×H×W H, W and C respectively represent the height, width and channel number of the characteristic diagram, and the height, width and channel number are respectively input into three groups of 1 × 1 convolution layers to obtain three new characteristic diagrams F _a 、F _b 、F _c And flattening its width and height dimensions, i.e. { F _a ,F _b ,F _c }∈R ^C×(H×W) . Subsequently, F is _a Transpose and F _b Obtaining a space attention matrix M through a Softmax function after multiplication _s ∈R ^{(H×W)×(H×W)} Is concretely provided with

Wherein the content of the first and second substances,

showing the correlation between the characteristic pixels of the ith position and the jth position, T representing transposition, F _a 、F _b Is a characteristic diagram of the convolutional layer output.

Step 2-2: make F _c And M _s Multiplying and multiplying with the original high dimensional feature F ∈ R ^C×H×W Adding element by element to obtain output characteristic F _s Is concretely provided with

Wherein eta _s Is a trainable scale factor and is initialized to 0 for avoidance

Too large, j is a subscript to the spatial location.

Further, in the present invention: the step 3 further comprises the step of,

step 3-1: extracting a high-dimensional characteristic diagram F epsilon R from the neural network in the step 1 ^C×H×W H, W and C respectively represent the height, width and channel number of the characteristic diagram, and the channel attention matrix M is obtained by taking the product of the H, W and C and the transpose of the H, W and C after the width and height dimensions of the characteristic diagram are flattened _t ∈R ^C×C Let i, j represent the ith and jth positions in space, and T is the transposition operation, specifically

Step 3-2: make F and M _t Multiplying by the original feature F ∈ R ^C×H×W Adding element by element to obtain output characteristic F _t Is concretely provided with

Wherein eta _t Is a trainable scale factor and is initialized to 0 for avoidance

Too large.

Further, in the present invention: the step 4 further comprises the step of,

step 4-1: for the attention mechanism modules described in step 2 and step 3, F is obtained _s And F _t Stacking on channel dimension to form a double-attention machine mechanism module;

step 4-2: for two paths of high-dimensional characteristic graphs F output by the neural network in the step 1 ₁ ,F ₂ ∈R ^C×H×W Respectively passing through a double-injection machine manufacturing module to obtain better characteristic representation F' ₁ ,F′ ₂ ∈R ^(2×C)×H×W 。

Further, in the present invention: said step 5 further comprises the step of,

step 5-1: for different target class images, directly forming sample pairs for training network

Step 5-2: for the same target type images, random scale scaling, rotation, affine transformation and brightness, saturation and contrast adjustment are carried out, so that two images in each pair of the same sample pairs are different and then form a sample pair, and the number of the same type target image sample pairs is consistent with that of different type target image sample pairs. Assuming that there are E types of visual targets to be identified, the single-sample training set described in this patent includes E target images, and after sample pairs are constructed according to the method described in step 5, the total number N of sample pairs _pairs Is composed of

Wherein n is an independent variable.

Further, in the present invention: said step 6 further comprises the step of,

step 6-1: processing a pair of feature maps F 'output by a neural network through a double-attention machine' ₁ ,F′ ₂ ∈R ⁽² ^×C)×H×W Respectively obtaining a pair of eigenvectors f through global average pooling ₁ ,f ₂ ∈R ^(2×C) We calculate f ₁ And f ₂ And mapped to [0,1 ] via Sigmoid function]Within the range, the final output y of the neural network is obtained _i Subsequently, a cross-entropy loss function loss is defined, in particular

Where i denotes the i-th pair of output characteristics, y _i Is the output of the neural network. .

Step 6-2: using loss as a loss function, using the sample pair in the step 5 as input, training the neural network in the steps 1 to 4 by adopting an adaptive moment estimation algorithm, and dynamically adjusting the learning rate of each parameter by utilizing the first moment estimation and the second moment estimation of the gradient, wherein the weight attenuation of the adaptive moment estimation algorithm is set to be 5e ^-5 Inputting 32 samples as a small batch, and initializing the learning rate to 4e ^-3 And (4) every 40 iteration cycles are attenuated to half of the original cycle, and 200 cycles are iterated to obtain the neural network model with the capability of distinguishing the target classes.

The invention has the beneficial effects that:

under the condition that each type of target only has one training image, the neural network is trained by constructing a data enhancement sample pair to expand training data; the neural network structure enables the model to have the capability of training by using a small number of samples and identifying target classes which do not participate in training; the double-attention mechanism improves the distinction degree between intra-class compactness and inter-class compactness and improves the identification accuracy; the cross entropy loss function avoids the punishment degree imbalance caused by manually setting the margin in the training process, and the identification accuracy is improved.

Drawings

FIG. 1 is a schematic overall flow diagram of the process of the present invention;

FIG. 2 is a diagram of a convolutional network structure for image feature extraction in the present invention;

FIG. 3 is a schematic view of a spatial attention mechanism module of the present invention;

FIG. 4 is a schematic diagram of a channel attention mechanism module of the present invention;

FIG. 5 is an ablation experimental result on an experimental data set according to the method of the present invention.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.

As shown in fig. 1, an overall flow diagram of a deep convolutional network target identification method with a dual channel attention mechanism provided by the present invention is shown, and the method specifically includes the following steps:

step 1: and (3) constructing a convolutional neural network, taking the image sample pair as input, and extracting a high-dimensional characteristic diagram.

And 2, step: constructing a spatial attention mechanism module, taking two high-dimensional feature graphs extracted by a neural network as input, calculating the correlation between feature pixels on spatial dimensions, and adding the correlation and the original features element by element;

and step 3: constructing a channel attention mechanism module, taking two high-dimensional feature graphs extracted by a neural network as input, calculating the correlation between feature channels on channel dimensions and adding the correlation and the original features element by element;

and 4, step 4: stacking the outputs of the space attention mechanism module and the channel attention mechanism module on a channel dimension to obtain a final feature representation of the model;

and 5: constructing a training sample pair, wherein the scale of the similar targets is increased through data enhancement, and the different targets are directly paired;

As shown in fig. 2, two identical convolutional neural networks are constructed, each path receives one image in the image sample pair as input and respectively outputs a high-dimensional feature map F ₁ And F ₂ . The constructed convolutional neural network comprises 17 convolutional layers, wherein the head convolutional layer is composed of 64 convolution kernels with the size of 7 multiplied by 7 and the step size is 2, so that the input image is subjected to 0.5-time down-sampling and the number of characteristic image channels is increased to 64 dimensions; the maximum value pooling layer adopts a window size of 3 multiplied by 3, the step length is 2, and the maximum value pooling layer is used for carrying out 0.5 time down-sampling on the feature map; except for the head convolution layer, every 2 convolution layers adopting convolution kernels with the size of 3 multiplied by 3 form a residual error module by a direct connection structure, 8 residual error modules are counted, the step length of the first convolution layer in each residual error block is 2, the rest are 1, the number of the convolution kernels is continuously increased along with the increase of the network depth, and finally, the network weight of the high-dimensional characteristic diagram with the size of 1/32 of the input image and the channel ascending dimension to 512 dimensions is obtained through random initialization and is continuously updated through back propagation in the training process.

As shown in FIG. 3, the high-dimensional characteristic diagram F epsilon R extracted by the convolutional neural network ^C×H×W H, W and C respectively represent the height, width and channel number of the characteristic diagram, and the height, width and channel number are respectively input into three groups of 1 × 1 convolution layers to obtain three new characteristic diagrams F _a 、F _b 、F _c And flattening it in the width and height dimensions, i.e. { F _a ,F _b ,F _c }∈R ^C×(H×W) . Subsequently, F is _a Transpose of (1) and (F) _b Obtaining a space attention matrix M through a Softmax function after multiplication _s ∈R ^{(H×W)×(H×W)} Specifically, it is

Wherein the content of the first and second substances,

the correlation between the characteristic pixels of the ith position and the jth position is shown.

Make F _c And M _s Multiplying by the original feature F ∈ R ^C×H×W Adding element by element to obtain output characteristic F _s Is concretely provided with

Too large. F _s And aggregation is selectively carried out according to the space attention moment array, so that strong related features are mutually promoted, the compactness and semantic consistency in the class are improved, and the network can better distinguish target images of different classes.

As shown in figure 4, a channel attention mechanism module is constructed, and a high-dimensional feature map F epsilon R extracted by a neural network is extracted ^C ^×H×W H, W and C respectively represent the height, width and channel number of the characteristic diagram, and the channel attention matrix M is obtained by multiplying the width and height dimensions of the characteristic diagram by the transpose of the characteristic diagram after the width and height dimensions of the characteristic diagram are flattened _t ∈R ^C×C Is concretely provided with

Make F and M _t Multiplying by the original feature F ∈ R ^C×H×W Adding element by element to obtain output characteristic F _t Is concretely provided with

Too large. Thus, the feature F produced at each location _t The method is a result of weighted addition of the features on all positions and the original features, simulates the dependency relationship between feature map channels, enhances the distinguishability among classes and the feature identifiability, and enables a network to highlight the feature representation of fine-grained change of a target image.

As shown in fig. 5, in order to verify the beneficial effect of the deep convolutional network target identification method with the dual channel attention mechanism proposed by the present invention, the following experiment is performed:

the method for identifying the deep convolutional network target of the dual-channel attention mechanism provided by the invention is used for carrying out ablation experiments on a target identification data set, and specifically comprises the steps of respectively carrying out single-sample visual target identification experiments on four structures of a single-channel convolutional network, a convolutional neural network + sample pair construction + double-attention mechanism and a convolutional neural network + sample pair construction + double-attention mechanism + loss, and selecting Top-1 accuracy, top-5 accuracy and an average F1 score as evaluation indexes, wherein the Top-1 accuracy is the accuracy of the first ranking category which is consistent with an actual result, the Top-5 accuracy is the accuracy of the first ranking which is consistent with an actual ranking result, and the definition of the F1 score is specifically the accuracy of the first five ranking categories which is consistent with an actual ranking result

Wherein TP, FP and FN are the number of positive samples, negative samples and positive samples which are determined as positive samples and negative samples in sequence, and N is the number of categories.

It can be observed that the convolutional neural network + sample pair, the construction + the double-attention mechanism + loss of the present invention all obtain the best recognition effect on two data sets, wherein the convolutional neural network structure and the necessary large-scale training sample pair thereof play a decisive role in the performance improvement, and the double-attention mechanism and the loss respectively improve the result to a certain extent.

It should be noted that the above-mentioned contents only illustrate the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and it will be apparent to those skilled in the art that several modifications and embellishments can be made without departing from the principle of the present invention, and these modifications and embellishments fall within the protection scope of the claims of the present invention.

Claims

1. A deep convolution network target identification method of a two-channel attention mechanism is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

and 3, step 3: constructing a channel attention mechanism module, taking two high-dimensional feature graphs extracted by a neural network as input, calculating the correlation between feature channels on channel dimensions and adding the correlation and the original features element by element;

2. The method for identifying the target of the deep convolutional network with the dual-channel attention mechanism as claimed in claim 1, wherein: the step 1 specifically comprises the steps of,

step 1-1: constructing a convolutional neural network which comprises 17 convolutional layers, wherein the head convolutional layer is composed of 64 convolutional kernels with the size of 7 multiplied by 7, the step size is 2, and therefore 0.5-time down-sampling is conducted on the input image, and the number of feature map channels is increased to 64 dimensions; the maximum value pooling layer adopts a window size of 3 multiplied by 3, the step length is 2, and the maximum value pooling layer is used for carrying out 0.5 time down-sampling on the feature map; every 2 convolutional layers which adopt convolution kernels with the size of 3 multiplied by 3 except the head convolutional layer form a residual error module by a direct connection structure, 8 residual error modules are counted, the step length of the first convolutional layer in each residual error block is 2, the rest are 1, the number of the convolution kernels is continuously increased along with the increase of the network depth, and finally, the network weight of the high-dimensional feature map with the size of 1/32 of the input image and the channel number from ascending dimension to 512 dimension is obtained through random initialization and is continuously updated through back propagation in the training process;

3. The method for identifying the target of the deep convolutional network with the dual-channel attention mechanism as claimed in claim 2, wherein: the step 2 specifically comprises the steps of,

step 2-1: extracting the original high-dimensional characteristic diagram F epsilon R from the convolutional neural network in the step 1 ^C×H×W H, W and C represent height, width and channel number of the characteristic diagram, and three groups of 1 × 1 convolution layers are input to obtain three new characteristic diagrams F _a 、F _b 、F _c And flattening it in the width and height dimensions, i.e. { F _a ,F _b ,F _c }∈R ^C×(H×W) (ii) a Subsequently, F is _a Transpose and F _b Obtaining a space attention matrix M through a Softmax function after multiplication _s ∈R ^{(H×W)×(H×W)} Is concretely provided with

Wherein, the first and the second end of the pipe are connected with each other,

showing the characteristic image of the ith position and the jth positionCorrelation between elements; t represents transpose, F _a 、F _b Is a characteristic diagram of the convolutional layer output;

step 2-2: make F _c And M _s Multiplying and multiplying with the original high dimensional feature F ∈ R ^C×H×W Adding element by element to obtain output characteristic F _s Specifically, it is

Too large, j is a subscript to the spatial position.

4. The method for identifying the target of the deep convolutional network with the dual-channel attention mechanism as claimed in claim 3, wherein: the step 3 specifically includes the steps of,

step 3-1: extracting a high-dimensional feature map F epsilon R for the neural network described in the step 1 ^C×H×W H, W and C respectively represent the height, width and channel number of the characteristic diagram, and the channel attention matrix M is obtained by multiplying the width and height dimensions of the characteristic diagram by the transpose of the characteristic diagram after the width and height dimensions of the characteristic diagram are flattened _t ∈R ^C×C Let i, j represent the ith and jth positions in space, and T is a transposition operation, specifically

Whereinη _t Is a trainable scale factor and is initialized to 0 for avoidance

Too large.

5. The method for identifying the deep convolutional network target of the dual-channel attention mechanism as claimed in claim 4, wherein: the step 4 further comprises the step of,

step 4-1: for the attention mechanism modules described in step 2 and step 3, the respective F _s And F _t Stacking on the channel dimension to form a double-attention mechanism module;

step 4-2: for two paths of high-dimensional characteristic graphs F output by the neural network in the step 1 ₁ ,F ₂ ∈R ^C×H×W Respectively passing through a double-injection machine mechanism module to obtain a better characteristic expression F ₁ ′,F ₂ ′∈R ^(2×C)×H×W 。

6. The method for identifying the deep convolutional network target of the dual-channel attention mechanism as claimed in claim 5, wherein: said step 5 further comprises the step of,

step 5-1: for different target class images, directly forming a sample pair for training a network;

step 5-2: for the same target class images, random scale scaling, rotation, affine transformation and brightness, saturation and contrast adjustment are carried out, so that two images in each pair of the same sample pairs are different and then form a sample pair, and the number of the same type target image sample pairs is consistent with that of different type target image sample pairs; assuming that there are E visual targets to be identified, the single sample training set described in this patent includes E target images, and after the sample pairs are constructed according to the method described in step 5, the total number N of sample pairs _pairs Is composed of

Wherein n is an independent variable.

7. The method for identifying the target of the deep convolutional network with the dual-channel attention mechanism as claimed in claim 6, wherein: said step 6 also comprises the step of,

step 6-1: a pair of characteristic diagrams F output by processing the neural network through a double-attention machine mechanism ₁ ′,F ₂ ′∈R ^(2×C)×H×W Respectively obtaining a pair of eigenvectors f through global average pooling ₁ ,f ₂ ∈R ^(2×C) We calculate f ₁ And f ₂ And mapped to [0,1 ] via Sigmoid function]Within the range, the final output y of the neural network is obtained _i Subsequently, a cross-entropy loss function loss is defined, in particular

Wherein i represents the ith pair of output features; y is _i Is the output of the neural network;

step 6-2: using loss as a loss function, using the sample pair in the step 5 as input, training the neural network in the steps 1 to 4 by adopting an adaptive moment estimation algorithm, and dynamically adjusting the learning rate of each parameter by utilizing the first moment estimation and the second moment estimation of the gradient, wherein the weight attenuation of the adaptive moment estimation algorithm is set to be 5e ^-5 32 samples are input as a small batch, and the learning rate is initialized to 4e ^-3 And (4) every 40 iteration cycles are attenuated to half of the original cycle, and 200 cycles are iterated to obtain the neural network model with the capability of distinguishing the target classes.