CN113065586B

CN113065586B - Non-local image classification device, method and storage medium

Info

Publication number: CN113065586B
Application number: CN202110308766.6A
Authority: CN
Inventors: 卢丽; 孙亚楠; 韩强; 闫超
Original assignee: Sichuan Yifei Technology Co ltd
Current assignee: Sichuan Yifei Technology Co ltd
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2022-10-18
Anticipated expiration: 2041-03-23
Also published as: CN113065586A

Abstract

The invention discloses a non-local image classification device, a non-local image classification method and a storage medium, wherein a convolution network consists of a root module, a plurality of residual modules, a non-local module and a head module which are sequentially connected from front to back; the non-local module consists of a coordinate splicing module, a key value generation module, a non-local attention module and an attention fusion module which are sequentially connected from front to back; the coordinate splicing module is used for adding absolute coordinate information of each feature point on the feature map into the feature vector; the key value generation module outputs a query vector, a key vector and a value vector, inputs the query vector, the key vector and the value vector into the non-local attention module, and processes the query vector, the key vector and the value vector to obtain an attention output tensor; the input of the attention fusion module is the input of the non-local area module and the output of the non-local area attention module. According to the invention, all the feature points on the output feature map can acquire the information of the whole local area through the non-local area module, so that the accuracy is obviously improved, and the network performance is effectively improved.

Description

Non-local image classification device, method and storage medium

Technical Field

The invention belongs to the technical field of image classification in computer machine vision, and particularly relates to a non-local image classification device, a non-local image classification method and a storage medium.

Background

At present, neural network technology in computer machine vision is widely applied to a plurality of fields such as image classification, target detection, image segmentation, face recognition, behavior recognition and the like. In these fields, image classification is the most fundamental technique. The neural networks used in other fields mostly use the neural network of image classification as the backbone network of the neural network, and are implemented after other functional modules are added. Therefore, a high-performance image classification network is very important for machine vision based on neural network technology.

Image classification networks are typically implemented based on convolution operations. The convolution operation is essentially a local operation, and the perception field of the feature points on the feature map output by the convolution operation is local, i.e. only the feature information of the previous layer of region equivalent to the size of the convolution kernel can be perceived. The convolution kernel of the common network is generally small in size, and values are usually 1x1,3x3,5x5 and the like. Although convolutional networks can enlarge the theoretical sensing field by stacking convolutions, many researches have found that although the theoretical sensing field of deep convolutional layers is large, the effective sensing field is still far smaller than the theoretical value, so that the convolutional network is still a local network to a large extent. This also limits the accuracy improvement of the convolutional network.

Vision Transformer and other methods based on global information completely abandon convolution operation, and a large amount of training data is needed to obtain better performance. Therefore, it is desirable to find a method capable of retaining the high efficiency of the convolution operation image feature extraction while improving the local characteristic limitation thereof.

Disclosure of Invention

An object of the present invention is to provide a non-local image classification apparatus, method and storage medium, which aim to solve the above problems. According to the invention, the neural network can obtain global information through the non-local module, the defect that the convolution operation can only obtain local information is overcome, and the purpose of improving the network precision is achieved.

The invention is mainly realized by the following technical scheme:

a non-local image classification device comprises a data acquisition module, a training module and a classification module, wherein the data acquisition module is used for collecting data and forming a training sample; the training module is used for inputting training samples into an image classification network for training to obtain an optimal image classification model; the classification module is used for inputting the image to be detected into the optimal image classification model and inputting a classification result; the image classification network consists of a root module, a plurality of residual modules, a non-local module and a head module which are sequentially connected from front to back, wherein the non-local module consists of a coordinate splicing module, a key value generation module, a non-local attention module and an attention fusion module which are sequentially connected from front to back; the root module is used for converting the pixel information of the input image and outputting a feature map consisting of feature information; the residual error modules are used for gradually extracting semantic information of higher layers in the feature map and outputting the feature map to the non-local area module; the head module is used for converting the feature map containing the semantics into an image classification result; the coordinate splicing module is used for adding absolute coordinate information of each feature point on the feature map into the feature vector; the key value generation module is used for generating a query vector, a key vector and a value vector, inputting the query vector, the key vector and the value vector to the non-local attention module, calculating the correlation between each feature point on the feature map and all the feature points and generating an attention map; the attention fusion module is used for feeding back information obtained by the attention map to the feature map, and the input of the attention fusion module is the input of the non-local area module and the output of the non-local area attention module.

The root module is used for converting the pixel information of the input image into rough characteristic information and outputting a characteristic diagram formed by the information. The residual error module adopts a plurality of convolution operations, gradually extracts feature information which is more fine and contains richer semantic information, and outputs a feature map formed by the information. And the superposition of a plurality of residual modules can gradually extract higher-level semantic information.

The non-local area module is added after part of the residual error module. The residual module is mainly composed of convolution layers. The convolution operation is characterized in that the output result of a certain characteristic point is only influenced by other characteristic points in the convolution kernel size area, namely, the characteristic of locality is provided. Image information naturally has a certain locality, such as a certain pixel, which generally constitutes a certain shape, or object, together with some pixels around it. The convolution operation is therefore efficient in extracting features. However, there is also non-local information in the image, and if the shape appearing on the left side and the shape appearing on the right side in the image need to be used simultaneously, the type of the object in the image can be determined.

And the head module is used for converting the feature map containing the semantic information into an image classification result.

The specific working principle of the non-local area module is as follows:

1) And the coordinate splicing module is used for adding the coordinate information into the feature information, further strengthening the spatial information of the features and improving the precision of subsequent attention diagrams.

2) And the key value generation module and the non-local attention module are used for calculating the correlation of each feature point and all the feature points on the feature map and generating the attention map. The higher the correlation, the larger the value on the attention map. The biggest difference with the convolution operation is that the correlation is calculated for all feature points, is no longer limited to the convolution kernel size region, and is therefore non-local.

3) The attention fusion module is used for feeding back the information obtained by the attention map to the feature map, so that the defect that the feature map obtained by the residual module is lack of non-local information is overcome.

In order to better implement the present invention, further, the root module is obtained by sequentially connecting the convolution layer, the batch normalization layer and the activation layer from front to back and packaging. Packaging the main branch and the residual branch in parallel to obtain a residual module; the main branch is formed by packaging from front to back according to smooth repetition of a convolution layer, a batch normalization layer and an activation layer for a plurality of times; if the residual error module carries out down-sampling, the bypass branch is composed of a convolution layer and a batch normalization layer; if the residual module does not perform downsampling, the bypass branches to an identity module, i.e., the input of the module is taken as the output directly. The head module is obtained by sequentially connecting a global average pooling layer, a full-connection layer and an activation layer from front to back for packaging. And sequentially connecting the root module, the residual modules, the non-local area module and the head module to obtain a non-local area convolution network. The connection sequence and the number of the non-local modules can be adjusted according to the application requirements.

In order to better implement the present invention, further, the expression of the coordinate stitching module is as follows:

X'＝concat([X,coord_map],dim＝channel)

wherein:

x' represents an output characteristic diagram,

x represents a graph of the input features,

concat represents the number of splicing operations that are performed,

dim = channel indicates that the dimension of the splice is that of the characteristic channel,

the coord _ map is a coordinate diagram, and assuming the size [ b, c, h, w ] of the feature diagram X, wherein b is the batch size, c is the number of channels, h is the height of the feature diagram, and w is the width of the feature diagram, the size of the coord _ map is [ b,2, h, w ].

In order to better implement the present invention, further, the key value generation module is configured to generate a query vector, a key vector, and a value vector, where the query vector, the key vector, and the value vector are generated by a convolution operation and a deformation operation in which the size of a convolution kernel is 1 and an output channel is equal to an input channel; the key vector and the query vector are both subjected to L2 regularization, so that the attention value depends on the included angle between the key vector and the query vector, namely whether the two vectors are close to each other in the direction of a high-dimensional space, but not on the magnitude of the mode of the vector, and the formula corresponding to the query vector, the key vector and the value vector is as follows:

Q＝l2_norm(reshape(conv _q (X)))

K＝l2_norm(reshape(conv _k (X)))

V＝reshape(conv _v (X))

wherein:

x is the input of the key value generation module;

q, K, V is the query vector, key vector, value vector, respectively;

l2_ norm () is an L2 regularization function, and the regularized channel is in dimension 1, i.e., in dimension C;

reshape is deformation, and the dimension of the vector is changed from [ B, C, H, W ] to [ B, C, H x W ], wherein B is the batch number, C is the channel number, H is the height of the feature map, and W is the width of the feature map;

conv _q 、conv _k 、conv _v respectively, a convolution function of the query vector, the key vector, and the value vector.

The non-local area module firstly adopts a coordinate splicing module to explicitly add the absolute coordinate information of each feature point on the feature map into the feature vector.

In order to better implement the present invention, further, the non-local attention module first generates an attention tensor by using the query vector and the key vector, and the calculation formula is as follows:

Attn＝softmax(exp(s*Q ^T ×K)

wherein:

softmax () is a softmax function,

Q ^T is a transposed vector of the query vector,

k is a key vector and is a key vector,

exp () is an exponential operation based on a natural number e,

s is a constant parameter, the optimum value can be obtained by experiment,

attn is the attention tensor;

then, the non-local attention module performs matrix multiplication on the attention tensor and the transpose of the value vector, and then obtains the attention output tensor by taking the transpose, wherein a calculation formula is as follows:

Attn_out＝(Attn ^T ×V) ^T

wherein:

Attn ^T the transposed tensor of the attention tensor,

v is a vector of values, and V is a vector of values,

t denotes a transpose operation of the first and second groups,

attn _ out is the output of the non-local attention module.

In order to better implement the invention, further, the constant parameter 2 ≦ s ≦ 5.

In order to better implement the present invention, further, the attention fusion module firstly performs a deformation operation on the output of the non-local attention module, and then sums the output of the non-local attention module with the input of the whole non-local attention module according to elements, and the calculation formula is as follows:

Attn_reshape＝reshape(Attn_out)

out＝input+Attn_reshape

wherein:

attn _ out is the output of the non-local attention module;

out is the output of the attention fusion module;

the input is the input of the whole non-local area module;

attn _ reshape is the deformation operation of the non-local attention module,

reshape () is a warping operation that warps the tensors with dimension [ B, C, H x W ] to tensors with dimension [ B, C, H, W ]; wherein B is the number of batches, C is the number of channels, H is the height of the feature map, and W is the width of the feature map.

According to the self-attention mechanism adopted by the invention, the key vectors and the query vectors are subjected to L2 regularization. Equivalently, the modulus of the two vectors is fixed to be 1, and when the attention tensor is calculated subsequently, the attention value depends on the included angle between the key vector and the query vector, namely whether the two vectors are close to each other in the direction of the high-dimensional space. Conventional self-attention mechanisms, without the addition of L2 regularization, the value of attention depends not only on the angle of the key vector and the query vector, but also on the product of the two vector moduli. On the characteristic diagram in the convolutional neural network, the characteristic point module of the background part is smaller, and the characteristic point module corresponding to the position where the object appears is larger. The attention tensor should focus more on the degree of correlation of the feature points at different locations, rather than on the magnitude of the mode of the feature points themselves.

To better implement the present invention, further, the number of the non-local area modules is 2, and the non-local area modules are respectively located before the head module and before the last downsampled residual module.

The invention is mainly realized by the following technical scheme:

a non-local image classification method is carried out by adopting the image classification device, and comprises the following steps:

step S1: collecting image data and marking to form training data,

step S2: respectively packaging to obtain a root module, a residual error module, a non-local module and a head module, and sequentially connecting the root module, the residual error modules, the non-local module and the head module from front to back to obtain an image classification network; then, inputting training samples in the training data into an image classification network for training to obtain a trained optimal image classification model;

and step S3: and inputting the image to be detected into the optimal image classification model and inputting a classification result.

A computer readable storage medium storing computer program instructions which, when executed by a processor, implement a non-local image classification method.

The invention has the beneficial effects that:

(1) According to the invention, all feature points on the output feature map can acquire the information of the whole local area through the non-local area module, so that the defect of the locality of convolution operation in a common convolution network is improved, the accuracy is obviously improved, and the performance of the network is effectively improved; the invention has the advantages of novel structure, simple realization and obvious precision improvement;

(2) In the non-local module, the attention tensor is calculated by adopting an attention mechanism, so that the network focuses more on relevant information when integrating the whole local information, and the influence of irrelevant information is reduced;

(3) The invention adopts a self-attention mechanism, and absolute coordinate information is introduced into the display. Compared with a non-local method in the conventional convolutional neural network, the absolute coordinate information obviously improves the performance of the network;

(4) According to the self-attention mechanism adopted by the invention, the key vectors and the query vectors are subjected to L2 regularization. Equivalently, the modulus of the two vectors is fixed to be 1, and when the attention tensor is calculated subsequently, the attention value depends on the included angle between the key vector and the query vector, namely whether the two vectors are close to each other in the direction of the high-dimensional space. Compared with the prior art, the attention tensor has more attention to the correlation degree of the characteristic points at different positions rather than the size of the mode of the characteristic points, and the precision is obviously improved;

(5) The non-local area module can be conveniently accessed into a common convolutional neural network, such as a residual error network, so that plug and play are realized, and the network precision is effectively improved.

Drawings

FIG. 1 is a schematic view of the overall structure of the present invention;

FIG. 2 is a schematic structural view of a root module according to the present invention;

FIG. 3 is a schematic structural diagram of a head module according to the present invention;

FIG. 4 is a schematic structural diagram of a non-local area module according to the present invention;

FIG. 5 is a functional block diagram of a coordinate stitching module of the present invention;

FIG. 6 is a functional block diagram of a key value generation module of the present invention;

FIG. 7 is a functional block diagram of a non-local attention module of the present invention;

FIG. 8 is a functional block diagram of the attention fusion module of the present invention;

FIG. 9 is a functional block diagram of a residual module of the present invention without downsampling;

fig. 10 is a functional block diagram of a residual block for downsampling according to the present invention.

Detailed Description

Example 1:

a non-local image classification device comprises a data acquisition module, a training module and a classification module, wherein the data acquisition module is used for collecting data and forming a training sample; the training module is used for inputting training samples into an image classification network for training to obtain an optimal image classification model; and the classification module is used for inputting the image to be detected into the optimal image classification model and inputting the classification result.

As shown in fig. 1, the image classification network is composed of a root module, a plurality of residual modules, a non-local module and a head module which are connected in sequence from front to back; the non-local module consists of a coordinate splicing module, a key value generation module, a non-local attention module and an attention fusion module which are sequentially connected from front to back;

the coordinate splicing module is used for adding absolute coordinate information of each feature point on the feature map into the feature vector; the key value generation module is used for generating a query vector, a key vector and a value vector, inputting the query vector, the key vector and the value vector to the non-local attention module and processing the query vector, the key vector and the value vector to obtain an attention output tensor; the input of the attention fusion module is the input of the non-local area module and the output of the non-local area attention module.

Further, as shown in fig. 2, the root module is obtained by sequentially connecting a convolution layer, a batch normalization layer and an activation layer from front to back.

As shown in fig. 9 and 10, a residual module is obtained by adopting parallel main branch and residual branch encapsulation; the main branch is formed by packaging from front to back according to smooth repetition of a convolution layer, a batch normalization layer and an activation layer for a plurality of times; if the residual error module carries out down-sampling, the bypass branch is composed of a convolution layer and a batch normalization layer; if the residual module does not perform downsampling, the bypass branches to an identity module, i.e., the input of the module is taken as the output directly.

As shown in fig. 3, the header module is obtained by sequentially connecting a global average pooling layer, a full connection layer and an active layer from front to back.

The non-local module provided by the embodiment can effectively introduce the information of the whole local area, and the defect that the convolutional neural network excessively focuses on the local information is overcome, so that the purpose of improving the network performance is achieved.

Example 2:

in this embodiment, optimization is performed on the basis of embodiment 1, and the expression of the coordinate splicing module is as follows:

X'＝concat([X,coord_map],dim＝channel)

wherein:

x' represents an output characteristic diagram,

x represents a graph of the input features,

concat represents the number of splicing operations that are performed,

Other parts of this embodiment are the same as embodiment 1, and thus are not described again.

Example 3:

in this embodiment, optimization is performed on the basis of embodiment 1 or 2, as shown in fig. 6, the key value generation module is configured to generate a query vector, a key vector, and a value vector, where the query vector, the key vector, and the value vector are generated by a convolution operation and a deformation operation in which the convolution kernel has a size of 1 and an output channel is equal to an input channel. The key vector and the query vector are both subjected to L2 regularization, so that the attention value depends on the included angle between the key vector and the query vector, namely whether the two vectors are close to each other in the direction of a high-dimensional space, but not the magnitude of the mode of the vectors, and the corresponding formula is as follows:

Q＝l2_norm(reshape(conv _q (X)))

K＝l2_norm(reshape(conv _k (X)))

V＝reshape(conv _v (X))

wherein:

x is the input of the key value generation module;

q, K, V is the query vector, key vector and value vector, respectively;

l2_ norm () is an L2 regularization function, and the regularized channel is in the 1 st dimension, i.e., the C dimension;

reshape is deformation, and the dimension of the vector is changed from [ B, C, H, W ] to [ B, C, hxW ], wherein B is the batch number, C is the channel number, H is the height of the feature map, and W is the width of the feature map;

convq, convk and convv are convolution kernels of the query vector, the key vector and the value vector respectively with the size of 1, and the output channel is equal to the convolution operation function of the input channel.

Further, as shown in fig. 7, the non-local attention module first generates an attention tensor by using the query vector and the key vector, and the calculation formula is as follows:

Attn＝softmax(exp(s*Q ^T ×K)

wherein:

softmax () is a softmax function,

Q ^T is a transposed vector of the query vector,

k is a key vector and is a key vector,

exp () is an exponential operation based on a natural number e,

s is a constant parameter, the optimum value can be obtained by experiment,

attn is the attention tensor;

Attn_out＝(Attn ^T _× V) ^T

wherein:

Attn ^T the transposed tensor of the attention tensor,

v is a vector of values, and V is a vector of values,

t denotes a transpose operation and,

attn _ out is the output of the non-local attention module.

Furthermore, s is more than or equal to 2 and less than or equal to 5 as a constant parameter.

The rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.

Example 4:

in this embodiment, optimization is performed on the basis of any one of embodiments 1 to 3, as shown in fig. 8, the attention fusion module first performs a deformation operation on the output of the non-local attention module, and then sums up the output of the non-local attention module with the input of the whole non-local attention module according to elements, and the calculation formula is as follows:

Attn_reshape＝reshape(Attn_out)

out＝input+Attn_reshape

wherein:

attn _ reshape is the deformation operation of the non-local attention module,

reshape () is a warping operation that warps the tensor with dimension [ B, C, hxW ] to the tensor with dimension [ B, C, H, W ]; wherein B is the number of batches, C is the number of channels, H is the height of the feature map, and W is the width of the feature map.

Other parts of this embodiment are the same as any of embodiments 1-3, and therefore are not described again.

Example 5:

step S1: collecting image data and marking to form training data,

Example 6:

as shown in fig. 1, the image classification network is a convolution network and is composed of a root module, a plurality of residual modules, a non-local module and a head module which are sequentially connected from front to back. Compared with the traditional convolutional neural network, the non-local module provided by the embodiment can effectively introduce global-local information, and the defect that the convolutional neural network excessively pays attention to the local information is overcome, so that the purpose of improving the network performance is achieved.

Further, as shown in fig. 2, the convolution layer, the batch normalization layer, and the active layer are sequentially connected from front to back, and are packaged into a root module.

Further, as shown in fig. 9, a residual module is obtained by adopting parallel main branch and residual branch encapsulation; the main branch is formed by packaging from front to back according to smooth repetition of a convolution layer, a batch normalization layer and an activation layer for a plurality of times; as shown in fig. 10, when the residual block performs downsampling, the bypass branch is composed of a convolution layer and a batch normalization layer; if the residual module does not perform downsampling, the bypass branches to an identity module, i.e., the input of the module is taken as the output directly.

As shown in fig. 4, the coordinate splicing module, the key value generation module, the non-local attention module, and the attention fusion module are connected in sequence from front to back.

As shown in fig. 3, the global average pooling layer, the full link layer, and the active layer are connected in sequence from front to back and encapsulated into a header module.

And sequentially connecting the root module, the residual modules, the non-local area module and the head module to obtain a non-local area convolution network. The connection sequence and the number of the non-local modules can be adjusted according to the application requirements.

Further, as shown in fig. 5, the non-local area module firstly adopts a coordinate stitching module to explicitly add the absolute coordinate information of each feature point on the feature map into the feature vector. The expression is as follows:

X'＝concat([X,coord_map],dim＝channel)

wherein, X represents the input feature map, X' represents the output feature map, concat represents the splicing operation, and dim = channel represents that the dimension of the splicing is the dimension of the feature channel. If the coord _ map is a graph, assuming the size [ b, c, h, w ] of the feature map X, where b is the batch size, c is the number of channels, h is the height of the feature map, and w is the width of the feature map, the coord _ map has a size [ b,2, h, w ], whose value is determined by the following equation:

coord_map[:,0,i,:]＝-1+(i/h)*2

coord_map[:,1,:,j]＝-1+(j/w)*2

further, as shown in fig. 6, the key-value generation module in the non-local area module needs to generate a key feature map, a value feature map and a query feature map. The three feature maps are generated by the convolution operation that the size of a convolution kernel is 1, an output channel is equal to an input channel, and the deformation operation is added. The corresponding formula is:

Q＝l2_norm(reshape(conv _q (X)))

K＝l2_norm(reshape(conv _k (X)))

V＝reshape(conv _v (X))

wherein, X is the input of the key value generation module, convq, convk and convv are respectively the convolution operation functions of which the sizes of three convolution kernels are 1 and the output channel is equal to the input channel. reshape is a deformation, and the dimension of the vector is changed from [ B, C, H, W ] to [ B, C, hxW ], wherein B is the batch number, C is the channel number, H is the height of the feature map, and W is the width of the feature map. L2_ norm () is an L2 regularization function with the regularized channel in dimension 1, i.e., the C dimension. Q, K, V are the query vector, key vector and value vector, respectively. These three vectors are the outputs of the key-value generation module.

Further, as shown in fig. 7, the non-local attention module in the non-local module generates an attention tensor according to the query vector output by the key value generation module and the key vector. The attention tensor is given by the following equation:

Attn＝softmax(exp(s*Q ^T ×K)

wherein softmax () is the softmax function, Q ^T The transposed vector is the query vector, K is the key vector, exp () is an exponential function with the natural number e as the base, s is a constant parameter, the best value can be obtained by experiment, attn is the attention tensor.

Further, the value of the constant parameter s is any decimal between 2 and 5.

Then, performing matrix multiplication on the attention tensor and the transpose of the value vector, and then taking the transpose to obtain the output of the attention module, wherein the whole process is given by the following formula:

Attn_out＝(Attn ^T ×V) ^T

where is the transposed tensor of the Attn attention tensor and V is the vector of values. T denotes the transpose operation, and Attn _ out is the output of the non-local attention module.

Furthermore, the attention fusion module in the non-local area module firstly performs a deformation operation on the output of the non-local area attention module, and then sums the output of the non-local area attention module with the input of the whole non-local area module according to elements. The whole process is given by the following formula:

Attn_reshape＝reshape(Attn_out)

out＝input+Attn_reshape

the reshape () is a morphing operation, morphing a tensor with dimensions [ B, C, hxW ] into a tensor with dimensions [ B, C, H, W ], where B is the number of batches, C is the number of channels, H is the height of the feature map, and W is the width of the feature map.

Further, the number of non-local modules in the whole network is 2, and the non-local modules are respectively positioned before the head module and before the last downsampled residual module.

Example 7:

a construction method of a non-local image classification network comprises the following steps:

(1) As shown in fig. 2, the convolution layer, the batch normalization layer, and the active layer are connected in sequence from front to back, and packaged into a root module.

(2) As shown in fig. 9 and fig. 10, a residual module is obtained by adopting parallel main branch and residual branch encapsulation; the main branch is formed by packaging from front to back according to smooth repetition of a convolution layer, a batch normalization layer and an activation layer for a plurality of times; if the residual error module carries out down-sampling, the bypass branch is composed of a convolution layer and a batch normalization layer; if the residual module does not perform downsampling, the bypass branches to an identity module, i.e., the input of the module is taken as the output directly.

(3) As shown in fig. 5, the absolute coordinate information of each feature point on the feature map is explicitly added to the feature map, and the coordinate concatenation module is obtained by encapsulation.

(4) As shown in fig. 6, the system is divided into 3 branches, and the 3 branches are connected with the vector calculation, transformation, and L2 regularization modules in sequence to obtain a value vector, a key vector, and a query vector, and the whole structure is encapsulated as a key value generation module. The branch of the value vector does not contain an L2 regularization module, and the vector calculation mode is convolution operation.

(5) As shown in fig. 7, the query vector is transposed, then matrix-multiplied with the key vector, the result is calculated as an exponent with the natural number as the base number, the result is transposed and then matrix-multiplied with the value vector, the result is transposed to obtain the attention output tensor, and the structure is integrally packaged as the non-local attention module.

(6) As shown in fig. 8, the attention output tensor is deformed and is added to the input of the non-local attention module, and the structure is encapsulated as an attention fusion module as a whole.

(7) As shown in fig. 4, the coordinate splicing module, the key value generation module, the non-local attention module, and the attention fusion module are connected in sequence from front to back, and are packaged into the non-local module. Wherein the input of the attention fusion module is the input of the non-local module as a whole and the output of the non-local attention module.

(8) As shown in fig. 3, the global average pooling layer, the full link layer, and the active layer are connected in sequence from front to back and encapsulated into a header module.

(9) As shown in fig. 1, the root module, the plurality of residual modules, the non-local module, and the head module are connected in sequence to obtain a non-local convolutional network. The connection sequence and the number of the non-local modules can be adjusted according to the application requirements.

Further, this embodiment adds 2 non-local modules before the head module and before the last downsampled residual module by constructing a 50-layer non-local convolutional neural network. The rest structure is completely consistent with the common residual convolution neural network.

The data set adopted by the experiment is not tf _ flower data set, the data set comprises 5 different flower pictures, the training set comprises 3308 pictures, and the testing set comprises 372 pictures. During training, the picture is scaled to 256x256 size, and then the 224x224 size area is randomly cropped and randomly horizontally inverted. During testing, the picture is directly scaled to 224x224 size. All training parameters are completely consistent for both network structures.

As shown in table 1, the accuracy of the non-local area convolutional neural network in this embodiment is 89.65%, which is 1.64% higher than that of the general residual convolutional neural network, and the effect is significant.

TABLE 1

Network architecture	Accuracy (%)
		Common residual convolutional neural network	88.01
Non-local convolution neural network	89.65

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims

1. A non-local image classification device is characterized by comprising a data acquisition module, a training module and a classification module, wherein the data acquisition module is used for collecting data and forming a training sample; the training module is used for inputting training samples into an image classification network for training to obtain an optimal image classification model; the classification module is used for inputting the image to be detected into the optimal image classification model and inputting a classification result;

the image classification network consists of a root module, a plurality of residual modules, a non-local module and a head module which are sequentially connected from front to back, wherein the non-local module consists of a coordinate splicing module, a key value generation module, a non-local attention module and an attention fusion module which are sequentially connected from front to back; adding absolute coordinate information of each feature point on the feature map into the feature map, encapsulating to obtain a coordinate splicing module which is divided into 3 branches, sequentially connecting a vector calculation module, a deformation module and an L2 regularization module to obtain a value vector, a key vector and a query vector, and encapsulating the whole structure into a key value generation module; the branch of the value vector does not comprise an L2 regularization module, the vector calculation mode is convolution operation, firstly, a query vector is transposed, then matrix multiplication is carried out on the query vector and a key vector, an index taking a natural number as a base number is calculated as a result, the result is transposed and then matrix multiplication is carried out on the result and the value vector, so that an attention output tensor is obtained, the whole structure is packaged into a non-local attention module, the attention output tensor is deformed and is subjected to element summation with the input of the non-local attention module, the whole structure is packaged into an attention fusion module, and a coordinate splicing module, a key value generation module, the non-local attention module and the attention fusion module are sequentially connected from front to back and packaged into the non-local module; wherein the input of the attention fusion module is the input of the non-local module as a whole and the output of the non-local attention module;

the root module is used for converting the pixel information of the input image and outputting a feature map composed of feature information; the residual error modules are used for gradually extracting semantic information of higher layers in the feature map and outputting the feature map to the non-local area module; the head module is used for converting the feature map containing the semantics into an image classification result;

the coordinate splicing module is used for adding absolute coordinate information of each feature point on the feature map into the feature vector; the key value generation module is used for generating a query vector, a key vector and a value vector, inputting the query vector, the key vector and the value vector to the non-local attention module, calculating the correlation between each feature point and all feature points on the feature map and generating an attention map; the attention fusion module is used for feeding back information obtained by the attention map to the feature map, and the input of the attention fusion module is the input of the non-local area module and the output of the non-local area attention module.

2. A non-local image classification device according to claim 1, wherein the coordinate stitching module has the following expression:

X'＝concat([X,coord_map],dim＝channel)

wherein:

x' represents an output characteristic diagram,

x represents a graph of the input features,

concat represents the number of splicing operations that are performed,

3. The non-local image classification device according to claim 1, wherein the key value generation module is configured to generate a query vector, a key vector, and a value vector, the query vector, the key vector, and the value vector are generated by a convolution operation and a deformation operation in which a convolution kernel has a size of 1 and an output channel is equal to an input channel, and the key vector and the query vector are both subjected to L2 regularization; the formula corresponding to the query vector, the key vector and the value vector is as follows:

Q＝l2_norm(reshape(conv _q (X)))

K＝l2_norm(reshape(conv _k (X)))

V＝reshape(conv _v (X))

wherein:

x is the input of the key value generation module;

q, K, V is the query vector, key vector, value vector, respectively;

4. The apparatus of claim 3, wherein the non-local attention module first generates an attention tensor from the query vector and the key vector, and the calculation formula is as follows:

Attn＝softmax(exp(s*Q ^T ×K)

wherein:

softmax () is a softmax function,

Q ^T is a transposed vector of the query vector,

k is a key vector and is a key vector,

exp () is an exponential operation based on a natural number e,

s is a constant parameter, the optimum value can be obtained by experiment,

attn is the attention tensor;

Attn_out＝(Attn ^T ×V) ^T

wherein:

Attn ^T the transposed tensor of the attention tensor,

v is a vector of values, and V is a vector of values,

t denotes a transpose operation and,

attn _ out is the output of the non-local attention module.

5. A non-local area image classification device according to claim 4, wherein the constant parameter 2 ≦ s ≦ 5.

6. The apparatus of claim 1, wherein the attention fusion module first performs a warping operation on the output of the non-local attention module, and then sums the warped output with the input of the whole non-local attention module by elements, and the calculation formula is as follows:

Attn_reshape＝reshape(Attn_out)

out＝input+Attn_reshape

wherein:

attn _ out is the output of the non-local attention module;

out is the output of the attention fusion module;

the input is the input of the whole non-local area module;

attn _ reshape is the deformation operation of the non-local attention module,

reshape () is a warping operation that warps the tensor with dimension [ B, C, H x W ] to the tensor with dimension [ B, C, H, W ]; where B is the number of batches, C is the number of channels, H is the height of the feature map, and W is the width of the feature map.

7. A non-local image classification device according to any one of claims 1 to 6, wherein the number of non-local modules is 2, and the non-local modules are respectively located before the head module and before the last downsampled residual module.

8. A non-local image classification method using the image classification apparatus according to any one of claims 1 to 7, comprising the steps of:

step S1: collecting image data and marking to form training data,

step S2: respectively packaging to obtain a root module, a residual module, a non-local module and a head module, and sequentially connecting the root module, a plurality of residual modules, the non-local module and the head module from front to back to obtain an image classification network; then, inputting training samples in the training data into an image classification network for training to obtain a trained optimal image classification model;

9. A computer readable storage medium storing computer program instructions, characterized in that the program instructions, when executed by a processor, implement the method of claim 8.