CN113065586B - Non-local image classification device, method and storage medium - Google Patents

Non-local image classification device, method and storage medium Download PDF

Info

Publication number
CN113065586B
CN113065586B CN202110308766.6A CN202110308766A CN113065586B CN 113065586 B CN113065586 B CN 113065586B CN 202110308766 A CN202110308766 A CN 202110308766A CN 113065586 B CN113065586 B CN 113065586B
Authority
CN
China
Prior art keywords
module
vector
local
attention
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110308766.6A
Other languages
Chinese (zh)
Other versions
CN113065586A (en
Inventor
卢丽
孙亚楠
韩强
闫超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Yifei Technology Co ltd
Original Assignee
Sichuan Yifei Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Yifei Technology Co ltd filed Critical Sichuan Yifei Technology Co ltd
Priority to CN202110308766.6A priority Critical patent/CN113065586B/en
Publication of CN113065586A publication Critical patent/CN113065586A/en
Application granted granted Critical
Publication of CN113065586B publication Critical patent/CN113065586B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a non-local image classification device, a non-local image classification method and a storage medium, wherein a convolution network consists of a root module, a plurality of residual modules, a non-local module and a head module which are sequentially connected from front to back; the non-local module consists of a coordinate splicing module, a key value generation module, a non-local attention module and an attention fusion module which are sequentially connected from front to back; the coordinate splicing module is used for adding absolute coordinate information of each feature point on the feature map into the feature vector; the key value generation module outputs a query vector, a key vector and a value vector, inputs the query vector, the key vector and the value vector into the non-local attention module, and processes the query vector, the key vector and the value vector to obtain an attention output tensor; the input of the attention fusion module is the input of the non-local area module and the output of the non-local area attention module. According to the invention, all the feature points on the output feature map can acquire the information of the whole local area through the non-local area module, so that the accuracy is obviously improved, and the network performance is effectively improved.

Description

Non-local image classification device, method and storage medium
Technical Field
The invention belongs to the technical field of image classification in computer machine vision, and particularly relates to a non-local image classification device, a non-local image classification method and a storage medium.
Background
At present, neural network technology in computer machine vision is widely applied to a plurality of fields such as image classification, target detection, image segmentation, face recognition, behavior recognition and the like. In these fields, image classification is the most fundamental technique. The neural networks used in other fields mostly use the neural network of image classification as the backbone network of the neural network, and are implemented after other functional modules are added. Therefore, a high-performance image classification network is very important for machine vision based on neural network technology.
Image classification networks are typically implemented based on convolution operations. The convolution operation is essentially a local operation, and the perception field of the feature points on the feature map output by the convolution operation is local, i.e. only the feature information of the previous layer of region equivalent to the size of the convolution kernel can be perceived. The convolution kernel of the common network is generally small in size, and values are usually 1x1,3x3,5x5 and the like. Although convolutional networks can enlarge the theoretical sensing field by stacking convolutions, many researches have found that although the theoretical sensing field of deep convolutional layers is large, the effective sensing field is still far smaller than the theoretical value, so that the convolutional network is still a local network to a large extent. This also limits the accuracy improvement of the convolutional network.
Vision Transformer and other methods based on global information completely abandon convolution operation, and a large amount of training data is needed to obtain better performance. Therefore, it is desirable to find a method capable of retaining the high efficiency of the convolution operation image feature extraction while improving the local characteristic limitation thereof.
Disclosure of Invention
An object of the present invention is to provide a non-local image classification apparatus, method and storage medium, which aim to solve the above problems. According to the invention, the neural network can obtain global information through the non-local module, the defect that the convolution operation can only obtain local information is overcome, and the purpose of improving the network precision is achieved.
The invention is mainly realized by the following technical scheme:
a non-local image classification device comprises a data acquisition module, a training module and a classification module, wherein the data acquisition module is used for collecting data and forming a training sample; the training module is used for inputting training samples into an image classification network for training to obtain an optimal image classification model; the classification module is used for inputting the image to be detected into the optimal image classification model and inputting a classification result; the image classification network consists of a root module, a plurality of residual modules, a non-local module and a head module which are sequentially connected from front to back, wherein the non-local module consists of a coordinate splicing module, a key value generation module, a non-local attention module and an attention fusion module which are sequentially connected from front to back; the root module is used for converting the pixel information of the input image and outputting a feature map consisting of feature information; the residual error modules are used for gradually extracting semantic information of higher layers in the feature map and outputting the feature map to the non-local area module; the head module is used for converting the feature map containing the semantics into an image classification result; the coordinate splicing module is used for adding absolute coordinate information of each feature point on the feature map into the feature vector; the key value generation module is used for generating a query vector, a key vector and a value vector, inputting the query vector, the key vector and the value vector to the non-local attention module, calculating the correlation between each feature point on the feature map and all the feature points and generating an attention map; the attention fusion module is used for feeding back information obtained by the attention map to the feature map, and the input of the attention fusion module is the input of the non-local area module and the output of the non-local area attention module.
The root module is used for converting the pixel information of the input image into rough characteristic information and outputting a characteristic diagram formed by the information. The residual error module adopts a plurality of convolution operations, gradually extracts feature information which is more fine and contains richer semantic information, and outputs a feature map formed by the information. And the superposition of a plurality of residual modules can gradually extract higher-level semantic information.
The non-local area module is added after part of the residual error module. The residual module is mainly composed of convolution layers. The convolution operation is characterized in that the output result of a certain characteristic point is only influenced by other characteristic points in the convolution kernel size area, namely, the characteristic of locality is provided. Image information naturally has a certain locality, such as a certain pixel, which generally constitutes a certain shape, or object, together with some pixels around it. The convolution operation is therefore efficient in extracting features. However, there is also non-local information in the image, and if the shape appearing on the left side and the shape appearing on the right side in the image need to be used simultaneously, the type of the object in the image can be determined.
And the head module is used for converting the feature map containing the semantic information into an image classification result.
The specific working principle of the non-local area module is as follows:
1) And the coordinate splicing module is used for adding the coordinate information into the feature information, further strengthening the spatial information of the features and improving the precision of subsequent attention diagrams.
2) And the key value generation module and the non-local attention module are used for calculating the correlation of each feature point and all the feature points on the feature map and generating the attention map. The higher the correlation, the larger the value on the attention map. The biggest difference with the convolution operation is that the correlation is calculated for all feature points, is no longer limited to the convolution kernel size region, and is therefore non-local.
3) The attention fusion module is used for feeding back the information obtained by the attention map to the feature map, so that the defect that the feature map obtained by the residual module is lack of non-local information is overcome.
In order to better implement the present invention, further, the root module is obtained by sequentially connecting the convolution layer, the batch normalization layer and the activation layer from front to back and packaging. Packaging the main branch and the residual branch in parallel to obtain a residual module; the main branch is formed by packaging from front to back according to smooth repetition of a convolution layer, a batch normalization layer and an activation layer for a plurality of times; if the residual error module carries out down-sampling, the bypass branch is composed of a convolution layer and a batch normalization layer; if the residual module does not perform downsampling, the bypass branches to an identity module, i.e., the input of the module is taken as the output directly. The head module is obtained by sequentially connecting a global average pooling layer, a full-connection layer and an activation layer from front to back for packaging. And sequentially connecting the root module, the residual modules, the non-local area module and the head module to obtain a non-local area convolution network. The connection sequence and the number of the non-local modules can be adjusted according to the application requirements.
In order to better implement the present invention, further, the expression of the coordinate stitching module is as follows:
X'=concat([X,coord_map],dim=channel)
wherein:
x' represents an output characteristic diagram,
x represents a graph of the input features,
concat represents the number of splicing operations that are performed,
dim = channel indicates that the dimension of the splice is that of the characteristic channel,
the coord _ map is a coordinate diagram, and assuming the size [ b, c, h, w ] of the feature diagram X, wherein b is the batch size, c is the number of channels, h is the height of the feature diagram, and w is the width of the feature diagram, the size of the coord _ map is [ b,2, h, w ].
In order to better implement the present invention, further, the key value generation module is configured to generate a query vector, a key vector, and a value vector, where the query vector, the key vector, and the value vector are generated by a convolution operation and a deformation operation in which the size of a convolution kernel is 1 and an output channel is equal to an input channel; the key vector and the query vector are both subjected to L2 regularization, so that the attention value depends on the included angle between the key vector and the query vector, namely whether the two vectors are close to each other in the direction of a high-dimensional space, but not on the magnitude of the mode of the vector, and the formula corresponding to the query vector, the key vector and the value vector is as follows:
Q=l2_norm(reshape(conv q (X)))
K=l2_norm(reshape(conv k (X)))
V=reshape(conv v (X))
wherein:
x is the input of the key value generation module;
q, K, V is the query vector, key vector, value vector, respectively;
l2_ norm () is an L2 regularization function, and the regularized channel is in dimension 1, i.e., in dimension C;
reshape is deformation, and the dimension of the vector is changed from [ B, C, H, W ] to [ B, C, H x W ], wherein B is the batch number, C is the channel number, H is the height of the feature map, and W is the width of the feature map;
conv q 、conv k 、conv v respectively, a convolution function of the query vector, the key vector, and the value vector.
The non-local area module firstly adopts a coordinate splicing module to explicitly add the absolute coordinate information of each feature point on the feature map into the feature vector.
In order to better implement the present invention, further, the non-local attention module first generates an attention tensor by using the query vector and the key vector, and the calculation formula is as follows:
Attn=softmax(exp(s*Q T ×K)
wherein:
softmax () is a softmax function,
Q T is a transposed vector of the query vector,
k is a key vector and is a key vector,
exp () is an exponential operation based on a natural number e,
s is a constant parameter, the optimum value can be obtained by experiment,
attn is the attention tensor;
then, the non-local attention module performs matrix multiplication on the attention tensor and the transpose of the value vector, and then obtains the attention output tensor by taking the transpose, wherein a calculation formula is as follows:
Attn_out=(Attn T ×V) T
wherein:
Attn T the transposed tensor of the attention tensor,
v is a vector of values, and V is a vector of values,
t denotes a transpose operation of the first and second groups,
attn _ out is the output of the non-local attention module.
In order to better implement the invention, further, the constant parameter 2 ≦ s ≦ 5.
In order to better implement the present invention, further, the attention fusion module firstly performs a deformation operation on the output of the non-local attention module, and then sums the output of the non-local attention module with the input of the whole non-local attention module according to elements, and the calculation formula is as follows:
Attn_reshape=reshape(Attn_out)
out=input+Attn_reshape
wherein:
attn _ out is the output of the non-local attention module;
out is the output of the attention fusion module;
the input is the input of the whole non-local area module;
attn _ reshape is the deformation operation of the non-local attention module,
reshape () is a warping operation that warps the tensors with dimension [ B, C, H x W ] to tensors with dimension [ B, C, H, W ]; wherein B is the number of batches, C is the number of channels, H is the height of the feature map, and W is the width of the feature map.
According to the self-attention mechanism adopted by the invention, the key vectors and the query vectors are subjected to L2 regularization. Equivalently, the modulus of the two vectors is fixed to be 1, and when the attention tensor is calculated subsequently, the attention value depends on the included angle between the key vector and the query vector, namely whether the two vectors are close to each other in the direction of the high-dimensional space. Conventional self-attention mechanisms, without the addition of L2 regularization, the value of attention depends not only on the angle of the key vector and the query vector, but also on the product of the two vector moduli. On the characteristic diagram in the convolutional neural network, the characteristic point module of the background part is smaller, and the characteristic point module corresponding to the position where the object appears is larger. The attention tensor should focus more on the degree of correlation of the feature points at different locations, rather than on the magnitude of the mode of the feature points themselves.
To better implement the present invention, further, the number of the non-local area modules is 2, and the non-local area modules are respectively located before the head module and before the last downsampled residual module.
The invention is mainly realized by the following technical scheme:
a non-local image classification method is carried out by adopting the image classification device, and comprises the following steps:
step S1: collecting image data and marking to form training data,
step S2: respectively packaging to obtain a root module, a residual error module, a non-local module and a head module, and sequentially connecting the root module, the residual error modules, the non-local module and the head module from front to back to obtain an image classification network; then, inputting training samples in the training data into an image classification network for training to obtain a trained optimal image classification model;
and step S3: and inputting the image to be detected into the optimal image classification model and inputting a classification result.
A computer readable storage medium storing computer program instructions which, when executed by a processor, implement a non-local image classification method.
The invention has the beneficial effects that:
(1) According to the invention, all feature points on the output feature map can acquire the information of the whole local area through the non-local area module, so that the defect of the locality of convolution operation in a common convolution network is improved, the accuracy is obviously improved, and the performance of the network is effectively improved; the invention has the advantages of novel structure, simple realization and obvious precision improvement;
(2) In the non-local module, the attention tensor is calculated by adopting an attention mechanism, so that the network focuses more on relevant information when integrating the whole local information, and the influence of irrelevant information is reduced;
(3) The invention adopts a self-attention mechanism, and absolute coordinate information is introduced into the display. Compared with a non-local method in the conventional convolutional neural network, the absolute coordinate information obviously improves the performance of the network;
(4) According to the self-attention mechanism adopted by the invention, the key vectors and the query vectors are subjected to L2 regularization. Equivalently, the modulus of the two vectors is fixed to be 1, and when the attention tensor is calculated subsequently, the attention value depends on the included angle between the key vector and the query vector, namely whether the two vectors are close to each other in the direction of the high-dimensional space. Compared with the prior art, the attention tensor has more attention to the correlation degree of the characteristic points at different positions rather than the size of the mode of the characteristic points, and the precision is obviously improved;
(5) The non-local area module can be conveniently accessed into a common convolutional neural network, such as a residual error network, so that plug and play are realized, and the network precision is effectively improved.
Drawings
FIG. 1 is a schematic view of the overall structure of the present invention;
FIG. 2 is a schematic structural view of a root module according to the present invention;
FIG. 3 is a schematic structural diagram of a head module according to the present invention;
FIG. 4 is a schematic structural diagram of a non-local area module according to the present invention;
FIG. 5 is a functional block diagram of a coordinate stitching module of the present invention;
FIG. 6 is a functional block diagram of a key value generation module of the present invention;
FIG. 7 is a functional block diagram of a non-local attention module of the present invention;
FIG. 8 is a functional block diagram of the attention fusion module of the present invention;
FIG. 9 is a functional block diagram of a residual module of the present invention without downsampling;
fig. 10 is a functional block diagram of a residual block for downsampling according to the present invention.
Detailed Description
Example 1:
a non-local image classification device comprises a data acquisition module, a training module and a classification module, wherein the data acquisition module is used for collecting data and forming a training sample; the training module is used for inputting training samples into an image classification network for training to obtain an optimal image classification model; and the classification module is used for inputting the image to be detected into the optimal image classification model and inputting the classification result.
As shown in fig. 1, the image classification network is composed of a root module, a plurality of residual modules, a non-local module and a head module which are connected in sequence from front to back; the non-local module consists of a coordinate splicing module, a key value generation module, a non-local attention module and an attention fusion module which are sequentially connected from front to back;
the coordinate splicing module is used for adding absolute coordinate information of each feature point on the feature map into the feature vector; the key value generation module is used for generating a query vector, a key vector and a value vector, inputting the query vector, the key vector and the value vector to the non-local attention module and processing the query vector, the key vector and the value vector to obtain an attention output tensor; the input of the attention fusion module is the input of the non-local area module and the output of the non-local area attention module.
Further, as shown in fig. 2, the root module is obtained by sequentially connecting a convolution layer, a batch normalization layer and an activation layer from front to back.
As shown in fig. 9 and 10, a residual module is obtained by adopting parallel main branch and residual branch encapsulation; the main branch is formed by packaging from front to back according to smooth repetition of a convolution layer, a batch normalization layer and an activation layer for a plurality of times; if the residual error module carries out down-sampling, the bypass branch is composed of a convolution layer and a batch normalization layer; if the residual module does not perform downsampling, the bypass branches to an identity module, i.e., the input of the module is taken as the output directly.
As shown in fig. 3, the header module is obtained by sequentially connecting a global average pooling layer, a full connection layer and an active layer from front to back.
The non-local module provided by the embodiment can effectively introduce the information of the whole local area, and the defect that the convolutional neural network excessively focuses on the local information is overcome, so that the purpose of improving the network performance is achieved.
Example 2:
in this embodiment, optimization is performed on the basis of embodiment 1, and the expression of the coordinate splicing module is as follows:
X'=concat([X,coord_map],dim=channel)
wherein:
x' represents an output characteristic diagram,
x represents a graph of the input features,
concat represents the number of splicing operations that are performed,
dim = channel indicates that the dimension of the splice is that of the characteristic channel,
the coord _ map is a coordinate diagram, and assuming the size [ b, c, h, w ] of the feature diagram X, wherein b is the batch size, c is the number of channels, h is the height of the feature diagram, and w is the width of the feature diagram, the size of the coord _ map is [ b,2, h, w ].
Other parts of this embodiment are the same as embodiment 1, and thus are not described again.
Example 3:
in this embodiment, optimization is performed on the basis of embodiment 1 or 2, as shown in fig. 6, the key value generation module is configured to generate a query vector, a key vector, and a value vector, where the query vector, the key vector, and the value vector are generated by a convolution operation and a deformation operation in which the convolution kernel has a size of 1 and an output channel is equal to an input channel. The key vector and the query vector are both subjected to L2 regularization, so that the attention value depends on the included angle between the key vector and the query vector, namely whether the two vectors are close to each other in the direction of a high-dimensional space, but not the magnitude of the mode of the vectors, and the corresponding formula is as follows:
Q=l2_norm(reshape(conv q (X)))
K=l2_norm(reshape(conv k (X)))
V=reshape(conv v (X))
wherein:
x is the input of the key value generation module;
q, K, V is the query vector, key vector and value vector, respectively;
l2_ norm () is an L2 regularization function, and the regularized channel is in the 1 st dimension, i.e., the C dimension;
reshape is deformation, and the dimension of the vector is changed from [ B, C, H, W ] to [ B, C, hxW ], wherein B is the batch number, C is the channel number, H is the height of the feature map, and W is the width of the feature map;
convq, convk and convv are convolution kernels of the query vector, the key vector and the value vector respectively with the size of 1, and the output channel is equal to the convolution operation function of the input channel.
Further, as shown in fig. 7, the non-local attention module first generates an attention tensor by using the query vector and the key vector, and the calculation formula is as follows:
Attn=softmax(exp(s*Q T ×K)
wherein:
softmax () is a softmax function,
Q T is a transposed vector of the query vector,
k is a key vector and is a key vector,
exp () is an exponential operation based on a natural number e,
s is a constant parameter, the optimum value can be obtained by experiment,
attn is the attention tensor;
then, the non-local attention module performs matrix multiplication on the attention tensor and the transpose of the value vector, and then obtains the attention output tensor by taking the transpose, wherein a calculation formula is as follows:
Attn_out=(Attn T × V) T
wherein:
Attn T the transposed tensor of the attention tensor,
v is a vector of values, and V is a vector of values,
t denotes a transpose operation and,
attn _ out is the output of the non-local attention module.
Furthermore, s is more than or equal to 2 and less than or equal to 5 as a constant parameter.
The rest of this embodiment is the same as embodiment 1 or 2, and therefore, the description thereof is omitted.
Example 4:
in this embodiment, optimization is performed on the basis of any one of embodiments 1 to 3, as shown in fig. 8, the attention fusion module first performs a deformation operation on the output of the non-local attention module, and then sums up the output of the non-local attention module with the input of the whole non-local attention module according to elements, and the calculation formula is as follows:
Attn_reshape=reshape(Attn_out)
out=input+Attn_reshape
wherein:
attn _ reshape is the deformation operation of the non-local attention module,
reshape () is a warping operation that warps the tensor with dimension [ B, C, hxW ] to the tensor with dimension [ B, C, H, W ]; wherein B is the number of batches, C is the number of channels, H is the height of the feature map, and W is the width of the feature map.
Other parts of this embodiment are the same as any of embodiments 1-3, and therefore are not described again.
Example 5:
a non-local image classification method is carried out by adopting the image classification device, and comprises the following steps:
step S1: collecting image data and marking to form training data,
step S2: respectively packaging to obtain a root module, a residual error module, a non-local module and a head module, and sequentially connecting the root module, the residual error modules, the non-local module and the head module from front to back to obtain an image classification network; then, inputting training samples in the training data into an image classification network for training to obtain a trained optimal image classification model;
and step S3: and inputting the image to be detected into the optimal image classification model and inputting a classification result.
The non-local module provided by the embodiment can effectively introduce the information of the whole local area, and the defect that the convolutional neural network excessively focuses on the local information is overcome, so that the purpose of improving the network performance is achieved.
Example 6:
as shown in fig. 1, the image classification network is a convolution network and is composed of a root module, a plurality of residual modules, a non-local module and a head module which are sequentially connected from front to back. Compared with the traditional convolutional neural network, the non-local module provided by the embodiment can effectively introduce global-local information, and the defect that the convolutional neural network excessively pays attention to the local information is overcome, so that the purpose of improving the network performance is achieved.
Further, as shown in fig. 2, the convolution layer, the batch normalization layer, and the active layer are sequentially connected from front to back, and are packaged into a root module.
Further, as shown in fig. 9, a residual module is obtained by adopting parallel main branch and residual branch encapsulation; the main branch is formed by packaging from front to back according to smooth repetition of a convolution layer, a batch normalization layer and an activation layer for a plurality of times; as shown in fig. 10, when the residual block performs downsampling, the bypass branch is composed of a convolution layer and a batch normalization layer; if the residual module does not perform downsampling, the bypass branches to an identity module, i.e., the input of the module is taken as the output directly.
As shown in fig. 4, the coordinate splicing module, the key value generation module, the non-local attention module, and the attention fusion module are connected in sequence from front to back.
As shown in fig. 3, the global average pooling layer, the full link layer, and the active layer are connected in sequence from front to back and encapsulated into a header module.
And sequentially connecting the root module, the residual modules, the non-local area module and the head module to obtain a non-local area convolution network. The connection sequence and the number of the non-local modules can be adjusted according to the application requirements.
Further, as shown in fig. 5, the non-local area module firstly adopts a coordinate stitching module to explicitly add the absolute coordinate information of each feature point on the feature map into the feature vector. The expression is as follows:
X'=concat([X,coord_map],dim=channel)
wherein, X represents the input feature map, X' represents the output feature map, concat represents the splicing operation, and dim = channel represents that the dimension of the splicing is the dimension of the feature channel. If the coord _ map is a graph, assuming the size [ b, c, h, w ] of the feature map X, where b is the batch size, c is the number of channels, h is the height of the feature map, and w is the width of the feature map, the coord _ map has a size [ b,2, h, w ], whose value is determined by the following equation:
coord_map[:,0,i,:]=-1+(i/h)*2
coord_map[:,1,:,j]=-1+(j/w)*2
further, as shown in fig. 6, the key-value generation module in the non-local area module needs to generate a key feature map, a value feature map and a query feature map. The three feature maps are generated by the convolution operation that the size of a convolution kernel is 1, an output channel is equal to an input channel, and the deformation operation is added. The corresponding formula is:
Q=l2_norm(reshape(conv q (X)))
K=l2_norm(reshape(conv k (X)))
V=reshape(conv v (X))
wherein, X is the input of the key value generation module, convq, convk and convv are respectively the convolution operation functions of which the sizes of three convolution kernels are 1 and the output channel is equal to the input channel. reshape is a deformation, and the dimension of the vector is changed from [ B, C, H, W ] to [ B, C, hxW ], wherein B is the batch number, C is the channel number, H is the height of the feature map, and W is the width of the feature map. L2_ norm () is an L2 regularization function with the regularized channel in dimension 1, i.e., the C dimension. Q, K, V are the query vector, key vector and value vector, respectively. These three vectors are the outputs of the key-value generation module.
Further, as shown in fig. 7, the non-local attention module in the non-local module generates an attention tensor according to the query vector output by the key value generation module and the key vector. The attention tensor is given by the following equation:
Attn=softmax(exp(s*Q T ×K)
wherein softmax () is the softmax function, Q T The transposed vector is the query vector, K is the key vector, exp () is an exponential function with the natural number e as the base, s is a constant parameter, the best value can be obtained by experiment, attn is the attention tensor.
Further, the value of the constant parameter s is any decimal between 2 and 5.
Then, performing matrix multiplication on the attention tensor and the transpose of the value vector, and then taking the transpose to obtain the output of the attention module, wherein the whole process is given by the following formula:
Attn_out=(Attn T ×V) T
where is the transposed tensor of the Attn attention tensor and V is the vector of values. T denotes the transpose operation, and Attn _ out is the output of the non-local attention module.
Furthermore, the attention fusion module in the non-local area module firstly performs a deformation operation on the output of the non-local area attention module, and then sums the output of the non-local area attention module with the input of the whole non-local area module according to elements. The whole process is given by the following formula:
Attn_reshape=reshape(Attn_out)
out=input+Attn_reshape
the reshape () is a morphing operation, morphing a tensor with dimensions [ B, C, hxW ] into a tensor with dimensions [ B, C, H, W ], where B is the number of batches, C is the number of channels, H is the height of the feature map, and W is the width of the feature map.
Further, the number of non-local modules in the whole network is 2, and the non-local modules are respectively positioned before the head module and before the last downsampled residual module.
Example 7:
a construction method of a non-local image classification network comprises the following steps:
(1) As shown in fig. 2, the convolution layer, the batch normalization layer, and the active layer are connected in sequence from front to back, and packaged into a root module.
(2) As shown in fig. 9 and fig. 10, a residual module is obtained by adopting parallel main branch and residual branch encapsulation; the main branch is formed by packaging from front to back according to smooth repetition of a convolution layer, a batch normalization layer and an activation layer for a plurality of times; if the residual error module carries out down-sampling, the bypass branch is composed of a convolution layer and a batch normalization layer; if the residual module does not perform downsampling, the bypass branches to an identity module, i.e., the input of the module is taken as the output directly.
(3) As shown in fig. 5, the absolute coordinate information of each feature point on the feature map is explicitly added to the feature map, and the coordinate concatenation module is obtained by encapsulation.
(4) As shown in fig. 6, the system is divided into 3 branches, and the 3 branches are connected with the vector calculation, transformation, and L2 regularization modules in sequence to obtain a value vector, a key vector, and a query vector, and the whole structure is encapsulated as a key value generation module. The branch of the value vector does not contain an L2 regularization module, and the vector calculation mode is convolution operation.
(5) As shown in fig. 7, the query vector is transposed, then matrix-multiplied with the key vector, the result is calculated as an exponent with the natural number as the base number, the result is transposed and then matrix-multiplied with the value vector, the result is transposed to obtain the attention output tensor, and the structure is integrally packaged as the non-local attention module.
(6) As shown in fig. 8, the attention output tensor is deformed and is added to the input of the non-local attention module, and the structure is encapsulated as an attention fusion module as a whole.
(7) As shown in fig. 4, the coordinate splicing module, the key value generation module, the non-local attention module, and the attention fusion module are connected in sequence from front to back, and are packaged into the non-local module. Wherein the input of the attention fusion module is the input of the non-local module as a whole and the output of the non-local attention module.
(8) As shown in fig. 3, the global average pooling layer, the full link layer, and the active layer are connected in sequence from front to back and encapsulated into a header module.
(9) As shown in fig. 1, the root module, the plurality of residual modules, the non-local module, and the head module are connected in sequence to obtain a non-local convolutional network. The connection sequence and the number of the non-local modules can be adjusted according to the application requirements.
Further, this embodiment adds 2 non-local modules before the head module and before the last downsampled residual module by constructing a 50-layer non-local convolutional neural network. The rest structure is completely consistent with the common residual convolution neural network.
The data set adopted by the experiment is not tf _ flower data set, the data set comprises 5 different flower pictures, the training set comprises 3308 pictures, and the testing set comprises 372 pictures. During training, the picture is scaled to 256x256 size, and then the 224x224 size area is randomly cropped and randomly horizontally inverted. During testing, the picture is directly scaled to 224x224 size. All training parameters are completely consistent for both network structures.
As shown in table 1, the accuracy of the non-local area convolutional neural network in this embodiment is 89.65%, which is 1.64% higher than that of the general residual convolutional neural network, and the effect is significant.
TABLE 1
Network architecture Accuracy (%)
Common residual convolutional neural network 88.01
Non-local convolution neural network 89.65
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims (9)

1. A non-local image classification device is characterized by comprising a data acquisition module, a training module and a classification module, wherein the data acquisition module is used for collecting data and forming a training sample; the training module is used for inputting training samples into an image classification network for training to obtain an optimal image classification model; the classification module is used for inputting the image to be detected into the optimal image classification model and inputting a classification result;
the image classification network consists of a root module, a plurality of residual modules, a non-local module and a head module which are sequentially connected from front to back, wherein the non-local module consists of a coordinate splicing module, a key value generation module, a non-local attention module and an attention fusion module which are sequentially connected from front to back; adding absolute coordinate information of each feature point on the feature map into the feature map, encapsulating to obtain a coordinate splicing module which is divided into 3 branches, sequentially connecting a vector calculation module, a deformation module and an L2 regularization module to obtain a value vector, a key vector and a query vector, and encapsulating the whole structure into a key value generation module; the branch of the value vector does not comprise an L2 regularization module, the vector calculation mode is convolution operation, firstly, a query vector is transposed, then matrix multiplication is carried out on the query vector and a key vector, an index taking a natural number as a base number is calculated as a result, the result is transposed and then matrix multiplication is carried out on the result and the value vector, so that an attention output tensor is obtained, the whole structure is packaged into a non-local attention module, the attention output tensor is deformed and is subjected to element summation with the input of the non-local attention module, the whole structure is packaged into an attention fusion module, and a coordinate splicing module, a key value generation module, the non-local attention module and the attention fusion module are sequentially connected from front to back and packaged into the non-local module; wherein the input of the attention fusion module is the input of the non-local module as a whole and the output of the non-local attention module;
the root module is used for converting the pixel information of the input image and outputting a feature map composed of feature information; the residual error modules are used for gradually extracting semantic information of higher layers in the feature map and outputting the feature map to the non-local area module; the head module is used for converting the feature map containing the semantics into an image classification result;
the coordinate splicing module is used for adding absolute coordinate information of each feature point on the feature map into the feature vector; the key value generation module is used for generating a query vector, a key vector and a value vector, inputting the query vector, the key vector and the value vector to the non-local attention module, calculating the correlation between each feature point and all feature points on the feature map and generating an attention map; the attention fusion module is used for feeding back information obtained by the attention map to the feature map, and the input of the attention fusion module is the input of the non-local area module and the output of the non-local area attention module.
2. A non-local image classification device according to claim 1, wherein the coordinate stitching module has the following expression:
X'=concat([X,coord_map],dim=channel)
wherein:
x' represents an output characteristic diagram,
x represents a graph of the input features,
concat represents the number of splicing operations that are performed,
dim = channel indicates that the dimension of the splice is that of the characteristic channel,
the coord _ map is a coordinate diagram, and assuming the size [ b, c, h, w ] of the feature diagram X, wherein b is the batch size, c is the number of channels, h is the height of the feature diagram, and w is the width of the feature diagram, the size of the coord _ map is [ b,2, h, w ].
3. The non-local image classification device according to claim 1, wherein the key value generation module is configured to generate a query vector, a key vector, and a value vector, the query vector, the key vector, and the value vector are generated by a convolution operation and a deformation operation in which a convolution kernel has a size of 1 and an output channel is equal to an input channel, and the key vector and the query vector are both subjected to L2 regularization; the formula corresponding to the query vector, the key vector and the value vector is as follows:
Q=l2_norm(reshape(conv q (X)))
K=l2_norm(reshape(conv k (X)))
V=reshape(conv v (X))
wherein:
x is the input of the key value generation module;
q, K, V is the query vector, key vector, value vector, respectively;
l2_ norm () is an L2 regularization function, and the regularized channel is in the 1 st dimension, i.e., the C dimension;
reshape is deformation, and the dimension of the vector is changed from [ B, C, H, W ] to [ B, C, H x W ], wherein B is the batch number, C is the channel number, H is the height of the feature map, and W is the width of the feature map;
conv q 、conv k 、conv v respectively, a convolution function of the query vector, the key vector, and the value vector.
4. The apparatus of claim 3, wherein the non-local attention module first generates an attention tensor from the query vector and the key vector, and the calculation formula is as follows:
Attn=softmax(exp(s*Q T ×K)
wherein:
softmax () is a softmax function,
Q T is a transposed vector of the query vector,
k is a key vector and is a key vector,
exp () is an exponential operation based on a natural number e,
s is a constant parameter, the optimum value can be obtained by experiment,
attn is the attention tensor;
then, the non-local attention module performs matrix multiplication on the attention tensor and the transpose of the value vector, and then obtains the attention output tensor by taking the transpose, wherein a calculation formula is as follows:
Attn_out=(Attn T ×V) T
wherein:
Attn T the transposed tensor of the attention tensor,
v is a vector of values, and V is a vector of values,
t denotes a transpose operation and,
attn _ out is the output of the non-local attention module.
5. A non-local area image classification device according to claim 4, wherein the constant parameter 2 ≦ s ≦ 5.
6. The apparatus of claim 1, wherein the attention fusion module first performs a warping operation on the output of the non-local attention module, and then sums the warped output with the input of the whole non-local attention module by elements, and the calculation formula is as follows:
Attn_reshape=reshape(Attn_out)
out=input+Attn_reshape
wherein:
attn _ out is the output of the non-local attention module;
out is the output of the attention fusion module;
the input is the input of the whole non-local area module;
attn _ reshape is the deformation operation of the non-local attention module,
reshape () is a warping operation that warps the tensor with dimension [ B, C, H x W ] to the tensor with dimension [ B, C, H, W ]; where B is the number of batches, C is the number of channels, H is the height of the feature map, and W is the width of the feature map.
7. A non-local image classification device according to any one of claims 1 to 6, wherein the number of non-local modules is 2, and the non-local modules are respectively located before the head module and before the last downsampled residual module.
8. A non-local image classification method using the image classification apparatus according to any one of claims 1 to 7, comprising the steps of:
step S1: collecting image data and marking to form training data,
step S2: respectively packaging to obtain a root module, a residual module, a non-local module and a head module, and sequentially connecting the root module, a plurality of residual modules, the non-local module and the head module from front to back to obtain an image classification network; then, inputting training samples in the training data into an image classification network for training to obtain a trained optimal image classification model;
and step S3: and inputting the image to be detected into the optimal image classification model and inputting a classification result.
9. A computer readable storage medium storing computer program instructions, characterized in that the program instructions, when executed by a processor, implement the method of claim 8.
CN202110308766.6A 2021-03-23 2021-03-23 Non-local image classification device, method and storage medium Active CN113065586B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110308766.6A CN113065586B (en) 2021-03-23 2021-03-23 Non-local image classification device, method and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110308766.6A CN113065586B (en) 2021-03-23 2021-03-23 Non-local image classification device, method and storage medium

Publications (2)

Publication Number Publication Date
CN113065586A CN113065586A (en) 2021-07-02
CN113065586B true CN113065586B (en) 2022-10-18

Family

ID=76563190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110308766.6A Active CN113065586B (en) 2021-03-23 2021-03-23 Non-local image classification device, method and storage medium

Country Status (1)

Country Link
CN (1) CN113065586B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469111A (en) * 2021-07-16 2021-10-01 中国银行股份有限公司 Image key point detection method and system, electronic device and storage medium
CN113569735B (en) * 2021-07-28 2023-04-07 中国人民解放军空军预警学院 Complex input feature graph processing method and system based on complex coordinate attention module
CN114565941B (en) * 2021-08-24 2024-09-24 商汤国际私人有限公司 Texture generation method, device, equipment and computer readable storage medium
CN113722549B (en) * 2021-09-03 2022-06-21 优维科技(深圳)有限公司 Data state fusion storage system and method based on graph

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015147764A1 (en) * 2014-03-28 2015-10-01 Kisa Mustafa A method for vehicle recognition, measurement of relative speed and distance with a single camera
CN111339364B (en) * 2020-02-28 2023-09-29 网易(杭州)网络有限公司 Video classification method, medium, device and computing equipment
CN111583210B (en) * 2020-04-29 2022-03-15 北京小白世纪网络科技有限公司 Automatic breast cancer image identification method based on convolutional neural network model integration
CN111754637B (en) * 2020-06-30 2021-01-29 华东交通大学 Large-scale three-dimensional face synthesis system with suppressed sample similarity
CN111932553B (en) * 2020-07-27 2022-09-06 北京航空航天大学 Remote sensing image semantic segmentation method based on area description self-attention mechanism
CN112419184B (en) * 2020-11-19 2022-11-04 重庆邮电大学 Spatial attention map image denoising method integrating local information and global information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A novel supervised feature extraction and classification fusion algorithm for land cover recognition of the off-land scenario;Yan Cui;《Neuro Computing》;20140922;第140卷;1-7 *
Multi-Head Self-Attention for 3D Point Cloud Classification;Xue-Yao Gao;《IEEE Access》;20210111;第9卷;1-12 *

Also Published As

Publication number Publication date
CN113065586A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN113065586B (en) Non-local image classification device, method and storage medium
CN112949565B (en) Single-sample partially-shielded face recognition method and system based on attention mechanism
US11328172B2 (en) Method for fine-grained sketch-based scene image retrieval
Wang et al. G2DeNet: Global Gaussian distribution embedding network and its application to visual recognition
CN111738344B (en) Rapid target detection method based on multi-scale fusion
CN110619638A (en) Multi-mode fusion significance detection method based on convolution block attention module
CN112801015B (en) Multi-mode face recognition method based on attention mechanism
EP3813661A1 (en) Human pose analysis system and method
CN112131959B (en) 2D human body posture estimation method based on multi-scale feature reinforcement
CN110245621B (en) Face recognition device, image processing method, feature extraction model, and storage medium
CN111368637B (en) Transfer robot target identification method based on multi-mask convolutional neural network
CN110866938B (en) Full-automatic video moving object segmentation method
CN117079098A (en) Space small target detection method based on position coding
CN115205933A (en) Facial expression recognition method, device, equipment and readable storage medium
CN113887385A (en) Three-dimensional point cloud classification method based on multi-view attention convolution pooling
CN114005046A (en) Remote sensing scene classification method based on Gabor filter and covariance pooling
CN113780140A (en) Gesture image segmentation and recognition method and device based on deep learning
Wu et al. Deep texture exemplar extraction based on trimmed T-CNN
CN107403145B (en) Image feature point positioning method and device
CN114863132B (en) Modeling and capturing method, system, equipment and storage medium for image airspace information
CN112907607B (en) Deep learning, target detection and semantic segmentation method based on differential attention
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device
Li et al. Research on YOLOv3 pedestrian detection algorithm based on channel attention mechanism
CN114612758A (en) Target detection method based on deep grouping separable convolution
CN113989906B (en) Face recognition method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant