CN113674334A

CN113674334A - Texture recognition method based on depth self-attention network and local feature coding

Info

Publication number: CN113674334A
Application number: CN202110760949.1A
Authority: CN
Inventors: 彭博; 其他发明人请求不公开姓名
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-11-19
Anticipated expiration: 2041-07-06
Also published as: CN113674334B

Abstract

The invention relates to a texture recognition method based on a depth self-attention network and local feature coding.A depth self-attention module with four stages is designed according to the characteristics of a texture image, local image blocks are merged in the first three stages to increase the receptive field, and the self-attention calculation is limited in a local space with a fixed size; the local image block combination is cancelled in the last stage, the self attention of the global calculation is paid, and the connection between the local image blocks is obtained; therefore, the texture features of the local area are better extracted and the global features are kept from being lost. The PET network provided by the invention fully combines the texture information in the local area in the image, and remodels the two-dimensional characteristics output by the backbone network into a three-dimensional characteristic diagram. And densely sampling the block descriptors with multiple scales in the feature map by moving a window to obtain a group of multi-scale local representations. And finally, carrying out local feature coding and fusion on the multi-scale block features to generate a texture representation with a fixed scale for final classification.

Description

Texture recognition method based on depth self-attention network and local feature coding

Technical Field

The invention belongs to the technical field of texture classification and material classification, and particularly relates to a texture identification method based on a depth self-attention network and local feature coding.

Background

In the classical texture recognition method, in the method based on the bag-of-words model, the features are extracted by using manual features (such as GLCM, LBP, LPQ), each descriptor is assigned to the nearest visual word in the codebook, and the classification is performed by counting the occurrence frequency of the visual words or aggregating the residual errors. With the rapid development of deep learning, Convolutional Neural Networks (CNN) are widely used to replace manual feature extraction, and then a texture coding strategy is adopted to perform final texture classification.

Most existing methods such as FV-CNN (1), DeepTEN (2), DEP-NET (3), LSCTN (4), which typically texture-based encode the global features extracted by CNN. In a texture image, the pixel arrangement and the change pattern of the whole image are often the same as those in a local region, so that the local region has strong texture recognition capability, and the existing overall coding method usually relies on CNN to perform feature extraction and neglects to combine local features to perform texture coding, so that the recognition performance of texture is low.

The defects in the prior art are as follows:

(1) for a classical texture recognition method, the solution of the method usually depends on some picture preprocessing, manual feature extraction and bag-of-word models, and the method can not meet the current detection requirement due to low detection performance. Secondly, the method is not optimized by utilizing a deep learning framework;

(2) for the depth learning methods of the same kind, first, such methods generally use a deep convolutional network (CNN) for depth feature extraction, and the CNN has a limited capability for extracting texture features although its strong feature capturing capability is demonstrated on images based on objects, and the like. Secondly, in the texture image, the local area has strong texture recognition capability, and the existing method neglects the combination of local features to carry out texture coding, thereby restricting the recognition capability of the model to texture data.

Reference documents:

(1)：M.Cimpoi,S.Maji,and A.Vedaldi.Deep filter banks for texture recognition and segmentation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 3828–3836,2015.1,6；

the technical points are compared: the article proposes that FisherVector (FV) is used as an encoding layer to obtain unordered representation, but the CNN network and the FisherVector encoding layer are trained independently, and FisherVector (FV) is used as an encoding layer to obtain unordered representation, but the CNN network and the FisherVector encoding layer are trained independently and are not end-to-end structures;

(2)：Hang Zhang,Jia Xue,and Kristin Dana.2017.Deep ten:Texture encodingnetwork.InIEEE conference on computer vision and pattern recognition.708–71；

the technical points are compared: the article proposes to integrate feature extraction and dictionary coding into an end-to-end model, takes into account the correlation between visual words and assignments in the dictionary learning process, improves the VLAD scheme, but does not take into account local features and multi-scale feature coding;

(3)：Jia Xue,Hang Zhang,and Kristin Dana.2018.Deep texture manifold for groundterrain recognition.InIEEE Conference on Computer Vision and Pattern Recognition.558–56；

the technical points are compared: it is pointed out that the texture of the material surface is not completely disordered, and the ordered information of the local space is also crucial for texture recognition. Unordered information obtained by dictionary coding and ordered information obtained by the ordered pooling layer are fused through a bilinear model, but the feature fusion of multiple scales and each local region is not considered;

(4)：Xingyuan Bu,Yuwei Wu,Zhi Gao,and Yunde Jia.2019.Deep convolutionalnetwork with locality and sparsity constraints for texture classification.PatternRecognition91(2019),34-46；

the technical points are compared: this article proposes the use of a position-aware coding layer for position constraint, where the dictionary and the coded representation are learned simultaneously, but this approach does not take into account the significant effect of local features on texture classification.

Disclosure of Invention

The invention aims to provide a texture recognition method based on a depth self-attention network and local feature coding.

The invention provides a local feature coding (PET) network based on a depth self-attention network, which designs a backbone network based on the depth self-attention network (Transformer) to replace a Convolutional Neural Network (CNN) for feature extraction. Designing a depth self-attention module with four stages according to the characteristics of a texture image, merging local image blocks in the first three stages to increase the receptive field, and limiting self-attention calculation in a local space with a fixed size; the local image block combination is cancelled in the last stage, the self attention of the global calculation is paid, and the connection between the local image blocks is obtained; therefore, the texture features of the local area are better extracted and the global features are kept from being lost.

The invention provides a local feature coding method by fully combining texture information in a local area in an image through a PET network, which remodels two-dimensional features output by a backbone network into a three-dimensional feature map. And densely sampling the block descriptors with multiple scales in the feature map by moving a window to obtain a group of multi-scale local representations. And finally, carrying out local feature coding and fusion on the multi-scale block features to generate a texture representation with a fixed scale for final classification.

The invention provides a texture recognition method based on a depth self-attention network and local feature coding, which comprises the following steps:

(1): giving an input image, carrying out image blocking after normalizing and standardizing the input image, wherein the size of each image block is p × 3, carrying out linear transformation on each image block to express the image block into a one-dimensional vector with the dimension of D to obtain an input vector z with the dimension of N × D, and sending the input vector z into a depth self-attention backbone network, wherein N is the number of initial division windows, and D is the dimension of an embedded layer for transforming the image into a one-dimensional space;

(2): two self-attention calculation methods, namely a multi-head self-attention Module (MSA) and a window-based self-attention module (WMSA), are serially combined to form a deep self-attention feature extraction network; the window-based self-attention module (WMSA) is characterized in that self-attention calculation is carried out in a local area so as to focus more on local information calculation in a window; the mechanism of the window-based self-attention module is that an image is divided into a plurality of sub-images, self-attention is calculated in the sub-images, the sub-images are merged when entering the next WMSA stage, the receptive field is enlarged, for example, the side length of each sub-image is enlarged by one time, the WMSA module splices the calculation results of the plurality of sub-images, and the output dimension which is the same as the overall calculation result of a multi-head self-attention Module (MSA) is obtained; the self-attention calculation process for WMSA and MSA is as follows:

z^l＝WMSA(LN(z^l-1))，

z^l＝MLP(LN(z^l))，

z^l+1＝MSA(LN(z^l))，

z^l+1＝MLP(LN(z^l+1))

wherein: z is a radical of^l-1For the N image blocks with embedded features, the dimension is N x D, z¹And z^l+1For the output vector after passing through the self-attention and full-connection layers, LN is a standardized operation, MLP indicates that two layers of full-connection networks are used for nonlinear transformation, MSA is a self-attention module, WMSA is a window-based self-attention module, and the difference from MSA is that an image is divided into a plurality of sub-images to be subjected to self-attention operation, and the result is spliced. The self-attention calculation process in MSA and WMSA is as follows:

wherein: q, K, V are input vector and W, respectively_q、W_k、W_vDefining h groups of self-Attention (Attention) modules and splicing to obtain a multi-head self-Attention calculation result z^l；

(3): feeding the input vector z into the first three stages of a deep self-attention feature extraction networkCalculating, wherein the input dimension is N x D, N is the number of the initial division windows, and D is the dimension of the embedded layer for transforming the image to a one-dimensional space; the deep self-attention feature extraction network consists of four stages, namely three window-based self-attention modules and a global multi-head self-attention module; the first three stages use a window merging mechanism and WMSA modules, self-attention is calculated in a local area while the model receptive field is increased, and depth local features are extracted, wherein the stacking number of the self-attention modules of the first three modules is respectively 2,2 and 4; each region merge doubles the width W and height H of the block; outputting vector x through the self-attention module calculation based on window in the first three stages₃Is reduced to N x d, wherein N is N/64, d is 8 x d;

(4): vector z output by the third stage of feature extraction₃Inputting the feature extraction vector into a fourth stage global multi-head self attention (MSA) module, removing window combination in the feature extraction process of the fourth stage, comprising 4 continuous attention calculation modules, wherein the calculation process is the same as the step (2), the output dimension is unchanged after calculation, and the feature extraction vector x with the output dimension of n x d is output₄Wherein N is N/64, d is 8 d;

(5): performing spatial remodeling on the features extracted from the backbone network in the step (4) through a local feature coding module, and converting the output two-dimensional vector of n x d into three-dimensional features of w x d, wherein w is n^1/2Recovering the space structure of the depth feature to obtain a three-dimensional feature map;

(6): performing local feature interception on the three-dimensional feature map by using a square window through a local feature coding module; in order to obtain uniform depth window features, performing sliding interception on the length and width dimensions of the feature map by using a window; total number of patches N after the end of a window slide_pComprises the following steps:

wherein: h and W are the height and width of the feature map, and s is the step size of one sliding of the sliding window; in order to enable the features of the depth coding to better sense the texture change in different ranges, a multi-scale feature window interception strategy is designed, windows with different sizes are used for intercepting a feature map, specifically, the windows are set to be 2 x2, 3 x 3 and 5 x 5, and three windows are used for performing sliding sampling on the feature map; inputting depth local feature blocks with the same depth and different widths and heights into a texture coding module;

(7): the characteristic diagram input in the step (5) after the window with different scales is cut into blocks is sent to a texture coding module for coding, and N is selected according to the local characteristic coding module_pA set of visual descriptors, X ═ X1, X2_NpDefining a codebook C with K visual word clustering centers as a learnable parameter of the model, wherein the dimension is K x D; for each descriptor x_iThe residual vector can be represented as r_ij＝x_i-C_KThe seed K is the Kth clustering center in the dictionary parameter C; different from a hard assignment method, soft assignment is to assign a descriptor to a clustering center in each codebook through a softmax function; the output vector encoded by codebook E has dimension K × D, and can be expressed as:

wherein: the a function is an evaluation function for the residual error, and can be expressed as:

wherein: s is a learnable smoothing factor parameter; the encoding method allows input variables to have different dimensions and encode them into the same K x D dimensional feature space. The dimensionality of the feature E output by the coding layer is N x K x D, wherein N is the number of all depth local feature descriptors sampled by the multi-scale window;

(8): performing feature fusion on the coded features of N x K x D in the step (7), and performing weighted addition on the features of the N groups of K x D to obtain a fused multi-featureTexture representation E of scale local features_fusionSpecifically, it can be expressed as:

wherein: e_iRepresenting each encoded vector, N representing the number of encoded vectors, w_iIs the weight of each size window;

(9): fuse vector E_fusionAnd (4) paving the vector into a one-dimensional vector of K x D, and outputting the one-dimensional vector with the dimension of nclass through a layer of fully-connected network, wherein the nclass represents the number of categories.

The invention has the beneficial effects that:

the experimental results based on four data sets of DTD, MINC, FMD and Fabrics show that the proposed network can remarkably improve the classification accuracy compared with the latest model, and the classification accuracy on the four public data sets is all higher than that of the current best method. The effectiveness of the proposed method is shown to be much better than the latest method.

Drawings

Fig. 1 presents a local feature-coded network (PET) model overview of a deep self-attention network, in which: 101 is the first three window-based self-attention modules (WMSA) of the feature extractor; 102 is the fourth stage of the feature extractor, the global multi-headed self attention Module (MSA); 103 is a local feature coding module;

FIG. 2 is a schematic diagram of a window-based multi-head attention mechanism region merging process.

Detailed Description

The invention is further illustrated by the following examples.

Example 1:

first, the proposed backbone network based on the deep self-attention network (Transformer) is trained on the ImageNet training dataset to obtain pre-training weights. The image is first partitioned using texture/material dependent dataset images (DTD, MINC, FMD, Fabrics), see fig. 2, with an initial sub-block size of 4 x4 pixels, and then embedded by featuresIn-layer maps the image blocks to dimension 96, so that the overall image input dimension is 3136 × 96; the vector is input to a feature extractor in the PET network. Referring to the first three window-based self-attention modules (WMSA)101 of the feature extractor in fig. 1, the first three window-based self-attention modules (WSMA) of the feature extractor are first used to perform window merging and local self-attention calculation to obtain a feature x₃The output dimension is 49 x 768. Please refer to the fourth stage of the feature extractor in FIG. 1, the Global Multi-headed self attention Module (MSA)102, x, which is the output of the first three window-based self attention calculation modules₃The feature vector is input to the fourth stage of the feature extractor, and the global multi-head self-attention Module (MSA) performs global self-attention calculation to obtain x4 with dimension 49 × 768.

Referring to the local feature encoding module 103 in fig. 1, firstly, the two-dimensional features x4 output by the feature extraction layer are reshaped into a three-dimensional vector with dimensions of 7 × 49, so as to restore the spatial structure of the depth features, and obtain a three-dimensional feature map. And then carrying out window interception on the three-dimensional characteristic image to obtain a depth local characteristic, wherein a window area adopts a square intercepted characteristic image. Performing local feature extraction on the three-dimensional feature map by using square windows with the sizes of 2 x2, 3 x 3 and 5 x 5; in order to obtain uniform depth window features, sliding interception is performed on the length and width dimensions of a feature map by using a window, three groups of depth local features with different scales are respectively obtained, and the number of local feature blocks of the feature map contained in each group is as follows: 36. 25 and 9 for a total of 70. And then inputting the obtained 70 deep local feature blocks with different scales into a coding layer, and firstly defining a learnable dictionary C as a codebook for residual coding, wherein the dimensionality of C is 32 × 128, 32 is the number of words in a cluster, and 128 is the dimensionality of each cluster center. For each depth local feature block x_iThe residual vector can be represented as r_ij＝x_i-C_KThe seed K is the Kth clustering center in the dictionary parameter C; unlike the hard assignment method, the soft assignment is to assign a descriptor to each codeword by a softmax function. Learning a clustering center codebook by adopting a learnable smoothing factor; the output of 70 characteristic blocks coded by a codebook E is a representation with 70 fixed scalesDimension 32 x 128. Weighting and adding 70X 32X 128 features after coding to obtain a fusion feature, weighting and adding 70X 32X 128 dimensional features to obtain a texture representation E of the fusion multiscale local feature_fusionSpecifically, it can be expressed as:

wherein E_iRepresents 32 x 128 vectors per code, N represents the number of coded vectors, and is 70, w_iIs the weight of each size window, in this example, 2 x2, 3 x 3,5 x 5 are given different weights, 0.35, 0.45 and 0.2 respectively. Fuse vector E_fusionAnd (5) paving the obtained product into a 32-by-128 one-dimensional vector, and outputting a classification result through a layer of fully-connected network. And finally, obtaining output vectors with the dimensionality being the category number through one layer of full connection. The network training process used the SGD as the optimizer, input image size 224 x 224, training data batch size 64. The learning rate starts at 0.004, and when the error tends to be flat, it is divided by 10, the attenuation rate is set to 0.0001, and the momentum is set to 0.9.

The experimental results of the PET presented in this example compared to the latest methods (DTD, MINC-2500, FMD, Fabric data set) are shown in Table 1:

table 1:

table 2 shows that for the ablation experiment of local feature coding module (PE) in PET, the sizes of various fixed blocks and the current most advanced coding method are compared with those of our method (PE) (DTD, MINC data set), and in order to ensure fairness, the backbone network adopts a residual network (ResNet50) with 50 layers;

table 2:

table 3 is an ablation experiment for backbone networks in PET, compared to other currently widely used backbone networks (DTD, MINC dataset);

table 3:

Claims

1. the texture recognition method based on the depth self-attention network and the local feature coding is characterized by comprising the following specific steps of:

(1): giving an input image, carrying out image blocking after normalizing and standardizing the input image, wherein the size of each image block is p × 3, carrying out linear transformation on each image block to express the image block into a one-dimensional vector with the dimension of D to obtain an input vector z, the dimension of N × D, sending the input vector z into a depth self-attention backbone network, wherein N is the number of initial division windows, and D is the dimension of an embedded layer for transforming the image into a one-dimensional space;

z^l＝WMSA(LN(z^l-1))，

z^l＝MLP(LN(z^l))，

z^l+1＝MSA(LN(z^l))，

z^l+1＝MLP(LN(z^l+1))

wherein: z is a radical of^1-1For the N image blocks with embedded features, the dimension is N x D, z¹And z^l+1The method is characterized in that an output vector is an output vector after passing through a self-attention layer and a full-connection layer, LN is a standardized operation, MLP represents that a two-layer full-connection network is used for nonlinear transformation, MSA is a self-attention module, WMSA is a window-based self-attention module, and the method is different from the MSA in that an image is divided into a plurality of sub-images to be subjected to self-attention operation, and the result is spliced; the self-attention calculation process in MSA and WMSA is as follows:

wherein: q, K, V are input vector and W, respectively_q、W_k、W_vDefining h groups of self-Attention (Attention) modules and splicing to obtain a multi-head self-Attention calculation result z¹；

(3): sending the input vector z into the first three stages of a depth self-attention feature extraction network for calculation, wherein the input dimension is N x D, N is the number of initial division windows, and D is the dimension of an embedded layer for transforming an image into a one-dimensional space; the deep self-attention feature extraction network consists of four stages, namely three window-based self-attention modules and a global multi-head self-attention module; the first three stages use a window merging mechanism and WMSA modules, self-attention is calculated in a local area while the model receptive field is increased, and depth local features are extracted, wherein the stacking number of the self-attention modules of the first three modules is respectively 2,2 and 4; each region merge doubles the width W and height H of the block; outputting vector x through the self-attention module calculation based on window in the first three stages₃Is reduced to N x d, wherein N is N/64, d is 8 x d;

(4): vector z output by the third stage of feature extraction₃Inputting the data into a fourth stage global multi-head self attention (MSA) module, eliminating window merging in the feature extraction process of the stage,the method comprises 4 continuous self-attention calculation modules, the calculation process is the same as the step (2), the output dimension is unchanged after calculation, and the feature extraction vector x with the output dimension of n x d is output₄Wherein N is N/64, d is 8 d;

(7): the characteristic diagram input in the step (5) after the window with different scales is cut into blocks is sent to a texture coding module for coding, and N is selected according to the local characteristic coding module_pA set X of visual descriptors

Defining a codebook C with K visual word clustering centers as a learnable parameter of a model, wherein the dimension is K x D; for each descriptor x_iResidual errorThe vector may be represented as r_ij＝x_i-C_KThe seed K is the Kth clustering center in the dictionary parameter C; unlike the hard assignment method, soft assignment is to assign a descriptor to each codeword by a softmax function; learning a clustering center codebook by adopting a learnable smoothing factor; the output vector encoded by codebook E has dimension K × D, and can be expressed as:

wherein: s is a learnable smoothing factor; the encoding method allows input variables to have different dimensions and encode them into the same K x D dimensional feature space; the dimensionality of the feature E output by the coding layer is N x K x D, wherein N is the number of all depth local feature descriptors sampled by the multi-scale window;

(8): performing feature fusion on the coded features of N x K x D in the step (7), and performing weighted addition on the features of the N groups of K x D to obtain a texture representation E of the fused multi-scale local features_fusionSpecifically, it can be expressed as: