CN113674334A - Texture recognition method based on depth self-attention network and local feature coding - Google Patents

Texture recognition method based on depth self-attention network and local feature coding Download PDF

Info

Publication number
CN113674334A
CN113674334A CN202110760949.1A CN202110760949A CN113674334A CN 113674334 A CN113674334 A CN 113674334A CN 202110760949 A CN202110760949 A CN 202110760949A CN 113674334 A CN113674334 A CN 113674334A
Authority
CN
China
Prior art keywords
attention
self
window
local
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110760949.1A
Other languages
Chinese (zh)
Other versions
CN113674334B (en
Inventor
彭博
其他发明人请求不公开姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202110760949.1A priority Critical patent/CN113674334B/en
Publication of CN113674334A publication Critical patent/CN113674334A/en
Application granted granted Critical
Publication of CN113674334B publication Critical patent/CN113674334B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • G06T7/41Analysis of texture based on statistical description of texture
    • G06T7/44Analysis of texture based on statistical description of texture using image operators, e.g. filters, edge density metrics or local histograms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention relates to a texture recognition method based on a depth self-attention network and local feature coding.A depth self-attention module with four stages is designed according to the characteristics of a texture image, local image blocks are merged in the first three stages to increase the receptive field, and the self-attention calculation is limited in a local space with a fixed size; the local image block combination is cancelled in the last stage, the self attention of the global calculation is paid, and the connection between the local image blocks is obtained; therefore, the texture features of the local area are better extracted and the global features are kept from being lost. The PET network provided by the invention fully combines the texture information in the local area in the image, and remodels the two-dimensional characteristics output by the backbone network into a three-dimensional characteristic diagram. And densely sampling the block descriptors with multiple scales in the feature map by moving a window to obtain a group of multi-scale local representations. And finally, carrying out local feature coding and fusion on the multi-scale block features to generate a texture representation with a fixed scale for final classification.

Description

Texture recognition method based on depth self-attention network and local feature coding
Technical Field
The invention belongs to the technical field of texture classification and material classification, and particularly relates to a texture identification method based on a depth self-attention network and local feature coding.
Background
In the classical texture recognition method, in the method based on the bag-of-words model, the features are extracted by using manual features (such as GLCM, LBP, LPQ), each descriptor is assigned to the nearest visual word in the codebook, and the classification is performed by counting the occurrence frequency of the visual words or aggregating the residual errors. With the rapid development of deep learning, Convolutional Neural Networks (CNN) are widely used to replace manual feature extraction, and then a texture coding strategy is adopted to perform final texture classification.
Most existing methods such as FV-CNN (1), DeepTEN (2), DEP-NET (3), LSCTN (4), which typically texture-based encode the global features extracted by CNN. In a texture image, the pixel arrangement and the change pattern of the whole image are often the same as those in a local region, so that the local region has strong texture recognition capability, and the existing overall coding method usually relies on CNN to perform feature extraction and neglects to combine local features to perform texture coding, so that the recognition performance of texture is low.
The defects in the prior art are as follows:
(1) for a classical texture recognition method, the solution of the method usually depends on some picture preprocessing, manual feature extraction and bag-of-word models, and the method can not meet the current detection requirement due to low detection performance. Secondly, the method is not optimized by utilizing a deep learning framework;
(2) for the depth learning methods of the same kind, first, such methods generally use a deep convolutional network (CNN) for depth feature extraction, and the CNN has a limited capability for extracting texture features although its strong feature capturing capability is demonstrated on images based on objects, and the like. Secondly, in the texture image, the local area has strong texture recognition capability, and the existing method neglects the combination of local features to carry out texture coding, thereby restricting the recognition capability of the model to texture data.
Reference documents:
(1):M.Cimpoi,S.Maji,and A.Vedaldi.Deep filter banks for texture recognition and segmentation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pages 3828–3836,2015.1,6;
the technical points are compared: the article proposes that FisherVector (FV) is used as an encoding layer to obtain unordered representation, but the CNN network and the FisherVector encoding layer are trained independently, and FisherVector (FV) is used as an encoding layer to obtain unordered representation, but the CNN network and the FisherVector encoding layer are trained independently and are not end-to-end structures;
(2):Hang Zhang,Jia Xue,and Kristin Dana.2017.Deep ten:Texture encodingnetwork.InIEEE conference on computer vision and pattern recognition.708–71;
the technical points are compared: the article proposes to integrate feature extraction and dictionary coding into an end-to-end model, takes into account the correlation between visual words and assignments in the dictionary learning process, improves the VLAD scheme, but does not take into account local features and multi-scale feature coding;
(3):Jia Xue,Hang Zhang,and Kristin Dana.2018.Deep texture manifold for groundterrain recognition.InIEEE Conference on Computer Vision and Pattern Recognition.558–56;
the technical points are compared: it is pointed out that the texture of the material surface is not completely disordered, and the ordered information of the local space is also crucial for texture recognition. Unordered information obtained by dictionary coding and ordered information obtained by the ordered pooling layer are fused through a bilinear model, but the feature fusion of multiple scales and each local region is not considered;
(4):Xingyuan Bu,Yuwei Wu,Zhi Gao,and Yunde Jia.2019.Deep convolutionalnetwork with locality and sparsity constraints for texture classification.PatternRecognition91(2019),34-46;
the technical points are compared: this article proposes the use of a position-aware coding layer for position constraint, where the dictionary and the coded representation are learned simultaneously, but this approach does not take into account the significant effect of local features on texture classification.
Disclosure of Invention
The invention aims to provide a texture recognition method based on a depth self-attention network and local feature coding.
The invention provides a local feature coding (PET) network based on a depth self-attention network, which designs a backbone network based on the depth self-attention network (Transformer) to replace a Convolutional Neural Network (CNN) for feature extraction. Designing a depth self-attention module with four stages according to the characteristics of a texture image, merging local image blocks in the first three stages to increase the receptive field, and limiting self-attention calculation in a local space with a fixed size; the local image block combination is cancelled in the last stage, the self attention of the global calculation is paid, and the connection between the local image blocks is obtained; therefore, the texture features of the local area are better extracted and the global features are kept from being lost.
The invention provides a local feature coding method by fully combining texture information in a local area in an image through a PET network, which remodels two-dimensional features output by a backbone network into a three-dimensional feature map. And densely sampling the block descriptors with multiple scales in the feature map by moving a window to obtain a group of multi-scale local representations. And finally, carrying out local feature coding and fusion on the multi-scale block features to generate a texture representation with a fixed scale for final classification.
The invention provides a texture recognition method based on a depth self-attention network and local feature coding, which comprises the following steps:
(1): giving an input image, carrying out image blocking after normalizing and standardizing the input image, wherein the size of each image block is p × 3, carrying out linear transformation on each image block to express the image block into a one-dimensional vector with the dimension of D to obtain an input vector z with the dimension of N × D, and sending the input vector z into a depth self-attention backbone network, wherein N is the number of initial division windows, and D is the dimension of an embedded layer for transforming the image into a one-dimensional space;
(2): two self-attention calculation methods, namely a multi-head self-attention Module (MSA) and a window-based self-attention module (WMSA), are serially combined to form a deep self-attention feature extraction network; the window-based self-attention module (WMSA) is characterized in that self-attention calculation is carried out in a local area so as to focus more on local information calculation in a window; the mechanism of the window-based self-attention module is that an image is divided into a plurality of sub-images, self-attention is calculated in the sub-images, the sub-images are merged when entering the next WMSA stage, the receptive field is enlarged, for example, the side length of each sub-image is enlarged by one time, the WMSA module splices the calculation results of the plurality of sub-images, and the output dimension which is the same as the overall calculation result of a multi-head self-attention Module (MSA) is obtained; the self-attention calculation process for WMSA and MSA is as follows:
zl=WMSA(LN(zl-1)),
zl=MLP(LN(zl)),
zl+1=MSA(LN(zl)),
zl+1=MLP(LN(zl+1))
wherein: z is a radical ofl-1For the N image blocks with embedded features, the dimension is N x D, z1And zl+1For the output vector after passing through the self-attention and full-connection layers, LN is a standardized operation, MLP indicates that two layers of full-connection networks are used for nonlinear transformation, MSA is a self-attention module, WMSA is a window-based self-attention module, and the difference from MSA is that an image is divided into a plurality of sub-images to be subjected to self-attention operation, and the result is spliced. The self-attention calculation process in MSA and WMSA is as follows:
Figure BDA0003149059260000031
wherein: q, K, V are input vector and W, respectivelyq、Wk、WvDefining h groups of self-Attention (Attention) modules and splicing to obtain a multi-head self-Attention calculation result zl
(3): feeding the input vector z into the first three stages of a deep self-attention feature extraction networkCalculating, wherein the input dimension is N x D, N is the number of the initial division windows, and D is the dimension of the embedded layer for transforming the image to a one-dimensional space; the deep self-attention feature extraction network consists of four stages, namely three window-based self-attention modules and a global multi-head self-attention module; the first three stages use a window merging mechanism and WMSA modules, self-attention is calculated in a local area while the model receptive field is increased, and depth local features are extracted, wherein the stacking number of the self-attention modules of the first three modules is respectively 2,2 and 4; each region merge doubles the width W and height H of the block; outputting vector x through the self-attention module calculation based on window in the first three stages3Is reduced to N x d, wherein N is N/64, d is 8 x d;
(4): vector z output by the third stage of feature extraction3Inputting the feature extraction vector into a fourth stage global multi-head self attention (MSA) module, removing window combination in the feature extraction process of the fourth stage, comprising 4 continuous attention calculation modules, wherein the calculation process is the same as the step (2), the output dimension is unchanged after calculation, and the feature extraction vector x with the output dimension of n x d is output4Wherein N is N/64, d is 8 d;
(5): performing spatial remodeling on the features extracted from the backbone network in the step (4) through a local feature coding module, and converting the output two-dimensional vector of n x d into three-dimensional features of w x d, wherein w is n1/2Recovering the space structure of the depth feature to obtain a three-dimensional feature map;
(6): performing local feature interception on the three-dimensional feature map by using a square window through a local feature coding module; in order to obtain uniform depth window features, performing sliding interception on the length and width dimensions of the feature map by using a window; total number of patches N after the end of a window slidepComprises the following steps:
Figure BDA0003149059260000041
wherein: h and W are the height and width of the feature map, and s is the step size of one sliding of the sliding window; in order to enable the features of the depth coding to better sense the texture change in different ranges, a multi-scale feature window interception strategy is designed, windows with different sizes are used for intercepting a feature map, specifically, the windows are set to be 2 x2, 3 x 3 and 5 x 5, and three windows are used for performing sliding sampling on the feature map; inputting depth local feature blocks with the same depth and different widths and heights into a texture coding module;
(7): the characteristic diagram input in the step (5) after the window with different scales is cut into blocks is sent to a texture coding module for coding, and N is selected according to the local characteristic coding modulepA set of visual descriptors, X ═ X1, X2NpDefining a codebook C with K visual word clustering centers as a learnable parameter of the model, wherein the dimension is K x D; for each descriptor xiThe residual vector can be represented as rij=xi-CKThe seed K is the Kth clustering center in the dictionary parameter C; different from a hard assignment method, soft assignment is to assign a descriptor to a clustering center in each codebook through a softmax function; the output vector encoded by codebook E has dimension K × D, and can be expressed as:
Figure BDA0003149059260000051
wherein: the a function is an evaluation function for the residual error, and can be expressed as:
Figure BDA0003149059260000052
wherein: s is a learnable smoothing factor parameter; the encoding method allows input variables to have different dimensions and encode them into the same K x D dimensional feature space. The dimensionality of the feature E output by the coding layer is N x K x D, wherein N is the number of all depth local feature descriptors sampled by the multi-scale window;
(8): performing feature fusion on the coded features of N x K x D in the step (7), and performing weighted addition on the features of the N groups of K x D to obtain a fused multi-featureTexture representation E of scale local featuresfusionSpecifically, it can be expressed as:
Figure BDA0003149059260000053
wherein: eiRepresenting each encoded vector, N representing the number of encoded vectors, wiIs the weight of each size window;
(9): fuse vector EfusionAnd (4) paving the vector into a one-dimensional vector of K x D, and outputting the one-dimensional vector with the dimension of nclass through a layer of fully-connected network, wherein the nclass represents the number of categories.
The invention has the beneficial effects that:
the experimental results based on four data sets of DTD, MINC, FMD and Fabrics show that the proposed network can remarkably improve the classification accuracy compared with the latest model, and the classification accuracy on the four public data sets is all higher than that of the current best method. The effectiveness of the proposed method is shown to be much better than the latest method.
Drawings
Fig. 1 presents a local feature-coded network (PET) model overview of a deep self-attention network, in which: 101 is the first three window-based self-attention modules (WMSA) of the feature extractor; 102 is the fourth stage of the feature extractor, the global multi-headed self attention Module (MSA); 103 is a local feature coding module;
FIG. 2 is a schematic diagram of a window-based multi-head attention mechanism region merging process.
Detailed Description
The invention is further illustrated by the following examples.
Example 1:
first, the proposed backbone network based on the deep self-attention network (Transformer) is trained on the ImageNet training dataset to obtain pre-training weights. The image is first partitioned using texture/material dependent dataset images (DTD, MINC, FMD, Fabrics), see fig. 2, with an initial sub-block size of 4 x4 pixels, and then embedded by featuresIn-layer maps the image blocks to dimension 96, so that the overall image input dimension is 3136 × 96; the vector is input to a feature extractor in the PET network. Referring to the first three window-based self-attention modules (WMSA)101 of the feature extractor in fig. 1, the first three window-based self-attention modules (WSMA) of the feature extractor are first used to perform window merging and local self-attention calculation to obtain a feature x3The output dimension is 49 x 768. Please refer to the fourth stage of the feature extractor in FIG. 1, the Global Multi-headed self attention Module (MSA)102, x, which is the output of the first three window-based self attention calculation modules3The feature vector is input to the fourth stage of the feature extractor, and the global multi-head self-attention Module (MSA) performs global self-attention calculation to obtain x4 with dimension 49 × 768.
Referring to the local feature encoding module 103 in fig. 1, firstly, the two-dimensional features x4 output by the feature extraction layer are reshaped into a three-dimensional vector with dimensions of 7 × 49, so as to restore the spatial structure of the depth features, and obtain a three-dimensional feature map. And then carrying out window interception on the three-dimensional characteristic image to obtain a depth local characteristic, wherein a window area adopts a square intercepted characteristic image. Performing local feature extraction on the three-dimensional feature map by using square windows with the sizes of 2 x2, 3 x 3 and 5 x 5; in order to obtain uniform depth window features, sliding interception is performed on the length and width dimensions of a feature map by using a window, three groups of depth local features with different scales are respectively obtained, and the number of local feature blocks of the feature map contained in each group is as follows: 36. 25 and 9 for a total of 70. And then inputting the obtained 70 deep local feature blocks with different scales into a coding layer, and firstly defining a learnable dictionary C as a codebook for residual coding, wherein the dimensionality of C is 32 × 128, 32 is the number of words in a cluster, and 128 is the dimensionality of each cluster center. For each depth local feature block xiThe residual vector can be represented as rij=xi-CKThe seed K is the Kth clustering center in the dictionary parameter C; unlike the hard assignment method, the soft assignment is to assign a descriptor to each codeword by a softmax function. Learning a clustering center codebook by adopting a learnable smoothing factor; the output of 70 characteristic blocks coded by a codebook E is a representation with 70 fixed scalesDimension 32 x 128. Weighting and adding 70X 32X 128 features after coding to obtain a fusion feature, weighting and adding 70X 32X 128 dimensional features to obtain a texture representation E of the fusion multiscale local featurefusionSpecifically, it can be expressed as:
Figure BDA0003149059260000071
wherein EiRepresents 32 x 128 vectors per code, N represents the number of coded vectors, and is 70, wiIs the weight of each size window, in this example, 2 x2, 3 x 3,5 x 5 are given different weights, 0.35, 0.45 and 0.2 respectively. Fuse vector EfusionAnd (5) paving the obtained product into a 32-by-128 one-dimensional vector, and outputting a classification result through a layer of fully-connected network. And finally, obtaining output vectors with the dimensionality being the category number through one layer of full connection. The network training process used the SGD as the optimizer, input image size 224 x 224, training data batch size 64. The learning rate starts at 0.004, and when the error tends to be flat, it is divided by 10, the attenuation rate is set to 0.0001, and the momentum is set to 0.9.
The experimental results of the PET presented in this example compared to the latest methods (DTD, MINC-2500, FMD, Fabric data set) are shown in Table 1:
table 1:
Figure BDA0003149059260000072
table 2 shows that for the ablation experiment of local feature coding module (PE) in PET, the sizes of various fixed blocks and the current most advanced coding method are compared with those of our method (PE) (DTD, MINC data set), and in order to ensure fairness, the backbone network adopts a residual network (ResNet50) with 50 layers;
table 2:
Figure BDA0003149059260000081
table 3 is an ablation experiment for backbone networks in PET, compared to other currently widely used backbone networks (DTD, MINC dataset);
table 3:
Figure BDA0003149059260000082

Claims (1)

1. the texture recognition method based on the depth self-attention network and the local feature coding is characterized by comprising the following specific steps of:
(1): giving an input image, carrying out image blocking after normalizing and standardizing the input image, wherein the size of each image block is p × 3, carrying out linear transformation on each image block to express the image block into a one-dimensional vector with the dimension of D to obtain an input vector z, the dimension of N × D, sending the input vector z into a depth self-attention backbone network, wherein N is the number of initial division windows, and D is the dimension of an embedded layer for transforming the image into a one-dimensional space;
(2): two self-attention calculation methods, namely a multi-head self-attention Module (MSA) and a window-based self-attention module (WMSA), are serially combined to form a deep self-attention feature extraction network; the window-based self-attention module (WMSA) is characterized in that self-attention calculation is carried out in a local area so as to focus more on local information calculation in a window; the mechanism of the window-based self-attention module is that an image is divided into a plurality of sub-images, self-attention is calculated in the sub-images, the sub-images are merged when entering the next WMSA stage, the receptive field is enlarged, for example, the side length of each sub-image is enlarged by one time, the WMSA module splices the calculation results of the plurality of sub-images, and the output dimension which is the same as the overall calculation result of a multi-head self-attention Module (MSA) is obtained; the self-attention calculation process for WMSA and MSA is as follows:
zl=WMSA(LN(zl-1)),
zl=MLP(LN(zl)),
zl+1=MSA(LN(zl)),
zl+1=MLP(LN(zl+1))
wherein: z is a radical of1-1For the N image blocks with embedded features, the dimension is N x D, z1And zl+1The method is characterized in that an output vector is an output vector after passing through a self-attention layer and a full-connection layer, LN is a standardized operation, MLP represents that a two-layer full-connection network is used for nonlinear transformation, MSA is a self-attention module, WMSA is a window-based self-attention module, and the method is different from the MSA in that an image is divided into a plurality of sub-images to be subjected to self-attention operation, and the result is spliced; the self-attention calculation process in MSA and WMSA is as follows:
Figure FDA0003149059250000011
wherein: q, K, V are input vector and W, respectivelyq、Wk、WvDefining h groups of self-Attention (Attention) modules and splicing to obtain a multi-head self-Attention calculation result z1
(3): sending the input vector z into the first three stages of a depth self-attention feature extraction network for calculation, wherein the input dimension is N x D, N is the number of initial division windows, and D is the dimension of an embedded layer for transforming an image into a one-dimensional space; the deep self-attention feature extraction network consists of four stages, namely three window-based self-attention modules and a global multi-head self-attention module; the first three stages use a window merging mechanism and WMSA modules, self-attention is calculated in a local area while the model receptive field is increased, and depth local features are extracted, wherein the stacking number of the self-attention modules of the first three modules is respectively 2,2 and 4; each region merge doubles the width W and height H of the block; outputting vector x through the self-attention module calculation based on window in the first three stages3Is reduced to N x d, wherein N is N/64, d is 8 x d;
(4): vector z output by the third stage of feature extraction3Inputting the data into a fourth stage global multi-head self attention (MSA) module, eliminating window merging in the feature extraction process of the stage,the method comprises 4 continuous self-attention calculation modules, the calculation process is the same as the step (2), the output dimension is unchanged after calculation, and the feature extraction vector x with the output dimension of n x d is output4Wherein N is N/64, d is 8 d;
(5): performing spatial remodeling on the features extracted from the backbone network in the step (4) through a local feature coding module, and converting the output two-dimensional vector of n x d into three-dimensional features of w x d, wherein w is n1/2Recovering the space structure of the depth feature to obtain a three-dimensional feature map;
(6): performing local feature interception on the three-dimensional feature map by using a square window through a local feature coding module; in order to obtain uniform depth window features, performing sliding interception on the length and width dimensions of the feature map by using a window; total number of patches N after the end of a window slidepComprises the following steps:
Figure FDA0003149059250000021
wherein: h and W are the height and width of the feature map, and s is the step size of one sliding of the sliding window; in order to enable the features of the depth coding to better sense the texture change in different ranges, a multi-scale feature window interception strategy is designed, windows with different sizes are used for intercepting a feature map, specifically, the windows are set to be 2 x2, 3 x 3 and 5 x 5, and three windows are used for performing sliding sampling on the feature map; inputting depth local feature blocks with the same depth and different widths and heights into a texture coding module;
(7): the characteristic diagram input in the step (5) after the window with different scales is cut into blocks is sent to a texture coding module for coding, and N is selected according to the local characteristic coding modulepA set X of visual descriptors
Figure FDA0003149059250000022
Defining a codebook C with K visual word clustering centers as a learnable parameter of a model, wherein the dimension is K x D; for each descriptor xiResidual errorThe vector may be represented as rij=xi-CKThe seed K is the Kth clustering center in the dictionary parameter C; unlike the hard assignment method, soft assignment is to assign a descriptor to each codeword by a softmax function; learning a clustering center codebook by adopting a learnable smoothing factor; the output vector encoded by codebook E has dimension K × D, and can be expressed as:
Figure FDA0003149059250000031
wherein: the a function is an evaluation function for the residual error, and can be expressed as:
Figure FDA0003149059250000032
wherein: s is a learnable smoothing factor; the encoding method allows input variables to have different dimensions and encode them into the same K x D dimensional feature space; the dimensionality of the feature E output by the coding layer is N x K x D, wherein N is the number of all depth local feature descriptors sampled by the multi-scale window;
(8): performing feature fusion on the coded features of N x K x D in the step (7), and performing weighted addition on the features of the N groups of K x D to obtain a texture representation E of the fused multi-scale local featuresfusionSpecifically, it can be expressed as:
Figure FDA0003149059250000033
wherein: eiRepresenting each encoded vector, N representing the number of encoded vectors, wiIs the weight of each size window;
(9): fuse vector EfusionAnd (4) paving the vector into a one-dimensional vector of K x D, and outputting the one-dimensional vector with the dimension of nclass through a layer of fully-connected network, wherein the nclass represents the number of categories.
CN202110760949.1A 2021-07-06 2021-07-06 Texture recognition method based on depth self-attention network and local feature coding Active CN113674334B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110760949.1A CN113674334B (en) 2021-07-06 2021-07-06 Texture recognition method based on depth self-attention network and local feature coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110760949.1A CN113674334B (en) 2021-07-06 2021-07-06 Texture recognition method based on depth self-attention network and local feature coding

Publications (2)

Publication Number Publication Date
CN113674334A true CN113674334A (en) 2021-11-19
CN113674334B CN113674334B (en) 2023-04-18

Family

ID=78538860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110760949.1A Active CN113674334B (en) 2021-07-06 2021-07-06 Texture recognition method based on depth self-attention network and local feature coding

Country Status (1)

Country Link
CN (1) CN113674334B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113963009A (en) * 2021-12-22 2022-01-21 中科视语(北京)科技有限公司 Local self-attention image processing method and model based on deformable blocks
CN114220012A (en) * 2021-12-16 2022-03-22 池明旻 Textile cotton and linen identification method based on deep self-attention network
CN114418030A (en) * 2022-01-27 2022-04-29 腾讯科技(深圳)有限公司 Image classification method, and training method and device of image classification model
CN114627006A (en) * 2022-02-28 2022-06-14 复旦大学 Progressive image restoration method based on depth decoupling network
CN115409855A (en) * 2022-09-20 2022-11-29 北京百度网讯科技有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN116070172A (en) * 2022-11-16 2023-05-05 北京理工大学 Method for enhancing characteristic expression of time series
CN116543146A (en) * 2023-07-06 2023-08-04 贵州大学 Image dense description method based on window self-attention and multi-scale mechanism

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902692A (en) * 2019-01-14 2019-06-18 北京工商大学 A kind of image classification method based on regional area depth characteristic coding
WO2019153908A1 (en) * 2018-02-11 2019-08-15 北京达佳互联信息技术有限公司 Image recognition method and system based on attention model
US20200250398A1 (en) * 2019-02-01 2020-08-06 Owkin Inc. Systems and methods for image classification
CN111523462A (en) * 2020-04-22 2020-08-11 南京工程学院 Video sequence list situation recognition system and method based on self-attention enhanced CNN
CN112418074A (en) * 2020-11-20 2021-02-26 重庆邮电大学 Coupled posture face recognition method based on self-attention
CN112819039A (en) * 2021-01-14 2021-05-18 华中科技大学 Texture recognition model establishing method based on multi-scale integrated feature coding and application
CN112861978A (en) * 2021-02-20 2021-05-28 齐齐哈尔大学 Multi-branch feature fusion remote sensing scene image classification method based on attention mechanism

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019153908A1 (en) * 2018-02-11 2019-08-15 北京达佳互联信息技术有限公司 Image recognition method and system based on attention model
CN109902692A (en) * 2019-01-14 2019-06-18 北京工商大学 A kind of image classification method based on regional area depth characteristic coding
US20200250398A1 (en) * 2019-02-01 2020-08-06 Owkin Inc. Systems and methods for image classification
CN111523462A (en) * 2020-04-22 2020-08-11 南京工程学院 Video sequence list situation recognition system and method based on self-attention enhanced CNN
CN112418074A (en) * 2020-11-20 2021-02-26 重庆邮电大学 Coupled posture face recognition method based on self-attention
CN112819039A (en) * 2021-01-14 2021-05-18 华中科技大学 Texture recognition model establishing method based on multi-scale integrated feature coding and application
CN112861978A (en) * 2021-02-20 2021-05-28 齐齐哈尔大学 Multi-branch feature fusion remote sensing scene image classification method based on attention mechanism

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114220012A (en) * 2021-12-16 2022-03-22 池明旻 Textile cotton and linen identification method based on deep self-attention network
CN113963009A (en) * 2021-12-22 2022-01-21 中科视语(北京)科技有限公司 Local self-attention image processing method and model based on deformable blocks
CN114418030A (en) * 2022-01-27 2022-04-29 腾讯科技(深圳)有限公司 Image classification method, and training method and device of image classification model
CN114418030B (en) * 2022-01-27 2024-04-23 腾讯科技(深圳)有限公司 Image classification method, training method and device for image classification model
CN114627006A (en) * 2022-02-28 2022-06-14 复旦大学 Progressive image restoration method based on depth decoupling network
CN115409855A (en) * 2022-09-20 2022-11-29 北京百度网讯科技有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN115409855B (en) * 2022-09-20 2023-07-07 北京百度网讯科技有限公司 Image processing method, device, electronic equipment and storage medium
CN116070172A (en) * 2022-11-16 2023-05-05 北京理工大学 Method for enhancing characteristic expression of time series
CN116543146A (en) * 2023-07-06 2023-08-04 贵州大学 Image dense description method based on window self-attention and multi-scale mechanism
CN116543146B (en) * 2023-07-06 2023-09-26 贵州大学 Image dense description method based on window self-attention and multi-scale mechanism

Also Published As

Publication number Publication date
CN113674334B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN113674334B (en) Texture recognition method based on depth self-attention network and local feature coding
CN107330127B (en) Similar text detection method based on text picture retrieval
Thai et al. Image classification using support vector machine and artificial neural network
CN109635744A (en) A kind of method for detecting lane lines based on depth segmentation network
CN110569814B (en) Video category identification method, device, computer equipment and computer storage medium
CN103927531A (en) Human face recognition method based on local binary value and PSO BP neural network
CN110188827B (en) Scene recognition method based on convolutional neural network and recursive automatic encoder model
CN110414616B (en) Remote sensing image dictionary learning and classifying method utilizing spatial relationship
CN111652273B (en) Deep learning-based RGB-D image classification method
CN112861970B (en) Fine-grained image classification method based on feature fusion
CN108734199A (en) High spectrum image robust classification method based on segmentation depth characteristic and low-rank representation
CN109002771B (en) Remote sensing image classification method based on recurrent neural network
CN116797787B (en) Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network
CN112926533A (en) Optical remote sensing image ground feature classification method and system based on bidirectional feature fusion
Sethy et al. Off-line Odia handwritten numeral recognition using neural network: a comparative analysis
CN115965864A (en) Lightweight attention mechanism network for crop disease identification
CN115775350A (en) Image enhancement method and device and computing equipment
CN114818889A (en) Image classification method based on linear self-attention transducer
CN114492581A (en) Method for classifying small sample pictures based on transfer learning and attention mechanism element learning application
CN114494786A (en) Fine-grained image classification method based on multilayer coordination convolutional neural network
CN109583584B (en) Method and system for enabling CNN with full connection layer to accept indefinite shape input
CN109558819B (en) Depth network lightweight method for remote sensing image target detection
CN111401434A (en) Image classification method based on unsupervised feature learning
Masruroh et al. Deep Convolutional Neural Networks Transfer Learning Comparison on Arabic Handwriting Recognition System
CN114821631A (en) Pedestrian feature extraction method based on attention mechanism and multi-scale feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant