CN115761258A

CN115761258A - Image direction prediction method based on multi-scale fusion and attention mechanism

Info

Publication number: CN115761258A
Application number: CN202211406464.3A
Authority: CN
Inventors: 白茹意; 郭小英; 贾春花
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-03-07

Abstract

The invention discloses an image direction prediction method based on multi-scale fusion and an attention mechanism, and belongs to the technical field of computer vision and image processing. Aiming at the problem of automatic prediction in the image direction, the invention outputs a characteristic diagram by the last residual error structure of the last four parts of the ResNet50 network, and the characteristic diagram respectively passes through an attention mechanism module to obtain 4 space attention diagrams which are added with corresponding elements of the original characteristic diagram. And then, the three features with smaller scales are up-sampled to the resolution ratio which is the same as the maximum scale by bilinear interpolation, and then the features are spliced along the channel to obtain the final multi-scale attention fusion feature which is called as a local feature. Secondly, taking 4 VR _ LBP feature maps of different scales of the image as network input, adopting ResNet50 to fuse residual void convolution to obtain 4 feature maps, and then adding corresponding elements of the feature maps to obtain the global feature. And finally, splicing and fusing the local features and the global features, and finally realizing direction prediction through a GAP (GAP) and a full connection layer.

Description

Image direction prediction method based on multi-scale fusion and attention mechanism

Technical Field

The invention belongs to the technical field of image processing and computer vision perception, and particularly relates to an image direction prediction method based on multi-scale fusion and an attention mechanism.

Background

Advances in digital imaging technology, as well as the proliferation of digital cameras, smart phones, and other devices, have resulted in a significant increase in the number of photographs that people take. Since the camera is not always horizontal during the taking process, the resulting picture usually requires rotational correction in order to be displayed in the correct orientation, i.e. the orientation in which the scene initially appears. The correct direction of the image is defined as the direction in which the scene initially appears. Most digital cameras have a built-in orientation sensor that allows recording of the camera orientation in the EXIF metadata of the image during capture, but this field is not uniformly managed and updated by several image processing applications and image formats. Automatic detection of standard image orientation is therefore an important task for some applications, such as automatic creation of digital photo albums, digitization of analog photos, computer vision applications that require images to be input in a vertical orientation. In these cases, user intervention is required and humans can use their own image understanding capabilities to recognize the correct orientation of the photograph. Generally, the orientation of the picture is determined by the rotation of the camera when the picture is taken, 90 degrees being the most common, even though any angle is possible. Therefore, it is generally assumed that an image is taken in one of four directions (0 degrees (up), 90 degrees (right), 180 degrees (down), 270 degrees (left)). Automatically accomplishing this task is a challenging task due to the wide variability of scene content.

In the current research, most of the image direction identification methods adopt image processing and machine learning algorithms. Nevertheless, these methods have some problems: (1) Some orientation detection methods rely on low-level features, which then implement image orientation detection using appropriate classifiers, however, low-level features fail to capture a large amount of semantic content in an image. (2) Some direction detection methods using neural networks need to scale the original image, for example, the VGG network will scale the image to 224 × 224, however, the aspect ratio of the image is one of the factors for determining the image, and some information of the image is lost by scaling the image. (3) At present, most neural network methods for detecting image directions are finely adjusted based on the existing backbone network, and whether the extracted features can express human visual perception or not is not considered, so that the generalization capability of the model is not high.

Disclosure of Invention

Aiming at the problem of image direction identification at present, the invention provides an image direction prediction method based on multi-scale fusion and attention mechanism.

In order to achieve the purpose, the invention adopts the following technical scheme:

an image direction prediction method based on multi-scale fusion and attention mechanism comprises the following steps:

step 1, rotating each image clockwise by three angles respectively: 90 degrees, 180 degrees and 270 degrees, and each image can finally obtain images in four different directions, namely an upper direction, a right direction, a lower direction and a left direction;

step 2, extracting local features of each image by adopting a ResNet50 fusion residual attention mechanism, and specifically comprising the following steps:

step 2.1, the ResNet50 network consists of 6 parts, respectively: convolutional layer, C0, C1, C2, C3, and C4; the C0 comprises 1 convolution layer with 7 multiplied by 7 step length of 2 and 1 maximum pooling layer with 3 multiplied by 3 step length of 2; the C1, C2, C3 and C4 are respectively 1/4,1/8,1/16 and 1/32 times of the original image, and respectively comprise 3,4, 6 and 3 Bottleneck layers (bottle, BTNK for short);

step 2.2, marking the last residual error structure output characteristic diagrams of the parts C1, C2, C3 and C4 as C respectively ₁ 、C ₂ 、C ₃ 、C ₄ (ii) a Passing through an Attention mechanism module (connected Block Attention) on four feature maps with different scales respectivelyModule, CBAM), respectively obtaining 4 space attention maps as a ₁ 、A ₂ 、A ₃ 、A ₄ ；

Step 2.3, corresponding elements of the space attention diagram and the corresponding original characteristic diagram are added and marked as

Wherein, the first and the second end of the pipe are connected with each other,

representing the addition of corresponding elements;

step 2.4, adding F ₂ 、F ₃ 、F ₄ Three small-scale feature maps are upsampled to F by bilinear interpolation ₁ And (3) carrying out splicing operation along the channel at the same scale, and then carrying out convolution operation of 1 multiplied by 1 to obtain the final multi-scale attention fusion feature, wherein the local feature is recorded as: local _ Feature = concat (F) ₁ ，up_2x(F ₂ )，up_4x(F ₃ )，up_8x(F ₄ ) ); wherein concat represents feature concatenation, up _2x represents up-sampling by a factor of 2;

step 3, taking VR _ LBP (Variable rotation Local binary pattern) feature maps of 4 different scales of the image as network input, and extracting global features of the image by adopting ResNet50 fusion Residual cavity Convolution (Residual scaled Convolution);

step 3.1, calculating a VR _ LBP characteristic graph capable of expressing the direction characteristic of the image in a standard three primary colors (RGB) mode; 4 different scales VR _ LBP are adopted in the calculation process _1,8 、VR_LBP _2,16 、VR_LBP _3,24 And VR _ LBP _4,32 Generating 4 VR-LBP feature maps, respectively denoted as P ₁ ，P ₂ ，P ₃ ，P ₄ And as an input to the RestNet50 network;

step 3.2, adding P ₁ ，P ₂ ，P ₃ And P ₄ The VR _ LBP profiles of four different scales are input into RestNet50, andthe output characteristic of the last volume block of the RestNet50 network is labeled as RP ₁ 、RP ₂ 、RP ₃ 、RP ₄ }; inputting the 4 feature maps into residual hole convolution blocks with corresponding sampling rates respectively, wherein the 4 sampling rates correspond to R values in the VR _ LBP feature map respectively; the residual error hole convolution block is formed by adding a Shortcut Connection (Shortcut Connection) of 1 × 1 convolution to 1 3 × 3 hole convolution; the function of the shortcut connection is to match the spatial dimension of the feature map, and the function of the residual block is to realize identity mapping simultaneously when the image features are extracted by convolution. Obtaining 4 characteristic graphs after residual empty rolling blocks and marking as RPD ₁ 、RPD ₂ 、RPD ₃ 、RPD ₄ To RPD ₁ 、RPD ₂ 、RPD ₃ 、RPD ₄ Adding corresponding elements of the 4 feature maps to obtain a global feature; global _ Feature = RPD ₁ ⊕RPD ₂ ⊕RPD ₃ ⊕RPD ₄ ；

Step 4, splicing and fusing the local features obtained in the step 2.4 and the global features obtained in the step 3.2, and finally realizing direction prediction;

step 4.1, downsampling the Local _ Feature to the resolution ratio same as the Global _ Feature through bilinear interpolation, and then carrying out splicing connection to obtain the fused Feature; LG _ Feature = concat (down (Local _ Feature), global _ Feature), down representing downsampling;

step 4.2, subjecting the LG _ Feature to Global Average Pooling (GAP) to obtain a one-dimensional vector; then, the image direction is predicted through a 256 full-connection layer;

and 4.3, taking the logistic regression maximum likelihood loss function as a loss function to realize direction classification and realize automatic prediction of the image direction, wherein the loss function is defined as follows:

wherein h is _θ (x) Representing the probability that sample x belongs to a class; y is _i For the predicted direction class, x _i And the ith sample characteristic is represented, m is the number of samples, theta is a parameter solved by the network model, and T represents the transposition of the matrix.

Further, the bottleneck layer BTNK in the step 2.1 has two types of BTNK1 and BTNK 2; the left side of BTNK2 is provided with 3 conv + BN + ReLU volume blocks, the result F (x) after convolution is added with input x, namely F (x) + x, then 1 ReLU activation function is carried out, and the number of input channels and the number of output channels of the module are the same; the left side of the BTNK1 is provided with 3 conv + BN + ReLU volume blocks F (x), the right side is provided with 1 conv + BN volume block G (x), the function of matching input and output dimension difference is achieved, namely the number of channels F (x) and G (x) is the same, then summation F (x) + G (x) is conducted, and the number of input channels and the number of output channels of the module are different. The ResNet50 network is formed by stacking a plurality of bottleneck layers BTNK of different types.

Further, the attention mechanism module CBAM of step 2.2 combines both the channel attention module and the spatial attention module; the channel attention module obtains two Cx 1 x 1 characteristics by respectively carrying out global maximum pooling and global average pooling on the input characteristic diagram, wherein C represents the number of channels, then respectively sending the two Cx 1 x 1 characteristics into a two-layer neural network MLP,

the number of neurons in the first layer of the two-layer neural network MLP is C/r, r is the reduction rate, the activation function is RelU, the number of neurons in the second layer is C, and the two-layer neural network MLP is shared;

performing element-by-element addition on the two features output by the MLP, and performing sigmoid activation operation to generate a final channel attention feature; finally, multiplying the channel attention feature by the input feature element by element to generate the input feature required by the space attention module;

the spatial attention module firstly applies global maximum pooling and global average pooling to output features of the channel attention module based on channel dimensions to obtain two 1 × H × W single-channel pooling features, then splices (collocation) the two 1 × H × W single-channel pooling features based on the channel dimensions, reduces the number of channels to 1 through 7 × 7 convolution operation, generates spatial attention features through sigmoid, and finally multiplies the spatial attention features generated through the sigmoid element by element with input features of the spatial attention module to obtain weighted features fusing channel attention and spatial attention.

Further, the VR _ LBP feature map in step 3.1 takes a certain pixel point in the image as a center point, based on

Interpolating to obtain a field point taking a circular sampling point set as the central point, wherein R is the radius, and P is the number of sampling points; then comparing the value of the central pixel point with the value of the adjacent pixel point, if the value of the adjacent pixel point is larger than that of the central pixel point, setting the position of the field to 1, otherwise setting the position to 0, then reading the circular sampling points clockwise, finally combining the circular sampling points into a binary number sequence, converting the sequence into decimal system, namely VR _ LBP _R,P Code, calculated as follows:

wherein, gray _c Is the gray level of the current pixel _i Is the gray scale of its domain; when x is less than 0, t (x) is 0, otherwise it is 1.

Further, in step 3.2, the hole convolution has a hyper-parameter of a dilation rate, which is used to define the space between values when the convolution kernel processes data, and the dilation rate is-1 and 0 are filled in the convolution kernel; therefore, when different expansion rates are set, the receptive fields are different, and multi-scale information is obtained; a convolution kernel K = K + (K-1) (r-1) of the cavity convolution, wherein K is the size of an original convolution kernel, and r is a cavity rate of a cavity convolution parameter; 4 different hole convolution kernels are used according to VR _ LBP of different scales, and r is 1,2,3 and 4 respectively.

Further, the window size of the global average pooling in the step 4.2 is the size of the whole feature map, an average value is calculated for all pixels of the feature map of each output channel, a feature vector with the size of 1 × 1 × C is obtained after the global average pooling, and C is the number of channels of the original feature map.

Compared with the prior art, the invention has the following advantages:

(1) And (3) respectively passing the final residual structure output characteristic diagram of the last four parts of the ResNet50 network through an attention mechanism module (CBAM) to obtain 4 space attention diagrams, and adding the 4 space attention diagrams to corresponding elements of the original characteristic diagram. And then, the three features with smaller scales are up-sampled to the resolution ratio which is the same as the maximum scale by using bilinear interpolation, and then the final multi-scale attention fusion feature, namely the local feature, is obtained by carrying out splicing operation along a channel. The method adopts an attention mechanism, so that the judgment of the machine and the human vision on the direction has better consistency. (2) Taking LBP feature maps of 4 different scales of the image as network input, adopting ResNet50 to fuse residual error hole convolution to obtain 4 feature maps, and then adding corresponding elements of the feature maps to obtain the global feature. The direction characteristics of the image are extracted from a plurality of scales, the direction characteristics of the image can be expressed from different fields of view, and the accuracy of direction detection is improved. (3) And splicing and fusing the local features and the global features, and finally realizing direction prediction through a GAP and a full connection layer. The GAP module converts the image with any input size into the feature vector with fixed size, so that overfitting is reduced, and the network convergence speed is increased.

Drawings

FIG. 1 is a schematic view of an image of the present invention;

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a diagram of different scale hole convolution kernels of the present invention;

FIG. 4 is a graph of VR _ LBP characteristics for different scales of the present invention;

FIG. 5 is a network model framework of the present invention;

FIG. 6 is a schematic diagram of performance index data of 8 models of LModels 1 to 8

FIG. 7 is a diagram of performance index data for 10 models of Gmodel1 to model10 according to the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail with reference to the embodiments and the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The technical solutions of the present invention are described in detail below with reference to the embodiments and the drawings, but the scope of protection is not limited thereto.

Example 1

The invention selects to disclose a data set for experiment, and the specific implementation steps are as follows:

step 1: a SUN dataset test is selected from a public data set comprising 108754 images, 397 scene categories. Each image is rotated clockwise by three angles: 90 degrees, 180 degrees and 270 degrees, each image finally obtains images in four different directions, namely, upper, right, lower and left.

Step 2: extracting local features of the image by adopting a ResNet50 fusion residual attention mechanism, and specifically comprising the following steps of:

step 2.1: the ResNet50 network consists of 6 parts, which are: convolutional layer, C0, C1, C2, C3, and C4. The C0 comprises 1 convolution layer with 7 multiplied by 7 step length of 2 and 1 maximum pooling layer with 3 multiplied by 3 step length of 2; the C1, C2, C3 and C4 are respectively 1/4,1/8,1/16 and 1/32 times of the original image, and respectively comprise 3,4, 6 and 3 Bottleneck layers (BTNK). The bottleneck layer BTNK comprises BTNK1 and BTNK 2; the left side of BTNK2 is provided with 3 conv + BN + ReLU volume blocks, the result F (x) after convolution is added with input x, namely F (x) + x, then the result F (x) is subjected to 1 ReLU activation function, and the number of input channels and the number of output channels of the module are the same; the left side of the BTNK1 is provided with 3 conv + BN + ReLU volume blocks F (x), the right side of the BTNK1 is provided with 1 conv + BN volume block G (x), the function of matching input and output dimension difference is achieved, namely the number of channels F (x) and G (x) is the same, then summation F (x) + G (x) is carried out, and the number of input channels and the number of output channels of the module are different.

Step 2.2: respectively recording the last residual structure output characteristic diagram of each part C1, C2, C3 and C4 as C ₁ 、C ₂ 、C ₃ 、C ₄ (ii) a Feature maps at four different scalesRespectively obtaining 4 spatial Attention diagrams marked as A through a Attention mechanism Module (CBAM) ₁ 、A ₂ 、A ₃ 、A ₄ . The attention mechanism module CBAM combines the channel attention module and the space attention module at the same time; the channel attention module obtains two Cx 1 x 1 characteristics through global maximum pooling and global average pooling respectively for the input characteristic diagram, wherein C represents the number of channels, and then the two Cx 1 x 1 characteristics are respectively sent into a two-layer neural network MLP, the number of neurons in a first layer of the two-layer neural network MLP is C/r, r is the reduction rate, the activation function is RelU, the number of neurons in a second layer is C, and the two-layer neural network is shared; performing element-by-element addition on the two features output by the MLP, and performing sigmoid activation operation to generate a final channel attention feature; finally, multiplying the channel attention feature by the input feature element by element to generate the input feature required by the space attention module; the spatial attention module firstly applies global maximum pooling and global average pooling to output features of the channel attention module based on channel dimensions respectively to obtain two 1 XHXW single-channel pooling features, then splices the two 1 XHXW single-channel pooling features based on the channel dimensions, then performs 7X 7 convolution operation to reduce the number of channels to 1, generates spatial attention features through sigmoid, and finally performs element-by-element multiplication on the spatial attention features generated through the sigmoid and input features of the spatial attention module to obtain a weighting feature fusing channel attention and spatial attention.

Step 2.3: adding corresponding elements of the space attention diagram and the corresponding original feature diagram, and recording as

Step 2.4: f is to be ₂ 、F ₃ 、F ₄ Three small-scale feature maps are upsampled to F by bilinear interpolation ₁ Same dimension, edgeAnd splicing the channels, and performing 1 × 1 convolution operation to obtain final multi-scale attention fusion features, wherein local features are recorded as: local _ Feature = concat (F) ₁ ，up_2x(F ₂ )，up_4x(F ₃ )，up_8x(F ₄ ) ); wherein concat represents feature concatenation, up _2x represents up-sampling by a factor of 2;

and 3, step 3: taking a rotational Variable Local binary pattern (VR _ LBP) feature map of 4 different scales of an image as network input, and extracting global features of the image by adopting ResNet50 fusion Residual hole Convolution (Residual scaled Convolution), wherein the specific steps are as follows:

step 3.1, calculating a VR _ LBP characteristic graph capable of expressing the direction characteristic of the image in a standard three-primary-color mode RGB; 4 different scales VR-LBP are adopted in the calculation process _1,8 、VR_LBP _2,16 、VR_LBP _3,24 And VR _ LBP _4,32 Generating 4 VR-LBP feature maps, respectively denoted as P ₁ ，P ₂ ，P ₃ ，P ₄ And as an input to the RestNet50 network; VR _ LBP feature map, taking a certain pixel point in the image as the center

Points based on

Interpolating to obtain a field point taking a circular sampling point set as the central point, wherein R is the radius, and P is the number of sampling points; then comparing the value of the central pixel point with the value of the adjacent pixel point, if the value of the adjacent pixel point is larger than that of the central pixel point, setting the position of the field to 1, otherwise setting the position to 0, then reading the circular sampling points clockwise, finally combining the circular sampling points into a binary number sequence, converting the sequence into decimal system, namely VR _ LBP _R,P Code, calculated as:

wherein, gray _c Is the gray level of the current pixel, gray _i Is the gray scale of its domain; when x is less than 0, t (x) is 0, otherwise it is 1.

Step 3.2, adding P ₁ ，P ₂ ，P ₃ And P ₄ The VR _ LBP profiles of four different scales are input into RestNet50, and the output profile of the last rolling block of the RestNet50 network is marked as { RP ₁ 、RP ₂ 、RP ₃ 、RP ₄ }; inputting the 4 characteristic graphs into residual hole volume blocks corresponding to sampling rates respectively, wherein the 4 sampling rates correspond to R values in the VR _ LBP characteristic graphs respectively;

the residual hole convolution block is formed by adding a Shortcut Connection (Shortcut Connection) of 1 × 1 convolution to 1 3 × 3 hole convolution. The function of the shortcut connection is to match the spatial dimension of the feature map, and the function of the residual block is to realize identity mapping simultaneously when the image features are extracted by convolution. Obtaining 4 characteristic graphs after residual empty rolling blocks and marking as RPD ₁ 、RPD ₂ 、RPD ₃ 、RPD ₄ To RPD ₁ 、RPD ₂ 、RPD ₃ 、RPD ₄ Adding corresponding elements of the 4 feature maps to obtain a global feature;

indicating the addition of corresponding elements.

The hole convolution has a hyper-parameter of a dilation rate (dilation rate) for defining the distance between values when the convolution kernel processes data, i.e. filling the convolution kernel with dilation rate-1 0, so that when different dilation rates are set, the receptive field will be different, i.e. multi-scale information is obtained. And (3) a convolution kernel K = K + (K-1) (r-1) of the hole convolution, wherein K is the size of the original convolution kernel, and r is a hole rate of a hole convolution parameter. The invention adopts 4 different cavity convolution kernels according to VR _ LBP with different scales, wherein r is 1,2,3 and 4 respectively.

And 4, step 4: splicing and fusing the local features obtained in the step 2.4 and the global features obtained in the step 3.2 to finally realize direction prediction, and the specific steps are as follows:

step 4.1: and (3) downsampling the Local _ Feature to the resolution ratio which is the same as that of the Global _ Feature through bilinear interpolation, and then carrying out splicing connection to obtain the fused Feature. LG _ Feature = contact (down (Local _ Feature), global _ Feature), down denotes downsampling

Step 4.2, subjecting the LG _ Feature to global average pooling to obtain a one-dimensional vector; then, the image direction is predicted through a 256 full-connection layer; the window size of the GAP is the size of the whole feature map, an average value is calculated for all pixels of the feature map of each output channel, a feature vector with the size of 1 × 1 × C is obtained after global average pooling, and C is the number of channels of the original feature map.

And 4.3, using the logistic regression maximum likelihood loss function as a loss function to realize direction classification and realize automatic prediction of image direction, wherein the loss function is defined as follows:

wherein h is _θ (x) Representing the probability that sample x belongs to a class; y is _i For the predicted direction class, x _i And (3) representing the ith sample characteristic, wherein m is the number of samples, theta is a parameter solved by the network model, and T represents matrix transposition.

And 5, adopting an experimental environment of Anaconda3 and a deep learning framework of Tensorflow (GPU). 70% of each data set was selected as the training set and 30% as the test set. The original image size remains unchanged. And a 10-fold cross validation method is adopted, so that the final evaluation index is the average value of the accuracy after 10-fold cross validation.

And pre-training ResNet50 by using an Imagenet data set, applying the acquired convolutional layer parameters to the method provided by the invention, and finely adjusting other modules on the basis.

Setting relevant parameters of the experiment: the blocksize is set to 128, the network is trained end-to-end using momentum's SGD optimizer, the momentum (momentum) is set to 0.9, the learning rate is 0.001, the number of iterations is 30, and overfitting is prevented by adding L2 regularization. The method of the invention is a multi-classification problem, and therefore classification Accuracy (ACC), macro-average accuracy (MAP), macro-average recall (MAR) and a confusion matrix are used for evaluating the performance of the model.

In order to fully verify the effectiveness and the applicability of the method, the invention provides the effectiveness of local features in an image direction detection task, ablation experiments are carried out on feature fusion modes of different layers of Resnet50 used in the part of network structures and CBAM, and the global features adopt the structure provided by the inventor to generate 8 different models. As shown in table 1, the characteristic diagrams adopted by the Lmodel1 and Lmodel5 are C4, the characteristic diagrams adopted by the Lmodel2 and Lmodel6 are C3 and C4, the characteristic diagrams adopted by the Lmodel3 and Lmodel7 are C2, C3 and C4, and the characteristic diagrams adopted by the Lmodel4 and Lmodel8 are C1, C2, C3 and C4.

In addition, whether CBAM is bound or not is also contemplated. CBAMs are not added into the Lmodel 1-Lmodel 4, but a direct upsampling fusion mode is adopted, and CBAMs are added into the Lmodel 5-Lmodel 8. Where Lmodel1 is ReanNet50, the backbone network. Lmodel8 is the model proposed by the present invention. As can be seen from the performance indexes of the 8 models in FIG. 6, the accuracy, the macro average precision and the macro average recall rate of the model (LModel 8) provided by the invention are respectively 99.2%, 97.1% and 95.5%, which are superior to those of other models.

TABLE 1

To verify the validity of using eye movement heatmaps as tags, to verify the invention proposesThe effectiveness of local features in the task of image direction detection, ablation experiments were performed on different scale "VR _ LBP" images used in this part of the network structure convolved with residual voids, and the local features used with the structure proposed by the present invention generated 10 different models, as shown in table 2. Firstly, calculating 'VR _ LBP' images of different scales of an original image, and selecting the original image or the LBP of different scales as a network to be input into a model. As shown in the table, the input of Gmodel1 and Gmodel2 is the original image, and the input of Gmodel3 and Gmodel7 is VR _ LBP _1,8 The inputs of Gmodel4 and Gmodel8 are VR _ LBP _1,8 And VR _ LBP _2,16 The input to Gmodel5 and Gmodel9 is VR _ LBP _1,8 、VR_LBP _2,16 And VR _ LBP _3,24 The input to Gmodel5 and Gmodel10 is VR _ LBP _1,8 、VR_LBP _2,16 、VR_LBP _3,24 And VR _ LBP _4,32 . In addition, whether or not the residual void convolution layer is combined is considered. The residual void convolutional layers were not added to Gmodel1, gmodel3 to Gmodel6, and the residual void convolutional layers were added to Gmodel2, gmodel7 to Gmodel 10. As can be seen from the performance indexes of 10 models in fig. 7, the accuracy, the macro average accuracy and the macro average recall rate of the model (Gmodel 10) provided by the present invention are respectively 99.2%, 97.5% and 94.7%, which are better than those of other models.

TABLE 2

In order to verify the effectiveness of the fusion of the local features and the global features in the task, experiments are respectively carried out on four models of the backbone network, the local features, the global feature network and the fusion of the two features. Table 3 shows that the classification accuracy of the fusion method proposed by the present invention on the data set is 99.2% respectively, which is superior to other single feature models. The experimental result shows that the accuracy of the local and global feature fusion model is higher than that of a backbone network or a network using single feature. The method also verifies that when a picture is watched, the specific content in the picture can be paid attention to by judging the picture direction, and the overall layout of the picture can be paid attention to, so that the classification model has a good classification effect on various types of images.

TABLE 3

The invention performs experiments on the data set SUN, and compared with the related research at present, the classification effect is obvious. The VR _ LBP image descriptor is designed to well express the direction characteristic of the image. And the fusion of the local features and the global features is adopted to help the model to perceive the direction of the image from different visual features. The method of the invention has good performance in different data sets.

Compared with the existing image direction perception method, the method has the advantages that: (1) The invention does not zoom the original image in the data set, keeps the initial size of the image and more retains the effective information of the image. (2) And extracting attention mechanism characteristics from the characteristic diagrams of the neural network model in different scales, and fusing to obtain local characteristics. This approach is similar to the human visual attention mechanism, capturing more detailed information about the target, while ignoring other extraneous information. By means of the mechanism, high-value information can be screened out from a large amount of information rapidly by using limited attention resources. (3) Extracting 'VR _ LBP' (rotary variable local binary pattern) features of different scales from the image, obtaining 4 feature graphs by adopting ResNet50 fusion residual void convolution, and then adding corresponding elements of the feature graphs to obtain the global features. The VR _ LBP can more accurately express the directional characteristics of the image and improve the generalization capability of the model. (4) The global features and the local features are fused, so that the direction semantics of the image can be more comprehensively expressed, and the classification accuracy of the model is improved.

Those skilled in the art will appreciate that the invention may be practiced without these specific details. Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. An image direction prediction method based on multi-scale fusion and attention mechanism is characterized in that: the method comprises the following steps:

step 1, respectively rotating each image clockwise by three angles: 90 degrees, 180 degrees and 270 degrees, and each image can finally obtain images in four different directions, namely an upper direction, a right direction, a lower direction and a left direction;

step 2.1, the ResNet50 network consists of 6 parts, respectively: convolutional layer, C0, C1, C2, C3, and C4; the C0 comprises 1 convolution layer with 7 multiplied by 7 step length of 2 and 1 maximum pooling layer with 3 multiplied by 3 step length of 2; the C1, C2, C3 and C4 are respectively 1/4,1/8,1/16 and 1/32 times of the original image, and respectively comprise 3 bottleneck layers BTNK, 4 bottleneck layers BTNK, 6 bottleneck layers BTNK and 3 bottleneck layers BTNK;

step 2.2, respectively marking the last residual error structure output characteristic diagram of each part C1, C2, C3 and C4 as C ₁ 、C ₂ 、C ₃ 、C ₄ (ii) a Respectively passing through an attention mechanism module CBAM on four feature maps with different scales to respectively obtain 4 space attention maps marked as A ₁ 、A ₂ 、A ₃ 、A ₄ ；

Step 2.3, corresponding elements of the space attention diagram and the corresponding original feature diagram are added, and the sum is marked as F _i ＝A _i ⊕C _i (i =1,2,3, 4), wherein ≧ denotes the corresponding element addition;

step 2.4, adding F ₂ 、F ₃ 、F ₄ Three small-scale feature maps are upsampled to F by bilinear interpolation ₁ And (3) carrying out splicing operation along the channel with the same scale, and then carrying out convolution operation of 1 multiplied by 1 to obtain the final multi-scale attention fusion feature, wherein the local feature is recorded as: local _ Feature = concat (F) ₁ ，up_2x(F ₂ )，up_4x(F ₃ )，up_8x(F ₄ ) ); wherein concat represents feature concatenation, up _2x represents up-sampling by a factor of 2;

step 3, taking the rotary variable local binary pattern feature maps VR _ LBP of 4 different scales of the image as network input, and extracting the global features of the image by adopting ResNet50 fusion residual error hole convolution, wherein the specific steps are as follows;

step 3.1, calculating a VR _ LBP characteristic graph capable of expressing the direction characteristic of the image in a standard three primary colors (RGB) mode; 4 different scales VR-LBP are adopted in the calculation process _1,8 、VR_LBP _2,16 、VR_LBP _3,24 And VR _ LBP _4,32 Generating 4 VR-LBP feature maps, respectively denoted as P ₁ ，P ₂ ，P ₃ ，P ₄ And serves as an input to the RestNet50 network;

step 3.2, adding P ₁ ，P ₂ ，P ₃ And P ₄ The VR _ LBP profiles of four different scales are input into RestNet50, and the output profile of the last rolling block of the RestNet50 network is marked as RP ₁ 、RP ₂ 、RP ₃ 、RP ₄ (ii) a Inputting the 4 feature maps into residual hole convolution blocks with corresponding sampling rates respectively, wherein the 4 sampling rates correspond to R values in the VR _ LBP feature map respectively; obtaining 4 characteristic graphs after residual empty rolling blocks and marking as RPD ₁ 、RPD ₂ 、RPD ₃ 、RPD ₄ To RPD ₁ 、RPD ₂ 、RPD ₃ 、RPD ₄ Adding corresponding elements of the 4 feature maps to obtain a global feature; global _ Feature = RPD ₁ ⊕RPD ₂ ⊕RPD ₃ ⊕RPD ₄ ；

step 4.1, down-sampling the Local _ Feature to the resolution ratio same as the Global _ Feature through bilinear interpolation, and then splicing and connecting to obtain a fused Feature; LG _ Feature = concat (down (Local _ Feature), global _ Feature), down representing downsampling;

step 4.2, performing global average pooling on the LG _ Feature to obtain a one-dimensional vector; then, the image direction is predicted through a 256 full-connection layer;

wherein h is _θ (x) Representing the probability that sample x belongs to a class; y is _i For the predicted direction class, x _i And the characteristic of the ith sample is represented, m is the number of samples, theta is a parameter solved by the network model, and T represents the transposition of the matrix.

2. The method of claim 1, wherein the image direction prediction method based on the multi-scale fusion and attention mechanism comprises: the bottleneck layer BTNK in the step 2.1 comprises BTNK1 and BTNK 2; the left side of BTNK2 is provided with 3 conv + BN + ReLU volume blocks, the result F (x) after convolution is added with input x, namely F (x) + x, then 1 ReLU activation function is carried out, and the number of input channels and the number of output channels of the module are the same; the left side of the BTNK1 is provided with 3 conv + BN + ReLU volume blocks F (x), and the right side is provided with 1 conv + BN volume block G (x), so that the function of matching input and output dimension difference is achieved; since the number of channels F (x) and G (x) is the same, and F (x) + G (x) is summed, the number of input and output channels of the module is different. The ResNet50 network is formed by stacking a plurality of bottleneck layers BTNK of different types.

3. The method of claim 1, wherein the image direction prediction method based on multi-scale fusion and attention mechanism comprises: in the step 2.2, the attention mechanism module CBAM combines the channel attention module and the space attention module at the same time; the channel attention module obtains two Cx 1 x 1 characteristics through global maximum pooling and global average pooling respectively of the input characteristic diagram, wherein C represents the number of channels, then the two Cx 1 x 1 characteristics are respectively sent into a two-layer neural network MLP,

the spatial attention module firstly applies global maximum pooling and global average pooling to output characteristics of the channel attention module based on channel dimensions respectively to obtain two 1 XHXW single-channel pooling characteristics, then splices the two 1 XHXW single-channel pooling characteristics based on the channel dimensions, reduces the number of channels to 1 through 7X 7 convolution operation, generates spatial attention characteristics through sigmoid, and finally multiplies the spatial attention characteristics generated through sigmoid and input characteristics of the spatial attention module element by element to obtain weighting characteristics fusing channel attention and spatial attention.

4. The method of claim 1, wherein the image direction prediction method based on the multi-scale fusion and attention mechanism comprises: the VR _ LBP feature map in the step 3.1 takes a certain pixel point in the image as a central point and is based on

Interpolating to obtain a field point taking a circular sampling point set as the central point, wherein R is the radius, and P is the number of sampling points; then comparing the value of the central pixel point with the value of the adjacent pixel point, if the value of the adjacent pixel point is larger than that of the central pixel point, setting the position of the area to be 1, otherwise setting the position to be 0,then, the circular sampling points are read clockwise to finally combine into a binary number sequence, and the sequence is converted into decimal, namely VR _ LBP _R,P Code, calculated as:

5. The method of claim 1, wherein the image direction prediction method based on the multi-scale fusion and attention mechanism comprises: in the step 3.2, the hole convolution has a hyper-parameter of an expansion rate, which is used for defining the distance between values when a convolution kernel processes data, and the expansion rate is-1 and 0 is filled in the convolution kernel; therefore, when different expansion rates are set, the receptive fields are different, and multi-scale information is obtained; a convolution kernel K = K + (K-1) (r-1) of the cavity convolution, wherein K is the size of an original convolution kernel, and r is a cavity rate of a cavity convolution parameter; 4 different hole convolution kernels are used according to VR _ LBP of different scales, and r is 1,2,3 and 4 respectively.

6. The method of claim 1, wherein the image direction prediction method based on the multi-scale fusion and attention mechanism comprises: the residual void volume block is formed by adding a Shortcut Connection (Shortcut Connection) of 1 × 1 convolution to 1 3 × 3 void convolution; the function of the quick connection is to match the space dimensionality of the feature map, and the function of the residual block is to realize identity mapping when the image features are extracted by convolution.

7. The method of claim 1, wherein the image direction prediction method based on multi-scale fusion and attention mechanism comprises: the window size of the global average pooling in the step 4.2 is the size of the whole feature map, an average value is calculated for all pixels of the feature map of each output channel, a feature vector with the size of 1 × 1 × C is obtained after the global average pooling, and C is the number of channels of the original feature map.