CN116563615B

CN116563615B - Bad picture classification method based on improved multi-scale attention mechanism

Info

Publication number: CN116563615B
Application number: CN202310434595.0A
Authority: CN
Inventors: 吴馨; 石晓涛; 王哲
Original assignee: Nanjing Xunsiya Information Technology Co ltd
Current assignee: Nanjing Xunsiya Information Technology Co ltd
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-11-07
Anticipated expiration: 2043-04-21
Also published as: CN116563615A

Abstract

A bad picture classification method based on an improved multi-scale attention mechanism relates to the technical fields of artificial intelligence, attention and multi-scale feature fusion and picture classification. Obtaining pictures to be classified, judging the quality of the pictures, carrying out picture enhancement on pictures with low quality such as fuzzy, noisy and the like to improve the definition of the pictures, and then carrying out pretreatment operations such as picture resolution, regularization and the like; the improved multiscale attention module is embedded in the ResUnit of the ResNet network model. The picture features are processed by a ResUnit module to obtain a calibration feature X' integrating channel weights and position weights, and the calibration feature focuses more on important positions and classification categories of user prediction classification; outputting the classification category and the score of the picture after improving the ResNet network model; and carrying out post-processing on the results according to the user set threshold value, the attention category and the like to obtain a final output result.

Description

Bad picture classification method based on improved multi-scale attention mechanism

Technical Field

The invention relates to the technical field of artificial intelligence, attention and multi-scale feature fusion and picture classification, in particular to a poor image classification method based on an improved multi-scale attention mechanism.

Background

With the wide application and rapid development of the internet, the problem of bad information in the network space is increasingly serious, especially yellow and violent pictures. Therefore, it is becoming increasingly important to enhance network monitoring with poor picture classification techniques. Currently, many bad picture classification methods based on deep learning models are developed to improve the security and health of the network environment.

In recent years, researchers have found that the human visual system automatically focuses on important areas when processing images. This approach to focusing on important information, namely the attention mechanism, has been widely applied to convolutional neural network-based image classification, thereby improving the performance and accuracy of the model. However, when the attention mechanism-based picture classification method extracts attention feature vectors, in order to reduce the calculation amount, the channel domain information and the spatial domain information are generally compressed by adopting methods such as global maximum pooling or global average pooling, and a large amount of fusion information is lost by adopting the method to compress the information; and then, obtaining a attention characteristic diagram through a single convolution layer, wherein the problem of multiple scales of the attention target cannot be processed due to the limited receptive field of the single convolution layer. Thus, these methods may exhibit certain limitations in processing multi-scale objects, blurred objects.

Disclosure of Invention

The technical purpose is that: aiming at the defects in the prior art, the invention discloses a bad image classification method based on an improved multi-scale attention mechanism.

A bad picture classification method based on an improved multi-scale attention mechanism comprises the following steps:

step S1, a feature map X epsilon R obtained by processing pictures to be classified through a network model ^C*H*W The feature map is an input feature map of the scheme, and the dimension is C, H and W;

step S2, performing 3*3 convolution and 3*3 hole convolution on the input feature map respectively, wherein the hole rate of the 3*3 hole convolution is conditionaln=2, and obtaining two different feeling fields by two transformations ₁ 、F ₂ ；

F ₁ ＝f(X)∈R ^H*W*C ，F ₂ ＝f ^～ (X)∈R ^H*W*C

Step S3, respectively carrying out global average pooling on the two feature images to obtain 1 x C feature vectors corresponding to the two different receptive fields, and fusing the two feature vectors S;

s＝AvgPool(F ₁ )+AvgPool(F ₂ )

step S4, introducing a super parameter K, and respectively carrying out (K+2) 1 cavity convolution and K1 convolution on the fused feature vector to obtain a feature 1 and a feature 2; then fusing the feature 1 and the feature 2 to obtain a channel domain attention feature map, namely a channel weight value A _c ；

A _c ＝σ(C _(K+2)*1 (s)+C _K*1 (s))

Wherein r and b are super parameters, r=2, b=1, C is the channel number of the fusion feature vector, and sigma is a sigmod function;

step S5, utilizing the channel weight A obtained in step S4 _c Operating the original input feature map to obtain a calibration feature map X' of the attention of the fusion channel;

X'＝F _scale (X,A _c )＝X*A _c ，X'∈R ^H*W*C

namely multiplying all values on H X W on each position of the original input feature map X by the weight of the corresponding channel;

step S6, taking the calibration feature image X' of the attention of the fusion channel as an input feature image for obtaining a spatial attention feature image;

s7, performing global maximum pooling and global average pooling on the calibration feature map X' along the channel domain to obtain S respectively ₁ 、S ₂ The channel domain information is embedded in the spatial domain. And then S is carried out ₁ 、S ₂ Fusing to obtain a feature map S;

S＝concat(AvgPool(X′),MaxPool(X′))

wherein S.epsilon.R ^H×W*2 ；

Step S8, carrying out two 3*3 hole convolutions (hole rate condition=2) on the fusion feature map S to obtain two features S with different receptive fields ₁ And features s ₂ Features s ₁ And features s ₂ Performing concat to obtain a feature map F fused with importance degrees of different scales;

F＝concat(s ₁ ,C _3*3 (s ₁ ))∈R ^1*H*W ，s ₂ ＝C _3*3 (s ₁ )

step S9, obtaining a spatial attention feature map, namely a spatial position weight A, by means of a sigmod function from feature maps F fused with importance degrees of different scales _s Wherein the value of each pixel represents the degree of importance of the pixel in the feature map;

step S10, using the spatial position weight A obtained in step S9 _s Operating a calibration feature map X' of the fusion channel attention;

X”＝F _scale (X',A _s )＝X'*A _s ，X”∈R ^H*W*C

i.e. the values on all channels at each position of the calibration feature map X' are multiplied by the weights of the corresponding spaces.

Compared with the prior art, the invention has the following advantages:

1. the method has the advantages that features with different scales are extracted and fused from the input feature map through 3*3 convolution and 3*3 cavity convolution, and the method is more reasonable than a method of directly adopting global average pooling. Global averaging pooling compresses the feature map of each channel into a single value, and losing excessive position information in the feature map reduces detection accuracy. The 3x3 convolution and the 3x3 hole convolution are used for extracting the features and fusing, so that multi-scale feature representation and more position information can be obtained.

2. The fusion feature vector is subjected to (K+2) 1 cavity convolution and K1 cavity convolution respectively to obtain a feature 1 and a feature 2, and then the feature 1 and the feature 2 are fused to obtain a channel domain attention feature map, and the scope of feature extraction can be enlarged by increasing the receptive field so as to fuse more channel information, better capture features with different scales and different directions, improve the richness and the distinguishing degree of feature expression, and enable the obtained channel weight value to be more reasonable and accurate.

3. When the space weight is acquired, the position information of different receptive fields is fused in a mode of twice 3*3 cavity convolution and jump connection, so that wider picture features are captured, richer position information can be obtained, the acquired position weight value is more reasonable and accurate, and the classification accuracy is improved.

Drawings

FIG. 1 is a flow chart of the attention profile of the acquisition channel domain of the present invention;

FIG. 2 is a flow chart of the acquired spatial domain attention feature map of the present invention;

FIG. 3 is a flow chart of a fused feature map with attention acquisition in accordance with the present invention;

FIG. 4 is a flow chart of bad image classification based on an improved multi-scale attention mechanism of the present invention.

Detailed Description

The technical scheme of the invention is described in detail below with reference to the accompanying drawings:

as shown in fig. 3, a bad picture classification method based on an improved multi-scale attention mechanism is characterized by comprising the following steps:

step S1, as shown in FIG. 1, a feature map X E R obtained by processing a picture to be classified through a network model ^C*H*W The feature map is an input feature map of the scheme, and the dimension is C, H and W;

step S2, as shown in fig. 1, 3*3 convolution and 3*3 hole convolution are respectively performed on the input feature map in step S1, wherein the hole ratio condition of the hole convolution of 3*3 =2, and two feature maps F with different sense fields are obtained through two transformations ₁ 、F ₂ ；

F ₁ ＝f(X)∈R ^H*W*C ，F ₂ ＝f ^～ (X)∈R ^H*W*C

Step S3, as shown in FIG. 1, global average pooling is carried out on the two feature graphs in step S2 respectively to obtain 1 x C feature vectors corresponding to the two different receptive fields, and the two feature vectors S are fused;

s＝AvgPool(F ₁ )+AvgPool(F ₂ )

step S4, as shown in FIG. 1, introducing a super parameter K, and carrying out (K+2) 1 cavity convolution and K1 convolution on the fusion feature vector in the step S3 to obtain a feature 1 and a feature 2; then fusing the feature 1 and the feature 2 to obtain a channel domain attention feature map, namely a channel weight value A _c ；

A _c ＝σ(C _(K+2)*1 (s)+C _K*1 (s))

step S5, as shown in FIG. 2, the channel weight A obtained in step S4 is used _c Operating the original input feature map to obtain a calibration feature map X' of the attention of the fusion channel;

X'＝F _scale (X,A _c )＝X*A _c ，X'∈R ^H*W*C

S＝concat(AvgPool(X′),MaxPool(X′))

wherein S.epsilon.R ^H*W*2 ；

Step S8, the fusion characteristic diagram S is passed throughTwo 3*3 hole convolutions (void ratio condition=2) to obtain two features s with different receptive fields ₁ And features s ₂ Features s ₁ And features s ₂ Performing concat to obtain a feature map F fused with importance degrees of different scales;

F＝concat(s ₁ ,C _3*3 (s ₁ ))∈R ^1*H*W ，s ₂ ＝C _3*3 (s ₁ )

X”＝F _scale (X',A _s )＝X'*A _s ，X”∈R ^H*W*C

As shown in FIG. 4, the present invention classifies bad pictures

1. Obtaining pictures to be classified, judging the quality of the pictures, carrying out picture enhancement on pictures with low quality such as fuzzy, noisy and the like to improve the definition of the pictures, and then carrying out pretreatment operations such as picture resolution, regularization and the like; the picture enhancement operation includes, but is not limited to, super-resolution reconstruction, image deblurring, etc.; preprocessing operations include, but are not limited to, operations of Resize, regularization, and the like;

2. the improved multiscale attention module is embedded in the ResUnit of the ResNet network model. Sending the preprocessed picture into an improved ResNet network model;

3. the picture features are processed by a ResUnit module to obtain a calibration feature X' integrating channel weights and position weights, and the calibration feature focuses more on important positions and classification categories of user prediction classification;

X'＝F _scale (X,A _c )＝X*A _c X'∈R ^H*W*C

X”＝F _scale (X',A _s )＝X'*A _s X”∈R ^H*W*C ；

4. outputting the classification category and the score of the picture after improving the ResNet network model; the ResNet is taken as a Backbone, a classification model of the self can be trained, the ResNet model is improved, fusion weight is introduced into each ResUnit, and classification model prediction pictures are trained on the improved ResNet model.

5. And carrying out post-processing on the results according to the user set threshold value, the attention category and the like to obtain a final output result.

Claims

1. The bad picture classification method based on the improved multi-scale attention mechanism is characterized by comprising the following steps:

(1) Obtaining pictures to be classified, judging the quality of the pictures, carrying out picture enhancement on the pictures with low quality to improve the definition of the pictures, and then carrying out picture resolution and regularization pretreatment operation;

(2) Embedding the improved multiscale attention module into a ResUnit of a ResNet network model, and sending the preprocessed picture into the improved ResNet network model;

(3) The picture features are processed by the ResUnit module to obtain calibration features X' fusing channel weights and position weights, and the calibration features pay more attention to important positions and classification categories of user prediction classification;

(4) Outputting the classification category and the fraction of the picture after improving the ResNet network model; the ResNet is used as a Backbone training classification model, the ResNet model is improved, fusion weight is introduced into each ResUnit, and classification model prediction pictures are trained on the improved ResNet model;

(5) Post-processing the result according to the user set threshold and the attention category to obtain a final output result;

specific:

step S1, a feature map X epsilon R obtained by processing pictures to be classified through a network model ^C*H*W The feature map is an input feature map, and the dimension is C, H and W;

step S2, respectively for the input characteristic diagramsPerforming 3*3 convolution and 3*3 hole convolution, wherein the hole rate condition=2 of the 3*3 hole convolution, and obtaining two characteristic diagrams F with different sense fields through two transformations ₁ 、F ₂ ；

F ₁ ＝f(X)∈R ^H*W*C ，F ₂ ＝f ^～ (X)∈R ^H*W*C

Step S3, respectively carrying out global average pooling on the two feature images to obtain 1 x C feature vectors corresponding to the two different receptive fields, and then fusing to obtain a feature vector S;

s＝AvgPool(F ₁ )+AvgPool(F ₂ )

A _c ＝σ(C _(K+2)*1 (s)+C _K*1 (s))

X'＝F _scale (X,A _c )＝X*A _c ，X'∈R ^H*W*C

s7, performing global maximum pooling and global average pooling on the calibration feature map X' along the channel domain to obtain S respectively ₁ 、S ₂ Embedding channel domain information into space domain, and then S ₁ 、S ₂ Fusing to obtain a feature map S;

S＝concat(AvgPool(X′),MaxPool(X′))

wherein S.epsilon.R ^H*W*2 ；

Step S8, carrying out cavity convolution on the fusion feature map S twice 3*3, wherein the cavity rate condition=2, and obtaining two features S with different receptive fields ₁ And features s ₂ Features s ₁ And features s ₂ Performing concat to obtain a feature map F fused with importance degrees of different scales;

F＝concat(s ₁ ,C _3*3 (s ₁ ))∈R ^1*H*W ，s ₂ ＝C _3*3 (s ₁ )

X”＝F _scale (X',A _s )＝X'*A _s ，X”∈R ^H*W*C