CN115131782A

CN115131782A - Image small target classification method based on multi-scale features and attention

Info

Publication number: CN115131782A
Application number: CN202210768041.XA
Authority: CN
Inventors: 元昌安; 覃晓; 龙珑; 牙姗姗; 蒋建辉; 陈龙
Original assignee: Guangxi Academy of Sciences
Current assignee: Guangxi Academy of Sciences
Priority date: 2022-07-01
Filing date: 2022-07-01
Publication date: 2022-09-30

Abstract

The invention discloses a small image target classification method based on multi-scale features and attention, which comprises the following steps: s1, designing an MFA module based on multi-scale features and attention mechanism; s2, replacing a 3 multiplied by 3 convolution block in a ResNet-50 residual block with an MFA module by taking the ResNet-50 network structure as reference to obtain a deep neural network model MFANet based on multi-scale features and attention; s3, training a multi-scale feature and attention-based deep neural network model (MFANet) by using a small target image data set; and S4, identifying a small target in the image to be identified by using the trained multi-scale feature and attention based deep neural network model MFANet. The invention can improve the identification accuracy of the small target while consuming less computing resources.

Description

Image small target classification method based on multi-scale features and attention

Technical Field

The invention relates to the technical field of computer vision. More particularly, the invention relates to a method for classifying small objects of images based on multi-scale features and attention.

Background

The current image classification method based on the convolutional neural network has a good effect of classifying large-scale images, such as people, animals and the like in the images. However, the classification of some small objects in a complex background, such as the collar classification research in a garment picture, is not sufficient, and therefore, it is necessary to research a classification method for small objects in a complex scene.

ResNet is an excellent convolutional neural network for image classification, which solves the problem of gradient explosion (disappearance) due to the network being too deep by using residual connection. In recent years, many researches show that the network performance is improved by designing a multi-scale feature extraction method in a neural network structure, and Res2Net improves a ResNet structure by constructing a layered residual connection mode by using a plurality of convolution operators with single specifications, so that the ResNet structure has multi-scale feature extraction capability. However, different receptive fields can be formed by convolution operators with a single specification through multilayer stacking, so that multi-scale feature information is acquired, the model is too complex, and a large number of convolution operations bring too high computational resource cost.

The attention mechanism enables the neural network to adaptively focus on important parts in the image, and has been widely applied to the visual task. The existing attention modules SE, BAM, CBAM, CA, etc. can fuse channel and space information by attention, but in order to avoid increasing the calculation overhead, a dimension reduction operation is often performed when channel attention information is acquired, and the dimension reduction operation loses certain information.

On one hand, the multi-scale characteristic information can enable the model to better grasp object-level information, so that the context can be better understood; on the other hand, the attention mode can further help the model to focus on the important part in an adaptive manner. However, how to ensure that the accuracy of identifying small targets is improved while consuming less computing resources is still a problem which needs to be solved urgently at present.

Disclosure of Invention

An object of the present invention is to solve the above-described problems and provide advantages which will be described later.

It is still another object of the present invention to provide a method for classifying small objects of images based on multi-scale features and attention, so as to ensure that the recognition accuracy of small objects is improved while consuming less computing resources.

To achieve these objects and other advantages in accordance with the purpose of the invention, there is provided an image small object classification method based on multi-scale features and attention, including:

s1, designing an MFA module based on multi-scale features and attention mechanism;

s2, replacing a 3 multiplied by 3 convolution block in a ResNet-50 residual block with an MFA module by taking the ResNet-50 network structure as reference to obtain a deep neural network model MFANet based on multi-scale features and attention;

s3, training a multi-scale feature and attention-based deep neural network model (MFANet) by using a small target image data set;

and S4, identifying a small target in the image to be identified by using the trained multi-scale feature and attention based deep neural network model MFANet.

Preferably, in step S1, the process of designing the MFA module based on the multi-scale features and attention mechanism includes:

s101, performing parallel deep separable convolution operation on an input feature map by using convolution operators with different specifications to obtain a plurality of feature maps containing different scale information of the input feature map;

s102, respectively obtaining weight vectors of a plurality of channels containing feature maps with different scale information by using an attention mechanism;

s103, splicing a plurality of feature maps containing different scale information of the input feature map, and then performing dot multiplication operation on the spliced feature map by using the weight vector to highlight important region representation.

Preferably, in step S101, the convolution operator group of the parallel depth separable convolution operation is set to K ═ 1, 3, 5, 7, and the operation of step S101 is expressed as:

F _i ＝Conv(1×1)(conv(k _i ×k _i ,g＝C)(X)) i＝0,1,2…,S-1；

wherein, F _i Conv is convolution operation, k is a characteristic diagram containing information of different scales obtained after the convolution operation _i The size of the convolution kernel, g the number of groups of convolution, C the number of channels of the input feature map, and X the input feature map.

Preferably, in step S102, the attention mechanism is an ECA attention mechanism, and the operation of step S102 is represented as:

Z _i ＝ECA(F _i ),i＝0,1,2…,S-1；

wherein Z is _i An attention weight vector is represented and ECA represents the method used in extracting the channel attention.

Preferably, in step S103, a plurality of feature maps including information of different scales of the input feature map are merged, and then a dot product operation is performed on the merged feature map by using the weight vector to highlight the important region, wherein the operation is represented as:

F＝Cat([F ₀ ,F ₁ ,…,F _S-1 ])；

wherein, F represents a characteristic diagram obtained after splicing, Cat represents splicing operation, and X represents _Out A characteristic diagram output after step S103 is shown, and δ represents the SoftMax function.

Preferably, the process of training the multi-scale feature and attention based deep neural network model MFANet using the small target image data set in step S3 includes:

s301, performing data enhancement on the small target image data set by using random horizontal inversion, and converting the small target image data set into a Tensor format;

s302, setting the category number output by a full connection layer in a deep neural network model MFANet based on multi-scale features and attention according to the category number of the small targets in the small target image data set;

s303, optimizing by using an Adam optimizer, setting the initial learning rate to be 0.01, and realizing a cosine annealing restart learning rate mechanism by using a custom learning rate adjusting function Lambdalr.

The invention at least comprises the following beneficial effects: through a single network, small targets in a complex image can be accurately detected and classified without additional manual marking information such as a boundary frame, key points and the like, and the multi-scale feature extraction method can acquire fine-grained object-level information, effectively eliminate the interference of noise on classification accuracy, and reduce the calculation overhead generated by multiple convolutions by utilizing a deep separable convolution mode; the attention weight is obtained by using a lightweight channel attention method ECA without dimension reduction, and the problem of information loss caused by channel dimension reduction is solved.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

FIG. 1 is a flowchart of a method for classifying small objects based on multi-scale features and attention according to an embodiment of the present invention;

FIG. 2 is a flow diagram of the design of a multi-scale feature and attention mechanism based MFA module according to an embodiment of the present invention;

FIG. 3 is a block diagram of an MFA according to an embodiment of the present invention;

FIG. 4 is a hierarchy diagram of the MFANet based on multi-scale features and attention in the embodiment of the present invention;

FIG. 5 is a flowchart of the training of the MFANet based deep neural network model with multi-scale features and attention according to the embodiment of the present invention;

FIG. 6 is a flowchart of the multi-scale feature and attention based deep neural network model MFANet for small target recognition classification according to the embodiment of the present invention;

fig. 7 is a diagram illustrating an example of partial data of a collar data set according to an embodiment of the present invention.

Detailed Description

The present invention is described in further detail below with reference to the attached drawings so that those skilled in the art can implement the invention by referring to the description text.

It is to be noted that the experimental methods described in the following embodiments are all conventional methods unless otherwise specified, and the reagents and materials, if not otherwise specified, are commercially available; in the description of the present invention, the terms "lateral", "longitudinal", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.

The small objects are defined based on the relative proportion of the object to the image, typically with the median of the ratio of the area of the object bounding box to the area of the image being between 0.08 and 0.58%. Of course, the following are more common: 1. the ratio of the width and the height of the target bounding box to the width and the height of the image is less than a certain value, and the more general ratio value is 0.1; 2. the ratio of the area of the target bounding box to the area of the image is less than a certain value, and the more general value is 0.03; 3. the small object is defined according to the ratio between the actual covered pixels of the object and the total pixels of the image.

As shown in FIG. 1, the invention provides a method for classifying small objects of images based on multi-scale features and attention, which comprises the following steps:

specifically, as shown in fig. 2, the process of designing the MFA module based on the multi-scale features and attention mechanism includes:

here, the convolution operator group of the parallel depth separable convolution operation may be set to K ═ 1, 3, 5, 7, and if S feature maps are obtained after the input feature maps are subjected to the depth separable convolution operation, the number of channels of each feature map is 1/S of the number of channels of the input feature map, and then the operation of step S101 may be represented as:

F _i ＝Conv(1×1)(conv(k _i ×k _i ,g＝C)(X)) i＝0,1,2…,S-1；

The small targets are identified, multi-scale information cannot be well extracted by using a convolution operator with a single specification, noise is brought by using an overlarge convolution operator, and the calculated amount is increased.

the attention mechanism here may adopt an ECA attention mechanism, and the operation of step S102 may be expressed as:

Z _i ＝ECA(F _i ),i＝0,1,2…,S-1；

wherein, Z _i An attention weight vector is represented and ECA represents the method used in extracting the channel attention.

The ECA attention mechanism has the characteristics of light weight and no dimension reduction, and can better solve the problem of information loss caused by dimension reduction of the channel.

The operation of step S103 may be expressed as:

F＝Cat([F ₀ ,F ₁ ,…,F _S-1 ])；

wherein F represents a characteristic diagram obtained after splicing, Cat represents splicing operation, and X represents _Out A characteristic diagram output after step S103 is shown, and δ represents the SoftMax function.

The resulting MFA module constructed according to the above process is shown in fig. 3.

a deep neural network model MFANet based on multi-scale features and attention draws reference to a ResNet-50 network structure to build a classification network, and as the strong feature extraction capability of the deep neural network comes from continuously superposed convolution operation, the deeper the network is designed, the better the feature extraction capability is, however, the gradient disappearance (explosion) problem can be caused when the network is deepened by stacking the convolution operation in a tasting way, and the ResNet designs a residual structure to well solve the problem. After the advent of ResNet, many of the subsequent studies were based on improvements in the ResNet model. The application replaces the 3 × 3 convolution block in the ResNet-50 residual block with the MFA module to construct the MFANet, and the obtained MFANet is shown in fig. 4.

specifically, as shown in fig. 5, the process of training the multi-scale feature and attention-based deep neural network model MFANet using the small target image data set in step S3 includes:

s301, performing data enhancement on a small target image data set by using random horizontal inversion, and converting the small target image data set into a Tensor format which can be identified by a pytorch architecture (used for constructing a deep learning framework of a deep neural network model based on multi-scale features and attention);

after data enhancement, the capability of the deep neural network model based on multi-scale features and attention for identifying inclined or turning pictures can be improved.

Here, the small object image data set refers to a set of images to which small object categories have been labeled in each image.

And finally, inputting the small target image data set in the Tensor format into a deep neural network model (MFANet) based on multi-scale features and attention for training.

As shown in fig. 6, the process of identifying small targets in an image to be identified based on the multi-scale feature and attention depth neural network model MFANet includes:

the image to be identified is subjected to convolution layers of 64 7 × 7 convolution kernels to obtain a feature map of 64 × 112 × 112;

the feature map of 64 × 112 × 112 is subjected to the maximum pooling layer of the convolution kernel of 3 × 3 to obtain a feature map of 64 × 56 × 56;

obtaining a 256 multiplied by 56 characteristic diagram by a 64 multiplied by 56 characteristic diagram through a 3-layer residual error network;

the 256 × 56 × 56 feature map is subjected to a 4-layer residual error network to obtain a 512 × 28 × 28 feature map;

the feature map of 512 × 28 × 28 is subjected to 6 layers of residual error networks to obtain a feature map of 1024 × 14 × 14;

passing the 1024 × 14 × 14 feature map through a 3-layer residual error network to obtain a 2048 × 7 × 7 feature map;

and then, outputting a classification result of the small target, namely the probability of the small target in each category in the image to be recognized, through an average pooling layer and a full connection layer (fc layer).

Here, the above-described per-layer residual network is composed of a convolution layer of 1 × 1 convolution kernel, an MFA block, and a convolution layer of 1 × 1 convolution kernel.

To illustrate the effectiveness of the present application, we performed a verification on a collar data set that contains all images of the garment image data collected from each large e-commerce platform, where the collar portion occupies a small portion of the overall image and there is significant background noise (an example of the portion of the data is shown in fig. 7). The pictures are input into different recognition models to obtain a recognition accuracy comparison result table shown in table 1.

TABLE 1 identification accuracy comparison results table

Network	Parameters	FLOPs	Top-1 Accuracy(％)
				EMRes-50	28.02M	4.34G	73.6
ResNet-50	23.52M	4.12G	66.5
				ResNeXt-50	22.99M	4.26G	75.7
Res2Net	23.66M	4.29G	74.8
				DenseNet-161	26.49M	7.82G	72.3
Xception	20.82M	4.58G	76.3
				EPSANet	20.53M	3.63G	78.1
SKNet	25.44M	4.51G	56.1
				MFANet(Ours)	13.81M	2.61G	80.4

In the table, Network represents model names, Parameters represent model parameter numbers, FLOPs represent model floating point calculation amount, and Accuracy represents identification Accuracy.

As can be seen from Table 1, the model parameters, the model floating point calculation amount and the identification accuracy of the method are higher than those of other models.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. A small image target classification method based on multi-scale features and attention is characterized by comprising the following steps:

and S4, identifying the small target in the image to be identified by using the trained multi-scale feature and attention-based deep neural network model MFANet.

2. The method for classifying small objects based on multi-scale features and attention of images as claimed in claim 1, wherein the step S1 of designing the MFA module based on multi-scale features and attention mechanism comprises:

3. The method for classifying small objects of images based on multi-scale features and attention according to claim 2, wherein in step S101, the convolution operator group of the parallel depth separable convolution operation is set to K ═ 1, 3, 5, 7, and the operation of step S101 is expressed as:

F _i ＝Conv(1×1)(conv(k _i ×k _i ,g＝C)(X))i＝0,1,2…,S-1；

wherein, F _i Conv is convolution operation, k is a feature diagram containing information of different scales obtained after the convolution operation _i The size of the convolution kernel, g the number of groups of convolution, C the number of channels of the input feature map, and X the input feature map.

4. The method for classifying small objects based on multi-scale features and attention images as claimed in claim 2, wherein in step S102, the attention mechanism is an ECA attention mechanism, and the operation of step S102 is represented as:

Z _i ＝ECA(F _i ),i＝0,1,2…,S-1；

5. The method for classifying small objects in images based on multi-scale features and attention as claimed in claim 2, wherein in step S103, a plurality of feature maps containing different scale information of the input feature map are merged, and then the merged feature map is subjected to dot product operation by using the weight vector to highlight the important region representation, wherein the operation is represented as:

F＝Cat([F ₀ ,F ₁ ,…,F _S-1 ])；

6. The multi-scale feature and attention based image small target classification method according to claim 1, wherein the process of training the multi-scale feature and attention based deep neural network model (MFANet) by using the small target image data set in step S3 comprises: