CN114037839A

CN114037839A - Small target identification method, system, electronic equipment and medium

Info

Publication number: CN114037839A
Application number: CN202111225109.1A
Authority: CN
Inventors: 彭建; 赵乙芳; 章登勇; 李峰
Original assignee: Changsha University of Science and Technology
Current assignee: Changsha University of Science and Technology
Priority date: 2021-10-21
Filing date: 2021-10-21
Publication date: 2022-02-11

Abstract

The invention discloses a small target identification method, a system, electronic equipment and a medium, wherein a plurality of feature graphs with different resolutions, semantics and position feature weights are obtained by extracting the features of a target image for a plurality of times, and then the features which are beneficial to small target identification in the feature graphs are fused to obtain a fusion feature graph with excellent dimensionality performances in the resolution, the semantics, the position and the like, and the accuracy and the performance of a convolutional neural network model for small target identification are improved by the fusion feature graph.

Description

Small target identification method, system, electronic equipment and medium

Technical Field

The invention relates to the technical field of deep learning image processing, in particular to a small target identification method, a system, electronic equipment and a medium.

Background

Object detection is a fundamental task in the field of computer vision and pattern recognition, in particular to find objects of interest in images, determining their classification and location. However, the problem of small targets is always a difficult point in visual tasks such as object detection and semantic segmentation, and the detection accuracy of small targets is usually only half of that of large targets.

There are many reasons why the detection accuracy of small objects is not as good as that of large objects, but mainly due to the contradiction between resolution and semantic information. In the process of extracting the features of the image, the semantic information and the position information must be sacrificed when the feature map with high resolution is obtained, and the resolution must be sacrificed when the feature map with strong semantic information and position information is obtained. The small target has a small outline and is usually located at the edge position of an image, the resolution, semantic information and position information are originally weak, if the semantic information and the position information of the small target are enhanced, the resolution of the small target is weaker, the convolutional neural network model is difficult to identify the small target, if the resolution of the small target is enhanced, the semantic information and the position information of the small target are weaker, and the convolutional neural network model is also difficult to identify the small target.

Disclosure of Invention

The present invention is directed to at least solving the problems of the prior art. Therefore, the invention provides a small target identification method, a small target identification system, electronic equipment and a medium, which can improve the performance of a convolutional neural network model on small target detection.

In a first aspect, an embodiment of the present invention provides a small target identification method, including the following steps:

acquiring a target image input by a convolutional neural network model, wherein the target image contains a small target to be identified;

extracting the characteristics of small targets in the target image to obtain an original characteristic diagram of the target image;

extracting the features of the original feature map of the target image through a channel attention mechanism to obtain a semantic feature map with small target semantic features, extracting the features of the original feature map of the target image through a space attention mechanism to obtain a position feature map with small target position features, and fusing the original feature map of the target image, the semantic feature map and the position feature map to obtain a feature map after the target image is fused;

and classifying the small targets in the target images based on the feature map after the target images are fused to obtain the identification results of the small targets.

According to the embodiment of the invention, at least the following technical effects are achieved:

the method comprises the steps of extracting information of a target image for the first time, obtaining an original feature map with high resolution but weak semantic information and position information, extracting features of the original feature map by using a channel attention mechanism, obtaining a semantic feature map emphasizing small target semantic features, extracting features of the original feature map by using a space attention mechanism, obtaining a position feature map emphasizing small target position features, fusing the original feature map, the semantic feature map and the position feature map, making up for deficiencies, obtaining an enhanced fused feature map on the resolution, the semantic information and the position information, and carrying out classification and identification on small targets in the target image by a convolutional neural network based on the fused feature map, so that the identification accuracy of the convolutional neural network model on the small target can be improved.

According to some embodiments of the present invention, the extracting, by a channel attention mechanism, the feature of the original feature map of the target image to obtain the semantic feature map emphasizing the semantic features of the small target includes: performing global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and adding the average feature matrix and the maximum feature matrix according to phases; convolving the result of the phase addition by using a convolution kernel of 1 multiplied by C/r, and activating by using a Relu nonlinear activation layer; and (3) convolving the result of the activated Relu nonlinear activation layer by using a 1 multiplied by C convolution kernel, and activating by using a sigmoid nonlinear activation function to obtain the semantic feature map of the emphasis small target semantic feature.

According to some embodiments of the present invention, the extracting, by a spatial attention mechanism, the feature of the original feature map of the target image to obtain the position feature map with a small emphasis on the target position feature includes: performing global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and merging the average feature matrix and the maximum feature matrix; and (4) performing convolution on the combined result, and activating by using a sigmoid nonlinear activation function to obtain the position feature map with the emphasis on the small target position feature.

According to some embodiments of the present invention, the fusing the original feature map of the target image, the semantic feature map, and the position feature map to obtain the feature map after the fusing of the target image includes: multiplying the original characteristic diagram and the semantic characteristic diagram according to phases twice, multiplying the original characteristic diagram and the position characteristic diagram according to phases once, and adding the results obtained after the multiplication according to the phases; and multiplying the result obtained after the bitwise addition by the original characteristic diagram according to the phase to obtain the characteristic diagram obtained after the target image is fused.

According to some embodiments of the invention, the features of the small objects in the target image are extracted by convolution or dilation convolution.

According to some embodiments of the invention, the recognition result of the small target comprises: marking the category of the small objects with text, marking the position of the small objects with a matrix, and masking the small objects of different categories with different colors.

In a second aspect, an embodiment of the present invention provides a small target detection system, including:

the system comprises an input module, a target recognition module and a target recognition module, wherein the input module is used for acquiring a target image input by a convolutional neural network model, and the target image contains a small target to be recognized;

the feature extraction module is used for extracting features of small targets in the target image to obtain an original feature map of the target image;

the feature fusion module is used for extracting features of an original feature map of the target image through a channel attention mechanism to obtain a semantic feature map of the semantic features of the small targets with emphasis, extracting features of the original feature map of the target image through a space attention mechanism to obtain a position feature map of the position features of the small targets with emphasis, and fusing the original feature map of the target image, the semantic feature map and the position feature map to obtain a feature map after the target image is fused;

and the output module is used for classifying the small targets in the target images based on the feature map after the target images are fused to obtain the identification results of the small targets.

the feature extraction module extracts information of the target image for the first time, the obtained original feature map is high in resolution but weak in semantic information and position information, the feature fusion module extracts features of the original feature map by using a channel attention mechanism to obtain a semantic feature map emphasizing small target semantic features, extracts features of the original feature map by using a space attention mechanism to obtain a position feature map emphasizing small target position features, the original feature map, the semantic feature map and the position feature map are fused, the advantages and the disadvantages are made, an enhanced fused feature map is obtained on the resolution, the semantic information and the position information, and the convolutional neural network classifies and identifies small targets in the target image based on the fused feature map, so that the identification accuracy of the convolutional neural network model for identifying the small targets can be improved.

In a third aspect, an embodiment of the present invention further provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing:

a small object recognition method as described in the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions for performing:

a small object recognition method as described in the first aspect.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic flow chart of a small target identification method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an SAFF module of the small target recognition method according to the embodiment of the present invention;

FIG. 3 is a schematic diagram of an internal structure of a channel attention branch according to the small target identification method of the embodiment of the present invention;

FIG. 4 is a schematic structural diagram of Stage6 of the small target identification method according to the embodiment of the present invention;

FIG. 5 is a diagram of a DSAFF-Net structure of the small target recognition method according to the embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a small target detection system according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be fully described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

With the continuous development of artificial intelligence application research, in recent years, the research of people in the aspect of deep learning has also obtained a rapid development, and the artificial intelligence application research becomes the most widely used technology in the field of computer vision and shows great advantages in the fields of image recognition, target detection, object tracking and the like.

The target detection task is a basic task in the field of computer vision and pattern recognition and is also one of core tasks. Its task is to find objects of interest in the image, determine their category and location, in other words to solve "where? "," is a question of what? ", which provides reliable information for a plurality of subsequent studies such as target tracking, behavior recognition, scene understanding, and the like. Before deep learning occurs, the conventional target detection method is generally divided into three steps, and first, candidate regions are extracted by a selective search method using sliding window frames of different sizes. And then extracting relevant visual features of the candidate region, such as Harr features commonly used for face detection, HOG features commonly used for pedestrian detection and common target detection and the like. And finally, classifying by using a trained classifier. The traditional target detection method has many defects, such as slow detection speed, low accuracy, poor real-time performance, large calculation amount and the like.

With the fire development of deep learning technology in recent years, the target detection algorithm is also shifted to the detection technology based on the deep neural network from the traditional algorithm based on manual characteristics. Convolutional neural networks are an important deep learning method. LeNet-5 proposed by Yann LeCun in 1998 successfully applies the convolutional neural network to the field of image recognition for the first time, and has a good effect in letter recognition, and the development of the field of deep learning is greatly promoted by the appearance of the convolutional neural network. In 2012, Alex Krizhevsky et al, the university of toronto, proposed an AlexNet neural network structure, which has milestone significance for the study of image processing based on convolutional neural networks, and also has attracted people's extensive attention to convolutional neural networks. Currently, CNN-based target detectors can be divided into two categories: the method is characterized in that a single-stage (one-stage) detector does not need to separately search a candidate area, features are directly extracted from a network to predict object category and position regression, the object category and position regression can be simply understood as 'one-step in place', and common one-stage target detection algorithms include YOLO, SSD, RetineNet and the like. The second is a two-stage (two-stage) detector, which needs to be realized in two steps, firstly a candidate region (a pre-selection frame possibly containing an object to be detected) is obtained, then classification and position regression are carried out through a convolutional neural network by using the characteristics of the candidate region, and common two-stage target detection algorithms include R-CNN, FasterR-CNN, Mask R-CNN and the like. The one-stage detection algorithm has a speed advantage over the two-stage detection algorithm, and the two-stage target detection algorithm has a precision advantage over the one-stage detection algorithm. However, with the continuous optimization of the target detection method, the precision and the speed of the target detection method are greatly improved.

As part of target detection, small target detection widely exists in target images imaged in large fields, long distances and the like, such as houses in images aerial by unmanned aerial vehicles, flowers in landscape figure photographs and the like, and there are two ways for officially defining small targets: one is that the small target is less than 80 pixels in 256 × 256 images, i.e., less than 0.12% of 256 × 256 is the small target, according to the definition of SPIE of the international organization. The other is the absolute size, defined by the COCO dataset, targets with a size of less than 32 x 32 pixels can be considered small targets. Most of the existing target detection algorithms based on the convolutional neural network detect a general data set, but generally, small targets occupy a small proportion in an image, edge features are not obvious or even missing, and due to limited resolution and semantic information, the problems make the effect of the small target detection algorithm based on deep learning on a conventional target detection data set poor, and small target detection is a wide and important research direction in the field of target detection research, so that many experts and scholars propose some optimization methods for small target detection. At present, the ideas for improving the detection of small targets mainly include data enhancement, feature fusion, utilization of context information, a proper training method, generation of denser anchors, use of the idea of generating an antagonistic network, feature amplification and target detection, and the like.

It is known that in target detectors based on COCO data sets, the detection performance of small targets is much inferior to that of large targets. This is caused by several reasons: 1) characteristics of network feature extraction. Deep learning based target detection networks typically use CNN networks as feature extraction tools. In the process of feature extraction, the CNN network continuously deepens the number of network layers in order to obtain features with strong semantic information and large receptive field, but the size of the feature map is continuously reduced along with the deepening of the number of network layers, so that region information with small area, i.e. small target feature information, is difficult to transmit to the later stage of the target detector, so that the features of the small target are difficult to extract or even disappear, and the detection performance of the small target detection is naturally poor. 2) The distribution of large and small targets in the target detection dataset is unbalanced. The proportion of large targets and small targets in the COCO data set is unbalanced, the large targets are far more than the small targets, so that the target detection network based on deep learning is friendly to the large target detection, and the difficulty in adapting to the targets with different sizes is brought to the network. 3) A network loss function. The network loss function is not friendly to small targets when selecting positive and negative samples.

The feature of possessing high resolution and strong semantic information has a crucial effect on the accuracy of target detection no matter the target detection or the small target detection. But high resolution and strong semantic information like fish and bear paw are not available at the same time. If a feature with strong semantic information is to be obtained, the number of down-sampling needs to be increased, the receptive field is continuously enlarged along with the increase of the number of down-sampling, and the semantic information of each pixel point on the deep feature map is also continuously enhanced. However, if it is desired to keep the feature map with high resolution, ensure that the features of the small object do not disappear, and obtain clear edge information of the small object, the number of down-sampling needs to be reduced, at this time, the size of the receptive field cannot be ensured, and a larger amount of calculation and a larger memory are required.

In Mask R-CNN network, the results of ResNet-FPN [ P2, P3, P4, P5, P6] are used as the input of RPN network layer, and the feature map P6 is obtained from the feature map P5 by maximum pooling down-sampling with step size 2, and is only used for obtaining region propofol in the RPN network layer, and obtaining a larger anchor size 512 x 512. However, direct down-sampling expands the field of view, but results in parameters that cannot be learned, and a part of spatial resolution is lost, which is not favorable for accurately positioning a large target and identifying a small target.

In order to solve one of the above problems, in a first aspect, referring to fig. 1, an embodiment of the present invention provides a small target identification method, including the following steps:

step S110: acquiring a target image input by a convolutional neural network model, wherein the target image contains a small target to be identified;

step S120: extracting the characteristics of small targets in the target image to obtain an original characteristic diagram of the target image;

step S130: extracting the features of an original feature map of the target image through a channel attention mechanism to obtain a semantic feature map with small target position features, extracting the features of the original feature map of the target image through a space attention mechanism to obtain a position feature map with small target position features, and fusing the original feature map, the semantic feature map and the position feature map of the target image to obtain a feature map after target image fusion;

step S140: and classifying the small targets in the target image based on the feature map after the target image is fused to obtain the identification result of the small targets.

In this embodiment, a Mask R-CNN convolutional neural network is used as a basis, and a feature extraction part of the convolutional neural network is improved while maintaining the backbone of the Mask R-CNN convolutional neural network to be basically unchanged. Firstly, a three-way Feature Attention Fusion Module, namely a Sandwich Attention Feature Fusion Module (SAFF Module), is designed, the semantic information of shallow features and the resolution of deep features are enhanced, and Feature Fusion is effectively carried out by combining with an FPN network structure, so that the accuracy of target classification and position regression, particularly small targets, is improved. Secondly, a new neural network layer is created in the backbone network through expansion convolution, and the defect that the resolution is lost when a sampling layer expands a receptive field is relieved.

In steps S110 to S120, the target image containing the information of the small target to be detected is input into the trained convolutional neural network model, and the convolutional neural network model extracts the information of the target image, retains a part of the relevant characteristics of the small target to be detected, and discards a part of the irrelevant characteristics to obtain the original characteristic diagram.

Because the semantic information and the position information of the original feature map are weak, the convolutional neural network cannot effectively distinguish the small target in the target image according to the original feature map, and therefore the semantic information and the position information of the original feature map need to be enhanced. In step S130, the features of the original feature map of the target image are extracted by the channel attention mechanism to obtain a semantic feature map of the semantic features of the small targets, and the features of the original feature map of the target image are extracted by the spatial attention mechanism to obtain a position feature map of the position features of the small targets, where the semantic feature map and the position feature map can just make up for the weak point of the original feature map in terms of semantic information and position information. And finally, fusing the original feature map, the semantic feature map and the position feature map of the target image to obtain the enhanced fused feature map on the resolution, the semantic information and the position information.

In step S140, the convolutional neural network classifies the small targets in the target image based on the feature map after the target image is fused, so as to obtain the recognition result of the small targets.

The embodiment of the invention carries out primary extraction on the information of the target image, the obtained original feature map has high resolution but weak semantic information and position information, the channel attention mechanism is used for extracting the features of the original feature map to obtain the semantic feature map emphasizing the semantic features of the small target, the space attention mechanism is used for extracting the features of the original feature map to obtain the position feature map emphasizing the position features of the small target, the original feature map, the semantic feature map and the position feature map are fused to obtain the enhanced fused feature map on the resolution, the semantic information and the position information, and the convolutional neural network carries out classification and identification on the small target in the target image based on the fused feature map, so that the identification accuracy of the convolutional neural network model on the small target can be improved.

In some alternative embodiments, the target image may be convolved by a dilation convolution technique at the time of feature extraction. The expansion convolution is to inject holes on the basis of the characteristic diagram of the standard convolution so as to increase the receptive field. The dilation convolution allows the convolution output to contain a larger range of information, avoiding unnecessary loss of a portion of the information.

The implementation of the above embodiment is specifically described by taking a Mask RCNN model in a convolutional neural network as an example.

The embodiment of the invention improves the backbone network for feature extraction on a two-stage detector Mask R-CNN, and mainly researches from two aspects. Firstly, a three-way Feature Attention Fusion Module, namely a SAFF Module, is designed to enhance the semantic information of shallow features and the resolution of deep features, and effectively perform Feature Fusion by combining a Feature Pyramid network structure (FPN), so that the accuracy of target classification and position regression is improved, and particularly small targets are obtained. Second, the disadvantage of loss of resolution when sampling (boosting) enlarges the receptive field is alleviated by creating a new number of neural network layers (stages) in the backbone network by dilation convolution.

1、SAFF module

Referring to fig. 2 and 3, a Sandwich Attention Feature Fusion Module (SAFF Module) three-way Feature Attention Fusion Module is formed by alternately superimposing two channel Attention mechanisms and one spatial Attention mechanism. The method aims to strengthen semantic information of shallow features, improve resolution of deep features and optimize small target detection performance.

The channel attention mechanism automatically acquires the importance degree of each characteristic channel in a learning mode, and then judges according to the acquired importance degree, so that useful characteristics are enhanced, and characteristics which are not useful for the current task are weakened. First, a feature diagram X (whose dimensions are H × W × C, where H and W represent height and width of the feature diagram, respectively, and C represents the number of channels of the input feature diagram) is input, and two channel features with a size of 1 × 1 × C are obtained through Global Average Pooling (GAP) and maximum pooling (Max pooling), respectively. In short, the GAP operation means to pool each feature map globally, compress the global information in the feature map into a real number, which has a certain sense field of the global information and can directly give the actual class meaning to each channel. Meanwhile, GAP also greatly reduces network parameters. And (4) maximum pooling, namely dividing the feature map into blocks, and then taking the maximum value of each block, which means extracting the relatively strongest information in the feature map, and discarding other weaker information to enter the next layer.

Then, the GAP and the feature matrix obtained by the max pooling are added bit by bit, and the addition result is input to the next convolution layer. The first convolution kernel is 1 × 1 × C/r (r represents the channel compression ratio), which compresses the channel to C/r, which is its original size, reducing the dimensionality of the feature map. The size of the second convolution kernel is 1 × 1 × C, and the number of channels of the feature map is restored to the original size. The operation of using 1-by-1 convolution verification to realize dimension reduction and dimension increase is actually linear combination change of information among channels, and information interaction among the channels is formed. A Relu nonlinear activation layer is arranged between the two convolutions, and nonlinearity is added to improve the expression capability of the network. And finally, obtaining a new feature map after scaling through the sigmoid nonlinear activation function.

The processed output characteristics can be formulated as:

C(x)＝δ(Conv(σ(Conv(GAP(x)+MaxPool(x)))))

where σ denotes the Relu nonlinear activation function and δ denotes the Sigmoid nonlinear activation function.

Multiplying the output C (X) and the original input feature graph according to the phase through far jump connection to obtain a fused feature F1(X), which can be expressed by the following formula, wherein X represents the original input feature graph.

F1(x)＝C(x)×X

The spatial attention mechanism is different from the channel attention mechanism, and focuses on the position information of the enhanced features, so that the spatial attention mechanism is complementary to the channel attention mechanism. Firstly, pooling is performed in two different ways based on channels, namely Global Average Pooling (GAP) and Global Maximum Pooling (GMP), so as to obtain two feature maps with the same dimension. Then, the two feature maps are merged (concat) based on the channels, and the number of the channels is merged to obtain a special feature map. And then, performing dimension reduction operation on the feature diagram, and obtaining a spatial matrix added with the spatial attention weight through the sigmoid nonlinear activation function. Finally, the obtained spatial attention moment array is correspondingly multiplied to the original characteristic diagram to obtain a new characteristic layer F2(x) after spatial reinforcement, and the new characteristic layer can be expressed by the following formula.

S(x)＝δ(Conv(GAP(x)；MaxPool(x)))

F2(x)＝S(x)×X

After the Feature map passes through the Sandwich Attention Feature Fusion Module, the final Feature map X' with enhanced channel information and spatial information can be expressed by the following formula:

X′＝X×(2F1(x)+F2(x))

2. stage6 module

Referring to fig. 4, Mask RCNN uses ResNet101 as a backbone network for feature extraction, consisting of 5 stages. Each stage is composed of different numbers of convolution layers, wherein two types of residual blocks are contained, namely a convolution module (Convolutional Block) and an identification module (Identity Block), and the input dimension and the output dimension of the characteristic diagram of the Convolutional Block are different, so that the dimension of a network can be changed but the network layers cannot be connected in series. The input and output dimensions of the feature diagram passing through the Identity Block are the same, and the feature diagram can be connected with network layers in series for deepening the number of the network layers. The feature maps output by Stage1-5 are respectively C1, C2, C3, C4 and C5, and the corresponding sizes are 1/2, 1/4, 1/8, 1/16 and 1/32 of the original size.

The construction of the feature pyramid FPN realizes the multi-scale fusion of features, and in the Mask R-CNN, after input pictures are subjected to ResNet-FPN feature extraction network processing, P2, P3, P4, P5 and P6 are obtained and used as effective feature layer acquisition prediction frames of the RPN. Each P layer processes single scale information, specifically, anchors (anchors) of five scales {32 × 32, 64 × 64, 128 × 128, 256 × 256, 512 × 512} respectively correspond to five feature layers { P2, P3, P4, P5, P6}, and each feature layer processes candidate frames of three length-width ratios of 1:1, 1:2, 2: 1.

Compared with ResNet, the network of the embodiment of the invention reserves Stage1-5 in ResNet101, and adds a Stage6 module, which is composed of two basic block expansion identification modules (scaled identity blocks) and an expansion convolution module (scaled constraint block). After the feature map passes through Stage6 module, the output value is P6, and the output size is 1/64 of the original size. The P6 in the original network is specially designed for RPN network, does not participate in the later structure layer of the network, is used for processing candidate frames with 512 × 512 size, and is directly obtained by P5 through down sampling. Although the characteristic layer obtained by downsampling can increase the receptive field and make the convolution obtain more information, it only leaves the characteristic of important information in the dimension reduction process, and can cause part of information to be lost, so the premise that posing increases the receptive field is that some information is lost, the resolution is reduced, and the accuracy of the final target position regression is influenced to a certain extent. By using the binary technology, the receptive field can be enlarged under the condition of not doing posing, the convolution output contains information with a larger range, and unnecessary loss of part of information is avoided.

The DSAFF-Net feature extraction backbone details are shown in table 1:

TABLE 1

Referring to fig. 5, the following describes a specific embodiment of the present invention in further detail with reference to the accompanying drawings:

1) data set preparation, MS coco public data set data sets were used as subjects, including training and test sets.

2) Building a two-stage-based target detection network model

2.1) inputting a picture, and forming a new feature extraction network structure through ResNet, FPN, an SAFF module and Stage6 to obtain a feature layer after feature extraction;

2.2) extracting candidate frames through an RPN network, wherein the feature layers of the candidate frames with the size of 512 by 512 are processed, the original network is directly obtained by P5 through downsampling, and the candidate frames are obtained by Stage6 consisting of a scaled identity block and a scaled constraint block;

2.3) scaling the target candidate box obtained in the step 2.2 to a uniform size by using ROIAlign (a regional characteristic aggregation mode);

2.4) the regions of interest (ROIs) are classified, Bounding Box regression and MASK generation (FCN operation inside each ROI).

In a second aspect, referring to fig. 6, an embodiment of the present invention provides a small target detection system, including an input module 210, a feature extraction module 220, a feature fusion module 230, and an output module 240, where:

the input module 210 is configured to obtain a target image input by the convolutional neural network model, where the target image includes a small target to be identified; the feature extraction module 220 is configured to extract features of small targets in the target image to obtain an original feature map of the target image; the feature fusion module 230 is configured to extract features of an original feature map of the target image through a channel attention mechanism to obtain a semantic feature map featuring small targets on one side, extract features of the original feature map of the target image through a space attention mechanism to obtain a position feature map featuring small targets on one side, and fuse the original feature map, the semantic feature map and the position feature map of the target image to obtain a feature map after the target image is fused; the output module 240 is configured to classify the small targets in the target image based on the feature map after the target image is fused, so as to obtain an identification result of the small targets.

The feature extraction module 220 of the embodiment of the present invention performs initial extraction on information of a target image, the obtained original feature map has high resolution but weak semantic information and position information, the feature fusion module 230 extracts features of the original feature map by using a channel attention mechanism to obtain a semantic feature map emphasizing semantic features of small targets, extracts features of the original feature map by using a space attention mechanism to obtain a position feature map emphasizing position features of small targets, fuses the original feature map, the semantic feature map and the position feature map, makes up for deficiencies of the original feature map, obtains enhanced fused feature maps on resolution, semantic information and position information, and a convolutional neural network classifies and identifies small targets in the target image based on the fused feature map, so that the identification accuracy of the convolutional neural network model on small target identification can be improved.

In addition, referring to fig. 7, the present application also provides a computer device 301, comprising: a memory 310, a processor 320 and a computer program 311 stored on the memory 310 and executable on the processor, the processor 320 when executing the computer program 311 effecting:

such as the small object recognition method described above.

The processor 320 and memory 310 may be connected by a bus or other means.

The memory 310, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory 310 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 310 may optionally include memory located remotely from the processor, which may be connected to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the small target recognition method of the above-described embodiments are stored in a memory, and when executed by a processor, perform the small target recognition method of the above-described embodiments, for example, perform the above-described method steps S110 to S140 in fig. 1.

Additionally, referring to fig. 8, the present application also provides a computer-readable storage medium 401 storing computer-executable instructions 410, the computer-executable instructions 410 being configured to perform:

such as the small object recognition method described above.

The computer-readable storage medium 401 stores computer-executable instructions 410, and the execution of the computer-executable instructions 410 by a processor or controller, for example, by a processor in the above-mentioned electronic device embodiment, may cause the above-mentioned processor to execute the small target identification method in the above-mentioned embodiment, for example, execute the above-mentioned method steps S110 to S140 in fig. 1.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of data such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired data and which can accessed by the computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any data delivery media as known to one of ordinary skill in the art.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A small target identification method is characterized by comprising the following steps:

2. The small target identification method according to claim 1, wherein the extracting of the feature of the original feature map of the target image through a channel attention mechanism to obtain the semantic feature map emphasizing the semantic features of the small target comprises the steps of:

performing global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and adding the average feature matrix and the maximum feature matrix according to phases;

convolving the result of the phase addition by using a convolution kernel of 1 multiplied by C/r, and activating by using a Relu nonlinear activation layer;

and (3) convolving the result of the activated Relu nonlinear activation layer by using a 1 multiplied by C convolution kernel, and activating by using a sigmoid nonlinear activation function to obtain the semantic feature map of the emphasis small target semantic feature.

3. The small target identification method according to claim 1, wherein the extracting the feature of the original feature map of the target image through a spatial attention mechanism to obtain the position feature map emphasizing the small target position feature comprises the following steps:

performing global average pooling and maximum pooling on the original feature map of the target image to obtain an average feature matrix and a maximum feature matrix, and merging the average feature matrix and the maximum feature matrix;

and (4) performing convolution on the combined result, and activating by using a sigmoid nonlinear activation function to obtain the position feature map with the emphasis on the small target position feature.

4. The small target identification method according to claim 1, wherein the step of fusing the original feature map of the target image, the semantic feature map and the position feature map to obtain the feature map after the target image is fused comprises the steps of:

multiplying the original characteristic diagram and the semantic characteristic diagram according to phases twice, multiplying the original characteristic diagram and the position characteristic diagram according to phases once, and adding the results obtained after the multiplication according to the phases;

and multiplying the result obtained after the bitwise addition by the original characteristic diagram according to the phase to obtain the characteristic diagram obtained after the target image is fused.

5. The small object recognition method of claim 1, wherein: and extracting the characteristics of the small targets in the target image through convolution or expansion convolution.

6. The small object recognition method of claim 1, wherein: the identification result of the small target comprises the following steps: marking the category of the small objects with text, marking the position of the small objects with a matrix, and masking the small objects of different categories with different colors.

7. A small object detection system, comprising:

8. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the computer program implementing:

a small object recognition method as claimed in any one of claims 1 to 6.

9. A computer-readable storage medium having stored thereon computer-executable instructions for performing:

a small object recognition method as claimed in any one of claims 1 to 6.