CN111461110A

CN111461110A - Small target detection method based on multi-scale image and weighted fusion loss

Info

Publication number: CN111461110A
Application number: CN202010134062.7A
Authority: CN
Inventors: 林坤阳; 罗家祥
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-03-02
Filing date: 2020-03-02
Publication date: 2020-07-28
Anticipated expiration: 2040-03-02
Also published as: CN111461110B

Abstract

The invention belongs to the field of image and video processing, and relates to a small target detection method based on multi-scale images and weighted fusion loss, which comprises the following steps: extracting a plurality of groups of feature vectors from images with different scales based on an improved Mask RCNN model, fusing the plurality of groups of feature vectors, and constructing a feature pyramid; generating candidate detection frames based on the characteristic pyramid and screening to obtain suggested detection frames; corresponding the suggested detection frames back to the feature graphs of the feature pyramids, and aligning and intercepting the feature graphs; inputting the aligned suggested detection frames into a classifier layer to obtain class confidence and position offset of the suggested detection frames; in the testing stage, screening certain suggested detection frames according to the category confidence score of the suggested detection frames, and performing non-maximum value inhibition; in the training stage, weighting is carried out on the loss function calculated by the small target detection characteristic layer, and the loss function is fused with the loss functions of the large and medium target detection layers, so that the sensitivity of the model to small target objects is enhanced.

Description

Small target detection method based on multi-scale image and weighted fusion loss

Technical Field

The invention belongs to the field of image and video processing, and relates to a small target detection method based on multi-scale images and weighted fusion loss.

Background

With the development of machine learning and deep learning, the pattern recognition and computer vision fields get unprecedented attention and popularity depending on the powerful learning ability of the convolutional neural network. In the era of wide popularization of machine automation and artificial intelligence, the role played by a camera is increasingly equal to that of human eyes, and the development of the field of computer vision is particularly important and has received wide attention from the industrial and academic circles. Among them, target detection is a remarkable and ongoing advance in the field of computer vision. However, most of the target objects in the pictures and videos appear in extremely minute forms. Usually, many objects occupy very low pixels in one frame of picture, and most of the pixels are less than 49px, so the task of detecting tiny objects is difficult and very important.

The difficulty of small target detection is scale, and the target with each scale size cannot be usually taken care of by inputting the feature information into the network model with a single size to extract feature information. Although the existing Mask RCNN model has good effect on target detection, the problems of single input image scale, uncertain resolution, insufficient utilization of context information, insensitivity to small target object detection and the like still exist.

Disclosure of Invention

Aiming at the defects of the traditional Mask RCNN model, the invention provides a small target detection method based on multi-scale images and weighted fusion loss.

The invention is realized by adopting the following technical scheme:

a small target detection method based on multi-scale images and weighted fusion loss is realized based on an improved Mask RCNN model and comprises the following steps:

s1, building an improved Mask RCNN model; the improved Mask RCNN model comprises: the system comprises a residual backbone network, a characteristic pyramid network layer, an area generation network layer, an interested frame alignment layer, a classifier layer, a loss function calculation layer and a test layer;

s2, constructing an image pyramid: carrying out scaling processing on the original image, and forming an image pyramid by the original image, the image with the reduced size and the image with the enlarged size;

s3, randomly cutting the image in the image pyramid;

s4, sending the randomly cut images into a residual backbone network for convolution, batch normalization and pooling, and outputting a plurality of groups of feature maps with different sizes;

s5, fusing a plurality of groups of feature maps with different scales, and further processing to obtain feature maps P2-P6;

s6, generating candidate detection frames which are not screened for the feature maps P2-P6 respectively;

s7, inputting the feature map P2-P6 into an area to generate a network layer, and obtaining the offset and confidence of the candidate detection frame through a series of convolution operations;

s8, combining the offset of the candidate detection frame in the S7 and the data of the candidate detection frame which is not screened and obtained in the S6, and screening the candidate detection frame with a set amount as the interested detection frame;

s9, respectively corresponding the interested detection frames to the feature maps P2-P6, and carrying out alignment operation;

s10, inputting the result of the alignment operation into a classification layer, and outputting the predicted category score, the category probability and the coordinate offset of the interested detection frame;

s11, inputting the predicted interested detection frame category score, the category probability and the coordinate offset into a test layer, screening the maximum value of the category probability in the test layer, selecting the predicted target category corresponding to the interested detection frame, further inhibiting and filtering redundant interested detection frames through the non-maximum value, and finally obtaining the finally predicted interested detection frame and the corresponding predicted target category in the test layer.

Further, the training phase also comprises:

s12, inputting the classification scores of the interested detection boxes predicted in S10 into a loss function calculation layer, and taking the classification scores and the actual classification labels as the input of a cross entropy function to calculate a classification loss value so as to obtain the classification prediction loss of the feature maps P2-P6;

the coordinate offset of the interested detection frame predicted in the S10 and the offset of the real target frame are taken as the input of a regression loss function, so that the regression prediction loss of the characteristic map P2-P6 is obtained;

s13, weighting the category prediction losses of the feature map P2 and the feature map P3 respectively, and adding the category prediction losses of the feature map P4, the feature map P5 and the feature map P6 to obtain a total category prediction loss;

weighting the regression prediction losses of the feature map P2 and the feature map P3 respectively, and adding the weighted regression prediction losses of the feature map P4, the feature map P5 and the feature map P6 to obtain a total regression prediction loss;

and S14, iteratively updating parameters and weights of the improved Mask RCNN model through back propagation, specifically, respectively utilizing total class prediction loss and total regression prediction loss, and performing optimization iteration and changing the weight value of the improved Mask RCNN model.

Further, the improved Mask RCNN model comprises the following steps:

①, aligning the interested detection frames, not aligning uniformly, but aligning different feature layers separately, after aligning, not directly fusing the transmitted loss function calculation layers, but inputting the input loss function calculation layers to carry out classification and regression respectively, finally inputting the input loss function calculation layers separately, weighting the loss functions calculated by the small target feature layer, and fusing the weighted loss functions with the loss functions of the large and medium target layers;

②, adding an effective characteristic layer P6 in the original Mask RCNN model;

③, removing the image segmentation module in the original Mask RCNN and canceling the Mask branch.

Preferably, the scaling process performed on the original image in S2 includes:

the formula for scaling pictures is expressed as:

Image_New＝Image*scale (1)

wherein: image _ New represents a zoomed picture, Image represents a picture before zooming, and scale represents a zooming scale;

the scale is determined by the following factors:

if the length of the minimum edge after the scaling is finished cannot be smaller than min _ dim, min () represents the minimum value operation, h represents the height of the original image, w represents the width of the original image, when min _ dim is larger than min (h, w),

scale＝min_dim/min(h，w) (2)

otherwise scale is 1;

if the length of the longest edge after the scaling is finished is max _ dim, if the picture is scaled according to equation (2), and if the longest edge of the scaled picture exceeds max _ dim, the following steps are performed:

scale＝max_dim/image_max (3)

otherwise, continuously scaling according to scale min _ dim/min (h, w);

the size of the final zoomed picture is max _ dim × max _ dim, and in addition, if the scale of the final zooming is larger than 1, the original picture is magnified by a bilinear interpolation method; for the part of the picture after the last scaling which is less than max _ dim, zero values are used to fill in the pixel values.

Preferably, the formula for randomly cropping the picture in S3 is expressed as follows:

Y₁＝randi([0,image_size(1)-crop_size(1)]) (4)

X₁＝randi([0,image_size(2)-crop_size(2)]) (5)

wherein: y1 and X1 represent the lower left-hand ordinate and lower left-hand abscissa, respectively, at which cropping of the picture begins; randi represents random access, and the access range is the range inside the small brackets; image _ size is the size of the picture before clipping, the width of the first dimension storing the picture, and the length of the second dimension storing the picture; crop _ size is the size of the area to be cut, the width of the first-dimension storage area and the length of the second-dimension storage area;

Y₂＝min(image_size(1),Y1+crop_size(1)) (6)

X₂＝min(image_size(2),X1+crop_size(2)) (7)

wherein: y2 and X2 respectively represent the ordinate and abscissa of the upper right corner at the start of cropping; randi represents random access; min () represents taking the minimum value;

the specific position of the cropping is determined by using the two coordinates obtained by the formulas (4) and (7), and if the cropping area overflows from the original image, pad filling is carried out to obtain the cropped image.

Preferably, the convolution of the residual backbone network comprises two convolution modules, block1 and block2, wherein:

the convolution module block1 workflow comprises:

①, for Branch 1, the output and input remain consistent;

②, for branch 2, sequentially using 1 × 1 convolution kernel, 3 × 3 convolution kernel and 1 × 1 convolution kernel to perform convolution operation, and performing mean value normalization on the output feature vector after each convolution is completed;

the convolution module block2 workflow comprises:

①, for branch 1, performing convolution operation by using 1 × 1 convolution kernel, and then performing mean value normalization on the output feature vector;

②, for branch 2, convolution operation is performed by using 1 × 1 convolution kernel, 3 × 3 convolution kernel and 1 × 1 convolution kernel in sequence, and mean normalization is performed on the output feature vector after each convolution is completed.

Preferably, outputting a plurality of sets of feature maps of different sizes in S4 includes:

for original input, five-layer feature map output is constructed: c2, C3, C4, C5, C6; for inputs that are doubled relative to the original, five-layer feature map outputs are constructed: c2s, C3s, C4s, C5s, C6 s; for an input that is enlarged by one time relative to the original, five-layer feature map outputs are constructed: c2l, C3l, C4l, C5l, C6 l.

Preferably, step S5 includes:

s51, performing interpolation principle-based upsampling on C2S-C6S, and doubling C2S-C6S;

s52, performing maximum pooling on C2l-C6l, and doubling C2l-C6 l;

s53, adding C2-C6, C2S-C6S which is enlarged by one time and C2l-C6l which is reduced by one time to obtain C2-C6 which is fused with image features of different scales;

s54, further processing the C2-C6 fused with the image features of different scales to obtain feature maps P2-P6.

Preferably, S54 includes:

using 256 convolution checks of 1 × 1 to convolve C6 fused with image features of different scales to obtain a feature map P6 with the output of 16 × 256;

c5 fused with image features of different scales is convoluted by 256 convolution checks of 1 × 1, the convoluted result is added with the output obtained by twice upsampling P6, and then the convolution is carried out by 3 × 3 to obtain a feature map P5 with the output of 32 × 256;

c4 fused with image features of different scales is convoluted by 256 convolution checks of 1 × 1, the convoluted result is added with the output obtained by twice upsampling the P5, and then the convolution is carried out by 3 × 3 to obtain a feature map P4 with the output of 64 × 256;

c3 fused with image features of different scales is convoluted by 256 convolution checks of 1 × 1, the convoluted result is added with the output obtained by twice upsampling the P4, and then the convolution is carried out by 3 × 3 to obtain a feature map P3 with the output of 128 × 256;

c2 fused with image features of different scales is convolved by 256 convolution checks of 1 × 1, then added to the output obtained by up-sampling twice the P3, and further convolved by 3 × 3 to obtain a feature map P2 with an output of 256 × 256.

Preferably, the height h and the width w of the candidate detection frame are:

wherein: scale _ length indicates that when the candidate detection frame is a frame having a height equal to a width, the height and the width correspond to the pixel level size of the original image.

Compared with the existing Mask RCNN model small target detection, the method has the following beneficial effects:

(1) a method for constructing an image pyramid as Mask RCNN model input is provided, one image is preprocessed to be changed into multiple scales as input instead of a single scale, and the capability of extracting the characteristics of a small target object is enhanced.

(2) An image segmentation module in the original Mask RCNN model is removed, a Mask branch is eliminated, network parameters are reduced, and the training classification and regression part is more efficient.

(3) The alignment of the interested regions is not aligned uniformly any more, but different feature layers are aligned separately, then the different feature layers are input into the classification layers for classification and regression respectively, finally the input loss function calculation layers are separated, the loss functions calculated by the small target feature layer are weighted and fused with the loss functions of the large and medium target layers, the influence of the small target object on the model loss function is enhanced, and the model can learn the features of the small target object better.

(4) A layer of effective characteristic layer P6 is added in the original Mask RCNN model, so that the detection precision of a small target object is improved, the detection precision of a large object is ensured not to be reduced, and the detection accuracy and precision of the small target object in target detection are improved.

Drawings

FIG. 1 is a diagram of an improved Mask RCNN model architecture in accordance with an embodiment of the present invention;

FIG. 2 is a diagram illustrating the volume blocks in the modified Mask RCNN model according to an embodiment of the present invention;

fig. 3 is a flow chart of implementation of small target detection in an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

and S1, constructing an improved Mask RCNN model.

In a preferred embodiment, the improved Mask RCNN model includes a backbone network part, a candidate window generation part and a classification layer part, and is built by using a keras platform, including: the system comprises a residual backbone network, a characteristic pyramid network layer, an area suggestion layer, an interested frame alignment layer, a classifier layer, a loss function calculation layer and a test layer. Compared with the original Mask RCNN, the improvement comprises the following steps:

①, aligning the regions of interest, aligning different feature layers separately, not directly fusing the incoming loss functions after aligning, but inputting the loss functions into the classifier layers for classification and regression respectively, finally separating the input loss function calculation layers, weighting the loss functions calculated by the small target feature layer, fusing the loss functions with the loss functions of the large and medium target layers, enhancing the influence of the small target object on the model loss function, and enabling the model to learn the features of the small target object better.

②, adding an effective characteristic layer P6 in the original Mask RCNN model, so that the detection precision of the small target object is improved, the detection precision of the large object is not reduced, and the detection accuracy and precision of the small target object in the target detection are improved.

③, removing the image segmentation module in the original Mask RCNN, and canceling the Mask branch, thereby reducing the network parameters and making the training classification and regression part more efficient.

And S2, constructing an image pyramid.

And respectively carrying out image size reduction and image size amplification on the original image data by one time, and reserving the original size image data, wherein the original image, the reduced image and the amplified image form an image pyramid together.

The first step of changing the size of the picture is to zoom the picture, and the formula for zooming the picture is expressed as:

Image_New＝Image*scale (1)

wherein: image _ New represents the zoomed picture, Image represents the picture before zooming, and scale represents the zoom scale. The scale is determined by the following factors:

scale＝min_dim/min(h，w) (2)

otherwise scale is 1.

scale＝max_dim/image_max (3)

otherwise, continue scaling by scale min _ dim/min (h, w).

And the size of the final zoomed picture is max _ dim × max _ dim, and in addition, if the scale of the final zooming is larger than 1, namely the original picture is zoomed in, the original picture is zoomed in by a bilinear interpolation method. For the part of the picture after the last scaling which is less than max _ dim, zero values are used to fill in the pixel values.

S3, the original image, the reduced image, and the enlarged image in the image pyramid are randomly cropped 512 × 512 to be input as a primary training.

The formula for randomly cropping the picture is expressed as follows:

Y₁＝randi([0,image_size(1)-crop_size(1)]) (4)

X₁＝randi([0,image_size(2)-crop_size(2)]) (5)

wherein: y1 and X1 represent the lower left-hand ordinate and lower left-hand abscissa, respectively, at which cropping of the picture begins; randi represents random access, and the access range is the range inside the small brackets; image _ size is the size of the picture before clipping, the width of the first dimension storing the picture, and the length of the second dimension storing the picture; crop _ size is the size of the area to be cut, the width of the first dimension storage area, and the length of the second dimension storage area.

Y₂＝min(image_size(1),Y1+crop_size(1)) (6)

X₂＝min(image_size(2),X1+crop_size(2)) (7)

Wherein: y2 and X2 respectively represent the ordinate and abscissa of the upper right corner at the start of cropping; randi represents a random number, the number range is the range inside the small brackets, min () represents the minimum value, and the two numbers to be compared are inside the small brackets.

And determining the specific position of the cutting by using the two obtained coordinates, and if the cutting area overflows from the original image, performing pad filling to obtain the image after cutting. pad padding is to zero-pad the overflow area in three channels of pixels, i.e. each channel is assigned a value of 0.

And S4, sending the cut image into a residual backbone network for convolution, batch normalization and pooling, and outputting three groups of feature maps with different sizes.

In a preferred embodiment, the residual backbone network performs a total of 60 convolutions, a maximum pooling at the beginning and an average pooling at the end, with a batch normalization after each convolution. And utilizing a specific convolution module to form five-layer characteristic diagram output for original image input according to the size relation of the characteristic diagram after convolution: c2, C3, C4, C5, C6; for inputs that are doubled relative to the original, five-layer feature map outputs are constructed: c2s, C3s, C4s, C5s, C6 s; for an input that is enlarged by one time relative to the original, five-layer feature map outputs are constructed: c2l, C3l, C4l, C5l, C6 l.

The convolution formula is expressed as follows:

where Output is the value of each point in the feature map of the convolution Output, w_i,jIs the weight of the n x n magnitude convolution kernel at the (i, j) location; input_i',j'Is the pixel value of the map of the input convolution kernel at the location corresponding to the convolution kernel location (i, j).

For the max pooling operation, a kernel of size 3 × 3 is selected, and the kernel is slid in the input map in steps of size 2, and the maximum value in the slid 3 × 3 local receiving domain is selected as the value of the corresponding point of the output map, and the formula is expressed as:

Output_max＝max(Area_input) (9)

wherein: area_inputRepresenting the values of all the pixels in the local reception domain.

For the average value pooling operation, the values of all the pixel points of the input characteristic diagram are summed and then averaged, and finally the output of 1 x1 is obtained without changing the number of channels. The formula is expressed as follows:

wherein: input_i,jIs the pixel value of the input graph at pixel point (i, j), and n x n refers to the size of the input feature graph.

Batch normalization is to accelerate network convergence when the value distribution of each point of a convolution graph output by each hidden layer of the deep neural network is converted into a normal distribution numerical value which takes 0 as a mean value and has unit variance. For a batch of inputs, assuming n samples, the output at a certain hidden layer I is: { z⁽¹⁾,z⁽²⁾,z⁽³⁾,z⁽⁴⁾,...,z⁽ⁿ⁾-averaging the batch of outputs:

and (4) solving the variance:

the output is normalized (batch normalization), i.e. the output of each feed sample is subjected to the following operations:

∈ is provided to prevent invalid calculations when the variance is 0.

Based on the deficiency, the invention introduces two learnable parameters gamma and β, thereby carrying out the following operations on the obtained batch normalized output:

thereby restoring the expressive power of the data itself.

In a preferred embodiment, the activation function of the improved Mask RCNN model is unified by a Sigmoid function, and the mathematical expression of the Sigmoid function is as follows:

s5, fusing three groups of feature maps C2S-C6S, C2-C6 and C2l-C6l with different scales obtained in S4, and further obtaining feature maps P2-P6. The specific operation mode is as follows:

②, performing upsampling on C2s-C6s based on an interpolation principle, and doubling C2s-C6 s;

②, performing maximum pooling on C2l-C6l, and doubling C2l-C6 l;

the principle of maximum pooling is as follows: selecting a kernel with the size of 3 x 3, sliding in the input graph by the step size of 2, selecting the maximum value in the sliding 3 x 3 area as the value of the corresponding point of the output graph, and expressing the formula as follows:

Output_max＝max(Area_input) (16)

③, adding C2-C6, C2s-C6s which is enlarged by one time and C2l-C6l which is reduced by one time to obtain C2-C6 with fused image features of different scales.

C2-C6 at this time is different from C2-C6 in S4, and C2-C6 at this time fuses the features extracted from the images with different scales.

④, and further processing the C2-C6 fused with the image features of different scales to obtain feature maps P2-P6.

Specifically, the method comprises the following steps: the C6 fused with the image features of different scales was convolved with 256 1 × 1 convolution checks to obtain a feature map P6 with an output of 16 × 256.

C5 fused with image features of different scales is convoluted by 256 convolution checks of 1 × 1, and then added with the output obtained by up-sampling twice the P6, and then the convolution is performed by 3 × 3 to obtain a feature map P5 with the output of 32 × 256.

C4 fused with image features of different scales is convolved by 256 convolution checks of 1 × 1, and then added to the output obtained by up-sampling twice the P5, and then 3 × 3 convolution is performed to obtain a feature map P4 with an output of 64 × 256.

C3 fused with image features of different scales is convolved by 256 convolution checks of 1 × 1, then added to the output obtained by up-sampling twice the P4, and then convolved by 3 × 3 to obtain a feature map P3 with an output of 128 × 256.

The above obtained P2-P6 are combined together to form a feature map matrix.

And S6, generating an unscreened candidate detection frame on the feature map P2-P6 obtained in S5.

The height h and width w of the candidate detection frame are as follows:

wherein: scale _ length indicates that when the candidate detection frame is a frame having a height equal to a width, the height and the width correspond to the pixel level size of the original image. For P2 to P6, 32, 64, 128, 256, 512, respectively. The ratios represent three dimensions for each size candidate detection box: 0.5, 1 and 2.

And generating candidate detection frames with different sizes and different scales by taking the pixel points as centers for the feature maps P2-P6, wherein each pixel point on the feature maps generates a candidate detection frame. And the candidate detection frame center coordinates are each pixel point of each layer of feature map.

S7, feature maps P2 to P6 obtained in S5 are input to RPN (Region pro-polar Network, Region-generated Network layer), and feature maps P2 to P6 are convolved with the same convolution layer, without changing the feature map size. And then, convolving the feature map after the convolution by using a convolution kernel of 1 × 1 to obtain rpn _ score, and then performing softmax operation, outputting candidate detection frame confidence rpn _ pro with the number of channels (2 × the number of candidate detection frames of each pixel), and then convolving by using the convolution kernel of 1 × 1 on the basis, and outputting candidate detection frame offset information rpn _ bbox with the number of channels (4 × the number of candidate detection frames of each pixel which are not screened).

S8, combining rpn _ bbox of S7 with the data of candidate test frames obtained from S6 to generate a final test frame of interest (also called ROI frame or suggested test frame) rpn box (complete candidate test frame after screening), and obtaining a score of test frame of interest from rpn _ score obtained from S7, and ranking the score of test frame of interest from large to small for non-maximum suppression, and finally retaining the first 1500 test frames of interest (rpn box).

And S9, performing alignment operation on the 1500 interested detection frames obtained in the S8 respectively corresponding to the P2-P6 feature maps.

The average value of the target size of the P5 feature layer is 224 × 224, and the field of the P5 feature layer is 32 with respect to the original image in the alignment, so that the P3578 feature layer should be aligned to a size of 7 × 7, which is (224/32 ═ 7) × (224/32 ═ 7). The receptive fields from P2 to P6 layers increased by a factor of 2, and so on, P2, P3, P4 and P6 would also align to 7 x 7 sizes. The output after the alignment is finished is respectively: align _ p2, align _ p3, align _ p4, align _ p5, align _ p 6.

S10, the output signals of align _ p2, align _ p3, align _ p4, align _ p5 and align _ p6 after the alignment in S9 are input into a classification layer, the align _ p2, align _ p3, align _ p4, align _ p5 and align _ p6 are converted into vectors by convolution operation, and the class score, the class probability and the coordinate offset of the prediction interest detection frame are output.

Specifically, the vector (x, 256, 7, 7) is converted into data of (x, 256, 1, 1) by convolution with 7 × 7, and the vector (x, 256, 1, 1) is converted into (x, 81) by full concatenation, and the output is a category score, wherein 81 represents the ROI box 81 categories; performing softmax operation on the vector (x, 81), and outputting the probability as a category probability; then, using the full join operation, (x, 256, 1, 1) is converted to (x, 81 x 4), where 4 represents 4 coordinate offsets of 81 categories of the ROI box. Note that "x" above indicates the number of detection boxes of interest for each layer of the feature map, and the value of "x" may be different in size at different layers.

And S11, in the testing stage, after S10, inputting the category score, the category probability and the coordinate offset into a testing layer, wherein the maximum value of the category probability is selected in the testing layer, the target category corresponding to the prediction of the interested detection frame is selected, and the non-maximum value is further carried out by using a threshold value with the size of 0.7 to inhibit and filter redundant interested detection frames. And finally, obtaining a final predicted interested detection frame and a corresponding predicted target category in the test layer.

In the training phase, the method further comprises the following steps:

and weighting the loss function calculated by the small target detection feature layer, fusing the loss function with the loss functions of the large and medium target detection layers, enhancing the influence of the small target object on the model loss function, and enabling the improved Mask RCNN model to better learn the features of the small target object.

Specifically, the method comprises the following steps:

and S12, inputting the 81 category scores output by each interested detection box in each layer feature map P2-P6 in S10 into a loss function calculation layer, and using the loss function calculation layer and the actual category labels as the input of a cross entropy function to calculate a classification loss value to obtain category prediction loss. Wherein the cross entropy function is expressed as:

wherein: y'_iFor the ith value in the true category label, c represents the total number of category labels, y_iFor predicting the corresponding value in the vector after the category is subjected to softmax normalization, the more accurate the classification is, the more accurate y_iThe closer to 1, the loss function value L oss_y′The smaller (y) is.

And 4 coordinate offsets of 81 categories output by each interested detection frame of each layer feature map in the S10 are used as the input of a regression loss function smooth _ l1 together with the actual target frame offset to obtain the regression prediction loss. The function is expressed as:

wherein: x represents the difference between the predicted detection frame of interest offset coordinates and the true target frame offset coordinates.

S13, weighting the class prediction losses L os _ class _ P2 and L os _ class _ P3 of the P2 and P3 layers in S12, respectively, and adding L os _ class _ P4, L os _ class _ P5 and L os _ class _ P6 to obtain the total class prediction loss:

Loss_class_P2'＝4*Loss_class_P2

Loss_class_P3'＝2*Loss_class_P3

Loss_class＝Loss_class_P2'+Loss_class_P3'+Loss_class_P4+Loss_class_P5+Loss_class_P6

(21)

the regression prediction losses L os _ reg _ P2 and L os _ reg _ P3 of the P2 and P3 layers output in S12 are weighted and added to L os _ reg _ P4, L os _ reg _ P5 and L os _ reg _ P6 to obtain the regression prediction losses:

Loss_reg_P2'＝4*Loss_reg_P2

Loss_reg_P3'＝2*Loss_reg_P3

Loss_reg＝Loss_reg_P2'+Loss_reg_P3'+Loss_reg_P4+Loss_reg_P5+Loss_reg_P6

(22)

finally, L os _ class and L os _ reg are respectively utilized to carry out optimization iteration and modify the weight value of the improved MaskRCNN model, so that the learning purpose of the improved Mask RCNN model is achieved.

The present invention will be described in further detail with reference to the accompanying drawings.

Referring to fig. 1-3, a small target detection method based on multi-scale images and weighted fusion loss includes:

(1) and performing feature extraction on the original image and the image which is enlarged and reduced by one time by a residual backbone network based on the improved Mask RCNN model to obtain three groups of features with different scales, and performing feature fusion on the three groups of features with different scales to obtain output vector graphs C2, C3, C4, C5 and C6 with the features of different scales fused.

In a preferred embodiment, different convolution modules are mainly used for extracting the features of the pictures with different scales. Convolution modules in the feature extraction layer (residual backbone network) are shown in fig. 2, and there are two types of convolution modules, which are respectively called "block 1" and "block 2" for short.

The convolution module block1 workflow comprises:

①, for Branch 1, the output and input remain consistent.

②, for branch 2, sequentially using 1 × 1 convolution kernel, 3 × 3 convolution kernel, 1 × 1 convolution kernel to perform convolution operation, and performing mean normalization on the output feature vector after each convolution is completed, specifically, the number of feature vector channels after each convolution operation is completed is in a proportional relationship of 1: 1: 4.

The convolution module block2 workflow comprises:

①, for branch 1, a convolution operation is performed using a 1 x1 convolution kernel followed by mean normalization of the output feature vector.

In a specific embodiment, it is assumed that pictures are input at x 3, and it is noted that the picture size is unified to the standard size of 1024 x 1024 for the original, that is, x is 1024, x is 512 when the picture size is reduced by one time, and x is 2048 when the picture size is enlarged by one time. For simplicity, the description of the mean normalization operation that must be performed after each convolution operation will be omitted, and the description will be omitted if the padding parameter is 0 and the step number parameter is 1 in the convolution and pooling.

Preprocessing the picture before the residual backbone network starts, specifically: filling the input three pictures with convolution kernels of 7 × 64, performing convolution with step number of 2, and outputting

The feature vector of (2); is followed byUsing 3 x 3 pool nucleus, step number 2 to make maximum pool, outputting

The feature vector of (2).

The C2 layer sequentially comprises a block2, a block1, a block1 and a block2, and the output is

The C3 layer sequentially comprises a block2, a block1, a block1 and a block1, and the output is

The C4 layer sequentially comprises a block2, a block1, a block1, a block1 and a block1, and the output is

In the C5 layer, sequentially comprising a block2, a block1 and a block1, the output is

In the C6 layer, sequentially comprising a block2, a block1 and a block1, the output is

(2) And then, a characteristic pyramid network layer is formed, after the output of the five layers of characteristic vectors in the last step is subjected to 1-by-1 convolution to change the number of channels to be 256, the characteristic vectors which are subjected to addition and fusion output by the nth layer and the (n +1) th layer are adopted as the output of the Pn characteristic layer, and the characteristic vectors are directly output for the highest layer P6. In addition, here, to add more candidate detection blocks, the maximum value pooling with step number 2 may be performed on P6 to obtain P7 feature vector output, which is not adopted in this embodiment to save training space.

(3) Candidate detection frames are generated on the P2 to P6 feature maps, the candidate detection frames with three proportions and three length-width ratios are generated by taking each pixel point as the center of each feature map, and the length, the width and the center coordinates of the candidate detection frames are uniformly scaled to the interval of 0 to 1 according to the proportional relation of the sizes of the P2 to P6 feature maps.

(4) Next, at the RPN layer, through a series of convolution operations, the offset and confidence of the candidate detection boxes are generated, combined with the candidate detection boxes, ranked from high to low by the confidence, and the non-maximum suppression algorithm screens out 1500 final proposed detection boxes.

(5) And then, aligning the 1500 recommended detection boxes respectively corresponding to the original P2, P3, P4, P5 and P6 feature maps, namely finding the feature layer to which the recommended detection box belongs according to the size of the recommended detection box, and intercepting the corresponding recommended detection box on the corresponding feature map by using a nonlinear interpolation algorithm. Each signature contains the output of the suggested test box after alignment at 7 x 256.

(6) The output of the proposed detection box alignment layer is re-input into the classifier layer. The classifier layer converts 7 × 7 feature vectors input by 7 × 7 convolution into 1 × 1 feature vectors without changing the number of channels, and converts the feature vectors into 81 classes of category score output and four coordinate position offsets of each class by using full connection; and performing softmax operation on the 81 class score output vectors, and outputting the result as class probability.

(7) And in the testing stage, the category score information and the position offset information obtained by the classifier layer are input into the testing layer, wherein the maximum value of the category probability is selected in the testing layer, the target category corresponding to the prediction of the interested detection frame is selected, and the non-maximum value is further used for inhibiting and filtering redundant interested detection frames by using a threshold value with the size of 0.7. And finally, obtaining a final predicted detection frame and a corresponding predicted target class in a test layer.

In the training stage, in the loss function calculation layer, the loss of the real information is calculated by utilizing the category score information and the position offset information obtained by the classifier layer, and the model parameters and the weight are modified through back propagation.

The method is specifically applied as follows:

step one, obtaining a picture containing a large number of small target objects, modifying the picture into a standard size, respectively performing up-sampling and down-sampling for one time, and forming an image pyramid with an original image to be used as input of an improved Mask RCNN model.

And step two, inputting the image pyramid into a residual backbone network of the improved Mask RCNN model, extracting the features of the three different-scale images in the image pyramid, and outputting three groups of feature vectors.

And step three, fusing the three groups of feature vectors through a feature pyramid network layer to construct a feature pyramid.

And step four, generating a candidate detection frame in each layer of the characteristic pyramid and sending the candidate detection frame into the RPN layer.

And step five, correspondingly generating confidence coefficient and position offset information for each candidate detection frame in the RPN layer, and screening a certain amount of effective suggested detection frames according to the confidence coefficient high-low ordering and non-maximum value inhibition.

And step six, correspondingly returning the screened suggestion detection frames to the feature graphs generated in the feature pyramid, aligning and intercepting the feature graphs, uniformly adjusting the feature graphs to be in a fixed size, and still separately placing the feature graphs according to different feature layers.

And step seven, inputting the aligned suggested detection frame into a classifier layer to obtain confidence degrees of all classes and position offset information of the suggested detection frame. Determining the category of the suggested detection frame according to the maximum category probability; then, considering the offset of the suggested detection frame as the offset corresponding to the maximum value of the category, and adjusting the position of the suggested detection frame; removing objects belonging to the background in each suggested detection frame, then taking a threshold value of 0.7 according to the confidence score of the maximum category of each suggested detection frame, then screening out a certain ROI (Region of interest) and performing non-maximum suppression.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A small target detection method based on multi-scale images and weighted fusion loss is characterized in that the method is realized based on an improved Mask RCNN model and comprises the following steps:

s3, randomly cutting the image in the image pyramid;

2. The small object detection method according to claim 1, further comprising, in a training phase:

and S14, iteratively updating parameters and weights of the improved Mask RCNN model through back propagation, specifically, respectively utilizing total class prediction loss and total regression prediction loss, and performing optimization iteration and updating the weight values of the improved Mask RCNN model.

3. The small object detection method according to claim 1, wherein the improvement of the improved Mask RCNN model comprises:

4. The small object detection method according to claim 1, wherein the scaling of the original image in S2 includes:

the formula for scaling pictures is expressed as:

Image_New＝Image*scale (1)

the scale is determined by the following factors:

scale＝min_dim/min(h，w) (2)

otherwise scale is 1;

scale＝max_dim/image_max (3)

otherwise, continuously scaling according to scale min _ dim/min (h, w);

5. The small object detection method according to claim 1, wherein the formula for randomly cropping the picture in S3 is expressed as follows:

Y₁＝randi([0,image_size(1)-crop_size(1)]) (4)

X₁＝randi([0,image_size(2)-crop_size(2)]) (5)

Y₂＝min(image_size(1),Y1+crop_size(1)) (6)

X₂＝min(image_size(2),X1+crop_size(2)) (7)

6. The small-target detection method of claim 1, wherein the convolution of the residual backbone network comprises two types of convolution modules, block1 and block2, wherein:

the convolution module block1 workflow comprises:

①, for Branch 1, the output and input remain consistent;

the convolution module block2 workflow comprises:

7. The small object detection method according to claim 1, wherein outputting a plurality of sets of feature maps of different sizes in S4 includes:

8. The small object detection method according to claim 7, wherein step S5 includes:

s52, performing maximum pooling on C2l-C6l, and doubling C2l-C6 l;

9. The small object detection method according to claim 8, wherein S54 includes:

10. The small object detection method according to claim 1, wherein the height h and width w of the candidate detection frame are: