CN110503112B

CN110503112B - Small target detection and identification method for enhancing feature learning

Info

Publication number: CN110503112B
Application number: CN201910794606.XA
Authority: CN
Inventors: 程建; 林莉; 李�灿; 周晓晔; 李月男
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2023-02-03
Anticipated expiration: 2039-08-27
Also published as: CN110503112A

Abstract

The invention discloses a small target detection and identification method for enhancing feature learning, belongs to the field of image processing, pattern identification and computer vision, and solves the problems of low detection precision and low network efficiency of small target detection and identification tasks in the prior art. The method comprises the steps of sequentially constructing a basic network module, a feature extraction module, a candidate frame generation module and a prediction output module as a small target detection and identification network; based on the extracted small target sample image data, preprocessing the extracted small target sample image data; inputting the obtained preprocessed small target sample image data into a small target detection and recognition network with initialized parameters for training to obtain a trained small target detection and recognition network; and inputting the small target image to be predicted into the trained small target detection and identification network, and outputting the position and the class information of the prediction frame of the small target end to end through forward propagation. The method is used for small target detection and identification.

Description

Small target detection and identification method for enhancing feature learning

Technical Field

A small target detection and identification method for enhancing feature learning is used for small target detection and identification and belongs to the field of image processing, pattern identification and computer vision.

Background

The task of target detection and identification is still one of the popular research directions in the field of computer vision so far, and due to the wide engineering application, the task is rapidly developed and innovated in the field of academic research. In fact, the target detection and recognition task plays an important role in life, for example, the safety inspection application in important public transportation places such as airports, railway stations and the like based on face recognition under the target detection and recognition task; the vehicle license plate detection and recognition based on the target detection and recognition task has important practical significance for traffic regulation and driving safety detection.

The target detection and identification task is different from the ordinary classification task in that the traditional classification task only needs to output a single class result, namely the probability that the input picture belongs to a certain class. Therefore, when there are multiple objects of interest in a picture to be examined, a simple classification task is not sufficient to meet this requirement. In contrast, the object detection and recognition task can locate the position of the object of interest through the detection network and make a judgment on the category of the object of interest. For the small target detection and identification, which is a subtask of target detection and identification, the small target detection and identification task is not greatly improved in recent years because the learning capability of the neural network for the small target features is insufficient.

Most of the traditional target detection and identification methods are based on an anchor mechanism method, namely, a plurality of prediction frames are designed on a feature map, the prediction frames are compared with real frames, one prediction frame closest to the real frame is selected as a prediction frame of a network by utilizing a certain pre-designed evaluation standard, and the category of the prediction frame is predicted. With the development of deep learning, the target detection and identification task is gradually developed and innovated in the aspect of improving performance, and the current detection and identification task is mainly divided into two different solutions: 1) A target detection and identification method of a two-stage task; 2) Provided is a target detection and identification method of a single-stage task. Specifically, if the target detection and recognition task is decomposed into two independent subtasks: the detection and classification process is called the two-stage task target detection and identification. Similarly, if the target detection and identification task is realized on an end-to-end basis, the method is called a single-stage task target detection and identification method. Whether the method is a single-stage method or a two-stage method, the detection precision of the small target object is low (note: the small target object refers to a target with a pixel point area smaller than 32x32 in an image). The reason is mainly due to the following two aspects: 1) The characteristics learned by the neural network have insufficient characteristic capability on the characteristics of a small target object; 2) The traditional target detection algorithm based on the anchor mechanism is to calculate the IOU (namely the size of the overlapping area) between a prediction frame and a real frame and then reject a smaller area to be selected of the IOU by using a preset threshold. However, for small target detection, the area of the small target mapped on the original image of any one feature layer is generally small, and by using the judgment criterion, a large probability of missing detection phenomenon occurs for a small target detection task, and the network efficiency is low (i.e. the detection speed is slow).

Disclosure of Invention

In view of the above research problems, an object of the present invention is to provide a small target detection and identification method for enhancing feature learning, which solves the problems of low detection precision and low network efficiency (i.e., low detection speed) of small target detection and identification tasks in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

a small target detection and identification method for enhancing feature learning comprises the following steps:

s1, sequentially constructing a basic network module for extracting the features of a small target and outputting a preliminary feature map, a feature extraction module formed by two hourglass stacks for further extracting the features and outputting the feature map on the basis of the preliminary feature map, a candidate frame generation module for generating a candidate frame on the basis of the feature map, and a prediction output module for performing prediction frame coordinate regression and prediction frame class classification on the basis of the candidate frame, wherein the prediction output module is used as a small target detection and identification network, namely a deep neural network, and randomly initializing the parameters of the small target detection and identification network after construction;

s2, extracting small target sample image data based on the COCO data set, namely extracting small target sample image data with pixel point area of less than 32x32, and preprocessing the extracted small target sample image data; inputting the obtained preprocessed small target sample image data into a small target detection and recognition network with initialized parameters for training to obtain a trained small target detection and recognition network;

and S3, inputting the small target image to be predicted into the trained small target detection and identification network, and outputting the position and the class information of the prediction frame of the small target end to end through forward propagation.

Further, the basic network module in the step S1 is improved ResNet-101 or improved VGG16.

Further, the improved ResNet-101 sequentially includes an input layer, a first group of convolutional layers, a maximum pooling layer, a second group of convolutional layers, a third group of convolutional layers, a fourth group of convolutional layers, and a fifth group of convolutional layers, wherein the input size of the input layer is 513x513 image data; the first group of convolutional layers sequentially comprises two parts of 1 7x7 convolutional operation and 1 nonlinear activation function operation, the second group of convolutional layers sequentially comprises 9 convolutional layers, a nonlinear activation layer and an average pooling layer, the third group of convolutional layers sequentially comprises 12 convolutional layers, a nonlinear activation layer and an average pooling layer, the fourth group of convolutional layers sequentially comprises 69 convolutional layers, a nonlinear activation layer and an average pooling layer, and the fifth group of convolutional layers sequentially comprises 9 convolutional layers, a nonlinear activation layer and an average pooling layer, wherein each convolutional layer in the second group of convolutional layers to the fifth group of convolutional layers sequentially passes through 1x1 convolution, 13x 3 convolutional layers and 1x1 convolutional operation.

Further, the modified VGG16 sequentially includes a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, and a fifth convolutional layer, where the first convolutional layer and the second convolutional layer sequentially include convolution operations in which 2 convolution kernels are 3x3 and nonlinear activation function operations, and the third convolutional layer, the fourth convolutional layer, and the fifth convolutional layer sequentially include convolution operations in which 3 convolution kernels are 3x3 and nonlinear activation function operations.

Further, the feature extraction module in the step S1 is immediately connected to the basic network module, a single hourglass stack in the feature extraction module is composed of 3-order sampling units and is in an hourglass shape, and each-order sampling unit comprises a convolution module and an identity mapping module; the convolution module sequentially comprises 1 down-sampling layer, 3 convolution layers and 1 up-sampling layer, wherein the second convolution layer in the convolution module in the second-order sampling unit is the convolution module in the first-order sampling unit, the second convolution layer in the convolution module in the third-order sampling unit is the convolution module in the second-order sampling unit, each convolution layer is a convolution layer with the size of 3x3, and the down-sampling ratio in the down-sampling layer is

Wherein d represents the d-th branch in the convolution module, and the upsampling in the upsampling layer adopts a bilinear interpolation method; the identity mapping module is used for carrying out jump connection on the input of a down sampling layer of the convolution module and the output of an upper sampling layer, and is used for learning the detail information of the shallow feature in the deep network.

Further, the candidate frame generation module generates 9 candidate frames with different sizes at each pixel point position of the feature map output by the feature extraction module based on an anchor generation mechanism, and each candidate frame is mapped onto the original image and corresponds to one candidate frame.

Further, in the step S1, the prediction output module follows the candidate frame generation module, where performing prediction frame coordinate regression is to regress a prediction frame coordinate position through a Smooth L1 loss function, the prediction frame category classification is to obtain category information of a corresponding prediction frame through a softmax loss function after regressing a numerical value through wx + b based on a feature map corresponding to the prediction frame, and x is a pixel point value in the feature map, where the prediction frame is a candidate frame; the method comprises the following specific steps:

the coordinates of the central point of the prediction frame are x and y, the width and the height are w and h respectively, if the position of the central point of any prediction frame and the height and width information are x _a ，y _a ，w _a ，h _a The position of the center point of the real frame is x _t ，y _t ，w _t ，h _t And the central point position of the prediction frame is as follows: x, y, w, h, and g = (g) the offset between the real frame and the candidate frame _x ，g _y ，g _w ，g _h ) The concrete solving formula is:

the actual offset obtained is l = (l) _x ，l _y ，l _w ，l _h ) The concrete solving formula is:

and (3) regressing the coordinate position of the prediction frame by using a Smooth L1 loss function, wherein the corresponding solving formula is as follows:

in the formula, i represents all positive sample sets, i.e. any one of the prediction box sets;

wherein the Smooth L1 loss function is:

after a numerical value is regressed through wx + b based on the feature graph corresponding to the prediction frame, the category information of the corresponding prediction frame is obtained through a softmax loss function, wherein the softmax loss function is as follows:

where c is a predicted box label, i.e., a predicted class, k is a true box label, i.e., a true box class of the small-target sample image data, L _cls The classification loss function alpha is a hyperparameter which can be automatically adjusted in an experiment, and the classification loss function comprises plants, televisions, ships and chairs.

Further, in step S1, the randomly initializing the parameters of the small target detection and identification network refers to pre-training the small target detection and identification network by using larger public data to obtain a set of initialized parameters, where the larger public data is lmageNet.

Further, when the small target detection and recognition network training is performed in step S2, a central point discrimination module block is added, which is used for obtaining a candidate frame predicted by the prediction output module based on the candidate frame generated by the candidate frame generation module, the k-nearest neighbor method and the non-maximum suppression method, and specifically includes the following steps: according to the central point position of the real frame of the small target sample image data, determining candidate frames corresponding to k nearest central point positions around the small target sample image data by using a k nearest neighbor method, taking the candidate frames as preliminarily selected candidate frames, processing the candidate frames by using the k nearest neighbor method, and obtaining the optimal candidate frame by using a non-maximum suppression method.

Further, in the step S2, the preprocessing of the extracted small target sample image data refers to performing plus/minus 90-degree rotation, random clipping or scale scaling on the small target sample image data;

the specific implementation process of obtaining the trained small target detection and identification network is as follows: inputting the small target sample image data obtained by preprocessing into a small target detection and identification network with initialized parameters, carrying out forward propagation to obtain regression and classification results, solving the loss of the regression and classification results according to the small target sample image data, updating network parameters of the small target detection and identification network by utilizing the loss reverse propagation, and obtaining the trained small target detection and identification network after an iteration condition is met.

Compared with the prior art, the invention has the beneficial effects that:

1. for the small target detection network based on the feature extraction module formed by two hourglass stacks, semantic information and detail information can be effectively fused together in the feature extraction module in a pyramid fusion-like mode, so that the feature expression capacity of the small target is enhanced, the problem of low detection precision of the small target is solved, and a plurality of redundant candidate frames can be removed according to the central point discrimination module, so that the frame calculation in the subsequent steps is reduced; it can be prevented that small candidate blocks are lost as negative sample pairs in that part of the NMS.

2. The method introduces a mode of judging according to the geometric distance of the central point of the real frame, and efficiently selects the positive sample frame (prediction frame) with the most probable occurrence of the target, thereby reducing the burden of the redundant frame on network calculation.

3. The invention focuses more on the frames around the real target, thereby reducing the calculation of the frames far away from the real position, and improving the detection speed.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a network architecture diagram of a feature base module and a prediction output module formed by two hourglass stacks in accordance with the present invention;

FIG. 3 is a schematic diagram of a feature extraction module of the present invention;

fig. 4 is a schematic diagram of a center point discrimination prediction frame obtained according to the present invention.

FIG. 5 is a graph comparing the effect of a feature basis module formed using SSD, DSSD, 1 hourglass stack and 2 hourglass stacks on a basis of the underlying network in accordance with the present invention; wherein, one-hour glass represents a characteristic basic module formed by only containing 1 hourglass stack; two-hourglass represents a feature basic module formed by 2 hourglass stacks, AP represents average precision, subscripts S, M and L represent small-scale, medium-scale and large-scale targets respectively, SSD represents that a feature fusion module is adopted to predict shallow and deep features, and DSSD represents that Two feature fusion modules are adopted to perform fusion on the shallow and deep features in a deconvolution mode to predict.

Fig. 6 is a schematic diagram showing the detection results of the SSD and the DSSD according to the present invention and the prior art in an embodiment of the present invention, wherein (a) shows a SSD network experiment result, b) shows a DSSD network experiment result, and (c) shows an experiment result of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific embodiments.

In order to solve the problems, the invention adopts two stacked hourglass to form a feature extraction module, enhances the semantic information of shallow features, simultaneously fuses the shallow features and deep features, and enhances the detailed features of the deep features, thereby enhancing the feature description of the network on small target objects and improving the precision of small target detection. On the other hand, the evaluation criterion of the original 1OU is improved, and a prediction frame closest to the real frame is screened by using a central point centralized evaluation mode. And finally, respectively regressing the position information and the category information of the prediction frame by using a prediction discrimination module.

As shown in fig. 1, a small target detection and identification method for enhancing feature learning includes the following steps:

the basic network module in the step S1 is improved ResNet-101 or improved VGG16.

As shown in fig. 2, the modified ResNet-101 sequentially includes an input layer, a first set of convolutional layers, a maximum pooling layer, a second set of convolutional layers, a third set of convolutional layers, a fourth set of convolutional layers, and a fifth set of convolutional layers, wherein the input size of the input layer is 513x513 image data; the first group of convolutional layers sequentially comprises two parts of 1 7x7 convolution operation and 1 nonlinear activation function operation, the second group of convolutional layers sequentially comprises 9 convolutional layers, a nonlinear activation layer and an average pooling layer, the third group of convolutional layers sequentially comprises 12 convolutional layers, a nonlinear activation layer and an average pooling layer, the fourth group of convolutional layers sequentially comprises 69 convolutional layers, a nonlinear activation layer and an average pooling layer, and the fifth group of convolutional layers sequentially comprises 9 convolutional layers, a nonlinear activation layer and an average pooling layer, wherein each convolutional layer in the second group of convolutional layers to the fifth group of convolutional layers sequentially passes through 1x1 convolution, 1x 3 convolutional layer and 1x1 convolution operation.

The improved VGG16 comprises a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer and a fifth convolution layer in sequence, wherein the first convolution layer and the second convolution layer sequentially comprise 2 convolution operations with convolution kernels of 3x3 and nonlinear activation function operations, and the third convolution layer, the fourth convolution layer and the fifth convolution layer sequentially comprise 3 convolution operations with convolution kernels of 3x3 and nonlinear activation function operations.

As shown in fig. 3, the feature extraction module is next to the basic network module, a single hourglass stack in the feature extraction module is composed of 3-order sampling units, and is in an hourglass shape, and each-order sampling unit comprises a convolution module and an identity mapping module; the convolution module sequentially comprises 1 down-sampling layer, 3 convolution layers and 1 up-sampling layer, a second convolution layer in the convolution module in the second-order sampling unit is a convolution module in the first-order sampling unit, a second convolution layer in the convolution module in the third-order sampling unit is a convolution module in the second-order sampling unit, and each convolution layer is a convolution layer of 3x3 and a down-sampling layer in the down-sampling layerSample ratio of

The candidate frame generation module generates 9 candidate frames with different sizes at each pixel point position of the feature image output by the feature extraction module based on an anchor generation mechanism, and each candidate frame is mapped to the original image and corresponds to one candidate frame.

The prediction output module is next to the candidate frame generation module, wherein the prediction frame coordinate regression is performed by regressing the coordinate position of the prediction frame through a Smooth L1 loss function, the classification of the category of the prediction frame refers to that the category information of the corresponding prediction frame is obtained through a softmax loss function after a numerical value is regressed through wx + b based on the feature map corresponding to the prediction frame, x refers to the pixel point value in the feature map, and the prediction frame refers to the candidate frame; the method comprises the following specific steps:

the coordinates of the central point of the prediction frame are x and y, the width and the height are w and h respectively, if the position of the central point of any prediction frame and the height and width information are x _a ，y _a ，w _a ，h _a The position of the center point of the real frame is x _t ，y _t ，w _t ，h _t The central point of the prediction frame is as follows: x, y, w, h, and g = (g) the offset between the real frame and the candidate frame _x ，g _y ，g _w ，g _h ) The specific solving formula is:

wherein the Smooth L1 loss function is:

wherein c is a prediction label, i.e. a prediction class, k is a true label, i.e. a true class, alpha is a hyper-parameter, which can be automatically adjusted in an experiment, L _cls To classify the loss function, the categories include plants, televisions, boats, chairs, etc.

Randomly initializing parameters of the small target detection and identification network refers to pre-training the small target detection and identification network by using larger public data to obtain a group of initialized parameters, wherein the larger public data is lmageNet.

S2, extracting small target sample image data based on the COCO data set, wherein the COCO data set accounts for 41% of the small target sample image data, namely extracting small target sample image data with pixel point area of less than 32x32, and preprocessing the extracted small target sample image data; inputting the obtained preprocessed small target sample image data into a small target detection and recognition network with initialized parameters for training to obtain a trained small target detection and recognition network;

as shown in fig. 4, when performing small target detection and recognition network training, a central point discriminating module block is added for predicting the candidate frame obtained by the prediction output module based on the candidate frame generated by the candidate frame generating module, the k-nearest neighbor method and the non-maximum suppression method, and the specific steps are as follows: according to the central point position of the real frame of the small target sample image data, determining candidate frames corresponding to k nearest central point positions around the small target sample image data by using a k nearest neighbor method, taking the candidate frames as preliminarily selected candidate frames, processing the candidate frames by using the k nearest neighbor method, and obtaining the best candidate frame by using a non-maximum suppression method. The central point discrimination module can eliminate a plurality of redundant candidate frames so as to reduce the calculation of the frames in the subsequent steps; it can be prevented that small candidate blocks are lost as negative sample pairs in that part of the NMS.

The preprocessing of the extracted small target sample image data refers to the rotation, random cutting or scale scaling operation of plus or minus 90 degrees on the small target sample image data;

the specific implementation process of obtaining the trained small target detection and identification network is as follows: inputting the small target sample image data obtained by preprocessing into a small target detection and identification network with initialized parameters, carrying out forward propagation to obtain regression and classification results, solving the loss of the regression and classification results according to the small target sample image data, updating network parameters of the small target detection and identification network by utilizing the loss backward propagation, and obtaining the trained small target detection and identification network after the iteration condition is reached.

And S3, inputting the small target image to be predicted into the trained small target detection and identification network, and outputting the position and the class information of the prediction frame of the small target end to end through forward propagation. The trained small target detection and recognition network can detect images without real frames, and the method specifically comprises the following steps:

and (3) a testing stage:

1) In the training stage, the weight parameters of the small target detection and identification network are obtained, namely the trained small target detection and identification network is obtained, and the predicted small target image is input;

2) Basic features (such as edge information, color, shape and other information) of the small target image are learned through a basic network module;

3) Learning characteristic diagram information of different scales through a characteristic extraction module to obtain a multi-scale characteristic diagram;

4) After obtaining the multi-scale feature images, generating 9 anchors at each pixel point position of any one feature image, wherein each anchor is mapped to an original image and corresponds to a candidate frame;

5) Performing regression calculation on the coordinates of the candidate box by using the trained weights w and b (wx + b);

6) Obtaining a score value (namely a probability value belonging to any one category) through a softmax function after regression, and using the probability value to perform NMS;

7) And selecting the best box as a final predicted value.

Examples

Extracting small target sample image data from the COCO data set as a test set, and inputting the small target images in the test set into the SSD, the DSSD and the method of the invention respectively for detection to obtain results shown in FIGS. 5 and 6, wherein FIG. 6 shows the detection results of three small target images in the SSD, the DSSD and the method of the invention. The invention has the advantages that the detection precision is higher than that of the prior art in the detection of small targets, medium targets and large targets, and the SSD network and the DSSD network have a great number of missed detection problems in the detection of the small targets, but the model structure provided by the invention is better improved.

The above are merely representative examples of the many specific applications of the present invention, and do not limit the scope of the invention in any way. All the technical solutions formed by the transformation or the equivalent substitution fall within the protection scope of the present invention.

Claims

1. A small target detection and identification method for enhancing feature learning is characterized by comprising the following steps:

s3, inputting the small target image to be predicted into the trained small target detection and identification network, and outputting the position and the category information of the prediction frame of the small target end to end through forward propagation;

the basic network module in the step S1 is improved ResNet-101 or improved VGG16;

the improved ResNet-101 sequentially comprises an input layer, a first group of convolutional layers, a maximum pooling layer, a second group of convolutional layers, a third group of convolutional layers, a fourth group of convolutional layers and a fifth group of convolutional layers, wherein the input size of the input layer is 513x513 image data; the first group of convolutional layers sequentially comprises two parts of 1 7x7 convolution operation and 1 nonlinear activation function operation, the second group of convolutional layers sequentially comprises 9 convolutional layers, a nonlinear activation layer and an average pooling layer, the third group of convolutional layers sequentially comprises 12 convolutional layers, a nonlinear activation layer and an average pooling layer, the fourth group of convolutional layers sequentially comprises 69 convolutional layers, a nonlinear activation layer and an average pooling layer, and the fifth group of convolutional layers sequentially comprises 9 convolutional layers, a nonlinear activation layer and an average pooling layer, wherein each convolutional layer in the second group of convolutional layers to the fifth group of convolutional layers sequentially passes through 1x1 convolution, 13x 3 convolutional layers and 1x1 convolution operation;

the improved VGG16 sequentially comprises a first convolution layer, a second convolution layer, a third convolution layer, a fourth convolution layer and a fifth convolution layer, wherein the first convolution layer and the second convolution layer sequentially comprise 2 convolution operations with convolution kernels of 3x3 and nonlinear activation function operations, and the third convolution layer, the fourth convolution layer and the fifth convolution layer sequentially comprise 3 convolution operations with convolution kernels of 3x3 and nonlinear activation function operations;

the characteristic extraction module in the step S1 is next to the basic network module, a single hourglass stack in the characteristic extraction module consists of 3-order sampling units and is in an hourglass shape, and each-order sampling unit comprises a convolution module and an identity mapping module; the convolution module sequentially comprises 1 down-sampling layer, 3 convolution layers and 1 up-sampling layer, wherein the second convolution layer in the convolution module in the second-order sampling unit is the convolution module in the first-order sampling unit, the second convolution layer in the convolution module in the third-order sampling unit is the convolution module in the second-order sampling unit, each convolution layer is a convolution layer with the size of 3x3, and the down-sampling ratio in the down-sampling layer is

Wherein d represents the d-th branch in the convolution module, and the upsampling in the upsampling layer adopts a bilinear interpolation method; the identity mapping module is used for carrying out jump connection on the input of a down sampling layer of the convolution module and the output of an upper sampling layer and learning the detail information of the shallow feature in a deep network.

2. The method for small target detection and identification with enhanced feature learning of claim 1, wherein: the candidate frame generation module generates 9 candidate frames with different sizes at each pixel point position of the feature image output by the feature extraction module based on an anchor generation mechanism, and each candidate frame is mapped to the original image and corresponds to one candidate frame.

3. The method of claim 2, wherein the method comprises the steps of: in the step S1, the prediction output module follows the candidate frame generation module, where performing prediction frame coordinate regression is to regress the coordinate position of the prediction frame through a Smooth L1 loss function, the prediction frame category classification is to obtain category information of the corresponding prediction frame through a softmax loss function after regressing a numerical value through wx + b based on a feature map corresponding to the prediction frame, and x is a pixel point value in the feature map, where the prediction frame is a candidate frame; the method comprises the following specific steps:

the coordinates of the central point of the prediction frame are x and y, the width and the height are w and h respectively, if the position of the central point of any prediction frame and the height and width information are x _a ，y _a ，w _a ，h _a The position of the center point of the real frame is x _t ，y _t ，w _t ，h _t And the central point position of the prediction frame is as follows: x, y, w, h, and the offset between the real frame and the candidate frame is set as g = (g) _x ，g _y ，g _w ，g _h ) The concrete solving formula is:

the actual offset obtained is l = (l) _x ，l _y ，l _w ，l _h ) Concretely, to findThe solution is:

wherein the Smooth L1 loss function is:

obtaining the category information of the corresponding prediction frame through a softmax loss function after a numerical value is regressed through wx + b on the basis of the feature graph corresponding to the prediction frame, wherein the softmax loss function is as follows:

4. The method for small object detection and identification with enhanced feature learning of claim 1, wherein: in the step S1, randomly initializing parameters of the small target detection and recognition network refers to pre-training the small target detection and recognition network by using larger public data to obtain a set of initialized parameters, where the larger public data is ImageNet.

5. The method of claim 2, wherein the method comprises the steps of: when small target detection and recognition network training is performed in the step S2, a central point distinguishing module is added for obtaining a candidate frame predicted by the prediction output module based on the candidate frame generated by the candidate frame generation module, the k-nearest neighbor method and the non-maximum suppression method, and the specific steps are as follows: according to the central point position of the real frame of the small target sample image data, determining candidate frames corresponding to k nearest central point positions around the small target sample image data by using a k nearest neighbor method, taking the candidate frames as preliminarily selected candidate frames, processing the candidate frames by using the k nearest neighbor method, and obtaining the optimal candidate frame by using a non-maximum suppression method.

6. The method for small object detection and identification by enhancing feature learning according to claim 1 or 4, wherein: in the step S2, the preprocessing of the extracted small target sample image data refers to performing plus/minus 90-degree rotation, random cropping or scale scaling operation on the small target sample image data;