CN111339934A

CN111339934A - Human head detection method integrating image preprocessing and deep learning target detection

Info

Publication number: CN111339934A
Application number: CN202010116670.5A
Authority: CN
Inventors: 李好洋; 黄家名; 秦瑜恒; 周小芹
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-06-26

Abstract

The invention discloses a rapid and accurate human head detection method integrating image preprocessing and deep learning target detection, which comprises the following steps: acquiring monitoring image data in real time; preprocessing the acquired monitoring image data; inputting the preprocessed monitoring image data into a pre-trained neural network to obtain a region to be selected in the monitoring image data, and an offset value and a confidence coefficient corresponding to the region to be selected; selecting an anchor with a target according to the confidence coefficient of the anchor output by the neural network; calculating the position and the size of a prediction boundary box corresponding to a corresponding target according to the offset value of the anchor output by the neural network; drawing the prediction bounding box in the monitoring image data according to the position and the size of the prediction bounding box to obtain a result image; and outputting the result image and the number of the detected heads in the image, wherein the number of the heads is the number of the predicted boundary frames in the image. The human head detection method can improve the rapid accuracy of human head detection in public places.

Description

Human head detection method integrating image preprocessing and deep learning target detection

Technical Field

The invention relates to the technical field of computer vision, in particular to a rapid and accurate human head detection method integrating image preprocessing and deep learning target detection.

Background

In a large number of public places, such as superstores, supermarkets, tourist attractions, large and small traffic hubs, banks, subways, schools and the like, the number density in monitoring and shooting needs to be analyzed in real time so as to ensure orderly and stable operation of the public places. For example, the real-time statistical analysis of the number of people in large shopping malls is performed, so that decision makers can dredge excessively crowded areas in time, and trampling accidents are prevented. Meanwhile, the technology can be applied to a campus scene, the number of people in each classroom can be monitored and analyzed in real time, and for students, a proper self-study classroom can be found more quickly according to the number information of the classroom provided in real time, so that the time for finding the self-study classroom is shortened. For teachers, the method can help teachers reasonably distribute teaching resources, and therefore teaching quality is improved. Therefore, the method has good application prospect and commercial value for accurately calculating the pedestrian volume in the public places in real time. However, the current people counting research aiming at multiple scenes faces some challenges, for example, the resolution of pedestrians to be counted in an image is low, people are seriously shielded and crowded, and the like.

There are currently two main ways of achieving population counting: the method comprises a people counting method based on characteristic regression and a people counting method based on target detection, wherein the two methods relate to a supervised deep learning technology. One method is to directly use a regression idea to regress a crowded crowd density chart according to an image, but the method has the defects that only one crowded index of the whole scene can be obtained, the specific position of each individual in the crowd cannot be known, and the method is sensitive to the resolution of the image and is not beneficial to switching and using multiple scenes. The other is to use a mainstream object detection method, for example, to detect the position of the "person" by using the technique such as fast R-CNN or SSD, thereby obtaining the number of "persons" in the image. This approach has the disadvantage of poor performance in situations where people occlude each other or there is insufficient light, and the more crowded the population is, the more likely it is that they occlude each other, resulting in limited algorithm use. In addition, the target detection algorithms are not optimized for the task of human head detection, so that when the algorithms detect the target with lower resolution, the false detection rate and the missing detection rate are higher.

Due to the limitation of the hardware condition of the current monitoring equipment, no matter the network camera or the wired camera, the monitoring video image generally has the conditions of fuzziness and unstable noise.

Disclosure of Invention

The invention aims to provide a human head detection method integrating image preprocessing and deep learning target detection, which can improve the rapid accuracy of human head detection in public places.

The technical scheme adopted by the invention is as follows: a method of human head detection, comprising:

acquiring monitoring image data in real time;

preprocessing the acquired monitoring image data;

inputting the preprocessed monitoring image data into a pre-trained neural network to obtain a region to be selected in the monitoring image data, and an offset value and a confidence coefficient corresponding to the region to be selected; the neural network is a deep learning neural network, the training samples of the neural network are positive and negative samples of the anchor selected according to the intersection ratio of the anchor to be selected in the image data and the ground-route of the head calibration area, the positive samples are provided with offset labels and confidence labels, and the negative samples are provided with image data sets of the confidence labels;

selecting an anchor with a target according to the confidence coefficient of the anchor output by the neural network;

calculating the position and the size of a prediction boundary box corresponding to a corresponding target according to the offset value of the anchor output by the neural network;

drawing the prediction bounding box in the monitoring image data according to the position and the size of the prediction bounding box to obtain a result image;

and outputting the result image and the number of the detected heads in the image, wherein the number of the heads is the number of the predicted boundary frames in the image.

Optionally, the training step of the neural network includes:

collecting a plurality of historical monitoring images to form an image data set;

marking each monitoring image in the image data set, and marking the position information of all human heads in the image as human head marking area ground-route;

determining the size of an anchor used by the neural network;

for each monitoring image, selecting an anchor positive and negative sample for training respectively;

respectively generating labels required by training for each selected positive and negative sample, wherein the positive sample label comprises an offset label and a confidence label, and the negative sample label comprises a confidence label;

building a neural network model, configuring parameters of the neural network model, and designing a loss function by using an Adam optimizer;

inputting the monitoring image data into a neural network, calculating a loss function by using an offset value label and a confidence coefficient label corresponding to the image, and performing back propagation on parameters of the loss function by using an optimizer to optimize model parameters.

In the neural network training process, a test data set can be adopted for carrying out stage test, when the model parameters are well represented on the test data set, the training is stopped, and the neural network parameter optimization is finished, so that the final model parameters are obtained. The inspection data set may likewise be historical monitoring image data.

Optionally, a convolutional layer is adopted as a neural network layer of the neural network, the pooling layer is replaced by a convolutional core with the step length of 2, and two feature maps with different sizes are finally generated for target detection by using an upsampling and fusing mechanism.

Optionally, the size calculation formula of the anchor is as follows:

anchor_size＝layer_stride*aspect_ratio*anchor_scale

wherein layer _ stride is a multiple of down-sampling, aspect _ ratio is an aspect ratio of the anchor, and anchor _ scale is a scale of the anchor on the feature map. The determination of the anchor is carried out according to the relation between the theoretical receptive field and the effective receptive field.

Optionally, for a large feature map of 40 × 30, the downsampling multiple layer _ stride is 16, aspect _ ratio is set to 1, and anchor _ scale is set to 2 and 4; for the 10 by 8 small feature map layer _ stride is 64, aspect _ ratio is 1, and anchor _ scale is 2 and 4.

By formula calculation, the anchorms of the large feature maps on the original image are 32 × 32 and 64 × 64 respectively, and the anchorms of the small feature maps are 128 × 128 and 256 × 256 respectively. The calculated anchor size is 1/7 to 1/3 times the theoretical receptive field size, and is very close to the effective receptive field size corresponding to the theoretical receptive field.

Optionally, the positive and negative anchor samples selected for training are:

marking the anchor with the coincidence ratio of the ground-cut to the anchor being more than or equal to 0.7 or the anchor with the maximum coincidence ratio of the ground-cut to the anchor as a positive sample;

marking the Anchor with the intersection ratio of ground-truth and the Anchor being less than equal 0.3 as a negative sample;

the ratio of the number of positive and negative samples is 1: 1. Through the sample selection strategy, the optimal positive samples and negative samples with the same quantity can be selected from all samples of the monitoring image data, and the balance of the positive samples and the negative samples during training can be ensured.

Optionally, generating the offset value label and the confidence label required by training includes:

for the offset value label of the positive sample, define a ═ a_x，A_y，A_w，A_h) Denotes the original anchor, using G ═ G (G)_x，G_y，G_w，G_h) Ground-truth, A representing the target_x，A_y，A_w，A_hCorresponding to the coordinate and width and height, G, of the anchor_x，G_y，G_w，G_hRespectively corresponding to the coordinate and width and height of the ground-truth;

the goal of the offset value generation strategy is to find a relationship such that the input original A can be mapped to the real box G, i.e., to find a transformation F such that F (A)_x，A_y，A_w，A_h)＝(G_x，G_y，Gw，G_h) (ii) a Wherein the prediction result is G '═ G'_x，G′_y，G′_w，G′_h) Then G' should be about equal to G, i.e.: (G'_x，G′_y，G′_w，G′_h)≈(G_x，G_y，G_w，G_h)；

Assuming that f (a) ≈ G' is achieved by translation and scaling, then:

translation:

zooming:

the offset value label can be calculated according to G and A by a formula

Fitting the deviation value label by adopting a regression algorithm in the training process, wherein the result after fitting is t_x(A)，t_y(A)，t_w(A)，t_h(A) Namely:

and then, restoring the coordinates G 'of the prediction frame by using a translation scaling formula'_x，G′_yAnd wide height G'_w，G′_hTo thereby realize (G'_x，G′_y，G′_w，G′_h)≈(G_x，G_y，G_w，G_h). This process only computes positive samples because in the loss function, the offset values of the negative samples are not used to participate in the portion of the loss function where the offset value loss is computed. The fitting is typically to a loss function using a gradient descentFitting is carried out by the method.

For the confidence label, the confidence label of the positive sample is set to 1, which indicates that the detection target exists in the anchor, and the confidence of the negative sample is set to 0.

Optionally, the invention adopts a deep learning framework to build a neural network model, the configured neural network model parameters include configured learning rate and regularization parameters, and a loss function calculation formula designed by using an Adam optimizer is as follows:

in the formula, L_clgTo calculate the loss function of confidence, L_regIn order to calculate the loss function of the deviation value, N is the number of effective samples of the input picture, and i is the index of the anchor corresponding to the effective samples;

L_clsthe calculation formula is as follows:

wherein p is_iThe prediction confidence calculated for the ith anchor,

the confidence label corresponding to the anchor;

L_regthe calculation formula of (2) is as follows:

wherein t is_iRepresents the predicted offset value (t) calculated by the ith anchor_i1，t_i2，t_i3，t_i4) The vector is a four-point vector,

is the offset value label corresponding to anchor, which corresponds to

In the loss function at L_regFront surface is multiplied by

This means that only anchors belonging to positive samples will participate in calculating the regression of the bezel offset value.

Optionally, the preprocessing the acquired monitoring image data includes: removing noise points in the image by using a BM3D denoising algorithm and a USM sharpening algorithm;

the neural network training step further comprises: and (4) carrying out image denoising processing on each monitoring image in the image data set by using a BM3D denoising algorithm and a USM sharpening algorithm. Noise at the image details can thus be removed, making the small target features in the image more visible with fewer pixels.

Optionally, the neural network training step further includes: and (3) changing each monitoring image and the corresponding ground-route thereof through cutting, rotating and/or stretching operations to complete the augmentation of the data set.

Advantageous effects

The invention uses the image preprocessing technology, can weaken the noise point corresponding to the small target pixel point, improves the fine granularity of the image, refines the characteristics of the image details and is beneficial to the neural network to detect the small target. Meanwhile, the human head existing in the image is detected by using deep learning, and not only can good performance be shown under the condition that human objects are seriously shaded, but also great advantages can be achieved in the detection speed through a well-designed anchor formulation mechanism and a convolution neural network structure. The algorithm realizes more accurate crowd counting by detecting the position of the human head in the image, so that the problem that the target body is not detected due to large area coverage can be solved, and the accurate position and size information of each human head in the image can be obtained.

Drawings

FIG. 1 is a convolutional neural network structure employed by the present invention;

FIG. 2 is a diagram showing the relationship between anchor and ground-route in the original image when calculating the label of the offset value;

FIG. 3 shows test results of an exemplary embodiment of the present invention;

FIG. 4 is a test result of a second application example of the present invention;

fig. 5 shows the test results of the third application example of the present invention.

Detailed Description

The following further description is made in conjunction with the accompanying drawings and the specific embodiments.

Example 1

The embodiment is a human head detection method, which comprises the following steps:

acquiring monitoring image data in real time;

preprocessing the acquired monitoring image data;

inputting the preprocessed monitoring image data into a pre-trained neural network to obtain a region to be selected in the monitoring image data, and an offset value and a confidence coefficient corresponding to the region to be selected; the neural network is a deep learning neural network, the training samples of the neural network are positive and negative samples of the anchor selected according to the intersection ratio of the anchor to be selected in the image data and the ground-route of the head calibration area, the positive samples have offset labels and confidence labels, and the negative samples only have image data sets of the confidence labels;

The training step of the pre-trained neural network comprises the following steps:

determining the size of an anchor used by the neural network;

Example 2

Based on the same inventive concept as embodiment 1, this embodiment specifically describes a fast and accurate human head detection algorithm combining image preprocessing and deep learning target detection, and the implementation includes the following two stages:

in the training phase, the training phase is carried out,

acquiring monitoring images of a plurality of public places to obtain an image data set;

manually marking the data set image, marking out the position information of all human heads, and obtaining a ground-route label;

thirdly, removing noise points at image details by using a BM3D denoising algorithm and a USM sharpening algorithm for each image in the data set, so that small target features with few pixels in the image are more obvious;

fourthly, changing the image and the corresponding label thereof through operations such as cutting, rotating, stretching and the like to complete the augmentation of the data set;

and step five, designing a neural network structure, as shown in fig. 1, wherein convolution layers are adopted in all the neural network layers adopted by the algorithm, a convolution kernel with the step length of 2 is used for replacing a pooling layer, an up-sampling and fusion mechanism is introduced, and finally two feature maps with different sizes are generated for target detection.

Step six, calculating the size of the anchor used by the model according to the relation between the theoretical receptive field and the effective receptive field;

the Anchor size is calculated by the formula as,

anchor_size＝layer_stride*aspect_ratio*anchor_scale。

wherein layer _ stride is a multiple of down-sampling, for a large feature map of 40 × 30, the down-sampling multiple layer _ stride is 16, and there is no need to change the anchor in different shapes by a fixed area, so the aspect _ ratio is set to 1, the anchor _ scale is the scale of the anchor on the feature map, and we set it to 2 and 4 according to the size of the effective receptive field. By the above formula calculation, the anchor sizes of the large feature map on the original image are 32 × 32 and 64 × 64, respectively. The size of the anchor calculated by the formula is 1/7 to 1/3 times of the size of the theoretical receptive field, and simultaneously the anchorages are very close to the size of the effective receptive field corresponding to the theoretical receptive field. For the anchor selection of the small feature map of 10 × 8, layer _ stride of the small feature map is 64, aspect _ ratio is also set to 1, and anchor _ scale is also set to 2 and 4, so that the anchor sizes corresponding to the small feature maps are 128 × 128 and 256 × 256.

Selecting positive and negative samples according to a certain strategy, and ensuring the proportion of the positive and negative samples to be 1: 1;

an effective sample strategy is selected for the selected sample strategy,

there are two criteria for the selection of positive samples: (1) marking the Anchor with the intersection ratio of the ground-truth to the Anchor being more than or equal to 0.7 as a positive sample; (2) the anchor with the largest intersection ratio of ground-truth to anchor is marked as a positive sample, and the algorithm selects the positive sample according to the two standards. The reason for making two criteria for choosing a positive sample is: in some cases, criterion (1) cannot mark any anchor as a positive sample, so we mark an anchor as a positive sample as long as it satisfies one of the above two criteria.

Negative samples are assigned to anchors with a cross-over ratio of ground-truth less than 0.3. An anchor with negative and positive sample selection conditions not meeting is not assigned with a positive and negative label, and thus does not participate in training. The optimal 16 positive samples and 16 negative samples are selected from all the samples through the sample selection strategy, the ratio of the positive samples to the negative samples is 1:1, and the balance of the positive samples and the negative samples during training is guaranteed.

Step eight, generating labels required by training for the selected positive and negative samples according to the ground-route labels and the anchors corresponding to the positive and negative samples;

the tag computation strategy is as follows,

for the offset value labels required in training, the algorithm uses A to represent the original anchor and G to represent the ground-route of the target, as shown in FIG. 2. The goal of the strategy is to find a relationship such that the input original anchor a is mapped to get a real box G, the model prediction result is G', i.e.,

given: a ═ A_x，A_y，A_w，A_h)，G＝(G_x，G_y，G_w，G_h)；

Find a transformation F such that F (A)_x，A_y，A_w，A_h)＝(G_x，G_y，G_w，G_h) (ii) a Wherein, (G'_x，G′_y，G′_w，G′_h)≈(G_x，G_y，G_w，G_h)；

F (a) ≈ G' by translation and scaling:

translation:

zooming:

in the above formula, the offset label can be calculated according to G and A

Fitting the label of the deviation value by regression in the training process, wherein the fitting result is t_x(A)，t_y(A)，t_w(A)，t_h(A) Then, the coordinates G 'of the prediction frame are restored by using the formula'_x，G′_yAnd wide height G'_w，G′_h. And this process only computes the positive samples because in the loss function, the offset values of the negative samples are not used to participate in the portion of the loss function where the offset value loss is computed.

For the confidence labels required in training, the confidence label of the positive sample is set to be 1, which indicates that a detection target exists in the anchor, and the confidence of the negative sample is set to be 0.

Step nine, building a neural network by using a deep learning framework, determining configurations such as learning rate and regularization parameters, and designing a loss function by using an Adam optimizer;

the loss function is calculated as follows,

the algorithm employs a multitasking loss function that is calculated from a confidence loss function L_clsAnd a loss function L for calculating the offset value_regTwo parts are formed. Wherein, N is the number of valid samples of the input picture, and i is the index of the anchor corresponding to the valid samples.

L_clsThe calculation formula is as follows:

wherein p is_iThe prediction confidence calculated for the ith anchor,

the confidence label corresponding to the anchor has a value of 1 or 0.

L_regThe calculation formula of (a) is as follows:

for L_regThe algorithm uses Smooth L1 as the loss function, t_iRepresenting the predicted offset value (t) calculated by the ith anchor_i1，t_i2，t_i3，t_i4) The vector is a four-point vector,

is the offset value label corresponding to anchor, which corresponds to

Loss function at L_regFront surface is multiplied by

This means that only anchors belonging to positive samples will participate in the regression that calculates the bezel offset value.

Inputting data in the data set into a neural network, calculating a loss function by using a training label corresponding to the image, performing back propagation on parameters of the loss function by using an optimizer, and optimizing model parameters;

step eleven, performing stage inspection during model training by using an inspection data set;

step twelve, when the model parameters are well represented on the inspection data set, stopping training, and finishing the optimization of the neural network parameters to obtain the model parameters;

in the testing phase, the test phase is carried out,

step thirteen, acquiring a real-time monitoring image through a local camera or a network camera;

fourteen, using BM3D denoising algorithm and USM sharpening algorithm to the captured image to remove noise in the image;

fifteen, inputting the image after denoising into a neural network, wherein the neural network parameters use the trained model parameters;

sixthly, acquiring an output result of the network, namely the offset value and the confidence corresponding to each anchor;

seventhly, screening out anchors with all targets through the confidence coefficient and a manually set threshold value;

eighteen, aiming at the anchor with the target selected in the seventeenth step, calculating the coordinate and the width and the height of the predicted boundary box through an offset value and the translation and scaling formula provided in the eighth step;

nineteenth, drawing all the predicted bounding boxes obtained in the previous step in an original image to obtain a result image;

twenty, returning the result image and the number of the detected heads in the image;

and twenty-one, if the latest monitoring image needs to be detected, returning to thirteen step, otherwise, executing twenty-two step.

And twenty-two, finishing the algorithm execution.

According to the rapid and accurate human head detection algorithm integrating image preprocessing and deep learning target detection, noise points corresponding to the pixel point regions of the small targets are reduced by using a noise reduction algorithm, the fine granularity of the images is improved, the characteristics of the image details are refined, and the detection of the small targets by a neural network is facilitated. Meanwhile, the human head existing in the image is detected by using deep learning, and not only can good performance be shown under the condition that people are seriously shaded mutually through a well-designed anchor formulation mechanism and a convolution neural network structure, but also great advantages can be achieved in the detection speed. The algorithm realizes more accurate crowd counting by detecting the position of the human head in the image, so that the problem that the target body is not detected due to large-area covering can be solved, and the position and size information of each human head in the image can be obtained.

To further verify the effect of the present algorithm, the following experiment was performed.

Firstly, a model is trained by using a data set and a label to obtain trained model parameters. Then randomly selecting 3 high-density human head images from the Google pictures, inputting the images into a convolutional neural network, and detecting the human heads in the images by using the trained model parameters as network weights, wherein the results are shown in figures 3, 4 and 5.

According to the experimental result, the algorithm has extremely high detection precision, can detect obvious large-scale targets, and can also accurately detect small targets with higher noise points and fewer pixel points.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a system for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction system which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A human head detection method is characterized by comprising the following steps:

acquiring monitoring image data in real time;

preprocessing the acquired monitoring image data;

inputting the preprocessed monitoring image data into a pre-trained neural network to obtain a region to be selected in the monitoring image data, and an offset value and a confidence coefficient corresponding to the region to be selected; the neural network is a deep learning neural network, the training samples are image data sets, wherein the training samples are positive and negative samples of an anchor are selected according to the intersection ratio of an anchor to be selected in the image data and a ground-route of a human head calibration area, the positive samples are provided with offset labels and confidence labels, and the negative samples are provided with confidence labels;

2. The method of claim 1, wherein the step of training the neural network comprises:

marking each monitoring image in the image data set, and marking the position information of all human heads in the images as a human head calibration area ground-route;

determining the size of an anchor used by the neural network;

respectively generating labels required by training for each selected positive and negative sample, wherein the positive sample label comprises an offset value label and a confidence label, and the negative sample label comprises a confidence label;

3. The method as claimed in claim 1 or 2, wherein the neural network layer of the neural network adopts convolution layer, the pooling layer is replaced by convolution kernel with step length of 2, and two feature maps with different sizes are finally generated for target detection by utilizing the up-sampling and fusion mechanism.

4. The method of claim 3, wherein the anchor size is calculated by the formula:

anchor_size＝layer_stride*aspect_ratio*anchor_scale

wherein layer _ stride is a multiple of down-sampling, aspect _ ratio is an aspect ratio of the anchor, and anchor _ scale is a scale of the anchor on the feature map.

5. The method of claim 4, wherein for a large feature map of 40 x 30, the downsampling multiple layer _ stride is 16, the aspect _ ratio is set to 1, and the anchor _ scale is set to 2 and 4; for the 10 by 8 small feature map layer _ stride is 64, aspect _ ratio is 1, and anchor _ scale is 2 and 4.

6. The method of claim 2, wherein the choice of anchor positive and negative samples for training is:

the ratio of the number of positive and negative samples is 1: 1.

7. The method of claim 1, wherein generating the offset value labels and confidence labels required for training comprises:

the goal of the offset value generation strategy is to find a relationship such that the input original A can be mapped to the real box G, i.e., to find a transformation F such that F (A)_x，A_y，A_w，A_h)＝(G_x，G_y，G_w，G_h) (ii) a Wherein the prediction result is G '═ G'_x，G′_y，G′_w，G′_h) G' should be about equal to G;

assuming that f (a) ≈ G' is achieved by translation and scaling, then:

translation:

zooming:

calculating the label of the offset value according to G and A by a formula

for the confidence label, the confidence label of the positive exemplar is set to 1, and the confidence of the negative exemplar is 0.

8. The method of claim 2, wherein a deep learning framework is adopted to build the neural network model, the configuration of the neural network model parameters comprises configuration of learning rate and regularization parameters, and a loss function calculation formula designed by an Adam optimizer is as follows:

in the formula, L_clsTo calculate the loss function of confidence, L_regIn order to calculate the loss function of the deviation value, N is the number of effective samples of the input picture, and i is the index of the anchor corresponding to the effective samples;

L_clsthe calculation formula is as follows:

wherein p is_iThe prediction confidence calculated for the ith anchor,

the confidence label corresponding to the anchor;

L_regthe calculation formula of (2) is as follows:

is the offset value label corresponding to anchor, which corresponds to

9. The method according to claim 1 or 2, wherein the pre-processing comprises: removing noise points in the image by using a BM3D denoising algorithm and a USM sharpening algorithm;

the neural network training step further comprises: and (4) carrying out image denoising processing on each monitoring image in the image data set by using a BM3D denoising algorithm and a USM sharpening algorithm.

10. The method of claim 2, wherein the neural network training step further comprises: and (3) changing each monitoring image and the corresponding ground-route thereof through cutting, rotating and/or stretching operations to complete the amplification of the data set.