CN115527133A

CN115527133A - High-resolution image background optimization method based on target density information

Info

Publication number: CN115527133A
Application number: CN202211282570.5A
Authority: CN
Inventors: 陈琛; 肖华欣; 刘煜; 张茂军; 马屹钦
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-10-19
Filing date: 2022-10-19
Publication date: 2022-12-27

Abstract

The invention discloses a high-resolution image background optimization method based on target density information, which comprises the following steps: acquiring a high-resolution image; predicting a sparse density map corresponding to each input image; calculating a pedestrian dense region by using a clustering method for the sparse density map generated by prediction; obtaining different clusters after obtaining a pedestrian dense region, counting the number N of the crowd of the sample in the region, if N is larger than a given threshold value T, expanding the region according to a certain expansion sparse method to obtain a mask of a background optimization subgraph, and performing background optimization on an original image according to the mask to generate a subgraph training set; and fusing to generate a final prediction result according to the prediction results of the original image and the subgraph, and outputting a plurality of subgraphs containing pedestrians. The method provided by the invention can be used for obviously improving the precision of the detection task in the low-altitude overhead shooting scene.

Description

High-resolution image background optimization method based on target density information

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a high-resolution image background optimization method based on target density information.

Background

Pedestrians are one of the most important starting points and attention points for various applications as main objects of production and life in real scenes, and the pedestrian detection technology closely related to the pedestrians has been advanced greatly. The unmanned aerial vehicle has the characteristics of rapidness and flexibility, and can improve the application value of the unmanned aerial vehicle in the fields of intelligent security, military and the like by combining with a pedestrian detection technology. The invention refers to the image shot by the unmanned aerial vehicle from low altitude as a low altitude overhead scene. For a pedestrian detection task, the scene faces unique challenges of complex content, illumination change, view angle change and the like, meanwhile, false detection and missing detection are always obstacles to pedestrian detection research, and how to improve the stability and the real-time performance of a pedestrian detection algorithm in the scene is still a difficult problem. Carry out the special research based on the deep learning, improve pedestrian's detection performance, reduce lou to examine and the false retrieval, compression model size is pedestrian detection and unmanned aerial vehicle application intellectuality's urgent in the low latitude under the image of bowing.

The great difference of the target size distribution in the low-altitude overhead shooting scene and the general scene brings great challenges to the traditional detector based on the convolutional neural network. Generally speaking, the aerial images have wide and uneven scale distribution range of targets, a detector is required to better extract multi-scale information, and the detection capability under different scales is improved, so that higher requirements are provided for anchor point design. In addition, a large number of background areas and small-size targets influence feature extraction, effective feature information obtained by the model is sparse, and the overall recall is low when the detection capability of the crowd dense area is reduced. Early research efforts have focused on how to improve the performance of detectors in detecting small targets. In the task of detecting low-altitude images, a naive technical route capable of obviously improving the performance of a detector is as follows: the input high-resolution image is equally split into low-resolution sub-images, different detectors are trained respectively aiming at the original image and the sub-images, and then the results of the two detectors are fused and input into a final prediction result. This simple strategy can in most cases improve detector performance, since the pixel fraction of a small pixel pedestrian in the background optimized sub-image is significantly higher than its fraction in the original image. Although the strategy can alleviate the scale problem and the small target problem to a certain extent, the method ignores the target density distribution information contained in the image, consumes a large amount of computing resources in sub-images with sparse pedestrians or no pedestrians at all, and reasonably background-optimizes sub-images only containing pedestrians by roughly dividing some pedestrian instances with larger pixel occupation in the background optimization process.

Compared with a static image in a common scene, the aerial image has higher resolution, the pedestrian appearance diversity in the image is higher, the size distribution range and the number of the targets in the image are relatively larger, the number of the small-size targets accounts for more, the targets tend to be distributed in a concentrated area, and the similar internal shielding is more serious than that in the common dense scene. In a low-altitude overhead shooting scene, the position distribution of pedestrians is very important guiding information, and the gathering of the pedestrians mainly comprises areas which are beneficial to the movement of a large number of pedestrians, such as the center of a square, sidewalks on two sides of a road, a park court and the like. Thus in many low-altitude overhead datasets, most pedestrians are concentrated only in a specific area in the image due to flight acquisition perspective and scene geometry space constraints. However, few effective image background optimization strategies are proposed in the current research, which mainly includes the following difficulties: firstly, the form of a pedestrian gathering scene is not fixed and can change along with the shooting angle of the unmanned aerial vehicle; secondly, in various low-altitude overhead images, if the scene types covered by methods such as manual features or introduction of priori knowledge are limited, the reliability of the estimated aggregation scene is low; finally, the depth-based semantic segmentation method can effectively and automatically segment the pedestrian and the background, but introduces more computation overhead.

Two major challenges are mainly faced in low-altitude overhead shooting scenes: 1) The pedestrians in the image are unevenly distributed, and a large number of redundant background areas influence the detection precision and speed of the model; 2) The shooting position of the aircraft is not fixed, and the size, the visual angle and the shape of the target to be detected in the image are changed violently. The common solution is to optimize a large-size input image into small sub-images according to a certain strategy background, and respectively detect pedestrians in the original image and the sub-images.

Disclosure of Invention

In view of the above, the present invention provides a high resolution image background optimization method based on target density information. The invention develops research work around the challenging problem of pedestrian detection in a low-altitude overhead scene. The resolution ratio of a low-altitude overhead image is generally higher, the positions of pedestrians distributed in the image are unbalanced, and the direct detection of the pedestrians on a high-resolution image can cause a large amount of computing resources to be wasted in a background area without the pedestrians, and seriously affect the performance and efficiency of the detection, so the invention provides an image background optimization strategy based on the density distribution of the pedestrians. The invention researches and provides a double-task network combining pedestrian detection and semantic segmentation. Further research on the occlusion scene finds that the head of the pedestrian is a reliable and stable guide line, researches a pedestrian detection method based on human body visible information, provides constraint conditions based on one-to-one matching of the head region and the body region of the pedestrian, and filters false detection results caused by occlusion problems. The invention researches the optimization mode of a pedestrian detection model on low-power consumption equipment based on the existing framework, provides a minimized pedestrian detector, and greatly reduces resources occupied by the model on the premise of ensuring the precision.

In order to solve the problems, the invention provides a pedestrian Detection Network (DANet) based on a Density-map, which is used for splitting a high-resolution low-altitude overhead image into a plurality of sub-images containing pedestrians. In DANet, a Density-Aware component (DAM) is first utilized to indicate the target presence area and the Density of objects within the area. The density sensing component can optimize the background optimization process, remove the background area which does not contain the target to be detected, greatly reduce the calculation cost and improve the detection efficiency. Inspired by a crowd counting task, the sparse density graph is predicted by adopting a crowd counting network based on Bayesian distribution, and then a clustering idea is introduced to count a region with high crowd density. After the original image is partitioned, the original image is respectively used as a data set, a background-optimized sub-image is used as a data set, and two fast RCNN structures with characteristic pyramids are trained to be used as a global detector and a sub-detector. When reasoning the input image, firstly, the DAM is used to split the input image, the two detectors are used to predict respectively, finally, the result of the sub-detector is mapped to the original image, and the final detection result is input after NMS (Non Maximum Suppression) post-processing.

Specifically, the invention discloses a high-resolution image background optimization method based on target density information, which comprises the following steps:

acquiring a high-resolution image containing a pedestrian;

predicting a sparse density map corresponding to each input image;

calculating a pedestrian dense region by using a clustering method for the sparse density map generated by prediction;

obtaining different clusters after obtaining a pedestrian dense region, counting the number N of the crowd of the sample in the region, if N is larger than a given threshold value T, expanding the region according to a region expansion method to obtain a mask of a background optimization subgraph, and performing background optimization on an original image according to the mask to generate a subgraph training set;

and fusing the prediction results of the original image and the subgraph to generate a final prediction result, and outputting a plurality of subgraphs containing pedestrians.

Further, the sparse density map indicates the target existing region and the object density in the region by using a density sensing component, and eliminates the background region which does not contain the target to be detected.

Furthermore, the density sensing component predicts a sparse density map based on a Bayesian distribution crowd counting network, and then counts out regions with high crowd density according to a clustering method to block the image.

Further, after the original image is partitioned, the original image is respectively used as a data set, a background-optimized sub-image is used as a data set, two fast RCNN structures with characteristic pyramids are trained to be used as a global detector and a sub-detector, when the input image is reasoned, a density sensing assembly is used for splitting the input image, the two detectors are used for respectively predicting, the results of the sub-detectors are mapped into the original image, and a non-maximum suppression method is used for post-processing and inputting a final detection result.

Further, a plurality of weighted loss functions are used for measuring the difference between the vector binary image density function of the truth point diagram and the vectorized density image density function predicted by the network, and the specific formula is as follows:

wherein

A vector binary map which is a truth point map,

for predicted vectorized density maps, λ ₁ And λ ₂ As weighting coefficients, z and

for the purpose of the non-regularized density function,

for counting loss, for measuring the error between the population statistics between the density map and the truth map,

for an improved optimal transportation loss function,

is a total variation loss function.

Further, the count loss is defined as follows:

wherein

the difference between them is as small as possible;

calory | | z | | ₁ And

two non-regularized density functions, divided by their respective sums, are transformed into probability density functions, and the difference of the two probability distributions is measured using an optimal traffic loss function, which is as follows:

wherein alpha is ^* And beta ^* For training the network-derived solution using pixel-by-pixel average absolute error, the transfer loss squared is used

Wherein

And

two-dimensional coordinates of positions i and j, respectively; a very small positive number is added to ensure that the denominator is not zero, at which point

The gradient of (d) is:

the total variation loss function is defined as follows:

the gradient of the total variation loss function is recorded as:

wherein

Sign () is a Sign function of the vector.

Further, the clustering method is a mean shift method, and when the generated sparse density map is processed, a pedestrian dense region is estimated by combining the mean shift method and a region expansion strategy.

Further, a clustering method based on mean shift is adopted, wherein the mean shift refers to an offset mean vector, the probability density distribution interval of sample points is calculated according to the sample points, the points in the density map are clustered into different clusters, the counting of target examples in the clusters is counted, the counting is larger than the retention of a threshold value, and the discarding is smaller than the threshold value;

suppose it is in d-dimensional space

In, there are n discrete samples x = x _i (i =1,2,.., n), the mean vector in the mean shift is defined as follows:

wherein S is _h A high-dimensional spherical region with a radius h, which represents a circular region with a radius h for image data, is a set of points satisfying the following relationship:

S _h (x)＝{y:(y-x) ^T (y-x)<＝h ² }

mean vector M _h (x) Is fallen into S _h The average value of the medium k samples always points to the gradient direction of the probability density; the kernel function K (x) is introduced, taking into account that points of different distances should be weighted differently, when the probability density function f (x) is represented by:

where K is a kernel function defined as follows:

K(x)＝c _k,d k(||x|| ² )

the regularization coefficient c is used for ensuring that the integral of the probability density is 1, and the zero point of the partial derivative of the probability density function f (x) can be calculated to obtain an extreme point.

Further, clustering points in the density map by a mean shift method to obtain different clusters Sn; counting the number N of the sample groups in the region, and if N is larger than a given threshold value T, performing background optimization on the cluster; to cluster S' _n Obtaining a candidate cluster S 'accurately indicating a dense crowd area after filtering by using given T' _n ；

According to candidate cluster S' _n Performing background optimization comprises the following steps: in order to contain all the marking frames as much as possible, in the training process, the positions of all the marking frames of the area with the central point falling in the cluster are counted, the boundary of the cluster is expanded according to the size of the given marking frame, and the object with the large pixel ratio is prevented from being segmented by background optimization operation; in the reasoning process, since the information of the true value box cannot be known, a method of expanding the area is adopted for prediction.

Further, the zone expansion strategy comprises the following steps:

for an input image I = (W, H), cluster S = { S | S _i ＝(x _i ,y _i ),i＝1,…,n}

S11: initialization S

S12: if i < n, iteratively executing the steps S13-S18;

s13: find clusters s respectively _i Maximum value (x) of horizontal and vertical coordinates of midpoint _max ，y _max ) And minimum value (x) _min ，y _min )；

S14: if (x) _min -λ _x ×(x _max -x _min ))>0 then topx _j ＝x _min -λ _x ×(x _max -x _min ) Otherwise, otherwise

topx _j ＝0；

S15: if (y) _min -λ _y ×(y _max –y _min ))>0 then topy _j ＝y _min -λ _y ×(y _max -y _min ) Otherwise, otherwise

topy _j ＝0；

S16: if (x) _max +λ _x ×(x _max -x _min ))<W th then W _j ＝(1+2×λ _x )(x _max -x _min ) Otherwise, otherwise

w _j ＝W-x _min ；

S17: if (y) _max +λ _y ×(y _max -y _min ))<H then H _j ＝(1+2×λ _y )(y _max –y _min ) Otherwise, otherwise

h _j ＝H-y _min ；

S18: i = i +1, and the step is returned to the step S12;

s19: output background optimization region set C = { C | cj = (topx) _j ,topy _j ,w _j ,h _j )}

Wherein λ _x And λ _y Is a coefficient of expansion associated with the data set.

The invention has the following beneficial effects:

aiming at a low-altitude forward-shooting scene shot by an unmanned aerial vehicle, the image background optimization strategy based on density perception provided by the invention introduces a sparse thermodynamic diagram of density distribution of pedestrians to be detected in an input image, and the pixel value of the thermodynamic diagram can reflect the sparse degree of the pedestrian distribution in the image. And extracting crowd-gathered subgraphs according to a clustering strategy to perform background optimization, and providing a region expansion strategy to avoid segmenting pedestrians with large pixel areas in the background optimization process. Experiments on the public data set prove that the method can obviously improve the pedestrian detection performance in a low-altitude overhead shooting scene.

Drawings

FIG. 1 is a flow chart of a high-resolution image background optimization method based on target density information according to the present invention;

FIG. 2 is an input high score low score captured image;

FIG. 3 is a predicted population distribution graph according to the present invention;

FIG. 4 is a population distribution plot predicted based on the MCNN method;

FIG. 5 is a graph showing the results of the detection according to the present invention;

FIG. 6 is a graph showing the results of the fast RCNN-based assay;

FIG. 7 is a visual inspection result of the present invention on the V1sDrone data set when ResNeXT-101 is used as the backbone network;

FIG. 8 is an original image;

FIG. 9 is a schematic diagram of a subgraph generated after an equal background optimization method is adopted;

FIG. 10 is a schematic diagram of a subgraph generated by using a random background optimization method;

FIG. 11 is a schematic diagram of a subgraph generated after the density background optimization method of the present invention is adopted;

FIG. 12 is a set of test samples in the VisDrone dataset;

FIG. 13 is a density map generated by the model of the present invention based on FIG. 12;

FIG. 14 is another set of test samples in the VisDrone dataset;

FIG. 15 is a density map generated by the model of the present invention based on FIG. 14.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

Fig. 1 shows a density-aware pedestrian detection network framework provided by the present invention, which mainly includes a background optimization module and a result fusion module for global detection network density awareness. Specifically, firstly, training a density perception network based on CNN (convolutional neural network), and predicting a sparse density map corresponding to each input image; calculating an area with a high total pixel value, namely a pedestrian dense area, of the density map generated by prediction by using a clustering method; slightly expanding the pedestrian dense region according to certain expansion sparsity to obtain a mask of a background optimization subgraph, and performing background optimization on the original subgraph according to the mask to generate a subgraph training set; and finally, fusing to generate a final prediction result according to the prediction results of the original image and the subgraph.

As shown in fig. 2 to fig. 4, fig. 2 is an input high-resolution low-altitude overhead image, fig. 3 is a population distribution predicted by the method used in the present invention, and fig. 4 is a population distribution predicted based on the MCNN method, which can intuitively feel that the pedestrian classification predicted by the method used in the present invention is clearer and more reliable.

Density sensing assembly

Density map estimation was first applied primarily to the task of pedestrian counting, i.e. the number of people in the map is calculated by a vision-based method given an image containing a large number of people/heads. Generally, a large number of overlapped pedestrian instances are contained in the image, and an instance-by-instance algorithm based on the target detection method is almost ineffective in the scene, so that the number of human bodies in the image is more difficult to count. Therefore, the related data set can give a marking point on the head or the forehead of the human body to obtain the total number of the pedestrians through counting the total number of the points. Because the density map can reflect the head position of a human body and provide spatial density distribution, the current technical route of population counting is to generate the density map from an input picture, count the number of people in the density map through integration, namely give a training image with point annotation, train the density map to estimate the parameters of a network equivalent to optimizing the network, and minimize the difference between a truth value point map and a predicted density map. However, the population count provides a label-discrete binary mask, in a sparse binary matrix, the reconstruction loss is severely unbalanced between the positive samples (the points marked as heads) and the negative samples (background pixels). It is necessary to have these truth labels distributed more evenly in the image in order to make the training process easier to train. A common approach is to apply a gaussian convolution kernel at the location of the marked significant points in the truth map, thereby generating a smoother density map. Such methods generally use a pixel-by-pixel average Absolute Error (MAE) to train a network, and a specific expression is shown in formula 1:

wherein x _i For the purpose of the example of the object,

is a gaussian convolution kernel whose size is determined by the mean distance of the K nearest points. After applying corresponding Gaussian convolution to all the mark points in the true value graph, the sparsity of the true value graph can be effectively reduced. The effectiveness of such methods is heavily dependent on the quality of the "false true" value after processing. Since the size and shape of the human body and the head included in the image vary dramatically in a dense population, it is very difficult to set the size of the gaussian blur kernel.

The method regards a given truth point diagram and a predicted density map as two density distributions, obtains reliable density map prediction by reducing the difference between the two density distributions, and introduces a loss function based on Monge-Kantariovich Optimal Transport (OT) to change a network training target, wherein the Optimal Transport is referred to in non-patent literature' Optimal Transport: old and new [ J ]].2009". The optimal transport problem is essentially the minimum cost to solve for the transition from one probability distribution to another. Assume that a set of points for two d-dimensional vector spaces are given

And

the corresponding probability measures are denoted as μ and v. Wherein, the first and the second end of the pipe are connected with each other,

and is provided with

Will be provided with

Noting the cost of transition from point X to Y, then C _ij ＝c(X _i ,Y _j ) Is an n x n dimensional cost matrix between two sets of points. Let Γ be the set of all solutions that shift the probability mass from X to Y:

the cost of Monge-Kantariovich OT between μ and v can be defined as:

if the probability distributions μ, v are treated as units with noise on X, Y, respectively, the OT cost can also be treated as the minimum cost for transition from one probability distribution to another. The OT cost energy quantifies the difference between the two probability distributions while also taking into account the distance between the noise locations. The final OT loss can be characterized as another expression in the form of equation 3:

the density estimation is considered in the present invention as a problem of matching between different distributions. Compared with the method which needs the Gaussian blur kernel to process the truth value graph, the method has the advantage that the Gaussian kernel truth value does not need to be deliberately selected for preprocessing. Order to

A vector binary image which is marked as a truth point image,

a vectorized density map predicted for the network. Z is equal to

The density function considered as non-regularization, as shown in equation 4, uses a polynomial weighted loss function to measure the difference between the two:

the first term is the count loss, which is used to measure the error between the population statistics between the density map and the truth map. Order to

the difference between them is as small as possible, so the count loss can be defined as an absolute error of the form of equation 5:

the second term in equation 4 is an improved OT loss function. Note that | | z | calucity ₁ And

are two non-regularized Density Functions, divided by their respective sums, that are transformed into Probability Density Functions (PDFs). Although KL divergence (Kullback-Leibler divergence) and JS divergence (Jensen-Shannon divergence) can also measure the difference between two probability density functionsHowever, these methods do not provide an effective gradient when the source distribution does not overlap the target distribution, and therefore cannot be used to train neural networks. Therefore, the present invention measures the difference between two probability distributions by using the OT loss function, which can be expressed as formula 6:

wherein alpha is ^* And beta ^* For the solution of equation 1, the square transfer loss can be used

Wherein

And

two-dimensional coordinates of locations i and j, respectively. Adding a very small positive number to ensure that the denominator is not zero, when in equation 1

The gradient of (a) is:

during the iterative training process, the OT Loss is approximately solved by using Sinkhorn algorithm. However, in the actual training process, the objective function will initially fall rapidly but gradually converge to the vicinity of the objective function slowly. Because the maximum iteration number is set, the number of times required for solving is often larger than the actual iteration number, and only one approximate solution is returned at the moment. Finally, when solving the OT Loss by using the Sinkhorn algorithm, the network can only predict a density map which is similar to the truth map, which is specifically embodied in that: OT Loss performs well in densely populated areas, and decreases significantly in lower population densities. Therefore, the present invention introduces an additional Total Variation loss function (TV) to solve this problem, which is defined as shown in equation 8:

the TV Loss function not only can solve the problem that the effect of a low-density area is not good, but also can enhance the stability of network training. When Sinkhorn algorithm is used for optimizing OT Loss, the training process of the Network is similar to the training process of generating a confrontation Network Generic Adaptive Network (GAN), and is a minimum saddle point optimization process. By adding an additional reconstruction loss function, the stability of the GAN network training can be obviously enhanced. The TV Loss here acts as a reconstruction Loss function, and the gradient thereof can be defined as:

wherein

Sign (.) is a Sign function of the vector.

Background optimized mask based on Meanshift density information

The core idea of DAnet is to rationally background optimize large-scale input images into small-scale sub-images using contextual information provided by density maps. In the density map, more points are predicted and the pixel value is larger in the region with higher density than in the region with lower population density. A naive idea is to use a clustering method to cluster points in a density map into different clusters, count the counts of target instances in the clusters, reserve if the counts are larger than a threshold, and discard if the counts are smaller than the threshold. The clusters that remain after filtering are naturally dense areas that need to be found.

The invention adopts a clustering method based on mean shift (Meanshift), wherein the mean shift refers to a shifted mean vector. The algorithm is a non-parameter clustering based on probability densityThe method can calculate the probability density distribution interval of the sample points according to the sample points without knowing the probability density distribution function of the sample data in advance. Assumed to be in d-dimensional space

In, there are n discrete samples x = x _i (i =1,2.., n). The mean vector in the mean shift is defined as equation 10:

wherein S is _h A high-dimensional sphere region with a radius h, and when aiming at image data, a circular region with a radius h is represented by the region, which is a set of points satisfying the relation of equation 11:

S _h (x)＝{y:(y-x) ^T (y-x)<＝h ² } (11)

mean vector M _h (x) Is falling into S _h The average of the k samples always points in the gradient direction of the probability density. Considering that points of different distances should be weighted differently, a kernel function K (x) is introduced, where the probability density function f (x) is represented by equation 12:

where K is a kernel function, defined as follows:

K(x)＝c _k,d k(||x|| ² ) (13)

the regularization coefficient c is used to ensure that the integral of the probability density is 1. The zero point of the partial derivative of the probability density function f (x) can be calculated as its extreme point. The mean shift algorithm is essentially an operation of searching for the gradient peak of the probability density distribution of data distribution in an adaptive incremental iteration mode. Specifically, a detailed flow of the mean shift algorithm is given in algorithm 1.

Algorithm 1 mean shift algorithm flow

Inputting: number of iterations t, search space S _h Initial point x, threshold valueo

And (3) outputting: sample set S, clustering center C

1 initializing t, sh

2:while mh(x ^t )<o do

3 calculating the probability density gradient m _h (x ^t )

Updating search space S _h ，x ^t+1 ＝x ^t +m _h (x ^t )

5:end while

6: returning sample set S and clustering center C

And clustering points in the density map by using a mean shift method to obtain different clusters Sn. And counting the number N of the sample groups in the region, and if N is greater than a given threshold value T, performing background optimization on the cluster. In general, it is desirable that the population density on the sub-image with optimized background is high, so that the sub-detectors focus more on feature extraction of the dense region, and the detection effect in the dense region is improved. It can be found that the number of sub-graphs generated after background optimization is related to T, if T is too small, too many sparse people and redundant background are included in the generated sub-graphs, and if T is too small, too many clusters are filtered, and the number of generated sub-graphs is insufficient, which results in too few samples of the sub-trainers. In the present test, T =3.7 was taken. Thus, to cluster S' _n After filtering by using given T, obtaining a candidate cluster S 'capable of accurately indicating a dense crowd area' _n . And is based on candidate cluster S' _n The technical route for performing background optimization is diverse. An intuitive method is to calculate the minimum bounding rectangle of each candidate cluster and optimize the subgraph in the background of the original graph according to the shape of the rectangle. The background optimization method can maximally contain all the labeling frames, but the generated image easily contains excessive background redundant areas. In order to contain all the marking frames as much as possible, in the training process, the positions of all the marking frames of the area with the central point falling in the cluster can be counted, the boundary of the cluster is expanded according to the size of the given marking frame, and the object with the large pixel ratio is prevented from being segmented by background optimization operation. In the inference process, since the information of the true value box cannot be known, a method of expanding the region is adopted for prediction.The detailed procedure is given in algorithm 2. Wherein λ _x And λ _y Is a data set dependent expansion coefficient set to 0.015 and 0.009, respectively, on the VisDrone data set.

Algorithm 2 expansion region algorithm flow

Experiments and analyses

Because the number of aerial overhead-shooting public data sets only aiming at the pedestrian detection task is small at present, the invention mainly carries out comparison experiments on VisDrone data sets containing pedestrian categories. As ten kinds of target detection objects including pedestrians as a data set mainly for target detection, average Accuracy (AP) which is a more common index in a target detection task is used as a basis for judging whether each method is good or bad. Firstly, comparing the VisDrone data set with other advanced target detection methods to prove the superiority of the proposed DANet, and then carrying out abundant ablation experiments on pedestrian categories in the VisDrone data set to judge the effectiveness of each proposed module.

When training the density-aware network, we simply preprocess the data set, that is, replace the center of the labeling box with the point label in the crowd counting task, considering that the data set does not provide the point label in the similar crowd counting task. The VGG-19 adopted by the DAM network is used as a backbone network for feature extraction. In order to make it more compliant with mission requirements, the last pooling layer in the VGG-19 network and the fully connected layer following it are removed. The output of the backbone network is enlarged to 1/8 of the input image size by two-line interpolation. After the backbone network, add 1x1 convolutional layer and two 3x3 convolutional layers, with 256, and 128 channels, respectively. Due to the large image size in the VisDrone dataset, the image size was scaled uniformly to 512x512 before entering the network. Density sensing netAdam is adopted as an optimizer, and the initial learning rate is set to be 10 ^-5 The weight decay rate was 0.0001, and the batch size was set to 8, for a total of 70k rounds of training. The image enhancement strategy uses only random background optimization, with the size of the background optimized region being 256x256.

Both the global detector and the sub-detectors employ a FasterRCNN framework with a network of feature pyramids. For the global detector, the size of the input image is scaled to 1000 × 600, a random gradient descent method is used as the optimizer, the momentum is set to 0.9, and the weight attenuation is set to 0.005. The initial learning rate of the model is set to 10 ^-4 The learning rate was adjusted down to 10 after 90k iterative training ^-5 The learning rate was adjusted down to 10 after 130k iterative training ^-6 Performing 150k rounds of iterative training in total; for the sub-detectors, the size of the input image is 256 × 256.

Data enhancement methods include horizontal flipping, random scaling and color dithering. Particularly, a random background optimization data enhancement method is additionally used when the global detector is trained; in training the sub-detector, a Mosaic data enhancement method is additionally cited.

The evaluation index of the VisDrone data set is consistent with the MS COCO data set, different threshold values are set according to the IoU area between the evaluation index and the truth value, and AP evaluation indexes of three threshold values are adopted: AP, AP50, AP75, APs, APm, AP1. The definitions of the various APs are shown in table 1, and different AP evaluation methods focus on the performance of the detector from the corresponding points.

Table 2 shows the verification results of the DANet method provided by the present invention on the VisDrone dataset, and the comparison method includes: clusDet, DMNet, AMRNet. The invention carries out experiments under three main networks with sequentially enhanced performances of ResNet50, resNet101 and ResNeXt 101. As can be seen from the observation of Table 2, the DANet provided by the invention can be stably surpassed other high-performance methods by about 1-5 percentage points under different backbone networks. Further, on the evaluation index of the AP75, the DANet proves that the method provided by the present invention has better robustness under a higher IoU threshold value in comparison with the same consideration that the ClusDet of the object density distribution is higher by approximately 7 percentage points and is higher by approximately 3 percentage points than the DMnet. Meanwhile, the method provided by the invention surpasses AMRNet by about 2 percentage points on two evaluation indexes of APs and APm, and the method proves that the performance of the network in small target detection can be obviously improved by the method based on density perception.

TABLE 1 different AP definitions

TABLE 2 comparison of VisDrone data set with other advanced methods

Method	Backbone network	AP	AP50	AP75	APs	APm	APl
								ClusDet[73]	ResNet50	26.7	50.6	24.7	17.6	38.9	51.4
ClusDet[73]	ResNet101	26.7	50.4	25.2	17.2	39.3	54.9
								ClusDet[73]	ResNeXt101	28.4	53.2	26.4	19.1	40.8	54.4
DMNet[74]	ResNet50	28.2	47.6	28.9	19.9	39.6	55.8
								DMNet[74]	ResNet101	28.5	48.1	29.4	20.0	39.7	57.1
DMNet[74]	ResNeXt101	29.4	49.3	30.6	21.6	41.0	56.9
								AMRNet[76]	ResNet50	31.7	52.7	33.1	23.0	43.4	58.1
AMRNet[76]	ResNet101	31.7	52.6	33.0	22.9	43.4	59.5
								AMRNet[76]	ResNeXt101	32.1	53.0	33.2	23.2	43.9	60.5
DANet	ResNet50	33.4	55.2	31.6	24.4	45.4	59.2
								DANet	ResNet101	33.5	55.7	31.2	24.3	45.6	59.9
DANet	ResNeXt101	34.9	56.0	32.3	26.6	46.9	61.7

Fig. 5 and 6 show the comparison of the detection results of DANet and fast RCNN methods on VisDrone data sets, and the classes corresponding to the prediction boxes of different colors are given above the images. The fact that the recall rate of the DANT proposed by the invention is obviously higher than that of the Faster RCNN method can be intuitively perceived by observing an original image. In the experiment, a part of background optimization areas are marked for amplification and further observation, and the DANet provided by the invention is found to obviously improve the detection effect of small targets in an aggregation area. Fig. 7 shows the visual detection result of DANet on V1sDrone data set when ResNeXT-101 is used as the backbone network. All comparison experiments are integrated to show that the DANet provided by the invention has a remarkable improvement effect on a detection task performed on a low-altitude overhead image, and the method for segmenting the subgraph based on the object density in the image can more effectively embody the dense region in the image without depending on excessive pretreatment on the true value.

Ablation experiment

In this section, a series of ablation experiments are designed, and contributions of components such as a background optimization strategy, a clustering strategy and an expansion strategy based on density information to model performance in the DANet are analyzed.

Background optimization strategy based on density information

In order to verify the effectiveness of the background optimization strategy based on density information, a contrast experiment is performed by selecting a background-free optimization strategy, an equally-divided background optimization strategy and a random background optimization strategy. Wherein the bisection background optimization strategy means that the image is equally divided into 3x3 block subgraphs, and the sizes of the subgraphs are about 666x500. The random background optimization strategy is to randomly select two groups of numerical values and cut the image into 3x3 block subgraphs, wherein the value range of the random number is [0.1,0.7]. Fig. 8-11 show partial subgraphs obtained for one image sample and three different strategies in the VisDrone dataset. The effect of the model trained after three different background optimization strategies for pedestrian categories in the VisDrone dataset is shown in table 3 as shown in fig. 9-11.

TABLE 3 comparison of model Performance under different background optimization strategies

Method	Backbone network	AP	APs	APm	APl
						Background-free optimization	ResNet50	34.7	22.4	37.1	60.4
Aliquot background optimization	ResNet50	40.1	32.9	45.6	58.8
						Stochastic background optimization	ResNet50	38.4	31.8	42.5	59.1
Density background optimization	ResNet50	44.8	38.2	51.7	67.9

A large number of small-pixel pedestrians exist in the low-altitude overhead image, and if the detection effect of the pedestrians can be effectively improved, the performance of the model can be greatly improved on the whole. From the observation in table 3, it can be known that when the background optimization strategy is not used, the detection effect of the model on small target pedestrians is not ideal, and the overall AP and APs are both low. After a background optimization strategy is introduced, the performance of the model is greatly improved, firstly, a subgraph benefited from background optimization is equivalent to an attention clue, and the subgraph can guide the subgraph to carry out targeted learning on small-target pedestrians; on the other hand, the background optimization strategy is introduced, the sub-detectors are introduced, and the performance of the model can be improved by combining the effect of the global detector and the sub-detectors. Further observing table 3, it can be seen that the equal-division background optimization and the random background optimization are greatly improved in APs compared with the background-free optimization strategy model, but both background optimization methods have a risk of cutting large pedestrians, which is also reflected in that the APl of both methods is also reduced by about 1.5 percentage points. It is worth noting that in the VisDrone dataset, the large pixel target accounts for less under the pedestrian category, and the shortcomings of the equal-division background optimization and the random background optimization are made up to some extent. The strategy based on density information background optimization provided by the invention not only can effectively improve the overall performance of the model, but also can relieve the problem that a large target is split during background optimization, and can prove that the strategy can solve the problem that the target scale changes violently in a low-altitude overhead shooting scene.

Table 4 comparison under different clustering strategies under VisDrone dataset

Method	Backbone network	Number of training samples	Number of test samples	AP
					K-means	ResNet50	36154	2641	23.5
K-means++	ResNet50	35902	2715	24.9
					DBSACAN	ResNet50	37078	2883	27.2
Meanshift	ResNet50	34276	2639	30.8

TABLE 5 comparison of pedestrian classifications in VisDrone dataset under different clustering strategies

Method	Backbone network	Number of training samples	Number of test samples	AP
					K-means	ResNet50	9842	1127	46.2
K-means++	ResNet50	9693	945	48.1
					DBSACAN	ResNet50	10004	1036	50.5
Meanshift	ResNet50	9957	991	51.6

After the density map of the input image is generated by the DAM, a proper point clustering method is selected, and the influence on the sub-map of the final background optimization of the model is large. Fig. 12 and 14 show two sets of test samples in the VisDrone data set, and fig. 13 and 15 are density maps generated by the model of the present invention. The method can visually find that the distribution sparsity, the positions and the number of clusters of pedestrian targets in different images are greatly different. To prove the effectiveness of the mean shift-based background optimization strategy proposed by the present invention, we chose 3 classical clustering algorithms as comparisons: k-means, K-means + + and DBSCAN. The K-means and K-means + + algorithms are clustering methods based on partitioning strategies, the number K of clustered clusters, namely the number of partitioned molecular diagrams, needs to be given in advance, and K is set to be 6 in an experiment. Tables 4 and 5 list the number of subgraphs obtained by applying different methods to the VisDrone data set and the pedestrian category of the VisDrone data set, and the subset is used as training data to train the AP obtained after the model is trained.

It can be seen from the observations of tables 4 and 5 that the generated number of subgraphs is approximately the same based on the two methods, but the model effect trained based on the two methods is not ideal because the two strategies need to divide the number of subsets in advance, and the number of the objects to be examined in some initial images is less than the preset K value, and the two methods have an undesirable dividing effect and generate subgraphs with an oversize size similar to that of noise. If the K value is further reduced, the number of generated subgraphs is obviously reduced, and the subgraphs have very uneven density level distribution of the target to be detected. The DBSCAN and Meanshift methods do not need to preset the number of the division clusters, but the DBSCAN algorithm has two obvious disadvantages: firstly, the method is very sensitive to the clustering density, and has poor effect on images with large change of the sparsity degree; the other is that the shape of the cluster is not biased, which is an advantage of the algorithm, but the invention is applied to the scene, so that a 'ring' -shaped cluster is easily generated, and the method is not friendly to molecular diagrams. In summary, we have chosen Meanshift as the clustering strategy. It is also found in experiments that although the number of subgraphs generated based on the Meanshift clustering strategy is not the maximum, the training subgraphs generated according to the method are the best.

Furthermore, in order to verify the influence of the proposed regional expansion strategy on the model performance, the method disclosed by the invention adopts Meanshift as a clustering method as a basis, and also performs comparison experiments on the pedestrian types of the VisDrone data set and the VisDrone data set, wherein the reference method is marked as Meanshift, and the method after introducing the expansion strategy is marked as Meanshift + expansion. The results of the experiment are shown in tables 6 and 7:

TABLE 6 ablation experiments for dilation strategy effectiveness under VisDrone dataset

Method	Backbone network	AP	APs	APm	APl
						Meanshift	ResNet50	30.8	23.8	42.1	48.9
Meanshift + expansion	ResNet50	31.9	23.2	44.7	53.0

Table 7 ablation experiments on effectiveness of dilation strategy under VisDrone dataset pedestrian categories

From Table 6, it can be seen that the inflation strategy under the VisDrone data set can improve the overall AP from 30.8% to 31.9%, and particularly the performance of the model under the APi index is improved from 48.9% to 53.0%. While the extent of the lift of the inflation strategy to AP under the VisDrone data set pedestrian category is 0.6 percentage points, which is slightly weaker than that. This is because the VisDrone data set includes large targets such as trucks and buses, and the phenomenon of splitting the large targets during the inference phase can be significantly reduced after the inflation strategy is adopted. The absolute pixel value occupied by the pedestrian target is small, the phenomenon of segmenting the target is less, and therefore the improvement on the overall performance is not as remarkable as that of the VisDrone data set. However, in general, the expansion strategy provided by the invention can help to improve the accuracy of the detection model, and the effect is more obvious especially for large targets.

The invention researches the problems in pedestrian detection under the condition of low-altitude overhead shooting, and the challenges in the scene mainly comprise: the size of an input image is large, and the total pixel of a target to be detected in an original image is small; the size and the posture of the target to be detected in the image are changed violently; the target to be detected is unevenly distributed in the image, and too many background areas influence the detection result and cause the waste of computing resources. Aiming at the problems, the invention adopts a divide-and-conquer technical route, firstly optimizes the image into a plurality of sub-images according to a strategy background based on the density distribution of the target to be detected, then trains the detectors on the original image and the sub-images respectively, and fuses the two detectors to output the final detection result. Inspired by a crowd counting task, the density sensing network is provided, and compared with the prior method, the density sensing network can generate sparse density response, and reflects position information to a certain degree while providing density distribution of a target to be detected. When the generated density map is processed, based on a Meanshift clustering method and a region expansion strategy, the two methods can be combined to better count the dense region of the target to be detected. Finally, a large number of comparison experiments are carried out on the VisDrone data set, and the effectiveness of the method provided by the invention is proved. Abundant ablation experiments are respectively carried out on the VisDrone data set and the pedestrian category of the VisDrone data set, and the density sensing network provided by the invention is proved to be capable of stably generating high-quality subgraphs on the premise of processing truth values as little as possible; compared with other clustering methods, the clustering method based on Meanshift can obtain a good effect on a complex density interval, the problem that a network cuts a target in a reasoning process can be relieved by a region expansion strategy, and the effect is more obvious especially on a large target. A large number of experimental results show that the method provided by the invention can obviously improve the precision of the detection task in the low-altitude overhead shooting scene.

The above embodiment is an embodiment of the present invention, but the embodiment of the present invention is not limited by the above embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.

Claims

1. A high-resolution image background optimization method based on target density information is characterized by comprising the following steps:

acquiring a high-resolution image containing a pedestrian;

predicting a sparse density map corresponding to each input image;

2. The method for optimizing the background of the high-resolution image based on the density information of the target as claimed in claim 1, wherein the sparse density map utilizes a density sensing component to indicate the existence region of the target and the density of the objects in the region, and the background region not containing the target to be detected is eliminated.

3. The method for optimizing the background of the high-resolution image based on the target density information according to claim 2, wherein the density perception component predicts a sparse density map based on a Bayesian distribution crowd counting network, and then classifies the image into blocks according to an area with high crowd density counted by a clustering method.

4. The method for high-resolution image background optimization based on target density information according to claim 1, wherein after the original image is partitioned, the original image is taken as a data set, a background-optimized sub-image is taken as a data set, two Faster RCNN structures with feature pyramids are trained as a global detector and a sub-detector, when the input image is inferred, the input image is split by a density sensing component, prediction is performed on the two detectors, the results of the sub-detectors are mapped to the original image, and a final detection result is input after post-processing by a non-maximum suppression method.

5. The method for optimizing the background of the high-resolution image based on the target density information according to claim 1, wherein a plurality of weighted loss functions are used to measure the difference between the vector binary image density function of the truth point diagram and the vectorized density image density function predicted by the network, and the specific formula is as follows:

wherein

A vector binary map which is a truth point map,

lambda is a predicted vectorized density map ₁ And λ ₂ Is a weight coefficient, z is

For the purpose of the non-regularized density function,

for an improved optimal transportation loss function,

is a total variation loss function.

6. The method for high-resolution image background optimization based on target density information according to claim 5, wherein the count loss is defined as follows:

wherein

the difference between them is as small as possible;

calory | | z | | ₁ And

Wherein z (i) and

The gradient of (a) is:

the total variation loss function is defined as follows:

the gradient of the total variation loss function is recorded as:

wherein

Sign (.) is a Sign function of the vector.

7. The method for optimizing the background of the high-resolution image based on the target density information as claimed in claim 1, wherein the clustering method is a mean shift method, and a pedestrian dense region is estimated by combining the mean shift method and a region expansion strategy when the generated sparse density map is processed.

8. The method for optimizing the background of the high-resolution image based on the density information of the target according to claim 7, wherein a clustering method based on mean shift is adopted, wherein the mean shift refers to a shifted mean vector, a probability density distribution interval of sample points is calculated according to the sample points, the points in the density map are clustered into different clusters, the counting of target instances in the clusters is counted, the counting is larger than a threshold value, and the discarding is smaller than the threshold value;

suppose it is in d-dimensional space

In (2), there are n discrete samples x = x _i (i =1,2,.., n), the mean vector in the mean shift is defined as follows:

S _h (x)＝{y:(y-x) ^T (y-x)<＝h ² }

where K is a kernel function, defined as follows:

K(x)＝c _k,d k(||x|| ² )

the regularization coefficient c is used for ensuring that the integral of the probability density is 1, and the zero point of the partial derivative of the probability density function f (x) is solved to calculate the extreme point.

9. The method for optimizing the background of the high-resolution image based on the target density information as claimed in claim 8, wherein different clusters Sn are obtained after clustering points in the density map by a mean shift method; counting the number N of the sample groups in the region, and if N is larger than a given threshold value T, performing background optimization on the cluster; to cluster S' _n Obtaining candidate cluster S 'accurately indicating dense crowd area after filtering by given T' _n ；

According to candidate cluster S' _n Performing background optimization comprises the following steps: in order to contain all the marking frames as much as possible, in the training process, the positions of all the marking frames with the central points falling into the cluster area are counted, the boundary of the cluster is expanded according to the size of the given marking frame, and the object with the large pixel ratio is prevented from being segmented by background optimization operation; in the reasoning process, since the information of the true value box cannot be known, a method of expanding the area is adopted for prediction.

10. The method for high-resolution image background optimization based on target density information according to claim 9, wherein the region expansion strategy comprises the following steps:

for input images I = (W, H), cluster S = { S | S _i ＝(x _i ,y _i ),i＝1,…,n}

S11: initialization S

S12: if i < n, iteratively executing the steps S13-S18;

S14: if (x) _min -λ _x ×(x _max -x _min ))>0 then topx _j ＝x _min -λ _x ×(x _max -x _min ) Otherwise topx _j ＝0；

S15: if (y) _min -λ _y ×(y _max –y _min ))>0 then topy _j ＝y _min -λ _y ×(y _max -y _min ) Otherwise topy _j ＝0；

S16: if (x) _max +λ _x ×(x _max -x _min ))<W th then W _j ＝(1+2×λ _x )(x _max -x _min ) Otherwise w _j ＝W-x _min ；

S17: if (y) _max +λ _y ×(y _max -y _min ))<H then H _j ＝(1+2×λ _y )(y _max –y _min ) Otherwise h _j ＝H-y _min ；

S18: i = i +1, and the step is returned to the step S12;

Wherein λ is _x And λ _y Is a coefficient of expansion associated with the data set.