CN114494999B

CN114494999B - Double-branch combined target intensive prediction method and system

Info

Publication number: CN114494999B
Application number: CN202210058467.6A
Authority: CN
Inventors: 吴晓; 张基; 谭舒月; 李威; 彭强
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-11-15
Anticipated expiration: 2042-01-18
Also published as: CN114494999A

Abstract

The invention relates to the technical field of target dense prediction, and discloses a double-branch combined target dense prediction method and system. The double-branch combined target dense prediction method comprises the following steps of: s1, generating a characteristic map by using an image of a target to be predicted; s2, generating a prediction density map by using the counting features in the feature map, and generating a prediction position map by using the positioning features in the feature map; and S3, updating and optimizing the generated predicted density map by using the process information for generating the predicted position map, and/or updating and optimizing the generated predicted position map by using the process information for generating the predicted density map. The invention solves the problems of attenuation of density map position information, inaccurate target position prediction and the like in the prior art.

Description

Double-branch combined target intensive prediction method and system

Technical Field

The invention relates to the technical field of target dense prediction, in particular to a double-branch combined target dense prediction method and system.

Background

In the technical field of target dense prediction, crowd dense prediction is a typical application scenario. Crowd counting and crowd positioning are two similar but different research tasks in crowd analysis, and they both analyze dense scenes based on the labeling of pedestrian head center points. The crowd counting target is to predict the density map and calculate the number of pedestrians in the image, and crowd positioning tends to estimate the location of each person, and so on.

The mainstream population counting method estimates the number of pedestrians by integrating the predicted density map. Therefore, how to generate high quality density maps becomes a key issue in population counting. Some research efforts have employed multi-pillar structures, multi-scale structures, and extended convolution to generate high quality density maps. However, these methods have problems of attenuation of position information in the density map and inaccuracy of prediction of pedestrian position due to the fact that it is difficult to learn the spatial difference between the predicted density map and the true annotation image due to the loss of mean square error. By observation, the predicted density map typically lacks strong enough location information to distinguish between foreground and background regions, which severely impacts the accuracy of the prediction. There are two main cases: (i) certain background regions are overestimated due to background noise; (ii) Some crowd foreground regions in the image are identified as background resulting in underestimation. To address this problem, attention modules are used to mitigate background noise, preserving the high frequency spatial variation of density with maximum excess of pixel loss. However, the features or density maps used to enhance cyberspace awareness in these methods are optimized for population counts, which makes these features and density maps insensitive to location information, limiting improvements. Therefore, accurate location information should be learned more to better express spatial distribution to facilitate population counting.

The crowd positioning method can be roughly divided into two categories according to different supervisors: a frame supervision method and a point supervision method. Frame surveillance methods often detect a target detection model of the head or face for locating pedestrians. These methods suffer from small objects and severe occlusion, resulting in low recall. Moreover, bounding box labeling is labor intensive, especially in crowd-intensive scenarios. The point supervision method tries to regress a location map or a binary map, and estimates a location map of the center of the human head through post-processing. Compared to density maps, location maps have more accurate object instance locations, while lacking density information, and thus easily repeat or miss objects in locally dense regions.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a double-branch combined target dense prediction method and a double-branch combined target dense prediction system, and solves the problems of density map position information attenuation, inaccurate target position prediction and the like in the prior art.

The technical scheme adopted by the invention for solving the problems is as follows:

a double branch combined target dense prediction method, the method comprises the following steps:

s1, generating a characteristic map by using an image of a target to be predicted;

s2, generating a prediction density map by using the counting features in the feature map, and generating a prediction position map by using the positioning features in the feature map;

and S3, updating and optimizing the generated predicted density map by using the process information for generating the predicted position map, and/or updating and optimizing the generated predicted position map by using the process information for generating the predicted density map.

As a preferred technical solution, the step S3 includes the following steps:

s31, constructing a position sensing module with a training function, and enabling the position sensing module to generate a prediction mask image by using positioning features from the feature image;

and S32, enabling the position perception module to update and optimize the generated prediction density map by using the prediction mask map.

As a preferred technical solution, in the step S31, in the process of generating the prediction mask map, the labeled location map is further converted into a labeled mask map for supervising the training of the location awareness module.

As a preferred technical solution, the method further comprises the following steps:

s33, measuring the difference between the predicted density graph and the labeled density graph by using an MSE loss function, and updating and optimizing the generated predicted density graph by using the MSE loss function, wherein the formula of the MSE loss function is as follows:

wherein L is _MSE (D ^rf ，D ^gt ) Representing MSE loss function between the predicted density graph and the marked density graph, N representing the number of input images of the target to be predicted, N being more than or equal to 1 and an integer, D ^rf Representing a predicted density map, D ^gt Representing a labeling density graph, i represents the number of the image of the target to be predicted, i is more than or equal to 1 and less than or equal to N, i is a positive integer, p represents the number of pixel points of the image of the target to be predicted,

the pixel value of the p-th pixel point of the ith prediction density map is represented,

and representing the pixel value of the p pixel point of the ith labeling density graph.

As a preferred technical solution, the step S3 further includes the following steps:

s34, constructing an adaptive density sensing module with a training function, and enabling the adaptive density sensing module to generate a prediction attention map by using counting features from the feature map;

and S35, enabling the position perception module to update and optimize the generated predicted position map by using the predicted attention map.

s36, finding out a dense area in the density map by utilizing the prediction attention map, adjusting the weight of areas with different density levels pixel by pixel in a positioning loss function, and supervising the self-adaptive density sensing module.

As a preferred technical solution, in step S36, a density-aware positioning loss function is used as the positioning loss function, and the formula of the density-aware positioning loss function is as follows:

wherein L is _dal Representing the density-aware localization loss function, M representing the number of positive samples, A _p Pixel value, Y, of the p-th pixel point of the foreground map representing the predictive attention map _p Representing the pixel value, P, of the P-th pixel of the foreground map of the annotated position map ^pr Is a center point diagram of the predicted position map,

and expressing the pixel value of the p-th pixel point of the central point diagram of the predicted position diagram, wherein gamma and delta both express hyper-parameters, and gamma and delta both are positive real numbers.

As a preferred technical solution, in step S32, the training of the location awareness module is supervised by using a focus loss function, where the formula of the focus loss function is as follows:

wherein L is _mask Representing the mask loss function, H representing the height of the prediction mask map, W representing the width of the prediction mask map, M ^pr A prediction mask map is represented that is,

the pixel value representing the p-th pixel point of the prediction mask map,

and representing the pixel value of the p-th pixel point of the label mask image.

A dual-branch combined target dense prediction system is based on the dual-branch combined target dense prediction method and comprises the following modules:

a feature map generation module: generating a feature map by using an image of a target to be predicted;

a density map generation module: generating a predicted density map by using the counting features in the feature map;

a location map generation module: the device is used for generating a predicted position map by utilizing the positioning features in the feature map;

a location awareness module: the method is used for updating and optimizing the generated prediction density map by using the process information for generating the prediction position map.

As a preferred technical solution, the method further comprises the following modules:

an adaptive density sensing module: the method is used for updating and optimizing the generated predicted position map by using the process information for generating the predicted density map.

Compared with the prior art, the invention has the following beneficial effects:

(1) The method relieves the problem of position information attenuation of the density map, reduces the position map error of the dense area, and improves the accuracy rate of target dense prediction;

(2) The invention integrates a positioning branch and a position perception module to provide complementary characteristics and a refined density map, and is superior to a Crowd counting method on NWPU-Crowd and JHU-Crowd + + data sets at larger intervals;

(3) According to the invention, the density map generated by counting the branches is refined by directly using the position map generated by positioning the branches, so that a better prediction effect is achieved; the new density-aware positioning loss (DAL) is employed to improve the positioning performance of dense areas by increasing the proportion of dense area losses in the overall loss. Finally, more accurate position information is obtained to refine the density map to promote crowd counting

(4) The invention provides an adaptive density perception module (ADAM) for generating a prediction attention map to guide a position map to focus on a dense region, based on density information of counting branches, the ADAM can solve the problem that the position map is prone to error in the dense region, and adaptively highlights the dense region which is prone to error in an image and ignores the region which is prone to positioning.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a dual-branch joint type target dense prediction method according to the present invention;

fig. 2 is a schematic structural diagram of a dual-branch combined target dense prediction system according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.

Examples

As shown in fig. 1 and 2, the SCALNet considers population counting and positioning as a dense prediction problem and integrates the same into a framework, and the model independently outputs a predicted position map and a predicted density map. Unfortunately, the link between population counts and population positioning has not been thoroughly explored. Inspired by SCALNet, we will go into intensive research on how to increase population counts by localization.

Based on the motivation for solving the problems existing in the prior art, a two-branch combined dense counting network is provided for relieving the position information attenuation problem of the predicted density map and the error of the dense region predicted position map. First, an image is input into a network, and a prediction density map and a prediction position map are generated by counting branches and locating branches, respectively. To address the location information attenuation problem of the predicted density map, a Location Awareness Module (LAM) predicts a mask map to highlight foreground regions of the predicted density map and suppresses background regions of the predicted density map using location information of the located branches. In addition, an annotation mask map is obtained through binarization processing of the annotation density map and is used for monitoring optimization of LAM. An adaptive density perception module (ADAM) is also presented to generate a predictive attention map to guide the prediction of location map focus dense regions. Based on the density information of the counting branches, ADAM can solve the problem that prediction of a location map is prone to error in a dense area. In addition, a new density-aware positioning loss (DAL) is employed to improve the positioning performance of dense areas by increasing the proportion of dense area losses in the overall loss. And finally, acquiring more accurate position information to refine the predicted density map so as to promote crowd counting.

The invention has the following characteristics:

1. a two-branch federated network;

in order to alleviate the challenges caused by the position attenuation of the solution density map and the error of the dense region prediction position map, a double-branch combined dense counting network is proposed to promote the crowd counting through positioning. The network sharing backbone network (namely a characteristic diagram generation module) adopts a counting branch and a positioning branch for extracting deep characteristics, and the two branches are optimized under mutual guidance. In the method, firstly, an image is input into a backbone network, and then extracted features are input into the two branches to respectively obtain a prediction density map and a prediction position map. A location-aware module (LAM) utilizes features from the located branches to generate a prediction mask map, which contains foreground and background that can be used to refine the prediction density map. Meanwhile, the label position image is binarized into a label mask image only containing foreground and background information to supervise the training of the LAM. In addition, an adaptive density perception module (ADAM) uses features from the counting branch to generate a predictive attention map containing density information. The prediction attention is directed to finding dense regions in the predicted density map in order to adjust the weights of the different density level regions pixel by pixel in the localization loss function. And finally, integrating the whole predicted density graph to obtain a counting result. The localization result includes the center points of the head, which are the peaks of the predicted location map, obtained by local maximum filtering and threshold selection. In particular, in view of the necessity of high resolution representation, HRNet is used as the backbone network of the network. The position branch is a continuous layer of one convolution and two transposed convolutions defined as:

the counting branch consists of three convolutional layers, defined as:

wherein, the first and the second end of the pipe are connected with each other,

representing the convolutional layer with kernel k and step size s, output represents the number of output channels,

transposed convolutional layer with 4 cores, step size 2, and number of output channels 64，

Denotes a transposed convolution layer with kernel 4, step size 2, and output channel number 1, BN denotes a batch normalization operation, and R denotes a ReLU operation.

2. A Location Awareness Module (LAM);

due to the optimization goal of the counting fingers and the characteristics of the MSE loss, the density map generated by the counting fingers is more susceptible to interference from background noise. In contrast, the positioning branch is used for modeling the position of the human head in the image, and the features from the positioning branch contain more accurate position information of the object instance, which can help predict the density map to enhance the position information. An intuitive idea is to use the predicted location map of the positioning branch generation directly to refine the predicted density map of the counting branch generation. In practice, the locating branch locates the center point of the head by pixel-by-pixel classification. In the predicted position map, only the position of the center point of the human head is regarded as a positive sample having a higher value. The pixels around the center point are all marked as negative examples, their values decreasing with distance from the center point. The predicted location map may improve the location sensitivity of the predicted density map to some extent, but may also affect the results of some foreground regions. Therefore, it is necessary to generate binary classification maps containing foreground and background information by using the features of the positioning branches.

The Location Awareness Module (LAM) aims to identify foreground and background regions of an image and refine the predicted density map. The features of the first convolutional layer of the positioning branch are input into the LAM to generate a prediction mask map, which is directly used for focusing the foreground region of the prediction density map. This operation can be expressed as follows:

wherein the content of the first and second substances,

and representing the pixel value of the p-th pixel point of the prediction density graph before the ith thinning, M (p) representing the pixel value of the p-th pixel point of the mask graph, p representing the pixel point number of the image of the target to be predicted, and X representing the pixel set of the image of the target to be predicted.

Furthermore, we explore the similarities and differences in optimization objectives between LAM and location branches. The LAM is used to identify the head region in the image, while the locating branch is intended to locate the point of the head. The label mask map is generated by binarizing the label position map and can be used for guiding the optimization of the LAM. A location awareness module comprising a multi-layered convolution and activation function defined as:

wherein Sigmoid represents Sigmoid function calculation operation.

3. An adaptive density perception module (ADAM);

although the position branch has rich position information and can assist the prediction density map to suppress background noise, the prediction position map is easy to repeatedly count or miss-detect pedestrians in a dense area. To address this problem, a simple but effective solution is to further constrain the predicted location map in a dense population. The predicted density map reflects the density of people for different pixel regions in the image, including density regions of different levels. If we directly use the predicted density map to adjust the weights of the localization loss of different regions, although dense regions can be highlighted to some extent, the loss weights of other regions can also be increased. This also inhibits the locating branch from effectively focusing on the error-prone dense areas. Therefore, adaptively highlighting error-prone dense areas in an image while ignoring easily located areas is a better solution. Inspired by the attention mechanism, an adaptive density perception module (ADAM) is proposed to adaptively focus on error-prone dense regions of a location map based on density information of counting branches. In ADAM, features of the counting branch are taken to generate a predictive attention map. Meanwhile, a new density sensing positioning loss function is proposed to improve the performance of the predicted position map in the dense area, and the function utilizes the predicted attention map to increase the weight of the error-prone area loss in the final loss. The structural composition of ADAM is:

4. a loss function.

1) Counting loss:

the MSE loss is used to measure the difference between the predicted density map and the annotated density map. The formula for MSE loss is as follows:

wherein L is _MSE (D ^rf ，D ^gt ) Representing MSE loss function between the predicted density graph and the marked density graph, N represents the number of input images of the target to be predicted, N is not less than 1 and is an integer, D ^rf Representing a predicted density map, D ^gt Representing a labeled density graph, i represents the number of the image of the target to be predicted, i is more than or equal to 1 and less than or equal to N, i is a positive integer, p represents the number of pixel points of the image of the target to be predicted,

and expressing the pixel value of the p-th pixel point of the ith labeling density graph.

2) Density-aware localization loss:

we take a variant of the focus loss as the basic loss function. In order to solve the problem that a dense region location map is prone to errors, a novel density-sensing localization loss (DAL) is designed by combining an ADAM-generated prediction attention map with a basic loss function. DAL loss demonstrates the performance of locating branches by increasing the proportion of dense error-prone regions in the overall loss based on the results of predictive attention maps. DAL loss can be expressed as follows:

3) Mask loss:

the mask penalty employs a focus penalty that is used to constrain the LAM module for foreground identification. The loss of focus is given by:

the formula for the focus loss function is:

wherein L is _mask Representing the mask loss function, H representing the height of the prediction mask map, W representing the width of the prediction mask map, M ^pr A prediction mask map is represented that represents the prediction mask map,

the pixel value representing the p-th pixel point of the prediction mask map,

and representing the pixel value of the p-th pixel point of the labeling mask image.

The final penalty is obtained by weighted summation of the MSE penalty, DAL penalty and mask penalty:

L _final ＝L _MSE +λ ₁ L _dal +λ ₂ L _mask ；

wherein λ is introduced ₁ And λ ₂ To balance the lost weight.

Due to the adoption of the technical scheme of the invention, the following technical effects are realized:

1. counting effect:

in the population counting task, we compared the proposed method with the most advanced method on four reference datasets NWPU-Crowd, shanghai science a and B parts, and JHU-Crowd + +. For the convenience of comparison, the methods are classified into a crowd counting method and a crowd positioning method according to whether the position information is output or not.

Compared with the population counting method, the method has the best performance on NWPU-crown, shanghai science and technology A part and JHU-crown + + data sets. In particular, our method reduced the MAE on the NWPUCrowd dataset by 18 and the MSE by 63.5, compared to the DM-Count results. At the same time, our method achieved comparable performance on the Shanghai science and technology part B dataset. The proposed method integrates a localization branch and location awareness module to provide complementary features and refine predicted density maps, outperforming the population counting method at larger intervals on NWPU-Crowd and JHU-Crowd + + datasets.

Compared with the Crowd positioning method, the method is superior to the most advanced method on the NWPU-Crowd data set, and the best MAE result is obtained on the JHU-Crowd + + data set. Also, the MAE and MSE on the NWPU-Crowd data set are relatively increased by 12.3% and 8.6% compared to P2PNet by our method. Our method also achieves comparable performance on the Shanghai science and technology part A and part B datasets. Images in data sets of the part A and the part B of Shanghai science and technology are low in resolution, and the proportion of background areas in the images is relatively small. In contrast, the pictures in the NWPU-Crowd and JHU-Crowd + + datasets are more resolved, containing more complex backgrounds. The proposed method employs a highly reduced HRNet as backbone network and utilizes position information from the positioning branches to mitigate the background noise of the predicted density map. In conclusion, the method makes great progress on the NWPU-crown and JHU-crown + + data sets, and has positive influence on the A part and the B part of Shanghai science and technology to a certain extent.

2. The positioning effect is as follows:

for fair comparison, the method is divided into a frame supervision method and a point supervision method. Compared to the results for P2Pnet, our method improved the F1 value and recall by 3.6% and 4.4%, respectively. Although FasterRCNN, supervised by box, achieved the highest accuracy, its recall rate was very poor at 3.5%. The proposed model employs an adaptive density awareness module and density awareness loss to effectively improve model recall. To verify the robustness of the proposed method, we reproduced four most advanced methods on the Shanghai science and technology part A dataset using the original setup. In particular, crowd-SDNet1 and SCALNet2 are trained with official codes. Fasterncn 3 and TinyFaces4 are implemented by common code. The proposed method is superior to point surveillance methods in terms of F1 value, accuracy and recall. Although our process increased the recall by 5.5%, the results were still lower than TinyFaces. This is because TinyFaces uses box information as supervision and adopts a paradigm of a specific scale to optimize small target detection, effectively improving recall. Our point surveillance approach achieves crowd localization from a dense prediction perspective and uses high shrinkage rates in the model. The resolution of images in shanghai science and technology part a is generally low. The highly-reduced model is easy to lose information when extracting image features, and recall rate results are influenced.

In the method, HRNet is adopted as the network backbone. An adaptive moment estimation (Adam) optimizer is used for training the model, and the initial learning rate of the NWPU-crown data set is 1e-4, and the initial learning rate of other data sets is le-6. The learning rate is updated at a decay rate of 0.99 every epoch. The number of training sessions is 600 and the batch size is set to 16. In addition, random horizontal flipping and clipping are utilized to augment the training data. Wherein the cut size of the Shanghai science and technology part A and part B is 256 × 256, and the cut size of the other data sets is 512 × 512. We also resize the image to ensure that the long edge of the NWPU-Crowd and JHU-Crowd + + datasets are less than 2048.

It is worth mentioning that:

preferably, in fig. 2, the counting branch and the positioning branch first act to generate a predicted density map and a predicted position map, respectively, then the position sensing module outputs a predicted mask map to refine the predicted density map, the adaptive density sensing module adaptively highlights error-prone regions of the predicted position map, and the density-sensing positioning loss receives the predicted position map and the predicted attention map for loss calculation.

In the description of the steps of the method of the present invention, the order of the steps is not particularly emphasized, and there are many cases where the description is not limited to one order: for example, the order of step S31, step S32, step S34, and step S35 may be that step S31 and step S32 are executed first, and then step S34 and step S35 are executed, or step S34 and step S35 are executed first, and then step S31 and step S32 are executed. The technical scheme disclosed by the invention is considered to be included as long as the technical scheme does not violate the common sense, the technical logic of the field and the inventive idea disclosed by the invention.

As described above, the present invention can be preferably realized.

All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims

1. A dual-branch combined target dense prediction method is characterized by comprising the following steps:

s3, updating and optimizing the generated predicted density map by using the process information for generating the predicted position map, and/or updating and optimizing the generated predicted position map by using the process information for generating the predicted density map;

step S3 includes the following steps:

in step S31, in the process of generating the prediction mask map, the labeled position map is further converted into a labeled mask map for supervising the training of the position sensing module;

s32, enabling the position perception module to update and optimize the generated prediction density map by using the prediction mask map;

s33, measuring the difference between the predicted density graph and the marked density graph by using an MSE (mean square error) loss function, and updating and optimizing the generated predicted density graph by using the MSE loss function, wherein the formula of the MSE loss function is as follows:

expressing the pixel value of the p pixel point of the ith labeling density graph;

s34, constructing an adaptive density sensing module with a training function, and enabling the adaptive density sensing module to generate a prediction attention diagram by using counting features from the feature diagram;

s35, enabling the position perception module to update and optimize the generated predicted position map by using the predicted attention map;

s36, finding out a dense area in the density map by utilizing the prediction attention map, adjusting the weight of areas with different density levels pixel by pixel in a positioning loss function, and supervising the self-adaptive density sensing module;

in step S36, a density-aware positioning loss function is used as the positioning loss function, and the formula of the density-aware positioning loss function is:

wherein L is _dal Representing the density-aware localization loss function, M representing the number of positive samples, A _p Pixel value, Y, of the p-th pixel point of the foreground map representing the predictive attention map _p The pixel value, P, of the P-th pixel point of the foreground map representing the labeled location map ^pr Is a center point diagram of the predicted position map,

2. The method according to claim 1, wherein in step S32, the training of the location-aware module is supervised by a focus loss function, and the formula of the focus loss function is as follows:

the pixel value representing the p-th pixel point of the prediction mask map,

3. A dual-branch combined target dense prediction system, characterized in that, based on the method of claim 1 or 2, the system comprises the following modules:

a location map generation module: the predicted position map is generated by utilizing the positioning features in the feature map;

a location awareness module: the method comprises the steps of updating and optimizing a generated prediction density map by using process information for generating a prediction position map;