CN114494999B - Double-branch combined target intensive prediction method and system - Google Patents

Double-branch combined target intensive prediction method and system Download PDF

Info

Publication number
CN114494999B
CN114494999B CN202210058467.6A CN202210058467A CN114494999B CN 114494999 B CN114494999 B CN 114494999B CN 202210058467 A CN202210058467 A CN 202210058467A CN 114494999 B CN114494999 B CN 114494999B
Authority
CN
China
Prior art keywords
map
density
predicted
prediction
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210058467.6A
Other languages
Chinese (zh)
Other versions
CN114494999A (en
Inventor
吴晓
张基
谭舒月
李威
彭强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN202210058467.6A priority Critical patent/CN114494999B/en
Publication of CN114494999A publication Critical patent/CN114494999A/en
Application granted granted Critical
Publication of CN114494999B publication Critical patent/CN114494999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of target dense prediction, and discloses a double-branch combined target dense prediction method and system. The double-branch combined target dense prediction method comprises the following steps of: s1, generating a characteristic map by using an image of a target to be predicted; s2, generating a prediction density map by using the counting features in the feature map, and generating a prediction position map by using the positioning features in the feature map; and S3, updating and optimizing the generated predicted density map by using the process information for generating the predicted position map, and/or updating and optimizing the generated predicted position map by using the process information for generating the predicted density map. The invention solves the problems of attenuation of density map position information, inaccurate target position prediction and the like in the prior art.

Description

Double-branch combined target intensive prediction method and system
Technical Field
The invention relates to the technical field of target dense prediction, in particular to a double-branch combined target dense prediction method and system.
Background
In the technical field of target dense prediction, crowd dense prediction is a typical application scenario. Crowd counting and crowd positioning are two similar but different research tasks in crowd analysis, and they both analyze dense scenes based on the labeling of pedestrian head center points. The crowd counting target is to predict the density map and calculate the number of pedestrians in the image, and crowd positioning tends to estimate the location of each person, and so on.
The mainstream population counting method estimates the number of pedestrians by integrating the predicted density map. Therefore, how to generate high quality density maps becomes a key issue in population counting. Some research efforts have employed multi-pillar structures, multi-scale structures, and extended convolution to generate high quality density maps. However, these methods have problems of attenuation of position information in the density map and inaccuracy of prediction of pedestrian position due to the fact that it is difficult to learn the spatial difference between the predicted density map and the true annotation image due to the loss of mean square error. By observation, the predicted density map typically lacks strong enough location information to distinguish between foreground and background regions, which severely impacts the accuracy of the prediction. There are two main cases: (i) certain background regions are overestimated due to background noise; (ii) Some crowd foreground regions in the image are identified as background resulting in underestimation. To address this problem, attention modules are used to mitigate background noise, preserving the high frequency spatial variation of density with maximum excess of pixel loss. However, the features or density maps used to enhance cyberspace awareness in these methods are optimized for population counts, which makes these features and density maps insensitive to location information, limiting improvements. Therefore, accurate location information should be learned more to better express spatial distribution to facilitate population counting.
The crowd positioning method can be roughly divided into two categories according to different supervisors: a frame supervision method and a point supervision method. Frame surveillance methods often detect a target detection model of the head or face for locating pedestrians. These methods suffer from small objects and severe occlusion, resulting in low recall. Moreover, bounding box labeling is labor intensive, especially in crowd-intensive scenarios. The point supervision method tries to regress a location map or a binary map, and estimates a location map of the center of the human head through post-processing. Compared to density maps, location maps have more accurate object instance locations, while lacking density information, and thus easily repeat or miss objects in locally dense regions.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a double-branch combined target dense prediction method and a double-branch combined target dense prediction system, and solves the problems of density map position information attenuation, inaccurate target position prediction and the like in the prior art.
The technical scheme adopted by the invention for solving the problems is as follows:
a double branch combined target dense prediction method, the method comprises the following steps:
s1, generating a characteristic map by using an image of a target to be predicted;
s2, generating a prediction density map by using the counting features in the feature map, and generating a prediction position map by using the positioning features in the feature map;
and S3, updating and optimizing the generated predicted density map by using the process information for generating the predicted position map, and/or updating and optimizing the generated predicted position map by using the process information for generating the predicted density map.
As a preferred technical solution, the step S3 includes the following steps:
s31, constructing a position sensing module with a training function, and enabling the position sensing module to generate a prediction mask image by using positioning features from the feature image;
and S32, enabling the position perception module to update and optimize the generated prediction density map by using the prediction mask map.
As a preferred technical solution, in the step S31, in the process of generating the prediction mask map, the labeled location map is further converted into a labeled mask map for supervising the training of the location awareness module.
As a preferred technical solution, the method further comprises the following steps:
s33, measuring the difference between the predicted density graph and the labeled density graph by using an MSE loss function, and updating and optimizing the generated predicted density graph by using the MSE loss function, wherein the formula of the MSE loss function is as follows:
Figure BDA0003475673250000031
wherein L is MSE (D rf ,D gt ) Representing MSE loss function between the predicted density graph and the marked density graph, N representing the number of input images of the target to be predicted, N being more than or equal to 1 and an integer, D rf Representing a predicted density map, D gt Representing a labeling density graph, i represents the number of the image of the target to be predicted, i is more than or equal to 1 and less than or equal to N, i is a positive integer, p represents the number of pixel points of the image of the target to be predicted,
Figure BDA0003475673250000032
the pixel value of the p-th pixel point of the ith prediction density map is represented,
Figure BDA0003475673250000033
and representing the pixel value of the p pixel point of the ith labeling density graph.
As a preferred technical solution, the step S3 further includes the following steps:
s34, constructing an adaptive density sensing module with a training function, and enabling the adaptive density sensing module to generate a prediction attention map by using counting features from the feature map;
and S35, enabling the position perception module to update and optimize the generated predicted position map by using the predicted attention map.
As a preferred technical solution, the step S3 further includes the following steps:
s36, finding out a dense area in the density map by utilizing the prediction attention map, adjusting the weight of areas with different density levels pixel by pixel in a positioning loss function, and supervising the self-adaptive density sensing module.
As a preferred technical solution, in step S36, a density-aware positioning loss function is used as the positioning loss function, and the formula of the density-aware positioning loss function is as follows:
Figure BDA0003475673250000034
wherein L is dal Representing the density-aware localization loss function, M representing the number of positive samples, A p Pixel value, Y, of the p-th pixel point of the foreground map representing the predictive attention map p Representing the pixel value, P, of the P-th pixel of the foreground map of the annotated position map pr Is a center point diagram of the predicted position map,
Figure BDA0003475673250000041
and expressing the pixel value of the p-th pixel point of the central point diagram of the predicted position diagram, wherein gamma and delta both express hyper-parameters, and gamma and delta both are positive real numbers.
As a preferred technical solution, in step S32, the training of the location awareness module is supervised by using a focus loss function, where the formula of the focus loss function is as follows:
Figure BDA0003475673250000042
wherein L is mask Representing the mask loss function, H representing the height of the prediction mask map, W representing the width of the prediction mask map, M pr A prediction mask map is represented that is,
Figure BDA0003475673250000043
the pixel value representing the p-th pixel point of the prediction mask map,
Figure BDA0003475673250000044
and representing the pixel value of the p-th pixel point of the label mask image.
A dual-branch combined target dense prediction system is based on the dual-branch combined target dense prediction method and comprises the following modules:
a feature map generation module: generating a feature map by using an image of a target to be predicted;
a density map generation module: generating a predicted density map by using the counting features in the feature map;
a location map generation module: the device is used for generating a predicted position map by utilizing the positioning features in the feature map;
a location awareness module: the method is used for updating and optimizing the generated prediction density map by using the process information for generating the prediction position map.
As a preferred technical solution, the method further comprises the following modules:
an adaptive density sensing module: the method is used for updating and optimizing the generated predicted position map by using the process information for generating the predicted density map.
Compared with the prior art, the invention has the following beneficial effects:
(1) The method relieves the problem of position information attenuation of the density map, reduces the position map error of the dense area, and improves the accuracy rate of target dense prediction;
(2) The invention integrates a positioning branch and a position perception module to provide complementary characteristics and a refined density map, and is superior to a Crowd counting method on NWPU-Crowd and JHU-Crowd + + data sets at larger intervals;
(3) According to the invention, the density map generated by counting the branches is refined by directly using the position map generated by positioning the branches, so that a better prediction effect is achieved; the new density-aware positioning loss (DAL) is employed to improve the positioning performance of dense areas by increasing the proportion of dense area losses in the overall loss. Finally, more accurate position information is obtained to refine the density map to promote crowd counting
(4) The invention provides an adaptive density perception module (ADAM) for generating a prediction attention map to guide a position map to focus on a dense region, based on density information of counting branches, the ADAM can solve the problem that the position map is prone to error in the dense region, and adaptively highlights the dense region which is prone to error in an image and ignores the region which is prone to positioning.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a dual-branch joint type target dense prediction method according to the present invention;
fig. 2 is a schematic structural diagram of a dual-branch combined target dense prediction system according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.
Examples
As shown in fig. 1 and 2, the SCALNet considers population counting and positioning as a dense prediction problem and integrates the same into a framework, and the model independently outputs a predicted position map and a predicted density map. Unfortunately, the link between population counts and population positioning has not been thoroughly explored. Inspired by SCALNet, we will go into intensive research on how to increase population counts by localization.
Based on the motivation for solving the problems existing in the prior art, a two-branch combined dense counting network is provided for relieving the position information attenuation problem of the predicted density map and the error of the dense region predicted position map. First, an image is input into a network, and a prediction density map and a prediction position map are generated by counting branches and locating branches, respectively. To address the location information attenuation problem of the predicted density map, a Location Awareness Module (LAM) predicts a mask map to highlight foreground regions of the predicted density map and suppresses background regions of the predicted density map using location information of the located branches. In addition, an annotation mask map is obtained through binarization processing of the annotation density map and is used for monitoring optimization of LAM. An adaptive density perception module (ADAM) is also presented to generate a predictive attention map to guide the prediction of location map focus dense regions. Based on the density information of the counting branches, ADAM can solve the problem that prediction of a location map is prone to error in a dense area. In addition, a new density-aware positioning loss (DAL) is employed to improve the positioning performance of dense areas by increasing the proportion of dense area losses in the overall loss. And finally, acquiring more accurate position information to refine the predicted density map so as to promote crowd counting.
The invention has the following characteristics:
1. a two-branch federated network;
in order to alleviate the challenges caused by the position attenuation of the solution density map and the error of the dense region prediction position map, a double-branch combined dense counting network is proposed to promote the crowd counting through positioning. The network sharing backbone network (namely a characteristic diagram generation module) adopts a counting branch and a positioning branch for extracting deep characteristics, and the two branches are optimized under mutual guidance. In the method, firstly, an image is input into a backbone network, and then extracted features are input into the two branches to respectively obtain a prediction density map and a prediction position map. A location-aware module (LAM) utilizes features from the located branches to generate a prediction mask map, which contains foreground and background that can be used to refine the prediction density map. Meanwhile, the label position image is binarized into a label mask image only containing foreground and background information to supervise the training of the LAM. In addition, an adaptive density perception module (ADAM) uses features from the counting branch to generate a predictive attention map containing density information. The prediction attention is directed to finding dense regions in the predicted density map in order to adjust the weights of the different density level regions pixel by pixel in the localization loss function. And finally, integrating the whole predicted density graph to obtain a counting result. The localization result includes the center points of the head, which are the peaks of the predicted location map, obtained by local maximum filtering and threshold selection. In particular, in view of the necessity of high resolution representation, HRNet is used as the backbone network of the network. The position branch is a continuous layer of one convolution and two transposed convolutions defined as:
Figure BDA0003475673250000071
the counting branch consists of three convolutional layers, defined as:
Figure BDA0003475673250000072
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003475673250000073
representing the convolutional layer with kernel k and step size s, output represents the number of output channels,
Figure BDA0003475673250000074
transposed convolutional layer with 4 cores, step size 2, and number of output channels 64,
Figure BDA0003475673250000075
Denotes a transposed convolution layer with kernel 4, step size 2, and output channel number 1, BN denotes a batch normalization operation, and R denotes a ReLU operation.
2. A Location Awareness Module (LAM);
due to the optimization goal of the counting fingers and the characteristics of the MSE loss, the density map generated by the counting fingers is more susceptible to interference from background noise. In contrast, the positioning branch is used for modeling the position of the human head in the image, and the features from the positioning branch contain more accurate position information of the object instance, which can help predict the density map to enhance the position information. An intuitive idea is to use the predicted location map of the positioning branch generation directly to refine the predicted density map of the counting branch generation. In practice, the locating branch locates the center point of the head by pixel-by-pixel classification. In the predicted position map, only the position of the center point of the human head is regarded as a positive sample having a higher value. The pixels around the center point are all marked as negative examples, their values decreasing with distance from the center point. The predicted location map may improve the location sensitivity of the predicted density map to some extent, but may also affect the results of some foreground regions. Therefore, it is necessary to generate binary classification maps containing foreground and background information by using the features of the positioning branches.
The Location Awareness Module (LAM) aims to identify foreground and background regions of an image and refine the predicted density map. The features of the first convolutional layer of the positioning branch are input into the LAM to generate a prediction mask map, which is directly used for focusing the foreground region of the prediction density map. This operation can be expressed as follows:
Figure BDA0003475673250000081
wherein the content of the first and second substances,
Figure BDA0003475673250000082
the pixel value of the p-th pixel point of the ith prediction density map is represented,
Figure BDA0003475673250000083
and representing the pixel value of the p-th pixel point of the prediction density graph before the ith thinning, M (p) representing the pixel value of the p-th pixel point of the mask graph, p representing the pixel point number of the image of the target to be predicted, and X representing the pixel set of the image of the target to be predicted.
Furthermore, we explore the similarities and differences in optimization objectives between LAM and location branches. The LAM is used to identify the head region in the image, while the locating branch is intended to locate the point of the head. The label mask map is generated by binarizing the label position map and can be used for guiding the optimization of the LAM. A location awareness module comprising a multi-layered convolution and activation function defined as:
Figure BDA0003475673250000084
wherein Sigmoid represents Sigmoid function calculation operation.
3. An adaptive density perception module (ADAM);
although the position branch has rich position information and can assist the prediction density map to suppress background noise, the prediction position map is easy to repeatedly count or miss-detect pedestrians in a dense area. To address this problem, a simple but effective solution is to further constrain the predicted location map in a dense population. The predicted density map reflects the density of people for different pixel regions in the image, including density regions of different levels. If we directly use the predicted density map to adjust the weights of the localization loss of different regions, although dense regions can be highlighted to some extent, the loss weights of other regions can also be increased. This also inhibits the locating branch from effectively focusing on the error-prone dense areas. Therefore, adaptively highlighting error-prone dense areas in an image while ignoring easily located areas is a better solution. Inspired by the attention mechanism, an adaptive density perception module (ADAM) is proposed to adaptively focus on error-prone dense regions of a location map based on density information of counting branches. In ADAM, features of the counting branch are taken to generate a predictive attention map. Meanwhile, a new density sensing positioning loss function is proposed to improve the performance of the predicted position map in the dense area, and the function utilizes the predicted attention map to increase the weight of the error-prone area loss in the final loss. The structural composition of ADAM is:
Figure BDA0003475673250000091
4. a loss function.
1) Counting loss:
the MSE loss is used to measure the difference between the predicted density map and the annotated density map. The formula for MSE loss is as follows:
Figure BDA0003475673250000092
wherein L is MSE (D rf ,D gt ) Representing MSE loss function between the predicted density graph and the marked density graph, N represents the number of input images of the target to be predicted, N is not less than 1 and is an integer, D rf Representing a predicted density map, D gt Representing a labeled density graph, i represents the number of the image of the target to be predicted, i is more than or equal to 1 and less than or equal to N, i is a positive integer, p represents the number of pixel points of the image of the target to be predicted,
Figure BDA0003475673250000093
the pixel value of the p-th pixel point of the ith prediction density map is represented,
Figure BDA0003475673250000094
and expressing the pixel value of the p-th pixel point of the ith labeling density graph.
2) Density-aware localization loss:
we take a variant of the focus loss as the basic loss function. In order to solve the problem that a dense region location map is prone to errors, a novel density-sensing localization loss (DAL) is designed by combining an ADAM-generated prediction attention map with a basic loss function. DAL loss demonstrates the performance of locating branches by increasing the proportion of dense error-prone regions in the overall loss based on the results of predictive attention maps. DAL loss can be expressed as follows:
Figure BDA0003475673250000101
wherein L is dal Representing the density-aware localization loss function, M representing the number of positive samples, A p Pixel value, Y, of the p-th pixel point of the foreground map representing the predictive attention map p Representing the pixel value, P, of the P-th pixel of the foreground map of the annotated position map pr Is a center point diagram of the predicted position map,
Figure BDA0003475673250000102
and expressing the pixel value of the p-th pixel point of the central point diagram of the predicted position diagram, wherein gamma and delta both express hyper-parameters, and gamma and delta both are positive real numbers.
3) Mask loss:
the mask penalty employs a focus penalty that is used to constrain the LAM module for foreground identification. The loss of focus is given by:
the formula for the focus loss function is:
Figure BDA0003475673250000103
wherein L is mask Representing the mask loss function, H representing the height of the prediction mask map, W representing the width of the prediction mask map, M pr A prediction mask map is represented that represents the prediction mask map,
Figure BDA0003475673250000104
the pixel value representing the p-th pixel point of the prediction mask map,
Figure BDA0003475673250000105
and representing the pixel value of the p-th pixel point of the labeling mask image.
The final penalty is obtained by weighted summation of the MSE penalty, DAL penalty and mask penalty:
L final =L MSE1 L dal2 L mask
wherein λ is introduced 1 And λ 2 To balance the lost weight.
Due to the adoption of the technical scheme of the invention, the following technical effects are realized:
1. counting effect:
in the population counting task, we compared the proposed method with the most advanced method on four reference datasets NWPU-Crowd, shanghai science a and B parts, and JHU-Crowd + +. For the convenience of comparison, the methods are classified into a crowd counting method and a crowd positioning method according to whether the position information is output or not.
Compared with the population counting method, the method has the best performance on NWPU-crown, shanghai science and technology A part and JHU-crown + + data sets. In particular, our method reduced the MAE on the NWPUCrowd dataset by 18 and the MSE by 63.5, compared to the DM-Count results. At the same time, our method achieved comparable performance on the Shanghai science and technology part B dataset. The proposed method integrates a localization branch and location awareness module to provide complementary features and refine predicted density maps, outperforming the population counting method at larger intervals on NWPU-Crowd and JHU-Crowd + + datasets.
Compared with the Crowd positioning method, the method is superior to the most advanced method on the NWPU-Crowd data set, and the best MAE result is obtained on the JHU-Crowd + + data set. Also, the MAE and MSE on the NWPU-Crowd data set are relatively increased by 12.3% and 8.6% compared to P2PNet by our method. Our method also achieves comparable performance on the Shanghai science and technology part A and part B datasets. Images in data sets of the part A and the part B of Shanghai science and technology are low in resolution, and the proportion of background areas in the images is relatively small. In contrast, the pictures in the NWPU-Crowd and JHU-Crowd + + datasets are more resolved, containing more complex backgrounds. The proposed method employs a highly reduced HRNet as backbone network and utilizes position information from the positioning branches to mitigate the background noise of the predicted density map. In conclusion, the method makes great progress on the NWPU-crown and JHU-crown + + data sets, and has positive influence on the A part and the B part of Shanghai science and technology to a certain extent.
2. The positioning effect is as follows:
for fair comparison, the method is divided into a frame supervision method and a point supervision method. Compared to the results for P2Pnet, our method improved the F1 value and recall by 3.6% and 4.4%, respectively. Although FasterRCNN, supervised by box, achieved the highest accuracy, its recall rate was very poor at 3.5%. The proposed model employs an adaptive density awareness module and density awareness loss to effectively improve model recall. To verify the robustness of the proposed method, we reproduced four most advanced methods on the Shanghai science and technology part A dataset using the original setup. In particular, crowd-SDNet1 and SCALNet2 are trained with official codes. Fasterncn 3 and TinyFaces4 are implemented by common code. The proposed method is superior to point surveillance methods in terms of F1 value, accuracy and recall. Although our process increased the recall by 5.5%, the results were still lower than TinyFaces. This is because TinyFaces uses box information as supervision and adopts a paradigm of a specific scale to optimize small target detection, effectively improving recall. Our point surveillance approach achieves crowd localization from a dense prediction perspective and uses high shrinkage rates in the model. The resolution of images in shanghai science and technology part a is generally low. The highly-reduced model is easy to lose information when extracting image features, and recall rate results are influenced.
In the method, HRNet is adopted as the network backbone. An adaptive moment estimation (Adam) optimizer is used for training the model, and the initial learning rate of the NWPU-crown data set is 1e-4, and the initial learning rate of other data sets is le-6. The learning rate is updated at a decay rate of 0.99 every epoch. The number of training sessions is 600 and the batch size is set to 16. In addition, random horizontal flipping and clipping are utilized to augment the training data. Wherein the cut size of the Shanghai science and technology part A and part B is 256 × 256, and the cut size of the other data sets is 512 × 512. We also resize the image to ensure that the long edge of the NWPU-Crowd and JHU-Crowd + + datasets are less than 2048.
It is worth mentioning that:
preferably, in fig. 2, the counting branch and the positioning branch first act to generate a predicted density map and a predicted position map, respectively, then the position sensing module outputs a predicted mask map to refine the predicted density map, the adaptive density sensing module adaptively highlights error-prone regions of the predicted position map, and the density-sensing positioning loss receives the predicted position map and the predicted attention map for loss calculation.
In the description of the steps of the method of the present invention, the order of the steps is not particularly emphasized, and there are many cases where the description is not limited to one order: for example, the order of step S31, step S32, step S34, and step S35 may be that step S31 and step S32 are executed first, and then step S34 and step S35 are executed, or step S34 and step S35 are executed first, and then step S31 and step S32 are executed. The technical scheme disclosed by the invention is considered to be included as long as the technical scheme does not violate the common sense, the technical logic of the field and the inventive idea disclosed by the invention.
As described above, the present invention can be preferably realized.
All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims (3)

1. A dual-branch combined target dense prediction method is characterized by comprising the following steps:
s1, generating a characteristic map by using an image of a target to be predicted;
s2, generating a prediction density map by using the counting features in the feature map, and generating a prediction position map by using the positioning features in the feature map;
s3, updating and optimizing the generated predicted density map by using the process information for generating the predicted position map, and/or updating and optimizing the generated predicted position map by using the process information for generating the predicted density map;
step S3 includes the following steps:
s31, constructing a position sensing module with a training function, and enabling the position sensing module to generate a prediction mask image by using positioning features from the feature image;
in step S31, in the process of generating the prediction mask map, the labeled position map is further converted into a labeled mask map for supervising the training of the position sensing module;
s32, enabling the position perception module to update and optimize the generated prediction density map by using the prediction mask map;
s33, measuring the difference between the predicted density graph and the marked density graph by using an MSE (mean square error) loss function, and updating and optimizing the generated predicted density graph by using the MSE loss function, wherein the formula of the MSE loss function is as follows:
Figure FDA0003790677750000011
wherein L is MSE (D rf ,D gt ) Representing MSE loss function between the predicted density graph and the marked density graph, N represents the number of input images of the target to be predicted, N is not less than 1 and is an integer, D rf Representing a predicted density map, D gt Representing a labeled density graph, i represents the number of the image of the target to be predicted, i is more than or equal to 1 and less than or equal to N, i is a positive integer, p represents the number of pixel points of the image of the target to be predicted,
Figure FDA0003790677750000021
the pixel value of the p-th pixel point of the ith prediction density map is represented,
Figure FDA0003790677750000022
expressing the pixel value of the p pixel point of the ith labeling density graph;
s34, constructing an adaptive density sensing module with a training function, and enabling the adaptive density sensing module to generate a prediction attention diagram by using counting features from the feature diagram;
s35, enabling the position perception module to update and optimize the generated predicted position map by using the predicted attention map;
s36, finding out a dense area in the density map by utilizing the prediction attention map, adjusting the weight of areas with different density levels pixel by pixel in a positioning loss function, and supervising the self-adaptive density sensing module;
in step S36, a density-aware positioning loss function is used as the positioning loss function, and the formula of the density-aware positioning loss function is:
Figure FDA0003790677750000023
wherein L is dal Representing the density-aware localization loss function, M representing the number of positive samples, A p Pixel value, Y, of the p-th pixel point of the foreground map representing the predictive attention map p The pixel value, P, of the P-th pixel point of the foreground map representing the labeled location map pr Is a center point diagram of the predicted position map,
Figure FDA0003790677750000024
and expressing the pixel value of the p-th pixel point of the central point diagram of the predicted position diagram, wherein gamma and delta both express hyper-parameters, and gamma and delta both are positive real numbers.
2. The method according to claim 1, wherein in step S32, the training of the location-aware module is supervised by a focus loss function, and the formula of the focus loss function is as follows:
Figure FDA0003790677750000031
wherein L is mask Representing the mask loss function, H representing the height of the prediction mask map, W representing the width of the prediction mask map, M pr A prediction mask map is represented that represents the prediction mask map,
Figure FDA0003790677750000032
the pixel value representing the p-th pixel point of the prediction mask map,
Figure FDA0003790677750000033
and representing the pixel value of the p-th pixel point of the labeling mask image.
3. A dual-branch combined target dense prediction system, characterized in that, based on the method of claim 1 or 2, the system comprises the following modules:
a feature map generation module: generating a feature map by using an image of a target to be predicted;
a density map generation module: generating a predicted density map by using the counting features in the feature map;
a location map generation module: the predicted position map is generated by utilizing the positioning features in the feature map;
a location awareness module: the method comprises the steps of updating and optimizing a generated prediction density map by using process information for generating a prediction position map;
an adaptive density sensing module: the method is used for updating and optimizing the generated predicted position map by using the process information for generating the predicted density map.
CN202210058467.6A 2022-01-18 2022-01-18 Double-branch combined target intensive prediction method and system Active CN114494999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210058467.6A CN114494999B (en) 2022-01-18 2022-01-18 Double-branch combined target intensive prediction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210058467.6A CN114494999B (en) 2022-01-18 2022-01-18 Double-branch combined target intensive prediction method and system

Publications (2)

Publication Number Publication Date
CN114494999A CN114494999A (en) 2022-05-13
CN114494999B true CN114494999B (en) 2022-11-15

Family

ID=81473215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210058467.6A Active CN114494999B (en) 2022-01-18 2022-01-18 Double-branch combined target intensive prediction method and system

Country Status (1)

Country Link
CN (1) CN114494999B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024044875A1 (en) * 2022-08-29 2024-03-07 Robert Bosch Gmbh Computer-implemented method and network for dense prediction

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109903339A (en) * 2019-03-26 2019-06-18 南京邮电大学 A kind of video group personage's position finding and detection method based on multidimensional fusion feature
CN111563447A (en) * 2020-04-30 2020-08-21 南京邮电大学 Crowd density analysis and detection positioning method based on density map
CN113409246A (en) * 2021-04-14 2021-09-17 宁波海棠信息技术有限公司 Method and system for counting and positioning reinforcing steel bar heads

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241895B (en) * 2018-08-28 2021-06-04 北京航空航天大学 Dense crowd counting method and device
CN110188597B (en) * 2019-01-04 2021-06-15 北京大学 Crowd counting and positioning method and system based on attention mechanism cyclic scaling
CN109903282B (en) * 2019-02-28 2023-06-09 安徽省农业科学院畜牧兽医研究所 Cell counting method, system, device and storage medium
CN110020658B (en) * 2019-03-28 2022-09-30 大连理工大学 Salient object detection method based on multitask deep learning
CN110866445A (en) * 2019-10-12 2020-03-06 西南交通大学 Crowd counting and density estimation method based on deep learning
CN112215129A (en) * 2020-10-10 2021-01-12 江南大学 Crowd counting method and system based on sequencing loss and double-branch network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109903339A (en) * 2019-03-26 2019-06-18 南京邮电大学 A kind of video group personage's position finding and detection method based on multidimensional fusion feature
CN111563447A (en) * 2020-04-30 2020-08-21 南京邮电大学 Crowd density analysis and detection positioning method based on density map
CN113409246A (en) * 2021-04-14 2021-09-17 宁波海棠信息技术有限公司 Method and system for counting and positioning reinforcing steel bar heads

Also Published As

Publication number Publication date
CN114494999A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
Li et al. Adaptively constrained dynamic time warping for time series classification and clustering
CN108764085B (en) Crowd counting method based on generation of confrontation network
CN110147743A (en) Real-time online pedestrian analysis and number system and method under a kind of complex scene
CN106886995A (en) Polyteny example returns the notable object segmentation methods of image of device polymerization
Wang et al. FE-YOLOv5: Feature enhancement network based on YOLOv5 for small object detection
Wan et al. AFSar: An anchor-free SAR target detection algorithm based on multiscale enhancement representation learning
CN111325750B (en) Medical image segmentation method based on multi-scale fusion U-shaped chain neural network
CN111091101B (en) High-precision pedestrian detection method, system and device based on one-step method
Jiang et al. A self-attention network for smoke detection
Lei et al. Boundary extraction constrained siamese network for remote sensing image change detection
CN111368634B (en) Human head detection method, system and storage medium based on neural network
Cheng et al. YOLOv3 Object Detection Algorithm with Feature Pyramid Attention for Remote Sensing Images.
CN109697727A (en) Method for tracking target, system and storage medium based on correlation filtering and metric learning
CN114494999B (en) Double-branch combined target intensive prediction method and system
Viraktamath et al. Comparison of YOLOv3 and SSD algorithms
Li et al. Robust detection of farmed fish by fusing YOLOv5 with DCM and ATM
CN114049503A (en) Saliency region detection method based on non-end-to-end deep learning network
CN117315284A (en) Image tampering detection method based on irrelevant visual information suppression
Kong et al. Collaborative model tracking with robust occlusion handling
CN116824488A (en) Target detection method based on transfer learning
CN116451081A (en) Data drift detection method, device, terminal and storage medium
CN114067359B (en) Pedestrian detection method integrating human body key points and visible part attention characteristics
CN113344005B (en) Image edge detection method based on optimized small-scale features
CN115578364A (en) Weak target detection method and system based on mixed attention and harmonic factor
Ji et al. Influence of embedded microprocessor wireless communication and computer vision in Wushu competition referees’ decision support

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant