CN110188597A

CN110188597A - A kind of dense population counting and accurate positioning method and system based on attention mechanism circulation scaling

Info

Publication number: CN110188597A
Application number: CN201910293903.6A
Authority: CN
Inventors: 陈刚; 刘臣臣; 王成成; 黄波; 韩峻; 糜俊青; 翁昕钰; 穆亚东
Original assignee: Mid Star Technology Ltd By Share Ltd; Peking University
Current assignee: Mid Star Technology Ltd By Share Ltd; Peking University
Priority date: 2019-01-04
Filing date: 2019-04-12
Publication date: 2019-08-30
Anticipated expiration: 2039-04-12
Also published as: CN110188597B

Abstract

The present invention relates to a kind of dense populations based on attention mechanism circulation scaling to count and accurate positioning method and system.Obtain that the method for crowd's quantity survey is different, and the present invention obtains the corresponding crowd's count density figure of input picture, crowd's location map by the deep neural network of well-designed three branch respectively and tries hard to for obtaining intensive candidate's attention from original people counting method based on density map and by face or pedestrian detection.Crowd's count value initial in image is obtained by crowd's count density figure；The position coordinates of each personage in image are obtained by crowd's location map；Several regions that the crowd is dense in image are obtained by close quarters candidate's figure, these regions are cut out from original image and comes and resolution ratio is enlarged into original twice, subsequent network is sent into and obtains more accurate personage's positioning result.

Description

A kind of dense population counting and accurate positioning side based on attention mechanism circulation scaling Method and system

Technical field

The present invention relates to dense populations in a kind of image to count and the pinpoint method of crowd more particularly to a kind of use The attention mechanism circulation scaling pinpoint method and system of acquisition crowd, belongs to computer vision field.

Background technique

With the Urbanization Progress of society, urban population quantity is steeply risen, and video monitoring camera is densely installed In many peri-urbans, more and more use in our routine works and life.The most important application of these video datas One of field is exactly intelligent video monitoring.Have the China of 1,300,000,000 populations, a series of problems of the big initiation of the size of population is always Threaten public security.Equally in the world elsewhere, as the overstocked generation of crowd is uncontrollable when holding large-scale activity Event.So effectively using safety monitoring data rational allocation law enforcement officer and construction additional transport facility to crowd into Row guidance, which shunts, has great significance for the protection of maintenance and the personal safety of public order.However traditional video surveillance needs Direct surveillance's processing reports the developments, very consumption manpower and material resources.The video analysis of automation and processing can not only liberate labour Power mining data, study can also arrive useful knowledge and rule from the video information of magnanimity.Crowd counts as video point Crowd pedestrian is analyzed in a field in analysis, and emergency monitoring, many aspects such as traffic programme suffer from important meaning Justice.

Existing crowd's counting technology be broadly divided into based on density map carry out integral estimation and face or pedestrian detection into Row Population size estimation two major classes.With the development of depth learning technology, many researchers learn to obtain using deep neural network The density map of crowd obtains crowd's quantity in picture by integrating to density map, this method has been achieved for good accurate Degree, the major defect of this method are although crowd's quantity is suitable in density map integrated value and picture that study obtains, but to learn The density map distribution and true density map distributional difference that acquistion is arrived are larger, are unfavorable for further population analysis.

The development of deep learning also makes traditional object detection task make significant headway, so there is researcher logical It crosses and the face or pedestrian that occur in image is detected to estimate crowd's quantity.Although this method can accurately provide people Position, the defect based on density drawing method prediction distribution inaccuracy avoided, but there is also very big problem is existing The poor effect of face or pedestrian detector under super-intensive scene, and crowd's estimation is all often super-intensive scene, is difficult to see Clear face or the body of people, so this method is difficult to have obtained effect in such a scenario.

Summary of the invention

For dense population count in based on density drawing method forecasting inaccuracy really and based on the method for detection for intensive The bad problem of scene effect, the purpose of the present invention is to provide a kind of based on the dense population of attention mechanism circulation scaling Several and pinpoint solution and system.The method that the present invention uses deep learning proposes a kind of based on attention The circulation of mechanism scales network, which converts crowd's initial estimation for crowd's quantity survey problem in original intensive picture And crowd is accurately positioned two problems.

Crowd's quantity is obtained with original people counting method based on density map and by face or pedestrian detection The method of estimation is different, and it is corresponding that the present invention by the deep neural network of well-designed three branch obtains input picture respectively Crowd's count density figure, crowd's location map and scaling candidate region pay attention to trying hard to.It is obtained by crowd's count density figure Initial crowd's count value in image；The position coordinates of each personage in image are obtained by crowd's location map；Pass through contracting Several regions that candidate region notices trying hard to obtain that the crowd is dense in image are put, these regions are cut out from original image and comes and incites somebody to action Resolution ratio is enlarged into original twice, is sent into subsequent circulation scaling network and obtains more accurate personage's positioning result.From people Crowd's count value can be obtained in group's count density figure and crowd's location map, the invention also provides a kind of combination scenes certainly Weight is adapted to, two obtained crowd's count values are weighted to obtain more accurate crowd's quantity survey with the weight.

A kind of dense population based on attention mechanism circulation scaling of the invention counts and accurate positioning method, including with Lower step:

1) deep neural network of three branches is established, obtains the corresponding crowd's count density figure of input picture, crowd respectively Location map and scaling candidate region pay attention to trying hard to；

2) crowd's count value initial in image is obtained by crowd's count density figure, passes through the crowd position point Butut obtains the position coordinates of each personage in image, notices trying hard to obtain that crowd is close in image by the scaling candidate region Several regions of collection；

3) several regions that the crowd is dense are cut out to come from image, are accurately determined by improving resolution ratio and obtaining Position is as a result, and update crowd's location map with it；

4) the crowd's count value obtained according to crowd's count density figure and the people obtained according to crowd's location map are utilized Group's count value obtains accurate crowd's count value by weighting.

The above method is further illustrated below.The detailed process signal of this method is as shown in Figure 1, comprising the following steps:

Step1: network structure building and parameter initialization.As shown in Figure 1, including two in method proposed by the present invention Major networks: master network (MainNet) and circulation scaling network (Recurrent Attention Zooming Net, abbreviation RAZNet), MainNet include positioning branch (Localization Branch), counter branch (Counting Branch) with And scaling candidate region branch (Zooming Region Proposal Branch).

MainNet positions branch by empty convolutional layer (dilated using first 13 layers of VGG-16 network as basic network Convolutional layers) and 3 warp laminations compositions (deconvolutional layers), which finally exports One layer of characteristic pattern identical with original image resolution sizes；Counter branch is only made of empty convolutional layer, and branch output is former The characteristic pattern of 1/8 size of beginning photo resolution；The characteristic pattern of counter branch output after positioning branch and up-sampling is spelled It connects, the input as scaling candidate region branch (Zooming region proposal branch).

RAZNet has lacked counter branch compared with MainNet, and rest part is consistent with MainNet.We are by VGG- Initiation parameter of 16 parameters that training obtains on ImageNet data set as MainNet basic network, RAZNet is to instruct Practice the MainNet parameter completed as initiation parameter.

Step2: the training of model.For the ease of model convergence, we are candidate according to counter branch, positioning branch, scaling The sequence of region branch is successively trained three branches.After the completion of MainNet training, using MainNet as RAZNet's Initiation parameter is finely adjusted RAZNet.

Step3: the selection of weight is merged.After the completion of model training, we can respectively obtain positioning point on training set Crowd's count value that branch and counter branch obtain, the corresponding true crowd's count value of image having been had according to us, Wo Menke To learn the fusion weight arrived between positioning branch and the count value of counter branch, which makes predicted value and true value more adjunction Closely.

Step4: the reasoning of network.After the completion of model training, to each test picture, the people obtained from MainNet Group's density map, crowd's location map and scaling candidate region pay attention to trying hard to, and try hard to obtain several close quarters according to attention, These regional shears are gone out from original image, and length and width are enlarged into original twice, these pictures are obtained newly by RAZNet Crowd's location map and scaling candidate region pay attention to trying hard to.It is new intensive when can not find during scaling candidate region pays attention to trying hard to When region, entire reasoning terminates.

Step5: the acquisition of final crowd's count value and number of people position coordinates.We take the peak in crowd's location map The position of value point is as the number of people coordinate finally predicted.In order to obtain peak point, we first do non-pole to crowd's location map Big value inhibits (Nonmaxima Suppresssion, NMS), and response is then taken to be greater than all location point conducts of a certain threshold value The anchor point of the number of people.The fusion weight obtained according to Step3, we calculate counter branch and the positioning fused crowd of branch Count results, as final crowd's count value.

As shown in Figure 1, this method contains two basic network modules of MainNet and RAZNet, in MainNet there are three Branch, there are two branches in RAZNet, and the title and function of network module and branch are respectively:

1. master network (MainNet): crowd's initial count is done to the initial picture of input and coarse crowd positions, it should The scaling candidate region that network obtains pays attention to trying hard to for instructing the shearing of subsequent close quarters to amplify.

2. circulation scaling network (RAZNet): doing crowd's positioning to the close quarters selected in MainNet, obtain partial zones The more accurate positioning result in domain.The network itself can obtain scaling candidate region and pay attention to trying hard to, according to scaling candidate regions Domain pays attention to trying hard to decide whether that share zone continues through RAZNet again.

3. positioning branch (Localization Branch): obtaining feature from basic network, pass through 6 empty convolution Layer and intermediate 3 interspersed warp laminations, export crowd's location map identical with network inputs image resolution ratio size.

4. counter branch (Counting Branch): feature is obtained from basic network, it is defeated by 6 empty convolutional layers Length and width are respectively the crowd density figure of 1/8 size of network inputs image out.

5. scaling candidate region branch (Zooming Region Proposal Branch): from positioning branch and counting point It obtains feature in branch, and they is stitched together the input as the branch, by 3 empty convolutional layers, output and network The identical scaling candidate region of input image resolution size pays attention to trying hard to.

Accordingly with above method, the present invention also provides a kind of dense population countings based on attention mechanism circulation scaling With Precise Position System comprising:

Master network module, it includes the deep neural networks of three branches, for obtaining the corresponding crowd of input picture respectively Count density figure, crowd's location map and scaling candidate region pay attention to trying hard to；It is obtained by crowd's count density figure Initial crowd's count value in image obtains the position coordinates of each personage in image by crowd's location map, leads to Cross several regions that the scaling candidate region notices trying hard to obtain that the crowd is dense in image；By several areas that the crowd is dense Domain is cut out from image to be come, and improves its resolution ratio；

Circulation scaling network module is responsible for obtaining with several regions that the crowd is dense described in improving after resolution ratio for input Crowd's location map is updated to accurate personage's positioning result, and with it；

Counting module is merged, is responsible for using the crowd's count value obtained according to crowd's count density figure and according to crowd position Crowd's count value that distribution map obtains obtains accurate crowd's count value by weighting.

It is described in the invention that circulation scaling is carried out based on attention mechanism compared with current existing crowd's counting technology Dense population counting had the advantage that with accurate positioning method

1. the position that technology described in the invention can accurately provide personage in picture.

2. the region that can find out automatically in image that the crowd is dense by attention mechanism, passes through the resolution for improving close quarters Rate obtains accurate positioning result.

3. the result positioned to crowd's counting and crowd merges by scene adaptive weight, crowd's meter is improved Several accuracys.

Detailed description of the invention

Fig. 1 is schematic network structure；

Fig. 2 is that attention generates the crowd is dense candidate region schematic diagram.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and Attached drawing is described in further details the present invention.

1. target data generates

When model training, it would be desirable to image true crowd's count density figure corresponding with it, true crowd's position distribution Figure (number of people location drawing) and true scaling candidate region pay attention to trying hard to as training data.

(1) true crowd's count density figure generates: the work that crowd counts before we refer to, according in labeled data The number of people coordinate given generates corresponding crowd density figure.Crowd density figure is generated according to following formula, for each mark The number of people, we introduce a Gaussian convolution, for each of true crowd's count density figure pixel coordinate point x, it close Angle valueCalculation as shown by the following formula, wherein N is total number of people number in image, and N number of number of people coordinate points indicate For x₁..., x_n,For distance x_iThe average distance of 4 nearest numbers of people, Z_iThe normalizing of Gaussian convolution is corresponded to for each number of people Change parameter, β is the zoom factor of distance, and empirically value is 0.1 for we.

(2) the true number of people location drawing generates: everyone leader note point four neighborhoods corresponding with it are set to 1 by us, are obtained To the final number of people location drawing.

(3) really scaling candidate region pays attention to trying hard to generate: we find apart from the point each of figure pixel Three nearest number of people positions, calculate the average value of their three distances, do a Gaussian transformation to the value and obtain pixel pair The response answered pays attention to the density degree for trying hard to be able to reflect out different zones crowd.

2. network structure constructs.

The present invention is a kind of using deep learning progress crowd's counting and the pinpoint method of crowd, depth nerve net The structure design of network is as shown in Figure 1.Network includes two major networks of MainNet and RAZNet, and wherein MainNet is by VGG-16 First 13 layers as basic network, be followed by positioning branch, three parts of counter branch and scaling candidate region branch form, RAZNet is made of positioning branch and scaling two parts of candidate region branch.The detailed configuration of MainNet and RAZNet is joined Number sees below table 1.

The configuration parameter of table 1.MainNet and RAZNet

The training process of 3.MainNet and RAZ-Net.

We train MainNet first.By step 2 it is found that MainNet is candidate by counter branch, positioning branch and scaling Three parts of region branch are constituted, and in order to facilitate model convergence, we are according to counter branch, positioning branch and scaling candidate region The sequence of branch successively training pattern.

(1) for counter branch, the MSE between density map and true density map that we are exported with the branch, which loses, to be made For optimization object function, shown in the following formula of the calculation of MSE, ε_denIt (I) is penalty values on picture I, wherein m, n difference Indicate input picture height and width, φ (p) andIt is illustrated respectively in p-th of pixel in crowd's count density figure of output Corresponding prediction and true value on point.

(2) after counter branch convergence, the parameter that counter branch is learnt is as the initiation parameter of positioning branch, positioning Branch is different from counter branch, intersects entropy loss with predict the Weight between the number of people location drawing and the true number of people location drawing (BCE) optimization object function, ε are used as_locIt (I) is the BCE penalty values on picture I, wherein m, n respectively indicate the height of input picture Degree and width, Y (x_p) indicate that corresponding true value on p-th of pixel, ψ (p) indicate the predicted value on p-th of pixel, γ For weighted value, empirically value is 100 for we.

l(x_p)=- γ Y (x_p)·log(ψ(p))-(1-Y(x_p))·log(1-ψ(p))

(3) after counter branch and positioning branch learn, we fix the parameter of the two branches, start to train scaling Candidate region branch, the branch are lost using MSE as optimization object function.

After MainNet training is completed, we train RAZNet, RAZNet only to remain positioning branch and scaling candidate regions Domain branch.The training data of RAZNet is different from MainNet, we according to fig. 2, to find crowd in original image close from paying attention to trying hard to Training sample of several regions of collection as RAZNet.Since the network structure of RAZNet and MainNet are almost the same, we with Initiation parameter of the parameter learnt in MainNet as RAZNet, successively to positioning branch and scaling candidate region branch It is finely adjusted.

4. counter branch obtains personage's total quantity in image.

It quadratures to the Crowds Distribute density map that counter branch obtains, can calculate in the image of the branch prediction and occur Personage's total quantity.

5. positioning branch obtains the personage occurred in human head location coordinate and figure sum.

What positioning branch obtained is number of people location map, it would be desirable to take out local peaking's point in the figure, and pass through After non-maxima suppression (non maxima suppression, NMS) operation, final number of people coordinate could be obtained.

1) we cross the average pond that a kernel size is 3x3 first on obtaining number of people location map, are used to Possible peak point in prominent regional area；

2) the maximum value pond for being again 3x3 by a kernel size on the basis of first step, by maximum value pond After change compared with distribution map before carries out pixel scale, the identical position of former and later two distribution maps is the part needed Peak point；

3) peak point that response is greater than a certain threshold value in the distribution map for taking second step to obtain is the finally obtained number of people Position coordinate；

4) personage occurred in image sum can be obtained by count to obtained human head location coordinate.

6. according to the fusion weight of scene learning position branch and counter branch.

After model training, according to step 5 and step 6, our available positioning branch and countings on training set Crowd's count value of branch, the corresponding true crowd's count value of image having had according to us, we may learn positioning point Branch the count value of counter branch between fusion weight (the fusion weight is indicated in Fig. 1 with α), the weight make predicted value with True value is more nearly.Such as crowd's count value obtained in counter branch and positioning branch obtained in crowd's count value phase When difference is greater than 150, the numerical value that counter branch obtains is more accurate, we select to believe the result that counter branch obtains.

The result fusion that the positioning branch of 7.MainNet and RAZNet obtains.

It is that RAZNet is obtained the result is that the accurate positioning in a certain piece of region is as a result, theory in original image according to the design of network On it is more accurate than positioning the obtained result of branch in MainNet, we are replaced with RAZNet in some region of testing result The task that the number of people is accurately positioned part can be completed in the testing result for falling the region in MainNet.

8. obtaining adaptive fused weights according to scene, the technical result based on density map and based on detection is merged To promote the accuracy of number of people counting load.

Weight is merged with counter branch in the positioning branch that must be learnt according to step 6, we are to test set Shang Liang branch Obtained result is merged, and final crowd's count value can be obtained.

The present invention counts common three data sets ShanghaiTech_A, ShanghaiTech_B and UCF_ in crowd Performance on QNRF is as shown in table 2.In evaluation index mean absolute error (Mean Average Error, MAE) and mean square error Performance on poor (Mean Squared Error, MSE) is superior to forefathers' method."-" indicates that this method is unreported herein in table Performance on data set.

The Contrast on effect of table 2. present invention and other methods

With the present invention do compare have MCNN (Y.Zhang, D.Zhou, S.Chen, S.Gao, and Y.Ma.Single- image crowd counting via multi-column convolutional neural network.In CVPR, 2016.3,6,7), Switch-CNN (D.B.Sam, S.Surya, and R.V.Babu.Switching convolutional Neural network for crowd counting.In CVPR, 2017.3,7), CP-CNN (V.A.Sindagi and V.M.Patel.Generating high-quality crowd density maps using contextual pyramid Cnns.In ICCV, 2017.3,7), CSRNet (Y.Li, X.Zhang, and D.Chen.Csrnet:Dilated convolutional neural networks for understanding the highly congested scenes.In CVPR,2018.3,7)

It is counted another embodiment of the present invention provides a kind of dense population based on attention mechanism circulation scaling and accurate fixed Position system comprising:

In the present invention, the basic network of MainNet can be replaced with into stronger VGG19 Resnet system by VGG16 Column model, stronger basic network model can bring better effect.

In the present invention, when RAZNet is trained, resolution ratio can be enlarged into original twice in the range of video memory allows Or more high magnification numbe.

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the principle and scope of the present invention, originally The protection scope of invention should be subject to described in claims.

Claims

1. a kind of dense population based on attention mechanism circulation scaling counts and accurate positioning method, which is characterized in that including Following steps:

1) deep neural network of three branches is established, obtains the corresponding crowd's count density figure of input picture, crowd position respectively Distribution map and scaling candidate region pay attention to trying hard to；

2) crowd's count value initial in image is obtained by crowd's count density figure, passes through crowd's location map The position coordinates of each personage in image are obtained, pay attention to trying hard to obtain in image what the crowd is dense by the scaling candidate region Several regions；

3) several regions that the crowd is dense are cut out to come from image, obtain accurate positioning knot by improving resolution ratio Fruit, and crowd's location map is updated with it；

4) it utilizes the crowd's count value obtained according to crowd's count density figure and is obtained according to crowd's location map Crowd's count value, by weighting obtain accurate crowd's count value.

2. the method according to claim 1, wherein three branch deep neural network constitute master network, The master network includes positioning branch, counter branch and scaling candidate region branch；The positioning branch is by empty convolutional layer It is constituted with 3 warp laminations, finally exports one layer of crowd's location map identical with original image resolution sizes；The meter Number branch is only made of empty convolutional layer, which exports crowd's count density figure of 1/8 size of original image resolution ratio；It will determine The characteristic pattern of position branch and counter branch output, which is done, to be spliced, and as the input of the scaling candidate region branch, the scaling is waited Convolutional layer is rolled up by 3 cavities by favored area branch, exports scaling candidate region identical with input image resolution size and pays attention to Try hard to.

3. method according to claim 1 or 2, which is characterized in that the raising resolution ratio, is that resolution ratio is enlarged into original Twice come.

4. according to the method described in claim 2, it is characterized in that, described obtain accurate positioning knot by improving its resolution ratio Fruit is that several regions that the crowd is dense after raising resolution ratio are sent into circulation scaling network to obtain accurate personage's positioning As a result；The circulation scaling network does not contain counter branch, and rest part is consistent with the master network.

5. according to the method described in claim 4, it is characterized in that, circulation scaling network itself can obtain scaling candidate Region pays attention to trying hard to, and pays attention to trying hard to decide whether share zone again and continue through the circulation contract according to scaling candidate region Network is put, until can not find the new region that the crowd is dense during scaling candidate region pays attention to trying hard to.

6. according to the method described in claim 4, it is characterized in that, according to counter branch, positioning branch, scaling candidate region point The sequence of branch is successively trained three branches of the master network；Using the parameter of the master network of training completion as institute The initiation parameter for stating circulation scaling network is finely adjusted circulation scaling network.

7. according to the method described in claim 6, it is characterized in that, being counted for counter branch with the crowd of branch output MSE loss between density map and true crowd's count density figure is used as optimization object function, to the model parameter of the branch Carry out gradient updating；After counter branch convergence, the parameter that counter branch is learnt is fixed as the initiation parameter of positioning branch Position branch to predict that the BSE of the Weight between the number of people location drawing and the true number of people location drawing loses as optimization object function, Gradient updating is carried out to the model parameter of the branch；After counter branch and the study of positioning branch, the two branches are fixed Parameter starts training scaling candidate region branch, which is lost using MSE as optimization object function.

8. the method according to claim 1, wherein it is described by weighting obtain accurate crowd's count value, Weight obtains in the following ways:

A) crowd's count value is obtained according to crowd's count density figure, crowd's location map respectively on training set；

B) according to the corresponding true crowd's count value of image that has had, learn the two crowd's count values obtained to step a) it Between fusion weight.

9. the method according to claim 1, wherein obtaining the side of crowd's count value according to crowd's location map Method is:

A) non-maxima suppression is done to crowd's location map, response is then taken to be greater than all location point conducts of a certain threshold value Peak point；

B) take the position of the peak point in crowd's location map as human head location coordinate；

C) by carrying out counting to get the personage occurred in image sum to human head location coordinate.

10. a kind of dense population based on attention mechanism circulation scaling counts and Precise Position System, which is characterized in that packet It includes:

Master network module, it includes the deep neural networks of three branches, count for obtaining the corresponding crowd of input picture respectively Density map, crowd's location map and scaling candidate region pay attention to trying hard to；Image is obtained by crowd's count density figure In initial crowd's count value, the position coordinates of each personage in image are obtained by crowd's location map, pass through institute State several regions that scaling candidate region notices trying hard to obtain that the crowd is dense in image；By several regions that the crowd is dense from It is cut out and in image, and improve its resolution ratio；

Circulation scaling network module is responsible for obtaining essence with several regions that the crowd is dense described in improving after resolution ratio for input True personage's positioning result, and crowd's location map is updated with it；

Counting module is merged, is responsible for using the crowd's count value obtained according to crowd's count density figure and according to crowd's position distribution Crowd's count value that figure obtains obtains accurate crowd's count value by weighting.