CN108921822A

CN108921822A - Image object method of counting based on convolutional neural networks

Info

Publication number: CN108921822A
Application number: CN201810564162.6A
Authority: CN
Inventors: 王子磊; 刘旭
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-06-04
Filing date: 2018-06-04
Publication date: 2018-11-30

Abstract

The invention discloses a kind of image object method of counting based on convolutional neural networks, the feature for arriving e-learning by robust enhancement layer, with more robustness, while reducing the computation complexity of model to target deformation；And density estimation is carried out using Pyramid technology counting module, the multi-scale information for including in convolutional neural networks layered characteristic is made full use of, significantly improves computational efficiency while realizing accurate counting.In short, realizing the target accurate metering in image the present invention is based on convolutional neural networks, it can be adapted for the object count task under complex scene, computation complexity is low, practical application value with higher.

Description

Image object method of counting based on convolutional neural networks

Technical field

The present invention relates to based on technical field of computer vision more particularly to a kind of image object by convolutional neural networks Counting method.

Background technique

As the high speed development and people of computer technology, network communication technology and electronic technology are to social public security It is required that continuous improvement, the intelligent video monitoring system based on intelligent video analysis technology is widely used.As intelligence Important content in energy field of video monitoring, object count have a large amount of application scenarios in real life, accurately estimate The specific number of target in the picture is the key that related system processing out.In intelligent transportation system, friendship is accurately estimated Number of vehicles in logical scene can carry out public transport management for traffic management department and provide important evidence；To the passenger flow in market Amount is counted, and business hours and the personnel assignment in market can be instructed；To the crowd density prison of the public places such as megastore Control can find in time security risk and provide early warning.

The target of counting load is the quantity for allowing computer accurately to estimate object of interest in image.Mainstream at present Object count method is mainly based upon the method and density estimation method neural network based that provincial characteristics returns.Wherein, area The method that characteristic of field returns is by establishing the regression model of foreground region image feature and destination number come direct estimation scene In target sum, such algorithm computation complexity is lower, but has ignored the spatial position distributed intelligence of target in the scene, It is only capable of obtaining an one-dimensional statistic, and the extraction of feature depends on the foreground segmentation effect of image, robustness is insufficient. The method of counting of density estimation is the density profile that target to be counted is generated by the sample of handmarking, is directly learnt from picture Mapping relations of the vegetarian refreshments feature to target density distribution map.The target density distribution map of generation had both contained complete Density Distribution Information can obtain the target numbers of arbitrary region by areal concentration summation, while contain the sky of target in the picture Between distributed intelligence, be the emphasis of current research.

The density estimation method for being currently based on neural network is needed mostly using multi-channel network structure extraction Analysis On Multi-scale Features. Such as high Sheng Hua is in China Patent Publication No. CN105528589A《Single image crowd based on multiple row convolutional neural networks counts Algorithm》Middle to use multiple row convolutional network structure extraction scene characteristic, the convolution kernel that each sub-network is used is of different sizes, passes through group Different size of receptive field feature is closed to handle the target scale variation issue in scene；Liu Yu etc. is in China Patent Publication No. CN107506692《A kind of dense population counting and personnel's distribution estimation method based on deep learning》In equally use multiple row Convolutional network structure extracts Analysis On Multi-scale Features by four column depth residual error networks；Similar, Deng Tengfei etc. is disclosed in Chinese patent Number CN107301387A《A kind of image Dense crowd method of counting based on deep learning》In pass through two column convolutional networks point It Xue Xi not high-level characteristic and low-level feature.But in above scheme, multiple row convolutional network model parameter amount is big, computation complexity Height is difficult the requirement for meeting practical application to efficient process.

Summary of the invention

The object of the present invention is to provide a kind of image object method of counting based on convolutional neural networks, can be not significant In the case where increasing network extraction feature complexity, model performance is further increased.

The purpose of the present invention is what is be achieved through the following technical solutions：

A kind of image object method of counting based on convolutional neural networks, including：

Pyramid object count network is established based on convolutional neural networks；

It is marked using artificial data, the Density Distribution true value image of interesting target is established on training image；

The training image that training data is concentrated by random cropping and flip horizontal mode and corresponding Density Distribution True value image carries out enumeration data enhancing；

Using the enhanced training image of enumeration data and target density distribution true value image as pyramid object count net The input of network completes pyramid object count network training by continuous iteration optimization, generates pyramid object count network mould Type；

When new images input, the image with input picture block same size is generated by sliding window mode, is sent to golden word In tower object count network model, the density profile predicted is averaged to obtain final to the density value of lap Output density figure, to acquire target numbers.

As seen from the above technical solution provided by the invention, convolutional neural networks are based on, the mesh in image is realized Accurate metering is marked, can be adapted for the object count task under complex scene, computation complexity is low, practical application with higher Value.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of process of the image object method of counting based on convolutional neural networks provided in an embodiment of the present invention Figure；

Fig. 2 is the schematic diagram of pyramid object count network provided in an embodiment of the present invention；

Fig. 3 is the schematic diagram of target's center's point diagram provided in an embodiment of the present invention and Density Distribution true value image；

Fig. 4 is Pyramid technology counting module schematic diagram provided in an embodiment of the present invention；

Fig. 5 is that Shanghaitech-B data set provided in an embodiment of the present invention exports result schematic diagram；

Fig. 6 is that TRANCOS data set provided in an embodiment of the present invention exports result schematic diagram.

Specific embodiment

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.

Natural scene be usually it is complicated and changeable, for image object counting load, it is easy to the shadow by various factors Ring, as between target seriously block, target deformation, the uneven distribution of target, mixed and disorderly background interference, camera angles it is abnormal Become etc..The especially influence of video camera perspective effect, so that size variation multiplicity of the same object in scene different depth, different The camera angles of scene equally change different.As previously mentioned, being directed to these problems, existing method mainly passes through introducing multi path network Network extracts Analysis On Multi-scale Features, however, network parameter quantity can be greatly increased by introducing multi-channel network, improves computation complexity, nothing Method meets application request.On the other hand, compared to single network model, the training of multi-channel network is usually extremely difficult 's.In fact, convolutional neural networks model, itself is a pyramid multi-level structure, model receives original image signal work For input, layer-by-layer abstract expression is carried out to image, higher layer has bigger receptive field, contains between each layer feature rich Rich multi-scale information.Therefore, the invention discloses a kind of image object method of counting based on convolutional neural networks, using list A network carries out object count, and the multi-scale information for making full use of convolutional neural networks model itself to be included is reducing model While complexity, good counting properties are achieved；Below for provided in an embodiment of the present invention a kind of based on convolutional Neural The image object method of counting of network does detailed introduction.

As shown in Figure 1, a kind of image object method of counting based on convolutional neural networks is provided for the embodiment of the present invention, In the 1st~4th step be the training stage, the 5th step is test phase；

Image used in training stage can come from representative crowd's enumeration data collection Shanghaitech-B And the scene picture in vehicle count data set TRANCOS.Wherein Shanghaitech-B data set is by (Single-image crowd counting via multi-column convolutional neural network.Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016.) it provides, TRANCOS Data set is by (Extremely Overlapping Vehicle Counting.Proceedings of Iberian Conference on Pattern Recognition and Image Analysis, 2015.) it provides.

The key step of the above method is as follows：

1, pyramid object count network is established based on convolutional neural networks.

In the embodiment of the present invention, according to the characteristics of image object counting load itself and requirement, for representative convolution Neural network model carries out structural adjustment and design, establishes pyramid object count network.As shown in Fig. 2, the gold established Word tower object count network mainly includes：Characteristic extracting module, robustness enhancing module and density estimation module.

1) characteristic extracting module uses full convolutional network and extracts characteristics of image, including five layers of regular volume lamination and Two layers of empty convolutional layer.

In the embodiment of the present invention, the characteristics of image under different scenes is extracted using two different network structures.Phase Ying Di, using two kinds of various sizes of image blocks, 72 × 72 and 144 × 144, respectively as the first and second of network structure Input size.The difference of two kinds of network structures is mainly that the receptive field of three first layers convolutional layer convolution kernel is of different sizes, the A kind of network structure all uses the convolution kernel of 3 × 3 sizes, and the three first layers of second of network structure use the convolution of 5 × 5 sizes Core.In two network structures, the port number of each layer is all since 16, and every by a maximum pond layer, port number increases by 2 Times, 16 then are reduced to from 64 again and are remained unchanged.Comprising two maximum pond layers in network, it is located at first layer convolutional layer After second layer convolutional layer, core size is 2 × 2, step-length 1.

It should be noted that the above-mentioned size about image block, specific value used in number of channels are merely illustrative, and It is non-to be construed as limiting.

Characteristic extracting module is finally two empty convolutional layers, and the cavity convolutional layer refers to the convolution kernel in Standard convolution Middle injection cavity, the spacing being respectively worth when to increasing convolution kernel processing data can not increase compared to regular volume lamination Expand the size of receptive field in the case where network parameter.

Illustratively, two empty convolutional layers can export every layer of convolution with step-length for 2, single using linear amendment Nonlinear Mapping modeling ability is added as activation primitive, for network in first ReLU.

2) robustness enhances module, uses spatial pyramid pond mode, in the spy of characteristic extracting module output It levies on figure, passes through N₁×N₁、N₂×N₂、N₃×N₃With N₄×N₄The space pond of four different scales, constructs various sizes of son Block, to obtain spatial information of the image on different resolution, make e-learning to feature the deformation of target is had more Robustness, while reducing the computation complexity of model.

Illustratively, the space pondization of four different scales can be 1 × 1,2 × 2,4 × 4 and 6 × 6, then binding characteristic Example in extraction module, the characteristic dimension after robustness enhancing are 16 × (1 × 1+2 × 2+4 × 4+6 × 6)=912.

3) the density estimation module uses a kind of Pyramid technology counting module, learns on different scale complementary Information, to generate target areal density figure.

The target density estimation that the Pyramid technology counting module is carried out is to pass through the enhanced feature of robustness It is carried out respectively on each characteristic pattern of extraction module, the density map of final output is obtained by the output results added of each layer；Wherein, close Degree estimation establishes the Nonlinear Mapping from characteristics of image to density value using two layers of full articulamentum.The full articulamentum of first layer and Shandong The output feature of stick enhancing module (SPP) is connected, and neuron number is 1000.The full articulamentum of the second layer is output layer, when When characteristic extracting module uses the first network structure, output neuron number is 324, when characteristic extracting module uses second When network structure, output neuron number is 1296.After first layer connects entirely, using amendment linear unit (ReLU) activation Function and Dropout layers, wherein the parameter of Dropout is 0.5.

It should be noted that it is above-mentioned about neuron number, Dropout parameter used in specific value be only show Example, is not construed as limiting.

2, it is marked using artificial data, the Density Distribution true value image of interesting target is established on training image.

In the embodiment of the present invention, gaussian filtering is carried out using the target's center's point diagram manually marked on training image and is obtained The Density Distribution true value image of interesting target；

Wherein, using the target's center position of target's center's point diagram of mark as the center of Gaussian kernel, pass through gaussian filtering Generate density profile：As shown in figure 3, given training image, if P is the set of target geometric center point in the image of mark, With D indicate image corresponding to density profile, then be located at (i, j) at pixel density value D (i, j) pass through following formula meter It calculates：

In above formula,The dimensional gaussian distribution value of pixel at (i, j), Gaussian Profile it is equal Value point is located at mark position (m, n)；σ²I_2×2For covariance matrix.

Illustratively, when it is implemented, for Shanghaitech-B and TRANCOS data set, the size of Gaussian kernel can It is set to 10 and 15.

3, the training image and corresponding density point training data concentrated by random cropping and flip horizontal mode Cloth true value image carries out enumeration data enhancing.

The parameter of convolutional neural networks model training is more, needs that a could be trained roll up based on a large amount of training data Product neural network model.Therefore in the training stage, training data is enhanced by the method for the random cropping from training image, thus Generate a large amount of training image blocks and corresponding real density figure.Dimension normalization is carried out according to input size, and random It cuts out a large amount of image block and progress data enhancing is overturn by image level again, obtained training sample is finally used for model Training.

Key step is as follows：

1) size of the training image of input is normalized；

2) from the training image after normalization the image block of random cropping same size as new training image；

3) flip horizontal is carried out to new training image, obtains a series of new training images；

4) aforesaid way (i.e. step 1)~3) are utilized) same treatment is done to Density Distribution true value image, then pass through normalizing Change makes after scaling that destination number remains unchanged in Density Distribution true value image.

Illustratively, for Shanghaitech-B data set, according to document (Single-image crowd counting via multi-column convolutional neural network.Proceedings of IEEE Conference on Computer Vision and PatternRecognition, 2016.) experimental setup in, uses For 400 pictures as training image, remaining 316 are test picture.The picture that Shanghaitech-B data set provides is differentiated Rate is larger, and in order to train counter model, for every trained picture, it is 200 that our random croppings 200 when implementing, which are opened small greatly, × 200 image block then by each image block scaling (normalizing) to 144 × 144, and uses second of feature extraction net Network carries out feature extraction.Data enhancing is equally carried out by Image Reversal when implementation.Certainly, above-mentioned specific value is also only act Example, is not construed as limiting.

4, using the enhanced training image of enumeration data and target density distribution true value image as pyramid object count The input of network completes pyramid object count network training by continuous iteration optimization, generates pyramid object count network Model.

When being trained to pyramid object count network, by the enhanced training image of enumeration data and target density point Cloth true value image is as training sample, using the Euclidean distance between the density map and real density figure of prediction as loss letter Number updates the model parameter of network, loss function L (Θ) by the training of stochastic gradient descent method in Optimized Iterative each time It is defined as follows：

In above formula, Θ indicates the network parameter that model learning arrives, and N is training samples number, F (X_k；It Θ) is pyramid mesh Mark the density map of counting and network prediction, D_kIndicate k-th of training sample X_kReal density figure.

As shown in Figure 4, it is determined that after Euclidean distance is as optimization aim, in the training of model, arrived first by end The training at end, in the last layer feature (the last one empty convolutional layer of the characteristic extracting module after robustness enhancing The characteristic pattern of output) on establish density regression model, obtain initial density estimation result.Then, in order to optimize count results, Fixed character extraction module and initial regression model, using the residual error of real density and current predictive density as optimization aim, In another layer of feature (feature of the last layer regular volume lamination output) of characteristic extracting module after enhancing by robustness Establish a new regression model.Back-propagation algorithm is finally recycled to carry out joint training to the parameter of whole network.Pass through This mode allows two regression models to learn complementary information from the Analysis On Multi-scale Features of convolutional neural networks, common to complete finally Density estimation.According to this Training strategy, can be trained in Pyramid technology counting module using similar method more Regression model.When it is implemented, final counting and network, which uses two regression models, carries out density estimations, two of use Regression model is built upon respectively on the last layer cavity convolutional layer and the last layer regular volume lamination of feature extraction network.

5, when new images input, the image with input picture block same size is generated by sliding window mode, is sent to gold In word tower object count network model, the density profile predicted is averaged to obtain final to the density value of lap Output density figure, to acquire target numbers.

In order to illustrate the performance of above scheme of the present invention, also it is tested and assesses.

It, below will be to net after counting and network model has been respectively trained on Shanghaitech-B and TRANCOS data set The performance of network is assessed, and the specific method is as follows：In test data set, for every test image, using 10 pixels as step-length Sliding window is carried out, the image with input picture block same size is generated, is sent in trained counting and network model, is predicted Density profile, finally the density value of lap is averaged to obtain final output density figure.By test data set Real density figure is compared with predicted density figure, obtains assessment result.Fig. 5 and Fig. 6 gives prediction result schematic diagram.Fig. 5 In Fig. 6, first is classified as input picture, and second is classified as Density Distribution true value image, and third is classified as inventive network model prediction Density map, the digital representation destination number below density map.

For Shanghaitech-B data set, using mean absolute error (MAE) and root-mean-square error (RMSE) conduct Evaluation index, corresponding formula are as follows：

In above-mentioned formula, N indicates test sample quantity, C_kFor the destination number for including in the kth picture of model prediction,It is corresponding authentic specimen quantity.

Table 1 be the present invention on Shanghaitech-B data set with the comparing result of existing method.It can be seen that this hair It is bright that there is very high crowd to count accuracy rate.

1 comparing result of table

For TRANCOS data set, using net lattice control absolute error (Grid Average Mean Absolute Error, GAME) it is used as evaluation index.GAME index considers the precision of counting and the accuracy to target distribution positioning simultaneously. 4 are divided by picture for specified scale L, GAME (L)^LThen a Non-overlapping Domain calculates the average absolute in each region Error, specific formula are as follows：

In above-mentioned formula, N indicates test sample quantity,For in the kth picture of model prediction first of region packet The destination number contained,It is corresponding authentic specimen quantity.Particularly, GAME (0) and MAE evaluation criterion are of equal value.

Table 2 be the present invention on TRANCOS data set with the comparing result of existing method.As can be seen that the present invention has Very high vehicle count accuracy rate.

2 comparing result of table

In above scheme of the embodiment of the present invention, for the problem that actual scene is complicated and changeable, have the advantage that：Firstly, Enhance the feature that module arrives e-learning by robustness and robustness is had more to target deformation, while reducing the calculating of model Complexity；Then, density estimation is carried out using Pyramid technology counting module, made full use of in convolutional neural networks layered characteristic The multi-scale information for including significantly improves computational efficiency while realizing accurate counting.In short, the present invention is based on convolution minds Through network, the target accurate metering in image is realized, can be adapted for the object count task under complex scene, is calculated complicated Spend low, practical application value with higher.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can The mode of necessary general hardware platform can also be added to realize by software by software realization.Based on this understanding, The technical solution of above-described embodiment can be embodied in the form of software products, which can store non-easy at one In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Within the technical scope of the present disclosure, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. a kind of image object method of counting based on convolutional neural networks, which is characterized in that including：

Using the enhanced training image of enumeration data and target density distribution true value image as pyramid object count network Input completes pyramid object count network training by continuous iteration optimization, generates pyramid object count network model；

When new images input, the image with input picture block same size is generated by sliding window mode, is sent to pyramid mesh It marks in counting and network model, the density profile predicted is averaged the density value of lap to obtain final output Density map, to acquire target numbers.

2. a kind of image object method of counting based on convolutional neural networks according to claim 1, which is characterized in that institute The pyramid object count network of foundation includes：Characteristic extracting module, robustness enhancing module and density estimation module；Wherein：

The characteristic extracting module uses full convolutional network and extracts characteristics of image, including five layers of regular volume lamination and two layers of sky Hole convolutional layer；The cavity convolutional layer refers to injects cavity in the convolution kernel of Standard convolution, to increase convolution kernel processing number According to when the spacing that is respectively worth；

The robustness enhances module, uses spatial pyramid pond mode, on the characteristic pattern of characteristic extracting module output, Pass through N₁×N₁、N₂×N₂、N₃×N₃With N₄×N₄The space pond of four different scales, constructs various sizes of sub-block, to obtain Take spatial information of the image on different resolution；

The density estimation module uses a kind of Pyramid technology counting module, learns complementary information on different scale, from And generate target areal density figure.

3. a kind of image object method of counting based on convolutional neural networks according to claim 2, which is characterized in that institute Stating the target density estimation that Pyramid technology counting module is carried out is each by the enhanced characteristic extracting module of robustness It is carried out respectively on layer characteristic pattern, the density map of final output is obtained by the output results added of each layer；Wherein, density estimation is adopted The Nonlinear Mapping from characteristics of image to density value is established with two layers of full articulamentum.

4. a kind of image object method of counting based on convolutional neural networks according to claim 1, which is characterized in that institute It states and is marked using artificial data, the Density Distribution true value image that interesting target is established on training image includes：

Gaussian filtering, which is carried out, using the target's center's point diagram manually marked on training image obtains the density point of interesting target Cloth true value image；

Wherein, it using the target's center position of target's center's point diagram of mark as the center of Gaussian kernel, is generated by gaussian filtering Density profile：If P be mark image in target geometric center point set, with D indicate image corresponding to Density Distribution Figure, then the density value D (i, j) for being located at pixel at (i, j) are calculated by following formula：

In above formula,It is the dimensional gaussian distribution value of pixel at (i, j), the average point of Gaussian Profile At mark position (m, n)；σ²I_2×2For covariance matrix.

5. a kind of image object method of counting based on convolutional neural networks according to claim 1, which is characterized in that institute State the training image and corresponding Density Distribution true value figure concentrated by random cropping and flip horizontal mode to training data As the step of carrying out enumeration data enhancing includes：

The size of the training image of input is normalized；

The image block of random cropping same size is as new training image from the training image after normalization；

Flip horizontal is carried out to new training image, obtains a series of new training images；

Same treatment is done to Density Distribution true value image using aforesaid way, then keeps Density Distribution after scaling true by normalization Destination number remains unchanged in value image.

6. a kind of image object method of counting based on convolutional neural networks according to claim 1, which is characterized in that right When pyramid object count network is trained, the enhanced training image of enumeration data and target density are distributed true value image As training sample, using the Euclidean distance between the density map and real density figure of prediction as loss function, by random Gradient descent method training, updates the model parameter of network in Optimized Iterative each time, and loss function L (Θ) is defined as follows：

In above formula, Θ indicates the network parameter that model learning arrives, and N is training samples number, F (X_k；It Θ) is pyramid object count The density map of neural network forecast, D_kIndicate k-th of training sample X_kReal density figure.