CN106056078B

CN106056078B - Crowd density estimation method based on multi-feature regression type ensemble learning

Info

Publication number: CN106056078B
Application number: CN201610374700.6A
Authority: CN
Inventors: 郑宏; 张洞明
Original assignee: Shenzhen Research Institute of Wuhan University
Current assignee: Shenzhen Research Institute of Wuhan University
Priority date: 2016-05-31
Filing date: 2016-05-31
Publication date: 2021-09-14
Anticipated expiration: 2036-05-31
Also published as: CN106056078A

Abstract

The invention relates to a crowd density estimation method based on multi-feature regression type ensemble learning, which is characterized in that the width of the head of a person is taken as a reference to carry out multi-level image blocking on a scene frame image, and the blocks are subjected to image scaling and Gamma correction to realize the consistency of the image scale and illumination; the method comprises the steps of constructing a density estimation model by utilizing preprocessed samples, extracting three features of D-SIFT, GLCM and GIST to construct a first-layer support vector machine regression (SVR) coarse prediction model, constructing a second-layer SVR fine prediction model by taking the coarse prediction results as new features, and adding the fine prediction results of all sub-images to perform density estimation according to the people number grades set by a scene. The method solves the problems of scene illumination change, camera height angle change and pedestrian shielding, and is suitable for a plurality of different scenes to realize crowd density estimation by utilizing a plurality of scene samples to adopt a plurality of characteristics and applying a regression mode to integrate and learn a construction model.

Description

Crowd density estimation method based on multi-feature regression type ensemble learning

Technical Field

The invention relates to the technical field of digital image processing and pattern recognition, in particular to a crowd density estimation method based on multi-feature regression type ensemble learning.

Background

Along with the improvement of living standard of people, the urbanization progress is accelerated continuously, and the collective activities of large-scale public places are frequent day by day, so that accidents caused by dense crowds are frequent in recent years. Therefore, how to use computer vision to carry out real-time intelligent monitoring on the crowd, make crowd density estimation in time and take effective measures has important significance for guaranteeing social stability and crowd safety.

Current methods for estimating population density can be divided into two major categories:

1) the direct method comprises the following steps: direct methods use some classifiers to attempt to segment or detect each individual in a population and then count to obtain population density. These methods can be further divided into two subclasses: model-based methods: detection or segmentation is performed by a model or a shape contour of a person. For example, a method for extracting human head contour features based on Haar wavelet transform and detecting pedestrians by combining a support vector Machine (Lin S F, Chen J Y, Chao H X. estimation of number of people in their crowned scenes using a productive transformation [ J ]. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 2001,31(6):645 (654)), Feizzwald et al propose a method for detecting pedestrians based on components and improved gradient Histogram (HOG) features (Feizzwanwa P F, Girshick R B, occupant D, et al, object estimation of motion with detection of pedestrian position [ 2010, noise ] using a forest detection framework [ 11, 1625 ] and a method for detecting pedestrians using a forest Analysis [ 7, J7 ] and a forest detection framework (IEEE J7, 11), lempitsky V.class-specific hough recommendations for object detection [ M ]. precision requirements for computer vision and medical Image analysis. Springer London,2013: 143-; a track clustering-based method: each individual is detected by clustering points of interest on the pedestrian through long-term tracking. For example, Rabaud and Belongie propose a method for clustering trajectories and inferring scene population using a Kanade-Lucas-Tomasi (KLT) tracker and a series of underlying features (Rabaud V, Belongie S. counting drained moving objects [ C ]. Computer Vision and Pattern registration, 2006IEEE Computer Society reference on. IEEE,2006,1:705 well 711), Rao et al (Rao A S, Gubbi J, Marusic S, et al. The direct method has a good effect under the condition of a small number of people in a scene, but the defect is obvious, the crowd is seriously overlapped under the condition of crowding, and the performance of the direct method is linearly reduced.

2) An indirect method: the indirect method treats the crowd as a whole, and obtains the crowd density by extracting the characteristics of textures and the like for the crowd and combining a regression model. Indirect methods can also be divided into three categories: analysis based on pixels: these methods first remove the scene background and then use some very underlying features to estimate crowd density. Davies et al (Davies A C, Yin J H, vehicle S A. crowd monitoring using image processing [ J ]. Electronics & communication Engineering Journal,1995,7(1):37-47) estimate people by linear relationship by extracting foreground and analyzing crowd foreground and edge pixels and adding perspective correction. Hussasin et al (Hussain N, Yatim H S M, Hussain N L, et al CDES: A pixel-based visibility estimation system for massive-Haram [ J ]. Safety Science,2011,49(6):824-833) extract underlying features on foreground pixels corrected for perspective distortion by scaling and then use a neural network to supervise training, the trained model estimates sparse population accurately, but as the density increases, population occlusion occurs, the wrong estimation rises linearly; texture and gradient based method: texture and gradient features may better represent the number of people in a scene than a pixel-based approach. Texture and Gradient features used in population density estimation include Gray-level co-occlusion matrix (GLCM), ULBP feature (Uniform local Pattern), HOG feature, and Gradient orientation co-occlusion matrix (GOCM), among others; thirdly, the method based on the characteristic points comprises the following steps: feature points are feature pixels of interest, e.g. corner points detected in the image. For example, Conte et al (Conte D, Foggia P, Percannella G, et al. counting moving magnets in raised scenes [ J ]. Motor vision and Applications,2013,24(5): 1029-. Kishor et al (Kishore P V V, Rahul R, Sravya K, et al. crown definition Analysis and tracking [ C ]. Advances in Computing, Communications and information (ICACCI),2015International reference on IEEE,2015: 1209-. Indirect methods usually need to extract foreground or motion information to reduce background interference, and in practical applications, because of illumination changes, continuous crowding of pedestrians, various background factors and the like, extraction of foreground and motion information becomes a difficult task, so that accurate estimation of the methods is difficult to make in practical applications

Disclosure of Invention

The invention aims to provide a crowd density estimation method based on multi-feature regression type ensemble learning.

In order to achieve the purpose, the invention adopts the following technical scheme: a crowd density estimation method based on multi-feature regression ensemble learning comprises the following steps:

image blocking step: acquiring a video monitoring frame image of a scene, performing multi-level image blocking on the scene by taking the width of a human head as a reference, performing scaling processing on the multi-level block image to unify the size, and performing Gamma correction preprocessing to obtain a sub-image sample;

a crowd density estimation step: carrying out coarse prediction on three characteristics of D-SIFT, GLCM and GIST of the sub-image sample by adopting a first layer of support vector regression model; and performing fine prediction by using the coarse prediction result as a new feature and using a second-layer support vector regression model, adding the fine prediction results of all the sub-image samples, and performing density estimation according to the set crowd density grade of the scene.

Preferably, the multi-level image blocking specifically comprises the following steps:

firstly, a scene interest area is defined, then the size of a first layer block image is determined, a reference pedestrian is selected, when the head of the reference pedestrian just enters the bottom boundary of the interest area, the width of the head of the reference pedestrian is measured to be w pixels, then the width of the first layer block image is set to be w pixels 128/42 pixels, then the reference pedestrian continuously moves forwards until the width of the head is w pixels 21/42 w/2 pixels, and the length from the top of the head to the bottom boundary of the interest area is the height of the first layer block image;

determining the size of the second layer block image, selecting a reference pedestrian, and measuring the head width w when the head of the reference pedestrian just crosses the upper edge of the first layer block image₁Setting the width of the second layer block as the head width w₁128/42 pixels, then moving forward with reference to the pedestrian until the head width is w₁*21/42＝w₁When the pixel is 2, the length from the head to the upper side of the first layer block image is the height of the second layer block image;

and determining the size of the third-layer block image by analogy until the multi-layer block image completes non-overlapping full coverage on the scene interest area.

Preferably, the widths and heights of the multiple-level block images after being subjected to scaling processing and unified in size are all 128 pixels.

Preferably, the step of obtaining the sub-image from the multi-level block image through Gamma correction preprocessing comprises: firstly, dividing a pixel value into three intervals of 0-255, and then converting the pixel value into an angle, wherein the specific expression is as follows:

where x is the pixel value, x₀And x₁Respectively, a set pixel threshold value, E₁＝[0，x₀]，E₂＝[x₀，x₁]，E₃＝[x₁，255]，

Then it is the converted angle;

the gamma value γ (x) is then determined using trigonometric relationships, defined as follows:

the fluctuation of the Gamma value is too large by simply adjusting the Gamma value by the weight value a, so the weight value b is introduced and the linear correction function shown in the formula (3) is adopted for correction

The final corrected Gamma value is defined as

The pixel corrected value is

Preferably, the crowd density estimating step comprises:

respectively extracting D-SIFT, GLCM and GIST characteristics from the sub-image samples;

respectively training a coarse prediction model by utilizing a first-layer support vector regression model for the extracted features, and obtaining different number of people coarse prediction values corresponding to the three features of D-SIFT, GLCM and GIST through the coarse prediction model for the test sample set;

the rough people number prediction value is used as a new feature, a second-layer support vector regression model is used for training a fine prediction model, and the result of the rough people number prediction is obtained through the fine prediction model to obtain more accurate sub-image sample people number prediction, namely a fine prediction value;

adding the fine prediction values of all sub-image samples of one frame of image, counting the number of people in the scene interest area,

and obtaining the crowd density estimation value of the current frame according to the density classification standard of the scene interest area.

Compared with the prior art, the invention has the beneficial effects that: the method solves the problems of scene illumination change, camera height angle change and pedestrian shielding, and is suitable for a plurality of different scenes to realize crowd density estimation by utilizing a plurality of scene samples to adopt a plurality of characteristics and applying a regression mode to integrate and learn a construction model.

The invention is further described below with reference to the accompanying drawings and specific embodiments.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of block image sizing;

FIG. 3 is a diagram illustrating a correspondence between a multi-level block image and a scene interest area;

fig. 4 is a flowchart of regression ensemble learning.

Detailed Description

In order to more fully understand the technical contents of the present invention, the technical solutions of the present invention will be further described and illustrated with reference to specific embodiments.

As shown in fig. 1, which is a schematic flow chart of the present invention, a population density estimation method based on multi-feature regression ensemble learning includes the following steps:

Further, as shown in fig. 2, a schematic diagram is determined for the block image size; as shown in fig. 3, it is a schematic diagram of the correspondence between the multi-level block image and the scene interest area; in the above technical solution, the specific steps of partitioning the multi-level image are as follows:

The method of image blocking with human head width as reference is adopted, the image of a frame is divided into blocks with different sizes in a plurality of layers from near to far, the blocks are used as basic elements to construct a model and predict the number of people, and the problem of perspective projection effect can be solved.

After image blocking, a plurality of multi-level block images with different distances, sizes, different times and weather are obtained, and before the characteristics are extracted, the multi-level block images need to be preprocessed to reduce environmental interference and training amount,

the multi-level block images are firstly subjected to scaling processing to be uniform in size, the width and the height after the sizes are uniform are all 128 pixels, so that the block images with different distances can be uniformly trained into samples with the same size by normalizing the sizes, the samples at far and near positions do not need to be separately trained, and the training amount is greatly reduced.

Secondly, in order to reduce the influence caused by ambient light, Gamma correction needs to be carried out on the block image, the multi-level block image is preprocessed by the Gamma correction to obtain a sub-image, and the method specifically comprises the following steps: firstly, dividing a pixel value into three intervals of 0-255, and then converting the pixel value into an angle, wherein the specific expression is as follows:

where x is the pixel value, x₀And x₁Respectively, a set pixel threshold value, E₁＝[0,x₀]，E₂＝[x₀,x₁]，E₃＝[x₁,255]，

Then it is the converted angle;

The final corrected Gamma value is defined as

The pixel corrected value is

Further, as shown in fig. 4, which is a flowchart of the regression-based ensemble learning, the crowd density estimating step includes:

respectively extracting D-SIFT, GLCM and GIST characteristics from the sub-image samples, and setting the characteristics as x_D-SIFT、x_GLCMAnd x_GIST；

Respectively training a rough prediction model by using a first-layer support vector regression model for the extracted features, and obtaining three models f by regression fitting of the first-layer support vector regression model for a test sample set₁(x_D-SIFT)、f₂(x_GLCM) And f₃(x_GIST) And the model outputs the predicted values y of the three characteristics of D-SIFT, GLCM and GIST_D-SIFT、y_GLCMAnd y_GISTAnd combining the three predicted values into new characteristics corresponding to the rough predicted values of different people numbers:

x_ALL＝[y_D-SIFT,y_GLCM,y_GIST] (11)

training the new feature to a fine prediction model f by using a second-layer support vector regression model_Final(x_ALL) The result of the rough people number prediction is subjected to a fine prediction model to obtain more accurate sub-image sample people number prediction y_FinalI.e. the fine prediction value; the regression ensemble learning includes two parts: a training (learning) part and a prediction (application) part, as shown in fig. 4, wherein the training part is a training regression model, firstly, the characteristics of a plurality of sub-images are extracted, the number of people of each sub-image is counted as the number label of the person to form a sample set of the training part, then the sample set is divided into a training set and a test set, the coarse regression model corresponding to the three characteristics is trained through the training set, and the test set can obtain corresponding prediction output, namely a coarse prediction value, through the coarse regression model. Three are to be arrangedAnd (4) taking the rough predicted value of the model as a new feature and combining the population labels to form a new sample set, and continuously dividing the new sample set into a new training set and a new testing set. And training the fine regression model through the new training set, and judging whether the model is accurate or not by the new testing set through obtaining the fine prediction value through the fine regression model.

The prediction part predicts the number of people through a trained model. Extracting features from a test sample of unknown people, obtaining a rough predicted value by combining a rough regression model trained by a training part, and inputting three rough predicted values serving as new features into a fine regression model to obtain a fine predicted value, namely the final people number prediction.

Considering that the sensitivities of different features to the population density are inconsistent, the two-layer regression can be adopted to make up for the deficiencies of each other, and the prediction accuracy can be improved.

and obtaining the crowd density estimation value of the current frame according to the density classification standard of the scene interest area. For example: assuming that the current scene can accommodate the maximum number of people n_maxFor the standard, an average classification is adopted, and the classification is divided into five grades: [0, n ]_max/5]、[n_max/5,2n_max/5]、[2n_max/5,3n_max/5]、[3n_max/5,4n_max/5]And [4n_maxAnd/5, ∞), respectively denoted as VL (very low), L (low), M (medium), H (high), and VH (very high), and the population density estimation can be done by comparing the above criteria against the population count of the region of interest of the statistical scene.

The technical contents of the present invention are further illustrated by the examples, so as to facilitate the understanding of the reader, but the embodiments of the present invention are not limited thereto, and any technical extension or re-creation based on the present invention is protected by the present invention.

Claims

1. A crowd density estimation method based on multi-feature regression ensemble learning is characterized by comprising the following steps:

a crowd density estimation step: carrying out coarse prediction on three characteristics of D-SIFT, GLCM and GIST of the sub-image sample by adopting a first layer of support vector regression model; using the coarse prediction result as a new feature to perform fine prediction by using a second-layer support vector regression model, adding the fine prediction results of all sub-image samples, and performing density estimation according to the set crowd density grade of the scene;

the method for partitioning the multi-level image comprises the following specific steps:

2. The crowd density estimation method based on multi-feature regression ensemble learning of claim 1, wherein the widths and heights of the multi-level block images after scaling processing and size unification are all 128 pixels.

3. The crowd density estimation method based on the multi-feature regression type ensemble learning of claim 2, wherein the step of obtaining the sub-images from the multi-level block images through Gamma correction preprocessing comprises: firstly, dividing a pixel value into three intervals of 0-255, and then converting the pixel value into an angle, wherein the specific expression is as follows:

where x is the pixel value, x₀And x₁Respectively, a set pixel threshold value, E₁＝[0，x₀]，E₂＝[x₀,x₁]，E₃＝[x₁，255]，

Then it is the converted angle;

The final corrected Gamma value is defined as

The pixel corrected value is

4. The crowd density estimation method based on the multi-feature regression-based ensemble learning according to claim 1, 2 or 3, wherein the crowd density estimation step comprises:

and adding the fine predicted values of all sub-image samples of one frame of image, counting the number of people in the scene interest area, and obtaining the crowd density estimation value of the current frame according to the density classification standard of the scene interest area.