CN111709300B

CN111709300B - Crowd counting method based on video image

Info

Publication number: CN111709300B
Application number: CN202010430583.7A
Authority: CN
Inventors: 韩铠宇; 翁立; 王建中
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2022-08-12
Anticipated expiration: 2040-05-20
Also published as: CN111709300A

Abstract

The invention provides a crowd counting method based on video images. The method comprises the steps that input data are continuous video frame images, and redundant information is separated through pixel subtraction on collected continuous video frames and given background images to obtain preprocessed input images; inputting the preprocessed image into a density classification-based coding-decoding network model, extracting multi-scale features by using a backbone network, and performing feature fusion on the features for density regression to give weight; and simultaneously, the extracted multi-scale features are used for up-sampling to obtain a corresponding density estimation image, and finally, the density images corresponding to the features of different scales are weighted to obtain a final density estimation image. The method provided by the invention aims at the crowd counting of the video image, utilizes the similarity between pedestrians to a certain extent, and utilizes the filtering of redundant information, thereby not only obtaining the real-time pedestrian counting, but also retaining the background picture in real time.

Description

Crowd counting method based on video image

Technical Field

The invention belongs to the field of crowd image processing in computer vision, and particularly relates to a method for counting crowds and segmenting a pedestrian background in an image.

Background

The crowd counting is to count the number of pedestrians in the image or the video image sequence. In real life, effective pedestrian counting has important significance in the fields of safety control, area planning, behavior analysis and the like, for example, certain data support is provided in the aspects of trampling prevention, traffic route design, advertisement position putting, building site selection and the like.

The current pedestrian counting methods can be mainly divided into three categories: early detection-based approaches, regression-based approaches, and today's density map regression approaches. The detection-based method is to detect the pedestrian through a sliding window by utilizing the characteristics of edges and the like, is limited by pedestrian shielding, and is suitable for occasions with dispersed targets. The regression-based method improves the counting accuracy in the shielded crowd to a certain extent, but cannot well obtain the spatial information of pedestrian distribution.

With the continuous development of the field of computer vision, pedestrian counting starts to turn to a density map regression method. Compared with the two methods, the method can provide the distribution condition of the pedestrians while processing the shielding problem by using the density map regression mode, so that specific spatial distribution information is obtained.

Pedestrian counting still remains today a problem common to a large number of computer vision fields. For example, perspective problems with perspective transformation make the detection of people at different scales more difficult. Most of existing counting methods adopt a form of deep learning and multi-scale feature extraction, and have the advantages that pedestrian features of different scales are extracted by multilayer or multi-column convolution, so that the perspective problem is solved to a certain extent, and an improved space still exists.

In fact, with pedestrian counts for fixed scenes, a large amount of redundant information often occurs. Such as surrounding buildings, parked vehicles, are often unchanged for a certain period of time. In the existing method for generating the density map by utilizing deep learning, certain resources are occupied by calculating interference data, and the calculation speed is reduced. For the interference of the background information, redundant information can be filtered in advance in the form of online background updating and background segmentation in the video stream processing.

By combining the above ideas, the invention provides a crowd counting method based on video images and online background segmentation.

Disclosure of Invention

Aiming at the problems in the existing pedestrian counting field, the invention provides a crowd counting method based on a video image. The method has the following advantages:

in a model training stage, 1) selecting a mature multilayer small Convolutional Neural Network (CNN), such as a VGG-16 structure, and performing primary feature extraction, so that the image has strong representation capability and parameters are reduced, the model is simpler and has strong universality; 2) and performing density estimation on the image by using the obtained multi-scale features. Under the condition that the similarity exists among the pedestrians and the density is high, statistics can be more effectively carried out by mainly using low-level features; in less dense cases, the high-level features of the pedestrian will make the count more accurate. Therefore, by means of the density classification mode, statistics can be conducted according to different shielding conditions, and counting accuracy is improved.

In the application process, the background segmentation method is used for separating the environmental interference, key information is reserved, the part of the image participating in calculation is simplified in a sparse matrix form, and the speed of subsequent pedestrian counting regression is increased; and continuously updating the background by utilizing the spatial information of the pedestrian detection and the information reserved by the background segmentation method, and finally separating the complete background.

A crowd counting method based on video images comprises the following steps:

step one, selecting a pedestrian image data set with marking information, wherein the number of a test set and a training set is 6: 4, the proportion can be modified according to the actual data set, and then Gaussian function processing is carried out according to the head annotation pixel points of the image to generate an initial true density chart corresponding to the original image;

and step two, building a coding-decoding convolution network model based on density classification.

The density classification-based coding-decoding convolutional network model is divided into a backbone network and two branches: and taking the VGG-16 network as a backbone network, and extracting corresponding different scale features by using all the layers. Performing fusion input on the extracted features with different scales through a density regression branch, and realizing density classification through regression to obtain the weight of a decoding branch; and the decoding branch utilizes the features of each scale to up-sample and decode the restored image to generate a crowd density estimation image corresponding to the features of each scale, and weights the crowd density estimation image by using the weight obtained by the density regression branch to obtain a final density estimation image.

And step three, training the coding-decoding convolutional network model based on density classification and built in the step two through a training set, optimizing parameters by adopting a random gradient descent algorithm, and calculating the loss between the density estimation graph and the truth density graph by using Euclidean distance. A complete model with a better effect is reserved for actual detection;

and step four, reducing the input image by utilizing a preprocessing method of background separation, finishing the generation of a sparse matrix, and realizing a final counting result by the coding-decoding convolutional network model based on density classification obtained in the step three.

Method of background separation: the pixel subtraction is carried out on the collected continuous video frames and the given background image, and the image content of all irrelevant background information is reserved in a threshold dividing mode, so that the reduction of the input image content is realized, and the convolution efficiency is improved; and extracting the pedestrian-containing part through a final density estimation image generated by the coding-decoding convolution network model, and updating the rest part to a background image layer in a background mode to realize the real-time updating of the background.

The specific content of the first step is as follows:

and converting the pedestrian image with the head position mark in the data set into a true value density map by utilizing a two-dimensional Gaussian convolution kernel for loss difference calculation. Selecting a density map based on a geometrically adapted Gaussian kernel, and formulating as follows:

the truth density map is obtained by convolving the delta pulse function with a gaussian function, and summing after convolution. x is the number of _i Representing the pixel position of the human head in the image; delta (x-x) _i ) An impulse function representing the position of the human head in the image; n is the total number of the heads in the image;

is a distance x _i Average distance of m persons with the nearest head; beta is a fixed value and is used for generating a width parameter of the Gaussian function.

Further, β is 0.3.

And converting the pedestrian image with the head mark into a true-value density map through the operation, and performing subsequent training by comparing the true-value density map with the output of the convolutional neural network.

The third step comprises the following specific contents:

and (4) training the coding-decoding convolution network model which is built in the step two and is based on density classification by using the test set image as input, and reserving model parameters. The loss between the final density estimate map and the true density map is calculated using the euclidean distance. The parameters are optimized using a random gradient descent algorithm until the loss values converge to the expected values.

When the distance between the density graph generated by Euclidean distance measurement and the real value is adopted, the loss function is defined as follows:

where N denotes the number of pictures input to the encoding-decoding convolutional network model, Z (X) _i (ii) a Theta) is the final density estimation diagram corresponding to the ith input picture, Z ^GT A truth density plot is shown. Θ represents the network parameters to be learned.

The coding-decoding convolutional network model is evaluated using Mean Square Error (MSE) and Mean Absolute Error (MAE). MSE is used for describing the accuracy of the encoding-decoding convolutional network model, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of a predicted value.

Wherein, C _i Indicating the number of people predicted for the picture,

representing the actual number of people.

And (3) testing process: and selecting a test set, inputting the test set into a trained model for testing, outputting a final crowd density graph, and counting results. And taking the optimal result as a model parameter for packaging.

The concrete content of the fourth step is as follows:

and subtracting the background image from the collected continuous video frames by using a background separation method, namely obtaining a difference image by performing pixel subtraction on the input initial image and the background image. The difference map contains information of all irrelevant backgrounds, including the change of shadows caused by pedestrians, vehicles and light irradiation. And performing threshold division on the difference map to filter out small interference such as illumination and the like to obtain a region of interest (ROI) separating the background. And (4) reserving the ROI image, namely the effective image in the model in the input step three. In the process, the filtering of redundant information is realized, and the convolution rate of the coding-decoding convolution network model is improved in a sparse matrix form.

After a final density estimation image of the ROI image is obtained, constructing a pedestrian mask template in a manual calibration mode (given according to actual conditions), performing expansion operation of digital image processing morphological change by using the pedestrian mask template and the final density estimation image (a highlight point in the density image is convolved with the mask template to obtain an expanded region which represents that pedestrians exist in the current region) to obtain a pedestrian image, and performing pixel value negation on the pedestrian image to obtain a background updating mask; and performing dot multiplication on the background updating mask and the initial image to obtain an updated background image, wherein the updated background image is used for updating the background image participating in background subtraction, and the online updating of the background image is realized.

And preprocessing the acquired information through the fourth step, and detecting and counting the pedestrians through the optimal model selected in the third step. High-efficiency pedestrian counting and spatial information feedback are realized.

The invention has the following beneficial effects:

the invention adopts a coding-decoding network based on density classification to generate a final density estimation graph; and the preprocessing of the image is realized by utilizing a background separation method, and the generation of a final density estimation graph is accelerated.

The input data is a continuous video frame image, and the input image after pretreatment is obtained by performing pixel subtraction separation redundant information on the collected continuous video frame and a given background image; the preprocessed image is input to a density-classification based encoding-decoding network. Aiming at the characteristic that the density of pedestrians is similar, the network extracts multi-scale features by using a backbone network, performs feature fusion on the features and gives weight by using density regression; and simultaneously, the extracted multi-scale features are used for up-sampling to obtain a corresponding density estimation image, and finally, the density images corresponding to the features of different scales are weighted to obtain a final density estimation image. Compared with the existing crowd counting technology, the method provided by the invention aims at the crowd counting of the video image, utilizes the similarity between pedestrians to a certain extent, and utilizes the filtering of redundant information, thereby not only obtaining the real-time pedestrian counting, but also retaining the background picture in real time. In addition, the pedestrian count of a single image can be realized by using the encoding-decoding network based on density classification alone.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a model of a density classification based encoding-decoding convolutional network;

FIG. 3 is a network model training flow diagram of the present invention;

FIG. 4 is a schematic diagram of a background separation process;

FIG. 5 is a flow chart of the present invention.

Detailed Description

The method of the invention is further described below with reference to the accompanying drawings and examples.

As shown in fig. 1, a method for counting people based on video images includes the following steps:

step one, selecting a pedestrian image data set with labeled information, wherein the number of a test set and a training set is 6: 4, the proportion can be modified according to the actual data set, and then Gaussian function processing is carried out according to the head annotation pixel points of the image to generate an initial true density chart corresponding to the original image;

the concrete content is as follows:

and converting the pedestrian image with the head position mark in the data set into a true value density map by utilizing a two-dimensional Gaussian convolution kernel for loss difference calculation. In order to make the density map correspond to the image with different visual angles and dense crowd better, the density map based on the geometric adaptive Gaussian kernel is selected, and the formula is as follows:

Further, β is 0.3.

And converting the pedestrian image with the head mark into a true density map through the operation, and performing subsequent training by using the true density map as output comparison of the convolutional neural network.

As shown in fig. 2, the density classification-based coding-decoding convolutional network model is divided into a backbone network and two branches: and taking the VGG-16 network as a backbone network, and extracting corresponding different scale features by using all the layers. Performing fusion input on the extracted features with different scales through a density regression branch, and realizing density classification through regression to obtain the weight of a decoding branch; and the decoding branch utilizes the features of each scale to up-sample and decode the restored image to generate a crowd density estimation image corresponding to the features of each scale, and weights the crowd density estimation image by using the weight obtained by the density regression branch to obtain a final density estimation image.

as shown in fig. 3, the specific content is:

and (4) training the coding-decoding convolution network model which is built in the step two and is based on density classification by using the test set image as input, and reserving model parameters. The loss between the final density estimate map and the true density map is calculated using the euclidean distance. The parameters are optimized using a random gradient descent algorithm until the loss value converges to the expected value.

Wherein, C _i Indicating the number of people predicted for the picture,

representing the actual number of people.

The testing process comprises the following steps: and selecting a test set, inputting the test set into a trained model for testing, outputting a final crowd density graph, and counting results. And taking the optimal result as a model parameter for packaging.

As shown in fig. 4, the background separation method: the pixel subtraction is carried out on the collected continuous video frames and a given background image, and the image content of all irrelevant background information is reserved in a threshold dividing mode, so that the reduction of the input image content is realized, and the convolution efficiency is improved; and extracting the pedestrian-containing part through a final density estimation image generated by the coding-decoding convolution network model, and updating the rest part to a background image layer in a background mode to realize the real-time updating of the background.

The concrete contents are as follows:

and subtracting the background image from the collected continuous video frames by using a background separation method, namely obtaining a difference image by performing pixel subtraction on the input initial image and the background image. The difference map contains information of all irrelevant backgrounds, including the change of shadows caused by pedestrians, vehicles and light irradiation. And performing threshold division on the difference map to filter out small interference such as illumination and the like to obtain a region of interest (ROI) separating the background. And (4) reserving the ROI image, namely the effective image in the model in the input step three. In the process, the filtering of redundant information (background interference) is realized, and the convolution rate of the coding-decoding convolution network model is improved in a sparse matrix form.

After obtaining a final density estimation image of the ROI image, constructing a pedestrian mask template in a manual calibration mode (given according to actual conditions), performing expansion operation of digital image processing morphological change by using the pedestrian mask template and the final density estimation image (a highlight point in the density image is convolved with the mask template to obtain an expanded region representing that pedestrians exist in the current region) to obtain a pedestrian image (only containing the pedestrians, each pedestrian is replaced by one mask template, the pedestrian mask can be understood as the pedestrian mask, and the pedestrian mask is not the template), and performing pixel value inversion on the pedestrian image (after binarization, 0 is changed into 1, and 1 is changed into 0) to obtain a background updating mask; and performing dot multiplication on the background updating mask and the initial image to obtain an updated background image, wherein the updated background image is used for updating the background image participating in background subtraction, and the online updating of the background image is realized.

As shown in fig. 5, the collected information is preprocessed in step four, and then pedestrian detection counting is performed by the best model selected in step three. High-efficiency pedestrian counting and spatial information feedback are realized.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made in the present invention. Any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A crowd counting method based on video images is characterized by comprising the following steps:

step one, selecting a pedestrian image data set with marking information, wherein the number of a test set and a training set is 6: 4, performing Gaussian function processing according to the head marking pixel points of the image to generate an initial true value density chart corresponding to the original image;

secondly, building a coding-decoding convolution network model based on density classification;

the density classification-based coding-decoding convolutional network model is divided into a backbone network and two branches: a VGG-16 network is used as a backbone network, and all the layers are used for extracting corresponding different scale features; performing fusion input on the extracted features with different scales through a density regression branch, and realizing density classification through regression to obtain the weight of a decoding branch; the decoding branch utilizes the features of all scales to up-sample and decode the restored image to generate a crowd density estimation image corresponding to the features of all scales, and weights are carried out by utilizing the weights obtained by the density regression branch to obtain a final density estimation image;

step three, training the coding-decoding convolution network model based on density classification and built in the step two through a training set, optimizing parameters by adopting a random gradient descent algorithm, and calculating the loss between a density estimation graph and a truth density graph by using Euclidean distance; a complete model with a better effect is reserved for actual detection;

fourthly, reducing the input image by utilizing a preprocessing method of background separation, finishing the generation of a sparse matrix, and realizing a final counting result through the coding-decoding convolution network model based on density classification obtained in the third step;

method of background separation: the pixel subtraction is carried out on the collected continuous video frames and a given background image, and the image content of all irrelevant background information is reserved in a threshold dividing mode, so that the reduction of the input image content is realized, and the convolution efficiency is improved; and extracting the pedestrian-containing part through a final density estimation image generated by the coding-decoding convolutional network model, and updating the rest part to a background image layer in a background mode to realize the real-time updating of the background.

2. The method according to claim 1, wherein the step one comprises the following steps:

converting a pedestrian image with a head position mark in the data set into a true value density map by using a two-dimensional Gaussian convolution kernel for loss difference calculation; selecting a density map based on a geometrically adapted Gaussian kernel, and formulating as follows:

the truth value density chart is obtained by convolution of a delta pulse function and a Gaussian function, and the convolution is performed firstly and then the summation is performed; x is the number of _i Representing the pixel position of the human head in the image; delta (x-x) _i ) An impulse function representing the position of the human head in the image; n is the total number of the heads in the image;

is a distance x _i Average distance of m persons with the nearest head; beta is a fixed value and is used for generating a width parameter of a Gaussian function;

3. The method according to claim 2, wherein the third step comprises:

training the coding-decoding convolutional network model based on density classification built in the step two by using the test set image as input, and reserving model parameters; calculating a loss between the final density estimation map and the true density map using the euclidean distance; optimizing parameters by adopting a random gradient descent algorithm until a loss value converges to a predicted value;

where N denotes the number of pictures input to the encoding-decoding convolutional network model, Z (X) _i (ii) a Theta) is the final density estimation diagram corresponding to the ith input picture, Z ^GT Representing a truth density chart; theta represents a network parameter to be learned;

evaluating the coding-decoding convolutional network model by using Mean Square Error (MSE) and Mean Absolute Error (MAE); MSE is used for describing the accuracy of the coding-decoding convolutional network model, the accuracy is higher when the MSE is smaller, and the MAE can reflect the error condition of a predicted value;

wherein, C _i Indicating the number of people predicted for the picture,

representing the actual number of people;

the testing process comprises the following steps: selecting a test set, inputting the test set into a trained model for testing, outputting a final crowd density graph, and counting results; and taking the optimal result as a model parameter for packaging.

4. The method according to claim 3, wherein the detailed contents of the fourth step are as follows:

subtracting the background image from the collected continuous video frames by using a background separation method, namely obtaining a difference image by performing pixel subtraction on the input initial image and the background image; the difference map contains information of all irrelevant backgrounds, including shadow changes caused by pedestrian, vehicle and light irradiation; carrying out threshold division on the difference map to filter out small interference of illumination so as to obtain a region of interest (ROI) separating the background; the reserved ROI image is an effective image in the model in the input step three; in the process, the filtering of redundant information is realized, and the convolution rate of the coding-decoding convolution network model is improved in a sparse matrix form;

after a final density estimation image of the ROI image is obtained, constructing a pedestrian mask template in a manual calibration mode, performing expansion operation of digital image processing morphological change on the pedestrian mask template and the final density estimation image to obtain a pedestrian image, and performing pixel value negation on the pedestrian image to obtain a background updating mask, wherein a highlight point in the density image is convoluted with the mask template to obtain an expanded area, which represents that pedestrians exist in the current area; performing dot multiplication on the background updating mask and the initial image to obtain an updated background image, wherein the updated background image is used for updating the background image participating in background subtraction and realizing online updating of the background image;

preprocessing the acquired information through the fourth step, and detecting and counting pedestrians through the optimal model selected in the third step; high-efficiency pedestrian counting and spatial information feedback are realized.

5. The method of claim 2, wherein β is 0.3.