WO2016061724A1

WO2016061724A1 - All-weather video monitoring method based on deep learning

Info

Publication number: WO2016061724A1
Application number: PCT/CN2014/088901
Authority: WO
Inventors: 黄凯奇; 康运锋; 曹黎俊; 张旭
Original assignee: 中国科学院自动化研究所
Priority date: 2014-10-20
Filing date: 2014-10-20
Publication date: 2016-04-28

Abstract

An all-weather video monitoring method based on deep learning. The method includes the following steps: collecting a video stream in real time, and obtaining multiple original sampling graph samples and speed sampling graph samples through line sampling on the basis of the obtained video stream; carrying out space-time correction on the obtained speed sampling graph samples; on the basis of original sampling graphs and speed sampling graphs, carrying out off-line training to obtain a deep learning model, wherein the deep learning model includes a classification model and a counting model; and carrying out a crowd state analysis on the real-time video stream by means of the obtained deep learning model. The method is well adaptive to different environments, illumination intensities, weather situations and camera angles. Higher accuracy can be guaranteed in terms of crowding environments, such as rushing out of mass flow crowds. A calculated amount is small, requirements for real-time video processing can be met, and the method can be widely applied to monitoring and management of public places, such as buses, subways and squares where stranded crowds are dense.

Description

[Name of invention by ISA according to Rule 37.2] An all-weather video surveillance method based on deep learning

Technical field

The invention belongs to the field of pattern recognition technology, and particularly relates to an all-weather video monitoring method based on deep learning, which is especially suitable for analyzing the state of a large traffic crowd.

Background technique

At present, the level of urbanization in China has exceeded 50%. The influx of a large number of floating population has made the density of urban populations larger and larger, and large-scale crowd activities have become more frequent. It is not uncommon for major accidents to occur due to crowded people. Therefore, how to monitor and manage the crowd and actively identify and timely alert in the early stage of mass incidents has become one of the research hotspots in the field of video surveillance in various countries. In order to better identify and alert group abnormal events and reduce disasters, real-time control of population size changes is a key factor. The crowd analysis based on intelligent video surveillance is to analyze the behavior of moving objects in a specific monitoring scene, and can describe its behavior rules, so as to realize the automatic detection of abnormal events by using machine intelligence, and also learn to establish related behavior models for public Provide space for design, intelligent environment, etc. However, due to different monitoring scenarios, differences in camera installation angles, weather and changes in sunshine intensity, the intelligent monitoring system plays a minor role in all-weather monitoring.

Convolutional neural networks, as a deep learning method, are a multilayer perceptron specially designed for 2D image processing. It has some advantages that traditional technology does not have: good fault tolerance, parallel processing and self-learning ability, can handle environmental information replication, background knowledge is unclear, and the problem of unclear inference rules allows for large defects and distortions. It has fast running speed, good adaptive performance and high resolution. Therefore, the convolutional neural network can solve the problem in the all-weather monitoring, and can ensure the high stability accuracy of the intelligent monitoring system under various conditions.

Summary of the invention

It is an object of the present invention to provide an all-weather video monitoring method based on deep learning, which can analyze the state of a crowd in a video, especially the number of people, in an all-weather manner.

In order to achieve the above object, a depth learning-based all-weather video monitoring method proposed by the present invention includes the following steps:

Step 1: acquiring a video stream in real time, and obtaining a plurality of original sample map samples and a velocity sample map sample by line sampling based on the obtained video stream;

Step 2: performing time and space correction on the obtained sample of the velocity sampling map;

Step 3: Obtain a deep learning model based on the original sampling map and the velocity sampling map, and the deep learning model includes a classification model and a statistical model;

Step 4: Perform the crowd state analysis on the real-time video stream by using the deep learning model obtained in step 3.

Compared with the latest methods at home and abroad, the invention has several obvious advantages: 1) good adaptability to different environments, light intensity, weather conditions and camera angles; 2) gushing out for large traffic crowds If the crowd is crowded, high accuracy can be guaranteed. 3) The amount of calculation is small, which can meet the requirements of real-time video processing.

DRAWINGS

1 is a flow chart of a method for monitoring all-weather video based on deep learning according to the present invention;

Figure 2 is a schematic illustration of the geometric correction of the present invention.

detailed description

The present invention will be further described in detail below with reference to the specific embodiments of the invention.

The main points of the present invention are: 1) the behavior of a person entering and leaving the door (or virtual door), which can convert the dynamic behavior into a static picture by fixed position sampling to facilitate the analysis of the crowd; 2) the method of perspective correction and speed correction Guaranteed under different camera angle settings High accuracy; 3) The deep learning model helps to automatically find the most effective features, and ensures the stability of the population state analysis in different scenarios by serial multi-features. The technical details involved in the present invention are explained below.

The flowchart of the all-weather video monitoring method based on deep learning is shown in FIG. 1 . As shown in FIG. 1 , the depth learning-based all-weather video monitoring method includes the following steps:

In an embodiment of the present invention, for statistical convenience, first, for each frame of the video stream, a width is fixed to n pixels at a position where the pedestrian enters and exits the door (in an embodiment of the present invention, n=3), the length covering the calibration line l _{n of the} entire door, as a virtual door boundary of the person entering and leaving, wherein the position of the calibration line is determined according to the position of the number of people in the video scene, which may be any angle, preferably It is perpendicular to the length direction of the door. For example, if the door is facing the camera, the calibration line can be set to be placed horizontally. If the door is perpendicular to the shooting direction of the camera, the calibration line can be set to be placed vertically; then, the video is extracted. In the image F of every f in the stream (in an embodiment of the present invention, f=2), the pixel covered by the calibration line, since the width of the calibration line is n pixels, each time a sampling is completed, n is obtained. The pixel data of the line is subjected to a fixed time interval t (in one embodiment of the present invention, t=300 frames), and all the pixels obtained by sampling are accumulated to constitute the original sample image I, thereby obtaining more for the video stream. A sample of the original sampled image. In an embodiment of the invention, the sampled pixel data of each line is filled in a row from top to bottom in the order of time sampling to obtain an original sample image I.

The speed sampling map is a pedestrian motion pattern. In the present invention, there are two possibilities for the pedestrian's moving direction, that is, walking to both sides of the calibration line in a direction perpendicular to the calibration line. Therefore, in the velocity sampling map, the present invention uses different channels of the RBG to indicate different directions of motion of the pedestrian: wherein the R channel and the G channel represent pixel points of two different motion directions, and the B channel represents pixels having no motion. Specifically, while the video stream is sampled to obtain the original sample image, the optical velocity method is used to calculate the velocity Speed (F _t (l _n )) and the motion direction Orient (F _t of each pixel covered by the corresponding calibration line). (l _n )), based on the calculated motion direction values of the pixel points, after similar accumulation of the same fixed time interval t, the velocity sampling pattern I _{s is obtained} .

From the above, the crowd information in the video stream for a period of time, can pass the original sampling map and speed Degree sampling map is obtained, namely:

I(n*t%3/3)=F _t (l _n ),

I _s (n*t%3/3)=Orient(F _t (l _n )),

among them,

Where F _t (l _n ) represents the pixel point covered by the calibration line l _n in the image frame F at time t, and Orient(F _t (l _n )) represents the calibration line l _n covered in the image frame F at time t The direction of movement of the pixel, and % indicates the remainder operation.

Step 2: Perform time and space correction on the obtained sample of the velocity sampling map to ensure a higher accuracy of the final population state analysis;

(1) performing spatial correction on the velocity sample map sample;

Due to the different installation angles of the camera, the projection of the scene on the image plane will have a more serious perspective phenomenon, that is, the same object, which looks close to the camera and looks away from the camera, needs to contribute to different pixels on the image plane. Perform weighting processing. In the present invention, it is assumed that the ground is a plane and the person is perpendicular to the ground.

2 is a schematic diagram of geometric correction of the present invention. In FIG. 2, XOY is an image coordinate system, and p ₁ p ₂ p ₃ p ₄ is a coordinate of four points in the world coordinate system, assuming p ₁ p ₂ and p ₃ p ₄ There is a 3D object and the same size, y and y _r are the reference lines of different objects, y _v is the extinction point reference line, ΔW and ΔH are the length and width representation of the object at p ₃ p ₄ , ΔWr and ΔHr are p The length and width of the object at ₁ p ₂ means that, as shown in Fig. 2, the coordinates of the evanescent point P _v are (x _v , y _v ), and the reference line is y=y _r = H/2, where H is a 3D object. The height of the geometric contribution factor of any pixel I(x, y) on the image plane is expressed as:

(2) performing time correction on the sample of the velocity sampling map;

Because people's speed of movement is different, it will cause pedestrians to show different heights or fats and thinness in the speed sampling map, which will affect the accuracy of the analysis of the crowd, so it is necessary to time correct the speed sampling map.

In an embodiment of the invention, the speed sampling map is time corrected by calculating the velocity of the pixel covered by the calibration line by the optical flow method, and the correction coefficient is expressed as:

S(F _t (l _n ))=Speed(F _t (l _n ))/N _s ,

Wherein, N _s is the standard speed values, speed an embodiment of the present invention is taken as 1 pixel / _{_{frame, Speed (F t (l n}} )) represents the time t image frame F calibration line pixel l _n covered in size.

After the above space and time correction, the velocity sampling map I _{'s is} expressed as:

I' _s =I _s *S _C (x,y)*S(F _t (l _n )).

In the crowd state analysis model, there are two types of deep learning models. One is the classification model. The velocity sampling sample can be used to train the classification model. For example, the velocity sampling map can be divided into four according to the walking direction of the person in the velocity sampling sample. Category: In the speed sampling chart, only the entering person, only the person in the speed sampling chart, the person in the speed sampling chart, and the speed sampling chart are not in and out, in order to facilitate statistics of the crowd information entering and leaving the virtual door; The other is a statistical model, which obtains a statistical model by training samples in and out of the original sampled map and the velocity sampling map, thereby obtaining the ratio of the total number of people in the original sampling map and the number of entering persons in the velocity sampling map, wherein The statistical model is divided into two types. One is a model for counting the total number of people in the original sampling map, which is called the statistical population quantity model, and the other is the model for the proportion of the population entering the population in the speed sampling graph. In the embodiment of the present invention, the two statistical models are trained using the same convolutional neural network. the same. After obtaining the classification model and the statistical model, by integrating the two types of model information, the cumulative quantity information of the entering and leaving population within a certain period of time can be obtained.

(1) Training of statistical models

The convolutional neural network of the statistical model constructed by an embodiment of the present invention adopts a 9-layer network structure, including an input layer, five convolutional layers, namely: C1 to C5, two fully connected layers F6 and F7, and an output layer O8. At the beginning of the model training, the network structure is constructed first, and the weight of the network is initialized with different small random numbers. The small random number is generally in the range of [-1, 1], and the bias initialization is set to 0.

A) Forward propagation phase

The input layer target image is I, and the size is different. The image input to the first convolution layer is two images: the size normalized image of the target image, and the image of the size normalized image flipped left and right, in the present invention. In the embodiment, the normalized size is 224*224. The convolution layer includes a convolution operation and a downsampling operation, where:

The convolution operation uses two convolution kernels to perform two-dimensional convolution on the input image, plus the offset, and then through the nonlinear excitation function, that is, the convolution result is obtained.

Where n represents the number of layers, S represents the number of neurons in the nth layer, and w _ij represents the convolution of the i-th input image and the j-th output image, wherein the size of the C1 layer convolution kernel is 11*11, C2 The size of the layer convolution kernel is 5*5, and the size of the C3, C4, and C5 convolution kernels is 3*3, φ _i is the threshold (offset) of the jth output image, and f(*) is the ReLU function: f(x)=max(x,0);

The downsampling operation uses the stochastic pooling sampling method, namely:

Where t represents the tth output image,

R _t is the sampling window size of the downsampling layer. In an embodiment of the invention, the downsampling layer sampling window size is set to 2*2, and I _j is an element value in the sampling window.

After the full connection operation of the full connection layers F6 and F7, the actual output O _k of the output layer O8 is calculated as:

Where k is the number of cells in the output layer, θ _k is the threshold (offset) of the output cell, 1 is the number of cells in F7, V _tk is the convolution of the output connected to the fully connected layer, and f(*) is the softmax function.

B) Back propagation phase

In the backpropagation phase, the gradient descent method is used to inversely adjust the weights and thresholds of each layer of the neural network. The statistical error function used is:

Where d denotes the corresponding target vector, ie the label of the velocity sample map or the original sample map sample, O _k is the output of the deep learning network, and m is the total number of samples.

When E < ε, where ε is the preset minimum error parameter, the training ends, and the obtained layer weights and thresholds are saved.

At this time, the parameters of the convolutional neural network structure of the statistical model have been stabilized.

(2) Training of classification models

The classification model also uses a convolutional neural network, and the velocity sampling map is also used as a sample to train the classification model. In an embodiment of the invention, the number of categories of the classification model is 4, so the established network depth does not need to be too deep. In this embodiment, the number of selected network layers is 6 layers, including an input layer, 3 convolution layers, 1 fully connected layer, and an output layer. The input layer directly normalizes the RGB velocity sample map sample to 96*96 and then inputs it to the first layer of the convolution layer without any processing. As with the training of statistical models, the training of classification models is also initialized using random data. Among them, the training method of the forward propagation stage and the training method of the reverse propagation stage are the same as those of the statistical model, and will not be described here. The difference is that the convolution kernels of the three convolutional layers in the classification model are both It is 5*5. The classification model obtained from the final training can be used for the classification of velocity sampling maps.

The step 4 further includes the following steps:

Step 41: Similar to the step 1, acquiring a plurality of original sampling images and a velocity sampling map based on the real-time video stream;

Similar to the step 1, in this step, the real-time video stream is sampled to obtain the original sample map of the pixels at the virtual gate in the image frame, and the speed of the pixel at the corresponding position of the virtual gate in the original sample map is calculated by the optical flow method. And accumulate the calculated velocity into a velocity sample map.

Step 42 is similar to the step 2, and the speed sampling map obtained in the step 41 is separately corrected in time and space to ensure a high accuracy of the population state analysis.

Step 43: Perform classification on the speed sampling map by using a classification model in the deep learning model, and determine a category to which the speed sampling map belongs;

Using the classification model in the deep learning model to classify the velocity sampling map, and obtain the category to which the velocity sampling graph belongs: only the entering person in the velocity sampling graph, only the outgoing person in the velocity sampling graph, and the velocity sampling graph There are people in and out, speed sampling chart No one is in or out.

Step 44: Perform, according to the category to which the speed sampling map belongs, the population information in the original sampling map by using a statistical model in the deep learning model;

Specifically, the step selects a corresponding statistical model according to the classification result of the classification model to perform population state statistics, for example, for the category of no-in and out of the speed sampling map, the population quantity is zero; for the speed sampling diagram, only the outgoing and only Enter the category, use the statistical population model in the statistical model to count the number of people; for the categories in the speed sampling chart, use the statistics in the statistical model to get the proportion of the number of people entering the population. And combined with the statistics of the number of people obtained by the statistical population model, the number of people entering and leaving is finally obtained.

Step 45: Integrate the crowd information corresponding to the plurality of original sampling images to obtain accurate crowd information in the corresponding time period of the real-time video stream.

According to the judgment result of the statistical model and the classification model, the quantity information of the inbound and outbound people in the corresponding time period of the real-time video stream can be separately accumulated, and the accumulated population of the inbound and outbound people in the time period can be obtained. By detecting the abnormality of the population size, the purpose of video warning can be achieved.

The specific embodiments of the present invention have been described in detail, and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Claims

An all-weather video monitoring method based on deep learning, characterized in that the method comprises the following steps:

Step 1: acquiring a video stream in real time, and obtaining a plurality of original sample map samples and a velocity sample map sample by line sampling based on the obtained video stream;

Step 2: performing time and space correction on the obtained sample of the velocity sampling map;

Step 3: Obtain a deep learning model based on the original sampling map and the velocity sampling map, and the deep learning model includes a classification model and a statistical model;

Step 4: Perform the crowd state analysis on the real-time video stream by using the deep learning model obtained in step 3.
The method of claim 1 wherein said step 1 further comprises the step of:

First, for each frame of the video stream, at a position where the pedestrian enters and exits the door, a calibration line l n having a width of n pixels and a length covering the entire door is set as a virtual door boundary for the person to enter and exit;

Then, the pixels covered by the calibration line in the image F of every f frame in the video stream are extracted, and each pixel that is sampled after a fixed time interval t constitutes the original sample image I;

When sampling the pixels covered by the calibration line, the optical flow method is used to calculate the velocity and the moving direction of each pixel. For each fixed time interval t, the motion direction of all the pixels sampled constitutes a velocity sampling map.
The method according to claim 1, wherein in the velocity sampling map, different channels of the RBG are used to indicate different moving directions of the pedestrian, wherein the R channel and the G channel represent pixels of two different moving directions, B The channel represents a pixel that has no motion.
The method according to claim 1, wherein in the step 2, the velocity sample map samples are spatially corrected by using contributions of different pixels on the image plane, and the velocity samples are sampled by using velocity values of different pixel points. The figure performs time correction.
The method of claim 4 wherein the spatially and time corrected velocity sample map I 's is represented as:

I' s =I s *S C (x,y)*S(F t (l n )),

Where I s represents the velocity sample map before space and time correction, and S C (x, y) represents the geometric contribution factor of any pixel I(x, y) on the image plane, S(F t (l n )) Time correction coefficient: S(F t (l n ))=Speed(F t (l n ))/N s , N s is the standard speed value, and Speed(F t (l n )) represents the image frame F at time t The speed at which the pixel of the alignment line l n is covered.
The method according to claim 1, wherein the classification model divides the velocity sampling map into four categories: only the incoming person in the speed sampling map, only the person in the speed sampling map, and the speed sampling map. No one enters or exits the incoming and outgoing people.
The method according to claim 1, wherein the statistical model further comprises a statistical population size model and a statistical in-and-out population model, wherein the statistical population quantity model is used to count the total number of people in the original sampling map; The statistical in-and-out population model is used to calculate the proportion of people entering the population in the speed sampling chart with and without the category.
The method of claim 1 wherein said statistical model is trained using a convolutional neural network, wherein the convolutional neural network for training the statistical population size model comprises an input layer, five convolutional layers, 2 A fully connected layer and an output layer; a convolutional neural network for training a statistical in and out crowd model includes an input layer, three convolutional layers, one fully connected layer, and an output layer.
The method of claim 1 wherein said step 4 further comprises the step of:

Step 41: Similar to the step 1, acquiring a plurality of original sampling images and a velocity sampling map based on the real-time video stream;

Step 42 is similar to the step 2, and the speed sampling map obtained in the step 41 is separately subjected to space-time correction;

Step 43: Perform classification on the speed sampling map by using a classification model in the deep learning model, and determine a category to which the speed sampling map belongs;

Step 44: Perform, according to the category to which the speed sampling map belongs, the population information in the original sampling map by using a statistical model in the deep learning model;

Step 45: Integrate the crowd information corresponding to the plurality of original sampling images to obtain accurate crowd information in the corresponding time period of the real-time video stream.
The method according to claim 9, wherein in the step 44, the population quantity is zero for the unmanned entry category; the statistics in the statistical model are used for the - only outgoing and only incoming categories. The population quantity model counts the number of people in the population; for the categories that are in and out, the statistics of the number of people entering the population using the statistical in-and-out population model in the statistical model, and the population quantity statistics obtained by combining the statistical population quantity model, The number of people who entered and went out was finally obtained.