CN105791774A

CN105791774A - Surveillance video transmission method based on video content analysis

Info

Publication number: CN105791774A
Application number: CN201610201613.0A
Authority: CN
Inventors: 王素玉; 白艳涛; 侯义斌
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2016-03-31
Filing date: 2016-03-31
Publication date: 2016-07-20

Abstract

The invention discloses a surveillance video transmission method based on video content analysis, which is characterized by comprising an image acquisition module, an image preprocessing module, a pedestrian flow detection module, a crowd density estimation module, a camera parameter setting and warning module, and a video transmission module. The pedestrian flow detection module is used for detecting and counting pedestrians in a surveillance video image. The crowd density estimation module is used for estimating the density of pedestrian flow in the current video image by use of a crowd density estimation algorithm. The camera parameter setting and warning module is used for sending out a camera parameter modification command and crowd event warning information when pedestrian detection and crowd density detection reach a preset level. According to the invention, the crowd density in a video is estimated by use of a pedestrian detection algorithm and a crowd density estimation algorithm, and the configuration of the camera is modified and corresponding surveillance video transmission is carried out according to the result of estimation. Moreover, 'problematic' videos can be transmitted actively or passively, and warning information can be sent out.

Description

Monitoring video transmission method based on video content analysis

Technical Field

The invention relates to a monitoring video transmission method, in particular to a monitoring video transmission method based on video content analysis.

Background

With the rapid development of scientific technology and the continuous progress of society, video monitoring is an efficient monitoring technology and receives more and more attention from various industries. Most of the traditional video monitoring has a storage function, and most of the stored information is useless redundant information. Compared with the intelligent video monitoring, the intelligent video monitoring system has the advantages that the intelligent video monitoring system can continuously monitor for 24 hours under the condition of no human intervention, can automatically analyze and process the monitored video, and can identify, detect, track and analyze behaviors of moving targets in the video.

The monitoring video transmission method based on video content analysis can dynamically adjust the configuration of the camera according to the analysis result of the video content, adopts different video shooting qualities aiming at different crowd density grades, reduces the requirements of video transmission on bandwidth, and also reduces the redundant information quantity stored by intelligent video monitoring. Meanwhile, early warning information can be sent out in time, more time is strived for processing emergency, the working efficiency of video monitoring is improved, and manpower is saved. The comprehensive application of the video transmission technology and the content analysis of the intelligent monitoring video ensures that the intelligent video monitoring has the advantages of strong mobility, low power consumption, wide monitoring range, diversified monitoring forms, convenient remote monitoring, high practical value and the like.

So far, intelligent video monitoring still belongs to a middle-high grade product, and the main application scene is also large-scale places such as public places, large-scale enterprises, world expo, Olympic games and the like. However, with the increasing market demand and the updating of technology, the application prospect of intelligent video monitoring is wider, and the intelligent video monitoring system is further developed by common people. Because the intelligent video monitoring has huge application prospect and research significance, the intelligent video monitoring becomes an important part of strategic development of all countries around the world, and a great deal of funds and energy are invested to research the intelligent video monitoring.

Disclosure of Invention

The video monitoring system aims to overcome the defects of single function, redundant stored information and the like of the traditional video monitoring. The invention provides a monitoring video transmission method based on video content analysis, which dynamically adjusts the parameter configuration of a camera according to the analysis of the monitoring video content, adopts different video shooting qualities aiming at different crowd density levels, reduces the requirements of video transmission on bandwidth, and simultaneously reduces the redundant information quantity stored by intelligent video monitoring.

The invention is realized by adopting the following technical means:

a monitoring video transmission method based on video content analysis is characterized by comprising an image acquisition module, an image preprocessing module, a people stream detection module, a crowd density estimation module, a camera parameter setting and early warning module and a video transmission module; wherein,

the image acquisition module is used for acquiring a monitoring video image of a monitoring scene and providing a data source for the subsequent image preprocessing;

the image preprocessing module is used for preprocessing and calculating the acquired monitoring video image and providing a data source for subsequent people flow detection and crowd density estimation;

the pedestrian flow detection module is used for detecting and counting pedestrians in the monitoring video picture, and the statistical result provides a basis for whether the crowd density estimation is started later;

the crowd density estimation module is used for estimating the crowd density in the current video picture by using a crowd density estimation algorithm after the pedestrian flow in the image exceeds a preset opening density detection level, and providing a basis for the parameter setting and early warning of a camera;

the camera parameter setting and early warning module is used for sending out group event early warning information after pedestrian detection and crowd density detection reach preset early warning levels, modifying camera parameters according to the crowd density levels, and improving the effective rate of video transmission through modification of the camera parameters;

the video transmission module automatically starts a video transmission function after the monitored video reaches a preset video transmission grade, transmits the original video file or the real-time video of the camera to the remote server in a streaming mode, and the remote server can simultaneously receive the access of a plurality of clients.

Furthermore, the image acquisition module acquires a video stream by accessing the RTSP server, and acquires a video image through the video stream.

Furthermore, the image preprocessing module preprocesses the acquired monitoring video image and provides a data source for subsequent people flow detection and crowd density estimation. Because the image collected by the camera is affected by factors such as illumination, noise and motion, which easily affect the accuracy of the module detection, the preprocessing of the monitoring video image is indispensable, and the preprocessing mainly performs median filtering on the image to achieve the purpose of denoising.

Further, the people stream detection module detects moving objects of the monitored video images by using a three-frame difference method at the beginning of detection, and performs pedestrian detection based on the HOG characteristics only when the detection result shows that the current video contains the moving objects, or performs moving object detection all the time. And if the monitoring picture contains the pedestrians, counting the number of the pedestrians and marking the pedestrian area.

The three-frame difference method is to use the difference of 2 adjacent frames and then carry out AND operation to locate the pedestrian. The difference between the adjacent frames is an algorithm for obtaining the external contour of the moving target by carrying out image difference operation on two adjacent images in the collected monitoring video.

The HOG features are histogram of gradient direction (histogram of gradient OrientedGradients), and the histogram of gradient of the whole pedestrian is obtained by calculating and counting the local features of the pedestrian and finally integrating. The pedestrian local feature extraction comprises 5 steps:

● the first step is to complete the preparation of the whole extraction process, namely to normalize the color space and Gamma space of the positive and negative training samples.

● the second step is to calculate the gradient of the positive and negative training samples.

● the third step is to count the gradient values in each direction in the cell unit for the gradient values calculated in the second step.

● the fourth step is to normalize the histogram of the gradient for each block.

● the last step is to combine the normalized gradient histogram into positive and negative HOG feature vectors of training samples according to a certain rule.

Further, the crowd density grades are divided into 4 grades, including: grade 1 (less than 2 persons per square meter), grade 2 (more than or equal to 2 persons per square meter and less than 3 persons per square meter), grade 3 (more than or equal to 3 persons per square meter and less than 4 persons per square meter) and grade 4 (more than or equal to 4 persons per square meter). The opening density detection level is level 2, and the estimation process of the crowd density estimation module comprises the following steps: the method comprises the steps of firstly establishing a background aiming at a current monitoring video, obtaining the number of pixels of a foreground through an interframe difference method after the background is successfully established, and then estimating the crowd density through a relation function between the crowd density and the number of the pixels of the foreground.

The background is that no pedestrians and other shelters exist in the current monitoring scene, the background is established by adopting a video frame pixel point statistical method, the foreground pixel number is the pixel value occupied by the crowd, and the relation function between the crowd density and the foreground pixel number of the monitoring video is obtained by training and calculating by using a least square method in advance.

Further, the setting of the camera parameters is realized through an ONVIF protocol, the early warning information is the group early warning information which is automatically sent out when the crowd density reaches a preset level, and the early warning level is level 4. The corresponding relation between the camera parameter setting and the video crowd density grade is as follows: level 1 (camera pixels are 200 × 150, and the number of frames saved and transmitted per second is 5 frames); level 2 (camera pixels are 200 × 150, and the number of frames saved and transmitted per second is 10 frames); level 3 (camera pixels are 400 × 300, and the number of frames saved and transmitted per second is 20 frames); level 4 (camera pixels 800 × 600, frames saved and transmitted per second 30 frames).

The onvif (opennetworksvideointerfaceforum) protocol enables a developer to modify parameters of cameras of different brands and different models by using a uniform parameter interface, and shields the difference of systems and hardware of each camera, wherein the parameters are the frequencies of shooting pixels and video frames of the camera.

Further, the video transmission module firstly acquires the real-time video of the camera, secondly packages and transmits the real-time video to the remote DSS server, and finally a user can check the real-time video of the current camera by accessing the DSS server. The transmission quality of the whole transmission process is guaranteed through an RTCP protocol, the transmission of video streams is realized through an RTP protocol, the RTSP protocol is used for controlling the functions of starting, pausing, ending and the like of the transmission process of the monitoring video, and the preset video transmission level is 2 levels.

And the process of acquiring the real-time video of the camera by the client accessing the server through the built-in RTSP server in the camera. The real-time video package is sent to a remote DSS server, so that the video pictures can be guaranteed not to lose frames through a TCP protocol, and the video real-time performance can be guaranteed through a UDP protocol. Default is to send over UDP protocol.

Drawings

Fig. 1 is a general configuration diagram of a transmission method;

FIG. 2, a flow chart of a transmission method;

FIG. 3, controller organization;

FIG. 4, method Module layout;

fig. 5, an interaction flow of the RTSP client and the server;

FIG. 6, a people flow detection algorithm flow;

FIG. 7, basic flowchart of HOG people stream detection;

FIG. 8, HOG feature extraction;

FIG. 9, cell, block and training sample relationship diagram;

FIG. 10, HOG gradient direction partitioning;

FIG. 11, population density estimation algorithm overall structure;

FIG. 12, a flow chart for modifying camera configuration;

FIG. 13 is a flow chart of video real-time transmission;

Detailed Description

The embodiments of the invention are described in detail below with reference to the accompanying drawings:

the overall constitution of the present invention is shown in FIG. 1. The video monitoring equipment monitors the monitored scene, automatically judges the abnormal condition and performs video transmission and early warning processing. The safety of the monitored environment is guaranteed. Firstly, acquiring a monitoring video image of a monitoring scene; preprocessing the acquired image; people flow detection and crowd density estimation are carried out on the image, and a result is obtained; analyzing the obtained result, adjusting the parameters of the camera and determining whether to send out early warning information or not; while analyzing whether there is a need to transmit video.

The general flow chart of the invention is shown in figure 2 and mainly comprises the following steps:

in fig. 2, the content of the command sent in step 1 is the specific number of the camera and the level of the video parameter, and after receiving the command, the surveillance video transmission system transmits the video with the corresponding code rate to the video server. This operation is optional and is not performed by default.

The workflow of the surveillance video transmission is from 2 to 3 to 4 in the general state. The controller analyzes the video and sends a parameter modification command to the camera head according to the analyzed result. If the video transmission is necessary, the corresponding monitoring video is transmitted to the DSS server by the video transmission module.

Wherein the controller comprises 3 functions (as in fig. 3): people flow detection, camera code rate setting and crowd density estimation. By analyzing the results of people stream detection and people density estimation, the system sets camera parameters and selects whether to transmit video.

The early transmission of the monitoring video transmits the monitoring video configured by the unified camera no matter the pedestrian condition of the monitored scene, so that the waste of transmission bandwidth is easily caused when no pedestrian or few pedestrians exist. Aiming at the problem, the invention provides a method for adjusting the camera parameters aiming at the video quality grade of video monitoring, thereby improving the utilization rate of transmission bandwidth. The method comprises the steps that firstly, a controller composed of people stream detection and people density estimation judges the people density grade of a monitoring video, and a transmission system adjusts the parameter configuration of a camera according to a judgment result, so that the definition of the shot monitoring video is relatively low when the number of people is small, the occupied space is small, the video definition is improved when the number of people is large, and shot video information is more meaningful. The transmission system selects videos with different qualities to transmit according to the results of people flow detection and people density estimation, and reduces network flow consumed by transmission to the maximum extent while ensuring monitoring service quality.

Referring to fig. 4, the method of the present invention mainly comprises the following 6 functional modules.

(1) And an image acquisition module.

The image acquisition module mainly has the functions of acquiring a monitoring video image of a monitoring scene and providing a data source for subsequent image processing and related algorithms. The invention adopts a mode of accessing the RTSP server to obtain the video stream and obtains the video image through the video stream.

An RTSP server is arranged in the camera, and video acquisition is the process that the client accesses the server to acquire audio and video streams. Fig. 5 shows an information interaction process between the RTSP server and the RTSP client. And sending a standard RTSP command to the camera head to acquire corresponding audio and video streams from the camera head.

(2) And the image preprocessing module.

The module mainly calculates and preprocesses the acquired monitoring video images to a certain extent, and provides a data source for later pedestrian detection and crowd density estimation.

The purpose of image pre-processing is to eliminate noise present in the surveillance video images. The invention adopts median filtering denoising, namely filtering operation is carried out on RGB three color components of an image, and the basic idea is to replace the values of three channels corresponding to a central pixel with the median. The specific process is as follows, taking the B component channel as an example.

Firstly, a template with the size of 3x3 is used, the whole coordinate system is traversed from the origin of the coordinate system, and all pixel points in the image are overlapped with the center of the board.

And secondly, sequencing the B values of 9 pixel points in the template from small to large.

Obtaining the sorted median value in the step (2), and then replacing the B component values of the 9 points with the value.

The median filtering process on the R and G components is also similar. When the median filtering of the three components is completed, the denoising of the image is completed.

(3) People flow detection module

The purpose of the module is to detect and count pedestrians in the monitoring video picture. In the invention, a three-frame difference method is used for detecting the moving target of the monitoring video image at the beginning of the detection of the pedestrian detection algorithm, the pedestrian detection based on the HOG characteristic is carried out only when the current video contains the moving target according to the detection result, and otherwise, the moving target detection is carried out all the time. And if the monitoring picture contains the pedestrians, counting the number of the pedestrians and marking the pedestrian area. The pedestrian detection flowchart is shown in fig. 6:

because the interframe difference method has the defects that a moving target can not be easily detected when the difference between double images and two frames is not large, the difference between 2 adjacent frames is utilized by adopting the three-frame difference method, and then the pedestrian is positioned after the AND operation.

The HOG feature-based pedestrian detection firstly extracts HOG features of positive and negative samples, then puts the features into a support vector machine for training, and finally uses the trained detection factors to detect pedestrians in images to be detected. The flow chart is shown in fig. 7.

Extracting positive and negative sample HOG characteristics.

And secondly, training an SVM classifier to obtain the model.

And thirdly, generating a detector by the model.

And fourthly, detecting the pedestrians in the monitored environment image by using the detection factor.

The extraction of the HOG features can be roughly divided into the following five steps, as shown in fig. 8.

The first step is to complete the preparation work of the whole extraction process, namely standardizing the color space and Gamma space of the positive and negative training samples;

the standardization adopted by the invention is a Gamma orthogonal method, and the Gamma compression formula is as follows:

I(x,y)＝I(x,y)^gamma(1)

i.e. to indicate the value gamma for the image I. Wherein, I represents the current image, (x, y) represents the pixel point, and the value of gamma is selected according to the requirement.

Secondly, calculating the gradient of the positive and negative training samples;

the gradient corresponds to the first derivative of the image. The gradient of pixel point (x, y) in the image is:

G_x(x,y)＝H(x+1,y)-H(x-1,y)(2)

G_y(x,y)＝H(x,y+1)-H(x,y-1)(3)

wherein G is_x、G_yRespectively representing the horizontal direction gradient and the vertical direction gradient of the pixel point (x, y) in the extracted HOG characteristic image. H represents the pixel value at the pixel point (x, y) in the extracted HOG feature image. Equations (4) and (5) respectively represent a gradient amplitude calculation method and a gradient power calculation method at the pixel point (x, y), where G (x, y) represents the gradient amplitude at the pixel point (x, y), a (x, y) represents the gradient power at the pixel point (x, y), and G (x, y) represents the gradient power at the pixel point (x, y)_x、G_yAnd (3) representing the horizontal gradient and the vertical gradient at the pixel point (x, y).

G (x, y) = \sqrt{G_{x} {(x, y)}^{2} + G_{y} {(x, y)}^{2}} - - - (4)

α (x, y) = \tan^{- 1} [\frac{G_{x} (x, y)}{G_{y} (x, y)}] - - - (5)

And performing convolution operation on the positive and negative training sample images to obtain the gradient component of the pixel point in the transverse direction, wherein the specific gradient operator is [ -1,0,1 ]. And then carrying out convolution operation on the positive and negative training sample images to obtain the gradient component of the pixel point in the longitudinal direction, wherein the specific gradient operator is [1,0, -1 ]. And finally, calculating the gradient size and gradient direction of the pixel point (x, y) by the formula (4) and the formula (5).

Thirdly, counting the gradient values in each direction in the cell units according to the gradient values calculated in the second step;

the gradient direction histograms in the positive and negative samples are calculated by using a rectangular HOG form, and the calculation size of the HOG characteristic is in a cell unit. The training sample size in the present invention is 64 x 128, and the cell size per cell (cell) is 8 x 8, thus yielding 8 x 16-128 cell units per sample. Each block unit is composed of every adjacent 4 cell units. Therefore, it is known from the calculation that each positive and negative training sample with size 64 × 128 contains 105 blocks. The relationship between the cell units, the blocks and the training samples is shown in FIG. 9.

The invention divides the image into a plurality of cell units, and divides the gradient direction of the cell units into 9 direction blocks in 360 degrees, as shown in figure 10. Each cell unit forms 9-dimensional HOG feature vectors, so a block constitutes a 4 x 9-36-dimensional feature vector. Each detection window contains 105 blocks, so that it is calculated that a single positive and negative training sample contains a feature vector with a total dimension of 3780.

The fourth step is to carry on the normalization treatment to the gradient histogram of each block;

the L2-Hys paradigm was used for normalization. The L2-Hys paradigm has the following calculation formula:

V_{i}^{*} = V_{i} / \sqrt{Σ_{i = 1}^{k} V_{i} + ϵ} - - - (6)

wherein V_iIs the histogram of the gradient before normalization of each block, V_i ^*Is a normalized gradient histogram of each block, is a mathematically common and small standard constant used to prevent the divisor 0 from occurring.

And the last step is to combine the normalized gradient histograms into HOG characteristic vectors of positive and negative training samples according to a certain rule.

And (4) comprehensively combining the HOG feature descriptions of the normalized blocks obtained in the step (4) to form a complete HOG special vector of positive and negative samples, wherein the vector extraction is the basis and key point of pedestrian detection. In step 3, it has been calculated that the HOG feature vectors are all 3780-dimensional feature vectors, and this 3780-dimensional vector is also the feature vector that is finally used by the classifier.

(4) Crowd density estimation module

And the crowd density estimation module is used for estimating the pedestrian condition in the current video picture by using a crowd density estimation algorithm after the pedestrian in the image exceeds a specified upper limit. And (3) estimation flow: the method comprises the steps of firstly establishing a background aiming at the current monitoring environment, obtaining the number of foreground pixels through an interframe difference method after the background is successfully established, and then estimating the crowd density by using a function of the relationship between the crowd density of a monitoring video and the number of foreground pixels through training and calculation by using a least square method in advance.

The method adopts a pixel-based method to estimate the crowd density, subtracts a processed video image and a background image to obtain a foreground, counts the number of pixels of the foreground, trains by using a least square fitting method, and finally obtains a linear relation function of the number of pixels of the foreground image and the crowd density, wherein the function is the basis for estimating the crowd density. The general structure is shown in fig. 11.

The background generation of the invention adopts an interframe difference method, and the specific steps of generating the background are as follows.

① the first frame image in the input video is f₀The background image obtained at this time is b₀。

Obtaining two continuous video frames, and carrying out difference operation on the two video frames to obtain a static area and a moving area in the image. And comparing the result of the difference operation with a preset threshold value T, and replacing the result by 0 if the result is less than the value T, or replacing the result by 1 if the result is greater than the value T. The formula for this step is defined below.

{bw}_{i} = \{\begin{matrix} 1 & a b s (f_{i} - f_{i - 1}) &GreaterEqual; T \\ 0 & a b s (f_{i} - f_{i - 1}) < T \end{matrix} - - - (7)

Wherein f is_iAnd f_i-1Representing two successive video frames, abs () operation is the absolute value of the difference, T is the threshold preset for the difference, which changes as the environment changes, bw_iIndicating the current judgment result.

And thirdly, regarding the updating coefficient, the coefficient of the static area is large, and the coefficient of the motion area is small. The coefficient is defined as shown in equation (8).

b_{i} (x, y) = \{\begin{matrix} b_{i - 1} (x, y) & {bw}_{i} (x, y) = 1 \\ {af}_{i} (x, y) + (1 - a) b_{i - 1} (x, y) & {bw}_{i} (x, y) = 0 \end{matrix} - - - (8)

Wherein, b_i(x, y) represents the pixel value at (x, y) in the current video image, if the result of the determination in step (2) is background (i.e. bw)_iIs 1), then b)_iThe value of (x, y) does not change. f. of_i(x, y) is the pixel value of the current frame at the point (x, y), a is the update coefficient of the changed area, and the corresponding 1-a represents the update coefficient of the unchanged area, and a is usually smaller. The values of a and T corresponding to different monitoring scenes areAnd not the same.

After obtaining the current frame and the background frame, firstly, the two images are converted into a single-channel image (grayed), and then the grayed image is subjected to differential operation. And removing the background to obtain the foreground. And obtaining pixel information occupied by the foreground after background subtraction, and preparing for next least square fitting.

The invention adopts a straight line fitting method to search the functional relation between the number of foreground pixels and the crowd density. In practical applications, however, it is not possible to find a straight line such that all observed data points are on the straight line. So we use the residual. So-called residual | e_kI is the observation data y_kAnd a value (ax) calculated by fitting the equation of a straight line y ═ ax + b_k+ b) absolute value of the difference between. The definition is as follows:

e_k＝y_k-(ax_k+b)(9)

residual | e_kThe physical meaning of | is pointing (x)_k+y_k) To the extent of deviation from the fitted line equation, if the data point falls into the line equation, the value of the residual is 0. Therefore, in the straight line fitting equation y ═ ax + b, the values of a and b need to be sought so that the data point residuals have a minimum value in some sense. The values of a and b are calculated so that the sum of the squares of all residuals is minimal. The definition is as follows:

S (a, b) \overset{Δ}{=} Σ_{k} {[y_{k} - ({ax}_{k} + b)]}^{2} = \min - - - (10)

where S (a, b) is a non-negative quadratic polynomial on a and b, and it is differentiable on a, b, min represents the minimum that can be taken. The values of a and b are obtained by solving a condition sufficient for obtaining an extremum by equation (10). The conditions are defined as follows:

\frac{\partial S (a, b)}{\partial a} = \frac{\partial S (a, b)}{\partial b} = 0 - - - (11)

the system of linear equations is solved:

\{\begin{matrix} \frac{\partial S}{\partial a} = 2 Σ_{k = 1}^{n} [y_{k} - ({ax}_{k} + b)] x_{k} = 0 \\ \frac{\partial S}{\partial b} = 2 Σ_{k = 1}^{n} [y_{k} - ({ax}_{k} + b)] = 0 \end{matrix} - - - (12)

since the solution of equation (12) is unique under a certain specific condition, a and b can be obtained by equation (12), and the values of a and b are:

\{\begin{matrix} a = \frac{m (Σ_{k = 1}^{m} x_{k} y_{k}) - (Σ_{k = 1}^{m} x_{k}) (Σ_{k = 1}^{m} y_{k})}{m (Σ_{k = 1}^{m} x_{k}^{2}) - {(Σ_{k = 1}^{m} x_{k})}^{2}} \\ b = \frac{(Σ_{k = 1}^{m} {x_{k}}^{2}) (Σ_{k = 1}^{m} y_{k}) - (Σ_{k = 1}^{m} x_{k}) (Σ_{k = 1}^{m} x_{k} y_{k})}{m (Σ_{k = 1}^{m} x_{k}^{2}) - {(Σ_{k = 1}^{m} x_{k})}^{2}} \end{matrix} - - - (13)

the values of a and b are the values of a and b in the equation y ═ ax + b, which is a linear equation for estimating the population density.

(5) Camera parameter setting and early warning module

The early warning and parameter setting module is mainly used for sending out group event early warning information and a command for modifying camera parameters after pedestrian detection and crowd density detection reach preset levels. The modification of the camera parameters is realized through the currently popular ONVIF protocol, and the ONVIF protocol can shield the difference of products produced by various network camera manufacturers and provide a uniform interface for modifying the camera parameters.

All modification interfaces for cameras provided by the ONVIF protocol for developers are WebServices, so that the webServices are necessarily supported by the network camera supporting the ONVIF protocol. And modifying the camera parameters by sending SOAP messages.

WebService primarily uses HTTP and SOAP protocols to transport data over the Web. The client may interact with the Web server using HTTP messages or SOAP messages. The working process is as follows:

firstly, a client initializes a SOAP message according to the self requirement, fills the message into a POST request of HTTP, and finally sends the request to a server of WebServices.

And secondly, after receiving the POST request, the server of the WebServices takes out the SOAP message in the server and analyzes the SOAP message.

The WebServices server processes the reasonable SOAP requirement, generates a corresponding SOAP request response message according to the processing information, and sends the message to the client through HTTP.

A specific flow of the camera configuration setting modification is shown in fig. 12. And initializing the SOAP message according to parameters such as input camera resolution, image quality, code rate upper limit, frame rate and the like. And sending the initialized SOAP message to a server of the camera, and modifying the configuration of the camera by the server after analyzing the received SOAP message.

(6) Video transmission module

The transmission module is mainly used for transmitting the original video file or the real-time video of the camera to the remote server in a streaming mode, and the remote server can simultaneously receive the access of a plurality of clients.

The transmission of the real-time video of the camera is mainly divided into two parts, wherein the first part is the acquisition of the camera video, and the second part is the transmission of the acquired video. The transmission process of the whole video uses RTSP/RTP/RTCP protocol for quality control.

Firstly, a real-time video of a network camera is obtained, and the video is sent to a remote streaming media server after the real-time video is obtained. The video forwarding can ensure that the frame of the video is not lost through a TCP protocol, and can also ensure the real-time performance of the video through a UDP protocol. Default is to send over UDP protocol. The whole process of video transmission is from the camera to the transmission platform, and then from the transmission platform to the streaming media server. The transmission flow chart of the surveillance video is shown in fig. 13.

Firstly, acquiring real-time video of a camera through the steps 2 and 3, and then transmitting the video at the step 4, wherein the DSS streaming media server uses a server for receiving the transmitted monitoring video. The step 1 is not executed by default, the execution is only carried out when a user wants to view the real-time video of one camera, the video sending command contains relevant information of the DSS server, and the information provides destination address information for video transmission in the step 4 video transmission.

In order to test the experimental performance of the invention, two sections of video materials are monitored. The test data for people stream detection comes from the PETS2009 data set, and the test data for people density estimation comes from self-shooting.

After the people flow detection function is started, the system creates a window to display the people flow detection condition of the current frame, marks the detected pedestrians by using a rectangular frame, and displays the pedestrian detection result below the system interface. The population density is divided into four grades, namely 1 grade (less than 2 persons per square meter), 2 grade (more than or equal to 2 persons per square meter and less than 3 persons per square meter), 3 grade (more than or equal to 3 persons per square meter and less than 4 persons per square meter) and 4 grade (more than or equal to 4 persons per square meter). When the people flow unit area reaches 2 people per square meter, the crowd density detection function is automatically started, the video transmission function is automatically started simultaneously, the early warning signal is started when the crowd density reaches 4 levels, the camera parameter is automatically adjusted under the corresponding crowd density level, and the corresponding relation between the crowd density level and the camera parameter is as follows: level 1 (camera pixels are 200 × 150, and the number of frames saved and transmitted per second is 5 frames); level 2 (camera pixels are 200 × 150, and the number of frames saved and transmitted per second is 10 frames); level 3 (camera pixels are 400 × 300, and the number of frames saved and transmitted per second is 20 frames); level 4 (camera pixels 800 × 600, frames saved and transmitted per second 30 frames).

Through the detection of the two sections of videos, after the pedestrian level in the video reaches the preset level 2, the crowd density detection is automatically started, meanwhile, the video transmission is started, and after the pedestrian level reaches the preset level 4, the early warning information is sent out, and the camera parameters are automatically set along with the change of the crowd density level in the detection process. The stored and transmitted videos are counted, and the material 1 space saving is 24.02%, and the material 2 space saving is 62.75% by calculation. It can be seen that under the condition that no pedestrian exists or no pedestrian exists, the video shot by the camera after the adjustment of the controller saves space compared with the video shot when the controller is not started. The method not only reduces the redundant information quantity of the monitoring video, but also reduces the required quantity of the video transmission on the bandwidth, reduces the power consumption of video monitoring and improves the monitoring efficiency.

Claims

1. A monitoring video transmission method based on video content analysis is characterized by comprising an image acquisition module, an image preprocessing module, a people stream detection module, a crowd density estimation module, a camera parameter setting and early warning module and a video transmission module; wherein,

the video transmission module automatically starts a video transmission function after the monitored video reaches a preset video transmission grade, transmits the original video file or the real-time video of the camera to the remote server in a streaming mode, and the remote server receives the access of a plurality of clients simultaneously.

2. The surveillance video delivery method based on video content analysis according to claim 1, wherein the image capture module obtains the video stream by accessing an RTSP server and captures the video image through the video stream.

3. The surveillance video transmission method based on video content analysis according to claim 1, wherein the image preprocessing module performs median filtering on the acquired video image to achieve the purpose of denoising.

4. The surveillance video transmission method based on video content analysis according to claim 1, wherein the people stream detection module performs moving object detection on the surveillance video image by using a three-frame difference method at the beginning of detection, performs pedestrian detection based on the HOG feature only when the detection result shows that the current video contains a moving object, and otherwise performs moving object detection all the time; and if the monitoring picture contains the pedestrians, counting the number of the pedestrians and marking the pedestrian area.

The three-frame difference method is to use the difference of 2 adjacent frames, and then carry out AND operation to locate the pedestrian; the difference between the adjacent frames is an algorithm for obtaining the external contour of the moving target by carrying out image difference operation on two adjacent images in the collected monitoring video;

the HOG features are calculated and counted out through local features of the pedestrians, and finally a gradient histogram of the whole pedestrian is obtained through synthesis; the pedestrian local feature extraction comprises 5 steps:

● the first step is to complete the preparation of the whole extraction process, namely standardizing the color space and Gamma space of the positive and negative training samples;

● the second step is to calculate the gradient of positive and negative training samples;

● the third step is to count the gradient value in each direction in the cell unit for the gradient value calculated in the second step;

● the fourth step is to normalize the histogram of gradient of each block;

5. The surveillance video delivery method based on video content analysis according to claim 1, wherein said crowd density level is divided into 4 levels, comprising: the number of the 1-grade per square meter is less than 2, the number of the 2-grade per square meter is more than or equal to 2 and less than 3, the number of the 3-grade per square meter is more than or equal to 3 and less than 4, and the number of the 4-grade per square meter is more than or equal to 4; the opening density detection level is level 2, and the estimation process of the crowd density estimation module comprises the following steps: firstly, establishing a background aiming at a current monitoring video, obtaining the number of pixels of a foreground through an interframe difference method after the background is successfully established, and estimating the crowd density through a relation function between the crowd density and the number of the pixels of the foreground;

the background is that no shielding object exists in the current monitoring scene, the background is established by adopting a video frame pixel point statistical method, the number of foreground pixels is the pixel value occupied by the crowd, and the relation function between the crowd density and the number of foreground pixels of the monitoring video is obtained by training and calculating by using a least square method in advance.

6. The surveillance video transmission method based on video content analysis according to claim 1, wherein the modification of the camera parameter setting and early warning module to the camera parameter is realized by ONVIF protocol, the early warning is to automatically send out group early warning information when the crowd density reaches a preset early warning level, and the early warning level is 4; the corresponding relation between the camera parameter setting and the video crowd density grade is as follows: the pixels of the 1-level camera are 200 × 150, and the number of frames stored and transmitted per second is 5; the 2-level camera has pixels of 200 × 150, and the number of frames stored and transmitted per second is 10; the pixel of the 3-level camera is 400 × 300, and the number of frames stored and transmitted per second is 20; the pixel of the 4-level camera is 800 × 600, and the number of frames stored and transmitted per second is 30;

the ONVIF protocol enables a developer to modify parameters of cameras of different brands and different models by using a uniform parameter interface, the differences of systems and hardware of each camera are shielded, and the parameters are the frequency of shooting pixels and video frames of the cameras; the preset early warning level is 4 levels.

7. The surveillance video transmission method based on video content analysis according to claim 1, wherein the video transmission module comprises two parts, namely an existing video file and a real-time video picture, wherein the real-time video firstly acquires the current picture of the camera, secondly packages and transmits the video to the remote DSS server, and finally the user views the real-time video of the current camera by accessing the DSS server, and the preset video transmission level is level 2;

the process of obtaining the audio and video stream by the client accessing the server through the built-in RTSP server in the camera to obtain the current picture of the camera; the real-time video is packaged and sent to a remote DSS server, frames of the video are not lost through a TCP (transmission control protocol), and the real-time performance of the video is ensured through a UDP (user datagram protocol) protocol; default is to send over UDP protocol; the transmission quality of the whole transmission process is guaranteed through an RTCP protocol, the transmission of the video stream is realized through an RTP protocol, and the RTSP protocol is used for controlling the start, pause and end of the transmission process of the monitoring video.