CN111325048B

CN111325048B - Personnel gathering detection method and device

Info

Publication number: CN111325048B
Application number: CN201811523516.9A
Authority: CN
Inventors: 曾钦清; 童超; 车军
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2023-05-26
Anticipated expiration: 2038-12-13
Also published as: CN111325048A

Abstract

The application provides a method and a device for detecting personnel aggregation, wherein the method comprises the following steps: extracting a target to be detected from the acquired video stream, and tracking the target to be detected to obtain a tracking linked list; determining the moving speed of each target to be detected according to the tracking linked list; judging whether the moving speed is smaller than a first preset threshold value or not, and determining a static area of a frame image according to a plurality of areas corresponding to a plurality of targets to be detected, the moving speed of which is smaller than the first preset threshold value, in the frame image; taking an image with a static area in the video stream as a candidate gathering image; when the number N of the candidate gathering images reaches a second preset threshold, judging whether the candidate gathering images are the personnel gathering images or not through a preset personnel gathering prediction model, and counting the number M of the personnel gathering images according to a judgment result; and when M is determined to be larger than a third preset threshold value, determining that a personnel gathering event occurs. The method can improve the accuracy of personnel gathering detection.

Description

Personnel gathering detection method and device

Technical Field

The invention relates to the technical field of monitoring, in particular to a personnel gathering detection method and device.

Background

In video monitoring in the prior art, when a person aggregation state occurs in a monitoring scene, management risk and control difficulty of a monitoring area are increased, and a scheme different from a normal state is required to be adopted for managing the monitoring scene.

Because of the huge deployment amount of modern video monitoring systems and the numerous number of cameras, people gathering phenomenon is found in all monitoring scenes, a large amount of manpower is needed to monitor all cameras for a long time, labor cost is consumed, and missing report is easily caused. Therefore, the video analysis means is used for automatically detecting the phenomenon of people aggregation found in the monitoring scene, so that the intelligent monitoring system is required.

A personnel aggregation detection method based on video is provided in the prior art. Performing monitoring area learning according to continuous video images to obtain a current background image of a monitoring area; and carrying out threshold segmentation on the foreground image to obtain a segmented image, carrying out pixel statistics on the connected areas of the target image, and judging whether a person gathering area exists or not according to the area of each connected area in the target image and a preset threshold area.

According to the method, the aggregation area is obtained by adopting a method for segmenting the static area image threshold value, the aggregation judging condition is simple, the information of multi-frame video is not fully utilized, the scene applicability is poor, the error detection is generated by fusion under the condition of complex scene, and the detection accuracy is poor.

Disclosure of Invention

In view of the above, the present application provides a method and an apparatus for detecting personnel aggregation, which can improve the accuracy of the personnel aggregation detection.

To solve the above technical problem, a first aspect of the present application provides a method for detecting personnel aggregation, including:

extracting a target to be detected from the acquired video stream, and tracking the target to be detected to obtain a tracking linked list;

determining the moving speed of each target to be detected according to the tracking linked list;

judging whether the moving speed is smaller than a first preset threshold value or not, and determining a static area of a frame image according to a plurality of areas corresponding to a plurality of targets to be detected, the moving speed of which is smaller than the first preset threshold value, in the frame image;

taking an image with a static area in the video stream as a candidate gathering image; when the number N of the candidate gathering images reaches a second preset threshold, judging whether the candidate gathering images are the personnel gathering images or not through a preset personnel gathering prediction model, and counting the number M of the personnel gathering images according to a judgment result;

and when M is determined to be larger than a third preset threshold value, determining that a personnel gathering event occurs.

A second aspect of the present application provides a people gathering detection device, the device comprising: the device comprises an acquisition unit, a first determination unit, a second determination unit, a third determination unit, a fourth determination unit, a statistics unit and a fifth determination unit;

the acquisition unit is used for extracting a target to be detected from the acquired video stream, tracking the target to be detected and obtaining a tracking linked list;

the first determining unit is used for determining the moving speed of each target to be detected according to the tracking linked list acquired by the acquiring unit;

the second determining unit is configured to determine whether the moving speed determined by the first determining unit is less than a first preset threshold, and determine a static area of a frame image according to a plurality of areas corresponding to a plurality of targets to be detected, the moving speed of which is less than the first preset threshold, in the frame image;

the third determining unit is configured to take, as a candidate aggregate image, an image in the video stream in which the static area determined by the second determining unit exists; determining whether the number N of candidate aggregated images reaches a second preset threshold;

the fourth determining unit is configured to determine, when the third unit determines that the number N of candidate aggregated images reaches a second preset threshold, whether the candidate aggregated images are person aggregated images according to a preset person aggregation prediction model;

the statistics unit is used for counting the number M of the person gathering images as the judgment result of the fourth determination unit;

and the fifth determining unit is used for determining that a personnel gathering event occurs when the M counted by the counting unit is determined to be larger than a third preset threshold value.

A third aspect of the present application provides a non-transitory computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps of a person accumulation detection method as described.

An electronic device comprising a non-transitory computer readable storage medium as described above, and the processor having access to the non-transitory computer readable storage medium.

In the method, the preset personnel aggregation prediction model obtained through static region identification and deep learning is combined, the multi-frame image of the video image is subjected to secondary identification, the process information of personnel aggregation event occurrence is fully utilized, and the accuracy of personnel aggregation detection can be improved.

Drawings

FIG. 1 is a schematic diagram of a human gathering detection flow in an embodiment of the present application;

FIG. 2 is a schematic diagram of a location area where an aggregation event occurs in an embodiment of the present application;

fig. 3 is a schematic structural diagram of an apparatus applied to the above technology in the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below by referring to the accompanying drawings and examples.

According to the personnel aggregation detection method, the preset personnel aggregation prediction model obtained through static area identification and deep learning is combined, the multi-frame images of the video images are identified secondarily, the process information of the occurrence of the personnel aggregation event is fully utilized, and the accuracy of the personnel aggregation detection can be improved.

The method and the device are applied to detection of personnel gathering events in public places and important areas, and the personnel gathering detection process in the embodiment of the method and the device are described in detail below with reference to the accompanying drawings.

For convenience of description, the apparatus for realizing the person aggregation detection is hereinafter simply referred to as a detection apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of a human aggregation detection flow in an embodiment of the present application. The method comprises the following specific steps:

step 101, the detection device extracts a target to be detected from the acquired video stream, and tracks the target to be detected to obtain a tracking linked list.

The video stream can be acquired in real time through video monitoring equipment, such as a camera, and video images in monitored scenes are acquired in real time and transmitted to detection equipment, and the detection equipment receives and stores the video stream sent by the video monitoring equipment, so that the video stream can be acquired in real time.

In the step, a target to be detected is extracted from the acquired video stream, the target to be detected is tracked, and a tracking linked list is obtained, and the method comprises the following two steps:

the method comprises the steps of firstly, obtaining a foreground image of each frame image of a video stream, and obtaining a target to be detected in the foreground image.

This step may be implemented in two ways, but is not limited to the following two implementations:

first kind:

the foreground object may be extracted from the video stream by a foreground model for foreground detection, such that the foreground object is the detection target of the target person. The background modeling method may include a gaussian mixture model (Gaussian Mixture Model) and a vipe (visual background extractor, visual background extraction) algorithm, among others.

Second kind:

feature objects may be extracted from the video stream by a trained convolutional neural network (Convolutional Neural Network) to target the feature objects to a detection target of a person. The convolutional neural network needs to be trained by personnel features in advance, a feature target appearing in one frame of image in the video can be identified, and as an embodiment, the convolutional neural network can be trained by limbs of a person, so that the subsequently trained convolutional neural network can extract the limb target of the person from the video stream, and the target to be detected is obtained.

And secondly, tracking each target to be detected in the foreground image of each frame of image to obtain a tracking linked list.

After the target to be detected is obtained, the obtained target to be detected can be tracked, and the tracking result is recorded in a tracking table.

In specific implementation, the detection target can be tracked by means of Kalman filtering, particle filtering or multi-target tracking technology.

In particular implementations, a tracking list may be used to generate a tracking entry for each detection target. A tracking linked list may also be generated for a detection target using a tracking linked list.

The tracking linked list in the embodiment of the application at least comprises: the method comprises the steps of identifying an object to be detected, identifying a video frame of a video frame where the object to be detected is located, and mapping the historical coordinates of the object to be detected.

The historical coordinates of the object to be detected are the coordinates of the center point of the circumscribed rectangular frame corresponding to the outline of the object to be detected.

When the tracking linked list is specifically implemented, the tracking linked list can be implemented in the form of the following table. Referring to table 1, table 1 is the content of the tracking chain table in the embodiment of the present application.

TABLE 1

The historical coordinates of the object 1 to be detected when it appears in the video frames 2, 3 and 8 are given in table 1, and the coordinate information corresponding to each moment, that is, the two-dimensional coordinate information. The coordinates at time t1 may be (3, 5) or the like.

Step 102, the detection device determines the moving speed of each target to be detected according to the tracking linked list.

In this step, determining the moving speed of each target to be detected according to the tracking linked list includes:

and calculating the moving speed of each target to be detected in the preset time according to the historical coordinates of each target to be detected in the tracking table.

In particular, the moving speed of the target to be detected may be calculated according to the following formula:

wherein, (x [ T ], y [ T ]) is the historical coordinate of the target to be detected in the tracking table at the time T, (x [ T-T ], y [ T-T ]) is the historical coordinate of the target to be detected in the tracking table at the time T-T, and T is the preset time.

Taking table 1 as an example, assuming that the time interval between T1 and T2 is a preset duration T, at time T2, the moving speed v [ T2] of the target 1 to be detected:

assuming that the coordinates of the object 1 to be detected at time t2 (13:45:38) are (6, 9) and the coordinates at time t1 (13:45:36) are (3, 5), the moving speed at time t2 is determined as: 2.5CM/S, where the displacement is in CM units, and in practical application, the moving unit can be determined according to practical needs.

In the embodiment of the application, the moving speed of the detection target is calculated by starting at the T-th moment, the moving speed of the detection target before the T-th moment is negligible, and the T can be configured according to actual needs, so that the speed calculation is more accurate and can be set to be smaller.

Step 103, the detecting device determines whether the moving speed is smaller than a first preset threshold, and determines a static area of a frame image according to a plurality of areas corresponding to a plurality of targets to be detected, the moving speed of which is smaller than the first preset threshold, in the frame image.

In this step, according to a plurality of areas corresponding to a plurality of objects to be detected, the moving speed of which is smaller than a first preset threshold, in a frame of image, a static area of the frame of image is determined, including:

when K targets to be detected with the moving speed smaller than a first preset threshold exist in a frame image, taking the largest circumscribed rectangular frame corresponding to the union of the K targets to be detected in the frame image as a static area of the frame image, or directly taking the union of the K targets to be detected in the frame image as the static area of the frame image. Wherein K is greater than a fourth preset value, and the value of the fourth preset value is according to the actual requirement of the equipment, such as 5, 8, etc.

Referring to fig. 2, fig. 2 is a schematic view of a static area in an embodiment of the present application. In fig. 2, circumscribed rectangular frames corresponding to a union of target areas corresponding to 7 detection targets are used as static areas, and the area corresponding to each detection target is also marked by a rectangular frame, so that an overlapping portion exists between the areas corresponding to the detection targets 6 and 7 in fig. 2.

In the embodiment of the application, when a first preset value of a moving speed cell of a plurality of targets exists, an area corresponding to a union set of areas corresponding to the plurality of targets is taken as a static area, and an image with the static area is taken as a candidate gathering image for secondary identification; this can prevent frequent secondary recognition of the person-gathering image.

Step 104, the detection device takes the image with the static area in the video stream as a candidate gathering image; when the number N of the candidate gathering images reaches a second preset threshold, judging whether the candidate gathering images are the personnel gathering images or not through a preset personnel gathering prediction model, and counting the number M of the personnel gathering images according to a judgment result.

The personnel aggregation prediction model in the embodiment of the application is obtained by training a plurality of images of aggregation events and images of non-aggregation events as sample images, and the specific training process is as follows:

the personnel aggregation prediction model consists of a convolutional neural network model and a regression learning model; outputting confidence level of aggregation and non-aggregation when an image is input based on the convolutional neural network model; an aggregation confidence coefficient threshold value is set in the regression learning model, and the output of the convolutional neural network model, namely the aggregation and non-aggregation confidence coefficient, can be input based on the regression learning model; outputting whether the image is a person gathering image identifier or not, and outputting the person gathering image identifier when the gathering confidence coefficient is larger than a gathering confidence coefficient threshold value; otherwise, outputting the image identification for non-personnel aggregation.

The aggregation confidence threshold may be set according to actual needs, which is not limited in the embodiments of the present application.

The convolutional neural network model is established, specifically:

the method comprises the steps of taking images of A aggregation events and images of B non-aggregation events as sample data, learning in a convolutional neural network, obtaining the capability of identifying aggregation and non-aggregation images, and further establishing a convolutional neural network model. The convolutional neural network learns according to the input data and the category labels, and the capability of identifying the aggregated and non-aggregated images is obtained. The convolutional neural network can adopt a Googlenet, resNet, VGG, alexnet deep learning network.

The detection device continues to acquire candidate aggregate images when it is determined that the number N of candidate aggregate images does not reach a second preset threshold.

The M, N is set according to the practical application environment, and is not particularly limited, M is an integer, and no more than N, N is an integer greater than 0.

And inputting an image based on the trained person gathering preset model, and outputting whether the image is a person gathering image identifier or not.

When it is determined that M is greater than the third preset threshold, the detection device determines that a person gathering event has occurred, step 105.

And when M is not greater than a third preset value, the current candidate gathering image is emptied, and the candidate gathering image is acquired again according to the video stream acquired in real time.

In this embodiment, when the detection device determines that a person gathering event occurs, the method further includes:

the detection device sends out alarm information, wherein the alarm information comprises a sign of occurrence of an aggregation event and a location area of occurrence of the aggregation event.

The aggregation event mark shows that the aggregation event occurs currently, and the specific implementation can be realized by using characters, symbols, red, yellow and other patterns, and the like, and is not limited to the implementation mode;

the location area where the gathering event occurs can be the area gathered by marking the personnel in the image; the following expressions are given in the specific implementation of the present application, but are not limited to:

the image with the highest confidence of aggregation is selected as the image to be displayed, and the static area in the image is displayed.

When an alarm is sent, an administrator can take corresponding early warning measures, such as alarm, warning, dispersion and the like, according to the current actual environment;

the processing strategy can be pre-configured on the equipment aiming at the alarm information, namely, the corresponding relation between the alarm information and the processing strategy is configured;

when an alarm occurs, the alarm information is matched with the configured alarm information, and if the matching is successful, the corresponding processing strategy of the corresponding alarm information is used for processing, such as alarm, dispersion and the like in a broadcasting mode.

The implementation mode of timely alarming can be fast and timely inform the gathering personnel, and the life safety of the personnel is guaranteed.

Based on the same inventive concept, the application also provides a personnel gathering detection device. Referring to fig. 3, fig. 3 is a schematic structural diagram of an apparatus to which the above technology is applied in the embodiment of the present application. The device comprises: an acquisition unit 301, a first determination unit 302, a second determination unit 303, a third determination unit 304, a fourth determination unit 305, a statistics unit 306, and a fifth determination unit 307;

the acquisition unit 301 is configured to extract a target to be detected from the acquired video stream, and track the target to be detected to obtain a tracking linked list;

a first determining unit 302, configured to determine a moving speed of each target to be detected according to the tracking linked list acquired by the acquiring unit 301;

a second determining unit 303, configured to determine whether the moving speed determined by the first determining unit 302 is less than a first preset threshold, and determine a static area of a frame image according to a plurality of areas corresponding to a plurality of targets to be detected, where the moving speed is less than the first preset threshold, in the frame image;

a third determining unit 304 configured to take, as a candidate aggregate image, an image in which the static area determined by the second determining unit 303 exists in the video stream; determining whether the number N of candidate aggregated images reaches a second preset threshold;

a fourth determining unit 305, configured to determine, when the third unit determines that the number N of candidate aggregated images reaches a second preset threshold, whether the candidate aggregated images are person aggregated images through a preset person aggregation prediction model;

a statistics unit 306, configured to count the number M of the person-gathering images determined by the fourth determination unit 305;

a fifth determining unit 307, configured to determine that a person gathering event occurs when it is determined that M counted by the counting unit 306 is greater than a third preset threshold.

Preferably, the method comprises the steps of,

the obtaining unit 301 is specifically configured to extract a target to be detected from the collected video stream, track the target to be detected, and obtain a tracking linked list, where the obtaining unit includes: acquiring a foreground image of each frame image of a video stream; and tracking each target to be detected in the foreground image of each frame image to obtain a tracking linked list.

Preferably, the method comprises the steps of,

the first determining unit 302 is specifically configured to determine, according to the tracking linked list, a moving speed of each target to be detected, where the determining unit includes: according to the historical coordinates of each target to be detected in the tracking table, calculating the moving speed of each target to be detected in a preset time period; the tracking linked list comprises the identification of the target to be detected, the video frame identification of the video frame where the target to be detected is located and the mapping relation of the historical coordinates of the target to be detected.

Preferably, the method comprises the steps of,

the first determining unit 302 is specifically configured to calculate, according to the historical coordinates of each target to be detected in the tracking table, a moving speed of each target to be detected within a preset duration, where the moving speed includes: calculating the moving speed v t of the object to be detected according to the following formula]：

Wherein, (x [ t ]],y[t]) For the historical coordinates of the object to be detected in the tracking table at time t, (x [ t ]T],y[t-T]) For the historical coordinates of the detection target in the tracking table at the time T-T, T is a preset duration.

Preferably, the method comprises the steps of,

the fifth determining unit 307 is further configured to empty the current candidate aggregate image and trigger the acquiring unit 301 to acquire the candidate aggregate image again according to the video stream acquired in real time when it is determined that M is not greater than the third preset value.

Preferably, the method comprises the steps of,

the personnel aggregation prediction model is trained by taking a plurality of images of aggregation events and images of non-aggregation events as sample images.

The units of the above embodiments may be integrated or may be separately deployed; can be combined into one unit or further split into a plurality of sub-units.

Further, a non-transitory computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the steps of the people gathering detection method is provided in embodiments of the present application.

In addition, an electronic device is provided that includes a non-transitory computer readable storage medium as described, and the processor having access to the non-transitory computer readable storage medium.

In summary, by combining static region identification and deep learning, the method and the device for detecting the personnel gathering event can perform secondary identification on the multi-frame images of the video images, fully utilize the process information of the personnel gathering event, and improve the accuracy of the personnel gathering detection.

In the embodiment of the application, an offline deep learning multi-frame secondary identification method is adopted, so that the process information of the event occurrence is fully learned, and the event judgment is more accurate;

the deep learning method combines with the auxiliary judgment of static area identification, is different from the judgment by only using a basic information method or a deep learning method, and can effectively reduce false alarm. Meanwhile, the position of the aggregation area can be output, the information output is more abundant, and the alarm post-processing is convenient.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A method of people gathering detection, the method comprising:

2. The method according to claim 1, wherein extracting the target to be detected from the collected video stream and tracking the target to be detected, obtaining a tracking linked list, comprises:

acquiring a foreground image of each frame image of a video stream;

and tracking each target to be detected in the foreground image of each frame image to obtain a tracking linked list.

3. The method of claim 1, wherein the tracking linked list includes a mapping relationship of an identification of the object to be detected, a video frame identification of a video frame in which the object to be detected is located, and historical coordinates of the object to be detected;

the determining the moving speed of each target to be detected according to the tracking linked list comprises the following steps:

and calculating the moving speed of each target to be detected in the preset time according to the historical coordinates of each target to be detected in the tracking linked list.

4. The method of claim 3, wherein calculating the moving speed of each target to be detected in the preset duration according to the historical coordinates of each target to be detected in the tracking linked list comprises:

calculating the moving speed v [ t ] of the object to be detected according to the following formula:

wherein, (x [ T ], y [ T ]) is the historical coordinate of the target to be detected in the tracking chain table at the time T, and (x [ T-T ], y [ T-T ]) is the historical coordinate of the target to be detected in the tracking chain table at the time T-T, and T is the preset time length.

5. The method according to claim 1, wherein the method further comprises:

6. The method of any of claims 1-5, wherein the people gathering prediction model is trained with a plurality of images of gathering events and images of non-gathering events as sample images.

7. A people gathering testing device, the device comprising:

the second determining unit is used for judging whether the moving speed determined by the first determining unit is smaller than a first preset threshold value or not, and determining a static area of a frame image according to a plurality of areas corresponding to a plurality of targets to be detected, the moving speed of which is smaller than the first preset threshold value, in the frame image;

a third determining unit configured to take, as a candidate aggregate image, an image in the video stream in which the static area determined by the second determining unit exists; determining whether the number N of candidate aggregated images reaches a second preset threshold;

a fourth determining unit, configured to determine, when the number N of candidate aggregated images determined by the third unit reaches a second preset threshold, whether the candidate aggregated images are person aggregated images according to a preset person aggregation prediction model;

the statistics unit is used for counting the number M of the personnel gathering images as a judgment result of the fourth determination unit;

and a fifth determining unit, configured to determine that a person gathering event occurs when it is determined that M counted by the counting unit is greater than a third preset threshold.

8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the acquisition unit is specifically configured to extract a target to be detected from an acquired video stream, track the target to be detected, and when obtaining a tracking linked list, the acquisition unit includes: acquiring a foreground image of each frame image of a video stream; and tracking each target to be detected in the foreground image of each frame image to obtain a tracking linked list.

9. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the first determining unit is specifically configured to determine, according to the tracking linked list, a moving speed of each target to be detected, where the determining unit includes: calculating the moving speed of each target to be detected in a preset time period according to the historical coordinates of each target to be detected in the tracking linked list; the tracking linked list comprises the identification of the target to be detected, the video frame identification of the video frame where the target to be detected is located and the mapping relation of the historical coordinates of the target to be detected.

10. The apparatus of claim 9, wherein the device comprises a plurality of sensors,

the first determining unit is specifically configured to calculate, according to the historical coordinates of each target to be detected in the tracking linked list, a moving speed of each target to be detected within a preset duration, where the moving speed includes: calculating the moving speed v t of the object to be detected according to the following formula]：

Wherein, (x [ t ]],y[t]) For the historical coordinates of the object to be detected in the tracking chain table at time T, (x [ T-T)],y[t-T]) And T is a preset duration for the historical coordinates of the detection target in the tracking linked list at the time T-T.

11. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

and the fifth determining unit is further configured to empty the current candidate aggregate image and trigger the acquiring unit to acquire the candidate aggregate image again according to the video stream acquired in real time when it is determined that M is not greater than the third preset value.

12. The apparatus of any of claims 7-11, wherein the people gathering prediction model is trained with a plurality of images of gathering events and images of non-gathering events as sample images.

13. A non-transitory computer readable storage medium storing instructions which, when executed by a processor, cause the processor to perform the steps of the people gathering detection method as recited in any one of claims 1 to 6.

14. An electronic device comprising the non-transitory computer-readable storage medium of claim 13, and the processor having access to the non-transitory computer-readable storage medium.