CN115880648A

CN115880648A - Crowd gathering identification method and system under angle of unmanned aerial vehicle and application of crowd gathering identification system

Info

Publication number: CN115880648A
Application number: CN202310216569.0A
Authority: CN
Inventors: 毛云青; 王国梁; 韩致远; 陈思瑶; 葛俊
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-03-31
Anticipated expiration: 2043-03-08
Also published as: CN115880648B

Abstract

The application provides a crowd gathering identification method and system under an unmanned aerial vehicle angle and application thereof, and the method comprises the following steps: s00, collecting multiple frames of target images as training image data, and labeling the position of each human head in the training images to obtain a human head position label; s10, clustering the labels of the positions of the heads of the crowd to generate a clustering center point in each frame of training image as label data; s20, training by using the label data and the training image data to obtain a crowd center point algorithm network; s30, estimating the clustering center point of the crowd of the video image frame to be detected through a crowd center point algorithm network; and S40, calculating the coordinate difference of the clustering center points in different video image frames to be detected, and classifying and matching to obtain the track and the motion direction of the crowd. The application is more practical because it can provide better crowd behavior analysis for preventing disaster events.

Description

Crowd gathering identification method and system under angle of unmanned aerial vehicle and application of crowd gathering identification system

Technical Field

The application relates to the technical field of image recognition, in particular to a crowd gathering recognition method and system under an unmanned aerial vehicle angle and application thereof.

Background

The crowd gathering forms such as the party tour, festival celebration, concert and the like bring challenges to city safety management, and the crowd gathering lacking management and control may cause serious adverse effects. Current so-called automated crowd sourcing analysis methods typically include crowd counting and associated crowd density estimation, including prevention of disasters caused by crowding, such as tread events. One cost-effective method of monitoring the population is to use Unmanned Aerial Vehicles (UAVs), such as drones, which are now commonly used. By configuring the camera and the GPU device, the unmanned aerial vehicle can become flying computer vision equipment, can be rapidly deployed to the site, and can implement crowd gathering analysis.

However, unmanned surveillance crowd gathering also has some disadvantages. On the one hand, computer vision algorithms face more challenges at high altitude perspectives, because high altitude perspectives are completely different from ground-based fixed monitoring, and moreover, the number of gathering personnel can be very large in scale, which has a great influence on operational analysis. On the other hand, the image algorithm method generally applied in the field of computer vision cannot meet the strict real-time requirement proposed by the unmanned aerial vehicle. In other words, it is crucial to choose a lightweight model that strikes a good balance between availability and efficiency.

Currently in the field of computer vision, there are many implementation methods for population count and population density estimation, and more is to use density estimation. There are also more adopted schemes that detect human body or human head through a sliding window on an image as a target detection algorithm: however, even with the most advanced object detectors, such as the YOLO series, the detection is poor when applied to very dense small objects. To solve this problem, scientists have introduced regression-based algorithms that directly learn the mapping from images to population numbers, such as learning density maps, which, while effective, have high requirements on computing devices and do not meet the strict requirements (limited battery capacity and need to react in real time) that unmanned aerial vehicles usually put forward. How to fine-tune the deep neural architecture to achieve the best balance between accuracy and performance is a core issue. For example, in the VisDrone2022 game, accuracy of the TransCrowd algorithm is the highest, but the algorithm is based on Vision Transformer, the requirement of the algorithm on computing resources is high due to the fact that the algorithm is based on the Transformer, the algorithm is basically impossible to be applied to the unmanned aerial vehicle, and the algorithm only regresses the number of people and does not provide a density map which is useful for detecting people flow.

In summary, although there have been some achievements in crowd analysis with drones, the methods proposed so far have room for great improvement. Therefore, a method, a system and an application for identifying crowd aggregation in an unmanned aerial vehicle angle, which can solve the above problems, are urgently needed.

Disclosure of Invention

The embodiment of the application provides a crowd gathering identification method and system under an unmanned aerial vehicle angle and application thereof, and aims to solve the problems that the existing technology has high requirements on unmanned aerial vehicle performance and has more defects.

The core technology of the invention is mainly based on an FCN model, the model estimates a central point density map for continuous frames of the same video sequence shot by an unmanned aerial vehicle, and then calculates the displacement of the detected central point between the frames, thereby being capable of determining the moving direction of people. Combining the density estimation method with clustering does not track the motion of a person in the crowd.

In a first aspect, the application provides a crowd gathering identification method under an unmanned plane angle, the method comprising the following steps:

s00, collecting multiple frames of target images as training image data, and labeling the position of each human head in the training images to obtain a human head position label;

s10, clustering the labels of the head positions of the crowd to generate a clustering center point in each frame of training image as label data;

s20, training by using the label data and the training image data to obtain a crowd center point algorithm network;

s30, estimating the clustering center point of the crowd of the video image frame to be detected through a crowd center point algorithm network;

and S40, calculating the coordinate difference of the clustering center points in different video image frames to be detected, and classifying and matching to obtain the track and the motion direction of the crowd.

Further, in the step S10, a cluster center point of the crowd head position label is obtained through a Mean Shift clustering algorithm.

Further, in the step S20, the crowd center point algorithm network uses a full convolutional neural network, and replaces the full connection layer with a convolutional layer.

Further, in step S20, the full convolutional neural network includes an encoder for reducing the input image into a low-dimensional implicit representation layer and a decoder for upsampling the implicit representation layer to a required output resolution, and each encoder includes a convolutional layer, a batch normalization layer and a maximum pooling layer, and each decoder includes a transposed convolutional layer and a batch normalization layer.

Further, in step S40, before calculating the coordinate difference of the clustering center points in different video image frames to be tested, the background of each video image to be tested is filtered out by performing noise reduction processing.

Further, in the step S40, the coordinate difference is obtained by calculating the euclidean distance between each pair of cluster center points of each frame of video image to be detected and matching the minimum distance between the pair of center points.

Further, in step S40, the classifications are classified into at least east, west, south, north, northeast, northwest, southeast, and southwest.

In a second aspect, the present application provides a crowd gathering identification system under unmanned aerial vehicle angle, include:

the acquisition module is used for acquiring multi-frame target images as training image data and marking the position of each human head in the training images to obtain a human head position label; inputting a video image to be detected;

the clustering module is used for clustering the labels of the positions of the heads of the crowd to generate a clustering center point in each frame of training image as label data;

the training module is used for training by using the label data and the training image data to obtain a crowd center point algorithm network;

the computing module is used for estimating the clustering center point of the crowd of the video image frame to be detected through a crowd center point algorithm network;

and the output module is used for calculating the coordinate difference of the clustering center points in different video image frames to be detected, and classifying and matching to obtain the track and the motion direction of the crowd.

In a third aspect, the present application provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the crowd sourcing identification method in the unmanned aerial vehicle angle.

In a fourth aspect, the present application provides a readable storage medium having stored thereon a computer program comprising program code for controlling a process to execute a process comprising the method of crowd identification from the perspective of a drone according to the above.

The main contributions and innovation points of the invention are as follows: 1. compared with the prior art, the crowd counting and density estimation in the static image frame are not simply considered, and the crowd flowing is detected by analyzing the video. Since our goal is not only to identify the presence of people in a single high-altitude scene, but also to determine how the crowd is moving. This is different from the human tracking algorithm (tracking one person or a group of persons), so the application has more practicability, can provide better crowd behavior analysis for preventing disaster events and help various safety management applications in the scene of the smart city;

2. compared with the prior art, the method and the device have the advantages that a good effect is achieved on crowd density estimation and clustering in videos shot by the unmanned aerial vehicle, and particularly from the perspective of calculation cost, the efficiency is high;

3. compared with the prior art, the method and the device have the advantage that the size of the data set is increased by integrating available scenes and collaborative data, so that the robustness of the deep learning model can be further improved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more concise and understandable description of the application, and features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flow of a crowd gathering identification method in an unmanned aerial vehicle perspective according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a crowd center point algorithm network according to an embodiment of the application;

3-5 are example diagrams of various clustering generated center points according to embodiments of the application;

fig. 6 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

While crowd analysis with drones has achieved some success, the methods proposed at present still leave much room for improvement.

Based on this, the invention solves the problems existing in the prior art based on a full convolution network. The network identifies image information in each frame and learns crowd density estimation and crowd clustering simultaneously. Thus, people can be identified by their central points, and their movement can be tracked by tracking the central points to identify the trajectories of people.

Example one

The application aims at providing a crowd gathering identification method under an unmanned aerial vehicle angle, and can, specifically, refer to fig. 1, the method includes the following steps:

s10, clustering the labels of the positions of the heads of the crowd to generate a clustering center point in each frame of training image as label data;

where the data sets that are normally provided are tag data for which there is no center point density map for the video, the data sets are typically labels where the human body or head is located, and the generation of C from these labels is described below ^GT (i) Tag data.

Since it is assumed that there is no label about the position of the center point of the crowd in the original video taken by the drone, the Mean Shift clustering algorithm is first applied to the above-mentioned label of the position of the head of the crowd to obtain the center point. Mean Shift is a center point-based algorithm that updates the candidate point for the center point to the average of the given points in a certain area. Then in a post-treatment stepThese candidates are filtered to eliminate near duplicate content. In particular. If at pixel x _i Having a center point represented as a delta function δ (x-xi), the convolution of the delta function with a Gaussian kernel yields a "center point density map" C (x), the true center point density map C ^GT (i)：

Wherein K is the number of center points; δ is at x = x _i If so, equal to 1, otherwise equal to 0; g _σ Is a gaussian kernel. Since the size of the population of people in each video sequence is similar, σ =10 is set empirically because this value results in better performance due to the example of the center point of the density map generated by a sufficiently large "confidence" activation region. The reason for using Mean Shift to find the center point is that it is impossible to know the number of clusters in the crowd in advance, so an algorithm that does not require the number of clusters to be specified in advance is required. In the above manner, the unsupervised learning method (using Mean Shift to find the center point) is followed to find the center point of the crowd, but when their positions are found, the unsupervised learning method (referred to as the later process of training the model) can be guided as the generated label. While this strategy depends largely on the particular clustering algorithm chosen, it is able to automatically generate label data and quantitatively evaluate the clustering tasks performed by the neural network.

in the past, the identification of the thermodynamic diagram center point is completed through a classical clustering algorithm after a crowd density diagram is generated, so that the step-by-step method influences the reasoning speed of the whole algorithm. In this embodiment, the task of finding the crowd center point is directly integrated into the network training. As shown in fig. 2, the present application employs a full convolutional neural network (FCN), which performs only convolution and pooling operations, and does not include a full connection layer. Full connectivity is usually an integral part of a typical CNN network. The method adopts a convolution layer with 1 multiplied by 1 convolution kernel and step length of 1 to replace a full connection layer, so that the method has the advantages of fewer parameters of a model on one hand and capability of receiving an image with any size as input on the other hand. Nevertheless, for better evaluation, the input resolution of each frame is fixed to 640 × 512 × 3.

The structure of the FCN is composed of two parts: an Encoder Encoder module reduces the input into a low-dimensional implicit expression layer; a Decoder module upsamples the implied presentation layer to the desired output resolution. The application designs an FCN network which is as simple as possible: as shown in FIG. 2, in this implementation, one scaling layer normalizes each pixel value within the range of [0,1 ]. The encoder portion of the model then repeats the application of four modules, including a convolutional layer with kernel 3 x 3, batch normalization, and maximum pooling of kernels 2 x 2, until the latency space Z e R40 x 32 x 128 is reached. The decoder upsamples the feature map by transpose convolution and batch normalization. Finally, the 1 × 1 convolutional layer produces an output density map of 640 × 512 × 1 in size. The activation function selects the commonly used ReLU.

Wherein each encoder block comprises a convolutional layer + a batch normalization layer + a max pooling layer, and each decoder layer comprises a transposed convolutional layer + a batch normalization layer.

In this embodiment, the network of the present application directly estimates the center point density map by training learning and uses the following loss function to train to minimize the mean squared error between the prediction and the ground truth center point density map:

wherein N is the number of samples, C ^P (i) And C ^GT (i) Respectively a predicted central point density map and a real central point density map, operator

Representing the calculated euclidean distance.

in this embodiment, as shown in fig. 3-5 (the left image is a video image frame to be detected, and the right image is a cluster center point), a schematic diagram of cluster generation center points in several situations is shown, where only two points in fig. 5 are because the left side of the image is used as one cluster and the right side is used as another cluster; people in the upper left corner are more, and the people are deviated to the upper left and the cluster on the right side is deviated to the center.

In this embodiment, noise may be included because the density map of the FCN prediction is characterized by non-normalized values. For example, in group Truth (Ground Truth, which refers to the accuracy of the classification of supervised learning techniques by the training set), the background pixel construction value is 0, but there is no guarantee that its value remains 0 in the predicted heatmap. To investigate this effect on performance, first apply min-max normalization, limiting its range to [0,1]; the pixel values are then thresholded (empirically chosen threshold τ = 10) so that any value below the threshold is considered background. A higher threshold filters the background, leaving substantially only areas where the model is more confident of the presence of people in these regions.

After the center point is identified and the image is filtered by an empirical threshold τ, the coordinate difference of the predicted center point in different frames is calculated in order to calculate the actual displacement of the identified crowd and determine the moving direction of the crowd. In the end frame at time tk, the displacement of the center point is relative to that at time t ₀ Is calculated from the detected center point in the starting frame. Since it is calculated on a two-dimensional plane, this displacement can be simply calculated by the euclidean distance between the (x, y) coordinates. It is first assumed that a given group of people (i.e., the center point) is likely to move between successive frames, but that the new center point is less distant from the original center point in the previous frame than all other center points in the current frame.

Therefore, to determine whether the newly predicted center points can be associated with existing center points and thus determine their direction of movement, the Euclidean distance between each pair of center points in each frame is calculated and matched by the minimum distance between the pairs of center points. Thus, each movement of the crowd can be classified as facing several typical directions, i.e. east/west/south/north, plus several intermediate directions, i.e. north-east, north-west, south-east, south-west, etc.

E.g. P _i Is the set of central points predicted at time i, and can be divided into three scenarios:

• |P ₀ | = |P _k l: in this case, P ₀ Each central point in (1) is associated with P _k The nearest center point.

• |P ₀ |>|P _k L: in this case, there is less center point in the end frame. With d = | | P ₀ | - |P _k I represents t ₀ And t _k May indicate a network prediction error, or may be at t ₀ Is merged into t _k Or a center point at t _k Away from the field of view.

• |P ₀ |<|P _k L: in this case, the center point in the start frame is less. Then, at time t _k There are more center points that will remain unmatched. May indicate a network prediction error, or may be at t _k Forming a new cluster or finally entering a new central point in the field of view.

Wherein, P ₀ Set of central points, P, representing the first image frame to be measured _k Is shown at t _k A set of central points for temporal predictions. The method is mainly used for tracking the crowd; time 0 to t _k At time, the center points on the two associations represent the direction of travel of this group of people.

In summary, the method proposed in the present application is based on an FCN model (network) that estimates a "center point density map" (a thermodynamic diagram highlighting the center point of the crowd) for consecutive frames of the same video sequence taken by a drone. Since the population can be viewed as groups of people that do not necessarily follow the same direction, the predicted center point represents these groups. The displacement of the detected center point between frames is then calculated and the direction of the crowd movement can be determined. The method combines a density estimation method with clustering and does not track the movement of a person in a crowd, and the starting point of the method is the complexity and the calculation cost of the method.

While crowd density maps based on individual locations may retain more information, the cost of tracking each individual motion trajectory in a scene is in fact enormous. Furthermore, individual tracking is not only impractical, but also not important, as in a crowd management scenario it is important to identify the overall stream of people, rather than the precise location of each person in the scenario. Thus, the method of the present application will focus only on high density areas, i.e. those corresponding to a concentration of people flow, while the method is inherently robust against occlusion, which will improve the effect of the people flow detector, especially when viewed from high altitude.

Example two

Based on the same design, this application has still provided a crowd's gathering identification system under unmanned aerial vehicle angle, includes:

the clustering module is used for clustering the labels of the head positions of the crowd to generate clustering center points in each frame of training image as label data;

EXAMPLE III

The present embodiment also provides an electronic device, referring to fig. 6, comprising a memory 404 and a processor 402, wherein the memory 404 stores a computer program, and the processor 402 is configured to execute the computer program to perform the steps in any of the above method embodiments.

Specifically, the processor 402 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more integrated circuits of the embodiments of the present application.

Memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may include a hard disk drive (hard disk drive, HDD for short), a floppy disk drive, a solid state drive (SSD for short), flash memory, an optical disk, a magneto-optical disk, tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. Memory 404 may include removable or non-removable (or fixed) media, where appropriate. The memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, memory 404 includes Read-only memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or FLASH memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a static random-access memory (SRAM) or a dynamic random-access memory (DRAM), where the DRAM may be a fast page mode dynamic random-access memory 404 (FPMDRAM), an extended data output dynamic random-access memory (EDODRAM), a synchronous dynamic random-access memory (SDRAM), or the like.

Memory 404 may be used to store or cache various data files for processing and/or communication use, as well as possibly computer program instructions for execution by processor 402.

The processor 402 reads and executes computer program instructions stored in the memory 404 to implement the crowd gathering identification method at any drone angle in the above embodiments.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402, and the input/output device 408 is connected to the processor 402.

The transmitting device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include wired or wireless networks provided by communication providers of the electronic devices. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmitting device 406 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The input and output devices 408 are used to input or output information. In this embodiment, the input information may be a video to be detected, and the output information may be a crowd track, a moving direction, and the like.

Example four

The present embodiment also provides a readable storage medium having stored thereon a computer program comprising program code for controlling a process to execute a process comprising the crowd gathering identification method from the perspective of a drone according to the first embodiment.

It should be noted that, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiment and optional implementation manners, and details of this embodiment are not described herein again.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets and/or macros can be stored in any device-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may include one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. Further in this regard it should be noted that any block of the logic flow as in the figures may represent a program step, or an interconnected logic circuit, block and function, or a combination of a program step and a logic circuit, block and function. The software may be stored on physical media such as memory chips or memory blocks implemented within the processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVDs and data variants thereof, CDs. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that various features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the features in the above embodiments are not described, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the features.

The above examples are merely illustrative of several embodiments of the present application, and the description is more specific and detailed, but not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. Crowd gathering identification method under unmanned aerial vehicle angle, characterized by including the following steps:

s30, estimating the clustering center point of the crowd of the video image frame to be detected through the crowd center point algorithm network;

2. The method for identifying crowd crowding under the angle of the unmanned aerial vehicle according to claim 1, wherein in the step S10, a cluster center point of the crowd head position label is obtained through a Mean Shift clustering algorithm.

3. The method according to claim 1, wherein in step S20, the crowd center point algorithm network uses a full convolutional neural network, and replaces a full link layer with a convolutional layer.

4. The crowd gathering identification method under the angle of unmanned aerial vehicle of claim 3, wherein in the step S20, the full convolutional neural network comprises an encoder for reducing an input image into a low-dimensional implicit representation layer and a decoder for upsampling the implicit representation layer to a required output resolution, and each encoder comprises a convolutional layer, a batch normalization layer and a maximum pooling layer, and each decoder comprises a transposed convolutional layer and a batch normalization layer.

5. The method for identifying crowd crowds at an angle of an unmanned aerial vehicle according to claim 1, wherein in the step S40, before calculating the coordinate difference of the clustering center points in different video image frames to be detected, each video image frame to be detected is subjected to noise reduction processing to filter out a background.

6. The method according to claim 5, wherein in step S40, the Euclidean distance between each pair of cluster center points of each frame of video image to be detected is calculated, and the coordinate difference is obtained by matching the minimum distance between the center point pairs.

7. The method for identifying crowd crowds under the angle of an unmanned aerial vehicle according to any one of claims 1-6, wherein in the step S40, the classification is at least divided into east, west, south, north, northeast, northwest, southeast and southwest.

8. The utility model provides a crowd gathers identification system under unmanned aerial vehicle angle which characterized in that includes:

9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the method of crowd identification at drone angle of any one of claims 1 to 7.

10. A readable storage medium, in which a computer program is stored, the computer program comprising program code for controlling a process to execute a process, the process comprising the method of crowd identification at drone angle according to any one of claims 1 to 7.