CN112016518B

CN112016518B - Crowd distribution form detection method based on unmanned aerial vehicle and artificial intelligence

Info

Publication number: CN112016518B
Application number: CN202010961541.6A
Authority: CN
Inventors: 罗瑗鸿; 曹再辉; 焦斌
Original assignee: Zhengzhou University of Aeronautics
Current assignee: Zhengzhou University of Aeronautics
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2023-07-04
Anticipated expiration: 2040-09-14
Also published as: LU500512A1; LU500512B1; CN112016518A

Abstract

The invention provides a crowd distribution form detection method based on unmanned aerial vehicle and artificial intelligence, which comprises the following steps: shooting images of areas such as urban streets or parks by using an unmanned aerial vehicle, sequentially detecting head surrounding frames of people and left and right double-ear key points of each person in an up-down mode, projecting the head surrounding frames and the left and right double-ear key points to the CIM model ground, and obtaining the positions and the orientations of human bodies according to the double-ear key points; and (3) realizing visualization of crowd information in the CIM by using the tile map, counting the number of people in each tile, and analyzing the distribution form of the crowd in the tile according to the position and the orientation of the crowd when the number is larger than a set threshold value, so as to judge what the aggregated crowd is in. The method can judge the overall behaviors of the crowd, is convenient for related personnel to take corresponding measures, and avoids accidents.

Description

Crowd distribution form detection method based on unmanned aerial vehicle and artificial intelligence

Technical Field

The invention relates to the field of artificial intelligence, in particular to a crowd distribution form detection method based on unmanned aerial vehicles and artificial intelligence.

Background

The analysis of the people group gathering condition in the city is that the prior art only singly counts the number of human bodies or counts the crowd density, and the mode can only obtain rough quantity information, and loses the information such as space distribution, sight orientation and the like.

Disclosure of Invention

In order to solve the above problems, the present invention provides a crowd distribution form detection method based on unmanned aerial vehicle and artificial intelligence, the method comprising:

step one, collecting crowd images by using an unmanned aerial vehicle, inputting the collected crowd images into a head detection network to obtain bounding boxes of human heads, screening the obtained bounding boxes of the human heads, and dividing the bounding boxes into reserved bounding boxes and screened bounding boxes;

step two, the flight route of the unmanned aerial vehicle is adjusted in time according to the density of the selected bounding boxes in the crowd images; cutting crowd images by using the reserved bounding boxes to obtain head images of human bodies, and processing the head images of the human bodies by a key point detection network to obtain left and right double-ear key point heat maps;

step three, processing the keypoint heat map to obtain positions of left and right double ears keypoints, projecting the keypoints to a pre-built CIM ground, performing visual processing on the CIM, displaying the keypoints on a tile map, and calculating the number of people in each tile;

setting a population threshold, and when the population in the tile is greater than the population threshold, performing post-processing operation on the tile, and then sending the tile into a form classification network, and obtaining a population distribution form after processing.

The head detection network comprises a first encoder and a first decoder, wherein the network processes an input crowd image and then returns the center point of a human head bounding box and the length and the width of the bounding box; comparing the length and the width of the regressed bounding box with a preset length threshold value and a preset width threshold value, screening the bounding boxes, wherein the bounding boxes with the length being larger than the length threshold value and the width being larger than the width threshold value are reserved bounding boxes, and the other bounding boxes are screened bounding boxes.

The keypoint detection network comprises a second encoder and a second decoder, and the left and right binaural keypoint heat maps output by the network comprise two channels.

And recording the real-time height of the unmanned aerial vehicle when the crowd images are acquired, and projecting the key points into the CIM according to the height of the unmanned aerial vehicle.

The image of the tile after post-processing operation is a binary single-channel image, the image comprises a plurality of objects connected by points and straight lines, each object represents the position and the orientation of a person, wherein the central point of the connecting line of the left and right double-ear key points represents the position of the person, and the straight line represents the orientation of the person judged according to the left and right double-ear key points.

The form classification network comprises a third encoder and a full connection layer, and the training method of the network comprises the following steps: the training data set is obtained by using a simulator, the labels are category labels, including the surrounding distribution, the queue distribution and the disorder distribution, and the cross entropy loss function is adopted for training.

The invention has the beneficial effects that:

1. in the prior art, the judgment of the surrounding behavior is carried out by using a face detection mode to obtain the directions of multiple faces so as to judge whether the surrounding is realized, but the mode has strict requirements on the visual angle of a camera, and because clear faces are difficult to collect due to shielding problems, face information is not collected in most cases due to the angle problems of an unmanned aerial vehicle, and therefore, the sight direction of the person is difficult to obtain; the invention judges the direction of the person according to the obtained left and right double-ear key point information, and solves the problem that the face is easy to be blocked in the detection of the prior art.

2. The judgment of the queue behavior in the prior art is judged only by the position of the head of the human body, and the characteristic of the human body sight orientation in the situation is not considered. In the actual situation, the positions of the human bodies in the queue behavior are not in ideal regular linear distribution, and the randomness of the individuals causes the queue behavior to present certain confusion.

3. The invention judges the distribution form under the overlook view angle, combines the position information and the orientation information of the individual to judge the concerned monitoring position and the grade of the gathered crowd, can accurately and rapidly detect the overall behaviors of the crowd, is convenient for relevant personnel to take corresponding measures, and avoids accidents.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a sectional view of a crowd image region.

FIG. 3 is an image of a tile after post-processing.

FIG. 4 is a schematic representation of a marking of a tile map.

Detailed Description

The present invention will be described in detail below with reference to examples and drawings for better understanding of the present invention to those skilled in the art. See fig. 1-4.

According to the invention, an unmanned aerial vehicle is utilized to shoot images of areas such as urban streets or parks, and then the head of a person and the double-ear key points of each person are sequentially detected in an up-down mode and then projected onto the CIM model ground, and the positions and the orientations of human bodies can be obtained by utilizing the double-ear key points; and (3) realizing the visualization of CIM crowd information by using the tile map, counting the number of people in each tile, and analyzing the distribution form of the crowd in the tile according to the station and the direction of the crowd when the number is larger than a set threshold value, so as to judge what the aggregated crowd is in.

Examples:

a crowd distribution form detection method based on unmanned aerial vehicle and artificial intelligence is shown in a flow chart in fig. 1, and comprises the following steps:

the construction method for constructing the city information model CIM is well known and will not be described here.

Collecting crowd images by using an unmanned aerial vehicle, wherein the unmanned aerial vehicle is set to have a fixed cradle head angle, a camera on the unmanned aerial vehicle has an included angle of 45 degrees downwards relative to a horizontal plane, and the unmanned aerial vehicle has an initial flight path; it should be noted that the present invention is to acquire the real-time altitude of the unmanned aerial vehicle in the process of collecting the crowd image, and the unmanned aerial vehicle needs to perform patrol, i.e. the flight path of the unmanned aerial vehicle is the corresponding outdoor area such as part of streets and parks in the CIM. The number of patrol unmanned aerial vehicles in the whole city and the division of patrol areas thereof are not in the discussion range of the invention and need to be determined by combining with actual scenes.

There are two common methods for performing multi-target pose estimation, namely Top-down method and Bottom-up method. The Top-Down method comprises the following steps: the targets are detected using a detector and then a single pose estimate is used for each detected target. The Bottom-up method comprises the following steps: all keypoints are detected first, and then the keypoints are combined according to the affinity relationship field PAFs (Part Affinity Fields) to obtain the keypoint combination of each individual.

The number of human bodies contained in the images captured by the unmanned aerial vehicle in low-altitude flight is unknown and these people may appear in any possible pose or scale. Secondly, there are complex interrelations between human bodies, such as contact, shielding, joint connection, etc., so that the relation between joints is difficult to establish, the running speed of a general method has a great relation with the number of targets in an image, and real-time implementation is difficult. Therefore, the invention adopts a Top-Down method, firstly detects the head of a person to obtain a binding box, then carries out image crop to obtain a clipping image of the head of the person, and sends the clipping image to a key point detection network to obtain a left and right double-ear key point heat map of the object of the person, in particular:

in the embodiment, the central Net is adopted to obtain the bounding box of the head of the human body, namely, the center point of the bounding box and the width and the height of the bounding box are regressed by the DNN network. The training process of the head detection network is as follows:

the training data set is built, the images in the data set can be real crowd images in cities shot by unmanned aerial vehicles, or crowd images generated by simulators, and an implementer can select which method to use to build the data set by himself; the real crowd image should include crowd situations in various scenes such as streets, communities, parks and the like.

The data labels are x, y, w and h, wherein x, y are coordinates of a center point of the bounding box, w is the width of the bounding box, and h is the height of the bounding box. When the human head is marked, all the visual angles are required to be marked, namely the surrounding frame of the human head in the image is marked no matter the front, the side or the back of the human head, and the human head in the image is marked no matter the size of the human head. In the label, x, y, w and h are subjected to normalization processing.

A first encoder and a first decoder for training data sets and tag data are used, the first encoder performs feature extraction on an input image, inputs the image data subjected to normalization processing, and outputs the image data as a first feature map; the first decoder upsamples the first feature map and generates a human head bounding box.

Training is performed by using a mean square error loss function.

Thus, the surrounding frame of the human head is obtained.

Because human body information detected by the unmanned aerial vehicle can be projected onto the CIM in real time, but a human body far away from the unmanned aerial vehicle can move in the flight process of the unmanned aerial vehicle, in order to reduce errors caused by crowd movement and reduce calculation amount, the method is more efficient and accurate, the method screens surrounding frames firstly, namely, only human bodies within a certain distance from the unmanned aerial vehicle are subjected to binaural key point detection, the human bodies far away from the unmanned aerial vehicle are not subjected to binaural key point detection temporarily, specifically, a length threshold value and a width threshold value of the surrounding frames are set, the surrounding frames with the length of the head detection network output larger than the length threshold value and the width larger than the width threshold value are reserved, and the human body corresponding to the target frame is considered to be close enough to the unmanned aerial vehicle at the moment; other bounding boxes which do not meet the condition are screened bounding boxes.

The flight route of the unmanned aerial vehicle is adjusted in real time according to the density of the selected bounding boxes in the crowd images, and the adjustment process of the flight route is as follows:

the acquired crowd image is divided into three areas in the left and right according to the special feature of the image shot by the unmanned aerial vehicle, the crowd image is divided into three areas in the left and right according to the embodiment shown in fig. 2, the optional flight directions of the unmanned aerial vehicle are the three directions of OA, OB and OC shown in fig. 2, the number (a, b and c) of the screened bounding boxes in the three areas is calculated respectively, and the maximum value in the three values of (a, b and c) is marked as m, namely m=max (a, b and c);

(a, b, c) when the three values are mutually different and m is greater than a set threshold value n, selecting a corresponding direction to fly;

if m=a=c, that is, the number of left and right areas is equal and maximum, when m is greater than a threshold value n, flying in the OB direction, and then making a decision according to the subsequent frame image;

if m=a=b or m=b=c, that is, the number of the left and middle areas or the right and middle areas is equal and maximum, and m is greater than the threshold value n, the aircraft still flies in the OB direction;

when m is smaller than a threshold value n, the unmanned plane flies towards the next patrol target point according to the initial flight path;

the unmanned aerial vehicle can change the flight direction by a small margin in the patrol flight process, fly towards places with crowds, timely acquire information of the crowds, and improve the quality and efficiency of information acquisition.

The invention realizes the tracking of multiple targets by adopting the mode of surrounding the frame IOU by adjacent frames, and avoids the repeated detection of the same target of different frames. The IOU mode obtains the cross ratio of two bounding boxes, and when the value of the IOU is larger than 0.7, the IOU mode judges that the IOU is the same target. This section is well known and is not discussed in detail.

The key point heat map is obtained by the following steps: cutting crowd images by using the reserved bounding boxes to obtain head images of the human body, sending the head images of the human body into a subsequent key point detection network, and processing the head images of the human body to obtain a left and right double-ear key point heat map. The training process of the key point detection network comprises the following steps:

the data set includes the human head images cut out, should contain the heads of various people under 360 degrees oblique overlook viewing angles.

The label is the left and right ears key point of the head, and the labeling process is as follows: each type of keypoint corresponds to a single channel in which the keypoint pixel location is marked, and then a gaussian kernel is used to cause a keypoint hot spot to form at the marked point. The embodiment uses two types of keypoints, so the label image contains two channels in total. Wherein, the key points which are blocked should be marked.

Training a second encoder and a second decoder by using the image and the tag data in the data set, performing feature extraction on the input image by the second encoder to obtain a second feature map, and performing up-sampling and feature extraction on the second feature map by the second decoder to obtain a left and right double-ear key point heat map.

The loss of squared error function is used as the loss of less function.

So far, the left and right binaural key point heat map is obtained.

The method mainly detects the distribution form although the actual position and the projection position have certain errors, so that the projection error causes the translation of the whole distribution, does not influence the distribution form and is still in the acceptable range of the system.

And according to the acquired real-time height and imaging principle of the unmanned aerial vehicle, projection is performed by using projection transformation. This section is well known in the art and therefore the present invention is not discussed here.

The map within a certain range is cut into square grid pictures of a plurality of rows and columns according to a certain size and format and a scaling level or scale, and the square grid pictures after being cut are vividly called tiles (tiles). Sometimes we need to look at macroscopic map information and sometimes look at microscopic map information, and grade cut the map according to the observation scale, i.e. from the highest level to the lowest level according to the pixel size of the divided tiles, a pyramid coordinate system is formed. The tile division of the urban outdoor area, the scale and the grade of which are determined by practical conditions, is not specifically discussed in the invention.

The method comprises the steps of performing visual processing on data in CIM, displaying key points on a tile map, calculating the number of people in each tile according to left and right double-ear key points, focusing on a high-density gathering area for reducing calculation amount, setting a number threshold, judging a crowd distribution form when the number of people in the tile is larger than the number threshold, otherwise judging that the tile area is an unmanned area or a small number of discrete crowd distribution area, marking the number density of people as 0, and determining that the number of people does not reach the scale of gathering people.

The people group distribution stations in the surrounding distribution are in a sector shape or a ring shape, and the sight is more concentrated, namely the head faces are converged in one direction; in the queue distribution, people group distribution stations are in strip shapes, and the vision is concentrated forward in multiple directions.

The invention uses DNN network to learn the distribution characteristics of various conditions of corresponding distribution, specifically, the tiles with the number greater than the threshold value of the number are sent into the form classification network after post-processing operation, the crowd distribution classification result is obtained after processing, wherein the post-processed image, namely the input image of the form classification network, is shown in figure 3, is a binarized single-channel image, wherein the black points are the central points of the connecting lines of the key points of the left and right ears, the positions of people are represented, the straight line representation connected with the black points is according to the judged directions of the ears, specifically, the front and back directions can be judged after the left and right directions are obtained for the human body, the straight line length value is fixed, and the specific length value needs to be determined by an implementer; the training process of the form classification network comprises the following steps:

the training data set can be constructed from images obtained after post-processing of tiles containing key points, or training images can be obtained by sliding a viewing angle window after generating crowd distribution diagrams by using a simulator.

The labels are image corresponding type labels, and are divided into a queue distribution label of 1, a surrounding distribution label of 2 and a disorder distribution label of 3.

Training a third encoder and a full-connection layer by using a training data set and tag data, wherein a training image is encoded by the third encoder, namely, after the feature is extracted by using CNN, the feature is changed into a one-dimensional vector by means of a flat-up operation, the one-dimensional vector is used as input of the full-connection layer of FC, and the output of FC is a distribution form category number corresponding to the training image.

The training of the network is performed using a cross entropy loss function.

Thus, the distribution form of the crowd in each tile can be obtained.

According to the steps, a marking schematic diagram of the tile map can be obtained, as shown in fig. 4, numbers in square grids in the diagram represent crowd distribution forms corresponding to the tiles, and arrows are average orientations obtained by counting the sight orientations of all individuals in the tiles; according to the graph, the region with the enclosed events or the queue events can be timely focused, and the method for focusing on the specific region comprises the following steps:

the focus area in the form of a circle is shown as a dotted circular area in fig. 4, and the center of the focus area is the center point of the nearest tile or the center point of the border from the center of the circle;

the focus area center in the form of a queue is the selectable position of the nearest non-queue identification in the overall average direction of the queue area.

Different levels of attention are adopted for different forms of focusing areas, namely different monitoring levels, the highest level of the surrounding form, the next level of the queue form, unordered aggregation is general, and little or no person is required to monitor. The specific position and grade of the focusing area are used for monitoring the overall crowd gathering behavior, so that the resources used for monitoring all the randomly moving high-density crowd areas are reduced, and the areas with different grades are reasonably monitored.

The foregoing is intended to provide a better understanding of the invention to those skilled in the art and is not intended to be limiting.

Claims

1. A crowd distribution form detection method based on unmanned aerial vehicle and artificial intelligence is characterized by comprising the following steps:

2. The method of claim 1, wherein the head detection network comprises a first encoder and a first decoder, and wherein the network processes the input crowd image and then returns a center point of the bounding box of the human head and a length and a width of the bounding box; comparing the length and the width of the regressed bounding box with a preset length threshold value and a preset width threshold value, screening the bounding boxes, wherein the bounding boxes with the length being larger than the length threshold value and the width being larger than the width threshold value are reserved bounding boxes, and the other bounding boxes are screened bounding boxes.

3. The method of claim 1, wherein the keypoint detection network comprises a second encoder and a second decoder, the network outputting left and right binaural keypoints heatmaps comprising two channels.

4. The method of claim 1, wherein the real-time altitude of the drone at the time the crowd image is acquired is recorded, and the keypoints are projected into the CIM based on the altitude of the drone.

5. The method of claim 1 wherein the post-processed image of the tile is a binary single channel image comprising a plurality of objects connected by points and lines, each object representing a position and orientation of a person, wherein a center point of the left and right binaural keypoints line represents a position of the person, and the line represents an orientation of the person determined from the left and right binaural keypoints.

6. The method of claim 1, wherein the formal classification network comprises a third encoder and a fully-connected layer, and wherein the training method of the network is: the training data set is obtained by using a simulator, the labels are category labels, including the surrounding distribution, the queue distribution and the disorder distribution, and the cross entropy loss function is adopted for training.