CN112541424A

CN112541424A - Real-time detection method for pedestrian falling under complex environment

Info

Publication number: CN112541424A
Application number: CN202011427824.9A
Authority: CN
Inventors: 谢辉; 贾海晨; 冒美娟; 齐宇霄; 陈瑞
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-03-23

Abstract

A real-time detection method for pedestrian falling under a complex environment relates to a detection method for monitoring the state of a pedestrian. The method comprises the following steps: preprocessing the acquired video, converting each frame of the video stream into a picture, and performing normalization processing; detecting the pedestrian, namely adjusting the size of a detection frame according to the change of the distance between the pedestrian and the camera; tracking the target, namely tracking the target, extracting features and calculating similarity by adopting a sort algorithm; predicting the current position through a Kalman filter, and associating a detection box and a target position by using a Hungarian algorithm; and (4) judging whether the pedestrian falls down in the target area, wherein the aspect ratio of the pedestrian is less than or equal to 0.4 when the pedestrian stands, and the aspect ratio is increased to 0.7-1.2 when the pedestrian falls down. The invention aims to solve the problems in the task of detecting and judging the falling of pedestrians, and constructs a real-time detection method for the falling of the pedestrians, which has a simple and clear structure, various application scenes, high precision and high robustness.

Description

Real-time detection method for pedestrian falling under complex environment

Technical Field

The invention relates to a detection method for monitoring states of pedestrians, in particular to the technical field of real-time detection of falling of pedestrians in a complex environment.

Background

The probability of accidental death caused by the falling of the old people in the society is increased continuously nowadays, which arouses wide attention of people and develops a great deal of research work on the falling detection of the old people. The current automatic detection system for the falling of the pedestrian is mainly divided into three types: scene device based, wearable device based and computer vision based fall automatic detection systems.

The automatic falling detection system based on the scene equipment analyzes and judges the data of human body motion collected by the sensor equipment under a specific scene, and although the normal life can not be influenced and the accuracy is high, the equipment cost is higher, so that the automatic falling detection system is difficult to popularize and use. The falling detection system based on the wearable device detects and judges whether the old man falls or not through the equipment worn with the pressure sensor, the acceleration sensor and the like, the equipment is easy to damage and interfere, a false alarm event occurs, and the equipment is worn all the time to influence normal activities.

The falling detection system based on computer vision analyzes the shape characteristics and the movement of a human body to detect falling events by transmitting a shot image to a computing terminal, applying a specially designed image processing algorithm to analyze a scene in real time, and carrying out processes such as pedestrian detection, target tracking, image segmentation and the like. The method has the advantages of low cost, small interference, good real-time performance, high accuracy, no influence on human activities and the like.

However, the current fall detection algorithm proposed based on computer vision still has some defects and shortcomings, including: 1. the traditional pedestrian detection and tracking method based on computer vision mainly adopts the manual characteristic values to carry out global characteristic detection, the characteristics are easily influenced by complex environments such as illumination change, scene change, shelter shielding and the like, the robustness of a detection and tracking algorithm is poor, and an ideal effect cannot be obtained. 2. The robustness of detecting a moving object in a complex environment can be improved by adopting a target tracking algorithm of related filtering, but the algorithm can cause the tracking drift problem due to the self scale transformation of the target object.

Disclosure of Invention

The invention aims to solve the problems in the task of detecting and judging the falling of the pedestrian by using the latest deep learning computer vision technology and construct a real-time detection method for the falling of the pedestrian, which has the advantages of simple and clear structure, various application scenes, high precision and high robustness.

The real-time detection method for pedestrian falling in the complex environment comprises the following steps:

the method comprises the following steps: preprocessing the acquired video, converting each frame of the video stream into a picture, normalizing the picture, and outputting the picture with the resolution of 416 x 416;

step two: pedestrian detection, namely taking Darknet-53 as a main network for feature extraction, carrying out FPN up-sampling on an input picture by utilizing the idea of a feature pyramid network to obtain three scale feature values and carrying out fusion, further enabling the network to adjust the priorbox of the receptive field convolution layer according to the actual size of an object groudtuth, and calculating the intersection ratio between bounding boxes to IoU:

in the formula, area (a) is the area of the original mark frame area, area (b) is the area of the candidate frame area, if the overlapping degree of the area and the candidate frame area is higher, the result tends to 1, so as to obtain the best matching image Box, and finally, the size of the detection frame is adjusted according to the condition that the size of the pedestrian is different due to the change of the distance between the pedestrian and the camera;

step three: tracking the target by adopting a sort algorithm, namely picking out the corresponding targets in all target frames detected in the pedestrian detection work, extracting the characteristics, and then calculating the similarity; predicting the current position through a Kalman filter, and associating a detection box with a target position by using a Hungarian algorithm;

step four: and (4) judging whether the pedestrian falls down, namely performing two-classification operation to judge whether the pedestrian falls down in the target area, wherein when the pedestrian stands, the aspect ratio of the identified pedestrian group route is less than or equal to 0.4, and when the pedestrian falls down, the aspect ratio is increased to about 0.7-1.2, the deflection angle is lower than a set threshold value, and the instantaneous angular acceleration is increased.

Preferably, in the second step of the present invention, during the training process, a part of the graph with a certain light and shadow effect in the COCO data set is selected to perform an individual training test on the YOLOv3 network, the data set is augmented by rotating the appropriate pedestrian image, and the ratio of the number of the graph to the number of the graph in the COCO data set is 7: the ratio of 3 is randomly divided into training sets and test sets and labeled as pedestrian or no pedestrian.

Preferably, in the third step of the present invention, first, the feature values of all target frames of the previous frame, including the center position coordinates, the aspect ratio, the height and the speed, are obtained through the kalman filter, then, the error covariance matrix is used to calculate and predict the position of the current target, and then, the correction is performed according to the following formula:

in the formula K_kAs Kalman gain, Z_kFor actual measurement, Z_kFrom the signal estimate in the previous state and K_kThe correction is carried out in such a way that,

and

the posterior state estimated values respectively representing the k moment and the k-1 moment are one of filtering results, namely the updated optimal estimation, and the finally obtained optimal estimated value is the real position of the current frame;

and secondly, realizing the association of the front frame and the rear frame through the Hungarian algorithm, calculating and solving a similarity matrix of the front frame and the rear frame by using the intersection ratio IoU of the target prediction box and the detection box as the weight of the Hungarian algorithm, and thus realizing the target matching and tracking of the front frame and the rear frame.

Preferably, a neural network-based feature extraction method is added into the sort algorithm of the invention, and the surface features and the moving target information are combined;

the degree of association of the moving object is represented by the mahalanobis distance between the prediction frame and the detection frame, namely:

in the formula, T represents transposition, d_jIs the jth detection box, y_iExpressed as the ith predicted target position, S_jFor the covariance matrix between the jth detected position and the mean tracking position,

representing its inverse matrix. The mahalanobis distance retains the result of spatial domain distribution, and the degree of association of the surface features is expressed by the minimum cosine distance between the ith track and the jth track, namely:

wherein for each detection frame d_jCalculating a surface feature descriptor r_jAnd r_j1, constructing an R for the ith tracking target_iA descriptor for storing the most recent 100 frames;

finally, the two dimensions are fused, namely:

C_i,j＝λd₁(i,j)+(1-λ)d₂(i,j)

where λ is a hyperparameter for adjusting the weights of the different terms, C_i,jFor the final calculation result, C_i,jThe smaller the correlation degree between the detected target and the tracked target is, the greater the correlation degree between the detected target and the tracked target is, the distance measurement has good effect on short-term prediction and matching, and the apparent information is effective on the track lost for a long time, so that the robustness of the algorithm on target loss and obstacles is improved.

Preferably, the fourth step of the invention adopts a sliding window-based feature extraction method, and stores feature data in a sliding window with a fixed size, as time increases, new data is added at the end of the window, and the leftmost data is removed;

constructing a support vector machine classifier according to the characteristic data in the sliding window to carry out falling detection training and judging whether the pedestrian falls down; in the training process, a large amount of feature data of falling samples and feature data of non-falling samples are put into the SVM, and the falling classifier is obtained by training the samples; projecting the features into a high-dimensional space by using a Gaussian kernel function as a kernel function, wherein the formula is as follows:

wherein, x and z are respectively a training sample and a testing sample, gamma belongs to a super parameter, is required to be more than 0, needs to be adjusted and defined, and represents Norm operation by | … | |, namely, takes the 'measurement' of the vector.

The invention utilizes the latest deep learning computer vision technology to solve the problems in the pedestrian falling detection and judgment task, such as tracking/detection failure caused by complex environmental factors such as scale change, shelter, illumination transformation and the like, and constructs a pedestrian falling detection algorithm solution with simple and clear structure, various application scenes, high precision and high robustness. Compared with the traditional technical scheme, the method not only improves the judgment accuracy of certain falling behaviors, but also has better robustness on the influence factors such as illumination transformation, shelters, scale change and the like.

Drawings

FIG. 1 is a flow chart of the overall scheme of the present invention.

Fig. 2 is a schematic diagram of a network structure of YOLOv 3.

Detailed Description

As shown in fig. 1, in a real-time detection method for pedestrian fall in a complex environment, preprocessing is performed first: and converting each frame of the video stream into a picture, and normalizing the picture to ensure that the picture with the resolution of 416 x 416 is output.

Pedestrian detection

In the task of image target detection, the algorithm based on the deep convolutional neural network is widely applied due to the advantages of the algorithm in feature extraction, and is obviously superior to the traditional detection method. Such algorithms can be divided into three categories: 1) a target identification algorithm based on the region suggestions; 2) a detection algorithm based on learning search; 3) and (3) a target detection algorithm based on a regression mode. Because the first and second algorithms are slow in detection speed and poor in detection accuracy, the target detection algorithm based on the regression mode is adopted for pedestrian detection, and the real-time requirement is met while the detection precision is ensured.

As shown in fig. 2, the structure diagram of YOLOv3 network, the network is divided into two parts: feature extraction and three-time sampling output. The invention adopts Darknet-53 as a main network for feature extraction, and is different from the traditional CNN network structure in that the Darknet-53 abandons a common pooling layer and a full connection layer, a Leaky-ReLU activation function is carried after a convolutional layer, and no deviation is used in the input of the activation function, so that the purposes of simplifying a model and reducing the dimension and parameters of a convolutional kernel channel are achieved, the feature extraction capability of the model is enhanced, and certain timeliness and sensitivity during pedestrian detection are improved.

In order to detect small targets, the invention carries out multi-scale fusion prediction on the characteristics output by Darknet-53. By utilizing the idea of Feature Pyramid (Feature Pyramid Networks), an FPN (field programmable gate array) up-sampling mode is carried out on an input picture to obtain three scale Feature values, fusion is carried out, detection is carried out on a plurality of Feature maps (Feature maps), and the accuracy rate of detecting small targets is obviously improved. The method is improved based on the characteristic of multiple scales, so that the convolution layers of different receptive fields of the network are used as independent outputs to perform classification calculation, the network can adjust the priorbox of the receptive field convolution layer according to the actual size of an object group, and the intersection ratio between bounding boxes is calculated to be IoU:

in the formula, area (A) is the area of the original mark frame region, and area (B) is the area of the candidate frame region. If the overlapping degree of the two images is higher, the result tends to 1, so that the optimal matching image Box is obtained, and finally the size of the detection frame can be adjusted according to the condition that the sizes of pedestrians are different due to the change of the distance between the pedestrians and the camera.

Further, in order to solve the influence of light and shadow transformation, in the training process, the invention selects a part of graphics with certain light and shadow effect in a COCO (common Objects in context) data set to perform individual training test on a YOLOv3 network, amplifies the data set in a mode of rotating an appropriate pedestrian image, and performs the following steps according to the ratio of 7: the ratio of 3 is randomly divided into training sets and test sets and labeled as pedestrian or no pedestrian. The anti-interference capability of the trained network on the light and shadow interference is obviously improved, and the effect of the trained network is obviously higher than that of an SSD network formed by only 600 data sets.

Target tracking

In order to more efficiently analyze the extracted pedestrian target and reduce the operation time of continuous detection, the invention adopts the sort algorithm to track the target. That is, corresponding objects in all object frames detected in the pedestrian detection work are extracted, feature extraction (including apparent features or motion features) is performed, and then similarity calculation is performed. The current position is predicted through a Kalman filter, and the detection box and the target position are associated by using a Hungarian algorithm.

Firstly, acquiring characteristic values of all target frames of a previous frame through a Kalman filter, wherein the characteristic values comprise a central position coordinate, an aspect ratio, a height and a speed, then calculating and predicting the position of a current target by utilizing an error covariance matrix, and then correcting according to the following formula:

and

Compared with other algorithms, the Sort algorithm has the advantages that the detection and tracking speed is faster, the accuracy is high, and the accuracy is reduced when an occlusion appears in a view screen. Therefore, the sort algorithm is optimized to a certain extent, a characteristic extraction method based on the neural network is added, more reliable measurement is adopted to replace correlation measurement, and the surface characteristic and the motion target information are combined.

in which for eachA detection frame d_jCalculating a surface feature descriptor r_jAnd r_j1, constructing an R for the ith tracking target_iA descriptor for storing the most recent 100 frames;

finally, the two dimensions are fused, namely:

C_i,j＝λd₁(i,j)+(1-λ)d₂(i,j)

1. Fall determination

Through the foregoing detection and extraction steps, this process only requires a binary classification work of whether or not the pedestrian falls in the target area. When a pedestrian stands, the aspect ratio of the identified pedestrian group is less than or equal to 0.4, and when the pedestrian falls, the aspect ratio is increased to be about 0.7 to 1.2, the deflection angle is lower than a certain threshold value (the invention is set to be 37 degrees), and the instantaneous angular acceleration is increased. The invention comprehensively considers the three factors to carry out the falling judgment, the three factors not only have stronger independence and accord with the comprehensive judgment condition, but also can avoid higher space dimensionality of the feature vectors, complex design of the classifier and poorer system real-time property caused by excessive selected feature vectors.

Considering that the falling behavior is a continuous action, compared with the traditional falling judgment algorithm, the invention adopts a characteristic extraction method based on a sliding window. The characteristic data is stored in a sliding window with a fixed size, new data is added at the end of the window as time increases, and the leftmost data is removed.

wherein, x and z are respectively a training sample and a testing sample, gamma belongs to a super parameter, is required to be more than 0, needs to be adjusted and defined, and represents Norm operation by | … | |, namely, takes the 'measurement' of the vector. Compared with a polynomial kernel function, a character string kernel function and the like, the kernel function has the advantages of less required parameters, high effectiveness and reduction of the operation complexity.

The pedestrian detection mark frame is basically matched with the real mark frame, only has deviation on the details of hands or feet, and does not influence the following target tracking and falling judgment process.

The results of the tests performed on this data set are shown in the following table:

TABLE 1 comparison of the results

As can be seen from the table, the false alarm rate, the omission factor and the accuracy rate of the invention are greatly improved compared with the traditional invention.

The pedestrian detection system can accurately detect pedestrians and track the pedestrians in real time in a strong light environment, and has good anti-interference performance on light and shadow transformation; when a person is far away from the camera, the dimension of the detection frame is changed adaptively; the problem that the target is lost due to the fact that the shielding object appears during tracking can be effectively solved.

Claims

1. The real-time detection method for pedestrian falling under the complex environment is characterized by comprising the following steps of:

step four: and (4) judging whether the pedestrian falls down, namely performing two-classification operation to judge whether the pedestrian falls down in the target area, wherein when the pedestrian stands, the aspect ratio of the identified pedestrian group route is less than or equal to 0.4, and when the pedestrian falls down, the aspect ratio is increased to 0.7-1.2, the deflection angle is lower than a set threshold value, and the instantaneous angular acceleration is increased.

2. The method for detecting the fall of a pedestrian under the complex environment according to claim 1, wherein in the second step, a part of the graph with a certain light and shadow effect in the COCO data set is selected during the training process to perform an individual training test on the YOLOv3 network, the data set is augmented by rotating the proper pedestrian class image, and the step is performed according to a rule of 7: the ratio of 3 is randomly divided into training sets and test sets and labeled as pedestrian or no pedestrian.

3. The method according to claim 1, wherein in the third step, the feature values of all target frames in the previous frame, including the coordinates of the center position, the aspect ratio, the height and the speed, are obtained through a kalman filter, and then the position of the current target is predicted by calculating the error covariance matrix, and then the correction is performed according to the following formula:

and

4. The method for detecting the falling of the pedestrian in the complex environment in real time according to claim 1 or 3, wherein a neural network-based feature extraction method is added to the sort algorithm, and the surface features and the moving target information are combined;

finally, the two dimensions are fused, namely:

C_i,j＝λd₁(i,j)+(1-λ)d₂(i,j)

5. The method for detecting the fall of the pedestrian in the complex environment in real time as claimed in claim 1, wherein the step four adopts a sliding window-based feature extraction method, feature data are stored in a sliding window with a fixed size, new data are added at the end of the window as time increases, and leftmost data are removed;

wherein, x and z are respectively a training sample and a testing sample, gamma belongs to a super parameter, is required to be more than 0, needs parameter adjustment definition, and is | |.