CN113095295B

CN113095295B - Fall detection method based on improved key frame extraction

Info

Publication number: CN113095295B
Application number: CN202110502441.1A
Authority: CN
Inventors: 胡佳佳; 李伟彤
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2023-08-18
Anticipated expiration: 2041-05-08
Also published as: CN113095295A

Abstract

The invention provides a fall detection method based on improved key frame extraction, which comprises the following steps of S1: acquiring an unprocessed original video stream; s2: performing key frame extraction on the original video stream preliminarily by using an inter-frame difference method; s3: performing secondary optimization on the key frames generated in the step S2 by using a clustering algorithm to obtain optimal key frames; s4: extracting features from the optimal key frames, and constructing feature vectors; s5: the extracted feature vector is used as the input of a Support Vector Machine (SVM) for initial judgment, and the support vector machine is used for distinguishing non-falling behaviors, falling behaviors and falling-like behaviors; s6: and performing secondary classification on the feature vector with the distinguishing result being the fall-like behavior by using a convolutional neural network, outputting a detection result, and finishing final detection of the fall-like behavior. Compared with the traditional clustering method, the algorithm provided by the invention has lower redundancy, higher recall ratio and higher accuracy, and can save a lot of time and improve the accuracy for subsequent falling detection.

Description

Fall detection method based on improved key frame extraction

Technical Field

The invention relates to the technical field of video monitoring, in particular to a fall detection method based on improved key frame extraction.

Background

With the growth of the age, the body functions of the old gradually decline, the falling seriously threatens the life safety of the old, and the falling is counted to become the cause of accidental death and injury of the old, so that the death risk of the old can be effectively reduced by 80% if the old can be timely rescued after falling. The current method for detecting the falling based on the video is the main stream method for detecting the falling, so that in order to judge the state of the old more quickly, useless frames in the video sequence can be removed, and only key frames which can reflect the video content and do not lose the motion sequence are checked.

How to retrieve valid, critical information from a large volume of video data for application within a prescribed time is currently a critical issue to be resolved urgently. The key frame is one or a plurality of frames of images reflecting the main content of the shot, so that the main visual content of the video can be simply and generally described, and compared with the number of image frames contained in the original video, the use of the key frame can greatly reduce the data volume of the video index, thereby providing a good data preprocessing effect for later application.

The most dominant methods for extracting key frames at present are of 4 types: (1) based on a shot boundary method: this method typically extracts frames at fixed positions of the shot as key frames, which has the disadvantage of not fully reflecting the video content. (2) Based on visual content analysis. The method takes the change degree of the video content as a standard for selecting the key frames, takes the frames with severe changes as the key frames, and can generate a large number of redundant video frames and express incomplete video content. (3) Based on motion analysis. According to the method, the motion quantity in the lens is calculated, and the key frame is selected at the position where the motion quantity reaches the local minimum value. (4) The clustering-based method has the advantages that the image data is high-dimensional, the calculated amount is large, the calculation process is complex, the situation of memory overflow can occur, a large amount of redundancy can be generated, and the efficiency is low.

The Chinese patent with publication number CN107220604A discloses a video-based fall detection method, which comprises the following steps: s1, processing a video image, and identifying and positioning a human body area in the image; s2, extracting joint points based on a cascade regression network aiming at the human body area to obtain a group of human body joint points, wherein a plurality of regression networks with the same structure are cascaded behind a first-stage network to finely adjust the coordinate positions of the human body joint points; s3, taking the motion vector of each human body joint point as the characteristic of human body motion, and dynamically analyzing whether the human body falls or not by analyzing the change of the joint point. The patent does not extract key frames in the video, resulting in an insufficient overall decision process.

Disclosure of Invention

The invention provides a fall detection method based on improved key frame extraction, which uses an improved key frame extraction technology to extract a small number of key frames, and utilizes the small number of key frames to completely reflect video content, so that fall behaviors can be discovered more quickly, and old people falling can be cured in time.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a fall detection method based on improved key frame extraction comprises the following steps:

s1: acquiring an unprocessed original video stream;

s2: performing key frame extraction on the original video stream preliminarily by using an inter-frame difference method;

s3: performing secondary optimization on the key frames generated in the step S2 by using a clustering algorithm to obtain optimal key frames;

s4: extracting features from the optimal key frames, and constructing feature vectors;

s5: the extracted feature vector is used as the input of a Support Vector Machine (SVM) for initial judgment, and the support vector machine is used for distinguishing non-falling behaviors, falling behaviors and falling-like behaviors;

s6: and performing secondary classification on the feature vector with the distinguishing result being the fall-like behavior by using a convolutional neural network, outputting a detection result, and finishing final detection of the fall-like behavior.

Preferably, in step S2, the key frame extraction is performed on the original video stream preliminarily by using an inter-frame difference method, which specifically includes the following steps:

s2.1: reading an original video stream, and calculating the frame difference between a current frame and a previous frame;

s2.2: obtaining average interframe difference according to the result of the step S2.1; the average inter-frame difference is specifically obtained by making a difference between two frames of corresponding pixel points, and then summing the difference and dividing the sum by the total number of pixels to obtain an average value of pixel variation;

s2.3: all frames of the original video stream are ordered according to the value of the average inter-frame difference, and the first n frames are selected as key frames.

Preferably, in step S3, the key frames generated in step S2 are secondarily optimized using a K-means clustering algorithm.

Preferably, the obtaining the optimal key frame in step S3 specifically includes the following steps:

s3.1: calculating a color feature vector of each key frame;

s3.2: taking a first frame image in an image data set formed by all key frames as an initial cluster center;

s3.3: respectively carrying out similarity measurement on the rest key frames and the centers of all current clusters, and if the similarity is smaller than a threshold value, creating a class for the rest key frames; if the similarity is greater than the threshold, adding the cluster into the previous cluster;

s3.4: repeating the step S3.3 until all the key frames are taken out;

s3.5: and after the clustering is completed, selecting the key frame nearest to the cluster center as the optimal key frame of the cluster video frame.

Preferably, in step S3.1, a color feature vector of each key frame is calculated, specifically:

s3.1.1: the key frame image color space is converted from RGB space to HSV space, specifically:

s3.1.2: h, S, V is non-uniformly quantized according to the ratio of 8:3:3 to form a 72-dimensional color feature vector, wherein H epsilon [0,360], S epsilon [0,1], V epsilon [0,1]:

HSV color feature vector F for each frame _i The expression is as follows:

F _i ＝9H+3S+V，i＝1,2...n。

preferably, in step S3.3, when the similarity is greater than the threshold, the cluster center is recalculated by requiring an average value once when the cluster is added to the previous cluster.

Preferably, in step S3.3, similarity measurement is performed on the remaining key frames and all cluster centers at present, where the similarity measurement specifically is:

wherein d (F) _i ,F _j ) F is the inter-frame distance between the ith frame and the jth frame _i (k)、F _j (k) Taking the inter-frame distance d (F) for the elements of the color feature vectors of the ith and jth frames _i ,F _j )>m+2σ ² As the number K of key frames to be extracted, where m, σ ² The mean and variance of the feature vectors for all n frames, respectively.

Preferably, in step S4, features are extracted from the optimal key frame, and feature vectors are constructed, which specifically includes the following steps:

s4.1: extracting an aspect ratio Fr, a centroid Fcen, a width change rate Fcha and a longest intercept angle Fa of a human body outline from the optimal key frame, wherein the aspect ratio Fr is the aspect ratio of an circumscribed rectangle of a human body in the key frame, the centroid Fcen is the central position of the human body in the key frame, the width change rate Fcha is the width change condition of a target person in the key frame, and the longest intercept angle Fa of the human body outline is the longest intercept angle characteristic in the key frame;

s4.2: in combination with the above features, a feature vector f= [ Fr, fcen, fcha, fa ] is constructed.

Preferably, the step S5 specifically includes the following steps:

s5.1: dividing a data set collected in advance into a training set and a testing set according to the ratio of 6:4, and training and testing the SVM, wherein the data set comprises characteristic vectors and label information of the current state;

s5.2: SVM model training part: using an SVM_train module in a Libsvm library to enable an SVM type to be C_SVC, enabling a kernel function type to be RBF, wherein gamma parameters in the RBF kernel function are set to be 2, loss function parameters of the C_SVC are set to be 1, and a parameter prob is a training sample set and comprises information for storing the total number of training samples, sample falling labels and all feature vectors with training;

s5.3: the SVM prediction part utilizes an SVM_prediction module in the Libsvm library, the module comprises a model and x parameters, wherein the model is path information of a training model file, x is a sample to be detected, and the prediction part sequentially acquires samples from a test sample based on training information in the model file to judge and returns a classification result of each sample;

s5.4: the SVM carries out primary classification on the data set, and the data set is divided into three results of non-falling behaviors, falling-like behaviors and falling-like behaviors, wherein the falling-like behaviors comprise: lying down, sitting down, quick rising up.

Preferably, the step S6 specifically includes the following steps:

s6.1: judging the SVM in the step S5 as falling-like behavior, and according to the following 6:4, dividing the training set and the testing set;

s6.2: taking the training sample set in the step S6.1 as input of a convolutional neural network, and performing deep learning to obtain deep features capable of distinguishing normal behaviors and falling behaviors;

s6.3: the model classifier is selected as a three-dimensional convolutional neural network, and the network comprises 16 layers, 13 convolutional layers, 3 full-connection layers, 5 pooling layers and a softmax classifying layer, wherein a ReLU is connected behind each convolutional layer and each full-connection layer; the resolution of the picture is 224 x 224, the initial learning rate of the model is set to be 0.005, the attenuation rate of the learning rate is 0.8, the decay rate of the weight is 0.0006, and the maximum iteration number is 20K; all convolution layers adopt 3D convolution kernels, the size is 3 x 3, the step length is 1 x 1, the number of convolution kernels is 64, 128, 256 respectively 256, 512 512, 512; the pooling layer adopts 3D maximum pooling, and the pooling layers adopt pooling cores with the size of 2 x 2 and step sizes;

s6.4: iterative training is continuously carried out to obtain a CNN model, a test set sample is input into the trained CNN model, classification is carried out by using softmax, a classification result is output, and final falling detection is completed.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

after the video data are collected, firstly, carrying out preliminary key frame extraction on the original video by an inter-frame difference method with simple calculation steps, and deleting a large number of similar key frames; and secondly, the clustering algorithm is used for secondary optimization, so that the defects that the traditional method is large in calculated amount, complex in calculation process, large in redundancy and the like can be overcome. Compared with the traditional clustering method, the algorithm provided by the invention has lower redundancy, higher recall ratio and higher accuracy, and can save a lot of time and improve the accuracy for subsequent falling detection.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of an optimal key frame result in an embodiment.

Fig. 3 is a schematic diagram of human body characteristics in an embodiment.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;

it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.

Example 1

The embodiment provides a fall detection method based on improved key frame extraction, as shown in fig. 1, comprising the following steps:

s1: acquiring an unprocessed original video stream;

In step S2, the key frame extraction is carried out on the original video stream preliminarily by utilizing an inter-frame difference method, and the method specifically comprises the following steps:

s2.2: obtaining average interframe difference according to the result of the step S2.1;

And in the step S3, performing secondary optimization on the key frames generated in the step S2 by using a K-means clustering algorithm.

The obtaining of the optimal key frame in the step S3 specifically includes the following steps:

s3.1: calculating a color feature vector of each key frame;

s3.3: respectively carrying out similarity measurement on the rest key frames and the centers of all current clusters, and if the similarity is smaller than a threshold value, creating a class for the rest key frames; if the similarity is greater than the threshold, adding the cluster into the previous cluster; in this embodiment, the threshold value is taken to be 0.7.

S3.4: repeating the step S3.3 until all the key frames are taken out;

s3.5: after the clustering is completed, the key frame nearest to the cluster center is selected as the optimal key frame of the cluster video frame, as shown in fig. 2.

In step S3.1, a color feature vector of each key frame is calculated, specifically:

HSV color feature vector F for each frame _i The expression is as follows:

F _i ＝9H+3S+V，i＝1,2...n。

in step S3.3, when the similarity is greater than the threshold, the cluster center is recalculated by requiring an average value once when the cluster is added to the previous cluster.

In step S3.3, performing similarity measurement on the remaining key frames and all cluster centers at present, where the similarity measurement specifically includes:

In step S4, features are extracted from the optimal key frame, and feature vectors are constructed, which specifically includes the following steps:

s4.1: as shown in fig. 3, the aspect ratio Fr, the centroid Fcen, the width change rate Fcha, and the longest intercept angle Fa of the human body contour are extracted from the optimal key frame, wherein:

the aspect ratio Fr is the aspect ratio of the circumscribed rectangle of the human body in the key frame, and is smaller than 1 in the human body upright posture;

the centroid Fcen is the central position of a human body in a key frame, and the movement of the human body inevitably causes the centroid to generate displacement change;

the width change rate Fcha is the width change condition of the target person in the key frame, when the person moves normally, the width change rate value is not more than 1, and when the person falls off abnormally, the width of the target person changes instantly, so that the width change rate of the person is more than 1, and the width change rate Fcha can be used as the basis for distinguishing falling and other normal human body behavior movements;

the longest sectional line angle Fa of the human body outline is the longest sectional line angle characteristic in the key frame, and the longest sectional line angle is smaller than 90 degrees under the standing posture of the human body, so that the longest sectional line angle Fa can be used as a basis for distinguishing falling behaviors from normal behaviors;

The step S5 specifically includes the following steps:

The step S6 specifically includes the following steps:

The same or similar reference numerals correspond to the same or similar components;

the terms describing the positional relationship in the drawings are merely illustrative, and are not to be construed as limiting the present patent;

it is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The fall detection method based on the improved key frame extraction is characterized by comprising the following steps of:

s1: acquiring an unprocessed original video stream;

s6: performing secondary classification on the feature vector of which the distinguishing result is the fall-like behavior by using a convolutional neural network, outputting a detection result, and finishing final detection of the fall-like behavior;

s3.1: calculating a color feature vector of each key frame;

s3.4: repeating the step S3.3 until all the key frames are taken out;

s3.5: after the clustering is completed, selecting a key frame nearest to the cluster center as an optimal key frame of the cluster video frame;

s3.1.2: h, S, V is non-uniformly quantized according to the ratio of 8:3:3 to form a 72-dimensional color feature vector, wherein H epsilon [0,360, S epsilon [0,1, V epsilon [0, 1):

HSV color feature vector F for each frame _i The expression is as follows:

F _i ＝9H+3S+V，i＝1,2...n。

2. the fall detection method based on improved key frame extraction as claimed in claim 1, wherein in step S2, the key frame extraction is performed on the original video stream preliminarily by using an inter-frame difference method, specifically comprising the steps of:

3. A fall detection method based on improved key frame extraction as claimed in claim 2, wherein in step S3 the key frames generated in step S2 are secondarily optimised using a K-means clustering algorithm.

4. A fall detection method based on improved key frame extraction as claimed in claim 3, wherein in step S3.3, when the similarity is greater than a threshold, then a one-time average is required to recalculate the cluster class centre when joining a previous cluster.

5. The fall detection method based on improved key frame extraction as claimed in claim 4, wherein in step S3.3, the remaining key frames are respectively subjected to similarity measures with all cluster centers at present, wherein the similarity measures are specifically as follows:

wherein d (F) _i ,F _j ) F is the inter-frame distance between the ith frame and the jth frame _i (k)、F _j (k) Taking the inter-frame distance d (F) for the elements of the color feature vectors of the ith and jth frames _i ,F _j )＞m+2σ ² As the number K of key frames to be extracted, where m, σ ² The mean and variance of the feature vectors for all n frames, respectively.

6. The fall detection method based on improved key frame extraction as claimed in claim 5, wherein in step S4, features are extracted from the optimal key frame, and feature vectors are constructed, specifically comprising the steps of:

7. The fall detection method based on improved key frame extraction as claimed in claim 6, wherein said step S5 specifically comprises the steps of:

8. The fall detection method based on improved key frame extraction as claimed in claim 7, wherein said step S6 specifically comprises the steps of:

s6.2: taking the training sample set in the step S6.1 as input of a convolutional neural network, and performing deep learning to obtain deep features for distinguishing normal behaviors and falling behaviors;