CN110633736A

CN110633736A - Human body falling detection method based on multi-source heterogeneous data fusion

Info

Publication number: CN110633736A
Application number: CN201910795220.0A
Authority: CN
Inventors: 李巧勤; 刘勇国; 杨尚明; 姜珊; 王志华; 陶文元; 傅翀
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2019-12-31

Abstract

The invention belongs to the field of human body fall detection, and provides a human body fall detection method based on multi-source heterogeneous data fusion; acquiring a behavior depth image and skeleton information of a human body through a Kinect, and firstly, getting rid of the constraint of a wearable sensor in the selection of the sensor; secondly, the problems that wearable sensors cannot be used in specific scenes such as bathrooms and toilets, and meanwhile, the privacy of human bodies is invaded by monitoring through a common camera are solved; meanwhile, features are extracted from multi-source heterogeneous data through a deep learning model, keyless attention fusion is introduced in a data fusion mode, and data redundancy and computational complexity caused by data level fusion are avoided; compared with the prior art, the method obviously improves the accuracy of fall detection.

Description

Human body falling detection method based on multi-source heterogeneous data fusion

Technical Field

The invention belongs to the field of human body fall detection, and particularly relates to a human body fall detection method based on multi-source heterogeneous data fusion.

Background

With the development of social aging, falling becomes a common health problem among the elderly, and falling detection becomes an important research direction. Falls are generally defined as "involuntary movements to the ground or lower", and fall detection is mainly performed by monitoring body signals through various types of sensors for acquisition, processing and determination. The sensors are roughly divided into wearable sensors and environment perception sensors; among them, in wearable sensor systems, an accelerometer is usually used to sense the orientation change of a tester, a gyroscope is used to detect the angular momentum, or other types of sensors such as barometer, magnetometer, etc., and these wearable sensors are generally placed on the chest, waist, wrist, thigh, etc. of a wearer; in the environment sensing system, an infrared sensor, a vibration sensor, a microphone array, a radar sensor, a camera, and the like are generally used. However, the wearable sensor requires the old to wear the device all the time and charge the device instantly, which is inconvenient; the accuracy of the camera sensor adopted in the environment perception sensor is higher, but the camera is used for monitoring and invading the privacy of the old.

In the current fall detection, Microsoft Kinect is used for obtaining three-dimensional skeleton data, acceleration and skeleton data are calculated through the mass center of a human joint and serve as main biomechanical characteristics, and an LSTM model is adopted for fall detection; for example, the method used in the document ' Elders ' fall detection based on biological features using depth camera ' can dynamically capture the motion of a human body in a three-dimensional space, so that researchers can more accurately analyze the posture of the human body, and the capability of human behavior analysis is greatly improved; because three-dimensional skeleton data rather than original image data are used, the privacy of the user can be effectively protected while analyzing the behavior. In behavior Recognition, three modes of three-dimensional bones, Human Body partial images and moving image histories (MHIs) are integrated into a mixed deep learning framework for Human Body behavior detection, so that a good effect is obtained, for example, in the literature, "Human action on Recognition Based on integration Body position, part shape, and Motion", RGB is used for capturing Human Body posture, partial shape and Body Motion, and a model combines a Convolutional Neural Network (CNNs), long-time memory (LST M) and a fine-tuned pre-training system structure into a mixed system called MCLP: the multimodal CNN + LSTM + VGG16 is pre-trained in the ImageNet public dataset; the three submodels are subjected to decision-level fusion. Decision-level fusion can be regarded as a form of aggregation, so that each mode is completely independent of other modes, and results obtained independently are fused in the last step; because different modalities are only fused in the last step of the output of the independent model, the model cannot obtain the interaction between the modalities, and the falling recognition rate is reduced.

Disclosure of Invention

The invention aims to provide a human body falling detection method based on multi-source heterogeneous data fusion, aiming at the problems that signals are unstable, data are possibly lost and real-time monitoring cannot be achieved in signals transmitted by using a wearable sensor in the current falling detection. The method utilizes Microsoft Kinect equipment to obtain the depth image and the skeleton data for fall detection, and firstly, the restriction of a wearable sensor is eliminated in the selection of the sensor; secondly, the problems that wearable sensors cannot be used in specific scenes such as bathrooms and toilets, and meanwhile, the privacy of human bodies is invaded by monitoring through a common camera are solved; and finally, algorithm improvement is carried out on multi-source heterogeneous data in a fusion mode, and the accuracy of fall detection is improved.

In order to achieve the purpose, the invention adopts the technical scheme that:

a human body falling detection method based on multi-source heterogeneous data fusion comprises the following steps:

step 1, collecting human body skeleton node data and depth image data based on a Kinect v2 sensor system;

step 2, synchronous frame sampling is carried out on the collected human skeleton node data and the collected depth image data: setting sequence length, dividing an input sequence into N equal-length sections, and randomly selecting a frame from each equal-length section to obtain time sequence data with the length of N frames;

step 3, carrying out data processing on the sampled human skeleton node data and the depth image data;

step 3-1, aiming at human skeleton node data:

step 3-1-1, node processing: performing node processing on each frame of skeleton node data, selecting trunk skeleton nodes from 25 skeleton nodes provided by a Kinect v2 sensor, and arranging joints in a tree traversal sequence;

step 3-1-2. coordinate normalization: carrying out coordinate normalization on the three-dimensional coordinate points of each frame of skeleton nodes;

step 3-1-3, performing feature extraction on the preprocessed bone node data through a CNN-LSTM model to obtain bone data features h^t；

Step 3-2. for depth image data:

step 3-2-1, generating a Motion History Image (MHI) according to the motion dynamics:

step 3-2-2, normalizing the motion history image;

3-2-3, performing image enhancement on the normalized motion history image to obtain a training data set;

step 3-2-4, performing feature extraction on the training data set by adopting a VGG16 model to obtain a one-dimensional feature vector h degrees;

step 4, carrying out keyless attention data fusion on the bone node data and the depth image data after feature extraction;

step 4-1, characterizing the bone data h^tAnd connecting depth image features h degrees in series to obtain:

wherein the content of the first and second substances,

human skeleton node characteristics, N, which represent the extraction of the nth frame of human skeleton node data, are 1, 2.., N;

step 4-2. mixing^BAs an annotation vector, the attention mechanism gives the output vector c by the desired calculation of the annotation vector:

wherein λ is_nTo represent

The weight of (c):

where ω is a learnable parameter of the same dimension as the annotation vector;

and 5, inputting the vector subjected to the data fusion processing into a classifier to obtain a classification result.

Further, the specific process of the step 3-2-1 is as follows: firstly, carrying out gray level conversion on each frame of depth image; then, taking the first frame as a background, calculating an absolute difference value between pixels corresponding to the second frame and the first frame, if the difference value of any pixel point is greater than a preset threshold value, determining the pixel point as a dynamic pixel and specifying the pixel point as white, otherwise, determining the pixel point as a static pixel and specifying the pixel point as black, and taking the synthesized graph as a new background; and by analogy, synthesizing the motion history image.

The invention has the beneficial effects that:

the invention provides a human body falling detection method based on multi-source heterogeneous data fusion, which is characterized in that a behavior depth image and skeleton information of a human body are obtained through a Kinect, so that the constraint of a wearable sensor is eliminated, and the problems that a common camera monitors and invades the privacy of the human body and the like are effectively avoided; meanwhile, features are extracted from multi-source heterogeneous data through a deep learning model, keyless attention fusion is introduced in a data fusion mode, and data redundancy and computational complexity caused by data level fusion are avoided; compared with the prior art, the method obviously improves the accuracy of fall detection.

Drawings

Fig. 1 is a flow chart of a human body fall detection method based on multi-source heterogeneous data fusion according to the present invention.

Fig. 2 is a specific flowchart of the human body fall detection method based on multi-source heterogeneous data fusion according to the present invention.

FIG. 3 is a human bone node traversal graph according to an embodiment of the present invention.

FIG. 4 is a flow chart of CNN-Maxpooling in an embodiment of the present invention.

FIG. 5 is a flow chart of an LSTM in an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The embodiment provides a human body fall detection method based on multi-source heterogeneous data fusion, the flow of which is shown in fig. 1 and 2, and the method specifically comprises the following steps:

s1, collecting human body skeleton node data and depth image data based on a Kinect v2 sensor system;

kinect v2 is a 3D motion-sensing Camera available from microsoft, which has three cameras, RGB Camera, Depth Camera, IR Emitters, and a Microphon e Array; the Kinect sensor is connected with a PC end through a data transmission line, Kinect S DK provided by Microsoft is called to obtain skeleton data and depth image data of a human body, in the embodiment, the Kinect sensor locks a target position, the placing height is 1.2m, and the front-back distance is 1.5 m-1.8 m;

description of action types: the training set actions are divided into fall actions and daily behaviors, wherein the simulated fall actions are set as: the front fall and the back fall finally lie on the ground, the back fall finally sits on the ground, and the left fall and the right fall take five falling actions; the daily behavior is set as: walking, sitting, squatting, lying, sitting to standing, sitting to lying, lying to sitting, lying to standing, and lying to standing;

s2, carrying out frame sampling on the collected human body bone node data and the collected depth image data:

carrying out synchronous frame sampling on human skeleton node data and depth image data; the purpose of frame sampling is to fix the length of the input sequence, so that the input sequence is adaptive to a training network; and making the model more robust by introducing randomness; in this embodiment, the sampling frequency is set to 30Hz, and since the fall action time does not exceed 2 seconds generally, the sequence length is set to 2 seconds, the input sequence is divided into N equal-length segments, and then one frame is randomly selected from each equal-length segment to obtain time series data with the length of N frames;

s3, carrying out data processing on the sampled human skeleton node data and the depth image data;

s3-1, aiming at human skeleton node data:

s3-1-1, node processing: performing node processing on each frame of bone node data, wherein the purpose of the node processing is to enable a system to detect a spatial mode between related joints to express bone coordinates; selecting 16 bone nodes from 25 bone nodes provided by a Kinect v2 sensor, wherein the bone nodes are respectively the middle vertebra, the middle shoulder, the head, the left shoulder, the left elbow, the left hand, the right shoulder, the right elbow, the right hand, the base vertebra, the left thigh, the left knee, the left ankle, the right thigh, the right knee and the right ankle in sequence as shown in FIG. 3; wherein the joints are arranged in a tree traversal order while allowing for joint repetition during traversal, in this embodiment, the total number of joints becomes 31; stacking the joint coordinates such that joints in the same traversal path are placed adjacent to each other, making it easier for the network to learn the correlation pattern between adjacent joint coordinates;

s3-1-2 coordinate normalization: performing coordinate normalization on the three-dimensional coordinate points of each frame of skeleton nodes, wherein the aim is to prevent the joint coordinates from being changed by the position and the direction of the shooting object and the size of the skeleton of the shooting object; the normalization formula is as follows:

wherein max is the maximum value of the sample data, min is the minimum value of the sample data, x is the current sensor data, and x is the result after normalization;

s3-1-3, carrying out feature extraction through a CNN-LSTM model after preprocessing bone node data;

in this embodiment, since 31 skeleton nodes are processed by the node processing, the skeleton node data is composed of 93 values (31 joints × 3 dimensions), and when N is set to 8 in the input layer, 8 × 93 joint coordinate values are used as input;

in the feature extraction stage, in this embodiment, the CNN-LSTM model employs 3 continuous convolutional layers and 2 cascaded LSTM, where the 3 continuous convolutional layers (sequentially including 32, 48, and 64 cores) form the CNN model and are used to implement feature extraction, as shown in fig. 4; the output of each convolutional layer is halved by the maximum pooling layer; inputting the features extracted by the CNN model into a first LSTM, and inputting the output of the first LSTM into a second LSTM after the feature quantity is halved by using a filter layer; 2 LSTMs use the same architecture, the number of neurons is 128; each LSTM updates their input gates following the following formula (i)^t) Forget gate (f)^t) And an output gate (o)^t) And cell state (C)^t) And the output vector (h)^t) (ii) a The update equation for LSTM is:

i^t＝sigmoid(W_uiU^t+W_hih^t-1+W_ciC^t-1+b_i) (1)

f^t＝sigmoid(W_ufU^t+W_hfh^t-1+W_cfC^t-1+b_f) (2)

o^t＝sigmoid(W_uoU^t+W_hoh^t-1+W_coC^t-1+b_o) (3)

C^t＝f^tc^t-1+i^ttanh(W_ucU^t+W_hch^t-1+b_c) (4)

h^t＝o^ttanh(C^t) (5)

wherein, U^tAnd C^tFor the input vector and cell state at time t, h^tAnd h^t-1Output vectors (hidden states) for time t and time t-1, b_cIs offset, and in addition, b_i、b_fAnd b_oBias of the input, forgetting and output gates, respectively, and W_ui、W_hi、W_ci、W_uf、W_hf、W_cf、W_uo、W_ho、W_co、W_uc、W_hcIs a weight matrix; h is^tThe extracted human skeleton node characteristics are obtained; as shown in fig. 5.

S3-2, for depth image data:

s3-2-1, generating a Motion History Image (MHI) according to the motion dynamics:

firstly, carrying out gray level conversion on each frame of depth image; then, taking the first frame as a background, calculating an absolute difference value between pixels corresponding to the second frame and the first frame, if the difference value of any pixel point is greater than a preset threshold value, determining the pixel point as a dynamic pixel and specifying the pixel point as white, otherwise, determining the pixel point as a static pixel and specifying the pixel point as black, and taking the synthesized graph as a new background; and so on, synthesizing the motion history image;

s3-2-2, normalizing the motion history image, wherein the normalization formula is as follows:

wherein x is_iRepresenting image pixel point values, min (x), max (x) representing maximum and minimum values of image pixels, respectively; the normalization does not change the information storage of the image, the value range of the processed pixel value is converted from 0-255 to 0-1, and the method has great benefit for the subsequent convolutional neural network processing, so that the image can resist the attack of geometric transformation;

s3-2-3, performing image enhancement on the normalized motion history image to obtain a training data set; for enhancing the generalization ability of the model; taking the image center as a rotation center, setting the maximum rotation angle to be 45 degrees, setting the up-down translation scale and the left-right translation scale to be 10, and zooming to be 0.2 to enhance the training data set;

s3-2-4, extracting features of the training data set by adopting a VGG16 model to obtain a one-dimensional feature vector h degrees;

the generated MHI image is resized to 224 x 224, and the image is three color value channels, so that the image is input to a VGG16 model to be trained (224 x 3) and serves as a feature extractor of the MHI; the extracted features (7 × 512) are then adjusted into a vector of size 25008;

the VGG convolutional neural network is a model proposed by Oxford university in 2014, and has very good results in image classification and target detection tasks; VGG16 is divided into 16 layers, and is a complex CNN network model, a picture of 224 × 3 is input, the size of a passed convolution kernel is 3 × 3, the step size stride of the convolution kernel movement is 1, and the dimension after the convolution operation becomes small, so that before each convolution operation, newly added positions are first rows, all first column elements are 0, and padding is used for representing, and padding is set to be 1; pooling (pooling) is a 2 x 2 maximal pooling (max pooling) mode; the whole framework is formed by laminating 13 convolution layers and 3 full-connection layers;

s4, carrying out keyless attention data fusion on the skeleton node data and the depth image data after feature extraction;

obtaining a bone data characteristic hiding state h in the three-dimensional bone node data processing^t(ii) a One-dimensional feature vectors are obtained in the depth image data processing process for h degrees;

s4-1, characterizing the bone data by h^tAnd connecting depth image features h degrees in series to obtain:

wherein the content of the first and second substances,

representing the n-th frame of the human boneHuman skeleton node characteristics extracted from skeleton node data, N is 1, 2. [. the]Representing a concatenation of state vectors;

s4-2, performing data fusion through a keyless attention mechanism, and enabling h to be a key^BAs a sequence of input vectors, called annotation vectors, where the attention mechanism gives the output vector c by the desired computation of the annotation vector:

wherein each one

The weight calculation method comprises the following steps:

and S5, inputting the vector c subjected to data fusion processing into a classifier, wherein the classifier is an artificial neural network formed by sequentially connecting a full connection layer (dense), an intermediate filter layer (dropout) and a classification layer (softmax activation function), and obtaining a classification result.

In this embodiment, the fall detection task generally adopts accuracy (accuracy), specificity (specificity) and sensitivity (sensitivity) as evaluation indexes, that is, the fall detection task uses accuracy (accuracy), specificity (specificity) and sensitivity (sensitivity) as evaluation indexes

Accuracy ═ number of correct classes/total number of classes

Specificity＝TN/(TN+FP)

Sensitivity＝TP/(TP+FN)

Where TP is a true positive value, indicating that a fall event was evoked and the algorithm detected it; FP is false positive, indicating that no fall event occurred, but the algorithm detected it; TN being true negative means that no fall event has occurred, but the algorithm does not detect it; FN is false negative, indicating that a fall event has occurred, but the algorithm does not detect it. At present, Ghojogh.B et al (Fisherposesfor Human Action Recognition Using Kinect Sensor Data) have achieved 89% accuracy in the UTkinect (http:// cvrc. ec. utexa. edu/kinectDatases/HOJ3D. html) Data set. By adopting the falling detection method, experiments are carried out on the Kinect public data sets UR and UTKinect, and the recognition accuracy rate of 95 percent is obtained.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. A human body falling detection method based on multi-source heterogeneous data fusion comprises the following steps:

step 3-1, aiming at human skeleton node data:

Step 3-2. for depth image data:

step 3-2-2, normalizing the motion history image;

step 3-2-4, performing feature extraction on the training data set by adopting a VGG16 model to obtain a one-dimensional feature vector h^o；

step 4-1, characterizing the bone data h^tAnd depth image feature h^oCarrying out series connection to obtain:

wherein the content of the first and second substances,

step 4-2. mixing^BCalled annotation vector, the attention mechanism gives the output vector c by the desired computation of the annotation vector:

wherein λ is_nTo represent

The weight of (c):

2. The human body fall detection method based on multi-source heterogeneous data fusion according to claim 1, wherein the specific process of the step 3-2-1 is as follows: firstly, carrying out gray level conversion on each frame of depth image; then, taking the first frame as a background, calculating an absolute difference value between pixels corresponding to the second frame and the first frame, if the difference value of any pixel point is greater than a preset threshold value, determining the pixel point as a dynamic pixel and specifying the pixel point as white, otherwise, determining the pixel point as a static pixel and specifying the pixel point as black, and taking the synthesized graph as a new background; and by analogy, synthesizing the motion history image.