CN109886102B

CN109886102B - Fall-down behavior time-space domain detection method based on depth image

Info

Publication number: CN109886102B
Application number: CN201910032206.5A
Authority: CN
Inventors: 肖阳; 姜文祥; 曹治国; 王焱乘; 朱子豪; 李帅; 张明阳
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-01-14
Filing date: 2019-01-14
Publication date: 2020-11-17
Anticipated expiration: 2039-01-14
Also published as: CN109886102A

Abstract

The invention discloses a time-space domain detection method for falling behavior based on depth images, which comprises the following steps: acquiring a depth image; selecting a multi-time window video sequence; extracting the normal vector characteristics of each section of depth video; fusing the characteristics; encoding a dynamic graph; detecting by the behavior detection network; processing a detection result; recording the detection result; and training a behavior detection network. According to the time-space domain detection method for the falling behavior based on the depth image, the characteristics of the depth image are fully mined, the characteristic sequence is coded by dynamic image coding through the characteristic fusion of a normal vector and a depth image, the time domain and the space domain are not greatly inhibited by a target detection network trained by a large amount of marked data, and the falling behavior is detected frame by frame, so that the method is guaranteed to have higher real-time performance, accuracy, robustness, privacy protection and practicability.

Description

Fall-down behavior time-space domain detection method based on depth image

Technical Field

The invention belongs to the field of digital image recognition, and particularly relates to a time-space domain detection method for falling behavior based on a depth image.

Background

Falls are the main causes of accidental injuries in the elderly (65 years old or older), and it is statistically calculated that 60% of the injuries in the head, 90% of the injuries in the buttocks and the wrists are caused by falls, and that 30% of solitary elderly and 50% of elderly in long-term care facilities (such as nursing homes) fall at least once a year. Therefore, the timely detection of falls is very important in the long-term care of the elderly. On the other hand, with the aging trend of the world population becoming more and more serious, the cost of long-term care for the elderly is also increasing, especially in some care institutions such as hospitals and nursing homes, and therefore, the demand for a real-time fall detection system for the elderly is very large.

Currently, there are three main types of methods for fall detection: wearable device based methods, environmental sensor based methods, and computer vision based methods.

The method based on the wearable device detects the falling behavior of the human body by detecting the acceleration through the sensor, and has the advantages of small calculated amount and simple use, but the human body needs to be worn all the time to influence normal life; the detection method based on the environmental sensor, such as a pressure sensor and a sound sensor, has the advantages of small calculation amount, but the method is influenced by the pressure change of the environment, the sound change is greatly influenced, and the false alarm is high; the method based on computer vision mainly uses monitored video image information, does not need wearing equipment, is easily influenced by illumination and cannot be used at night, and in addition, the privacy protection is poor, and the precision cannot reach a higher standard.

Disclosure of Invention

Aiming at the defects or improvement requirements of the prior art, the invention provides a time-space domain detection method for a falling behavior based on a depth image, so that the technical problems that the existing falling detection method is influenced by illumination, low in accuracy, and needs to wear a sensor and the like are solved.

In order to achieve the above object, the present invention provides a time-space domain detection method for fall behavior based on depth images, comprising:

(1) acquiring depth image data of an indoor scene, and intercepting M sections of depth image sequences with different lengths from the depth image data, wherein M is an integer;

(2) extracting the normal vector features of each depth image sequence, wherein the normal vector features of W H3N are extracted from each depth image sequence, W is the width of a depth image, H is the height of the depth image, and N is the frame number of the corresponding depth image sequence;

(3) converting each obtained depth image sequence into a W x H x N gray-scale image, and performing feature fusion on the gray-scale image of each depth image sequence and the normal vector features of the corresponding depth image sequence to obtain a corresponding W x H x 3 x N tensor of the corresponding depth image sequence;

(4) carrying out dynamic graph coding on tensor features W x H x 3 x N corresponding to each section of depth image sequence, wherein each section of depth image sequence obtains a tensor of W x H x 3;

(5) taking the tensor of W x H x 3 corresponding to each section of depth image sequence as the input of a target detection network to obtain the occurrence probability of the falling behavior and the spatial position information of the falling behavior corresponding to each section of depth image sequence;

(6) the method comprises the steps of carrying out spatial domain non-maximum inhibition on spatial position information of falling behavior corresponding to each depth image sequence according to falling probability obtained by a target detection network, carrying out time domain non-maximum inhibition processing on detection results of M depth image sequences in a time domain according to falling probability obtained by the target detection network, combining the spatial positions and time windows corresponding to the M depth image sequences, determining falling behavior occurrence if the target falling probability is greater than a first preset value and the length of the combined time window is greater than a second preset value, recording the falling probability of the falling behavior, combining the positions in images, the falling occurrence moment and the falling duration time, carrying out early warning, and otherwise, not causing the falling behavior.

Preferably, step (1) comprises:

(1.1) acquiring N frames of depth images, carrying out feature extraction coding detection from the Nth frame of depth image, and discarding the depth image with the earliest time in the current depth image sequence before processing the next frame of depth image so that the length of the depth image sequence for fall detection is N, wherein N represents the length of the depth image sequence;

and (1.2) taking M depth image sequences with different lengths, wherein the lengths N of the different depth image sequences are different, and the corresponding N value of each depth image sequence is fixed when each depth image sequence is processed.

Preferably, step (2) comprises:

(2.1) for each depth image sequence, extracting the depth image sequenceNormal vector S of each frame depth image_n＝S_xn×S_ynWherein S is_nA normal vector representing the depth image of the nth frame,

respectively are normal vectors in the x and y directions, and the depth image of the nth frame is p_n＝(x_n，y_n，d_n(x_n，y_n))，(x_n，y_n) Representing the coordinates of the pixel points, d_n(x_n，y_n) As (x) in the depth image_n，y_n) A corresponding pixel value, N ═ 1,2, 3.., N;

and (2.2) fusing the normal vectors of each frame of depth image in the depth image sequence to obtain the normal vector characteristics of W x H x 3 x N of the depth image sequence.

Preferably, step (3) comprises:

(3.1) for each depth image sequence, converting each frame of depth image in the depth image sequence into a gray-scale map of W x H;

(3.2) calculating the first dimension of the normal vector of each frame of depth image in the depth image sequence pixel by pixel

And a second dimension

Obtaining two matrixes with the size of W x H, (x, y) represents the coordinates of pixel points, and d (x, y) is the pixel value of the depth image;

(3.3) merging the gray-scale image matrix W x H and the normal vector front two-dimensional matrix W x H x 2 into a tensor of W x H3, encoding each frame of depth image W x H into a feature tensor of W x H3, further obtaining a tensor of W x H3 x N corresponding to the section of depth image sequence, wherein W and H are the width and the height of the depth image.

Preferably, step (4) comprises:

(4.1) For each depth image sequence, recording the characteristic tensor sequence of the N frames of depth images in the depth image sequence as X ═ X [ X ]₁,x₂,...,x_t,...,x_N]Wherein x is_tN is the tensor of W × H × 3 encoded by the t-th frame depth image;

(4.2) designing a mapping function

So that the depth image x for the t-th frame_tIs inputted

Is a feature tensor that maps the past corresponding t-th frame depth image, where the mapping function is used to convert the original depth image data type to the range [0,255%]And vectorizing the matrix;

(4.3) obtaining the score S (v; u) ═ u of the t-th frame depth image according to the average characteristic of the t-th frame depth image and the sorting function^t·v_tWherein, in the step (A),

representing the average feature of the depth image of the t-th frame, u^tRepresenting a transpose of a vector resulting from optimizing a sorting function, wherein the sorting function is configured to make frame images further behind the time series have a larger score;

(4.4) optimizing the parameter u in the sorting function through a rank SVM, so that the frame images which are positioned behind the time sequence and between different frames in the depth image sequence have larger scores, and converting the obtained optimal value of the parameter u into a tensor of W.H.3, wherein the tensor is used as the tensor of W.H.3 encoded by the tensor of W.H.3.N corresponding to the depth image sequence;

(4.5) use

As an approximation of the parameter u,

is the step (4.4) The vector after vectorization of W x H x 3 feature tensor coded by the ith frame image obtained in the step (b), alpha_i2(N-i +1), N denotes a corresponding depth image sequence length.

Preferably, the feature tensors of M W × H × 3 encoded by the M segments of depth video sequences in step (5) are detected, and a detection result is output, where the detection network includes YOLOv1, YOLOv2, YOLOv3, Fast R-CNN, MobileNets V1, MobileNets V2, or ShuffleNet, and when the target detection network uses YOLOv2, the input of the target detection network is 413 × 3, W × H3 is used as 413 × 3 by using image size transformation, and the target detection network outputs fall probability of the target to be detected, horizontal and vertical coordinate positions of the target to be detected in the image, and width and height of the target to be detected.

Preferably, the method further comprises:

after time-space domain labeling is carried out on a preset fall detection data set, converting a depth image into a gray image and normal vector features of the depth image for feature fusion, and carrying out dynamic image coding on tensor features obtained by feature fusion to manufacture a dynamic image training sample;

pre-training a convolutional neural network by using ImageNet million images, and finally performing end-to-end multi-batch training on dynamic image training samples aiming at behavior detection to obtain a target detection network, wherein the output of the target detection network comprises: the falling probability of the target to be detected, the position of the target to be detected in the image, the width of the target to be detected and the height of the target to be detected.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

the time-space domain detection method for the falling behavior based on the depth image fully excavates the characteristics of the depth image, encodes a characteristic sequence through the characteristic fusion of a normal vector and the depth image, carries out frame-by-frame detection on the falling behavior through a target detection network trained by a large amount of labeled data and the non-maximum inhibition of a time domain and a space domain in a detection result, and ensures that the method has higher real-time performance, accuracy, robustness, privacy protection and practicability.

Drawings

Fig. 1 is a schematic flow chart of a fall behavior detection method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a physical meaning corresponding to a normal vector feature extracted from a depth image according to an embodiment of the present invention;

FIG. 3 is a flow chart of a dynamic graph encoding algorithm provided by an embodiment of the present invention;

FIG. 4 is a visualization result of a first dimension of feature vectors encoded by an algorithm after encoding a simplified dynamic graph according to an embodiment of the present invention;

FIG. 5 is a network architecture of a behavior detection network used in accordance with an embodiment of the present invention;

fig. 6 is a flowchart of a complete fall detection method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides a time-space domain detection method for a falling behavior based on a depth image, which fully excavates the characteristics of the depth image, encodes a characteristic sequence through the characteristic fusion of a normal vector and the depth image by dynamic image coding, detects a target detection network through a large amount of labeled data training and greatly inhibits a time domain and a space domain in a detection result frame by frame to detect the falling behavior, has higher real-time property, accuracy, robustness, privacy protection and practicability, and solves the technical problems that the existing falling detection method is influenced by illumination light, has low accuracy, needs to wear a sensor and the like.

The invention provides a time-space domain detection method for a falling behavior based on a depth image, which comprises the steps of obtaining the depth image, updating a depth image sequence, selecting a multi-time window video sequence, carrying out normal vector feature extraction, feature fusion and dynamic image coding on the depth image sequence, extracting features by a behavior detection network, outputting a detection result and training the behavior detection network. The time-space domain detection method for fall behaviors provided by the invention is specifically described below by combining with an example.

The method for detecting the time-space domain of the falling behavior based on the depth image, provided by the embodiment of the invention, comprises the following specific steps, and the whole process is shown in fig. 1 and 6:

(1) acquiring a depth image: the depth image data of the indoor scene is acquired through the depth sensor, and the depth image is not influenced by illumination, so that the all-weather detection can be performed. Meanwhile, the characteristics of the depth image cannot be used for identifying the identity of the user, and privacy protection is strong.

In addition, particularly, the invention is real-time detection, and after the detection of whether each frame has a falling behavior is completed, the depth image of the next frame is read in.

(2) Selecting a multi-time window video sequence: since the falling behavior itself cannot be confirmed according to a single depth image, a depth image sequence is needed for judgment, and the length of the image sequence is N.

Reading N frames of images first, and starting to perform subsequent feature extraction coding detection on the Nth frame of image. Thereafter, before each frame of depth image is read in, the image with the earliest time in the current image sequence is discarded, so that the length of the depth image sequence for fall detection is N.

Since the duration of the fall action varies from person to person, a plurality of time-length depth image sequences are taken, so that in step (2), there are M-length depth image sequences at the same time, and the length N of each depth image sequence is different, but the value of N of each depth image sequence is not changed when each depth image sequence is processed, and the processing of each depth image sequence in the subsequent steps is the same, and the description is not repeated. In addition, the selection of the video time window length in the embodiment of the invention is obtained by clustering operation based on the training sample.

(3) Carrying out normal vector on each section of depth image sequenceAnd (5) feature extraction. And extracting a normal vector for each frame of depth image. Let the nth frame depth image be p_n＝(x_n，y_n，d_n(x_n，y_n) N ═ 1,2,3,. N, where (x)_n，y_n) Representing the coordinates of the pixel points, d_n(x_n，y_n) For the nth frame depth image (x)_n，y_n) The physical meaning of the corresponding pixel value is the distance of the image from the camera (in millimeters), which is different from the physical meaning of the data collected by an RGB camera. The normal vector of the depth image is calculated according to the following formula:

S_n＝S_xn×S_yn

wherein the content of the first and second substances,

are the normal vectors in the x, y directions, respectively. The normal vector is calculated according to the following formula:

so that each point p on the nth frame depth map_n＝(x_n，y_n，d_n(x_n，y_n) The normal vector of) is:

in general

And

the following approximate method is used for calculation:

the physical meaning is shown in figure 2.

(4) And (5) feature fusion. Converting each frame of depth image obtained in the step (2) into a gray scale map W x H, and calculating a first dimension and a second dimension of a normal vector of each frame of depth image pixel by pixel in the step (3)

And

two matrices are obtained, again of size W × H. And (3) merging the gray-scale image matrix W x H in the step (2) and the front two-dimensional matrix W x H x 2 of the normal vector in the step (3) into a tensor of W x H3, so that each frame of depth image W x H is coded into a feature tensor of W x H3, wherein the depth image sequence W x H is a tensor of W x H3 x N, the coding result of the depth image sequence is a tensor of W x H3 x N, and W and H are the width and the height of the depth image.

(5) And (5) encoding the dynamic graph. And (4) carrying out dynamic graph coding on the W H3N characteristic tensor obtained in the step (4) to obtain the tensor of W H3. As shown in fig. 3, the N-frame feature tensor sequence X ═ X₁，x₂，...，x_N]Wherein the t-th frame x_tDesigning a mapping function to express the tensor of W H3 coded by each frame of depth image in the step (4)

So that the depth image x for the t-th frame_tIs inputted

Is a feature tensor that maps the past corresponding tth frame, where the mapping function acts to convert the original depth image data type to the range 0,255]And vectorizing the matrix; then define one

An average feature, denoted as frame t, based on which a ranking function is defined and a score S (v; u) ═ u is obtained, since there is a significant temporal order between frames of the video of the fall behaviour^t·v_tThe function of the ranking function is such that the images of frames further down the time series have a larger score S for all frames, i.e. the image of a frame further down the time series has a larger score S

The final objective is to optimize the parameter u of such a ranking function S by means of rankSVM such that the frame images between different frames that satisfy the further temporal sequence have a larger score. Using the structural risk minimization and maximum separation optimization framework here, the objective optimization problem can be expressed as:

the first term is a regularization term and the second term is a change-loss error penalty term. The equation is proved to be a convex optimization problem, the RankSVM can be used for solving, and the optimized parameter u is obtained^*Can be a new representation of the entire sequence of feature tensors. Parameter u^*After resize, becomes a tensor feature of W × H × 3.

This formula is simplified and d represents the better parameter u to be obtained in the previous method:

from

At first, the first approximate solution

Can therefore obtain

Summing the left series of numbers

α_t＝2(N-t+1)-(N+1)(H_N-H_t-1)

Wherein

The tensor characteristics of W × H × 3 that are finally desired become, as shown in fig. 4:

in the present embodiment, α is used_tThe tensor eigensequence is processed 2(N-t +1), formula α_t＝2(T-t+1)-(T+1)(H_T-H_t-1) The second item in (1) does not influence the coding effect, and the consumption of much time is reduced.

(6) The behavior detection network performs detection. And (4) encoding M sections of depth video sequences encoded in the step (5) to obtain M characteristic W x H x 3 tensors, and outputting a detection result by a subsequent detection network.

The target detection network in the embodiment of the invention can use the existing target detection networks such as YOLOv1, YOLOv2, YOLOv3, Fast R-CNN, Faster R-CNN, MobileNet V1, MobileNet V2 and ShuffleNet. In the embodiment of the present invention, YOLOv2(You Only Look one: Unifield, Real-Time Object Detection) is preferably used, and the network structure is shown in FIG. 5.

The input of the convolutional neural network is 413 × 3, the direct resize of W × H × 3 is 413 × 3, the network output is 13 × 130, that is, 13 × 13 (B × 5+ C), B is 5, C is 21, B is the number of output bounding boxes, C is the detected target category, the first 20 categories are non-falling, and the 21 st category is falling. The 5 in front of the detection result respectively refers to the probability p of the target and the position (x, y) of the target in the image, and the width and height w and h of the target.

(7) And processing detection results, namely respectively carrying out non-maximum inhibition on the M depth video sequences in a spatial domain according to the falling probability output by the detection network in a spatial falling occurrence position, carrying out non-maximum inhibition on the detection results of the M videos in a time domain according to the falling probability output by the detection network in a time domain, combining the processed spatial positions and time windows, judging whether a falling behavior occurs if the target falling probability and the length of the combined falling time window are greater than a threshold value set by an experiment, otherwise, not judging whether the falling behavior occurs, and obtaining the falling detection result at the current moment.

(8) Recording the detection result, recording the target falling probability of falling, the position in the image, the falling occurrence time and the falling duration, and performing early warning. And outputting the judgment probability, the spatial position and the occurrence time of the falling behavior according to the detected falling probability and the time window length when the time window length is larger than the threshold value, and archiving the depth image in the occurrence time window so as to be convenient for analyzing the reason of the falling behavior in the follow-up process.

(9) And a behavior detection network training part. The feature extraction network and the detection network are obtained by labeling the existing fall detection data set to make a dynamic graph training sample, then pre-training the dynamic graph by using ImagenNet million graphs, and finally performing end-to-end multi-batch training on behavior detection.

In the embodiment of the invention, the SDUFall tumble identification data set of Shandong university and the NTU RGBD data set of the Nanyang university of Singapore are used, the tumble time domain labeling and the spatial labeling are respectively carried out on the identification data set, and the original data set is expanded.

And after the detection result of the current moment is recorded, reading in the next frame of image.

In the training process, the VOC 2007-VOC 2012 competition data sets are used, so that the output result is 21 types, wherein the former 20 types are 20 types of the VOC data sets, the transfer learning technology is used in the training process, the initialized parameters of the trunk network (more than 24 layers of serial numbers) in the drawing are obtained after pre-training on the ImageNet million-level image classification data sets, and then the final parameters of the convolution kernel are obtained by training through the NTU RGBD data set and the SDU Fall data set labeled by the invention, so that the robustness and the stability of the method are greatly improved.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A time-space domain detection method for falling behavior based on a depth image is characterized by comprising the following steps:

(3) converting each obtained depth image sequence into a W x H x N gray-scale image, and performing feature fusion on the gray-scale image of each depth image sequence and the normal vector features of the corresponding depth image sequence to obtain a corresponding W x H x 3 x N tensor of the corresponding depth image sequence; the method specifically comprises the following steps:

And a second dimension

(3.3) merging the gray-scale map matrix W × H and the normal vector front two-dimensional matrix W × H × 2 into a tensor of W × H × 3, so as to encode each frame depth image W × H into a feature tensor of W × H × 3, and further obtain a tensor of W × H3 × N corresponding to the section of depth image sequence;

(4) carrying out dynamic graph coding on tensor features W x H x 3 x N corresponding to each section of depth image sequence, wherein each section of depth image sequence obtains a tensor of W x H x 3; the method specifically comprises the following steps:

(4.1) for each depth image sequence, recording the feature tensor sequence of the N frames of depth images in the depth image sequence as X ═ X [ < X >₁,x₂,...,x_t,...,x_N]Wherein x is_tT is 1,2,3, …, N is the tensor of W × H × 3 encoded by the t-th frame depth image;

(4.2) designing a mapping function

So that the depth image x for the t-th frame_tIs inputted

(4.5) use

As an approximation of the parameter u, wherein,

is the vectorized vector of W x H3 characteristic tensor coded by the i frame image obtained in the step (4.4), alpha_i2(N-i +1), N denotes a corresponding depth image sequence length;

2. The method of claim 1, wherein step (1) comprises:

3. The method of claim 1 or 2, wherein step (2) comprises:

(2.1) for each depth image sequence, extracting a normal vector S of each frame of depth image in the depth image sequence_n＝S_xn×S_ynWherein S is_nA normal vector representing the depth image of the nth frame,

4. The method according to claim 1, wherein the feature tensors of M W × H3 encoded by the M segments of depth video sequence in step (5) are detected and the detection result is output, wherein the detection network includes YOLOv1, YOLOv2, YOLOv3, Fast R-CNN, fastr-CNN, MobileNets V1, MobileNets V2 or ShuffleNet; and when the target detection network uses YOLOv2, the input of the target detection network is 413 x 3, W x H3 is changed into 413 x 3 by adopting image size, and the target detection network outputs the falling probability of the target to be detected, the horizontal and vertical coordinate position of the target to be detected in the image and the width and the height of the target to be detected.

5. The method of claim 4, further comprising: