CN113837005A

CN113837005A - Human body falling detection method and device, storage medium and terminal equipment

Info

Publication number: CN113837005A
Application number: CN202110960372.9A
Authority: CN
Inventors: 林凡; 高欣; 宋进
Original assignee: GCI Science and Technology Co Ltd
Current assignee: GCI Science and Technology Co Ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-12-24

Abstract

The invention discloses a method, a device, a storage medium and a terminal device for detecting human body falling, which receive video data; extracting skeleton joint points of the human body to be detected in the video data to obtain joint data; inputting the joint data into a first fall recognition model trained in advance to obtain a first fall probability matrix; inputting the video data into a second fall recognition model trained in advance to obtain a second fall probability matrix; the first falling probability matrix and the second falling probability matrix are subjected to mean value processing to obtain a falling recognition result of the human body to be detected, and the contour information, the color information and the skeleton data information of the video image data can be fused, so that the neural network learns abundant action characteristics, and the falling recognition accuracy is improved.

Description

Human body falling detection method and device, storage medium and terminal equipment

Technical Field

The invention relates to the technical field of health monitoring, in particular to a method and a device for detecting human body falling, a storage medium and terminal equipment.

Background

The falling seriously threatens the health and the life of the old, and the free and real-time safety monitoring is provided, so that the method has great application value and research significance for ensuring the life quality and the life of the old. In the literature, "a skeleton sequence-based old people tumble motion recognition method research", a tumble motion recognition method is proposed, in which a skeleton data set extracted from a video is divided into two mutually exclusive sets serving as a training set and a test set by a leave-out method, and the division ratio is 4: 1; then, multiple random divisions are adopted, and after repeated tests, an average value is taken as an evaluation result; then, data is preprocessed through data cleaning, and effective joint point data is stored; analyzing the tumbling process, and respectively extracting the spatial characteristic and the time sequence characteristic of the skeleton; and finally, respectively loading training data and testing data, and after training a model on the training set, evaluating the testing error on the testing set to be used as an approximation for the generalization error, wherein the model trained by the method can effectively identify the tumbling action. However, the motion recognition method based on the skeleton data cannot learn the contour and color information included in the image data, and cannot effectively solve the problem of motion classification such as interaction between a person and a scene.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a device, a storage medium and a terminal device for detecting human body falling, which can fuse the outline information, the color information and the skeleton data information of video image data, enable a neural network to learn abundant action characteristics and improve the accuracy of falling recognition.

In order to achieve the above object, an embodiment of the present invention provides a method for detecting a human body fall, including:

receiving video data; the video data consists of a plurality of frames of images containing human bodies to be detected;

extracting skeleton joint points of the human body to be detected in the video data to obtain joint data; the joint data comprise joint point three-dimensional coordinate data of the human body to be detected in each frame of the image;

inputting the joint data into a first fall recognition model trained in advance to obtain a first fall probability matrix; wherein the first fall identification model is a space-time graph convolutional network for fall identification; the first falling probability matrix comprises a first probability belonging to falling actions corresponding to each frame of the image;

inputting the video data into a second fall recognition model trained in advance to obtain a second fall probability matrix; wherein the second fall identification model is a graph attention network for fall identification; the second falling probability matrix comprises a second probability belonging to falling actions corresponding to each frame of the image;

and carrying out mean value processing on the first falling probability matrix and the second falling probability matrix to obtain a falling identification result of the human body to be detected.

As an improvement of the above scheme, the extracting of the skeleton joint points of the human body to be detected in the video data to obtain joint data specifically includes:

inputting the video data into a pre-trained convolutional neural network to detect the position of the human body to be detected in each frame of image, and obtaining a bounding box containing the human body to be detected in each frame of image;

inputting the human body image in the boundary frame of each frame of image into a single posture prediction network, and extracting skeleton joint points of the human body to be detected in each frame of image to obtain joint point two-dimensional coordinate data of each frame of image;

inputting the two-dimensional coordinate data of the joint points of each frame of image into a pre-trained network model consisting of a convolutional neural network and a time convolutional network to obtain the three-dimensional coordinate data of the joint points of each frame of image;

and fitting the three-dimensional coordinate data of the joint points of each frame of image to obtain joint data.

As an improvement of the above scheme, the method includes inputting the human body image in the bounding box of each frame of image into a single posture prediction network, extracting skeleton joint points of the human body to be detected in each frame of image, and obtaining joint point two-dimensional coordinate data of each frame of image, and specifically includes:

inputting the human body image in the boundary frame of each frame of image into a space transformation network to obtain a preprocessed human body image of each frame of image;

inputting the preprocessed human body image of each frame of image into a single posture prediction network, and extracting skeleton joint points of the human body to be detected in each frame of image;

performing confidence comparison on all skeleton joint points at each human body joint point position in each frame of image to obtain a skeleton joint point with the maximum confidence of each human body joint point position in each frame of image;

and fitting all skeleton joint points with the maximum confidence coefficient in each frame of image to obtain joint point two-dimensional coordinate data of each frame of image.

As an improvement of the above solution, the first fall identification model includes n space-time diagram convolutional layers and a first classifier, n is an integer and n is greater than 1;

when i is equal to 1, the input data of the space-time diagram convolutional layer of the ith layer is the joint data; when i is larger than 1, the input data of the ith layer of the space-time diagram convolutional layer is the output data of the (i-1) th layer of the space-time diagram convolutional layer, and the input data of the first classifier is the output data of the nth layer of the space-time diagram convolutional layer;

each layer of the space-time diagram convolution layer comprises a diagram convolution network used for processing the space structure information of the skeleton joint points of the human body to be detected and a time convolution network used for processing time dimension characteristics;

the first classifier is used for performing action classification on output data of the space-time diagram convolutional layer of the nth layer to obtain a first falling probability matrix.

As an improvement of the scheme, the ith skeleton joint of the t frame image processed by the image convolution networkFeature output of points

The method specifically comprises the following steps:

wherein,

is the feature of the jth skeletal joint point in the image of the tth frame,

adding a closed-loop adjacency matrix for the ith skeletal joint point of the tth frame image,

adding a closed-loop adjacency matrix for the jth skeleton joint point of the tth frame image,

is composed of

The degree matrix of (c) is,

is composed of

ω is a parameter of the graph convolution network.

As an improvement of the above, the second fall identification model comprises: the system comprises a depth residual error network, an attention module, a convolution long-time and short-time memory network and a second classifier;

the depth residual error network is used for extracting the features of each frame of image in the video data and outputting the image features of each frame of image;

the attention module is used for processing output data of the depth sub-network and outputting an attention heat map of each frame of image;

the convolution long-time and short-time memory network is used for extracting time sequence characteristics of the attention heat map of each frame of image;

and the second classifier is used for performing action classification on the output data of the convolution long-time and short-time memory network to obtain a second falling probability matrix.

In order to achieve the above object, an embodiment of the present invention further provides a human body fall detection apparatus, including:

the data receiving module is used for receiving video data; the video data consists of a plurality of frames of images containing human bodies to be detected;

the data preprocessing module is used for extracting skeleton joint points of the human body to be detected in the video data to obtain joint data; the joint data comprise joint point three-dimensional coordinate data of the human body to be detected in each frame of the image;

the first falling identification module is used for inputting the joint data into a first falling identification model trained in advance to carry out falling identification so as to obtain a first falling probability matrix; wherein the first fall identification model is a space-time graph convolutional network for fall identification; the first falling probability matrix comprises a first probability belonging to falling actions corresponding to each frame of the image;

the second falling identification module is used for inputting the video data into a second falling identification model for falling identification to obtain a second falling probability matrix; wherein the second fall identification model is a graph attention network for fall identification; the second falling probability matrix comprises a second probability belonging to falling actions corresponding to each frame of the image;

and the data operation module is used for carrying out mean value processing on the first falling probability matrix and the second falling probability matrix to obtain a falling identification result of the human body to be detected.

As an improvement of the above scheme, the data preprocessing module specifically includes:

the boundary frame detection unit is used for inputting the video data into a pre-trained convolutional neural network to detect the position of the human body to be detected in each frame of image, and obtaining a boundary frame containing the human body to be detected in each frame of image;

the first coordinate unit is used for inputting the human body image in the boundary frame of each frame of image into a single posture prediction network, extracting skeleton joint points of the human body to be detected in each frame of image, and obtaining two-dimensional coordinate data of the joint points of each frame of image;

the second coordinate unit is used for inputting the two-dimensional coordinate data of the joint points of each frame of image into a pre-trained network model formed by a convolutional neural network and a time convolutional network to obtain the three-dimensional coordinate data of the joint points of each frame of image;

and the fitting unit is used for fitting the three-dimensional coordinate data of the joint points of each frame of image to obtain joint data.

In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the method for detecting a human fall according to any one of the above embodiments.

To achieve the above object, an embodiment of the present invention further provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and the processor, when executing the computer program, implements the human fall detection method described in any one of the above.

Compared with the prior art, the method, the device, the storage medium and the terminal device for detecting human body falling provided by the embodiment of the invention comprise the steps of firstly receiving video data, and extracting skeleton joint points of the human body to be detected in the video data to obtain joint data; secondly, inputting the joint data into a first fall recognition model trained in advance to obtain a first fall probability matrix; then, inputting the video data into a second fall recognition model trained in advance to obtain a second fall probability matrix; and finally, carrying out mean value processing on the first falling probability matrix and the second falling probability matrix to obtain a falling identification result of the human body to be detected. The method can fuse the contour information, the color information and the skeleton data information of the video image data, so that the neural network learns rich action characteristics, and the accuracy of fall identification is improved.

Drawings

Fig. 1 is a flowchart of a method for detecting a human fall according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a first fall identification model according to a preferred embodiment of the invention;

fig. 3 is a schematic structural diagram of a human fall detection apparatus provided in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a method for detecting a human fall according to an embodiment of the present invention.

The human body falling detection method comprises the following steps:

s1, receiving video data; the video data consists of a plurality of frames of images containing human bodies to be detected;

s2, extracting skeleton joint points of the human body to be detected in the video data to obtain joint data; the joint data comprise joint point three-dimensional coordinate data of the human body to be detected in each frame of the image;

s3, inputting the joint data into a first fall recognition model trained in advance to obtain a first fall probability matrix; wherein the first fall identification model is a space-time graph convolutional network for fall identification; the first falling probability matrix comprises a first probability belonging to falling actions corresponding to each frame of the image;

s4, inputting the video data into a pre-trained second fall recognition model to obtain a second fall probability matrix; wherein the second fall identification model is a graph attention network for fall identification; the second falling probability matrix comprises a second probability belonging to falling actions corresponding to each frame of the image;

and S5, carrying out mean value processing on the first falling probability matrix and the second falling probability matrix to obtain a falling identification result of the human body to be detected.

The dimension of the joint data is (17, 3, 300), that is, the number of human joint points is 17, the information dimension of the input joint points is 3, and the total frame number of the video data is 300.

In an optional embodiment, in step S2, the extracting skeleton joint points of the human body to be detected in the video data is performed to obtain joint data, which specifically includes:

s21, inputting the video data into a pre-trained convolutional neural network to detect the position of the human body to be detected in each frame of image, and obtaining a bounding box containing the human body to be detected in each frame of image;

s22, inputting the human body image in the boundary frame of each frame of image into a single posture prediction network, extracting skeleton joint points of the human body to be detected in each frame of image, and obtaining two-dimensional coordinate data of the joint points of each frame of image;

s23, inputting the two-dimensional coordinate data of the joint points of each frame of image into a pre-trained network model consisting of a convolutional neural network and a time convolutional network to obtain the three-dimensional coordinate data of the joint points of each frame of image;

and S24, fitting the three-dimensional coordinate data of the joint points of each frame of image to obtain joint data.

It should be noted that, in step S21, before the video data is input into the pre-trained convolutional neural network to detect the position of the human body to be detected in each frame of the image, the convolutional neural network needs to be trained by using the image data set labeled with the label, so that the trained convolutional neural network has a higher activation value for the image pixels in the region where the human body is located. It should be noted that the image data set labeled with the label is an image data set selected from a frame of a human body to be measured.

Preferably, the convolutional neural network adopted by the network model is a convolutional neural network based on residual modules.

In step S23, before inputting the two-dimensional coordinate data of the joint point of each frame of image into the pre-trained network model composed of the convolutional neural network and the time convolutional network, the network model needs to be trained using the coordinates of the skeleton joint point of the human body to be measured in the three-dimensional space as the label, so that the trained network model can realize the conversion from the two-dimensional coordinates to the three-dimensional coordinates.

In an optional embodiment, in step S22, the inputting the human body image in the bounding box of each frame of the image into a single posture prediction network, extracting skeleton joint points of the human body to be detected in each frame of the image, and obtaining joint point two-dimensional coordinate data of each frame of the image includes:

s221, inputting the human body image in the boundary frame of each frame of image into a space transformation network to obtain a preprocessed human body image of each frame of image;

s222, inputting the preprocessed human body image of each frame of image into a single posture prediction network, and extracting skeleton joint points of the human body to be detected in each frame of image;

s223, performing confidence comparison on all skeleton joint points of each human joint point position in each frame of image to obtain a skeleton joint point with the maximum confidence of each human joint point position in each frame of image;

and S224, fitting all skeleton joint points with the maximum confidence coefficient in each frame of image to obtain joint point two-dimensional coordinate data of each frame of image.

Preferably, the spatial transformation network is a layer.

In an alternative embodiment, the first fall identification model comprises n space-time diagram convolutional layers and a first classifier, n being an integer and n being greater than 1;

Preferably, n is 9, and the first classifier is a SoftMax classifier.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a first fall identification model according to a preferred embodiment of the invention;

the first fall identification model comprises 9 space-time diagram convolutional layers and a SoftMax classifier;

the space-time graph convolution layer is composed of a graph convolution network layer and a time convolution network layer;

taking the output data of the graph convolution network as the input data of the time convolution network, and processing the characteristic information of the space-time graph convolution layer through dropout;

the number of channels of a single node of the space-time diagram convolutional layer of the 1 st layer, the 2 nd layer and the 3 rd layer is 64, the number of channels of a single node of the space-time diagram convolutional layer of the 4 th layer, the 5 th layer and the 6 th layer is 128, and the number of channels of a single node of the space-time diagram convolutional layer of the 7 th layer, the 8 th layer and the 9 th layer is 256;

performing global average pooling on output data of the space-time diagram convolutional layer of the layer 9 to obtain feature vectors with 256 channels;

and inputting the 256 feature vectors into a SoftMax classifier for action classification to obtain a first fall probability matrix.

Before inputting the joint data into the first fall recognition model trained in advance in step S3, a human skeleton graph structure is constructed, which can be regarded as a graph structure formed by connecting skeleton joint points serving as graph nodes and skeleton edges, and which can be represented by G ═ V, E, where V is a set of graph nodes and includes all skeleton joint points; e is a set of edges comprising a first subset, which is a skeletal connection of each frame of image, and a second subset, which is a connected skeletal joint point in successive frames of images. Preferably, the joint data is input into a first fall recognition model trained in advance as a graph node feature of the human skeleton graph structure.

Specifically, the characteristic output of the ith skeleton joint point of the t frame image processed by the graph convolution network

The method specifically comprises the following steps:

wherein,

is the feature of the jth skeletal joint point in the image of the tth frame,

is composed of

The degree matrix of (c) is,

is composed of

ω is a parameter of the graph convolution network.

Specifically, the adjacent matrix of the ith skeleton joint point of the t frame image added with a closed loop

The method specifically comprises the following steps:

wherein,

is a adjacency matrix of the ith skeletal joint point of the image of the t-th frame,

is prepared by reacting with

Identity matrix with same dimension.

In an alternative embodiment, the second fall identification model comprises: the system comprises a depth residual error network, an attention module, a convolution long-time and short-time memory network and a second classifier;

Preferably, the depth residual network is ResNet-34, and the second classifier is a SoftMax classifier.

Specifically, the processing the output data of the depth-based sub-network and outputting the attention heat map of each frame of the image includes:

calculating the attention heat map of each frame of the image by the following formula

Wherein,

characteristic diagram of c channel output by convolutional layer of the last layer of ResNet-34 network_iActivation value of individual position, theta_CFor the parameter corresponding to the c-th channel, M (P)_i) To score attention, α is attentionAnd (4) weighting.

It should be noted that the attention heat map is added to enable the second fall recognition module to learn pixel information and contour information of a single frame of the image, and the time convolution duration memory network enables the second fall recognition module to learn feature information of the whole video in a time sequence, so as to improve the accuracy of fall recognition.

The embodiment of the present invention further provides a device for detecting human body falls, which can implement all the processes of the method for detecting human body falls provided in any of the above embodiments, and the functions and technical effects of the modules and units in the device are respectively the same as those of the method for detecting human body falls provided in the above embodiment, and are not described herein again.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a device for detecting a human fall according to an embodiment of the present invention.

The human body falling detection device comprises:

a data receiving module 11, configured to receive video data; the video data consists of a plurality of frames of images containing human bodies to be detected;

the data preprocessing module 12 is configured to perform skeleton joint point extraction on the human body to be detected in the video data to obtain joint data; the joint data comprise joint point three-dimensional coordinate data of the human body to be detected in each frame of the image;

the first fall recognition module 13 is used for inputting the joint data into a first fall recognition model trained in advance to perform fall recognition, so as to obtain a first fall probability matrix; wherein the first fall identification model is a space-time graph convolutional network for fall identification; the first falling probability matrix comprises a first probability belonging to falling actions corresponding to each frame of the image;

the second fall identification module 14 is configured to input the video data into a second fall identification model for fall identification, so as to obtain a second fall probability matrix; wherein the second fall identification model is a graph attention network for fall identification; the second falling probability matrix comprises a second probability belonging to falling actions corresponding to each frame of the image;

and the data operation module 15 is configured to perform mean processing on the first fall probability matrix and the second fall probability matrix to obtain a fall identification result of the human body to be detected.

The data preprocessing module 12 specifically includes:

The first coordinate unit is specifically configured to:

Preferably, the first fall identification model comprises n space-time diagram convolutional layers and a first classifier, n is an integer and n is greater than 1;

Preferably, the second fall identification model comprises: the system comprises a depth residual error network, an attention module, a convolution long-time and short-time memory network and a second classifier;

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program; wherein the computer program, when running, controls an apparatus on which the computer-readable storage medium is located to execute the method for detecting a human fall according to any of the above embodiments.

An embodiment of the present invention further provides a terminal device, which is shown in fig. 4 and is a schematic structural diagram of a terminal device provided in an embodiment of the present invention, and the terminal device includes a processor 10, a memory 20, and a computer program stored in the memory 20 and configured to be executed by the processor 10, where the processor 10, when executing the computer program, implements the method for detecting a human body fall according to any of the embodiments.

Preferably, the computer program may be divided into one or more modules/units (e.g., computer program 1, computer program 2, … …) that are stored in the memory 20 and executed by the processor 10 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used for describing the execution process of the computer program in the terminal device.

The Processor 10 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, etc., the general purpose Processor may be a microprocessor, or the Processor 10 may be any conventional Processor, the Processor 10 is a control center of the terminal device, and various interfaces and lines are used to connect various parts of the terminal device.

The memory 20 mainly includes a program storage area that may store an operating system, an application program required for at least one function, and the like, and a data storage area that may store related data and the like. In addition, the memory 20 may be a high speed random access memory, may also be a non-volatile memory, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), and the like, or the memory 20 may also be other volatile solid state memory devices.

It should be noted that the terminal device may include, but is not limited to, a processor and a memory, and those skilled in the art will understand that the structural diagram of fig. 4 is only an example of the terminal device and does not constitute a limitation of the terminal device, and may include more or less components than those shown, or combine some components, or different components.

To sum up, the embodiment of the present invention provides a method, an apparatus, a storage medium, and a terminal device for detecting a human body falling, and the method includes receiving video data, and performing skeleton joint point extraction on the human body to be detected in the video data to obtain joint data; secondly, inputting the joint data into a first fall recognition model trained in advance to obtain a first fall probability matrix; then, inputting the video data into a second fall recognition model trained in advance to obtain a second fall probability matrix; and finally, carrying out mean value processing on the first falling probability matrix and the second falling probability matrix to obtain a falling identification result of the human body to be detected. The method can fuse the contour information, the color information and the skeleton data information of the video image data, so that the neural network learns rich action characteristics, and the accuracy of fall identification is improved.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

Claims

1. A method for detecting a human fall, comprising:

2. A method for detecting a human fall as claimed in claim 1, wherein the extracting of skeleton joint points from the human body to be detected in the video data to obtain joint data specifically comprises:

3. A method as claimed in claim 2, wherein the method for detecting a human body fall comprises inputting the human body image in the bounding box of each frame of image into a single posture prediction network, extracting skeleton joint points of the human body to be detected in each frame of image, and obtaining two-dimensional coordinate data of the joint points of each frame of image, specifically:

4. A method of fall detection as claimed in claim 1, wherein the first fall identification model comprises n space-time diagram convolutional layers and a first classifier, n being an integer and n being greater than 1;

5. A method for detecting a human fall as claimed in claim 4, wherein the feature output of the ith skeletal joint point of the tth frame of image processed by the image volume network is output

The method specifically comprises the following steps:

wherein,

is the feature of the jth skeletal joint point in the image of the tth frame,

is composed of

The degree matrix of (c) is,

is composed of

ω is a parameter of the graph convolution network.

6. A method of detecting a personal fall as claimed in claim 1, wherein the second fall identification model comprises: the system comprises a depth residual error network, an attention module, a convolution long-time and short-time memory network and a second classifier;

7. A device for detecting a fall of a human body, comprising:

8. A human fall detection apparatus as claimed in claim 7, wherein the data preprocessing module specifically comprises:

9. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method for detecting a human fall according to any one of claims 1 to 6.

10. A terminal device comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the method of detecting a human fall according to any one of claims 1 to 6 when executing the computer program.