CN113963442A

CN113963442A - Fall-down behavior identification method based on comprehensive body state features

Info

Publication number: CN113963442A
Application number: CN202111243199.7A
Authority: CN
Inventors: 雷亮; 尹衍伟; 李小兵; 梁明辉; 和圆圆; 秦兰瑶; 张文萍
Original assignee: Chongqing University of Science and Technology
Current assignee: Chongqing University of Science and Technology
Priority date: 2021-10-25
Filing date: 2021-10-25
Publication date: 2022-01-21

Abstract

The invention discloses a falling behavior identification method based on a posture comprehensive characteristic, which comprises the following steps: video data acquisition → extraction of a human body skeleton image → preprocessing of skeleton image data → construction of a binary image circumscribed rectangle → LSTM network unit processing → calculation of the effective area of a human body → calculation of the aspect ratio → calculation of the distance from the center of mass of the human body to the ground → calculation of the height change rate → occurrence of a fall. According to the method, on the basis of accurately extracting the skeleton information of the human body, various posture characteristics of the human body are effectively combined to comprehensively judge the falling behavior, the detection accuracy is improved, and the lightweight openpit is used, so that the data volume is light, and then an enhanced version gradient descent method and an SGD (generalized gradient descent) method are used, so that the accuracy can be ensured while the convergence speed of the model is accelerated, and the trained optimal model can be conveniently deployed to each edge end and applied to places such as homes, public places and the like.

Description

Fall-down behavior identification method based on comprehensive body state features

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a falling behavior identification method based on comprehensive body state features.

Background

The social reality of the rapid aging of the population at present enables the old people to have a desire for safety and health problems, and factors threatening the health of the old people not only belong to internal factors such as diseases and psychology, but also are more important to be hurt by falling caused by the combination of internal and external factors. The old people are the main groups which are dangerous and fatal due to falling, and researches show that 20% of old people in China are seriously injured after falling, and even the old people with very good health condition at ordinary times still have the serious injury after falling by 17.7% of the old people. Especially, in the application scene of solitary old people, how to accurately detect the falling behavior in real time and send out an alarm has become one of the subjects of urgent research by domestic and foreign scholars.

Falls are themselves a more complex activity under the dual influence of both internal and external factors: the internal factors are mainly reflected in that people can judge wrong similarity actions by eyes such as squatting and falling, lying and falling, and the like; the external factor body is easily influenced by complex environments such as illumination, shadow, furniture shielding and the like in the current application scene. The influence of internal and external factors can cause the detection accuracy to be reduced, so that the method for realizing fall detection is particularly important. Common methods generally include methods based on wearable devices, environmental sensors, computer vision and the like, and the wearable detection method generally brings serious discomfort to users, and particularly causes inconvenience in actions when facing heavy wearable devices of the elderly; the detection method based on the environmental sensor usually needs to install expensive detection equipment on the floor or the wall, and has poor flexibility and portability and high false detection rate; with the rapid development of deep learning, people naturally focus on a computer vision method, the defects brought by the two methods can be well avoided by adopting the computer vision method, and the computer vision method has the advantages of low cost, rapid improvement and maintenance, strong portability, good flexibility and the like, so that a tumble behavior identification method based on the comprehensive body state characteristics is provided.

Disclosure of Invention

The invention aims to provide a falling behavior identification method based on comprehensive body state features, and the method solves the problems.

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention provides a falling behavior identification method based on a posture comprehensive characteristic, which comprises the following steps: video data acquisition → extraction of a human body skeleton image → preprocessing of skeleton image data → construction of a binary image circumscribed rectangle → LSTM network unit processing → calculation of human body effective area → calculation of aspect ratio → calculation of distance from human body center of mass to ground → calculation of height change rate → occurrence of a fall, wherein:

s1, video data acquisition: the system is used for acquiring a video sequence from a video file or a camera and initializing the video sequence;

s2, extracting a human skeleton diagram: detecting a human body target according to the key frame image provided by the video sequence read in the S1, and extracting human body skeleton information for subsequent steps;

s3, skeleton image data preprocessing: the system is used for eliminating the extracted irrelevant information in the human skeleton module, enhancing the detectability of the relevant information and simplifying the data to the maximum extent;

s4, constructing a binary image circumscribed rectangle: the human body skeleton point information processing device is used for converting the preprocessed human body skeleton point information into a binary image;

s5, LSTM network unit processing: classification training for images;

s6, calculating the effective area of the human body: a first feature for fall detection;

s7, calculating the aspect ratio: a second morphological feature for fall detection;

s8, calculating the distance between the center of mass of the human body and the ground: a third morphological feature for fall detection;

s9, calculating the height change rate: a fourth morphological feature for fall detection;

s10, occurrence of a fall: the actions of the old are analyzed and processed through S1-S5, and whether the fall behavior occurs is judged through S6-S9.

Preferably, in S1, the collected video data is used as a sample input through the home video surveillance video, and the time-monitoring video stream is intercepted at a preset time interval by using a key frame interception technique, so as to obtain a video frame that can be sent to the deep learning model for processing.

Preferably, in S2, the skeleton of the human body is extracted using openpos, a cuff model of openpos is called using a dnn library of opencv, and then the output is post-processed, where the openpos uses an output mode of 18 joints, and the step of S2 is as follows: (1) the network structure analysis (2) extracts all key points (3) to distinguish the key points, and the relation among the key points is calculated according to an affinity field, wherein the calculation formula of the affinity field is as follows:

stored separately.

Preferably, S3 is normalized.

Preferably, the method of S4 is to convert the human bone point information into a binary image, then use the minimum bounding rectangle frame to find out the processed human bone binary image, and use the OpenCV function to find out the bounding rectangle, and the main steps are as follows: (1) the mathematical basis is that a point (x1, y1) on the plane is set, and a point (x2, y2) which rotates anticlockwise by theta angle around (x0, y0) is set as

x₂＝(x₁-x₀)cosθ-(y₁-y₀)sinθ+x₀

y₂＝(x₁-x₀)sinθ-(y₁-y₀)cosθ+y₀(ii) a (2) Rotating the original polygon, circulating for-90 degrees, adjusting the distance by balancing the precision and the operation efficiency, solving a simple external rectangle of the polygon after rotating each degree, and recording the area, the vertex coordinates and the degree of rotation of the simple external rectangle; (3) comparing all simple circumscribed rectangles obtained by the polygon in the rotating process to obtain a simple circumscribed rectangle with the smallest area, and obtaining the vertex coordinates and the rotating angle of the simple circumscribed rectangle; (4) and (4) rotating the circumscribed rectangle, and rotating the opposite direction of the simple circumscribed rectangle with the smallest area obtained in the previous step by the same angle opposite to the direction in the step 3 to obtain the smallest circumscribed rectangle.

Preferably, the output at the time t is merged into the output at the time t +1 in S5, so that the falling behavior recognition can be performed by integrating the time and space dimensions, the problems of gradient loss and gradient explosion caused by gradual reduction in the gradient back propagation process can be solved, and LSTM provides a reference for setting the threshold value through learning.

Preferably, in S6, the effective area of the human body is calculated by using an image of a minimum circumscribed rectangular image after binarization processing, and after binarization processing, the effective area of the human body is equal to the number of pixels with a pixel value of 1 in the minimum circumscribed rectangular frame image, and the total area of the rectangular frames is the sum of pixels with pixel values of 1 and 0, and the specific formula is as follows:

preferably, the human body width-height ratio in S7 is the width-height ratio corresponding to the minimum circumscribed rectangular frame, and when the human body is walking or standing normally, the human body width-height ratio is much less than 1, and when the human body falls down or squats, the width-height ratio is much greater than 1.

Preferably, in S8, the centroid of the human body is a diagonal intersection O (X0, Y0) of the minimum circumscribed rectangle, and the distance from the centroid to the bottom surface can well reflect the motion state of the human body at that time, and coordinates a (X1, Y1) and B (X2, Y2) of two end points of the bottom surface of the circumscribed rectangle are defined, then a is Y1-Y2, B is X1-X2, C is X1Y2-Y1X2, and the distance from the centroid O to the bottom surface ab is h, where the formula is as follows:

preferably, in S9, the feature of height change rate is introduced, and the height is measured by a ratio of a real-time height to an average height of a human body in a video, where height represents the real-time height of the human body, height average represents the average height of the human body, and height average is defined as follows:

the invention has the beneficial effects that:

firstly, extracting human skeleton information through Openpse, drawing a human skeleton external rectangle by using OpenCV, enhancing human posture estimation by fusing human width-height ratio, Hu moment characteristics, human effective area ratio, centroid change rate, outline eccentricity, height change rate and other multi-modal characteristics, and carrying out falling behavior identification from time-space and multi-modal characteristic multi-dimensionality through a long-short term memory (LSTM) neural network, processing time and spatial characteristics by combining a plurality of LSTM classifiers, sending the extracted human skeleton information into a plurality of branch loop judgments, describing time-space associated sequence data by using an autoregressive model LSTM with a time loop structure, and reducing misjudgment rate and missed judgment rate by combining video front and rear frame changes, thereby realizing real-time and accurate falling behavior detection; according to the method, on the basis of accurately extracting the skeleton information of the human body, various posture characteristics of the human body are effectively combined to comprehensively judge the falling behavior, the detection accuracy is improved, and the lightweight openpit is used, so that the data volume is light, and then an enhanced version gradient descent method and an SGD (generalized gradient descent) method are used, so that the accuracy can be ensured while the convergence speed of the model is accelerated, and the trained optimal model can be conveniently deployed to each edge end and applied to places such as homes, public places and the like.

Drawings

FIG. 1 is an inventive flow chart;

FIG. 2 is a diagram of a human skeleton according to the present invention;

FIG. 3 is a network structure analysis diagram in the step of extracting the skeleton diagram of the human body in the invention.

Detailed Description

The following further describes embodiments of the present invention with reference to examples. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Referring to fig. 1 to 3, a method for recognizing a falling behavior based on a comprehensive body state feature includes the following steps: video data acquisition → extraction of a human body skeleton image → preprocessing of skeleton image data → construction of a binary image circumscribed rectangle → LSTM network unit processing → calculation of human body effective area → calculation of aspect ratio → calculation of distance from human body center of mass to ground → calculation of height change rate → occurrence of a fall, wherein:

s5, LSTM network unit processing: classification training for images;

In this embodiment, in S1, the collected video data is input as a sample through the home video surveillance video, and the key frame capture technology is used to capture the time-monitoring video stream at a preset time interval to obtain a video frame that can be sent to the deep learning model for processing, and the video data collection module is connected to the home live video surveillance, and collects the real-time video by increasingly popular home surveillance, and captures and inputs the real-time monitoring video stream at a preset time interval, using the collected video sequence as a sample input.

In this embodiment, in S2, openpos is used to extract human skeleton, openpos' S dnn library is directly used to call the caffe model, and then post-processing is performed on the output, so that the important point in the implementation process is to analyze the affinity field representing the articulation intimacy, and as openpos has two output modes of human skeleton articulation points, namely 18 articulation points and 25 articulation points, the most important difference between the two is foot articulation points, and in the case of a fall, the foot articulation points seem to have no influence on the fall, the invention adopts 18 articulation points, and the specific process is as follows: (1) network structure resolution, as shown in FIG. 3, the two layers being spliced have 19 and 38 profiles, two stages, each having two bridges: the first branch outputs 19 feature maps, which respectively represent 18 human key points and the background; the second branch has 38 signatures representing the Affinity fields (PAF), i.e., the joint-to-joint anterior association; (2) then, extracting the key points of the limb by using the 18 characteristic graphs, calling key point codes, extracting the position and the corresponding confidence coefficient of the current joint, and traversing the characteristic graph of each joint if the positions and the confidence coefficients of all the joints are extracted; (3) all the key points are extracted, and then the key points are distinguished, but which key point belongs to which person is not calculated, and at the moment, the relation among the key points needs to be calculated according to the affinity field. For example, the affinity fields of the respective forearm key points of the first person's forearm to the three persons are definitely different, and the affinity fields are only distinguished by the affinity fields of the respective forearm key points belonging to the person. The calculation of the affinity field is to directly connect two key points, and a line is made in the middle to calculate the value on the line on the affinity field, and the formula is as follows:

(4) and finally, storing separately, wherein the principle is simple, namely, edges with connection are placed in a set, and the next node which can be connected with the previous node is continuously plugged to the corresponding position.

In this embodiment, S3 is to perform mean subtraction normalization on the video keyframe images one by one, so as to ensure that the images can be better extracted with features, and since the extracted skeleton information has no coordinate information or no framed object, there is no way to be directly applied, so that image data preprocessing is extremely necessary.

In this embodiment, S4 converts the pre-human skeleton point information into a binary image, that is, the human skeleton is a white pixel point, and the other unrelated backgrounds are black pixel points, and then the processed human skeleton binary image is framed with a minimum bounding rectangle. The problem of the minimum circumscribed rectangle is that the rectangle which circumscribes the polygon and has the minimum area is solved when the vertex of a convex polygon is given, and the human body detection → the skeleton information extraction → the framing of the circumscribed rectangle is actually realized. Finding out a circumscribed rectangle by using an OpenCV function, which mainly comprises the following steps: (1) and (3) realizing an algorithm that a certain point on the plane rotates around a fixed point by a certain angle. The mathematical basis is that, assuming that points (x1, y1) on a plane and points (x2, y2) rotated counterclockwise by an angle theta around (x0, y0) are (x2, y2), the points have

x₂＝(x₁-x₀)cosθ-(y₁-y₀)sinθ+x₀

y₂＝(x₁-x₀)sinθ-(y₁-y₀)cosθ+y₀

(2) Rotating the original polygon (circulation, the distance can balance the precision and the operation efficiency to adjust, and the distance can be adjusted), solving the simple external rectangle of the polygon after rotating each degree, and recording the area, the vertex coordinates and the degree of rotation of the simple external rectangle.

(3) And comparing all the simple external rectangles obtained by the polygon in the rotation process to obtain the simple external rectangle with the minimum area, and obtaining the vertex coordinates and the rotation angle of the simple external rectangle.

(4) The circumscribed rectangle is rotated. And (4) rotating the simple circumscribed rectangle with the smallest area obtained in the previous step by the same angle in the opposite direction (opposite to the direction of the step 3) to obtain the minimum circumscribed rectangle.

The method is used for converting the preprocessed human skeleton point information into a binary image, so that on one hand, foreground and background can be better distinguished, and on the other hand, the dimension reduction of data can be realized. And then, the processed human body bone binary image is displayed by using a minimum external rectangle frame, all the steps realize human body detection, bone extraction and preprocessing functions, and a good foundation is laid for the next gesture recognition.

In this embodiment, the LSTM network in S5 is a special type of RNN, and can learn long-term dependence information, and in a scene of fall detection, human behavior is closely linked to a timestamp, and a previous second of behavior can provide a reference for predicting a next second of behavior, so that the invention introduces processing time and spatial features of LSTM network elements. LSTM avoids the long term dependency problem by being deliberately designed, and long term information is in practice the default behavior of LSTM, rather than the ability to be obtained at great expense. All RNNs have a form of a chain of repeating neural network modules. In a standard RNN, this duplicated module has only a very simple structure, such as a tanh layer. LSTM is also such a structure, but the duplicated modules have a different structure. Unlike a single neural network layer, there are four, interacting in a very specific way. Lines that are taken together represent concatenation of vectors, and separate lines represent content that is copied and then distributed to different locations. Because the falling behavior is highly related to the time sequence, the LSTM network module can not only integrate time and space dimensions to identify the falling behavior, but also solve the problems of gradient extinction and gradient explosion caused by gradual reduction in the gradient backward propagation process, and avoid the problems of gradient explosion and gradient extinction in the standard RNN, so that the method is better in use and faster in learning speed; in addition, the output of the front unit is used as the input of the rear unit in the same layer, and the output of the front layer is used as the input of the rear layer, so that the time relation and the context information before and after the video can be connected.

In this embodiment, in S6, the effective area of the human body is calculated by using an image of the minimum circumscribed rectangular image after binarization processing, and after binarization processing, the effective area of the human body is equal to the number of pixels whose pixel values are 1 in the minimum circumscribed rectangular frame image, and the total area of the rectangular frame is the total sum of pixels whose pixel values are 1 and 0, and the specific formula is as follows:

for example, when the old people do stretching exercise, the minimum circumscribed rectangular frame becomes larger along with the stretching of the limbs, and the effective area ratio is much smaller than that when the old people fall down normally.

In this embodiment, the human body width-height ratio in S7 is the width-height ratio corresponding to the minimum circumscribed rectangular frame, and when the human body is walking or standing normally, the human body width-height ratio is much less than 1, and when the human body falls down or squats, the width-height ratio is much greater than 1; the second body aspect characteristic used for falling detection is that the aspect ratio corresponding to the minimum external rectangular frame is far less than 1 when a human body normally walks or stands, and when the human body falls down or squats, the aspect ratio is far greater than that which is the most important characteristic module for posture identification.

In this embodiment, in S8, the distance from the center of mass to the bottom surface of the human body, that is, the intersection O (X0, Y0) of the diagonal line of the minimum circumscribed rectangle, can well reflect the motion state of the human body at this time, and coordinates a (X1, Y1) and B (X2, Y2) of two end points of the bottom surface of the circumscribed rectangle are defined, then a is Y1-Y2, B is X1-X2, C is X1Y2-Y1X2, and the distance from the center O to the bottom surface ab is h, where the following formula:

in this embodiment, before and after the fall action occurs in S9, the height of the human body is constantly changed, but the most easily neglected point is that the height change is still very obvious in many non-fall actions such as squat, stooping, lying down, creeping and the like, so that the feature of the height change rate is introduced, and the height change rate is measured by using the ratio of the real-time height to the average height of the human body in the video. height represents the real-time height of the human body, height average represents the average height of the human body, and height average is defined as follows:

because the position of the camera has non-fixity and the size of the human body at different positions of the camera can be changed, the invention introduces an image invariance characteristic-Hu moment characteristic. The low-order moment value of the Hu moment has definite geometric properties, wherein the zero-order moment m0 can represent the total area of a target, the first-order moment values m10 and m01 can represent the size of a circumscribed outline of the target, and the second-order moments m20, m11 and m02 are also called moments of inertia and represent the size and the direction of an image. The Hu moment translation invariance is realized by a mode of a centroid coordinate. Here, the first four Hu moments of the falling behaviors (M1-M4) are obviously different from the first four Hu moments of the squat, bending and sitting behaviors, so that the falling behaviors can be used as one of the characteristics for representing the four behaviors.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A falling behavior identification method based on a posture comprehensive characteristic is characterized in that: the method comprises the following steps: video data acquisition → extraction of a human body skeleton image → preprocessing of skeleton image data → construction of a binary image circumscribed rectangle → LSTM network unit processing → calculation of human body effective area → calculation of aspect ratio → calculation of distance from human body center of mass to ground → calculation of height change rate → occurrence of a fall, wherein:

s5, LSTM network unit processing: classification training for images;

2. A fall behavior recognition method based on posture-integrated features as claimed in claim 1, wherein: and S1, the collected video data is used as a sample input through the household video monitoring video, and the time monitoring video stream is intercepted by using a preset time interval by using a key frame intercepting technology to obtain a video frame which can be sent to the deep learning model for processing.

3. A fall behavior recognition method based on posture-integrated features as claimed in claim 1, wherein: in S2, human skeletons are extracted by using openpos, a cuff model of the openpos is called by using a dnn library of opencv, and then output is post-processed, wherein the openpos adopts an 18-joint output mode, and S2 comprises the following steps: (1) the network structure analysis (2) extracts all key points (3) to distinguish the key points, and the relation among the key points is calculated according to an affinity field, wherein the calculation formula of the affinity field is as follows:

stored separately.

4. A fall behavior recognition method based on posture-integrated features as claimed in claim 1, wherein: s3 adopts a normalization method.

5. A fall behavior recognition method based on posture-integrated features as claimed in claim 1, wherein: the method of S4 is to convert the human skeleton point information into a binary image, then use the minimum bounding rectangle frame to find out the processed human skeleton binary image, and use the OpenCV function to find out the bounding rectangle, and the main steps are as follows: (1) the mathematical basis is that a point (x1, y1) on the plane is set, and a point (x2, y2) which rotates anticlockwise by theta angle around (x0, y0) is set as

x₂＝(x₁-x₀)cosθ-(y₁-y₀)sinθ+x₀

y₂＝(x₁-x₀)sinθ-(y₁-y₀)cosθ+y₀(ii) a (2) Rotating the original polygon, cycleThe distance between the rings and the ring is 90-90 degrees, the precision and the operation efficiency can be balanced, the simple external rectangle of the polygon rotated by each degree is obtained, and the area, the vertex coordinates and the rotating degree of the simple external rectangle are recorded; (3) comparing all simple circumscribed rectangles obtained by the polygon in the rotating process to obtain a simple circumscribed rectangle with the smallest area, and obtaining the vertex coordinates and the rotating angle of the simple circumscribed rectangle; (4) and (4) rotating the circumscribed rectangle, and rotating the opposite direction of the simple circumscribed rectangle with the smallest area obtained in the previous step by the same angle opposite to the direction in the step 3 to obtain the smallest circumscribed rectangle.

6. A fall behavior recognition method based on posture-integrated features as claimed in claim 1, wherein: and S5, the output at the time t is fused into the output at the time t +1, so that the falling behavior can be identified by integrating time and space dimensions, the problems of gradient loss and gradient explosion caused by gradual reduction in the gradient back propagation process can be solved, and the LSTM provides a reference for setting a threshold value through learning.

7. A fall behavior recognition method based on posture-integrated features as claimed in claim 1, wherein: in the S6, the effective area of the human body is calculated by using the image of the minimum circumscribed rectangular image after binarization processing, and after binarization processing, the effective area of the human body is equal to the number of pixels with a pixel value of 1 in the minimum circumscribed rectangular frame image, and the total area of the rectangular frames is the sum of pixels with pixel values of 1 and 0, and the specific formula is as follows:

8. a fall behavior recognition method based on posture-integrated features as claimed in claim 1, wherein: the human body width-height ratio in S7 is the width-height ratio corresponding to the minimum circumscribed rectangular frame, and when the human body normally walks or stands, the human body width-height ratio is far less than 1, and when the human body falls down or squats, the width-height ratio is far greater than 1.

9. A fall behavior recognition method based on posture-integrated features as claimed in claim 7, wherein: in S8, the centroid of the human body is a diagonal intersection O (X0, Y0) of the minimum circumscribed rectangle, the distance from the centroid to the bottom surface can well reflect the motion state of the human body at that time, coordinates a (X1, Y1) and B (X2, Y2) of two end points of the bottom surface of the circumscribed rectangle are defined, then a is Y1-Y2, B is X1-X2, C is X1Y2-Y1X2, and the distance from the centroid O to the bottom surface ab is h, and the formula is as follows:

10. a fall behavior recognition method based on posture-integrated features as claimed in claim 1, wherein: the characteristic of height change rate is introduced into S9, the ratio of real-time height to human body average height of human body in video is used for measurement, height represents the real-time height of human body, height average represents the human body average height, and height average is defined as follows: