CN111274954B

CN111274954B - Embedded platform real-time falling detection method based on improved attitude estimation algorithm

Info

Publication number: CN111274954B
Application number: CN202010062574.7A
Authority: CN
Inventors: 郭欣; 王红豆; 孙连浩
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2022-03-15
Anticipated expiration: 2040-01-20
Also published as: CN111274954A

Abstract

The invention relates to an embedded platform real-time falling detection method based on an improved posture estimation algorithm, which uses a depth separable convolution, an attention mechanism and an inverse residual error structure to build a posture estimation network, and the posture estimation network is used in falling detection, so that the precision of the posture estimation network is further improved, the parameter quantity and the calculated quantity are greatly reduced, the distance of each joint point of a human body between different video frames is calculated to track the human body, the acceleration of the joint point of the human body is calculated by using front and rear video frames, whether falling occurs or not is judged according to the acceleration, the relative position of the joint point and the like, so that the posture estimation network is more suitable for being deployed on an embedded platform, and the real-time effect can be achieved by deploying on a TX2 embedded platform. The method of the invention uses the coordinates of the human body joint points of multiple persons and the skeleton information obtained from the previous and the next frames to track the human body, the posture estimation is more stable by the tracking of the multiple persons, and the falling detection problem under the scene of the multiple persons can be better processed.

Description

Embedded platform real-time falling detection method based on improved attitude estimation algorithm

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an embedded platform real-time falling detection method based on an improved attitude estimation algorithm.

Background

With the improvement of medical conditions and the improvement of living standard, the average life span of the population is remarkably increased, and the aging process of the population is accelerated. The aging population brings about a plurality of social problems, and the reduction of the loss of the old people caused by accidents is a problem to be solved urgently at present. According to the WHO report, 64.6 million people are estimated to die each year because of falling without being timely cured, and the proportion of the middle-aged and the elderly people is the largest. In the current old people falling detection method, the method based on wearable equipment has large limitation and high cost; although the computer vision fall detection algorithm based on the posture estimation does not need the old to carry equipment with him, the network of the deep learning posture estimation algorithm has forty layers, the size of the model reaches 200M, a large amount of calculation needs to be performed during forward reasoning, and the requirement on equipment during deployment is high. Patent No. CN201910631143.5 discloses a "fall detection method based on human posture estimation", which extracts human joint points from a video frame, and then determines whether a human body falls through the included angle between the hip joint and the center point of the knee joint and the neck joint. The method does not track multiple persons and optimize the problem that the attitude estimation algorithm model is too large; patent number CN201910477816.6 discloses a method for detecting the fall of people on an escalator based on attitude estimation, which extracts seven feature points of a human body, calculates the variance in the vertical direction of the feature points, and determines whether people on the escalator fall according to the variance, in practice, the number of people on the elevator is not two, because the detection effect of one frame of multiple frames is poor when the attitude of the human body is estimated, and the method does not track multiple people, so the fall behavior of multiple people cannot be well determined; the 'old people falling detection algorithm based on joint point extraction' of Nanjing aerospace university uses a YOLO algorithm to detect the position of a human body, uses an openpos algorithm to acquire joint information, and uses an SVM classifier to classify the behavior of the human body, so that the falling detection accuracy of the algorithm in a fixed scene where a single person exists is high, and the falling detection accuracy of multiple persons is greatly reduced.

At present, the fall detection algorithm based on attitude estimation is not improved, the speed after deployment on an embedded platform is low, and a human body is not tracked in a multi-person scene, so that the fall detection recognition rate of the existing algorithm in the multi-person scene is low, and a great improvement space is still provided.

Disclosure of Invention

Aiming at the defects that a current posture estimation algorithm has overlarge models and overmuch parameters when being actually deployed, a human body is not tracked, and the like, the invention provides an embedded platform real-time falling detection method based on an improved posture estimation algorithm. The method mainly comprises the following steps: the method comprises the steps of constructing a posture estimation network by using a depth separable convolution, an attention mechanism and an inverse residual error structure, using the posture estimation network in falling detection, further improving the precision of the posture estimation network, greatly reducing the parameter quantity and the calculation quantity, calculating the distance of each joint point of a human body among different video frames to track the human body, calculating the acceleration of the joint points of the human body by using front and back video frames, and judging whether falling occurs or not according to the acceleration, the relative position of the joint points and the like.

The technical scheme for realizing the purpose of the invention is as follows:

an embedded platform real-time fall detection method based on an improved attitude estimation algorithm comprises the following steps:

the method comprises the following steps: building attitude estimation network using lightweight structure

1-1, building a feature extraction network: improving a feature extraction part in an opencast algorithm, constructing a network of the feature extraction part by using a depth separable convolution and an inverse residual structure, and introducing an attention mechanism:

(1) structure of the basic module: including a depth separable convolution and two 1x1 convolutions, while using an inverse residual structure, the input to the basic module is split into two branches, the first branch is expanded in number of channels using a 1x1 convolution, then 3x3 depth separable convolution is used, and then a 1x1 convolution is used to reduce the number of channels; the second branch adds the input characteristic diagram of the basic module and the output characteristic diagram of the first branch of the basic module as the output of the basic module;

(2) and (2) constructing a feature extraction network by using the basic module in the step (1), wherein the network structure is as follows: using a picture with the size of 432 × 368 as an input, firstly using a common convolution with the size of 3 × 3, then using 9 basic modules in the step (1) to be sequentially connected to form a feature extraction part in the attitude estimation network, and superposing the output of the last basic module and the output of the sixth basic module on a channel to form the output of the feature extraction network; some of the 9 basic modules include a channel attention module, and the channel attention module is arranged after the deep separable convolution, and assigns weights to the number of channels of the feature map at that time, that is, the importance of different channel feature maps is judged;

1-2, constructing a posture estimation network: extracting feature maps with dimensions of 54x46x120 through a feature extraction network in the step 1-1, and sending the feature maps into a first stage, wherein each stage comprises two branches, each branch firstly passes through a 5-3 x3 deep convolution structure, the deep convolution structure comprises 3x3 deep separable convolution and 1x1 convolution, and then passes through 2 convolutions of 1x1, and the final output channel numbers of the two branches in each stage are respectively 19 and 38; the input of the next stage is the channel superposition of the output of the stage and the characteristic diagram output by the characteristic extraction network, and there are five stages in total, wherein 19 channel characteristic diagrams in the output represent that each characteristic diagram predicts a part of the human body, 18 are added, and a background characteristic diagram is added, and the output of 38 channels represents a vector diagram of the joint point connection of the human body part; except for the final stage, the outputs of the 19 channels and the 38 channels of other stages are fused with the output feature map of the feature extraction network and then used as the input of the next stage;

1-3, loss function and joint point matching: training the network set up in the step 1-1 and the step 1-2 integrally, wherein a human body joint point loss function is the difference between a joint point output characteristic diagram of a posture estimation network and a data set marking position, a joint point connection loss function is the difference between a joint point connection output characteristic diagram of the posture estimation network and a data set marking position, meanwhile, an L2 loss function is used for each stage in the step 1-2, and the integral loss is the sum of losses of all parts; distributing the detected multiple human joint points by using a Hungarian matching algorithm to obtain the joint point coordinates and confidence information of each human;

step two: human posture tracking

Training an improved posture estimation network to obtain a posture estimation model, performing human body posture estimation on a video frame picture by using the posture estimation model to obtain the joint point coordinates of each person, and calculating the distance of each joint point of the same person in different frames; wherein the j-th joint point coordinate matrix of the m-th person is L_j,m＝(x_j,m,y_j,m,c_j,m) X in the formula_j,mAnd y_j,mCoordinate points representing the joint points of the human body; c. C_j,mThe confidence that the representation is a joint point; the m-th person's coordinate matrix is: p_m＝(L_1,m,L_2,m...L_18,m) Calculating the average value of the sums of the Euclidean distances of all the joint points of different people in the adjacent frames before and after, wherein the same person is obtained when the distance is minimum and less than a threshold value;

step three: fall behavior detection

Tracking the human bodies in different frames by using the method in the second step, and detecting the falling of the human bodies according to the coordinate change condition of the joint points of the same person in the previous and next frames, the included angle between the joint point connecting line and the horizontal line and the width-height ratio;

step four: and deploying the attitude estimation model on a TX2 embedded platform, performing attitude estimation on a video frame, performing attitude tracking on different people, and performing fall detection in real time.

When a feature extraction network is built, channel attention modules are added to the fourth, fifth and sixth basic modules only in the 9 basic modules; the activation function used in the seventh, eighth and ninth basic modules is h-swish, and the other basic modules use relu6 activation functions.

The specific process of fall behavior detection in the third step is as follows:

3-1, calculating the acceleration of the joint: calculating the acceleration of joint points (hip joints, neck joints, shoulder joints and knee joints) close to the central point of the human body according to the coordinate change condition of the joint points of the human body between the front frame and the rear frame, and detecting the falling of the human body according to the motion direction and the acceleration of the hip joints, the shoulder joints, the neck joints or the knee joints so as to reflect the severe condition of the motion of the human body; obtaining the acceleration of the joint point in the second step according to the coordinate form of the joint point

a is the acceleration of the joint movement; (x)_t-1,y_t-1) The position of the joint point at the last moment; (x)_t,y_t) The position of the joint point at the current moment is taken as the position of the joint point;

3-2, calculating the relative positions of different joint points of the same person, the included angles of the joint points and a horizontal line and the aspect ratio: after the acceleration of the joint point close to the central point of the human body is judged, whether the human body falls down is further determined by using the relative positions of different joint points, the included angles between the central connecting lines of the neck joint and the two hip joints and the horizontal line and the aspect ratio, namely:

3-2-1, judging the intensity of the human motion according to the acceleration of the joint points, wherein the larger the acceleration is, the more intense the current human motion is, and when the acceleration is smaller than an intensity threshold value, the acceleration of the next frame is continuously calculated in a state that the human does not move or slowly moves; when the acceleration is greater than the intensity threshold value, the human body is in an intensity motion state, continuously detecting for 80 frames, simultaneously calculating the acceleration, and counting to enter the step 3-2-2;

3-2-2, after counting for a period of time in the violent movement state, judging whether the acceleration is smaller than a periodic threshold, if so, continuing to calculate the acceleration at the next moment, counting and accumulating until the accumulated number exceeds 8, and entering the step 3-2-3; if the acceleration is larger than the periodic threshold, the detected periodic or continuous acceleration is large, the detected periodic or continuous acceleration is judged to be other violent movement behaviors except for falling, and the initial acceleration calculation is returned to continue;

and 3-2-3, then, removing squat and sitting behaviors according to the relative position of the joint points after the human body falls, the included angle between the central connecting line of the neck joint and the two hip joints and the horizontal line and the change of the width-to-height ratio, and finally determining the posture of the human body and judging whether the human body is in a falling state.

The detected human neck joint is used as an original point, the direction parallel to the upper edge line and the lower edge line of a video frame is set as the X direction, the direction parallel to the left edge line and the right edge line is set as the Y direction, the X direction difference and the Y direction difference between the neck joint and the hip joint are calculated, the included angle between the connecting line of the neck joint and the centers of the two hip joints and the X direction is calculated at the same time, a human body detection frame is calibrated according to the detected coordinates of all joint points, whether the X direction difference between the neck joint and the center of the hip joint is larger than 2/3 of the connecting line of the two points and whether the Y direction difference is smaller than 1/3 of the connecting line are met at the same time, the included angle between the connecting line of the joint points and the X direction is not between [45 degrees and 135 degrees ], and the human body width-height ratio is larger than 1:1, and whether the human body falls down is further judged.

Compared with the prior art, the invention has the beneficial effects that:

the method uses the depth separable convolution and inverse residual error structure to build the network, so that the parameter quantity and the calculated quantity of the attitude estimation network are reduced by times, meanwhile, the channel attention mechanism is used for making up for the accuracy reduction caused by the condition of reduced parameter quantity, and the depth separable convolution and inverse residual error structure and the channel attention mechanism are jointly used, so that the attitude estimation network is more suitable for being deployed on an embedded platform, and the real-time effect can be achieved when the attitude estimation network is deployed on a TX2 embedded platform; meanwhile, among the plurality of basic modules, when the characteristics are extracted, the characteristic diagram in the middle of the characteristic extraction network and the final characteristic diagram channel of the characteristic extraction network are superposed, and the characteristics can be better extracted by applying different receptive fields.

The method uses the coordinates of the human body joint points of a plurality of people and skeleton information obtained from the front frame and the back frame to track the human body, the posture estimation is more stable due to the tracking of the plurality of people, and the falling detection problem under the scene of the plurality of people can be better solved; in the aspect of falling detection, video frame images are continuously read, the accelerated speeds of different joint points of the same person in front and back frames are calculated, the accelerated speeds of the joint points close to the central point of the human body are used for judging the intensity of the motion of the human body, the motion state and the static state (slow motion state) are distinguished, and according to the motion characteristic that the falling behavior is accelerated greatly and then is static, the relative position, the included angle and the human body width-height ratio information of the joint points are used for finally determining whether the human body falls or not.

Drawings

FIG. 1 is a schematic diagram of a basic module structure including a channel attention module in the detection method of the present invention.

FIG. 2 is a schematic diagram of a channel attention module structure employed in the embodiments of the present invention.

Fig. 3 is a schematic structural diagram of a feature extraction network in the detection method of the present invention.

Fig. 4 a depth convolution structure.

Fig. 5 is a branched network structure.

Fig. 6 shows an improved overall network structure in the detection method of the present invention.

Fig. 7 fall detection flow chart.

FIG. 8 is a diagram of the effect of single-person and multi-person posture estimation.

Fig. 9 is a diagram of the detection effect of the falling process in the invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

The invention discloses an embedded type platform real-time falling detection method based on an improved attitude estimation algorithm, which comprises the following steps:

1-1, building a feature extraction network: the use of VGG19 to extract features in the opencast algorithm makes the parameter amount of the feature extraction part huge, so a network of the feature extraction part is built using a depth separable convolution and inverse residual structure, and an attention mechanism is introduced:

(1) structure of the basic module: the number of parameters can be reduced by 7-8 times by using the depth separable convolution and 1x1 convolution instead of common convolution, a network is built again by using the depth separable convolution, the network structure of the original opencast algorithm is completely changed, and the situation that the gradient disappears can be prevented by using an inverse residual structure. The basic module has the structure that: including a depth separable convolution and two 1x1 convolutions, the input to the basic module is split into two branches, the first branch is passed through 1x1 convolution to increase the number of channels, then passed through 3x3 depth separable convolution, and then passed through a 1x1 convolution to decrease the number of channels; the second branch takes the input profile of the basic module and the output profile of the first branch of the basic module and adds them as the output of the basic module. In this application, a basic module has two 1 × 1 convolutions, the first one is used to expand the number of channels of the feature map in order to extract more features, and the second one is used to reduce the number of channels in order to reduce the number of parameters.

(2) And (2) constructing a feature extraction network by using the basic module in the step (1), wherein the network structure is as follows: using a picture with the size of 432 × 368 as an input, firstly using a common convolution with the size of 3 × 3, then using 9 basic modules in the step (1) to sequentially connect and form a feature extraction part in the posture estimation network, wherein the feature extraction part and the subsequent posture estimation part form a large network for training, the network extraction part is a backbone network, the whole structure can be seen in fig. 6, and the output of the last basic module and the output of the sixth basic module are superposed on a channel (here, the feature map channels between the basic modules are superposed) to form a feature extraction network, and the features are better extracted by using sense fields with different sizes; some of the 9 basic modules include a channel attention module, and the channel attention module is arranged after the deep separable convolution, and assigns weights to the number of channels of the feature map at that time, that is, determines the importance of different channel feature maps.

1-2, constructing a posture estimation network: extracting feature maps with dimensions of 54x46x120 through the feature extraction network in the step 1-1, and sending the feature maps into a first stage, wherein each stage comprises two branches, each branch firstly passes through a 5-3 x3 deep convolution structure, the deep convolution structure comprises 3x3 deep separable convolution and 1x1 convolution, and then passes through 2 convolutions of 1x1, the final output channels of the two branches of each stage are respectively 19 and 38, the output feature maps input into the stage in the next stage are fused with the channels of the feature maps output by the feature extraction network, and the output 19 channel feature maps in the five stages are totally used for predicting one part of a human body by each feature map, wherein the total number of the output part is 18, and a background feature map is added, namely the joint point position of the human body is predicted; while the output of the 38 channels represents a vector diagram of the body part joint (joint) connections, including size and orientation, 19 and 38 are determined by the number of joints in the data set. Except for the final stage, the outputs of the 19 channels and the 38 channels of other stages are subjected to channel superposition with the output feature map of the feature extraction and then serve as the input of the next stage.

1-3, loss function and joint point matching: training the network set up in the step 1-1 and the step 1-2, wherein a human body joint point loss function is the difference between a joint point output characteristic diagram of the posture estimation network and a data set marking position, a joint point connection loss function is the difference between a joint point connection output characteristic diagram of the posture estimation network and a data set marking position, meanwhile, an L2 loss function is used for each stage in the step 1-2, and the overall loss is the sum of losses of all parts. Distributing the detected multiple human joint points by using a Hungarian matching algorithm to obtain the joint point coordinates and confidence information of each human;

step two: human posture tracking

Training an improved posture estimation network at a PC (personal computer) end to obtain a posture estimation model, carrying out human body posture estimation on a video frame picture by using the posture estimation model to obtain the joint point coordinates of each person, and calculating the distance of each joint point of the same person in different frames. Wherein the j-th joint point coordinate matrix of the m-th person is L_j,m＝(x_j,m,y_j,m,c_j,m) X in the formula_j,mAnd y_j,mCoordinate points representing the joint points of the human body; c. C_j,mThe confidence of the joint point is indicated. The m-th person's coordinate matrix is: p_m＝(L_1,m,L_2,m...L_18,m) And calculating the average value of the sums of the Euclidean distances of all the joint points of different people in the adjacent frames before and after, wherein the same person is obtained when the distance is the minimum and is smaller than the threshold value.

Step three: fall behavior detection

And tracking the human body in different frames by using the method in the step two, so that the human body falling is detected according to the coordinate change condition of the joint points of the same person in the previous and next frames, the included angle between the joint point connecting line and the horizontal line, the aspect ratio and the like.

3-1, calculating the acceleration of the joint: according to the coordinate change condition of the human body joint points between the front frame and the rear frame, the acceleration of the joint points (a hip joint, a neck joint, a shoulder joint and a knee joint) close to the central point of the human body is calculated, and because the central point of the human body can rapidly move downwards when the human body falls, the falling of the human body is detected according to the motion direction and the acceleration of the hip joint, the shoulder joint, the neck joint or the knee joint (mainly looking at the hip joint). The acceleration of these joints can reflect the strenuous state of human motion. Obtaining the acceleration of the joint point in the second step according to the coordinate form of the joint point

a is the acceleration of the joint movement; (x)_t-1,y_t-1) The position of the joint point at the last moment; (x)_t,y_t) The position of the joint point at the current moment.

3-2, calculating the relative positions of different joint points of the same person, the included angle between the joint point connecting line and the horizontal line and the aspect ratio: after the acceleration of the joint point of the human body close to the central point is judged, whether the human body falls down is further determined by using the relative positions of different joint points, the included angle between the central connecting line of the neck joint and the two hip joints and the horizontal line and the aspect ratio.

3-3, the process of fall detection comprises:

(1) judging the intensity of the human motion according to the acceleration of the joint points, wherein the larger the acceleration is, the current person is represented

The more violent the body movement is, when the acceleration is less than the threshold value, the immobile state of walking, standing and the like can be excluded or

A slow motion state.

(2) For states of strenuous exercise (running, jumping, falling, etc.), the fall process differs from other exercises in that it

His process is a periodically repeating process, while the falling behavior is a one-time behavior, so that a periodic or sustained period is detected

The continuous acceleration is very large, and other behaviors such as running can be judged.

(3) And eliminating squat, sitting and other behaviors according to the relative position of the joint point after the human body falls, the included angle between the central connecting line of the neck joint and the two hip joints and the horizontal line and the change of the width-to-height ratio, and finally determining the posture of the human body and judging whether the human body is in a falling state.

Step four: the method comprises the steps of deploying an attitude estimation model on a TX2 embedded platform, carrying out attitude estimation on video frames, carrying out attitude tracking on different people, calculating the acceleration of joint points, judging the falling behavior of a human body in an auxiliary mode by using the relative positions, included angles and aspect ratios of the joint points, and carrying out real-time falling detection.

Examples

The embedded type platform real-time falling detection method based on the improved posture estimation algorithm comprises the following steps:

1-1, building a feature extraction network: a network of feature extraction parts is built using depth separable convolution and inverse residual structure, and an attention mechanism is introduced:

(1) structure of the basic module: as shown in fig. 1, the basic module has the structure: the input of the basic module is divided into two branches, the first branch firstly uses 1 × 1 convolution to expand the number of channels, then uses 3 × 3 depth separable convolution, and then uses one person 1 × 1 convolution to reduce the number of channels; the second branch is the input of the basic module and is directly added with the output characteristic diagram of the first branch as the output of the basic module.

A channel attention module may be added to the basic module, and following the depth separable convolution of 3 × 3, the channel attention module is used to assign a weight to the number of channels of the feature map, and the structure of the channel attention module is shown in fig. 2, and the structure is as follows: using F_sq(.) function to globally pool reduced feature maps in spatial dimensions, F_sq(.) the function is:

in the formula: u. of_cRefers to the input feature map; h represents the height of the feature map; w indicates the width of the feature map.

Then using F_ex(., w) function to learn the associations between channels, F_ex(., w) function is:

s＝F_ex(z,w)＝γ(g(z,w))＝γ(w₂δ(w₁z))

in the formula, δ refers to Sigmoid function, and γ refers to relu function; z represents an input;

and

c in the formula refers to the number of channels of the feature map. F_scale() Used to re-weight the feature map. The formula is as follows:

F_scale(u_c,s_c)＝s_cu_c

in the formula: s_cRefers to a scalar value of the importance of the channel. The h-switch activating function is partially used in the basic module, the performance of the switch activating function is better than that of a relu6 activating function, but the calculation is more complex than that of a relu6, and the performance can be improved while the memory overhead is reduced by using the h-switch activating function similar to the switch function, so that the switch activating function is suitable for being used on a mobile device. The formula of the h-swish activation function is:

(2) and (2) constructing a feature extraction network by using the basic module in the step (1), wherein the network structure is as follows: using the picture with the size of 432 × 368 as an input, firstly using a common convolution with 3 × 3, then using 9 basic modules in the step (1) to extract the features of the picture, and superposing the output of the last basic module and the output of the sixth basic module on a channel, and using the receptive fields with different sizes to better extract the features. In the feature extraction network, the channel attention module is added to the fourth, fifth and sixth basic modules, the activation function used in the seventh, eighth and ninth basic modules is h-swish, and the relu6 activation function is used in other modules. The structure of the feature extraction network is shown in fig. 3. The channel attention module is added in the embodiment, the channel attention module is fully connected, the accuracy is not greatly improved by adding too many attention modules, the parameter quantity is increased inversely, and the embedded platform deployment is not facilitated.

1-2, constructing a posture estimation network: through the feature extraction network in the step 1-1, feature maps with dimensions of 54x46x120 are obtained through extraction, and the feature maps are sent to a first stage, wherein each stage comprises two branches, and each branch firstly passes through a 5-3 x3 deep convolution structure and then passes through 2 convolutions of 1x 1. The depth convolution structure is shown in fig. 4, and includes 3 × 3 depth separable convolutions and 1 × 1 convolution, and the 3 × 3 depth separable convolutions are processed by the Relu function after being processed by the Batch Normalization (BN) operation, then enter 1 × 1 convolution, and then are processed by the Relu function after being processed by the batch normalization. The branch network structure is shown in fig. 5, the number of the last output channels of the two branches of each stage is respectively 19 and 38, the output characteristic diagram input to the stage of the next stage is superposed with the channels of the characteristic diagram output by the characteristic extraction network, and there are five stages in total, the 19-channel characteristic diagram in the output represents that each characteristic diagram predicts one part of the human body, 18 parts are in total, a background characteristic diagram is added, and the output of 38 channels represents the vector diagram of the joint point connection of the human body parts. Except for the final stage, the outputs of the 19 channels and 38 channels of other stages are fused with the output feature map of feature extraction and then used as the input of the next stage. The general structure of the first stage in the pose estimation network is shown in fig. 6.

1-3, loss function and joint point matching: training the network set up in the step 1-1 and the step 1-2 on the PC end, wherein the input of each branch network in the attitude estimation network is as follows:

S¹＝ρ¹(F)，t＝1

S^t＝ρ^t(F,S^t-1,L^t-1),t≥2

in the formula: f represents an output feature diagram of the feature extraction network; ρ and

showing the successive convolution operations of 3x3 and 1x 1; s represents human body joint point heatA drawing; l represents a vector diagram of the connection relationship between the human body joint points. The human body joint point loss function is the difference between a joint point output characteristic diagram of the posture estimation network and the labeled position of the data set, and the formula is as follows:

in the formula, the first step is that,

is the actual value of the position of the human joint point,

for the output of the posture estimation network, in the actual labeling, if the joint point has a label, W (p) is 1, and if the joint point has no label, W (p) is 0; p refers to pixel points in the feature graph, J refers to the number of the feature graphs, namely 19, t represents different stages of the attitude estimation network, and the value is 1-5 (namely 6)>t>1)。

The loss function of the joint point connection vector diagram is the difference value of the output characteristic diagram of the joint point connection of the posture estimation network and the data set labeling position connection, and the formula is as follows:

in the formula, the first step is that,

for the actual values of the vector diagram of the human body joint connecting line,

and for the output of different stages of the network, p refers to pixel points in the feature graph, C refers to the number of the feature graphs, namely 38, t represents different stages of the attitude estimation network, and the value is 1-5. The overall loss is the sum of losses of all parts, and the formula is as follows:

for each joint point of each person in the actual image, the representation form is an extreme point on the thermodynamic diagram, the confidence coefficient of the joint point is determined by a Gaussian function, and the value of a certain position available for the jth joint point of the mth person is as follows:

wherein x is_j,mRefers to the position corresponding to the jth joint point of the mth person in the image; σ means: because each pixel point in the output characteristic diagram has a value and is represented as a peak value at the joint point, only one or a plurality of pixel points in the actual label are labeled as the joint points, and the sigma represents the propagation range, namely the variance, of each peak value.

Then

The values are:

the formula of the skeleton information direction between the joint points in the actual picture obtained by the same method is as follows:

wherein n is_cThe number of vectors at the position of a pixel point p in all the marked people is nonzero, namely, if the same position of a plurality of people exists at the position of p, the vectors are averaged;

knowledge of the articulated point vector, L, of the pose estimation network output_cAfter (p), the correlation between the two joint points needs to be evaluated by using the correlation, and the integral of dot products of connecting vectors of the two joint points and the vectors of each pixel point on the connecting line is used as the correlation between the two joint points, wherein the formula is as follows:

in the formula: d_j1And d_j2Are two joint points, L_c(p (u)) is a vector on the joint line; p (u) is the position between two joint points, i.e. p (u) ═ 1-u) d_j1+ud_j2And p (u) means d when u is 1_j2An articulation point, when u is 0, p (u) means d_j1And u is used for selecting the position between the connecting lines of the two joint points and is 0-1. Because the correlation between the joint points is known, the joint points are used as the vertexes of the graph, the correlation between the joint points is used as the edge weight of the graph, and the Hungarian matching algorithm is used for distribution to obtain the joint point coordinates and skeleton information of each person;

step two: human posture tracking

Training the improved posture estimation network to obtain a posture estimation model, performing human body posture estimation on the video frame picture by using the posture estimation model to obtain the joint point coordinates of each person, and calculating the distance of each joint point of the same person in different frames. Wherein the j-th joint point coordinate matrix of the m-th person is L_j,m＝(x_j,m,y_j,m,c_j,m) X in the formula_j,mAnd y_j,mThe coordinate points of the human body joint points are shown, and j is 1, 2, … and 18; c. C_j,mThe confidence of the joint point is indicated. The m-th person's coordinate matrix is: p_m＝(L_1,m,L_2,m...L_18,m) The average value of the sums of the euclidean distances of the respective joint points of different persons in the preceding and following adjacent frames is calculated, and the person whose distance is the smallest and smaller than the threshold is the same person (the threshold is set in a case where the camera is fixed, that is, the size of the human body does not change much, a fixed value may be specified, for example, 90 pixels, that is, 90pixel values are used in this embodiment, and 1/2 of the horizontal direction distance between the leftmost joint point and the rightmost joint point of the person may also be selected as the threshold when the size of the human body changes relatively greatly in a changing scene, that is, after the distances between the respective person in the previous frame and the person in the frame are calculated, the wide 1/2 of the person is selected as the threshold. ) (ii) a

Step three: fall behavior detection

Tracking the human body in different frames by using the method in the step two, so as to detect that the human body falls down according to the coordinate change condition and the aspect ratio condition of the joint point of the same person in the previous and subsequent frames, wherein a in the falling detection flowchart is shown in fig. 7, a in the drawing refers to acceleration, a threshold value 1 is a threshold value for judging whether the human body is in a strenuous exercise state, and 200 pixels/s are used in the example²(where pixel refers to pixel distance, s refers to second), the threshold 2 is used to determine whether the image is in a static state or in a periodic motion state, and is a periodic threshold, in this example, 80 pixels/s is used²The Number is used for counting judgment, the threshold value 2 (in the embodiment, when the cumulative Number exceeds 8, the static state after falling is preliminarily judged) is used for judging for multiple times to determine whether the user is in the static state after falling or in the periodic motion state, and other states are excluded according to the characteristics of falling motion.

3-1, calculating the acceleration of the joint: according to the coordinate change condition of the human body joint points between the front frame and the rear frame, the acceleration of the joint points (hip joint, neck joint, shoulder joint and knee joint) close to the central point of the human body is calculated, and because the central point of the human body can rapidly move downwards when the human body falls, the falling of the human body is detected according to the motion direction and the acceleration of the hip joint, the shoulder joint, the neck joint or the knee joint. The acceleration of these joints can reflect the strenuous state of human motion. And obtaining the acceleration according to the joint point coordinate form in the step two as follows:

in the formula, a is the acceleration of the joint point movement; (x)_t-1,y_t-1) The position of the joint point at the last moment; (x)_t,y_t) The position of the joint point at the current moment.

3-2, calculating the relative positions of different joint points of the same person, the included angle between the joint point connecting line and the horizontal line and the aspect ratio: after the acceleration of the joint point close to the central point of the human body is judged, whether the human body falls down is further determined by using the relative positions of different joint points, the included angles between the central connecting lines of the neck joint and the two hip joints and the horizontal line and the aspect ratio:

3-2-1, judging the intensity of the human motion according to the acceleration of the joint points, wherein the larger the acceleration is, the more intense the current human motion is, and when the acceleration is smaller than an intensity threshold value, continuously calculating the acceleration of the next frame for the state that the human does not move or slowly moves (such as walking, standing and the like); when the acceleration is greater than the intensity threshold value and the human body is in an intensity motion state (such as running, jumping, falling and the like), continuously detecting 80 frames, simultaneously calculating the acceleration, and counting to enter the step 3-2-2;

and 3-2-3, then, eliminating squat, sitting and other behaviors according to the relative position of the joint points after the human body falls, the included angle between the central connecting line of the neck joint and the two hip joints and the horizontal line and the change of the width-to-height ratio, and finally determining the posture of the human body and judging whether the human body is in a falling state. The method comprises the following steps:

and calculating the relative positions of different joint points of the same person, the included angle between the connecting line of the neck joint and the centers of the two hip joints and the horizontal line and the aspect ratio. The detected human neck joint is used as an original point, the direction parallel to the upper edge line and the lower edge line of a video frame is set as an X direction, the direction parallel to the left edge line and the right edge line is set as a Y direction, the X direction difference and the Y direction difference of the neck joint and the hip joint are calculated, the included angle between the connecting line of the centers of the neck joint and the two hip joints and the X direction is calculated at the same time, a human body detection frame is calibrated according to the detected coordinates of all joint points, and the human body is determined to fall if four conditions that the difference between the X direction of the centers of the neck joint and the hip joint is greater than 2/3 of the connecting line of the two points, the difference between the Y direction is smaller than 1/3 of the connecting line, the included angle between the connecting line of the joint points and the X direction is not between [45 degrees ], 135 degrees ] and the human body width-height ratio is greater than 1:1 are met at the same time.

The difference between the falling process and other sports is that the other processes are periodically repeated, and the falling behavior is a one-off behavior, so that other behaviors such as running can be judged when a large periodic or continuous acceleration is detected.

Step four: the method comprises the steps of deploying a posture estimation model on a TX2 embedded platform, carrying out posture estimation on a video frame, enabling a posture estimation effect graph to be shown in fig. 8 (posture estimation effect graphs of a single person, two persons and multiple persons are respectively shown in fig. 8), then carrying out posture tracking on different persons, calculating acceleration of joint points, using relative positions of the joint points, included angles of the joint points and an X direction and an aspect ratio to assist in judging the falling behavior of a human body, carrying out real-time falling detection, and enabling the falling detection process to be shown in fig. 9.

The accuracy of the posture estimation model on the MPII data set can reach 81.7%, the accuracy on the COCO data set can reach 62.3%, the accuracy is slightly higher than that of the existing openfuse algorithm, the speed is improved by 54% compared with the existing openfuse algorithm, the average FPS (transmission frame per second) on a TX2 platform is 16.7, a real-time effect can be achieved, 4 testers with different fat and thin sizes are selected in order to eliminate the relation between the algorithm and the body types and ages of the testers, actions from standing, walking, running, squatting and falling are respectively performed in a laboratory and a hall, the total number of tests is 80, and the number of successful detection times is 76.

The invention is not the best known technology.

Claims

1. An embedded platform real-time fall detection method based on an improved attitude estimation algorithm comprises the following steps:

(1) structure of the basic module: including one depth separable convolution and two 1x1 convolutions while using the inverse residual structure. The input of the basic module is divided into two branches, the first branch firstly uses 1 × 1 convolution to expand the number of channels, then uses 3 × 3 depth separable convolution, and then uses 1 × 1 convolution to reduce the number of channels; the second branch adds the input characteristic diagram of the basic module and the output characteristic diagram of the first branch of the basic module as the output of the basic module;

1-2, constructing a posture estimation network: extracting feature maps with dimensions of 54x46x120 through a feature extraction network in the step 1-1, and sending the feature maps into a first stage, wherein each stage comprises two branches, each branch firstly passes through a 5-3 x3 deep convolution structure, the deep convolution structure comprises 3x3 deep separable convolution and 1x1 convolution, and then passes through 2 convolutions of 1x1, and the final output channel numbers of the two branches in each stage are respectively 19 and 38; the input of the next stage is the channel superposition of the output of the stage and the feature map output by the feature extraction network, and the total number of the stages is five; the 19-channel characteristic diagram in the output shows that each characteristic diagram predicts a part of the human body, 18 parts are in total, a background characteristic diagram is added, and the 38-channel output shows a vector diagram of the joint point connection of the human body parts; except for the final stage, the outputs of the 19 channels and the 38 channels of other stages are fused with the output feature map of the feature extraction network and then used as the input of the next stage;

step two: human posture tracking

step three: fall behavior detection

step four: and deploying the attitude estimation model on the embedded platform, performing attitude estimation on the video frame, performing attitude tracking on different people, and performing real-time falling detection.

2. The detection method according to claim 1, wherein when the feature extraction network is built, channel attention modules are added to only the fourth, fifth and sixth basic modules among 9 basic modules used; the activation function used in the seventh, eighth and ninth basic modules is h-swish, and the other basic modules use relu6 activation functions.

3. The detection method according to claim 1, wherein the specific process of fall behavior detection in step three is as follows:

3-1, calculating the acceleration of the joint: calculating the acceleration of the joint point close to the central point of the human body according to the coordinate change condition of the joint point of the human body between the front frame and the rear frame, wherein the joint point close to the central point of the human body comprises a hip joint, a neck joint, a shoulder joint and a knee joint, and the falling of the human body is detected according to the motion direction and the acceleration of the hip joint, the shoulder joint, the neck joint or the knee joint so as to reflect the severe condition of the motion of the human body; obtaining the acceleration of the joint point in the second step according to the coordinate form of the joint point

3-2, calculating the relative positions of different joint points of the same person, the included angle between the joint point connecting line and the video frame horizontal line and the aspect ratio: after the acceleration of the joint point close to the central point of the human body is judged, whether the human body falls down is further determined by using the relative positions of different joint points, the included angles between the central connecting lines of the neck joint and the two hip joints and the horizontal line and the aspect ratio, namely:

4. The detection method according to claim 3, wherein the detected human neck joint is used as an origin, the direction parallel to the upper and lower edge lines of the video frame is set as an X direction, the direction parallel to the left and right edge lines is set as a Y direction, the X direction difference and the Y direction difference between the neck joint and the hip joint are calculated, the included angle between the connecting line between the centers of the neck joint and the two hip joints and the X direction is calculated, the human body detection frame is calibrated according to the detected coordinates of all joint points, whether the X direction difference between the centers of the neck joint and the hip joint is larger than 2/3 of the connecting line between the two points and whether the Y direction difference is smaller than 1/3 of the connecting line, the included angle between the connecting line of the joint points and the X direction is not between [45 degrees ], 135 degrees ] and the human body width-height ratio is larger than 1:1 are met, and whether the human body falls is determined.