CN114463844A

CN114463844A - Fall detection method based on self-attention double-flow network

Info

Publication number: CN114463844A
Application number: CN202210033684.XA
Authority: CN
Inventors: 陈小辉; 孟登; 陈凌俊
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-05-10

Abstract

A fall detection method based on a self-attention double-flow network comprises the following steps: step 1: acquiring image data of the human behavior action to form a sample data set of the behavior action; step 2: constructing a self-attention double-flow network for detecting behavior actions; and step 3: a balanced simple easily-separable sample is adopted to control a focus loss function, and a self-attention double-flow network is optimized; and 4, step 4: training and testing the self-attention double-flow network by using the sample data set to achieve detection precision; and 5: and acquiring real-time images of people, inputting the trained self-attention double-flow network, and detecting whether the people have falling behaviors. The invention provides a fall detection method based on a self-attention double-flow network, which is provided for detecting and tracking the behavior of people in a video based on a computer vision method.

Description

Fall detection method based on self-attention double-flow network

Technical Field

The invention belongs to the technical field of behavior recognition, and particularly relates to a real-time fall detection and tracking method based on double-current convolution.

Background

With the continuous progress of society, artificial intelligence becomes more and more extensive in people's life. The artificial intelligence related technology also plays an increasingly important role in the life of people, and the behavior recognition technology is taken as one of the artificial intelligence technologies and has an important role in analyzing the behaviors of pedestrians in videos. For some dangerous behavior actions, the video can be processed by using a related behavior recognition algorithm, and early warning is carried out at the same time. The behavior of the hazardous area can be handled in time.

In the field of behavior recognition, wearable device-based, attitude estimation-based, computer vision-based algorithms are currently popular. Although the wearable device method has the advantages of simplicity, convenience and the like, some problems may exist for some old people, comfort may be caused, and even the wearing is forgotten. For the attitude estimation algorithm, since there are calculations of a plurality of joint points when processing using the algorithm, there is a certain reduction in real-time.

In current fall detection based on video detection, the development is not very mature, and although there are many public data sets available, these data sets are very different for practical application scenarios. For the real-time detection problem in the video, at present, 3D convolution, 2D +3D, CNN + LSTM and other related methods exist, a space-time characteristic method for directly obtaining the video based on the 3D convolution can directly obtain the space-time characteristic of the video, but the problems exist that the calculated amount is large and the consumed time is long due to the fact that the convolution kernel parameter amount is too much; based on 2D +3D and convolution, the spatial characteristics of an image are mainly obtained through 2D convolution, the motion information of a video is obtained through 3D convolution, finally space-time information is fused, the relevant characteristic information of the video is obtained through the importance of some forgetting gates on the basis of a CNN + LSTM model, and finally the obtained result is predicted.

In the method of two-stream identification and detection, document [1] proposes that spatial feature information is extracted using a single frame RGB as input CNN, and temporal feature information of a video is extracted using optical flows of multiple frames as input, but the disadvantage is that a correspondence relationship is learned between the spatial feature and the temporal feature. For the 3D network model method, document [2] proposes a C3D network architecture, and performs behavior recognition by using a 3D convolution kernel, where the 3D convolution is explored for extracting behavior features of people in a video and finding out an optimal convolution kernel size. However, the 3D convolution has the problems that the parameter amount is too large, the time of the model is very slow when network training is performed, and document [3] expands the 2D convolution to the 3D convolution, explores the convolution expansion mode, and expands the 2D pre-training model to 3D. Document [4] proposes a BSN network model for finding a timing boundary of an action in a video, which is flexible in network structure, but the network is a multi-stage network and has some defects in real-time.

In many existing behavior recognition methods, firstly, for a network architecture, 3D convolution is adopted for many network architectures, so that a training time is long due to too many parameters during network training. However, for many multi-stage network models, although the accuracy is greatly improved, due to the complexity of the network, there is a great problem in real-time prediction, that is, real-time detection of video does not have good real-time performance. When the deep information characteristic extraction is carried out on the network, some important information cannot be extracted well, and a large amount of important information cannot be highlighted, so that the accuracy of the model in the prediction process is reduced.

Disclosure of Invention

The invention provides a fall detection method based on a self-attention double-flow network, which is used for detecting and tracking the behaviors of people in a video based on a computer vision method and aims to solve the technical problems that the existing behavior identification method has no good real-time performance when the video is detected in real time, important information cannot be extracted well when deep information features are extracted from the network, a large amount of important information cannot be highlighted, and the accuracy of a model in prediction is reduced.

A fall detection method based on a self-attention double-flow network comprises the following steps:

step 1: acquiring image data of the human behavior action to form a sample data set of the behavior action;

step 2: constructing a self-attention double-flow network for detecting behavior actions;

and step 3: a balanced simple easily-separable sample is adopted to control a focus loss function, and a self-attention double-flow network is optimized;

and 4, step 4: training and testing the self-attention double-flow network by using the sample data set to achieve detection precision;

and 5: and acquiring real-time images of people, inputting the trained self-attention double-flow network, and detecting whether the people have falling behaviors.

In step 1, the behavioral and action classifications of a person include fall, run, jump, walk, stand, lie;

in step 1, the sample data set is proportionally divided into a training set and a test set.

In step 2, the constructed backbone network of the self-attention dual-flow network is divided into two branches, the first branch comprises a Darknet19 neural network, the second branch comprises an S3D neural network, and the specific structure is as follows: input layer → Darknet19 neural network → auto-attention model; input layer → S3D neural network → self attention model; the self-attention model takes the output of the network in the first branch and the second branch as input, uses channel fusion to fuse information on the channel, obtains an attention feature map through an attention mechanism, and finally outputs the information through a layer of convolution to obtain an output layer.

In step 2, the structure of the S3D neural network is: sequentially connecting convolution, pooling, convolution and pooling, then sequentially connecting two inclusion 1 blocks and a maximum pooling layer, then connecting 5 inclusion 1 blocks and a maximum pooling layer, and finally connecting two inclusion 2 blocks;

when the S3D neural network is used, a time sequence video frame is obtained firstly, shallow layer information on the time sequence is obtained through convolution, pooling, convolution and pooling, deeper feature information is extracted through a plurality of inclusion 1 blocks, the number of features is reduced through the pooling layer, the size of the combined feature map is reduced so as to be sent to the next inclusion 1 Block, and after the combined feature map passes through the plurality of inclusion 1 blocks, the combined feature map passes through the plurality of inclusion 2 blocks.

The specific structure of the inclusion 1 Block is as follows: the system comprises four branches, wherein the first branch only comprises a convolution kernel and operates on time sequence; the second and third branches are firstly reduced in dimension at the spatial position and then reduced in dimension in time sequence; the fourth branch is firstly implemented with maximum pooling in spatial dimension, then is processed in time sequence, and more abundant characteristic information in the video can be extracted through a plurality of branches.

In step 2, the structure of the Darknet19 neural network is sequentially connected with 2D convolution, pooling, Block1, pooling, Block1, pooling, Block2, pooling, Block2 and 2D convolution; wherein, both Block1 and Block2 are multilayer convolution structures.

In step 2, the self-attention double-flow network adopts a Darknet19 neural network and an S3D neural network, the characteristic information extracted by the 2D network and the time sequence information extracted by the 3D network are spliced by using a channel fusion mode, the weight of each channel is obtained through the self-attention mechanism, the importance of the characteristic diagram on each channel is obtained, the incidence relation of the characteristic diagrams among the channels is obtained through a gram matrix, the self-attention mechanism effectively strengthens the characteristic information, and has a good effect on behavior classification.

The backbone network of the self-attention dual-flow network comprises a Darknet19 neural network, an S3D and a self-attention model, wherein the Darknet19 neural network is combined by a convolutional layer and a pooling layer; S3D neural network is connected with the convolution layer and the pooling layer through a plurality of inclusion blocks; the self-attention model increases the importance of the features.

The inclusion blocks of the S3D neural network all adopt a pseudo-3D structure.

When the self-attention model is used, a 2D and 3D network fusion feature map A is obtained firstly, and A belongs to R^{(c'+c”)×H×W}After fusion, B is obtainedAnd B ∈ R^c×H×WLet B be [ v ]₁,v₂,...,v_c]N — H × W, then dimension transforming B to obtain F, the formula is as follows:

B∈R^c×H×W→F∈R^c×N (1)

E＝F×F^T (2)

G＝M×F (5)

G∈R^c×N→G'∈R^c×H×W (6)

wherein B represents: a feature map obtained by fusing the 2D features and the 3D features; e represents: an attention map of the feature map between the channels; m represents: an attention map after mapping; g represents: attention is paid to the influence effect of the force diagram on the original characteristic diagram;

after obtaining the fused feature map B, converting the feature map B into a two-dimensional vector F by carrying out dimension conversion on the feature map B, and then carrying out F and F^TObtaining a gram matrix, and then obtaining an attention diagram M through a softmax function;

the weight of each channel is obtained by performing global pooling on the feature maps in space, so as to measure the importance of the feature maps on different channels, and finally obtain weighted feature information, wherein the formula is as follows:

Q＝K×B (8)

C＝G'+Q (9)

wherein K represents: a channel weight attention map; q represents: weighted feature maps; v represents: a single feature map; c represents: a new feature map.

When the self-attention model is used, a 2D and 3D network fusion feature map A is obtained firstly, B is obtained after fusion, and B is made to be [ v ═ v [ ]₁,v₂,...,v_c]N-H × W, then dimension transforming B to obtain F, passing F and F^TObtaining a gram matrix, and then obtaining an attention diagram M through a softmax function;

the weight of each channel is obtained by performing global pooling on the feature maps in space, so as to measure the importance of the feature maps on different channels and finally obtain weighted feature information.

The formula used to obtain F after dimension transformation of B is as follows:

B∈R^c×H×W→F∈R^c×N (1)

E＝F×F^T (2)

G＝M×F (5)

G∈R^c×N→G'∈R^c×H×W (6)

in the process of obtaining weighted feature information, the formula adopted is as follows:

Q＝K×B (8)

C＝G'+Q (9)

Compared with the prior art, the invention has the following technical effects:

compared with the existing neural network model, the new neural network model, namely the self-attention double-flow network, can find out the importance degree of the characteristics of different channels of the channel, and has more targeted training during network training, so that the final prediction effect is better; the self-attention double-flow network solves the problem of low relevance of characteristic information among channels, and meanwhile, useful characteristic information is strengthened according to a self-attention mechanism.

Drawings

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

fig. 1 is a schematic diagram of a self-attention dual-flow network according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a Darknet19 neural network structure.

FIG. 3 is a Block1 schematic diagram of a Darknet19 neural network structure.

FIG. 4 is a Block2 schematic diagram of a Darknet19 neural network structure.

Fig. 5 is a schematic structural diagram of the S3D neural network.

Fig. 6 is a schematic structural diagram of inclusion 1 Block of an S3D neural network structure.

Fig. 7 is a schematic structural diagram of inclusion 2Block of an S3D neural network structure.

Fig. 8 is a schematic diagram of a self-attention mechanism.

Detailed Description

step 1: dividing the behavior and actions of a person into falling, running, jumping, walking, standing and lying, collecting image data of the behavior and actions of the person, and forming a sample data set of the behavior and actions;

As shown in fig. 1, the backbone network of the self-attention dual-flow network includes two branches, the first branch is a Darknet19 neural network, and the second branch is an S3D neural network.

S3D neural networks are sequentially connected with convolution, pooling, convolution and pooling, then two increment 1 blocks and one maximum pooling layer are sequentially connected, then 5 increment 1 blocks and one maximum pooling layer are connected, and finally two increment 2 blocks are connected; the self-attention model takes the output of the two branch networks as input, information is fused on the channels by using channel fusion, and an attention feature map is obtained by an attention mechanism. And finally, outputting through a layer of convolution.

The neural network model of the invention uses S3D neural network, it first obtains a time sequence video frame, then obtains the shallow layer information on the time sequence through convolution, pooling, convolution, pooling, and then extracts deeper feature information through a plurality of inclusion blocks, the inclusion 1 Block structure is shown in FIG. 6, it includes 4 branches, each branch is pseudo 3D convolution, after extracting the feature, finally uses the concat function to splice the features of the four branches together, then takes the pooling layer to reduce the feature quantity by half, it is used to reduce the size of the merged feature map, in order to send to the next inclusion 1 Block.

The Incepton 1 Block first branch only contains one convolution kernel and operates in time sequence; the second and third branches are firstly reduced in dimension at the spatial position and then reduced in dimension in time sequence; the fourth branch is firstly implemented with maximum pooling in spatial dimension, then is processed in time sequence, and more abundant characteristic information in the video can be extracted through a plurality of branches. Fig. 7 shows an inclusion 2Block, which also includes 4 branches, but differs from the inclusion 1 Block in the middle two branches, and therefore, information on the processing sequence is reduced. The inclusion Block module uses a pseudo-3D convolution kernel module, compared with a 3D convolution kernel, a large number of parameters are reduced, the pseudo-3D convolution has a great effect on the single-stage neural network model in behavior prediction, and the real-time performance of network detection can be improved.

The neural network model adopts a Darknet19 neural network, and the Darknet19 neural network comprises 19 layers of convolution networks and 5 layers of pooling layers; its structure is shown in fig. 2, which is used as a 2D branch network of the neural network model of the present invention. Wherein the Block1 and Block2 modules are shown in FIG. 3 and FIG. 4 as a multi-layer convolution. The Darknet19 neural network can acquire the spatial characteristics of the key frames in the video, and compared with a 3D network for behavior recognition, the 2D network is used for extracting the key frames in the video, so that the characteristics extracted by the whole network are richer, the space-time behavior detection problem is also greatly influenced, the specific positions of people in the video can be well positioned, the network structure is simpler, and the real-time detection in the video is greatly influenced.

After the characteristics are respectively extracted by the branch network models of the double-flow network, the spatial information and the time sequence information are fused, however, the influence of a lot of noises is generated, and the fused characteristic information cannot correctly represent the characteristics required by training. Meanwhile, the special diagnosis information among the channels is independent from each other and has no relevance. Therefore, the network model can pay more attention to useful features and be trained through the attention mechanism, so that more effective features can be obtained, and the behaviors of people in the video can be more accurately predicted in the prediction process.

In step 2, in the self-attention double-flow network, a Darknet19 neural network and S3D are adopted, the output of the 2D network and the output of the 3D network are spliced together in a channel fusion mode, spatial feature information of video key frames is extracted through the Darknet19 neural network, time sequence feature information of videos is extracted through S3D, spatial and time sequence information is fused to a deep layer through channel fusion, the spatial information and the time sequence are aggregated together in a fusion mode, and finally results are classified and boundary box regression is carried out through convolution layers.

Preferably, the S3D neural network of the self-attention dual-flow network comprises sequentially connected convolution, pooling, convolution, pooling, followed by connecting 2 inclusion blocks 1; then one Maxpool followed by 5 inclusion blocks 1, one Maxpool and finally two inclusion blocks 2, which were separated into 2D plus 3D forms by 3D convolution and were divided into 4 branches.

The self-attention network used by the neural network model of the invention is shown in fig. 8, firstly, the characteristics obtained by fusing channels are obtained, the obtained characteristics are only simply spliced on the channels for spatial characteristics and time sequence characteristics, but the relevance between the channels cannot be represented, and then the importance of information storage characteristics of each channel cannot be known, so that a lot of noises exist in the fused characteristics, and the information between each channel is independent. And both of these problems can be solved by attention mechanism. Firstly, obtaining a 2D and 3D network fusion characteristic diagram A, wherein A belongs to R^{(c'+c”)×H×W}After fusion, B is obtained and B ∈ R^c×H×WLet B be [ v ]₁,v₂,...,v_c]N — H × W, then dimension transforming B to obtain F, the formula is as follows:

B∈R^c×H×W→F∈R^c×N (1)

E＝F×F^T (2)

G＝M×F (5)

G∈R^c×N→G'∈R^c×H×W (6)

in the formula, B represents (a feature map obtained by fusing a 2D feature and a 3D feature), E represents (an attention map of a feature map between channels), M represents (an attention map after mapping), and G represents (an influence of the attention map on an original feature map).

After obtaining the fused characteristic diagram B in the formula, converting the characteristic diagram B into a two-dimensional vector F by carrying out dimension conversion on the characteristic diagram B, and converting the two-dimensional vector F into a two-dimensional vector F by F and F^TAnd obtaining a gram matrix, and then obtaining an attention map M through a softmax function, wherein the M represents the relevance among different characteristic maps among the channels, the information fusion among the channels is deepened, the relevance relation among different characteristics among the channels is strengthened through the above mode, and new characteristic information can be obtained in a deep layer, so that the channel obtains better performance.

For the problem of importance of different feature maps in the channels, the weight of each channel is obtained by performing global pooling on the feature maps in space, so as to measure the importance of the feature maps on different channels and finally obtain weighted feature information. The formula is as follows:

Q＝K×B (8)

C＝G'+Q (9)

in the formula, K represents (channel weight attention map), Q represents (weighted feature map), v represents (single feature map), and C represents (new feature map).

In an embodiment, the parameters of a self-attention dual-flow network for fall detection are shown in table 1.

TABLE 1 deep learning Algorithm identification Unit parameter Table

In the embodiment, a double-flow network with a non-attention mechanism and an attention mechanism is respectively used for a fall detection experiment, and the experiment results are shown in table 2.

TABLE 2 evaluation index comparison table for common double-flow network and self-attention double-flow network

The evaluation indexes of the experimental results in table 2 show that the self-attention double-flow network has better effect than the non-self-attention double-flow network, the self-attention double-flow network strengthens the correlation among the channel characteristic diagrams, so that the characteristic information is richer, and in addition, the importance degree of each characteristic diagram to the overall characteristic diagram can be known by using the weighted characteristic diagram. The precision of the self-attention double-flow network is 64.02%, the recall rate is 93.20%, the precision is improved by 1.42% compared with the precision of a common double-flow network, the recall rate is improved by 0.4%, various indexes of the self-attention double-flow network are improved compared with the common double-flow network, and the self-attention double-flow network can acquire richer spatio-temporal information of videos.

Claims

1. A fall detection method based on a self-attention double-flow network is characterized by comprising the following steps:

and 2, step: constructing a self-attention double-flow network for detecting behavior actions;

and step 3: a balanced simple and easily-separable sample is adopted to control a focal loss function, and a self-attention double-flow network is optimized;

2. The method according to claim 1, wherein in step 1, the behavioral-action classification of the person comprises falling, running, jumping, walking, standing, lying;

3. The method according to claim 1, wherein in step 2, the constructed backbone network of the self-attention dual-flow network is divided into two branches, the first branch comprises a Darknet19 neural network, and the second branch comprises an S3D neural network, and the specific structure is as follows: input layer → Darknet19 neural network → auto-attention model; input layer → S3D neural network → self attention model; the self-attention model takes the output of the network in the first branch and the second branch as input, uses channel fusion to fuse information on the channel, obtains an attention feature map through an attention mechanism, and finally outputs the information through a layer of convolution to obtain an output layer.

4. The method of claim 3, wherein in step 2, the structure of the S3D neural network is: the method comprises the steps of sequentially connecting convolution, pooling, convolution and pooling, then sequentially connecting two inclusion 1 blocks and a maximum pooling layer, then connecting 5 inclusion 1 blocks and a maximum pooling layer, and finally connecting two inclusion 2 blocks.

5. The method of claim 4, wherein the S3D neural network is used by first acquiring a time-series video frame, then obtaining shallow time-series information through convolution, pooling, convolution and pooling, extracting deeper feature information through multiple inclusion 1 blocks, then reducing the number of features by pooling to reduce the size of the combined feature map for transmission to the next inclusion 1 Block, and then passing through multiple inclusion 2 blocks after passing through multiple inclusion 1 blocks.

6. The method according to claim 4, wherein the inclusion 1 Block has a specific structure: the system comprises four branches, wherein the first branch only comprises a convolution kernel and operates on time sequence; the second and third branches are firstly reduced in dimension at the spatial position and then reduced in dimension in time sequence; the fourth branch is firstly implemented with maximum pooling in spatial dimension, then is processed in time sequence, and more abundant characteristic information in the video can be extracted through a plurality of branches.

7. The method according to claim 3, wherein in step 2, the Darknet19 neural network is structured by sequentially connecting 2D convolution, pooling, Block1, pooling, Block1, pooling, Block2, pooling, Block2, 2D convolution; wherein, both Block1 and Block2 are multilayer convolution structures.

8. A method according to one of the claims 3 to 7, characterized in that, when the self-attention model is in use,

firstly, obtaining a 2D and 3D network fusion characteristic diagram A, wherein A belongs to R^{(c’+c”)×H×W}After fusion, B is obtained and B is belonged to R^c×H×WLet B be [ v ]₁,v₂,...,v_c]N — H × W, then dimension transforming B to obtain F, the formula is as follows:

B∈R^c×H×W→F∈R^c×N (1)

E＝F×F^T (2)

G＝M×F (5)

G∈R^c×N→G'∈R^c×H×W (6)

Q＝K×B (8)

C＝G'+Q (9)

9. A self-attention model is characterized in that when in use, a 2D and 3D network fusion feature map A is obtained firstly, B is obtained after fusion, and B is made to be [ v ═ v [ [ v ] v [ ]₁,v₂,...,v_c]N-H × W, then dimension transforming B to obtain F, passing F and F^TObtaining a gram matrix, and then obtaining an attention diagram M through a softmax function;

10. A self-attention model according to claim 9, wherein the formula used to derive F after dimension transformation of B is as follows:

B∈R^c×H×W→F∈R^c×N (1)

E＝F×F^T (2)

G＝M×F (5)

G∈R^c×N→G'∈R^c×H×W (6)

Q＝K×B (8)

C＝G'+Q (9)