CN114463844A - Fall detection method based on self-attention double-flow network - Google Patents

Fall detection method based on self-attention double-flow network Download PDF

Info

Publication number
CN114463844A
CN114463844A CN202210033684.XA CN202210033684A CN114463844A CN 114463844 A CN114463844 A CN 114463844A CN 202210033684 A CN202210033684 A CN 202210033684A CN 114463844 A CN114463844 A CN 114463844A
Authority
CN
China
Prior art keywords
attention
self
network
pooling
double
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210033684.XA
Other languages
Chinese (zh)
Inventor
陈小辉
孟登
陈凌俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Three Gorges University CTGU
Original Assignee
China Three Gorges University CTGU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Three Gorges University CTGU filed Critical China Three Gorges University CTGU
Priority to CN202210033684.XA priority Critical patent/CN114463844A/en
Publication of CN114463844A publication Critical patent/CN114463844A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

A fall detection method based on a self-attention double-flow network comprises the following steps: step 1: acquiring image data of the human behavior action to form a sample data set of the behavior action; step 2: constructing a self-attention double-flow network for detecting behavior actions; and step 3: a balanced simple easily-separable sample is adopted to control a focus loss function, and a self-attention double-flow network is optimized; and 4, step 4: training and testing the self-attention double-flow network by using the sample data set to achieve detection precision; and 5: and acquiring real-time images of people, inputting the trained self-attention double-flow network, and detecting whether the people have falling behaviors. The invention provides a fall detection method based on a self-attention double-flow network, which is provided for detecting and tracking the behavior of people in a video based on a computer vision method.

Description

Fall detection method based on self-attention double-flow network
Technical Field
The invention belongs to the technical field of behavior recognition, and particularly relates to a real-time fall detection and tracking method based on double-current convolution.
Background
With the continuous progress of society, artificial intelligence becomes more and more extensive in people's life. The artificial intelligence related technology also plays an increasingly important role in the life of people, and the behavior recognition technology is taken as one of the artificial intelligence technologies and has an important role in analyzing the behaviors of pedestrians in videos. For some dangerous behavior actions, the video can be processed by using a related behavior recognition algorithm, and early warning is carried out at the same time. The behavior of the hazardous area can be handled in time.
In the field of behavior recognition, wearable device-based, attitude estimation-based, computer vision-based algorithms are currently popular. Although the wearable device method has the advantages of simplicity, convenience and the like, some problems may exist for some old people, comfort may be caused, and even the wearing is forgotten. For the attitude estimation algorithm, since there are calculations of a plurality of joint points when processing using the algorithm, there is a certain reduction in real-time.
In current fall detection based on video detection, the development is not very mature, and although there are many public data sets available, these data sets are very different for practical application scenarios. For the real-time detection problem in the video, at present, 3D convolution, 2D +3D, CNN + LSTM and other related methods exist, a space-time characteristic method for directly obtaining the video based on the 3D convolution can directly obtain the space-time characteristic of the video, but the problems exist that the calculated amount is large and the consumed time is long due to the fact that the convolution kernel parameter amount is too much; based on 2D +3D and convolution, the spatial characteristics of an image are mainly obtained through 2D convolution, the motion information of a video is obtained through 3D convolution, finally space-time information is fused, the relevant characteristic information of the video is obtained through the importance of some forgetting gates on the basis of a CNN + LSTM model, and finally the obtained result is predicted.
In the method of two-stream identification and detection, document [1] proposes that spatial feature information is extracted using a single frame RGB as input CNN, and temporal feature information of a video is extracted using optical flows of multiple frames as input, but the disadvantage is that a correspondence relationship is learned between the spatial feature and the temporal feature. For the 3D network model method, document [2] proposes a C3D network architecture, and performs behavior recognition by using a 3D convolution kernel, where the 3D convolution is explored for extracting behavior features of people in a video and finding out an optimal convolution kernel size. However, the 3D convolution has the problems that the parameter amount is too large, the time of the model is very slow when network training is performed, and document [3] expands the 2D convolution to the 3D convolution, explores the convolution expansion mode, and expands the 2D pre-training model to 3D. Document [4] proposes a BSN network model for finding a timing boundary of an action in a video, which is flexible in network structure, but the network is a multi-stage network and has some defects in real-time.
In many existing behavior recognition methods, firstly, for a network architecture, 3D convolution is adopted for many network architectures, so that a training time is long due to too many parameters during network training. However, for many multi-stage network models, although the accuracy is greatly improved, due to the complexity of the network, there is a great problem in real-time prediction, that is, real-time detection of video does not have good real-time performance. When the deep information characteristic extraction is carried out on the network, some important information cannot be extracted well, and a large amount of important information cannot be highlighted, so that the accuracy of the model in the prediction process is reduced.
Disclosure of Invention
The invention provides a fall detection method based on a self-attention double-flow network, which is used for detecting and tracking the behaviors of people in a video based on a computer vision method and aims to solve the technical problems that the existing behavior identification method has no good real-time performance when the video is detected in real time, important information cannot be extracted well when deep information features are extracted from the network, a large amount of important information cannot be highlighted, and the accuracy of a model in prediction is reduced.
A fall detection method based on a self-attention double-flow network comprises the following steps:
step 1: acquiring image data of the human behavior action to form a sample data set of the behavior action;
step 2: constructing a self-attention double-flow network for detecting behavior actions;
and step 3: a balanced simple easily-separable sample is adopted to control a focus loss function, and a self-attention double-flow network is optimized;
and 4, step 4: training and testing the self-attention double-flow network by using the sample data set to achieve detection precision;
and 5: and acquiring real-time images of people, inputting the trained self-attention double-flow network, and detecting whether the people have falling behaviors.
In step 1, the behavioral and action classifications of a person include fall, run, jump, walk, stand, lie;
in step 1, the sample data set is proportionally divided into a training set and a test set.
In step 2, the constructed backbone network of the self-attention dual-flow network is divided into two branches, the first branch comprises a Darknet19 neural network, the second branch comprises an S3D neural network, and the specific structure is as follows: input layer → Darknet19 neural network → auto-attention model; input layer → S3D neural network → self attention model; the self-attention model takes the output of the network in the first branch and the second branch as input, uses channel fusion to fuse information on the channel, obtains an attention feature map through an attention mechanism, and finally outputs the information through a layer of convolution to obtain an output layer.
In step 2, the structure of the S3D neural network is: sequentially connecting convolution, pooling, convolution and pooling, then sequentially connecting two inclusion 1 blocks and a maximum pooling layer, then connecting 5 inclusion 1 blocks and a maximum pooling layer, and finally connecting two inclusion 2 blocks;
when the S3D neural network is used, a time sequence video frame is obtained firstly, shallow layer information on the time sequence is obtained through convolution, pooling, convolution and pooling, deeper feature information is extracted through a plurality of inclusion 1 blocks, the number of features is reduced through the pooling layer, the size of the combined feature map is reduced so as to be sent to the next inclusion 1 Block, and after the combined feature map passes through the plurality of inclusion 1 blocks, the combined feature map passes through the plurality of inclusion 2 blocks.
The specific structure of the inclusion 1 Block is as follows: the system comprises four branches, wherein the first branch only comprises a convolution kernel and operates on time sequence; the second and third branches are firstly reduced in dimension at the spatial position and then reduced in dimension in time sequence; the fourth branch is firstly implemented with maximum pooling in spatial dimension, then is processed in time sequence, and more abundant characteristic information in the video can be extracted through a plurality of branches.
In step 2, the structure of the Darknet19 neural network is sequentially connected with 2D convolution, pooling, Block1, pooling, Block1, pooling, Block2, pooling, Block2 and 2D convolution; wherein, both Block1 and Block2 are multilayer convolution structures.
In step 2, the self-attention double-flow network adopts a Darknet19 neural network and an S3D neural network, the characteristic information extracted by the 2D network and the time sequence information extracted by the 3D network are spliced by using a channel fusion mode, the weight of each channel is obtained through the self-attention mechanism, the importance of the characteristic diagram on each channel is obtained, the incidence relation of the characteristic diagrams among the channels is obtained through a gram matrix, the self-attention mechanism effectively strengthens the characteristic information, and has a good effect on behavior classification.
The backbone network of the self-attention dual-flow network comprises a Darknet19 neural network, an S3D and a self-attention model, wherein the Darknet19 neural network is combined by a convolutional layer and a pooling layer; S3D neural network is connected with the convolution layer and the pooling layer through a plurality of inclusion blocks; the self-attention model increases the importance of the features.
The inclusion blocks of the S3D neural network all adopt a pseudo-3D structure.
When the self-attention model is used, a 2D and 3D network fusion feature map A is obtained firstly, and A belongs to R(c'+c”)×H×WAfter fusion, B is obtainedAnd B ∈ Rc×H×WLet B be [ v ]1,v2,...,vc]N — H × W, then dimension transforming B to obtain F, the formula is as follows:
B∈Rc×H×W→F∈Rc×N (1)
E=F×FT (2)
Figure BDA0003467467040000031
Figure BDA0003467467040000032
G=M×F (5)
G∈Rc×N→G'∈Rc×H×W (6)
wherein B represents: a feature map obtained by fusing the 2D features and the 3D features; e represents: an attention map of the feature map between the channels; m represents: an attention map after mapping; g represents: attention is paid to the influence effect of the force diagram on the original characteristic diagram;
after obtaining the fused feature map B, converting the feature map B into a two-dimensional vector F by carrying out dimension conversion on the feature map B, and then carrying out F and FTObtaining a gram matrix, and then obtaining an attention diagram M through a softmax function;
the weight of each channel is obtained by performing global pooling on the feature maps in space, so as to measure the importance of the feature maps on different channels, and finally obtain weighted feature information, wherein the formula is as follows:
Figure BDA0003467467040000033
Q=K×B (8)
C=G'+Q (9)
wherein K represents: a channel weight attention map; q represents: weighted feature maps; v represents: a single feature map; c represents: a new feature map.
When the self-attention model is used, a 2D and 3D network fusion feature map A is obtained firstly, B is obtained after fusion, and B is made to be [ v ═ v [ ]1,v2,...,vc]N-H × W, then dimension transforming B to obtain F, passing F and FTObtaining a gram matrix, and then obtaining an attention diagram M through a softmax function;
the weight of each channel is obtained by performing global pooling on the feature maps in space, so as to measure the importance of the feature maps on different channels and finally obtain weighted feature information.
The formula used to obtain F after dimension transformation of B is as follows:
B∈Rc×H×W→F∈Rc×N (1)
E=F×FT (2)
Figure BDA0003467467040000041
Figure BDA0003467467040000042
G=M×F (5)
G∈Rc×N→G'∈Rc×H×W (6)
wherein B represents: a feature map obtained by fusing the 2D features and the 3D features; e represents: an attention map of the feature map between the channels; m represents: an attention map after mapping; g represents: attention is paid to the influence effect of the force diagram on the original characteristic diagram;
in the process of obtaining weighted feature information, the formula adopted is as follows:
Figure BDA0003467467040000043
Q=K×B (8)
C=G'+Q (9)
wherein K represents: a channel weight attention map; q represents: weighted feature maps; v represents: a single feature map; c represents: a new feature map.
Compared with the prior art, the invention has the following technical effects:
compared with the existing neural network model, the new neural network model, namely the self-attention double-flow network, can find out the importance degree of the characteristics of different channels of the channel, and has more targeted training during network training, so that the final prediction effect is better; the self-attention double-flow network solves the problem of low relevance of characteristic information among channels, and meanwhile, useful characteristic information is strengthened according to a self-attention mechanism.
Drawings
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
fig. 1 is a schematic diagram of a self-attention dual-flow network according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a Darknet19 neural network structure.
FIG. 3 is a Block1 schematic diagram of a Darknet19 neural network structure.
FIG. 4 is a Block2 schematic diagram of a Darknet19 neural network structure.
Fig. 5 is a schematic structural diagram of the S3D neural network.
Fig. 6 is a schematic structural diagram of inclusion 1 Block of an S3D neural network structure.
Fig. 7 is a schematic structural diagram of inclusion 2Block of an S3D neural network structure.
Fig. 8 is a schematic diagram of a self-attention mechanism.
Detailed Description
A fall detection method based on a self-attention double-flow network comprises the following steps:
step 1: dividing the behavior and actions of a person into falling, running, jumping, walking, standing and lying, collecting image data of the behavior and actions of the person, and forming a sample data set of the behavior and actions;
step 2: constructing a self-attention double-flow network for detecting behavior actions;
and step 3: a balanced simple easily-separable sample is adopted to control a focus loss function, and a self-attention double-flow network is optimized;
and 4, step 4: training and testing the self-attention double-flow network by using the sample data set to achieve detection precision;
and 5: and acquiring real-time images of people, inputting the trained self-attention double-flow network, and detecting whether the people have falling behaviors.
As shown in fig. 1, the backbone network of the self-attention dual-flow network includes two branches, the first branch is a Darknet19 neural network, and the second branch is an S3D neural network.
S3D neural networks are sequentially connected with convolution, pooling, convolution and pooling, then two increment 1 blocks and one maximum pooling layer are sequentially connected, then 5 increment 1 blocks and one maximum pooling layer are connected, and finally two increment 2 blocks are connected; the self-attention model takes the output of the two branch networks as input, information is fused on the channels by using channel fusion, and an attention feature map is obtained by an attention mechanism. And finally, outputting through a layer of convolution.
The neural network model of the invention uses S3D neural network, it first obtains a time sequence video frame, then obtains the shallow layer information on the time sequence through convolution, pooling, convolution, pooling, and then extracts deeper feature information through a plurality of inclusion blocks, the inclusion 1 Block structure is shown in FIG. 6, it includes 4 branches, each branch is pseudo 3D convolution, after extracting the feature, finally uses the concat function to splice the features of the four branches together, then takes the pooling layer to reduce the feature quantity by half, it is used to reduce the size of the merged feature map, in order to send to the next inclusion 1 Block.
The Incepton 1 Block first branch only contains one convolution kernel and operates in time sequence; the second and third branches are firstly reduced in dimension at the spatial position and then reduced in dimension in time sequence; the fourth branch is firstly implemented with maximum pooling in spatial dimension, then is processed in time sequence, and more abundant characteristic information in the video can be extracted through a plurality of branches. Fig. 7 shows an inclusion 2Block, which also includes 4 branches, but differs from the inclusion 1 Block in the middle two branches, and therefore, information on the processing sequence is reduced. The inclusion Block module uses a pseudo-3D convolution kernel module, compared with a 3D convolution kernel, a large number of parameters are reduced, the pseudo-3D convolution has a great effect on the single-stage neural network model in behavior prediction, and the real-time performance of network detection can be improved.
The neural network model adopts a Darknet19 neural network, and the Darknet19 neural network comprises 19 layers of convolution networks and 5 layers of pooling layers; its structure is shown in fig. 2, which is used as a 2D branch network of the neural network model of the present invention. Wherein the Block1 and Block2 modules are shown in FIG. 3 and FIG. 4 as a multi-layer convolution. The Darknet19 neural network can acquire the spatial characteristics of the key frames in the video, and compared with a 3D network for behavior recognition, the 2D network is used for extracting the key frames in the video, so that the characteristics extracted by the whole network are richer, the space-time behavior detection problem is also greatly influenced, the specific positions of people in the video can be well positioned, the network structure is simpler, and the real-time detection in the video is greatly influenced.
After the characteristics are respectively extracted by the branch network models of the double-flow network, the spatial information and the time sequence information are fused, however, the influence of a lot of noises is generated, and the fused characteristic information cannot correctly represent the characteristics required by training. Meanwhile, the special diagnosis information among the channels is independent from each other and has no relevance. Therefore, the network model can pay more attention to useful features and be trained through the attention mechanism, so that more effective features can be obtained, and the behaviors of people in the video can be more accurately predicted in the prediction process.
In step 2, in the self-attention double-flow network, a Darknet19 neural network and S3D are adopted, the output of the 2D network and the output of the 3D network are spliced together in a channel fusion mode, spatial feature information of video key frames is extracted through the Darknet19 neural network, time sequence feature information of videos is extracted through S3D, spatial and time sequence information is fused to a deep layer through channel fusion, the spatial information and the time sequence are aggregated together in a fusion mode, and finally results are classified and boundary box regression is carried out through convolution layers.
Preferably, the S3D neural network of the self-attention dual-flow network comprises sequentially connected convolution, pooling, convolution, pooling, followed by connecting 2 inclusion blocks 1; then one Maxpool followed by 5 inclusion blocks 1, one Maxpool and finally two inclusion blocks 2, which were separated into 2D plus 3D forms by 3D convolution and were divided into 4 branches.
The self-attention network used by the neural network model of the invention is shown in fig. 8, firstly, the characteristics obtained by fusing channels are obtained, the obtained characteristics are only simply spliced on the channels for spatial characteristics and time sequence characteristics, but the relevance between the channels cannot be represented, and then the importance of information storage characteristics of each channel cannot be known, so that a lot of noises exist in the fused characteristics, and the information between each channel is independent. And both of these problems can be solved by attention mechanism. Firstly, obtaining a 2D and 3D network fusion characteristic diagram A, wherein A belongs to R(c'+c”)×H×WAfter fusion, B is obtained and B ∈ Rc×H×WLet B be [ v ]1,v2,...,vc]N — H × W, then dimension transforming B to obtain F, the formula is as follows:
B∈Rc×H×W→F∈Rc×N (1)
E=F×FT (2)
Figure BDA0003467467040000061
Figure BDA0003467467040000062
G=M×F (5)
G∈Rc×N→G'∈Rc×H×W (6)
in the formula, B represents (a feature map obtained by fusing a 2D feature and a 3D feature), E represents (an attention map of a feature map between channels), M represents (an attention map after mapping), and G represents (an influence of the attention map on an original feature map).
After obtaining the fused characteristic diagram B in the formula, converting the characteristic diagram B into a two-dimensional vector F by carrying out dimension conversion on the characteristic diagram B, and converting the two-dimensional vector F into a two-dimensional vector F by F and FTAnd obtaining a gram matrix, and then obtaining an attention map M through a softmax function, wherein the M represents the relevance among different characteristic maps among the channels, the information fusion among the channels is deepened, the relevance relation among different characteristics among the channels is strengthened through the above mode, and new characteristic information can be obtained in a deep layer, so that the channel obtains better performance.
For the problem of importance of different feature maps in the channels, the weight of each channel is obtained by performing global pooling on the feature maps in space, so as to measure the importance of the feature maps on different channels and finally obtain weighted feature information. The formula is as follows:
Figure BDA0003467467040000071
Q=K×B (8)
C=G'+Q (9)
in the formula, K represents (channel weight attention map), Q represents (weighted feature map), v represents (single feature map), and C represents (new feature map).
In an embodiment, the parameters of a self-attention dual-flow network for fall detection are shown in table 1.
TABLE 1 deep learning Algorithm identification Unit parameter Table
Figure BDA0003467467040000072
In the embodiment, a double-flow network with a non-attention mechanism and an attention mechanism is respectively used for a fall detection experiment, and the experiment results are shown in table 2.
TABLE 2 evaluation index comparison table for common double-flow network and self-attention double-flow network
Figure BDA0003467467040000073
The evaluation indexes of the experimental results in table 2 show that the self-attention double-flow network has better effect than the non-self-attention double-flow network, the self-attention double-flow network strengthens the correlation among the channel characteristic diagrams, so that the characteristic information is richer, and in addition, the importance degree of each characteristic diagram to the overall characteristic diagram can be known by using the weighted characteristic diagram. The precision of the self-attention double-flow network is 64.02%, the recall rate is 93.20%, the precision is improved by 1.42% compared with the precision of a common double-flow network, the recall rate is improved by 0.4%, various indexes of the self-attention double-flow network are improved compared with the common double-flow network, and the self-attention double-flow network can acquire richer spatio-temporal information of videos.

Claims (10)

1. A fall detection method based on a self-attention double-flow network is characterized by comprising the following steps:
step 1: acquiring image data of the human behavior action to form a sample data set of the behavior action;
and 2, step: constructing a self-attention double-flow network for detecting behavior actions;
and step 3: a balanced simple and easily-separable sample is adopted to control a focal loss function, and a self-attention double-flow network is optimized;
and 4, step 4: training and testing the self-attention double-flow network by using the sample data set to achieve detection precision;
and 5: and acquiring real-time images of people, inputting the trained self-attention double-flow network, and detecting whether the people have falling behaviors.
2. The method according to claim 1, wherein in step 1, the behavioral-action classification of the person comprises falling, running, jumping, walking, standing, lying;
in step 1, the sample data set is proportionally divided into a training set and a test set.
3. The method according to claim 1, wherein in step 2, the constructed backbone network of the self-attention dual-flow network is divided into two branches, the first branch comprises a Darknet19 neural network, and the second branch comprises an S3D neural network, and the specific structure is as follows: input layer → Darknet19 neural network → auto-attention model; input layer → S3D neural network → self attention model; the self-attention model takes the output of the network in the first branch and the second branch as input, uses channel fusion to fuse information on the channel, obtains an attention feature map through an attention mechanism, and finally outputs the information through a layer of convolution to obtain an output layer.
4. The method of claim 3, wherein in step 2, the structure of the S3D neural network is: the method comprises the steps of sequentially connecting convolution, pooling, convolution and pooling, then sequentially connecting two inclusion 1 blocks and a maximum pooling layer, then connecting 5 inclusion 1 blocks and a maximum pooling layer, and finally connecting two inclusion 2 blocks.
5. The method of claim 4, wherein the S3D neural network is used by first acquiring a time-series video frame, then obtaining shallow time-series information through convolution, pooling, convolution and pooling, extracting deeper feature information through multiple inclusion 1 blocks, then reducing the number of features by pooling to reduce the size of the combined feature map for transmission to the next inclusion 1 Block, and then passing through multiple inclusion 2 blocks after passing through multiple inclusion 1 blocks.
6. The method according to claim 4, wherein the inclusion 1 Block has a specific structure: the system comprises four branches, wherein the first branch only comprises a convolution kernel and operates on time sequence; the second and third branches are firstly reduced in dimension at the spatial position and then reduced in dimension in time sequence; the fourth branch is firstly implemented with maximum pooling in spatial dimension, then is processed in time sequence, and more abundant characteristic information in the video can be extracted through a plurality of branches.
7. The method according to claim 3, wherein in step 2, the Darknet19 neural network is structured by sequentially connecting 2D convolution, pooling, Block1, pooling, Block1, pooling, Block2, pooling, Block2, 2D convolution; wherein, both Block1 and Block2 are multilayer convolution structures.
8. A method according to one of the claims 3 to 7, characterized in that, when the self-attention model is in use,
firstly, obtaining a 2D and 3D network fusion characteristic diagram A, wherein A belongs to R(c’+c”)×H×WAfter fusion, B is obtained and B is belonged to Rc×H×WLet B be [ v ]1,v2,...,vc]N — H × W, then dimension transforming B to obtain F, the formula is as follows:
B∈Rc×H×W→F∈Rc×N (1)
E=F×FT (2)
Figure FDA0003467467030000021
Figure FDA0003467467030000022
G=M×F (5)
G∈Rc×N→G'∈Rc×H×W (6)
wherein B represents: a feature map obtained by fusing the 2D features and the 3D features; e represents: an attention map of the feature map between the channels; m represents: an attention map after mapping; g represents: attention is paid to the influence effect of the force diagram on the original characteristic diagram;
after obtaining the fused feature map B, converting the feature map B into a two-dimensional vector F by carrying out dimension conversion on the feature map B, and then carrying out F and FTObtaining a gram matrix, and then obtaining an attention diagram M through a softmax function;
the weight of each channel is obtained by performing global pooling on the feature maps in space, so as to measure the importance of the feature maps on different channels, and finally obtain weighted feature information, wherein the formula is as follows:
Figure FDA0003467467030000023
Q=K×B (8)
C=G'+Q (9)
wherein K represents: a channel weight attention map; q represents: weighted feature maps; v represents: a single feature map; c represents: a new feature map.
9. A self-attention model is characterized in that when in use, a 2D and 3D network fusion feature map A is obtained firstly, B is obtained after fusion, and B is made to be [ v ═ v [ [ v ] v [ ]1,v2,...,vc]N-H × W, then dimension transforming B to obtain F, passing F and FTObtaining a gram matrix, and then obtaining an attention diagram M through a softmax function;
the weight of each channel is obtained by performing global pooling on the feature maps in space, so as to measure the importance of the feature maps on different channels and finally obtain weighted feature information.
10. A self-attention model according to claim 9, wherein the formula used to derive F after dimension transformation of B is as follows:
B∈Rc×H×W→F∈Rc×N (1)
E=F×FT (2)
Figure FDA0003467467030000031
Figure FDA0003467467030000032
G=M×F (5)
G∈Rc×N→G'∈Rc×H×W (6)
wherein B represents: a feature map obtained by fusing the 2D features and the 3D features; e represents: an attention map of the feature map between the channels; m represents: an attention map after mapping; g represents: attention is paid to the influence effect of the force diagram on the original characteristic diagram;
in the process of obtaining weighted feature information, the formula adopted is as follows:
Figure FDA0003467467030000033
Q=K×B (8)
C=G'+Q (9)
wherein K represents: a channel weight attention map; q represents: weighted feature maps; v represents: a single feature map; c represents: a new feature map.
CN202210033684.XA 2022-01-12 2022-01-12 Fall detection method based on self-attention double-flow network Pending CN114463844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210033684.XA CN114463844A (en) 2022-01-12 2022-01-12 Fall detection method based on self-attention double-flow network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210033684.XA CN114463844A (en) 2022-01-12 2022-01-12 Fall detection method based on self-attention double-flow network

Publications (1)

Publication Number Publication Date
CN114463844A true CN114463844A (en) 2022-05-10

Family

ID=81409804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210033684.XA Pending CN114463844A (en) 2022-01-12 2022-01-12 Fall detection method based on self-attention double-flow network

Country Status (1)

Country Link
CN (1) CN114463844A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024103682A1 (en) * 2022-11-14 2024-05-23 天地伟业技术有限公司 Fall behavior identification method based on video classification and electronic device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024103682A1 (en) * 2022-11-14 2024-05-23 天地伟业技术有限公司 Fall behavior identification method based on video classification and electronic device

Similar Documents

Publication Publication Date Title
CN110322446B (en) Domain self-adaptive semantic segmentation method based on similarity space alignment
Liong et al. Shallow triple stream three-dimensional cnn (ststnet) for micro-expression recognition
Pan et al. Deepfake detection through deep learning
CN111652903B (en) Pedestrian target tracking method based on convolution association network in automatic driving scene
CN111079646A (en) Method and system for positioning weak surveillance video time sequence action based on deep learning
CN114445430B (en) Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion
KR102309111B1 (en) Ststem and method for detecting abnomalous behavior based deep learning
CN114596520A (en) First visual angle video action identification method and device
CN113435432B (en) Video anomaly detection model training method, video anomaly detection method and device
CN113963304B (en) Cross-modal video time sequence action positioning method and system based on time sequence-space diagram
CN116824625A (en) Target re-identification method based on generation type multi-mode image fusion
CN114170537A (en) Multi-mode three-dimensional visual attention prediction method and application thereof
CN116342894A (en) GIS infrared feature recognition system and method based on improved YOLOv5
CN114463844A (en) Fall detection method based on self-attention double-flow network
US20240177525A1 (en) Multi-view human action recognition method based on hypergraph learning
Hatay et al. Learning to detect phone-related pedestrian distracted behaviors with synthetic data
Zhang et al. Multi-scale spatiotemporal feature fusion network for video saliency prediction
CN117671787A (en) Rehabilitation action evaluation method based on transducer
CN117238034A (en) Human body posture estimation method based on space-time transducer
CN116704609A (en) Online hand hygiene assessment method and system based on time sequence attention
CN116543339A (en) Short video event detection method and device based on multi-scale attention fusion
Tang et al. A multi-task neural network for action recognition with 3D key-points
CN114419729A (en) Behavior identification method based on light-weight double-flow network
CN113361475A (en) Multi-spectral pedestrian detection method based on multi-stage feature fusion information multiplexing
CN113824989A (en) Video processing method and device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination