CN113033276A - Behavior recognition method based on conversion module - Google Patents
Behavior recognition method based on conversion module Download PDFInfo
- Publication number
- CN113033276A CN113033276A CN202011383635.6A CN202011383635A CN113033276A CN 113033276 A CN113033276 A CN 113033276A CN 202011383635 A CN202011383635 A CN 202011383635A CN 113033276 A CN113033276 A CN 113033276A
- Authority
- CN
- China
- Prior art keywords
- data
- mask
- linear
- dim
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/32—Normalisation of the pattern dimensions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a behavior recognition method based on a conversion module, and relates to the field of human body action recognition. Firstly, reading continuous frame images and constructing a mask by a behavior identification method based on a conversion module; then, constructing input data of a conversion module, including obtaining the input data of the conversion module and a position mask matrix mask operation; the motion recognition of the retransformation module comprises data preprocessing operation, and the data processing after the continuous coding module is carried out to obtain a motion detection result; and finally, calculating the cross entropy loss of the class detection result and the class label target, and optimizing the network parameters. The method uses the conversion module used in natural language understanding to extract the space-time characteristics of continuous frame images, and simultaneously, only uses the conversion module in the whole recognition process, thereby reducing the parameter quantity of the method, reducing the whole calculated quantity and improving the action recognition frequency.
Description
Technical Field
The invention relates to the field of human body action recognition, in particular to a behavior recognition method based on a conversion module.
Background
The action recognition completes the analysis task of the video action content and obtains a classification task of the action types by extracting the action features of the continuous video frames, so that the method can be helpful for improving the monitoring capability of dangerous behaviors in key areas and avoiding the occurrence of possible dangerous behaviors.
The Chinese patent with the patent number of CN202010708119.X provides an efficient unsupervised cross-domain action recognition method (CAFCCN) based on channel fusion and classifier antagonism, which is used for solving the problem that a target data set training data set is label-free, and realizes accurate recognition of a target domain test set by using information of a source domain data set and information of a target domain label-free training set. The method has the disadvantages that four depth residual error network models are needed to respectively extract the source domain optical flow graph features and the target domain optical flow graph features, and a plurality of fusion modules are needed to complete the fusion of the source domain optical flow graph features and the target domain optical flow graph features, so that the whole algorithm has more model parameters and larger overall calculation amount.
The patent number 201810431650.X discloses a sequential action recognition method based on deep learning, aiming at the problem that long action feature expression is not enough in effectiveness in the boundary detection process, inter-frame information and intra-frame information are extracted simultaneously through a double-flow network to obtain a feature sequence of a video unit, a multi-scale short action segment intercepting scheme combining context information is provided, the follow-up regression accuracy is effectively improved, a time boundary model is trained by utilizing the feature sequence, the model training time is reduced, and the calculation efficiency is improved. The method has the disadvantages that the method uses the interval frame image to be directly input into the action recognition network for action recognition, when the equipment is in a complex environment and has multiple targets, different actions of different targets can influence the action detection result of the whole image, and each target cannot be subjected to action recognition. Meanwhile, the method adopts a double-current network to simultaneously extract the interframe information and the intraframe information, and 3D convolution is inevitably used for obtaining the continuous frame image characteristics under the time sequence, so that the calculated amount of the model is increased, the training period of the model is increased, and the sample searching amount is also increased.
Chinese patent No. CN202010840934.1 discloses a behavior recognition method for a strong dynamic video, which uses an optical flow branch in a traditional double-flow model as a teaching model to assist in training RGB branches according to data distribution characteristics on a data set; the method comprises the following steps that RGB branches are input into RGB image frames of the source video overall, optical flow branches are input into optical flow image frames of the source video overall, and the optical flow image frames are obtained through optical flow calculation by the RGB image frames; and performing RGB branch and optical flow branch combined inference to realize behavior recognition in the video. In the patent, the RGB branch and the optical flow branch are trained by different configurations respectively, and compared with the traditional double-flow algorithm, the dynamic identification is configured, so that the adaptability is strong. According to the method, the characteristics of the strong dynamic behavior video are considered through the reinforced optical flow characteristic learning, the optical flow information is transmitted in multiple stages, the sufficient motion characteristics are obtained, and the identification accuracy is improved. The patent also uses the expansion 3D convolution to obtain the time characteristic of the optical flow, uses the 2D convolution to obtain the space characteristic, needs two different networks to complete the action recognition task, and does not solve the problems of large model calculation amount and poor 3D convolution network portability.
Chinese patent No. 201910064529.2 discloses a behavior recognition system based on attention mechanism, which uses a channel attention module to extract inter-channel feature codes for motion prediction. However, the attention module of the patent still uses a two-dimensional convolution mode of a three-dimensional convolution set, and does not solve the problem that the 3D convolution model parameters are large in amount of calculation.
The classic motion recognition method is based on a 3D convolution and optical flow method and used for extracting the continuous frame characteristics under a time sequence, obtaining the dependency relationship of continuous frames under a time axis and improving motion recognition accuracy.
Compared with 2D convolution, 3D convolution needs to extract continuous frame features in three dimensions, so that the parameter quantity of a 3D convolution model is increased, the model calculation quantity is increased, and the training period of the model is longer. Meanwhile, 3D is used as a new calculation mode, and under different depth learning frames, the support degree of 3D convolution is poor, so that the practical applicability of the 3D convolution-based action recognition algorithm is influenced.
The optical flow method needs a plurality of 2D convolution models to be matched with each other for extracting time characteristics and space characteristics, so that model parameters are overlarge, the calculated amount is large, high requirements are put forward for hardware equipment in practical application, and the practical applicability of the method is reduced.
Disclosure of Invention
The invention aims to overcome the defects, and provides a behavior identification method which uses a conversion module used in natural language understanding to extract space-time characteristics of continuous frame images and only uses the conversion module in the whole identification process.
The invention specifically adopts the following technical scheme:
a behavior recognition method based on a conversion module comprises the following steps:
reading continuous frame images and constructing a mask;
step two, constructing input data of the conversion module, including obtaining the input data of the conversion module and a position mask matrix mask operation;
step three, the motion recognition of the conversion module comprises data preprocessing operation, and the motion detection result is obtained through data processing after the continuous coding module;
and step four, calculating the cross entropy loss of the class detection result and the class label target, and optimizing the network parameters.
Preferably, reading the successive frame images and constructing the mask comprises the following processes:
according to the time sequence, constructing input data input by using image data of 16 frames in a continuous clip, wherein the dimension of the image data input of the continuous frames belongs to R16×3×H×WOf a four-dimensional matrix of (2), whereinH, W represents the original height and width of the picture;
for each picture of continuous frame input data input, adopting an equal scaling method to perform picture size conversion, and obtaining data dimensionality shown in formula (1) after the operation:
input∈R16×3×h×w (1)
wherein, h and w are the height and width of the zoomed picture;
key frame target tag information, target, including action tags,
and constructing a position mask matrix mask with the dimensionality of mask belonging to R4×4The two-dimensional all-1 matrix is used for calibrating the position of a real picture in input data.
Preferably, the step two of acquiring the input data of the conversion module comprises the following processes:
tiling the continuous frame image data input with clip being 16 into a two-dimensional matrix, wherein the dimension change is as follows: input ∈ R16 ×dWherein: d is 3 × h × w;
and (3) compressing the flattened continuous frame data input by adopting a linear link layer, wherein the number of input channels of the linear link layer is d, the number of output channels of the linear link layer is 1024, and the continuous frame data obtained after compression is shown in a formula (2):
clip_fram=Linear(input) (2)
wherein Linear (·) is a Linear link layer operation, and the obtained dimension is clip _ frame ∈ R16×1024A two-dimensional matrix of (a);
constructing a random trainable parameter matrix cls _ token with the dimension of cls _ token belonging to R1×1024;
Performing matrix splicing on the data cls _ token and the data clip _ frame according to a first dimension to obtain input data in _ data of the conversion module, wherein the input data in _ data is shown in a formula (3):
in_data=Cat(cls_token,clip_frame),cls_token∈R1×1024,clip_frame∈R16 ×1024 (3)
wherein Cat (·) represents a matrix splicing operation, and the obtained in _ data has a dimension of in _ data ═ R17×1024A two-dimensional matrix of (a);
the position mask matrix mask operation includes the following processes:
tiling a mask matrix mask into a one-dimensional vector, and changing the mask matrix mask in the dimension thereof into formula (4):
and (3) filling the mask matrix to obtain a transformed mask matrix represented by formula (5):
mask=Pad(mask,(1,0),value=1) (5)
wherein Pad (·) represents a padding operation, and (1,0) represents that 1 piece of data is added at the first position, the value of the added data is 1, and the output mask dimension transformation relation is expressed as formula (6):
carrying out dimension transformation on the data mask, and acquiring two new matrixes as shown in shift (7):
get new mask input matrix shift (8):
in_mask=mask1×mask2 (8)
the dimensionality is as follows: in _ mask ∈ R17×17Is used for the two-dimensional matrix of (1).
Preferably, the data preprocessing operation in step three includes the following processes:
constructing a random trainable parameter matrix pos _ embedding, wherein the dimension is pos _ embedding ∈ R17×1024And added to the input data in _ data, and neuron activation layer operation is performed, and the output result x is expressed by equation (9):
x=Dropout(pos_embedding+in_data,dropout=0.1),x∈R17×1024 (9)
wherein Dropout (·) denotes an active layer operation, and an active layer factor Dropout is 0.1;
the continuous coding module is composed of 6 basic coding modules with the same structure in a depth-6 serial connection mode, and the calculation process of the basic coding modules is as follows:
the basic design parameters of the basic coding module are that the number dim of input data channels is 1024, the number mlp _ dim of intermediate layer data channels is 2048, parallel depth heads is 8, and the coefficient prodout of the active layer is 0.1;
1) data normalization processing
Normalizing the input data x, and expressing the obtained new data as an expression (10):
x_out=Norml(x_in),x_out∈R17×1024 (10)
wherein Norml (. cndot.) represents the normalization treatment; for convenient symbol marking, input and output data before and after processing are represented by x _ in and x _ out;
2) concurrent focus operations
a. Linear link layer data path expansion:
the input data channel dim is 1024, and the expanded data channel out _ dim is dim × 3 is 3072, and the transformation process is expressed as formula (11):
x_out1=Linear(x_in,dim=1024,out_dim=3072) (11)
wherein Linear (-) is a Linear chaining operation, x _ in, x _ out1Representing input and output data before and after processing, and representing the data dimension change as formula (12):
b. constructing q, k, v data:
multiplying the matrix q, k to obtain formula (14):
wherein T represents a matrix turn-to operation;
mask replacement operation:
according to the input mask matrix in _ mask ∈ R17×17The result x _ out ∈ R after multiplication of the matrix q, k8×17×17In the position where the mask result is 0, value is 1e-9Alternatively, the calculation process is represented by equation (15):
x_out5=softmax(Mask(x_out4,value=1e-9)),x_out5∈R8×17×17 (15)
wherein, Mask (·) represents a masking operation, and softmax (·) is a softmax activation layer in the neural network;
will output the result x _ out5Multiplying the data v, and obtaining an output after the data is deformed, wherein the output is expressed by formula (16):
x_out6=Tranf(x_out5·v),x_out5∈R8×17×17,v∈R8×17×128,x_out6∈R17×1024 (16)
wherein Tranf (·) represents the matrix dimension transformation;
c. data linear transformation and activation processing:
x_out7=Dropout(Linear(x_out6,dim=1024,dim=1024),dropout=0.1),x_out7∈R17×1024
wherein Linear (·) represents Linear transformation, an input channel dim is 1024, and an output channel dim is 1024; droput (·) denotes neuronal activation layer processing, with the activation factor dropout ═ 0.1;
after parallel attention operation, residual operation is performed, and the obtained module output is formula (17):
x_out=x_in+x_out7,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (17);
3) feed-forward network data processing
The feedforward network data processing is to perform related operation on data obtained after parallel attention operation, and the part of input data is x _ in E.R17×1024The following sequence processing procedures are carried out:
linear processing into formula (18):
x_out1=Linear(x_in,dim=1024,mlp_dim=2048),x_out1∈R17×1024 (18)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (19):
x_out2=GELU(x_out1),x_out2∈R17×1024 (19)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as shown in formula (20):
x_out3=Dropout(x_out2,dropout=0.1),x_out3∈R17×1024 (20)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing is shown in formula (21):
x_out4=Linear(x_out3,mlp_dim=2048,dim=1024),x_out4∈R17×1024 (21)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel dim is 1024;
the neuron activation layer operates as shown in formula (22):
x_out5=Dropout(x_out4,dropout=0.1),x_out5∈R17×1024 (22)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
after feedforward network data processing, residual operation is adopted, and the obtained final output data is shown in formula (23):
x_out=x_in+x_out5,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (23);
and (3) carrying out data processing on the data after passing through the continuous coding module to obtain an action detection result, wherein the process is expressed as an expression (24):
x_out=x_in[0],x_in∈R17×1024,x_out∈R1×1024 (24)
and (4) carrying out the sequential operation as shown in the formula (25) on the output data:
normalization:
x_out1=Norml(x_out),x_out1∈R1×1024 (25)
wherein Norml (. cndot.) represents the normalization treatment;
linear treatment is as in formula (26):
x_out2=Linear(x_out1,dim=1024,mlp_dim=2048,),x_out2∈R17×1024 (26)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (27):
x_out3=GELU(x_out2),x_out3∈R1×2048 (27)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as follows (28):
x_out4=Dropout(x_out3,dropout=0.1),x_out4∈R1×2048 (28)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing into formula (29):
x_out5=Linear(x_out4,mlp_dim=2048,num_class),x_out5∈R17×num_class (29)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel num _ class is a class number; the activation function layer is of formula (30):
x_out6=softmax(x_out5),x_out6∈R1×num_class (30)
wherein, softmax (·) represents the softmax activation function, and the final action recognition result is obtained.
The invention has the following beneficial effects:
the method realizes continuous frame image action identification based on continuous feature extraction.
In the method, the conversion model extraction module replaces a 3D convolution network, the problem of large calculation amount of a 3D convolution network model is solved, the parallel calculation capacity of the model on a GPU is improved, meanwhile, the conversion models are composed of the most basic operators, the migration deployment performance of the model is improved, and the problem of weak compatibility when the model is converted or deployed is solved.
Drawings
FIG. 1 is a block flow diagram of a translation module-based behavior recognition method;
FIG. 2 is a conversion module diagram;
fig. 3 is a diagram of a basic coding module structure.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:
with reference to fig. 1-3, the behavior recognition method based on the conversion module includes the following steps:
reading a continuous frame image and constructing a mask, wherein the reading of the continuous frame image and the constructing of the mask comprises the following processes:
according to the time sequence, constructing input data input by using image data of 16 frames in a continuous clip, wherein the dimension of the image data input of the continuous frames belongs to R16×3×H×WWherein H, W represent the original height and width of the picture;
for each picture of continuous frame input data input, adopting an equal scaling method to perform picture size conversion, and obtaining data dimensionality shown in formula (1) after the operation:
input∈R16×3×h×w (1)
wherein, h and w are the height and width of the zoomed picture;
key frame target tag information, target, including action tags,
and constructing a position mask matrix mask with the dimensionality of mask belonging to R4×4The two-dimensional all-1 matrix is used for calibrating the position of a real picture in input data.
Step two, constructing the input data of the conversion module, including obtaining the input data of the conversion module and the mask operation of the position mask matrix, wherein the obtaining of the input data of the conversion module comprises the following processes:
tiling the continuous frame image data input with clip being 16 into a two-dimensional matrix, wherein the dimension change is as follows: input ∈ R16 ×dWherein: d is 3 × h × w;
and (3) compressing the flattened continuous frame data input by adopting a linear link layer, wherein the number of input channels of the linear link layer is d, the number of output channels of the linear link layer is 1024, and the continuous frame data obtained after compression is shown in a formula (2):
clip_fram=Linear(input) (2)
wherein Linear (·) is a Linear link layer operation, and the obtained dimension is clip _ frame ∈ R16×1024A two-dimensional matrix of (a);
constructing a random trainable parameter matrix cls _ token with the dimension of cls _ token belonging to R1×1024;
Performing matrix splicing on the data cls _ token and the data clip _ frame according to a first dimension to obtain input data in _ data of the conversion module, wherein the input data in _ data is shown in a formula (3):
in_data=Cat(cls_token,clip_frame),cls_token∈R1×1024,clip_frame∈R16 ×1024 (3)
wherein Cat (·) represents the matrix splicing operation, and the obtained in _ data has the dimension of in_data=R17×1024A two-dimensional matrix of (a);
the position mask matrix mask operation includes the following processes:
tiling a mask matrix mask into a one-dimensional vector, and changing the mask matrix mask in the dimension thereof into formula (4):
and (3) filling the mask matrix to obtain a transformed mask matrix represented by formula (5):
mask=Pad(mask,(1,0),value=1) (5)
wherein Pad (·) represents a padding operation, and (1,0) represents that 1 piece of data is added at the first position, the value of the added data is 1, and the output mask dimension transformation relation is expressed as formula (6):
carrying out dimension transformation on the data mask, and acquiring two new matrixes as shown in shift (7):
get new mask input matrix shift (8):
in_mask=mask1×mask2 (8)
the dimensionality is as follows: in _ mask ∈ R17×17Is used for the two-dimensional matrix of (1).
Step three, the motion recognition of the conversion module comprises data preprocessing operation, and the motion detection result is obtained through data processing after the continuous coding module; the data preprocessing operation includes the following processes:
constructing a random trainable parameter matrix pos _ embedding, wherein the dimension is pos _ embedding ∈ R17×1024Added to the input data in _ data, and performs a neuron activation layer operation,the output result x is expressed as formula (9):
x=Dropout(pos_embedding+in_data,dropout=0.1),x∈R17×1024 (9)
wherein Dropout (·) denotes an active layer operation, and an active layer factor Dropout is 0.1;
the continuous coding module is composed of 6 basic coding modules with the same structure in a depth-6 serial connection mode, and the calculation process of the basic coding modules is as follows:
the basic design parameters of the basic coding module are that the number dim of input data channels is 1024, the number mlp _ dim of intermediate layer data channels is 2048, parallel depth heads is 8, and the coefficient prodout of the active layer is 0.1;
1) data normalization processing
Normalizing the input data x, and expressing the obtained new data as an expression (10):
x_out=Norml(x_in),x_out∈R17×1024 (10)
wherein Norml (. cndot.) represents the normalization treatment; for convenient symbol marking, input and output data before and after processing are represented by x _ in and x _ out;
2) concurrent focus operations
a. Linear link layer data path expansion:
the input data channel dim is 1024, and the expanded data channel out _ dim is dim × 3 is 3072, and the transformation process is expressed as formula (11):
x_out1=Linear(x_in,dim=1024,out_dim=3072) (11)
wherein Linear (-) is a Linear chaining operation, x _ in, x _ out1Representing input and output data before and after processing, and representing the data dimension change as formula (12):
b. constructing q, k, v data:
multiplying the matrix q, k to obtain formula (14):
wherein T represents a matrix turn-to operation;
mask replacement operation:
according to the input mask matrix in _ mask ∈ R17×17The result x _ out ∈ R after multiplication of the matrix q, k8×17×17In the position where the mask result is 0, value is 1e-9Alternatively, the calculation process is represented by equation (15):
x_out5=softmax(Mask(x_out4,value=1e-9)),x_out5∈R8×17×17 (15)
wherein, Mask (·) represents a masking operation, and softmax (·) is a softmax activation layer in the neural network;
will output the result x _ out5Multiplying the data v, and obtaining an output after the data is deformed, wherein the output is expressed by formula (16):
x_out6=Tranf(x_out5·v),x_out5∈R8×17×17,v∈R8×17×128,x_out6∈R17×1024 (16)
wherein Tranf (·) represents the matrix dimension transformation;
c. data linear transformation and activation processing:
x_out7=Dropout(Linear(x_out6,dim=1024,dim=1024),dropout=0.1),x_out7∈R17×1024
wherein Linear (·) represents Linear transformation, an input channel dim is 1024, and an output channel dim is 1024; droput (·) denotes neuronal activation layer processing, with the activation factor dropout ═ 0.1;
after parallel attention operation, residual operation is performed, and the obtained module output is formula (17):
x_out=x_in+x_out7,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (17);
3) feed-forward network data processing
The feedforward network data processing is to perform related operation on data obtained after parallel attention operation, and the part of input data is x _ in E.R17×1024The following sequence processing procedures are carried out:
linear processing into formula (18):
x_out1=Linear(x_in,dim=1024,mlp_dim=2048),x_out1∈R17×1024 (18)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (19):
x_out2=GELU(x_out1),x_out2∈R17×1024 (19)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as shown in formula (20):
x_out3=Dropout(x_out2,dropout=0.1),x_out3∈R17×1024 (20)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing is shown in formula (21):
x_out4=Linear(x_out3,mlp_dim=2048,dim=1024),x_out4∈R17×1024 (21)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel dim is 1024;
the neuron activation layer operates as shown in formula (22):
x_out5=Dropout(x_out4,dropout=0.1),x_out5∈R17×1024 (22)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
after feedforward network data processing, residual operation is adopted, and the obtained final output data is shown in formula (23):
x_out=x_in+x_out5,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (23);
and (3) carrying out data processing on the data after passing through the continuous coding module to obtain an action detection result, wherein the process is expressed as an expression (24):
x_out=x_in[0],x_in∈R17×1024,x_out∈R1×1024 (24)
and (4) carrying out the sequential operation as shown in the formula (25) on the output data:
normalization:
x_out1=Norml(x_out),x_out1∈R1×1024 (25)
wherein Norml (. cndot.) represents the normalization treatment;
linear treatment is as in formula (26):
x_out2=Linear(x_out1,dim=1024,mlp_dim=2048,),x_out2∈R17×1024 (26)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (27):
x_out3=GELU(x_out2),x_out3∈R1×2048 (27)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as follows (28):
x_out4=Dropout(x_out3,dropout=0.1),x_out4∈R1×2048 (28)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing into formula (29):
x_out5=Linear(x_out4,mlp_dim=2048,num_class),x_out5∈R17×num_class (29)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel num _ class is a class number; the activation function layer is of formula (30):
x_out6=softmax(x_out5),x_out6∈R1×num_class (30)
wherein, softmax (·) represents the softmax activation function, and the final action recognition result is obtained.
And step four, calculating the cross entropy loss of the class detection result and the class label target, and optimizing the network parameters.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.
Claims (4)
1. A behavior recognition method based on a conversion module is characterized by comprising the following steps:
reading continuous frame images and constructing a mask;
step two, constructing input data of the conversion module, including obtaining the input data of the conversion module and a position mask matrix mask operation;
step three, the motion recognition of the conversion module comprises data preprocessing operation, and the motion detection result is obtained through data processing after the continuous coding module;
and step four, calculating the cross entropy loss of the class detection result and the class label target, and optimizing the network parameters.
2. The behavior recognition method based on the conversion module as claimed in claim 1, wherein reading the successive frame images and constructing the mask comprises the following processes:
according to timeSequentially, constructing input data input by using image data with 16 frames as a continuous clip, wherein the dimension of the image data input belongs to R16×3×H×WWherein H, W represent the original height and width of the picture;
for each picture of continuous frame input data input, adopting an equal scaling method to perform picture size conversion, and obtaining data dimensionality shown in formula (1) after the operation:
input∈R16×3×h×w (1)
wherein, h and w are the height and width of the zoomed picture;
key frame target tag information, target, including action tags,
and constructing a position mask matrix mask with the dimensionality of mask belonging to R4×4The two-dimensional all-1 matrix is used for calibrating the position of a real picture in input data.
3. The behavior recognition method based on the conversion module as claimed in claim 1, wherein the step two of obtaining the input data of the conversion module comprises the following processes:
tiling the continuous frame image data input with clip being 16 into a two-dimensional matrix, wherein the dimension change is as follows: input ∈ R16×dWherein: d is 3 × h × w;
and (3) compressing the flattened continuous frame data input by adopting a linear link layer, wherein the number of input channels of the linear link layer is d, the number of output channels of the linear link layer is 1024, and the continuous frame data obtained after compression is shown in a formula (2):
clip_fram=Linear(input) (2)
wherein Linear (·) is a Linear link layer operation, and the obtained dimension is clip _ frame ∈ R16×1024A two-dimensional matrix of (a);
constructing a random trainable parameter matrix cls _ token with the dimension of cls _ token belonging to R1×1024;
Performing matrix splicing on the data cls _ token and the data clip _ frame according to a first dimension to obtain input data in _ data of the conversion module, wherein the input data in _ data is shown in a formula (3):
in_data=Cat(cls_token,clip_frame),cls_token∈R1×1024,clip_frame∈R16×1024 (3)
wherein Cat (·) represents a matrix splicing operation, and the obtained in _ data has a dimension of in _ data ═ R17×1024A two-dimensional matrix of (a);
the position mask matrix mask operation includes the following processes:
tiling a mask matrix mask into a one-dimensional vector, and changing the mask matrix mask in the dimension thereof into formula (4):
and (3) filling the mask matrix to obtain a transformed mask matrix represented by formula (5):
mask=Pad(mask,(1,0),value=1) (5)
wherein Pad (·) represents a padding operation, and (1,0) represents that 1 piece of data is added at the first position, the value of the added data is 1, and the output mask dimension transformation relation is expressed as formula (6):
carrying out dimension transformation on the data mask, and acquiring two new matrixes as shown in shift (7):
get new mask input matrix shift (8):
in_mask=mask1×mask2 (8)
the dimensionality is as follows: in _ mask ∈ R17×17Is used for the two-dimensional matrix of (1).
4. The behavior recognition method based on conversion module as claimed in claim 1, wherein the data preprocessing operation in step three comprises the following processes:
constructing a random trainable parameter matrix pos _ embedding, wherein the dimension is pos _ embedding ∈ R17×1024And added to the input data in _ data, and neuron activation layer operation is performed, and the output result x is expressed by equation (9):
x=Dropout(pos_embedding+in_data,dropout=0.1),x∈R17×1024 (9)
wherein Dropout (·) denotes an active layer operation, and an active layer factor Dropout is 0.1;
the continuous coding module is composed of 6 basic coding modules with the same structure in series, and the calculation process of the basic coding modules is as follows:
the basic design parameters of the basic coding module are that the number dim of input data channels is 1024, the number mlp _ dim of intermediate layer data channels is 2048, parallel depth heads is 8, and the coefficient prodout of the active layer is 0.1;
1) data normalization processing
Normalizing the input data x, and expressing the obtained new data as an expression (10):
x_out=Norml(x_in),x_out∈R17×1024 (10)
wherein Norml (. cndot.) represents the normalization treatment; for convenient symbol marking, x _ in and x _ out are used for representing input data and output data before and after processing;
2) concurrent focus operations
a. Linear link layer data path expansion:
the input data channel dim is 1024, and the expanded data channel out _ dim is dim × 3 is 3072, and the transformation process is expressed as formula (11):
x_out1=Linear(x_in,dim=1024,out_dim=3072) (11)
wherein Linear (-) is a Linear chaining operation, x _ in, x _ out1Representing input and output data before and after processing, and representing the data dimension change as formula (12):
b. constructing q, k, v data:
multiplying the matrix q, k to obtain formula (14):
wherein T represents a matrix turn-to operation;
mask replacement operation:
according to the input mask matrix in _ mask ∈ R17×17The result x _ out ∈ R after multiplication of the matrix q, k8×17×17In the position where the mask result is 0, value is 1e-9Alternatively, the calculation process is represented by equation (15):
x_out5=softmax(Mask(x_out4,value=1e-9)),x_out5∈R8×17×17 (15)
wherein, Mask (·) represents a masking operation, and softmax (·) is a softmax activation layer in the neural network;
will output the result x _ out5Multiplying the data v, and obtaining an output after the data is deformed, wherein the output is expressed by formula (16):
x_out6=Tranf(x_out5·v),x_out5∈R8×17×17,v∈R8×17×128,x_out6∈R17×1024 (16)
wherein Tranf (·) represents the matrix dimension transformation;
c. data linear transformation and activation processing:
x_out7=Dropout(Linear(x_out6,dim=1024,dim=1024),dropout=0.1),x_out7∈R17×1024
wherein Linear (·) represents Linear transformation, an input channel dim is 1024, and an output channel dim is 1024; droput (·) denotes neuronal activation layer processing, with the activation factor dropout ═ 0.1;
after parallel attention operation, residual operation is performed, and the obtained module output is formula (17):
x_out=x_in+x_out7,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (17);
3) feed-forward network data processing
The feedforward network data processing is to perform related operation on data obtained after parallel attention operation, and the part of input data is x _ in E.R17×1024The following sequence processing procedures are carried out:
linear processing into formula (18):
x_out1=Linear(x_in,dim=1024,mlp_dim=2048),x_out1∈R17×1024 (18)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (19):
x_out2=GELU(x_out1),x_out2∈R17×1024 (19)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as shown in formula (20):
x_out3=Dropout(x_out2,dropout=0.1),x_out3∈R17×1024 (20)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing is shown in formula (21):
x_out4=Linear(x_out3,mlp_dim=2048,dim=1024),x_out4∈R17×1024 (21)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel dim is 1024;
the neuron activation layer operates as shown in formula (22):
x_out5=Dropout(x_out4,dropout=0.1),x_out5∈R17×1024 (22)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
after feedforward network data processing, residual operation is adopted, and the obtained final output data is shown in formula (23):
x_out=x_in+x_out5,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (23);
and (3) carrying out data processing on the data after passing through the continuous coding module to obtain an action detection result, wherein the process is expressed as an expression (24):
x_out=x_in[0],x_in∈R17×1024,x_out∈R1×1024 (24)
and (4) carrying out the sequential operation as shown in the formula (25) on the output data:
normalization:
x_out1=Norml(x_out),x_out1∈R1×1024 (25)
wherein Norml (. cndot.) represents the normalization treatment;
linear treatment is as in formula (26):
x_out2=Linear(x_out1,dim=1024,mlp_dim=2048,),x_out2∈R17×1024 (26)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (27):
x_out3=GELU(x_out2),x_out3∈R1×2048 (27)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as follows (28):
x_out4=Dropout(x_out3,dropout=0.1),x_out4∈R1×2048 (28)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing into formula (29):
x_out5=Linear(x_out4,mlp_dim=2048,num_class),x_out5∈R17×num_class (29)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel num _ class is a class number;
the activation function layer is of formula (30):
x_out6=softmax(x_out5),x_out6∈R1×num_class (30)
wherein, softmax (·) represents the softmax activation function, and the final action recognition result is obtained.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011383635.6A CN113033276B (en) | 2020-12-01 | 2020-12-01 | Behavior recognition method based on conversion module |
PCT/CN2021/116770 WO2022116616A1 (en) | 2020-12-01 | 2021-09-06 | Behavior recognition method based on conversion module |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011383635.6A CN113033276B (en) | 2020-12-01 | 2020-12-01 | Behavior recognition method based on conversion module |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113033276A true CN113033276A (en) | 2021-06-25 |
CN113033276B CN113033276B (en) | 2022-05-17 |
Family
ID=76459191
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011383635.6A Active CN113033276B (en) | 2020-12-01 | 2020-12-01 | Behavior recognition method based on conversion module |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113033276B (en) |
WO (1) | WO2022116616A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022116616A1 (en) * | 2020-12-01 | 2022-06-09 | 神思电子技术股份有限公司 | Behavior recognition method based on conversion module |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115065567B (en) * | 2022-08-19 | 2022-11-11 | 北京金睛云华科技有限公司 | Plug-in execution method for DGA domain name study and judgment inference machine |
CN116246338B (en) * | 2022-12-20 | 2023-10-03 | 西南交通大学 | Behavior recognition method based on graph convolution and transducer composite neural network |
CN117994028B (en) * | 2024-04-07 | 2024-08-02 | 浙江孚临科技有限公司 | User repayment behavior prediction method, system and storage medium based on tabular data |
CN118378593B (en) * | 2024-06-26 | 2024-09-03 | 上海岩芯数智人工智能科技有限公司 | Multi-channel method and device for adding sequence position information to text features |
CN118506110B (en) * | 2024-07-18 | 2024-10-11 | 天津市农业发展服务中心 | Crop identification and classification method based on deep learning model |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107909005A (en) * | 2017-10-26 | 2018-04-13 | 西安电子科技大学 | Personage's gesture recognition method under monitoring scene based on deep learning |
CN108830157A (en) * | 2018-05-15 | 2018-11-16 | 华北电力大学(保定) | Human bodys' response method based on attention mechanism and 3D convolutional neural networks |
CN109726671A (en) * | 2018-12-27 | 2019-05-07 | 上海交通大学 | The action identification method and system of expression study from the overall situation to category feature |
CN109829443A (en) * | 2019-02-23 | 2019-05-31 | 重庆邮电大学 | Video behavior recognition methods based on image enhancement Yu 3D convolutional neural networks |
US20200057846A1 (en) * | 2019-09-11 | 2020-02-20 | Lg Electronics Inc. | Authentication method and device through face recognition |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10474988B2 (en) * | 2017-08-07 | 2019-11-12 | Standard Cognition, Corp. | Predicting inventory events using foreground/background processing |
CN109543627B (en) * | 2018-11-27 | 2023-08-01 | 西安电子科技大学 | Method and device for judging driving behavior category and computer equipment |
CN111008567B (en) * | 2019-11-07 | 2023-03-24 | 郑州大学 | Driver behavior identification method |
CN113033276B (en) * | 2020-12-01 | 2022-05-17 | 神思电子技术股份有限公司 | Behavior recognition method based on conversion module |
-
2020
- 2020-12-01 CN CN202011383635.6A patent/CN113033276B/en active Active
-
2021
- 2021-09-06 WO PCT/CN2021/116770 patent/WO2022116616A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107909005A (en) * | 2017-10-26 | 2018-04-13 | 西安电子科技大学 | Personage's gesture recognition method under monitoring scene based on deep learning |
CN108830157A (en) * | 2018-05-15 | 2018-11-16 | 华北电力大学(保定) | Human bodys' response method based on attention mechanism and 3D convolutional neural networks |
CN109726671A (en) * | 2018-12-27 | 2019-05-07 | 上海交通大学 | The action identification method and system of expression study from the overall situation to category feature |
CN109829443A (en) * | 2019-02-23 | 2019-05-31 | 重庆邮电大学 | Video behavior recognition methods based on image enhancement Yu 3D convolutional neural networks |
US20200057846A1 (en) * | 2019-09-11 | 2020-02-20 | Lg Electronics Inc. | Authentication method and device through face recognition |
Non-Patent Citations (1)
Title |
---|
戴舒等: "基于YOLO算法的行人检测方法", 《无线电通信技术》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022116616A1 (en) * | 2020-12-01 | 2022-06-09 | 神思电子技术股份有限公司 | Behavior recognition method based on conversion module |
Also Published As
Publication number | Publication date |
---|---|
WO2022116616A1 (en) | 2022-06-09 |
CN113033276B (en) | 2022-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113033276B (en) | Behavior recognition method based on conversion module | |
Zhang et al. | Context encoding for semantic segmentation | |
CN111079532B (en) | Video content description method based on text self-encoder | |
CN110929622A (en) | Video classification method, model training method, device, equipment and storage medium | |
CN109947912A (en) | A kind of model method based on paragraph internal reasoning and combined problem answer matches | |
CN115240121B (en) | Joint modeling method and device for enhancing local features of pedestrians | |
CN110222718B (en) | Image processing method and device | |
CN110175248B (en) | Face image retrieval method and device based on deep learning and Hash coding | |
CN110390308B (en) | Video behavior identification method based on space-time confrontation generation network | |
CN116246338B (en) | Behavior recognition method based on graph convolution and transducer composite neural network | |
CN116258914B (en) | Remote Sensing Image Classification Method Based on Machine Learning and Local and Global Feature Fusion | |
CN117197727B (en) | Global space-time feature learning-based behavior detection method and system | |
CN113065451A (en) | Multi-mode fused action recognition device and method and storage medium | |
CN114780767A (en) | Large-scale image retrieval method and system based on deep convolutional neural network | |
Salem et al. | Semantic image inpainting using self-learning encoder-decoder and adversarial loss | |
CN117576724A (en) | Unmanned plane bird detection method, system, equipment and medium | |
CN114996495A (en) | Single-sample image segmentation method and device based on multiple prototypes and iterative enhancement | |
CN114581789A (en) | Hyperspectral image classification method and system | |
CN114283301A (en) | Self-adaptive medical image classification method and system based on Transformer | |
CN117809109A (en) | Behavior recognition method based on multi-scale time features | |
CN117197632A (en) | Transformer-based electron microscope pollen image target detection method | |
CN116543338A (en) | Student classroom behavior detection method based on gaze target estimation | |
CN116563795A (en) | Doll production management method and doll production management system | |
CN114120245B (en) | Crowd image analysis method, device and equipment based on deep neural network | |
CN112613405B (en) | Method for recognizing actions at any visual angle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |