CN113033276B - Behavior recognition method based on conversion module - Google Patents

Behavior recognition method based on conversion module Download PDF

Info

Publication number
CN113033276B
CN113033276B CN202011383635.6A CN202011383635A CN113033276B CN 113033276 B CN113033276 B CN 113033276B CN 202011383635 A CN202011383635 A CN 202011383635A CN 113033276 B CN113033276 B CN 113033276B
Authority
CN
China
Prior art keywords
data
mask
linear
dim
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011383635.6A
Other languages
Chinese (zh)
Other versions
CN113033276A (en
Inventor
高朋
刘辰飞
陈英鹏
于鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Synthesis Electronic Technology Co Ltd
Original Assignee
Synthesis Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Synthesis Electronic Technology Co Ltd filed Critical Synthesis Electronic Technology Co Ltd
Priority to CN202011383635.6A priority Critical patent/CN113033276B/en
Publication of CN113033276A publication Critical patent/CN113033276A/en
Priority to PCT/CN2021/116770 priority patent/WO2022116616A1/en
Application granted granted Critical
Publication of CN113033276B publication Critical patent/CN113033276B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/32Normalisation of the pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a behavior recognition method based on a conversion module, and relates to the field of human body action recognition. Firstly, reading continuous frame images and constructing a mask by a behavior identification method based on a conversion module; then, constructing input data of a conversion module, including obtaining the input data of the conversion module and a position mask matrix mask operation; the motion recognition of the retransformation module comprises data preprocessing operation, and the data processing after the continuous coding module is carried out to obtain a motion detection result; and finally, calculating cross entropy loss of the class detection result and the class label target, and optimizing network parameters. The method uses the conversion module used in natural language understanding to extract the spatio-temporal characteristics of continuous frame images, and simultaneously, only uses the conversion module in the whole identification process, thereby reducing the parameter quantity of the method, reducing the whole calculated quantity and improving the action identification frequency.

Description

Behavior recognition method based on conversion module
Technical Field
The invention relates to the field of human body action recognition, in particular to a behavior recognition method based on a conversion module.
Background
The action recognition completes the analysis task of the video action content and obtains a classification task of the action category by extracting the action features of the continuous video frames, so that the method is beneficial to improving the monitoring capability of dangerous behaviors in key areas and avoiding the occurrence of possible dangerous behaviors.
The Chinese patent with the patent number of CN202010708119.X provides an efficient unsupervised cross-domain action recognition method (CAFCCN) based on channel fusion and classifier confrontation, which is used for solving the problem that a target data set is trained without labels, and realizes accurate recognition of a target domain test set by using information of a source domain data set and information of a target domain unlabeled training set. The method has the disadvantages that four depth residual error network models are needed to respectively extract the source domain optical flow graph features and the target domain optical flow graph features, and a plurality of fusion modules are needed to complete the fusion of the source domain optical flow graph features and the target domain optical flow graph features, so that the whole algorithm has more model parameters and larger overall calculation amount.
The patent number 201810431650.X discloses a sequential action recognition method based on deep learning, aiming at the problem that long action feature expression is not enough in effectiveness in the boundary detection process, inter-frame information and intra-frame information are extracted simultaneously through a double-flow network to obtain a feature sequence of a video unit, a multi-scale short action segment intercepting scheme combining context information is provided, the follow-up regression accuracy is effectively improved, a time boundary model is trained by using the feature sequence, the model training time is reduced, and the calculation efficiency is improved. The method has the disadvantages that the method uses the interval frame image to be directly input into the action recognition network for action recognition, when the equipment is in a complex environment and has multiple targets, different actions of different targets can influence the action detection result of the whole image, and each target cannot be subjected to action recognition. Meanwhile, the method adopts a double-current network to simultaneously extract the interframe information and the intraframe information, and 3D convolution is inevitably used for obtaining the continuous frame image characteristics under the time sequence, so that the calculated amount of the model is increased, the training period of the model is increased, and the sample searching amount is also increased.
Chinese patent No. CN202010840934.1 discloses a behavior recognition method for a strong dynamic video, which uses an optical flow branch in a traditional double-flow model as a teaching model to assist in training RGB branches according to data distribution characteristics on a data set; the method comprises the following steps that RGB branches are input into RGB image frames of the source video overall, optical flow branches are input into optical flow image frames of the source video overall, and the optical flow image frames are obtained through optical flow calculation by the RGB image frames; and performing joint inference on RGB branches and optical flow branches to realize the identification of behaviors in the video. In the patent, the RGB branch and the optical flow branch are trained respectively in different configurations, and compared with the traditional double-flow algorithm, the dynamic identification is configured, so that the adaptability is strong. According to the method, the characteristics of the strong dynamic behavior video are considered through the reinforced optical flow characteristic learning, the optical flow information is transmitted in multiple stages, the sufficient motion characteristics are obtained, and the identification accuracy is improved. The patent also uses the expansion 3D convolution to obtain the time characteristic of the optical flow, uses the 2D convolution to obtain the space characteristic, needs two different networks to complete the action recognition task, and does not solve the problems of large model calculation amount and poor 3D convolution network portability.
Chinese patent No. 201910064529.2 discloses a behavior recognition system based on attention mechanism, which uses a channel attention module to extract inter-channel feature codes for motion prediction. However, the attention module of the patent still uses a two-dimensional convolution mode of a three-dimensional convolution set, and does not solve the problem that the 3D convolution model parameters are large in amount of calculation.
The classic motion recognition method is based on a 3D convolution and optical flow method and used for extracting the continuous frame characteristics under a time sequence, obtaining the dependency relationship of continuous frames under a time axis and improving motion recognition accuracy.
Compared with 2D convolution, 3D convolution needs to extract continuous frame features in three dimensions, so that the parameter quantity of a 3D convolution model is increased, the model calculation quantity is increased, and the training period of the model is longer. Meanwhile, 3D is used as a new calculation mode, and the support degree of the 3D convolution is poor under different depth learning frames, so that the actual application of the action recognition algorithm based on the 3D convolution is influenced.
The optical flow method needs a plurality of 2D convolution models to be matched with each other for extracting time characteristics and space characteristics, so that model parameters are overlarge, the calculated amount is large, high requirements are put forward for hardware equipment in practical application, and the practical applicability of the method is reduced.
Disclosure of Invention
The invention aims to overcome the defects, and provides a behavior identification method which uses a conversion module used in natural language understanding to extract space-time characteristics of continuous frame images and only uses the conversion module in the whole identification process.
The invention specifically adopts the following technical scheme:
a behavior recognition method based on a conversion module comprises the following steps:
reading continuous frame images and constructing a mask;
step two, constructing input data of the conversion module, including obtaining the input data of the conversion module and a position mask matrix mask operation;
step three, the motion recognition of the conversion module comprises data preprocessing operation, and the motion detection result is obtained through data processing after the continuous coding module;
and step four, calculating the cross entropy loss of the class detection result and the class label target, and optimizing the network parameters.
Preferably, reading the successive frame images and constructing the mask comprises the following processes:
according to the time sequence, constructing input data input by using image data of 16 frames as a continuous clip, wherein the dimension of the image data input of the continuous frames belongs to R16×3×H×WWherein H, W represent the original height and width of the picture;
for each picture of continuous frame input data input, adopting an equal scaling method to perform picture size conversion, and obtaining data dimensionality shown in formula (1) after the operation:
input∈R16×3×h×w (1)
wherein, h and w are the height and width of the zoomed picture;
key frame target tag information, target, including action tags,
and constructing a position mask matrix mask with the dimensionality of mask belonging to R4×4The two-dimensional all-1 matrix is used for calibrating the position of a real picture in input data.
Preferably, the step two of acquiring the input data of the conversion module comprises the following processes:
will cl16, tiling the continuous frame image data input into a two-dimensional matrix, and changing the dimension into: input ∈ R16 ×dWherein: d is 3 × h × w;
and (3) compressing the flattened continuous frame data input by adopting a linear link layer, wherein the number of input channels of the linear link layer is d, the number of output channels of the linear link layer is 1024, and the continuous frame data obtained after compression is shown in a formula (2):
clip_fram=Linear(input) (2)
wherein Linear (·) is a Linear link layer operation, and the obtained dimension is clip _ frame ∈ R16×1024A two-dimensional matrix of (a);
constructing a random trainable parameter matrix cls _ token with the dimensionality of cls _ token E to R1×1024
Performing matrix splicing on the data cls _ token and the data clip _ frame according to a first dimension to obtain input data in _ data of the conversion module, wherein the input data in _ data is shown in a formula (3):
in_data=Cat(cls_token,clip_frame),cls_token∈R1×1024,clip_frame∈R16 ×1024 (3)
wherein Cat (·) represents a matrix splicing operation, and the obtained in _ data has a dimension of in _ data ═ R17×1024A two-dimensional matrix of (a);
the position mask matrix mask operation includes the following processes:
tiling a mask matrix mask into a one-dimensional vector, and changing the mask matrix mask in the dimension thereof into formula (4):
Figure GDA0003077001230000041
and (3) filling the mask matrix to obtain a transformed mask matrix represented by formula (5):
mask=Pad(mask,(1,0),value=1) (5)
wherein Pad (·) represents a padding operation, and (1,0) represents that 1 piece of data is added at the first position, the value of the added data is 1, and the output mask dimension transformation relation is expressed as formula (6):
Figure GDA0003077001230000042
carrying out dimension transformation on the data mask, and acquiring two new matrixes as shown in shift (7):
Figure GDA0003077001230000043
get new mask input matrix shift (8):
in_mask=mask1×mask2 (8)
the dimensionality is as follows: in _ mask ∈ R17×17Is used for the two-dimensional matrix of (1).
Preferably, the data preprocessing operation in step three includes the following processes:
constructing a random trainable parameter matrix pos _ embedding, wherein the dimension is pos _ embedding ∈ R17×1024And added to the input data in _ data, and neuron activation layer operation is performed, and the output result x is expressed by equation (9):
x=Dropout(pos_embedding+in_data,dropout=0.1),x∈R17×1024 (9)
wherein Dropout (·) denotes an active layer operation, and an active layer factor Dropout is 0.1;
the continuous coding module is composed of 6 basic coding modules with the same structure in series, and the calculation process of the basic coding modules is as follows:
the basic design parameters of the basic coding module are that the number dim of input data channels is 1024, the number mlp _ dim of intermediate layer data channels is 2048, parallel depth heads is 8, and the coefficient prodout of the active layer is 0.1;
1) data normalization processing
Normalizing the input data x, and expressing the obtained new data as an expression (10):
x_out=Norml(x_in),x_out∈R17×1024 (10)
wherein Norml (. cndot.) represents the normalization treatment; for convenient symbol marking, x _ in and x _ out are used for representing input data and output data before and after processing;
2) concurrent focus operations
a. Linear link layer data path expansion:
the input data channel dim is 1024, and the expanded data channel out _ dim is dim × 3 is 3072, and the transformation process is expressed as formula (11):
x_out1=Linear(x_in,dim=1024,out_dim=3072) (11)
wherein Linear (-) is a Linear chaining operation, x _ in, x _ out1Representing input and output data before and after processing, and representing the data dimension change as formula (12):
Figure GDA0003077001230000051
b. constructing q, k, v data:
matrix deformation
Figure GDA0003077001230000052
Then formula (13):
Figure GDA0003077001230000053
multiplying the matrix q, k to obtain formula (14):
Figure GDA0003077001230000054
wherein T represents a matrix turn-to operation;
mask replacement operation:
according to the input mask matrix in _ mask ∈ R17×17The result x _ out ∈ R after multiplication of the matrix q, k8×17×17In the position where the mask result is 0, value is 1e-9Alternatively, the calculation process is represented by equation (15):
x_out5=softmax(Mask(x_out4,value=1e-9)),x_out5∈R8×17×17 (15)
wherein, Mask (·) represents a masking operation, and softmax (·) is a softmax activation layer in the neural network;
will output the result x _ out5Multiplying the data v, and obtaining an output after the data is deformed, wherein the output is expressed by formula (16):
x_out6=Tranf(x_out5·v),x_out5∈R8×17×17,v∈R8×17×128,x_out6∈R17×1024 (16)
wherein Tranf (·) represents the matrix dimension transformation;
c. data linear transformation and activation processing:
x_out7=Dropout(Linear(x_out6,dim=1024,dim=1024),dropout=0.1),x_out7∈R17×1024
wherein Linear (·) represents Linear transformation, an input channel dim is 1024, and an output channel dim is 1024; droput (·) denotes neuronal activation layer processing, with the activation factor dropout ═ 0.1;
after parallel attention operation, residual operation is performed, and the obtained module output is formula (17):
x_out=x_in+x_out7,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (17);
3) feed-forward network data processing
The feedforward network data processing is to perform related operation on data obtained after parallel attention operation, and the part of input data is x _ in E.R17×1024The following sequence processing procedures are carried out:
linear treatment into formula (18):
x_out1=Linear(x_in,dim=1024,mlp_dim=2048),x_out1∈R17×1024 (18)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (19):
x_out2=GELU(x_out1),x_out2∈R17×1024 (19)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as shown in formula (20):
x_out3=Dropout(x_out2,dropout=0.1),x_out3∈R17×1024 (20)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing is shown in formula (21):
x_out4=Linear(x_out3,mlp_dim=2048,dim=1024),x_out4∈R17×1024 (21)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel dim is 1024;
the neuron activation layer operates as shown in formula (22):
x_out5=Dropout(x_out4,dropout=0.1),x_out5∈R17×1024 (22)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
after feedforward network data processing, residual operation is adopted, and the obtained final output data is shown in formula (23):
x_out=x_in+x_out5,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (23);
and (3) carrying out data processing on the data after passing through the continuous coding module to obtain an action detection result, wherein the process is expressed as an expression (24):
x_out=x_in[0],x_in∈R17×1024,x_out∈R1×1024 (24)
and (4) carrying out the sequential operation as shown in the formula (25) on the output data:
normalization:
x_out1=Norml(x_out),x_out1∈R1×1024 (25)
wherein Norml (. cndot.) represents the normalization treatment;
linear treatment is as in formula (26):
x_out2=Linear(x_out1,dim=1024,mlp_dim=2048,),x_out2∈R17×1024 (26)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (27):
x_out3=GELU(x_out2),x_out3∈R1×2048 (27)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as follows (28):
x_out4=Dropout(x_out3,dropout=0.1),x_out4∈R1×2048 (28)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing into formula (29):
x_out5=Linear(x_out4,mlp_dim=2048,num_class),x_out5∈R17×num_class (29)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel num _ class is a class number;
the activation function layer is of formula (30):
x_out6=softmax(x_out5),x_out6∈R1×num_class (30)
wherein, softmax (·) represents the softmax activation function, and the final action recognition result is obtained.
The invention has the following beneficial effects:
the method realizes continuous frame image action identification based on continuous feature extraction.
In the method, the conversion model extraction module replaces a 3D convolution network, the problem that the 3D convolution network model is large in calculation amount is solved, the parallel calculation capacity of the model on a GPU is improved, meanwhile, the conversion models are composed of the most basic operators, the migration deployment performance of the model is improved, and the problem that the compatibility is weak when the model is converted or deployed is solved.
Drawings
FIG. 1 is a block flow diagram of a translation module-based behavior recognition method;
FIG. 2 is a conversion module diagram;
fig. 3 is a diagram of a basic coding module structure.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:
with reference to fig. 1-3, the behavior recognition method based on the conversion module includes the following steps:
reading a continuous frame image and constructing a mask, wherein the reading of the continuous frame image and the constructing of the mask comprises the following processes:
according to the time sequence, constructing input data input by using image data of 16 frames in a continuous clip, wherein the dimension of the image data input of the continuous frames belongs to R16×3×H×WWherein H, W represent the original height and width of the picture;
for each picture of continuous frame input data input, adopting an equal scaling method to perform picture size conversion, and obtaining data dimensionality shown in formula (1) after the operation:
input∈R16×3×h×w (1)
wherein, h and w are the height and width of the zoomed picture;
key frame target tag information, target, including action tags,
and constructing a position mask matrix mask with the dimensionality of mask belonging to R4×4The two-dimensional all-1 matrix is used for calibrating the position of a real picture in input data.
Step two, constructing the input data of the conversion module, including obtaining the input data of the conversion module and the mask operation of the position mask matrix, wherein the obtaining of the input data of the conversion module comprises the following processes:
tiling the continuous frame image data input with clip being 16 into a two-dimensional matrix, wherein the dimension change is as follows: input ∈ R16 ×dWherein: d is 3 × h × w;
and (3) compressing the flattened continuous frame data input by adopting a linear link layer, wherein the number of input channels of the linear link layer is d, the number of output channels of the linear link layer is 1024, and the continuous frame data obtained after compression is shown in a formula (2):
clip_fram=Linear(input) (2)
wherein Linear (·) is a Linear link layer operation, and the obtained dimension is clip _ frame ∈ R16×1024A two-dimensional matrix of (a);
constructing a random trainable parameter matrix cls _ token with the dimension of cls _ token belonging to R1×1024
Performing matrix splicing on the data cls _ token and the data clip _ frame according to a first dimension to obtain input data in _ data of the conversion module, wherein the input data in _ data is shown in a formula (3):
in_data=Cat(cls_token,clip_frame),cls_token∈R1×1024,clip_frame∈R16 ×1024 (3)
wherein Cat (·) represents a matrix splicing operation, and the obtained in _ data has a dimension of in _ data ═ R17×1024A two-dimensional matrix of (a);
the position mask matrix mask operation includes the following processes:
tiling a mask matrix mask into a one-dimensional vector, and changing the mask matrix mask in the dimension thereof into formula (4):
Figure GDA0003077001230000091
and (3) filling the mask matrix to obtain a transformed mask matrix represented by formula (5):
mask=Pad(mask,(1,0),value=1) (5)
wherein Pad (·) represents a padding operation, and (1,0) represents that 1 piece of data is added at the first position, the value of the added data is 1, and the output mask dimension transformation relation is expressed as formula (6):
Figure GDA0003077001230000101
carrying out dimension transformation on the data mask, and acquiring two new matrixes as shown in shift (7):
Figure GDA0003077001230000102
get new mask input matrix shift (8):
in_mask=mask1×mask2 (8)
the dimensionality is as follows: in _ mask ∈ R17×17Is used for the two-dimensional matrix of (1).
Step three, the motion recognition of the conversion module comprises data preprocessing operation, and the motion detection result is obtained through data processing after the continuous coding module; the data preprocessing operation includes the following processes:
constructing a random trainable parameter matrix pos _ embedding, wherein the dimension is pos _ embedding ∈ R17×1024And added to the input data in _ data, and neuron activation layer operation is performed, and the output result x is expressed by equation (9):
x=Dropout(pos_embedding+in_data,dropout=0.1),x∈R17×1024 (9)
wherein Dropout (·) denotes an active layer operation, and an active layer factor Dropout is 0.1;
the continuous coding module is composed of 6 basic coding modules with the same structure in series, and the calculation process of the basic coding modules is as follows:
the basic design parameters of the basic coding module are that the number dim of input data channels is 1024, the number mlp _ dim of intermediate layer data channels is 2048, parallel depth heads is 8, and the coefficient prodout of the active layer is 0.1;
1) data normalization processing
Normalizing the input data x, and expressing the obtained new data as an expression (10):
x_out=Norml(x_in),x_out∈R17×1024 (10)
wherein Norml (. cndot.) represents the normalization treatment; for convenient symbol marking, x _ in and x _ out are used for representing input data and output data before and after processing;
2) concurrent focus operations
a. Linear link layer data path expansion:
the input data channel dim is 1024, and the expanded data channel out _ dim is dim × 3 is 3072, and the transformation process is expressed as formula (11):
x_out1=Linear(x_in,dim=1024,out_dim=3072) (11)
wherein Linear (-) is a Linear chaining operation, x _ in, x _ out1Representing input and output data before and after processing, and representing the data dimension change as formula (12):
Figure GDA0003077001230000111
b. constructing q, k, v data:
matrix deformation
Figure GDA0003077001230000112
Then formula (13):
Figure GDA0003077001230000113
multiplying the matrix q, k to obtain formula (14):
Figure GDA0003077001230000114
wherein T represents a matrix turn-to operation;
mask replacement operation:
according to the input mask matrix in _ mask ∈ R17×17The result x _ out ∈ R after multiplication of the matrix q, k8×17×17In the position where the mask result is 0, value is 1e-9Alternatively, the calculation process is represented by equation (15):
x_out5=softmax(Mask(x_out4,value=1e-9)),x_out5∈R8×17×17 (15)
wherein, Mask (·) represents a masking operation, and softmax (·) is a softmax activation layer in the neural network;
will output the result x _ out5Multiplying the data v, and obtaining an output after the data is deformed, wherein the output is expressed by formula (16):
x_out6=Tranf(x_out5·v),x_out5∈R8×17×17,v∈R8×17×128,x_out6∈R17×1024 (16)
wherein Tranf (·) represents the matrix dimension transformation;
c. data linear transformation and activation processing:
x_out7=Dropout(Linear(x_out6,dim=1024,dim=1024),dropout=0.1),x_out7∈R17×1024wherein Linear (·) represents Linear transformation, an input channel dim is 1024, and an output channel dim is 1024; droput (·) denotes neuronal activation layer processing, with the activation factor dropout ═ 0.1;
after parallel attention operation, residual operation is performed, and the obtained module output is formula (17):
x_out=x_in+x_out7,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (17);
3) feed-forward network data processing
The feedforward network data processing is to perform related operation on data obtained after parallel attention operation, and the part of input data is x _ in E.R17×1024The following sequence processing procedures are carried out:
linear processing into formula (18):
x_out1=Linear(x_in,dim=1024,mlp_dim=2048),x_out1∈R17×1024 (18)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (19):
x_out2=GELU(x_out1),x_out2∈R17×1024 (19)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as shown in formula (20):
x_out3=Dropout(x_out2,dropout=0.1),x_out3∈R17×1024 (20)
wherein Droput (·) represents the active layer process, and the active factor dropout is 0.1;
linear processing is shown in formula (21):
x_out4=Linear(x_out3,mlp_dim=2048,dim=1024),x_out4∈R17×1024 (21)
where Linear (·) denotes Linear transformation, where an input channel mlp _ dim is 2048, and an output channel dim is 1024;
the neuron activation layer operates as shown in formula (22):
x_out5=Dropout(x_out4,dropout=0.1),x_out5∈R17×1024 (22)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
after feedforward network data processing, residual operation is adopted, and the obtained final output data is shown in formula (23):
x_out=x_in+x_out5,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (23);
and (3) carrying out data processing on the data after passing through the continuous coding module to obtain an action detection result, wherein the process is expressed as an expression (24):
x_out=x_in[0],x_in∈R17×1024,x_out∈R1×1024 (24)
and (4) carrying out the sequential operation as shown in the formula (25) on the output data:
normalization:
x_out1=Norml(x_out),x_out1∈R1×1024 (25)
wherein Norml (. cndot.) represents the normalization treatment;
linear treatment is as in formula (26):
x_out2=Linear(x_out1,dim=1024,mlp_dim=2048,),x_out2∈R17×1024 (26)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (27):
x_out3=GELU(x_out2),x_out3∈R1×2048 (27)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as follows (28):
x_out4=Dropout(x_out3,dropout=0.1),x_out4∈R1×2048 (28)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing into formula (29):
x_out5=Linear(x_out4,mlp_dim=2048,num_class),x_out5∈R17×num_class (29)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel num _ class is a class number;
the activation function layer is of formula (30):
x_out6=softmax(x_out5),x_out6∈R1×num_class (30)
wherein, softmax (·) represents the softmax activation function, and the final action recognition result is obtained.
And step four, calculating the cross entropy loss of the class detection result and the class label target, and optimizing the network parameters.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims (3)

1. A behavior recognition method based on a conversion module is characterized by comprising the following steps:
reading continuous frame images and constructing a mask;
step two, constructing input data of the conversion module, including obtaining the input data of the conversion module and a position mask matrix mask operation;
step three, the action recognition of the conversion module comprises data preprocessing operation, the data processing after the continuous coding module is carried out to obtain an action detection result, and the data preprocessing operation comprises the following processes:
constructing a random trainable parameter matrix pos _ embedding, wherein the dimension is pos _ embedding ∈ R17×1024And added to the input data in _ data, and neuron activation layer operation is performed, and the output result x is expressed by equation (9):
x=Dropout(pos_embedding+in_data,dropout=0.1),x∈R17×1024 (9)
wherein Dropout (·) denotes an active layer operation, and an active layer factor Dropout is 0.1;
the continuous coding module is composed of 6 basic coding modules with the same structure in series, and the calculation process of the basic coding modules is as follows:
the basic design parameters of the basic coding module are that the number dim of input data channels is 1024, the number mlp _ dim of intermediate layer data channels is 2048, parallel depth heads is 8, and the coefficient prodout of the active layer is 0.1;
1) data normalization processing
Normalizing the input data x, and expressing the obtained new data as an expression (10):
x_out=Norml(x_in),x_out∈R17×1024 (10)
wherein Norml (. cndot.) represents the normalization treatment; for convenient symbol marking, x _ in and x _ out are used for representing input data and output data before and after processing;
2) concurrent focus operations
a. Linear link layer data path expansion:
the input data channel dim is 1024, and the expanded data channel out _ dim is dim × 3 is 3072, and the transformation process is expressed as formula (11):
x_out1=Linear(x_in,dim=1024,out_dim=3072) (11)
wherein Linear (-) is a Linear chaining operation, x _ in, x _ out1Representing input and output data before and after processing, and representing the data dimension change as formula (12):
Figure FDA0003583623180000024
b. constructing q, k, v data:
matrix deformation
Figure FDA0003583623180000021
Then formula (13):
Figure FDA0003583623180000022
multiplying the matrix q, k to obtain formula (14):
Figure FDA0003583623180000023
wherein T represents a matrix turn-to operation;
mask replacement operation:
according to an input mask matrix in _ mask E R17×17The result x _ out ∈ R after multiplication of the matrix q, k8×17×17In the position where the mask result is 0, value is 1e-9Alternatively, the calculation process is represented by equation (15):
x_out5=softmax(Mask(x_out4,value=1e-9)),x_out5∈R8×17×17 (15)
wherein, Mask (·) represents a masking operation, and softmax (·) is a softmax activation layer in the neural network;
will output the result x _ out5Multiplying the data v, and obtaining an output after the data is deformed, wherein the output is expressed by formula (16):
x_out6=Tranf(x_out5·v),x_out5∈R8×17×17,v∈R8×17×128,x_out6∈R17×1024 (16)
wherein Tranf (·) represents the matrix dimension transformation;
c. data linear transformation and activation processing:
x_out7=Dropout(Linear(x_out6,dim=1024,dim=1024),dropout=0.1),x_out7∈R17×1024
wherein Linear (·) represents Linear transformation, an input channel dim is 1024, and an output channel dim is 1024; droput (·) denotes neuronal activation layer processing, with the activation factor dropout ═ 0.1;
after parallel attention operation, residual operation is performed, and the obtained module output is formula (17):
x_out=x_in+x_out7,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (17);
3) feed-forward network data processing
Feedforward network data processing, namely performing related operation on data obtained after parallel attention operation, wherein the part of input data is x _ in E R17×1024The following sequence processing procedures are carried out:
linear processing into formula (18):
x_out1=Linear(x_in,dim=1024,mlp_dim=2048),x_out1∈R17×1024 (18)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (19):
x_out2=GELU(x_out1),x_out2∈R17×1024 (19)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as shown in formula (20):
x_out3=Dropout(x_out2,dropout=0.1),x_out3∈R17×1024 (20)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing is shown in formula (21):
x_out4=Linear(x_out3,mlp_dim=2048,dim=1024),x_out4∈R17×1024 (21)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel dim is 1024;
the neuron activation layer operates as shown in formula (22):
x_out5=Dropout(x_out4,dropout=0.1),x_out5∈R17×1024 (22)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
after feedforward network data processing, residual operation is adopted, and the obtained final output data is shown in formula (23):
x_out=x_in+x_out5,x_in∈R17×1024,x_out7∈R17×1024,x_out∈R17×1024 (23);
and (3) carrying out data processing on the data after passing through the continuous coding module to obtain an action detection result, wherein the process is expressed as an expression (24):
x_out=x_in[0],x_in∈R17×1024,x_out∈R1×1024 (24)
and performing the following operations on the output data in the following formula (25):
normalization:
x_out1=Norml(x_out),x_out1∈R1×1024 (25)
wherein Norml (. cndot.) represents the normalization treatment;
linear treatment is as in formula (26):
x_out2=Linear(x_out1,dim=1024,mlp_dim=2048,),x_out2∈R17×1024 (26)
wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;
the activation function layer is represented by formula (27):
x_out3=GELU(x_out2),x_out3∈R1×2048 (27)
wherein, GELU (-) represents a GELU activation function;
the neuron activation layer operates as follows (28):
x_out4=Dropout(x_out3,dropout=0.1),x_out4∈R1×2048 (28)
wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;
linear processing into formula (29):
x_out5=Linear(x_out4,mlp_dim=2048,num_class),x_out5∈R17×num_class (29)
wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel num _ class is a class number;
the activation function layer is of formula (30):
x_out6=softmax(x_out5),x_out6∈R1×num_class (30)
wherein, softmax (·) represents the softmax activation function, and the final action recognition result is obtained;
and step four, calculating the cross entropy loss of the action detection result and the class label target1, and optimizing the network parameters.
2. The behavior recognition method based on conversion module as claimed in claim 1, wherein reading the successive frame images and constructing the mask comprises the following processes:
according to the time sequence, constructing input data input by using image data of 16 frames in a continuous clip, wherein the dimension of the image data input of the continuous frames belongs to R16×3×H×WWherein H, W represent the original height and width of the picture;
for each picture of continuous frame input data input, adopting an equal scaling method to perform picture size conversion, and obtaining data dimensionality shown in formula (1) after the operation:
input∈R16×3×h×w (1)
wherein, h and w are the height and width of the zoomed picture;
key frame target tag information, target2, contains action tags,
and constructing a position mask matrix mask with the dimensionality of mask belonging to R4×4The two-dimensional all-1 matrix is used for calibrating the position of a real picture in input data.
3. The behavior recognition method based on the conversion module as claimed in claim 1, wherein the step two of obtaining the input data of the conversion module comprises the following processes:
tiling the continuous frame image data input with clip being 16 into a two-dimensional matrix, wherein the dimension change is as follows: input ∈ R16×dWherein: d is 3 × h × w;
and (3) compressing the flattened continuous frame data input by adopting a linear link layer, wherein the number of input channels of the linear link layer is d, the number of output channels of the linear link layer is 1024, and the continuous frame data obtained after compression is shown in a formula (2):
clip_fram=Linear(input) (2)
wherein Linear (·) is a Linear link layer operation, and the obtained dimension is clip _ frame ∈ R16×1024A two-dimensional matrix of (a);
constructing a random trainable parameter matrix cls _ token with the dimension of cls _ token belonging to R1×1024
Performing matrix splicing on the data cls _ token and the data clip _ frame according to a first dimension to obtain input data in _ data of the conversion module, wherein the input data in _ data is shown in a formula (3):
in_data=Cat(cls_token,clip_frame),cls_token∈R1×1024,clip_frame∈R16×1024 (3)
wherein Cat (·) represents a matrix splicing operation, and the obtained in _ data has a dimension of in _ data ═ R17×1024A two-dimensional matrix of (a);
the position mask matrix mask operation includes the following processes:
tiling a mask matrix mask into a one-dimensional vector, and changing the mask matrix mask in the dimension thereof into formula (4):
Figure FDA0003583623180000051
and (3) filling the mask matrix to obtain a transformed mask matrix represented by formula (5):
mask=Pad(mask,(1,0),value=1) (5)
wherein Pad (·) represents a padding operation, and (1,0) represents that 1 piece of data is added at the first position, the value of the added data is 1, and the output mask dimension transformation relation is expressed as formula (6):
Figure FDA0003583623180000061
carrying out dimension transformation on the data mask to obtain two new matrixes as shown in formula (7):
Figure FDA0003583623180000062
obtaining a new mask input matrix is formula (8):
in_mask=mask1×mask2 (8)
the dimensionality is as follows: in _ mask belongs to R17×17Is used for the two-dimensional matrix of (1).
CN202011383635.6A 2020-12-01 2020-12-01 Behavior recognition method based on conversion module Active CN113033276B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011383635.6A CN113033276B (en) 2020-12-01 2020-12-01 Behavior recognition method based on conversion module
PCT/CN2021/116770 WO2022116616A1 (en) 2020-12-01 2021-09-06 Behavior recognition method based on conversion module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011383635.6A CN113033276B (en) 2020-12-01 2020-12-01 Behavior recognition method based on conversion module

Publications (2)

Publication Number Publication Date
CN113033276A CN113033276A (en) 2021-06-25
CN113033276B true CN113033276B (en) 2022-05-17

Family

ID=76459191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011383635.6A Active CN113033276B (en) 2020-12-01 2020-12-01 Behavior recognition method based on conversion module

Country Status (2)

Country Link
CN (1) CN113033276B (en)
WO (1) WO2022116616A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033276B (en) * 2020-12-01 2022-05-17 神思电子技术股份有限公司 Behavior recognition method based on conversion module
CN115065567B (en) * 2022-08-19 2022-11-11 北京金睛云华科技有限公司 Plug-in execution method for DGA domain name study and judgment inference machine
CN116246338B (en) * 2022-12-20 2023-10-03 西南交通大学 Behavior recognition method based on graph convolution and transducer composite neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909005A (en) * 2017-10-26 2018-04-13 西安电子科技大学 Personage's gesture recognition method under monitoring scene based on deep learning
CN108830157A (en) * 2018-05-15 2018-11-16 华北电力大学(保定) Human bodys' response method based on attention mechanism and 3D convolutional neural networks
CN109726671A (en) * 2018-12-27 2019-05-07 上海交通大学 The action identification method and system of expression study from the overall situation to category feature
CN109829443A (en) * 2019-02-23 2019-05-31 重庆邮电大学 Video behavior recognition methods based on image enhancement Yu 3D convolutional neural networks

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10474988B2 (en) * 2017-08-07 2019-11-12 Standard Cognition, Corp. Predicting inventory events using foreground/background processing
CN109543627B (en) * 2018-11-27 2023-08-01 西安电子科技大学 Method and device for judging driving behavior category and computer equipment
US10977355B2 (en) * 2019-09-11 2021-04-13 Lg Electronics Inc. Authentication method and device through face recognition
CN111008567B (en) * 2019-11-07 2023-03-24 郑州大学 Driver behavior identification method
CN113033276B (en) * 2020-12-01 2022-05-17 神思电子技术股份有限公司 Behavior recognition method based on conversion module

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909005A (en) * 2017-10-26 2018-04-13 西安电子科技大学 Personage's gesture recognition method under monitoring scene based on deep learning
CN108830157A (en) * 2018-05-15 2018-11-16 华北电力大学(保定) Human bodys' response method based on attention mechanism and 3D convolutional neural networks
CN109726671A (en) * 2018-12-27 2019-05-07 上海交通大学 The action identification method and system of expression study from the overall situation to category feature
CN109829443A (en) * 2019-02-23 2019-05-31 重庆邮电大学 Video behavior recognition methods based on image enhancement Yu 3D convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于YOLO算法的行人检测方法;戴舒等;《无线电通信技术》;20200518(第03期);全文 *

Also Published As

Publication number Publication date
CN113033276A (en) 2021-06-25
WO2022116616A1 (en) 2022-06-09

Similar Documents

Publication Publication Date Title
CN113033276B (en) Behavior recognition method based on conversion module
Zhang et al. Context encoding for semantic segmentation
CN111079532B (en) Video content description method based on text self-encoder
Kim et al. Fully deep blind image quality predictor
CN110929622A (en) Video classification method, model training method, device, equipment and storage medium
CN110378208B (en) Behavior identification method based on deep residual error network
CN110175248B (en) Face image retrieval method and device based on deep learning and Hash coding
CN110390308B (en) Video behavior identification method based on space-time confrontation generation network
CN113065451A (en) Multi-mode fused action recognition device and method and storage medium
CN113408343A (en) Classroom action recognition method based on double-scale space-time block mutual attention
CN111489803A (en) Report coding model generation method, system and equipment based on autoregressive model
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
CN117197727B (en) Global space-time feature learning-based behavior detection method and system
CN117058595B (en) Video semantic feature and extensible granularity perception time sequence action detection method and device
CN116258914B (en) Remote Sensing Image Classification Method Based on Machine Learning and Local and Global Feature Fusion
CN116246338B (en) Behavior recognition method based on graph convolution and transducer composite neural network
CN117576724A (en) Unmanned plane bird detection method, system, equipment and medium
CN112508121A (en) Method and system for sensing outside by industrial robot
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
CN116543338A (en) Student classroom behavior detection method based on gaze target estimation
CN115965836A (en) Human behavior posture video data amplification system and method with controllable semantics
CN112613405B (en) Method for recognizing actions at any visual angle
CN114283301A (en) Self-adaptive medical image classification method and system based on Transformer
CN114881098A (en) Label noise estimation method based on manifold regularization transfer matrix
CN116071825B (en) Action behavior recognition method, system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant