CN113033276B

CN113033276B - Behavior recognition method based on conversion module

Info

Publication number: CN113033276B
Application number: CN202011383635.6A
Authority: CN
Inventors: 高朋; 刘辰飞; 陈英鹏; 于鹏
Original assignee: Synthesis Electronic Technology Co Ltd
Current assignee: Synthesis Electronic Technology Co Ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2022-05-17
Anticipated expiration: 2040-12-01
Also published as: CN113033276A; WO2022116616A1

Abstract

The invention discloses a behavior recognition method based on a conversion module, and relates to the field of human body action recognition. Firstly, reading continuous frame images and constructing a mask by a behavior identification method based on a conversion module; then, constructing input data of a conversion module, including obtaining the input data of the conversion module and a position mask matrix mask operation; the motion recognition of the retransformation module comprises data preprocessing operation, and the data processing after the continuous coding module is carried out to obtain a motion detection result; and finally, calculating cross entropy loss of the class detection result and the class label target, and optimizing network parameters. The method uses the conversion module used in natural language understanding to extract the spatio-temporal characteristics of continuous frame images, and simultaneously, only uses the conversion module in the whole identification process, thereby reducing the parameter quantity of the method, reducing the whole calculated quantity and improving the action identification frequency.

Description

Behavior recognition method based on conversion module

Technical Field

The invention relates to the field of human body action recognition, in particular to a behavior recognition method based on a conversion module.

Background

The action recognition completes the analysis task of the video action content and obtains a classification task of the action category by extracting the action features of the continuous video frames, so that the method is beneficial to improving the monitoring capability of dangerous behaviors in key areas and avoiding the occurrence of possible dangerous behaviors.

The Chinese patent with the patent number of CN202010708119.X provides an efficient unsupervised cross-domain action recognition method (CAFCCN) based on channel fusion and classifier confrontation, which is used for solving the problem that a target data set is trained without labels, and realizes accurate recognition of a target domain test set by using information of a source domain data set and information of a target domain unlabeled training set. The method has the disadvantages that four depth residual error network models are needed to respectively extract the source domain optical flow graph features and the target domain optical flow graph features, and a plurality of fusion modules are needed to complete the fusion of the source domain optical flow graph features and the target domain optical flow graph features, so that the whole algorithm has more model parameters and larger overall calculation amount.

The patent number 201810431650.X discloses a sequential action recognition method based on deep learning, aiming at the problem that long action feature expression is not enough in effectiveness in the boundary detection process, inter-frame information and intra-frame information are extracted simultaneously through a double-flow network to obtain a feature sequence of a video unit, a multi-scale short action segment intercepting scheme combining context information is provided, the follow-up regression accuracy is effectively improved, a time boundary model is trained by using the feature sequence, the model training time is reduced, and the calculation efficiency is improved. The method has the disadvantages that the method uses the interval frame image to be directly input into the action recognition network for action recognition, when the equipment is in a complex environment and has multiple targets, different actions of different targets can influence the action detection result of the whole image, and each target cannot be subjected to action recognition. Meanwhile, the method adopts a double-current network to simultaneously extract the interframe information and the intraframe information, and 3D convolution is inevitably used for obtaining the continuous frame image characteristics under the time sequence, so that the calculated amount of the model is increased, the training period of the model is increased, and the sample searching amount is also increased.

Chinese patent No. CN202010840934.1 discloses a behavior recognition method for a strong dynamic video, which uses an optical flow branch in a traditional double-flow model as a teaching model to assist in training RGB branches according to data distribution characteristics on a data set; the method comprises the following steps that RGB branches are input into RGB image frames of the source video overall, optical flow branches are input into optical flow image frames of the source video overall, and the optical flow image frames are obtained through optical flow calculation by the RGB image frames; and performing joint inference on RGB branches and optical flow branches to realize the identification of behaviors in the video. In the patent, the RGB branch and the optical flow branch are trained respectively in different configurations, and compared with the traditional double-flow algorithm, the dynamic identification is configured, so that the adaptability is strong. According to the method, the characteristics of the strong dynamic behavior video are considered through the reinforced optical flow characteristic learning, the optical flow information is transmitted in multiple stages, the sufficient motion characteristics are obtained, and the identification accuracy is improved. The patent also uses the expansion 3D convolution to obtain the time characteristic of the optical flow, uses the 2D convolution to obtain the space characteristic, needs two different networks to complete the action recognition task, and does not solve the problems of large model calculation amount and poor 3D convolution network portability.

Chinese patent No. 201910064529.2 discloses a behavior recognition system based on attention mechanism, which uses a channel attention module to extract inter-channel feature codes for motion prediction. However, the attention module of the patent still uses a two-dimensional convolution mode of a three-dimensional convolution set, and does not solve the problem that the 3D convolution model parameters are large in amount of calculation.

The classic motion recognition method is based on a 3D convolution and optical flow method and used for extracting the continuous frame characteristics under a time sequence, obtaining the dependency relationship of continuous frames under a time axis and improving motion recognition accuracy.

Compared with 2D convolution, 3D convolution needs to extract continuous frame features in three dimensions, so that the parameter quantity of a 3D convolution model is increased, the model calculation quantity is increased, and the training period of the model is longer. Meanwhile, 3D is used as a new calculation mode, and the support degree of the 3D convolution is poor under different depth learning frames, so that the actual application of the action recognition algorithm based on the 3D convolution is influenced.

The optical flow method needs a plurality of 2D convolution models to be matched with each other for extracting time characteristics and space characteristics, so that model parameters are overlarge, the calculated amount is large, high requirements are put forward for hardware equipment in practical application, and the practical applicability of the method is reduced.

Disclosure of Invention

The invention aims to overcome the defects, and provides a behavior identification method which uses a conversion module used in natural language understanding to extract space-time characteristics of continuous frame images and only uses the conversion module in the whole identification process.

The invention specifically adopts the following technical scheme:

a behavior recognition method based on a conversion module comprises the following steps:

reading continuous frame images and constructing a mask;

step two, constructing input data of the conversion module, including obtaining the input data of the conversion module and a position mask matrix mask operation;

step three, the motion recognition of the conversion module comprises data preprocessing operation, and the motion detection result is obtained through data processing after the continuous coding module;

and step four, calculating the cross entropy loss of the class detection result and the class label target, and optimizing the network parameters.

Preferably, reading the successive frame images and constructing the mask comprises the following processes:

according to the time sequence, constructing input data input by using image data of 16 frames as a continuous clip, wherein the dimension of the image data input of the continuous frames belongs to R^16×3×H×WWherein H, W represent the original height and width of the picture;

for each picture of continuous frame input data input, adopting an equal scaling method to perform picture size conversion, and obtaining data dimensionality shown in formula (1) after the operation:

input∈R^16×3×h×w (1)

wherein, h and w are the height and width of the zoomed picture;

key frame target tag information, target, including action tags,

and constructing a position mask matrix mask with the dimensionality of mask belonging to R^4×4The two-dimensional all-1 matrix is used for calibrating the position of a real picture in input data.

Preferably, the step two of acquiring the input data of the conversion module comprises the following processes:

will cl16, tiling the continuous frame image data input into a two-dimensional matrix, and changing the dimension into: input ∈ R¹⁶ ^×dWherein: d is 3 × h × w;

and (3) compressing the flattened continuous frame data input by adopting a linear link layer, wherein the number of input channels of the linear link layer is d, the number of output channels of the linear link layer is 1024, and the continuous frame data obtained after compression is shown in a formula (2):

clip_fram＝Linear(input) (2)

wherein Linear (·) is a Linear link layer operation, and the obtained dimension is clip _ frame ∈ R^16×1024A two-dimensional matrix of (a);

constructing a random trainable parameter matrix cls _ token with the dimensionality of cls _ token E to R^1×1024；

Performing matrix splicing on the data cls _ token and the data clip _ frame according to a first dimension to obtain input data in _ data of the conversion module, wherein the input data in _ data is shown in a formula (3):

in_data＝Cat(cls_token,clip_frame),cls_token∈R^1×1024,clip_frame∈R¹⁶ ^×1024 (3)

wherein Cat (·) represents a matrix splicing operation, and the obtained in _ data has a dimension of in _ data ═ R^17×1024A two-dimensional matrix of (a);

the position mask matrix mask operation includes the following processes:

tiling a mask matrix mask into a one-dimensional vector, and changing the mask matrix mask in the dimension thereof into formula (4):

and (3) filling the mask matrix to obtain a transformed mask matrix represented by formula (5):

mask＝Pad(mask,(1,0),value＝1) (5)

wherein Pad (·) represents a padding operation, and (1,0) represents that 1 piece of data is added at the first position, the value of the added data is 1, and the output mask dimension transformation relation is expressed as formula (6):

carrying out dimension transformation on the data mask, and acquiring two new matrixes as shown in shift (7):

get new mask input matrix shift (8):

in_mask＝mask₁×mask₂ (8)

the dimensionality is as follows: in _ mask ∈ R^17×17Is used for the two-dimensional matrix of (1).

Preferably, the data preprocessing operation in step three includes the following processes:

constructing a random trainable parameter matrix pos _ embedding, wherein the dimension is pos _ embedding ∈ R^17×1024And added to the input data in _ data, and neuron activation layer operation is performed, and the output result x is expressed by equation (9):

x＝Dropout(pos_embedding+in_data,dropout＝0.1),x∈R^17×1024 (9)

wherein Dropout (·) denotes an active layer operation, and an active layer factor Dropout is 0.1;

the continuous coding module is composed of 6 basic coding modules with the same structure in series, and the calculation process of the basic coding modules is as follows:

the basic design parameters of the basic coding module are that the number dim of input data channels is 1024, the number mlp _ dim of intermediate layer data channels is 2048, parallel depth heads is 8, and the coefficient prodout of the active layer is 0.1;

1) data normalization processing

Normalizing the input data x, and expressing the obtained new data as an expression (10):

x_out＝Norml(x_in),x_out∈R^17×1024 (10)

wherein Norml (. cndot.) represents the normalization treatment; for convenient symbol marking, x _ in and x _ out are used for representing input data and output data before and after processing;

2) concurrent focus operations

a. Linear link layer data path expansion:

the input data channel dim is 1024, and the expanded data channel out _ dim is dim × 3 is 3072, and the transformation process is expressed as formula (11):

x_out₁＝Linear(x_in,dim＝1024,out_dim＝3072) (11)

wherein Linear (-) is a Linear chaining operation, x _ in, x _ out₁Representing input and output data before and after processing, and representing the data dimension change as formula (12):

b. constructing q, k, v data:

matrix deformation

Then formula (13):

multiplying the matrix q, k to obtain formula (14):

wherein T represents a matrix turn-to operation;

mask replacement operation:

according to the input mask matrix in _ mask ∈ R^17×17The result x _ out ∈ R after multiplication of the matrix q, k^8×17×17In the position where the mask result is 0, value is 1e^-9Alternatively, the calculation process is represented by equation (15):

x_out₅＝softmax(Mask(x_out₄,value＝1e^-9)),x_out₅∈R^8×17×17 (15)

wherein, Mask (·) represents a masking operation, and softmax (·) is a softmax activation layer in the neural network;

will output the result x _ out₅Multiplying the data v, and obtaining an output after the data is deformed, wherein the output is expressed by formula (16):

x_out₆＝Tranf(x_out₅·v),x_out₅∈R^8×17×17,v∈R^8×17×128,x_out₆∈R^17×1024 (16)

wherein Tranf (·) represents the matrix dimension transformation;

c. data linear transformation and activation processing:

x_out₇＝Dropout(Linear(x_out₆,dim＝1024,dim＝1024),dropout＝0.1),x_out₇∈R^17×1024

wherein Linear (·) represents Linear transformation, an input channel dim is 1024, and an output channel dim is 1024; droput (·) denotes neuronal activation layer processing, with the activation factor dropout ═ 0.1;

after parallel attention operation, residual operation is performed, and the obtained module output is formula (17):

x_out＝x_in+x_out₇,x_in∈R^17×1024,x_out₇∈R^17×1024,x_out∈R^17×1024 (17)；

3) feed-forward network data processing

The feedforward network data processing is to perform related operation on data obtained after parallel attention operation, and the part of input data is x _ in E.R^17×1024The following sequence processing procedures are carried out:

linear treatment into formula (18):

x_out₁＝Linear(x_in,dim＝1024,mlp_dim＝2048),x_out₁∈R^17×1024 (18)

wherein Linear (·) represents Linear transformation, the input channel dim is 1024, and the output channel mlp _ dim is 2048;

the activation function layer is represented by formula (19):

x_out₂＝GELU(x_out₁),x_out₂∈R^17×1024 (19)

wherein, GELU (-) represents a GELU activation function;

the neuron activation layer operates as shown in formula (20):

x_out₃＝Dropout(x_out₂,dropout＝0.1),x_out₃∈R^17×1024 (20)

wherein Droput (·) represents the activation layer processing, and the activation factor dropout is 0.1;

linear processing is shown in formula (21):

x_out₄＝Linear(x_out₃,mlp_dim＝2048,dim＝1024),x_out₄∈R^17×1024 (21)

wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel dim is 1024;

the neuron activation layer operates as shown in formula (22):

x_out₅＝Dropout(x_out₄,dropout＝0.1),x_out₅∈R^17×1024 (22)

after feedforward network data processing, residual operation is adopted, and the obtained final output data is shown in formula (23):

x_out＝x_in+x_out₅,x_in∈R^17×1024,x_out₇∈R^17×1024,x_out∈R^17×1024 (23)；

and (3) carrying out data processing on the data after passing through the continuous coding module to obtain an action detection result, wherein the process is expressed as an expression (24):

x_out＝x_in[0],x_in∈R^17×1024,x_out∈R^1×1024 (24)

and (4) carrying out the sequential operation as shown in the formula (25) on the output data:

normalization:

x_out₁＝Norml(x_out),x_out₁∈R^1×1024 (25)

wherein Norml (. cndot.) represents the normalization treatment;

linear treatment is as in formula (26):

x_out₂＝Linear(x_out₁,dim＝1024,mlp_dim＝2048,),x_out₂∈R^17×1024 (26)

the activation function layer is represented by formula (27):

x_out₃＝GELU(x_out₂),x_out₃∈R^1×2048 (27)

wherein, GELU (-) represents a GELU activation function;

the neuron activation layer operates as follows (28):

x_out₄＝Dropout(x_out₃,dropout＝0.1),x_out₄∈R^1×2048 (28)

linear processing into formula (29):

x_out₅＝Linear(x_out₄,mlp_dim＝2048,num_class),x_out₅∈R^{17×num_class} (29)

wherein Linear (·) represents Linear transformation, the input channel mlp _ dim is 2048, and the output channel num _ class is a class number;

the activation function layer is of formula (30):

x_out₆＝softmax(x_out₅),x_out₆∈R^1×num_class (30)

wherein, softmax (·) represents the softmax activation function, and the final action recognition result is obtained.

The invention has the following beneficial effects:

the method realizes continuous frame image action identification based on continuous feature extraction.

In the method, the conversion model extraction module replaces a 3D convolution network, the problem that the 3D convolution network model is large in calculation amount is solved, the parallel calculation capacity of the model on a GPU is improved, meanwhile, the conversion models are composed of the most basic operators, the migration deployment performance of the model is improved, and the problem that the compatibility is weak when the model is converted or deployed is solved.

Drawings

FIG. 1 is a block flow diagram of a translation module-based behavior recognition method;

FIG. 2 is a conversion module diagram;

fig. 3 is a diagram of a basic coding module structure.

Detailed Description

The following description of the embodiments of the present invention will be made with reference to the accompanying drawings:

with reference to fig. 1-3, the behavior recognition method based on the conversion module includes the following steps:

reading a continuous frame image and constructing a mask, wherein the reading of the continuous frame image and the constructing of the mask comprises the following processes:

according to the time sequence, constructing input data input by using image data of 16 frames in a continuous clip, wherein the dimension of the image data input of the continuous frames belongs to R^16×3×H×WWherein H, W represent the original height and width of the picture;

input∈R^16×3×h×w (1)

wherein, h and w are the height and width of the zoomed picture;

key frame target tag information, target, including action tags,

Step two, constructing the input data of the conversion module, including obtaining the input data of the conversion module and the mask operation of the position mask matrix, wherein the obtaining of the input data of the conversion module comprises the following processes:

tiling the continuous frame image data input with clip being 16 into a two-dimensional matrix, wherein the dimension change is as follows: input ∈ R¹⁶ ^×dWherein: d is 3 × h × w;

clip_fram＝Linear(input) (2)

constructing a random trainable parameter matrix cls _ token with the dimension of cls _ token belonging to R^1×1024；

the position mask matrix mask operation includes the following processes:

mask＝Pad(mask,(1,0),value＝1) (5)

get new mask input matrix shift (8):

in_mask＝mask₁×mask₂ (8)

Step three, the motion recognition of the conversion module comprises data preprocessing operation, and the motion detection result is obtained through data processing after the continuous coding module; the data preprocessing operation includes the following processes:

x＝Dropout(pos_embedding+in_data,dropout＝0.1),x∈R^17×1024 (9)

1) data normalization processing

x_out＝Norml(x_in),x_out∈R^17×1024 (10)

2) concurrent focus operations

a. Linear link layer data path expansion:

x_out₁＝Linear(x_in,dim＝1024,out_dim＝3072) (11)

b. constructing q, k, v data:

matrix deformation

Then formula (13):

multiplying the matrix q, k to obtain formula (14):

wherein T represents a matrix turn-to operation;

mask replacement operation:

x_out₅＝softmax(Mask(x_out₄,value＝1e^-9)),x_out₅∈R^8×17×17 (15)

wherein Tranf (·) represents the matrix dimension transformation;

c. data linear transformation and activation processing:

x_out₇＝Dropout(Linear(x_out₆,dim＝1024,dim＝1024),dropout＝0.1),x_out₇∈R^17×1024wherein Linear (·) represents Linear transformation, an input channel dim is 1024, and an output channel dim is 1024; droput (·) denotes neuronal activation layer processing, with the activation factor dropout ═ 0.1;

3) feed-forward network data processing

linear processing into formula (18):

x_out₁＝Linear(x_in,dim＝1024,mlp_dim＝2048),x_out₁∈R^17×1024 (18)

the activation function layer is represented by formula (19):

x_out₂＝GELU(x_out₁),x_out₂∈R^17×1024 (19)

wherein, GELU (-) represents a GELU activation function;

the neuron activation layer operates as shown in formula (20):

x_out₃＝Dropout(x_out₂,dropout＝0.1),x_out₃∈R^17×1024 (20)

wherein Droput (·) represents the active layer process, and the active factor dropout is 0.1;

linear processing is shown in formula (21):

where Linear (·) denotes Linear transformation, where an input channel mlp _ dim is 2048, and an output channel dim is 1024;

the neuron activation layer operates as shown in formula (22):

x_out₅＝Dropout(x_out₄,dropout＝0.1),x_out₅∈R^17×1024 (22)

x_out＝x_in[0],x_in∈R^17×1024,x_out∈R^1×1024 (24)

normalization:

x_out₁＝Norml(x_out),x_out₁∈R^1×1024 (25)

wherein Norml (. cndot.) represents the normalization treatment;

linear treatment is as in formula (26):

the activation function layer is represented by formula (27):

x_out₃＝GELU(x_out₂),x_out₃∈R^1×2048 (27)

wherein, GELU (-) represents a GELU activation function;

the neuron activation layer operates as follows (28):

x_out₄＝Dropout(x_out₃,dropout＝0.1),x_out₄∈R^1×2048 (28)

linear processing into formula (29):

the activation function layer is of formula (30):

x_out₆＝softmax(x_out₅),x_out₆∈R^1×num_class (30)

It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims

1. A behavior recognition method based on a conversion module is characterized by comprising the following steps:

reading continuous frame images and constructing a mask;

step three, the action recognition of the conversion module comprises data preprocessing operation, the data processing after the continuous coding module is carried out to obtain an action detection result, and the data preprocessing operation comprises the following processes:

x＝Dropout(pos_embedding+in_data,dropout＝0.1),x∈R^17×1024 (9)

1) data normalization processing

x_out＝Norml(x_in),x_out∈R^17×1024 (10)

2) concurrent focus operations

a. Linear link layer data path expansion:

x_out₁＝Linear(x_in,dim＝1024,out_dim＝3072) (11)

b. constructing q, k, v data:

matrix deformation

Then formula (13):

multiplying the matrix q, k to obtain formula (14):

wherein T represents a matrix turn-to operation;

mask replacement operation:

according to an input mask matrix in _ mask E R^17×17The result x _ out ∈ R after multiplication of the matrix q, k^8×17×17In the position where the mask result is 0, value is 1e^-9Alternatively, the calculation process is represented by equation (15):

x_out₅＝softmax(Mask(x_out₄,value＝1e^-9)),x_out₅∈R^8×17×17 (15)

wherein Tranf (·) represents the matrix dimension transformation;

c. data linear transformation and activation processing:

3) feed-forward network data processing

Feedforward network data processing, namely performing related operation on data obtained after parallel attention operation, wherein the part of input data is x _ in E R^17×1024The following sequence processing procedures are carried out:

linear processing into formula (18):

x_out₁＝Linear(x_in,dim＝1024,mlp_dim＝2048),x_out₁∈R^17×1024 (18)

the activation function layer is represented by formula (19):

x_out₂＝GELU(x_out₁),x_out₂∈R^17×1024 (19)

wherein, GELU (-) represents a GELU activation function;

the neuron activation layer operates as shown in formula (20):

x_out₃＝Dropout(x_out₂,dropout＝0.1),x_out₃∈R^17×1024 (20)

linear processing is shown in formula (21):

the neuron activation layer operates as shown in formula (22):

x_out₅＝Dropout(x_out₄,dropout＝0.1),x_out₅∈R^17×1024 (22)

x_out＝x_in[0],x_in∈R^17×1024,x_out∈R^1×1024 (24)

and performing the following operations on the output data in the following formula (25):

normalization:

x_out₁＝Norml(x_out),x_out₁∈R^1×1024 (25)

wherein Norml (. cndot.) represents the normalization treatment;

linear treatment is as in formula (26):

the activation function layer is represented by formula (27):

x_out₃＝GELU(x_out₂),x_out₃∈R^1×2048 (27)

wherein, GELU (-) represents a GELU activation function;

the neuron activation layer operates as follows (28):

x_out₄＝Dropout(x_out₃,dropout＝0.1),x_out₄∈R^1×2048 (28)

linear processing into formula (29):

the activation function layer is of formula (30):

x_out₆＝softmax(x_out₅),x_out₆∈R^1×num_class (30)

wherein, softmax (·) represents the softmax activation function, and the final action recognition result is obtained;

and step four, calculating the cross entropy loss of the action detection result and the class label target1, and optimizing the network parameters.

2. The behavior recognition method based on conversion module as claimed in claim 1, wherein reading the successive frame images and constructing the mask comprises the following processes:

input∈R^16×3×h×w (1)

wherein, h and w are the height and width of the zoomed picture;

key frame target tag information, target2, contains action tags,

3. The behavior recognition method based on the conversion module as claimed in claim 1, wherein the step two of obtaining the input data of the conversion module comprises the following processes:

tiling the continuous frame image data input with clip being 16 into a two-dimensional matrix, wherein the dimension change is as follows: input ∈ R^16×dWherein: d is 3 × h × w;

clip_fram＝Linear(input) (2)

in_data＝Cat(cls_token,clip_frame),cls_token∈R^1×1024,clip_frame∈R^16×1024 (3)

the position mask matrix mask operation includes the following processes:

mask＝Pad(mask,(1,0),value＝1) (5)

carrying out dimension transformation on the data mask to obtain two new matrixes as shown in formula (7):

obtaining a new mask input matrix is formula (8):

in_mask＝mask₁×mask₂ (8)

the dimensionality is as follows: in _ mask belongs to R^17×17Is used for the two-dimensional matrix of (1).