CN113989940A

CN113989940A - Method, system, equipment and storage medium for recognizing actions in video data

Info

Publication number: CN113989940A
Application number: CN202111363930.XA
Authority: CN
Inventors: 郝艳宾; 谭懿; 何向南; 杨勋
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-01-28
Anticipated expiration: 2041-11-17
Also published as: CN113989940B

Abstract

The invention discloses a method, a system, equipment and a storage medium for identifying actions in video data, wherein the related method comprises the following steps: pooling the original video feature tensor from different directions at different scales by adopting a video data multi-content dependence modeling mode, and performing dependence activation by utilizing a convolution layer to obtain a corresponding dependence representation; and (3) realizing aggregation of dependent characteristics by using an attention mechanism of an inquiry structure, optimizing the original video feature tensor, and identifying the action by using an optimization result. According to the scheme, the convolution-based motion recognition model can be directly inserted, almost no additional parameters and calculated amount are provided, and experiments show that the classification performance of the motion recognition model can be obviously improved.

Description

Method, system, equipment and storage medium for recognizing actions in video data

Technical Field

The invention relates to the field of computer vision, in particular to a method, a system, equipment and a storage medium for recognizing actions in video data.

Background

In the multimedia age, various terminal devices, such as: mass video data are continuously generated by mobile phones, cameras, monitoring cameras and the like, and motion classification is an effective analysis method for the mass video data. However, the time dimension of the video data is increased relative to the image data, more content dependence is brought, and the difficulty of video action recognition is greatly increased.

For the content-dependent modeling problem of motion recognition, the following methods exist:

1) methods based on implicit dependency modeling that implicitly learn features in video data using only the three-dimensional convolution kernel to be optimized by directly expanding the two-dimensional convolution kernel to the three-dimensional convolution kernel with existing image classification networks, such as the ResNet network. Such a method implements modeling of long-distance dependence completely on the number of stacked layers, so that only the last layers in the network can perceive long-distance dependence. Meanwhile, the rough dimension expansion brings about a serious increase in the amount of calculation and the size of the model, so that the method is difficult to train.

2) The method is based on time-dependent modeling, focuses on the increased time dimension of video data relative to image data, and explicitly utilizes the information of the time dimension to capture the dynamic characteristics of the video data. Compared with the implicit dependence modeling method, due to the special design aiming at the time dimension, the method can avoid using a heavy three-dimensional convolution kernel, reduce the complexity of the model and improve the performance. However, this type of method ignores other content dependencies that are widely present in video data and is limited in performance.

3) The method is based on a global space-time point attention method, adds a global attention mechanism to an action classification model, and realizes long-distance content dependent capture by using a matching relation between every two space-time points in video data. However, calculating the relationship between the space-time points pair by pair brings the problems of model bloat and slow calculation.

Generally, the above methods fail to solve the problem between modeling multiple content dependencies and maintaining the model high efficiency, and the performance and computational overhead of the action recognition model still need to be optimized.

Disclosure of Invention

The invention aims to provide a method, a system, equipment and a storage medium for recognizing actions in video data, aiming at an action recognition task, modeling and aggregating multi-content dependence in the video data while hardly increasing parameters and calculated amount of an action recognition model, and improving the classification performance of the action recognition model.

The purpose of the invention is realized by the following technical scheme:

a method for recognizing actions in video data comprises the following steps:

acquiring an original video feature tensor extracted from video data by an action recognition model;

pooling the original video feature tensor from different directions by different scales in a video data multi-content dependence modeling mode to obtain multiple groups of compressed dependence feature tensors, and then performing dependence activation by utilizing a convolution layer to obtain corresponding dependence representations;

introducing a query vector to be matched with all the dependent tokens by using an attention mechanism of an inquiry structure, calculating weights of various dependent tokens according to matching response intensity, weighting and summing to obtain final dependent tokens, and performing threshold operation on an original video data feature tensor by using the final dependent tokens to obtain an optimized video data feature tensor;

and inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.

A system for recognizing motion in video data, for implementing the method, the system comprising:

the original video feature tensor acquisition unit is used for acquiring an original video feature tensor extracted from video data by the action recognition model;

the video data multi-content dependence modeling unit is used for pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and then performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representation;

the dependency aggregation unit is used for introducing a query vector to be matched with all dependency tokens by utilizing an attention mechanism of an inquiry structure, calculating weights of various dependency tokens according to matching response strength, weighting and summing to obtain a final dependency token, and performing threshold operation on an original video data feature tensor by using the final dependency token to obtain an optimized video data feature tensor;

and the action recognition unit is used for inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.

A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium, storing a computer program, characterized in that the computer program realizes the aforementioned method when executed by a processor.

The technical scheme provided by the invention can be seen that the method is a plug-and-play method, can be directly inserted into the motion recognition model based on convolution, hardly brings extra parameters and calculated amount, and can obviously improve the classification performance of the motion recognition model through experiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for recognizing actions in video data according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a model structure of a method for recognizing actions in video data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a dependency aggregation based on an attention mechanism of an interrogation structure according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a system for recognizing motion in video data according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The terms that may be used herein are first described as follows:

the terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, process, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article of manufacture), is to be construed as including not only the particular feature explicitly listed but also other features not explicitly listed as such which are known in the art.

The following describes a method for recognizing actions in video data according to the present invention in detail. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. Those not specifically mentioned in the examples of the present invention were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer. The reagents or instruments used in the examples of the present invention are not specified by manufacturers, and are all conventional products available by commercial purchase.

As shown in fig. 1, a method for recognizing actions in video data mainly includes the following steps:

step 1, acquiring an original video feature tensor extracted from video data by an action recognition model.

And 2, pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representations.

And 3, introducing a query vector to be matched with all the dependent tokens by using an attention mechanism of an inquiry structure, calculating the weights of various dependent tokens according to the matching response intensity, weighting and summing to obtain a final dependent token, and performing threshold operation on the original video data feature tensor by using the final dependent token to obtain an optimized video data feature tensor.

And 4, inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.

The scheme of the embodiment of the invention can be applied to systems such as monitoring video processing, video content analysis and the like. Taking a widely used time segmentation model (TSN) and a time translation model (TSM) as baselines, and giving specific experiments later, the effectiveness of the method in the aspect of improving the performance of the motion recognition system can be verified; of course, other motion recognition models may be used in the present invention.

For ease of understanding, the following detailed description is directed to various aspects of the invention.

Firstly, acquiring an original video feature tensor.

In the embodiment of the invention, the input of the action recognition model is video data, and the output is an original video characteristic tensor; the motion recognition model can be any existing motion recognition model with any structure.

And secondly, video data multi-content dependent modeling (MDM for short).

As shown in fig. 2, the model structure unit of the classical video data motion recognition method includes three convolutional layers (conv 1, conv2, conv3 shown on the left side of fig. 2), and the method (SDA) proposed by the present invention acts between the second and third convolutional layers. For video feature tensor output by motion recognition model

From which the MDM excavates a variety of spatio-temporal content dependencies. Firstly, in order to make the model lightweight, MDM uses a convolution layer and a ReLU activation function to tense the video characteristics

From C channels to C/r_cThe generated feature tensor is recorded as

Wherein,

is a real number set; t, H and W sequentially represent the length, height and width of the video feature tensor, r_cRepresenting the compression factor.

The MDM outputs a series of dependence representations by taking the generated characteristic tensor Y' as input, and the dependence representations are marked as { R₁，R₂，...，R_MMDM (Y'). The calculation of different dependency characteristics adopts a uniform flow: feature compression → dependent activation.

1) And (5) feature compression.

In the embodiment of the invention, the feature compression is realized by adopting pooling operation. For the spatiotemporal characteristics of the feature tensor Y', the MDM pools it from different directions (e.g. spatial direction, temporal direction) at different scales (e.g. global scale, local scale).

Record pooling as

Wherein p is_t，p_h，p_wIndicating the size of the pooled nuclear receptive field. In order to obtain the overall data characteristics of the video data in all directions, the invention specifically selects the average pooling as the pooling operation, and the average pooling operation is pool_avg() Calculating the compressed dependent feature tensor from the video feature tensor Y

The process of (a) is represented as:

wherein p is_t，p_g，p_wIndicating the size of the pooled nuclear receptive field, different p_t，p_g，p_wThe sizes correspond to different directions and different dimensions.

The compressed dependent feature tensor provides the data characteristics within the pooled nuclear receptive field range for subsequent dependent modeling.

2) Depending on the activation.

After the dependency characteristics A are obtained, the MDM uses an operator and a ReLU operator to realize dependency activation, then a reshape operator is used to restore the pooled dependency characteristic tensor to the size before pooling so as to facilitate subsequent calculation, and the invention uses convolution operation to realize the operator.

Similar to the pooling operation, the convolution kernel is noted

Note the convolution operation as Conv_3d() Performing convolution operation on the dependent feature tensor A to obtain corresponding dependent characterization

The process comprises the following steps:

wherein, c_t，c_h，c_wRepresenting the size of the convolution kernel.

Based on the above principle, in the embodiment of the present invention, two sets of content dependencies are set in the video data multiple content dependency modeling, as shown in two parts (a) and (b) on the right side of fig. 2:

the first group is the long-range content dependency, which reflects the relationship between video content from three perspectives, temporal, spatial and spatiotemporal: when the pooling nucleus is

When, reflecting the long-distance space-time dependence (abbreviated as LST), the convolution kernel of the corresponding convolution layer is

When the pooling nucleus is

When the convolution kernel of the corresponding convolution layer is shown as long-distance time dependence (abbreviated as LT)

When the pooling nucleus is

When reflecting long-distance spatial dependence (LS for short), the convolution kernel of the corresponding convolution layer is

Based on the three pooling kernels, three dependent feature tensors can be obtained

Similarly, based on the above three convolution kernels, the information between the channels is mixed using three convolution operations and the corresponding dependency characterization is obtained as follows:

the second group is short-range content dependence, which focuses on compressing information in a local spatiotemporal receptive field, using a local pooling kernel to compress dynamic information presented within the local receptive field. The corresponding pooling nucleus is

The convolution kernel of the corresponding convolution layer is

The corresponding dependent feature tensor A can be calculated in the manner described above^SAnd dependent characterization of R₄。

In the embodiment of the invention, 1 is more than a, c is more than min (H, W), 1 is more than b, and d is more than T. As an example, it may be provided that: a is 3, b is 3, c is 2 and d is 2.

After obtaining various dependency characterizations, it is scaled to T × H × W × C/r using element replication_c。

And thirdly, selective dependence on polymerization (for short, SEC).

The most intuitive dependency aggregation method is to average and sum the obtained dependent tokens, however, different videos have different dependency preferences, and simple average aggregation ignores important dependencies while emphasizes irrelevant dependencies.

As shown in FIG. 3, the present invention utilizes the attention mechanism of the query structure (abbreviated as QSA) to effect aggregation of dependent tokens, automatically emphasizing important dependencies by assigning weights to different dependent tokens. Specifically, the method comprises the following steps:

introducing learnable challenge vectors

And all dependencies are characterized

Compression to MxC/r by global average pooling layer_cThe matrix of dimensions K serves as a key in the attention mechanism:

will be provided with

Splicing as a value:

where M is the number of dependent tokens (e.g., M ═ 4 according to the previous example), C represents the number of channels of the original video feature tensor, r_cRepresenting the compression factor.

Calculating the vector inner product of the query vector and the four dependent tokens by the following formula to obtain the matching response strength of each dependent token as a weight value of the subsequent weighted summation (for convenience of representation, the matrix multiplication is used for representation):

Attention(q，K)＝softmax(q×K^T)

the final dependent tokens are obtained by weighted summation of the various dependent tokens according to the following formula:

R_sec＝Attention(q，K)×V

where softmax () represents the softmax function, T is the Transpose symbol, and x represents the matrix multiplication.

The above operation is performed by the dependency aggregation Block in fig. 2, and the SEC hardly adds extra parameters and calculation amount based on the above design.

In the embodiment of the invention, the final dependence is characterized by R by using a 1 × 1 × 1 three-dimensional convolution kernel_secThe number of channels is restored to the number of the original video feature tensor Y, a Sigmoid activation function is used for mapping to an interval of (0.0, 1.0), and finally the original video feature tensor Y is multiplied element by element to obtain an optimized video data feature tensor Z, which is expressed as:

Z＝Sigmoid(Conv3d(R_sec；1×1×1))⊙Y

wherein, the element-by-element multiplication is indicated by Conv_3d() Representing a convolution operation.

And taking the optimized video data characteristic tensor Z as the output of the SEC.

Fourthly, recognizing the action.

Inputting the optimized video data feature tensor Z into the action recognition model to obtain an action recognition result, wherein the related recognition principle can refer to the conventional technology, and details are not repeated here

To demonstrate the effectiveness of the present invention, it was verified by performing the following experiments.

Experiments were performed on four real data sets of Something-Something V1 and V2, differentiating 48 and EPIC-KITCHENS with action classification accuracy (Acc) as an evaluation index. For this experiment, the present invention uses a widely used Time Slicing Network (TSN) and time shifting network (TSM) as baselines. The experiment is divided into three parts:

1) the various dependence modeling effects proposed by the present invention were verified on Something-Something V1 based on TSN, the results of which are shown in Table 1.

TABLE 1 enhancement of motion recognition models by content-dependent modeling

Wherein, # P represents a modelOverall parameters, FLOPS is the number of floating point operations per second, which measures the amount of computation required to classify the actions in a video. As can be seen from Table 1, the dependency modeling method provided by the invention only increases a small amount of parameters and calculation amount. Secondly, the three long-distance dependence modeling provided by the invention can effectively improve the action recognition performance of the base line model TSN, and the use of the three long-distance dependence can obtain a better result than the use of any one long distance alone. Also, Table 1 compares various short range dependent properties, where Sxxx stands for W^poolAs can be seen from the short-range dependency model of (x, x, x), S122 achieves the best performance, and the simultaneous use of three short-range dependencies does not make the action classification performance stronger. Finally, the present invention achieves the maximum increase in TSN when using three long ranges and S122 simultaneously.

2) The effect of the two polymerization dependents on the selective polymerization and on the average proposed by the invention were compared on Something-Something V1 and the results are shown in Table 2. Where AVG stands for average polymerization and SEC for selective polymerization.

TABLE 2 comparison of Selective polymerization with average polymerization

As can be seen from table 2, under different settings, the use of selective aggregation SEC can significantly improve the accuracy of motion recognition compared to average aggregation AVG. Meanwhile, the parameters and the calculation amount brought about are almost 0.

3) The results of comparing the accuracy of motion recognition of the present invention with other most advanced motion recognition models, using TSN and TSM as baseline, on Something-Something V1 and V2, differentiating 48 and EPIC-KITCHENS are shown in Table 3.

TABLE 3 comparison of Performance of the model based on Selective dependency aggregation with other most advanced models

In Table 3, SDA-TSN, SDA-TSM represent the combination of the protocol of the present invention with two baseline models TSN, TSM; C3D, GST and TAM are the most advanced motion recognition models at present. As can be seen from Table 3, the SDA-TSN, SDA-TSM far exceeded the original baseline model in all datasets as well as the most advanced models currently available.

Another embodiment of the present invention further provides a system for recognizing actions in video data, which is mainly used to implement the method provided in the foregoing embodiment, as shown in fig. 4, the system mainly includes:

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.

In addition, the main technical details related to the above system parts are introduced in detail in the previous method embodiment, and therefore are not described again.

Another embodiment of the present invention further provides a processing apparatus, as shown in the figure, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.

In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical button or a mouse and the like;

the output device may be a display terminal;

the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.

Another embodiment of the present invention further provides a readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method provided by the foregoing embodiment.

The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for recognizing actions in video data is characterized by comprising the following steps:

2. The method of claim 1, wherein the step of pooling the original video feature tensor at different scales from different directions comprises:

tensor of original video features using convolutional layer and ReLU activation function

From C channels to C/r_cThe generated video feature tensor is recorded as

Wherein,

3. The method of claim 2, wherein pooling at different scales from different directions comprises:

record pooling as

Average pooling operation was noted as pool_avg() Calculating the compressed dependent feature tensor from the video feature tensor Y

The process of (a) is represented as:

wherein p is_t，p_h，p_wIndicating the size of the pooled nuclear receptive field, different p_t，p_g，p_wThe sizes correspond to different directions and different dimensions.

4. The method of claim 2, wherein the obtaining of the corresponding dependency representation comprises:

let the convolution kernel be

The process comprises the following steps:

wherein, c_t，c_h，c_wRepresenting the size of the convolution kernel.

5. The method according to claim 3 or 4, wherein two groups of content dependencies are set in the video data multiple content dependency modeling:

Time, reaction to long distance spatio-temporal dependence, convolution kernel of corresponding convolution layer as

When the pooling nucleus is

Time, response to long distance time dependence, the convolution kernel of the corresponding convolutional layer is

When the pooling nucleus is

When, reflecting the long-distance spatial dependence, the convolution kernel of the corresponding convolution layer is

The second group is short-range content dependence, which focuses on compressing the information in a local spatio-temporal receptive field, with the corresponding pooling kernel being

The convolution kernel of the corresponding convolution layer is

Wherein, a is more than 1, c is more than min (H, W), b is more than 1, and d is more than T.

6. The method of claim 1, wherein the step of matching the query vector with all the dependent tokens by using an attention mechanism of the query structure, and the step of calculating weights of the various dependent tokens according to the matching response strengths and performing weighted summation to obtain a final dependent token comprises:

introducing a learnable query vector q and characterizing all dependencies

will be provided with

Splicing as a value:

where M is the number of dependent tokens, C represents the number of channels of the video feature tensor, r_cRepresenting a compression factor;

calculating the vector inner product of the query vector and the dependent tokens by the following formula to obtain the matching response strength of each dependent token as the weight value of the subsequent weighted sum:

Attention(q，K)＝softmax(q×K^T)

the final dependent characterization is obtained by weighted summation:

R_sec＝Attention(q，K)×V

wherein softmax () represents a softmax function, T is a matrix transposition symbol, and x represents matrix multiplication.

7. The method of claim 1, wherein the thresholding is performed on the original video data feature tensor with the final dependency representation, and obtaining the optimized video data feature tensor comprises:

characterizing the final dependence by using a 1 × 1 × 1 three-dimensional convolution kernel_secThe number of channels is restored to the number of the original video feature tensor Y, a Sigmoid activation function is used for mapping to an interval of (0.0, 1.0), and finally the original video feature tensor Y is multiplied element by element to obtain an optimized video data feature tensor Z, which is expressed as:

Z＝Sigmoid(Conv3d(R_sec；1×1×1))⊙Y

8. A system for recognizing motion in video data, the system being configured to implement the method of any one of claims 1 to 7, the system comprising:

9. A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium, storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1 to 7.