CN113989940A - Method, system, equipment and storage medium for recognizing actions in video data - Google Patents

Method, system, equipment and storage medium for recognizing actions in video data Download PDF

Info

Publication number
CN113989940A
CN113989940A CN202111363930.XA CN202111363930A CN113989940A CN 113989940 A CN113989940 A CN 113989940A CN 202111363930 A CN202111363930 A CN 202111363930A CN 113989940 A CN113989940 A CN 113989940A
Authority
CN
China
Prior art keywords
video data
feature tensor
dependence
dependent
pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111363930.XA
Other languages
Chinese (zh)
Other versions
CN113989940B (en
Inventor
郝艳宾
谭懿
何向南
杨勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202111363930.XA priority Critical patent/CN113989940B/en
Publication of CN113989940A publication Critical patent/CN113989940A/en
Application granted granted Critical
Publication of CN113989940B publication Critical patent/CN113989940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a system, equipment and a storage medium for identifying actions in video data, wherein the related method comprises the following steps: pooling the original video feature tensor from different directions at different scales by adopting a video data multi-content dependence modeling mode, and performing dependence activation by utilizing a convolution layer to obtain a corresponding dependence representation; and (3) realizing aggregation of dependent characteristics by using an attention mechanism of an inquiry structure, optimizing the original video feature tensor, and identifying the action by using an optimization result. According to the scheme, the convolution-based motion recognition model can be directly inserted, almost no additional parameters and calculated amount are provided, and experiments show that the classification performance of the motion recognition model can be obviously improved.

Description

Method, system, equipment and storage medium for recognizing actions in video data
Technical Field
The invention relates to the field of computer vision, in particular to a method, a system, equipment and a storage medium for recognizing actions in video data.
Background
In the multimedia age, various terminal devices, such as: mass video data are continuously generated by mobile phones, cameras, monitoring cameras and the like, and motion classification is an effective analysis method for the mass video data. However, the time dimension of the video data is increased relative to the image data, more content dependence is brought, and the difficulty of video action recognition is greatly increased.
For the content-dependent modeling problem of motion recognition, the following methods exist:
1) methods based on implicit dependency modeling that implicitly learn features in video data using only the three-dimensional convolution kernel to be optimized by directly expanding the two-dimensional convolution kernel to the three-dimensional convolution kernel with existing image classification networks, such as the ResNet network. Such a method implements modeling of long-distance dependence completely on the number of stacked layers, so that only the last layers in the network can perceive long-distance dependence. Meanwhile, the rough dimension expansion brings about a serious increase in the amount of calculation and the size of the model, so that the method is difficult to train.
2) The method is based on time-dependent modeling, focuses on the increased time dimension of video data relative to image data, and explicitly utilizes the information of the time dimension to capture the dynamic characteristics of the video data. Compared with the implicit dependence modeling method, due to the special design aiming at the time dimension, the method can avoid using a heavy three-dimensional convolution kernel, reduce the complexity of the model and improve the performance. However, this type of method ignores other content dependencies that are widely present in video data and is limited in performance.
3) The method is based on a global space-time point attention method, adds a global attention mechanism to an action classification model, and realizes long-distance content dependent capture by using a matching relation between every two space-time points in video data. However, calculating the relationship between the space-time points pair by pair brings the problems of model bloat and slow calculation.
Generally, the above methods fail to solve the problem between modeling multiple content dependencies and maintaining the model high efficiency, and the performance and computational overhead of the action recognition model still need to be optimized.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a storage medium for recognizing actions in video data, aiming at an action recognition task, modeling and aggregating multi-content dependence in the video data while hardly increasing parameters and calculated amount of an action recognition model, and improving the classification performance of the action recognition model.
The purpose of the invention is realized by the following technical scheme:
a method for recognizing actions in video data comprises the following steps:
acquiring an original video feature tensor extracted from video data by an action recognition model;
pooling the original video feature tensor from different directions by different scales in a video data multi-content dependence modeling mode to obtain multiple groups of compressed dependence feature tensors, and then performing dependence activation by utilizing a convolution layer to obtain corresponding dependence representations;
introducing a query vector to be matched with all the dependent tokens by using an attention mechanism of an inquiry structure, calculating weights of various dependent tokens according to matching response intensity, weighting and summing to obtain final dependent tokens, and performing threshold operation on an original video data feature tensor by using the final dependent tokens to obtain an optimized video data feature tensor;
and inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
A system for recognizing motion in video data, for implementing the method, the system comprising:
the original video feature tensor acquisition unit is used for acquiring an original video feature tensor extracted from video data by the action recognition model;
the video data multi-content dependence modeling unit is used for pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and then performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representation;
the dependency aggregation unit is used for introducing a query vector to be matched with all dependency tokens by utilizing an attention mechanism of an inquiry structure, calculating weights of various dependency tokens according to matching response strength, weighting and summing to obtain a final dependency token, and performing threshold operation on an original video data feature tensor by using the final dependency token to obtain an optimized video data feature tensor;
and the action recognition unit is used for inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.
A readable storage medium, storing a computer program, characterized in that the computer program realizes the aforementioned method when executed by a processor.
The technical scheme provided by the invention can be seen that the method is a plug-and-play method, can be directly inserted into the motion recognition model based on convolution, hardly brings extra parameters and calculated amount, and can obviously improve the classification performance of the motion recognition model through experiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a method for recognizing actions in video data according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a model structure of a method for recognizing actions in video data according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a dependency aggregation based on an attention mechanism of an interrogation structure according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a system for recognizing motion in video data according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The terms that may be used herein are first described as follows:
the terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, process, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article of manufacture), is to be construed as including not only the particular feature explicitly listed but also other features not explicitly listed as such which are known in the art.
The following describes a method for recognizing actions in video data according to the present invention in detail. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. Those not specifically mentioned in the examples of the present invention were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer. The reagents or instruments used in the examples of the present invention are not specified by manufacturers, and are all conventional products available by commercial purchase.
As shown in fig. 1, a method for recognizing actions in video data mainly includes the following steps:
step 1, acquiring an original video feature tensor extracted from video data by an action recognition model.
And 2, pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representations.
And 3, introducing a query vector to be matched with all the dependent tokens by using an attention mechanism of an inquiry structure, calculating the weights of various dependent tokens according to the matching response intensity, weighting and summing to obtain a final dependent token, and performing threshold operation on the original video data feature tensor by using the final dependent token to obtain an optimized video data feature tensor.
And 4, inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
The scheme of the embodiment of the invention can be applied to systems such as monitoring video processing, video content analysis and the like. Taking a widely used time segmentation model (TSN) and a time translation model (TSM) as baselines, and giving specific experiments later, the effectiveness of the method in the aspect of improving the performance of the motion recognition system can be verified; of course, other motion recognition models may be used in the present invention.
For ease of understanding, the following detailed description is directed to various aspects of the invention.
Firstly, acquiring an original video feature tensor.
In the embodiment of the invention, the input of the action recognition model is video data, and the output is an original video characteristic tensor; the motion recognition model can be any existing motion recognition model with any structure.
And secondly, video data multi-content dependent modeling (MDM for short).
As shown in fig. 2, the model structure unit of the classical video data motion recognition method includes three convolutional layers (conv 1, conv2, conv3 shown on the left side of fig. 2), and the method (SDA) proposed by the present invention acts between the second and third convolutional layers. For video feature tensor output by motion recognition model
Figure BDA0003359909260000041
From which the MDM excavates a variety of spatio-temporal content dependencies. Firstly, in order to make the model lightweight, MDM uses a convolution layer and a ReLU activation function to tense the video characteristics
Figure BDA0003359909260000042
From C channels to C/rcThe generated feature tensor is recorded as
Figure BDA0003359909260000051
Wherein,
Figure BDA0003359909260000052
is a real number set; t, H and W sequentially represent the length, height and width of the video feature tensor, rcRepresenting the compression factor.
The MDM outputs a series of dependence representations by taking the generated characteristic tensor Y' as input, and the dependence representations are marked as { R1,R2,...,RMMDM (Y'). The calculation of different dependency characteristics adopts a uniform flow: feature compression → dependent activation.
1) And (5) feature compression.
In the embodiment of the invention, the feature compression is realized by adopting pooling operation. For the spatiotemporal characteristics of the feature tensor Y', the MDM pools it from different directions (e.g. spatial direction, temporal direction) at different scales (e.g. global scale, local scale).
Record pooling as
Figure BDA0003359909260000053
Wherein p ist,ph,pwIndicating the size of the pooled nuclear receptive field. In order to obtain the overall data characteristics of the video data in all directions, the invention specifically selects the average pooling as the pooling operation, and the average pooling operation is poolavg() Calculating the compressed dependent feature tensor from the video feature tensor Y
Figure BDA0003359909260000054
The process of (a) is represented as:
Figure BDA0003359909260000055
wherein p ist,pg,pwIndicating the size of the pooled nuclear receptive field, different pt,pg,pwThe sizes correspond to different directions and different dimensions.
The compressed dependent feature tensor provides the data characteristics within the pooled nuclear receptive field range for subsequent dependent modeling.
2) Depending on the activation.
After the dependency characteristics A are obtained, the MDM uses an operator and a ReLU operator to realize dependency activation, then a reshape operator is used to restore the pooled dependency characteristic tensor to the size before pooling so as to facilitate subsequent calculation, and the invention uses convolution operation to realize the operator.
Similar to the pooling operation, the convolution kernel is noted
Figure BDA0003359909260000056
Note the convolution operation as Conv3d() Performing convolution operation on the dependent feature tensor A to obtain corresponding dependent characterization
Figure BDA0003359909260000057
The process comprises the following steps:
Figure BDA0003359909260000058
wherein, ct,ch,cwRepresenting the size of the convolution kernel.
Based on the above principle, in the embodiment of the present invention, two sets of content dependencies are set in the video data multiple content dependency modeling, as shown in two parts (a) and (b) on the right side of fig. 2:
the first group is the long-range content dependency, which reflects the relationship between video content from three perspectives, temporal, spatial and spatiotemporal: when the pooling nucleus is
Figure BDA0003359909260000059
When, reflecting the long-distance space-time dependence (abbreviated as LST), the convolution kernel of the corresponding convolution layer is
Figure BDA0003359909260000061
When the pooling nucleus is
Figure BDA0003359909260000062
When the convolution kernel of the corresponding convolution layer is shown as long-distance time dependence (abbreviated as LT)
Figure BDA0003359909260000063
When the pooling nucleus is
Figure BDA0003359909260000064
When reflecting long-distance spatial dependence (LS for short), the convolution kernel of the corresponding convolution layer is
Figure BDA0003359909260000065
Based on the three pooling kernels, three dependent feature tensors can be obtained
Figure BDA0003359909260000066
Figure BDA0003359909260000067
Similarly, based on the above three convolution kernels, the information between the channels is mixed using three convolution operations and the corresponding dependency characterization is obtained as follows:
Figure BDA0003359909260000068
Figure BDA0003359909260000069
Figure BDA00033599092600000610
the second group is short-range content dependence, which focuses on compressing information in a local spatiotemporal receptive field, using a local pooling kernel to compress dynamic information presented within the local receptive field. The corresponding pooling nucleus is
Figure BDA00033599092600000611
The convolution kernel of the corresponding convolution layer is
Figure BDA00033599092600000612
The corresponding dependent feature tensor A can be calculated in the manner described aboveSAnd dependent characterization of R4
In the embodiment of the invention, 1 is more than a, c is more than min (H, W), 1 is more than b, and d is more than T. As an example, it may be provided that: a is 3, b is 3, c is 2 and d is 2.
After obtaining various dependency characterizations, it is scaled to T × H × W × C/r using element replicationc
And thirdly, selective dependence on polymerization (for short, SEC).
The most intuitive dependency aggregation method is to average and sum the obtained dependent tokens, however, different videos have different dependency preferences, and simple average aggregation ignores important dependencies while emphasizes irrelevant dependencies.
As shown in FIG. 3, the present invention utilizes the attention mechanism of the query structure (abbreviated as QSA) to effect aggregation of dependent tokens, automatically emphasizing important dependencies by assigning weights to different dependent tokens. Specifically, the method comprises the following steps:
introducing learnable challenge vectors
Figure BDA00033599092600000613
And all dependencies are characterized
Figure BDA00033599092600000614
Compression to MxC/r by global average pooling layercThe matrix of dimensions K serves as a key in the attention mechanism:
Figure BDA00033599092600000615
will be provided with
Figure BDA00033599092600000616
Splicing as a value:
Figure BDA00033599092600000617
where M is the number of dependent tokens (e.g., M ═ 4 according to the previous example), C represents the number of channels of the original video feature tensor, rcRepresenting the compression factor.
Calculating the vector inner product of the query vector and the four dependent tokens by the following formula to obtain the matching response strength of each dependent token as a weight value of the subsequent weighted summation (for convenience of representation, the matrix multiplication is used for representation):
Attention(q,K)=softmax(q×KT)
the final dependent tokens are obtained by weighted summation of the various dependent tokens according to the following formula:
Rsec=Attention(q,K)×V
where softmax () represents the softmax function, T is the Transpose symbol, and x represents the matrix multiplication.
The above operation is performed by the dependency aggregation Block in fig. 2, and the SEC hardly adds extra parameters and calculation amount based on the above design.
In the embodiment of the invention, the final dependence is characterized by R by using a 1 × 1 × 1 three-dimensional convolution kernelsecThe number of channels is restored to the number of the original video feature tensor Y, a Sigmoid activation function is used for mapping to an interval of (0.0, 1.0), and finally the original video feature tensor Y is multiplied element by element to obtain an optimized video data feature tensor Z, which is expressed as:
Z=Sigmoid(Conv3d(Rsec;1×1×1))⊙Y
wherein, the element-by-element multiplication is indicated by Conv3d() Representing a convolution operation.
And taking the optimized video data characteristic tensor Z as the output of the SEC.
Fourthly, recognizing the action.
Inputting the optimized video data feature tensor Z into the action recognition model to obtain an action recognition result, wherein the related recognition principle can refer to the conventional technology, and details are not repeated here
To demonstrate the effectiveness of the present invention, it was verified by performing the following experiments.
Experiments were performed on four real data sets of Something-Something V1 and V2, differentiating 48 and EPIC-KITCHENS with action classification accuracy (Acc) as an evaluation index. For this experiment, the present invention uses a widely used Time Slicing Network (TSN) and time shifting network (TSM) as baselines. The experiment is divided into three parts:
1) the various dependence modeling effects proposed by the present invention were verified on Something-Something V1 based on TSN, the results of which are shown in Table 1.
Figure BDA0003359909260000071
Figure BDA0003359909260000081
TABLE 1 enhancement of motion recognition models by content-dependent modeling
Wherein, # P represents a modelOverall parameters, FLOPS is the number of floating point operations per second, which measures the amount of computation required to classify the actions in a video. As can be seen from Table 1, the dependency modeling method provided by the invention only increases a small amount of parameters and calculation amount. Secondly, the three long-distance dependence modeling provided by the invention can effectively improve the action recognition performance of the base line model TSN, and the use of the three long-distance dependence can obtain a better result than the use of any one long distance alone. Also, Table 1 compares various short range dependent properties, where Sxxx stands for WpoolAs can be seen from the short-range dependency model of (x, x, x), S122 achieves the best performance, and the simultaneous use of three short-range dependencies does not make the action classification performance stronger. Finally, the present invention achieves the maximum increase in TSN when using three long ranges and S122 simultaneously.
2) The effect of the two polymerization dependents on the selective polymerization and on the average proposed by the invention were compared on Something-Something V1 and the results are shown in Table 2. Where AVG stands for average polymerization and SEC for selective polymerization.
Figure BDA0003359909260000082
TABLE 2 comparison of Selective polymerization with average polymerization
As can be seen from table 2, under different settings, the use of selective aggregation SEC can significantly improve the accuracy of motion recognition compared to average aggregation AVG. Meanwhile, the parameters and the calculation amount brought about are almost 0.
3) The results of comparing the accuracy of motion recognition of the present invention with other most advanced motion recognition models, using TSN and TSM as baseline, on Something-Something V1 and V2, differentiating 48 and EPIC-KITCHENS are shown in Table 3.
Figure BDA0003359909260000091
TABLE 3 comparison of Performance of the model based on Selective dependency aggregation with other most advanced models
In Table 3, SDA-TSN, SDA-TSM represent the combination of the protocol of the present invention with two baseline models TSN, TSM; C3D, GST and TAM are the most advanced motion recognition models at present. As can be seen from Table 3, the SDA-TSN, SDA-TSM far exceeded the original baseline model in all datasets as well as the most advanced models currently available.
Another embodiment of the present invention further provides a system for recognizing actions in video data, which is mainly used to implement the method provided in the foregoing embodiment, as shown in fig. 4, the system mainly includes:
the original video feature tensor acquisition unit is used for acquiring an original video feature tensor extracted from video data by the action recognition model;
the video data multi-content dependence modeling unit is used for pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and then performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representation;
the dependency aggregation unit is used for introducing a query vector to be matched with all dependency tokens by utilizing an attention mechanism of an inquiry structure, calculating weights of various dependency tokens according to matching response strength, weighting and summing to obtain a final dependency token, and performing threshold operation on an original video data feature tensor by using the final dependency token to obtain an optimized video data feature tensor;
and the action recognition unit is used for inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.
In addition, the main technical details related to the above system parts are introduced in detail in the previous method embodiment, and therefore are not described again.
Another embodiment of the present invention further provides a processing apparatus, as shown in the figure, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.
In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical button or a mouse and the like;
the output device may be a display terminal;
the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.
Another embodiment of the present invention further provides a readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method provided by the foregoing embodiment.
The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for recognizing actions in video data is characterized by comprising the following steps:
acquiring an original video feature tensor extracted from video data by an action recognition model;
pooling the original video feature tensor from different directions by different scales in a video data multi-content dependence modeling mode to obtain multiple groups of compressed dependence feature tensors, and then performing dependence activation by utilizing a convolution layer to obtain corresponding dependence representations;
introducing a query vector to be matched with all the dependent tokens by using an attention mechanism of an inquiry structure, calculating weights of various dependent tokens according to matching response intensity, weighting and summing to obtain final dependent tokens, and performing threshold operation on an original video data feature tensor by using the final dependent tokens to obtain an optimized video data feature tensor;
and inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
2. The method of claim 1, wherein the step of pooling the original video feature tensor at different scales from different directions comprises:
tensor of original video features using convolutional layer and ReLU activation function
Figure FDA0003359909250000011
From C channels to C/rcThe generated video feature tensor is recorded as
Figure FDA0003359909250000012
Wherein,
Figure FDA0003359909250000013
is a real number set; t, H and W sequentially represent the length, height and width of the video feature tensor, rcRepresenting the compression factor.
3. The method of claim 2, wherein pooling at different scales from different directions comprises:
record pooling as
Figure FDA0003359909250000014
Average pooling operation was noted as poolavg() Calculating the compressed dependent feature tensor from the video feature tensor Y
Figure FDA0003359909250000015
The process of (a) is represented as:
Figure FDA0003359909250000016
wherein p ist,ph,pwIndicating the size of the pooled nuclear receptive field, different pt,pg,pwThe sizes correspond to different directions and different dimensions.
4. The method of claim 2, wherein the obtaining of the corresponding dependency representation comprises:
let the convolution kernel be
Figure FDA0003359909250000017
Note the convolution operation as Conv3d() Performing convolution operation on the dependent feature tensor A to obtain corresponding dependent characterization
Figure FDA0003359909250000018
The process comprises the following steps:
Figure FDA0003359909250000019
wherein, ct,ch,cwRepresenting the size of the convolution kernel.
5. The method according to claim 3 or 4, wherein two groups of content dependencies are set in the video data multiple content dependency modeling:
the first group is the long-range content dependency, which reflects the relationship between video content from three perspectives, temporal, spatial and spatiotemporal: when the pooling nucleus is
Figure FDA0003359909250000021
Time, reaction to long distance spatio-temporal dependence, convolution kernel of corresponding convolution layer as
Figure FDA0003359909250000022
When the pooling nucleus is
Figure FDA0003359909250000023
Time, response to long distance time dependence, the convolution kernel of the corresponding convolutional layer is
Figure FDA0003359909250000024
When the pooling nucleus is
Figure FDA0003359909250000025
When, reflecting the long-distance spatial dependence, the convolution kernel of the corresponding convolution layer is
Figure FDA0003359909250000026
The second group is short-range content dependence, which focuses on compressing the information in a local spatio-temporal receptive field, with the corresponding pooling kernel being
Figure FDA0003359909250000027
The convolution kernel of the corresponding convolution layer is
Figure FDA0003359909250000028
Wherein, a is more than 1, c is more than min (H, W), b is more than 1, and d is more than T.
6. The method of claim 1, wherein the step of matching the query vector with all the dependent tokens by using an attention mechanism of the query structure, and the step of calculating weights of the various dependent tokens according to the matching response strengths and performing weighted summation to obtain a final dependent token comprises:
introducing a learnable query vector q and characterizing all dependencies
Figure FDA0003359909250000029
Compression to MxC/r by global average pooling layercThe matrix of dimensions K serves as a key in the attention mechanism:
Figure FDA00033599092500000210
will be provided with
Figure FDA00033599092500000211
Splicing as a value:
Figure FDA00033599092500000212
where M is the number of dependent tokens, C represents the number of channels of the video feature tensor, rcRepresenting a compression factor;
calculating the vector inner product of the query vector and the dependent tokens by the following formula to obtain the matching response strength of each dependent token as the weight value of the subsequent weighted sum:
Attention(q,K)=softmax(q×KT)
the final dependent characterization is obtained by weighted summation:
Rsec=Attention(q,K)×V
wherein softmax () represents a softmax function, T is a matrix transposition symbol, and x represents matrix multiplication.
7. The method of claim 1, wherein the thresholding is performed on the original video data feature tensor with the final dependency representation, and obtaining the optimized video data feature tensor comprises:
characterizing the final dependence by using a 1 × 1 × 1 three-dimensional convolution kernelsecThe number of channels is restored to the number of the original video feature tensor Y, a Sigmoid activation function is used for mapping to an interval of (0.0, 1.0), and finally the original video feature tensor Y is multiplied element by element to obtain an optimized video data feature tensor Z, which is expressed as:
Z=Sigmoid(Conv3d(Rsec;1×1×1))⊙Y
wherein, the element-by-element multiplication is indicated by Conv3d() Representing a convolution operation.
8. A system for recognizing motion in video data, the system being configured to implement the method of any one of claims 1 to 7, the system comprising:
the original video feature tensor acquisition unit is used for acquiring an original video feature tensor extracted from video data by the action recognition model;
the video data multi-content dependence modeling unit is used for pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and then performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representation;
the dependency aggregation unit is used for introducing a query vector to be matched with all dependency tokens by utilizing an attention mechanism of an inquiry structure, calculating weights of various dependency tokens according to matching response strength, weighting and summing to obtain a final dependency token, and performing threshold operation on an original video data feature tensor by using the final dependency token to obtain an optimized video data feature tensor;
and the action recognition unit is used for inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
9. A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.
10. A readable storage medium, storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1 to 7.
CN202111363930.XA 2021-11-17 2021-11-17 Method, system, device and storage medium for identifying actions in video data Active CN113989940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111363930.XA CN113989940B (en) 2021-11-17 2021-11-17 Method, system, device and storage medium for identifying actions in video data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111363930.XA CN113989940B (en) 2021-11-17 2021-11-17 Method, system, device and storage medium for identifying actions in video data

Publications (2)

Publication Number Publication Date
CN113989940A true CN113989940A (en) 2022-01-28
CN113989940B CN113989940B (en) 2024-03-29

Family

ID=79749106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111363930.XA Active CN113989940B (en) 2021-11-17 2021-11-17 Method, system, device and storage medium for identifying actions in video data

Country Status (1)

Country Link
CN (1) CN113989940B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926770A (en) * 2022-05-31 2022-08-19 上海人工智能创新中心 Video motion recognition method, device, equipment and computer readable storage medium
CN115861901A (en) * 2022-12-30 2023-03-28 深圳大学 Video classification method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190341052A1 (en) * 2018-05-02 2019-11-07 Simon Says, Inc. Machine Learning-Based Speech-To-Text Transcription Cloud Intermediary
CN111325145A (en) * 2020-02-19 2020-06-23 中山大学 Behavior identification method based on combination of time domain channel correlation blocks
WO2020233010A1 (en) * 2019-05-23 2020-11-26 平安科技(深圳)有限公司 Image recognition method and apparatus based on segmentable convolutional network, and computer device
CN113297964A (en) * 2021-05-25 2021-08-24 周口师范学院 Video target recognition model and method based on deep migration learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131943B (en) * 2020-08-20 2023-07-11 深圳大学 Dual-attention model-based video behavior recognition method and system
CN112926396B (en) * 2021-01-28 2022-05-13 杭州电子科技大学 Action identification method based on double-current convolution attention

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190341052A1 (en) * 2018-05-02 2019-11-07 Simon Says, Inc. Machine Learning-Based Speech-To-Text Transcription Cloud Intermediary
WO2020233010A1 (en) * 2019-05-23 2020-11-26 平安科技(深圳)有限公司 Image recognition method and apparatus based on segmentable convolutional network, and computer device
CN111325145A (en) * 2020-02-19 2020-06-23 中山大学 Behavior identification method based on combination of time domain channel correlation blocks
CN113297964A (en) * 2021-05-25 2021-08-24 周口师范学院 Video target recognition model and method based on deep migration learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王辉涛;胡燕;: "基于全局时空感受野的高效视频分类方法", 小型微型计算机系统, no. 08 *
解怀奇;乐红兵;: "基于通道注意力机制的视频人体行为识别", 电子技术与软件工程, no. 04 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114926770A (en) * 2022-05-31 2022-08-19 上海人工智能创新中心 Video motion recognition method, device, equipment and computer readable storage medium
CN114926770B (en) * 2022-05-31 2024-06-07 上海人工智能创新中心 Video motion recognition method, apparatus, device and computer readable storage medium
CN115861901A (en) * 2022-12-30 2023-03-28 深圳大学 Video classification method, device, equipment and storage medium
CN115861901B (en) * 2022-12-30 2023-06-30 深圳大学 Video classification method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113989940B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
Qin et al. A biological image classification method based on improved CNN
CN107229757B (en) Video retrieval method based on deep learning and Hash coding
CN111242208A (en) Point cloud classification method, point cloud segmentation method and related equipment
CN113989940B (en) Method, system, device and storage medium for identifying actions in video data
TWI740726B (en) Sorting method, operation method and apparatus of convolutional neural network
CN112507920B (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN111340866A (en) Depth image generation method, device and storage medium
CN115294563A (en) 3D point cloud analysis method and device based on Transformer and capable of enhancing local semantic learning ability
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
CN115131218A (en) Image processing method, image processing device, computer readable medium and electronic equipment
CN112529068B (en) Multi-view image classification method, system, computer equipment and storage medium
CN114821251B (en) Method and device for determining point cloud up-sampling network
CN107194359B (en) Method and device for constructing scale space of video image sequence
CN116229323A (en) Human body behavior recognition method based on improved depth residual error network
CN114693923A (en) Three-dimensional point cloud semantic segmentation method based on context and attention
Cheema et al. Disguised heterogeneous face recognition using deep neighborhood difference relational network
CN117496206A (en) Semi-dense characteristic image matching method and device based on high-order space interaction
Paul et al. Estimating Betti numbers using deep learning
CN114782684B (en) Point cloud semantic segmentation method and device, electronic equipment and storage medium
Li et al. Geometry-invariant texture retrieval using a dual-output pulse-coupled neural network
CN115860802A (en) Product value prediction method, device, computer equipment and storage medium
CN114973410A (en) Method and device for extracting motion characteristics of video frame
CN104615611A (en) Method for obtaining global feature descriptors
CN110826726B (en) Target processing method, target processing device, target processing apparatus, and medium
CN114155410A (en) Graph pooling, classification model training and reconstruction model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant