CN113989940A - Method, system, equipment and storage medium for recognizing actions in video data - Google Patents
Method, system, equipment and storage medium for recognizing actions in video data Download PDFInfo
- Publication number
- CN113989940A CN113989940A CN202111363930.XA CN202111363930A CN113989940A CN 113989940 A CN113989940 A CN 113989940A CN 202111363930 A CN202111363930 A CN 202111363930A CN 113989940 A CN113989940 A CN 113989940A
- Authority
- CN
- China
- Prior art keywords
- video data
- feature tensor
- dependence
- dependent
- pooling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 230000009471 action Effects 0.000 title claims abstract description 48
- 238000003860 storage Methods 0.000 title claims abstract description 11
- 230000001419 dependent effect Effects 0.000 claims abstract description 46
- 238000011176 pooling Methods 0.000 claims abstract description 30
- 230000004913 activation Effects 0.000 claims abstract description 14
- 230000007246 mechanism Effects 0.000 claims abstract description 14
- 230000002776 aggregation Effects 0.000 claims abstract description 12
- 238000004220 aggregation Methods 0.000 claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 10
- 230000004044 response Effects 0.000 claims description 10
- 238000007906 compression Methods 0.000 claims description 9
- 230000006835 compression Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 238000012512 characterization method Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 230000002123 temporal effect Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000013507 mapping Methods 0.000 claims description 2
- 230000017105 transposition Effects 0.000 claims 1
- 238000002474 experimental method Methods 0.000 abstract description 7
- 238000005457 optimization Methods 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 description 7
- 238000006116 polymerization reaction Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- -1 carrier Substances 0.000 description 1
- 239000003153 chemical reaction reagent Substances 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 239000000306 component Substances 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
- G06F17/153—Multidimensional correlation or convolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Algebra (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method, a system, equipment and a storage medium for identifying actions in video data, wherein the related method comprises the following steps: pooling the original video feature tensor from different directions at different scales by adopting a video data multi-content dependence modeling mode, and performing dependence activation by utilizing a convolution layer to obtain a corresponding dependence representation; and (3) realizing aggregation of dependent characteristics by using an attention mechanism of an inquiry structure, optimizing the original video feature tensor, and identifying the action by using an optimization result. According to the scheme, the convolution-based motion recognition model can be directly inserted, almost no additional parameters and calculated amount are provided, and experiments show that the classification performance of the motion recognition model can be obviously improved.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a method, a system, equipment and a storage medium for recognizing actions in video data.
Background
In the multimedia age, various terminal devices, such as: mass video data are continuously generated by mobile phones, cameras, monitoring cameras and the like, and motion classification is an effective analysis method for the mass video data. However, the time dimension of the video data is increased relative to the image data, more content dependence is brought, and the difficulty of video action recognition is greatly increased.
For the content-dependent modeling problem of motion recognition, the following methods exist:
1) methods based on implicit dependency modeling that implicitly learn features in video data using only the three-dimensional convolution kernel to be optimized by directly expanding the two-dimensional convolution kernel to the three-dimensional convolution kernel with existing image classification networks, such as the ResNet network. Such a method implements modeling of long-distance dependence completely on the number of stacked layers, so that only the last layers in the network can perceive long-distance dependence. Meanwhile, the rough dimension expansion brings about a serious increase in the amount of calculation and the size of the model, so that the method is difficult to train.
2) The method is based on time-dependent modeling, focuses on the increased time dimension of video data relative to image data, and explicitly utilizes the information of the time dimension to capture the dynamic characteristics of the video data. Compared with the implicit dependence modeling method, due to the special design aiming at the time dimension, the method can avoid using a heavy three-dimensional convolution kernel, reduce the complexity of the model and improve the performance. However, this type of method ignores other content dependencies that are widely present in video data and is limited in performance.
3) The method is based on a global space-time point attention method, adds a global attention mechanism to an action classification model, and realizes long-distance content dependent capture by using a matching relation between every two space-time points in video data. However, calculating the relationship between the space-time points pair by pair brings the problems of model bloat and slow calculation.
Generally, the above methods fail to solve the problem between modeling multiple content dependencies and maintaining the model high efficiency, and the performance and computational overhead of the action recognition model still need to be optimized.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a storage medium for recognizing actions in video data, aiming at an action recognition task, modeling and aggregating multi-content dependence in the video data while hardly increasing parameters and calculated amount of an action recognition model, and improving the classification performance of the action recognition model.
The purpose of the invention is realized by the following technical scheme:
a method for recognizing actions in video data comprises the following steps:
acquiring an original video feature tensor extracted from video data by an action recognition model;
pooling the original video feature tensor from different directions by different scales in a video data multi-content dependence modeling mode to obtain multiple groups of compressed dependence feature tensors, and then performing dependence activation by utilizing a convolution layer to obtain corresponding dependence representations;
introducing a query vector to be matched with all the dependent tokens by using an attention mechanism of an inquiry structure, calculating weights of various dependent tokens according to matching response intensity, weighting and summing to obtain final dependent tokens, and performing threshold operation on an original video data feature tensor by using the final dependent tokens to obtain an optimized video data feature tensor;
and inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
A system for recognizing motion in video data, for implementing the method, the system comprising:
the original video feature tensor acquisition unit is used for acquiring an original video feature tensor extracted from video data by the action recognition model;
the video data multi-content dependence modeling unit is used for pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and then performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representation;
the dependency aggregation unit is used for introducing a query vector to be matched with all dependency tokens by utilizing an attention mechanism of an inquiry structure, calculating weights of various dependency tokens according to matching response strength, weighting and summing to obtain a final dependency token, and performing threshold operation on an original video data feature tensor by using the final dependency token to obtain an optimized video data feature tensor;
and the action recognition unit is used for inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.
A readable storage medium, storing a computer program, characterized in that the computer program realizes the aforementioned method when executed by a processor.
The technical scheme provided by the invention can be seen that the method is a plug-and-play method, can be directly inserted into the motion recognition model based on convolution, hardly brings extra parameters and calculated amount, and can obviously improve the classification performance of the motion recognition model through experiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a method for recognizing actions in video data according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a model structure of a method for recognizing actions in video data according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a dependency aggregation based on an attention mechanism of an interrogation structure according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a system for recognizing motion in video data according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The terms that may be used herein are first described as follows:
the terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, process, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article of manufacture), is to be construed as including not only the particular feature explicitly listed but also other features not explicitly listed as such which are known in the art.
The following describes a method for recognizing actions in video data according to the present invention in detail. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. Those not specifically mentioned in the examples of the present invention were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer. The reagents or instruments used in the examples of the present invention are not specified by manufacturers, and are all conventional products available by commercial purchase.
As shown in fig. 1, a method for recognizing actions in video data mainly includes the following steps:
And 2, pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representations.
And 3, introducing a query vector to be matched with all the dependent tokens by using an attention mechanism of an inquiry structure, calculating the weights of various dependent tokens according to the matching response intensity, weighting and summing to obtain a final dependent token, and performing threshold operation on the original video data feature tensor by using the final dependent token to obtain an optimized video data feature tensor.
And 4, inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
The scheme of the embodiment of the invention can be applied to systems such as monitoring video processing, video content analysis and the like. Taking a widely used time segmentation model (TSN) and a time translation model (TSM) as baselines, and giving specific experiments later, the effectiveness of the method in the aspect of improving the performance of the motion recognition system can be verified; of course, other motion recognition models may be used in the present invention.
For ease of understanding, the following detailed description is directed to various aspects of the invention.
Firstly, acquiring an original video feature tensor.
In the embodiment of the invention, the input of the action recognition model is video data, and the output is an original video characteristic tensor; the motion recognition model can be any existing motion recognition model with any structure.
And secondly, video data multi-content dependent modeling (MDM for short).
As shown in fig. 2, the model structure unit of the classical video data motion recognition method includes three convolutional layers (conv 1, conv2, conv3 shown on the left side of fig. 2), and the method (SDA) proposed by the present invention acts between the second and third convolutional layers. For video feature tensor output by motion recognition modelFrom which the MDM excavates a variety of spatio-temporal content dependencies. Firstly, in order to make the model lightweight, MDM uses a convolution layer and a ReLU activation function to tense the video characteristicsFrom C channels to C/rcThe generated feature tensor is recorded asWherein,is a real number set; t, H and W sequentially represent the length, height and width of the video feature tensor, rcRepresenting the compression factor.
The MDM outputs a series of dependence representations by taking the generated characteristic tensor Y' as input, and the dependence representations are marked as { R1,R2,...,RMMDM (Y'). The calculation of different dependency characteristics adopts a uniform flow: feature compression → dependent activation.
1) And (5) feature compression.
In the embodiment of the invention, the feature compression is realized by adopting pooling operation. For the spatiotemporal characteristics of the feature tensor Y', the MDM pools it from different directions (e.g. spatial direction, temporal direction) at different scales (e.g. global scale, local scale).
Record pooling asWherein p ist,ph,pwIndicating the size of the pooled nuclear receptive field. In order to obtain the overall data characteristics of the video data in all directions, the invention specifically selects the average pooling as the pooling operation, and the average pooling operation is poolavg() Calculating the compressed dependent feature tensor from the video feature tensor YThe process of (a) is represented as:
wherein p ist,pg,pwIndicating the size of the pooled nuclear receptive field, different pt,pg,pwThe sizes correspond to different directions and different dimensions.
The compressed dependent feature tensor provides the data characteristics within the pooled nuclear receptive field range for subsequent dependent modeling.
2) Depending on the activation.
After the dependency characteristics A are obtained, the MDM uses an operator and a ReLU operator to realize dependency activation, then a reshape operator is used to restore the pooled dependency characteristic tensor to the size before pooling so as to facilitate subsequent calculation, and the invention uses convolution operation to realize the operator.
Similar to the pooling operation, the convolution kernel is notedNote the convolution operation as Conv3d() Performing convolution operation on the dependent feature tensor A to obtain corresponding dependent characterizationThe process comprises the following steps:
wherein, ct,ch,cwRepresenting the size of the convolution kernel.
Based on the above principle, in the embodiment of the present invention, two sets of content dependencies are set in the video data multiple content dependency modeling, as shown in two parts (a) and (b) on the right side of fig. 2:
the first group is the long-range content dependency, which reflects the relationship between video content from three perspectives, temporal, spatial and spatiotemporal: when the pooling nucleus isWhen, reflecting the long-distance space-time dependence (abbreviated as LST), the convolution kernel of the corresponding convolution layer isWhen the pooling nucleus isWhen the convolution kernel of the corresponding convolution layer is shown as long-distance time dependence (abbreviated as LT)When the pooling nucleus isWhen reflecting long-distance spatial dependence (LS for short), the convolution kernel of the corresponding convolution layer is
Similarly, based on the above three convolution kernels, the information between the channels is mixed using three convolution operations and the corresponding dependency characterization is obtained as follows:
the second group is short-range content dependence, which focuses on compressing information in a local spatiotemporal receptive field, using a local pooling kernel to compress dynamic information presented within the local receptive field. The corresponding pooling nucleus isThe convolution kernel of the corresponding convolution layer isThe corresponding dependent feature tensor A can be calculated in the manner described aboveSAnd dependent characterization of R4。
In the embodiment of the invention, 1 is more than a, c is more than min (H, W), 1 is more than b, and d is more than T. As an example, it may be provided that: a is 3, b is 3, c is 2 and d is 2.
After obtaining various dependency characterizations, it is scaled to T × H × W × C/r using element replicationc。
And thirdly, selective dependence on polymerization (for short, SEC).
The most intuitive dependency aggregation method is to average and sum the obtained dependent tokens, however, different videos have different dependency preferences, and simple average aggregation ignores important dependencies while emphasizes irrelevant dependencies.
As shown in FIG. 3, the present invention utilizes the attention mechanism of the query structure (abbreviated as QSA) to effect aggregation of dependent tokens, automatically emphasizing important dependencies by assigning weights to different dependent tokens. Specifically, the method comprises the following steps:
introducing learnable challenge vectorsAnd all dependencies are characterizedCompression to MxC/r by global average pooling layercThe matrix of dimensions K serves as a key in the attention mechanism:
where M is the number of dependent tokens (e.g., M ═ 4 according to the previous example), C represents the number of channels of the original video feature tensor, rcRepresenting the compression factor.
Calculating the vector inner product of the query vector and the four dependent tokens by the following formula to obtain the matching response strength of each dependent token as a weight value of the subsequent weighted summation (for convenience of representation, the matrix multiplication is used for representation):
Attention(q,K)=softmax(q×KT)
the final dependent tokens are obtained by weighted summation of the various dependent tokens according to the following formula:
Rsec=Attention(q,K)×V
where softmax () represents the softmax function, T is the Transpose symbol, and x represents the matrix multiplication.
The above operation is performed by the dependency aggregation Block in fig. 2, and the SEC hardly adds extra parameters and calculation amount based on the above design.
In the embodiment of the invention, the final dependence is characterized by R by using a 1 × 1 × 1 three-dimensional convolution kernelsecThe number of channels is restored to the number of the original video feature tensor Y, a Sigmoid activation function is used for mapping to an interval of (0.0, 1.0), and finally the original video feature tensor Y is multiplied element by element to obtain an optimized video data feature tensor Z, which is expressed as:
Z=Sigmoid(Conv3d(Rsec;1×1×1))⊙Y
wherein, the element-by-element multiplication is indicated by Conv3d() Representing a convolution operation.
And taking the optimized video data characteristic tensor Z as the output of the SEC.
Fourthly, recognizing the action.
Inputting the optimized video data feature tensor Z into the action recognition model to obtain an action recognition result, wherein the related recognition principle can refer to the conventional technology, and details are not repeated here
To demonstrate the effectiveness of the present invention, it was verified by performing the following experiments.
Experiments were performed on four real data sets of Something-Something V1 and V2, differentiating 48 and EPIC-KITCHENS with action classification accuracy (Acc) as an evaluation index. For this experiment, the present invention uses a widely used Time Slicing Network (TSN) and time shifting network (TSM) as baselines. The experiment is divided into three parts:
1) the various dependence modeling effects proposed by the present invention were verified on Something-Something V1 based on TSN, the results of which are shown in Table 1.
TABLE 1 enhancement of motion recognition models by content-dependent modeling
Wherein, # P represents a modelOverall parameters, FLOPS is the number of floating point operations per second, which measures the amount of computation required to classify the actions in a video. As can be seen from Table 1, the dependency modeling method provided by the invention only increases a small amount of parameters and calculation amount. Secondly, the three long-distance dependence modeling provided by the invention can effectively improve the action recognition performance of the base line model TSN, and the use of the three long-distance dependence can obtain a better result than the use of any one long distance alone. Also, Table 1 compares various short range dependent properties, where Sxxx stands for WpoolAs can be seen from the short-range dependency model of (x, x, x), S122 achieves the best performance, and the simultaneous use of three short-range dependencies does not make the action classification performance stronger. Finally, the present invention achieves the maximum increase in TSN when using three long ranges and S122 simultaneously.
2) The effect of the two polymerization dependents on the selective polymerization and on the average proposed by the invention were compared on Something-Something V1 and the results are shown in Table 2. Where AVG stands for average polymerization and SEC for selective polymerization.
TABLE 2 comparison of Selective polymerization with average polymerization
As can be seen from table 2, under different settings, the use of selective aggregation SEC can significantly improve the accuracy of motion recognition compared to average aggregation AVG. Meanwhile, the parameters and the calculation amount brought about are almost 0.
3) The results of comparing the accuracy of motion recognition of the present invention with other most advanced motion recognition models, using TSN and TSM as baseline, on Something-Something V1 and V2, differentiating 48 and EPIC-KITCHENS are shown in Table 3.
TABLE 3 comparison of Performance of the model based on Selective dependency aggregation with other most advanced models
In Table 3, SDA-TSN, SDA-TSM represent the combination of the protocol of the present invention with two baseline models TSN, TSM; C3D, GST and TAM are the most advanced motion recognition models at present. As can be seen from Table 3, the SDA-TSN, SDA-TSM far exceeded the original baseline model in all datasets as well as the most advanced models currently available.
Another embodiment of the present invention further provides a system for recognizing actions in video data, which is mainly used to implement the method provided in the foregoing embodiment, as shown in fig. 4, the system mainly includes:
the original video feature tensor acquisition unit is used for acquiring an original video feature tensor extracted from video data by the action recognition model;
the video data multi-content dependence modeling unit is used for pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and then performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representation;
the dependency aggregation unit is used for introducing a query vector to be matched with all dependency tokens by utilizing an attention mechanism of an inquiry structure, calculating weights of various dependency tokens according to matching response strength, weighting and summing to obtain a final dependency token, and performing threshold operation on an original video data feature tensor by using the final dependency token to obtain an optimized video data feature tensor;
and the action recognition unit is used for inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to perform all or part of the above described functions.
In addition, the main technical details related to the above system parts are introduced in detail in the previous method embodiment, and therefore are not described again.
Another embodiment of the present invention further provides a processing apparatus, as shown in the figure, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.
In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical button or a mouse and the like;
the output device may be a display terminal;
the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.
Another embodiment of the present invention further provides a readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method provided by the foregoing embodiment.
The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A method for recognizing actions in video data is characterized by comprising the following steps:
acquiring an original video feature tensor extracted from video data by an action recognition model;
pooling the original video feature tensor from different directions by different scales in a video data multi-content dependence modeling mode to obtain multiple groups of compressed dependence feature tensors, and then performing dependence activation by utilizing a convolution layer to obtain corresponding dependence representations;
introducing a query vector to be matched with all the dependent tokens by using an attention mechanism of an inquiry structure, calculating weights of various dependent tokens according to matching response intensity, weighting and summing to obtain final dependent tokens, and performing threshold operation on an original video data feature tensor by using the final dependent tokens to obtain an optimized video data feature tensor;
and inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
2. The method of claim 1, wherein the step of pooling the original video feature tensor at different scales from different directions comprises:
tensor of original video features using convolutional layer and ReLU activation functionFrom C channels to C/rcThe generated video feature tensor is recorded asWherein,is a real number set; t, H and W sequentially represent the length, height and width of the video feature tensor, rcRepresenting the compression factor.
3. The method of claim 2, wherein pooling at different scales from different directions comprises:
record pooling asAverage pooling operation was noted as poolavg() Calculating the compressed dependent feature tensor from the video feature tensor YThe process of (a) is represented as:
wherein p ist,ph,pwIndicating the size of the pooled nuclear receptive field, different pt,pg,pwThe sizes correspond to different directions and different dimensions.
4. The method of claim 2, wherein the obtaining of the corresponding dependency representation comprises:
let the convolution kernel beNote the convolution operation as Conv3d() Performing convolution operation on the dependent feature tensor A to obtain corresponding dependent characterizationThe process comprises the following steps:
wherein, ct,ch,cwRepresenting the size of the convolution kernel.
5. The method according to claim 3 or 4, wherein two groups of content dependencies are set in the video data multiple content dependency modeling:
the first group is the long-range content dependency, which reflects the relationship between video content from three perspectives, temporal, spatial and spatiotemporal: when the pooling nucleus isTime, reaction to long distance spatio-temporal dependence, convolution kernel of corresponding convolution layer asWhen the pooling nucleus isTime, response to long distance time dependence, the convolution kernel of the corresponding convolutional layer isWhen the pooling nucleus isWhen, reflecting the long-distance spatial dependence, the convolution kernel of the corresponding convolution layer is
The second group is short-range content dependence, which focuses on compressing the information in a local spatio-temporal receptive field, with the corresponding pooling kernel beingThe convolution kernel of the corresponding convolution layer is
Wherein, a is more than 1, c is more than min (H, W), b is more than 1, and d is more than T.
6. The method of claim 1, wherein the step of matching the query vector with all the dependent tokens by using an attention mechanism of the query structure, and the step of calculating weights of the various dependent tokens according to the matching response strengths and performing weighted summation to obtain a final dependent token comprises:
introducing a learnable query vector q and characterizing all dependenciesCompression to MxC/r by global average pooling layercThe matrix of dimensions K serves as a key in the attention mechanism:
where M is the number of dependent tokens, C represents the number of channels of the video feature tensor, rcRepresenting a compression factor;
calculating the vector inner product of the query vector and the dependent tokens by the following formula to obtain the matching response strength of each dependent token as the weight value of the subsequent weighted sum:
Attention(q,K)=softmax(q×KT)
the final dependent characterization is obtained by weighted summation:
Rsec=Attention(q,K)×V
wherein softmax () represents a softmax function, T is a matrix transposition symbol, and x represents matrix multiplication.
7. The method of claim 1, wherein the thresholding is performed on the original video data feature tensor with the final dependency representation, and obtaining the optimized video data feature tensor comprises:
characterizing the final dependence by using a 1 × 1 × 1 three-dimensional convolution kernelsecThe number of channels is restored to the number of the original video feature tensor Y, a Sigmoid activation function is used for mapping to an interval of (0.0, 1.0), and finally the original video feature tensor Y is multiplied element by element to obtain an optimized video data feature tensor Z, which is expressed as:
Z=Sigmoid(Conv3d(Rsec;1×1×1))⊙Y
wherein, the element-by-element multiplication is indicated by Conv3d() Representing a convolution operation.
8. A system for recognizing motion in video data, the system being configured to implement the method of any one of claims 1 to 7, the system comprising:
the original video feature tensor acquisition unit is used for acquiring an original video feature tensor extracted from video data by the action recognition model;
the video data multi-content dependence modeling unit is used for pooling the original video feature tensor from different directions in different scales by adopting a video data multi-content dependence modeling mode to obtain a plurality of groups of compressed dependence feature tensors, and then performing dependence activation by utilizing the convolution layer to obtain corresponding dependence representation;
the dependency aggregation unit is used for introducing a query vector to be matched with all dependency tokens by utilizing an attention mechanism of an inquiry structure, calculating weights of various dependency tokens according to matching response strength, weighting and summing to obtain a final dependency token, and performing threshold operation on an original video data feature tensor by using the final dependency token to obtain an optimized video data feature tensor;
and the action recognition unit is used for inputting the optimized video data feature tensor to the action recognition model to obtain an action recognition result.
9. A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.
10. A readable storage medium, storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111363930.XA CN113989940B (en) | 2021-11-17 | 2021-11-17 | Method, system, device and storage medium for identifying actions in video data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111363930.XA CN113989940B (en) | 2021-11-17 | 2021-11-17 | Method, system, device and storage medium for identifying actions in video data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113989940A true CN113989940A (en) | 2022-01-28 |
CN113989940B CN113989940B (en) | 2024-03-29 |
Family
ID=79749106
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111363930.XA Active CN113989940B (en) | 2021-11-17 | 2021-11-17 | Method, system, device and storage medium for identifying actions in video data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113989940B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114926770A (en) * | 2022-05-31 | 2022-08-19 | 上海人工智能创新中心 | Video motion recognition method, device, equipment and computer readable storage medium |
CN115861901A (en) * | 2022-12-30 | 2023-03-28 | 深圳大学 | Video classification method, device, equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190341052A1 (en) * | 2018-05-02 | 2019-11-07 | Simon Says, Inc. | Machine Learning-Based Speech-To-Text Transcription Cloud Intermediary |
CN111325145A (en) * | 2020-02-19 | 2020-06-23 | 中山大学 | Behavior identification method based on combination of time domain channel correlation blocks |
WO2020233010A1 (en) * | 2019-05-23 | 2020-11-26 | 平安科技(深圳)有限公司 | Image recognition method and apparatus based on segmentable convolutional network, and computer device |
CN113297964A (en) * | 2021-05-25 | 2021-08-24 | 周口师范学院 | Video target recognition model and method based on deep migration learning |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112131943B (en) * | 2020-08-20 | 2023-07-11 | 深圳大学 | Dual-attention model-based video behavior recognition method and system |
CN112926396B (en) * | 2021-01-28 | 2022-05-13 | 杭州电子科技大学 | Action identification method based on double-current convolution attention |
-
2021
- 2021-11-17 CN CN202111363930.XA patent/CN113989940B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190341052A1 (en) * | 2018-05-02 | 2019-11-07 | Simon Says, Inc. | Machine Learning-Based Speech-To-Text Transcription Cloud Intermediary |
WO2020233010A1 (en) * | 2019-05-23 | 2020-11-26 | 平安科技(深圳)有限公司 | Image recognition method and apparatus based on segmentable convolutional network, and computer device |
CN111325145A (en) * | 2020-02-19 | 2020-06-23 | 中山大学 | Behavior identification method based on combination of time domain channel correlation blocks |
CN113297964A (en) * | 2021-05-25 | 2021-08-24 | 周口师范学院 | Video target recognition model and method based on deep migration learning |
Non-Patent Citations (2)
Title |
---|
王辉涛;胡燕;: "基于全局时空感受野的高效视频分类方法", 小型微型计算机系统, no. 08 * |
解怀奇;乐红兵;: "基于通道注意力机制的视频人体行为识别", 电子技术与软件工程, no. 04 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114926770A (en) * | 2022-05-31 | 2022-08-19 | 上海人工智能创新中心 | Video motion recognition method, device, equipment and computer readable storage medium |
CN114926770B (en) * | 2022-05-31 | 2024-06-07 | 上海人工智能创新中心 | Video motion recognition method, apparatus, device and computer readable storage medium |
CN115861901A (en) * | 2022-12-30 | 2023-03-28 | 深圳大学 | Video classification method, device, equipment and storage medium |
CN115861901B (en) * | 2022-12-30 | 2023-06-30 | 深圳大学 | Video classification method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113989940B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Qin et al. | A biological image classification method based on improved CNN | |
CN107229757B (en) | Video retrieval method based on deep learning and Hash coding | |
CN111242208A (en) | Point cloud classification method, point cloud segmentation method and related equipment | |
CN113989940B (en) | Method, system, device and storage medium for identifying actions in video data | |
TWI740726B (en) | Sorting method, operation method and apparatus of convolutional neural network | |
CN112507920B (en) | Examination abnormal behavior identification method based on time displacement and attention mechanism | |
CN111340866A (en) | Depth image generation method, device and storage medium | |
CN115294563A (en) | 3D point cloud analysis method and device based on Transformer and capable of enhancing local semantic learning ability | |
CN114419732A (en) | HRNet human body posture identification method based on attention mechanism optimization | |
CN115131218A (en) | Image processing method, image processing device, computer readable medium and electronic equipment | |
CN112529068B (en) | Multi-view image classification method, system, computer equipment and storage medium | |
CN114821251B (en) | Method and device for determining point cloud up-sampling network | |
CN107194359B (en) | Method and device for constructing scale space of video image sequence | |
CN116229323A (en) | Human body behavior recognition method based on improved depth residual error network | |
CN114693923A (en) | Three-dimensional point cloud semantic segmentation method based on context and attention | |
Cheema et al. | Disguised heterogeneous face recognition using deep neighborhood difference relational network | |
CN117496206A (en) | Semi-dense characteristic image matching method and device based on high-order space interaction | |
Paul et al. | Estimating Betti numbers using deep learning | |
CN114782684B (en) | Point cloud semantic segmentation method and device, electronic equipment and storage medium | |
Li et al. | Geometry-invariant texture retrieval using a dual-output pulse-coupled neural network | |
CN115860802A (en) | Product value prediction method, device, computer equipment and storage medium | |
CN114973410A (en) | Method and device for extracting motion characteristics of video frame | |
CN104615611A (en) | Method for obtaining global feature descriptors | |
CN110826726B (en) | Target processing method, target processing device, target processing apparatus, and medium | |
CN114155410A (en) | Graph pooling, classification model training and reconstruction model training method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |