CN115527152A

CN115527152A - Small sample video motion analysis method, system and device

Info

Publication number: CN115527152A
Application number: CN202211402385.5A
Authority: CN
Inventors: 封晓强; 汤庆飞; 曹毅超
Original assignee: NANJING ENBO TECHNOLOGY CO LTD
Current assignee: NANJING ENBO TECHNOLOGY CO LTD
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2022-12-27

Abstract

The invention discloses a small sample video motion analysis method, system and device, and belongs to the field of neural architecture search. Aiming at the problems that in the motion analysis process of a small sample video in the prior art, the capacity of a long-term time model can be limited due to insufficient video sparse sampling frames, the increase of the number of sampling frames can cause complex combination and matching strategies and the like, the small sample video motion analysis method, the system and the device provided by the invention have the advantages that a query video frame and a support video frame are input, a feature map is output through dimension reduction operation, the feature map is input into a neural network formed by a time attention module and a space attention module for training, the trained query video frame and the support video frame are matched in a space-time alignment module, and motion feature information in the small sample video is analyzed, so that the capturing capability of motion information in the small sample video can be effectively enhanced, and the motion recognition accuracy of the small sample video is remarkably improved.

Description

Small sample video motion analysis method, system and device

Technical Field

The invention relates to the field of neural architecture search, in particular to a small sample video motion analysis method, system and device.

Background

One of the most important tasks in video understanding is to understand human behavior, which is also one of the representative tasks of video understanding. In recent years, with the advent of high-quality large-scale video data sets, video understanding and motion recognition have made significant progress. However, such success relies heavily on a large number of manually labeled samples. The data set labeling process is tedious and time-consuming, and practical application of the algorithm is limited. Therefore, how to classify unusual action classes with few labeled samples has led to extensive research. The introduction of the temporal dimension makes the video task more complex than the image task. For example, actions in video are typically performed at different speeds, occurring at different points in time. In addition, action recognition typically requires the integration of multiple different sub-action information to generate corresponding time tokens for subsequent spatio-temporal feature matching.

For the problem of small sample video motion classification, in the prior art, a metric-based meta-learning framework is mostly adopted to realize similarity comparison between a query video and a support video, and the method is to map an input video to an embedding space through representation learning and then realize distance measurement to compare video similarity in a segment task. Although the method learns the backbone network or the time relation module through the scene training, the importance of the spatio-temporal representation is ignored, and the spatio-temporal representation is important for the basic concept of small sample video motion classification, so that a network architecture formed by the time attention module and the space attention module needs to be designed to analyze the small sample video motion.

Disclosure of Invention

1. Technical problem to be solved

Aiming at the problems that in the prior art, in the small sample video motion analysis process, the capacity of a long-term time model can be limited due to insufficient video sparse sampling frames, the complexity of a combination and matching strategy can be caused due to the increase of the number of sampling frames, and the like, the invention provides a small sample video motion analysis method, system and device, which can effectively enhance the capturing capacity of motion information in a small sample video and obviously improve the precision of small sample video motion recognition.

2. Technical scheme

The purpose of the invention is realized by the following technical scheme.

A small sample video motion analysis method comprises the following steps:

inputting a query video frame and a support video frame, and outputting a feature map through dimension reduction operation;

inputting the characteristic diagram into a neural network formed by a time attention module and a space attention module for training;

and matching the trained inquiry video frame with the support video frame in a space-time alignment module, and analyzing the action characteristic information in the small sample video.

Further, the neural network includes 3 stages, each stage having 8 layers, each layer including 3 temporal attention modules and 3 spatial attention modules.

Further, the 3 stages of the neural network are connected through a connecting module; the neural network connects the spatio-temporal alignment modules through a Norm layer.

Further, in the temporal attention module, temporal dependency information between frames is captured by calculating attention between different frames at the same position.

Further, in the spatial attention module, spatial dependency information between the same frames is captured by calculating attention of different positions of the same frames.

Further, in the neural network, network training is accelerated by narrowing a search space range.

Further, the step of narrowing the search space range is:

selecting a group of sub-networks for training by adopting a uniform sampling mode from all defined operands;

during the training period of the super network, the score statistics is carried out on all operands once per P iterations, and the score calculation formula is as follows:

where S (i, j) represents the score of the ith operand of the ith layer, O _i,j Denotes the ith operand, N _s Representing a super network, a representing a sub-network of the super network, W representing a weight, ea representing an average of the statistical results of the sub-network a, U (A) representing the set of all models in the search space, D _tr Represents the training set, L _CE Representing the cross entropy loss function, L _CE The calculation formula of (c) is:

wherein, W _a Weight, N, representing subnetwork a _A Transformer search space, D, representing a super network _train Representing a training set;

sorting the scores of each operand, deleting the operands with low K% scores, and reserving the operands with high scores, namely good performance;

during the next iteration, a set of subnetworks will be selected for training using a uniform sampling from among the remaining operands.

Further, the query video frame and the support video frame are matched in a space-time alignment module, and the distance between each query video frame and the support video frame is calculated according to the following formula:

a(x _q ，x _s )＝softmax[f(x _q )f ^T (x _s )]

wherein, a (x) _q ,x _s ) Attention at the video frame level, f (x) _q ) An embedded vector representing a query video frame, f ^T (x _s ) Representing the embedded vector that supports the transposition of the video frame, with this attention the embedded vector is further calculated:

wherein, f (x) _s ) An embedded vector representing a frame of support video,

a representation of a temporally aligned support video frame;

calculating the distance between the embedded vectors corresponding to the query video frame and the support video frame, wherein the calculation formula is as follows:

wherein,

representing the distance between the embedded vectors, T representing the length of the video, C _o Representing a feature dimension.

A small sample video motion analysis system adopts a small sample video motion analysis method to perform feature learning on a query video frame and a support video frame; the small sample video motion analysis system comprises:

the input module inputs an inquiry video frame and a support video frame sequence, and outputs a characteristic diagram through dimension reduction operation;

the training module inputs the feature map into a neural network formed by the time attention module and the space attention module for training; in the training phase, for each iteration, one module is randomly selected from the 3 temporal attention modules and the 3 spatial attention modules of each layer for training.

And the space-time alignment module is used for matching the trained inquiry video frame with the support video frame in the space-time alignment module and analyzing the action characteristic information in the small sample video, and the space-time alignment module can be matched with the video frame with any length.

A small sample video motion analysis device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the small sample video motion analysis method.

3. Advantageous effects

Compared with the prior art, the invention has the advantages that:

according to the small sample video motion analysis method, the small sample video motion analysis system and the small sample video motion analysis device, the neural network is constructed on the basis of the time attention module and the space attention module, the small sample video is trained in the neural network, and the model can spontaneously select and pay attention to different types of information at different stages, so that better time and space information representation can be obtained; meanwhile, the spatial range is narrowed, the neural network training is accelerated, and the time cost of video analysis is greatly saved; the time-space alignment module can be used for matching videos with any length, breaks through the limitation on the number of frames in small sample video motion recognition, enhances the capturing capability of motion information in small sample videos, and obviously improves the precision of motion recognition in small sample videos.

Drawings

FIG. 1 is a block diagram of a neural network architecture of the present invention;

FIG. 2 is a schematic diagram of the temporal attention module and the spatial attention module according to the present invention;

FIG. 3 is a block diagram of the temporal and spatial alignment at video frame level according to the present invention.

Detailed Description

The invention is described in detail below with reference to the drawings and specific examples.

Examples

As shown in fig. 1, in the method, system, and apparatus for analyzing a small sample video motion provided in this embodiment, a query video frame and a support video frame are input, and a feature diagram is output through a dimension reduction operation; inputting the characteristic diagram into a neural network formed by a time attention module and a space attention module for training; and matching the trained inquiry video frame with the support video frame in a space-time alignment module, and analyzing the action characteristic information in the small sample video.

Specifically, in this embodiment, two sets of Video frames are input, one set is m Query Video frames (Query Video), and the other set is n Support Video frames (Support Video). Wherein the query video frame class is known. And calculating the distance between each query video frame and the n support video frames, wherein the distance between the query video frame and the n support video frames is closest to which support video frame, and the category of the support video frame is the category of the query video frame, so that the category of the video is determined. For the input query video frame and the support video frame sequence, a feature map of 16 times down sampling is output through 4 layers of 3x3 convolutional layer dimensionality reduction operation. The dimension of the data is reduced through dimension reduction operation, so that the calculated amount of a model can be reduced; since the channel and length-width feature maps input by the modules in the neural network need to be matched, 4 layers of 3 × 3 convolutional layers need to be used. After the query video frame and the support video frame are subjected to 4 layers of 3x3 convolutional layer dimensionality reduction, the query video frame and the support video frame are input into a neural network.

In this embodiment, the neural network is composed of a temporal attention module (TimeAttn) and a spatial attention module (SpaceAttn), the whole neural network is divided into 3 stages, each stage has 8 layers, each layer includes 3 temporal attention modules and 3 spatial attention modules, and each module is composed of a transform structure; the 3 stages of the neural network are connected through a connecting module (SDB) and used for sampling an input feature map 2 times of space, and the SDB is also a dimension reduction module and used for reducing the calculated amount of a model. For the time attention module and the space attention module, the dimension of the input module is Twh × c, wherein w, h and c are the width, height and channel number of the embedded vector respectively, and T represents the frame number; in the time attention module, the time attention module has 3 variants, the head numbers of the variants are 6, 12 and 16 respectively, and a wh multiplied by T time attention mask is obtained by calculating attention among different frames at the same position and is used for capturing time dependency information among the frames; in the spatial attention module, there are also 3 variants of the spatial attention module, whose head numbers are 6, 12, and 16, respectively, and T × wh × wh spatial attention codes are obtained by calculating attention of the same frame and different positions, and are used for capturing spatial dependency information between the same frames. Fig. 2 shows a specific structure of the temporal attention module and the spatial attention module, which is a standard transform structure. For the time attention module, data (Twh, c) is input, q, k and v3 groups of feature vectors are obtained after passing through a Linear layer, the q and the k are multiplied by a MatMul matrix, the operation result is obtained through a Softmax layer, a feature attention map of (wh, T and T) is obtained, the map and the v are subjected to matrix multiplication, the time position concerned by the model is highlighted, namely, a key frame is highlighted, the obtained result is subjected to Linear and then added to the original input data, and the time attention map is obtained through two layers of MLPs. In this embodiment, the structure of the spatial attention module is similar to the structure of the temporal attention module. In this embodiment, a neural network is constructed by the temporal attention module and the spatial attention module, and the small sample video is trained in the neural network, so that the model can autonomously select and pay attention to different types of information at different stages, and thus, better temporal and spatial information representation can be obtained.

In the neural network training, the network training is accelerated by narrowing the search space range. In this embodiment, the spatial range is a set of all operand combinations, the neural network to be searched has n layers, each layer has m operands, and the search space is the m-th power of n; during the neural network training process, the operand with low score is removed, thereby reducing the search range. The step of narrowing the search space range is as follows:

selecting a group of subnetworks for training by adopting a uniform sampling mode from all defined operands, wherein the defined operands are TimeAttn H6, spaceAttn H6 and the like in a time attention module and a space attention module shown in FIG. 1, and one operand can be understood as a calculation module for realizing a certain function;

where S (i, j) represents the score of the ith operand of the ith layer, O _i,j Denotes the ith operand, N _s Representing a super network, a representing a sub-network of the super network, W representing a weight, ea representing an average of statistical results of the sub-network a, U (A) representing a set of all models in a search space, D _tr Represents a training set, L _CE Representing the cross entropy loss function, L _CE The calculation formula of (2) is as follows:

wherein, W _a Weight, N, representing subnetwork a _A Transform search space, D, representing a super network _train Representing a training set;

the scores of each operand are sorted, the operands with low K% scores are deleted, and the operands with high scores, namely good performance, are reserved, so that the search space range can be reduced. In this embodiment, K% is a specified parameter, for example, 100 operands are defined, and if K is 10, 10% of the operands with low scores are deleted, and 90 operands with high scores are reserved.

Therefore, the spatial range is reduced, the neural network training is accelerated, and the time cost of video analysis is greatly saved.

It should be noted that, in the training phase, for each iteration, one module is randomly selected from the 6 temporal attention modules and the spatial attention module of each layer, and then a total of 24 modules are selected to form a path for training.

The neural network connects the space-time Alignment module (Spatio-Temporal Alignment) through a Norm layer, and normalization processing is performed on features by using the Norm layer, so that convergence of a model is facilitated. After the query video frame and the support video frame are trained in the neural network, classifying the obtained features through a Norm layerAnd after normalization processing, inputting the normalized video frames into the space-time alignment module, and matching the query video frames with the support video frames in the space-time alignment module so as to analyze the action characteristic information in the small sample video. As shown in FIG. 3, query Video frame (Query Video x) _q ) There are four pictures, diagram a is the position in which the person is standing, diagram b is the position in which the person is running, diagram c is the position in which the person is crossing the pole, and diagram d is the position in which the person is lying down; supporting video frames (Support video x) _s ) There are four pictures, graph a 'is the pose of the person running, graph b' is the pose of the person running, graph c 'is the pose of the person running, and graph d' is the pose of the person crossing the pole. It should be noted that, generally, the matching is performed by using sequential frame alignment, for example, the first frame of the query video frame is aligned with the first frame of the support video frame, the second frame of the query video frame is aligned with the second frame of the support video frame, and so on, but this method has a problem of not being necessarily matched, and if using sequential frame alignment, the gesture of the person standing in the query video frame diagram a cannot be aligned with the gesture of the person running in the support video frame diagram a', so that the matching accuracy is reduced. In the embodiment, the matching accuracy can be greatly improved by adopting a key frame alignment mode. If the gesture of the person running in the query video frame image b can be aligned with the gesture of the person running in the support video frame image a'; the pose of the person crossing the pole in the query video frame map c can be aligned with the pose of the person crossing the pole in the support video frame map d'. By the method, the capturing capability of the motion information in the video can be obviously enhanced, so that the motion recognition accuracy of the small sample video is improved. Calculating the distance between each inquiry video frame and the support video frame, wherein the calculation formula is as follows:

a(x _q ，x _s )＝softmax[f(x _q )f ^T (x _s )]

wherein, a (x) _q ，x _s ) Attention at the video frame level, f (x) _q ) An embedded vector representing a query video frame, f ^T (x _s ) Representing the embedded vector that supports the transposition of the video frame, with this attention the embedded vector is further calculated:

wherein, f (x) _s ) An embedded vector representing a frame of support video,

a representation of a temporally aligned support video frame;

wherein,

Therefore, the space-time alignment module can be used for matching videos with any length, breaks through the limitation of small sample video motion recognition on the number of frames, enhances the capturing capability of motion information in the small sample video, and obviously improves the precision of motion recognition in the small sample video.

In addition, in the small sample video motion analysis system provided in this embodiment, feature learning is performed on the query video frame and the support video frame based on the small sample video motion analysis method; the small sample video motion analysis system comprises:

the training module inputs the characteristic diagram into a neural network formed by the time attention module and the space attention module for training; in the training phase, for each iteration, one module is randomly selected from the 3 temporal attention modules and the 3 spatial attention modules for each layer to be trained.

The present embodiment further provides a small sample video motion analysis apparatus, which is a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the program to implement the steps of the small sample video motion analysis method as in the present embodiment. The computer device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a cabinet server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. The computer device of the embodiment at least includes but is not limited to: a memory, a processor communicatively coupled to each other via a system bus. The memory (i.e., readable storage medium) includes flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. The memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device, or an external storage unit of the computer device, such as a plug-in hard disk provided on the computer device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Of course, the memory may also include both internal and external storage units of the computer device. In this embodiment, the memory is generally used for storing an operating system, various types of application software, and the like installed in the computer device. In addition, the memory may also be used to temporarily store various types of data that have been output or are to be output. The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device and, in this embodiment, is used to execute program code stored in memory or to process data.

The invention and its embodiments have been described above schematically, without limitation, and the invention can be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiment shown in the drawings is only one of the embodiments of the invention, the actual structure is not limited to the embodiment, and any reference signs in the claims shall not limit the claims. Therefore, if a person skilled in the art receives the teachings of the present invention, without inventive design, a similar structure and an embodiment to the above technical solution should be covered by the protection scope of the present patent. Furthermore, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" preceding an element does not exclude the inclusion of a plurality of such elements. Several of the elements recited in the product claims may also be implemented by one element in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A small sample video motion analysis method comprises the following steps:

2. The method according to claim 1, wherein the neural network comprises 3 stages, each stage has 8 layers, and each layer comprises 3 temporal attention modules and 3 spatial attention modules.

3. The small sample video motion analysis method according to claim 2, wherein 3 stages of the neural network are connected through a connection module; the neural network is connected with the space-time alignment module through a Norm layer.

4. The method according to claim 3, wherein the temporal attention module captures temporal dependency information between frames by calculating attention between different frames at the same position.

5. The method according to claim 3 or 4, wherein in the spatial attention module, spatial dependency information between the same frames is captured by calculating attention at different positions of the same frame.

6. The method according to claim 5, wherein in the neural network, network training is accelerated by narrowing a search space.

7. The small sample video motion analysis method according to claim 6, wherein the step of narrowing the search space comprises:

during the training period of the super network, the score statistics is carried out on all operands once in every P iterations, and the score calculation formula is as follows:

where S (i, j) represents the score of the ith operand of the ith layer, O _i,j Denotes the ith operand, N _s Representing a super network, a representing a sub-network of the super networkNetwork, W represents weight, ea represents the average of the statistical results for sub-network a, U (A) represents the set of all models in the search space, D _tr Represents the training set, L _CE Representing the cross entropy loss function, L _CE The calculation formula of (2) is as follows:

during the next iteration, a set of subnetworks will be selected for training from among the remaining operands using a uniform sampling approach.

8. The small sample video motion analysis method according to claim 7, wherein the query video frames and the support video frames are matched in a spatio-temporal alignment module, and the distance between each query video frame and the support video frame is calculated according to the following formula:

a(x _q ，x _s )＝softmax[f(x _q )f ^T (x _s )]

wherein, a (x) _q, x _s ) Attention at the video frame level, f (x) _q ) An embedded vector representing a query video frame, f ^T (x _s ) Representing the embedded vector that supports the transposition of the video frame, with this attention the embedded vector is further calculated:

wherein, f (x) _s ) An embedded vector representing a frame of support video,

a representation of a temporally aligned support video frame;

wherein,

9. A small sample video motion analysis system, characterized in that it adopts the small sample video motion analysis method as claimed in any one of claims 1-8 to perform feature learning on the query video frame and the support video frame; the small sample video motion analysis system comprises:

the input module inputs a query video frame and a support video frame sequence, and outputs a characteristic diagram through dimension reduction operation;

the training module inputs the characteristic diagram into a neural network formed by the time attention module and the space attention module for training; in the training phase, for each iteration, one module is randomly selected from the 3 temporal attention modules and the 3 spatial attention modules of each layer for training.

And the space-time alignment module is used for matching the trained inquiry video frame with the support video frame and analyzing the action characteristic information in the small sample video, and can be used for matching the video frame with any length.

10. A small sample video motion analysis apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the small sample video motion analysis method according to any one of claims 1 to 8.