CN115527152A - Small sample video motion analysis method, system and device - Google Patents
Small sample video motion analysis method, system and device Download PDFInfo
- Publication number
- CN115527152A CN115527152A CN202211402385.5A CN202211402385A CN115527152A CN 115527152 A CN115527152 A CN 115527152A CN 202211402385 A CN202211402385 A CN 202211402385A CN 115527152 A CN115527152 A CN 115527152A
- Authority
- CN
- China
- Prior art keywords
- video frame
- small sample
- video
- module
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 40
- 238000013528 artificial neural network Methods 0.000 claims abstract description 32
- 238000000034 method Methods 0.000 claims abstract description 14
- 238000005070 sampling Methods 0.000 claims abstract description 12
- 230000009467 reduction Effects 0.000 claims abstract description 11
- 239000013598 vector Substances 0.000 claims description 20
- 230000002123 temporal effect Effects 0.000 claims description 19
- 238000010586 diagram Methods 0.000 claims description 18
- 238000004364 calculation method Methods 0.000 claims description 11
- 230000009471 action Effects 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 4
- 238000004590 computer program Methods 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 5
- 230000007774 longterm Effects 0.000 abstract description 2
- 230000001537 neural effect Effects 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 101100400452 Caenorhabditis elegans map-2 gene Proteins 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
- G06V20/42—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a small sample video motion analysis method, system and device, and belongs to the field of neural architecture search. Aiming at the problems that in the motion analysis process of a small sample video in the prior art, the capacity of a long-term time model can be limited due to insufficient video sparse sampling frames, the increase of the number of sampling frames can cause complex combination and matching strategies and the like, the small sample video motion analysis method, the system and the device provided by the invention have the advantages that a query video frame and a support video frame are input, a feature map is output through dimension reduction operation, the feature map is input into a neural network formed by a time attention module and a space attention module for training, the trained query video frame and the support video frame are matched in a space-time alignment module, and motion feature information in the small sample video is analyzed, so that the capturing capability of motion information in the small sample video can be effectively enhanced, and the motion recognition accuracy of the small sample video is remarkably improved.
Description
Technical Field
The invention relates to the field of neural architecture search, in particular to a small sample video motion analysis method, system and device.
Background
One of the most important tasks in video understanding is to understand human behavior, which is also one of the representative tasks of video understanding. In recent years, with the advent of high-quality large-scale video data sets, video understanding and motion recognition have made significant progress. However, such success relies heavily on a large number of manually labeled samples. The data set labeling process is tedious and time-consuming, and practical application of the algorithm is limited. Therefore, how to classify unusual action classes with few labeled samples has led to extensive research. The introduction of the temporal dimension makes the video task more complex than the image task. For example, actions in video are typically performed at different speeds, occurring at different points in time. In addition, action recognition typically requires the integration of multiple different sub-action information to generate corresponding time tokens for subsequent spatio-temporal feature matching.
For the problem of small sample video motion classification, in the prior art, a metric-based meta-learning framework is mostly adopted to realize similarity comparison between a query video and a support video, and the method is to map an input video to an embedding space through representation learning and then realize distance measurement to compare video similarity in a segment task. Although the method learns the backbone network or the time relation module through the scene training, the importance of the spatio-temporal representation is ignored, and the spatio-temporal representation is important for the basic concept of small sample video motion classification, so that a network architecture formed by the time attention module and the space attention module needs to be designed to analyze the small sample video motion.
Disclosure of Invention
1. Technical problem to be solved
Aiming at the problems that in the prior art, in the small sample video motion analysis process, the capacity of a long-term time model can be limited due to insufficient video sparse sampling frames, the complexity of a combination and matching strategy can be caused due to the increase of the number of sampling frames, and the like, the invention provides a small sample video motion analysis method, system and device, which can effectively enhance the capturing capacity of motion information in a small sample video and obviously improve the precision of small sample video motion recognition.
2. Technical scheme
The purpose of the invention is realized by the following technical scheme.
A small sample video motion analysis method comprises the following steps:
inputting a query video frame and a support video frame, and outputting a feature map through dimension reduction operation;
inputting the characteristic diagram into a neural network formed by a time attention module and a space attention module for training;
and matching the trained inquiry video frame with the support video frame in a space-time alignment module, and analyzing the action characteristic information in the small sample video.
Further, the neural network includes 3 stages, each stage having 8 layers, each layer including 3 temporal attention modules and 3 spatial attention modules.
Further, the 3 stages of the neural network are connected through a connecting module; the neural network connects the spatio-temporal alignment modules through a Norm layer.
Further, in the temporal attention module, temporal dependency information between frames is captured by calculating attention between different frames at the same position.
Further, in the spatial attention module, spatial dependency information between the same frames is captured by calculating attention of different positions of the same frames.
Further, in the neural network, network training is accelerated by narrowing a search space range.
Further, the step of narrowing the search space range is:
selecting a group of sub-networks for training by adopting a uniform sampling mode from all defined operands;
during the training period of the super network, the score statistics is carried out on all operands once per P iterations, and the score calculation formula is as follows:
where S (i, j) represents the score of the ith operand of the ith layer, O i,j Denotes the ith operand, N s Representing a super network, a representing a sub-network of the super network, W representing a weight, ea representing an average of the statistical results of the sub-network a, U (A) representing the set of all models in the search space, D tr Represents the training set, L CE Representing the cross entropy loss function, L CE The calculation formula of (c) is:
wherein, W a Weight, N, representing subnetwork a A Transformer search space, D, representing a super network train Representing a training set;
sorting the scores of each operand, deleting the operands with low K% scores, and reserving the operands with high scores, namely good performance;
during the next iteration, a set of subnetworks will be selected for training using a uniform sampling from among the remaining operands.
Further, the query video frame and the support video frame are matched in a space-time alignment module, and the distance between each query video frame and the support video frame is calculated according to the following formula:
a(x q ,x s )=softmax[f(x q )f T (x s )]
wherein, a (x) q ,x s ) Attention at the video frame level, f (x) q ) An embedded vector representing a query video frame, f T (x s ) Representing the embedded vector that supports the transposition of the video frame, with this attention the embedded vector is further calculated:
wherein, f (x) s ) An embedded vector representing a frame of support video,a representation of a temporally aligned support video frame;
calculating the distance between the embedded vectors corresponding to the query video frame and the support video frame, wherein the calculation formula is as follows:
wherein,representing the distance between the embedded vectors, T representing the length of the video, C o Representing a feature dimension.
A small sample video motion analysis system adopts a small sample video motion analysis method to perform feature learning on a query video frame and a support video frame; the small sample video motion analysis system comprises:
the input module inputs an inquiry video frame and a support video frame sequence, and outputs a characteristic diagram through dimension reduction operation;
the training module inputs the feature map into a neural network formed by the time attention module and the space attention module for training; in the training phase, for each iteration, one module is randomly selected from the 3 temporal attention modules and the 3 spatial attention modules of each layer for training.
And the space-time alignment module is used for matching the trained inquiry video frame with the support video frame in the space-time alignment module and analyzing the action characteristic information in the small sample video, and the space-time alignment module can be matched with the video frame with any length.
A small sample video motion analysis device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the small sample video motion analysis method.
3. Advantageous effects
Compared with the prior art, the invention has the advantages that:
according to the small sample video motion analysis method, the small sample video motion analysis system and the small sample video motion analysis device, the neural network is constructed on the basis of the time attention module and the space attention module, the small sample video is trained in the neural network, and the model can spontaneously select and pay attention to different types of information at different stages, so that better time and space information representation can be obtained; meanwhile, the spatial range is narrowed, the neural network training is accelerated, and the time cost of video analysis is greatly saved; the time-space alignment module can be used for matching videos with any length, breaks through the limitation on the number of frames in small sample video motion recognition, enhances the capturing capability of motion information in small sample videos, and obviously improves the precision of motion recognition in small sample videos.
Drawings
FIG. 1 is a block diagram of a neural network architecture of the present invention;
FIG. 2 is a schematic diagram of the temporal attention module and the spatial attention module according to the present invention;
FIG. 3 is a block diagram of the temporal and spatial alignment at video frame level according to the present invention.
Detailed Description
The invention is described in detail below with reference to the drawings and specific examples.
Examples
As shown in fig. 1, in the method, system, and apparatus for analyzing a small sample video motion provided in this embodiment, a query video frame and a support video frame are input, and a feature diagram is output through a dimension reduction operation; inputting the characteristic diagram into a neural network formed by a time attention module and a space attention module for training; and matching the trained inquiry video frame with the support video frame in a space-time alignment module, and analyzing the action characteristic information in the small sample video.
Specifically, in this embodiment, two sets of Video frames are input, one set is m Query Video frames (Query Video), and the other set is n Support Video frames (Support Video). Wherein the query video frame class is known. And calculating the distance between each query video frame and the n support video frames, wherein the distance between the query video frame and the n support video frames is closest to which support video frame, and the category of the support video frame is the category of the query video frame, so that the category of the video is determined. For the input query video frame and the support video frame sequence, a feature map of 16 times down sampling is output through 4 layers of 3x3 convolutional layer dimensionality reduction operation. The dimension of the data is reduced through dimension reduction operation, so that the calculated amount of a model can be reduced; since the channel and length-width feature maps input by the modules in the neural network need to be matched, 4 layers of 3 × 3 convolutional layers need to be used. After the query video frame and the support video frame are subjected to 4 layers of 3x3 convolutional layer dimensionality reduction, the query video frame and the support video frame are input into a neural network.
In this embodiment, the neural network is composed of a temporal attention module (TimeAttn) and a spatial attention module (SpaceAttn), the whole neural network is divided into 3 stages, each stage has 8 layers, each layer includes 3 temporal attention modules and 3 spatial attention modules, and each module is composed of a transform structure; the 3 stages of the neural network are connected through a connecting module (SDB) and used for sampling an input feature map 2 times of space, and the SDB is also a dimension reduction module and used for reducing the calculated amount of a model. For the time attention module and the space attention module, the dimension of the input module is Twh × c, wherein w, h and c are the width, height and channel number of the embedded vector respectively, and T represents the frame number; in the time attention module, the time attention module has 3 variants, the head numbers of the variants are 6, 12 and 16 respectively, and a wh multiplied by T time attention mask is obtained by calculating attention among different frames at the same position and is used for capturing time dependency information among the frames; in the spatial attention module, there are also 3 variants of the spatial attention module, whose head numbers are 6, 12, and 16, respectively, and T × wh × wh spatial attention codes are obtained by calculating attention of the same frame and different positions, and are used for capturing spatial dependency information between the same frames. Fig. 2 shows a specific structure of the temporal attention module and the spatial attention module, which is a standard transform structure. For the time attention module, data (Twh, c) is input, q, k and v3 groups of feature vectors are obtained after passing through a Linear layer, the q and the k are multiplied by a MatMul matrix, the operation result is obtained through a Softmax layer, a feature attention map of (wh, T and T) is obtained, the map and the v are subjected to matrix multiplication, the time position concerned by the model is highlighted, namely, a key frame is highlighted, the obtained result is subjected to Linear and then added to the original input data, and the time attention map is obtained through two layers of MLPs. In this embodiment, the structure of the spatial attention module is similar to the structure of the temporal attention module. In this embodiment, a neural network is constructed by the temporal attention module and the spatial attention module, and the small sample video is trained in the neural network, so that the model can autonomously select and pay attention to different types of information at different stages, and thus, better temporal and spatial information representation can be obtained.
In the neural network training, the network training is accelerated by narrowing the search space range. In this embodiment, the spatial range is a set of all operand combinations, the neural network to be searched has n layers, each layer has m operands, and the search space is the m-th power of n; during the neural network training process, the operand with low score is removed, thereby reducing the search range. The step of narrowing the search space range is as follows:
selecting a group of subnetworks for training by adopting a uniform sampling mode from all defined operands, wherein the defined operands are TimeAttn H6, spaceAttn H6 and the like in a time attention module and a space attention module shown in FIG. 1, and one operand can be understood as a calculation module for realizing a certain function;
during the training period of the super network, the score statistics is carried out on all operands once per P iterations, and the score calculation formula is as follows:
where S (i, j) represents the score of the ith operand of the ith layer, O i,j Denotes the ith operand, N s Representing a super network, a representing a sub-network of the super network, W representing a weight, ea representing an average of statistical results of the sub-network a, U (A) representing a set of all models in a search space, D tr Represents a training set, L CE Representing the cross entropy loss function, L CE The calculation formula of (2) is as follows:
wherein, W a Weight, N, representing subnetwork a A Transform search space, D, representing a super network train Representing a training set;
the scores of each operand are sorted, the operands with low K% scores are deleted, and the operands with high scores, namely good performance, are reserved, so that the search space range can be reduced. In this embodiment, K% is a specified parameter, for example, 100 operands are defined, and if K is 10, 10% of the operands with low scores are deleted, and 90 operands with high scores are reserved.
During the next iteration, a set of subnetworks will be selected for training using a uniform sampling from among the remaining operands.
Therefore, the spatial range is reduced, the neural network training is accelerated, and the time cost of video analysis is greatly saved.
It should be noted that, in the training phase, for each iteration, one module is randomly selected from the 6 temporal attention modules and the spatial attention module of each layer, and then a total of 24 modules are selected to form a path for training.
The neural network connects the space-time Alignment module (Spatio-Temporal Alignment) through a Norm layer, and normalization processing is performed on features by using the Norm layer, so that convergence of a model is facilitated. After the query video frame and the support video frame are trained in the neural network, classifying the obtained features through a Norm layerAnd after normalization processing, inputting the normalized video frames into the space-time alignment module, and matching the query video frames with the support video frames in the space-time alignment module so as to analyze the action characteristic information in the small sample video. As shown in FIG. 3, query Video frame (Query Video x) q ) There are four pictures, diagram a is the position in which the person is standing, diagram b is the position in which the person is running, diagram c is the position in which the person is crossing the pole, and diagram d is the position in which the person is lying down; supporting video frames (Support video x) s ) There are four pictures, graph a 'is the pose of the person running, graph b' is the pose of the person running, graph c 'is the pose of the person running, and graph d' is the pose of the person crossing the pole. It should be noted that, generally, the matching is performed by using sequential frame alignment, for example, the first frame of the query video frame is aligned with the first frame of the support video frame, the second frame of the query video frame is aligned with the second frame of the support video frame, and so on, but this method has a problem of not being necessarily matched, and if using sequential frame alignment, the gesture of the person standing in the query video frame diagram a cannot be aligned with the gesture of the person running in the support video frame diagram a', so that the matching accuracy is reduced. In the embodiment, the matching accuracy can be greatly improved by adopting a key frame alignment mode. If the gesture of the person running in the query video frame image b can be aligned with the gesture of the person running in the support video frame image a'; the pose of the person crossing the pole in the query video frame map c can be aligned with the pose of the person crossing the pole in the support video frame map d'. By the method, the capturing capability of the motion information in the video can be obviously enhanced, so that the motion recognition accuracy of the small sample video is improved. Calculating the distance between each inquiry video frame and the support video frame, wherein the calculation formula is as follows:
a(x q ,x s )=softmax[f(x q )f T (x s )]
wherein, a (x) q ,x s ) Attention at the video frame level, f (x) q ) An embedded vector representing a query video frame, f T (x s ) Representing the embedded vector that supports the transposition of the video frame, with this attention the embedded vector is further calculated:
wherein, f (x) s ) An embedded vector representing a frame of support video,a representation of a temporally aligned support video frame;
calculating the distance between the embedded vectors corresponding to the query video frame and the support video frame, wherein the calculation formula is as follows:
wherein,representing the distance between the embedded vectors, T representing the length of the video, C o Representing a feature dimension.
Therefore, the space-time alignment module can be used for matching videos with any length, breaks through the limitation of small sample video motion recognition on the number of frames, enhances the capturing capability of motion information in the small sample video, and obviously improves the precision of motion recognition in the small sample video.
In addition, in the small sample video motion analysis system provided in this embodiment, feature learning is performed on the query video frame and the support video frame based on the small sample video motion analysis method; the small sample video motion analysis system comprises:
the input module inputs an inquiry video frame and a support video frame sequence, and outputs a characteristic diagram through dimension reduction operation;
the training module inputs the characteristic diagram into a neural network formed by the time attention module and the space attention module for training; in the training phase, for each iteration, one module is randomly selected from the 3 temporal attention modules and the 3 spatial attention modules for each layer to be trained.
And the space-time alignment module is used for matching the trained inquiry video frame with the support video frame in the space-time alignment module and analyzing the action characteristic information in the small sample video, and the space-time alignment module can be matched with the video frame with any length.
The present embodiment further provides a small sample video motion analysis apparatus, which is a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the program to implement the steps of the small sample video motion analysis method as in the present embodiment. The computer device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a cabinet server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. The computer device of the embodiment at least includes but is not limited to: a memory, a processor communicatively coupled to each other via a system bus. The memory (i.e., readable storage medium) includes flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. The memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device, or an external storage unit of the computer device, such as a plug-in hard disk provided on the computer device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Of course, the memory may also include both internal and external storage units of the computer device. In this embodiment, the memory is generally used for storing an operating system, various types of application software, and the like installed in the computer device. In addition, the memory may also be used to temporarily store various types of data that have been output or are to be output. The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device and, in this embodiment, is used to execute program code stored in memory or to process data.
The invention and its embodiments have been described above schematically, without limitation, and the invention can be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiment shown in the drawings is only one of the embodiments of the invention, the actual structure is not limited to the embodiment, and any reference signs in the claims shall not limit the claims. Therefore, if a person skilled in the art receives the teachings of the present invention, without inventive design, a similar structure and an embodiment to the above technical solution should be covered by the protection scope of the present patent. Furthermore, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" preceding an element does not exclude the inclusion of a plurality of such elements. Several of the elements recited in the product claims may also be implemented by one element in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Claims (10)
1. A small sample video motion analysis method comprises the following steps:
inputting a query video frame and a support video frame, and outputting a feature map through dimension reduction operation;
inputting the characteristic diagram into a neural network formed by a time attention module and a space attention module for training;
and matching the trained inquiry video frame with the support video frame in a space-time alignment module, and analyzing the action characteristic information in the small sample video.
2. The method according to claim 1, wherein the neural network comprises 3 stages, each stage has 8 layers, and each layer comprises 3 temporal attention modules and 3 spatial attention modules.
3. The small sample video motion analysis method according to claim 2, wherein 3 stages of the neural network are connected through a connection module; the neural network is connected with the space-time alignment module through a Norm layer.
4. The method according to claim 3, wherein the temporal attention module captures temporal dependency information between frames by calculating attention between different frames at the same position.
5. The method according to claim 3 or 4, wherein in the spatial attention module, spatial dependency information between the same frames is captured by calculating attention at different positions of the same frame.
6. The method according to claim 5, wherein in the neural network, network training is accelerated by narrowing a search space.
7. The small sample video motion analysis method according to claim 6, wherein the step of narrowing the search space comprises:
selecting a group of sub-networks for training by adopting a uniform sampling mode from all defined operands;
during the training period of the super network, the score statistics is carried out on all operands once in every P iterations, and the score calculation formula is as follows:
where S (i, j) represents the score of the ith operand of the ith layer, O i,j Denotes the ith operand, N s Representing a super network, a representing a sub-network of the super networkNetwork, W represents weight, ea represents the average of the statistical results for sub-network a, U (A) represents the set of all models in the search space, D tr Represents the training set, L CE Representing the cross entropy loss function, L CE The calculation formula of (2) is as follows:
wherein, W a Weight, N, representing subnetwork a A Transform search space, D, representing a super network train Representing a training set;
sorting the scores of each operand, deleting the operands with low K% scores, and reserving the operands with high scores, namely good performance;
during the next iteration, a set of subnetworks will be selected for training from among the remaining operands using a uniform sampling approach.
8. The small sample video motion analysis method according to claim 7, wherein the query video frames and the support video frames are matched in a spatio-temporal alignment module, and the distance between each query video frame and the support video frame is calculated according to the following formula:
a(x q ,x s )=softmax[f(x q )f T (x s )]
wherein, a (x) q, x s ) Attention at the video frame level, f (x) q ) An embedded vector representing a query video frame, f T (x s ) Representing the embedded vector that supports the transposition of the video frame, with this attention the embedded vector is further calculated:
wherein, f (x) s ) An embedded vector representing a frame of support video,a representation of a temporally aligned support video frame;
calculating the distance between the embedded vectors corresponding to the query video frame and the support video frame, wherein the calculation formula is as follows:
9. A small sample video motion analysis system, characterized in that it adopts the small sample video motion analysis method as claimed in any one of claims 1-8 to perform feature learning on the query video frame and the support video frame; the small sample video motion analysis system comprises:
the input module inputs a query video frame and a support video frame sequence, and outputs a characteristic diagram through dimension reduction operation;
the training module inputs the characteristic diagram into a neural network formed by the time attention module and the space attention module for training; in the training phase, for each iteration, one module is randomly selected from the 3 temporal attention modules and the 3 spatial attention modules of each layer for training.
And the space-time alignment module is used for matching the trained inquiry video frame with the support video frame and analyzing the action characteristic information in the small sample video, and can be used for matching the video frame with any length.
10. A small sample video motion analysis apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the small sample video motion analysis method according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211402385.5A CN115527152A (en) | 2022-11-10 | 2022-11-10 | Small sample video motion analysis method, system and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211402385.5A CN115527152A (en) | 2022-11-10 | 2022-11-10 | Small sample video motion analysis method, system and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115527152A true CN115527152A (en) | 2022-12-27 |
Family
ID=84704837
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211402385.5A Withdrawn CN115527152A (en) | 2022-11-10 | 2022-11-10 | Small sample video motion analysis method, system and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115527152A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024186378A1 (en) * | 2023-03-08 | 2024-09-12 | Qualcomm Incorporated | Common action localization |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220019804A1 (en) * | 2020-07-01 | 2022-01-20 | Tata Consultancy Services Limited | System and method to capture spatio-temporal representation for video reconstruction and analysis |
CN114282047A (en) * | 2021-09-16 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Small sample action recognition model training method and device, electronic equipment and storage medium |
CN114581819A (en) * | 2022-02-22 | 2022-06-03 | 华中科技大学 | Video behavior identification method and system |
CN114783053A (en) * | 2022-03-24 | 2022-07-22 | 武汉工程大学 | Behavior identification method and system based on space attention and grouping convolution |
CN115035605A (en) * | 2022-08-10 | 2022-09-09 | 广东履安实业有限公司 | Action recognition method, device and equipment based on deep learning and storage medium |
CN115131580A (en) * | 2022-08-31 | 2022-09-30 | 中国科学院空天信息创新研究院 | Space target small sample identification method based on attention mechanism |
CN115240271A (en) * | 2022-07-08 | 2022-10-25 | 北方工业大学 | Video behavior identification method and system based on space-time modeling |
-
2022
- 2022-11-10 CN CN202211402385.5A patent/CN115527152A/en not_active Withdrawn
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220019804A1 (en) * | 2020-07-01 | 2022-01-20 | Tata Consultancy Services Limited | System and method to capture spatio-temporal representation for video reconstruction and analysis |
CN114282047A (en) * | 2021-09-16 | 2022-04-05 | 腾讯科技(深圳)有限公司 | Small sample action recognition model training method and device, electronic equipment and storage medium |
CN114581819A (en) * | 2022-02-22 | 2022-06-03 | 华中科技大学 | Video behavior identification method and system |
CN114783053A (en) * | 2022-03-24 | 2022-07-22 | 武汉工程大学 | Behavior identification method and system based on space attention and grouping convolution |
CN115240271A (en) * | 2022-07-08 | 2022-10-25 | 北方工业大学 | Video behavior identification method and system based on space-time modeling |
CN115035605A (en) * | 2022-08-10 | 2022-09-09 | 广东履安实业有限公司 | Action recognition method, device and equipment based on deep learning and storage medium |
CN115131580A (en) * | 2022-08-31 | 2022-09-30 | 中国科学院空天信息创新研究院 | Space target small sample identification method based on attention mechanism |
Non-Patent Citations (1)
Title |
---|
Y. CAO,ET AL: "Searching for Better Spatio-temporal Alignment in Few-Shot Action Recognition", NEURIPS 2022 CONFERENCE BLIND SUBMISSION, pages 1 - 3 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024186378A1 (en) * | 2023-03-08 | 2024-09-12 | Qualcomm Incorporated | Common action localization |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109165573B (en) | Method and device for extracting video feature vector | |
WO2021022521A1 (en) | Method for processing data, and method and device for training neural network model | |
CN111401474B (en) | Training method, device, equipment and storage medium for video classification model | |
CN111898703B (en) | Multi-label video classification method, model training method, device and medium | |
CN111783712A (en) | Video processing method, device, equipment and medium | |
CN112507912B (en) | Method and device for identifying illegal pictures | |
Choi et al. | Face video retrieval based on the deep CNN with RBF loss | |
CN112733590A (en) | Pedestrian re-identification method based on second-order mixed attention | |
WO2024060684A1 (en) | Model training method, image processing method, device, and storage medium | |
Wang et al. | S 3 D: Scalable pedestrian detection via score scale surface discrimination | |
CN113822264A (en) | Text recognition method and device, computer equipment and storage medium | |
CN111310743B (en) | Face recognition method and device, electronic equipment and readable storage medium | |
CN114663798A (en) | Single-step video content identification method based on reinforcement learning | |
Xu et al. | Graphical modeling for multi-source domain adaptation | |
CN114357200A (en) | Cross-modal Hash retrieval method based on supervision graph embedding | |
Zhang et al. | Robust visual tracking using multi-frame multi-feature joint modeling | |
CN115527152A (en) | Small sample video motion analysis method, system and device | |
CN111898418A (en) | Human body abnormal behavior detection method based on T-TINY-YOLO network | |
CN113255394A (en) | Pedestrian re-identification method and system based on unsupervised learning | |
CN108830302B (en) | Image classification method, training method, classification prediction method and related device | |
Canévet et al. | Large scale hard sample mining with monte carlo tree search | |
CN113590898A (en) | Data retrieval method and device, electronic equipment, storage medium and computer product | |
CN111709473B (en) | Clustering method and device for object features | |
CN113869398A (en) | Unbalanced text classification method, device, equipment and storage medium | |
CN112949778A (en) | Intelligent contract classification method and system based on locality sensitive hashing and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20221227 |
|
WW01 | Invention patent application withdrawn after publication |