CN115527152A - Small sample video motion analysis method, system and device - Google Patents

Small sample video motion analysis method, system and device Download PDF

Info

Publication number
CN115527152A
CN115527152A CN202211402385.5A CN202211402385A CN115527152A CN 115527152 A CN115527152 A CN 115527152A CN 202211402385 A CN202211402385 A CN 202211402385A CN 115527152 A CN115527152 A CN 115527152A
Authority
CN
China
Prior art keywords
video frame
small sample
video
module
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202211402385.5A
Other languages
Chinese (zh)
Inventor
封晓强
汤庆飞
曹毅超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING ENBO TECHNOLOGY CO LTD
Original Assignee
NANJING ENBO TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING ENBO TECHNOLOGY CO LTD filed Critical NANJING ENBO TECHNOLOGY CO LTD
Priority to CN202211402385.5A priority Critical patent/CN115527152A/en
Publication of CN115527152A publication Critical patent/CN115527152A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a small sample video motion analysis method, system and device, and belongs to the field of neural architecture search. Aiming at the problems that in the motion analysis process of a small sample video in the prior art, the capacity of a long-term time model can be limited due to insufficient video sparse sampling frames, the increase of the number of sampling frames can cause complex combination and matching strategies and the like, the small sample video motion analysis method, the system and the device provided by the invention have the advantages that a query video frame and a support video frame are input, a feature map is output through dimension reduction operation, the feature map is input into a neural network formed by a time attention module and a space attention module for training, the trained query video frame and the support video frame are matched in a space-time alignment module, and motion feature information in the small sample video is analyzed, so that the capturing capability of motion information in the small sample video can be effectively enhanced, and the motion recognition accuracy of the small sample video is remarkably improved.

Description

Small sample video motion analysis method, system and device
Technical Field
The invention relates to the field of neural architecture search, in particular to a small sample video motion analysis method, system and device.
Background
One of the most important tasks in video understanding is to understand human behavior, which is also one of the representative tasks of video understanding. In recent years, with the advent of high-quality large-scale video data sets, video understanding and motion recognition have made significant progress. However, such success relies heavily on a large number of manually labeled samples. The data set labeling process is tedious and time-consuming, and practical application of the algorithm is limited. Therefore, how to classify unusual action classes with few labeled samples has led to extensive research. The introduction of the temporal dimension makes the video task more complex than the image task. For example, actions in video are typically performed at different speeds, occurring at different points in time. In addition, action recognition typically requires the integration of multiple different sub-action information to generate corresponding time tokens for subsequent spatio-temporal feature matching.
For the problem of small sample video motion classification, in the prior art, a metric-based meta-learning framework is mostly adopted to realize similarity comparison between a query video and a support video, and the method is to map an input video to an embedding space through representation learning and then realize distance measurement to compare video similarity in a segment task. Although the method learns the backbone network or the time relation module through the scene training, the importance of the spatio-temporal representation is ignored, and the spatio-temporal representation is important for the basic concept of small sample video motion classification, so that a network architecture formed by the time attention module and the space attention module needs to be designed to analyze the small sample video motion.
Disclosure of Invention
1. Technical problem to be solved
Aiming at the problems that in the prior art, in the small sample video motion analysis process, the capacity of a long-term time model can be limited due to insufficient video sparse sampling frames, the complexity of a combination and matching strategy can be caused due to the increase of the number of sampling frames, and the like, the invention provides a small sample video motion analysis method, system and device, which can effectively enhance the capturing capacity of motion information in a small sample video and obviously improve the precision of small sample video motion recognition.
2. Technical scheme
The purpose of the invention is realized by the following technical scheme.
A small sample video motion analysis method comprises the following steps:
inputting a query video frame and a support video frame, and outputting a feature map through dimension reduction operation;
inputting the characteristic diagram into a neural network formed by a time attention module and a space attention module for training;
and matching the trained inquiry video frame with the support video frame in a space-time alignment module, and analyzing the action characteristic information in the small sample video.
Further, the neural network includes 3 stages, each stage having 8 layers, each layer including 3 temporal attention modules and 3 spatial attention modules.
Further, the 3 stages of the neural network are connected through a connecting module; the neural network connects the spatio-temporal alignment modules through a Norm layer.
Further, in the temporal attention module, temporal dependency information between frames is captured by calculating attention between different frames at the same position.
Further, in the spatial attention module, spatial dependency information between the same frames is captured by calculating attention of different positions of the same frames.
Further, in the neural network, network training is accelerated by narrowing a search space range.
Further, the step of narrowing the search space range is:
selecting a group of sub-networks for training by adopting a uniform sampling mode from all defined operands;
during the training period of the super network, the score statistics is carried out on all operands once per P iterations, and the score calculation formula is as follows:
Figure BDA0003935456580000021
where S (i, j) represents the score of the ith operand of the ith layer, O i,j Denotes the ith operand, N s Representing a super network, a representing a sub-network of the super network, W representing a weight, ea representing an average of the statistical results of the sub-network a, U (A) representing the set of all models in the search space, D tr Represents the training set, L CE Representing the cross entropy loss function, L CE The calculation formula of (c) is:
Figure BDA0003935456580000022
wherein, W a Weight, N, representing subnetwork a A Transformer search space, D, representing a super network train Representing a training set;
sorting the scores of each operand, deleting the operands with low K% scores, and reserving the operands with high scores, namely good performance;
during the next iteration, a set of subnetworks will be selected for training using a uniform sampling from among the remaining operands.
Further, the query video frame and the support video frame are matched in a space-time alignment module, and the distance between each query video frame and the support video frame is calculated according to the following formula:
a(x q ,x s )=softmax[f(x q )f T (x s )]
wherein, a (x) q ,x s ) Attention at the video frame level, f (x) q ) An embedded vector representing a query video frame, f T (x s ) Representing the embedded vector that supports the transposition of the video frame, with this attention the embedded vector is further calculated:
Figure BDA0003935456580000023
wherein, f (x) s ) An embedded vector representing a frame of support video,
Figure BDA0003935456580000031
a representation of a temporally aligned support video frame;
calculating the distance between the embedded vectors corresponding to the query video frame and the support video frame, wherein the calculation formula is as follows:
Figure BDA0003935456580000032
wherein,
Figure BDA0003935456580000033
representing the distance between the embedded vectors, T representing the length of the video, C o Representing a feature dimension.
A small sample video motion analysis system adopts a small sample video motion analysis method to perform feature learning on a query video frame and a support video frame; the small sample video motion analysis system comprises:
the input module inputs an inquiry video frame and a support video frame sequence, and outputs a characteristic diagram through dimension reduction operation;
the training module inputs the feature map into a neural network formed by the time attention module and the space attention module for training; in the training phase, for each iteration, one module is randomly selected from the 3 temporal attention modules and the 3 spatial attention modules of each layer for training.
And the space-time alignment module is used for matching the trained inquiry video frame with the support video frame in the space-time alignment module and analyzing the action characteristic information in the small sample video, and the space-time alignment module can be matched with the video frame with any length.
A small sample video motion analysis device comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the small sample video motion analysis method.
3. Advantageous effects
Compared with the prior art, the invention has the advantages that:
according to the small sample video motion analysis method, the small sample video motion analysis system and the small sample video motion analysis device, the neural network is constructed on the basis of the time attention module and the space attention module, the small sample video is trained in the neural network, and the model can spontaneously select and pay attention to different types of information at different stages, so that better time and space information representation can be obtained; meanwhile, the spatial range is narrowed, the neural network training is accelerated, and the time cost of video analysis is greatly saved; the time-space alignment module can be used for matching videos with any length, breaks through the limitation on the number of frames in small sample video motion recognition, enhances the capturing capability of motion information in small sample videos, and obviously improves the precision of motion recognition in small sample videos.
Drawings
FIG. 1 is a block diagram of a neural network architecture of the present invention;
FIG. 2 is a schematic diagram of the temporal attention module and the spatial attention module according to the present invention;
FIG. 3 is a block diagram of the temporal and spatial alignment at video frame level according to the present invention.
Detailed Description
The invention is described in detail below with reference to the drawings and specific examples.
Examples
As shown in fig. 1, in the method, system, and apparatus for analyzing a small sample video motion provided in this embodiment, a query video frame and a support video frame are input, and a feature diagram is output through a dimension reduction operation; inputting the characteristic diagram into a neural network formed by a time attention module and a space attention module for training; and matching the trained inquiry video frame with the support video frame in a space-time alignment module, and analyzing the action characteristic information in the small sample video.
Specifically, in this embodiment, two sets of Video frames are input, one set is m Query Video frames (Query Video), and the other set is n Support Video frames (Support Video). Wherein the query video frame class is known. And calculating the distance between each query video frame and the n support video frames, wherein the distance between the query video frame and the n support video frames is closest to which support video frame, and the category of the support video frame is the category of the query video frame, so that the category of the video is determined. For the input query video frame and the support video frame sequence, a feature map of 16 times down sampling is output through 4 layers of 3x3 convolutional layer dimensionality reduction operation. The dimension of the data is reduced through dimension reduction operation, so that the calculated amount of a model can be reduced; since the channel and length-width feature maps input by the modules in the neural network need to be matched, 4 layers of 3 × 3 convolutional layers need to be used. After the query video frame and the support video frame are subjected to 4 layers of 3x3 convolutional layer dimensionality reduction, the query video frame and the support video frame are input into a neural network.
In this embodiment, the neural network is composed of a temporal attention module (TimeAttn) and a spatial attention module (SpaceAttn), the whole neural network is divided into 3 stages, each stage has 8 layers, each layer includes 3 temporal attention modules and 3 spatial attention modules, and each module is composed of a transform structure; the 3 stages of the neural network are connected through a connecting module (SDB) and used for sampling an input feature map 2 times of space, and the SDB is also a dimension reduction module and used for reducing the calculated amount of a model. For the time attention module and the space attention module, the dimension of the input module is Twh × c, wherein w, h and c are the width, height and channel number of the embedded vector respectively, and T represents the frame number; in the time attention module, the time attention module has 3 variants, the head numbers of the variants are 6, 12 and 16 respectively, and a wh multiplied by T time attention mask is obtained by calculating attention among different frames at the same position and is used for capturing time dependency information among the frames; in the spatial attention module, there are also 3 variants of the spatial attention module, whose head numbers are 6, 12, and 16, respectively, and T × wh × wh spatial attention codes are obtained by calculating attention of the same frame and different positions, and are used for capturing spatial dependency information between the same frames. Fig. 2 shows a specific structure of the temporal attention module and the spatial attention module, which is a standard transform structure. For the time attention module, data (Twh, c) is input, q, k and v3 groups of feature vectors are obtained after passing through a Linear layer, the q and the k are multiplied by a MatMul matrix, the operation result is obtained through a Softmax layer, a feature attention map of (wh, T and T) is obtained, the map and the v are subjected to matrix multiplication, the time position concerned by the model is highlighted, namely, a key frame is highlighted, the obtained result is subjected to Linear and then added to the original input data, and the time attention map is obtained through two layers of MLPs. In this embodiment, the structure of the spatial attention module is similar to the structure of the temporal attention module. In this embodiment, a neural network is constructed by the temporal attention module and the spatial attention module, and the small sample video is trained in the neural network, so that the model can autonomously select and pay attention to different types of information at different stages, and thus, better temporal and spatial information representation can be obtained.
In the neural network training, the network training is accelerated by narrowing the search space range. In this embodiment, the spatial range is a set of all operand combinations, the neural network to be searched has n layers, each layer has m operands, and the search space is the m-th power of n; during the neural network training process, the operand with low score is removed, thereby reducing the search range. The step of narrowing the search space range is as follows:
selecting a group of subnetworks for training by adopting a uniform sampling mode from all defined operands, wherein the defined operands are TimeAttn H6, spaceAttn H6 and the like in a time attention module and a space attention module shown in FIG. 1, and one operand can be understood as a calculation module for realizing a certain function;
during the training period of the super network, the score statistics is carried out on all operands once per P iterations, and the score calculation formula is as follows:
Figure BDA0003935456580000051
where S (i, j) represents the score of the ith operand of the ith layer, O i,j Denotes the ith operand, N s Representing a super network, a representing a sub-network of the super network, W representing a weight, ea representing an average of statistical results of the sub-network a, U (A) representing a set of all models in a search space, D tr Represents a training set, L CE Representing the cross entropy loss function, L CE The calculation formula of (2) is as follows:
Figure BDA0003935456580000052
wherein, W a Weight, N, representing subnetwork a A Transform search space, D, representing a super network train Representing a training set;
the scores of each operand are sorted, the operands with low K% scores are deleted, and the operands with high scores, namely good performance, are reserved, so that the search space range can be reduced. In this embodiment, K% is a specified parameter, for example, 100 operands are defined, and if K is 10, 10% of the operands with low scores are deleted, and 90 operands with high scores are reserved.
During the next iteration, a set of subnetworks will be selected for training using a uniform sampling from among the remaining operands.
Therefore, the spatial range is reduced, the neural network training is accelerated, and the time cost of video analysis is greatly saved.
It should be noted that, in the training phase, for each iteration, one module is randomly selected from the 6 temporal attention modules and the spatial attention module of each layer, and then a total of 24 modules are selected to form a path for training.
The neural network connects the space-time Alignment module (Spatio-Temporal Alignment) through a Norm layer, and normalization processing is performed on features by using the Norm layer, so that convergence of a model is facilitated. After the query video frame and the support video frame are trained in the neural network, classifying the obtained features through a Norm layerAnd after normalization processing, inputting the normalized video frames into the space-time alignment module, and matching the query video frames with the support video frames in the space-time alignment module so as to analyze the action characteristic information in the small sample video. As shown in FIG. 3, query Video frame (Query Video x) q ) There are four pictures, diagram a is the position in which the person is standing, diagram b is the position in which the person is running, diagram c is the position in which the person is crossing the pole, and diagram d is the position in which the person is lying down; supporting video frames (Support video x) s ) There are four pictures, graph a 'is the pose of the person running, graph b' is the pose of the person running, graph c 'is the pose of the person running, and graph d' is the pose of the person crossing the pole. It should be noted that, generally, the matching is performed by using sequential frame alignment, for example, the first frame of the query video frame is aligned with the first frame of the support video frame, the second frame of the query video frame is aligned with the second frame of the support video frame, and so on, but this method has a problem of not being necessarily matched, and if using sequential frame alignment, the gesture of the person standing in the query video frame diagram a cannot be aligned with the gesture of the person running in the support video frame diagram a', so that the matching accuracy is reduced. In the embodiment, the matching accuracy can be greatly improved by adopting a key frame alignment mode. If the gesture of the person running in the query video frame image b can be aligned with the gesture of the person running in the support video frame image a'; the pose of the person crossing the pole in the query video frame map c can be aligned with the pose of the person crossing the pole in the support video frame map d'. By the method, the capturing capability of the motion information in the video can be obviously enhanced, so that the motion recognition accuracy of the small sample video is improved. Calculating the distance between each inquiry video frame and the support video frame, wherein the calculation formula is as follows:
a(x q ,x s )=softmax[f(x q )f T (x s )]
wherein, a (x) q ,x s ) Attention at the video frame level, f (x) q ) An embedded vector representing a query video frame, f T (x s ) Representing the embedded vector that supports the transposition of the video frame, with this attention the embedded vector is further calculated:
Figure BDA0003935456580000061
wherein, f (x) s ) An embedded vector representing a frame of support video,
Figure BDA0003935456580000062
a representation of a temporally aligned support video frame;
calculating the distance between the embedded vectors corresponding to the query video frame and the support video frame, wherein the calculation formula is as follows:
Figure BDA0003935456580000063
wherein,
Figure BDA0003935456580000064
representing the distance between the embedded vectors, T representing the length of the video, C o Representing a feature dimension.
Therefore, the space-time alignment module can be used for matching videos with any length, breaks through the limitation of small sample video motion recognition on the number of frames, enhances the capturing capability of motion information in the small sample video, and obviously improves the precision of motion recognition in the small sample video.
In addition, in the small sample video motion analysis system provided in this embodiment, feature learning is performed on the query video frame and the support video frame based on the small sample video motion analysis method; the small sample video motion analysis system comprises:
the input module inputs an inquiry video frame and a support video frame sequence, and outputs a characteristic diagram through dimension reduction operation;
the training module inputs the characteristic diagram into a neural network formed by the time attention module and the space attention module for training; in the training phase, for each iteration, one module is randomly selected from the 3 temporal attention modules and the 3 spatial attention modules for each layer to be trained.
And the space-time alignment module is used for matching the trained inquiry video frame with the support video frame in the space-time alignment module and analyzing the action characteristic information in the small sample video, and the space-time alignment module can be matched with the video frame with any length.
The present embodiment further provides a small sample video motion analysis apparatus, which is a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor. The processor executes the program to implement the steps of the small sample video motion analysis method as in the present embodiment. The computer device may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a cabinet server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. The computer device of the embodiment at least includes but is not limited to: a memory, a processor communicatively coupled to each other via a system bus. The memory (i.e., readable storage medium) includes flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. The memory may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device, or an external storage unit of the computer device, such as a plug-in hard disk provided on the computer device, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Of course, the memory may also include both internal and external storage units of the computer device. In this embodiment, the memory is generally used for storing an operating system, various types of application software, and the like installed in the computer device. In addition, the memory may also be used to temporarily store various types of data that have been output or are to be output. The processor may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor is typically used to control the overall operation of the computer device and, in this embodiment, is used to execute program code stored in memory or to process data.
The invention and its embodiments have been described above schematically, without limitation, and the invention can be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiment shown in the drawings is only one of the embodiments of the invention, the actual structure is not limited to the embodiment, and any reference signs in the claims shall not limit the claims. Therefore, if a person skilled in the art receives the teachings of the present invention, without inventive design, a similar structure and an embodiment to the above technical solution should be covered by the protection scope of the present patent. Furthermore, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" preceding an element does not exclude the inclusion of a plurality of such elements. Several of the elements recited in the product claims may also be implemented by one element in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (10)

1. A small sample video motion analysis method comprises the following steps:
inputting a query video frame and a support video frame, and outputting a feature map through dimension reduction operation;
inputting the characteristic diagram into a neural network formed by a time attention module and a space attention module for training;
and matching the trained inquiry video frame with the support video frame in a space-time alignment module, and analyzing the action characteristic information in the small sample video.
2. The method according to claim 1, wherein the neural network comprises 3 stages, each stage has 8 layers, and each layer comprises 3 temporal attention modules and 3 spatial attention modules.
3. The small sample video motion analysis method according to claim 2, wherein 3 stages of the neural network are connected through a connection module; the neural network is connected with the space-time alignment module through a Norm layer.
4. The method according to claim 3, wherein the temporal attention module captures temporal dependency information between frames by calculating attention between different frames at the same position.
5. The method according to claim 3 or 4, wherein in the spatial attention module, spatial dependency information between the same frames is captured by calculating attention at different positions of the same frame.
6. The method according to claim 5, wherein in the neural network, network training is accelerated by narrowing a search space.
7. The small sample video motion analysis method according to claim 6, wherein the step of narrowing the search space comprises:
selecting a group of sub-networks for training by adopting a uniform sampling mode from all defined operands;
during the training period of the super network, the score statistics is carried out on all operands once in every P iterations, and the score calculation formula is as follows:
Figure FDA0003935456570000011
where S (i, j) represents the score of the ith operand of the ith layer, O i,j Denotes the ith operand, N s Representing a super network, a representing a sub-network of the super networkNetwork, W represents weight, ea represents the average of the statistical results for sub-network a, U (A) represents the set of all models in the search space, D tr Represents the training set, L CE Representing the cross entropy loss function, L CE The calculation formula of (2) is as follows:
Figure FDA0003935456570000012
wherein, W a Weight, N, representing subnetwork a A Transform search space, D, representing a super network train Representing a training set;
sorting the scores of each operand, deleting the operands with low K% scores, and reserving the operands with high scores, namely good performance;
during the next iteration, a set of subnetworks will be selected for training from among the remaining operands using a uniform sampling approach.
8. The small sample video motion analysis method according to claim 7, wherein the query video frames and the support video frames are matched in a spatio-temporal alignment module, and the distance between each query video frame and the support video frame is calculated according to the following formula:
a(x q ,x s )=softmax[f(x q )f T (x s )]
wherein, a (x) q, x s ) Attention at the video frame level, f (x) q ) An embedded vector representing a query video frame, f T (x s ) Representing the embedded vector that supports the transposition of the video frame, with this attention the embedded vector is further calculated:
Figure FDA0003935456570000024
wherein, f (x) s ) An embedded vector representing a frame of support video,
Figure FDA0003935456570000021
a representation of a temporally aligned support video frame;
calculating the distance between the embedded vectors corresponding to the query video frame and the support video frame, wherein the calculation formula is as follows:
Figure FDA0003935456570000022
wherein,
Figure FDA0003935456570000023
representing the distance between the embedded vectors, T representing the length of the video, C o Representing a feature dimension.
9. A small sample video motion analysis system, characterized in that it adopts the small sample video motion analysis method as claimed in any one of claims 1-8 to perform feature learning on the query video frame and the support video frame; the small sample video motion analysis system comprises:
the input module inputs a query video frame and a support video frame sequence, and outputs a characteristic diagram through dimension reduction operation;
the training module inputs the characteristic diagram into a neural network formed by the time attention module and the space attention module for training; in the training phase, for each iteration, one module is randomly selected from the 3 temporal attention modules and the 3 spatial attention modules of each layer for training.
And the space-time alignment module is used for matching the trained inquiry video frame with the support video frame and analyzing the action characteristic information in the small sample video, and can be used for matching the video frame with any length.
10. A small sample video motion analysis apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the small sample video motion analysis method according to any one of claims 1 to 8.
CN202211402385.5A 2022-11-10 2022-11-10 Small sample video motion analysis method, system and device Withdrawn CN115527152A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211402385.5A CN115527152A (en) 2022-11-10 2022-11-10 Small sample video motion analysis method, system and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211402385.5A CN115527152A (en) 2022-11-10 2022-11-10 Small sample video motion analysis method, system and device

Publications (1)

Publication Number Publication Date
CN115527152A true CN115527152A (en) 2022-12-27

Family

ID=84704837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211402385.5A Withdrawn CN115527152A (en) 2022-11-10 2022-11-10 Small sample video motion analysis method, system and device

Country Status (1)

Country Link
CN (1) CN115527152A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024186378A1 (en) * 2023-03-08 2024-09-12 Qualcomm Incorporated Common action localization

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220019804A1 (en) * 2020-07-01 2022-01-20 Tata Consultancy Services Limited System and method to capture spatio-temporal representation for video reconstruction and analysis
CN114282047A (en) * 2021-09-16 2022-04-05 腾讯科技(深圳)有限公司 Small sample action recognition model training method and device, electronic equipment and storage medium
CN114581819A (en) * 2022-02-22 2022-06-03 华中科技大学 Video behavior identification method and system
CN114783053A (en) * 2022-03-24 2022-07-22 武汉工程大学 Behavior identification method and system based on space attention and grouping convolution
CN115035605A (en) * 2022-08-10 2022-09-09 广东履安实业有限公司 Action recognition method, device and equipment based on deep learning and storage medium
CN115131580A (en) * 2022-08-31 2022-09-30 中国科学院空天信息创新研究院 Space target small sample identification method based on attention mechanism
CN115240271A (en) * 2022-07-08 2022-10-25 北方工业大学 Video behavior identification method and system based on space-time modeling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220019804A1 (en) * 2020-07-01 2022-01-20 Tata Consultancy Services Limited System and method to capture spatio-temporal representation for video reconstruction and analysis
CN114282047A (en) * 2021-09-16 2022-04-05 腾讯科技(深圳)有限公司 Small sample action recognition model training method and device, electronic equipment and storage medium
CN114581819A (en) * 2022-02-22 2022-06-03 华中科技大学 Video behavior identification method and system
CN114783053A (en) * 2022-03-24 2022-07-22 武汉工程大学 Behavior identification method and system based on space attention and grouping convolution
CN115240271A (en) * 2022-07-08 2022-10-25 北方工业大学 Video behavior identification method and system based on space-time modeling
CN115035605A (en) * 2022-08-10 2022-09-09 广东履安实业有限公司 Action recognition method, device and equipment based on deep learning and storage medium
CN115131580A (en) * 2022-08-31 2022-09-30 中国科学院空天信息创新研究院 Space target small sample identification method based on attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Y. CAO,ET AL: "Searching for Better Spatio-temporal Alignment in Few-Shot Action Recognition", NEURIPS 2022 CONFERENCE BLIND SUBMISSION, pages 1 - 3 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024186378A1 (en) * 2023-03-08 2024-09-12 Qualcomm Incorporated Common action localization

Similar Documents

Publication Publication Date Title
CN109165573B (en) Method and device for extracting video feature vector
WO2021022521A1 (en) Method for processing data, and method and device for training neural network model
CN111401474B (en) Training method, device, equipment and storage medium for video classification model
CN111898703B (en) Multi-label video classification method, model training method, device and medium
CN111783712A (en) Video processing method, device, equipment and medium
CN112507912B (en) Method and device for identifying illegal pictures
Choi et al. Face video retrieval based on the deep CNN with RBF loss
CN112733590A (en) Pedestrian re-identification method based on second-order mixed attention
WO2024060684A1 (en) Model training method, image processing method, device, and storage medium
Wang et al. S 3 D: Scalable pedestrian detection via score scale surface discrimination
CN113822264A (en) Text recognition method and device, computer equipment and storage medium
CN111310743B (en) Face recognition method and device, electronic equipment and readable storage medium
CN114663798A (en) Single-step video content identification method based on reinforcement learning
Xu et al. Graphical modeling for multi-source domain adaptation
CN114357200A (en) Cross-modal Hash retrieval method based on supervision graph embedding
Zhang et al. Robust visual tracking using multi-frame multi-feature joint modeling
CN115527152A (en) Small sample video motion analysis method, system and device
CN111898418A (en) Human body abnormal behavior detection method based on T-TINY-YOLO network
CN113255394A (en) Pedestrian re-identification method and system based on unsupervised learning
CN108830302B (en) Image classification method, training method, classification prediction method and related device
Canévet et al. Large scale hard sample mining with monte carlo tree search
CN113590898A (en) Data retrieval method and device, electronic equipment, storage medium and computer product
CN111709473B (en) Clustering method and device for object features
CN113869398A (en) Unbalanced text classification method, device, equipment and storage medium
CN112949778A (en) Intelligent contract classification method and system based on locality sensitive hashing and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20221227

WW01 Invention patent application withdrawn after publication