CN115240106B - Task self-adaptive small sample behavior recognition method and system - Google Patents

Task self-adaptive small sample behavior recognition method and system Download PDF

Info

Publication number
CN115240106B
CN115240106B CN202210815080.0A CN202210815080A CN115240106B CN 115240106 B CN115240106 B CN 115240106B CN 202210815080 A CN202210815080 A CN 202210815080A CN 115240106 B CN115240106 B CN 115240106B
Authority
CN
China
Prior art keywords
feature
features
map
attention
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210815080.0A
Other languages
Chinese (zh)
Other versions
CN115240106A (en
Inventor
金�一
王佳艺
冯松鹤
郎丛妍
王涛
李浥东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jiaotong University
Original Assignee
Beijing Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jiaotong University filed Critical Beijing Jiaotong University
Priority to CN202210815080.0A priority Critical patent/CN115240106B/en
Publication of CN115240106A publication Critical patent/CN115240106A/en
Application granted granted Critical
Publication of CN115240106B publication Critical patent/CN115240106B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a task self-adaptive small sample behavior recognition method and a task self-adaptive small sample behavior recognition system, which belong to the technical field of computer vision and acquire video data to be recognized; and processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action type result, adding the action type result into an attention layer, extracting the position information and the image content information of the behavior main body in the picture frame, modulating the extracted characteristic features by an attention mechanism, and obtaining intra-class characteristic commonalities of actions of the same type and differences among the actions of different types. The invention adds the attention layer when extracting the characteristics, and generates the characteristic representation with more resolution; random multi-mode fusion is carried out on different samples in the similar behaviors, support set data is expanded, and the transformation robustness of the model on the environment where the behavior main body is located is stronger; through task-level feature modulation, features are enabled to be more in line with the requirements of current tasks and focused on a behavior main body, and classification accuracy is improved.

Description

Task self-adaptive small sample behavior recognition method and system
Technical Field
The invention relates to the technical field of computer vision, in particular to a task self-adaptive small sample behavior recognition method and system.
Background
In recent years, various intelligent devices with shooting functions are commonly applied to daily lives of people, such as smart phones, monitoring videos and the like, and numerous videos are uploaded and watched in each minute on various large video websites such as microblogs, tremble sounds, fast hands and the like. Because of the massive nature of video data, time and effort are consumed only by identifying in a manual mode, research on behavior identification is more important, and abnormal behaviors such as fighting, obscene pornography identification and grabbing are more key to maintaining the wind and the gas of a network platform. In the face of the current situation that some behavior samples are difficult to collect, small sample behavior identification has more important research significance.
The purpose of the small sample behavior recognition task is to train a robust model with a small number of samples, which can show good classification ability when the test phase faces new action categories that have never appeared in the training set. The small sample behavior recognition aims at learning how to learn classification, formally, the behavior recognition video training set contains a plurality of samples of a plurality of action categories. In the training stage, C categories are randomly extracted from a training set, K samples of each category are used for constructing a task to be used as a support set input of a model, a batch of samples are extracted from the residual video samples of the C categories to be used as a prediction object of the model, namely a query set, and the model is required to learn how to distinguish the actions of the C categories from C.times.K data. In the training process, different tasks can be randomly sampled in each training, and in general, different category combinations are included in the training, and the mechanism enables the model to learn the commonality part among different tasks so as to achieve the result of better classification for the new task in the test.
The a priori knowledge of small sample behavior recognition comes from three aspects: data, models, algorithms, and thus studies are mainly made from these three aspects. The main ideas of the data enhancement-based method are methods of expanding samples, such as training a transformation between transformation learning samples for data set expansion, or generating new samples by using a GAN network, but have the problems of no universality and difficult migration. The model improvement-based method obtains an approximate solution through an iterative model, and typical methods include a multitask learning method, an embedded learning method, an external memory-based learning method and a model generating method in order to reduce the hypothesis space of the solution. The method based on algorithm optimization has the core ideas of improving the optimization algorithm to make the searching process more efficient, and the main methods of improving the existing parameters, improving the meta-learning parameters and learning optimizers.
The existing small sample behavior recognition method has the following problems: (1) The feature extraction part adopts a ResNet-50 network, features are directly fed into the classifier after feature fusion is carried out by the DGAdaIN module, the attention point is only the gain of scene information on motion recognition, and the attention to the motion main body is ignored. (2) For different tasks, the features for distinguishing the included classes are greatly different, and only the attention modules such as the SE module and the CBMA module are added in the feature extraction part, so that the problem of poor adaptability exists when the new tasks are faced. (3) The method has diversity for some common actions needing attention such as falling, frame taking and the like, and can possibly appear in indoor and various outdoor scenes, the time sequence asynchronous enhancement is only performed on a single video sample, so that the significance is not great, and the data expansion effect is weak.
Disclosure of Invention
The invention aims to provide a task self-adaptive small sample behavior recognition method and system, which are used for solving at least one technical problem in the background technology.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in one aspect, the invention provides a task-adaptive small sample behavior recognition method, which comprises the following steps:
acquiring video data to be identified;
processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action category result; and adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted feature features through an attention mechanism, and obtaining intra-class feature commonalities of actions of the same class and differences among actions of different classes.
Preferably, training the recognition model includes:
framing the prepared video data set to obtain an RGB image set and a depth image set;
sending the RGB image set and the depth image set into ResNet-50 to extract characteristics, removing a final average pooling layer and a full connection layer of the ResNet-50, and sending the RGB image set and the depth image set into an attention layer to generate an attention map;
carrying out multi-mode fusion on a pair of RGB features and depth features to obtain a reinforced feature representation;
the feature enhancement is carried out on the support set type feature and the query set feature by utilizing the support set type feature;
and calculating the similarity between the paired support set prototype representation and the query set features by taking Euclidean distance as a distance measure to obtain a predicted action category label of the video to be queried.
Preferably, the send-in attention layer generates an attention map, including: removing the last average pooling layer and the full connection layer of the ResNet-50; the feature map output by layer4 of ResNet-50 is sent to a CAM module to generate a channel attention map; multiplying the spatial attention map with the feature map to obtain a channel attention weighted feature map; sending the weighted feature map to a SAM module to generate a spatial attention map; the spatial attention map is multiplied by the feature map to obtain a spatial attention weighted feature map.
Preferably, the multimodal fusion comprises: each pair comes from the same sample, a group of randomly paired RGB feature and Depth feature pairs are added and are jointly used as sample features to be sent to a DGAdaIN module; in the DGAdaIN module, the input is a pair of RGB features and depth features, and the output is a fused feature.
Preferably, the task adaptation is performed as follows: integrating the sample characteristics of the support set by class; calculating the channel attention map of each class; performing characteristic enhancement on each class of samples of the support set by multiplying maps, and enhancing common characteristics in the classes; and carrying out feature class distinction on the query set sample multiplied by map to obtain N query features, and strengthening the difference between classes to ensure that the features are more discriminant.
Preferably, N modulated class prototype representations and N query sample features are obtained, and similarity calculation is respectively carried out according to class matching; and for each pair of prototype representations of the classes and the query feature representations, calculating the similarity between the feature vector of the video to be queried and the prototype representations of all the classes by adopting Euclidean distance as a distance measure so as to obtain the predicted action class label of the video to be queried.
In a second aspect, the present invention provides a task-adaptive small sample behavior recognition system, comprising:
the acquisition module is used for acquiring video data to be identified;
the recognition module is used for processing the acquired video data to be recognized by utilizing a pre-trained recognition model to obtain an action type result; and adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted feature features through an attention mechanism, and obtaining intra-class feature commonalities of actions of the same class and differences among actions of different classes.
In a third aspect, the present invention provides a non-transitory computer readable storage medium for storing computer instructions which, when executed by a processor, implement a task-adaptive small sample behavior recognition method as described above.
In a fourth aspect, the present invention provides a computer program product comprising a computer program for implementing a task-adaptive small sample behavior recognition method as described above when run on one or more processors.
In a fifth aspect, the present invention provides an electronic device, comprising: a processor, a memory, and a computer program; wherein the processor is connected to the memory, and wherein the computer program is stored in the memory, said processor executing the computer program stored in said memory when the electronic device is running, to cause the electronic device to execute instructions implementing the task-adaptive small sample behavior recognition method as described above.
The invention has the beneficial effects that: adding an attention layer in the feature extraction stage to enable the attention layer to select a focusing position and generate a feature representation with better resolution; random multi-mode fusion is carried out on different samples in the similar behaviors, support set data is expanded, and the transformation robustness of the model on the environment where the behavior main body is located is stronger; through task-level feature modulation, an attention mechanism can be better utilized, so that features are more in line with the requirements of the current task and focused on a behavior main body, and classification accuracy is improved.
The advantages of additional aspects of the invention will be set forth in part in the description which follows, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a task-adaptive small sample behavior recognition method according to an embodiment of the present invention.
FIG. 2 is a functional block diagram of a task-adaptive small sample behavior recognition system according to an embodiment of the present invention.
FIG. 3 is a functional block diagram of a Task-specific module according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements throughout or elements having like or similar functionality. The embodiments described below by way of the drawings are exemplary only and should not be construed as limiting the invention.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or groups thereof.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
In order that the invention may be readily understood, a further description of the invention will be rendered by reference to specific embodiments that are illustrated in the appended drawings and are not to be construed as limiting embodiments of the invention.
It will be appreciated by those skilled in the art that the drawings are merely schematic representations of examples and that the elements of the drawings are not necessarily required to practice the invention.
Example 1
The present embodiment 1 provides a task-adaptive small sample behavior recognition system, including:
the acquisition module is used for acquiring video data to be identified;
the recognition module is used for processing the acquired video data to be recognized by utilizing a pre-trained recognition model to obtain an action type result; and adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted feature features through an attention mechanism, and obtaining intra-class feature commonalities of actions of the same class and differences among actions of different classes.
In this embodiment 1, by using the above system, a task-adaptive small sample behavior recognition method is implemented, including:
acquiring video data to be identified;
processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action category result; and adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted feature features through an attention mechanism, and obtaining intra-class feature commonalities of actions of the same class and differences among actions of different classes.
Training the recognition model includes:
framing the prepared video data set to obtain an RGB image set and a depth image set;
sending the RGB image set and the depth image set into ResNet-50 to extract characteristics, removing a final average pooling layer and a full connection layer of the ResNet-50, and sending the RGB image set and the depth image set into an attention layer to generate an attention map;
carrying out multi-mode fusion on a pair of RGB features and depth features to obtain a reinforced feature representation;
the feature enhancement is carried out on the support set type feature and the query set feature by utilizing the support set type feature;
and calculating the similarity between the paired support set prototype representation and the query set features by taking Euclidean distance as a distance measure to obtain a predicted action category label of the video to be queried.
The send-in attention layer generates an attention map comprising: removing the last average pooling layer and the full connection layer of the ResNet-50; the feature map output by layer4 of ResNet-50 is sent to a CAM module to generate a channel attention map; multiplying the spatial attention map with the feature map to obtain a channel attention weighted feature map; sending the weighted feature map to a SAM module to generate a spatial attention map; the spatial attention map is multiplied by the feature map to obtain a spatial attention weighted feature map.
The multi-modal fusion includes: each pair comes from the same sample, a group of randomly paired RGB feature and Depth feature pairs are added and are jointly used as sample features to be sent to a DGAdaIN module; in the DGAdaIN module, the input is a pair of RGB features and depth features, and the output is a fused feature.
The task adaptation process is as follows: integrating the sample characteristics of the support set by class; calculating the channel attention map of each class; performing characteristic enhancement on each class of samples of the support set by multiplying maps, and enhancing common characteristics in the classes; and carrying out feature class distinction on the query set sample multiplied by map to obtain n query features, and strengthening the difference between classes to ensure that the features are more discriminant.
The method comprises the steps of obtaining n modulated class prototype representations and n query sample features, and respectively performing similarity calculation according to class matching; and for each pair of prototype representations of the classes and the query feature representations, calculating the similarity between the feature vector of the video to be queried and the prototype representations of all the classes by adopting Euclidean distance as a distance measure so as to obtain the predicted action class label of the video to be queried.
Example 2
In this embodiment 2, a task-adaptive small sample behavior recognition method is provided. Firstly, adding an attention layer in a feature extraction stage, and emphasizing position information and image content information of a behavior main body in a picture frame; secondly, a designed DGAdaIN module is used for fusing RGB features and Depth features of different samples of the same class so as to achieve the purpose of data expansion, and the samples of the support set are richer; and then modulating the support set prototype representation and the query set sample characteristics by using the relationship between different sample characteristics of the current task support set through an attention module, and emphasizing the intra-class characteristic commonality and the difference between different classes of the current task so as to improve the adaptability of the characteristics to the current task and finally achieve the aim of higher accuracy of motion classification prediction.
The task self-adaptive small sample behavior recognition method specifically comprises the following steps:
s1: data preprocessing stage
Step 1-1: and framing the video samples in the video data set, obtaining a corresponding group of continuous image frames RGB frames by each video, and calculating the number of frames of the image frames.
Step 1-2: depth estimation is performed on RGB frames using a monosde 2 module as a Depth estimator, resulting in Depth frames.
Step 1-3: dividing the acquired RGB frames and Depth frames into num seg Fragments of equal length, and then randomly extracting num from each fragment f And forming new RGB clips and Depth clips, wherein the Depth clips are divided into clips matched with the RGB clips and non-matched clips.
S2: attention-added feature extraction stage
Step 2-1: and respectively sending the RGB clip and the Depth clip into an RGB feature extraction network and a Depth feature extraction network to obtain RGB features and Depth features. The adopted feature extraction networks are all trained ResNet-50, so that the parameters of the feature extraction networks are fixed and are not transformed in the training process.
Step 2-2: and sending the feature map of RGBclip obtained by ResNet-50 processing to a CBAM module to obtain RGB features containing attention.
S3: multimodal fusion stage
Step 3-1: at this time, there is a set of matched RGB feature and Depth feature pairs from S2, each pair is from the same sample, and a set of randomly paired RGB feature and Depth feature pairs is added to be sent together as sample features to the DGAdaIN module.
Step 3-2: in the DGAdaIN module, the input is a pair of RGB features and Depth features (matched or unmatched) and the output is a fused feature. The module comprises two parameters for self-adaptive learning Depth feature graphs, and fine adjustment is carried out in the training process.
S4: task-adaptive feature modulation phase
Step 4-1: the support set sample features are integrated by class.
Step 4-2: the channel attention map for each class is calculated.
Step 4-3: and multiplying each class of samples of the support set by the map to perform characteristic enhancement, and strengthening common characteristics in the class.
Step 4-4: and carrying out feature class distinction on the query set sample multiplied by map to obtain n query features, and strengthening the difference between classes to ensure that the features are more discriminant.
S5: classifier classification prediction stage
Step 5-1: and obtaining n modulated class prototype representations and n query sample features from the third stage, and respectively performing similarity calculation according to class matching.
Step 5-2: and for each pair of prototype representations of the classes and the query feature representations, calculating the similarity between the feature vector of the video to be queried and the prototype representations of all the classes by adopting Euclidean distance as a distance measure so as to obtain the predicted action class label of the video to be queried.
Step 5-3: knowing the correct class labels, a negative log-loss function is defined, optimizing the parameters in the DGAdaIN module by minimizing the loss function.
In summary, in the method described in embodiment 2, an attention layer is added in the feature extraction stage to enable the focus position to be selected, so as to generate a more resolved feature representation; random multi-mode fusion is carried out on different samples in the similar behaviors, support set data is expanded, and the transformation robustness of the model on the environment where the behavior main body is located is stronger; through task-level feature modulation, an attention mechanism can be better utilized, so that features are more in line with the requirements of the current task and focused on a behavior main body, and classification accuracy is improved.
Example 3
Referring to fig. 1, fig. 2, and fig. 3, the task-adaptive small sample behavior recognition method and apparatus provided in embodiment 3 of the present invention include the following steps:
s1: data preprocessing stage
Step 1-1: and framing the video samples in the video data set, obtaining a corresponding group of continuous image frames RGB frames by each video, and calculating the number of frames of the image frames.
In this embodiment, the adopted data set is a behavior recognition data set UCF101 composed of real motion videos, and collected from YouTube, and includes 13320 videos from 101 behavior categories, and main behaviors are divided into five categories of human-to-object interaction, simple limb motion, human-to-human interaction, musical instrument playing and sports. The dataset is partitioned in such a way that the training set comprises class 70, the validation set comprises class 10, and the test set comprises class 21. Each video sample of each class in the dataset is divided into a set of image frames RGB frames and the number of frames n_frames for each set of image frames is counted. For each image in the framed dataset, it is first resized to 256x256 and then a region of 224x224 is cropped randomly. In the test stage, since the behavior information in the image is usually in the middle of the image rather than at the edge of the image, the edge information is preferably discarded, and the region with the size of 224x224 is obtained by adopting a center clipping mode.
Step 1-2: depth estimation is performed on RGB frames using a monosde 2 module as a Depth estimator, resulting in Depth frames.
In this embodiment, depth estimation is performed on all RGB frames obtained in step 1-1 using a monoscopic 2 module, and the entire input of the monoscopic 2 module is all RGB clips in the dataset. For a group of RGB frames containing n_frames, three consecutive frames I are used t-1 、I t 、I t+1 Inputting into a Depth Network, wherein the t frame is the frame of the Depth to be predicted, the t-1 frame and the t+1 frame are the frame after the t frame and the frame before the t frame, the Depth Network is realized by using a Unet structure and consists of an encoder module and a decoder module, and the Depth Network outputs a Depth map D of the t frame t . And processing each frame in the RGB frames in turn, and recovering the corresponding depth to obtain depth frames.
Step 1-3: dividing the acquired RGB frames and Depth frames into num seg Fragments of equal length and then fromRandomly decimating num in each segment f And forming new RGB clips and Depth clips, wherein the Depth clips are divided into clips matched with the RGB clips and non-matched clips.
S2: attention-added feature extraction stage
Step 2-1: and respectively sending the RGB clip and the Depth clip into an RGB feature extraction network and a Depth feature extraction network to obtain RGB features and Depth features. The adopted feature extraction networks are all trained ResNet-50, so that the parameters of the feature extraction networks are fixed and are not transformed in the training process.
In this embodiment, since the difference between the RGB mode and the depth mode is large, the features are acquired by adopting the two-stage training method. Firstly, training an RGB sub-model and a depth sub-model, wherein the feature extraction network of the RGB clip and the depth clip takes ResNet-50 as a main network, and replaces the last full connection layer in the ResNet 50 with the full connection layer of the last full connection layer as a classifier. A task is generated from the step S1, and samples of the support set and the query set are used as the input of the step 2-1. And sending the RGB image and the Depth image into a feature extraction network, wherein the feature information extraction layer is a feature encoder generated by a convolutional neural network, and extracting rich image features required by an information processing layer from the image of the input layer to obtain an RGB feature map and a Depth feature map. For the RGB sub-model, fine tuning was performed on 6 epochs of the training phase, setting the learning rate of ResNet-50 to lr 1 =0.00001, the learning rate of the full link layer is set to lr 2 =0.001. For the depth submodel, the features of the depth frame are difficult to extract, so that the depth submodel is subjected to fine adjustment of 60 epochs, and the learning rate lr is learned 1 And lr 2 Are set to 0.00001 and reduced by 10% after 30 epochs. And respectively extracting the feature vectors of the RGB clip and the Depth clip by taking the pre-trained sub-model as a feature extractor.
Step 2-2: and sending the feature map of the RGB clip obtained by ResNet-50 processing to a CBAM module to obtain the RGB feature containing attention.
In this embodiment, the feature map is sent to the CAM module to generate a channel attention map, and then the feature map is multiplied by the channel attention map and then sent to the SAM module to generate a spatial attention map.
S3: multimodal fusion stage
Step 3-1: at this time, there is a set of matched RGB feature and Depth feature pairs from S2, each pair is from the same sample, and a set of randomly paired RGB feature and Depth feature pairs is added to be sent together as sample features to the DGAdaIN module.
Step 3-2: in the DGAdaIN module, the input is a pair of RGB features and Depth features (matched or unmatched) and the output is a fused feature. The module comprises two parameters for self-adaptive learning Depth feature graphs, and fine adjustment is carried out in the training process.
In this example, since some behaviors such as riding mountain bike, playing basketball, etc. are closely related to the scene, a method of adaptive fusion of depth mode features such as RGB mode features is adopted to strengthen the feature background information. DGAdaIN Fusion Module the obtained data are the RGB feature vector and depth feature vector extracted in S2, and the fused feature vector is obtained after processing. The batch of module inputs is denoted as xR BxDxL Where B is the batch size, D is the number of frames of a group of picture frames into which a single video sample is divided, L is the feature dimension of each frame, and γ and β are used to adaptively learn the depth feature map. S3: attention-based feature modulation stage
Step 4-1: the support set sample features are integrated by class.
In this embodiment, the support set sample sets are integrated separately by class labels.
Step 4-2: the channel attention map for each class is calculated.
In this embodiment, the channel attention of each type of sample feature is calculated, and the attention map of each type is obtained to strengthen the support set feature and the query feature.
Step 4-3: and multiplying each class of samples of the support set by the map to perform characteristic enhancement, and strengthening common characteristics in the class.
In this embodiment, the common features within the class are highlighted by attention weighting.
Step 4-4: and carrying out feature class distinction on the query set sample multiplied by map to obtain n query features, and strengthening the difference between classes to ensure that the features are more discriminant.
In this embodiment, the attention weighting is used to obtain the query features corresponding to each class, so as to highlight the differences between classes.
S4: classifier classification prediction stage
Step 5-1: and obtaining n modulated class prototype representations and n query sample features from the third stage, and respectively performing similarity calculation according to class matching.
Most current approaches use similarity measures between query features and support set-like prototype feature representations. In this embodiment, each support set class prototype feature has its corresponding query feature, and a distance measure is made between every two pairs.
Step 5-2: and for each pair of prototype representations of the classes and the query feature representations, calculating the similarity between the feature vector of the video to be queried and the prototype representations of all the classes by adopting Euclidean distance as a distance measure so as to obtain the predicted action class label of the video to be queried.
Step 5-3: knowing the correct class labels, negative log-loss functions are defined, and parameters in the DGAdaIN module and the Task-specific module are optimized by minimizing the loss functions.
Example 4
Embodiment 4 of the present invention provides a non-transitory computer readable storage medium for storing computer instructions that, when executed by a processor, implement a task-adaptive small sample behavior recognition method, the method comprising:
acquiring video data to be identified;
processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action category result; and adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted feature features through an attention mechanism, and obtaining intra-class feature commonalities of actions of the same class and differences among actions of different classes.
Example 5
Embodiment 5 of the present invention provides a computer program (product) comprising a computer program for implementing a task-adaptive small sample behavior recognition method when run on one or more processors, the method comprising:
acquiring video data to be identified;
processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action category result; and adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted feature features through an attention mechanism, and obtaining intra-class feature commonalities of actions of the same class and differences among actions of different classes.
Example 6
Embodiment 6 of the present invention provides an electronic device, including: a processor, a memory, and a computer program; wherein the processor is coupled to the memory and the computer program is stored in the memory, the processor executing the computer program stored in the memory when the electronic device is running to cause the electronic device to execute instructions for implementing a task-adaptive small sample behavior recognition method, the method comprising:
acquiring video data to be identified;
processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action category result; and adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted feature features through an attention mechanism, and obtaining intra-class feature commonalities of actions of the same class and differences among actions of different classes.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it should be understood that various changes and modifications could be made by one skilled in the art without the need for inventive faculty, which would fall within the scope of the invention.

Claims (6)

1. A task-adaptive small sample behavior recognition method, comprising:
acquiring video data to be identified;
processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action category result; adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted features through an attention mechanism, and obtaining intra-class feature commonalities of the same type of actions and differences among different types of actions;
training the recognition model includes:
framing the prepared video data set to obtain an RGB image set and a depth image set;
sending the RGB image set and the depth image set into ResNet-50 to extract characteristics, removing a final average pooling layer and a full connection layer of the ResNet-50, and sending the RGB image set and the depth image set into an attention layer to generate an attention map;
carrying out multi-mode fusion on a pair of RGB features and depth features to obtain a reinforced feature representation;
the feature enhancement is carried out on the support set type feature and the query set feature by utilizing the support set type feature;
the Euclidean distance is used as a distance measure to calculate the similarity between paired support set prototype representations and query set features so as to obtain a predicted action category label of the video to be queried;
the send-in attention layer generates an attention map comprising: removing the last average pooling layer and the full connection layer of the ResNet-50; the feature map output by layer4 of ResNet-50 is sent to a CAM module to generate a channel attention map; multiplying the spatial attention map with the feature map to obtain a channel attention weighted feature map; sending the weighted feature map to a SAM module to generate a spatial attention map; multiplying the spatial attention map with the feature map to obtain a weighted feature map of the spatial attention;
the task adaptation process is as follows: integrating the sample characteristics of the support set by class; calculating the channel attention map of each class; performing characteristic enhancement on each class of samples of the support set by multiplying maps, and enhancing common characteristics in the classes; and carrying out feature class distinction on the query set sample multiplied by map to obtain N query features, and strengthening the difference between classes to ensure that the features are more discriminant.
2. The task-adaptive small sample behavior recognition method of claim 1, wherein the multi-modal fusion comprises: each pair comes from the same sample, a group of randomly paired RGB feature and Depth feature pairs are added and are jointly used as sample features to be sent to a DGAdaIN module; in the DGAdaIN module, the input is a pair of RGB features and depth features, and the output is a fused feature.
3. The task-adaptive small sample behavior recognition method according to claim 1, wherein the method comprises the steps of obtaining N modulated class prototype representations and N query sample features, and performing similarity calculation according to class matching; and for each pair of prototype representations of the classes and the query feature representations, calculating the similarity between the feature vector of the video to be queried and the prototype representations of all the classes by adopting Euclidean distance as a distance measure so as to obtain the predicted action class label of the video to be queried.
4. A task-adaptive small sample behavior recognition system, comprising:
the acquisition module is used for acquiring video data to be identified;
the recognition module is used for processing the acquired video data to be recognized by utilizing a pre-trained recognition model to obtain an action type result; adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted features through an attention mechanism, and obtaining intra-class feature commonalities of the same type of actions and differences among different types of actions;
training the recognition model includes:
framing the prepared video data set to obtain an RGB image set and a depth image set;
sending the RGB image set and the depth image set into ResNet-50 to extract characteristics, removing a final average pooling layer and a full connection layer of the ResNet-50, and sending the RGB image set and the depth image set into an attention layer to generate an attention map;
carrying out multi-mode fusion on a pair of RGB features and depth features to obtain a reinforced feature representation;
the feature enhancement is carried out on the support set type feature and the query set feature by utilizing the support set type feature;
the Euclidean distance is used as a distance measure to calculate the similarity between paired support set prototype representations and query set features so as to obtain a predicted action category label of the video to be queried;
the send-in attention layer generates an attention map comprising: removing the last average pooling layer and the full connection layer of the ResNet-50; the feature map output by layer4 of ResNet-50 is sent to a CAM module to generate a channel attention map; multiplying the spatial attention map with the feature map to obtain a channel attention weighted feature map; sending the weighted feature map to a SAM module to generate a spatial attention map; multiplying the spatial attention map with the feature map to obtain a weighted feature map of the spatial attention;
the task adaptation process is as follows: integrating the sample characteristics of the support set by class; calculating the channel attention map of each class; performing characteristic enhancement on each class of samples of the support set by multiplying maps, and enhancing common characteristics in the classes; and carrying out feature class distinction on the query set sample multiplied by map to obtain N query features, and strengthening the difference between classes to ensure that the features are more discriminant.
5. A non-transitory computer readable storage medium storing computer instructions which, when executed by a processor, implement the task-adaptive small sample behavior recognition method of any one of claims 1-3.
6. An electronic device, comprising: a processor, a memory, and a computer program; wherein the processor is connected to the memory, and wherein the computer program is stored in the memory, said processor executing the computer program stored in said memory when the electronic device is running, to cause the electronic device to execute instructions implementing the task-adaptive small sample behavior recognition method according to any of claims 1-3.
CN202210815080.0A 2022-07-12 2022-07-12 Task self-adaptive small sample behavior recognition method and system Active CN115240106B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210815080.0A CN115240106B (en) 2022-07-12 2022-07-12 Task self-adaptive small sample behavior recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210815080.0A CN115240106B (en) 2022-07-12 2022-07-12 Task self-adaptive small sample behavior recognition method and system

Publications (2)

Publication Number Publication Date
CN115240106A CN115240106A (en) 2022-10-25
CN115240106B true CN115240106B (en) 2023-06-20

Family

ID=83673337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210815080.0A Active CN115240106B (en) 2022-07-12 2022-07-12 Task self-adaptive small sample behavior recognition method and system

Country Status (1)

Country Link
CN (1) CN115240106B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108596256B (en) * 2018-04-26 2022-04-01 北京航空航天大学青岛研究院 Object recognition classifier construction method based on RGB-D
CN108985259B (en) * 2018-08-03 2022-03-18 百度在线网络技术(北京)有限公司 Human body action recognition method and device
CN113807176B (en) * 2021-08-13 2024-02-20 句容市紫薇草堂文化科技有限公司 Small sample video behavior recognition method based on multi-knowledge fusion
CN114187546A (en) * 2021-12-01 2022-03-15 山东大学 Combined action recognition method and system

Also Published As

Publication number Publication date
CN115240106A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
WO2021232969A1 (en) Action recognition method and apparatus, and device and storage medium
EP2568429A1 (en) Method and system for pushing individual advertisement based on user interest learning
CN110839173A (en) Music matching method, device, terminal and storage medium
CN111046821B (en) Video behavior recognition method and system and electronic equipment
US20230353828A1 (en) Model-based data processing method and apparatus
CN109871736B (en) Method and device for generating natural language description information
CN111783712A (en) Video processing method, device, equipment and medium
CN110198482B (en) Video key bridge segment marking method, terminal and storage medium
CN111143617A (en) Automatic generation method and system for picture or video text description
CN110619284A (en) Video scene division method, device, equipment and medium
CN113766299A (en) Video data playing method, device, equipment and medium
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN114996495A (en) Single-sample image segmentation method and device based on multiple prototypes and iterative enhancement
KR20210047467A (en) Method and System for Auto Multiple Image Captioning
KR20210011707A (en) A CNN-based Scene classifier with attention model for scene recognition in video
CN115240106B (en) Task self-adaptive small sample behavior recognition method and system
CN116977457A (en) Data processing method, device and computer readable storage medium
CN115222838A (en) Video generation method, device, electronic equipment and medium
CN114863249A (en) Video target detection and domain adaptation method based on motion characteristics and appearance characteristics
CN116665083A (en) Video classification method and device, electronic equipment and storage medium
CN113704544A (en) Video classification method and device, electronic equipment and storage medium
CN111291602A (en) Video detection method and device, electronic equipment and computer readable storage medium
Dai et al. Video Based Action Recognition Using Spatial and Temporal Feature
CN111325068A (en) Video description method and device based on convolutional neural network
CN117575894B (en) Image generation method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant