CN115240106B

CN115240106B - Task self-adaptive small sample behavior recognition method and system

Info

Publication number: CN115240106B
Application number: CN202210815080.0A
Authority: CN
Inventors: 金�一; 王佳艺; 冯松鹤; 郎丛妍; 王涛; 李浥东
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2022-07-12
Filing date: 2022-07-12
Publication date: 2023-06-20
Anticipated expiration: 2042-07-12
Also published as: CN115240106A

Abstract

The invention provides a task self-adaptive small sample behavior recognition method and a task self-adaptive small sample behavior recognition system, which belong to the technical field of computer vision and acquire video data to be recognized; and processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action type result, adding the action type result into an attention layer, extracting the position information and the image content information of the behavior main body in the picture frame, modulating the extracted characteristic features by an attention mechanism, and obtaining intra-class characteristic commonalities of actions of the same type and differences among the actions of different types. The invention adds the attention layer when extracting the characteristics, and generates the characteristic representation with more resolution; random multi-mode fusion is carried out on different samples in the similar behaviors, support set data is expanded, and the transformation robustness of the model on the environment where the behavior main body is located is stronger; through task-level feature modulation, features are enabled to be more in line with the requirements of current tasks and focused on a behavior main body, and classification accuracy is improved.

Description

Task self-adaptive small sample behavior recognition method and system

Technical Field

The invention relates to the technical field of computer vision, in particular to a task self-adaptive small sample behavior recognition method and system.

Background

In recent years, various intelligent devices with shooting functions are commonly applied to daily lives of people, such as smart phones, monitoring videos and the like, and numerous videos are uploaded and watched in each minute on various large video websites such as microblogs, tremble sounds, fast hands and the like. Because of the massive nature of video data, time and effort are consumed only by identifying in a manual mode, research on behavior identification is more important, and abnormal behaviors such as fighting, obscene pornography identification and grabbing are more key to maintaining the wind and the gas of a network platform. In the face of the current situation that some behavior samples are difficult to collect, small sample behavior identification has more important research significance.

The purpose of the small sample behavior recognition task is to train a robust model with a small number of samples, which can show good classification ability when the test phase faces new action categories that have never appeared in the training set. The small sample behavior recognition aims at learning how to learn classification, formally, the behavior recognition video training set contains a plurality of samples of a plurality of action categories. In the training stage, C categories are randomly extracted from a training set, K samples of each category are used for constructing a task to be used as a support set input of a model, a batch of samples are extracted from the residual video samples of the C categories to be used as a prediction object of the model, namely a query set, and the model is required to learn how to distinguish the actions of the C categories from C.times.K data. In the training process, different tasks can be randomly sampled in each training, and in general, different category combinations are included in the training, and the mechanism enables the model to learn the commonality part among different tasks so as to achieve the result of better classification for the new task in the test.

The a priori knowledge of small sample behavior recognition comes from three aspects: data, models, algorithms, and thus studies are mainly made from these three aspects. The main ideas of the data enhancement-based method are methods of expanding samples, such as training a transformation between transformation learning samples for data set expansion, or generating new samples by using a GAN network, but have the problems of no universality and difficult migration. The model improvement-based method obtains an approximate solution through an iterative model, and typical methods include a multitask learning method, an embedded learning method, an external memory-based learning method and a model generating method in order to reduce the hypothesis space of the solution. The method based on algorithm optimization has the core ideas of improving the optimization algorithm to make the searching process more efficient, and the main methods of improving the existing parameters, improving the meta-learning parameters and learning optimizers.

The existing small sample behavior recognition method has the following problems: (1) The feature extraction part adopts a ResNet-50 network, features are directly fed into the classifier after feature fusion is carried out by the DGAdaIN module, the attention point is only the gain of scene information on motion recognition, and the attention to the motion main body is ignored. (2) For different tasks, the features for distinguishing the included classes are greatly different, and only the attention modules such as the SE module and the CBMA module are added in the feature extraction part, so that the problem of poor adaptability exists when the new tasks are faced. (3) The method has diversity for some common actions needing attention such as falling, frame taking and the like, and can possibly appear in indoor and various outdoor scenes, the time sequence asynchronous enhancement is only performed on a single video sample, so that the significance is not great, and the data expansion effect is weak.

Disclosure of Invention

The invention aims to provide a task self-adaptive small sample behavior recognition method and system, which are used for solving at least one technical problem in the background technology.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in one aspect, the invention provides a task-adaptive small sample behavior recognition method, which comprises the following steps:

acquiring video data to be identified;

processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action category result; and adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted feature features through an attention mechanism, and obtaining intra-class feature commonalities of actions of the same class and differences among actions of different classes.

Preferably, training the recognition model includes:

framing the prepared video data set to obtain an RGB image set and a depth image set;

sending the RGB image set and the depth image set into ResNet-50 to extract characteristics, removing a final average pooling layer and a full connection layer of the ResNet-50, and sending the RGB image set and the depth image set into an attention layer to generate an attention map;

carrying out multi-mode fusion on a pair of RGB features and depth features to obtain a reinforced feature representation;

the feature enhancement is carried out on the support set type feature and the query set feature by utilizing the support set type feature;

and calculating the similarity between the paired support set prototype representation and the query set features by taking Euclidean distance as a distance measure to obtain a predicted action category label of the video to be queried.

Preferably, the send-in attention layer generates an attention map, including: removing the last average pooling layer and the full connection layer of the ResNet-50; the feature map output by layer4 of ResNet-50 is sent to a CAM module to generate a channel attention map; multiplying the spatial attention map with the feature map to obtain a channel attention weighted feature map; sending the weighted feature map to a SAM module to generate a spatial attention map; the spatial attention map is multiplied by the feature map to obtain a spatial attention weighted feature map.

Preferably, the multimodal fusion comprises: each pair comes from the same sample, a group of randomly paired RGB feature and Depth feature pairs are added and are jointly used as sample features to be sent to a DGAdaIN module; in the DGAdaIN module, the input is a pair of RGB features and depth features, and the output is a fused feature.

Preferably, the task adaptation is performed as follows: integrating the sample characteristics of the support set by class; calculating the channel attention map of each class; performing characteristic enhancement on each class of samples of the support set by multiplying maps, and enhancing common characteristics in the classes; and carrying out feature class distinction on the query set sample multiplied by map to obtain N query features, and strengthening the difference between classes to ensure that the features are more discriminant.

Preferably, N modulated class prototype representations and N query sample features are obtained, and similarity calculation is respectively carried out according to class matching; and for each pair of prototype representations of the classes and the query feature representations, calculating the similarity between the feature vector of the video to be queried and the prototype representations of all the classes by adopting Euclidean distance as a distance measure so as to obtain the predicted action class label of the video to be queried.

In a second aspect, the present invention provides a task-adaptive small sample behavior recognition system, comprising:

the acquisition module is used for acquiring video data to be identified;

the recognition module is used for processing the acquired video data to be recognized by utilizing a pre-trained recognition model to obtain an action type result; and adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted feature features through an attention mechanism, and obtaining intra-class feature commonalities of actions of the same class and differences among actions of different classes.

In a third aspect, the present invention provides a non-transitory computer readable storage medium for storing computer instructions which, when executed by a processor, implement a task-adaptive small sample behavior recognition method as described above.

In a fourth aspect, the present invention provides a computer program product comprising a computer program for implementing a task-adaptive small sample behavior recognition method as described above when run on one or more processors.

In a fifth aspect, the present invention provides an electronic device, comprising: a processor, a memory, and a computer program; wherein the processor is connected to the memory, and wherein the computer program is stored in the memory, said processor executing the computer program stored in said memory when the electronic device is running, to cause the electronic device to execute instructions implementing the task-adaptive small sample behavior recognition method as described above.

The invention has the beneficial effects that: adding an attention layer in the feature extraction stage to enable the attention layer to select a focusing position and generate a feature representation with better resolution; random multi-mode fusion is carried out on different samples in the similar behaviors, support set data is expanded, and the transformation robustness of the model on the environment where the behavior main body is located is stronger; through task-level feature modulation, an attention mechanism can be better utilized, so that features are more in line with the requirements of the current task and focused on a behavior main body, and classification accuracy is improved.

The advantages of additional aspects of the invention will be set forth in part in the description which follows, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a task-adaptive small sample behavior recognition method according to an embodiment of the present invention.

FIG. 2 is a functional block diagram of a task-adaptive small sample behavior recognition system according to an embodiment of the present invention.

FIG. 3 is a functional block diagram of a Task-specific module according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements throughout or elements having like or similar functionality. The embodiments described below by way of the drawings are exemplary only and should not be construed as limiting the invention.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, and/or groups thereof.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

In order that the invention may be readily understood, a further description of the invention will be rendered by reference to specific embodiments that are illustrated in the appended drawings and are not to be construed as limiting embodiments of the invention.

It will be appreciated by those skilled in the art that the drawings are merely schematic representations of examples and that the elements of the drawings are not necessarily required to practice the invention.

Example 1

The present embodiment 1 provides a task-adaptive small sample behavior recognition system, including:

the acquisition module is used for acquiring video data to be identified;

In this embodiment 1, by using the above system, a task-adaptive small sample behavior recognition method is implemented, including:

acquiring video data to be identified;

Training the recognition model includes:

The send-in attention layer generates an attention map comprising: removing the last average pooling layer and the full connection layer of the ResNet-50; the feature map output by layer4 of ResNet-50 is sent to a CAM module to generate a channel attention map; multiplying the spatial attention map with the feature map to obtain a channel attention weighted feature map; sending the weighted feature map to a SAM module to generate a spatial attention map; the spatial attention map is multiplied by the feature map to obtain a spatial attention weighted feature map.

The multi-modal fusion includes: each pair comes from the same sample, a group of randomly paired RGB feature and Depth feature pairs are added and are jointly used as sample features to be sent to a DGAdaIN module; in the DGAdaIN module, the input is a pair of RGB features and depth features, and the output is a fused feature.

The task adaptation process is as follows: integrating the sample characteristics of the support set by class; calculating the channel attention map of each class; performing characteristic enhancement on each class of samples of the support set by multiplying maps, and enhancing common characteristics in the classes; and carrying out feature class distinction on the query set sample multiplied by map to obtain n query features, and strengthening the difference between classes to ensure that the features are more discriminant.

The method comprises the steps of obtaining n modulated class prototype representations and n query sample features, and respectively performing similarity calculation according to class matching; and for each pair of prototype representations of the classes and the query feature representations, calculating the similarity between the feature vector of the video to be queried and the prototype representations of all the classes by adopting Euclidean distance as a distance measure so as to obtain the predicted action class label of the video to be queried.

Example 2

In this embodiment 2, a task-adaptive small sample behavior recognition method is provided. Firstly, adding an attention layer in a feature extraction stage, and emphasizing position information and image content information of a behavior main body in a picture frame; secondly, a designed DGAdaIN module is used for fusing RGB features and Depth features of different samples of the same class so as to achieve the purpose of data expansion, and the samples of the support set are richer; and then modulating the support set prototype representation and the query set sample characteristics by using the relationship between different sample characteristics of the current task support set through an attention module, and emphasizing the intra-class characteristic commonality and the difference between different classes of the current task so as to improve the adaptability of the characteristics to the current task and finally achieve the aim of higher accuracy of motion classification prediction.

The task self-adaptive small sample behavior recognition method specifically comprises the following steps:

s1: data preprocessing stage

Step 1-1: and framing the video samples in the video data set, obtaining a corresponding group of continuous image frames RGB frames by each video, and calculating the number of frames of the image frames.

Step 1-2: depth estimation is performed on RGB frames using a monosde 2 module as a Depth estimator, resulting in Depth frames.

Step 1-3: dividing the acquired RGB frames and Depth frames into num _seg Fragments of equal length, and then randomly extracting num from each fragment _f And forming new RGB clips and Depth clips, wherein the Depth clips are divided into clips matched with the RGB clips and non-matched clips.

S2: attention-added feature extraction stage

Step 2-1: and respectively sending the RGB clip and the Depth clip into an RGB feature extraction network and a Depth feature extraction network to obtain RGB features and Depth features. The adopted feature extraction networks are all trained ResNet-50, so that the parameters of the feature extraction networks are fixed and are not transformed in the training process.

Step 2-2: and sending the feature map of RGBclip obtained by ResNet-50 processing to a CBAM module to obtain RGB features containing attention.

S3: multimodal fusion stage

Step 3-1: at this time, there is a set of matched RGB feature and Depth feature pairs from S2, each pair is from the same sample, and a set of randomly paired RGB feature and Depth feature pairs is added to be sent together as sample features to the DGAdaIN module.

Step 3-2: in the DGAdaIN module, the input is a pair of RGB features and Depth features (matched or unmatched) and the output is a fused feature. The module comprises two parameters for self-adaptive learning Depth feature graphs, and fine adjustment is carried out in the training process.

S4: task-adaptive feature modulation phase

Step 4-1: the support set sample features are integrated by class.

Step 4-2: the channel attention map for each class is calculated.

Step 4-3: and multiplying each class of samples of the support set by the map to perform characteristic enhancement, and strengthening common characteristics in the class.

Step 4-4: and carrying out feature class distinction on the query set sample multiplied by map to obtain n query features, and strengthening the difference between classes to ensure that the features are more discriminant.

S5: classifier classification prediction stage

Step 5-1: and obtaining n modulated class prototype representations and n query sample features from the third stage, and respectively performing similarity calculation according to class matching.

Step 5-2: and for each pair of prototype representations of the classes and the query feature representations, calculating the similarity between the feature vector of the video to be queried and the prototype representations of all the classes by adopting Euclidean distance as a distance measure so as to obtain the predicted action class label of the video to be queried.

Step 5-3: knowing the correct class labels, a negative log-loss function is defined, optimizing the parameters in the DGAdaIN module by minimizing the loss function.

In summary, in the method described in embodiment 2, an attention layer is added in the feature extraction stage to enable the focus position to be selected, so as to generate a more resolved feature representation; random multi-mode fusion is carried out on different samples in the similar behaviors, support set data is expanded, and the transformation robustness of the model on the environment where the behavior main body is located is stronger; through task-level feature modulation, an attention mechanism can be better utilized, so that features are more in line with the requirements of the current task and focused on a behavior main body, and classification accuracy is improved.

Example 3

Referring to fig. 1, fig. 2, and fig. 3, the task-adaptive small sample behavior recognition method and apparatus provided in embodiment 3 of the present invention include the following steps:

s1: data preprocessing stage

In this embodiment, the adopted data set is a behavior recognition data set UCF101 composed of real motion videos, and collected from YouTube, and includes 13320 videos from 101 behavior categories, and main behaviors are divided into five categories of human-to-object interaction, simple limb motion, human-to-human interaction, musical instrument playing and sports. The dataset is partitioned in such a way that the training set comprises class 70, the validation set comprises class 10, and the test set comprises class 21. Each video sample of each class in the dataset is divided into a set of image frames RGB frames and the number of frames n_frames for each set of image frames is counted. For each image in the framed dataset, it is first resized to 256x256 and then a region of 224x224 is cropped randomly. In the test stage, since the behavior information in the image is usually in the middle of the image rather than at the edge of the image, the edge information is preferably discarded, and the region with the size of 224x224 is obtained by adopting a center clipping mode.

In this embodiment, depth estimation is performed on all RGB frames obtained in step 1-1 using a monoscopic 2 module, and the entire input of the monoscopic 2 module is all RGB clips in the dataset. For a group of RGB frames containing n_frames, three consecutive frames I are used _t-1 、I _t 、I _t+1 Inputting into a Depth Network, wherein the t frame is the frame of the Depth to be predicted, the t-1 frame and the t+1 frame are the frame after the t frame and the frame before the t frame, the Depth Network is realized by using a Unet structure and consists of an encoder module and a decoder module, and the Depth Network outputs a Depth map D of the t frame _t . And processing each frame in the RGB frames in turn, and recovering the corresponding depth to obtain depth frames.

Step 1-3: dividing the acquired RGB frames and Depth frames into num _seg Fragments of equal length and then fromRandomly decimating num in each segment _f And forming new RGB clips and Depth clips, wherein the Depth clips are divided into clips matched with the RGB clips and non-matched clips.

S2: attention-added feature extraction stage

In this embodiment, since the difference between the RGB mode and the depth mode is large, the features are acquired by adopting the two-stage training method. Firstly, training an RGB sub-model and a depth sub-model, wherein the feature extraction network of the RGB clip and the depth clip takes ResNet-50 as a main network, and replaces the last full connection layer in the ResNet 50 with the full connection layer of the last full connection layer as a classifier. A task is generated from the step S1, and samples of the support set and the query set are used as the input of the step 2-1. And sending the RGB image and the Depth image into a feature extraction network, wherein the feature information extraction layer is a feature encoder generated by a convolutional neural network, and extracting rich image features required by an information processing layer from the image of the input layer to obtain an RGB feature map and a Depth feature map. For the RGB sub-model, fine tuning was performed on 6 epochs of the training phase, setting the learning rate of ResNet-50 to lr ₁ =0.00001, the learning rate of the full link layer is set to lr ₂ =0.001. For the depth submodel, the features of the depth frame are difficult to extract, so that the depth submodel is subjected to fine adjustment of 60 epochs, and the learning rate lr is learned ₁ And lr ₂ Are set to 0.00001 and reduced by 10% after 30 epochs. And respectively extracting the feature vectors of the RGB clip and the Depth clip by taking the pre-trained sub-model as a feature extractor.

Step 2-2: and sending the feature map of the RGB clip obtained by ResNet-50 processing to a CBAM module to obtain the RGB feature containing attention.

In this embodiment, the feature map is sent to the CAM module to generate a channel attention map, and then the feature map is multiplied by the channel attention map and then sent to the SAM module to generate a spatial attention map.

S3: multimodal fusion stage

In this example, since some behaviors such as riding mountain bike, playing basketball, etc. are closely related to the scene, a method of adaptive fusion of depth mode features such as RGB mode features is adopted to strengthen the feature background information. DGAdaIN Fusion Module the obtained data are the RGB feature vector and depth feature vector extracted in S2, and the fused feature vector is obtained after processing. The batch of module inputs is denoted as xR ^BxDxL Where B is the batch size, D is the number of frames of a group of picture frames into which a single video sample is divided, L is the feature dimension of each frame, and γ and β are used to adaptively learn the depth feature map. S3: attention-based feature modulation stage

Step 4-1: the support set sample features are integrated by class.

In this embodiment, the support set sample sets are integrated separately by class labels.

Step 4-2: the channel attention map for each class is calculated.

In this embodiment, the channel attention of each type of sample feature is calculated, and the attention map of each type is obtained to strengthen the support set feature and the query feature.

In this embodiment, the common features within the class are highlighted by attention weighting.

In this embodiment, the attention weighting is used to obtain the query features corresponding to each class, so as to highlight the differences between classes.

S4: classifier classification prediction stage

Most current approaches use similarity measures between query features and support set-like prototype feature representations. In this embodiment, each support set class prototype feature has its corresponding query feature, and a distance measure is made between every two pairs.

Step 5-3: knowing the correct class labels, negative log-loss functions are defined, and parameters in the DGAdaIN module and the Task-specific module are optimized by minimizing the loss functions.

Example 4

Embodiment 4 of the present invention provides a non-transitory computer readable storage medium for storing computer instructions that, when executed by a processor, implement a task-adaptive small sample behavior recognition method, the method comprising:

acquiring video data to be identified;

Example 5

Embodiment 5 of the present invention provides a computer program (product) comprising a computer program for implementing a task-adaptive small sample behavior recognition method when run on one or more processors, the method comprising:

acquiring video data to be identified;

Example 6

Embodiment 6 of the present invention provides an electronic device, including: a processor, a memory, and a computer program; wherein the processor is coupled to the memory and the computer program is stored in the memory, the processor executing the computer program stored in the memory when the electronic device is running to cause the electronic device to execute instructions for implementing a task-adaptive small sample behavior recognition method, the method comprising:

acquiring video data to be identified;

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it should be understood that various changes and modifications could be made by one skilled in the art without the need for inventive faculty, which would fall within the scope of the invention.

Claims

1. A task-adaptive small sample behavior recognition method, comprising:

acquiring video data to be identified;

processing the acquired video data to be identified by utilizing a pre-trained identification model to obtain an action category result; adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted features through an attention mechanism, and obtaining intra-class feature commonalities of the same type of actions and differences among different types of actions;

training the recognition model includes:

the Euclidean distance is used as a distance measure to calculate the similarity between paired support set prototype representations and query set features so as to obtain a predicted action category label of the video to be queried;

the send-in attention layer generates an attention map comprising: removing the last average pooling layer and the full connection layer of the ResNet-50; the feature map output by layer4 of ResNet-50 is sent to a CAM module to generate a channel attention map; multiplying the spatial attention map with the feature map to obtain a channel attention weighted feature map; sending the weighted feature map to a SAM module to generate a spatial attention map; multiplying the spatial attention map with the feature map to obtain a weighted feature map of the spatial attention;

2. The task-adaptive small sample behavior recognition method of claim 1, wherein the multi-modal fusion comprises: each pair comes from the same sample, a group of randomly paired RGB feature and Depth feature pairs are added and are jointly used as sample features to be sent to a DGAdaIN module; in the DGAdaIN module, the input is a pair of RGB features and depth features, and the output is a fused feature.

3. The task-adaptive small sample behavior recognition method according to claim 1, wherein the method comprises the steps of obtaining N modulated class prototype representations and N query sample features, and performing similarity calculation according to class matching; and for each pair of prototype representations of the classes and the query feature representations, calculating the similarity between the feature vector of the video to be queried and the prototype representations of all the classes by adopting Euclidean distance as a distance measure so as to obtain the predicted action class label of the video to be queried.

4. A task-adaptive small sample behavior recognition system, comprising:

the acquisition module is used for acquiring video data to be identified;

the recognition module is used for processing the acquired video data to be recognized by utilizing a pre-trained recognition model to obtain an action type result; adding an attention layer into the pre-trained recognition model, extracting position information and image content information of a behavior main body in a picture frame, fusing RGB features and Depth features, modulating the extracted features through an attention mechanism, and obtaining intra-class feature commonalities of the same type of actions and differences among different types of actions;

training the recognition model includes:

5. A non-transitory computer readable storage medium storing computer instructions which, when executed by a processor, implement the task-adaptive small sample behavior recognition method of any one of claims 1-3.

6. An electronic device, comprising: a processor, a memory, and a computer program; wherein the processor is connected to the memory, and wherein the computer program is stored in the memory, said processor executing the computer program stored in said memory when the electronic device is running, to cause the electronic device to execute instructions implementing the task-adaptive small sample behavior recognition method according to any of claims 1-3.