CN117274656B

CN117274656B - Multi-mode model countermeasure training method based on self-adaptive depth supervision module

Info

Publication number: CN117274656B
Application number: CN202310660598.6A
Authority: CN
Inventors: 侯永宏; 刘超; 刘鑫; 岳焕景; 杨敬钰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2024-04-05
Anticipated expiration: 2043-06-06
Also published as: CN117274656A

Abstract

The invention discloses a multi-mode model countermeasure training method based on a self-adaptive depth supervision module, belonging to the technical field of model countermeasure training methods; the invention mainly comprises the following three parts: 1) I3D network: as a backbone network for processing video data, I3D is mainly composed of 3D convolution; 2) HCN network: the HCN network is a main network for extracting the characteristics of human skeleton sequence data and mainly comprises a 2D convolution layer; 3) Adaptive Depth Supervision Module (ADSM): RGB video and human skeleton key points are used as multi-modal input, and the attention layer in ADSM is utilized to autonomously learn modal characteristics with higher robustness, so that the robustness of the multi-modal network is enhanced.

Description

Multi-mode model countermeasure training method based on self-adaptive depth supervision module

Technical Field

The invention relates to the technical field of model countermeasure training methods, in particular to a multimode model countermeasure training method based on a self-adaptive depth supervision module.

Background

The widespread use of deep learning in the real world places extremely high demands on the robustness of the model. For example, in the fields of medical diagnosis, automatic driving, and the like, the deep learning model must be able to properly cope with various situations and cannot cause erroneous prediction due to a slight disturbance or malicious attack of input data. Thus, studying the robustness of deep learning models has become an important direction of study. However, previous work has focused mainly on robustness studies on a single visual or video classification task. As multimodal tasks develop and apply, robustness research of multimodal models has become particularly urgent. Therefore, further investigation of the robustness of the multi-modal models is needed to ensure that they can cope correctly with various uncertainties, and thus are more reliable and safer in practical applications.

AT present, one of the main methods for improving the robustness of the model is countermeasure training (AT), and the core idea of the countermeasure training is to introduce some countermeasure samples in the model training process to improve the robustness of the deep learning model, so that the deep learning model can better cope with various disturbance and attack. The challenge samples are generated from the original data by a challenge algorithm, such as FGSM. This approach enables more robust feature representations and decision rules to be learned by incorporating the challenge sample into the training process. AT was first proposed by Madry et al. Increasing the robustness of the model may make it more reliable and safe, but this may reduce the classification accuracy of the model on the raw data. Therefore, a balance point needs to be found between robustness and accuracy, and Zhang et al propose a TRADES method. TRADES uses KL divergence to regularize the output of the challenge samples and the raw data, in this way, the TRADES method can make the model have higher robustness while maintaining higher classification accuracy. Bai et al propose a novel challenge sample defense method known as CAS. Unlike conventional challenge training, CAS starts from the perspective of channel activation and dynamically learns channel importance weights, which are used to suppress the response of redundant channels from being activated when challenged by challenge samples. The method is expected to provide more effective and reliable guarantee for the robustness of the deep learning model.

In the field of video understanding and human motion recognition, there are many sophisticated models and algorithms. From the original Two-Stream CNN to the popular transducer today, various methods are layered out and constantly optimized. In recent years, human motion recognition based on human skeletal (Skeleton) data has achieved rapid progress. Bone data has some advantages over RGB data, such as: 1. low dimensionality, high efficiency: skeleton data only contains the position information of key points of a human body, so that the data size is small, and the learning and processing efficiency is high. 2. Is not interfered by the background: compared to RGB data, skeletal data does not contain environmental and background information, and thus features of the character action itself are more easily captured without being affected by the background. Thus, more and more methods begin to explore how to combine the two modalities of RGB and skeletal data for human motion recognition. However, in past studies, multimodal models often focused only on performance on the original clean dataset, lacking concern for the problem of multimodal model robustness. In fact, the robustness problem in this field still faces a great challenge, becoming a problem that remains to be solved in this field.

With the continuous development of machine learning and artificial intelligence technology, the multi-modal technology becomes a research field which is widely focused, and is successfully applied in the fields of intelligent interaction, medical care, video understanding and the like. The multi-modal technique has great advantages over the single-modal technique. However, most of the current research does not take into account the robustness of the multi-modal model, such as what is different from the single-modal model in terms of overall performance when a certain modality is subject to a challenge, and how to improve the robustness of the multi-modal model. In addition, most of the previous researches are to improve the overall robustness of the model based on the robustness of the single-mode model, and few researches are performed on the multi-mode model.

In order to solve the above problems, the present invention proposes a multimodal model countermeasure training method based on an adaptive deep supervision module.

Disclosure of Invention

1. Technical problem to be solved by the invention

The invention aims to provide a multi-mode model countermeasure training method based on a self-adaptive deep supervision module so as to solve the following problems:

1) What are the robustness changes of the multi-modal model different from that of the single mode? Due to the interactions between the various modalities of the multi-modal model, after a modality is attacked, whether the model behaves differently than a single modality, where does it differ? Is multi-modal fusion able to improve overall robustness?

2) How does the robustness of different modalities to combat attacks? The robustness of both RGB video modality and skeleton modality are not the same, skeleton is better from a data feature perspective, how is this difference demonstrated experimentally?

3) How does the robustness of the multimodal model increase more effectively? How are few current approaches to multi-modal model robustness to efficiently achieve multi-modal model robustness improvement?

2. Technical proposal

In order to achieve the above purpose, the present invention provides the following technical solutions:

the multi-modal model countermeasure training method based on the self-adaptive depth supervision module comprises the following steps:

s1, generating a skeleton data set according to video data;

s2, carrying out normalization operation on the skeleton data set obtained in the S1, and sampling a certain number of framesTThe defects areTSupplementing 0 of the frame;

s3, uniformly sampling the video frames to obtain RGB pictures for training;

s4, cutting the RGB picture obtained in the S3, and adjusting the size to obtain an RGB data set;

s5, the RGB data set obtained in the S4 is in one-to-one correspondence with the skeleton data set in the S2, and the RGB data set is stored in a dictionary form to obtain a final integrated multi-mode data set;

s6, training a single-mode model HCN for processing skeleton data on a skeleton data set to obtain pre-training parameters of the model;

s7, training a single-mode model I3D for processing the picture data on the RGB data set to obtain pre-training parameters of the model;

s8, designing a self-adaptive depth supervision module (Adaptive Deep Supervision Module, ADSM), integrating the self-adaptive depth supervision module and the single-mode model obtained in S6 and S7 into a multi-mode model, and loading the pre-training parameters obtained in S6 and S7;

s9, taking the multi-modal data set obtained in the S5 as clean data input, and obtaining a multi-modal countermeasure sample by utilizing a PGD algorithm;

s10, inputting the countermeasure sample obtained in the S9 into the multi-modal model obtained in the S8 as training data, and performing forward propagation;

s11, obtaining a prediction result of the model through forward propagation, and obtaining a new objective function based on calculation of the prediction result so as to finish updating of the weight parameters of the model once.

Preferably, the S1 specifically includes the following:

detecting and extracting key points of human bones in video data by using an open-source human body posture estimation library Openpore to generate a skeleton data set; wherein the information of each key point is represented by a three-dimensional coordinate pointx，y，z) A constitution in whichx，yRepresenting the two-dimensional position coordinates of the point,zthe confidence score is represented as a function of the confidence score,z∈[0，1]and there are at most two people simultaneously in each video.

Preferably, the normalization operation in S2 specifically includes the following:

for the followingx，yDimension, taking the maximum value of its absolute valuex| _max ，|y| _max Will bex，yEach data in the dimension is divided by a maximum to ensure that the resulting skeleton datax，yThe values of (2) are all distributed in [ -1,1]Between them.

Preferably, the input of RGB data of the multimodal dataset in S5 isWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting the number of frames;Crepresenting the number of RGB channels;H、Wrepresented as height and width of the image, respectively;

the Skeleton data of the multi-modal dataset in S5 is entered asWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting the number of frames;Crepresenting the coordinate dimensions of the bone;Vrepresenting bone key points;Mrepresenting the number of people in the video.

Preferably, the single-mode model HCN in S6 is used for extracting features of data, and specifically includes the following contents:

s6.1, subtracting the front frame and the rear frame of the skeleton key point data to obtain motion information on the time domain, wherein the size isWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting the number of frames;Crepresenting the coordinate dimensions of the bone;Vrepresenting bone key points;Mrepresenting the number of people in the video;

s6.2, extracting characteristics of motion data and key point data through respective branch networks, wherein each branch network comprises four layers of convolution layers;

s6.3, splicing the two features extracted in the S6.2 together, and then inputting the two features into the two convolution layers and the two full-connection layers to obtain a final prediction result with the size ofB×NWherein, the method comprises the steps of, wherein,Nrepresenting the number of categories; and saving model parameters after training is completed.

Preferably, the single-mode model I3D in S7 is used for extracting video features, and specifically includes the following:

the I3D comprises 4 blocks, each block is formed by a plurality of 3D volumesThe product module is formed; the output of the last convolution layer is processed by a 3D pooling layer, the vector obtained after the processing is input into a full-connection layer, and a final prediction result is obtained, wherein the size of the final prediction result isB×NWherein, the method comprises the steps of, wherein,Nrepresenting the number of categories; and saving model parameters after training is completed.

Preferably, the design adaptive depth supervision module in S8 specifically includes the following:

s8.1, taking out intermediate features of the video and bones from hidden layers of two single-mode networks, and respectively marking the intermediate features as:

，/>；

s8.2, two attention weight matrixes are created according to the size of the feature, and the attention weight matrixes are specifically expressed as follows:

，/>；

s8.3, repeating the two attention weight matrixes obtained in the S8.2 in the batch size dimensionBAnd multiplying the original characteristics to obtain:

，/>wherein->Representing multiplication by element; the features after the attention weighting adjustment are noted as:

；

s8.4, respectively inputting the features subjected to attention weight adjustment in the step S8.3 into a 3D pooling layer and a 2D pooling layer, and obtaining after dimension adjustment:

；

s8.5, splicing the characteristics with the dimension adjusted obtained in the S8.4 together to obtain:

；

s8.6, the features obtained in the S8.5 are followed by a Linear layer (Linear) to obtain an auxiliary classification prediction result:

。

the above is the structural design of the self-adaptive depth supervision module. In the module, three results are finally returned, namely the characteristics of the two modes after weight adjustmentAnd prediction of auxiliary classification +.>. The adjusted feature continues to propagate forward, < >>For calculating the loss function. In the invention, an adaptive depth supervision module is inserted between the 2 nd, 3 rd and 4 th blocks of the single-mode model I3D and the 5 th, 6 th and 7 th convolution (or linear) layers of the single-mode model HCN, so that the required multi-mode network can be obtained. Different numbers of the modules can be flexibly added according to the needs. And then loading the pre-training parameters obtained in S6 and S7.

Preferably, the S9 specifically includes the following:

by usingf _θ Representing the multimodal model in S8, wherein,θrepresentation ofModel parameters; {x _RGB ，x _ske The multi-modal input is represented by the corresponding challenge sample being；yRepresenting a real label; targets of multimodal attacks spoof target multimodal models by adding human-imperceptible perturbations to the multimodal inputf _θ ；

Pairs under multimodal inputf _θ The objective function of performing a multi-modal attack is:

wherein,representing the fighting disturbance of RGB;representing an antagonistic disturbance of bone;L(. Cndot.) representation is used to optimizef _θ A loss function of (2); I.I _p Representation ofpA norm;ε _RGB andε _ske representing the maximum range of the countermeasure disturbance;

a PGD algorithm is adopted to obtain a countermeasure sample, and single-mode data is taken as an example, and a specific calculation formula is as follows:

wherein,representing a loss function;αrepresenting an attack step length;x ⁽⁰⁾ natural data representing natural data or natural data interfered by gaussian noise or uniform random noise;x ^(t) represent the firsttA challenge sample of steps; />Representing projection functions, when disturbedProjecting countermeasure data back when the movement exceeds a maximum rangex ⁽⁰⁾ Centered atεIn a sphere of radius;trepresenting the number of iterations.

Preferably, the S11 specifically includes the following:

based on S10, the auxiliary classification prediction result obtained in S8 and the final prediction result obtained by forward propagation in S10 are combined as a final loss function:

wherein,representing the final output of the challenge sample; />An output representing the original clean data;L(. Cndot.) represents a cross entropy loss function;Jrepresenting the number of the self-adaptive depth supervision modules;λrepresenting a hyper-parameter for adjusting the balance between the two loss functions;

and after the final loss function is obtained, back propagation is carried out by using the final loss function, and model parameters are optimized.

3. Advantageous effects

(1) The invention provides a self-Adaptive Depth Supervision Module (ADSM), which adopts RGB video and human skeleton key points as multi-mode input, and utilizes a attention layer in the ADSM to autonomously learn mode characteristics with higher robustness, thereby enhancing the robustness of a multi-mode network. To our knowledge, there is little research on multi-modal model robustness, which is the first method to improve multi-modal model robustness on video motion recognition tasks.

(2) The invention designs a new module, namely a self-adaptive depth supervision module, which comprises two attention weight layers, wherein the modal features with larger classification contributions can obtain larger weights, and the model can autonomously learn more robust data features during training, so that the overall robustness is improved.

(3) The invention designs a new loss function to optimize the parameters of the multi-modal model. The new loss function contains auxiliary classification outputs of a plurality of ADSMs in the middle, so that the forced model learns more robust features when the model is in the middle level. Meanwhile, in order not to reduce the accuracy of the model on the clean data set, the loss function also adds classification loss on the clean data, and the balance between the two loss functions is adjusted by using the super parameters.

(4) The Adaptive Depth Supervision Module (ADSM) designed by the invention is a plug and play module, can be conveniently combined with other multi-mode models, and improves the robustness.

Drawings

FIG. 1 is a flow chart of a design framework of a multi-modal model countermeasure training method based on an adaptive deep supervision module;

fig. 2 is a schematic structural diagram of an Adaptive Deep Supervision Module (ADSM) according to the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

In recent years, improving model robustness has become a hotspot in the field of deep learning, and many methods have been proposed, such as countermeasure training AT, TRADES, FAT, and the like based on conventional image classification tasks. Today, with the great success of multi-modal techniques in the fields of medical health, human-computer interaction, video understanding, etc., multi-modal methods exhibit great superiority that single-modal models do not possess. For example, in a video motion recognition task, the performance of using only RGB video data is inferior to that of a multi-modal method using RGB video and human body skeleton data. However, in early studies, methods for improving model robustness were performed on single-mode models, and few studies were performed on multi-mode models. The invention provides a multi-mode model countermeasure training method based on a self-adaptive depth supervision module, which is supported by national natural science foundation-human body micro-gesture recognition and emotion analysis 62171309 based on self-supervision learning, and is described below with reference to specific examples and drawings.

Example 1:

referring to fig. 1, the invention comprehensively considers the influence of two modes of RGB video and skeleton on robustness, and provides a new multi-mode model countermeasure training method based on a self-adaptive depth supervision module, which specifically comprises the following steps:

step 1: generating a skeleton dataset from the video data: and detecting and extracting key points of human bones in the video data by using an open-source human body posture estimation library Openpore. The iMiGUE data set used in the invention can generate 15 skeletal key points for each person, and the information of each key point is expressed by a three-dimensional coordinate point #x，y，z) A constitution in whichx，yRepresenting the two-dimensional position coordinates of the point,zthe confidence score is represented as a function of the confidence score,z∈[0，1]and there are at most two people simultaneously in each video.

Step 2: normalization (normalization) of the skeleton dataset and sampling a certain number of framesT=300, complement 0 of less than 300 frames;

the normalization step is: for the followingx，yDimension, taking the maximum value of its absolute valuex| _max ，|y| _max Will bex，yEach datum in the dimension is divided by a maximum value. The skeleton data thus obtainedx，yThe value of (1) is distributed in [ -1,1]Between them. This step may be omitted if the skeleton data obtained by other methods has been normalized.

Step 3: sampling all video data uniformly, i.e. sampling 8 frames per video at equal intervalsT=8, resulting in RGB pictures for training.

Step 4: the RGB pictures are cropped and scaled to a size of 256×256.

Step 5: the RGB data obtained above are in one-to-one correspondence with the skeletons and are stored in a dictionary form, so that a final integrated multi-mode data set is obtained;

the input of RGB data in a multi-modal dataset isWhereinBFor the size of the batch size,T=8 is the number of frames,C=3 is the number of RGB channels,H=W=256 is the height and width of the image.

The input of the skeleton data in the multi-mode data set is thatWhereinBFor the size of the batch size,T=300 is the number of frames,C=3 is the coordinate dimension of the bone,V=15 is the number of skeletal key points,M=1 is the number of people in the video.

Step 6: training a single-mode model HCN for processing bone data on a skeleton data set to obtain pre-training parameters of the model;

among them, HCN is a feature used to extract bone data. Firstly, subtracting the front frame and the rear frame of the skeleton key point data to obtain motion information in the time domain, wherein the motion information is of the size of. And extracting characteristics of the motion data and the key point data through respective branch networks, wherein each branch network is provided with four layers of convolution layers. The two extracted features are then stitched togetherContinuously inputting into two convolution layers and two full-connection layers to obtain final prediction result with the size ofB×NWhereinNIs the number of categories. And saving model parameters after training is completed.

Step 7: training a single-mode model I3D for processing picture data on an RGB data set to obtain pre-training parameters of the model;

wherein, the I3D contains 4 blocks, each block is further composed of a plurality of 3D convolution modules, and there are 48 3D convolution layers in total for extracting video features. The final convolution layer output size isBX 2048 x 8, passing through a 3D pooling layer to obtain sizeBThe vector of x 2048 is input into the full connection layer to obtain the final prediction result with the size ofB×NWhereinNIs the number of categories. And saving model parameters after training is completed.

Step 8: integrating the single-mode model in the steps 6 and 7 and the self-adaptive depth supervision module (Adaptive Deep Supervision Module, ADSM) designed by the invention into a multi-mode model, and loading the pre-training parameters obtained in the steps 6 and 7;

the Adaptive Depth Supervision Module (ADSM) described above is the core design of the present invention. We find in experiments that bone modality data is more robust against attacks (conforms to the characteristics of bone data itself), but the data volume of bone data is smaller than that of video data. Therefore, the weight of the two data is adjusted as much as possible in the training process, so that the model learns more robust data distribution.

Specifically, we take the intermediate features of video and bone from the hidden layers of two single-mode networks, respectively，/>. Two attention weight matrices are created according to the feature size>，/>. Two attention weight matrices repeat in the batch size dimensionBAnd multiplying the original characteristics to obtain:，/>wherein->Representing multiplication by element. NowThe characteristics after the attention weight adjustment are respectively input into a 3D pooling layer and a 2D pooling layer, and the dimension is adjusted to obtainThen spliced together, i.e.)>. Finally, a Linear layer (Linear) is added to obtain the auxiliary classification prediction result +.>. The above is the structure of the adaptive depth supervision module. In this module, three results are finally returned, namely the characteristics of the two modes after weight adjustment +.>And predictive outcome of auxiliary classification +.>. The adjusted feature continues to propagate forward, < >>For calculating the loss function.

In the invention, the modules are inserted between the 2 nd, 3 rd, 4 th blocks and the 5 th, 6 th and 7 th convolution (or linear) layers of the HCN of the I3D, so that the needed multi-mode network can be obtained. Different numbers of the modules can be flexibly added according to the needs. The pre-training parameters obtained in steps 6 and 7 are then loaded.

Step 9: taking the multi-modal data set obtained in the step 5 as clean data input, and obtaining a multi-modal countermeasure sample by using a PGD10 algorithm, wherein the multi-modal countermeasure sample comprises the following specific contents:

by usingf _θ Representing the multimodal network in step 8, whereinθIs a model parameter. {x _RGB ，x _ske The multi-modal input is the corresponding challenge sample is，yIs a real tag. The goal of a multimodal attack is to spoof a target multimodal model by adding human imperceptible perturbations to the multimodal inputf _θ For example, in the present invention, RGB data and bone data: {x _RGB ，x _ske }. In order to make a trained multi-modal modelf _θ Making incorrect predictions and making the corresponding disturbances as imperceptible as possible, for multi-modal inputf _θ The objective function of performing a multi-modal attack is:

wherein,representing the fighting disturbance of RGB;representing an antagonistic disturbance of bone;L(. Cndot.) representation is used to optimizef _θ A loss function of (2); I.I _p Representation ofpA norm;ε _RGB andε _ske indicating the maximum range of the countermeasure disturbance. This formula describes the meaning of a challenge sample, and to obtain a challenge sample we use the common PGD algorithm, taking the single-mode data as an example:

wherein,representing a loss function;αrepresenting an attack step length;x ⁽⁰⁾ natural data representing natural data or natural data interfered by gaussian noise or uniform random noise;x ^(t) represent the firsttA challenge sample of steps; />Representing a projection function, projecting the challenge data back to the system when the disturbance exceeds a maximum rangex ⁽⁰⁾ Centered atεIn a sphere of radius;trepresenting the number of iterations, the present invention takes 10, i.e., PGD10. The challenge sample can be obtained using the above formula.

Step 10: the challenge samples are input as training data into the multimodal model for forward propagation.

Step 11: obtaining a prediction result of the model through forward propagation, and calculating a new objective function designed by the invention to finish one-time model weight parameter updating;

the new objective function specifically refers to: combining the prediction result of the auxiliary classification obtained in the step 8 with the final prediction result of the network as a final loss function:

wherein,representing the final output of the challenge sample;an output representing the original clean data;L(. Cndot.) represents a cross entropy loss function;Jrepresenting the number of the self-adaptive depth supervision modules;λrepresenting superparametersThe method is used for adjusting the balance between the two loss functions, and can be adjusted by itself according to different data sets.

After the final loss function is obtained, the method can be used for back propagation and optimization of model parameters.

Example 2:

based on embodiment 1, but with the difference, please refer to fig. 1, the overall flow proposed by the present invention can be divided into three major parts:

1) I3D network. As a backbone network for processing video data, I3D is mainly composed of 3D convolution. The I3D used in the invention comprises four 3D convolution blocks, and each Block respectively comprises 3, 4, 6 and 3D convolution basic modules, namely Bottleneck3D. The Bottleneck3D module contains 3D convolutional layers, the first and third layers having a convolutional kernel size of 1 x 1, the step size is 1 x 01 x 11, the convolution kernel size of the second layer is 3 x 23 x 33, step size of 1X 2X 2, padding is 1×1×1. After the first Bottleneck3D of each Block, a downsampling layer follows, the layer consists of a 3D convolution layer with a convolution kernel size of 1 x 1 and a step size of 1 x 2 and a 3 dbatchnum layer. A total of 53 convolutional layers, including one 2D convolutional layer from the beginning of the I3D network. The size of the input video data isWhereinBFor the size of the batch size,T=8 is the number of frames,C=3 is the number of RGB channels,H=W=256 is the height and width of the image. After each Block, the feature size of the output is respectivelyB×256×8×64×64，B×512×8×32×32，B×1024×8×16×16，BX 2048 x 8. Finally, the three-dimensional (3D) pooling layer is passed through and flattened to be of the same sizeBVectors x 2048. And obtaining a final classified prediction result through a full connection layer (FC layer).

2) HCN networks. The HCN network is a backbone network for extracting features of human skeletal sequence data, and mainly consists of 2D convolution layers. The input size of the human skeleton data isWhereinBB isThe size of the atchsize (r),T=300 is the number of frames,C=3 is the coordinate dimension of the bone,V=15 is the number of skeletal key points,M=1 is the number of people in the video (NTU datasetM=2，V=17). Because the 2D convolution cannot extract the time sequence information, motion characteristics representing the time sequence information are extracted in advance, and the bone point information corresponding to the previous frame is subtracted from the bone point information of the next frame, so that the motion information of the bone key points between the two frames, namely +.>The difference calculation reduces the time series length by 1, thus T-1, resulting in +.>. The motion tensor is then readjusted to be by interpolation function. With the following componentsX _ske ，X _motion Then, using M dimension as index to traverse each character, for each characterX _ske ，X _motion And extracting spatial features and time sequence features through the four 2D convolution layers respectively, then splicing in the C dimension to obtain features with space-time information, and continuously extracting the features through the two convolution layers. And then, taking the maximum value as classification input for the space-time characteristics of each character, and finally inputting the classification input into two full-connection layers to obtain a final prediction result.

3) ADSM: this module is one of the cores of the present invention. Specifically, we take the intermediate features of video and bone from the hidden layers of two single-mode networks, respectively，. Creating two attention weight matrices according to feature size，/>. Two attention weight matrices repeat in the batch size dimensionBAfter the times, the original characteristics are multiplied respectively, < ->，Wherein->Representing multiplication by element. NowThe characteristics after the attention weight adjustment are respectively input into a 3D pooling layer and a 2D pooling layer, and the dimension is adjusted to obtainThen spliced together, i.e.)>. Finally, a Linear layer (Linear) is added to obtain the auxiliary classification prediction result +.>. The above is the structure of the adaptive depth supervision module. In this module, three results are finally returned, namely the characteristics of the two modes after weight adjustment +.>And predictive outcome of auxiliary classification +.>. The adjusted feature continues to propagate forward, < >>For calculating the loss function.

The invention provides a new loss function for optimizing parameters of the multi-modal model, thereby realizing the balance of robustness and accuracy. Specifically, the prediction result of the auxiliary classification obtained by the antagonism sample through the network is combined with the final prediction result of the network to be used as a Robust loss, and the final prediction result of the clean data through the network is used as a clean loss, and two final loss functions are obtained:

wherein,representing the final output of the challenge sample; />An output representing the original clean data;L(. Cndot.) represents a cross entropy loss function;Jrepresenting the number of the self-adaptive depth supervision modules;λthe super-parameters are represented and used for adjusting the balance between the two loss functions, and can be adjusted by self according to different data sets.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical solution and the modified concept thereof, within the scope of the present invention.

Claims

1. The multi-mode model countermeasure training method based on the self-adaptive depth supervision module is characterized by comprising the following steps of:

s1, generating a skeleton data set according to video data;

s3, uniformly sampling the video frames to obtain RGB pictures for training;

s8, designing a self-adaptive depth supervision module, integrating the self-adaptive depth supervision module with the single-mode models obtained in the S6 and the S7 to form a multi-mode model, and loading the pre-training parameters obtained in the S6 and the S7;

the method specifically comprises the following steps:

，/>；

；

s8.6, the features obtained in the S8.5 are followed by a linear layer, and an auxiliary classification prediction result is obtained:

；

2. The multi-modal model countermeasure training method based on the adaptive deep supervision module according to claim 1, wherein S1 specifically includes the following:

3. The multi-modal model challenge training method based on the adaptive deep supervision module of claim 1, wherein the normalization operation in S2 specifically includes the following:

4. The adaptive deep-supervisor module-based multimodal model challenge training method of claim 1, wherein the input of RGB data for the multimodal dataset in S5 isWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting the number of frames;Crepresenting the number of RGB channels;H、Wrepresented as height and width of the image, respectively;

the Skeleton data of the multi-modal dataset in S5 is entered asWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting framesA number;Crepresenting the coordinate dimensions of the bone;Vrepresenting bone key points;Mrepresenting the number of people in the video.

5. The multi-modal model countermeasure training method based on the adaptive deep supervision module according to claim 1, wherein the single-modal model HCN in S6 is used for extracting features of data, and specifically includes the following contents:

6. The multi-modal model countermeasure training method based on the adaptive depth supervision module according to claim 1, wherein the single-modal model I3D in S7 is used for extracting video features, specifically including the following:

the I3D comprises 4 blocks, and each block consists of a plurality of 3D convolution modules; the output of the last convolution layer is processed by a 3D pooling layer, the vector obtained after the processing is input into a full-connection layer, and a final prediction result is obtained, wherein the size of the final prediction result isB×NWherein, the method comprises the steps of, wherein,Nrepresenting the number of categories; and saving model parameters after training is completed.

7. The multi-modal model countermeasure training method based on the adaptive deep supervision module according to claim 1, wherein S9 specifically includes the following:

by usingf _θ Representing the multimodal model in S8, wherein,θrepresenting model parameters; {x _RGB ，x _ske The multi-modal input is represented by the corresponding challenge sample being；yRepresenting a real label; targets of multimodal attacks spoof target multimodal models by adding human-imperceptible perturbations to the multimodal inputf _θ ；

wherein,representing the fighting disturbance of RGB; />Representing an antagonistic disturbance of bone;L(. Cndot.) representation is used to optimizef _θ A loss function of (2); I.I _p Representation ofpA norm;ε _RGB andε _ske representing the maximum range of the countermeasure disturbance;

wherein,indicating lossA function;αrepresenting an attack step length;x ⁽⁰⁾ natural data representing natural data or natural data interfered by gaussian noise or uniform random noise;x ^(t) represent the firsttA challenge sample of steps; />Representing a projection function, projecting the challenge data back to the system when the disturbance exceeds a maximum rangex ⁽⁰⁾ Centered atεIn a sphere of radius;trepresenting the number of iterations.

8. The multi-modal model challenge training method based on the adaptive deep supervision module of claim 6, wherein S11 specifically comprises the following: