CN117274656B - Multi-mode model countermeasure training method based on self-adaptive depth supervision module - Google Patents

Multi-mode model countermeasure training method based on self-adaptive depth supervision module Download PDF

Info

Publication number
CN117274656B
CN117274656B CN202310660598.6A CN202310660598A CN117274656B CN 117274656 B CN117274656 B CN 117274656B CN 202310660598 A CN202310660598 A CN 202310660598A CN 117274656 B CN117274656 B CN 117274656B
Authority
CN
China
Prior art keywords
representing
model
data
modal
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310660598.6A
Other languages
Chinese (zh)
Other versions
CN117274656A (en
Inventor
侯永宏
刘超
刘鑫
岳焕景
杨敬钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310660598.6A priority Critical patent/CN117274656B/en
Publication of CN117274656A publication Critical patent/CN117274656A/en
Application granted granted Critical
Publication of CN117274656B publication Critical patent/CN117274656B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-mode model countermeasure training method based on a self-adaptive depth supervision module, belonging to the technical field of model countermeasure training methods; the invention mainly comprises the following three parts: 1) I3D network: as a backbone network for processing video data, I3D is mainly composed of 3D convolution; 2) HCN network: the HCN network is a main network for extracting the characteristics of human skeleton sequence data and mainly comprises a 2D convolution layer; 3) Adaptive Depth Supervision Module (ADSM): RGB video and human skeleton key points are used as multi-modal input, and the attention layer in ADSM is utilized to autonomously learn modal characteristics with higher robustness, so that the robustness of the multi-modal network is enhanced.

Description

Multi-mode model countermeasure training method based on self-adaptive depth supervision module
Technical Field
The invention relates to the technical field of model countermeasure training methods, in particular to a multimode model countermeasure training method based on a self-adaptive depth supervision module.
Background
The widespread use of deep learning in the real world places extremely high demands on the robustness of the model. For example, in the fields of medical diagnosis, automatic driving, and the like, the deep learning model must be able to properly cope with various situations and cannot cause erroneous prediction due to a slight disturbance or malicious attack of input data. Thus, studying the robustness of deep learning models has become an important direction of study. However, previous work has focused mainly on robustness studies on a single visual or video classification task. As multimodal tasks develop and apply, robustness research of multimodal models has become particularly urgent. Therefore, further investigation of the robustness of the multi-modal models is needed to ensure that they can cope correctly with various uncertainties, and thus are more reliable and safer in practical applications.
AT present, one of the main methods for improving the robustness of the model is countermeasure training (AT), and the core idea of the countermeasure training is to introduce some countermeasure samples in the model training process to improve the robustness of the deep learning model, so that the deep learning model can better cope with various disturbance and attack. The challenge samples are generated from the original data by a challenge algorithm, such as FGSM. This approach enables more robust feature representations and decision rules to be learned by incorporating the challenge sample into the training process. AT was first proposed by Madry et al. Increasing the robustness of the model may make it more reliable and safe, but this may reduce the classification accuracy of the model on the raw data. Therefore, a balance point needs to be found between robustness and accuracy, and Zhang et al propose a TRADES method. TRADES uses KL divergence to regularize the output of the challenge samples and the raw data, in this way, the TRADES method can make the model have higher robustness while maintaining higher classification accuracy. Bai et al propose a novel challenge sample defense method known as CAS. Unlike conventional challenge training, CAS starts from the perspective of channel activation and dynamically learns channel importance weights, which are used to suppress the response of redundant channels from being activated when challenged by challenge samples. The method is expected to provide more effective and reliable guarantee for the robustness of the deep learning model.
In the field of video understanding and human motion recognition, there are many sophisticated models and algorithms. From the original Two-Stream CNN to the popular transducer today, various methods are layered out and constantly optimized. In recent years, human motion recognition based on human skeletal (Skeleton) data has achieved rapid progress. Bone data has some advantages over RGB data, such as: 1. low dimensionality, high efficiency: skeleton data only contains the position information of key points of a human body, so that the data size is small, and the learning and processing efficiency is high. 2. Is not interfered by the background: compared to RGB data, skeletal data does not contain environmental and background information, and thus features of the character action itself are more easily captured without being affected by the background. Thus, more and more methods begin to explore how to combine the two modalities of RGB and skeletal data for human motion recognition. However, in past studies, multimodal models often focused only on performance on the original clean dataset, lacking concern for the problem of multimodal model robustness. In fact, the robustness problem in this field still faces a great challenge, becoming a problem that remains to be solved in this field.
With the continuous development of machine learning and artificial intelligence technology, the multi-modal technology becomes a research field which is widely focused, and is successfully applied in the fields of intelligent interaction, medical care, video understanding and the like. The multi-modal technique has great advantages over the single-modal technique. However, most of the current research does not take into account the robustness of the multi-modal model, such as what is different from the single-modal model in terms of overall performance when a certain modality is subject to a challenge, and how to improve the robustness of the multi-modal model. In addition, most of the previous researches are to improve the overall robustness of the model based on the robustness of the single-mode model, and few researches are performed on the multi-mode model.
In order to solve the above problems, the present invention proposes a multimodal model countermeasure training method based on an adaptive deep supervision module.
Disclosure of Invention
1. Technical problem to be solved by the invention
The invention aims to provide a multi-mode model countermeasure training method based on a self-adaptive deep supervision module so as to solve the following problems:
1) What are the robustness changes of the multi-modal model different from that of the single mode? Due to the interactions between the various modalities of the multi-modal model, after a modality is attacked, whether the model behaves differently than a single modality, where does it differ? Is multi-modal fusion able to improve overall robustness?
2) How does the robustness of different modalities to combat attacks? The robustness of both RGB video modality and skeleton modality are not the same, skeleton is better from a data feature perspective, how is this difference demonstrated experimentally?
3) How does the robustness of the multimodal model increase more effectively? How are few current approaches to multi-modal model robustness to efficiently achieve multi-modal model robustness improvement?
2. Technical proposal
In order to achieve the above purpose, the present invention provides the following technical solutions:
the multi-modal model countermeasure training method based on the self-adaptive depth supervision module comprises the following steps:
s1, generating a skeleton data set according to video data;
s2, carrying out normalization operation on the skeleton data set obtained in the S1, and sampling a certain number of framesTThe defects areTSupplementing 0 of the frame;
s3, uniformly sampling the video frames to obtain RGB pictures for training;
s4, cutting the RGB picture obtained in the S3, and adjusting the size to obtain an RGB data set;
s5, the RGB data set obtained in the S4 is in one-to-one correspondence with the skeleton data set in the S2, and the RGB data set is stored in a dictionary form to obtain a final integrated multi-mode data set;
s6, training a single-mode model HCN for processing skeleton data on a skeleton data set to obtain pre-training parameters of the model;
s7, training a single-mode model I3D for processing the picture data on the RGB data set to obtain pre-training parameters of the model;
s8, designing a self-adaptive depth supervision module (Adaptive Deep Supervision Module, ADSM), integrating the self-adaptive depth supervision module and the single-mode model obtained in S6 and S7 into a multi-mode model, and loading the pre-training parameters obtained in S6 and S7;
s9, taking the multi-modal data set obtained in the S5 as clean data input, and obtaining a multi-modal countermeasure sample by utilizing a PGD algorithm;
s10, inputting the countermeasure sample obtained in the S9 into the multi-modal model obtained in the S8 as training data, and performing forward propagation;
s11, obtaining a prediction result of the model through forward propagation, and obtaining a new objective function based on calculation of the prediction result so as to finish updating of the weight parameters of the model once.
Preferably, the S1 specifically includes the following:
detecting and extracting key points of human bones in video data by using an open-source human body posture estimation library Openpore to generate a skeleton data set; wherein the information of each key point is represented by a three-dimensional coordinate pointxyz) A constitution in whichxyRepresenting the two-dimensional position coordinates of the point,zthe confidence score is represented as a function of the confidence score,z∈[0,1]and there are at most two people simultaneously in each video.
Preferably, the normalization operation in S2 specifically includes the following:
for the followingxyDimension, taking the maximum value of its absolute valuex| max ,|y| max Will bexyEach data in the dimension is divided by a maximum to ensure that the resulting skeleton dataxyThe values of (2) are all distributed in [ -1,1]Between them.
Preferably, the input of RGB data of the multimodal dataset in S5 isWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting the number of frames;Crepresenting the number of RGB channels;HWrepresented as height and width of the image, respectively;
the Skeleton data of the multi-modal dataset in S5 is entered asWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting the number of frames;Crepresenting the coordinate dimensions of the bone;Vrepresenting bone key points;Mrepresenting the number of people in the video.
Preferably, the single-mode model HCN in S6 is used for extracting features of data, and specifically includes the following contents:
s6.1, subtracting the front frame and the rear frame of the skeleton key point data to obtain motion information on the time domain, wherein the size isWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting the number of frames;Crepresenting the coordinate dimensions of the bone;Vrepresenting bone key points;Mrepresenting the number of people in the video;
s6.2, extracting characteristics of motion data and key point data through respective branch networks, wherein each branch network comprises four layers of convolution layers;
s6.3, splicing the two features extracted in the S6.2 together, and then inputting the two features into the two convolution layers and the two full-connection layers to obtain a final prediction result with the size ofB×NWherein, the method comprises the steps of, wherein,Nrepresenting the number of categories; and saving model parameters after training is completed.
Preferably, the single-mode model I3D in S7 is used for extracting video features, and specifically includes the following:
the I3D comprises 4 blocks, each block is formed by a plurality of 3D volumesThe product module is formed; the output of the last convolution layer is processed by a 3D pooling layer, the vector obtained after the processing is input into a full-connection layer, and a final prediction result is obtained, wherein the size of the final prediction result isB×NWherein, the method comprises the steps of, wherein,Nrepresenting the number of categories; and saving model parameters after training is completed.
Preferably, the design adaptive depth supervision module in S8 specifically includes the following:
s8.1, taking out intermediate features of the video and bones from hidden layers of two single-mode networks, and respectively marking the intermediate features as:
,/>
s8.2, two attention weight matrixes are created according to the size of the feature, and the attention weight matrixes are specifically expressed as follows:
,/>
s8.3, repeating the two attention weight matrixes obtained in the S8.2 in the batch size dimensionBAnd multiplying the original characteristics to obtain:
,/>wherein->Representing multiplication by element; the features after the attention weighting adjustment are noted as:
s8.4, respectively inputting the features subjected to attention weight adjustment in the step S8.3 into a 3D pooling layer and a 2D pooling layer, and obtaining after dimension adjustment:
s8.5, splicing the characteristics with the dimension adjusted obtained in the S8.4 together to obtain:
s8.6, the features obtained in the S8.5 are followed by a Linear layer (Linear) to obtain an auxiliary classification prediction result:
the above is the structural design of the self-adaptive depth supervision module. In the module, three results are finally returned, namely the characteristics of the two modes after weight adjustmentAnd prediction of auxiliary classification +.>. The adjusted feature continues to propagate forward, < >>For calculating the loss function. In the invention, an adaptive depth supervision module is inserted between the 2 nd, 3 rd and 4 th blocks of the single-mode model I3D and the 5 th, 6 th and 7 th convolution (or linear) layers of the single-mode model HCN, so that the required multi-mode network can be obtained. Different numbers of the modules can be flexibly added according to the needs. And then loading the pre-training parameters obtained in S6 and S7.
Preferably, the S9 specifically includes the following:
by usingf θ Representing the multimodal model in S8, wherein,θrepresentation ofModel parameters; {x RGBx ske The multi-modal input is represented by the corresponding challenge sample beingyRepresenting a real label; targets of multimodal attacks spoof target multimodal models by adding human-imperceptible perturbations to the multimodal inputf θ
Pairs under multimodal inputf θ The objective function of performing a multi-modal attack is:
wherein,representing the fighting disturbance of RGB;representing an antagonistic disturbance of bone;L(. Cndot.) representation is used to optimizef θ A loss function of (2); I.I p Representation ofpA norm;ε RGB andε ske representing the maximum range of the countermeasure disturbance;
a PGD algorithm is adopted to obtain a countermeasure sample, and single-mode data is taken as an example, and a specific calculation formula is as follows:
wherein,representing a loss function;αrepresenting an attack step length;x (0) natural data representing natural data or natural data interfered by gaussian noise or uniform random noise;x (t) represent the firsttA challenge sample of steps; />Representing projection functions, when disturbedProjecting countermeasure data back when the movement exceeds a maximum rangex (0) Centered atεIn a sphere of radius;trepresenting the number of iterations.
Preferably, the S11 specifically includes the following:
based on S10, the auxiliary classification prediction result obtained in S8 and the final prediction result obtained by forward propagation in S10 are combined as a final loss function:
wherein,representing the final output of the challenge sample; />An output representing the original clean data;L(. Cndot.) represents a cross entropy loss function;Jrepresenting the number of the self-adaptive depth supervision modules;λrepresenting a hyper-parameter for adjusting the balance between the two loss functions;
and after the final loss function is obtained, back propagation is carried out by using the final loss function, and model parameters are optimized.
3. Advantageous effects
(1) The invention provides a self-Adaptive Depth Supervision Module (ADSM), which adopts RGB video and human skeleton key points as multi-mode input, and utilizes a attention layer in the ADSM to autonomously learn mode characteristics with higher robustness, thereby enhancing the robustness of a multi-mode network. To our knowledge, there is little research on multi-modal model robustness, which is the first method to improve multi-modal model robustness on video motion recognition tasks.
(2) The invention designs a new module, namely a self-adaptive depth supervision module, which comprises two attention weight layers, wherein the modal features with larger classification contributions can obtain larger weights, and the model can autonomously learn more robust data features during training, so that the overall robustness is improved.
(3) The invention designs a new loss function to optimize the parameters of the multi-modal model. The new loss function contains auxiliary classification outputs of a plurality of ADSMs in the middle, so that the forced model learns more robust features when the model is in the middle level. Meanwhile, in order not to reduce the accuracy of the model on the clean data set, the loss function also adds classification loss on the clean data, and the balance between the two loss functions is adjusted by using the super parameters.
(4) The Adaptive Depth Supervision Module (ADSM) designed by the invention is a plug and play module, can be conveniently combined with other multi-mode models, and improves the robustness.
Drawings
FIG. 1 is a flow chart of a design framework of a multi-modal model countermeasure training method based on an adaptive deep supervision module;
fig. 2 is a schematic structural diagram of an Adaptive Deep Supervision Module (ADSM) according to the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.
Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
In recent years, improving model robustness has become a hotspot in the field of deep learning, and many methods have been proposed, such as countermeasure training AT, TRADES, FAT, and the like based on conventional image classification tasks. Today, with the great success of multi-modal techniques in the fields of medical health, human-computer interaction, video understanding, etc., multi-modal methods exhibit great superiority that single-modal models do not possess. For example, in a video motion recognition task, the performance of using only RGB video data is inferior to that of a multi-modal method using RGB video and human body skeleton data. However, in early studies, methods for improving model robustness were performed on single-mode models, and few studies were performed on multi-mode models. The invention provides a multi-mode model countermeasure training method based on a self-adaptive depth supervision module, which is supported by national natural science foundation-human body micro-gesture recognition and emotion analysis 62171309 based on self-supervision learning, and is described below with reference to specific examples and drawings.
Example 1:
referring to fig. 1, the invention comprehensively considers the influence of two modes of RGB video and skeleton on robustness, and provides a new multi-mode model countermeasure training method based on a self-adaptive depth supervision module, which specifically comprises the following steps:
step 1: generating a skeleton dataset from the video data: and detecting and extracting key points of human bones in the video data by using an open-source human body posture estimation library Openpore. The iMiGUE data set used in the invention can generate 15 skeletal key points for each person, and the information of each key point is expressed by a three-dimensional coordinate point #xyz) A constitution in whichxyRepresenting the two-dimensional position coordinates of the point,zthe confidence score is represented as a function of the confidence score,z∈[0,1]and there are at most two people simultaneously in each video.
Step 2: normalization (normalization) of the skeleton dataset and sampling a certain number of framesT=300, complement 0 of less than 300 frames;
the normalization step is: for the followingxyDimension, taking the maximum value of its absolute valuex| max ,|y| max Will bexyEach datum in the dimension is divided by a maximum value. The skeleton data thus obtainedxyThe value of (1) is distributed in [ -1,1]Between them. This step may be omitted if the skeleton data obtained by other methods has been normalized.
Step 3: sampling all video data uniformly, i.e. sampling 8 frames per video at equal intervalsT=8, resulting in RGB pictures for training.
Step 4: the RGB pictures are cropped and scaled to a size of 256×256.
Step 5: the RGB data obtained above are in one-to-one correspondence with the skeletons and are stored in a dictionary form, so that a final integrated multi-mode data set is obtained;
the input of RGB data in a multi-modal dataset isWhereinBFor the size of the batch size,T=8 is the number of frames,C=3 is the number of RGB channels,H=W=256 is the height and width of the image.
The input of the skeleton data in the multi-mode data set is thatWhereinBFor the size of the batch size,T=300 is the number of frames,C=3 is the coordinate dimension of the bone,V=15 is the number of skeletal key points,M=1 is the number of people in the video.
Step 6: training a single-mode model HCN for processing bone data on a skeleton data set to obtain pre-training parameters of the model;
among them, HCN is a feature used to extract bone data. Firstly, subtracting the front frame and the rear frame of the skeleton key point data to obtain motion information in the time domain, wherein the motion information is of the size of. And extracting characteristics of the motion data and the key point data through respective branch networks, wherein each branch network is provided with four layers of convolution layers. The two extracted features are then stitched togetherContinuously inputting into two convolution layers and two full-connection layers to obtain final prediction result with the size ofB×NWhereinNIs the number of categories. And saving model parameters after training is completed.
Step 7: training a single-mode model I3D for processing picture data on an RGB data set to obtain pre-training parameters of the model;
wherein, the I3D contains 4 blocks, each block is further composed of a plurality of 3D convolution modules, and there are 48 3D convolution layers in total for extracting video features. The final convolution layer output size isBX 2048 x 8, passing through a 3D pooling layer to obtain sizeBThe vector of x 2048 is input into the full connection layer to obtain the final prediction result with the size ofB×NWhereinNIs the number of categories. And saving model parameters after training is completed.
Step 8: integrating the single-mode model in the steps 6 and 7 and the self-adaptive depth supervision module (Adaptive Deep Supervision Module, ADSM) designed by the invention into a multi-mode model, and loading the pre-training parameters obtained in the steps 6 and 7;
the Adaptive Depth Supervision Module (ADSM) described above is the core design of the present invention. We find in experiments that bone modality data is more robust against attacks (conforms to the characteristics of bone data itself), but the data volume of bone data is smaller than that of video data. Therefore, the weight of the two data is adjusted as much as possible in the training process, so that the model learns more robust data distribution.
Specifically, we take the intermediate features of video and bone from the hidden layers of two single-mode networks, respectively,/>. Two attention weight matrices are created according to the feature size>,/>. Two attention weight matrices repeat in the batch size dimensionBAnd multiplying the original characteristics to obtain:,/>wherein->Representing multiplication by element. NowThe characteristics after the attention weight adjustment are respectively input into a 3D pooling layer and a 2D pooling layer, and the dimension is adjusted to obtainThen spliced together, i.e.)>. Finally, a Linear layer (Linear) is added to obtain the auxiliary classification prediction result +.>. The above is the structure of the adaptive depth supervision module. In this module, three results are finally returned, namely the characteristics of the two modes after weight adjustment +.>And predictive outcome of auxiliary classification +.>. The adjusted feature continues to propagate forward, < >>For calculating the loss function.
In the invention, the modules are inserted between the 2 nd, 3 rd, 4 th blocks and the 5 th, 6 th and 7 th convolution (or linear) layers of the HCN of the I3D, so that the needed multi-mode network can be obtained. Different numbers of the modules can be flexibly added according to the needs. The pre-training parameters obtained in steps 6 and 7 are then loaded.
Step 9: taking the multi-modal data set obtained in the step 5 as clean data input, and obtaining a multi-modal countermeasure sample by using a PGD10 algorithm, wherein the multi-modal countermeasure sample comprises the following specific contents:
by usingf θ Representing the multimodal network in step 8, whereinθIs a model parameter. {x RGBx ske The multi-modal input is the corresponding challenge sample isyIs a real tag. The goal of a multimodal attack is to spoof a target multimodal model by adding human imperceptible perturbations to the multimodal inputf θ For example, in the present invention, RGB data and bone data: {x RGBx ske }. In order to make a trained multi-modal modelf θ Making incorrect predictions and making the corresponding disturbances as imperceptible as possible, for multi-modal inputf θ The objective function of performing a multi-modal attack is:
wherein,representing the fighting disturbance of RGB;representing an antagonistic disturbance of bone;L(. Cndot.) representation is used to optimizef θ A loss function of (2); I.I p Representation ofpA norm;ε RGB andε ske indicating the maximum range of the countermeasure disturbance. This formula describes the meaning of a challenge sample, and to obtain a challenge sample we use the common PGD algorithm, taking the single-mode data as an example:
wherein,representing a loss function;αrepresenting an attack step length;x (0) natural data representing natural data or natural data interfered by gaussian noise or uniform random noise;x (t) represent the firsttA challenge sample of steps; />Representing a projection function, projecting the challenge data back to the system when the disturbance exceeds a maximum rangex (0) Centered atεIn a sphere of radius;trepresenting the number of iterations, the present invention takes 10, i.e., PGD10. The challenge sample can be obtained using the above formula.
Step 10: the challenge samples are input as training data into the multimodal model for forward propagation.
Step 11: obtaining a prediction result of the model through forward propagation, and calculating a new objective function designed by the invention to finish one-time model weight parameter updating;
the new objective function specifically refers to: combining the prediction result of the auxiliary classification obtained in the step 8 with the final prediction result of the network as a final loss function:
wherein,representing the final output of the challenge sample;an output representing the original clean data;L(. Cndot.) represents a cross entropy loss function;Jrepresenting the number of the self-adaptive depth supervision modules;λrepresenting superparametersThe method is used for adjusting the balance between the two loss functions, and can be adjusted by itself according to different data sets.
After the final loss function is obtained, the method can be used for back propagation and optimization of model parameters.
Example 2:
based on embodiment 1, but with the difference, please refer to fig. 1, the overall flow proposed by the present invention can be divided into three major parts:
1) I3D network. As a backbone network for processing video data, I3D is mainly composed of 3D convolution. The I3D used in the invention comprises four 3D convolution blocks, and each Block respectively comprises 3, 4, 6 and 3D convolution basic modules, namely Bottleneck3D. The Bottleneck3D module contains 3D convolutional layers, the first and third layers having a convolutional kernel size of 1 x 1, the step size is 1 x 01 x 11, the convolution kernel size of the second layer is 3 x 23 x 33, step size of 1X 2X 2, padding is 1×1×1. After the first Bottleneck3D of each Block, a downsampling layer follows, the layer consists of a 3D convolution layer with a convolution kernel size of 1 x 1 and a step size of 1 x 2 and a 3 dbatchnum layer. A total of 53 convolutional layers, including one 2D convolutional layer from the beginning of the I3D network. The size of the input video data isWhereinBFor the size of the batch size,T=8 is the number of frames,C=3 is the number of RGB channels,H=W=256 is the height and width of the image. After each Block, the feature size of the output is respectivelyB×256×8×64×64,B×512×8×32×32,B×1024×8×16×16,BX 2048 x 8. Finally, the three-dimensional (3D) pooling layer is passed through and flattened to be of the same sizeBVectors x 2048. And obtaining a final classified prediction result through a full connection layer (FC layer).
2) HCN networks. The HCN network is a backbone network for extracting features of human skeletal sequence data, and mainly consists of 2D convolution layers. The input size of the human skeleton data isWhereinBB isThe size of the atchsize (r),T=300 is the number of frames,C=3 is the coordinate dimension of the bone,V=15 is the number of skeletal key points,M=1 is the number of people in the video (NTU datasetM=2,V=17). Because the 2D convolution cannot extract the time sequence information, motion characteristics representing the time sequence information are extracted in advance, and the bone point information corresponding to the previous frame is subtracted from the bone point information of the next frame, so that the motion information of the bone key points between the two frames, namely +.>The difference calculation reduces the time series length by 1, thus T-1, resulting in +.>. The motion tensor is then readjusted to be by interpolation function. With the following componentsX skeX motion Then, using M dimension as index to traverse each character, for each characterX skeX motion And extracting spatial features and time sequence features through the four 2D convolution layers respectively, then splicing in the C dimension to obtain features with space-time information, and continuously extracting the features through the two convolution layers. And then, taking the maximum value as classification input for the space-time characteristics of each character, and finally inputting the classification input into two full-connection layers to obtain a final prediction result.
3) ADSM: this module is one of the cores of the present invention. Specifically, we take the intermediate features of video and bone from the hidden layers of two single-mode networks, respectively. Creating two attention weight matrices according to feature size,/>. Two attention weight matrices repeat in the batch size dimensionBAfter the times, the original characteristics are multiplied respectively, < ->Wherein->Representing multiplication by element. NowThe characteristics after the attention weight adjustment are respectively input into a 3D pooling layer and a 2D pooling layer, and the dimension is adjusted to obtainThen spliced together, i.e.)>. Finally, a Linear layer (Linear) is added to obtain the auxiliary classification prediction result +.>. The above is the structure of the adaptive depth supervision module. In this module, three results are finally returned, namely the characteristics of the two modes after weight adjustment +.>And predictive outcome of auxiliary classification +.>. The adjusted feature continues to propagate forward, < >>For calculating the loss function.
The invention provides a new loss function for optimizing parameters of the multi-modal model, thereby realizing the balance of robustness and accuracy. Specifically, the prediction result of the auxiliary classification obtained by the antagonism sample through the network is combined with the final prediction result of the network to be used as a Robust loss, and the final prediction result of the clean data through the network is used as a clean loss, and two final loss functions are obtained:
wherein,representing the final output of the challenge sample; />An output representing the original clean data;L(. Cndot.) represents a cross entropy loss function;Jrepresenting the number of the self-adaptive depth supervision modules;λthe super-parameters are represented and used for adjusting the balance between the two loss functions, and can be adjusted by self according to different data sets.
After the final loss function is obtained, the method can be used for back propagation and optimization of model parameters.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical solution and the modified concept thereof, within the scope of the present invention.

Claims (8)

1. The multi-mode model countermeasure training method based on the self-adaptive depth supervision module is characterized by comprising the following steps of:
s1, generating a skeleton data set according to video data;
s2, carrying out normalization operation on the skeleton data set obtained in the S1, and sampling a certain number of framesTThe defects areTSupplementing 0 of the frame;
s3, uniformly sampling the video frames to obtain RGB pictures for training;
s4, cutting the RGB picture obtained in the S3, and adjusting the size to obtain an RGB data set;
s5, the RGB data set obtained in the S4 is in one-to-one correspondence with the skeleton data set in the S2, and the RGB data set is stored in a dictionary form to obtain a final integrated multi-mode data set;
s6, training a single-mode model HCN for processing skeleton data on a skeleton data set to obtain pre-training parameters of the model;
s7, training a single-mode model I3D for processing the picture data on the RGB data set to obtain pre-training parameters of the model;
s8, designing a self-adaptive depth supervision module, integrating the self-adaptive depth supervision module with the single-mode models obtained in the S6 and the S7 to form a multi-mode model, and loading the pre-training parameters obtained in the S6 and the S7;
the method specifically comprises the following steps:
s8.1, taking out intermediate features of the video and bones from hidden layers of two single-mode networks, and respectively marking the intermediate features as:
,/>
s8.2, two attention weight matrixes are created according to the size of the feature, and the attention weight matrixes are specifically expressed as follows:
,/>
s8.3, repeating the two attention weight matrixes obtained in the S8.2 in the batch size dimensionBAnd multiplying the original characteristics to obtain:
,/>wherein->Representing multiplication by element; the features after the attention weighting adjustment are noted as:
s8.4, respectively inputting the features subjected to attention weight adjustment in the step S8.3 into a 3D pooling layer and a 2D pooling layer, and obtaining after dimension adjustment:
s8.5, splicing the characteristics with the dimension adjusted obtained in the S8.4 together to obtain:
s8.6, the features obtained in the S8.5 are followed by a linear layer, and an auxiliary classification prediction result is obtained:
s9, taking the multi-modal data set obtained in the S5 as clean data input, and obtaining a multi-modal countermeasure sample by utilizing a PGD algorithm;
s10, inputting the countermeasure sample obtained in the S9 into the multi-modal model obtained in the S8 as training data, and performing forward propagation;
s11, obtaining a prediction result of the model through forward propagation, and obtaining a new objective function based on calculation of the prediction result so as to finish updating of the weight parameters of the model once.
2. The multi-modal model countermeasure training method based on the adaptive deep supervision module according to claim 1, wherein S1 specifically includes the following:
detecting and extracting key points of human bones in video data by using an open-source human body posture estimation library Openpore to generate a skeleton data set; wherein the information of each key point is represented by a three-dimensional coordinate pointxyz) A constitution in whichxyRepresenting the two-dimensional position coordinates of the point,zthe confidence score is represented as a function of the confidence score,z∈[0,1]and there are at most two people simultaneously in each video.
3. The multi-modal model challenge training method based on the adaptive deep supervision module of claim 1, wherein the normalization operation in S2 specifically includes the following:
for the followingxyDimension, taking the maximum value of its absolute valuex| max ,|y| max Will bexyEach data in the dimension is divided by a maximum to ensure that the resulting skeleton dataxyThe values of (2) are all distributed in [ -1,1]Between them.
4. The adaptive deep-supervisor module-based multimodal model challenge training method of claim 1, wherein the input of RGB data for the multimodal dataset in S5 isWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting the number of frames;Crepresenting the number of RGB channels;HWrepresented as height and width of the image, respectively;
the Skeleton data of the multi-modal dataset in S5 is entered asWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting framesA number;Crepresenting the coordinate dimensions of the bone;Vrepresenting bone key points;Mrepresenting the number of people in the video.
5. The multi-modal model countermeasure training method based on the adaptive deep supervision module according to claim 1, wherein the single-modal model HCN in S6 is used for extracting features of data, and specifically includes the following contents:
s6.1, subtracting the front frame and the rear frame of the skeleton key point data to obtain motion information on the time domain, wherein the size isWherein, the method comprises the steps of, wherein,Brepresenting the size of the batch size;Trepresenting the number of frames;Crepresenting the coordinate dimensions of the bone;Vrepresenting bone key points;Mrepresenting the number of people in the video;
s6.2, extracting characteristics of motion data and key point data through respective branch networks, wherein each branch network comprises four layers of convolution layers;
s6.3, splicing the two features extracted in the S6.2 together, and then inputting the two features into the two convolution layers and the two full-connection layers to obtain a final prediction result with the size ofB×NWherein, the method comprises the steps of, wherein,Nrepresenting the number of categories; and saving model parameters after training is completed.
6. The multi-modal model countermeasure training method based on the adaptive depth supervision module according to claim 1, wherein the single-modal model I3D in S7 is used for extracting video features, specifically including the following:
the I3D comprises 4 blocks, and each block consists of a plurality of 3D convolution modules; the output of the last convolution layer is processed by a 3D pooling layer, the vector obtained after the processing is input into a full-connection layer, and a final prediction result is obtained, wherein the size of the final prediction result isB×NWherein, the method comprises the steps of, wherein,Nrepresenting the number of categories; and saving model parameters after training is completed.
7. The multi-modal model countermeasure training method based on the adaptive deep supervision module according to claim 1, wherein S9 specifically includes the following:
by usingf θ Representing the multimodal model in S8, wherein,θrepresenting model parameters; {x RGBx ske The multi-modal input is represented by the corresponding challenge sample beingyRepresenting a real label; targets of multimodal attacks spoof target multimodal models by adding human-imperceptible perturbations to the multimodal inputf θ
Pairs under multimodal inputf θ The objective function of performing a multi-modal attack is:
wherein,representing the fighting disturbance of RGB; />Representing an antagonistic disturbance of bone;L(. Cndot.) representation is used to optimizef θ A loss function of (2); I.I p Representation ofpA norm;ε RGB andε ske representing the maximum range of the countermeasure disturbance;
a PGD algorithm is adopted to obtain a countermeasure sample, and single-mode data is taken as an example, and a specific calculation formula is as follows:
wherein,indicating lossA function;αrepresenting an attack step length;x (0) natural data representing natural data or natural data interfered by gaussian noise or uniform random noise;x (t) represent the firsttA challenge sample of steps; />Representing a projection function, projecting the challenge data back to the system when the disturbance exceeds a maximum rangex (0) Centered atεIn a sphere of radius;trepresenting the number of iterations.
8. The multi-modal model challenge training method based on the adaptive deep supervision module of claim 6, wherein S11 specifically comprises the following:
based on S10, the auxiliary classification prediction result obtained in S8 and the final prediction result obtained by forward propagation in S10 are combined as a final loss function:
wherein,representing the final output of the challenge sample; />An output representing the original clean data;L(. Cndot.) represents a cross entropy loss function;Jrepresenting the number of the self-adaptive depth supervision modules;λrepresenting a hyper-parameter for adjusting the balance between the two loss functions;
and after the final loss function is obtained, back propagation is carried out by using the final loss function, and model parameters are optimized.
CN202310660598.6A 2023-06-06 2023-06-06 Multi-mode model countermeasure training method based on self-adaptive depth supervision module Active CN117274656B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310660598.6A CN117274656B (en) 2023-06-06 2023-06-06 Multi-mode model countermeasure training method based on self-adaptive depth supervision module

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310660598.6A CN117274656B (en) 2023-06-06 2023-06-06 Multi-mode model countermeasure training method based on self-adaptive depth supervision module

Publications (2)

Publication Number Publication Date
CN117274656A CN117274656A (en) 2023-12-22
CN117274656B true CN117274656B (en) 2024-04-05

Family

ID=89213111

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310660598.6A Active CN117274656B (en) 2023-06-06 2023-06-06 Multi-mode model countermeasure training method based on self-adaptive depth supervision module

Country Status (1)

Country Link
CN (1) CN117274656B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651916A (en) * 2020-12-25 2021-04-13 上海交通大学 Method, system and medium for pre-training of self-monitoring model
CN112668492A (en) * 2020-12-30 2021-04-16 中山大学 Behavior identification method for self-supervised learning and skeletal information
CN112905822A (en) * 2021-02-02 2021-06-04 华侨大学 Deep supervision cross-modal counterwork learning method based on attention mechanism
CN114612511A (en) * 2022-03-09 2022-06-10 齐齐哈尔大学 Exercise training assistant decision support system based on improved domain confrontation neural network algorithm
CN114722812A (en) * 2022-04-02 2022-07-08 尚蝉(浙江)科技有限公司 Method and system for analyzing vulnerability of multi-mode deep learning model
CN114821014A (en) * 2022-05-17 2022-07-29 湖南大学 Multi-mode and counterstudy-based multi-task target detection and identification method and device
CN116129174A (en) * 2022-12-08 2023-05-16 河北工业大学 Generalized zero sample image classification method based on feature refinement self-supervision learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111739078B (en) * 2020-06-15 2022-11-18 大连理工大学 Monocular unsupervised depth estimation method based on context attention mechanism
US20220051437A1 (en) * 2020-08-17 2022-02-17 Northeastern University 3D Human Pose Estimation System
EP4323940A2 (en) * 2021-04-16 2024-02-21 Strong Force VCN Portfolio 2019, LLC Systems, methods, kits, and apparatuses for digital product network systems and biology-based value chain networks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112651916A (en) * 2020-12-25 2021-04-13 上海交通大学 Method, system and medium for pre-training of self-monitoring model
CN112668492A (en) * 2020-12-30 2021-04-16 中山大学 Behavior identification method for self-supervised learning and skeletal information
CN112905822A (en) * 2021-02-02 2021-06-04 华侨大学 Deep supervision cross-modal counterwork learning method based on attention mechanism
CN114612511A (en) * 2022-03-09 2022-06-10 齐齐哈尔大学 Exercise training assistant decision support system based on improved domain confrontation neural network algorithm
CN114722812A (en) * 2022-04-02 2022-07-08 尚蝉(浙江)科技有限公司 Method and system for analyzing vulnerability of multi-mode deep learning model
CN114821014A (en) * 2022-05-17 2022-07-29 湖南大学 Multi-mode and counterstudy-based multi-task target detection and identification method and device
CN116129174A (en) * 2022-12-08 2023-05-16 河北工业大学 Generalized zero sample image classification method based on feature refinement self-supervision learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Deepak Kumar 等.Finding Achilles' Heel: Adversarial Attack on Multi-modal Action Recognition.《Multimedia Analysis and Description & Multimedia Fusion and Embedding》.2020,3839-3837. *
Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition;Xiaoguang Zhu 等;《ACM Transactions on Multimedia Computing, Communications, and Applications》;20220304;第18卷(第8期);1-24 *
Spatially and Temporally Structured Global to Local Aggregation of Dynamic Depth Information for Action Recognition;Yonghong Hou 等;《IEEE》;20171211;2206-2219 *
基于深度学习的视频异常事件检测算法研究;唐伟;《中国优秀硕士学位论文全文数据库 信息科技辑》;20220615(第06期);I138-438 *

Also Published As

Publication number Publication date
CN117274656A (en) 2023-12-22

Similar Documents

Publication Publication Date Title
AU2018236433B2 (en) Room layout estimation methods and techniques
CN112784764B (en) Expression recognition method and system based on local and global attention mechanism
Mo et al. Human physical activity recognition based on computer vision with deep learning model
KR20200078531A (en) Gradient normalization systems and methods for adaptive loss balancing in deep multitask networks
EP3710990A1 (en) Meta-learning for multi-task learning for neural networks
WO2019227479A1 (en) Method and apparatus for generating face rotation image
Glauner Deep convolutional neural networks for smile recognition
CN113989890A (en) Face expression recognition method based on multi-channel fusion and lightweight neural network
CN111767786B (en) Anti-attack method and device based on three-dimensional dynamic interaction scene
Garg et al. Facial expression recognition & classification using hybridization of ICA, GA, and neural network for human-computer interaction
Tolba et al. Image signature improving by PCNN for Arabic sign language recognition
CN113239866B (en) Face recognition method and system based on space-time feature fusion and sample attention enhancement
CN117274656B (en) Multi-mode model countermeasure training method based on self-adaptive depth supervision module
CN117333753A (en) Fire detection method based on PD-YOLO
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
CN115829962B (en) Medical image segmentation device, training method, and medical image segmentation method
CN116665300A (en) Skeleton action recognition method based on space-time self-adaptive feature fusion graph convolution network
Mocanu et al. Multimodal convolutional neural network for object detection using rgb-d images
Venkatesh Object tracking in games using convolutional neural networks
CN116542292B (en) Training method, device, equipment and storage medium of image generation model
CN115984652B (en) Training method and device for symbol generation system, electronic equipment and storage medium
CN117992800B (en) Image-text data matching detection method, device, equipment and medium
CN112765955B (en) Cross-modal instance segmentation method under Chinese finger representation
Yiqiao Deep Learning Notes
Ding A detachable lstm with residual-autoencoder features method for motion recognition in video sequences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant