CN117037037A - Anti-abuse privacy protection ViT-iGAN video behavior detection method for endowment privacy scene - Google Patents
Anti-abuse privacy protection ViT-iGAN video behavior detection method for endowment privacy scene Download PDFInfo
- Publication number
- CN117037037A CN117037037A CN202311033761.2A CN202311033761A CN117037037A CN 117037037 A CN117037037 A CN 117037037A CN 202311033761 A CN202311033761 A CN 202311033761A CN 117037037 A CN117037037 A CN 117037037A
- Authority
- CN
- China
- Prior art keywords
- video
- igan
- vit
- privacy
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 31
- 230000006399 behavior Effects 0.000 claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 30
- 230000000007 visual effect Effects 0.000 claims abstract description 25
- 230000000694 effects Effects 0.000 claims abstract description 16
- 238000012544 monitoring process Methods 0.000 claims abstract description 16
- 230000007246 mechanism Effects 0.000 claims abstract description 14
- 230000009467 reduction Effects 0.000 claims abstract description 10
- 238000005516 engineering process Methods 0.000 claims abstract description 9
- 238000012545 processing Methods 0.000 claims abstract description 7
- 230000008485 antagonism Effects 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 25
- 239000011159 matrix material Substances 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 10
- 239000013598 vector Substances 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 6
- 230000000644 propagated effect Effects 0.000 claims description 3
- 230000006872 improvement Effects 0.000 claims description 2
- 206010001488 Aggression Diseases 0.000 description 7
- 230000006835 compression Effects 0.000 description 4
- 238000007906 compression Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000010009 beating Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 208000027418 Wounds and injury Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000006378 damage Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000007711 solidification Methods 0.000 description 1
- 230000008023 solidification Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
- G06V10/765—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a privacy protection ViT-iGAN video behavior detection method for preventing abuse of a pension privacy scene, which belongs to the field of computer vision and behavior recognition and comprises the following specific steps: performing dimension reduction coding processing on the monitoring video through a multilayer compressed sensing technology to obtain video data with a visual privacy protection effect; using ViT visual self-attention mechanism to target area of interest of original videomark for network modeling; respectively inputting a visual hidden state video and a foreground marking video into an improved generation antagonism network iGAN model, and realizing space-time characteristic compensation of visual hidden state video data in an iterative training mode; aiming at typical abuse behaviors, the ViT-iGAN classification network is utilized to detect the behaviors of the invisible space-time characteristic compensation data. The invention can ensure the protection of visual hidden privacy effect, simultaneously maintain higher detection performance for typical behaviors of the old abused people, and has better practical application value by taking privacy protection and intelligent monitoring into consideration.
Description
Technical Field
The invention relates to the field of computer vision and behavior recognition, in particular to a privacy protection ViT-iGAN video behavior detection method for preventing abuse of a endowment privacy scene.
Background
The world health organization report shows that the population of elderly people over 60 years worldwide will double from 2015 to 2050, from 9 to about 20 hundred million people, accounting for 22% of the world population. The abuse can cause serious physical injury and long-term psychological consequences for the elderly, so that timely and accurate detection of the abuse behavior is an urgent need.
Existing abuse detection techniques fall broadly into two categories: the first is a wearable sensor, however, the device is inconvenient to put on and off, forgets to wear, and places an additional burden on the elderly. In fact, the effectiveness of wearable sensors depends largely on the user's ability and willingness. The second type is a detection technology based on computer vision, and the device which is rich in monitoring information, a non-contact monitoring mode and easy to maintain in the later period is more suitable for the pension scene.
With the continuous reduction of camera costs and the continuous popularization of public place monitoring, many penmanship institutions and even individual families are gradually willing to accept the installation of cameras in living areas to ensure the safety of personnel and property. The cameras are arranged in the living area for monitoring, so that the behaviors of the old can be timely focused, and the possibility of the occurrence of the abuse behaviors is reduced. However, since the camera monitors the detected object in real time, the video is uploaded and processed through a network or other communication means. In this process, there is a risk of personal privacy disclosure, and exposure of decorative or environmental arrangement private information is also involved. Therefore, the behavior detection method based on computer vision is more required to be compatible with security monitoring and privacy protection. While the privacy protection of the monitoring video is considered, the follow-up intelligent detection is also needed to be considered. Otherwise, the video with excessive privacy protection loses practical application value. Therefore, developing a reliable privacy protection monitoring video typical abuse behavior detection system oriented to the pension privacy scene will have a certain market prospect for home care and even pension institutions.
Disclosure of Invention
Aiming at the problem of invading privacy related to the prior detection technology, the invention provides a privacy protection ViT-iGAN video behavior detection method for preventing abuse of a endowment privacy scene. The method utilizes the image irrecoverability brought by the random observation matrix of the multilayer compressed sensing technology to ensure the video vision shielding, proposes to combine a ViT vision attention mechanism with the improved generation of an antagonism network iGAN, automatically compensates the foreground region space-time characteristics of the video in a hidden state and effectively detects typical abuse behaviors.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the method for detecting the privacy protection ViT-iGAN video behaviors for preventing abuse of the endowment privacy scene comprises the following steps:
step 1: based on the disclosed data set, collecting the video data of the private scene monitoring, and dividing a training set and a testing set.
Step 2: and (3) performing dimension reduction coding treatment on the video monitoring data in the step (1) by using a multilayer compressed sensing technology (Compression Sensing, CS) to enable the content to reach a visual privacy protection state, and obtaining the visual hidden video.
Step 3: the target region of interest is marked with the original video of the model training process using a visual self-attention model (Vision Transformer, viT), resulting in a foreground marked video for use in subsequent network modeling.
Step 4: and (3) respectively inputting the video in the hidden state in the step (2) and the foreground marking video in the step (3) into an improved generation countermeasure network (improved Generative Adversarial Networks, iGAN), and carrying out information compensation in an iterative training mode to obtain space-time characteristic compensation data.
Step 5: for typical abuse behavior of the pension process, the combination ViT and the iGAN form a ViT-iGAN model, and the added classification network ResNet18 performs behavior detection on the spatio-temporal feature compensation data in the step 4.
In step 2, the multi-layer compressed sensing technology directly performs dimension reduction encoding processing on the video in the data set by adopting a random observation matrix, and the definition formula is as follows:
wherein X is an input video, phi (n) For the observation matrix of the nth layer, Y (n) The observed value obtained by processing X through n layers of CS. As the number of dimension reduction layers increases, the image quality is continuously reduced, and the video content gradually reaches a visual privacy protection state.
Further, in the step 3, the ViT model adopts a Multi-Head Self-Attention (MHSA) mechanism with good robustness and stability to mark the interested target area in the video, and controls the effect of the foreground mark by controlling the Head number of the MHSA; MHSA is defined as follows:
wherein X is input video, n is head number, Q h 、K h And V h For X and each group of different parameter matricesAnd->Multiplying to obtain three key vectors d k For K h Dimension of vector, W out For the weight of each group of heads, concat is the splicing operation and MHSA (X) is the output video.
Further, in the step 4, the applied iGAN network is obtained by improving the generation of the countermeasure network GAN, so as to compensate the space-time characteristics of the video, and the improvement is as follows:
changing the number of CNN network layers of a generator discriminator in the iGAN network according to the difference of the sizes of the input video hidden state video and the foreground mark video; redefining a discriminator label according to the information compensation of the iGAN network for various behaviors;
the generator and the discriminator are both network structures of CNN, and the loss function is reversely propagated to the generator and the discriminator according to the chain deduction rule, so that parameters such as weight and the like are continuously modified and updated;
the generator and arbiter loss functions of the iGAN network are as follows:
L D (o,l,c)=BCE(D(o),l)+BCE(D(G(c)),l)
L G (l,c)=BCE(D(G(c)),l)+CE(C(G(c),l)
wherein L is D Is the loss function of the discriminator, L G Is the loss function of the generator, o is the original video data, l is the original video data label, C is the video data in the hidden state, BCE is the binary cross entropy, CE is the cross entropy, D is the discriminator, G is the generator, C is the classifier, and G (C) is the compensated video data generated by the generator.
Further, in the step 5The saidThe ResNet18 network consists of 17 convolutional layers and 1 fully-connected layer, and in model training, the loss function of the classifier is as follows:
L C (o,l,c)=CE(C(o),l)+CE(C(G(c)),argmax(C(G(c)))>t)
wherein L is C Is the loss function of the classifier, o is the original video data, l is the original video data label, C is the visual hidden video data, CE is the cross entropy, G is the generator, C is the classifier, G (C) is the compensated video data generated by the generator, argmax is the function of generating the pseudo-label, and t is the pseudo-label threshold.
Compared with the prior art, the invention has the beneficial effects that:
1. aiming at the privacy leakage risk of the monitoring video in the visual aspect, the invention provides a method for sampling the video by using a random observation matrix, thereby realizing the visual privacy protection effect of the video.
2. Aiming at the effective information reduction, noise introduction and background information interference of the visual privacy protection video, the method and the device provided by the invention simultaneously consider the requirements of actual application scenes on behavior detection accuracy, use a ViT-iGAN network model to compensate the space-time characteristics of the visual hidden video, and detect typical abuse behaviors in the video.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of image privacy protection based on multi-layer compressed sensing in an embodiment of the present invention;
FIG. 3 is a schematic view of an image foreground tag based on a ViT attention mechanism in an embodiment of the invention;
FIG. 4 shows the detection results (confusion matrix) of the NTU RGB+D60 data set according to the present invention;
FIG. 5 shows the detection result (confusion matrix) of the ISR-UoL 3D society data set.
Detailed Description
The present invention will be further elucidated with reference to the drawings and the specific embodiments, in order to make the objects, technical solutions and advantages of the present invention more apparent.
Referring to fig. 1, the embodiment of the invention provides a method for detecting privacy protection ViT-iGAN video behaviors for preventing abuse of a endowment private scene, which comprises the following steps of.
Step 1: based on the disclosed data set, collecting the private scene monitoring video data, and dividing a training set and a testing set:
because the disclosed abusive old man behavior data sets are almost absent, in order to verify the effectiveness of the method, the disclosed indoor scene double violent behavior video data are selected, and the training data sets and the testing data sets are divided. The disclosed video data set contains various behaviors (such as walking, mopping, frame playing, etc.), so that the training data set needs to be preprocessed, violent behaviors in each video are collected and divided into short video segments consisting of limited video frames, and each short video segment only contains one type of behaviors. And dividing the data into a plurality of sample sets according to action categories in the video clips, and marking corresponding category labels for subsequent network model training. To further verify the validity of the method, large and small sample data sets are acquired, respectively.
The large sample dataset of this example selects NTU rgb+d60 datasets, containing 60 human action categories in total, each dataset being captured simultaneously at multiple angles by three cameras with a resolution of 1920 x 1080. The violent behavior video training set is collected from the video training set, and comprises three categories: beating, kicking and pushing. During training, each class is divided into 300 effective video clips, 900 video clips are added, and the duration is about 1.2 seconds, and 30 frames per second.
The example small sample dataset selection ISR-UoL 3D Social Activity dataset is a human interactive video dataset, these activities are performed by 6 participants, including 8 social activities: handshaking, greetings, walking assistance, standing assistance, cradling, pushing, talking, and drawing attention, each activity recording for approximately 30 seconds to 60 seconds. The dataset consisted of 10 sessions, each session providing an RGB image and a skeleton image of 8 different activities performed by two persons. The violent behavior video training set is collected from RGB images and comprises three categories: beat, kick and push, each class of 1000 active consecutive video frames with a resolution of 640 x 480.
Step 2: and (3) performing dimension reduction coding treatment on the video monitoring data in the step (1) by using a multilayer compressed sensing technology CS, so that the content reaches a visual privacy protection state, and a visual hidden state video is obtained:
dividing a video frame X containing M X N pixel values into N image blocks X containing 2X 2 pixel values n I.e. x= [ X 1 ,x 2 ,...x n ]At the same time, the observation matrix phi containing M×N elements is divided into N blocks phi of 2×2 size n I.e. Φ= [ Φ ] 1 ,Φ 2 ,...Φ n ]Then, performing inner product operation on the matrix blocks and the image blocks at the same position, namely respectively observing each image block to obtain an observed value Y:
Y=ΦX
the above-mentioned is a compressed sensing process, the adoption of the same observation matrix for each layer of the multi-layer dimension reduction processing can cause the solidification and the large discrete degree of the relationship among the layers and the weakening of the video detail information, so that each layer of the invention adopts different observation matrixes to perform multi-layer compressed sensing, and the definition is as follows:
wherein phi is (n) For the observation matrix of the nth layer, Y (n) And (5) obtaining an observed value for the original video X through n layers of compressed sensing processing.
Referring to fig. 2, as the number of compression layers increases, information loss increases and images blur gradually. After three layers of CS compression, the frame details of the detected object are difficult to identify, the effect of protecting the visual privacy is achieved, and after four layers of CS compression, the whole details, the background and the noise of the video frame are mixed together, so that the video frame is more difficult to distinguish.
Step 3: using the visual self-attention model ViT, the original video of the model training process marks the target region of interest, and a foreground mark video is obtained for subsequent network modeling:
since the key information of violent behavior identification is a foreground object and the background information in the video is interference information for identification, the example adopts an MHSA mechanism of a ViT model to process the original violent behavior video so as to obtain a foreground marked video.
The conventional ViT model is used for image classification, and this example focuses on marking a target region of interest in a video, achieving a foreground marking effect. The Self-Attention (SA) mechanism of ViT is specifically: input matrix X and three trainable parameter matrices W Q 、W K And W is V Multiplying to obtain three Key vectors Q (Query), K (Key), V (Value), multiplying the Q and K points to generate a similarity matrix, dividing each element of the similarity matrix by each element of the similarity matrix for more stable gradient update during trainingd k Is the dimension of the K vector, and then passes through the softmax functionAnd normalizing the number to obtain a weight matrix. Finally multiplying the normalized weight matrix with the vector V, and calculating the weighted summation to obtain an output matrix SA (X):
to enhance the fit performance of the network, this example marks the target area of the video using the MHSA mechanism, controlling the effect of the foreground marks by the number of heads. The MHSA mechanism is specifically: for the input matrix X, multiple sets of different W are defined Q 、W K And W is V Each group is calculated to obtain Q h 、K h And V h Learning different weight matrixes, directly using the output of the previous layer as input by the subsequent layer, finally splicing the results of a plurality of heads into Concat, and then multiplying the weight W by the Concat out Obtaining a final output matrix MHSA (X):
referring to fig. 3, the first line is a few frames in the original video, and the second line is obtained through the MHSA mechanism, it can be seen that the MHSA mechanism is similar to the human eye observing things, preferentially noticing the target area of interest (foreground) in the video frame, and disregarding the non-important area (background). The effect of the MHSA mechanism visualization is characterized by marking the foreground on the image.
Step 4: inputting the video in the hidden state in the step 2 and the foreground mark video in the step 3 into the improved generation countermeasure network respectively, and performing information compensation in an iterative training mode to obtain space-time characteristic compensation data:
conventional generation countermeasure networks have focused mainly on generating samples with noise for expanding the dataset, this example focused on compensating image features. The improved generation of the antagonism network iGAN model is employed to compensate for the spatiotemporal features of the visual privacy preserving state video. The generator G network and the discriminator D network both adopt CNN networks, and according to the difference of the size of the video hidden state video and the size of the original video, the generator is changed into two layers of CNN networks, and the discriminator is changed into six layers of CNN networks. The conventional discriminant uses true and false labels to identify the authenticity of the generated graph of the generator, and the embodiment is used for characteristic migration compensation of class 3 violent behaviors, so that the behavior class labels are redefined to be 0, 1 and 2. For the whole training structure, the labels are fed back to the generator and the discriminator after calculation and comparison, the video data generated by the generator are purposefully matched with the foreground mark state video data of the same label for training, and in the process, the foreground missing information is compensated, so that the space-time characteristic reinforcement of the video hidden state data is realized.
The generator and the discriminator are both network structures of CNN, and the loss function is reversely propagated to the generator and the discriminator according to the chain deduction rule, so that parameters such as weight and the like are continuously modified and updated. The generator and arbiter loss functions of the iGAN network are respectively:
L D (o,l,c)=BCE(D(o),l)+BCE(D(G(c)),l)
L G (l,c)=BCE(D(G(c)),l)+CE(C(G(c),l)
wherein L is D Is the loss function of the discriminator, L G Is the loss function of the generator, o is the original video data, l is the original video data label, C is the video data in the hidden state, BCE is the binary cross entropy, CE is the cross entropy, D is the discriminator, G is the generator, C is the classifier, and G (C) is the compensated video data generated by the generator.
Step 5: for typical abuse behavior of the pension process, the combination ViT forms ViT-iGAN model with iGAN, and the added classification network res net18 performs behavior detection on the spatiotemporal feature compensation data in step 4:
the ResNet18 network is used as a classifier of the ViT-iGAN model, and consists of 17 convolution layers and 1 full connection layer. In training, all real data available has tags, while the compensation-state video data generated by the generator has no associated tags. A pseudo tag mode is used for the compensated video data, assuming a tag based on the most probable class based on the current state of the classifier. Only when the model predicted sample class has higher confidence or probability is higher than a certain threshold, the generated image and label are reserved, and the loss function of the classifier is as follows:
L C (o,l,c)=CE(C(o),l)+CE(C(G(c)),argmax(C(G(c)))>t)
wherein L is C Is the loss function of the classifier, o is the original video data, l is the original video data label, C is the visual hidden video data, CE is the cross entropy, G is the generator, C is the classifier, G (C) is the compensated video data generated by the generator, argmax is the function of generating the pseudo-label, and t is the pseudo-label threshold.
The behavior detection performance of the present example method can be further illustrated by the following simulation experiments:
and respectively encoding the NTU RGB+D60 data set and the ISR-UoL 3D Social Activity data set acquired in the front through a plurality of layers of CS to obtain corresponding CS1 layer, CS2 layer, CS3 layer and CS4 layer video sets. As shown in fig. 2, the example image is compressed to the CS3 layer, the detail information is not seen clearly, the visual privacy protection state is achieved, the details, the background and the noise point of the CS4 layer image are integrated, and the fact that the data set is used for intelligent detection and the video which is excessively privacy-protected loses the practical application value is considered, so that the CS3 layer video set is selected for subsequent behavior recognition experiments.
In addition, the two data sets mark the interested target area through the MHSA mechanism of ViT respectively, as shown in fig. 3, the original video passes through the MHSA mechanism, the background information in the image is weakened, the target person is marked prominently, and the foreground marking data is obtained and used for subsequent model training.
In order to improve the generalization capability of the model, a 5-fold cross-validation mode is adopted in the experiment. In order to preserve time sequence information among videos, all frames of one video segment are selected as a group for training, so that time-space information is better compensated.
In the training stage, the CS3 layer training video set is input into the iGAN model, preprocessed to the fixed resolution of 32 multiplied by 32 and then sent into the generator, the foreground marking video set input model is preprocessed to the fixed resolution of 128 multiplied by 128 and then sent into the discriminator and the classifier together with the class label, and the training model is stored after 300 times of iterative training.
And in the test stage, loading a model stored in the training stage, preprocessing a CS3 layer test video set to a fixed resolution of 32 multiplied by 32, inputting the model into a generator to generate a compensation video set with the size of 128 multiplied by 128, inputting the compensation video set into a classification network in the model for detection, and outputting a violence detection result.
Referring to fig. 4 and 5, the ViT-iGAN model is used to detect violence of the CS3 layer NTU rgb+d60 data set and the CS3 layer ISR-UoL 3D Social Activity data set, respectively, and to more intuitively characterize the model detection performance, a confusion matrix is calculated and obtained. The vertical axis is the true label of video behavior: kicking, beating and pushing, the horizontal axis is the detection result of the model, namely, the prediction label of the classifier on the CS3 layer video behavior, and the first behavior example in FIG. 4: 1000 kicking actions in the input video, wherein the detection results are 997 kicks, 0 beats and 3 pushes, and other rows in the figure are the same. Thus, the effectiveness of the method for detecting the behavior of the invisible video can be seen.
Claims (5)
1. A privacy protection ViT-iGAN video behavior detection method for preventing abuse of endowment privacy scenes is characterized by comprising the following steps of: the method comprises the following steps:
step 1: based on the disclosed data set, collecting the video data of the private scene monitoring, and dividing a training set and a testing set;
step 2: performing dimension reduction coding treatment on the video monitoring data in the step 1 by using a multilayer compressed sensing technology CS to enable the content to reach a visual privacy protection state, and obtaining a visual hidden state video;
step 3: utilizing a visual self-attention model ViT to mark an interested target area on an original video of a model training process to obtain a foreground marked video, so as to be used for subsequent network modeling;
step 4: inputting the video in the hidden state in the step 2 and the foreground mark video in the step 3 to the improved generation antagonism network iGAN respectively, and carrying out information compensation in an iterative training mode to obtain space-time characteristic compensation data;
step 5: for typical abuse behavior of the pension process, the combination ViT and the iGAN form a ViT-iGAN model, and the added classification network ResNet18 performs behavior detection on the spatio-temporal feature compensation data in the step 4.
2. The privacy protection ViT-iGAN video behavior detection method for preventing abuse of pension privacy scenes according to claim 1, wherein the method is characterized by comprising the following steps: in the step 2, the multi-layer compressed sensing technology adopts a random observation matrix to directly perform dimension reduction coding processing on the video in the data set, and the definition formula is as follows:
wherein X is an input video, phi (n) For the observation matrix of the nth layer, Y (n) The observed value obtained by processing X through n layers of CS.
3. The privacy protection ViT-iGAN video behavior detection method for preventing abuse of pension privacy scenes according to claim 1, wherein the method is characterized by comprising the following steps: in the step 3, the ViT model adopts a multi-head self-attention mechanism MHSA to mark a target area of interest in the video, and the effect of the foreground mark is controlled by controlling the head number of the MHSA; MHSA is defined as follows:
wherein X is input video, n is head number, Q h 、K h And V h For X and each group of different parameter matricesAnd->Multiplying to obtain three key vectors d k For K h Dimension of vector, W out For the weight of each group of heads, concat is the splicing operation and MHSA (X) is the output video.
4. The privacy protection ViT-iGAN video behavior detection method for preventing abuse of pension privacy scenes according to claim 1, wherein the method is characterized by comprising the following steps: in the step 4, the applied iGAN network is obtained by improving the generation of the antagonism network GAN, and is used for compensating the space-time characteristics of the video, and the improvement part is as follows:
changing the number of CNN network layers of a generator discriminator in the iGAN network according to the difference of the sizes of the input video hidden state video and the foreground mark video; redefining a discriminator label according to the information compensation of the iGAN network for various behaviors;
the generator and the discriminator are both network structures of CNN, and the loss function is reversely propagated to the generator and the discriminator according to the chain deduction rule, so that parameters such as weight and the like are continuously modified and updated;
the generator and arbiter loss functions of the iGAN network are as follows:
L D (o,l,c)=BCE(D(o),l)+BCE(D(G(c)),l)
L G (l,c)=BCE(D(G(c)),l)+CE(C(G(c),l)
wherein L is D Is the loss function of the discriminator, L G Is the loss function of the generator, o is the original video data, l is the original video data label, C is the video data in the hidden state, BCE is the binary cross entropy, CE is the cross entropy, D is the discriminator, G is the generator, C is the classifier, and G (C) is the compensated video data generated by the generator.
5. The privacy protection ViT-iGAN video behavior detection method for preventing abuse of pension privacy scenes according to claim 1, wherein the method is characterized by comprising the following steps: in the step 5, the ResNet18 network is composed of 17 convolution layers and 1 full connection layer, and in model training, the loss function of the classifier is as follows:
L C (o,l,c)=CE(C(o),l)+CE(C(G(c)),argmax(C(G(c)))>t)
wherein L is C Is the loss function of the classifier, o is the original video data, l is the original video data label, C is the visual hidden state video data, CE is the cross entropy, C is the classifier, G is the generator, G (C) is the compensation state video data generated by the generator, argmax is the function of generating the pseudo label, and t is the pseudo label threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311033761.2A CN117037037A (en) | 2023-08-16 | 2023-08-16 | Anti-abuse privacy protection ViT-iGAN video behavior detection method for endowment privacy scene |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311033761.2A CN117037037A (en) | 2023-08-16 | 2023-08-16 | Anti-abuse privacy protection ViT-iGAN video behavior detection method for endowment privacy scene |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117037037A true CN117037037A (en) | 2023-11-10 |
Family
ID=88601999
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311033761.2A Pending CN117037037A (en) | 2023-08-16 | 2023-08-16 | Anti-abuse privacy protection ViT-iGAN video behavior detection method for endowment privacy scene |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117037037A (en) |
-
2023
- 2023-08-16 CN CN202311033761.2A patent/CN117037037A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Adhikari et al. | Activity recognition for indoor fall detection using convolutional neural network | |
CN112560810B (en) | Micro-expression recognition method based on multi-scale space-time characteristic neural network | |
Shami et al. | People counting in dense crowd images using sparse head detections | |
Zhou et al. | Activity analysis, summarization, and visualization for indoor human activity monitoring | |
US9665777B2 (en) | System and method for object and event identification using multiple cameras | |
CN110464366A (en) | A kind of Emotion identification method, system and storage medium | |
CN110598606B (en) | Indoor falling behavior detection method with visual privacy protection advantage | |
CN109190475A (en) | A kind of recognition of face network and pedestrian identify network cooperating training method again | |
Zhao et al. | Scale-aware crowd counting via depth-embedded convolutional neural networks | |
CN110008793A (en) | Face identification method, device and equipment | |
Chen et al. | Protecting personal identification in video | |
CN113869276B (en) | Lie recognition method and system based on micro-expression | |
Majeed et al. | Investigating the efficiency of deep learning based security system in a real-time environment using YOLOv5 | |
Tao et al. | 3D convolutional neural network for home monitoring using low resolution thermal-sensor array | |
Sheeba et al. | Hybrid features-enabled dragon deep belief neural network for activity recognition | |
CN108334870A (en) | The remote monitoring system of AR device data server states | |
Zhang et al. | Key frame extraction based on quaternion Fourier transform with multiple features fusion | |
Park et al. | A track-based human movement analysis and privacy protection system adaptive to environmental contexts | |
CN117037037A (en) | Anti-abuse privacy protection ViT-iGAN video behavior detection method for endowment privacy scene | |
Zhou | Eye-Blink Detection under Low-Light Conditions Based on Zero-DCE | |
CN115909400A (en) | Identification method for using mobile phone behaviors in low-resolution monitoring scene | |
Zhang et al. | Facial micro-expression recognition based on multi-scale temporal and spatial features | |
Arivazhagan | Versatile loitering detection based on non-verbal cues using dense trajectory descriptors | |
Condell et al. | Automatic gait recognition and its potential role in counterterrorism | |
Bouchrika | On using gait biometrics for re-identification in automated visual surveillance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |