CN117037037A

CN117037037A - Anti-abuse privacy protection ViT-iGAN video behavior detection method for endowment privacy scene

Info

Publication number: CN117037037A
Application number: CN202311033761.2A
Authority: CN
Inventors: 刘佶鑫; 姚素芳; 孙宁
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-10

Abstract

The invention discloses a privacy protection ViT-iGAN video behavior detection method for preventing abuse of a pension privacy scene, which belongs to the field of computer vision and behavior recognition and comprises the following specific steps: performing dimension reduction coding processing on the monitoring video through a multilayer compressed sensing technology to obtain video data with a visual privacy protection effect; using ViT visual self-attention mechanism to target area of interest of original videomark for network modeling; respectively inputting a visual hidden state video and a foreground marking video into an improved generation antagonism network iGAN model, and realizing space-time characteristic compensation of visual hidden state video data in an iterative training mode; aiming at typical abuse behaviors, the ViT-iGAN classification network is utilized to detect the behaviors of the invisible space-time characteristic compensation data. The invention can ensure the protection of visual hidden privacy effect, simultaneously maintain higher detection performance for typical behaviors of the old abused people, and has better practical application value by taking privacy protection and intelligent monitoring into consideration.

Description

Anti-abuse privacy protection ViT-iGAN video behavior detection method for endowment privacy scene

Technical Field

The invention relates to the field of computer vision and behavior recognition, in particular to a privacy protection ViT-iGAN video behavior detection method for preventing abuse of a endowment privacy scene.

Background

The world health organization report shows that the population of elderly people over 60 years worldwide will double from 2015 to 2050, from 9 to about 20 hundred million people, accounting for 22% of the world population. The abuse can cause serious physical injury and long-term psychological consequences for the elderly, so that timely and accurate detection of the abuse behavior is an urgent need.

Existing abuse detection techniques fall broadly into two categories: the first is a wearable sensor, however, the device is inconvenient to put on and off, forgets to wear, and places an additional burden on the elderly. In fact, the effectiveness of wearable sensors depends largely on the user's ability and willingness. The second type is a detection technology based on computer vision, and the device which is rich in monitoring information, a non-contact monitoring mode and easy to maintain in the later period is more suitable for the pension scene.

With the continuous reduction of camera costs and the continuous popularization of public place monitoring, many penmanship institutions and even individual families are gradually willing to accept the installation of cameras in living areas to ensure the safety of personnel and property. The cameras are arranged in the living area for monitoring, so that the behaviors of the old can be timely focused, and the possibility of the occurrence of the abuse behaviors is reduced. However, since the camera monitors the detected object in real time, the video is uploaded and processed through a network or other communication means. In this process, there is a risk of personal privacy disclosure, and exposure of decorative or environmental arrangement private information is also involved. Therefore, the behavior detection method based on computer vision is more required to be compatible with security monitoring and privacy protection. While the privacy protection of the monitoring video is considered, the follow-up intelligent detection is also needed to be considered. Otherwise, the video with excessive privacy protection loses practical application value. Therefore, developing a reliable privacy protection monitoring video typical abuse behavior detection system oriented to the pension privacy scene will have a certain market prospect for home care and even pension institutions.

Disclosure of Invention

Aiming at the problem of invading privacy related to the prior detection technology, the invention provides a privacy protection ViT-iGAN video behavior detection method for preventing abuse of a endowment privacy scene. The method utilizes the image irrecoverability brought by the random observation matrix of the multilayer compressed sensing technology to ensure the video vision shielding, proposes to combine a ViT vision attention mechanism with the improved generation of an antagonism network iGAN, automatically compensates the foreground region space-time characteristics of the video in a hidden state and effectively detects typical abuse behaviors.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the method for detecting the privacy protection ViT-iGAN video behaviors for preventing abuse of the endowment privacy scene comprises the following steps:

step 1: based on the disclosed data set, collecting the video data of the private scene monitoring, and dividing a training set and a testing set.

Step 2: and (3) performing dimension reduction coding treatment on the video monitoring data in the step (1) by using a multilayer compressed sensing technology (Compression Sensing, CS) to enable the content to reach a visual privacy protection state, and obtaining the visual hidden video.

Step 3: the target region of interest is marked with the original video of the model training process using a visual self-attention model (Vision Transformer, viT), resulting in a foreground marked video for use in subsequent network modeling.

Step 4: and (3) respectively inputting the video in the hidden state in the step (2) and the foreground marking video in the step (3) into an improved generation countermeasure network (improved Generative Adversarial Networks, iGAN), and carrying out information compensation in an iterative training mode to obtain space-time characteristic compensation data.

Step 5: for typical abuse behavior of the pension process, the combination ViT and the iGAN form a ViT-iGAN model, and the added classification network ResNet18 performs behavior detection on the spatio-temporal feature compensation data in the step 4.

In step 2, the multi-layer compressed sensing technology directly performs dimension reduction encoding processing on the video in the data set by adopting a random observation matrix, and the definition formula is as follows:

wherein X is an input video, phi ⁽ⁿ⁾ For the observation matrix of the nth layer, Y ⁽ⁿ⁾ The observed value obtained by processing X through n layers of CS. As the number of dimension reduction layers increases, the image quality is continuously reduced, and the video content gradually reaches a visual privacy protection state.

Further, in the step 3, the ViT model adopts a Multi-Head Self-Attention (MHSA) mechanism with good robustness and stability to mark the interested target area in the video, and controls the effect of the foreground mark by controlling the Head number of the MHSA; MHSA is defined as follows:

wherein X is input video, n is head number, Q _h 、K _h And V _h For X and each group of different parameter matricesAnd->Multiplying to obtain three key vectors d _k For K _h Dimension of vector, W _out For the weight of each group of heads, concat is the splicing operation and MHSA (X) is the output video.

Further, in the step 4, the applied iGAN network is obtained by improving the generation of the countermeasure network GAN, so as to compensate the space-time characteristics of the video, and the improvement is as follows:

changing the number of CNN network layers of a generator discriminator in the iGAN network according to the difference of the sizes of the input video hidden state video and the foreground mark video; redefining a discriminator label according to the information compensation of the iGAN network for various behaviors;

the generator and the discriminator are both network structures of CNN, and the loss function is reversely propagated to the generator and the discriminator according to the chain deduction rule, so that parameters such as weight and the like are continuously modified and updated;

the generator and arbiter loss functions of the iGAN network are as follows:

L _D (o,l,c)＝BCE(D(o),l)+BCE(D(G(c)),l)

L _G (l,c)＝BCE(D(G(c)),l)+CE(C(G(c),l)

wherein L is _D Is the loss function of the discriminator, L _G Is the loss function of the generator, o is the original video data, l is the original video data label, C is the video data in the hidden state, BCE is the binary cross entropy, CE is the cross entropy, D is the discriminator, G is the generator, C is the classifier, and G (C) is the compensated video data generated by the generator.

Further, in the step 5The saidThe ResNet18 network consists of 17 convolutional layers and 1 fully-connected layer, and in model training, the loss function of the classifier is as follows:

L _C (o,l,c)＝CE(C(o),l)+CE(C(G(c)),argmax(C(G(c)))>t)

wherein L is _C Is the loss function of the classifier, o is the original video data, l is the original video data label, C is the visual hidden video data, CE is the cross entropy, G is the generator, C is the classifier, G (C) is the compensated video data generated by the generator, argmax is the function of generating the pseudo-label, and t is the pseudo-label threshold.

Compared with the prior art, the invention has the beneficial effects that:

1. aiming at the privacy leakage risk of the monitoring video in the visual aspect, the invention provides a method for sampling the video by using a random observation matrix, thereby realizing the visual privacy protection effect of the video.

2. Aiming at the effective information reduction, noise introduction and background information interference of the visual privacy protection video, the method and the device provided by the invention simultaneously consider the requirements of actual application scenes on behavior detection accuracy, use a ViT-iGAN network model to compensate the space-time characteristics of the visual hidden video, and detect typical abuse behaviors in the video.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of image privacy protection based on multi-layer compressed sensing in an embodiment of the present invention;

FIG. 3 is a schematic view of an image foreground tag based on a ViT attention mechanism in an embodiment of the invention;

FIG. 4 shows the detection results (confusion matrix) of the NTU RGB+D60 data set according to the present invention;

FIG. 5 shows the detection result (confusion matrix) of the ISR-UoL 3D society data set.

Detailed Description

The present invention will be further elucidated with reference to the drawings and the specific embodiments, in order to make the objects, technical solutions and advantages of the present invention more apparent.

Referring to fig. 1, the embodiment of the invention provides a method for detecting privacy protection ViT-iGAN video behaviors for preventing abuse of a endowment private scene, which comprises the following steps of.

Step 1: based on the disclosed data set, collecting the private scene monitoring video data, and dividing a training set and a testing set:

because the disclosed abusive old man behavior data sets are almost absent, in order to verify the effectiveness of the method, the disclosed indoor scene double violent behavior video data are selected, and the training data sets and the testing data sets are divided. The disclosed video data set contains various behaviors (such as walking, mopping, frame playing, etc.), so that the training data set needs to be preprocessed, violent behaviors in each video are collected and divided into short video segments consisting of limited video frames, and each short video segment only contains one type of behaviors. And dividing the data into a plurality of sample sets according to action categories in the video clips, and marking corresponding category labels for subsequent network model training. To further verify the validity of the method, large and small sample data sets are acquired, respectively.

The large sample dataset of this example selects NTU rgb+d60 datasets, containing 60 human action categories in total, each dataset being captured simultaneously at multiple angles by three cameras with a resolution of 1920 x 1080. The violent behavior video training set is collected from the video training set, and comprises three categories: beating, kicking and pushing. During training, each class is divided into 300 effective video clips, 900 video clips are added, and the duration is about 1.2 seconds, and 30 frames per second.

The example small sample dataset selection ISR-UoL 3D Social Activity dataset is a human interactive video dataset, these activities are performed by 6 participants, including 8 social activities: handshaking, greetings, walking assistance, standing assistance, cradling, pushing, talking, and drawing attention, each activity recording for approximately 30 seconds to 60 seconds. The dataset consisted of 10 sessions, each session providing an RGB image and a skeleton image of 8 different activities performed by two persons. The violent behavior video training set is collected from RGB images and comprises three categories: beat, kick and push, each class of 1000 active consecutive video frames with a resolution of 640 x 480.

Step 2: and (3) performing dimension reduction coding treatment on the video monitoring data in the step (1) by using a multilayer compressed sensing technology CS, so that the content reaches a visual privacy protection state, and a visual hidden state video is obtained:

dividing a video frame X containing M X N pixel values into N image blocks X containing 2X 2 pixel values _n I.e. x= [ X ₁ ,x ₂ ,...x _n ]At the same time, the observation matrix phi containing M×N elements is divided into N blocks phi of 2×2 size _n I.e. Φ= [ Φ ] ₁ ,Φ ₂ ,...Φ _n ]Then, performing inner product operation on the matrix blocks and the image blocks at the same position, namely respectively observing each image block to obtain an observed value Y:

Y＝ΦX

the above-mentioned is a compressed sensing process, the adoption of the same observation matrix for each layer of the multi-layer dimension reduction processing can cause the solidification and the large discrete degree of the relationship among the layers and the weakening of the video detail information, so that each layer of the invention adopts different observation matrixes to perform multi-layer compressed sensing, and the definition is as follows:

wherein phi is ⁽ⁿ⁾ For the observation matrix of the nth layer, Y ⁽ⁿ⁾ And (5) obtaining an observed value for the original video X through n layers of compressed sensing processing.

Referring to fig. 2, as the number of compression layers increases, information loss increases and images blur gradually. After three layers of CS compression, the frame details of the detected object are difficult to identify, the effect of protecting the visual privacy is achieved, and after four layers of CS compression, the whole details, the background and the noise of the video frame are mixed together, so that the video frame is more difficult to distinguish.

Step 3: using the visual self-attention model ViT, the original video of the model training process marks the target region of interest, and a foreground mark video is obtained for subsequent network modeling:

since the key information of violent behavior identification is a foreground object and the background information in the video is interference information for identification, the example adopts an MHSA mechanism of a ViT model to process the original violent behavior video so as to obtain a foreground marked video.

The conventional ViT model is used for image classification, and this example focuses on marking a target region of interest in a video, achieving a foreground marking effect. The Self-Attention (SA) mechanism of ViT is specifically: input matrix X and three trainable parameter matrices W ^Q 、W ^K And W is ^V Multiplying to obtain three Key vectors Q (Query), K (Key), V (Value), multiplying the Q and K points to generate a similarity matrix, dividing each element of the similarity matrix by each element of the similarity matrix for more stable gradient update during trainingd _k Is the dimension of the K vector, and then passes through the softmax functionAnd normalizing the number to obtain a weight matrix. Finally multiplying the normalized weight matrix with the vector V, and calculating the weighted summation to obtain an output matrix SA (X):

to enhance the fit performance of the network, this example marks the target area of the video using the MHSA mechanism, controlling the effect of the foreground marks by the number of heads. The MHSA mechanism is specifically: for the input matrix X, multiple sets of different W are defined ^Q 、W ^K And W is ^V Each group is calculated to obtain Q _h 、K _h And V _h Learning different weight matrixes, directly using the output of the previous layer as input by the subsequent layer, finally splicing the results of a plurality of heads into Concat, and then multiplying the weight W by the Concat _out Obtaining a final output matrix MHSA (X):

referring to fig. 3, the first line is a few frames in the original video, and the second line is obtained through the MHSA mechanism, it can be seen that the MHSA mechanism is similar to the human eye observing things, preferentially noticing the target area of interest (foreground) in the video frame, and disregarding the non-important area (background). The effect of the MHSA mechanism visualization is characterized by marking the foreground on the image.

Step 4: inputting the video in the hidden state in the step 2 and the foreground mark video in the step 3 into the improved generation countermeasure network respectively, and performing information compensation in an iterative training mode to obtain space-time characteristic compensation data:

conventional generation countermeasure networks have focused mainly on generating samples with noise for expanding the dataset, this example focused on compensating image features. The improved generation of the antagonism network iGAN model is employed to compensate for the spatiotemporal features of the visual privacy preserving state video. The generator G network and the discriminator D network both adopt CNN networks, and according to the difference of the size of the video hidden state video and the size of the original video, the generator is changed into two layers of CNN networks, and the discriminator is changed into six layers of CNN networks. The conventional discriminant uses true and false labels to identify the authenticity of the generated graph of the generator, and the embodiment is used for characteristic migration compensation of class 3 violent behaviors, so that the behavior class labels are redefined to be 0, 1 and 2. For the whole training structure, the labels are fed back to the generator and the discriminator after calculation and comparison, the video data generated by the generator are purposefully matched with the foreground mark state video data of the same label for training, and in the process, the foreground missing information is compensated, so that the space-time characteristic reinforcement of the video hidden state data is realized.

The generator and the discriminator are both network structures of CNN, and the loss function is reversely propagated to the generator and the discriminator according to the chain deduction rule, so that parameters such as weight and the like are continuously modified and updated. The generator and arbiter loss functions of the iGAN network are respectively:

L _D (o,l,c)＝BCE(D(o),l)+BCE(D(G(c)),l)

L _G (l,c)＝BCE(D(G(c)),l)+CE(C(G(c),l)

Step 5: for typical abuse behavior of the pension process, the combination ViT forms ViT-iGAN model with iGAN, and the added classification network res net18 performs behavior detection on the spatiotemporal feature compensation data in step 4:

the ResNet18 network is used as a classifier of the ViT-iGAN model, and consists of 17 convolution layers and 1 full connection layer. In training, all real data available has tags, while the compensation-state video data generated by the generator has no associated tags. A pseudo tag mode is used for the compensated video data, assuming a tag based on the most probable class based on the current state of the classifier. Only when the model predicted sample class has higher confidence or probability is higher than a certain threshold, the generated image and label are reserved, and the loss function of the classifier is as follows:

L _C (o,l,c)＝CE(C(o),l)+CE(C(G(c)),argmax(C(G(c)))>t)

The behavior detection performance of the present example method can be further illustrated by the following simulation experiments:

and respectively encoding the NTU RGB+D60 data set and the ISR-UoL 3D Social Activity data set acquired in the front through a plurality of layers of CS to obtain corresponding CS1 layer, CS2 layer, CS3 layer and CS4 layer video sets. As shown in fig. 2, the example image is compressed to the CS3 layer, the detail information is not seen clearly, the visual privacy protection state is achieved, the details, the background and the noise point of the CS4 layer image are integrated, and the fact that the data set is used for intelligent detection and the video which is excessively privacy-protected loses the practical application value is considered, so that the CS3 layer video set is selected for subsequent behavior recognition experiments.

In addition, the two data sets mark the interested target area through the MHSA mechanism of ViT respectively, as shown in fig. 3, the original video passes through the MHSA mechanism, the background information in the image is weakened, the target person is marked prominently, and the foreground marking data is obtained and used for subsequent model training.

In order to improve the generalization capability of the model, a 5-fold cross-validation mode is adopted in the experiment. In order to preserve time sequence information among videos, all frames of one video segment are selected as a group for training, so that time-space information is better compensated.

In the training stage, the CS3 layer training video set is input into the iGAN model, preprocessed to the fixed resolution of 32 multiplied by 32 and then sent into the generator, the foreground marking video set input model is preprocessed to the fixed resolution of 128 multiplied by 128 and then sent into the discriminator and the classifier together with the class label, and the training model is stored after 300 times of iterative training.

And in the test stage, loading a model stored in the training stage, preprocessing a CS3 layer test video set to a fixed resolution of 32 multiplied by 32, inputting the model into a generator to generate a compensation video set with the size of 128 multiplied by 128, inputting the compensation video set into a classification network in the model for detection, and outputting a violence detection result.

Referring to fig. 4 and 5, the ViT-iGAN model is used to detect violence of the CS3 layer NTU rgb+d60 data set and the CS3 layer ISR-UoL 3D Social Activity data set, respectively, and to more intuitively characterize the model detection performance, a confusion matrix is calculated and obtained. The vertical axis is the true label of video behavior: kicking, beating and pushing, the horizontal axis is the detection result of the model, namely, the prediction label of the classifier on the CS3 layer video behavior, and the first behavior example in FIG. 4: 1000 kicking actions in the input video, wherein the detection results are 997 kicks, 0 beats and 3 pushes, and other rows in the figure are the same. Thus, the effectiveness of the method for detecting the behavior of the invisible video can be seen.

Claims

1. A privacy protection ViT-iGAN video behavior detection method for preventing abuse of endowment privacy scenes is characterized by comprising the following steps of: the method comprises the following steps:

step 1: based on the disclosed data set, collecting the video data of the private scene monitoring, and dividing a training set and a testing set;

step 2: performing dimension reduction coding treatment on the video monitoring data in the step 1 by using a multilayer compressed sensing technology CS to enable the content to reach a visual privacy protection state, and obtaining a visual hidden state video;

step 3: utilizing a visual self-attention model ViT to mark an interested target area on an original video of a model training process to obtain a foreground marked video, so as to be used for subsequent network modeling;

step 4: inputting the video in the hidden state in the step 2 and the foreground mark video in the step 3 to the improved generation antagonism network iGAN respectively, and carrying out information compensation in an iterative training mode to obtain space-time characteristic compensation data;

2. The privacy protection ViT-iGAN video behavior detection method for preventing abuse of pension privacy scenes according to claim 1, wherein the method is characterized by comprising the following steps: in the step 2, the multi-layer compressed sensing technology adopts a random observation matrix to directly perform dimension reduction coding processing on the video in the data set, and the definition formula is as follows:

wherein X is an input video, phi ⁽ⁿ⁾ For the observation matrix of the nth layer, Y ⁽ⁿ⁾ The observed value obtained by processing X through n layers of CS.

3. The privacy protection ViT-iGAN video behavior detection method for preventing abuse of pension privacy scenes according to claim 1, wherein the method is characterized by comprising the following steps: in the step 3, the ViT model adopts a multi-head self-attention mechanism MHSA to mark a target area of interest in the video, and the effect of the foreground mark is controlled by controlling the head number of the MHSA; MHSA is defined as follows:

4. The privacy protection ViT-iGAN video behavior detection method for preventing abuse of pension privacy scenes according to claim 1, wherein the method is characterized by comprising the following steps: in the step 4, the applied iGAN network is obtained by improving the generation of the antagonism network GAN, and is used for compensating the space-time characteristics of the video, and the improvement part is as follows:

the generator and arbiter loss functions of the iGAN network are as follows:

L _D (o,l,c)＝BCE(D(o),l)+BCE(D(G(c)),l)

L _G (l,c)＝BCE(D(G(c)),l)+CE(C(G(c),l)

5. The privacy protection ViT-iGAN video behavior detection method for preventing abuse of pension privacy scenes according to claim 1, wherein the method is characterized by comprising the following steps: in the step 5, the ResNet18 network is composed of 17 convolution layers and 1 full connection layer, and in model training, the loss function of the classifier is as follows:

L _C (o,l,c)＝CE(C(o),l)+CE(C(G(c)),argmax(C(G(c)))>t)

wherein L is _C Is the loss function of the classifier, o is the original video data, l is the original video data label, C is the visual hidden state video data, CE is the cross entropy, C is the classifier, G is the generator, G (C) is the compensation state video data generated by the generator, argmax is the function of generating the pseudo label, and t is the pseudo label threshold.