CN112653886B

CN112653886B - Monitoring video stream forgery detection method and positioning method based on wireless signals

Info

Publication number: CN112653886B
Application number: CN202011471045.9A
Authority: CN
Inventors: 王巍; 黄勇
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2021-12-03
Anticipated expiration: 2040-12-14
Also published as: CN112653886A

Abstract

The invention discloses a monitoring video stream counterfeiting detection method and a monitoring video stream counterfeiting positioning method based on wireless signals, and belongs to the field of video stream counterfeiting detection. The method comprises the following steps: acquiring wireless signals of a video monitoring area at equal time intervals in real time to obtain the same number of wireless signals corresponding to each monitoring video frame; respectively extracting JHM and PAF tensors from a monitoring video frame and a wireless signal; and if the difference between the monitoring video frame JHM and the wireless signal JHM is large, judging that the current monitoring video frame is forged. When the monitoring camera and the wireless signal receiving end are in the same space, the visual signal and the wireless signal can simultaneously sense human semantic information, and the human semantic information is extracted from the wireless signal. As the JHM tensors respectively extracted from the video frames and the wireless signals are similar when no counterfeiting attack exists, and otherwise are not similar, the JHM tensors of the monitoring video frames and the wireless signals are compared frame by frame to realize counterfeiting judgment, and real-time and fine-grained detection is met.

Description

Monitoring video stream forgery detection method and positioning method based on wireless signals

Technical Field

The invention belongs to the field of monitoring video stream forgery detection, and particularly relates to a monitoring video stream forgery detection method and a positioning method based on wireless signals.

Background

With the increasing demand for safety in daily life, the video monitoring system has wider and wider application range indoors and outdoors, such as bank crime prevention, customer monitoring of retail stores, and the like. Due to the rapid growth in popularity and significance in the real world, video surveillance systems inevitably become attractive targets for attacks in the field of network security. Recent research has shown that a malicious attacker can dive into the monitoring system by using a security hole of the monitoring camera or hijacking his connected ethernet cable, then tamper with the real video stream content, and further mask illegal activities in the monitored area, leaving behind any perceptible clues. Under such attack threats, it is important for a monitoring system to rapidly alert an ongoing network attack and to track potential intruders in real time.

Aiming at the problem of camera replay attack, Nitya Lakshmann et al propose a video forgery detection method in 2019 in the 'detecting and detecting subsequent camera replay attack with wi-fi channel state information', and the main idea is to compare the similarity of two signal characteristics based on the vision in the same space and the relevant time-frequency domain characteristics (the start and stop time and the main frequency components of an event) of a WiFi CSI signal so as to judge whether the current video stream is not matched with the current wireless signal. If a mismatch is detected as the presence of an attack. However, this approach relies on visual and event-level features in the wireless signal, causing the following problems: firstly, the detection time is long, and an original signal sequence of 10-30s is required to be used as a detection unit; secondly, the precision is low, if only the data corresponding to one event is used, the system precision is less than 40 percent; thirdly, the situation that a plurality of people are in the coverage range of the camera cannot be supported on the assumption that only one person exists in the region of interest; fourth, even if an attack is detected, the system cannot locate where the intruder is in the monitored area and what to do. Therefore, the real-time, high-precision and fine-grained requirements of the monitoring video stream on intrusion detection cannot be met.

Cao Zhe et al in 2016 propose a positioning method, and the main idea is to design a fast association algorithm of human body joint points based on human body semantic information PAF tensor, and further estimate the 2D posture of a human body from one picture. However, if the scheme is directly utilized to recover respective human body postures from the vision, the wireless JHM tensor and the wireless PAF tensor respectively, and the difference of the two human body postures is compared one by one, so that the positioning of the intruder is realized. The direct method performs the joint association algorithm twice on each picture, and the calculation complexity is high.

Disclosure of Invention

Aiming at the defects and the improvement requirements of the prior art, the invention provides a monitoring video stream counterfeiting detection method and a positioning method based on wireless signals, and aims to solve the problem that the requirements of real-time performance and fine-grained detection capability cannot be met simultaneously in video stream counterfeiting attack detection in the conventional video monitoring system.

To achieve the above object, according to a first aspect of the present invention, there is provided a surveillance video stream forgery detection method based on a wireless signal, the method including:

s1, collecting wireless signals of a video monitoring area at equal time intervals in real time to obtain the same number of wireless signals corresponding to each monitoring video frame;

s2, extracting a human body joint heat map and a PAF tensor from the monitoring video frame and the wireless signal respectively;

and S3, if the difference between the human body joint heat map of the monitoring video frame and the human body joint heat map of the wireless signal is large, judging that the current monitoring video frame is forged.

Preferably, in step S2, a pre-trained openpos network is used to extract a human joint heat map and a PAF tensor from a monitoring video frame; extracting a human joint heat map and a PAF tensor from a wireless signal by adopting a pre-trained RF2Pose network;

the RF2Pose network includes:

the wireless signal conversion module is used for converting all wireless signals corresponding to each monitoring video frame into intermediate characteristics with the same length dimension as the monitoring video frame and outputting the intermediate characteristics to the human joint thermal diagram generation module and the PAF tensor generation module;

the human body joint heat map generation module is used for generating a human body joint heat map according to the intermediate characteristics;

the PAF tensor generation module is used for generating the PAF tensor according to the intermediate features;

the training sample of the RF2Pose network is a wireless signal, and the label is a corresponding human joint heat map and a PAF tensor output by the OpenPose network.

Has the advantages that: according to the invention, human body semantic information is extracted from the wireless signals corresponding to each video frame through the deep neural network, the influence relationship between the wireless signals and a human body is difficult to model through a traditional method due to the complexity and the changeability of the wireless signals, and the deep learning method has extremely strong characteristic mapping capability, so that the human body semantic information can be effectively recovered from the wireless signals corresponding to each video frame by using the deep neural network, and thus frame-by-frame forgery detection can be realized. In addition, in the training stage, the RF2Pose network is trained by using a cross-modal training method, and because a large amount of time and labor are needed for manually marking the tags of the wireless signals, the trained OpenPose network outputs the tags as the tags corresponding to the wireless signals, so that a large amount of labor cost can be saved.

Preferably, the wireless signal transformation module comprises a deconvolution network, a convolution network and a residual error network which are connected in series in sequence; the human body joint heat map generation module comprises a decoding network and a full convolution network which are sequentially connected in series, wherein the decoding network is a repeated structure of one layer of deconvolution layer and two layers of convolution layers; the PAF tensor generation module comprises a decoding network and a full convolution network which are sequentially connected in series, wherein the decoding network is a repeated structure of one layer of deconvolution layer and two layers of convolution layers.

Has the advantages that: the invention designs a novel deep neural network structure which comprises a wireless signal transformation module, a human joint heat map generation module and a PAF tensor generation module, so that a human joint heat map and a PAF tensor are extracted from complex and changeable wireless signals. Because the spatial resolution of the wireless signal is low, the wireless signal conversion module firstly utilizes a deconvolution network to carry out up-sampling on the input wireless signal to ensure that the wireless signal has higher spatial resolution, and then inputs a convolution network and a residual error network to obtain intermediate characteristics which relate the human body information and the video frame spatial information; because the human body semantic information of the intermediate features is rough and the mapping relation between the human body semantic information and the human body joint heat map is further clear, the human body joint heat map generation module firstly utilizes a decoding network to gradually refine the intermediate features through a repeated structure of a layer of deconvolution layer and two layers of convolution layers, and then utilizes a full convolution network to convert the refined intermediate feature map into the human body joint heat map; similarly, because the human semantic information of the intermediate features is rough, and the mapping relationship between the intermediate features and the PAF tensor needs to be further defined, the PAF tensor generation module firstly uses a decoding network to gradually refine the intermediate features through a repeated structure of a layer of deconvolution layer and two layers of convolution layers, and then uses a full convolution network to convert the refined intermediate feature map into the PAF tensor.

Preferably, the loss function of the RF2pos network in the training phase is as follows:

wherein the content of the first and second substances,

and

respectively represents the loss functions of JHM and PAF corresponding to the human body joint heat diagram,

respectively representing JHM and PAF tensors obtained by inputting the y training sample video frame into an OpenPose network,

respectively representing JHM and PAF tensors obtained by inputting a Y training sample wireless signal into an RF2Pose network, Y representing the number of training samples, J representing the number of connecting joint points abstracted by a human body, C representing the number of human body parts, h and w representing pixel point coordinates,

and

respectively representing weight factors, lambda, of JHM and PAF tensors pixel by pixel₁、λ₂、β₁And beta₂To represent

And

the balancing coefficients of the influence in the overall objective function,

representing the confidence of the j-th joint point on the visual JHM pixel point (h, w),

representing the confidence of the j-th joint point on the wireless JHM pixel point (h, w),

representing the position and direction information of the c-th human body part on the visual PAF pixel points (h, w),

and (3) representing the position and direction information of the c-th human body part on the wireless PAF pixel points (h, w).

Has the advantages that: the present invention proposes a new loss function, since the background in the video frame often occupies most of the pixels, most of the elements in the jhm (paf) tensor are equal to zero, while the conventional L2 loss function tends to reduce the overall (all elements) error, thus resulting in a large gap between the visual jhm (paf) and wireless jhm (paf) tensors. To this end, the invention is therefore based on the provision of

Linearly related to the absolute value of the (h, w) th element to increase the weight of the non-zero element, so that the RF2Pose network has morePaying attention to the region where the human body exists, and neglecting the background region, thereby greatly reducing the difference between visual JHM (PAF) and wireless JHM (PAF) tensors; in addition, considering that JHM and PAF tensors indicate different human semantic information and have different numerical scales, the method sets different coefficients lambda₁、λ₁、β₁And beta₂De-balancing

And

influence in the overall objective function. Through the training process, the deep neural network can extract more effective PAF information from the wireless signals, and the JHM tensor recovered from the wireless signals is more accurate.

Preferably, in step S3, a threshold is set or a two-classifier is constructed based on the similarity or difference of the human joint heat maps, and the difference between the human joint heat map of the monitoring video frame and the wireless signal human joint heat map is determined.

Has the advantages that: the invention uses the similarity or difference of the human joint heat map as the judgment basis. If the attack does not exist, the similarity is higher or the difference is smaller; otherwise, the similarity is lower or the difference is larger. Therefore, based on the relationship, the redundant information in the two human body joint heat maps can be reduced by utilizing the similarity or the difference, and simple and quick counterfeit detection is realized.

To achieve the above object, according to a second aspect of the present invention, there is provided a method for counterfeit location of a surveillance video stream based on a wireless signal, the method comprising:

(T1) performing a forgery detection on the surveillance video stream using the detection method according to the first aspect;

(T2) for the surveillance video frame whose detection result is false, calculating the absolute value of the difference between the human body joint heat map of the current surveillance video frame and the human body joint heat map of the wireless signal, and the PAF tensor and value;

(T3) selecting an abnormal human body joint point set corresponding to the current monitoring video frame based on the difference absolute value of the human body joint heat map;

(T4) performing joint point association operation on the PAF tensor and value of the current monitoring video frame and the abnormal human body joint point set to obtain an association state between the abnormal human body joint points, thereby determining the position of the forged human body object in the current monitoring video frame.

Preferably, in the step (T3), the abnormal human body joint point set corresponding to the current monitoring video frame is selected by a non-maximum suppression method.

Has the advantages that: according to the method, the abnormal human body joint points corresponding to the current monitoring video frame are selected through a non-maximum value inhibition method, elements in the JHM difference tensor have positive values and negative values, and non-maximum value inhibition operation has corresponding significance only on the positive values.

To achieve the above object, according to a third aspect of the present invention, there is provided a computer readable storage medium storing one or more first programs, the one or more first programs being executed by one or more processors to implement the steps of the wireless signal based surveillance video stream forgery detection method according to the first aspect; or, the computer readable storage medium stores one or more second programs, which are executed by one or more processors to implement the steps of the method for monitoring video stream forgery location based on wireless signals according to the second aspect.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) the JHM tensor extracted from the monitoring video frame is compared with the JHM tensor extracted from the wireless signal in the same space to judge whether the video frame is forged or not. Due to the fact that the electrolyte constant of the human body is large, the wireless electromagnetic signals can be subjected to strong reflection when encountering the human body. Therefore, when the monitoring camera and the wireless signal receiving end are in the same space, the visual signal and the wireless signal can simultaneously sense the human semantic information, and therefore the human semantic information can be extracted from the widely existing wireless signals. Because the JHM tensors extracted from the video frame corresponding to the same moment and the wireless signal are similar when no counterfeiting attack exists, and the JHM tensors extracted from the wireless signal are dissimilar when the counterfeiting attack exists, the JHM tensors of the monitored video frame and the corresponding wireless signal are compared frame by frame to realize counterfeiting judgment, so that the requirements of real-time performance and fine-grained detection capability are met simultaneously.

(2) In the positioning stage, the absolute value of the difference tensor of the vision and the wireless JHM is subjected to non-maximum suppression, so that a candidate abnormal human body joint point set is selected. Thereafter, the visual PAF tensor is summed with the wireless PAF tensor to obtain a PAF sum value tensor. And finally, the position and the posture of the abnormal human body target in the picture can be positioned by utilizing a human body joint point association algorithm. The abnormal human body joint point set is selected directly based on the absolute value of the JHM difference tensor, so that the respective human body joint point sets can be prevented from being selected from the vision tensor and the wireless JHM tensor respectively; the vision and wireless PAF tensors are summed to obtain the PAF and the value tensor, the correlation of the following human body joint points is not greatly influenced, and the calculation amount can be reduced; the human body joint point association algorithm is only needed to be used once, so that the calculation expense is reduced.

Drawings

Fig. 1 is a schematic view of a video frame forgery attack scene existing in a video monitoring system according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for detecting and locating a surveillance video stream forgery attack based on a wireless signal according to an embodiment of the present invention;

fig. 3(a) is a schematic diagram of CSI resampling provided by an embodiment of the present invention;

FIG. 3(b) is a schematic diagram of an outlier distribution provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of human semantic feature extraction based on video frames and wireless signals according to an embodiment of the present invention;

fig. 5 is a schematic diagram of video frame forgery detection and localization based on human semantic features according to an embodiment of the present invention;

FIG. 6 is a forged positioning result under video frame interframe attack based on human semantic features according to an embodiment of the present invention;

fig. 7 is a forged positioning result under attack in a forged frame of a video frame based on human semantic features according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Due to the rapid growth in popularity and significance in the real world, video surveillance systems inevitably become attractive targets for attacks in the field of network security. As shown in fig. 1, a malicious attacker can sneak into the monitoring system by hijacking his cable, then tamper with the real video stream content, and further mask illegal activities in the monitored area. Under such attack threats, it is important for video surveillance systems to quickly detect ongoing counterfeit attacks and track potential intruders in real time. However, none of the existing methods can simultaneously meet the real-time and fine-grained requirements of surveillance video stream forgery detection.

Therefore, the present invention provides a method for detecting and positioning a surveillance video stream forgery attack based on wireless signals as shown in fig. 2. In particular, the present embodiment takes a WiFi signal as an example to verify the feasibility of the present invention. Today, many surveillance cameras cover areas such as shops, homes, etc. that are also covered by WiFi hotspots. Since the human body has a high reflection coefficient for WiFi signals. In these areas, the synchronized surveillance video signal and the received WiFi signal will carry consistent human semantic information. Once an attacker makes a forgery attack on the monitoring video stream, the cross-modal information correspondence relationship is decoupled. Therefore, the corresponding relation of the information can be utilized to detect and locate the video frame forgery in real time. As shown in fig. 2, the system includes three functional modules, namely, multi-modal signal processing, human semantic information extraction, and counterfeit detection and localization.

In the multi-mode signal processing module, the real-time video stream from the monitoring camera is firstly decompressed and recorded as

II＝(I¹，…，I^m，…，I^M)

Where M is the number of video frames contained in a GOP (group of pictures) in the video stream, I^mIs a picture after decompression. Meanwhile, the system receives CSI amplitude data stream from a WiFi receiving end in the same area, and the CSI amplitude data stream is recorded as the CSI amplitude data stream

Wherein A isⁿIs a CSI amplitude matrix comprising all amplitude characteristics of the nth CSI sample point.

In the wireless signal processing stage, the received wireless signal measurement value is resampled at equal time intervals based on the timestamp of the video stream, and after resampling, the information correspondence between the vision and the wireless signal is more accurate.

As shown in fig. 3(a), due to a random access mechanism and the existence of a packet loss phenomenon, a time interval between two adjacent original CSI sampling points is not fixed. Therefore, the number of CSI sampling points in unit time is not fixed, the periodicity of CSI data is reduced, and accurate correspondence between visual information and wireless information is not facilitated.

To solve this problem, the system is based on II and II considering that the time stamp of video stream II is more accurate

Time stamp pair of

In the CSI measurementAnd (5) performing line resampling. The specific method comprises the following steps: meter t_m-1And t_mRespectively two consecutive video frames I^m-1And I^mA corresponding time stamp. Due to the sampling rate F of the WiFi signal_wOften greater than the frame rate F of the video stream_IThus the system is for video frame I^mF is less than or equal to F in resampling mode_w/F_IA CSI sampling point and

wherein the time interval between each sampling point is fixed and is (t)_m-t_m-1) and/F. To achieve the above objective, a linear interpolation method with low computational complexity may be utilized.

In addition, the system will be paired due to the interference of the environmental noise

And eliminating abnormal points by the corresponding resampling points. In particular, the value of the outlier differs greatly from the surrounding values, and therefore affects the validity of the resample value. Aiming at the problem that the numerical value of the abnormal point has larger change relative to the surrounding numerical value, so that the effectiveness of a resampling value can be influenced, the method provided by the invention can be used for eliminating the abnormal point of the resampled wireless signal, and the resampling and the abnormal point elimination can enable the deep neural network to better extract the tensors of JHM and PAF from the wireless signal.

To solve this problem, the present embodiment performs outlier elimination on the resample values of the CSI using the Hampel filtering. After Hampel filtering, outliers in the CSI can be effectively detected and rejected, as shown in FIG. 3 (b). Through the above operations, corresponding to the video frame I^mA set R containing F CSI amplitude characteristics can be obtained^mAnd named RF frame. Or replacing outliers with neighborhood means.

In the human body semantic information extraction module, the system respectively extracts the semantic information from the video frame I^mAnd RF frame R^mCorresponding human body semantic information is extracted. The system represents semantic information of a human body as JHM features (Joint Heat Maps) and PAF features (Part Affinity Fields). In particular, the human body is abstracted into J connectedAnd (4) a joint point. JHM represents different articulation points in I^mConfidence at different locations. For a certain video frame

JHM is a three-dimensional tensor

And is marked as

Wherein the content of the first and second substances,

is the confidence of the jth joint point in image space. In addition, PAF indicates spatial information of C human body parts (one part is determined by two connected joint points). In particular, the PAF can be expressed as a 4-dimensional tensor

Is recorded as

Wherein the content of the first and second substances,

representing the position and orientation information of the c-th body part in image space. As shown in FIG. 4, the system utilizes an OpenPose neural network and an RF2Pose neural network to respectively derive the frequency frame I^mAnd RF frame R^mExtracting the corresponding JHM and PAF tensors. In particular, for video frame I^mThe openpos neural network firstly utilizes an fpn (feature Pyramid network) network to extract multi-scale visual features from the video frames. Then, the openpos neural network extracts JHM and PAF features by using a two-stage CNN network (convolutional neural network). The above two steps can be described as

Wherein the content of the first and second substances,

parameters representing an openpos neural network. For RF frame R^mThe RF2pos neural network includes a CSI converter, a JHM generator, and a PAF generator. R to be input by CSI converter^mAnd converting the intermediate features into intermediate features containing human body semantic information, extracting a JHM tensor from the intermediate features by a JHM generator, and extracting a PAF tensor from the intermediate features by a PAF generator. Thus, given input R^mThe output of the RF2Pose neural network can be recorded as

Wherein the content of the first and second substances,

representing parameters of the RF2pos neural network. For the

The OpenPose network can be trained by utilizing the existing public data set. For the

Due to the lack of corresponding public data sets, a method for cross-modal learning needs to be designed. Specifically, in the training phase, a set of training sets { (I) is given^y,R^y)}_y＝1:YFirst, will { I^y}_y＝1:YInputting the data into an OpenPose neural network and obtaining a corresponding visual semantic feature set

Subsequently, { R }^y}_y＝1:YAs RF2Pose neural networks

The input of (a) is performed,

as a corresponding training label. Therefore, the temperature of the molten metal is controlled,

with the goal of minimizing its output

Difference in output:

wherein

And

are loss functions corresponding to JHM and PAF, respectively, which can be further expressed as

In the above two formulas,

and

the JHM and PAF tensors are respectively the weight factors of pixel points by pixel points. Since most of the element values in the JHM and PAF tensors are close to zero, the settings are set

And

linearly related to the absolute value of the (h, w) th element to give greater weight to the non-zero elements. In addition, the present embodiment sets different coefficients λ in consideration that JHM and PAF tensors indicate different human semantic information and have different numerical scales₁、λ₂、β₁And beta₂De-balancing

And

influence in the overall objective function. Thus, it is possible to obtain

In the counterfeit detection and positioning module, the system is based on the obtained JHM tensor

And

and PAF tensor

And

for video frame I^mAnd carrying out counterfeit detection and positioning. In the video frame forgery detection stage, two JHM tensors are used

And

making difference to obtain JHM difference tensor D^mIs composed of

The differential operation of the above formula can effectively retain the forged information and remove redundant irrelevant information. This is because, if the video frame I^mNot forged, then

And

should be very similar, then the difference between them should be small. Instead, they should be very different. Based on JHM difference tensor, it can be input into a two-class detection network to estimate I^mProbability of being attacked. In particular, a two-class detection network is denoted

Its output is a probability vector

Is obtained by

The system may then be based on

And

the value of (2) makes a decision:

in this embodiment, a forgery probability threshold value greater than 0.5 is determined as forgery. Once a forgery attack is detected, the video surveillance system responds to the forgery attack in a timely manner, such as early warning. And further positioning the abnormal human body target in the forged area, thereby accurately tracking the track and the behavior of the invader.

As shown in fig. 5, the present invention provides a positioning method, including: adopting the detection method to forge and detect the monitoring video stream; calculating the absolute value of the difference value between the human body joint heat map of the current monitoring video frame and the human body joint heat map of the wireless signal, and the PAF tensor and value for the monitoring video frame with the detection result of forgery; selecting an abnormal human body joint point set corresponding to the current monitoring video frame based on the absolute value of the difference value of the human body joint heat map; and performing joint point correlation operation on the PAF tensor and value of the current monitoring video frame and the abnormal human body joint point set to obtain the correlation state between the abnormal human body joint points, thereby determining the position of the forged human body object in the current monitoring video frame.

To do this, the JHM difference tensor D is first differentiated^mSelecting candidate human body joint points. Due to D^mIs that

And

so that its value has a positive or negative value. Thus, the system is based on D^mThe absolute value of each element in the set is subjected to non-maximum suppression operation, and a candidate joint point set is selected:

wherein N is_jRepresents the number of j-th joint points,

is the nth candidate point for the jth joint point. Then, based on the two PAF tensors

And

will be provided with

The candidate joint points in (1) are associated, so as to select the abnormal human body target. For this purpose, the system first

And

summing yields the PAF and tensor:

then based on

And

and estimating the area and the posture of the abnormal human body target. In particular, a human body posture can be characterized as a set of associated joint points, and

best mode of association

Can be obtained in the following way

Wherein the content of the first and second substances,

representing the correlation function. In addition, the first and second substrates are,

each element in (1) is a binary variable

And indicate the k-th₁A node and a kth node₂The association status of the individual nodes. Based on the operation, the system can obtain the position and the state of the abnormal human body target, so that the positioning of the forged area is realized.

Fig. 6 shows the counterfeit positioning effect under interframe attack, where the upper half represents the original video frame, the lower half represents the counterfeit video frame, and the positioning result (the associated human joint point) is superimposed on the counterfeit frame. The original video frame in fig. 6 is replaced by a video frame that does not contain a human object. The method provided by the invention can effectively position all the fake human body targets under the interframe attack.

Fig. 7 shows the counterfeit positioning effect under intra-frame attack, where the upper half represents the original video frame, the lower half represents the counterfeit video frame, and the positioning result (the associated human joint point) is overlapped on the counterfeit frame. The left human target in the original video frame in fig. 7 is replaced by the background. The method provided by the invention can effectively position all the fake human body targets under intra-frame attack.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A surveillance video stream forgery detection method based on wireless signals is characterized by comprising the following steps:

s2, extracting the heatmap JHM and the PAF tensors of the human body joint points from the monitoring video frames and the wireless signals through a deep neural network;

and S3, if the difference between the heatmap of the human body joint extracted from the monitoring video frame and the heatmap of the human body joint extracted from the wireless signal is large, judging that the current monitoring video frame is forged.

2. The detection method as claimed in claim 1, wherein in step S2, the pretrained openpos network is used to extract heatmap JHM and PAF tensors of human body joints from the monitored video frames; extracting the heatmap JHM and PAF tensors of human body joint points from wireless signals by adopting a pre-trained RF2Pose network;

the RF2Pose network includes:

the wireless signal conversion module is used for converting all wireless signals corresponding to each monitoring video frame into intermediate characteristics with the same length and width dimensions as those of the monitoring video frame and outputting the intermediate characteristics to a heatmap generation module and a PAF tensor generation module of human body joint points;

the heatmap generation module of the human body joint point is used for generating JHM of the human body joint point according to the intermediate features;

the training sample of the RF2Pose network is a wireless signal, and the label is JHM and PAF tensors which are output by the OpenPose network and correspond to human body joint points.

3. The detection method according to claim 2, wherein the wireless signal transformation module comprises a deconvolution network, a convolution network and a residual error network which are connected in series in sequence; the heatmap generation module of the human body joint point comprises a decoding network and a full convolution network which are sequentially connected in series, wherein the decoding network is a repeated structure of one layer of deconvolution layer and two layers of convolution layers; the PAF tensor generation module comprises a decoding network and a full convolution network which are sequentially connected in series, wherein the decoding network is a repeated structure of one layer of deconvolution layer and two layers of convolution layers.

4. The detection method according to claim 2 or 3, wherein the Loss function Loss of the RF2Pose network in the training phase is as follows:

wherein the content of the first and second substances,

representing the parameters of the RF2pos neural network,

and

respectively representing the loss functions of JHM and PAF corresponding to human body joint points,

and

And

the balancing coefficients of the influence in the overall objective function,

representing the confidence of the jth joint point on the visual JHM pixel point (g, w),

representing the confidence of the jth joint point on the wireless JHM pixel point (g, w),

showing that the c-th human body part is in a visual PAF imagePosition and orientation information on the pixel points (g, w),

and (3) representing the position and direction information of the c-th human body part on the wireless PAF pixel points (g, w).

5. The detecting method according to claim 1, wherein in step S3, based on the similarity or difference of the heatmap of the human body joint points, a threshold is set or two classifiers are constructed to determine the difference between the heatmap of the human body joint points extracted from the monitoring video frame and the heatmap of the human body joint points extracted from the wireless signal.

6. A monitoring video stream forgery location method based on wireless signals is characterized by comprising the following steps:

(T1) performing a forgery detection of the surveillance video stream using the detection method according to any one of claims 1 to 5;

(T2) for the surveillance video frame whose detection result is falsification, calculating a difference absolute value between a heatmap of a human body joint extracted from the current surveillance video frame and a heatmap of a human body joint extracted from the wireless signal, and calculating a sum of a PAF tensor of the human body joint extracted from the current surveillance video frame and a PAF tensor of the human body joint extracted from the wireless signal;

(T3) selecting an abnormal human body joint point set corresponding to the current monitoring video frame based on the difference absolute value of the heatmap of the human body joint points;

(T4) performing joint point association operation on the PAF tensor and value calculated by the current monitoring video frame and the abnormal human body joint point set to obtain an association state between the abnormal human body joint points, thereby determining the position of the forged human body object in the current monitoring video frame.

7. The method according to claim 6, wherein in step (T3), the abnormal human joint set corresponding to the current surveillance video frame is selected by a non-maximum suppression method.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores one or more first programs, which are executed by one or more processors to implement the steps of the wireless signal-based surveillance video stream forgery detection method according to any one of claims 1 to 5; or, the computer readable storage medium stores one or more second programs, which are executed by one or more processors to implement the steps of the wireless signal based surveillance video stream forgery location method according to claim 6 or 7.