CN113850284B

CN113850284B - Multi-operation detection method based on multi-scale feature fusion and multi-branch prediction

Info

Publication number: CN113850284B
Application number: CN202110751853.9A
Authority: CN
Inventors: 甘永东; 朱新山; 王佳宇; 孙浩; 张云
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-07-04
Filing date: 2021-07-04
Publication date: 2023-06-23
Anticipated expiration: 2041-07-04
Also published as: CN113850284A

Abstract

The invention relates to a multi-operation detection method based on multi-scale feature fusion and multi-branch prediction, and belongs to the technical field of multimedia evidence obtaining. The prior art generally detects and locates only a single type of operation in an image. The method constructs a multi-operation detection network, extracts composite operation characteristics by using residual block convolution flow, performs multi-scale characteristic fusion, and realizes multi-operation detection by a multi-branch prediction module. By adopting the method provided by the invention, the model trained by the detection network can be constructed to detect and locate various types of operations, and the method has certain robustness to post-processing operations such as noise adding, scaling, blurring, secondary compression and the like.

Description

Multi-operation detection method based on multi-scale feature fusion and multi-branch prediction

Technical Field

The invention belongs to the technical field of multimedia evidence obtaining, and particularly relates to a multi-operation detection method based on multi-scale feature fusion and multi-branch prediction.

Background

With the rapid development of computer and internet technologies, digital multimedia information is widely applied to aspects of social production and life, such as broadcasting, television, news, games, tickets, physical evidence, books, documents, etc., so that our social life is greatly enriched. However, with various editing software, such as PhotoShop, corelDraw, a beauty show, etc., the multimedia file can be easily edited and modified, resulting in serious information security problems (integrity, confidentiality, usability), and even serious threat to social stability. As a potential technique, the basic idea of digital evidence collection is to extract a unique trace of an operation from a multimedia file to determine whether the file has been subjected to such an operation. Much research effort has been devoted to achieving detection of JPEG compression, histogram equalization, noise addition, blurring, median filtering, resampling, copy-move, etc.

Traditional methods detect tampering operations based on statistical features. JPEG compression is a common image processing approach for almost all digital imaging devices. Li and the like propose to detect whether an image is a JPEG compressed composite image by utilizing the characteristic that quality factor inconsistency or block position inconsistency often exists in a JPEG composite image [ Li, zhang Xinpeng ], detect a composite image by utilizing JPEG compression characteristics [ J ]. Apply science journal, 2008 (03): 281-287]. However, this method is only suitable for detecting a JPEG stitched composite image, and cannot cope with the case of full-image JPEG compression. Lin et al propose that the relationship between co-located discrete cosine transform (Discrete Cosine Transform, DCT) coefficients in different blocks of an image has invariance before and after JPEG compression [ Lin, CY and Chang, SF.A robust image authentication method distinguishing JPEG compression from malicious management.IEEE Transactions on Circuits and Systems for Video technology.11-2 (2001), 153-168]. The method can distinguish whether a certain image block in the image is tampered maliciously or not, and has good robustness to JPEG compression. Fan et al statistically model different types of operation images using a Gaussian mixture model (Gaussian Mixture Model, GMM), extract generic features to detect different types of image operations [ W.Fan, K.Wang, and F.Cayr.general-purpose image forensics using patch likelihood under image statistical models.In IEEE International Workshop on Information Forensics and Security (WIFS), pages 1-6, nov.2015]. The method needs to construct a plurality of two classifiers for classification detection, and is complex in operation and weak in robustness. Gallagher et al propose a method of detecting an image resampling operation [ A.C. Gallagher. Detection of linear and cubic interpolation in JPEG compressed images in The 2nd Canadian onference on Computer and Robot Vision (CRV' 05), pp.65-72, victoria, BC, canada,2005]. The method comprises the steps of firstly calculating second-order difference of an image, and judging whether the image is resampled or not according to peak change conditions of Fourier transform spectrograms of each row in a second-order difference matrix. Because the second-order differential matrix of the downsampled image has no periodicity, the method has poor detection effect on downsampling operation.

The conventional method has the following problems. Since the operation features are not significant, it is difficult to extract the effective operation features by hand. Secondly, feature extraction and classifier are designed separately, and simultaneous optimization of the feature extraction and the classifier cannot be achieved. Again, the multimedia file may undergo other post-processing operations after being maliciously tampered with, which is easy to erase or mask the trace of the operations, and makes the conventional evidence obtaining technology very difficult. Finally, for multi-operation evidence collection, the traditional method is complex to realize and has extremely limited performance.

In recent years, deep learning networks (Deep Learning Network, DLN) have achieved great success in many areas, such as image classification, generation and segmentation, object detection and localization, natural language processing, document analysis, and the like. DLN overtakes traditional problem handling modes and adopts an optimized working mode which is completely driven by data. It only needs to build a proper Neural Network (NN) for the problem to be treated, then trains on the sample set, optimizes the parameters of NN through training, and enables the NN to output correct predictions. NN organically combines feature extraction and classification in a framework to obtain optimized feature expression and classifier in a data-driven manner. Given the excellent performance of DLN, academia has begun to study DLN-based forensics.

Chen et al propose median filtering image evidence-taking schemes based on convolutional neural networks (Convolutional neural networks, CNNs) [ Jiansheng Chen, xiangui Kang, ye Liu, and Z Jane Wang.2015.Median filtering forensics based on convolutional neural networks.IEEE Signal Processing Letters, 11 (2015), 1849-1853]. For the median filtering operation, a preprocessing layer is designed to extract a median filtering residual image, and the median filtering residual image is input to a CNNs network. Bayar et al utilize a constrained convolution layer to suppress The content of The image, extract The operational features, and employ CNNs for multi-operation tamper detection [ Belhassen Bayar and Matthew C Stamm.A deep learning approach to universal image manipulation detection using a new convolutional layer in The 4th ACM Workshop on lnformation Hiding and Multimedia Security.ACM,5-10, 2016]. This method can detect only one operation occurring in a single image. Cozzolino et al propose that the residual-based local descriptor can be regarded as a simple constraint CNN for enabling forgery detection [ Davide Cozzolino, giovanni Poggi, and Luisa verdoliva.2017.Recasting residual based local descriptors as convolutional neural networks: an application to image forgery detection.In The 5th ACM Workshop on Information Hiding and Multimedia Security.ACM,159-164]. The residual unit adds the input directly to the output and then re-activates, not only alleviating the network degradation problem, but also being seen as a compact constraint. Rao et al propose that the first layer of CNN uses SRM kernel convolution to obtain local noise information of an image for tamper detection [ Yuan Rao and Jiangqun ni.2016.a deep learning approach to detection of splicing and copy-move forgeries in images in IEEE International Workshop on Information Forensics and Security (WIFS). IEEE,1-6]. These methods provide the idea of detecting whether the image has been tampered with, but do not allow accurate positioning of the operating area.

Digital forensic studies, in addition to requiring resolution of whether a detection operation is occurring, also require locating the specific location where the operation is occurring. Li designed a set of tamper detectors based on multiscale CNNs [ Haodong Li, weiqi Luo, xiaoqing Qia, and Jiwu Huang.2017.image forgery localization via Integrating tampering possibility maps.IEEE Transactions on Information Forensics and Security, 5 (2017), 1240-1252]. According to the scheme, a series of complementary tamper operation confidence heat maps are generated, the tampered areas in the digital image are positioned by utilizing the multi-scale features, and targets with different sizes in the image can be detected well by fusing the multi-scale features. Zhou et al propose a solution based on the master RCNN dual-flow network to enable location of tampered areas [ P.Zhou, X.Han, V.I.Morariu, and l.s. davis, learning rich features for image manipulation detection [ C ]. Internaltional Conference on Computer Vision and Pattern Recogintion,2018:1053-1061]. The method uses RGB streams for regression frames and noise streams in combination with RGB streams for classification. A common problem with the two previous approaches is that they are not robust. Wang Changmeng et al propose that adding an attention mechanism to a semantic segmentation network of a detected image increases attention to tampered edges, and simultaneously uses a maximum entropy Markov model to model the correlation between adjacent areas in the attention pattern [ Wang Changmeng ], an attention CNN-based document certificate type image tampering detection method: china, 112907598A [ P ].2021-06-04]. The patent is applied to falsification detection of qualification certificate documents.

The evidence obtaining technology based on the convolutional neural network can automatically extract effective features, train on a large-scale data set, has strong generalization of a network model and obtains a detection effect obviously superior to that of the traditional scheme. However, the existing evidence obtaining technology generally can only detect and locate a single operation type, and is not easy to expand due to the fact that a feature extractor or a preprocessing layer is designed for the fixed operation type. The positioning method generally divides the image into small image blocks for detection, the form of dividing the image blocks is fixed, the flexibility is not strong, and the positioning accuracy is not high. Moreover, the existing methods have the problem of weak robustness for post-processing operation.

Disclosure of Invention

In view of the drawbacks of the prior art, it is an object of the present invention to propose a forensic method capable of detecting and locating multiple types of operations simultaneously, and improving the robustness to post-processing operations.

In order to achieve the above purpose, the invention adopts the technical scheme that: a multi-operation detection method based on multi-scale feature fusion and multi-branch prediction comprises the following steps:

(1) Selecting a multimedia operation type, and constructing a multimedia data set processed by various operations;

(2) The residual block convolution flow is used as a main network for extracting composite operation characteristics, and a multi-operation detection depth convolution neural network is constructed by combining multi-scale characteristic fusion and multi-branch prediction links;

(3) Training the detection network by using the constructed data set to obtain an optimized detection network model.

Further, in step (1), each sample of the data set is subjected to more than one operation process, such as filtering, noise adding, sharpening, etc., and the area and shape of the operation region may be arbitrary.

In the step (2), a main network for extracting composite operation features is formed by connecting a group of residual blocks in series, so that the resolution of a feature map output by each residual block is continuously reduced, the channel number of the feature map is increased, multi-scale feature fusion is carried out on the feature maps with different resolutions obtained by the main network, and then the feature maps are output to a multi-branch prediction link for operation type classification and frame regression prediction.

Still further, a backbone feature extraction network is constructed using more than 5 residual blocks, each consisting of a convolution, pooling layer, BN layer.

Still further, the pooling layer in the residual block is provided with a stride process to reduce the feature map resolution, and the convolution output is non-linearly activated.

Further, the method for multi-scale feature fusion is to superimpose the highest-layer operation feature map generated by the backbone network with the lower-layer operation feature map after up-sampling operation to obtain fused features, repeat the process to obtain fused feature maps with other resolutions, and downsample the highest-layer output features to obtain feature maps with more than two resolutions, and combine the feature maps with the fused feature map generated before to form multi-scale features to be respectively input to multi-branch prediction links.

Furthermore, in the branch prediction link, a plurality of anchor frames with different sizes and aspect ratios are placed at each pixel position on the feature map with each resolution, and are respectively sent to a classification branch module and a frame regression branch module, the convolution operation is carried out, the features are further extracted, the prediction result is obtained, and the frame regression prediction is the offset of the operation area relative to the anchor frames.

Further, in the step (3), the detection network training uses data enhancement, a random probability inactivation technology and L2 regularization, so that the overfitting condition of the model is reduced.

The invention has the following effects: by adopting the method, the operation characteristics can be adaptively and automatically extracted without setting a pretreatment layer for certain operation; the method can detect and position various types of operations, and the positioning accuracy is much higher than that of image blocking detection; the method has better robustness to post-processing operations such as noise adding, scaling, blurring, secondary compression and the like; the end-to-end detection can be realized, the image is input, and the detection result is directly output; the detection speed is high; better performance can be achieved on a large-scale dataset.

Drawings

FIG. 1 is a basic flow of a multi-operation detection method based on multi-scale feature fusion and multi-branch prediction

FIG. 2 is a network block diagram of one embodiment of the invention

FIG. 3 is an unoperated original

FIG. 4 is a pseudo-color drawing of an operating region generated

FIG. 5 is a tamper operation chart after processing

FIG. 6 is a diagram showing the detection effect of the tamper operation in the present embodiment

Detailed Description

A specific embodiment of the present invention will be described below with reference to the accompanying drawings, and further illustrate the effects of the present invention.

The image signal is used as a multimedia file expression form, a multi-operation detection method based on multi-scale feature fusion and multi-branch prediction is realized, the whole process is shown in figure 1, and the method comprises the following steps:

step one, constructing a data set: 17125 three-channel pictures are taken from the PASCAL-VoC 2012 dataset, and part of the pictures are shown in fig. 3, and eight operation types are selected, which are homomorphic filtering, median filtering, adding white gaussian noise, local histogram equalization, gaussian blurring, edge sharpening, local resampling, gamma transformation. One or more irregular operating regions are randomly generated for the picture using a region random growth algorithm, as shown in fig. 4. Each region randomly selects one of eight operation types to operate, and an image after the manipulation operation processing is obtained, as shown in fig. 5. During training and testing, each picture is a sample. In order to supervise the training process and provide a reference for the evaluation index calculation of the detector, we also need to record the corresponding label information for each image sample. The label information of the sample includes the width, height, operation type of each region, left boundary, upper boundary, right boundary, and lower boundary of each region. When training, the image sample is sent into the network, the type and position information of each operation area in the image sample are obtained through prediction, and compared with the type and position information of the labels, the network learning process is to enable the prediction distribution of the network model to be continuously close to the real distribution of the labels, and the performance of the network model is improved. Evaluating detector performance is to objectively evaluate the difference between the predicted distribution of the network and the actual distribution of the tag. The data set consists of image samples and corresponding labels, and after the data set is generated, the data set is divided into a training set and a testing set according to the proportion of 9:1.

And secondly, adopting a residual block convolution stream as a main network for extracting composite operation characteristics, and combining multi-scale characteristic fusion and multi-branch prediction links to construct the multi-operation detection depth convolution neural network. Specifically, a main network for extracting the composite operation characteristics is formed by connecting a group of residual blocks in series, the network structure and the parameter configuration are shown in table 1, and the resolution of the characteristic map output by each residual block is continuously reduced. The residual block consists of two residual units, each residual unit is formed by Dropout after 3x3 convolution, relu nonlinear activation, BN layer batch normalization and 0.1 probability random inactivation, and the maximum pooling treatment is performed. The residual unit adds the input to the output to obtain the total output, prevents network degradation and extracts the composite operation characteristics, and continuously reduces the resolution of the characteristic map by convolution or pooling with a stride of 2, and increases the channel number of the characteristic map. And then carrying out multi-scale feature fusion on feature graphs with different resolutions obtained by a main network, wherein the process is carried out by carrying out up-sampling operation on the highest-layer operation feature graph generated by the main network, then overlapping the highest-layer operation feature graph with the lower-layer operation feature graph to obtain fused features, repeating the process to obtain two high-resolution fusion feature graphs, carrying out down-sampling on the highest-layer output features to obtain two resolution feature graphs, combining the five different-resolution fusion feature graphs into multi-scale features, and respectively inputting the multi-branch prediction links. The multi-branch prediction link carries out operation type classification and frame regression prediction, 4 anchor frames with different sizes and aspect ratios are placed at each pixel position on the characteristic diagram with each resolution, the sizes of the anchor frames are 0.1 a, 0.2 a 0.2,0.2 a 0.3 and 0.3 a 0.2 of the original diagram width and height respectively, and the anchor frames are respectively sent into 5 groups of branch classification and frame regression branch modules, wherein each branch carries out continuous 4 times 3 convolution operations to further extract characteristics to obtain a prediction result. The classification branch predicts the category of the anchor frame of each pixel point, the frame regression predicts the offset of the center point coordinate, width and height of the operation area relative to the anchor frame, and the whole network structure diagram is shown in fig. 2.

Table 1 backbone network parameter configuration of one embodiment of the invention

The symbol identifiers in table 1 are described as follows: conv in the table represents convolution, and the five parameters are the number of input channels, the number of output channels, the convolution kernel size, the edge filling size and the step number(s) respectively. BN represents batch normalization (Batch Normalization), re mountain represents nonlinear activation function, maxPool represents maximum pooling operation, dropout represents random deactivation operation, and parameters therein represent probability of random deactivation. In the table, each two continuous convolution blocks in layers 1-4 do residual operation once, and input is directly added to output to obtain total output.

Step three: training the detection network by using the constructed data set to obtain an optimized detection network model. The training optimizer adopts an SGD optimizer, and the initial value of the learning rate is set to be 5x10 ^-4 The batch size is set to be 32, the learning rate is reduced by 30% every 20000 iteration steps from 30000 th iteration steps, and the total training is 1x10 ⁶ Number of iteration steps. The detection network training uses data enhancement, random probability inactivation technology and L2 regularization to reduce the overfitting condition of the model. The data enhancement adopts random mirror image inversion with less influence on an operation area, and is divided into X and Y directions. The random deactivation technology is applied to the main partThe convolution of the residual units of the dry network is followed. The L2 regularization coefficient is 0.005.

For the detection model obtained in the present embodiment, 1712 tamper operation pictures are used for testing. The CPU of the hardware platform tested was i5-9400, the dominant frequency was 2.9GHz, and the GPU was NVIDIARTX2060. The accuracy (Average Precision, AP) and overall average accuracy (mean Average Precision, mAP) indices for each type of operation are recorded as shown in table 2. The test effect of a portion of the test pictures is shown in fig. 6.

Table 2 test results of model on test set

As can be seen from the data in Table 2, the method of the present invention is capable of detecting multiple types of operations and locating the operation area simultaneously. The detection model of the embodiment is tested on a data set which is not subjected to post-processing operation, the average precision reaches 0.6969, the detection effect of Homomorphic filter and Gaussian white noise reaches more than 0.85, and the detection effect is good.

In order to verify the robustness of the method under various post-processing conditions, the following robustness experiments are carried out, JPEG compression with the quality factor of 75% and JPEG secondary compression (Jpeg 75 then JPEG 95) with the quality factor of 95% are respectively carried out on the test set image, four post-processing operations of scaling (Zoom), adding spiced salt noise (Salt and pepper noise) and bilateral filtering (Bilateral filters) are respectively carried out, and the AP and mAP of each type of operation are respectively recorded through the model detection of the embodiment, as shown in Table 3:

TABLE 3 robustness verification experiment results

As can be seen from the data in Table 3, the detection model of the embodiment is tested on the data set after post-processing operation, jpeg75 then Jpeg95 is subjected to secondary compression operation, mAP is 0.6574, and 0.0395 is reduced; the Zoom resampling operation, mAP 0.6435, was 0.0534 down; salt and pepper noise noise adding operation, mAP is 0.6810, and 0.0159 is reduced; bilateral filters filter operation, mAP 0.6251, drops 0.0718. The influence of the post-processing operations on the detection accuracy is not more than 8%, which means that the detector can still effectively detect and locate the tampering operation under the post-processing operations, and the robustness is strong.

The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive, for example:

1) The multimedia file type is not limited to images, audio, video, etc.;

2) The operation type is not limited to the eight types mentioned in the embodiments;

3) The network structure provided by the invention can also be applied to target detection of images, and is not limited to operation detection;

4) The selection of various data set construction parameters and network configuration parameters is not limited to the configuration in the embodiment.

The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. A multi-operation detection method based on multi-scale feature fusion and multi-branch prediction comprises the following steps:

(1) Selecting a multimedia operation type, and constructing a multimedia data set processed by various operations; the operation processing comprises filtering, noise adding, sharpening, JPEG compression, histogram equalization, blurring, median filtering, resampling, copy-move, homomorphic filtering, adding Gaussian white noise, local histogram equalization, gaussian blurring, edge sharpening and gamma transformation; performing more than one operation on each sample of the multimedia data set;

(3) Training the multi-operation detection depth convolutional neural network by using the constructed multimedia data set to obtain an optimized detection network model for classifying operation types and locating the specific position of the operation;

the method for multi-scale feature fusion comprises the steps of carrying out up-sampling operation on a highest-layer operation feature map generated by a backbone network, overlapping the highest-layer operation feature map with a lower-layer operation feature map to obtain fused features, repeating the process to obtain fused feature maps with other resolutions, carrying out down-sampling on the highest-layer output features to obtain feature maps with more than two resolutions, combining the feature maps with the fused feature maps generated before to form multi-scale features, and respectively inputting the multi-branch prediction links;

forming a main network for extracting composite operation characteristics by connecting a plurality of residual blocks in series, and numbering the directions of the residual blocks connected in series from a network input end to a network output end as 1,2,3 and …, n in sequence;

fusing the n-th layer residual block output characteristic diagram of the backbone network with the n-1 layer residual block output characteristic diagram to obtain a first fused characteristic diagram;

fusing the n-1 layer residual block output characteristic diagram of the backbone network with the n-2 layer residual block output characteristic diagram to obtain a second fused characteristic diagram;

carrying out downsampling treatment on the n-th layer residual error block output feature map of the backbone network to obtain at least two resolution feature maps, and marking the feature maps as the highest layer downsampling result;

and combining the final network output feature map of the backbone network, namely the n-th layer residual block output feature map, the first fusion feature map, the second fusion feature map and the highest layer downsampling result of the backbone network into a multi-scale feature.

2. The multi-operation detection method based on multi-scale feature fusion and multi-branch prediction according to claim 1, wherein: and forming a main network for extracting composite operation characteristics by connecting a group of residual blocks in series, continuously reducing the resolution of the characteristic diagram output by each residual block, increasing the channel number of the characteristic diagram, carrying out multi-scale characteristic fusion on the characteristic diagrams with different resolutions obtained by the main network, and then outputting the characteristic diagrams to a multi-branch prediction link for operation type classification and frame regression prediction.

3. A multi-operation detection method based on multi-scale feature fusion and multi-branch prediction as claimed in claim 2, wherein: and constructing a trunk feature extraction network by adopting more than 5 residual blocks, wherein each residual block consists of a convolution layer, a pooling layer and a BN layer.

4. A multi-operation detection method based on multi-scale feature fusion and multi-branch prediction as claimed in claim 3, wherein: the pooling layer in the residual block is provided with a stride process to reduce the feature map resolution, and the convolution output is non-linearly activated.

5. A multi-operation detection method based on multi-scale feature fusion and multi-branch prediction as claimed in claim 2, wherein: and in the branch prediction link, a plurality of anchor frames with different sizes and aspect ratios are placed at each pixel position on the feature map with each resolution, and are respectively sent into a classification branch module and a frame regression branch module, the convolution operation is carried out, the features are further extracted, the prediction result is obtained, and the frame regression prediction is the offset of the operation area relative to the anchor frames.

6.A multi-operation detection method based on multi-scale feature fusion and multi-branch prediction as claimed in claim 2, wherein: the detection network training uses data enhancement, random probability inactivation technology and L2 regularization to reduce the overfitting condition of the model.