CN110852316A

CN110852316A - Image tampering detection and positioning method adopting convolution network with dense structure

Info

Publication number: CN110852316A
Application number: CN201911081464.9A
Authority: CN
Inventors: 张榕瑜; 倪江群
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2020-02-28
Anticipated expiration: 2039-11-07
Also published as: CN110852316B

Abstract

The invention provides an image tampering detection and positioning method adopting a dense structure convolution network, which comprises the steps of inputting an image to be detected, preprocessing the space enrichment SRM convolution to be detected to obtain a preprocessed image; constructing a dense connection convolution network to carry out tampered image feature extraction on the preprocessed image to obtain two classification information of the image to be detected, and completing detection of image tampering; constructing a deconvolution network which is symmetrical with the structure of the connected convolution network, and taking the two classification information as input; and finishing the positioned image by a deconvolution network according to the obtained image tampering region. The method provided by the invention applies the deep learning technology to the image tampering detection and positioning, is suitable for various tampering means, and has good robustness and practicability; a unified frame for detection and positioning is provided, so that not only can whether multiple images are tampered or not be predicted, but also a tampered area can be predicted, pixel-by-pixel accurate labeling is given, and a detailed object outline boundary is obtained.

Description

Image tampering detection and positioning method adopting convolution network with dense structure

Technical Field

The invention relates to the technical field of blind image evidence obtaining, in particular to an image tampering detection and positioning method adopting a dense structure convolution network.

Background

In the information age, images have been one of the main ways to propagate information, which has been fully integrated into human life because of their intuitive object representation and thought guidance. However, the development of image tampering technology has also been dramatically advanced, and the significant threat to the security of multimedia contents has not been ignored. Current techniques for identifying image tampering are mainly classified into methods based on artificial feature extraction and methods based on deep learning.

The method based on artificial feature extraction is to perform various transformations on images, and classify the images by using a threshold value or machine learning method after the image features are extracted, but the method relies on the modeling of researchers on the image features, and is usually only suitable for one type of image tampering identification, namely although a better effect is achieved on one tampering means, the method has poor applicability to other tampering means and poor robustness and expandability; the method based on deep learning usually focuses on only one of detection and positioning, and although high accuracy can be achieved in detection, the method cannot give play to the superior performance of deep learning in target detection, and does not fully utilize the relationship between detection and positioning.

Disclosure of Invention

The invention provides an image tampering detection and positioning method adopting a dense structure convolution network, aiming at overcoming the technical defect that the prior image tampering identification technology can not realize the detection and the positioning of image tampering at the same time.

In order to solve the technical problems, the technical scheme of the invention is as follows:

an image tampering detection and positioning method adopting a dense structure convolution network comprises the following steps:

s1: inputting an image to be detected, and preprocessing the space enrichment SRM convolution to be detected to obtain a preprocessed image;

s2: constructing a dense connection convolution network to carry out tampered image feature extraction on the preprocessed image to obtain two classification information of the image to be detected, and completing detection of image tampering;

s3: constructing a deconvolution network symmetrical with the structure of the connected convolution network, taking the two classification information of the image to be detected as input, and positioning the image tampering area;

s4: and finishing the positioned image by the deconvolution network according to the obtained image tampering area, and finishing the positioning of image tampering.

In step S2, the densely-connected convolutional network includes a pooling layer, a dense layer, a transition layer, a global average pooling layer, and a fully-connected layer; wherein:

the pooling layer performs one-time convolution and maximum pooling operation on the preprocessed image, and inputs the result into the dense layer;

the dense layers and the transition layers are provided with a plurality of layers, the output result of each dense layer is input into the corresponding transition layer, and finally the obtained tampered image feature map is input into the global average pooling layer by the last transition layer;

and the global average pooling layer is used for performing average pooling on the tampered image characteristic graph, and calculating and outputting two probability values by the full connection layer, wherein the two probability values respectively represent the probability of tampering and the probability of non-tampering, so that the two classification information of the image to be detected is obtained.

The dense layer comprises a plurality of basic structure layers, each basic structure layer consists of two continuous convolution layers, the input of each basic structure layer is formed by combining the output of the previous layer, and the input of each basic structure layer is a local dense version of a residual error structure.

The dense connection convolution network is provided with four dense layers which respectively comprise 5, 10, 20 and 12 basic structural layers.

The transition layer comprises a convolution layer, and the convolution layer is used for performing convolution once on the feature map input by the dense layer and then performing average pooling to reduce the image size.

Wherein, the full connection layer calculates and outputs two probability values through a softmax function, and the specific calculation formula is as follows:

wherein i represents two classes of tamperIn the case of a non-tampering/non-tampering,

representing the output value of the network in the i category, y_iRepresenting the true value of the sample in the i class, a_iRepresenting the weight of the i category.

In the scheme, in order to better capture the falsification noise characteristics of the image, the RGB three channels of the input image are subjected to one SRM convolution, the convolution kernels are initialized by the normalized SRM, three channels of one convolution kernel are assigned by the same model to obtain 30 filters, and the output after the convolution is connected with the RGB three channels in series and combined.

In the scheme, after pooling operation is performed through the pooling layer, a deep network is constructed by utilizing the dense layer and the transition layer of the dense connection convolution network, so that the characteristics of the tampered image can be extracted conveniently. Two successive convolutional layers constitute a base structure layer, a dense layer may comprise a plurality of base structure layers, and the input of each base structure layer in a dense layer is composed of all the outputs of the previous layer through a merging operation, such a structure is a locally dense version of the residual structure, which can be beneficial to training deeper networks without overfitting. One of the convolution networks uses four dense layers which respectively comprise 5, 10, 20 and 12 basic structures, a transition layer is a convolution layer, and an input feature map is convolved once to reduce the depth, and then is subjected to average pooling to reduce the size. The pooling used in the network is 2x2 pooling, and the feature map after the last dense layer should be thirty-half of the original size. The global average pooling layer averages the feature map, only depth is reserved, two values are output after passing through the full connection layer, the two values are converted into probability values by a softmax function, the probability values represent the probability of tampering and the probability of non-tampering respectively, and the maximum value is taken as a final judgment result.

In the scheme, each convolution layer is followed by a batch normalization and relu activation function layer, so that gradient explosion or diffusion is prevented, and a nonlinear model is introduced.

In step S3, the deconvolution network includes a full connection layer, a dense layer, and a corresponding deconvolution transition layer; firstly, point-by-point calculation is carried out on the tampered image characteristics through the full connection layer, then the image is continuously restored layer by layer through the dense layer and the corresponding deconvolution transition layer, and the image tampered area is located.

In step S4, according to the image tampered region, the deconvolution network outputs a binary image after positioning the image to be detected, so as to complete positioning of image tampering.

In the above scheme, the present invention constructs a deconvolution network by using a structure that is as symmetrical as possible to the convolution network, first, the global pooling layer in the convolution network is removed to obtain a retained complete feature map, the corresponding full connection layer can perform operations point by point on the feature map, that is, it is equivalent to 1x1 convolution, and then three dense layers containing 12, 6, 3 basic structure layers and a deconvolution transition layer, the deconvolution transition layer is an improvement on the transition layer, and the average pooling is replaced by 2x deconvolution layer, thereby increasing the size of the feature map by one time.

In the scheme, in order to better supplement details of an output image, the output of the former convolutional network is input into the layer of the latter convolutional network after direct connection, 2x deconvolution and 4x deconvolution operations, so that multi-size feature splicing is formed, and multi-size context information can be researched through a feature graph obtained by series combination, so that how to accurately predict the boundary, contour and size of a tampered area is facilitated for network learning; in addition, the invention improves the importance of the full connection layer output in network decision by connecting the full connection layer output to the later layer after carrying out 2x deconvolution step by step. The present invention recognizes that the fully-connected layer output is an effective spatial decision information because the fully-connected layer is trained for two classes of tampering in the detection task, thus making additional use of this information.

The training process of the dense connection convolution network and the deconvolution network specifically comprises the following steps:

collecting training image data and preprocessing the training image data;

dividing the preprocessed image data into a training set and a testing set;

pre-training a 128x128 image by using a training set, and calculating a gradient updating parameter;

training the image with the complete size according to the gradient updating parameters to obtain the weight of the dense connection convolution network;

pre-training a 128x128 image on the deconvolution network according to the weight of the densely connected convolution network, and calculating a gradient updating parameter;

training the image with the complete size according to the calculated gradient update parameters to finish the training of the deconvolution network;

and evaluating and adjusting the deconvolution network by using the test set, and finally outputting the densely connected convolution network and the deconvolution network with corresponding weights.

In the training and adjusting process of the dense connection convolution network and the deconvolution network, a five-fold cross validation method is adopted for adjustment, one fifth of the preprocessed image data is taken as a test set, four fifths of the preprocessed image data are taken as a training set, and the average training result is taken as a final result through five times of training evaluation.

In the scheme, the invention uses a 128x128 small window to slide the image, saves the window containing the tampered area as a new image, and screens scientific and reasonable samples from the window based on the strategy of the size of the tampered area. Firstly, only a window with the tampering area not exceeding 40% is reserved so as not to make the tampering area too large; second, to avoid the tampered region being too small, a window with less than 150 pixels of area of the tampered region is discarded. Therefore, the problem that the area of the sample image tampering region is unreasonable can be prevented, and the network can learn the image tampering detection. Meanwhile, the image is rotated in multiple angles by a data enhancement method, so that the rotation invariance of the model is enhanced.

In the scheme, training is performed by performing detection first and then positioning, training a two-class convolutional network for detecting whether the image is tampered or not, then keeping the weight of the convolutional network, training by taking the positioning tampered area as a target, and updating the convolutional network and the deconvolution network only by using the tampered training sample.

In the scheme, gradient updating parameters are calculated by using 128x128 small images during training, the video memory of a graphic processor is fully utilized, and the gradient of a plurality of samples can be calculated by forward propagation all the time; while training is performed using a full size data set, due to the size disparity, only one sample gradient can be calculated for one forward propagation. In order to make the loss stably decrease, the invention sets gradient accumulator by program, and updates the parameter after averaging the multiple gradients.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the image tampering detection and positioning method adopting the dense structure convolutional network provided by the invention has the advantages that the deep learning technology is applied to the image tampering detection and positioning, the network is trained to learn the characteristics of the tampered image, the method is suitable for dealing with various tampering means, the parameters can be continuously updated on the premise of a new data set, the performance is improved, and the robustness and the practicability are good; compared with other deep learning methods, the method realizes a unified framework for detection and positioning, can predict whether a plurality of images are tampered or not, can predict a tampered area, and gives accurate pixel-by-pixel labeling to obtain a detailed object contour boundary.

Drawings

FIG. 1 is a flow chart of the method steps of the present invention;

FIG. 2 is a schematic diagram of the structure of a convolutional network and a deconvolution network;

FIG. 3 is a flow chart of convolutional network and deconvolution network training;

FIG. 4 is a diagram illustrating the results of a positioning test sample.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, an image tampering detection and location method using a dense structure convolutional network includes the following steps:

More specifically, as shown in fig. 2, in the step S2, the dense connection convolutional network includes a pooling layer, a dense layer, a transition layer, a global average pooling layer and a full connection layer; wherein:

More specifically, the dense layer comprises a plurality of basic structure layers, each basic structure layer is composed of two continuous convolution layers, wherein the input of each basic structure layer is combined by the output of the previous layer, and the input of each basic structure layer is a local dense version of the residual structure.

More specifically, the dense connection convolution network is provided with four dense layers, and the dense connection convolution network respectively comprises 5 basic structure layers, 10 basic structure layers, 20 basic structure layers and 12 basic structure layers.

More specifically, the transition layer includes a convolution layer, which performs convolution once on the feature map input by the dense layer, and then performs average pooling to reduce the image size.

More specifically, the full connection layer calculates and outputs two probability values through a softmax function, and the specific calculation formula is as follows:

where i represents two categories tamper/non-tamper,

In the specific implementation process, in order to better capture the tampering noise characteristics of the image, once SRM convolution is performed on RGB three channels of the input image, the convolution kernel is initialized by the normalized SRM, three channels of one convolution kernel are assigned by the same model to obtain 30 filters, and the output after convolution is connected with the RGB three channels in series and combined.

In the specific implementation process, after pooling operation is carried out through the pooling layer, a deep network is constructed by utilizing a dense layer and a transition layer of the dense connection convolution network, so that the characteristics of the tampered image can be extracted conveniently. Two successive convolutional layers constitute a base structure layer, a dense layer may comprise a plurality of base structure layers, and the input of each base structure layer in a dense layer is composed of all the outputs of the previous layer through a merging operation, such a structure is a locally dense version of the residual structure, which can be beneficial to training deeper networks without overfitting. One of the convolution networks uses four dense layers which respectively comprise 5, 10, 20 and 12 basic structures, a transition layer is a convolution layer, and an input feature map is convolved once to reduce the depth, and then is subjected to average pooling to reduce the size. The pooling used in the network is 2x2 pooling, and the feature map after the last dense layer should be thirty-half of the original size. The global average pooling layer averages the feature map, only depth is reserved, two values are output after passing through the full connection layer, the two values are converted into probability values by a softmax function, the probability values represent the probability of tampering and the probability of non-tampering respectively, and the maximum value is taken as a final judgment result.

In the implementation, there are batch normalization and relu activation function layers after each convolution layer to prevent gradient explosion or diffusion and introduce a nonlinear model.

More specifically, as shown in fig. 2, in the step S3, the deconvolution network includes a fully-connected layer, a dense layer and a corresponding deconvolution transition layer; firstly, point-by-point calculation is carried out on the tampered image characteristics through the full connection layer, then the image is continuously restored layer by layer through the dense layer and the corresponding deconvolution transition layer, and the image tampered area is located.

More specifically, in step S4, according to the image tampered region, the deconvolution network outputs a binary image after the image to be detected is positioned, and the positioning of image tampering is completed.

In the specific implementation process, the invention utilizes a structure which is symmetrical to the convolutional network as much as possible to construct the deconvolution network, firstly, the global pooling layer in the convolutional network is removed so as to obtain a retained complete feature map, the corresponding full connection layer can perform operation point by point on the feature map, namely, the convolution is equivalent to 1x1, then three dense layers containing 12, 6 and 3 basic structure layers and a deconvolution transition layer are arranged, and the deconvolution transition layer is an improvement on the transition layer and replaces the average pooling layer with a 2x deconvolution layer, so that the size of the feature map is increased by one time.

In the specific implementation process, in order to better supplement the details of an output image, the output of the former convolutional network is input into the layer of the latter convolutional network after the operations of direct connection, 2x deconvolution and 4x deconvolution, so that multi-size feature splicing is formed, and multi-size context information can be researched through the feature graphs obtained by series combination, so that the network learning is facilitated to accurately predict the boundary, contour and size of a tampered area; in addition, the invention improves the importance of the full connection layer output in network decision by connecting the full connection layer output to the later layer after carrying out 2x deconvolution step by step. The present invention recognizes that the fully-connected layer output is an effective spatial decision information because the fully-connected layer is trained for two classes of tampering in the detection task, thus making additional use of this information.

Example 2

More specifically, on the basis of embodiment 1, as shown in fig. 3, the training process of the dense connection convolutional network and the deconvolution network specifically includes:

collecting training image data and preprocessing the training image data;

dividing the preprocessed image data into a training set and a testing set;

More specifically, in the training and adjusting process of the dense connection convolution network and the deconvolution network, a five-fold cross-validation method is adopted for adjustment, one fifth of the preprocessed image data is taken as a test set, four fifths of the preprocessed image data are taken as a training set, and the average training result is taken as a final result through five times of training evaluation.

In the specific implementation process, the invention saves the window containing the tampered area as a new image by sliding the image by using a small window of 128x128, and screens scientific and reasonable samples from the window based on the strategy of the size of the tampered area. Firstly, only a window with the tampering area not exceeding 40% is reserved so as not to make the tampering area too large; second, to avoid the tampered region being too small, a window with less than 150 pixels of area of the tampered region is discarded. Therefore, the problem that the area of the sample image tampering region is unreasonable can be prevented, and the network can learn the image tampering detection. Meanwhile, the image is rotated in multiple angles by a data enhancement method, so that the rotation invariance of the model is enhanced.

In the specific implementation process, training is performed around detection and positioning, a two-class convolutional network for detecting whether an image is tampered is trained, then the weight of the convolutional network is reserved, training is performed by taking a positioning tampered area as a target, and only a tampered training sample is used for updating the convolutional network and the deconvolution network.

In the specific implementation process, gradient updating parameters are calculated by using 128x128 small images during training, the video memory of a graphic processor is fully utilized, and the gradient of a plurality of samples can be calculated by forward propagation all the time; while training is performed using a full size data set, due to the size disparity, only one sample gradient can be calculated for one forward propagation. In order to make the loss stably decrease, the invention sets gradient accumulator by program, and updates the parameter after averaging the multiple gradients.

Example 3

In a specific implementation process, the network provided by the invention is built by using a Tensorflow deep learning framework and can be trained on a Geforce GTX 1080Ti GPU (graphics processing Unit). When the sample is 128x128 in size, one iteration can use 128 images to update the parameters. On a test set of sizes varying from 240 x 160 to 1000 x 1000 pixels, an average of 17.75 milliseconds is required to detect whether an image has been tampered with, and the average time to locate a tampered image is 99.84 milliseconds.

In the specific implementation process, the invention uses a plurality of public data sets for training and testing, including four common data sets including CASIAAv1.0, CASIAA v2.0, NC 2016 and Columbia Umcompressed. The model is subjected to five training tests, and then the average result is obtained, and the average classification accuracy, the average pixel classification accuracy and the average cross-over ratio on the test set are shown in table 1. The accuracy is the ratio of the number of correctly classified samples to the total number of samples. The intersection ratio refers to the ratio of the intersection and union of the real tampered region and the predicted tampered region, is between 0 and 1, and the larger the intersection ratio is, the higher the coincidence degree is, namely, the better the performance of the model is.

TABLE 1

In the implementation, the results of some of the localization test samples are shown in fig. 4, with white pixels representing tampered areas. Due to the removal of the global pooling layer of the convolutional network, the fully-connected layer outputs a 2-channel feature map, the feature map is visualized as a fourth column, effective space decision information is output by the layer, and the predicted position is given roughly. The deconvolution network uses this information and through dense connections with the shallow network, further predicts accurately. The dense structure convolution neural network provided by the invention can effectively identify the tampering means of splicing, copying, moving and removing, and output the pixel-by-pixel classification result, and can accurately predict the tampered object, size and shape, and is close to the real marking.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. An image tampering detection and positioning method adopting a dense structure convolution network is characterized by comprising the following steps:

2. The image tamper detection and localization method according to claim 1, wherein in step S2, the densely-connected convolutional network comprises a pooling layer, a dense layer, a transition layer, a global average pooling layer and a full-connection layer; wherein:

3. The method of claim 2, wherein the dense layer comprises a plurality of basic structure layers, each basic structure layer is composed of two continuous convolution layers, and the input of each basic structure layer is combined by the output of the previous layer, and is a local dense version of the residual structure.

4. The image tamper detection and localization method using the convolution network with dense structure as claimed in claim 3, wherein the convolution network with dense connection is provided with four dense layers, which respectively include 5, 10, 20, and 12 basic structure layers.

5. The method according to claim 4, wherein the transition layer comprises a convolutional layer, and the convolutional layer performs convolution once on the feature map input by the dense layer, and then performs average pooling to reduce the image size.

6. The image tamper detection and location method using the dense structure convolutional network as claimed in claim 5, wherein the full connection layer outputs two probability values through softmax function calculation, and the specific calculation formula is:

where i represents two categories tamper/non-tamper,

7. The image tamper detection and location method using dense structure convolution network according to claim 2, wherein in step S3, the deconvolution network includes a full connection layer, a dense layer and a corresponding deconvolution transition layer; firstly, point-by-point calculation is carried out on the tampered image characteristics through the full connection layer, then the image is continuously restored layer by layer through the dense layer and the corresponding deconvolution transition layer, and the image tampered area is located.

8. The image tampering detection and location method using dense structure convolution network as claimed in claim 7, wherein in said step S4, according to said image tampering region, the deconvolution network outputs the binary image after the image to be detected is located, so as to complete the location of image tampering.

9. The image tamper detection and localization method using the dense structure convolutional network as claimed in claim 8, wherein the training process of the dense connection convolutional network and the deconvolution network specifically is:

collecting training image data and preprocessing the training image data;

dividing the preprocessed image data into a training set and a testing set;

10. The method for detecting and locating image tampering by using the dense structure convolutional network as claimed in claim 9, wherein in the training and adjusting process of the dense connection convolutional network and the deconvolution network, a five-fold cross-validation method is used for adjustment, one fifth of the preprocessed image data is taken as a test set, four fifths of the preprocessed image data are taken as a training set, and the average training result is taken as a final result through five times of training evaluation.