CN114418840A

CN114418840A - Image splicing positioning detection method based on attention mechanism

Info

Publication number: CN114418840A
Application number: CN202111532297.2A
Authority: CN
Inventors: 张玉兰; 朱国普; 杨建权; 刘祖权
Original assignee: Shenzhen Institute of Advanced Technology of CAS; Shenzhen Technology University
Current assignee: Shenzhen Institute of Advanced Technology of CAS; Shenzhen Technology University
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-04-29
Also published as: WO2023109709A1

Abstract

The invention discloses an image splicing, positioning and detecting method based on an attention mechanism, which comprises the following steps: preparing an image splicing data set, and dividing the image splicing data set into a training set, a verification set and a test set; II, secondly: designing a double-flow multitask learning neural network structure; thirdly, the method comprises the following steps: designing a multitask loss function; fourthly, the method comprises the following steps: optimizing training to obtain a splicing area positioning model; fifthly: and inputting the image to be detected into the model trained in the fourth step to obtain a splicing positioning result. Compared with the prior art, the invention has the advantages that: the shallow low-level features are introduced, so that more detailed information can be provided, and the feature representation capability of the network is improved; the edge of the image and the edge of the splicing region are introduced as monitoring information, a multi-task loss function is designed, and the splicing region can be positioned more accurately; an Squeeze-excitation attention mechanism is introduced, fusion characteristics are recalibrated, the model can pay more attention to the characteristics which contribute to positioning greatly, and a more accurate splicing positioning result is obtained.

Description

Image splicing positioning detection method based on attention mechanism

Technical Field

The invention relates to the technical field of image splicing, positioning and detecting methods, in particular to an image splicing, positioning and detecting method based on an attention mechanism.

Background

Digital images, as an important information carrier, are widely distributed and spread over the internet; meanwhile, a series of image security problems are caused by convenient image tampering operation. Image stitching is a commonly used image tampering method, and generally, a certain region in one image (referred to as a donation image) is copied, then is pasted into a certain region of another image (referred to as a receptor image) after geometric operations such as scaling and rotation, and finally, post-processing operations such as gaussian filtering and image enhancement are performed on a composite image, so that the stitched region and the receptor image are kept consistent. Post-processing of the edges of the splice area makes splice positioning more challenging. For entertainment, people can splice the blue sky of the blue sea into pictures which are randomly taken by adopting an image splicing technology, and a beautiful scene of a person who goes out and travels is forged. Therefore, it is of great practical significance to analyze whether the images are subjected to image stitching operation and locate the stitching region.

The existing image stitching and positioning technology mainly comprises a traditional feature-based method and a deep learning-based method. The traditional image splicing positioning method mainly aims at the characteristics of sensor mode noise, interpolation mode of a color filter array or JPEG (joint photographic experts group) compression trace and the like of a splicing area to position the splicing area. However, these conventional methods are directed to a specific image attribute, and are not applicable to all stitching types. The deep learning-based method mainly utilizes the data driving function of big data to learn the characteristics of the splicing region and then positions the splicing region. However, most of the existing deep learning methods only use the stitched image and the corresponding ground-route mask for learning, neglecting the effect of the edge of the image on the edge of the stitched region, and obtaining the edge of the positioning region is not ideal. In addition, the existing deep learning-based method only focuses on the high-level features of the deeper network in the convolutional network, and ignores the low-level features of the shallow network, so that the accuracy of the splicing positioning is not high.

Disadvantage 1 of the prior art: in the prior art, only the deep features of the convolutional network are utilized, but the low-level features output by the shallow network are not utilized, so that the splicing positioning result needs to be further improved. The low-level features output by the shallow network comprise local features of the images and detail information of some images, and the information can improve the feature expression capability of the network and can further improve the positioning effect.

Disadvantage 2 of the prior art: the prior art only utilizes the stitched image and does not utilize the edge information of the image and the edge information of the stitched area. The edge information of the image and the edge information of the splicing region can play a guiding role in the edge of the splicing region, and the accuracy of splicing edge positioning is improved.

Disadvantage 3 of the prior art: in the prior art, features are simply fused without being recalibrated by an attention mechanism, so that the output feature discrimination is poor, and the positioning result needs to be further improved.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the technical defects, and provide an image splicing positioning detection method based on an attention mechanism, which is used for designing a multi-task loss function, simultaneously learning the edge information of an image, the edge information of a splicing area and the splicing area, and improving the positioning result of the splicing edge; extracting low-level texture features by utilizing a shallow network, and enhancing the feature expression capability of the extracted network; finally, the fusion features are recalibrated by using the squeeze-excitation attention mechanism, so that the model focuses more on the features useful for positioning the splicing region and is given more weight.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: an attention mechanism-based image stitching positioning detection method comprises the following steps:

the method comprises the following steps: preparing an image splicing data set, and dividing the image splicing data set into a training set, a verification set and a test set;

step two: designing a double-flow multitask learning neural network structure;

step three: designing a multitask loss function;

step four: optimizing training to obtain a splicing area positioning model;

step five: and inputting the image to be detected into the model trained in the fourth step to obtain a splicing positioning result.

Preferably, in step one, 4 reference image stitching data sets CASIA1.0, 461, CASIA2.0, 5123, Carvalho data set 100, Columbia data set 180, and two composite stitching data sets spiced _ NIST 13575 and spiced _ Dresden 35712 are used, and the number of training sets, validation sets and test sets is assigned to each data set in a ratio of 7:2: 1.

Preferably, the second step includes an edge guiding path and a label mask path, wherein the edge guiding path is a coding and decoding path composed of U-Net, the edge of the image is used for supervision, the label mask path is a coding and decoding path composed of U-Net, and the real group mask of the splicing region and the edge of the splicing region are used for supervision of the label mask path.

Preferably, the multitask loss function in the third step includes three aspects, the first is label mask loss, the second is mask edge loss, and the third is image edge loss.

Preferably, the step four experiment is realized on an Ubuntu 16.04 system by adopting a Pythrch network framework, the display card is a GeForce GTX 1080Ti GPU, the adaptive moment estimation is adopted as an optimizer, the learning rate is set to be 1 × 10^-3Set to 1X 10 after 30 epochs^-4A total of 300 epochs are trained with the batch size set to 8.

Compared with the prior art, the invention has the advantages that: (1) the shallow low-level features are introduced, so that more detailed information can be provided, and the feature representation capability of the network is improved;

(2) the edge of the image and the edge of the splicing region are introduced as monitoring information, a multi-task loss function is designed, and the splicing region can be positioned more accurately;

(3) and an Squeeze-excitation attention mechanism is introduced to recalibrate the fusion characteristics, so that the model can pay more attention to the characteristics which contribute to positioning greatly, and a more accurate splicing positioning result is obtained.

Drawings

FIG. 1 is a schematic structural diagram of an image stitching positioning detection method based on an attention mechanism according to the present invention.

FIG. 2 is a structural diagram of a Feature Adaptive Layer (FAL) of an image stitching positioning detection method based on an attention mechanism.

FIG. 3 is a schematic diagram of an attention-based image stitching positioning detection method Squeeze-excitation attention mechanism (SEAM) according to the present invention.

FIG. 4 is a diagram illustrating the stitching positioning result on a partial test set of the image stitching positioning detection method based on the attention mechanism.

FIG. 5 is a positioning result of different stitching data sets by the image stitching positioning detection method based on the attention mechanism.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

An attention mechanism-based image stitching positioning detection method comprises the following steps:

the method comprises the following steps: the 4 reference image stitching dataset CASIA1.0 (461), CASIA2.0 (5123), Carvalho dataset (100), Columbia dataset (180), and two synthetic stitching datasets spiced _ NIST (13575) and spiced _ Dresden (35712) were used, each dataset assigning the number of training, validation and test sets in a ratio of 7:2: 1.

Step two: and designing a double-flow multitask learning neural network, which comprises an edge guide path and a label mask path. The edge guide path is an encoding and decoding path formed by U-Net, and the edge of the image is adopted for supervision. The encoder extracts discriminant features from the input spliced image, and the decoder further processes the extracted features to obtain a pixel-by-pixel image edge prediction image. The encoder of the edge-guided path is composed of four sets of consecutive convolution modules and downsampling layers, each convolution module is composed of a convolution layer, a Batch regularization layer (BN), and a nonlinear correction Unit (ReLU), wherein the convolution kernel sizes of the convolution layers are all 3 × 3, and the step size is 1. The lower sample is implemented by convolution with a convolution kernel of 4 x 4 and a step size of 2. The decoder of the edge-guided path is composed of four sets of consecutive upsampling layers and convolution modules. The upsampling layer is realized by bilinear difference values, and the width and height dimensions of the feature map are doubled after each upsampling. The encoder and the decoder are connected by a convolution module. Finally, a convolution layer with a convolution kernel size of 1 × 1 and a step size of 1 is used empirically to refine the upsampling feature. Nevertheless, upsampling results in feature loss, so a jump connection between the U-Net contraction path and the expansion path is required to reuse the initial features and compensate for the feature loss.

The label mask path of the proposed network has a similar structure of edge-guided paths as a whole, and is also an encoding and decoding path composed of a U-Net. The real groudtruth mask of the stitching region and the edge of the stitching region are used to supervise the label mask path. But differs from the edge guide path in several places:

1) the features in the edge guiding path are filtered by a Feature Adaptation Layer (FALs) and then input to the label mask path, and are fused with the features in the label mask path. FAL is composed of Res-block and has a structure shown in figure 2. The FAL contains a convolution path and an identity path, where the convolution path consists of a convolution layer with a convolution kernel size of 1 × 1 and a step size of 1 and a ReLU layer. Assuming that the input FAL is characterized by y, the output of FAL

Can be expressed as

Wherein

Representing a pixel-by-pixel addition, C_1×1Representing a normalized 1 x 1 convolution. In order to reduce loss in fusion, the filtered features are fused with features in the label mask path in a cascading manner.

2) The low-level features extracted from the stitched image through the shallow network are input into a label mask path and fused with the features output from the upsampling layer in the decoder. Low-level features generally refer to local features of image detail, such as edges, corners, or gradients. The red dotted path in fig. 4 indicates the extraction of the low-level feature, which can provide more discrimination information. From left to right are 4 downsampled layers, 8, 4, 2 and 1 downsampled respectively, with convolution kernel sizes/steps 8/8, 4/4, 2/2, 1/1 respectively. The fusion of low-level features with high-level features in the label mask path can enhance high-resolution representation.

3) The directly fused features are coarsely localized to repair, so the proposed networkThe fusion feature is input to a pinch-Excitation Attention module (SEAM). SEAM can be viewed as a simple channel attention mechanism, the structure of which is shown in fig. 3. By using

Representing the input of the SEAM, obtaining a characteristic with the number of channels C through a series of convolution and other general operations

Wherein

Denotes convolution, V ═ V₁,v₂,...,v_C]Representing a convolution kernel.

Unlike the conventional CNN, the resulting features are then recalibrated using three operations. First, the features obtained by convolution are subjected to squeeze operation, i.e.

Wherein F_sqRepresenting a squeeze operation, resulting in a global feature z ═ z at the channel level₁,z₂,...,z_C]Feature compression is performed along the spatial dimension, and each two-dimensional feature channel is changed into a real number z_cThis real number has a somewhat global receptive field and the output dimensions match the number of input feature channels.

Then, performing an Excitation operation to learn the relationship among the channels to obtain the weights of different channels, namely

e＝F_ex(z,W)＝σ(G(z,W))＝σ(W₂ ReLU(W₁z)) (3)

Wherein F_exDenotes the excitation operation, σ denotes the sigmoid activation function, G denotes the gating mechanism implemented by ReLU,

r represents the dimensional compression ratio. A mechanism similar to the gate in the recurrent neural network is used to generate a weight for each eigen channel by means of the parameter W.

Finally, the weight of the Excitation output is regarded as the importance of each feature channel after feature selection, and then the weighting is carried out to the previous channel by multiplying the channel by channel, namely

Wherein F_scale(u_c,e_c) Represents u_cAnd e_cThe multiplication is channel by channel. To this end, the re-scaling of the original features in the channel dimension is done.

Step three: and designing a multitask loss function. The multitask loss function of the invention mainly comprises three aspects, wherein the first is the loss of label mask, the second is the loss of mask edge, and the third is the loss of image edge. The overall loss function can be expressed as:

L_total＝L_{label_mask}+λ₁L_{label_edge}+λ₂L_{image_edge}. (5)

in order to solve the problem of non-uniformity of distribution between positive and negative samples and inconsistency of difficulty in distinguishing between samples, focal length is adopted as label mask loss, namely

Wherein

And P_i,jThe estimated labels at the pixel points (i, j) and the probabilities predicted as the stitched pixels are respectively expressed, α is used to balance the proportion of positive and negative samples, γ is used to balance the proportion of difficult and easy samples, and α is empirically set to 0.25 and γ is set to 2 in the present experiment.

For mask edge, general Binary Cross Entropy (BCE) is adopted as a loss function, namely

Wherein

And Q_i,jRespectively representing the estimated mask edge label and the probability of predicting as a mask edge at pixel point (i, j).

For image edges, Minimum Mean Square Error (MSE) is used as a loss function, i.e.

Wherein S_i,jAnd

respectively representing the true and estimated values of the image edges.

Step four: and optimizing training. The experiment of the invention is realized on the Ubuntu 16.04 system by adopting a Pythrch network framework, and the display card is a GeForce GTX 1080Ti GPU. The scheme of the invention adopts adaptive moment estimation (Adam) as an optimizer, and the learning rate is set to be 1 multiplied by 10^-3Set to 1X 10 after 30 epochs^-4A total of 300 epochs are trained with the batch size set to 8. Adjustment factor lambda of the loss function₁，λ₂The final splicing positioning result is not greatly influenced, and the lambda is obtained₁＝λ₂When 1, the best detection result is obtained. Thus, the experimental setting is λ₁＝λ ₂1. And finally, selecting the model with the highest positioning result on the test data as the final model.

Step five: and inputting the image to be detected into the model stored in the fourth step to obtain a splicing positioning result.

The invention is verified on a plurality of commonly used image splicing data sets, and experimental results prove that the scheme is feasible. The F1-socre is used as a criterion, the positioning results of different spliced data sets are shown in the attached figure 5, and the splicing positioning results on part of test sets are shown in the attached figure 4.

The attention mechanism adopts an squeeze-attention mechanism, and other attention mechanisms can be adopted in practice, such as a rolling block attention mechanism, and a better splicing positioning result can be achieved.

When the method is specifically implemented, a multi-task loss function is designed, and meanwhile, the edge information of the image, the edge information of the splicing area and the splicing area are learned, so that the positioning result of the splicing edge is improved; extracting low-level texture features by utilizing a shallow network, and enhancing the feature expression capability of the extracted network; finally, recalibrating the fusion features by utilizing an squeeze-excitation attention mechanism, so that the model focuses more on the features useful for positioning the splicing region and is endowed with larger weight;

and designing a double-flow network (comprising an edge guide path and a label mask path), and learning the edge, the mask edge and the label mask of the image by adopting a multi-task loss function. Features in the edge guide path are input to the label mask path using a Feature Adaptive Layer (Feature Adaptive Layer). The fused features in the label mask are recalibrated by using a channel attention mechanism, and the features with significance in judgment are endowed with larger weight, so that the expression capability of the features is improved.

1) The multitask loss function designed by the invention introduces the edge loss of the image and the loss function of the edge of the splicing region; 2) low-level feature fusion of the shallow network; 3) the introduction of a feature adaptation layer between the edge guide path and the label mask; 4) introduction of the Squeeze-excitation attention mechanism.

(1) The invention introduces shallow low-level features, can provide more detailed information and improves the feature representation capability of the network.

(2) The invention introduces the edge of the image and the edge of the splicing area as the supervision information, designs the multi-task loss function and can more accurately position the splicing area.

(3) According to the invention, an Squeeze-excitation attention mechanism is introduced to recalibrate the fusion characteristics, so that the model can pay more attention to the characteristics contributing to positioning, and a more accurate splicing positioning result is obtained.

The present invention and its embodiments have been described above, and the description is not intended to be limiting, and the drawings are only one embodiment of the present invention, and the actual structure is not limited thereto. In summary, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image stitching positioning detection method based on an attention mechanism is characterized by comprising the following steps: the method comprises the following steps:

step two: designing a double-flow multitask learning neural network structure;

step three: designing a multitask loss function;

step four: optimizing training to obtain a splicing area positioning model;

2. The image stitching positioning detection method based on the attention mechanism as claimed in claim 1, wherein: in step one, 4 reference image stitching data sets CASIA1.0, 461, CASIA2.0, 5123, Carvalho data set 100, Columbia data set 180, and two synthetic stitching data sets spiced _ NIST 13575 and spiced _ Dresden 35712 are used, and the number of training sets, validation sets and test sets is assigned to each data set according to the ratio of 7:2: 1.

3. The image stitching positioning detection method based on the attention mechanism as claimed in claim 1, wherein: and the second step comprises an edge guide path and a label mask path, wherein the edge guide path is a coding and decoding path formed by U-Net, the edge of the image is adopted for supervision, the label mask path is a coding and decoding path formed by U-Net, and the real group mask of the splicing region and the edge of the splicing region are used for supervising the label mask path.

4. The image stitching positioning detection method based on the attention mechanism as claimed in claim 1, wherein: the multitask loss function in the third step comprises three aspects, wherein the first aspect is the loss of a label mask, the second aspect is the loss of a mask edge, and the third aspect is the loss of an image edge.

5. The image stitching positioning detection method based on the attention mechanism as claimed in claim 1, wherein: the experiment in the fourth step is realized on an Ubuntu 16.04 system by adopting a Pythrch network framework, the display card is a GeForce GTX 1080Ti GPU, the self-adaptive moment estimation is adopted as an optimizer, the learning rate is set to be 1 multiplied by 10^-3Set to 1X 10 after 30 epochs^-4A total of 300 epochs are trained with the batch size set to 8.