CN113763373A

CN113763373A - Domain generalization scale alignment reproduction picture detection system

Info

Publication number: CN113763373A
Application number: CN202111091084.0A
Authority: CN
Inventors: 回红; 罗吉年; 郭捷; 甘唯嘉; 邱卫东; 黄征
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-07
Anticipated expiration: 2041-09-17
Also published as: CN113763373B

Abstract

A domain-generalized scale-aligned snap-shot picture detection system, comprising: the invention can improve the reproduction detection accuracy of an algorithm on unknown data during training by integrating a plurality of small data sets, and can greatly improve the accuracy under the conditions of inconsistent multi-data sources, unknown application scenes and image scales in practical application scenes.

Description

Domain generalization scale alignment reproduction picture detection system

Technical Field

The invention relates to a technology in the field of image processing, in particular to a domain-generalized scale-aligned and copied picture detection system.

Background

One branch in digital Image forensics is the identification of a captured Image. After the directly photographed image is reproduced in a certain manner, the reproduced image is acquired again by using the photographing apparatus, which is called a reproduced image. The existing classification method solves the problem of data shortage by integrating multiple sources or adopting a data enhancement means. And the data sets from different sources have edge distribution of various characteristics, so that a plurality of independent domains in the transfer learning are formed. The data of domains different from the domains used in training are used for testing, namely domain switching, and the testing method is closer to the actual production and use scene. The method of directly synthesizing a plurality of different domains into a domain of a data set leads to the characteristic learned by the algorithm to be easily influenced by domain switching due to the deviation in distribution. In practical application scenarios, information such as distribution of detection content is usually unknown during training, and therefore, a realistic challenge is presented to generalization capability of the rephotography detection algorithm.

Disclosure of Invention

The invention provides a domain-generalized scale-aligned detection system for a copied picture, aiming at the problems of distribution deviation of different data sources and inconsistent decision function distribution under different scales in the existing copied picture detection technology, which can improve the copying detection accuracy of an algorithm on unknown data during training of a plurality of small data sets and can greatly improve the accuracy under the conditions of inconsistent data sources, unknown application scenes and picture scales under actual application scenes.

The invention is realized by the following technical scheme:

the invention relates to a domain generalization scale alignment copying detection system, which comprises: the system comprises a preprocessing module, a symmetrical confrontation learning module, a task module, a global scale relation alignment module and a local feature ternary loss mining module, wherein: the preprocessing module extracts a window of an image to be detected and generates a scale pyramid, the countermeasure learning module receives input of two different scale levels with two levels of scales corresponding to each other in the scale pyramid in a large scale and a small scale in a training mode and embeds the input into feature spaces respectively, the task module obtains final probability values of reproduction and direct reproduction according to feature vector information in the embedded feature spaces, the global scale relation alignment module performs symmetrical KL divergence calculation according to vector values calculated by the task module on scale pyramid levels on two levels of a large scale and a small scale of the same image generated by the preprocessing module, namely classification scoring, obtains KL divergence values, the local feature ternary loss mining module randomly selects a feature vector as an anchor point each time according to the feature vector information in the embedded feature spaces, and then selects a feature vector from the category to which the anchor point belongs as a positive sample, selecting a feature vector from another class as a negative sample, respectively calculating the sum of Euclidean distances between an anchor point and the feature space of the positive sample and the negative sample as the loss of the triples, exhausting all the triples in the input data, averaging the loss of all the triples to obtain the total loss, minimizing the loss in the training process so as to carry out local ternary loss mining in the feature space, improving the compactness of feature embedding and ensuring a clear decision boundary.

The scale pyramid has two layers and respectively corresponds to the two levels of scales of the size generated by each input image.

The symmetry refers to that: the same image is input into a preprocessing module to generate two scale levels of one large scale and one small scale, and the scale levels are not distinguished in the countermeasure learning process; and the categories of the flipping and direct-shooting are not distinguished. The large-to-small scale, flap-through categories are therefore symmetrical in the sense of computation in this module and therefore have symmetry.

The confrontational learning module comprises: a feature generation unit formed by two parallel ResNet18 backbone networks and a domain discriminator, wherein: two ResNet18 backbone networks form two symmetrical groups of characteristic embedded networks, the two parts are connected through a gradient reverse layer, and finally, the loss of the domain discriminator is calculated by using a classification loss unit.

The two symmetrical groups of characteristics are embedded into the network, namely: the two backbone networks adopt a parameter sharing strategy to extract the characteristics of the image, and the image is confronted with the characteristic embedding network through a domain discriminator connected with a gradient reverse layer, so that the data distribution of different domains is aligned, and the characteristics are embedded into a characteristic space shared by the network learning.

The task module comprises: two cascaded linear layer nodes, wherein: and the linear layer node obtains a vector value according to the embedded feature vector information in the feature space, the vector value is used as a classified score, and then the vector value is normalized by using a Softmax function to obtain the probability value of final copying and direct copying.

The probability values of the copying and the direct copying are directly judged as the type of the direct copying or the copying according to the larger probability value during testing or practical application, and the two probability values and the real type are used for calculating the cross entropy during training, so that the concept of the type of the copying and the direct copying is learned by utilizing the cross entropy to guide a network.

The triplet includes: anchor point, positive sample and negative sample, wherein: the positive sample includes the feature vector of another rendered image when the anchor point happens to belong to the feature vector of the rendered image.

The KL divergence value is not calculated when the module is tested or actually used, and the distribution of classification formed by features of different scales between domains and in the domains is aligned by performing back propagation on the KL divergence value in the training process, so that the influence of differences of different scales on performance is reduced, and the generalization capability is further improved.

Technical effects

The invention respectively receives the image window information on two scale levels, one large scale and one small scale, generated by the preprocessing module through two ResNet18 backbone networks. In the process of counterstudy, the features embedded in each image on two scales, namely one large scale and one small scale, are all involved in the training of the domain classifier, and the copying and direct-shooting data in the data set are all involved in the training of the domain classifier, so that the counterstudy module has symmetry.

Compared with the prior art, the method can train on a training set formed by a plurality of different source data sets, and achieve lower error rate (measured by HTER) and stronger generalization capability (measured by AUC) in a reproduction detection test or a practical application task with unknown distribution information; meanwhile, the problem that the detection error rate is increased due to the difference of the image scale distribution is solved.

Drawings

FIG. 1 is a schematic diagram of the overall structure of the system of the present invention;

FIG. 2 is a schematic diagram of an embodiment training process;

FIG. 3 is a schematic diagram of the test using process of the embodiment.

Detailed Description

As shown in fig. 1, the present embodiment relates to a domain-generalized scale-aligned captured image detection system, which includes: the system comprises a preprocessing module, a symmetrical confrontation learning module, a task module, a global scale relation alignment module and a local feature ternary loss mining module, wherein: the preprocessing module extracts a window of the image and generates a scale pyramid which is used as the input of a later symmetrical countermeasure learning module; the symmetrical countercheck learning module receives two inputs with different scale levels in a training mode, and respectively embeds the inputs into a feature space (only one side network is used for receiving the inputs during test and actual application), two symmetrical sets of feature embedding networks adopt a parameter sharing strategy to carry out feature extraction on an image, and countercheck with the feature embedding networks through a domain discriminator connected with a gradient reverse layer, so that the data distribution of different domains is aligned, and the features are embedded into the feature space shared by network learning; a task module, namely traditional classification linear layer nodes and cross entropy loss, guides a network to learn concepts of reproduction and direct-shooting categories; the global scale relation alignment module aligns the distribution of the classifications formed by the features with different scales between domains and in the domains, thereby reducing the influence of the difference with different scales on the performance and further improving the generalization capability; the local feature ternary loss mining module performs local ternary loss mining in the feature space, improves the compactness of feature embedding, and ensures a clear decision boundary.

The preprocessing module comprises: a sorting unit, a small scale level unit and a large scale level unit, wherein: the sorting unit sorts and stores all pictures, copied pictures, directly-photographed pictures, test pictures and training pictures in each domain of the data set into a json file, each record in the file stores an absolute path of the picture and a label of the copied and directly-photographed pictures, the small-scale level unit reads the pictures into PILImage data, carries out 1:2 down-sampling in alignment, then cuts a window with the size of 256 × 3 from a middle position as a small-scale picture, the window forms a small-scale level window, then carries out normalization on the window by using parameters of [0.485,0.456,0.406] and [0.229,0.224 and 0.225] to obtain the final small-scale level input of the symmetrical countermeasure learning module, the large-scale level unit respectively reads the pictures into PILImage data and cuts the window with the size of 256 × 3 from the middle position as a uniform-format input picture on a large-scale level, the final symmetric warfare module large scale level input is then obtained by normalizing it with parameters 0.485,0.456,0.406, [0.229,0.224,0.225 ].

The finishing is as follows: and storing all pictures, copied pictures, directly-photographed pictures, test pictures and training pictures in each domain of the data set into a json file in a classified manner, wherein each record in the file stores an absolute path of the picture and a label for copying and directly-photographing the picture. In the training or verification process, the preprocessing module reads a source domain, namely json records of domain reproduction and direct shooting participating in training, and randomly selects the same number of pictures as a test set, namely a target domain.

The input of two dimensions in size and the corresponding labels are provided to all the following modules for calculation. During the testing process, the inputs at the small scale level are masked, leaving only the inputs at the large scale level for classification.

The scale refers to the relative size of any target object in the picture. The same object occupies a larger proportion in the picture at the large scale level and occupies a smaller proportion in the picture at the small scale level.

The symmetrical antagonistic learning module only uses one side network to receive input during test and actual application, and comprises the following steps: a feature generation unit, a gradient reverse layer, a domain identification unit and a classification loss unit which are formed by two parallel ResNet18 backbone networks, wherein: two input ports of the feature generation unit are respectively connected with two groups of ResNet-18 backbone networks, correspondingly receive two scale levels, extract features of the same type, the same content and different scale levels, combine the features in a batch dimension in sequence and respectively output the combined features in batch to a task module, a global scale relation alignment module, a local feature ternary loss mining module and a gradient reverse layer of a symmetrical antagonistic learning module; when the gradient reverse layer respectively carries out forward propagation at different time, the input is directly connected to the output and no operation and modification are carried out or backward propagation is carried out, the output gradient value is multiplied by-1, and then the multiplication is carried out to reduce the influence of noise at the initial training stage; the domain identification unit classifies the domain of the features embedded in the feature generation unit, and the number of the output nodes is the total number of the used data sets; the classification loss unit calculates a cross-entropy loss.

The inverse coefficient

Wherein: exp is an exponential function, the ratio of the number of backward propagation times to the total number of backward propagation times

Wherein: current _ iteration is the current back-propagation round, increasing from 1 to 4000 in this embodiment, total _ iteration is the set total back-propagation round, total _ iteration is set to 4000 in this embodiment.

The task module comprises: an average unit, a classification unit with two linear layer nodes and a loss unit, wherein: the average unit is used for averaging the embedded characteristics of the copying and the direct photographing to obtain two groups of average characteristic values, the classification unit is used for classifying the embedded characteristics through copying and direct photographing, the loss unit is used for comparing the categories output by the classification unit with the actual categories in the training process, and calculating according to the statistics of correct and wrong quantities to obtain a loss value which is used as a part of the final comprehensive loss to be transmitted backwards.

And the linear layer node receives the 512-dimensional embedded features, outputs 2-dimensional floating point numbers, and processes the floating point numbers into normalized probability confidence coefficients of reproduction and direct shooting through a softmax function. And outputting the category with higher confidence as a final category.

The task module classifies 512-dimensional features embedded in the feature generator into two categories, namely a reproduction picture and a direct-shot picture, by classifying the 512-dimensional features through 2 nodes, namely, linear layer nodes torch, nn, linear (in _ features is 512, out _ features is 2), and specifically: performing probability normalization operation on 2-dimensional output of the linear layer nodes by using a softmax function, taking the final column with higher probability as a classification principle when the probabilities of two categories of direct shooting and copying are finally obtained, and not distinguishing a domain of data in a task module, namely, a binary group formed by edge distribution of input data and the input data, wherein: the reproduction picture and the through picture respectively refer to a picture taken by a photographing apparatus for a display and a picture taken by a photographing apparatus for an object in a natural scene.

The source domain refers to a domain participating in training, and the target domain refers to a domain used for testing.

The scale relation alignment module comprises: the input of the KL divergence calculation unit is the confidence degree distribution of classification of a large scale level and a small scale level, namely, the tensor of the dimension (B,2,2) uses the symmetrical KL divergence to measure the difference of the characteristics of each pair of pictures with the same content and different scales in the reproduction and the direct photographing, and the symmetrical KL divergence of the output values of the classification unit of the task module is taken as the scale loss of the pair of pictures; taking the average value of all scale losses in the batch as the final output value of the scale relation alignment module, wherein: b is the batch size set during training or testing, and the output is a loss value with dimension 1.

The global scale relation alignment module aligns the two distributions by minimizing the difference of the distribution of the final classification unit output caused by the scale difference irrelevant to the copying and the direct photographing, thereby enhancing the generalization capability of the embedded feature space and enabling the embedded feature space to be insensitive to the global scale difference.

The local feature ternary loss mining module further locally limits the embedded feature space, and the module comprises: and the ternary loss calculation unit exhales all the triples in the data set, calculates the loss of the triples and averages the ternary losses of all the triples to obtain a final result.

The loss calculation mode is as follows: firstly, selecting a feature as an anchor point, then selecting one of the features in the same category as a positive sample, calculating the distance between the two samples, then selecting one of the features in different categories as a negative sample, and calculating the distance between the two samples. The distance of the anchor point from the positive sample minus the distance of the anchor point from the negative sample is taken as the triplet penalty on one triplet. The final triplet losses are obtained by exhaustively summing and averaging all triplet losses. The local feature ternary loss mining module is used for restraining the local part of the feature space, so that the compactness of feature representation is improved, the decision boundary on the feature space is favorably optimized, and the generalization capability is favorably improved.

The embodiment relates to a domain generalization scale alignment copying detection method based on the system, which comprises the following specific steps:

s1) training a domain-generalized scale-aligned panning detection model.

S1.1) preprocessing the selected 3 data sets, and generating json list files of all training data and corresponding labels by using general _ label. The method specifically comprises the following steps: processing 3 training data sets one by one, taking out absolute paths of all pictures, and storing corresponding labels (reproduction: 1, direct reproduction: 0) into a json file;

s1.2) respectively and randomly selecting equivalent data for training in 3 data sets for training, randomly selecting 608 pictures and 666 random numbers for each data set and each category, and writing paths and labels of the 608 pictures and the 666 random numbers into a json file for training for later reference;

s1.3) pictures in 3 data sets are read in batch by batch B, one batch for each round, B being 8, and 24 pictures in 3 data sets constitute one batch. Reading pictures into a PIL format by using a PILImage library, and then reserving one copy and copying one copy;

s1.4) cutting a 256-by-256 window at the center directly without any resize operation on a reserved picture, converting the window into a Tensor format, and normalizing the window by using parameters of [0.485,0.456,0.406] and [0.229,0.224 and 0.225] to obtain a large-scale picture input;

s1.5) converting a copied picture into a Tensor format, then performing down-sampling, taking one picture at intervals of one pixel in the transverse direction and the longitudinal direction to obtain a 1:2 down-sampled Tensor, then converting the Tensor into a PIL format, cutting a window of 256 by 256 at the center, converting the window into the Tensor format, and then normalizing the window by using parameters of [0.485,0.456,0.406], [0.229,0.224 and 0.225] to obtain a small-scale picture input;

s1.6) merging the pictures obtained in S1.4 and S1.5 in the batch dimension to obtain an input with a batch size of 48, each member being a normalized color picture of 256 × 3, and inputting the merged batch size to a feature generation unit in a symmetric antagonistic learning module;

s1.7) initializing a feature generation unit by using a public model ResNet18-5c106cde trained on ImageNet by ResNet;

s1.8) modify pooling layers behind layer4 of ResNet to nn.adaptiveavgpool2d (output _ size ═ 1) and to ktteleck _ layer to nn.linear (512 ), the output embedded feature dimension is 512 dimensions;

s1.9) inputting the combined batch pictures into a feature generation unit to obtain (48,512) dimensional features;

s1.10) copying the 48,512-dimensional features for standby, inputting a first part into a gradient reversal unit of a symmetrical countermeasure learning module, inputting a second part into a task module, and inputting a third part into a local feature ternary loss mining module;

s1.11) in the gradient reverse unit, counting the training turns in the forward direction, and multiplying the gradient by a coefficient in the backward direction

Wherein: k is the round and total counted for forward propagationThe ratio of the turns, the total turn is set to 4000;

s1.12) the domain identification unit is arranged as a cascade of two linear layer nodes and one Dropout layer, wherein: the nodes of the linear layer are 512 nodes and domains nodes respectively, and the number of the domains is 3, which is the number of data sets used in training; the probability of a Dropout layer is set to 0.5 and is inserted between two linear layer nodes;

s1.13) during forward propagation, encoding 0,1,2 of the label according to a data set from a picture, and transmitting the label into a classification loss unit to calculate cross entropy to obtain an antagonistic learning loss L1;

s1.14) in a task module, using a layer of linear layer nodes as a classifier, wherein the node number is 2, the node number corresponds to two categories of copying and direct shooting respectively, scores in the two categories are output, each picture has two scores for evaluating the probability in the two categories, and the output dimensionality of the part is (48, 2); respectively averaging the characteristic values of the copied images and the directly-photographed images, and similarly scoring two categories, wherein the output dimensionality of the two categories is (2, 2); directly inputting the output of the first (48,2) dimension into a softmax function to normalize the score, and then inputting the normalized score into cross entropy loss to obtain task loss L2;

s1.15) transmitting the output of the second (2,2) dimensionality of the task module into a global scale relation alignment module, and dividing the output 2 pairs of probabilities into 2 parts according to scale levels to obtain a score tensor with the dimensionality of (1,2, 2); symmetric KL divergence is calculated on the second dimension of the image pair by pair to obtain 1 KL divergence value, namely, the global scale relation alignment loss L3;

s1.16) in a local feature ternary loss mining module, dividing features according to categories, exhaustively exhausting all members as anchor points, taking the same category as a positive sample and the different category as a negative sample, calculating the sum of distances between the anchor points as the loss on one triplet, and taking the average value of the triplet losses on all triplets as local feature ternary loss L4;

s1.17) at the end of each round, linearly combining L1-L4 to obtain total loss L; taking L as 0.1L1+ L2+0.1L3+0.2L 4;

s1.18) automatically deriving and then transmitting by using an autograd function to finish a round of training;

s1.19) freezing all parameters in the model, carrying out the steps from S1.2 to S1.17 on the remaining data set without carrying out backward propagation, and if the calculated L2 is reduced, updating checkpoint and storing all current parameters;

s1.20) repeating the steps from S1.2 to S1.19 4000 times;

s2) performing a pan detection using the domain-generalized scale-aligned pan detection model.

S2.1) preprocessing the selected test data set, and generating json list files of all training data and corresponding labels by using general _ label. The method specifically comprises the following steps: taking out absolute paths of all pictures, and storing corresponding labels (1 in reproduction and 0 in direct photographing) into a json file;

s2.2) randomly selecting 608 pictures, randomly selecting 666 as a seed, and writing paths and labels of the pictures into a trained json file for future reference;

s2.3) pictures in the data sets are read in batches according to batch B, one batch for each round, where B-8 constitutes one batch. Reading pictures into a PIL format by using a PILImage library, and then reserving one copy and copying one copy;

s2.4) cutting a 256-by-256 window at the center directly without any resize operation on the reserved picture, converting the window into a Tensor format, and normalizing the window by using parameters of [0.485,0.456,0.406] and [0.229,0.224 and 0.225] to obtain large-scale picture input;

s2.5) converting a copied picture into a Tensor format, then performing down-sampling, taking one picture at intervals of one pixel in the transverse direction and the longitudinal direction to obtain a 1:2 down-sampled Tensor, then converting the Tensor into a PIL format, cutting a window of 256 by 256 at the center, converting the window into the Tensor format, and then normalizing the window by using parameters of [0.485,0.456,0.406], [0.229,0.224 and 0.225] to obtain a small-scale picture input;

s2.6) merging the pictures obtained in S2.4 and S2.5 in the batch dimension to obtain an input with a batch size of 48, each member being a normalized color picture of 256 × 3, and inputting the merged batch size to a feature generation unit in the symmetric antagonistic learning module;

s2.7) initializing a feature generation unit by using a public model ResNet18-5c106cde trained on ImageNet by ResNet;

s2.8) modify the pooling layer behind layer4 of ResNet to nn.adaptiveavgpool2d (output _ size ═ 1) and modify the bottompiece _ layer to nn.linear (512 ), the output embedded feature dimension is 512 dimensions;

s2.9) inputting the combined batch pictures into a feature generation unit to obtain (48,512) dimensional features;

s2.10) inputting the characteristic of the (48,512) dimension into a task module;

s2.11) in a task module, using a layer of linear layer nodes as a classifier, wherein the node number is 2, the node number corresponds to two categories of copying and direct shooting respectively, scores in the two categories are output, each picture has two scores for evaluating the probability in the two categories, and the total output dimensionality is (48, 2); the output is transmitted to a softmax function to normalize the score, and then the score is input to cross entropy loss to obtain task loss L2;

s2.12) counting the accuracy and the confidence of the classification in the S2.11;

s2.13) calculating the mean value HTER of the false positive proportion and the false negative proportion by using the confidence coefficient of the statistics in the S2.12, drawing a ROC curve, and calculating the AUC according to the ROC curve.

This example performed four sets of training and testing on 4 public data sets ICL-COMMSP (abbreviated as I), NTU-ROSE (abbreviated as N), mturk (abbreviated as M), and BJTU-IIS (abbreviated as B), where the difference in data scale between M and N is the largest, the difference in data scale between I and B is the smallest, one data set was selected as the test set each time, the remaining three were selected as the training sets, and the four sets of experiments were respectively designated as B & I & M to N, B & I & N to M, B & M & N to I, and I & M & N B, and the above-described method training was performed with L0.1L 1+ L2+0.1L3+0.2L4 parameters.

In the embodiment, the distribution difference between domains is minimized, the global scale relation is aligned, and ternary loss mining is performed locally, so that the generalization capability of the algorithm is remarkably improved, and the reproduction detection algorithm has a practical value.

Four public data sets were selected for four groups of generalization training and testing experiments, the HTER of this example showed 18.43%, 23.68%, 16.93%, and 15.12% in four groups of experiments (B & I & M to N, B & I & N to M, B & M & N to I, and I & M & N to B), respectively, and the error rate was significantly better than that of the prior art, such as 32.81%, 35.16%, 36.88%, and 18.25%. The results of the four experiments of the AUC of the present example are 85.98%, 83.16%, 91.44% and 91.87%, which are significantly better than 74.41%, 69.32%, 73.32% and 85.88% of the prior art.

Compared with the prior art, the method and the device can solve the problem of comprehensive training of multiple data sets, thereby simplifying the difficulty of data collection work; meanwhile, the defect that the detection results of the pictures with the same content under different scales are not distributed uniformly is overcome, so that the copying detection technology can grasp the features of different scales for identification; in addition, the embodiment can ensure the highest generalization capability under the condition that the number of the source domains is extremely limited, so that the system can better adapt to the actual application scene on the target domain. Each module model can be regularly and intensively trained according to the use condition, and the system performance is improved.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A domain-generalized scale-aligned-reproduction detection system, comprising: the system comprises a preprocessing module, a symmetrical confrontation learning module, a task module, a global scale relation alignment module and a local feature ternary loss mining module, wherein: the preprocessing module extracts a window of an image to be measured and generates a scale pyramid, the countermeasure learning module receives input of two different scale levels with two levels of scales corresponding to each other in the scale pyramid in a large scale and a small scale in a training mode and embeds the input into feature spaces respectively, the task module obtains final probability values of reproduction and direct reproduction according to feature vector information in the embedded feature spaces, the global scale relation alignment module performs symmetrical KL divergence calculation according to vector values calculated by the task module on scale pyramid levels of the same image on two levels of a large scale and a small scale, namely classification scoring to obtain KL divergence values, the local feature ternary loss mining module randomly selects a feature vector as an anchor point each time according to the feature vector information in the embedded feature spaces, and then selects a feature vector from the class of the anchor point as a positive sample, selecting a feature vector from another class as a negative sample, respectively calculating the sum of Euclidean distances on feature spaces of an anchor point and the positive and negative samples as the loss of the triples, exhausting all triples in input data, averaging the loss of all triples to obtain total loss, minimizing the loss in the training process so as to carry out local ternary loss mining in the feature space, improving the compactness of feature embedding and ensuring a clear decision boundary;

the scale pyramid has two layers and respectively corresponds to the two levels of scales of the size generated by each input image;

the triplet includes: anchor point, positive sample and negative sample, wherein: the positive samples include the feature vector of another copied image if the anchor point happens to belong to the feature vector of the copied image.

2. The domain-generalized scale-aligned snap-shot detection system of claim 2, wherein said confrontation learning module comprises: a feature generation unit formed by two parallel ResNet18 backbone networks and a domain discriminator, wherein: two ResNet18 backbone networks form two symmetrical groups of characteristic embedded networks, the two parts are connected through a gradient reverse layer, and finally, the loss of the domain discriminator is calculated by using a classification loss unit.

3. The domain-generalized scale-aligned snap-in detection system of claim 2, wherein said two symmetric sets of feature-embedded networks are: the two backbone networks adopt a parameter sharing strategy to extract the characteristics of the image, and the image is confronted with the characteristic embedding network through a domain discriminator connected with a gradient reverse layer, so that the data distribution of different domains is aligned, and the characteristics are embedded into a characteristic space shared by the network learning.

4. The domain-generalized scale-aligned snap-shot detection system of claim 1,2 or 3, wherein said antagonistic learning module comprises: the system comprises a feature generation unit, a gradient reverse layer, a domain identification unit and a classification loss unit, wherein the feature generation unit, the gradient reverse layer, the domain identification unit and the classification loss unit are formed by two ResNet18 backbone networks which are connected in parallel, and the classification loss unit comprises: two input ports of the feature generation unit are respectively connected with two groups of ResNet-18 backbone networks, correspondingly receive two scale levels, extract features of the same type, the same content and different scale levels, combine the features in a batch dimension in sequence and respectively output the combined features in batch to a task module, a global scale relation alignment module, a local feature ternary loss mining module and a gradient reverse layer of a symmetrical countermeasure learning module; when the gradient reverse layer respectively carries out forward propagation at different time, the input is directly connected to the output and no operation and modification are carried out or backward propagation is carried out, the output gradient value is multiplied by-1, and then the multiplication is carried out to reduce the influence of noise at the initial training stage; the domain identification unit classifies the domain of the features embedded in the feature generation unit, and the number of the output nodes is the total number of the used data sets; the classification loss unit calculates a cross-entropy loss.

5. The domain-generalized scale-aligned snap-through detection system of claim 1, wherein said task module comprises: an average unit, a classification unit with two linear layer nodes and a loss unit, wherein: the average unit is used for averaging the embedded characteristics of the copying and the direct photographing to obtain two groups of average characteristic values, the classification unit is used for carrying out the classification tasks of the copying and the direct photographing on the embedded characteristics, the loss unit is used for comparing the categories output by the classification unit with the actual categories in the training process, and calculating according to the statistics of the correct number and the error number to obtain a loss value which is used as a part of the final comprehensive loss for backward propagation;

and the linear layer node obtains a vector value according to the feature vector information in the embedded feature space, the vector value is used as a classified score, and then the vector value is normalized by using a Softmax function to obtain the final probability value of the reproduction and the direct reproduction.

6. The domain-generalized scale-aligned panning detection system of claim 5, wherein the probability values of the panning and the panning are directly determined as the category of the panning or the panning according to the higher probability value during testing or practical application, and the two probability values and the real category are used to calculate the cross entropy during training, so as to use the cross entropy to guide the network to learn the concepts of the panning and the panning categories.

7. The domain-generalized scale-aligned snap-in detection system according to claim 1 or 2, wherein said preprocessing module comprises: a sorting unit, a small scale level unit and a large scale level unit, wherein: the sorting unit sorts and stores all pictures, copied pictures, directly-photographed pictures, test pictures and training pictures in each domain of a data set into a json file, each record in the file stores an absolute path of the picture and a label of the copied and directly-photographed pictures, a small-scale level unit reads the pictures into PILImage data, performs 1:2 down-sampling in an aligned mode, then cuts a window with the size of 256 × 3 from a middle position as a small-scale picture, the window forms a small-scale level window, then normalizes the window with the parameters of [0.485,0.456,0.406] and [0.229,0.224 and 0.225] to obtain the final small-scale level input of the symmetrical countermeasure learning module, a large-scale level unit reads the pictures into PILImage data respectively and cuts a window with the size of 256 × 3 from the middle position as a unified input picture in a unified format on a large-scale, the final symmetric warfare module large scale level input is then obtained by normalizing it with parameters 0.485,0.456,0.406, [0.229,0.224,0.225 ].

8. The domain-generalized scale-aligned-pan detection system of claim 1, wherein said scale-relationship-alignment module comprises: the input of the KL divergence calculation unit is the confidence degree distribution of classification of a large scale level and a small scale level, namely, the tensor of the dimension (B,2,2) uses the symmetrical KL divergence to measure the difference of the characteristics of each pair of pictures with the same content and different scales in reproduction and direct photography, and the symmetrical KL divergence of the output values of the classification unit of the task module of the two pairs of pictures is taken as the scale loss of the pair of pictures; taking the average loss value of all scales in the batch as the final output value of the scale relation alignment module, wherein: b is the batch size set during training or testing, and the output is a loss value with dimension 1.

9. The domain-generalized scale-aligned-snap detection system of claim 1, wherein said local-feature ternary-loss mining module further locally limits the embedded feature space, the module comprising: the ternary loss calculation unit exhales all triples in the data set, calculates the loss of the triples, and then calculates the average value of the ternary losses of all the triples to be used as a final result;

the loss calculation mode is as follows: the method comprises the steps of selecting a feature as an anchor point, selecting one of the features in the same category as a positive sample, calculating the distance between the feature and the positive sample, selecting one of the features in different categories as a negative sample, calculating the distance between the anchor point and the positive sample, subtracting the distance between the anchor point and the negative sample from the distance between the anchor point and the positive sample as a triple loss on a triple, and exhaustively adding and averaging all the triple losses to obtain a final triple loss.

10. A domain-generalized scale-aligned snap-in detection method based on the system of any one of claims 1 to 9, comprising:

s1) training a domain generalized scale alignment copying detection model;

s1.1) preprocessing the selected 3 data sets, and generating json list files of all training data and corresponding labels by using general _ label.py; the method specifically comprises the following steps: processing the 3 training data sets one by one, taking out absolute paths of all pictures, and storing corresponding labels into a json file;

s1.2) respectively and randomly selecting equivalent data from 3 data sets for training, randomly selecting 608 pictures and 666 random numbers for each data set and each category, and writing paths and labels of the 608 pictures and the 666 random numbers into a json file for training for later reference;

s1.3) reading pictures in 3 data sets in batches according to batch B, one batch for each round, where B is 8, and 24 pictures in 3 data sets constitute one batch; reading pictures into a PIL format by using a PILImage library, and then reserving one copy and copying one copy;

s1.4) cutting a 256-by-256 window at the center directly without any resize operation on a reserved picture, converting the window into a Tensor format, and normalizing the window by using parameters of [0.485,0.456,0.406], [0.229,0.224 and 0.225] to obtain a large-scale picture input;

s1.5) converting a copied picture into a Tensor format, then down-sampling, taking one picture at intervals of one pixel in the transverse direction and the longitudinal direction to obtain a 1:2 down-sampled Tensor, then converting the Tensor into a PIL format, cutting a window of 256 by 256 at the center, converting the window into the Tensor format, and then normalizing the window by using parameters of [0.485,0.456,0.406], [0.229,0.224 and 0.225] to obtain a small-scale picture input;

s1.6) merging the pictures obtained in S1.4 and S1.5 in the batch dimension to obtain an input with a batch size of 48, each member being a normalized color picture of 256 × 3, and inputting the merged batch size to a feature generation unit in the symmetric antagonistic learning module;

Wherein: k is the ratio of the counted rounds to the total round of forward propagation, the total round being set to 4000;

s1.14) in a task module, using a layer of linear layer nodes as a classifier, wherein the node number is 2, the node number corresponds to two categories of copying and direct-photographing respectively, scores in the two categories are output, each picture has two scores for evaluating the probability in the two categories, and the output dimensionality of the part is (48, 2); respectively averaging the characteristic values of the copied images and the directly-photographed images, and similarly scoring two categories, wherein the output dimensionality of the two categories is (2, 2); directly inputting the output of the first (48,2) dimension into a softmax function to normalize the score, and then inputting the normalized score into cross entropy loss to obtain task loss L2;

s1.20) repeating the steps from S1.2 to S1.19 4000 times;

s2) using the domain generalized scale alignment copying detection model to perform copying detection;

s2.1) preprocessing the selected test data set, and generating json list files of all training data and corresponding labels by using general _ label.py; the method specifically comprises the following steps: taking out absolute paths of all pictures, and storing corresponding labels (1 in reproduction and 0 in direct photographing) into a json file;

s2.2) randomly selecting 608 pictures, randomly selecting 666 as a seed, and writing paths and labels of the seeds into a json file for training for future reference;

s2.3) reading pictures in the data set in batches according to the batch B, wherein one batch is formed in each turn, and B is 8 to form one batch; reading pictures into a PIL format by using a PILImage library, and then reserving one copy and copying one copy;

s2.4) cutting a 256-by-256 window at the center directly without any resize operation on the reserved picture, converting the window into a Tensor format, and normalizing the window by using parameters of [0.485,0.456,0.406], [0.229,0.224 and 0.225] to obtain a large-scale picture input;

s2.5) converting a copied picture into a Tensor format, then down-sampling, taking one picture at intervals of one pixel in the transverse direction and the longitudinal direction to obtain a 1:2 down-sampled Tensor, then converting the Tensor into a PIL format, cutting a window of 256 by 256 at the center, converting the window into the Tensor format, and then normalizing the window by using parameters of [0.485,0.456,0.406], [0.229,0.224 and 0.225] to obtain a small-scale picture input;

s2.11) in a task module, using a layer of linear layer nodes as a classifier, wherein the node number is 2, the node number corresponds to two categories of copying and direct-photographing respectively, scores in the two categories are output, each picture has two scores for evaluating the probability in the two categories, and the total output dimensionality is (48, 2); the output is transmitted to a softmax function to normalize the score, and then the score is input to cross entropy loss to obtain task loss L2;