CN117788906B

CN117788906B - Large model generation image identification method and system

Info

Publication number: CN117788906B
Application number: CN202311804911.5A
Authority: CN
Inventors: 郑威; 云剑; 郑晓玲; 凌霞
Original assignee: China Academy of Information and Communications Technology CAICT
Current assignee: China Academy of Information and Communications Technology CAICT
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-07-05
Anticipated expiration: 2043-12-26
Also published as: CN117788906A

Abstract

The invention provides a large model generation image identification method and a large model generation image identification system. The method comprises the following steps: inputting the generated image into a first processing module based on residual filtering to obtain original characteristics; inputting the original features into a second processing module based on a self-attention mechanism and a residual error structure to obtain classification features; and inputting the classification characteristic into a classification network, and outputting a result of only true or false. The scheme provided by the invention solves the problems that the prior art cannot utilize shallow texture information of an input picture, and the loss function is simple and cannot dynamically change along with input data.

Description

Large model generation image identification method and system

Technical Field

The invention belongs to the field of image identification, and particularly relates to a large model generation image identification method and system.

Background

With the development of artificial intelligence technology, the large model gradually develops and matures to play respective roles in the life of people. Wherein the AI-generated content (AIGC) is a popular large model direction. A large number of pictures using diffusion models (diffusion) create large models into the field of view of people. For example StableDiffusion, dreambooth, midjourney, the large model service of enabling the large model to automatically generate the corresponding image only by inputting prompt words (prompt) by a user also enables various persons not good at drawing to map ideas in the mind of the person to the image.

But generating the large image model is a double-sided sword, which brings about a plurality of defects. For example, in general commercial activities, investors or purchasers often wish to purchase images that are personally designed and drawn by the painter rather than images generated by a large model, and the random use of large models to generate images can also be a copyright dispute. For example, a large model of the generated image can be drawn in great detail based on the hint words, possibly for cook up a story and spread it around by a person in mind evil intention, with immeasurable consequences.

At this time, the society has urgent need for technology capable of distinguishing large model generation images from real reality images.

Although the former has made some researches on how to identify the true picture and the fake picture, in the field of identifying the computer generated picture and the true picture, the existing researches tend to focus more on the image generated by the conventional neural network such as the countermeasure generation network (GAN) or the variational self-encoder (VAE), and the methods of identifying the image and the real picture generated by the conventional neural network such as the GAN and the VAE are proposed from various aspects such as the spatial domain and the frequency domain.

However, the principle of an image generation large model for generating an image by using a diffusion model is very different from that of the conventional image generation network, and the conventional technology is difficult to be directly applied to the fake identification task of generating a picture by using the large model. Some students use the existing image identification technology to identify large model images and real images, and as a result, the model performs very poorly, and cannot meet the current demands and expectations of people. With the rapid development of large model generation by diffusion image, the existing identification technology can be more and more difficult to distinguish the image generated by the large model from the real image.

Prior Art

DIRE technology, from paper DIRE for Diffusion-GENERATED IMAGE Detection, is an abbreviation for DIffusion Reconstruction Error. The DIRE measures the error between the input image and its reconstruction by means of a pre-trained diffusion model. The authors of this paper found that images generated by diffusion models were more easily reconstructed by pre-trained diffusion models than real images, which would be difficult to reconstruct due to various complications of reality. And reconstructing an input image through DDIM, calculating differences between the reconstructed image and the original image, and finally performing two classifications by taking the differences as characteristics to judge whether the image is a large-model forged image.

Defects of the prior art

The first disadvantage of the prior art method DIRE is that the difference between the original image and the reconstructed image results in loss of shallow texture characteristics of the original image, and that sufficient information cannot be extracted from the original image.

A second disadvantage of the existing method DIRE is that there is no concern about the characteristics of the relationship between individual pixels within the large model-generated image.

A third disadvantage of the prior art method DIRE is that the loss function is too simple to dynamically adjust the learning stride according to the different input data.

Disclosure of Invention

In order to solve the technical problems, the invention provides a technical scheme of a large model generation image identification method so as to solve the technical problems.

The first aspect of the invention discloses a large model generation image identification method, which comprises the following steps:

S1, inputting a generated image into a first processing module based on residual filtering to obtain original characteristics;

S2, inputting the original features into a second processing module based on a self-attention mechanism and a residual error structure to obtain classification features;

And S3, inputting the classification characteristics into a classification network, and outputting a result with only true or false.

According to the method of the first aspect of the present invention, in the step S1, the method for inputting the generated image into the first processing module based on residual filtering to obtain the original feature includes:

And respectively inputting the generated image into a residual filter and a convolution kernel, combining the processing results of the residual filter and the convolution kernel, and finally inputting the combined result into a first convolution pooling layer to obtain the original characteristics.

According to the method of the first aspect of the present invention, in the step S1, there are seventeen residual filters; the convolution kernel has eight; the values of seventeen residual filters are fixed and not changed in learning; whereas the parameters of the eight convolution kernels are learned during training.

According to the method of the first aspect of the present invention, in the step S1, the first convolution pooling layer is a convolution pooling layer with a convolution layer and a pooling layer to which a residual mechanism is applied.

According to the method of the first aspect of the present invention, in the step S2, the method for inputting the original feature into a second processing module based on a self-attention mechanism and a residual structure to obtain a classification feature includes:

and inputting the original features into a second convolution pooling layer to obtain processing features, inputting the processing features into self-attention operation, and carrying out numerical addition on the self-attention operation result and the processing features to obtain classification features.

According to the method of the first aspect of the present invention, in said step S2, V of said attention calculation is then said processing feature; the weights assigned to the V are obtained by calculation of Q and K using a softmax layer, and then the weighted average results in shallow texture features captured from the attention mechanism.

According to the method of the first aspect of the present invention, in said step S3, classification is optimized using a high-dimensional spherical boundary objective function in said classification network.

In a second aspect, the present invention discloses a large model generation image authentication system, the system comprising:

the first processing module is configured to input the generated image into the first processing module based on residual filtering to obtain original characteristics;

The second processing module is configured to input the original features into the second processing module based on a self-attention mechanism and a residual error structure to obtain classification features;

and the third processing module is configured to input the classification characteristic into a classification network and output a result of only true or false.

A third aspect of the invention discloses an electronic device. The electronic device comprises a memory storing a computer program and a processor implementing the steps in a large model generation image authentication method of any one of the first aspects of the present disclosure when the processor executes the computer program.

A fourth aspect of the invention discloses a computer-readable storage medium. A computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps in a large model generation image authentication method of any of the first aspects of the present disclosure.

In summary, the scheme provided by the application comprises a self-attention mechanism and a network structure of a residual error structure, the self-attention mechanism is combined to enhance the refining and analyzing capacity of the network to shallow texture features invisible to naked eyes, and the basis of identification is further enriched through the residual error supplementing lost feature information. The training mode of training steps by increasing similarity in the group in the classification based on the high-dimensional spherical boundary objective function can help the model pay more attention to similarity threshold values in the group. The application solves the problems that the prior art cannot utilize shallow texture information of an input picture, and the loss function is simple and cannot dynamically change along with input data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a large model generation image authentication method according to an embodiment of the present invention;

FIG. 2 is a block flow diagram according to an embodiment of the invention;

FIG. 3 is a block diagram of residual filtering according to an embodiment of the present invention;

FIG. 4 is an exemplary value of the initialization of seventeen residual filters according to an embodiment of the present invention;

FIG. 5 is a self-attention module diagram according to an embodiment of the present invention;

FIG. 6 is a block diagram of a large model generation image authentication system according to an embodiment of the present invention;

Fig. 7 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The first aspect of the invention discloses a large model generation image identification method. Fig. 1 is a flowchart of a large model generation image authentication method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

In step S1, as shown in fig. 2, the generated image is input to a first processing module (residual filtering module in fig. 2) based on residual filtering, to obtain an original feature.

In some embodiments, in the step S1, the method for obtaining the original feature by inputting the generated image into a first processing module based on residual filtering includes:

As shown in fig. 3, the generated image is input into a residual filter and a convolution kernel respectively, then the processing results of the residual filter and the convolution kernel are combined, and finally the combined result is input into a first convolution pooling layer to obtain the original feature.

Seventeen residual filters are arranged; the convolution kernel has eight; the values of seventeen residual filters are fixed and not changed in learning; whereas the parameters of the eight convolution kernels are learned during training.

The first convolutional pooling layer is a convolutional pooling layer with a convolutional layer and a pooling layer, wherein a residual mechanism is applied to the convolutional layer.

Specifically, the convolution pooling layers herein include a 3*3 convolution layer, a regularization layer, a ReLU layer, and a max pooling layer.

As shown in fig. 4, the initialization specific values of seventeen residual filters are as follows, and these filters can efficiently extract residual information of an image.

In step S2, as shown in fig. 2, the original feature is input to a second processing module (self-attention module in fig. 2) based on a self-attention mechanism and a residual structure, resulting in a classification feature. The self-attention mechanism is combined to enhance the refining and analyzing capacity of the network to shallow texture features invisible to naked eyes, and the basis of identification is further enriched through residual error supplement lost feature information.

In some embodiments, in the step S2, the method for inputting the original feature into a second processing module based on a self-attention mechanism and a residual structure to obtain the classification feature includes:

As shown in fig. 5, the original feature is input into a second convolution pooling layer to obtain a processing feature, the processing feature is input into a self-attention operation, and the result of the self-attention operation and the processing feature are subjected to numerical addition to obtain a classification feature. And the information lost due to operation is supplemented through a residual error mechanism, so that the identification effect is enhanced.

V of the attention operation is the processing feature; the weights assigned to the V are obtained by calculation of Q and K using a softmax layer, and then the weighted average results in shallow texture features captured from the attention mechanism.

Specifically, the convolution pooling layer comprises a 3*3 convolution layer, a regularization layer, a ReLU layer, a 3*3 convolution layer, a regularization layer, a ReLU layer, and a max pooling layer in that order.

These processed features use a self-attention mechanism to capture the correlation of shallow texture features at the pixel level.

Before self-attention is performed, the form of the data needs to be adjusted so that the data is in the form of (number of pixels, number of information channels). Then, a self-attention operation is performed on the data.

In step S3, the classification feature is input into a classification network (the ball-type loss classification module in fig. 2), and only the result of true or false is output.

In some embodiments, in said step S3, classification is optimized using a high-dimensional spherical boundary objective function in said classification network.

Specifically, the problem of image false discrimination can be regarded as a two-class problem, and only "true" or "false" results are output. In order to improve the accuracy and the effectiveness of classification, the invention introduces a high-dimensional spherical boundary objective function in the classification layer. The high-dimensional spherical boundary objective function is mainly developed around intra-group similarity steps and inter-group similarity steps. In the invention, the similarity in the group is emphasized, and a special weight w is designed for the similarity in the group.

The high-dimensional spherical boundary objective function is an objective function for adaptively changing the optimization stride in the training process. The stride update rule is as follows:

at the beginning, respective thresholds are set for the group similarity and the inter-group similarity.

When a sample is input, first, the intra-group similarity and inter-group similarity of the current sample are calculated. The group similarity stride and the inter-group similarity stride are then calculated. The group similarity stride is the product of the weight w and the difference of the group similarity threshold minus the group similarity calculation. The inter-group similarity stride is the difference of the inter-group similarity calculation minus the inter-group similarity threshold. If the two strides are less than 0, then they are set to 0, keeping the two strides non-negative all the time.

When the measurement indexes are calculated in a grouping way, the intra-group similarity loss and the inter-group similarity loss are multiplied by the stride of each group respectively and then are counted. Thus, when solving gradient descent, different steps can be updated for different data.

After training the model by using the high-dimensional spherical boundary objective function, the model can forge and identify the input picture to identify whether the picture is generated by a large model.

In summary, the scheme provided by the invention combines the self-attention mechanism and the network structure of the residual error structure, enhances the refining and analyzing capacity of the network to the invisible shallow texture features of naked eyes, and further enriches the identification basis by supplementing lost feature information through residual error.

The training mode of training steps by increasing similarity in the group in the classification based on the high-dimensional spherical boundary objective function can help the model pay more attention to similarity threshold values in the group. The method solves the problems that the prior art cannot utilize shallow texture information of an input picture, and the loss function is simple and cannot dynamically change along with input data.

A second aspect of the invention discloses a large model generation image authentication system. FIG. 6 is a block diagram of a large model generation image authentication system according to an embodiment of the present invention; as shown in fig. 6, the system 100 includes:

A first processing module 101 configured to input the generated image into a first processing module based on residual filtering, resulting in an original feature;

A second processing module 102 configured to input the original features into a second processing module based on a self-attention mechanism and a residual structure, resulting in classification features;

The third processing module 103 is configured to input the classification feature into a classification network and output only a result of "true" or "false".

According to the system of the second aspect of the present invention, the first processing module 101 is specifically configured such that the method for inputting the generated image into the first processing module based on residual filtering to obtain the original feature includes:

According to the system of the second aspect of the present invention, the second processing module 102 is specifically configured to input the original feature into the second processing module based on a self-attention mechanism and a residual structure, and the method for obtaining the classification feature includes:

According to the system of the second aspect of the present invention, the third processing module 103 is specifically configured to optimize classification using a high-dimensional spherical boundary objective function in the classification network.

A third aspect of the invention discloses an electronic device. The electronic device comprises a memory and a processor, the memory stores a computer program, and the processor implements the steps in a large model generation image authentication method according to any one of the first aspect of the disclosure when executing the computer program.

Fig. 7 is a block diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device includes a processor, a memory, a communication interface, a display screen, and an input device connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic device includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the electronic device is used for conducting wired or wireless communication with an external terminal, and the wireless communication can be achieved through WIFI, an operator network, near Field Communication (NFC) or other technologies. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the electronic equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely a block diagram of a portion related to the technical solution of the present disclosure, and does not constitute a limitation of the electronic device to which the technical solution of the present disclosure is applied, and that a specific electronic device may include more or less components than those shown in the drawings, or may combine some components, or have different component arrangements.

A fourth aspect of the invention discloses a computer-readable storage medium. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a large model generation image authentication method of any one of the first aspects of the present disclosure.

Note that the technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be regarded as the scope of the description. The foregoing examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method for large model generation image authentication, the method comprising:

Inputting the original features into a second convolution pooling layer to obtain processing features, inputting the processing features into self-attention operation, and carrying out numerical addition on the self-attention operation result and the processing features to obtain classification features;

S3, inputting the classification characteristics into a classification network, and outputting a result with only true or false;

optimizing classification by using a high-dimensional spherical boundary objective function in the two-class network;

The high-dimensional spherical boundary objective function is an objective function for adaptively changing the optimization stride in the training process;

the stride updating rule is as follows:

At the beginning, setting respective thresholds for the similarity between groups and the similarity between groups; when one sample is input, firstly calculating the intra-group similarity and the inter-group similarity of the current sample; then calculating the similarity stride between groups and the similarity stride between groups; the group similarity stride is the product of the weight w and the difference of the group similarity threshold minus the group similarity calculation; the inter-group similarity stride is the difference of the inter-group similarity calculated value minus the inter-group similarity threshold;

if the group similarity stride and the inter-group similarity stride are smaller than 0, then the group similarity stride is set to 0, and the two strides are always kept non-negative; when the measurement indexes are calculated in a grouping way, the intra-group similarity loss and the inter-group similarity loss are multiplied by the stride of each group respectively and then are counted; when solving gradient descent, different steps can be updated for different data.

2. The method for discriminating a large model generated image according to claim 1 wherein in step S1, the method for inputting the generated image to a first processing module based on residual filtering to obtain an original feature includes:

3. The large model generation image discrimination method according to claim 2, wherein in said step S1, there are seventeen of said residual filters; the convolution kernel has eight; the values of seventeen residual filters are fixed and not changed in learning; whereas the parameters of the eight convolution kernels are learned during training.

4. The large model generation image discrimination method according to claim 2, wherein in said step S1, said first convolution pooling layer is a convolution pooling layer with a residual mechanism of convolution layer and pooling layer.

5. A large model generation image discrimination method according to claim 1, wherein in said step S2, V of said attention operation is said processing feature; the weights assigned to the V are obtained by calculation of Q and K using a softmax layer, and then the weighted average results in shallow texture features captured from the attention mechanism.

6. A system for large model generation image authentication, the system comprising:

A third processing module configured to input the classification feature into a classification network and output only a "true" or "false" result;

the stride updating rule is as follows:

7. An electronic device comprising a memory storing a computer program and a processor implementing the steps of a large model generation image authentication method according to any one of claims 1 to 5 when the computer program is executed by the processor.

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a large model generation image authentication method according to any of claims 1 to 5.