CN117854073A

CN117854073A - Multimode media tampering detection and positioning system based on unified reconstruction

Info

Publication number: CN117854073A
Application number: CN202410028426.1A
Authority: CN
Inventors: 焦铬; 赵炜辰; 陈纪友; 吴芸
Original assignee: Hengyang Normal University
Current assignee: Hengyang Normal University
Priority date: 2024-01-09
Filing date: 2024-01-09
Publication date: 2024-04-09

Abstract

The invention discloses a multimode media tampering detection and positioning system based on unified reconstruction, which comprises: an image encoder configured to receive image data to be detected containing image data to be detected and a random mask and output visual features and mask visual features; a text encoder configured to receive text data to be detected including text data to be detected and a random mask and output text features and mask text features; a visual-text reconstruction module that achieves alignment and reconstruction of visual-guided text features by self-attention and cross-attention; a text-visual reconstruction module that achieves alignment and reconstruction of text-guided visual features by self-attention and cross-attention; the reconstruction coordinator module is used for carrying out centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and fusing tampered characteristic representations; a detection and classification module configured to predict and locate tampered instances in the multi-modal media data based on an output of the reconstruction coordinator module.

Description

Multimode media tampering detection and positioning system based on unified reconstruction

Technical Field

The invention belongs to the technical field of intersection of computer vision and natural language processing, and particularly relates to a multimode media tampering detection and positioning system based on unified reconstruction.

Background

The detection of counterfeit media is an important issue today when digital media are increasingly developed. Especially on the network, false information exists in a large amount in visual and text forms, which has a profound effect on society. While many methods of detecting deep forgeries and text false news have been developed, most of these methods are only applicable to single-pattern recognition, and based on binary classification, they are limited in their ability to analyze and infer subtle forgery marks between different modalities. Multimodal media, which consist of images and text, conveys more extensive information in our daily lives and has a greater impact than single modalities. Thus, the counterfeiting of multi-modal media tends to be more compromised. Existing single mode counterfeiting detection methods are struggled against this new challenge because they fail to cover the need to detect and locate image bounding boxes and text labels while image and text mode counterfeiting. In view of this problem, it is desirable to provide a multi-mode media tamper detection and localization system based on unified reconstruction.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multimode media tampering detection and positioning system based on unified reconstruction, which improves the overall accuracy of tampering detection.

In order to achieve the above object, the present invention provides a multimode media tamper detection and positioning system based on unified reconstruction, including: an image encoder, a text encoder, a visual-text reconstruction module, a text-visual reconstruction module, a reconstruction coordinator module, and a detection and classification module;

the image encoder is configured to receive image data to be detected, which contains the image data to be detected and a random mask, and output visual features and mask visual features;

the text encoder is configured to receive text data to be detected, which contains the text data to be detected and a random mask, and output text features and mask text features;

the visual-text reconstruction module is used for realizing alignment and reconstruction of text features of visual guidance through self-attention and cross-attention;

the text-visual reconstruction module is used for realizing the alignment and reconstruction of visual features of text guidance through self-attention and cross-attention;

the reconstruction coordinator module performs centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and is used for further fusing tampered characteristic representations;

the detection and classification module is configured to predict and locate a tampered instance in the multi-modal media data based on an output of the reconstruction coordinator module.

Optionally, the detection and classification module includes at least one binary classifier, a bounding box detector, a multi-label classifier, and a label detector;

the binary classifier is used for predicting whether the multi-mode media data is real or tampered;

the bounding box detector is used for positioning a tampered area in the image;

the multi-label classifier is used for identifying a plurality of tampered labels in the image;

the mark detector is used for detecting and positioning the tampered mark in the text.

Optionally, the image encoder is further configured to stack 12 extracted features by self-attention and feed forward neural network combinations and generate class and pattern vectors for subsequent processing; the text encoder is further configured to stack the 6 extracted features by self-attention and feed forward neural network combinations and generate class and pattern vectors for subsequent processing.

Optionally, the visual-text reconstruction module and the text-visual reconstruction module further comprise a combination of multi-headed self-attention, cross-attention, and feed-forward neural networks, stacked 6 times, in a manner that facilitates fine-grained alignment between different modality features by reconstructing the features.

Optionally, the reconstruction coordinator module further includes a multi-headed self-attention, cross-attention, and feed-forward neural network, stacked 4 times to improve the quality and accuracy of the reconstructed features.

The invention has the technical effects that: the invention discloses a multimode media tampering detection and positioning system based on unified reconstruction, which integrates image and text analysis through a unified frame so as to process and identify tampering in multimode data at the same time, and can accurately identify and position a tampering area in the image and misleading expression in the text, thereby improving the overall accuracy of tampering detection; the invention simplifies the layering architecture of the existing method, and directly extracts the features from the reconstruction coordinator module for detection and positioning.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:

FIG. 1 is a schematic diagram of a system for detecting and locating multi-modal media tampering based on unified reconstruction according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a system for locating an image tampering area according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating text tamper detection and attention heat map according to an embodiment of the present invention.

Detailed Description

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

As shown in fig. 1, in this embodiment, a multi-mode media tamper detection and positioning system based on unified reconstruction is provided, which includes: the invention provides a multimode media tampering detection and positioning system based on unified reconstruction, which comprises: an image encoder, a text encoder, a visual-text reconstruction module, a text-visual reconstruction module, a reconstruction coordinator module, and a detection and classification module;

an image encoder configured to receive image data to be detected containing image data to be detected and a random mask and output visual features and mask visual features;

a text encoder configured to receive text data to be detected including text data to be detected and a random mask and output text features and mask text features;

a visual-text reconstruction module that achieves alignment and reconstruction of visual-guided text features by self-attention and cross-attention;

a text-visual reconstruction module that achieves alignment and reconstruction of text-guided visual features by self-attention and cross-attention;

the reconstruction coordinator module is used for carrying out centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and further fusing tampered characteristic representations;

a detection and classification module configured to predict and locate tampered instances in the multi-modal media data based on an output of the reconstruction coordinator module.

The image encoder is further configured to stack 12 extracted features by self-attention mechanism and feed forward neural network (FFN) combinations and generate class and pattern vectors for subsequent processing; the text encoder is further configured to stack the 6 extracted features by self-attention mechanism and feed forward neural network (FFN) combinations and generate class and pattern vectors for subsequent processing.

The visual-text reconstruction module (VGS) and the text-visual reconstruction module (SGV) further include a combination of multi-headed self-attention, cross-attention mechanisms, and feed forward neural networks (FFNs) stacked 6 times to facilitate fine-grained alignment between different modality features by way of reconstruction features.

The reconstruction coordinator module further includes a multi-headed self-attention, cross-attention mechanism, and feed forward neural network (FFN), stacked 4 times to improve the quality and accuracy of the reconstructed features.

The detection and classification module comprises at least one binary classifier, a boundary box detector, a multi-label classifier and a label detector;

a binary classifier for predicting whether the multi-modal media data is authentic or tampered with;

a bounding box detector for locating a tampered region in the image;

a multi-tag classifier for identifying a plurality of tampered tags in the image;

and a mark detector for detecting and locating the tampered mark in the text.

Fig. 2 shows the effect of the present invention on predicting the tampered area of an image, wherein red is the true tampered area and blue is the tampered area predicted by our method, which are almost identical.

Fig. 3 illustrates the effect of the present invention in predicting text tampering, wherein red text is tampered text and is reflected on an image by an attention heat map.

As shown in Table 1, all performance metrics of the present invention used different pre-training models (CLIP, viLT, ALBEF) in which experiments exceeded the prior HAMMER method.

TABLE 1

The invention discloses a multimode media tampering detection and positioning system based on unified reconstruction, which integrates image and text analysis through a unified frame so as to process and identify tampering in multimode data at the same time, and can accurately identify and position a tampering area in the image and misleading expression in the text, thereby improving the overall accuracy of tampering detection; the invention simplifies the layering architecture of the existing method, and directly extracts the features from the reconstruction coordinator module for detection and positioning.

The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A multi-modal media tamper detection and localization system based on unified reconstruction, comprising:

an image encoder, a text encoder, a visual-text reconstruction module, a text-visual reconstruction module, a reconstruction coordinator module, and a detection and classification module;

2. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1 wherein the detection and classification module comprises at least one binary classifier, one bounding box detector, one multi-tag classifier, and one tag detector;

the bounding box detector is used for positioning a tampered area in the image;

3. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1, wherein the image encoder is further configured to stack 12 times of extracted features by self-attention and feed forward neural network combinations and generate category and mode vectors for subsequent processing; the text encoder is further configured to stack the 6 extracted features by self-attention and feed forward neural network combinations and generate class and pattern vectors for subsequent processing.

4. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1 wherein the visual-text reconstruction module and the text-visual reconstruction module further comprise a combination of multi-headed self-attention, cross-attention, and feed-forward neural networks stacked 6 times to facilitate fine-grained alignment between different modal features by way of reconstruction features.

5. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1, wherein the reconstruction coordinator module further comprises a multi-headed self-attention, cross-attention, and feed forward neural network stacked 4 times to improve the quality and accuracy of the reconstructed features.