CN117854073A - Multimode media tampering detection and positioning system based on unified reconstruction - Google Patents

Multimode media tampering detection and positioning system based on unified reconstruction Download PDF

Info

Publication number
CN117854073A
CN117854073A CN202410028426.1A CN202410028426A CN117854073A CN 117854073 A CN117854073 A CN 117854073A CN 202410028426 A CN202410028426 A CN 202410028426A CN 117854073 A CN117854073 A CN 117854073A
Authority
CN
China
Prior art keywords
text
reconstruction
attention
visual
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410028426.1A
Other languages
Chinese (zh)
Inventor
焦铬
赵炜辰
陈纪友
吴芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hengyang Normal University
Original Assignee
Hengyang Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hengyang Normal University filed Critical Hengyang Normal University
Priority to CN202410028426.1A priority Critical patent/CN117854073A/en
Publication of CN117854073A publication Critical patent/CN117854073A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a multimode media tampering detection and positioning system based on unified reconstruction, which comprises: an image encoder configured to receive image data to be detected containing image data to be detected and a random mask and output visual features and mask visual features; a text encoder configured to receive text data to be detected including text data to be detected and a random mask and output text features and mask text features; a visual-text reconstruction module that achieves alignment and reconstruction of visual-guided text features by self-attention and cross-attention; a text-visual reconstruction module that achieves alignment and reconstruction of text-guided visual features by self-attention and cross-attention; the reconstruction coordinator module is used for carrying out centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and fusing tampered characteristic representations; a detection and classification module configured to predict and locate tampered instances in the multi-modal media data based on an output of the reconstruction coordinator module.

Description

Multimode media tampering detection and positioning system based on unified reconstruction
Technical Field
The invention belongs to the technical field of intersection of computer vision and natural language processing, and particularly relates to a multimode media tampering detection and positioning system based on unified reconstruction.
Background
The detection of counterfeit media is an important issue today when digital media are increasingly developed. Especially on the network, false information exists in a large amount in visual and text forms, which has a profound effect on society. While many methods of detecting deep forgeries and text false news have been developed, most of these methods are only applicable to single-pattern recognition, and based on binary classification, they are limited in their ability to analyze and infer subtle forgery marks between different modalities. Multimodal media, which consist of images and text, conveys more extensive information in our daily lives and has a greater impact than single modalities. Thus, the counterfeiting of multi-modal media tends to be more compromised. Existing single mode counterfeiting detection methods are struggled against this new challenge because they fail to cover the need to detect and locate image bounding boxes and text labels while image and text mode counterfeiting. In view of this problem, it is desirable to provide a multi-mode media tamper detection and localization system based on unified reconstruction.
Disclosure of Invention
In order to solve the technical problems, the invention provides a multimode media tampering detection and positioning system based on unified reconstruction, which improves the overall accuracy of tampering detection.
In order to achieve the above object, the present invention provides a multimode media tamper detection and positioning system based on unified reconstruction, including: an image encoder, a text encoder, a visual-text reconstruction module, a text-visual reconstruction module, a reconstruction coordinator module, and a detection and classification module;
the image encoder is configured to receive image data to be detected, which contains the image data to be detected and a random mask, and output visual features and mask visual features;
the text encoder is configured to receive text data to be detected, which contains the text data to be detected and a random mask, and output text features and mask text features;
the visual-text reconstruction module is used for realizing alignment and reconstruction of text features of visual guidance through self-attention and cross-attention;
the text-visual reconstruction module is used for realizing the alignment and reconstruction of visual features of text guidance through self-attention and cross-attention;
the reconstruction coordinator module performs centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and is used for further fusing tampered characteristic representations;
the detection and classification module is configured to predict and locate a tampered instance in the multi-modal media data based on an output of the reconstruction coordinator module.
Optionally, the detection and classification module includes at least one binary classifier, a bounding box detector, a multi-label classifier, and a label detector;
the binary classifier is used for predicting whether the multi-mode media data is real or tampered;
the bounding box detector is used for positioning a tampered area in the image;
the multi-label classifier is used for identifying a plurality of tampered labels in the image;
the mark detector is used for detecting and positioning the tampered mark in the text.
Optionally, the image encoder is further configured to stack 12 extracted features by self-attention and feed forward neural network combinations and generate class and pattern vectors for subsequent processing; the text encoder is further configured to stack the 6 extracted features by self-attention and feed forward neural network combinations and generate class and pattern vectors for subsequent processing.
Optionally, the visual-text reconstruction module and the text-visual reconstruction module further comprise a combination of multi-headed self-attention, cross-attention, and feed-forward neural networks, stacked 6 times, in a manner that facilitates fine-grained alignment between different modality features by reconstructing the features.
Optionally, the reconstruction coordinator module further includes a multi-headed self-attention, cross-attention, and feed-forward neural network, stacked 4 times to improve the quality and accuracy of the reconstructed features.
The invention has the technical effects that: the invention discloses a multimode media tampering detection and positioning system based on unified reconstruction, which integrates image and text analysis through a unified frame so as to process and identify tampering in multimode data at the same time, and can accurately identify and position a tampering area in the image and misleading expression in the text, thereby improving the overall accuracy of tampering detection; the invention simplifies the layering architecture of the existing method, and directly extracts the features from the reconstruction coordinator module for detection and positioning.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
FIG. 1 is a schematic diagram of a system for detecting and locating multi-modal media tampering based on unified reconstruction according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a system for locating an image tampering area according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating text tamper detection and attention heat map according to an embodiment of the present invention.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
As shown in fig. 1, in this embodiment, a multi-mode media tamper detection and positioning system based on unified reconstruction is provided, which includes: the invention provides a multimode media tampering detection and positioning system based on unified reconstruction, which comprises: an image encoder, a text encoder, a visual-text reconstruction module, a text-visual reconstruction module, a reconstruction coordinator module, and a detection and classification module;
an image encoder configured to receive image data to be detected containing image data to be detected and a random mask and output visual features and mask visual features;
a text encoder configured to receive text data to be detected including text data to be detected and a random mask and output text features and mask text features;
a visual-text reconstruction module that achieves alignment and reconstruction of visual-guided text features by self-attention and cross-attention;
a text-visual reconstruction module that achieves alignment and reconstruction of text-guided visual features by self-attention and cross-attention;
the reconstruction coordinator module is used for carrying out centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and further fusing tampered characteristic representations;
a detection and classification module configured to predict and locate tampered instances in the multi-modal media data based on an output of the reconstruction coordinator module.
The image encoder is further configured to stack 12 extracted features by self-attention mechanism and feed forward neural network (FFN) combinations and generate class and pattern vectors for subsequent processing; the text encoder is further configured to stack the 6 extracted features by self-attention mechanism and feed forward neural network (FFN) combinations and generate class and pattern vectors for subsequent processing.
The visual-text reconstruction module (VGS) and the text-visual reconstruction module (SGV) further include a combination of multi-headed self-attention, cross-attention mechanisms, and feed forward neural networks (FFNs) stacked 6 times to facilitate fine-grained alignment between different modality features by way of reconstruction features.
The reconstruction coordinator module further includes a multi-headed self-attention, cross-attention mechanism, and feed forward neural network (FFN), stacked 4 times to improve the quality and accuracy of the reconstructed features.
The detection and classification module comprises at least one binary classifier, a boundary box detector, a multi-label classifier and a label detector;
a binary classifier for predicting whether the multi-modal media data is authentic or tampered with;
a bounding box detector for locating a tampered region in the image;
a multi-tag classifier for identifying a plurality of tampered tags in the image;
and a mark detector for detecting and locating the tampered mark in the text.
Fig. 2 shows the effect of the present invention on predicting the tampered area of an image, wherein red is the true tampered area and blue is the tampered area predicted by our method, which are almost identical.
Fig. 3 illustrates the effect of the present invention in predicting text tampering, wherein red text is tampered text and is reflected on an image by an attention heat map.
As shown in Table 1, all performance metrics of the present invention used different pre-training models (CLIP, viLT, ALBEF) in which experiments exceeded the prior HAMMER method.
TABLE 1
The invention discloses a multimode media tampering detection and positioning system based on unified reconstruction, which integrates image and text analysis through a unified frame so as to process and identify tampering in multimode data at the same time, and can accurately identify and position a tampering area in the image and misleading expression in the text, thereby improving the overall accuracy of tampering detection; the invention simplifies the layering architecture of the existing method, and directly extracts the features from the reconstruction coordinator module for detection and positioning.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (5)

1. A multi-modal media tamper detection and localization system based on unified reconstruction, comprising:
an image encoder, a text encoder, a visual-text reconstruction module, a text-visual reconstruction module, a reconstruction coordinator module, and a detection and classification module;
the image encoder is configured to receive image data to be detected, which contains the image data to be detected and a random mask, and output visual features and mask visual features;
the text encoder is configured to receive text data to be detected, which contains the text data to be detected and a random mask, and output text features and mask text features;
the visual-text reconstruction module is used for realizing alignment and reconstruction of text features of visual guidance through self-attention and cross-attention;
the text-visual reconstruction module is used for realizing the alignment and reconstruction of visual features of text guidance through self-attention and cross-attention;
the reconstruction coordinator module performs centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and is used for further fusing tampered characteristic representations;
the detection and classification module is configured to predict and locate a tampered instance in the multi-modal media data based on an output of the reconstruction coordinator module.
2. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1 wherein the detection and classification module comprises at least one binary classifier, one bounding box detector, one multi-tag classifier, and one tag detector;
the binary classifier is used for predicting whether the multi-mode media data is real or tampered;
the bounding box detector is used for positioning a tampered area in the image;
the multi-label classifier is used for identifying a plurality of tampered labels in the image;
the mark detector is used for detecting and positioning the tampered mark in the text.
3. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1, wherein the image encoder is further configured to stack 12 times of extracted features by self-attention and feed forward neural network combinations and generate category and mode vectors for subsequent processing; the text encoder is further configured to stack the 6 extracted features by self-attention and feed forward neural network combinations and generate class and pattern vectors for subsequent processing.
4. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1 wherein the visual-text reconstruction module and the text-visual reconstruction module further comprise a combination of multi-headed self-attention, cross-attention, and feed-forward neural networks stacked 6 times to facilitate fine-grained alignment between different modal features by way of reconstruction features.
5. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1, wherein the reconstruction coordinator module further comprises a multi-headed self-attention, cross-attention, and feed forward neural network stacked 4 times to improve the quality and accuracy of the reconstructed features.
CN202410028426.1A 2024-01-09 2024-01-09 Multimode media tampering detection and positioning system based on unified reconstruction Pending CN117854073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410028426.1A CN117854073A (en) 2024-01-09 2024-01-09 Multimode media tampering detection and positioning system based on unified reconstruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410028426.1A CN117854073A (en) 2024-01-09 2024-01-09 Multimode media tampering detection and positioning system based on unified reconstruction

Publications (1)

Publication Number Publication Date
CN117854073A true CN117854073A (en) 2024-04-09

Family

ID=90545758

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410028426.1A Pending CN117854073A (en) 2024-01-09 2024-01-09 Multimode media tampering detection and positioning system based on unified reconstruction

Country Status (1)

Country Link
CN (1) CN117854073A (en)

Similar Documents

Publication Publication Date Title
US11113582B2 (en) Method and system for facilitating detection and identification of vehicle parts
Sheikh et al. Traffic sign detection and classification using colour feature and neural network
JP2012008791A (en) Form recognition device and form recognition method
CN102292700A (en) System and method for enhancing security printing
US11960572B2 (en) System and method for identifying object information in image or video data
CN111539425A (en) License plate recognition method, storage medium and electronic equipment
Zipfel et al. Anomaly detection for industrial quality assurance: A comparative evaluation of unsupervised deep learning models
WO2017173017A1 (en) Counterfeit detection of traffic materials using images captured under multiple, different lighting conditions
JP2017531847A (en) Optically active article and system in which this optically active article can be used
US20190340429A1 (en) System and Method for Processing and Identifying Content in Form Documents
Ali et al. DeepMoney: counterfeit money detection using generative adversarial networks
Kim et al. End-to-end digitization of image format piping and instrumentation diagrams at an industrially applicable level
CN113378815A (en) Model for scene text positioning recognition and training and recognition method thereof
Sirajudeen et al. Forgery document detection in information management system using cognitive techniques
CN115687643A (en) Method for training multi-mode information extraction model and information extraction method
KR102638711B1 (en) Radar and camera fusion based vehicle recognition system
Balali et al. Video-based detection and classification of US traffic signs and mile markers using color candidate extraction and feature-based recognition
Ahmed et al. A generic method for stamp segmentation using part-based features
Mishra et al. Evaginating scientific charts: Recovering direct and derived information encodings from chart images
CN110533704A (en) Fake method, device, equipment and medium are tested in the identification of ink label
Ghanmi et al. CheckSim: A reference-based identity document verification by image similarity measure
CN117854073A (en) Multimode media tampering detection and positioning system based on unified reconstruction
CN111797830A (en) Rapid red seal detection method, system and device for bill image
Agarwal et al. The advent of deep learning-based image forgery detection techniques
Lal et al. Lineformer: Line chart data extraction using instance segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination