CN117854073A - Multimode media tampering detection and positioning system based on unified reconstruction - Google Patents
Multimode media tampering detection and positioning system based on unified reconstruction Download PDFInfo
- Publication number
- CN117854073A CN117854073A CN202410028426.1A CN202410028426A CN117854073A CN 117854073 A CN117854073 A CN 117854073A CN 202410028426 A CN202410028426 A CN 202410028426A CN 117854073 A CN117854073 A CN 117854073A
- Authority
- CN
- China
- Prior art keywords
- text
- reconstruction
- attention
- visual
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 34
- 230000000007 visual effect Effects 0.000 claims abstract description 15
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 230000004807 localization Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000000034 method Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 2
- 230000001010 compromised effect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Abstract
The invention discloses a multimode media tampering detection and positioning system based on unified reconstruction, which comprises: an image encoder configured to receive image data to be detected containing image data to be detected and a random mask and output visual features and mask visual features; a text encoder configured to receive text data to be detected including text data to be detected and a random mask and output text features and mask text features; a visual-text reconstruction module that achieves alignment and reconstruction of visual-guided text features by self-attention and cross-attention; a text-visual reconstruction module that achieves alignment and reconstruction of text-guided visual features by self-attention and cross-attention; the reconstruction coordinator module is used for carrying out centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and fusing tampered characteristic representations; a detection and classification module configured to predict and locate tampered instances in the multi-modal media data based on an output of the reconstruction coordinator module.
Description
Technical Field
The invention belongs to the technical field of intersection of computer vision and natural language processing, and particularly relates to a multimode media tampering detection and positioning system based on unified reconstruction.
Background
The detection of counterfeit media is an important issue today when digital media are increasingly developed. Especially on the network, false information exists in a large amount in visual and text forms, which has a profound effect on society. While many methods of detecting deep forgeries and text false news have been developed, most of these methods are only applicable to single-pattern recognition, and based on binary classification, they are limited in their ability to analyze and infer subtle forgery marks between different modalities. Multimodal media, which consist of images and text, conveys more extensive information in our daily lives and has a greater impact than single modalities. Thus, the counterfeiting of multi-modal media tends to be more compromised. Existing single mode counterfeiting detection methods are struggled against this new challenge because they fail to cover the need to detect and locate image bounding boxes and text labels while image and text mode counterfeiting. In view of this problem, it is desirable to provide a multi-mode media tamper detection and localization system based on unified reconstruction.
Disclosure of Invention
In order to solve the technical problems, the invention provides a multimode media tampering detection and positioning system based on unified reconstruction, which improves the overall accuracy of tampering detection.
In order to achieve the above object, the present invention provides a multimode media tamper detection and positioning system based on unified reconstruction, including: an image encoder, a text encoder, a visual-text reconstruction module, a text-visual reconstruction module, a reconstruction coordinator module, and a detection and classification module;
the image encoder is configured to receive image data to be detected, which contains the image data to be detected and a random mask, and output visual features and mask visual features;
the text encoder is configured to receive text data to be detected, which contains the text data to be detected and a random mask, and output text features and mask text features;
the visual-text reconstruction module is used for realizing alignment and reconstruction of text features of visual guidance through self-attention and cross-attention;
the text-visual reconstruction module is used for realizing the alignment and reconstruction of visual features of text guidance through self-attention and cross-attention;
the reconstruction coordinator module performs centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and is used for further fusing tampered characteristic representations;
the detection and classification module is configured to predict and locate a tampered instance in the multi-modal media data based on an output of the reconstruction coordinator module.
Optionally, the detection and classification module includes at least one binary classifier, a bounding box detector, a multi-label classifier, and a label detector;
the binary classifier is used for predicting whether the multi-mode media data is real or tampered;
the bounding box detector is used for positioning a tampered area in the image;
the multi-label classifier is used for identifying a plurality of tampered labels in the image;
the mark detector is used for detecting and positioning the tampered mark in the text.
Optionally, the image encoder is further configured to stack 12 extracted features by self-attention and feed forward neural network combinations and generate class and pattern vectors for subsequent processing; the text encoder is further configured to stack the 6 extracted features by self-attention and feed forward neural network combinations and generate class and pattern vectors for subsequent processing.
Optionally, the visual-text reconstruction module and the text-visual reconstruction module further comprise a combination of multi-headed self-attention, cross-attention, and feed-forward neural networks, stacked 6 times, in a manner that facilitates fine-grained alignment between different modality features by reconstructing the features.
Optionally, the reconstruction coordinator module further includes a multi-headed self-attention, cross-attention, and feed-forward neural network, stacked 4 times to improve the quality and accuracy of the reconstructed features.
The invention has the technical effects that: the invention discloses a multimode media tampering detection and positioning system based on unified reconstruction, which integrates image and text analysis through a unified frame so as to process and identify tampering in multimode data at the same time, and can accurately identify and position a tampering area in the image and misleading expression in the text, thereby improving the overall accuracy of tampering detection; the invention simplifies the layering architecture of the existing method, and directly extracts the features from the reconstruction coordinator module for detection and positioning.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
FIG. 1 is a schematic diagram of a system for detecting and locating multi-modal media tampering based on unified reconstruction according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a system for locating an image tampering area according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating text tamper detection and attention heat map according to an embodiment of the present invention.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
As shown in fig. 1, in this embodiment, a multi-mode media tamper detection and positioning system based on unified reconstruction is provided, which includes: the invention provides a multimode media tampering detection and positioning system based on unified reconstruction, which comprises: an image encoder, a text encoder, a visual-text reconstruction module, a text-visual reconstruction module, a reconstruction coordinator module, and a detection and classification module;
an image encoder configured to receive image data to be detected containing image data to be detected and a random mask and output visual features and mask visual features;
a text encoder configured to receive text data to be detected including text data to be detected and a random mask and output text features and mask text features;
a visual-text reconstruction module that achieves alignment and reconstruction of visual-guided text features by self-attention and cross-attention;
a text-visual reconstruction module that achieves alignment and reconstruction of text-guided visual features by self-attention and cross-attention;
the reconstruction coordinator module is used for carrying out centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and further fusing tampered characteristic representations;
a detection and classification module configured to predict and locate tampered instances in the multi-modal media data based on an output of the reconstruction coordinator module.
The image encoder is further configured to stack 12 extracted features by self-attention mechanism and feed forward neural network (FFN) combinations and generate class and pattern vectors for subsequent processing; the text encoder is further configured to stack the 6 extracted features by self-attention mechanism and feed forward neural network (FFN) combinations and generate class and pattern vectors for subsequent processing.
The visual-text reconstruction module (VGS) and the text-visual reconstruction module (SGV) further include a combination of multi-headed self-attention, cross-attention mechanisms, and feed forward neural networks (FFNs) stacked 6 times to facilitate fine-grained alignment between different modality features by way of reconstruction features.
The reconstruction coordinator module further includes a multi-headed self-attention, cross-attention mechanism, and feed forward neural network (FFN), stacked 4 times to improve the quality and accuracy of the reconstructed features.
The detection and classification module comprises at least one binary classifier, a boundary box detector, a multi-label classifier and a label detector;
a binary classifier for predicting whether the multi-modal media data is authentic or tampered with;
a bounding box detector for locating a tampered region in the image;
a multi-tag classifier for identifying a plurality of tampered tags in the image;
and a mark detector for detecting and locating the tampered mark in the text.
Fig. 2 shows the effect of the present invention on predicting the tampered area of an image, wherein red is the true tampered area and blue is the tampered area predicted by our method, which are almost identical.
Fig. 3 illustrates the effect of the present invention in predicting text tampering, wherein red text is tampered text and is reflected on an image by an attention heat map.
As shown in Table 1, all performance metrics of the present invention used different pre-training models (CLIP, viLT, ALBEF) in which experiments exceeded the prior HAMMER method.
TABLE 1
The invention discloses a multimode media tampering detection and positioning system based on unified reconstruction, which integrates image and text analysis through a unified frame so as to process and identify tampering in multimode data at the same time, and can accurately identify and position a tampering area in the image and misleading expression in the text, thereby improving the overall accuracy of tampering detection; the invention simplifies the layering architecture of the existing method, and directly extracts the features from the reconstruction coordinator module for detection and positioning.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (5)
1. A multi-modal media tamper detection and localization system based on unified reconstruction, comprising:
an image encoder, a text encoder, a visual-text reconstruction module, a text-visual reconstruction module, a reconstruction coordinator module, and a detection and classification module;
the image encoder is configured to receive image data to be detected, which contains the image data to be detected and a random mask, and output visual features and mask visual features;
the text encoder is configured to receive text data to be detected, which contains the text data to be detected and a random mask, and output text features and mask text features;
the visual-text reconstruction module is used for realizing alignment and reconstruction of text features of visual guidance through self-attention and cross-attention;
the text-visual reconstruction module is used for realizing the alignment and reconstruction of visual features of text guidance through self-attention and cross-attention;
the reconstruction coordinator module performs centralized reasoning on the multi-mode characteristics by utilizing self-attention and cross-attention and is used for further fusing tampered characteristic representations;
the detection and classification module is configured to predict and locate a tampered instance in the multi-modal media data based on an output of the reconstruction coordinator module.
2. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1 wherein the detection and classification module comprises at least one binary classifier, one bounding box detector, one multi-tag classifier, and one tag detector;
the binary classifier is used for predicting whether the multi-mode media data is real or tampered;
the bounding box detector is used for positioning a tampered area in the image;
the multi-label classifier is used for identifying a plurality of tampered labels in the image;
the mark detector is used for detecting and positioning the tampered mark in the text.
3. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1, wherein the image encoder is further configured to stack 12 times of extracted features by self-attention and feed forward neural network combinations and generate category and mode vectors for subsequent processing; the text encoder is further configured to stack the 6 extracted features by self-attention and feed forward neural network combinations and generate class and pattern vectors for subsequent processing.
4. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1 wherein the visual-text reconstruction module and the text-visual reconstruction module further comprise a combination of multi-headed self-attention, cross-attention, and feed-forward neural networks stacked 6 times to facilitate fine-grained alignment between different modal features by way of reconstruction features.
5. The unified reconstruction-based multi-modal media tamper detection and localization system of claim 1, wherein the reconstruction coordinator module further comprises a multi-headed self-attention, cross-attention, and feed forward neural network stacked 4 times to improve the quality and accuracy of the reconstructed features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410028426.1A CN117854073A (en) | 2024-01-09 | 2024-01-09 | Multimode media tampering detection and positioning system based on unified reconstruction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410028426.1A CN117854073A (en) | 2024-01-09 | 2024-01-09 | Multimode media tampering detection and positioning system based on unified reconstruction |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117854073A true CN117854073A (en) | 2024-04-09 |
Family
ID=90545758
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410028426.1A Pending CN117854073A (en) | 2024-01-09 | 2024-01-09 | Multimode media tampering detection and positioning system based on unified reconstruction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117854073A (en) |
-
2024
- 2024-01-09 CN CN202410028426.1A patent/CN117854073A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11113582B2 (en) | Method and system for facilitating detection and identification of vehicle parts | |
Sheikh et al. | Traffic sign detection and classification using colour feature and neural network | |
JP2012008791A (en) | Form recognition device and form recognition method | |
CN102292700A (en) | System and method for enhancing security printing | |
US11960572B2 (en) | System and method for identifying object information in image or video data | |
CN111539425A (en) | License plate recognition method, storage medium and electronic equipment | |
Zipfel et al. | Anomaly detection for industrial quality assurance: A comparative evaluation of unsupervised deep learning models | |
WO2017173017A1 (en) | Counterfeit detection of traffic materials using images captured under multiple, different lighting conditions | |
JP2017531847A (en) | Optically active article and system in which this optically active article can be used | |
US20190340429A1 (en) | System and Method for Processing and Identifying Content in Form Documents | |
Ali et al. | DeepMoney: counterfeit money detection using generative adversarial networks | |
Kim et al. | End-to-end digitization of image format piping and instrumentation diagrams at an industrially applicable level | |
CN113378815A (en) | Model for scene text positioning recognition and training and recognition method thereof | |
Sirajudeen et al. | Forgery document detection in information management system using cognitive techniques | |
CN115687643A (en) | Method for training multi-mode information extraction model and information extraction method | |
KR102638711B1 (en) | Radar and camera fusion based vehicle recognition system | |
Balali et al. | Video-based detection and classification of US traffic signs and mile markers using color candidate extraction and feature-based recognition | |
Ahmed et al. | A generic method for stamp segmentation using part-based features | |
Mishra et al. | Evaginating scientific charts: Recovering direct and derived information encodings from chart images | |
CN110533704A (en) | Fake method, device, equipment and medium are tested in the identification of ink label | |
Ghanmi et al. | CheckSim: A reference-based identity document verification by image similarity measure | |
CN117854073A (en) | Multimode media tampering detection and positioning system based on unified reconstruction | |
CN111797830A (en) | Rapid red seal detection method, system and device for bill image | |
Agarwal et al. | The advent of deep learning-based image forgery detection techniques | |
Lal et al. | Lineformer: Line chart data extraction using instance segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |