CN114662596A - False information detection model training method and false information detection method - Google Patents

False information detection model training method and false information detection method Download PDF

Info

Publication number
CN114662596A
CN114662596A CN202210301579.XA CN202210301579A CN114662596A CN 114662596 A CN114662596 A CN 114662596A CN 202210301579 A CN202210301579 A CN 202210301579A CN 114662596 A CN114662596 A CN 114662596A
Authority
CN
China
Prior art keywords
text
oriented
modal
loss value
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210301579.XA
Other languages
Chinese (zh)
Inventor
胡琳梅
赵紫望
孟涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202210301579.XA priority Critical patent/CN114662596A/en
Publication of CN114662596A publication Critical patent/CN114662596A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a false information detection model training method and a false information detection method, wherein the false information detection model training method comprises the following steps: acquiring training information, wherein the training information comprises texts and images; respectively inputting the text and the image into respective encoders to obtain text representation and image representation; interacting the text representation and the image representation, and determining multi-modal characteristics facing the text and multi-modal characteristics facing the vision; respectively inputting the multi-modal character facing the text and the multi-modal character facing the vision into respective classifiers to obtain a loss value facing the text and a loss value facing the vision; and determining an objective function of the false information detection model according to the text-oriented loss value and the vision-oriented loss value, and training the false information detection model according to the objective function. The invention improves the accuracy and efficiency of false information detection.

Description

False information detection model training method and false information detection method
Technical Field
The invention discloses a false information detection model training method and a false information detection method, and belongs to the technical field of false information detection.
Background
Online social media has become an indispensable platform for people to share and obtain information in daily life. Because of the relaxed restrictions on the content produced by users on online social media, there may be a lot of false news distorting and pinching facts on online social media in addition to real news. The widespread dissemination of false news can mislead the reader and even cause serious social consequences. Therefore, automatic detection of false news on social media is particularly urgent and important.
Early research on false news detection focused primarily on the text modality of user posts. Although it can recognize false information, it ignores the fact that a large amount of news relates to images. On the one hand, images may contain key clues for false news detection. As shown in fig. 1 (a), it is difficult to determine the authenticity of the news purely from the text, and by considering the visual image of the tweet, it is possible to make sure that the news is fake. On the other hand, it can enhance the judgment of the authenticity of news even if the image itself does not contain irregular clues. As shown in fig. 1 (b), the text indicates that it may be false, while the image appears normal. However, if text and images are considered jointly, it can be easily inferred that it is false news.
Therefore, in recent years, many efforts have been made in multimodal false news detection. The multi-modal false news detection in the prior art mainly focuses on designing more advanced feature extraction programs for multi-modalities and is implemented by a tandem operation when multi-modality fusion is performed according to extracted features. However, the prior art fails to model the interaction between the two modalities, so that the problems of low detection accuracy and poor detection efficiency exist when the false news detection is performed by using the two modalities.
Disclosure of Invention
The application aims to provide a false information detection model training method and a false information detection method so as to solve the technical problems of low detection accuracy and poor detection efficiency of the conventional false information detection method.
The invention provides a false information detection model training method in a first aspect, which comprises the following steps:
acquiring training information, wherein the training information comprises texts and images related to the texts;
respectively inputting the text and the image into respective encoders to obtain text representation and image representation;
interacting the text representation and the image representation, and determining multi-modal characteristics facing the text and multi-modal characteristics facing the vision;
respectively inputting the multi-modal character facing the text and the multi-modal character facing the vision into respective classifiers to obtain a loss value facing the text and a loss value facing the vision;
determining an objective function of a false information detection model according to the text-oriented loss value and the vision-oriented loss value, and training the false information detection model according to the objective function.
Preferably, the inputting the text and the image into respective encoders to obtain a text representation and an image representation, specifically includes:
acquiring embedded information of a text, and inputting the embedded information of the text into a text encoder to obtain text representation;
acquiring embedded information of an image, and inputting the embedded information of the image into an image encoder to obtain image representation;
the text encoder and the image encoder are both transform encoders;
the text representation and the image representation each include a query vector, a key vector, and a value vector.
Preferably, interacting the text representation and the image representation to determine text-oriented multi-modal features and visual-oriented multi-modal features, specifically comprising:
inputting the query vector in the text representation and the key vector and the value vector in the image representation into a multi-modal fusion block facing the text to obtain multi-modal characteristics facing the text;
and inputting the query vector in the image representation and the key vector and the value vector in the text representation into a multi-modal fusion block facing the vision to obtain multi-modal characteristics facing the vision.
Preferably, the text-oriented multi-modal fusion block comprises a first modal interaction unit and a first key information selection unit;
inputting the query vector in the text representation and the key vector and value vector in the image representation into a multi-modal fusion block facing the text to obtain multi-modal features facing the text, which specifically comprises:
inputting the query vector in the text representation and the key vector and value vector in the image representation into a first modal interaction unit to obtain a preliminary multi-modal fusion feature;
inputting the preliminary multi-modal fusion features to a first key information selection unit to obtain multi-modal features oriented to texts;
the mechanism used in the first modal interaction unit is a cooperative attention mechanism;
the mechanism used by the first key information selection unit is a self-attention mechanism.
Preferably, the visual-oriented multi-modal fusion block comprises a second modal interaction unit and a second key information selection unit;
inputting the query vector in the image representation and the key vector and the value vector in the text representation into a multi-modal fusion block facing the vision to obtain multi-modal characteristics facing the vision, specifically comprising:
inputting the query vector in the image representation and the key vector and value vector in the text representation into a second modal interaction unit to obtain a preliminary multi-modal fusion feature;
inputting the preliminary multi-modal fusion features to a second key information selection unit to obtain visual-oriented multi-modal features;
the mechanism used in the second modal interaction unit is a cooperative attention mechanism;
the mechanism used by the second key information selection unit is a self-attention mechanism.
Preferably, the text-oriented multi-modal features and the visual-oriented multi-modal features are respectively input into respective classifiers to obtain a text-oriented loss value and a visual-oriented loss value, and the method specifically includes:
inputting the multi-modal characteristics facing the text into a classifier facing the text to obtain a first prediction probability of the training information, and determining a loss value facing the text according to the first prediction probability and a true value of the training information;
and inputting the multi-modal characteristics facing the vision into the classifier facing the vision to obtain a second prediction probability of the training information, and determining a loss value facing the vision according to the second prediction probability and the true value of the training information.
Preferably, the determining an objective function of the false information detection model according to the text-oriented loss value and the visual-oriented loss value specifically includes:
determining a first interaction loss value of the text-oriented classifier to the visual-oriented classifier and a second interaction loss value of the visual-oriented classifier to the text-oriented classifier according to the first prediction probability and the second prediction probability;
and determining an objective function of the false information detection model according to the text-oriented loss value, the vision-oriented loss value, the first interaction loss value and the second interaction loss value.
Preferably, determining a first interaction loss value of the text-oriented classifier to the visual-oriented classifier and a second interaction loss value of the visual-oriented classifier to the text-oriented classifier according to the first prediction probability and the second prediction probability specifically comprises:
and determining a first interaction loss value from the text-oriented classifier to the visual-oriented classifier and a second interaction loss value from the visual-oriented classifier to the text-oriented classifier by using information divergence according to the first prediction probability and the second prediction probability.
Preferably, determining an objective function of the false information detection model according to the text-oriented loss value, the visual-oriented loss value, the first interaction loss value and the second interaction loss value specifically includes:
determining an objective function of a false information detection model by using a first formula, wherein the first formula is as follows:
Figure BDA0003565661140000041
in the formula (I), the compound is shown in the specification,
Figure BDA0003565661140000042
in order to be a text-oriented loss value,
Figure BDA0003565661140000043
in order to be a visually oriented value of the loss,
Figure BDA0003565661140000044
in order to be the first interaction loss value,
Figure BDA0003565661140000045
is a second interaction loss value, λKLThe lost weight is mutual learning.
A second aspect of the present invention provides a method for detecting false information, including:
acquiring information to be detected;
and inputting the information to be detected into the false information detection model obtained by training the training method of the false information detection model to obtain a false information detection result.
Compared with the prior art, the false information detection model training method and the false information detection method have the following beneficial effects:
the invention relates to a multi-mode fusion method and a multi-mode fusion system based on a mutual learning network, wherein a multi-mode fusion block can jointly capture interaction among multi-mode features and extract key information for false news detection, and on the basis, a mutual learning strategy is further combined, so that information can be transmitted between a text-oriented classifier and a visual-oriented classifier, and multi-mode fusion is promoted to enhance the efficiency and accuracy of false news detection.
Drawings
FIGS. 1 (a) and (b) are false news on online social media;
FIG. 2 is a schematic flow chart of a method for training a false information detection model according to an embodiment of the present invention;
FIG. 3 is a detailed flowchart of a method for training a false information detection model according to an embodiment of the present invention;
FIG. 4 is a flow chart of a method for detecting false information according to an embodiment of the present invention;
FIG. 5 is a graph of (a) ablation results of different models on a microblog and (b) ablation results of different models on Twitter in an embodiment of the present invention;
fig. 6 (a) is a visualization result of attention weight of a news word on an image block according to an embodiment of the present invention, and (b) is a visualization result of attention weight of another news word on an image block;
FIG. 7 shows different λ in an embodiment of the present inventionKLImpact of values on both datasets.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
The first aspect of the embodiment of the present invention discloses a method for training a false information detection model, as shown in fig. 2 and 3, including:
step 1, training information is obtained, wherein the training information comprises texts and images related to the texts.
The training information in the embodiment of the invention is news information.
Step 2, inputting the text and the image into a text encoder and a visual encoder respectively to obtain a text representation and an image representation, and the method specifically comprises the following steps:
and step 21, acquiring embedded information of the text, and inputting the embedded information into a text encoder to obtain text representation.
In order to accurately model semantic information of text T in training information and avoid word ambiguity problems, in the embodiments of the present invention, pre-trained berts (bidirectional Encoder retrieval from transformers) are used to obtain word embedding. In particular, the text T is segmented into m +1 tokens { T }0,t1,t2,…,tmWhere t is0For inserting special marks at the beginning of a sentence [ CLS]. Obtaining the embedded information corresponding to the converted text after the coding layer of the BERT
Figure BDA0003565661140000061
Wherein
Figure BDA0003565661140000062
Is an output word embedded in the BERT, dtIs the dimension of word embedding.
To capture intra-modal interactions between words in text, embodiments of the present invention employ a standard Transformer layer consisting of a multi-headed self-attention and a feed-forward network (FFN) to learn the textual representation HTThat is, the text encoder of the embodiment of the present invention is a transform encoder. Wherein the text represents HTAs shown in equation (1):
Figure BDA0003565661140000063
in the formula
Figure BDA0003565661140000064
Figure BDA0003565661140000065
For text representation corresponding parameter-free position embedding, transform () represents ETAnd
Figure BDA0003565661140000066
inputting into a Transformer layer.
Text representation H obtained by the inventionTIncluding a query vector (Q), a key vector (K), and a value vector (V).
And step 22, acquiring the embedded information of the image, and inputting the embedded information into an image encoder to obtain image representation.
Unlike existing methods of stacking multiple convolutional layers to extract visual features, embodiments of the present invention employ a visual transform model (VIT). Since the standard Transformer uses the one-dimensional symbol embedding sequence as input, the embodiment of the present invention first divides the image into a plurality of image blocks, obtains the embedding information of the image blocks, and then inputs the embedding information into the image encoder to obtain the image representation.
The embodiment of the invention provides an image
Figure BDA0003565661140000067
Reshaped into a series of flat image blocks
Figure BDA0003565661140000068
Where (H, W) is the resolution of the original image, H is the height of the original image, W is the width of the original image, C is the number of channels, (P, P) is the resolution of each image block, n ═ HW/P2Is the image block number.
Then, embodiments of the invention use trainable linear projection to flatten and map image blocks to the D dimension to obtain
Figure BDA0003565661140000069
As input to a pre-trained Vision transducer (VIT). Specifically, the embodiment of the present invention uses ViT-B/16 weights pre-trained on ImageNet to obtain the embedded information of the image block, as shown in formula (2):
Figure BDA0003565661140000071
wherein
Figure BDA0003565661140000072
Is a special mark' [ CLS]The embedding of the' is carried out,
Figure BDA0003565661140000073
is the embedding of the i-th image block, dIIs the dimension of the image block embedding.
In the embodiment of the present invention, a standard transform encoder is used to construct the internal interaction of the visual modality, that is, the image encoders in the embodiments of the present invention are all transform encoders, so that the following image representation H can be obtainedI
Figure BDA0003565661140000074
Wherein the content of the first and second substances,
Figure BDA0003565661140000075
Figure BDA0003565661140000076
for image representation corresponding parameter-free position embedding, transform () represents EIAnd
Figure BDA0003565661140000077
inputting into a Transformer layer.
Image representation H obtained by the inventionIIncluding a query vector (Q), a key vector (K), and a value vector (V).
The sequence of step 21 and step 22 in the embodiment of the present invention is interchangeable.
Step 3, interacting the text representation and the image representation, and determining multi-modal characteristics facing the text and the multi-modal characteristics facing the vision, wherein the method specifically comprises the following steps:
and step 31, inputting the query vector in the text representation and the key vector and the value vector in the image representation into a multi-mode fusion block facing the text to obtain multi-mode features facing the text.
The text-oriented multi-mode fusion block comprises a first mode interaction unit and a first key information selection unit;
step 31 specifically includes:
311, inputting a query vector in the text representation and a key vector and a value vector in the image representation into a first modal interaction unit to obtain a primary multi-modal fusion feature; wherein the mechanism used in the first modality interaction unit is a coordinated attention mechanism.
In the embodiment of the invention, the input of the ith cooperative attention head (i is 1,2, …, M) is represented by a text HTAnd image representation HIConverted to, as shown in equation (4):
Figure BDA0003565661140000081
in the formula (I), the compound is shown in the specification,
Figure BDA0003565661140000082
is a projection matrix of the query vectors in the textual representation,
Figure BDA0003565661140000083
is a projection matrix of key vectors in the image representation,
Figure BDA0003565661140000084
is a projection matrix of value vectors in the image representation.
The calculation of the cooperative attention mechanism in the cooperative attention model is as in equation (5):
Figure BDA0003565661140000085
wherein softmax () is the softmax function, AttiIs the ith head of multi-head attention (consisting of M heads in total), formula (6) can be derived from formula (5):
Figure BDA0003565661140000086
where MH _ Att () is a function of multi-head attention,
Figure BDA0003565661140000087
is a matrix of the weights that is,
Figure BDA0003565661140000088
indicating a series operation.
Wrapping the residual connection, the feedforward network (FFN) and the normalization layer (layernorm (ln)) to obtain a preliminary multi-modal fusion feature as shown in formula (7):
Figure BDA0003565661140000089
LN () in the formula denotes that its parameters are input into the normalization layer (LayerNorm) of the first-modality interaction unit, FFN () denotes that its parameters are input into the feed-forward network of the first-modality interaction unit,
Figure BDA00035656611400000810
as in equation (8):
Figure BDA00035656611400000811
LN () in the formula represents inputting parameters therein into a normalization layer (LayerNorm) of the first-modality interaction unit.
Step 312, inputting the preliminary multi-modal fusion features to a first key information selection unit to obtain multi-modal features oriented to texts; wherein the mechanism used by the first key information selection unit is a self-attention mechanism.
Embodiments of the invention determine text-oriented multi-modal features according to equation (9):
Figure BDA0003565661140000091
LN () in the formula denotes that its parameters are input into the normalization layer (LayerNorm) of the first key information selection unit, FFN () denotes that its parameters are input into the feed-forward network of the first key information selection unit,
Figure BDA0003565661140000092
as in equation (10):
Figure BDA0003565661140000093
in the formula, LN () represents that its parameters are input into the normalization layer (LayerNorm) of the first key information selection unit.
And step 32, inputting the query vector in the image representation and the key vector and the value vector in the text representation into the multi-mode fusion block facing the vision to obtain multi-mode features facing the vision.
The visual-oriented multi-mode fusion block comprises a second mode interaction unit and a second key information selection unit.
Step 32 specifically includes:
step 321, inputting the query vector in the image representation and the key vector and the value vector in the text representation to a second modality interaction unit to obtain a preliminary multi-modality fusion feature, wherein a mechanism used in the second modality interaction unit is a cooperative attention mechanism.
In the embodiment of the invention, the input of the ith cooperative attention head (i is 1,2, …, M) is represented by a text HTAnd image representation HIConverted to, as shown in equation (11):
Figure BDA0003565661140000094
in the formula (I), the compound is shown in the specification,
Figure BDA0003565661140000095
is a projection matrix of the query vector in the image representation,
Figure BDA0003565661140000096
is a projection matrix of key vectors in the text representation,
Figure BDA0003565661140000097
is a projection matrix of the text representation median vector.
The calculation of the cooperative attention mechanism in the cooperative attention model is as in equation (12):
Figure BDA0003565661140000101
wherein softmax () is the softmax function, AttiIs the ith head of multi-head attention (consisting of M heads in total), equation (13) can be derived from equation (12):
Figure BDA0003565661140000102
in the formula (I), the compound is shown in the specification,
Figure BDA0003565661140000103
is a matrix of the weights that is,
Figure BDA0003565661140000104
indicating a series operation.
Wrapping the residual connection, the feedforward network (FFN) and the normalization layer (layernorm (ln)) to obtain a preliminary multi-modal fusion feature as shown in formula (14):
Figure BDA0003565661140000105
LN () in the formula denotes that its parameters are input into the normalization layer (LayerNorm) of the second-modality interaction unit, FFN () denotes that its parameters are input into the feed-forward network of the second-modality interaction unit,
Figure BDA0003565661140000106
as in equation (15):
Figure BDA0003565661140000107
LN () in the formula represents inputting parameters therein into a normalization layer (LayerNorm) of the second-modality interaction unit.
And 322, inputting the preliminary multi-modal fusion features to a second key information selection unit to obtain the vision-oriented multi-modal features, wherein a mechanism used by the second key information selection unit is a self-attention mechanism.
Embodiments of the invention determine a vision-oriented multi-modal feature according to equation (16):
Figure BDA0003565661140000108
LN () in the formula represents inputting its parameters into the normalization layer (LayerNorm) of the second key information selection unit, FFN () represents inputting its parameters into the feed-forward network of the second key information selection unit,
Figure BDA0003565661140000111
as in formula (17):
Figure BDA0003565661140000112
in the formula, LN () represents that its parameters are input into the normalization layer (LayerNorm) of the second key information selection unit.
The order of steps 31 and 32 of the embodiment of the present invention is interchangeable.
The multi-modal fusion block of the present invention acts to model multi-modal interactions and select key information for false information detection. In view of the fact that people judge false information from the perspective of text facing and the perspective of vision facing, the invention constructs corresponding multi-mode fusion blocks facing text and vision. As shown on the right side of fig. 3, the two fusion blocks share the same architecture, each architecture is composed of two units, and is respectively responsible for inter-modality interaction and key information selection.
Step 4, respectively inputting the text-oriented multi-modal features and the visual-oriented multi-modal features into respective classifiers to obtain a text-oriented loss value and a visual-oriented loss value, and specifically comprising:
and 41, inputting the multi-modal characteristics facing the text into a classifier facing the text to obtain a first prediction probability of the authenticity of the training information, and determining a loss value facing the text according to the first prediction probability and a true value of the training information.
Embodiments of the present invention utilize a fully connected layer and then predict the authenticity of the training information using a softmax function.
Wherein the first prediction probability is as in equation (18):
PT=softmax(WTOT+bT) (18)
in the formula, PTIs a first prediction probability, WTFor the weight of the fully-connected layer in the text-oriented classifier, OTFor text-oriented multimodal features, bTIs a bias in the text-oriented classifier.
The cross entropy used in the text-oriented classifier in the embodiments of the present invention is as in formula (19):
Figure BDA0003565661140000113
in the formula (I), the compound is shown in the specification,
Figure BDA0003565661140000121
for text-oriented loss values, N is the number of training information,
Figure BDA0003565661140000122
and
Figure BDA0003565661140000123
the true value and the first prediction probability of the ith training information of the text-oriented classifier are respectively.
And 42, inputting the multi-mode visual-oriented features into a visual-oriented classifier to obtain a second prediction probability of the reality of the training information, and determining a visual-oriented loss value according to the second prediction probability and the true value of the training information.
Wherein the second prediction probability is as in equation (20):
PI=softmax(WIOI+bI) (20)
in the formula, PIIs a second prediction probability, WIFor the weight of the fully connected layer in the vision-oriented classifier, OIFor multi-modal features oriented to vision, bIIs the bias of the vision-oriented classifier.
Figure BDA0003565661140000124
In the formula (I), the compound is shown in the specification,
Figure BDA0003565661140000125
for a visual-oriented loss value, N is the amount of training information,
Figure BDA0003565661140000126
and
Figure BDA0003565661140000127
the true value and the second prediction probability of the ith training information of the vision-oriented classifier are respectively.
And 5, determining an objective function of the false information detection model according to the text-oriented loss value and the vision-oriented loss value, and training the false information detection model according to the objective function, as shown in the upper half part of the figure 3.
Inspired by the fact that people judge the double visual angles of the false information, the method adopts a mutual learning strategy, so that information can be transmitted between the text-oriented classifier and the visual-oriented classifier, and multi-mode fusion in the false news detection process is promoted.
For mutual learning networks, the present invention forces two classifiers to mimic each other in the final prediction probability.
The step 5 specifically comprises the following steps:
and step 51, determining a first interaction loss value from the text-oriented classifier to the visual-oriented classifier and a second interaction loss value from the visual-oriented classifier to the text-oriented classifier according to the first prediction probability and the second prediction probability.
The embodiment of the invention quantizes the prediction P of two networks by adopting the divergence of information (Kullback Leibler (KL) divergence)TAnd PIThe degree of matching. KL distance calculation from text-oriented classifier to vision-oriented classifier is as in equation (22):
Figure BDA0003565661140000131
then in the mutual learning strategy, the first interaction loss value from the text-oriented classifier to the visual-oriented classifier is as in formula (23):
Figure BDA0003565661140000132
similarly, the KL distance calculation from the visual-oriented classifier to the text-oriented classifier is as in equation (24):
Figure BDA0003565661140000133
then in the mutual learning strategy, the second interaction loss value from the visual-oriented classifier to the text-oriented classifier is as in formula (25):
Figure BDA0003565661140000134
and step 52, determining an objective function of the false information detection model according to the text-oriented loss value, the vision-oriented loss value, the first interaction loss value and the second interaction loss value, and training the false information detection model according to the objective function.
The objective function of the false information detection model of the invention is shown in formula (26):
Figure BDA0003565661140000135
in the formula (I), the compound is shown in the specification,
Figure BDA0003565661140000136
for the text-oriented loss value to be,
Figure BDA0003565661140000137
in order to be a visual-oriented value of the loss,
Figure BDA0003565661140000138
in order to be the first interaction loss value,
Figure BDA0003565661140000139
is a second interaction loss value, λKLThe lost weight is mutual learning.
The second aspect of the present invention discloses a method for detecting false information, as shown in fig. 4, including:
step 101, information to be detected is obtained.
The information to be detected in the embodiment of the invention comprises texts and pictures related to the texts.
And 102, inputting the information to be detected into the false information detection model obtained by training by using the training method of the false information detection model to obtain a false information detection result.
The false information detection result obtained by the embodiment of the invention is the prediction probability of the authenticity of the information to be detected.
Intuitively, when people judge the authenticity of multimodal news, they often need to consider from a text-oriented perspective and a visual-oriented perspective. Specifically, the text-oriented angle focuses on judging text content while considering visual content. In contrast, the view-oriented angle focuses on judging visual contents while considering text contents. As we can see from the example in fig. 1, focusing on both text and images helps to detect false news. This is consistent with the interaction of the regional network of our brain in performing cognitive tasks.
In the invention, a new false information detection method is provided in consideration of a double-view learning process, and the method is a multi-mode detection method (MMNet) based on Mutual learning. The MMNet mutual learning mechanism can realize information transfer among different visual angles, and multi-mode information can be better fused. Specifically, the model of the present invention consists of two key modules: text-oriented classifiers and visual-oriented classifiers. In each module, a new multi-modal fusion block is designed to extract text/visual oriented multi-modal features, wherein the interaction between the two modalities is well characterized. In particular, the multi-modal fusion block first captures cross-modal interactions through cooperative attention, and then extracts key information for further false news classification by self-attention.
The validity of the method of the invention will be verified in the following with more specific examples.
1.1 Experimental setup
1.1.1 data set.
To evaluate the performance of the proposed MMNet, embodiments of the invention utilize two widely used datasets: microblog and Twitter. Microblog data sets are collected by newsletters and microblogs. Each post contains text, an attachment picture, and social information. Fake news was captured from month 5 2012 to month 1 2016 and verified by the micro blogging official rumor system. Twitter data sets are published for a validation multimedia usage task that aims to detect false multimedia content on social media. Each tweet in the dataset includes text, pictures and social context information associated with it. The microblog data set comprises 9528 unique pictures and the Twitter data set comprises 514 unique pictures. The embodiment of the invention is as follows: 2, the microblog data set is divided into a training set and a testing set. For the Twitter dataset, the present invention uses the existing preprocessing method to obtain its training set test set, and Table 1 shows the statistical data of these two datasets.
TABLE 1 statistical data of microblog data sets and twitter data
News Micro blog Twitter
Real news 4779 6026
False news 4749 7898
1.1.2 implementation details.
For word embedding, embodiments of the present invention use pre-trained BERT to extract text features with a dimensionality of 768. Specifically, bert-base-detect is used to obtain word embedding information on microblog data sets, and bert-base-detect is used to learn word embedding information on Twitter data sets. For image block embedding, embodiments of the present invention employ vit _ base _ patch16_224 to extract visual features of dimension 768, where the image is resized to 224 and the block size is 16. Embodiments of the present invention use an adaptive moment estimation (Adam) optimizer to find the best parameters, with an initial learning rate of 0.0001.
In addition to the accuracy index, the embodiment of the invention also provides the accuracy rate, the recall rate and the F1 score of false news and true news by different methods.
1.2 base line
Embodiments of the present invention compare the MMNet model with the single-modality and multi-modality models, as follows.
A single mode model. Embodiments of the present invention compare MMNet with the following model that uses text only for false news detection.
SVM-TS: SVM-TS uses a linear SVM classifier and heuristic rules to detect false news.
GRU: the GRU models textual semantic information for false news detection using a multi-layer GRU network.
A multi-modal model. The present example compares MMNet with the following 7 multimodal models.
att _ RNN: att _ RNN application attention-based RNN to fuse text, visual, and social background features for false news detection. In this experiment, a part that deals with the social background feature was deleted.
EANN: the EANN employs an auxiliary event fighting neural network to remove event-specific features and maintain shared features between events for detecting false news. The embodiment of the invention uses a simplified version of EANN in the experiment, and excludes an event discriminator.
MVAE: MVAE combines a variational auto-encoder with a binary classifier for false news detection.
SpotFake: SpotFake learns text features using pre-trained BERT and image features using VGG-19 pre-trained on ImageNet to detect false news.
SpotFake +: SpotFake + upgrades the pre-trained language model in SpotFake to XLNET to effectively detect false news.
HMCAN: HMCAN adopts a hierarchical multi-mode context attention network, and performs false news detection by performing combined modeling on multi-mode context information and the hierarchical semantics of a text.
MCAN: MCAN employs multiple layers of collaborative attention to fuse multimodal features to complete the task of false news detection.
1.3 results and analysis
Table 2 shows the overall performance comparison of the different methods on the two data sets. The best results are bolded and the second best results are underlined.
Table 2: comparison of different models on microblog and Twitter data sets
Figure BDA0003565661140000161
From table 2, the following conclusions can be drawn:
1) the MMNet provided by the invention is superior to other models in accuracy index and F1 score. Compared with the optimal baseline model, the accuracy of MMNet on microblog and Twitter datasets was improved by about 0.78% and 3.7%, respectively.
2) A model that considers multi-modal information is superior to a model that considers only single-modal information. This demonstrates that integrating multiple modalities is advantageous for the task of false news detection.
3) Collaborative attention-based models perform better than simple tandem modal features or methods that utilize ancillary tasks (e.g., SpotFake, EANN), indicating that they can better fuse multimodal information.
4) The MMNet of embodiments of the present invention further goes beyond the synergistic attention-based approach (i.e., HMCAN and MCAN). We believe this is because the mutual learning network facilitates multi-modal fusion of false news detection by transferring information between text-oriented and visual-oriented angles. In addition, the multi-mode fusion block designed by the embodiment of the invention not only can capture the interaction among multi-mode information, but also can select key information to identify false news.
1.4 ablation study
To verify the importance of each module in MMNet, embodiments of the invention compare MMNet with the following variants.
MMNet-T: a variant of MMNet, based only on text-oriented classifiers.
MMNet-V: a variation of MMNet, using only vision-oriented classification
MMNet-Co: a variant of MMNet replaces the multimodal fusion block with a traditional collaborative attention network.
MMNet-Avg: a variant of MMNet combines MMNet-T and MMNet-V by averaging the prediction probabilities.
FIG. 5 shows the results of an ablation study, with the models for each set of indices from left to right in FIGS. 5 (a) and (b) being MMNet-T, MMNet-V, MMNet-Co, MMNet-Avg, and MMNet of the present invention, respectively. From fig. 5 the following observations can be made:
1) the MMNet model of the invention outperforms all variants. The main reason is that mutual learning based on text-oriented classifiers and visual-oriented classifiers can convey information between each other, facilitating multi-modal fusion.
2) Both MMNet-T and MMNet-V achieved lower performance, indicating that using only text-oriented or visual-oriented classifiers is a suboptimal option for false news detection. On the microblog data set, MMNet-T performs much better than MMNet-V. The reason may be that the text on the micro-blog is relatively long, containing more false news detection information.
3) The performance of MMNet on two data sets is superior to that of MMNet-Co, which shows that the multi-mode fusion block designed by the invention is effective, can capture interaction between modes, and can select key information for false news detection.
4) Compared with the method MMNet-Avg, which combines a text-oriented classifier and a visual-oriented classifier by an averaging method, the performance of MMNet is significantly improved. This further illustrates the effectiveness of cross-learning between different angles for false news detection.
1.5 case study
Furthermore, in order to have an intuitive understanding of the mutual learning strategy, the embodiment of the present invention visualizes the word attention weights on the image blocks calculated by formula (5) and formula (6). For convenience of explanation, the embodiments of the present invention reflect the attention weight on the opacity of the image block. If the attention value is greater than the average attention weight, the opacity is set to 255; otherwise, opacity is set to 76. The embodiment of the invention shows the visualization results of MMNet-Avg and MMNet in figure 6. FIG. 6 is a visualization of attention weights of words on an image block. Each example consists of text (at the top, the words for attention visualization are "dog" and "explosion") and three images, including the original image (on the left), the attention image derived by MMNet-Avg without mutual learning (in the middle), and the attention image derived by MMNet of an embodiment of the invention (on the right).
As can be seen from (a) in fig. 6, for the word "dog", MMNet notices the corresponding object, while in MMNet-Avg, which does not learn each other, the object is scattered in each corner of the image. Likewise, in fig. 6 (b), the word "explosion" is found to give more attention to the corresponding region in the MMNet of the embodiment of the present invention. These cases show that mutual learning between text-oriented and vision-oriented angles can better capture the interdependencies of multimodal information.
1.6λKLInfluence of the value
To explore the weight λ lost to mutual learningKLInfluence on model Performance, embodiments of the invention apply λKLFrom 1e-5To 0.1 and the accuracy of the two data sets is illustrated in fig. 7. It can be observed that the accuracy of false news detection generally increases first on both data sets and is at 1e-5Reaches a maximum value of 0.01, whenKLAbove 0.01, accuracy begins to decline. In general, when λKLAt 0.01, MMNet can best balance the individual classifiers and their mutual learning.
The embodiment of the invention provides a novel MMNet model based on a mutual learning network, which is used for multi-mode false news detection. The model of the embodiment of the invention can carry out information transfer between text-oriented and vision-oriented angles and promote multi-mode fusion to enhance false news detection. In addition, the embodiment of the invention designs a new multi-mode fusion block, which not only can capture the interaction among the modes, but also can select the key information of false news detection. A number of experiments performed on two published real data sets verified the validity of the model MMNet of an embodiment of the invention.
Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims (10)

1. A method for training a false information detection model is characterized by comprising the following steps:
acquiring training information, wherein the training information comprises a text and an image related to the text;
inputting the text and the image into respective encoders to obtain a text representation and an image representation;
interacting the text representation and the image representation, and determining multi-modal text-oriented features and multi-modal visual-oriented features;
respectively inputting the multi-modal character facing the text and the multi-modal character facing the vision into respective classifiers to obtain a loss value facing the text and a loss value facing the vision;
determining an objective function of a false information detection model according to the text-oriented loss value and the vision-oriented loss value, and training the false information detection model according to the objective function.
2. The method of training the false information detection model according to claim 1, wherein the inputting the text and the image into respective encoders to obtain a text representation and an image representation comprises:
acquiring the embedded information of the text, and inputting the embedded information of the text into a text encoder to obtain text representation;
acquiring the embedded information of the image, and inputting the embedded information of the image into an image encoder to obtain image representation;
the text encoder and the image encoder are both transform encoders;
the text representation and the image representation each include a query vector, a key vector, and a value vector.
3. The method for training the false information detection model according to claim 2, wherein the interacting the text representation and the image representation to determine text-oriented multi-modal features and visual-oriented multi-modal features comprises:
inputting the query vector in the text representation and the key vector and the value vector in the image representation into a multi-modal fusion block facing the text to obtain multi-modal characteristics facing the text;
and inputting the query vector in the image representation and the key vector and the value vector in the text representation into a multi-modal fusion block facing the vision to obtain multi-modal characteristics facing the vision.
4. The training method of the false information detection model according to claim 3, wherein the text-oriented multi-modal fusion block comprises a first modal interaction unit and a first key information selection unit;
inputting the query vector in the text representation and the key vector and the value vector in the image representation into a multi-modal fusion block facing the text to obtain multi-modal features facing the text, which specifically comprises:
inputting the query vector in the text representation and the key vector and value vector in the image representation into a first modal interaction unit to obtain a preliminary multi-modal fusion feature;
inputting the preliminary multi-modal fusion features to a first key information selection unit to obtain multi-modal features oriented to texts;
the mechanism used in the first modal interaction unit is a cooperative attention mechanism;
the mechanism used by the first key information selection unit is a self-attention mechanism.
5. The training method of the false information detection model according to claim 3, wherein the visual-oriented multi-modal fusion block comprises a second modal interaction unit and a second key information selection unit;
inputting the query vector in the image representation and the key vector and the value vector in the text representation into a multi-modal fusion block facing the vision to obtain multi-modal characteristics facing the vision, specifically comprising:
inputting the query vector in the image representation and the key vector and the value vector in the text representation into a second modal interaction unit to obtain a primary multi-modal fusion feature;
inputting the preliminary multi-modal fusion features to a second key information selection unit to obtain visual multi-modal features;
the mechanism used in the second modal interaction unit is a cooperative attention mechanism;
the mechanism used by the second key information selection unit is a self-attention mechanism.
6. The method for training the false information detection model according to any one of claims 1-5, wherein the text-oriented multi-modal features and the visual-oriented multi-modal features are respectively input into respective classifiers to obtain a text-oriented loss value and a visual-oriented loss value, and specifically comprises:
inputting the multi-modal character facing the text into a classifier facing the text to obtain a first prediction probability of the training information, and determining a loss value facing the text according to the first prediction probability and a true value of the training information;
and inputting the multi-modal vision-oriented features into a vision-oriented classifier to obtain a second prediction probability of the training information, and determining a vision-oriented loss value according to the second prediction probability and a true value of the training information.
7. The method for training the dummy information detection model according to claim 6, wherein determining the objective function of the dummy information detection model according to the text-oriented loss value and the visual-oriented loss value comprises:
determining a first interaction loss value of the text-oriented classifier to the visual-oriented classifier and a second interaction loss value of the visual-oriented classifier to the text-oriented classifier according to the first prediction probability and the second prediction probability;
and determining an objective function of a false information detection model according to the text-oriented loss value, the visual-oriented loss value, the first interaction loss value and the second interaction loss value.
8. The method of claim 7, wherein determining a first interaction loss value from the text-oriented classifier to the visual-oriented classifier and a second interaction loss value from the visual-oriented classifier to the text-oriented classifier according to the first prediction probability and the second prediction probability comprises:
and determining a first interaction loss value from the text-oriented classifier to the visual-oriented classifier and a second interaction loss value from the visual-oriented classifier to the text-oriented classifier by using information divergence according to the first prediction probability and the second prediction probability.
9. The method for training the false information detection model according to claim 7, wherein determining the objective function of the false information detection model according to the text-oriented loss value, the visual-oriented loss value, the first interaction loss value, and the second interaction loss value specifically includes:
determining an objective function of a false information detection model by using a first formula, wherein the first formula is as follows:
Figure FDA0003565661130000031
in the formula (I), the compound is shown in the specification,
Figure FDA0003565661130000041
for the text-oriented loss value in question,
Figure FDA0003565661130000042
for the value of the visual-oriented loss,
Figure FDA0003565661130000043
for the first value of the interaction loss to be,
Figure FDA0003565661130000044
for said second interaction loss value, λKLThe lost weight is mutual learning.
10. A false information detection method, comprising:
acquiring information to be detected;
inputting the information to be detected into a false information detection model obtained by training by using the training method of the false information detection model according to any one of claims 1 to 9, and obtaining a false information detection result.
CN202210301579.XA 2022-03-25 2022-03-25 False information detection model training method and false information detection method Pending CN114662596A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210301579.XA CN114662596A (en) 2022-03-25 2022-03-25 False information detection model training method and false information detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210301579.XA CN114662596A (en) 2022-03-25 2022-03-25 False information detection model training method and false information detection method

Publications (1)

Publication Number Publication Date
CN114662596A true CN114662596A (en) 2022-06-24

Family

ID=82030827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210301579.XA Pending CN114662596A (en) 2022-03-25 2022-03-25 False information detection model training method and false information detection method

Country Status (1)

Country Link
CN (1) CN114662596A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496140A (en) * 2022-09-19 2022-12-20 北京邮电大学 Multi-mode false news detection method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496140A (en) * 2022-09-19 2022-12-20 北京邮电大学 Multi-mode false news detection method and system

Similar Documents

Publication Publication Date Title
CN113065577A (en) Multi-modal emotion classification method for targets
Zhou et al. Ladder loss for coherent visual-semantic embedding
Huang et al. Few-shot image and sentence matching via gated visual-semantic embedding
CN108256968A (en) A kind of electric business platform commodity comment of experts generation method
CN113822224A (en) Rumor detection method and device integrating multi-modal learning and multi-granularity structure learning
CN114662497A (en) False news detection method based on cooperative neural network
CN115017887A (en) Chinese rumor detection method based on graph convolution
CN113360621A (en) Scene text visual question-answering method based on modal inference graph neural network
CN110765285A (en) Multimedia information content control method and system based on visual characteristics
CN113239159A (en) Cross-modal retrieval method of videos and texts based on relational inference network
CN116561305A (en) False news detection method based on multiple modes and transformers
CN116611021A (en) Multi-mode event detection method and system based on double-transducer fusion model
Luo et al. Bi-vldoc: Bidirectional vision-language modeling for visually-rich document understanding
CN114662586A (en) Method for detecting false information based on common attention multi-mode fusion mechanism
CN114662596A (en) False information detection model training method and false information detection method
Sharma et al. A survey of detection and mitigation for fake images on social media platforms
CN117521012A (en) False information detection method based on multi-mode context hierarchical step alignment
Yoon et al. Image classification and captioning model considering a CAM‐based disagreement loss
CN113177164B (en) Multi-platform collaborative new media content monitoring and management system based on big data
Xu et al. Estimating similarity of rich internet pages using visual information
Hamed et al. A Review of Fake News Detection Models: Highlighting the Factors Affecting Model Performance and the Prominent Techniques Used
Liu et al. Anti‐noise image source identification
CN117556830B (en) Rumor detection method based on potential hot topics and propagation process
CN117875266B (en) Training method and device for text coding model, electronic equipment and storage medium
CN117131503B (en) Threat chain identification method for user behavior

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination