CN114662596A

CN114662596A - False information detection model training method and false information detection method

Info

Publication number: CN114662596A
Application number: CN202210301579.XA
Authority: CN
Inventors: 胡琳梅; 赵紫望; 孟涵
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-06-24

Abstract

The invention discloses a false information detection model training method and a false information detection method, wherein the false information detection model training method comprises the following steps: acquiring training information, wherein the training information comprises texts and images; respectively inputting the text and the image into respective encoders to obtain text representation and image representation; interacting the text representation and the image representation, and determining multi-modal characteristics facing the text and multi-modal characteristics facing the vision; respectively inputting the multi-modal character facing the text and the multi-modal character facing the vision into respective classifiers to obtain a loss value facing the text and a loss value facing the vision; and determining an objective function of the false information detection model according to the text-oriented loss value and the vision-oriented loss value, and training the false information detection model according to the objective function. The invention improves the accuracy and efficiency of false information detection.

Description

False information detection model training method and false information detection method

Technical Field

The invention discloses a false information detection model training method and a false information detection method, and belongs to the technical field of false information detection.

Background

Online social media has become an indispensable platform for people to share and obtain information in daily life. Because of the relaxed restrictions on the content produced by users on online social media, there may be a lot of false news distorting and pinching facts on online social media in addition to real news. The widespread dissemination of false news can mislead the reader and even cause serious social consequences. Therefore, automatic detection of false news on social media is particularly urgent and important.

Early research on false news detection focused primarily on the text modality of user posts. Although it can recognize false information, it ignores the fact that a large amount of news relates to images. On the one hand, images may contain key clues for false news detection. As shown in fig. 1 (a), it is difficult to determine the authenticity of the news purely from the text, and by considering the visual image of the tweet, it is possible to make sure that the news is fake. On the other hand, it can enhance the judgment of the authenticity of news even if the image itself does not contain irregular clues. As shown in fig. 1 (b), the text indicates that it may be false, while the image appears normal. However, if text and images are considered jointly, it can be easily inferred that it is false news.

Therefore, in recent years, many efforts have been made in multimodal false news detection. The multi-modal false news detection in the prior art mainly focuses on designing more advanced feature extraction programs for multi-modalities and is implemented by a tandem operation when multi-modality fusion is performed according to extracted features. However, the prior art fails to model the interaction between the two modalities, so that the problems of low detection accuracy and poor detection efficiency exist when the false news detection is performed by using the two modalities.

Disclosure of Invention

The application aims to provide a false information detection model training method and a false information detection method so as to solve the technical problems of low detection accuracy and poor detection efficiency of the conventional false information detection method.

The invention provides a false information detection model training method in a first aspect, which comprises the following steps:

acquiring training information, wherein the training information comprises texts and images related to the texts;

respectively inputting the text and the image into respective encoders to obtain text representation and image representation;

interacting the text representation and the image representation, and determining multi-modal characteristics facing the text and multi-modal characteristics facing the vision;

respectively inputting the multi-modal character facing the text and the multi-modal character facing the vision into respective classifiers to obtain a loss value facing the text and a loss value facing the vision;

determining an objective function of a false information detection model according to the text-oriented loss value and the vision-oriented loss value, and training the false information detection model according to the objective function.

Preferably, the inputting the text and the image into respective encoders to obtain a text representation and an image representation, specifically includes:

acquiring embedded information of a text, and inputting the embedded information of the text into a text encoder to obtain text representation;

acquiring embedded information of an image, and inputting the embedded information of the image into an image encoder to obtain image representation;

the text encoder and the image encoder are both transform encoders;

the text representation and the image representation each include a query vector, a key vector, and a value vector.

Preferably, interacting the text representation and the image representation to determine text-oriented multi-modal features and visual-oriented multi-modal features, specifically comprising:

inputting the query vector in the text representation and the key vector and the value vector in the image representation into a multi-modal fusion block facing the text to obtain multi-modal characteristics facing the text;

and inputting the query vector in the image representation and the key vector and the value vector in the text representation into a multi-modal fusion block facing the vision to obtain multi-modal characteristics facing the vision.

Preferably, the text-oriented multi-modal fusion block comprises a first modal interaction unit and a first key information selection unit;

inputting the query vector in the text representation and the key vector and value vector in the image representation into a multi-modal fusion block facing the text to obtain multi-modal features facing the text, which specifically comprises:

inputting the query vector in the text representation and the key vector and value vector in the image representation into a first modal interaction unit to obtain a preliminary multi-modal fusion feature;

inputting the preliminary multi-modal fusion features to a first key information selection unit to obtain multi-modal features oriented to texts;

the mechanism used in the first modal interaction unit is a cooperative attention mechanism;

the mechanism used by the first key information selection unit is a self-attention mechanism.

Preferably, the visual-oriented multi-modal fusion block comprises a second modal interaction unit and a second key information selection unit;

inputting the query vector in the image representation and the key vector and the value vector in the text representation into a multi-modal fusion block facing the vision to obtain multi-modal characteristics facing the vision, specifically comprising:

inputting the query vector in the image representation and the key vector and value vector in the text representation into a second modal interaction unit to obtain a preliminary multi-modal fusion feature;

inputting the preliminary multi-modal fusion features to a second key information selection unit to obtain visual-oriented multi-modal features;

the mechanism used in the second modal interaction unit is a cooperative attention mechanism;

the mechanism used by the second key information selection unit is a self-attention mechanism.

Preferably, the text-oriented multi-modal features and the visual-oriented multi-modal features are respectively input into respective classifiers to obtain a text-oriented loss value and a visual-oriented loss value, and the method specifically includes:

inputting the multi-modal characteristics facing the text into a classifier facing the text to obtain a first prediction probability of the training information, and determining a loss value facing the text according to the first prediction probability and a true value of the training information;

and inputting the multi-modal characteristics facing the vision into the classifier facing the vision to obtain a second prediction probability of the training information, and determining a loss value facing the vision according to the second prediction probability and the true value of the training information.

Preferably, the determining an objective function of the false information detection model according to the text-oriented loss value and the visual-oriented loss value specifically includes:

determining a first interaction loss value of the text-oriented classifier to the visual-oriented classifier and a second interaction loss value of the visual-oriented classifier to the text-oriented classifier according to the first prediction probability and the second prediction probability;

and determining an objective function of the false information detection model according to the text-oriented loss value, the vision-oriented loss value, the first interaction loss value and the second interaction loss value.

Preferably, determining a first interaction loss value of the text-oriented classifier to the visual-oriented classifier and a second interaction loss value of the visual-oriented classifier to the text-oriented classifier according to the first prediction probability and the second prediction probability specifically comprises:

and determining a first interaction loss value from the text-oriented classifier to the visual-oriented classifier and a second interaction loss value from the visual-oriented classifier to the text-oriented classifier by using information divergence according to the first prediction probability and the second prediction probability.

Preferably, determining an objective function of the false information detection model according to the text-oriented loss value, the visual-oriented loss value, the first interaction loss value and the second interaction loss value specifically includes:

determining an objective function of a false information detection model by using a first formula, wherein the first formula is as follows:

in the formula (I), the compound is shown in the specification,

in order to be a text-oriented loss value,

in order to be a visually oriented value of the loss,

in order to be the first interaction loss value,

is a second interaction loss value, λ_KLThe lost weight is mutual learning.

A second aspect of the present invention provides a method for detecting false information, including:

acquiring information to be detected;

and inputting the information to be detected into the false information detection model obtained by training the training method of the false information detection model to obtain a false information detection result.

Compared with the prior art, the false information detection model training method and the false information detection method have the following beneficial effects:

the invention relates to a multi-mode fusion method and a multi-mode fusion system based on a mutual learning network, wherein a multi-mode fusion block can jointly capture interaction among multi-mode features and extract key information for false news detection, and on the basis, a mutual learning strategy is further combined, so that information can be transmitted between a text-oriented classifier and a visual-oriented classifier, and multi-mode fusion is promoted to enhance the efficiency and accuracy of false news detection.

Drawings

FIGS. 1 (a) and (b) are false news on online social media;

FIG. 2 is a schematic flow chart of a method for training a false information detection model according to an embodiment of the present invention;

FIG. 3 is a detailed flowchart of a method for training a false information detection model according to an embodiment of the present invention;

FIG. 4 is a flow chart of a method for detecting false information according to an embodiment of the present invention;

FIG. 5 is a graph of (a) ablation results of different models on a microblog and (b) ablation results of different models on Twitter in an embodiment of the present invention;

fig. 6 (a) is a visualization result of attention weight of a news word on an image block according to an embodiment of the present invention, and (b) is a visualization result of attention weight of another news word on an image block;

FIG. 7 shows different λ in an embodiment of the present invention_KLImpact of values on both datasets.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

The first aspect of the embodiment of the present invention discloses a method for training a false information detection model, as shown in fig. 2 and 3, including:

step 1, training information is obtained, wherein the training information comprises texts and images related to the texts.

The training information in the embodiment of the invention is news information.

Step 2, inputting the text and the image into a text encoder and a visual encoder respectively to obtain a text representation and an image representation, and the method specifically comprises the following steps:

and step 21, acquiring embedded information of the text, and inputting the embedded information into a text encoder to obtain text representation.

In order to accurately model semantic information of text T in training information and avoid word ambiguity problems, in the embodiments of the present invention, pre-trained berts (bidirectional Encoder retrieval from transformers) are used to obtain word embedding. In particular, the text T is segmented into m +1 tokens { T }₀,t₁,t₂,…,t_mWhere t is₀For inserting special marks at the beginning of a sentence [ CLS]. Obtaining the embedded information corresponding to the converted text after the coding layer of the BERT

Wherein

Is an output word embedded in the BERT, d_tIs the dimension of word embedding.

To capture intra-modal interactions between words in text, embodiments of the present invention employ a standard Transformer layer consisting of a multi-headed self-attention and a feed-forward network (FFN) to learn the textual representation H^TThat is, the text encoder of the embodiment of the present invention is a transform encoder. Wherein the text represents H^TAs shown in equation (1):

in the formula

For text representation corresponding parameter-free position embedding, transform () represents E^TAnd

inputting into a Transformer layer.

Text representation H obtained by the invention^TIncluding a query vector (Q), a key vector (K), and a value vector (V).

And step 22, acquiring the embedded information of the image, and inputting the embedded information into an image encoder to obtain image representation.

Unlike existing methods of stacking multiple convolutional layers to extract visual features, embodiments of the present invention employ a visual transform model (VIT). Since the standard Transformer uses the one-dimensional symbol embedding sequence as input, the embodiment of the present invention first divides the image into a plurality of image blocks, obtains the embedding information of the image blocks, and then inputs the embedding information into the image encoder to obtain the image representation.

The embodiment of the invention provides an image

Reshaped into a series of flat image blocks

Where (H, W) is the resolution of the original image, H is the height of the original image, W is the width of the original image, C is the number of channels, (P, P) is the resolution of each image block, n ═ HW/P²Is the image block number.

Then, embodiments of the invention use trainable linear projection to flatten and map image blocks to the D dimension to obtain

As input to a pre-trained Vision transducer (VIT). Specifically, the embodiment of the present invention uses ViT-B/16 weights pre-trained on ImageNet to obtain the embedded information of the image block, as shown in formula (2):

wherein

Is a special mark' [ CLS]The embedding of the' is carried out,

is the embedding of the i-th image block, d_IIs the dimension of the image block embedding.

In the embodiment of the present invention, a standard transform encoder is used to construct the internal interaction of the visual modality, that is, the image encoders in the embodiments of the present invention are all transform encoders, so that the following image representation H can be obtained^I：

Wherein the content of the first and second substances,

for image representation corresponding parameter-free position embedding, transform () represents E^IAnd

inputting into a Transformer layer.

Image representation H obtained by the invention^IIncluding a query vector (Q), a key vector (K), and a value vector (V).

The sequence of step 21 and step 22 in the embodiment of the present invention is interchangeable.

Step 3, interacting the text representation and the image representation, and determining multi-modal characteristics facing the text and the multi-modal characteristics facing the vision, wherein the method specifically comprises the following steps:

and step 31, inputting the query vector in the text representation and the key vector and the value vector in the image representation into a multi-mode fusion block facing the text to obtain multi-mode features facing the text.

The text-oriented multi-mode fusion block comprises a first mode interaction unit and a first key information selection unit;

step 31 specifically includes:

311, inputting a query vector in the text representation and a key vector and a value vector in the image representation into a first modal interaction unit to obtain a primary multi-modal fusion feature; wherein the mechanism used in the first modality interaction unit is a coordinated attention mechanism.

In the embodiment of the invention, the input of the ith cooperative attention head (i is 1,2, …, M) is represented by a text H^TAnd image representation H^IConverted to, as shown in equation (4):

in the formula (I), the compound is shown in the specification,

is a projection matrix of the query vectors in the textual representation,

is a projection matrix of key vectors in the image representation,

is a projection matrix of value vectors in the image representation.

The calculation of the cooperative attention mechanism in the cooperative attention model is as in equation (5):

wherein softmax () is the softmax function, Att_iIs the ith head of multi-head attention (consisting of M heads in total), formula (6) can be derived from formula (5):

where MH _ Att () is a function of multi-head attention,

is a matrix of the weights that is,

indicating a series operation.

Wrapping the residual connection, the feedforward network (FFN) and the normalization layer (layernorm (ln)) to obtain a preliminary multi-modal fusion feature as shown in formula (7):

LN () in the formula denotes that its parameters are input into the normalization layer (LayerNorm) of the first-modality interaction unit, FFN () denotes that its parameters are input into the feed-forward network of the first-modality interaction unit,

as in equation (8):

LN () in the formula represents inputting parameters therein into a normalization layer (LayerNorm) of the first-modality interaction unit.

Step 312, inputting the preliminary multi-modal fusion features to a first key information selection unit to obtain multi-modal features oriented to texts; wherein the mechanism used by the first key information selection unit is a self-attention mechanism.

Embodiments of the invention determine text-oriented multi-modal features according to equation (9):

LN () in the formula denotes that its parameters are input into the normalization layer (LayerNorm) of the first key information selection unit, FFN () denotes that its parameters are input into the feed-forward network of the first key information selection unit,

as in equation (10):

in the formula, LN () represents that its parameters are input into the normalization layer (LayerNorm) of the first key information selection unit.

And step 32, inputting the query vector in the image representation and the key vector and the value vector in the text representation into the multi-mode fusion block facing the vision to obtain multi-mode features facing the vision.

The visual-oriented multi-mode fusion block comprises a second mode interaction unit and a second key information selection unit.

Step 32 specifically includes:

step 321, inputting the query vector in the image representation and the key vector and the value vector in the text representation to a second modality interaction unit to obtain a preliminary multi-modality fusion feature, wherein a mechanism used in the second modality interaction unit is a cooperative attention mechanism.

In the embodiment of the invention, the input of the ith cooperative attention head (i is 1,2, …, M) is represented by a text H^TAnd image representation H^IConverted to, as shown in equation (11):

in the formula (I), the compound is shown in the specification,

is a projection matrix of the query vector in the image representation,

is a projection matrix of key vectors in the text representation,

is a projection matrix of the text representation median vector.

The calculation of the cooperative attention mechanism in the cooperative attention model is as in equation (12):

wherein softmax () is the softmax function, Att_iIs the ith head of multi-head attention (consisting of M heads in total), equation (13) can be derived from equation (12):

in the formula (I), the compound is shown in the specification,

is a matrix of the weights that is,

indicating a series operation.

Wrapping the residual connection, the feedforward network (FFN) and the normalization layer (layernorm (ln)) to obtain a preliminary multi-modal fusion feature as shown in formula (14):

LN () in the formula denotes that its parameters are input into the normalization layer (LayerNorm) of the second-modality interaction unit, FFN () denotes that its parameters are input into the feed-forward network of the second-modality interaction unit,

as in equation (15):

LN () in the formula represents inputting parameters therein into a normalization layer (LayerNorm) of the second-modality interaction unit.

And 322, inputting the preliminary multi-modal fusion features to a second key information selection unit to obtain the vision-oriented multi-modal features, wherein a mechanism used by the second key information selection unit is a self-attention mechanism.

Embodiments of the invention determine a vision-oriented multi-modal feature according to equation (16):

LN () in the formula represents inputting its parameters into the normalization layer (LayerNorm) of the second key information selection unit, FFN () represents inputting its parameters into the feed-forward network of the second key information selection unit,

as in formula (17):

in the formula, LN () represents that its parameters are input into the normalization layer (LayerNorm) of the second key information selection unit.

The order of steps 31 and 32 of the embodiment of the present invention is interchangeable.

The multi-modal fusion block of the present invention acts to model multi-modal interactions and select key information for false information detection. In view of the fact that people judge false information from the perspective of text facing and the perspective of vision facing, the invention constructs corresponding multi-mode fusion blocks facing text and vision. As shown on the right side of fig. 3, the two fusion blocks share the same architecture, each architecture is composed of two units, and is respectively responsible for inter-modality interaction and key information selection.

Step 4, respectively inputting the text-oriented multi-modal features and the visual-oriented multi-modal features into respective classifiers to obtain a text-oriented loss value and a visual-oriented loss value, and specifically comprising:

and 41, inputting the multi-modal characteristics facing the text into a classifier facing the text to obtain a first prediction probability of the authenticity of the training information, and determining a loss value facing the text according to the first prediction probability and a true value of the training information.

Embodiments of the present invention utilize a fully connected layer and then predict the authenticity of the training information using a softmax function.

Wherein the first prediction probability is as in equation (18):

P^T＝softmax(W^TO^T+b^T) (18)

in the formula, P^TIs a first prediction probability, W^TFor the weight of the fully-connected layer in the text-oriented classifier, O^TFor text-oriented multimodal features, b^TIs a bias in the text-oriented classifier.

The cross entropy used in the text-oriented classifier in the embodiments of the present invention is as in formula (19):

in the formula (I), the compound is shown in the specification,

for text-oriented loss values, N is the number of training information,

and

the true value and the first prediction probability of the ith training information of the text-oriented classifier are respectively.

And 42, inputting the multi-mode visual-oriented features into a visual-oriented classifier to obtain a second prediction probability of the reality of the training information, and determining a visual-oriented loss value according to the second prediction probability and the true value of the training information.

Wherein the second prediction probability is as in equation (20):

P^I＝softmax(W^IO^I+bI) (20)

in the formula, P^IIs a second prediction probability, W^IFor the weight of the fully connected layer in the vision-oriented classifier, O^IFor multi-modal features oriented to vision, b^IIs the bias of the vision-oriented classifier.

In the formula (I), the compound is shown in the specification,

for a visual-oriented loss value, N is the amount of training information,

and

the true value and the second prediction probability of the ith training information of the vision-oriented classifier are respectively.

And 5, determining an objective function of the false information detection model according to the text-oriented loss value and the vision-oriented loss value, and training the false information detection model according to the objective function, as shown in the upper half part of the figure 3.

Inspired by the fact that people judge the double visual angles of the false information, the method adopts a mutual learning strategy, so that information can be transmitted between the text-oriented classifier and the visual-oriented classifier, and multi-mode fusion in the false news detection process is promoted.

For mutual learning networks, the present invention forces two classifiers to mimic each other in the final prediction probability.

The step 5 specifically comprises the following steps:

and step 51, determining a first interaction loss value from the text-oriented classifier to the visual-oriented classifier and a second interaction loss value from the visual-oriented classifier to the text-oriented classifier according to the first prediction probability and the second prediction probability.

The embodiment of the invention quantizes the prediction P of two networks by adopting the divergence of information (Kullback Leibler (KL) divergence)^TAnd P^IThe degree of matching. KL distance calculation from text-oriented classifier to vision-oriented classifier is as in equation (22):

then in the mutual learning strategy, the first interaction loss value from the text-oriented classifier to the visual-oriented classifier is as in formula (23):

similarly, the KL distance calculation from the visual-oriented classifier to the text-oriented classifier is as in equation (24):

then in the mutual learning strategy, the second interaction loss value from the visual-oriented classifier to the text-oriented classifier is as in formula (25):

and step 52, determining an objective function of the false information detection model according to the text-oriented loss value, the vision-oriented loss value, the first interaction loss value and the second interaction loss value, and training the false information detection model according to the objective function.

The objective function of the false information detection model of the invention is shown in formula (26):

in the formula (I), the compound is shown in the specification,

for the text-oriented loss value to be,

in order to be a visual-oriented value of the loss,

in order to be the first interaction loss value,

is a second interaction loss value, λ_KLThe lost weight is mutual learning.

The second aspect of the present invention discloses a method for detecting false information, as shown in fig. 4, including:

step 101, information to be detected is obtained.

The information to be detected in the embodiment of the invention comprises texts and pictures related to the texts.

And 102, inputting the information to be detected into the false information detection model obtained by training by using the training method of the false information detection model to obtain a false information detection result.

The false information detection result obtained by the embodiment of the invention is the prediction probability of the authenticity of the information to be detected.

Intuitively, when people judge the authenticity of multimodal news, they often need to consider from a text-oriented perspective and a visual-oriented perspective. Specifically, the text-oriented angle focuses on judging text content while considering visual content. In contrast, the view-oriented angle focuses on judging visual contents while considering text contents. As we can see from the example in fig. 1, focusing on both text and images helps to detect false news. This is consistent with the interaction of the regional network of our brain in performing cognitive tasks.

In the invention, a new false information detection method is provided in consideration of a double-view learning process, and the method is a multi-mode detection method (MMNet) based on Mutual learning. The MMNet mutual learning mechanism can realize information transfer among different visual angles, and multi-mode information can be better fused. Specifically, the model of the present invention consists of two key modules: text-oriented classifiers and visual-oriented classifiers. In each module, a new multi-modal fusion block is designed to extract text/visual oriented multi-modal features, wherein the interaction between the two modalities is well characterized. In particular, the multi-modal fusion block first captures cross-modal interactions through cooperative attention, and then extracts key information for further false news classification by self-attention.

The validity of the method of the invention will be verified in the following with more specific examples.

1.1 Experimental setup

1.1.1 data set.

To evaluate the performance of the proposed MMNet, embodiments of the invention utilize two widely used datasets: microblog and Twitter. Microblog data sets are collected by newsletters and microblogs. Each post contains text, an attachment picture, and social information. Fake news was captured from month 5 2012 to month 1 2016 and verified by the micro blogging official rumor system. Twitter data sets are published for a validation multimedia usage task that aims to detect false multimedia content on social media. Each tweet in the dataset includes text, pictures and social context information associated with it. The microblog data set comprises 9528 unique pictures and the Twitter data set comprises 514 unique pictures. The embodiment of the invention is as follows: 2, the microblog data set is divided into a training set and a testing set. For the Twitter dataset, the present invention uses the existing preprocessing method to obtain its training set test set, and Table 1 shows the statistical data of these two datasets.

TABLE 1 statistical data of microblog data sets and twitter data

News	Micro blog	Twitter
			Real news	4779	6026
False news	4749	7898

1.1.2 implementation details.

For word embedding, embodiments of the present invention use pre-trained BERT to extract text features with a dimensionality of 768. Specifically, bert-base-detect is used to obtain word embedding information on microblog data sets, and bert-base-detect is used to learn word embedding information on Twitter data sets. For image block embedding, embodiments of the present invention employ vit _ base _ patch16_224 to extract visual features of dimension 768, where the image is resized to 224 and the block size is 16. Embodiments of the present invention use an adaptive moment estimation (Adam) optimizer to find the best parameters, with an initial learning rate of 0.0001.

In addition to the accuracy index, the embodiment of the invention also provides the accuracy rate, the recall rate and the F1 score of false news and true news by different methods.

1.2 base line

Embodiments of the present invention compare the MMNet model with the single-modality and multi-modality models, as follows.

A single mode model. Embodiments of the present invention compare MMNet with the following model that uses text only for false news detection.

SVM-TS: SVM-TS uses a linear SVM classifier and heuristic rules to detect false news.

GRU: the GRU models textual semantic information for false news detection using a multi-layer GRU network.

A multi-modal model. The present example compares MMNet with the following 7 multimodal models.

att _ RNN: att _ RNN application attention-based RNN to fuse text, visual, and social background features for false news detection. In this experiment, a part that deals with the social background feature was deleted.

EANN: the EANN employs an auxiliary event fighting neural network to remove event-specific features and maintain shared features between events for detecting false news. The embodiment of the invention uses a simplified version of EANN in the experiment, and excludes an event discriminator.

MVAE: MVAE combines a variational auto-encoder with a binary classifier for false news detection.

SpotFake: SpotFake learns text features using pre-trained BERT and image features using VGG-19 pre-trained on ImageNet to detect false news.

SpotFake +: SpotFake + upgrades the pre-trained language model in SpotFake to XLNET to effectively detect false news.

HMCAN: HMCAN adopts a hierarchical multi-mode context attention network, and performs false news detection by performing combined modeling on multi-mode context information and the hierarchical semantics of a text.

MCAN: MCAN employs multiple layers of collaborative attention to fuse multimodal features to complete the task of false news detection.

1.3 results and analysis

Table 2 shows the overall performance comparison of the different methods on the two data sets. The best results are bolded and the second best results are underlined.

Table 2: comparison of different models on microblog and Twitter data sets

From table 2, the following conclusions can be drawn:

1) the MMNet provided by the invention is superior to other models in accuracy index and F1 score. Compared with the optimal baseline model, the accuracy of MMNet on microblog and Twitter datasets was improved by about 0.78% and 3.7%, respectively.

2) A model that considers multi-modal information is superior to a model that considers only single-modal information. This demonstrates that integrating multiple modalities is advantageous for the task of false news detection.

3) Collaborative attention-based models perform better than simple tandem modal features or methods that utilize ancillary tasks (e.g., SpotFake, EANN), indicating that they can better fuse multimodal information.

4) The MMNet of embodiments of the present invention further goes beyond the synergistic attention-based approach (i.e., HMCAN and MCAN). We believe this is because the mutual learning network facilitates multi-modal fusion of false news detection by transferring information between text-oriented and visual-oriented angles. In addition, the multi-mode fusion block designed by the embodiment of the invention not only can capture the interaction among multi-mode information, but also can select key information to identify false news.

1.4 ablation study

To verify the importance of each module in MMNet, embodiments of the invention compare MMNet with the following variants.

MMNet-T: a variant of MMNet, based only on text-oriented classifiers.

MMNet-V: a variation of MMNet, using only vision-oriented classification

MMNet-Co: a variant of MMNet replaces the multimodal fusion block with a traditional collaborative attention network.

MMNet-Avg: a variant of MMNet combines MMNet-T and MMNet-V by averaging the prediction probabilities.

FIG. 5 shows the results of an ablation study, with the models for each set of indices from left to right in FIGS. 5 (a) and (b) being MMNet-T, MMNet-V, MMNet-Co, MMNet-Avg, and MMNet of the present invention, respectively. From fig. 5 the following observations can be made:

1) the MMNet model of the invention outperforms all variants. The main reason is that mutual learning based on text-oriented classifiers and visual-oriented classifiers can convey information between each other, facilitating multi-modal fusion.

2) Both MMNet-T and MMNet-V achieved lower performance, indicating that using only text-oriented or visual-oriented classifiers is a suboptimal option for false news detection. On the microblog data set, MMNet-T performs much better than MMNet-V. The reason may be that the text on the micro-blog is relatively long, containing more false news detection information.

3) The performance of MMNet on two data sets is superior to that of MMNet-Co, which shows that the multi-mode fusion block designed by the invention is effective, can capture interaction between modes, and can select key information for false news detection.

4) Compared with the method MMNet-Avg, which combines a text-oriented classifier and a visual-oriented classifier by an averaging method, the performance of MMNet is significantly improved. This further illustrates the effectiveness of cross-learning between different angles for false news detection.

1.5 case study

Furthermore, in order to have an intuitive understanding of the mutual learning strategy, the embodiment of the present invention visualizes the word attention weights on the image blocks calculated by formula (5) and formula (6). For convenience of explanation, the embodiments of the present invention reflect the attention weight on the opacity of the image block. If the attention value is greater than the average attention weight, the opacity is set to 255; otherwise, opacity is set to 76. The embodiment of the invention shows the visualization results of MMNet-Avg and MMNet in figure 6. FIG. 6 is a visualization of attention weights of words on an image block. Each example consists of text (at the top, the words for attention visualization are "dog" and "explosion") and three images, including the original image (on the left), the attention image derived by MMNet-Avg without mutual learning (in the middle), and the attention image derived by MMNet of an embodiment of the invention (on the right).

As can be seen from (a) in fig. 6, for the word "dog", MMNet notices the corresponding object, while in MMNet-Avg, which does not learn each other, the object is scattered in each corner of the image. Likewise, in fig. 6 (b), the word "explosion" is found to give more attention to the corresponding region in the MMNet of the embodiment of the present invention. These cases show that mutual learning between text-oriented and vision-oriented angles can better capture the interdependencies of multimodal information.

1.6λ_KLInfluence of the value

To explore the weight λ lost to mutual learning_KLInfluence on model Performance, embodiments of the invention apply λ_KLFrom 1e^-5To 0.1 and the accuracy of the two data sets is illustrated in fig. 7. It can be observed that the accuracy of false news detection generally increases first on both data sets and is at 1e^-5Reaches a maximum value of 0.01, when_KLAbove 0.01, accuracy begins to decline. In general, when λ_KLAt 0.01, MMNet can best balance the individual classifiers and their mutual learning.

The embodiment of the invention provides a novel MMNet model based on a mutual learning network, which is used for multi-mode false news detection. The model of the embodiment of the invention can carry out information transfer between text-oriented and vision-oriented angles and promote multi-mode fusion to enhance false news detection. In addition, the embodiment of the invention designs a new multi-mode fusion block, which not only can capture the interaction among the modes, but also can select the key information of false news detection. A number of experiments performed on two published real data sets verified the validity of the model MMNet of an embodiment of the invention.

Although the present application has been described with reference to a few embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims

1. A method for training a false information detection model is characterized by comprising the following steps:

acquiring training information, wherein the training information comprises a text and an image related to the text;

inputting the text and the image into respective encoders to obtain a text representation and an image representation;

interacting the text representation and the image representation, and determining multi-modal text-oriented features and multi-modal visual-oriented features;

2. The method of training the false information detection model according to claim 1, wherein the inputting the text and the image into respective encoders to obtain a text representation and an image representation comprises:

acquiring the embedded information of the text, and inputting the embedded information of the text into a text encoder to obtain text representation;

acquiring the embedded information of the image, and inputting the embedded information of the image into an image encoder to obtain image representation;

the text encoder and the image encoder are both transform encoders;

3. The method for training the false information detection model according to claim 2, wherein the interacting the text representation and the image representation to determine text-oriented multi-modal features and visual-oriented multi-modal features comprises:

4. The training method of the false information detection model according to claim 3, wherein the text-oriented multi-modal fusion block comprises a first modal interaction unit and a first key information selection unit;

inputting the query vector in the text representation and the key vector and the value vector in the image representation into a multi-modal fusion block facing the text to obtain multi-modal features facing the text, which specifically comprises:

5. The training method of the false information detection model according to claim 3, wherein the visual-oriented multi-modal fusion block comprises a second modal interaction unit and a second key information selection unit;

inputting the query vector in the image representation and the key vector and the value vector in the text representation into a second modal interaction unit to obtain a primary multi-modal fusion feature;

inputting the preliminary multi-modal fusion features to a second key information selection unit to obtain visual multi-modal features;

6. The method for training the false information detection model according to any one of claims 1-5, wherein the text-oriented multi-modal features and the visual-oriented multi-modal features are respectively input into respective classifiers to obtain a text-oriented loss value and a visual-oriented loss value, and specifically comprises:

inputting the multi-modal character facing the text into a classifier facing the text to obtain a first prediction probability of the training information, and determining a loss value facing the text according to the first prediction probability and a true value of the training information;

and inputting the multi-modal vision-oriented features into a vision-oriented classifier to obtain a second prediction probability of the training information, and determining a vision-oriented loss value according to the second prediction probability and a true value of the training information.

7. The method for training the dummy information detection model according to claim 6, wherein determining the objective function of the dummy information detection model according to the text-oriented loss value and the visual-oriented loss value comprises:

and determining an objective function of a false information detection model according to the text-oriented loss value, the visual-oriented loss value, the first interaction loss value and the second interaction loss value.

8. The method of claim 7, wherein determining a first interaction loss value from the text-oriented classifier to the visual-oriented classifier and a second interaction loss value from the visual-oriented classifier to the text-oriented classifier according to the first prediction probability and the second prediction probability comprises:

9. The method for training the false information detection model according to claim 7, wherein determining the objective function of the false information detection model according to the text-oriented loss value, the visual-oriented loss value, the first interaction loss value, and the second interaction loss value specifically includes:

in the formula (I), the compound is shown in the specification,

for the text-oriented loss value in question,

for the value of the visual-oriented loss,

for the first value of the interaction loss to be,

for said second interaction loss value, λ_KLThe lost weight is mutual learning.

10. A false information detection method, comprising:

acquiring information to be detected;

inputting the information to be detected into a false information detection model obtained by training by using the training method of the false information detection model according to any one of claims 1 to 9, and obtaining a false information detection result.