CN114528912A

CN114528912A - False news detection method and system based on progressive multi-mode converged network

Info

Publication number: CN114528912A
Application number: CN202210021501.2A
Authority: CN
Inventors: 敬静; 吴泓辰; 孙杰; 房晓畅; 张化祥
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-05-24

Abstract

The invention discloses a false news detection method and a system based on a progressive multi-mode converged network, wherein the method comprises the following steps: acquiring news data to be detected, wherein the news data comprises image information and text information; detecting news data to be detected based on a pre-trained false news detection model; wherein the false news detection model comprises: the visual feature encoder comprises n-level visual feature extraction blocks which are sequentially connected; a feature fusion device: the system comprises n-level feature fusion blocks which are sequentially connected and a text feature encoder: the output end is connected to the level 1 characteristic fusion block; and the output ends of the ith level visual feature block are connected to the ith level feature fusion block, i is less than n, and the output ends of the nth level visual feature extraction block and the nth-1 level feature fusion block are connected to the nth level feature fusion block. The invention realizes fine-grained multi-mode information fusion and improves the detection precision by a progressive fusion method.

Description

False news detection method and system based on progressive multi-mode converged network

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a false news detection method and system based on a progressive multi-mode converged network.

Background

With the rapid development of mobile internet technology, social media are as follows: social applications such as twitter and microblog become important channels for people to acquire massive information, and people can easily release and spread false news on social media. Moreover, the articles with pictures are more and more popular in social media, and compared with the articles with pure words, the pictures have richer information and can attract the attention of readers. False news often has misleading or tampered pictures combined with text. Therefore, the visual content has become a non-negligible important component in the false news detection, and therefore, there is a need to provide a method for automatically detecting the false news to detect the authenticity of the article with the picture, so as to alleviate the serious negative effect caused by the false news.

In recent years, methods for detecting false information are diversified, and one method is manual fact checking, including two methods, namely expert fact checking and crowd-sourced fact checking. The fact checking accuracy of experts is high, but the time and the labor are wasted; the crowd-sourced fact verification is strong in expandability, but not high in verification accuracy. Due to the limitation of the manual fact checking method, some researchers manually extract features from news text contents by using expert knowledge and then train a false news classifier by using a traditional machine learning algorithm, but the method lacks comprehensiveness and flexibility. The existing deep learning model has stronger feature extraction capability, can automatically extract news features from news contents, and obtains better performance.

As false news is more diversified, the authenticity of the article with the picture provides higher requirements and challenges for a false information detection technology, and some deep learning-based methods have been successfully applied to multi-modal false news detection. First, some models such as khottar et al simply extract and fuse features of text and pictures using a multi-modal variational encoder, but they are not fine enough in feature extraction and feature fusion. Second, Jin et al created an end-to-end network, using a RNN designed false news detection model that uses a local attention mechanism in combination with text images and social background features, Wang et al created an event countermeasure neural network (EANN) that uses event discriminators to learn the feature representations of text and images in articles, but adding additional assist features increases the cost of detection. In addition, these methods only consider the spatial domain of the picture, do not consider the frequency domain of the picture, and do not capture the picture information sufficiently. Thirdly, Wu et al propose Multimodal Co-Attentionnetworks (MCAN) for false information detection, and MCAN can learn the interdependence among Multimodal features, and obtain a good effect on false information detection, but MCAN only focuses on fusion of deep-level features.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a false news detection method and system based on a progressive multi-mode converged network. By means of the progressive fusion method, fine-grained multi-mode information fusion is achieved, and detection precision is improved.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

a false news detection method based on a progressive multi-mode converged network comprises the following steps:

acquiring news data to be detected, wherein the news data comprises image information and text information;

detecting the news data to be detected based on a pre-trained false news detection model; the false news detection model comprises a text feature encoder, a visual feature encoder, a feature fusion device and a classifier;

the visual feature encoder comprises n-level visual feature extraction blocks which are sequentially connected, the feature fusion device comprises n-level feature fusion blocks which are sequentially connected, and the output end of the text feature encoder is connected to the 1 st-level feature fusion block; the output ends of the ith-level visual feature blocks are connected to the ith-level feature fusion block, and i is less than n; and the output ends of the nth level visual feature extraction block and the nth-1 level feature fusion block are connected to the nth level feature fusion block.

Further, the visual feature encoder includes a spatial domain feature encoder and a frequency domain feature encoder.

Further, after news data to be detected are obtained, image segmentation is carried out on image information in the news data to be detected, and a plurality of non-overlapped patches with the size of k multiplied by k are obtained; unfolding each patch, extracting R, G, B three components to obtain a feature vector with the size of k multiplied by 3, and inputting the feature vector into a spatial domain feature encoder through a linear embedding layer;

in the spatial domain feature encoder, each next-level visual feature extraction block performs downsampling and channel expansion on a feature map obtained by the previous-level visual feature extraction block.

Further, after news data to be detected are obtained, discrete Fourier transform is carried out on image information in the news data to be detected to obtain frequency domain information; and separating and connecting the imaginary part and the real part of the frequency domain information as the input of the frequency domain characteristic encoder.

Further, the text feature encoder adopts a bidirectional Transformer pre-training model to extract features.

Further, the 1 st level feature fusion block performs feature fusion on the obtained spatial domain visual features, frequency domain visual features and text features by using a multilayer perceptron; and then combining the features obtained by fusion with the features T to be used as the input of a feature fusion block of the next stage.

Further, the classifier comprises a fully connected layer, and the output of the fully connected layer generates the distribution condition of the classification labels through a softmax function.

One or more embodiments provide a false news detection system based on a progressive multimodal fusion network, comprising:

the data acquisition module is used for acquiring news data to be detected, and the news data comprises image information and text information;

the false detection module is used for detecting the news data to be detected based on a pre-trained false news detection model; the false news detection model comprises a text feature encoder, a visual feature encoder, a feature fusion device and a classifier;

One or more embodiments provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the progressive multimodal fusion network based false news detection method when executing the program.

One or more embodiments provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the progressive multimodal fusion network based false news detection method.

One or more of the technical schemes have the following beneficial effects:

in the feature extraction stage, the progressive fusion strategy is adopted to capture the representation information of different levels of the image and the text, and the features of each mode can be fused in a finer granularity, so that the information contained in the image and the text is fully mined, and the model detection precision is improved.

For image characteristics, the problem that false information contained in false news is tampered is considered, the image characteristics are extracted from two layers of a space domain and a frequency domain, and the detection sensitivity of the model to the false news is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow diagram of a false news detection method based on a progressive multimodal fusion network according to one or more embodiments of the present invention;

FIG. 2 is a block diagram of progressive multimodal feature extraction and fusion in one or more embodiments of the invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Example one

The embodiment discloses a false news detection method based on a progressive multi-modal fusion network, which comprises the following steps as shown in fig. 1:

step 1: acquiring news data to be detected, wherein the news data comprises image information and text information;

step 2: performing discrete Fourier transform on the image information to obtain frequency domain information;

and step 3: detecting the news data to be detected based on a pre-trained false news detection model; the false news detection model comprises a text feature encoder, a visual feature encoder, a feature fusion device and a classifier.

The visual characteristic encoder comprises a spatial domain characteristic encoder and a frequency domain characteristic encoder; the space domain feature encoder and the frequency domain feature encoder both comprise n levels of visual feature extraction blocks, wherein n is a natural number greater than 2;

the characteristic fusion device comprises n-level characteristic fusion blocks, wherein the output end of the ith-level characteristic fusion block is connected to the input end of the (i + 1) th-level characteristic fusion block, and i is less than n;

the text feature encoder comprises a text feature extraction block, and the output end of the text feature extraction block is connected to the level 1 feature fusion block;

the output end of the ith level visual feature extraction block of the spatial domain feature encoder is divided into two paths, one path is connected to the (i + 1) level visual feature extraction block, and the other path is connected to the ith feature fusion block;

the output end of the ith level visual feature extraction block of the frequency domain feature encoder is divided into two paths, one path is connected to the (i + 1) level visual feature extraction block, and the other path is connected to the ith feature fusion block;

the output end of the text feature encoder is connected to a level 1 feature fusion block, and the feature fusion block fuses two paths of visual features and one path of text features;

and the output ends of the nth level visual characteristic extraction block of the spatial domain characteristic encoder and the frequency domain characteristic encoder and the output end of the (n-1) th level characteristic fusion block are connected to the nth level characteristic fusion block.

In this embodiment, the process of feature extraction and fusion will be specifically described with n being 4 as an example.

Text feature encoder

The multi-mode false news detection mainly comprises information of two modes, namely text and image. The text is a main expression mode of news events, and provides an important clue for judging the credibility of news. Most of the existing methods use a recurrent neural network to model the context information of the input text and capture the surface features of the text, but the fact knowledge extracted by the methods is very limited, and the semantic features of false news are difficult to capture. In order to better extract context information and semantic information of text information, a pre-trained BERT model is adopted for text feature extraction. BERT is trained on a large-scale data set, has strong modeling capability, and has a large amount of common knowledge and semantic knowledge learned therein. Moreover, BERT consists of stacked self-attention layers, which can better capture the connection between contexts.

Specifically, the input to the text feature encoder is a word list of a sequence of sentences in the text, embedding the sentences into a vector. We note the k-dimensional vector of the ith word of the f-th sentence as

Recording a bidirectional Transformer pre-training model containing a 12-layer encoder as BRET, and then inputting T into BRET to obtain a feature vector related to the sentence, wherein the specific details are as follows:

wherein, V_fRepresenting the feature vector of the f-th sentence after being encoded by the BERT pre-training model,

is the k-dimensional feature vector represented by the nth position word in the f sentence. For each sentence's feature vector, the features F of the entire text are obtained from all words using a mean pooling operation_tObtaining the context information contained in the textInformation and semantic information.

(II) visual feature encoder

The image contained in news has important significance for judging the authenticity of the article, and the article containing the image which is not consistent with the image and text and is maliciously tampered is not real. Starting from two aspects, feature extraction is respectively carried out on image space domain information and frequency domain information, semantic extraction of images is emphasized by the space domain information, whether the images are modified or not is emphasized by the frequency domain information, and the modified images are easier to detect in the frequency domain space.

Spatial domain aspect of the image: in recent work, transformers have been widely used and successful in many tasks of image understanding. Here we use SwinT to extract visual spatial semantic features that will be pre-trained in the ImageNet dataset. We used four Swin Transformer blocks for feature extraction of visual features to varying degrees.

Specifically, an image is first divided into non-overlapping patches by a patch division module. Each patch is treated as a marker, where we set the patch size to 4x4, then unroll each RGB patch, we get a 4x4x3 sized feature vector, which is mapped into the feature space of dim 96 using linear embedding layers. Next, after hierarchical representation, by 4 stages, after each stage, each feature map is down-sampled to 2 times before, and the number of channels is expanded to 2 times before, and input to the next stage. Here expressed as:

Stage_i＝SwinB(σ(W×Stage_i-1))

wherein Stage_iAnd Stage_i-1Respectively as the output and input of the i-th layer, SwinB is a swintransformer block, which is composed of stacked self events, and the layer and heads of the self events contained in the stage of 4 layers respectively adopt [2,2,6,2]And [3,6,12,24]And W is the learning parameter of the down sampling. The feature vectors output through layer 4 are mapped to linear vectors by the linear layer.

Frequency domain aspect of the image: some research work has shown that images after tampering are more easily detected in frequency domain space. Considering the problem that false information contained in false news is tampered, image frequency domain information is subjected to feature extraction to guide false news detection. Firstly, the image is converted from a space domain to a frequency domain by using discrete Fourier transform (DCT), and in order to obtain a deeper feature, the VGG19 is adopted as a feature extractor, and an imaginary part and a real part of frequency domain information are separated and connected and then input into the VGG19, so that a deeper semantic vector is obtained.

f_F＝VGG19(concat(IF_img,IF_real))

Wherein, IF_imgRepresenting the imaginary part, IF, of the frequency domain information of the image_realRepresenting the real part of the image frequency domain information. The features after the discrete fourier transform contain more information than the discrete cosine transform used in the previous work.

(III) feature fusion device

Image information and text information in news are complementary, images and characters are often compared when people read the news, so that fusion between the characters and the image information is a crucial part in false news detection. A progressive fusion mode is designed, shallow information and character information of an image are processed in a staged mode, and the image and the shallow information are fully utilized. An Mlp Mixer Block is used as an image fusion module, and feature information between different modalities is fused in a finer granularity mode.

In the aspect of images, the spatial domain feature extractor obtains features of different depths at different stages. In extractor order we label these 4 features containing different depths as stages 1,2,3,4, respectively. In terms of frequency domain space, we label the 2 nd, 4 th, 8 th, 16 th convolutional layers of VGG19 as V1,2 th, 3 th, 4 th layers, respectively. The text features extracted by the text feature extractor are denoted as T. Taking the fusion as an example of the shallow feature stage1, the level 1 feature fusion block is configured to perform the following operations:

(1) expanding the number c of the stages 1 and the v1 channels to 512 by one convolution layer with convolution kernel of 3; performing average pooling on the expanded feature maps to obtain the feature maps with the sizes of b, 512 and 1; flattening the feature map, and performing linear mapping to obtain 1000-dimensional feature vectors;

(2) expanding three vectors of stage1, v1 and T on dim1 respectively to (B, 3,1000) dimension vectors to balance distribution of different modes, and then splicing the 3 feature vectors into a feature F with the shape of (B,9,1000) on dimension 1;

(3) performing feature fusion on dimension 2 by using two layers of Mlp for F, performing transposition operation on the fused features, performing feature fusion by using the Mlp to realize the fusion operation of the original features on dimension 1, and finally performing inverse transformation to perform feature compression to recover feature vectors with the same size as the feature T;

(4) and combining the features obtained after fusion as residual modules with the features T, reducing model risks and improving feature extraction capability.

Fi＝Mlp mixer(cat(stagei,vi,T))+T

Wherein, Mlp mixer represents a feature fusion device based on linear layers. We used ReLu and LayerNorm to improve fusion ability. The features of the image and the text are gradually fused from shallow to deep, and the association degree between the features of different modes is improved.

For the level 2 feature fusion block, fusion is performed on the basis of the output features of the previous level feature fusion block, stage2 and v2, the specific implementation process is referred to level 1, and the difference is that the text feature T is replaced by the output features of the previous level feature fusion block; the 3 rd level feature fusion block performs fusion based on the output features of the previous level feature fusion block, stage3 and v 3.

And the 4 th-level feature fusion block is final fusion, and the 3 rd-level feature fusion block is fused with stage4 and v4 to obtain a final feature fusion result.

(IV) News classifier

We input the feature information after the multi-modal representation fusion into the fully connected layer, and generate the distribution of classification labels by the output of the fully connected layer through the softmax function:

p＝softmax(W_Cx+b_C)

wherein, W_CAnd b_CIs a parameter of the fully-connected layer,here we use the cross entropy loss function:

L＝-∑[y^flog p^f+(1-y^f)log(1-p^f)]

wherein, y^fIs a true tag of the sample, 0 indicates that the sample is predicted to be false news, 1 indicates that the sample is predicted to be true news, p^fRepresenting the probability predicted by the sample.

The method is evaluated on two data sets of microblog and twitter, and the detection precision is superior to that of the existing model.

Example two

The embodiment aims to provide a false news detection system based on a progressive multi-mode converged network. The method comprises the following steps:

EXAMPLE III

The embodiment aims to provide an electronic device.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the progressive multimodal fusion network based false news detection method as described in embodiment one.

Example four

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a false news detection method based on a progressive multimodal fusion network as described in the first embodiment.

The steps involved in the second to fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A false news detection method based on a progressive multi-mode converged network is characterized by comprising the following steps:

2. A false news detection method as claimed in claim 1, wherein the visual feature encoder includes a spatial domain feature encoder and a frequency domain feature encoder.

3. A false news detection method as claimed in claim 2, wherein after the news data to be detected is obtained, image segmentation is performed on image information therein to obtain a plurality of non-overlapping kxk patches; unfolding each patch, extracting R, G, B three components to obtain a feature vector with the size of k multiplied by 3, and inputting the feature vector into a spatial domain feature encoder through a linear embedding layer;

4. The false news detection method of claim 2, wherein after news data to be detected is acquired, discrete fourier transform is performed on image information therein to obtain frequency domain information; and separating and connecting the imaginary part and the real part of the frequency domain information as the input of the frequency domain characteristic encoder.

5. The false news detection method of claim 1, wherein the text feature encoder performs feature extraction using a two-way Transformer pre-training model.

6. The false news detection method as claimed in claim 2, wherein the level 1 feature fusion block performs feature fusion using a multi-layer perceptron for the obtained spatial domain visual features, frequency domain visual features and text features; and then combining the features obtained by fusion with the features T to be used as the input of a feature fusion block of the next stage.

7. A false news detection method as claimed in claim 1, wherein the classifier includes a fully-connected layer, the output of which produces a distribution of classification tags by means of a softmax function.

8. A false news detection system based on a progressive multi-mode converged network is characterized by comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for false news detection based on a progressive multimodal fusion network according to any one of claims 1 to 7 when executing the program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the false news detection method based on progressive multimodal fusion network as claimed in any one of claims 1 to 7.