CN117521012A

CN117521012A - False information detection method based on multi-mode context hierarchical step alignment

Info

Publication number: CN117521012A
Application number: CN202311569509.3A
Authority: CN
Inventors: 潘祯祥; 毛莺池; 熊力; 陈秉睿; 曹一凡; 戚荣志; 禹跃美; 贾璐瑶; 祖立辉; 吴波
Original assignee: Hohai University HHU; Huaneng Lancang River Hydropower Co Ltd
Current assignee: Hohai University HHU; Huaneng Lancang River Hydropower Co Ltd
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-02-06

Abstract

The invention discloses a false information detection method and a system based on multi-mode context layering step-by-step alignment, which acquire images and texts in social media blogs; establishing a local alignment module with bidirectional cross modal attention, and deducing a matching relation of fragment levels; the method comprises the steps of establishing a global context fusion entity feature extraction module, helping local objects learn more context semantics by utilizing a multi-head self-attention mechanism, and enhancing context representation of image and text features; designing an adaptive weight filtering module to align image-text entity characteristics, integrating all weighted characteristics to realize global matching according to the similarity of two modal enhancement characteristics, and inhibiting overall semantic deviation; and inputting the overall similarity of the two different levels of fusion entity characteristics into a binary classifier to obtain a false information detection result. By integrating the context information, the invention can reserve complete semantic information of texts and images and improve the accuracy of false information detection.

Description

False information detection method based on multi-mode context hierarchical step alignment

Technical Field

The invention relates to a false information detection method and system based on multi-mode context layering step-by-step alignment, in particular to a detection method and system for matching of images and text information in social media blogs, and belongs to the technical field of false news detection.

Background

With the increasing importance of social media, many bloggers have resorted to exaggerating their words to increase the transmissibility of their blogs. However, this type of blogging tends to be fraught with misleading information. The prior art detects false information in social media by using various algorithms, such as keyword detection, user behavior analysis, and machine learning models, but these methods have certain drawbacks. Firstly, the keyword detection method may generate false alarm or missing report due to different contexts; user behavior analysis is limited by data acquisition and is subject to privacy protection policies; although machine learning models have proven to be effective methods in recent years, traditional machine learning methods tend to not handle fine inconsistencies between images and text information well, which results in limited ability to detect fine-grained deceptive information.

For example, one example of a PHEME dataset shows a social media blog that distorts the fact information. In this example, the image shows a scenario of "a woman holding a child", and the accompanying text is described as "this woman has born 14 children". It can be found that there is a significant inconsistency in the two different forms of information in the presentation of the same subject "child", and further that from the exaggerated textual description it can be inferred that this piece of blog contains false information. Thus, performing a fine image-text context matching work is critical, so that proper alignment of information can be ensured, and thus the accuracy of identifying false information is improved.

Disclosure of Invention

The invention aims to: aiming at the two problems that in the existing fine granularity matching method, certain information words in text cannot strictly correspond to a certain area in an image, and the context information is ignored and the matching network excessively emphasizes that fine granularity alignment is misled by the overall understanding of the image and the alignment of the context is ignored, the invention provides a false information detection method and a false information detection system based on multi-mode context hierarchical stepwise alignment. Firstly, respectively and independently extracting images and text features in a social media blog, extracting image region features by a Faster R-CNN network based on ResNet-101, and extracting text vocabulary features by a Bi-GRU network. And secondly, calculating the similarity of the image area and the text area by using a bidirectional cross attention mechanism to strictly correspond to the image area and the text vocabulary, and obtaining a local fragment level matching result between the visual area and the text vocabulary. And then, in order to avoid the situation that the matching network excessively emphasizes the fine granularity and ignores the alignment of the contexts, the basic characteristics of the image and the text are enhanced by utilizing a multi-head self-attention mechanism and combining the context information, the enhanced characteristics of the image and the text are obtained, and the global context matching result of the image and the text is obtained through an adaptive weight filtering module. Finally, the full connection layer with softmax function is used to sum the result vectors of the local and global stages and classify the result vectors into two results of real information or false information.

The technical scheme is as follows: a false information detection method based on multi-mode context hierarchical step-by-step alignment comprises the following steps:

(1) Extracting basic features of images and texts, and respectively extracting features of images and texts in social media blogs by using a target detection model and a language characterization model to respectively obtain basic feature vectors of visual areas and text words;

(2) Extracting image and text weighted characteristics, namely aligning the same segment in different modes for the image and text basic characteristic vectors extracted in the step (1) by utilizing a bidirectional cross attention mechanism, and then taking text vocabulary as a benchmark to obtain an image area basic characteristic vector weighted sum for describing the vocabulary as a weighted characteristic vector of the vocabulary; similarly, taking an image area as a reference to obtain a text vocabulary basic feature vector weighted sum describing the area as a weighted feature vector of the area;

(3) Image-text feature local matching, namely matching the image and text basic feature vector extracted in the step (1) with the image and text weighted feature vector extracted in the step (2) by utilizing similarity to obtain a local fragment level matching result;

(4) Extracting image and text enhancement features, namely exploring the context relation among features by utilizing a multi-head self-attention mechanism for the basic feature vectors of the image and the text extracted in the step (1) to respectively obtain the image and the text enhancement feature vectors fused with the context information;

(5) Matching the image-text enhancement features, and obtaining a global correlation matching result by utilizing a self-adaptive weight filtering module for the image and text enhancement features extracted in the step (4);

(6) And (3) detecting social media blogs, namely inputting the local segment level matching result obtained in the step (3) and the global relevance matching result obtained in the step (5) into a binary classifier, projecting the sum vector of two levels of result vectors to two target spaces of real information and false information by using a full-connection layer with a softmax function in the binary classifier, and obtaining the detection result of the social media blogs.

In the step (1), an image X and a text Y are defined, first, the fast R-CNN based on Resnet-101 is used for extracting the basic characteristics of an image area, and then the Bi-GRU network is used for extracting the basic characteristics of text vocabulary.

Further, in the step (1), the specific steps of extracting the image region feature and the text vocabulary feature are as follows:

(1.1) extraction of basic features of an image area: for one image X, selecting a Faster R-CNN to extract image region characteristics, and selecting the first m regions to represent X according to the scores of the image regions. Then a feature vector f is extracted for each image region using a pre-trained ResNet-101 _i ，i∈[1,m]. Finally, the full connection layer is utilized to convert the feature vector into 1024-dimensional feature vectors, and finally the feature vectors are expressed as a group of regional features { x } ₁ ,x ₂ ,...,x _m X E R as an essential feature of an image region ^d ^×m . As shown in formula (1):

x _i ＝W _x f _i +b _x (1)

(1.2) extraction of basic characteristics of text vocabulary: for a sentence of text Y, each vocabulary is first encoded into one-hot vector g _t ，t∈[1,n]And through a parameter matrix W _g And offset vector b _g Embedded in 300-dimensional vector o _t As shown in formula (2):

o _t ＝W _g g _t +b _g (2)

next, bi-GRU is used to integrate context information in sentences captured from both front and back directions into text embedding, and then the final word feature y is obtained by averaging _t As basic characteristics of text vocabulary, the following formula (3) is shown:

in the step (2), the basic feature vector of the image area extracted in the step (1.1) and the basic feature vector of the text vocabulary extracted in the step (1.2) are utilized to obtain multi-mode weighted feature vectors of the image area and the text vocabulary by a bidirectional cross attention mechanism.

Further, in the step (2), the specific steps of extracting the weighted feature vectors of the image area and the text vocabulary are as follows:

(2.1) calculating a similarity matrix of the image area and the text vocabulary: as shown in formula (4):

A＝(W _x X)(W _y Y) ^T (4)

wherein W is _x And W is _y Is a parameter that can be learned, x= { X ₁ ,x ₂ ,...,x _m Sum y= { Y ₁ ,y ₂ ,...,y _n The basic features of the image area and the basic features of the text vocabulary extracted in the step (1.1) and the step (1.2) are respectively, A epsilon R ^m×n Is the similarity matrix of image area-text vocabulary, A _it Meaning semantic similarity of the ith region and the t vocabulary.

(2.2) image region weighted feature extraction: taking a certain image area as a reference, distributing weights to n vocabularies, and then obtaining an image area x based on the n vocabularies by weighting and combining the n vocabularies _i Text lexical feature y _i ^* The process is shown as a formula (5):

where lambda is the temperature parameter as a function of softmax.

(2.3) text vocabulary weighted feature extraction: similarly, a text-based vocabulary y is obtained by taking a certain text vocabulary as a reference _t Corresponding image region feature x _t ^* The process is shown as a formula (6):

in the step (3), the image and text basic feature vectors extracted in the step (1) and the image and text weighted feature vectors extracted in the step (2) are subjected to segment level matching, and the segment level matching scores of the complete image and the complete text are obtained on the basis of the segment level matching scores.

Further, in the step (3), the specific steps of calculating the fragment level similarity are as follows:

(3.1) calculating a fragment-level matching score for the image region correlation: for image region x _i And (2.2) the region extracted corresponds to the text vocabulary feature y _i ^* Calculating to obtain similarityAs shown in formula (7):

(3.2) calculating a text vocabulary related segment level matching score: similarly, for the text vocabulary y _t And (2.3) the vocabulary corresponding image area feature x extracted in _t ^* Calculating to obtain similarityAs shown in formula (8):

(3.3) calculating a segment-level matching score S of the complete image and the complete text _segment (X, Y) as shown in formula (9):

where η is a superparameter for balancing the region-dependent segment level matching score S _x (x _i ,y _i ^* ) Vocabulary-related segment-level matching score S _y (y _t ,x _t ^* ) Is a contribution of (a).

In the step (4), the basic feature vectors of the image and the text extracted in the step (1) are explored by utilizing a multi-head self-attention mechanism to explore the context relation among the features, and the enhanced feature vectors of the image and the text fused with the context information are obtained, and the specific steps are as follows:

(4.1) image enhancement feature extraction: extracting a set of region features x= { X in step (1.1) for an image X ₁ ,x ₂ ,...,x _m A multi-headed self-attention mechanism is used to explore the contextual relationship between regional features. Specifically, query Q is first defined _X Key K _X Sum value Val _X As shown in formula (10):

Q _X ＝XW ^Q ,K _X ＝XW ^K ,Val _X ＝XW ^Val (10)

wherein W is ^Q ,W ^K ,W ^Val Is a learnable feature mapping matrix. Further, the value Val is calculated by the formula (11) _X Is a weighted sum of:

wherein d is _k Is a scale factor. Weight matrix Q _X K _X ^T Is a matrix that encodes the relationship of each visual feature to all other features by inner lamination. The relationships between the different objects are then encoded with equation (12):

where j represents the j-th head,w in (2) represents a learnable feature mapping matrix, superscript Q _X This matrix is represented for image queries, where Q represents the query, x represents for the image, and subscript j represents the j-th header. Thus (S)>Refers to the processing of image input in the j th head in the multi-head attention mechanismA matrix of query weights on the table. Similarly, the number of the devices to be used in the system,and->Representing a key weight matrix and a value weight matrix, respectively, dedicated to processing the image input at the j-th head.

Finally, the i-th original visual zone feature x _i Is converted into enhanced features x by means of a combination of global objects and local guided objects _ip As shown in formula (13):

wherein W is _p Is a learnable weight matrix, h is the number of heads. Then for an image the set x= { X based on region features ₁ ,x ₂ ,...,x _m The enhanced feature of } can be represented as X _{reinforcement} ＝{x _1p ,x _2p ,...,x _mp }。

(4.2) text enhancement feature extraction: similarly, the text feature y= { Y extracted in step (1.2) for a piece of text Y ₁ ,y ₂ ,...,y _n Performing the same procedure as in step (4.1) to obtain text enhancement feature Y _{reinforcement} ＝{y _1p ,y _2p ,...,y _np }。

(4.3) weighted enhancement feature extraction: similarly, for the weighted text feature X extracted in step (2) ^* ＝{x ₁ ^* ,x ₂ ^* ,...,x _n ^* Sum image feature Y ^* ＝{y ₁ ^* ,y ₂ ^* ,...,y _m ^* The same process as in step (4.1) is performed to obtain weighted features after enhancement, denoted as X respectively _{reinforcement} ^* ＝{x _1p ^* ,x _2p ^* ,...,x _np ^* }，Y _{reinforcement} ^* ＝{y _1p ^* ,y _2p ^* ,...,y _mp ^* }。

In the step (5), the image and text enhancement features extracted in the step (4) are filtered by an adaptive weight filtering module to obtain a global correlation matching result, and the specific steps are as follows:

(5.1) calculating an image-to-text direction global matching score: combining the enhanced basic feature pair (X) obtained in step (4) _{reinforcement} ,Y _{reinforcement} ) And enhancing the weighted feature pair (Y _{reinforcement} ^* ,X _{reinforcement} ^* ) As input, the global semantics are generated by weighting and fusing the regional features, and then the global semantic similarity between two modalities is calculated to obtain the global matching score s of the image to the text direction _x→y As shown in formula (14):

wherein w is _x ＝cos(x _lp ,y _lp ^* ) For adaptive weights, the enhanced base feature x is represented _lp Enhancement weighting feature y _lp ^* Is of importance.

(5.2) calculating a text-to-image direction global matching score: similarly, the basic feature pair (X _{reinforcement} ,Y _{reinforcement} ) And enhancing the weighted feature pair (Y _{reinforcement} ^* ,X _{reinforcement} ^* ) As input, the global semantics are generated by weighting and fusing the regional features, and then the global semantic similarity between two modalities is calculated to obtain the global matching score s of the image to the text direction _y→x As shown in formula (15):

similar to (5.1), w _y ＝cos(y _lp ,x _lp ^* ) For adaptive weights, the representation isEnhancement of basic feature y _lp Enhancement weighting feature x _lp ^* Is of importance.

(5.3) calculating a global matching score S of the complete image and the complete text _global (X, Y) as shown in formula (16):

S _global (X,Y)＝s _x→y +s _y→x (16)

in the step (6), the local segment level matching result obtained in the step (3) and the global matching result obtained in the step (5) are input into a binary classifier, and the sum vector of the result vectors of two levels is projected to two target spaces of real information and false information by using a full-connection layer with a softmax function in the binary classifier, so that a detection result of the social media blog is obtained, wherein the specific steps are as follows:

(6.1) after obtaining the local region-vocabulary matching result and the global matching result, using a fully connected layer with a softmax function to match S _segment (X, Y) and S _global (X, Y) projects into the target space of only two categories (real or fake) and gets its probability distribution as shown in equation (17):

wherein p= [ p ] ₀ ,p ₁ ]Representing a predictive probability vector, p ₀ And p ₁ The probabilities of the prediction results being 0 (real) and 1 (false), respectively. W is the weight matrix and b is the bias term.

(6.2) for each blog, the goal is to minimize a binary cross entropy loss function to distinguish whether the blog is real information or false information. The loss function is shown in equation (18):

L _p ＝-[ylnp ₀ +(1-y)lnp ₁ ] (18)

where y e {0,1} is the authenticity label for each blog, y=0 indicates that the blog is authentic, and y=1 indicates that the blog contains false information.

A false information detection system based on multi-mode context layering step-by-step alignment comprises an image region feature extraction module, a text vocabulary feature extraction module, a local alignment module, a global context alignment module and a social media blog detection module;

the image region feature extraction module firstly detects a significant region by using Faster R-CNN, then extracts a feature vector for each image region by using pretrained ResNet-101, and finally converts the feature vector into a feature vector by using a full connection layer to obtain an image region feature vector;

the text vocabulary feature extraction module firstly codes each vocabulary into one-hot vectors, and embeds the one-hot vectors into the vector o through parameter matrixes and offset vectors _t And then the Bi-GRU is used for capturing context information in sentences from front and back directions and integrating the context information into text embedding, so that text vocabulary feature vectors are obtained.

The local alignment module firstly obtains a segment level cross-modal weighted feature vector through a bidirectional cross-attention mechanism, then calculates to obtain a segment level matching score related to a region and a segment level matching score related to a vocabulary, and finally averages all segment level matching scores based on the region and segment level matching scores based on the vocabulary to obtain a final segment level matching score of a complete image and a text.

The global context alignment module firstly acquires the context relation between fragment level features by using a multi-head self-attention mechanism, acquires an enhanced feature vector fused with context information, then calculates global matching scores of two directions from image to text and text to image respectively, and finally takes the sum of the global matching scores of the two directions as the global matching score of the complete image and the text.

The social media blog detection module inputs the segment-level similarity and the global similarity between the images and the texts into a full-connection layer with a softmax function to obtain a detection result that the social media blog is real information or false information.

The implementation process of the system is the same as that of the method, and is not repeated.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a method of false information detection based on multi-modal context hierarchical step-wise alignment as described above when executing the computer program.

A computer readable storage medium storing a computer program for performing a method of false information detection based on multi-modal context hierarchical step-wise alignment as described above.

The beneficial effects are that: aiming at the problem that the importance of neglecting context information when modeling the multi-modal fine granularity corresponding relation by using the model and the alignment problem that the model cannot capture different modal contexts are solved in false information of an exaggeration manipulation imaginary event. The invention adopts a two-stage strategy to detect false information. The first stage establishes a local alignment module with bi-directional cross-modal attention, and deduces segment-level matching relationships by adding region-vocabulary similarities in different directions. The second stage establishes a global context alignment module, utilizes a multi-head self-attention mechanism to help local objects learn more context semantics, enhances visual and text context representation, and then utilizes an adaptive weight filtering module to enhance the similarity of features according to two modes, realizes global matching by integrating all weighted features, and inhibits overall semantic deviation. Finally, integrating the result vectors of the two different levels into a full-connection-layer classifier with a softmax function, and classifying the blog as two results of real information or false information. The false information detection model obtained by the method can effectively obtain accurate detection results.

Drawings

FIG. 1 is a block diagram of a method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a bi-directional cross-attention mechanism according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the invention only and not limiting the scope of the invention, and that modifications of the invention, which are equivalent to those skilled in the art to which the invention pertains, will fall within the scope of the invention as defined in the claims appended hereto.

As shown in fig. 1, the method for detecting false information based on multi-mode context hierarchical step alignment disclosed in the embodiment of the invention specifically comprises the following steps:

(1) Extracting basic features of images and texts: defining an image X and a text Y, firstly extracting basic characteristics of an image area by using a fast R-CNN based on Resnet-101, and then extracting basic characteristics of text words by using a Bi-GRU network. The specific steps of extracting the image area characteristics and the text vocabulary characteristics are as follows:

(1.1) extraction of basic features of an image area: for one image X, selecting Faster R-CNN to extract image region characteristics, and selecting the first m regions to represent X according to the scores of the Faster R-CNN. Then a feature vector f is extracted for each image region using a pre-trained ResNet-101 _i ，i∈[1,m]. Finally, the full connection layer is utilized to convert the region feature into 1024-dimensional feature vectors, and finally the 1024-dimensional feature vectors are expressed as a group of region features { x } ₁ ,x ₂ ,...,x _m }，X∈R ^d×m . As shown in formula (1):

x _i ＝W _x f _i +b _x (1)

o _t ＝W _g g _t +b _g (2)

next, bi-GRU is used to integrate context information in sentences captured from both front and back directions into text embedding, and then the final word feature y is obtained by averaging _t As shown in formula (3):

(2) Extracting weighted features of images and texts: and (2) obtaining the multi-modal weighted feature vectors of the image region and the text vocabulary by utilizing a bi-directional cross attention mechanism by using the basic feature vectors of the image region extracted in the step (1.1) and the basic feature vectors of the text vocabulary extracted in the step (1.2), wherein the bi-directional cross attention mechanism is shown in figure 2. The specific steps of extracting the weighted feature vectors of the image area and the text vocabulary are as follows:

A＝(W _x X)(W _y Y) ^T (4)

wherein W is _x And W is _y Is a parameter which can be learned, A is E R ^m×n Is the similarity matrix of image area-text vocabulary, A _it Meaning semantic similarity of the ith region and the t vocabulary.

where lambda is the temperature parameter as a function of softmax.

(3) Image-text feature local matching: and (3) performing segment level matching on the image and text basic feature vectors extracted in the step (1) and the image and text weighted feature vectors extracted in the step (2), and obtaining segment level matching scores for the complete image and the complete text on the basis. The specific steps for calculating the fragment level similarity are as follows:

(4) Extracting image and text enhancement features: the basic feature vectors of the images and the texts extracted in the step (1) are explored by utilizing a multi-head self-attention mechanism to explore the context relation among the features, and the images and the text enhancement feature vectors fused with the context information are obtained, wherein the specific steps are as follows:

Q _X ＝XW ^Q ,K _X ＝XW ^K ,Val _X ＝XW ^Val (10)

where j represents the j-th header. Finally, the i-th original visual zone feature x _i Is converted into enhanced features by means of a combination of global objects and local boot objects, e.g. x _ip Formula (13):

x _ip ＝Concat(R ₁ +R ₂ +...+R _h )W _p +x _i (13)

(4.2) text enhancement feature extraction: similarly, for a piece of text Y, the text feature extracted in step (1.2), y= { Y ₁ ,y ₂ ,...,y _n The same procedure in the step (4.1) is carried out to obtain text enhancement feature Y _{reinforcement} ＝{y _1p ,y _2p ,...,y _np }。

(5) Image-text enhancement feature matching: and (3) obtaining a global correlation matching result by utilizing the image and text enhancement features extracted in the step (4) and utilizing a self-adaptive weight filtering module, wherein the specific steps are as follows:

S _global (X,Y)＝s _x→y +s _y→x (16)

(6) Social media blog detection: inputting the local segment level matching result obtained in the step (3) and the global matching result obtained in the step (5) into a binary classifier, projecting the sum vector of the result vectors of two levels to a target space of real information and false information by using a full-connection layer with a softmax function in the binary classifier, and obtaining a detection result of the social media blog, wherein the detection result comprises the following specific steps:

(6.2) for each blog, the goal is to minimize the binary cross entropy loss function, as shown in equation (18):

L _p ＝-[ylogp ₀ +(1-y)logp ₁ ] (18)

where y ε {0,1} represents the true value.

the image region feature extraction module firstly detects a significant region by using a fast R-CNN, then extracts a feature vector for each image region by using a pretrained ResNet-101, and finally converts the feature vector into 1024-dimensional feature vectors by using a full connection layer to obtain the feature vector of the image region;

and the text vocabulary feature extraction module is used for firstly encoding each vocabulary into one-hot vectors, embedding the one-hot vectors into 300-dimensional vectors through parameter matrixes and offset vectors, and integrating context information in sentences captured in front and back directions into text embedding by using Bi-GRU to obtain text vocabulary feature vectors.

The global context alignment module firstly uses a multi-head self-attention mechanism to acquire the context relation between fragment level features, acquires an enhanced feature vector fused with context information, then calculates global matching scores of two directions from image to text and text to image respectively, and finally takes the sum of the global matching scores of the two directions as the global matching score of the complete image and the text.

And the social media blog detection module is used for inputting the segment level similarity and the global similarity between the images and the texts into a full-connection layer with a softmax function to obtain a detection result that the social media blog is real information or false information.

It will be apparent to those skilled in the art that the steps of the method for detecting false information based on hierarchical alignment of multi-modal contexts or the modules of the system for detecting false information based on hierarchical alignment of multi-modal contexts of the embodiments of the present invention may be implemented by a general purpose computing device, they may be centralized on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by a computing device, so that they may be stored in a memory device for execution by the computing device, and in some cases, the steps shown or described may be performed in a different order than what is shown or described herein, or they may be fabricated separately as individual integrated circuit modules, or a plurality of modules or steps within them may be fabricated as a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

The parameters were set and experimental evaluation criteria were as follows:

a, setting parameters:

the following are several parameters affecting the present model: iteration round (Epoch), batch Size (Batch Size), net Learning rate (Learning rate), learnable linear projection function hidden layer parameter η.

Table 1 model training parameter settings

And B, evaluation indexes:

the related evaluation indexes are index combinations uniformly used by the existing method, and the method comprises the following steps: accuracy, precision, recall and F1-Score.

According to the above embodiments, the invention aims at the importance of neglecting context information when modeling multi-modal fine granularity corresponding relation in false information of imaginary events by using exaggeration technique and the alignment problem that the model can not capture different modal contexts

A two-step detection strategy is presented. The invention uses the local alignment module to learn the complementary information among multiple modes, and can solve the problem that certain information in the text cannot strictly correspond to the image area. The invention uses the global context alignment module to realize the global matching of the image and the text, can solve the problem that the matching network is misled by the image semantics and ignores the context alignment, and improves the accuracy of false information detection.

Claims

1. The false information detection method based on multi-mode context hierarchical step-by-step alignment is characterized by comprising the following steps:

2. The method for detecting false information based on multi-modal context hierarchical step-by-step alignment according to claim 1, wherein in the step (1), an image X and a text Y are defined, first image region basic features are extracted by using a fast R-CNN based on Resnet-101, and then text vocabulary basic features are extracted by using a Bi-GRU network;

the specific steps of extracting the image area characteristics and the text vocabulary characteristics are as follows:

(1.1) extraction of basic features of an image area: for an image X, selecting a Faster R-CNN to extract image region characteristics, and selecting the first m regions to represent X according to the scores of the image regions; then a feature vector f is extracted for each image region using a pre-trained ResNet-101 _i ，i∈[1,m]The method comprises the steps of carrying out a first treatment on the surface of the Finally, the full connection layer is utilized to convert the feature vector into 1024-dimensional feature vectors, and finally the feature vectors are expressed as a group of regional features { x } ₁ ,x ₂ ,...,x _m X E R as an essential feature of an image region ^d×m As shown in formula (1):

x _i ＝W _x f _i +b _x (1)

o _t ＝W _g g _t +b _g (2)

3. the method for detecting false information based on hierarchical alignment of multi-modal contexts according to claim 2, wherein in (2), the multi-modal weighted feature vectors of the image region and the text vocabulary are obtained by using a bi-directional cross-attention mechanism from the image region basic feature vector and the text vocabulary basic feature vector;

the specific steps of extracting the weighted feature vectors of the image area and the text vocabulary are as follows:

A＝(W _x X)(W _y Y) ^T (4)

wherein W is _x And W is _y Is a parameter that can be learned, x= { X ₁ ,x ₂ ,...,x _m Sum y= { Y ₁ ,y ₂ ,...,y _n The basic features of the image area and the basic features of the text vocabulary extracted in the step (1.1) and the step (1.2) are respectively, A epsilon R ^m×n Is the similarity matrix of image area-text vocabulary, A _it Representing semantic similarity of the ith region and the t vocabulary;

(2.2) image region weighted feature extraction: will be certainThe image areas are used as the standard, weights are distributed to n vocabularies, and then the n vocabularies are weighted and combined to obtain an image area x _i Text lexical feature y _i ^* The process is shown as a formula (5):

wherein λ is a temperature parameter that is a function of softmax;

4. the method for detecting false information based on multi-modal context hierarchical step-by-step alignment according to claim 1, wherein in the step (3), the image and text basic feature vectors extracted in the step (1) and the image and text weighted feature vectors extracted in the step (2) are subjected to segment level matching, and the segment level matching scores of the complete image and the complete text are obtained on the basis of the segment level matching;

the specific steps for calculating the fragment level similarity are as follows:

(3.1) calculating a fragment-level matching score for the image region correlation: for image region x _i Text vocabulary feature y corresponding to the region _i ^* Calculating to obtain similarityAs shown in formula (7):

(3.2) calculating a text vocabulary related segment level matching score: for text vocabulary y _t Image region feature x corresponding to the vocabulary _t ^* Calculating to obtain similarityAs shown in formula (8):

where η is a superparameter for balancing the region-dependent segment-level matching scoresVocabulary-related fragment level matching score +.>Is a contribution of (a).

5. The method for detecting false information based on multi-modal context hierarchical step alignment according to claim 1, wherein in the step (4), the image and text basic feature vectors extracted in the step (1) are explored by using a multi-head self-attention mechanism to explore the context relation between features, and the image and text enhancement feature vectors fused with the context information are obtained, and the specific steps are as follows:

(4.1) image enhancement feature extraction: a set of regional features x= { X for one image X ₁ ,x ₂ ,...,x _m Exploring regions using multi-headed self-attention mechanismsContextual relationships between features; first define query Q _X Key K _X Sum value Val _X As shown in formula (10):

Q _X ＝XW ^Q ,K _X ＝XW ^K ,Val _X ＝XW ^Val (10)

wherein W is ^Q ,W ^K ,W ^Val Is a learnable feature mapping matrix; calculated value Val by (11) _X Is a weighted sum of:

wherein d is _k Is a scale factor, a weight matrix Q _X K _X ^T Is a matrix that encodes the relationship of each visual feature to all other features by inner lamination; the relationships between the different objects are then encoded with equation (12):

where j represents the j-th head,w in (2) represents a learnable feature mapping matrix, superscript Q _X Representing this matrix for an image query, where Q represents the query, x represents for the image, and subscript j represents the j-th header; thus (S)>Refers to a query weight matrix specifically used to handle image input at the j-th head in a multi-head attention mechanism; similarly, a->And->Representing a key weight matrix and a value weight matrix, respectively, dedicated to processing the image input at the j-th head.

wherein W is _p Is a learnable weight matrix, h is the number of heads; then for an image the set x= { X based on region features ₁ ,x ₂ ,...,x _m The enhanced feature of } can be represented as X _{reinforcement} ＝{x _1p ,x _2p ,...,x _mp }；

(4.2) text enhancement feature extraction: text feature y= { Y for a piece of text Y ₁ ,y ₂ ,...,y _n Performing the same procedure as in step (4.1) to obtain text enhancement feature Y _{reinforcement} ＝{y _1p ,y _2p ,...,y _np }；

(4.3) weighted enhancement feature extraction: for the weighted text feature X extracted in step (2) ^* ＝{x ₁ ^* ,x ₂ ^* ,...,x _n ^* Sum image feature Y ^* ＝{y ₁ ^* ,y ₂ ^* ,...,y _m ^* The same process as in step (4.1) is performed to obtain weighted features after enhancement, denoted as X respectively _{reinforcement} ^* ＝{x _1p ^* ,x _2p ^* ,...,x _np ^* }，Y _{reinforcement} ^* ＝{y _1p ^* ,y _2p ^* ,...,y _mp ^* }。

6. The method for detecting false information based on multi-mode context hierarchical step alignment according to claim 1, wherein in the step (5), the image and text enhancement features extracted in the step (4) are filtered by using an adaptive weight filtering module to obtain a global relevance matching result, and the specific steps are as follows:

wherein w is _x ＝cos(x _lp ,y _lp ^* ) For adaptive weights, the enhanced base feature x is represented _lp Enhancement weighting feature y _lp ^* Is of importance of (2);

(5.2) calculating a text-to-image direction global matching score: will enhance the basic feature pair (X _{reinforcement} ,Y _{reinforcement} ) And enhancing the weighted feature pair (Y _{reinforcement} ^* ,X _{reinforcement} ^* ) As input, the global semantics are generated by weighting and fusing the regional features, and then the global semantic similarity between two modalities is calculated to obtain the global matching score s of the image to the text direction _y→x As shown in formula (15):

similar to (5.1), w _y ＝cos(y _lp ,x _lp ^* ) For adaptive weights, the enhanced base features y are represented _lp Enhancement weighting feature x _lp ^* Is of importance of (2);

S _global (X,Y)＝s _x→y +s _y→x (16)。

7. the method for detecting false information based on hierarchical alignment of multi-modal context according to claim 1, wherein in the step (6), the local segment level matching result obtained in the step (3) and the global matching result obtained in the step (5) are input into a binary classifier, and the sum vector of the result vectors of two levels is projected to the target space of both real information and false information by using the full connection layer with softmax function in the binary classifier, and the detection result of the social media blog is obtained, which comprises the following specific steps:

(6.1) after obtaining the local region-vocabulary matching result and the global matching result, using a fully connected layer with a softmax function to match S _segment (X, Y) and S _global (X, Y) projects into the target space of only two categories, real or fake, and gets its probability distribution as shown in equation (17):

wherein p= [ p ] ₀ ,p ₁ ]Representing a predictive probability vector, p ₀ And p ₁ Respectively representing the probabilities of the prediction results of 0 and 1, wherein 0 represents real and 1 represents fake; w is the weight matrix and b is the bias term.

(6.2) for each blog, the goal is to minimize a binary cross entropy loss function to distinguish whether the blog is real information or false information; the loss function is shown in equation (18):

L _p ＝-[ylnp ₀ +(1-y)lnp ₁ ] (18)

8. The false information detection system based on multi-mode context hierarchical step-by-step alignment is characterized by comprising an image region feature extraction module, a text vocabulary feature extraction module, a local alignment module, a global context alignment module and a social media blog detection module;

the text vocabulary feature extraction module firstly codes each vocabulary into one-hot vectors, and embeds the one-hot vectors into the vector o through parameter matrixes and offset vectors _t Then, capturing context information in sentences from front and back directions by using Bi-GRU, and integrating the context information into text embedding to obtain text vocabulary feature vectors;

the local alignment module firstly acquires a segment level cross-modal weighted feature vector through a bidirectional cross-attention mechanism, then calculates to obtain a segment level matching score related to a region and a segment level matching score related to a vocabulary, and finally averages all segment level matching scores based on the region and segment level matching scores based on the vocabulary to obtain a final segment level matching score of a complete image and a text;

the global context alignment module firstly uses a multi-head self-attention mechanism to acquire a context relation between fragment-level features, acquires an enhanced feature vector fused with context information, then calculates global matching scores of an image to a text and the text to the image respectively, and finally takes the sum of the global matching scores of the two directions as a global matching score of a complete image and the text;

9. A computer device, characterized by: the computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method for false information detection based on multi-modal context hierarchical step-wise alignment as claimed in any one of claims 1-7 when executing the computer program.

10. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program for performing a method of false information detection based on multi-modal context hierarchical step-wise alignment as claimed in any one of claims 1-7.