CN111079444B

CN111079444B - Network rumor detection method based on multi-modal relationship

Info

Publication number: CN111079444B
Application number: CN201911379313.1A
Authority: CN
Inventors: 张勇东; 毛震东; 邓旭冉; 赵博文
Original assignee: Beijing Zhongke Research Institute; University of Science and Technology of China USTC
Current assignee: Beijing Zhongke Research Institute; University of Science and Technology of China USTC
Priority date: 2019-12-25
Filing date: 2019-12-27
Publication date: 2020-09-29
Anticipated expiration: 2039-12-27
Also published as: CN111079444A

Abstract

The invention discloses a network rumor detection method based on a multi-modal relationship, which comprises the following steps: acquiring an image to be detected and a related text which are issued on a network platform; extracting visual feature vectors containing different classes of objects in the image through a pre-training faster-CNN model; after preprocessing the text, extracting semantic vectors through GRU; capturing the importance degree of the visual feature vector and the semantic vector through an attention mechanism, and realizing cross-mode association between the image and the text, so as to update the visual feature vector and the semantic vector; and for the visual feature vector and the semantic vector, the relationship of internal dynamic information is respectively modeled through an attention mechanism, so that the visual feature vector and the semantic vector are updated; and connecting the visual feature vector obtained by updating the two parts with the semantic vector, and obtaining the probability that the information to be detected is the rumor and the real category through a second classifier. The method can automatically judge whether the information to be detected belongs to the network rumor or, and has high detection accuracy.

Description

Network rumor detection method based on multi-modal relationship

Technical Field

The invention relates to the technical field of network space security, in particular to a network rumor detection method based on multi-modal relations.

Background

The rise of the network society enables opportunities and challenges to coexist, especially, the stability of network space is seriously influenced by the low admission threshold of internet access and the freedom of information dissemination, and the wantonly dissemination of network rumors is one of the problems which must be regarded as important. Today's social networking platform users are already broken by hundreds of millions, highly active, widely spread, quickly spread, widely used, free of time-space constraints, and their magnifier features magnify the impact of information at multiples, especially with sensitive topics, focus events, hot spot problems, major public events, emergencies well known between days, or cause loss of trust, impaired government, corporate image, complaints boiling, so automatic and rapid detection of network rumor is of great importance to cyberspace security.

With the development of multimedia technology, both self-media and professional media start to shift to multimedia news forms based on pictures, texts and short videos. Multimedia content carries richer and more intuitive information, can better describe news events, and is more easily and widely disseminated. Studies have shown that the average number of hops for media with streaming images is 11 times that of plain text. As such, false news or rumors often use highly aggressive pictures to attract and mislead readers, spreading quickly and widely, which has made detection of visual modal content a non-negligible part of dealing with network rumor challenges.

The traditional work of false content detection based on visual modal content mainly utilizes traditional manual characteristics, such as visual definition, visual similarity histogram, double JPEG (joint photographic experts group) compression traces and the like, which usually have good effect on rough picture tampering, but with the continuous improvement of picture generation technology, the methods can not ensure precision and also obviously improve the resource cost requirement.

In recent years, with the rapid development of neural networks and deep learning models, corresponding detection technologies have come into play and have achieved great success. In the false information detection, a multi-mode detection method for distinguishing the authenticity of news by simultaneously utilizing text and visual modal information is also generated. In the prior work, representative examples include: attRNN, EANN and MVAE. These methods, while providing a heuristic approach in the detection of spurious information in a multimodal form, still have significant drawbacks. Firstly, the extraction process of the image and the text information is still rough, especially the semantic features of the picture; and secondly, in the feature fusion stage, the features of the two modes are simply spliced, so that the interaction and the association among the modes are difficult to express.

Disclosure of Invention

The invention aims to provide a network rumor detection method based on a multi-modal relationship, which can automatically judge whether information to be detected belongs to the network rumor and has higher detection accuracy.

The purpose of the invention is realized by the following technical scheme:

a network rumor detection method based on multi-modal relations comprises the following steps:

acquiring information to be detected, including images and related texts, issued on a network platform;

for the image, extracting visual characteristic vectors containing objects of different classes in the image through a pre-trained fasterR-CNN model;

for the text, after preprocessing, extracting semantic vectors through a gate control circulation unit;

capturing the importance degree of the visual feature vector and the semantic vector through an attention mechanism, and realizing cross-mode association between the image and the text, so as to update the visual feature vector and the semantic vector; based on the updated visual feature vector and semantic vector, the relationship of the internal dynamic information is respectively modeled through an attention mechanism, so that the visual feature vector and the semantic vector are updated again; and connecting the visual feature vector obtained by updating again with the semantic vector, and obtaining the probability that the information to be detected is the rumor category and the real category through a two-classifier.

According to the technical scheme provided by the invention, the text information and the image information are inspected simultaneously by using multi-mode feature fusion, so that the accuracy is higher; meanwhile, the method is different from other multi-mode methods using attention mechanisms, and gives consideration to information in the modes at the same time, so that the model can integrate richer information relation. The method can obtain accurate detection results only by using single information as input, and can quickly detect and process at the initial stage of rumor propagation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of a model structure of a network rumor detection method based on a multi-modal relationship according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a network rumor detection method based on a multi-modal relationship. In the feature extraction stage, the image features are extracted by using a fast R-CNN-based target detection model, and specific targets and salient regions in the image can be noted. In the feature fusion stage, different from the prior art that attention is paid to the relation between images and texts, the method applies an attention mechanism to information in the same modality, and the advantage is that the related information in the modality can supplement the information between the modalities. The method provided by the invention has a good effect on the Weibo RumorSet data set, and can find a false information case which is difficult to distinguish by using a single mode in the traditional scheme.

As shown in fig. 1, a schematic model structure diagram of a network rumor detection method based on multi-modal relationships according to an embodiment of the present invention mainly includes the following five parts:

1. multimodal data acquisition.

In the embodiment of the invention, the information to be detected, including images and related texts, issued on the network platform is acquired.

Illustratively, the retrieval may be by a social media platform, e.g., a micro-blogging platform.

In the embodiment of the invention, the text contained in the information to be detected and the text attached when other users forward the information to be detected. For example, the microblog information acquired from the microblog platform includes, in addition to the text including the microblog information itself, text attached to the transfer of the microblog information by another user.

2. And (5) extracting visual features.

In the embodiment of the invention, for the image, the Visual feature vectors containing different classes of objects in the image are extracted through a faster-CNN model pre-trained on Visual Genome.

The fast R-CNN model is a classical model commonly used in target detection, and for a given picture I, the model can output target level information in the picture, namely visual feature vectors V ═ V of objects of different classes₁,v₂,…，v_KIn which v is_iThe visual feature vector representing an object, i-1, 2, …, K represents the total number of feature vectors (36 here), for example, the visual feature vector V may be a K × 2048 dimensional visual feature matrixThe method of (3) is more focused on objects or other salient regions of the image.

3. And preprocessing the text and extracting the features.

In the embodiment of the invention, after the text is preprocessed, the semantic vector is extracted through the gate control circulation unit. It should be noted that the text in the dashed box at the lower left corner of fig. 1 is only schematic.

1) And (4) preprocessing.

For text, due to the complexity and disorder of social media information, many useless redundant information such as symbolic expressions, special characters, URLs (uniform resource locators) and the like are generated, and thus preprocessing is required. Specifically, all the redundant information such as the URL, the special characters, the emoticons and the like is selected to be ignored, only the residual character information is reserved, and then the residual character information is spliced into a text sequence, and separators are used as identifiers in splicing gaps. For example, after the text in the microblog is preprocessed, only the remaining text information is retained, and then the remaining text information of the source microblog and the remaining text information of the subsequent forwarding microblog are sequentially spliced into a sequence L.

2) And (5) feature extraction.

Statistically, 98% of the texts in the data set are not longer than 150 characters after preprocessing, so that for the purpose of computational efficiency limitation, a section of text L contains 150 words at most, and the excessive words are discarded and insufficient completion is performed. And then, performing word feature vectorization by using pre-trained GLOVE (pre-trained in Wikipedia in Chinese), expressing the preprocessed text in a matrix form, and performing feature extraction by using a Gated Round Unit (GRU) to obtain a semantic vector E.

Illustratively, the vectorization of word features is represented as a 150 × 300 matrix, with hidden state size 512 for GRU feature extraction.

4. And (5) feature fusion.

1) Information circulation and interaction among the modalities.

In the embodiment of the invention, the importance degree of the visual feature vector and the semantic vector is captured through an attention mechanism, and cross-modal association between the image and the text is realized, so that the visual feature vector and the semantic vector are updated; specifically, the method comprises the following steps: the visual feature vector and the semantic vector are respectively used as modal information, for the information between the modalities, the importance degree of each (visual feature vector and semantic vector) pair (any pair of the visual feature vector and the semantic vector) is extracted through an attention mechanism, the information between different modalities flows according to the importance degree so as to update the information of each modality, and the cross-modal association between the image and the text is realized through an information flowing process. The operation process is as follows:

respectively carrying out linear transformation on the visual characteristic vector V and the semantic vector E to obtain a k value, a q value and a V value required by an attention mechanism, and then obtaining inter-modal attention weight through vector inner product:

wherein E represents a semantic vector, and V represents a visual feature vector; q. q.s_V、k_VQ, k values, q representing the visual feature vector V_E、k_ERepresenting the q and k values of the semantic vector E, and dim represents the vector dimension; interatt_E→V、InterAtt_V→EThe attention weight from the semantic vector to the visual feature vector and the attention weight from the visual feature vector to the semantic vector are sequentially represented, and the two bidirectional matrixes contain important information between paired image areas and words.

As will be understood by those skilled in the art, the k, q, and v values are inherent variables in the attention mechanism, respectively, key, query, value; in brief, the attention mechanism calculates similarity by using query of the source end and key of the target end and normalizes to obtain attention weight, and then multiplies value of the target end to obtain an attention update vector of the target end to a source end vector, so that information flows among different modes:

V′＝InterAtt_E→V×v_E

E′＝InterAtt_V→E×v_V

wherein v is_E、v_vAnd respectively representing the V values of the semantic vector E and the visual characteristic vector V.

Then, the updated visual feature vector V 'and semantic feature vector E' are connected with the original visual feature vector V and semantic vector E in series through a full connection layer to obtain the visual feature vector V^*And semantic feature vector E^*And then input to a subsequent intra-modality association module to further learn information flow within the modality.

2) Dynamic information relationship modeling within a modality.

For input visual feature vector V^*And semantic vector E^*And respectively modeling the relationship of internal dynamic information through an attention mechanism to serve as supplementary information of cross-modal association, and updating the visual feature vector and the semantic vector.

In addition, the association of information within a modality should also utilize information of another modality as a condition, for example, different associations should be made between image regions according to different word phrases. For this purpose, firstly, the visual feature vector and the semantic vector are respectively pooled and affine transformed to the dimension same as the k value, the q value and the v value, and then the channel type conditional gate vector M is calculated_V→E,M_E→VIntroducing another modality information:

M_V→E＝Sigmoid(Linear(V^* _pool))

M_E→V＝Sigmoid(Linear(E^* _pool))

wherein, Linear (V)^* _pool) And Linear (E)^* _pool) Are respectively visual feature vector V^*And semantic vector E^*Performing pooling and affine transformation results; sigmoid denotes Sigmoid function.

Next, two channel condition gate vectors modulate k and q values of two modes, which are to be activated or deactivated by a channel condition gate of another mode, and the updated k and q values are:

in the above formula, the first and second carbon atoms are,

representing the updated visual feature vector

The values of q and k of (a),

representing the semantic vector after the re-update

Q, k values of (1);

visual feature vector V representing input^*The values of q and k of (a),

semantic vector E representing input^*Q, k values of (1).

Those skilled in the art will understand that the result M of sigmoid is in the (-1,1) interval, and the value of 1+ M is in the (0,2) interval, and then multiplied by the original q value and k value point to play the role of similar scaling, i.e. corresponding to "activation or deactivation". The updated q value and k value refer to the q value and k value when the information of the other modality is taken as the condition information, and are equivalent to the introduction of the information of the other modality.

After the k value and the q value which are updated again are obtained, generating weights by using an attention mechanism, and updating the visual feature vector and the semantic vector, wherein the definition is as follows:

wherein, IntraAtt_V→V、IntraAtt_E→ESequentially represents the attention weight inside the visual feature vector and the attention weight inside the semantic vector,

respectively, the input visual feature vector V^*Semantic vector E^*The updated visual feature vector and semantic vector are respectively

And

in the specific implementation process, information circulation and interaction among the modes and dynamic information relation modeling in the modes can be realized by one sub-module respectively, the two sub-modules form one basic module, the three basic modules are stacked to obtain final visual and semantic vectors, and finally, the visual feature vectors and the semantic vectors are point-multiplied together to obtain final fusion feature vectors (multi-mode feature information).

5. And outputting the classification.

The network rumor detection problem is regarded as a classification problem, the finally fused multi-modal characteristic information is input into a multi-layer perceptron to serve as a second classifier, and the probability that the information to be detected is a rumor class and a real class is obtained through a Softmax function.

In the embodiment of the invention, the whole method is regarded as a model, the loss function of the whole model in the training process can use a cross entropy loss function, and the classifier can distinguish the rumor category from the real category through the characteristics of multi-modal characteristic information through training.

The probability of the rumor category and the real category can be obtained and then the final detection result can be determined in a conventional manner, for example, the final detection result is judged by a set threshold, and since there are only two categories, when the probability of a certain category is greater than 0.5, the detection result can be judged to belong to the category. Certainly, a higher threshold may be set for obtaining a greater confidence, for example, in a certain example, the probability of the rumor and the real two categories is (0.99, 0.01), that is, the probability of the rumor is 99%, the probability of the real category is 1%, and the probability of the rumor category is greater than the set threshold (e.g., 90%), then the higher confidence may be that the information to be detected is the rumor. Of course, the specific value for the threshold value can be set by the skilled person according to actual conditions or experience.

In the model shown in fig. 1, the loss function may use a cross-entropy loss function during the training process. The data set may use a weibo rumor set. The data set data is collected on a microblog platform, and the specific quantity distribution is as follows:

	number of samples	Including the number of pictures
			Real data	4779	5318
Rumor data	4748	7954

TABLE 1 data set distribution

The scheme of the embodiment of the invention has good effect in the data set shown in the table 1, and can find the false information case which is difficult to distinguish by using a single mode in the traditional scheme.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A network rumor detection method based on multi-modal relations is characterized by comprising the following steps:

capturing the importance degree of the visual feature vector and the semantic vector through an attention mechanism, and realizing cross-mode association between the image and the text, so as to update the visual feature vector and the semantic vector; based on the updated visual feature vector and semantic vector, the relationship of the internal dynamic information is respectively modeled through an attention mechanism, so that the visual feature vector and the semantic vector are updated again; connecting the visual feature vector obtained by updating again with the semantic vector, and obtaining the probability that the information to be detected is a rumor category and a real category through a two-classifier;

wherein, through the attention mechanism, capturing importance degrees of the visual feature vector and the semantic vector, and realizing cross-modal association between the image and the text, so as to update the visual feature vector and the semantic vector, comprises:

the visual feature vector and the semantic vector are respectively used as modal information, the importance degree of each (visual feature vector and semantic vector) pair is extracted through an attention mechanism, the flow between different modal information is realized according to the importance degree so as to update the modal information, and the cross-modal association between the image and the text is realized through an information flow process; the operation process is as follows:

respectively carrying out linear transformation on the visual characteristic vector and the semantic vector to obtain a k value, a q value and a v value required by an attention mechanism, and then obtaining inter-modal attention weight through vector inner product:

wherein E represents a semantic vector, and V represents a visual feature vector; the k value, the q value and the v value are inherent variables in the attention mechanism and are respectively a key value, a query value and a context vector; q. q.s_V、k_VQ, k values, q representing the visual feature vector V_E、k_ERepresenting the q and k values of the semantic vector E, and dim represents the vector dimension; interatt_E→V、InterAtt_V→ESequentially representing the attention weight from the semantic vector to the visual feature vector and the attention weight from the visual feature vector to the semantic vector;

and then updating the feature vector of the modal information by using other modal information according to the attention weight, so as to realize the information flow among different modes:

V′＝InterAtt_E→V×v_E

E′＝InterAtt_V→E×v_V

wherein v is_E、v_VRespectively representing the V values of the semantic vector E and the visual characteristic vector V;

then the updated visual feature vector V 'and semantic feature vector E' are connected with the original visual feature vector V and semantic vector E in series through a full connection layer to obtain the visual feature vector V^*And semantic feature vector E^*。

2. The method of claim 1, wherein the visual feature vectors containing different types of objects are expressed as V ═ V { V } V₁，v₂，...，v_KIn which v is_iRepresents the visual feature vector of an object, K represents the total number of feature vectors, i is 1, 2.

3. The method of claim 1, wherein the associated text comprises: the text contained in the information to be detected and the text attached when other users forward the information to be detected.

4. The method of claim 2, wherein preprocessing the text comprises: removing redundant information in the text, only reserving character information, splicing the character information into a text sequence, and using separators as identifiers in splicing gaps; the redundant information includes at least one or more of the following information: symbolic expressions, special characters, uniform resource locators.

5. The method for detecting network rumors based on multi-modal relationships according to claim 1, wherein before the semantic vector is extracted through the gated-cycle unit, word features are vectorized by using pre-trained GLOVE, the preprocessed text is expressed in a matrix form, and feature extraction is performed by using the gated-cycle unit, so as to obtain the semantic vector.

6. The method of claim 1, wherein the step of updating the visual feature vector and the semantic vector again by using an attention mechanism based on the updated visual feature vector and the semantic vector to respectively model the relationship of the internal dynamic information comprises:

to visual feature vector V^*And semantic vector E^*Pooling and affine transformation are respectively carried out to the dimensions of the k value, the q value and the v value, and then a channel type conditional gate vector M is calculated_V→E，M_E→VIntroducing another modality information:

M_V→E＝Sigmoid(Linear(V^* _pool))

M_E→V＝Sigmoid(Linear(E^* _pool))

wherein, Linear (V)^* _pool) And Linear (E)^* _pool) Are respectively visual feature vector V^*And semantic vector E^*Performing pooling and affine transformation results; sigmoid represents Sigmoid function;

the two channel condition gate vectors modulate k values and q values of two modes, the k values and the q values are activated or deactivated through channel condition gates of other modes, and the updated k values and q values are as follows:

in the above formula, the first and second carbon atoms are,

representing the updated visual feature vector

The values of q and k of (a),

representing the semantic vector after the re-update

Q, k values of (1);

visual feature vector V representing input^*The values of q and k of (a),

semantic vector E representing input^*Q, k values of (1);

after the k value and the q value which are updated again are obtained, generating weights by using an attention mechanism, and updating respective internal dynamic information of the visual feature vector and the semantic vector:

are respectively visual feature vector V^*Semantic vector E^*The value of (d) v.