CN112035670B

CN112035670B - Multi-modal rumor detection method based on image emotional tendency

Info

Publication number: CN112035670B
Application number: CN202010940956.5A
Authority: CN
Inventors: 毛震东; 张勇东; 赵博文; 付哲仁
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-09-09
Filing date: 2020-09-09
Publication date: 2021-05-14
Anticipated expiration: 2040-09-09
Also published as: CN112035670A

Abstract

The invention discloses a multi-modal rumor detection method based on image emotion tendencies. Meanwhile, a method for extracting the emotional tendency of the image is proposed based on a conditional variational self-encoder (CVAE) and is different from a conventional method using emotional analysis, and the effectiveness of the method can be observed through tests. The method can obtain accurate detection results only by using a single image as input, and can quickly detect and process at the initial stage of rumor propagation.

Description

Multi-modal rumor detection method based on image emotional tendency

Technical Field

The invention relates to the technical field of network space security, in particular to a multi-modal rumor detection method based on image emotional tendency.

Background

The development of social media accelerates information dissemination and brings about the flooding of false rumor information, which often causes a plurality of unstable factors and has great influence on economy and society. The social network platform users are hundreds of millions of years old today, and the social network platform users have wide spread, rapid spread and wide use range, are not limited by time and space and have magnifying information influence by the characteristics of a magnifier. Unrealistic rumors, "manipulate" public opinion feelings, mislead public judgment, and influence social stability, so that automatic and rapid detection for network rumors has important significance for network space safety.

Social media rumors often have some characteristics with obvious incidences, and from this perspective, the method of emotion analysis based on text greatly expands the heteroscedasticity in rumor detection, but with the development of multimedia production technology, the rumors gradually attract and mislead readers in a way of picture and text integration, and pictures often have strong visual impact and abundant potential information can be mined. In addition, in massive social media data, the image and text information are not presented in a completely separated form, but a part of image data still contains a large amount of text, and the part of text often contains semantic information closely related to a theme, which is helpful for establishing the relationship between the image and emotional tendency, but the conventional multi-modal detection method cannot well hold the auxiliary information.

Disclosure of Invention

The invention aims to provide a multi-modal rumor detection method based on image emotion tendencies, which can obtain an accurate detection result by only using a single image as input and can quickly detect and process the rumor at the initial stage of rumor propagation.

The purpose of the invention is realized by the following technical scheme:

a multi-modal rumor detection method based on image emotion tendencies comprises the following steps:

in the training stage, texts and images containing character information are used as training data; the method for extracting the multi-modal features of each training sample group consisting of texts and images comprises the following steps: text features, image features, and text information features in the image; updating prior distribution and a classifier by combining image characteristics, character information characteristics in the image, text characteristics, hidden variables of a semantic space and a given emotional tendency label based on a conditional variation self-encoder, wherein the hidden variables are the semantics of the image;

and in the testing stage, character information features are extracted from the image for the image to be detected and the corresponding text, the emotion tendency is generated by combining the updated hidden variable obtained by prior distribution sampling, and then the emotion tendency is spliced with the text features, and the probability that the image to be detected is rumor is obtained through a classifier.

According to the technical scheme provided by the invention, on one hand, the method has better pertinence to the samples with characters in the drawings. Meanwhile, a method for extracting the emotional tendency of the image is proposed based on a conditional variational self-encoder (CVAE) and is different from a conventional method using emotional analysis, and the effectiveness of the method can be observed through tests. The method can obtain accurate detection results only by using a single image as input, and can quickly detect and process at the initial stage of rumor propagation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of a multi-modal rumor detection method based on image emotional tendency according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a relationship between an image, a text, a hidden variable and an emotional tendency provided in the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a multi-modal rumor detection method based on image emotional tendency, wherein the emotional analysis of texts can often start from certain keywords, but the emotional tendency of the images can not be generally extracted or explicitly represented from a specific certain area. To this end, the invention trains an emotional tendency characteristic judgment model by taking emotional tendency as a label based on a conditional variational self-encoder (CVAE), thereby implicitly learning the emotional 'characteristic' in an image, and simultaneously acquiring the text in the image as additional information to assist learning by utilizing an Optical Character Recognition (OCR) technology. In the testing stage, the social media to be tested is input into the model, and whether the social media is a rumor is judged according to the learned emotional tendency characteristics. The method provided by the patent has a good effect on the social media rumors containing the text pictures, and shows certain pertinence and effectiveness.

The general technical framework of the method is shown in fig. 1, mainly as follows:

firstly, a training stage.

In the training stage, texts and images containing text information are used as training data (which can be directly obtained from a social platform); the method for extracting the multi-modal features of each training sample group consisting of texts and images comprises the following steps: text features, image features, and text information features in the image; based on a conditional variation self-encoder, image features, character information features in the image, text features, hidden variables of an image semantic space and given emotional tendency labels are combined to update a prior distribution and a classifier, and the hidden variables are the semantics of the image.

The training stage mainly comprises the following parts:

1. and (4) preprocessing data.

1) The symbol expression, the special character, the URL and the like in the original text content are redundant, all the information is selected to be ignored, only the character information is reserved through redundancy removing operation and is spliced into a text sequence, and the splicing gap uses a separator as an identifier.

2) In order to facilitate extraction of text information, denoising processing is performed on the image.

2. And (5) extracting the multi-mode features.

1) And extracting text features.

Statistics shows that the length of 98% of texts in the data set does not exceed 150 characters after preprocessing, so that a section of text is set to contain 150 words at most for limiting the calculation efficiency, and the excessive words are discarded and the insufficient words are supplemented; the numerical value 150 given here is only an example, and the character length can be set by itself in practical applications according to circumstances.

And performing word feature vectorization on the text by using GLoVe pre-trained in Chinese Wikipedia, sending the text to a GRU (gated Loop Unit) for feature extraction, wherein the hidden layer state size is 512, and the obtained semantic vector is the text feature E.

2) And (5) extracting image features.

Since the features based on the target level are not needed, the method is different from the prior method, and the general feature representation is extracted by adopting the pre-training model ResnexXt. ResnexXt performs well in many tasks in the computer vision domain, and is unique in the structure of first grouping convolutions and then residuals.

In the embodiment of the invention, only the part extracted by the ResnexXt feature is reserved, and the global feature expression vector of the image is obtained after the last pooling and is used as the image feature I.

3) Character information feature extraction in a picture

In the embodiment of the invention, a set of OCR tokens in an image is obtained through an open-source Chinese optical character recognition suite CNOCR, and the set comprises semantic information of characters in the image; and then carrying out vectorization on the text by utilizing GLoVe pre-trained on Chinese Wikipedia, and finally obtaining character information characteristics O through linear transformation.

3. Picture emotional tendency feature extraction based on CVAE

The matching of rumors is usually strong in visual impact, but the emotional tendency of the whole picture cannot be obtained from a certain local area, unlike the situation that some words with obvious incidences exist in the text can start, so how to extract the emotional tendency in the image is the key point and the difficulty of the research of the invention.

It is assumed that there is a certain distribution of emotional tendency Y of the image and the image feature I, but the distribution cannot be expressed by an explicit formula. For this reason, based on the design mechanism of the conditional variational auto-encoder, it can be reasonably assumed that the factors determining the emotional tendency Y to be generated are given image features I and an implicit variable Z in a semantic space (Z can be understood as the semantic in the matching graph, in the form of a multivariate gaussian distribution with a diagonal covariance matrix, isotropy), and Z and I satisfy a certain prior distribution. Meanwhile, the character O extracted in the figure is generated from the image, contributing to the generation of Y, and therefore, the above-described relationship is as shown in fig. 2.

Given the image characteristics I as conditions, the character information characteristics O can be extracted by using CNOCR, and the prior distribution p of a hidden variable can be determined_θ(Z | I). Each possible emotional tendency Y can be obtained by sampling an implicit variable Z from the prior distribution and extracting character information characteristics O through a decoder p_θ(Y | I, Z, O) is generated. According to the inputs currently available: image characteristics I and given label labels of emotional tendency Y (divided into positive direction and negative direction, corresponding to truth and rumor respectively); based on the principle of a conditional variational encoder, a priori distribution p_θ(Z | I) is not straightforward to calculate, so the present invention uses a posterior distribution

And (4) removing approximate fitting, wherein the form is a deep neural network, and the optimization target is the KL divergence of prior and posterior distribution. The training strategy is as follows:

1) obtaining posterior distribution by using image characteristics I and given emotional tendency label Y

And initializing a prior distribution p of hidden variables_θ(Z | I), sampling a hidden variable, decoding and predicting the emotional tendency by combining the character information characteristic O, and calculating the reconstruction error of the emotional tendency label.

2) Minimizing posterior distribution by KL divergence

And a prior distribution p_θ(Z | I) distance, thereby modifying the prior distribution p_θ(Z | I); and, minimizing the reconstruction error of the emotional tendency label Y, the two process loss functions are:

wherein, the ratio of theta,

respectively representing the relevant adjustable parameters in the prior distribution and the posterior distribution,

representing the reconstruction error of the emotional tendency label Y, p (Y | I) is a fixed form in the CVAE loss function. According to the Bayes rule of random gradient variation (SGVB), the lower bound function on the left side of the inequality is maximized during training.

3) From a trained a priori distribution p_θSampling an implicit variable in (Z | I)

Generating emotional tendency by decoding with character information characteristic O

And then the text features E are spliced together and input into a classifier to judge whether the rumor is the rumor, and the classifier is trained based on a set loss function (such as a cross entropy loss function).

And II, testing.

The flow of the testing stage is similar to that of the training stage, as shown in fig. 1, the input of the testing stage is an image containing text information and a corresponding text, for the image to be detected, text information features are extracted, then hidden variables obtained by combining updated prior distribution sampling are decoded to generate emotional tendency, the emotional tendency and the text features are spliced, and the probability that the image to be detected is rumor is obtained through an updated classifier.

And then, the final detection result can be determined in a conventional mode, and because only two types exist, the detection result can be judged to belong to a certain type when the probability of the certain type is higher.

Of course, a higher threshold value may be set for obtaining a greater degree of confidence, and the specific value may be set by the technician according to actual conditions or experience. For example, in an example, the probability of the rumor and the real two categories is (0.99, 0.01), that is, the probability of the rumor is 99%, the probability of the real is 1%, and the probability of the rumor category is greater than a set threshold (e.g., 90%), then the information to be detected is considered to be the rumor with a higher confidence.

According to the scheme provided by the embodiment of the invention, on one hand, the method has better pertinence to the samples with characters in the graph. Meanwhile, a method for extracting the emotional tendency of the image is proposed based on a conditional variational self-encoder (CVAE) and is different from a conventional method using emotional analysis, and the effectiveness of the method can be observed through tests. The method can obtain accurate detection results only by using a single image as input, and can quickly detect and process at the initial stage of rumor propagation.

In addition, in order to illustrate the effects of the above-described aspects of the present invention, related experiments were also performed.

In the experiment, the data set uses a microblog which comprises pictures and texts, and the pictures contain text information. Data of the data set is collected on a microblog platform, supplement is mainly carried out on the basis of a Weibo RumorSet data set, each microblog is matched with 1.47 pictures on average, and meanwhile, an emotional tendency label is also marked on each picture. The specific number distribution is shown in table 1:

	number of samples	Including the number of pictures
			Real data	2729	4262
Rumor data	2555	3517

TABLE 1 data set distribution

The training set and the test set are divided according to a ratio of 9:1, the accuracy of a final model on the test set can reach 77.8%, the recall ratio is 76.3%, the highest accuracy of the traditional multi-mode method is only 54.6%, and the method has certain pertinence and effectiveness on multi-mode microblog information with characters in the graph.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-modal rumor detection method based on image emotion tendencies is characterized by comprising the following steps:

in the training stage, texts and images containing character information are used as training data; the method for extracting the multi-modal features of each training sample group consisting of texts and images comprises the following steps: text features, image features, and text information features in the image; based on conditional variation self-encoder, combining image characteristics, character information characteristics in image, text characteristics, latent variables of semantic space and given emotion tendencyUpdating the prior distribution and classifier to the label, comprising: obtaining posterior distribution by using image characteristics I and given emotional tendency label Y

And initializing a prior distribution p of hidden variables_θ(Z | I), sampling a hidden variable, decoding and predicting the emotional tendency by combining the character information characteristic O, and calculating the reconstruction error of the emotional tendency label; obtaining updated prior distribution p by minimizing the distance between the posterior distribution and the prior distribution through KL divergence and minimizing the reconstruction error of the emotional tendency label Y_θ(Z | I); from a trained a priori distribution p_θSampling an implicit variable in (Z | I)

Then the text characteristic E is spliced together and input into a classifier to judge whether the rumor is rumor; training the classifier by using a set loss function; the latent variable is the semantic of the image;

2. The method for detecting multi-modal rumors based on emotional tendency of images according to claim 1, wherein the pre-processing of data before the multi-modal feature extraction comprises:

performing redundancy removal operation on the text, only reserving character information, and splicing the character information into a text sequence;

and carrying out denoising processing on the image.

3. The method of claim 1, wherein the image emotion tendencies are used as basis for multi-modal rumors detection,

vectorizing word features of the text through pre-trained GLoVe, and sending the words into GRU for feature extraction to obtain semantic vectors which are text features;

extracting general characteristic representation of the image through a pre-training model ResnexXt, and taking the characteristic output by the last pooling layer of the pre-training model ResnexXt as the image characteristic;

acquiring a set of OCR tokens in an image through an open-source Chinese optical character recognition suite (CNOCR), wherein the set comprises semantic information of characters in the image; and finally, vectorizing the text by using the pre-trained GLoVe, and finally obtaining character information characteristics through linear transformation.

4. The method of claim 1, wherein the posterior distribution is minimized by KL divergence

And a prior distribution p_θ(Z | I) distance, thereby modifying the prior distribution p_θ(Z | I); and, minimizing the reconstruction error of the emotional tendency label Y, the loss function of this process is:

in the above formula, the ratio of theta,

representing the reconstruction error of the emotional tendency label Y, p (Y | I) is a fixed form in the CVAE loss function.