CN115438274A

CN115438274A - False news identification method based on heterogeneous graph convolutional network

Info

Publication number: CN115438274A
Application number: CN202210911726.5A
Authority: CN
Inventors: 尚学群; 高莉; 宋凌云; 谭亚聪; 刘杰; 杨琛
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-12-06

Abstract

The invention relates to a false news identification method based on a heterogeneous graph convolution network, which comprises the following steps: 1) Acquiring news data and constructing a heterogeneous news propagation diagram; 2) Extracting news text features: extracting context interaction information of the text by using a natural language processing method; 3) Designing a heterogeneous graph convolution network model: (1) designing a topology smoothing mechanism to obtain a topology position weight for each node; (2) designing a hierarchical graph attention mechanism to learn structural features of different types of nodes; 4) Feature fusion and classification: firstly, combining the acquired text features with the acquired structural features, and secondly, training through a cross entropy loss function to finally obtain node weights and node feature vectors for carrying out false news classification. The method of the invention achieves excellent effect in the aspect of false news detection.

Description

False news identification method based on heterogeneous graph convolutional network

Technical Field

The invention relates to the technology in the field of graph neural network application, in particular to a false news identification method based on a heterogeneous graph convolutional network.

Background

False news refers to messages that are intentionally posted on social media and can be verified as false. The wide application of social media enables the spread of the false news to be quicker and wider, and the spread of the false news not only affects network security and social economy, but also damages the public credibility of governments and media. It is therefore a crucial task to identify false news as early as possible. Current false news detection methods can be divided into two categories: text content based methods and social network interaction information based methods.

The text content-based method focuses on extracting lexical features, grammatical features and writing style features through news texts and judging false news through a feature classification method. However, the method generally analyzes the news text independently, and ignores the deep structural relationship between news and between news and users during news broadcasting.

In order to solve the problems, the method based on the social network interaction information fuses the relationships between users and news, between news and between users and comments in the social network on the basis of texts, and improves the performance of false news identification through the deeper relationships. Bian and Ma et al formalize a tree-shaped propagation graph using the relationship between source news and reviews, and then further classify by graph representation. Yuan and Yang et al model the user, source news and comments together into a news propagation heterogeneous graph, and then perform section feature learning and classification through a graph representation learning model. Although the method achieves excellent effect on false news detection, the authenticity of edges in a news propagation graph and the topological imbalance existing in the graph are ignored in the graph learning process, so that the news characteristic learning effect of the method is limited.

Disclosure of Invention

Technical problem to be solved

In order to avoid the defects of the prior art, the invention provides a false news identification method based on a heterogeneous graph convolutional network.

Technical scheme

A false news identification method based on heterogeneous graph convolutional network is characterized by comprising the following steps:

step 1: obtaining news data from a social platform, wherein the news data comprises source news m, related comments c and corresponding users u, and constructing a heterogeneous news propagation graph HNG according to the connection among the source news m, the related comments c and the corresponding users u;

step 2: acquiring text characteristic information of source news content and comment content by using a natural language processing model;

step 2.1: acquiring initial characteristics of a text by using a natural language processing model;

step 2.2: in order to further obtain the contextual semantic features between the source news and the comments, the relevance between the comments and the source news is obtained through a multi-head self-attention model, so that new features with contextual semantics are obtained for the news and the comments; the feature is used as an initial feature vector of a source news node and a comment node in the study of the heterogeneous graph;

and step 3: designing a hierarchical graph convolution model to learn an HNG structure and obtain structural characteristics of nodes;

step 3.1: designing a topological smoothing strategy to obtain the topological position weight of each node in the news propagation network;

step 3.2: training the constructed HNG by designing a hierarchical graph attention mechanism, and learning the characteristics of each node in the network;

and 4, step 4: and (3) fusing the network structure characteristics obtained in the step (3) with the text information characteristics obtained in the step (2) to generate new vectors for further classification operation, so as to achieve the purpose of false news detection.

The further technical scheme of the invention is as follows: in the step 1, the social platform is microblog and Twitter, and three data sets are obtained from the microblog and Twitter, namely weibo, twitter15 and Twitter16.

The further technical scheme of the invention is as follows: the construction mode of the heterogeneous news propagation graph HNG in the step 1 is specifically as follows:

(1) if the user and the user have an attention relationship or both comment or forward the same news, connecting the two users;

(2) if the user reviews or issues a news, connecting the user with a review node and connecting the user with a news node;

(3) if the news and the news are simultaneously released or have common users, connecting the news with the news;

(4) and if one comment is a reply of the other comment, connecting the two comments.

The further technical scheme of the invention is as follows: the natural language processing model used in step 2.1 is a CNN model, and the purpose is to learn a feature vector representing each news item and each comment information item.

The further technical scheme of the invention is as follows: the multi-head self-attention model used in step 2.2 is input as the feature vector of each news and each comment obtained in step 2.1, the semantic relation of sentences between the news and the comment is cross-learned through the multi-head self-attention model, and finally, a semantic feature vector representing context is obtained for each news and each comment.

The further technical scheme of the invention is as follows: the topology weight calculation of each node in the topology smoothing strategy in step 3.1 specifically includes:

firstly, measuring the node influence distribution of each marked node through an individualized PageRank algorithm to finally obtain a probability matrix P, wherein a ∈ (0,1) is random walk probability;

P＝a(I-(1-a)A′) ^-1 ⑴

next, assume a tagged news node m _i Node m is strongly affected by neighbor nodes from other labels _i Large effects are encountered in messaging and proximity to topological class boundaries; based on this assumption, the present invention is designed based on sectionsTopological unbalance quantization index T for point information conflict detection _m Capturing the degree of topological unbalance of the graph, reducing the training weight close to the class boundary node, increasing the training weight close to the class center node, and meanwhile, re-weighting the target node; the weight calculation formula is as follows:

in the formula, w _min ，w _min Is hyperparametric, T _m Representing a topological value, rank (T) _m ) Represents a topological value T _m Sorting in ascending order, wherein Y represents a news node with a label; finally, obtaining corresponding topology weight value for each node in the network, and only taking the weight value w of the news node _m For subsequent calculations.

The further technical scheme of the invention is as follows: step 3.2, feature vector learning of each type of node in the hierarchical graph attention machine mechanism specifically comprises the following steps:

firstly, capturing the importance of other types of neighbor nodes of a target node through node-level attention; then acquiring the weight of the neighbor node of the same type as the target node through type-level attention, wherein the formula is shown as (3) and (4);

in the formula, σ (·) represents a LeakyReLU function; and tau represents node types, namely news, comments and users.

The further technical scheme of the invention is as follows: the feature fusion and classification module in the step 4 specifically comprises the following steps:

first, for any one news node m _i The text features are obtained by step 2.2

Obtaining its structural characteristics by step 3.2

In order to more effectively process the characteristics, the invention will

And the final characteristics are obtained through fusion, then the node weight of the last layer is trained through cross entropy to carry out false news classification, and the calculation formula is as follows:

in the formula, W is a parameter matrix, b is an error parameter, and l represents the number of categories.

Advantageous effects

The invention provides a false news identification method based on a heterogeneous graph convolutional network. Firstly, a new topology smoothing strategy is designed to measure the topology weight of each node, and the topology weight of each node is obtained by increasing the weight of the node close to the class center and reducing the weight of the node far from the class center. Secondly, a layered attention mechanism is adopted to adaptively learn the weight of each edge in the news dissemination network, so that the importance degree of each edge is measured, and the negative influence caused by unreal edges is relieved.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention designs a topology smoothing strategy to measure the topology weight of the marked node so as to relieve the problem of topology unbalance.

2. On the basis, the invention provides a layered attention mechanism to learn the characteristics of the HNG, and the authenticity of the relationship is identified by properly measuring the weight of each relationship, so that the influence of the non-authenticity relationship on the HNG is effectively reduced.

3. The experimental results on the standard data set prove that the technical model related to the invention achieves more excellent performance than the prior method.

Drawings

The drawings, in which like reference numerals refer to like parts throughout, are for the purpose of illustrating particular embodiments only and are not to be considered limiting of the invention.

FIG. 1 is a general model framework diagram of the method in an example of the invention.

Fig. 2 is a diagram illustrating a heterogeneous news dissemination (HNG) according to an embodiment of the present invention.

FIG. 3 is a block diagram of a multi-headed autofocusing mechanism algorithm in the method according to an embodiment of the present invention.

Fig. 4 is a graph comparing the early news detection effect of the method according to the example of the present invention with that of the prior art method.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides a false news identification method based on a heterogeneous graph convolutional network, which consists of four sub-modules: the system comprises a text data acquisition and heterogeneous news propagation graph construction module, a text feature acquisition module, a hierarchical graph convolution module and a node classification task training module. The overall model frame is shown in fig. 1, and is described as follows:

1. text data acquisition and heterogeneous news dissemination map construction

1.1 text data acquisition

Data used by the method are obtained from a microblog and a Twitter social platform, and a weibo data set and Twitter15 data are finally obtainedThe set, twitter16 dataset, is public data that has been validated for use. The data contains news M = [ M = [ M ] ] ₁ ,m ₂ ,...,m _n ]And a review R = [ R ] for each news item ₁ ,r ₂ ,...,r _j ]User U = [ U ] ₁ ,u ₂ ,...,u _r ]

1.2 heterogeneous News dissemination graph (HNG) construction

According to three nodes of news text, comments and news users, five relations of < user-release-news >, < source news-similar time/similar-source news >, < comment-source news >, < comment opinion-approval/question-news >, < user-attention-user > are modeled, a heterogeneous false news network HNG is constructed to enrich information of false news, and the finally constructed HNG is shown in figure 2. For convenience of describing the method, the HNG is denoted as G = (V, E), a denotes an adjacency matrix, a' = a + I denotes an adjacency matrix added to a self-loop, and D denotes a degree matrix.

2. Text feature acquisition

2.1 initial text feature acquisition

For a source news m _i And its comment R = [ R = ₁ ,r ₂ ,...,r _j ]. First, news m is acquired using CNN _i Initial sequence characterization of

The CNN feature acquisition formula is as follows:

in the formula, W represents a convolution kernel parameter matrix, and σ (·) represents a nonlinear activation function. Extract each reply r in the same way _j Is characterized by

2.2 text context semantic feature acquisition

To further refine the semantic representation between reviews and source news, a multi-headed autoradiogram is usedAttention is drawn to a mechanism to capture the correlation between news content and reviews. In particular, all sentences are cross-checked across models using an attention mechanism to capture the consistency between them. Obtaining the text characteristics of each news through the semantic consistency coding process

Characteristics of comments

The multi-headed self-attention model is shown in fig. 3.

3. Hierarchical graph convolution model

3.1 topology smoothing strategy

In the graph structure HNG, training samples of different classes have not only differences in quantity but also differences in location structure, and particularly in the node classification task, the distribution of labeled (training) nodes on the graph is also uneven, thereby generating a topology imbalance problem. In order to alleviate the problem of poor model training capability caused by topology imbalance, firstly, measuring the node influence distribution of each labeled node through a personalized PageRank algorithm to finally obtain a probability matrix P, wherein a ∈ (0,1) is random walk probability, and a calculation formula is shown as (8).

P＝a(I-(1-a)A′) ^-1 ⑻

Next, assume a tagged news node m _i Node m is strongly affected by neighbor nodes from other labels _i Large effects are encountered in messaging and near topological class boundaries. Based on the assumption, the invention designs a topological unbalance quantization index T based on node information conflict detection _m The degree of topological unbalance of the graph is captured, and the target nodes are reweighed while the training weights close to the class boundary nodes are reduced and the training weights close to the class center nodes are increased. The weight calculation formula is as follows:

in the formula, w _max ，w _min Is hyperparametric, T _m Representing a topological value, rank (T) _m ) Represents a topological value T _m Sorted in ascending order, Y denotes a news node with a tag. Finally, obtaining corresponding topology weight value for each node in the network, and only taking the weight value w of the news node _m For subsequent calculations.

3.2 hierarchy chart attention mechanism

In the heterogeneous news feed HNG, given a particular node, neighboring nodes of different types may have different effects on it, and neighboring nodes of the same type may also have different importance. Therefore, in order to capture different importance of node level and type level simultaneously, a double-layer attention mechanism is adopted to distinguish false news, specifically, importance of other types of neighbor nodes of a target node is captured through node-level attention (node-level attention); then, the weights of the neighbor nodes of the same type as the target node are obtained through type-level attentions, and the formula is shown as (10) (11). In the formula, σ (·) represents a LeakyReLU function; and tau represents node types, namely news, comments and users.

4. False news classification

The present invention treats false news detection as a classification problem. For any news node m _i Its structural characteristics in HNG

With text features

And (4) combining. Finally, by crossingThe entropy is used for training the node weight of the last layer to carry out false news classification, and the calculation formula is as follows:

where W is a parameter matrix, b is an error parameter, and l represents the number of categories, for example, there are only two categories (true news, false news) in the weibo data set, and there are four categories in the Twitter15 and Twitter16 data sets.

5. Experiment and results

5.1 Classification Effect

Table 1 shows the classification effect of the present invention on the Twitter15 and Twitter16 data sets. The results show that the present invention outperforms the state-of-the-art graph-based GLAN on all datasets. Specifically, TRHAN improved the accuracy of the best model by 2.5% and 1.7% over all indices of the Twitter15 and Twitter16 data sets, respectively. This is mainly due to two reasons, first, TRHAN takes into account the unreliable relationships and rich structural features inherent in newsfeed graphs. Secondly, unlike CGAT and GLAN, TRHAN focuses more on the problem of node topology imbalance on news graphs, which helps to improve the model effect.

TABLE 1 detection Performance of the TRHAN method on Twitter15, twitter16 data sets

5.2 early detection Performance

The early stage of detecting false news is particularly important to limit the spread of the false news. The earlier the detection deadline is, the less comments, user and other dissemination information can be obtained. In order to evaluate the performance of early false news detection, the invention sets a series of detection time periods [0h,2h,4h,6h,8h,12h, 24h). Fig. 4 illustrates the performance of early false news detection. As can be seen from the figure, the TRHAN method has achieved a high accuracy very early. Specifically, the accuracy of the microblog data set in 2 hours by the TRHAN is as high as 94%, and the accuracy of the microblog data set on the Twitter15 data set and the accuracy of the microblog data set on the Twitter16 data set respectively reach 87.2% and 84.9%, which are much higher than the results of other methods.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.

Claims

1. A false news identification method based on heterogeneous graph convolutional network is characterized by comprising the following steps:

step 2.2: in order to further obtain the context semantic features between the source news and the comments, the relevance between the comments and the source news is obtained through a multi-head self-attention model, so that new features with context semantics are obtained for the news and the comments; the feature is used as an initial feature vector of a source news node and a comment node in the study of the heterogeneous graph;

2. The method of claim 1, wherein the social platform in step 1 is microblog and Twitter, and three data sets are obtained from the microblog and Twitter, which are weibo, twitter15 and Twitter16.

3. The false news identification method based on the heterogeneous graph convolutional network as claimed in claim 2, wherein the heterogeneous news propagation graph HNG in the step 1 is constructed in a specific manner:

(3) if the news and the news are simultaneously released or have a common user, connecting the news with the news;

4. A method for false news identification based on heterogeneous graph convolutional network as claimed in claim 3, wherein the natural language processing model used in step 2.1 is CNN model, aiming at learning a feature vector representing each news and each comment information.

5. The method of claim 4, wherein the multi-head self-attention model used in step 2.2 is input as the feature vector of each news and each comment obtained in step 2.1, and the multi-head self-attention mechanism is used to cross-learn the semantic relationship of the sentences between the news and the comment, so as to obtain a semantic feature vector representing the context for each news and each comment.

6. The false news identification method based on heterogeneous graph convolution network according to claim 5, wherein the topology weight calculation of each node in the topology smoothing policy in step 3.1 specifically includes:

P＝a(I-(1-a)A′) ^-1 (1)

next, assume a tagged news node m _i Node m is strongly affected by neighbor nodes from other labels _i Large effects are encountered in messaging and proximity to topological class boundaries; based on the assumption, the invention designs a topological unbalance quantization index T based on node information conflict detection _m Capturing the degree of topological unbalance of the graph, reducing the training weight close to the class boundary node, increasing the training weight close to the class center node, and meanwhile, re-weighting the target node; the weight calculation formula is as follows:

in the formula, w _min ，w _max Is hyperparametric, T _m Representing a topological value, rank (T) _m ) Represents a topological value T _m Sorting in ascending order, wherein Y represents a news node with a label; finally, obtaining corresponding topology weight value for each node in the network, and only taking the weight value w of the news node _m For subsequent calculations.

7. The false news identification method based on the heterogeneous graph convolutional network as claimed in claim 6, wherein the feature vector learning of each type of node in the hierarchical graph attention machine mechanism in step 3.2 is specifically as follows:

firstly, capturing the importance of other types of neighbor nodes of a target node through node level attention; then acquiring the weight of the neighbor node of the same type as the target node through type-level attention, wherein the formula is shown as (3) and (4);

8. The false news identification method based on the heterogeneous graph convolution network according to claim 7, wherein the feature fusion and classification module in the step 4 specifically comprises:

first, for any one news node m _i The text features are obtained by step 2.2

Obtaining structural characteristics thereof by step 3.2

In order to more effectively process the characteristics, the invention will

The final characteristics are obtained by fusion, and then the node weight of the last layer is trained through cross entropy to carry out false news classificationClass, the calculation formula is as follows: