CN114444516B

CN114444516B - Cantonese rumor detection method based on deep semantic perception map convolutional network

Info

Publication number: CN114444516B
Application number: CN202210371266.1A
Authority: CN
Inventors: 王海舟; 陈欣雨; 柯亮; 方怡萱; 王森; 蔡易成; 王文贤
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2022-07-05
Anticipated expiration: 2042-04-08
Also published as: CN114444516A

Abstract

The invention relates to the technical field of rumor detection, and particularly discloses a Cantonese rumor detection method based on a deep semantic perception graph convolution network, which comprises the steps of firstly constructing a plurality of groups of healthy Cantonese rumor keywords, constructing a Web crawler to acquire relevant tweet, user, forwarding and comment information, and constructing a data set Net-CR-Dataset after data annotation is completed; secondly, designing a deep semantic perception map convolutional neural network model SA-GCN; optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong tongue, and simultaneously utilizing a large number of collected Guangdong tongue linguistic data to perform further pre-training and fine adjustment on the BERT pre-training model so as to extract semantic feature vectors of the Chinese text; extracting the structural features of the tweet by using an improved GCN network to generate a structural feature vector; and finally, the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result. The invention is superior to other common detection methods in the aspects of detection effect and early detection capability.

Description

Cantonese rumor detection method based on deep semantic perception map convolutional network

Technical Field

The invention relates to the technical field of rumor detection, in particular to a Cantonese rumor detection method based on a deep semantic perception map convolutional network.

Background

Social media provide a platform for people to pay attention to hot events, release opinions and make friends, and play an indispensable role in the life of people. The "Digital 2021" report shows that by 1 month 2021, globally active users of social media have reached 42 billion, accounting for about 53.6% of the world's general population. Due to the great influence of the social network on public opinion, rumors are layered in the social network, which not only disturbs the network order and causes social panic, but also brings economic loss in the real world and endangers the national safety. Besides the common english and chinese rumors, yue-language rumors are also a long-standing disease in social networks. Cantonese, which is a branch of chinese, is prevalent not only in areas such as guangdong, hong kong, and australia in china, but also in overseas chinese. The users of the cantonese language, which are more than 1.2 hundred million in the world, are shared all over the world, and the widely spread cantonese language rumors in the social network also have great adverse effects on the world below. Therefore, there is a need to provide an effective method for automatically detecting the rumors in yue languages in social networks.

The traditional rumor detection method mainly adopts a supervised learning strategy, and trains a machine learning classifier by using features manually extracted from text contents, user homepages and propagation modes, for example: SVM (Support Vector Machine), RF (Random Forest), Bayes (Bayes), DT (Decision Tree). Further research has extracted more effective features such as time series features, topic features, etc. The detection method based on the traditional machine learning mainly depends on feature engineering, and a great deal of time is required to be invested in the method. Moreover, it is difficult to manually extract efficient high-order feature representations, which hinders the enhancement of performance of such methods.

In recent years, powerful deep learning models have been widely used in this task to solve the problems in machine learning based detection methods. RNN (Current Neural Network Recurrent Neural Network), LSTM (Long Short-term Memory), GRU (Gate Current Unit), CNN (Convolvulatory Neural Network) and variants and combinations thereof have all made significant efforts in the field of rumor testing. In addition, some studies further improve the detection effect of the model by introducing an attention mechanism, generating counterstudy and other technologies. However, the methods above mostly consider the microblog texts to be detected as independent individuals, and ignore the connection between them. As is well known, a social network uses a graph as an infrastructure, and includes not only entities such as users, posts, and hashtags, but also relationships such as friends, forwarding, and comments. The graph structure contains rich information, thus providing new features for rumor detection. For example, if there are frequent interactions (e.g., forwarding/commenting/paying attention) between a user and multiple rumor users, the probability that the post posted by the user is a rumor is greatly increased.

To address this problem, recent research has focused on rumor detection using the information of the propagation structure in the graph structure. Some studies construct the Propagation behavior of posts as a Propagation Tree, and design models such as PTK (Propagation Tree Kernel), RvNN (recurrent Neural Network) to learn the structural features of the Propagation Tree. Meanwhile, methods such as GCN (Graph relational Network convolution Network), GAT (Graph Attention Network), PGNN (Propagation Graph Neural Network) and the like are also proposed to extract the global structural features of posts from the Propagation Graph, thereby improving the effect of the detection model. However, most of these methods only focus on obtaining the propagation and structure information from the process of transmission and development of posts over time, but ignore features from text content, user homepages, etc., which may result in some important information (e.g., text features) being lost and have an influence on the final detection effect. In addition, for the GCN network, the shallow model cannot learn the characteristics of the nodes at a long distance. Studies have shown that although deeper GCNs can capture richer neighbor information, layer 2 GCN networks perform best.

Relevant research on the Detection of Cantonese rumors was first conducted in the literature [ KE L, CHEN X, LU Z, et al. A Novel Approach for Cantonese Rumor Detection based on Deep Neural networks: 33rd IEEE International Conference on Systems, Man, and Cybernetics [ C ], Toronto, Canada, 2020.] and the Cantonese rumors in Twitter were widely collected, thereby constructing a first more complete data set CR-Dataset of the Cantonese rumors. Meanwhile, 27 statistical characteristics of rumors in Guangdong languages are provided, a detection model based on a deep learning method is designed, and the semantic characteristics and the statistical characteristics of the rumors are fused for detection. Experiments prove that the method achieves excellent detection effect. However, this method does not take into account the propagation structure characteristics, which are important in the discrimination of rumors in practical situations. In addition, the constructed CR-Dataset also lacks structural information of the tweet.

On one hand, the traditional rumor detection method cannot be directly applied to the scene of cantonese. The above-mentioned methods are mainly studied for english rumors and mandarin rumors in social networks, and solutions for the scenario of cantonese rumors are lacking. The new words, unique oral words and hybrid Chinese and English grammar structure contained in Guangdong language can not bring the traditional method into play the best detection effect. On the other hand, the widespread rumors in cantonese in social networks can have serious negative effects on the real world. One of the most popular and influential dialects in the chinese, the cantonese language has more than 1.2 hundred million users in the world, and the cantonese users are widely distributed not only in guangdong province, hong kong and australia in china, but also in 12 other countries such as singapore, thailand, the united states and canada.

Therefore, a new method based on graph structure is needed to develop the detection research for the rumors in Guangdong languages in social networks.

The unique language features of cantonese pose a serious challenge to social network-oriented detection of the rumor in cantonese. Unlike the common Chinese language, the Guangdong language includes many variant words (e.g., - ) and rare words (e.g., ). Although modern standard chinese and cantonese have the same (or ultimately related) meaning on the part-word syllables associated with the source of the word, some users of cantonese may write with variant characters, i.e., variant words. However, it is difficult for the chinese language model to automatically learn the semantics of these unique characters. Meanwhile, the syntax structure of the english-english mixture in cantonese makes the extraction of semantic features difficult. With the development of history, users of cantonese gradually merge english words into the cantonese language system. For example: to do oh work coffee (meaning that it is not feasible to do so). This usage makes word segmentation and acquisition of the context semantics of the words difficult.

In recent years, researchers have conducted extensive research on rumor automated detection problems in social networks, mainly including detection methods based on traditional machine learning and detection methods based on deep learning.

(1) Detection method based on traditional machine learning

Most rumor detection research focuses on training rumor classifiers by using features manually extracted from aspects such as microblog text content, user homepages, propagation paths and the like, so that rumor detection is realized. Castillo et al extracted user, message, topic, and propagation based features from "push" and "turn-push" behaviors in the Twitter platform to evaluate the trustworthiness of a given tweet. Yang et al has expanded the characteristic set that Castillo et al put forward, has put forward the characteristic based on customer end and place on the basis of the characteristic based on content, account, propagation before, has realized the rumor detection to the Xinlang microblog platform. Kwon et al explore the conventional structural and linguistic features and novel temporal features resulting from sudden fluctuations in rumors over time. Ma et al designed a time series model to capture the change in rumor social context over time, not just the tweet capacity feature. This type of detection method based on traditional machine learning mainly relies on feature engineering, and therefore requires a great deal of research time and human labor. Moreover, some implicit features are difficult to find in feature engineering, and features extracted manually cannot have strong robustness, which makes the method particularly difficult in performance improvement.

(2) Detection method based on deep learning

In order to solve the above-mentioned problems faced by the conventional machine learning method, many studies employ a deep learning method to perform feature learning, so as to capture high-order feature representation and achieve a better classification effect. Ma et al identified rumors by learning the timing and text representation of rumor posts using the RNN model for the first time. On this basis, Chen et al propose to incorporate an attention mechanism into the RNN model to capture text features that are more important to the detection task. Jin et al extracted multimodal features from rumor text, images, and social background for rumor detection tasks. These RNN-based methods can efficiently detect rumors, but are not suitable for the task of early detection of rumors. To address this problem, Yu et al devised a CNN network-based model to efficiently identify error information and enable early detection of propagation. Also, some recent studies have employed a mixture of RNN and CNN for detection. Liu et al learned global and local changes in user characteristics using RNN and CNN networks to identify fake news. Ma et al introduced the idea of generating antagonistic learning by generating antagonistic noise to allow the classifier to learn a stronger rumor representation, enabling more robust and efficient detection. Furthermore, Ke et al first conducted research on social network-oriented cantonese rumor detection, extracted statistical features of 27 cantonese rumors, including four categories of content, user, propagation, and comment, and designed a cantonese rumor detection model BLA (Bi-LSTM network with Attention mechanism fused based on BERT), extracted semantic features of the inferences using BERT (Bi-directional Encoder representation based on Transformers) model, Bi-LSTM (Bi-directional Long Short Term Memory) network, and Attention mechanism, and then spliced with the extracted statistical features, thereby achieving effective recognition of cantonese rumors. However, the classical deep learning rumor detection technology mostly focuses on extracting features of text, images and other categories, and structural features of rumor propagation are ignored.

In addition, in the structure diagram of the social network, the delivery process of rumors implies rich information. Some studies have used propagation relationships in graph structures for rumor detection. Sicilia et al combined content-based features with some fine-grained features inspired by graph theory to detect rumors in single subject domain posts related to health news. Ma et al propose a kernel-based approach to obtain high-order rumor representations by comparing similarities in the propagation tree structure of microbian. This method achieves a good detection effect, but cannot automatically learn a high-order feature without a flat feature including noise. To solve this problem, Ma et al propose to learn the propagation tree of microbodish by using the RvNN network, thereby extracting effective semantics and propagation characteristics. Yuan et al also considers the connection between different propagation trees on the basis of the above methods, and proposes a novel model to encode local context information and global structure information. Bian et al innovatively propose a bipartite graph model Bi-GCN to learn high-order feature representations from both propagation and dispersion directions. Yang et al designed a graph-confronted learning approach to enhance the robustness and generalization of rumor detection models. Most of the detection methods based on graph structures do not consider the fusion of multiple features, which may result in the loss of some important information, such as text content, user information, and the like. Meanwhile, the common shallow GCN network cannot capture the characteristics of the remote neighbors. Moreover, a series of studies conducted at present lack of exploration in the field of cantonese rumors, cantonese is a major branch of chinese, the population distribution is wide, and cantonese rumors in social networks are also endful, so that the invention performs detection research on cantonese rumors in Twitter based on a graph structure and feature fusion method.

However, most of the existing detection methods based on graph structures ignore the fusion of multiple features, resulting in the loss of some important information (such as text content). Also, common shallow GCNs may not be able to capture the features of distant neighbors.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a cantonese rumor detection method based on a deep semantic perception graph convolution network, which learns the structural features of the tweet by means of an improved GCN network, captures the semantic features of the tweet by using a BERT cantonese pre-training model retrained and fine-tuned on cantonese data, and finally splices the structural features and the semantic features, so that the detection effect and the early detection capability are superior to those of other common detection methods. The technical scheme is as follows:

a Cantonese rumor detection method based on a deep semantic perception map convolution network comprises the following steps:

step 1: constructing a plurality of groups of healthy key words of the Guangdong-like rumors, acquiring related tweets, users, forwarding and comment information, and constructing a data set Net-CR-Dataset of the Guangdong-like rumors with graph structure information, namely modeling the data set as a graph according to entities in the social network and the relationship between the entities in the social networkG=<V,E>；

Step 2: fusing a BERT model, a GCN Network and an attention mechanism, and providing a social Network Guangdong language rumor detection model SA-GCN (Semantic perception Graph convolution Network for Semantic meaning of Semantic Graph): extracting a structural feature vector of the tweet by using an improved GCN network;

optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously performing further pre-training and fine tuning on the BERT Chinese pre-training model by using a large number of collected Guangdong language materials so as to extract semantic feature vectors of the tweet; and finally, fusing the two types of characteristics to obtain a final classification result.

Further, the modeling is carried out as a graph according to the entities in the social network and the relations between the entitiesG=<V,E>The method specifically comprises the following steps:

by usingT={t ₁,t ₂,_…,t _mThe symbol represents a set of original text,mthe number of the original Chinese characters is the number of the original Chinese characters; by using

Representing original tuinat _iThe set of pushups and reviews, where,

is composed oft _iThe turn-push/comment of (a) is,nnumber of commentary and pushups;

V={V ₁,V ₂,_…,V _mand (c) the step of (c) in which,V _i={t _i,R _iis the original textt _iThe node set of (2) contains the original textt _iNode and set of forwarding and commentsR _iA node of (2);

E={E ₁,E ₂,_…,E _mand (c) the step of (c) in which,

to deduce the text from the originalt _iThe edge set of (2) representing forwarding/commenting relationships between nodes;

X={x ₁,x ₂,_…,x _mdenotes the original text setTIs determined by the characteristic matrix of (a),

，kis characterized in thatx _iDimension (d);x _irepresenting nodest _iThe feature vector of (2);

is shown as a drawingGA matrix representing the adjacency relation between nodes, indicating the adjacency relation in the graphWhether any two nodes are connected by edges or not;

hypothesis forwarding and comment node

And

between which there is an edge

Then adjacent to the matrixAThe expression of (a) is as follows:

（1）

wherein,E _cto deduce the text from the originalt _cThe set of edges of (1);

consider the rumor detection task as a binary problem, original textt _iCorresponding labely _iE {0,1}, 0 for non-rumors, 1 for rumors; then the rumor detection target is the learning classifierf：

（2）

Wherein,TandYrespectively corresponding to the original text set and the label set.

Further, the step 2 specifically includes:

step 2.1: extracting structural features: the method comprises the steps that original tweets, forwarding tweets and comments in Net-CR-Dataset are used as nodes, forwarding and comment relations are used as edges for modeling, propagation paths of the tweets in a social network are converted into graph structure data, and information on the propagation paths of the tweets is aggregated by using an improved GCN (generalized genetic network), so that high-level structure feature vectors of the tweets are generated;

step 2.2: extracting semantic features: constructing a mapping table, converting variant characters in Guangdong languages into corresponding characters in Mandarin, and splitting rare characters; and extending a vocabulary of the BERT Chinese pre-training model; using the collected cantonese linguistic data to perform further pre-training on the BERT-Base-Chinese model to enable the model to learn more characteristics of cantonese, and using a Net-CR-Dataset to perform fine adjustment on the BERT Chinese pre-training model to obtain the BERT cantonese pre-training model, thereby extracting semantic feature vectors of the tweet;

step 2.3: and the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result.

Further, the step 2.1 of extracting the structural features specifically includes:

step 2.1.1: a multi-head attention mechanism is used for mining potential structural correlation among vertexes, particularly nodes which are not directly connected and nodes which are multi-hop; the specific process is as follows:

generating the characteristics of the nodes by using the Guangdong pre-training word vector provided by fastTextU={u ₁,u ₂,_…,u _NAnd (c) the step of (c) in which,Nthe number of all nodes;

and constructing an attention adjacency matrixAConverting the propagation tree of the original tweet into a graph which is fully connected by weight edges, thereby comprehensively considering the structural relationship among all tweet nodes; first, themRelated to headmThe calculation of the individual attention adjacency matrices is as follows:

（3）

wherein,QandKequivalent to node features, i.e. extracted node featuresU；dIs the dimension of the feature vector;

and

are respectively asQAndKthe transfer matrix of (a);

step 2.1.2: the tight connection layer is used for capturing local and remote node characteristics, the problem that deep associated node information cannot be learned by a shallow GCN is solved, and a better node representation is generated;

each tight connection layer comprisesLA seed layer; for nodeiTo say that it passes throughlThe output of the individual sublayers is shown below:

（4）

wherein,

as ReLU function, weight matrix

And bias

Depends onA ^m()；A ^m()Is as followsmTo the headmAn attention adjacency matrix;

representing nodesiAnd nodejIs a matrixA ^m()The element (1) in (1);

is a nodejIn the first placelInput features of individual layers, fromh ⁽⁰⁾And the sum of the values of 1,2,_…,l-1} node characteristics resulting from sub-layer updatesh ⁽¹⁾,…,h ^l(-1)The calculation mode is shown as the following formula:

（5）

step 2.1.3: a linear combination layer is introduced to integrate the representations from the different densely connected layers, the output S of the linear combination layer being defined as follows:

S=W _comb h _out +b _comb（6）

wherein,h _out=[h ⁽¹⁾;…;h ^(M)]，h ^(M)is shown asMThe feature vectors output by the tight connection layers;W _combis a weight matrix for each of the feature vectors,b _combis a bias vector.

Further, in step 2.2, the expanding the vocabulary of the BERT chinese pre-training model includes: the word list and the fastText cantonese pre-training word vector provided by the PyCantonese library are adopted, common English words in cantonese are added into the word list, the weight of the words is initialized randomly, the English words are prevented from being split into roots and affixes, and the learning capacity of the model to English word semantics in cantonese is improved.

Furthermore, in the step 2.2, the fine tuning of the BERT Chinese pre-training model by using the Net-CR-Dataset comprises: push the original text and turn to push/comment dataV={V ₁,V ₂,_…,V _mAfter labeling, get

. Then, will

Inputting the sentence vector into a retrained and fine-tuned BERT model, and extracting sentence features by using a Transformer to obtain a sentence vectorW={w ₁,w ₂,_…,w _mAs shown below:

（7）

（8）。

wherein,Tokenizeis a word segmentation operation in the BERT model.

Further, the step 2.3 specifically includes:

step 2.3.1: SA-GCN model pair structure feature vectorSAnd semantic feature vectorsWSplicing to obtain a feature vectorFAs shown in the following formula:

F=S⊕W（9）

step 2.3.2: feature vector of the pushtextFThroughSoftmaxThe function obtains the final classification result as shown in the following formula:

（10）

wherein,

extrapolating text samples for predictiondProbability of rumor;

step 2.3.3: the optimization goal of the model is to minimize the cross entropy loss function, as shown in the following equation:

（11）

wherein,Da set of sample data is represented which,y _drepresenting a sample of text to be predicteddThe true value of (d).

The invention has the beneficial effects that:

1. in order to develop rumor detection research based on a graph structure, the invention constructs a cantonese rumor data set with the graph structure, wherein the data set comprises 2,419 original tweets, 112,539 conversion tweets and 92,260 comments, and 207,218 nodes and 202,437 edges are formed together.

2. The invention further pre-trains and adjusts the BERT Chinese pre-training model by using a large amount of collected Guangdong language corpora, so that the model learns the semantic information of the unique vocabulary in the Guangdong language. Meanwhile, the preprocessing flow before training is modified, and the word list of the BERT Chinese pre-training model is expanded, so that foreign body characters and rare characters in Guangdong languages can be processed, and meanwhile, the grammar structure of unique Chinese-English mixing of Guangdong languages is more suitable, and richer semantic information is captured.

3. The invention designs a Guangdong rumor detection model SA-GCN based on a deep semantic perception graph convolution network, which extracts the structural features of the tweet by using an improved graph convolution neural network, captures the semantic features of the tweet by using a BERT pre-training model which is further pre-trained and fine-tuned on the data of the Guangdong rumor, and finally fuses the two types of features.

Drawings

FIG. 1 is a diagram of the SA-GCN model structure of the present invention.

Fig. 2 is a comparison graph of characteristic ablation.

FIG. 3 is a ROC curve.

FIG. 4 is a diagram illustrating early detection capabilities of a model.

Detailed Description

The invention is described in further detail below with reference to the figures and specific embodiments.

The invention provides a social network Cantonese rumor detection method based on a deep semantic perception graph convolutional network. Firstly, a plurality of groups of keywords of healthy Guangdong-like rumors are constructed, a Web crawler is constructed to acquire relevant tweet, user, forwarding and comment information, and a data set Net-CR-Dataset is constructed after data labeling is completed. Secondly, the invention designs a deep semantic perception graph convolutional neural network model SA-GCN. The BERT Chinese pre-training model is optimized according to the unique language features of Guangdong languages, and meanwhile, the BERT pre-training model is further pre-trained and fine-tuned by using a large amount of collected Guangdong language materials, so that semantic feature vectors of the Purchase are extracted. And the improved GCN network is applied, so that the structural features of the tweet are extracted, and a structural feature vector is generated. And finally, the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result.

The method comprises the following specific processes:

step 1: constructing a plurality of groups of healthy key words of the Guangdong-like rumors, acquiring related tweets, users, forwarding and comment information, and constructing a data set Net-CR-Dataset of the Guangdong-like rumors with graph structure information, namely modeling the data set as a graph according to entities in the social network and the relationship between the entities in the social networkG=<V,E>. The social network comprises entities such as original text, forwarding text, comment text and the like, and also comprises behaviors such as sending text, forwarding text, comment text and the like. The invention models entities in a social network and relationships between the entities into a graphG=<V,E>。T={t ₁,t ₂,_…,t _mThe symbol represents a set of original text,mis the number of original tweets.

Representing original tuinat _iThe set of pushups and reviews, where,

is composed oft _iThe turn-push/comment of (a) is,nthe number of commentary and pushups.V={V ₁,V ₂,_…,V _mAnd (c) the step of (c) in which,V _i={t _i,R _iis the original textt _iThe node set of (2) contains the original textt _iNode and set of forwarding and commentsR _iThe node of (c).E={E ₁,E ₂,_…,E _m-means for, among other things,

to deduce the text from the originalt _iRepresents the forwarding/commenting relationship between nodes. For example,

is composed of

If there is an edge

I.e. by

. The object of the present invention is an undirected graph, and therefore the direction of the edge is not considered.X={x ₁,x ₂,_…,x _mIs the original text setTIs determined by the characteristic matrix of (a),

，kis characterized in thatx _iOf (c) is calculated.x _iRepresenting nodest _iThe feature vector of (2).

Is shown as a drawingGOf the adjacent matrix. The adjacency matrix is a matrix representing the adjacency relationship between nodes and can indicate whether any two nodes in the graph are connected by an edge. Hypothetical node

And

between which there is an edge

Then adjacent to the matrixAThe expression form of (A) is shown as formula (1):

（1）

the invention considers the rumor detection task as a two-class questionQuestion, original wordt _iCorresponding labely _iE {0,1}, 0 for non-rumors and 1 for rumors. Therefore, the rumor detection target of the present invention is the learning classifierfAs shown in formula (2):

（2）

wherein,TandYrespectively corresponding to the original text set and the label set. The invention predicts the labels of the tweet based on the structural features and semantic features of the tweet.

Step 2: fusing a BERT model, a GCN network and an attention mechanism, and providing a social network Guangdong rumor detection model SA-GCN: extracting a structural feature vector of the tweet by using an improved GCN network; optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously performing further pre-training and fine tuning on the BERT Chinese pre-training model by using a large number of collected Guangdong language materials so as to extract semantic feature vectors of the tweet; and finally, fusing the two types of characteristics to obtain a final classification result.

The invention provides a novel social network Guangdong language rumor detection model SA-GCN, which integrates a BERT model, a GCN network and an attention mechanism and realizes effective detection of the Guangdong language rumor. The structure of the SA-GCN model is shown in FIG. 1.

Step 2.1: structural feature extraction

The GCN is a multi-layer neural network operating directly on the graph that is capable of updating the representation of a node based on its neighborhood attributes. Work by Kipf et al has demonstrated the effectiveness of graph convolution networks in the task of node classification.LGCN network of layers capable of capturingLAnd (4) hopping the information of the neighbor nodes. Therefore, shallow GCN networks cannot aggregate the characteristics of distant nodes. Also, studies have shown that deep GCN networks do not behave as layer 2 networks. To address this problem, Guo et al introduced tight connections into the GCN network and proposed an attention-directed graph convolution network for the relationship extraction task. The invention proposesThe model SA-GCN of (1) is inspired thereby.

Because close association exists between the conversion and comment distribution on the propagation path of the tweet and the rumor judgment result of the original tweet, the method takes the original tweet, conversion and comment in the Net-CR-Dataset as nodes, takes the forwarding and comment relationship as edges for modeling, converts the propagation path of the tweet in the social network into graph structure data, and uses the improved GCN network to aggregate information on the propagation path of the tweet, thereby generating the high-level structural feature representation of the tweet.

(1) Multi-head attention mechanism

Improved GCN network routingMBlocks, each block containing three modules: multi-head attention mechanism, dense connections and linear combinations. In this section, a multi-head attention mechanism is applied to mine potential structural dependencies between vertices, especially those nodes that are not directly connected and that have passed through multiple hops between them. Specifically, the features of the nodes are first generated using the Guangdong pre-training word vectors provided by fastTextU={u ₁,u ₂,_…,u _NAnd (c) the step of (c) in which,Nis the number of all nodes. Next, by constructing an attention adjacency matrixAAnd converting the propagation tree of the original tweed into a graph which is fully connected by the weighted edges, thereby comprehensively considering the structural relationship among all tweed nodes. And a firstmRelated to headmThe calculation of the individual attention adjacency matrix is shown in equation (3):

（3）

wherein,QandKequivalent to node features, i.e. extracted node featuresU。dIs the dimension of the feature vector.

And

are respectively asQAndKthe transfer matrix of (2).A ^m()Will be used in the following graph convolution process.

(2) Tight joining layer

In this section, a tightly-connected layer is used to capture local and distant node features, solve the problem that shallow GCNs cannot learn deep associated node information, and generate better node representations. Each tight connection layer comprisesLAnd a plurality of sub-layers. For nodeiIt goes through, say, the firstlThe output of the individual layers is shown in equation (4):

（4）

wherein,

as ReLU function, weight matrix

And bias

Is dependent onA ^m()。

Is a nodejIn the first placelInput features of individual layers, fromh ⁽⁰⁾And the sum of the values of 1,2,_…,l-1} node characteristics resulting from sub-layer updatesh ⁽¹⁾,…,h ^l(-1)Splicing to form the product, wherein the calculation mode is shown as formula (5):

（5）

(3) linear combination

This section introduces a linear combination layer to integrate the representations from different densely connected layers. The output of the linear combination layer is defined as shown in equation (6):

S=W _comb h _out +b _comb（6）

wherein,h _out=[h ⁽¹⁾;…;h ^(M)]，h ^(M)is shown asMThe feature vectors output by the individual tightly-connected layers.W _combIn order to be a weight matrix, the weight matrix,b _combis a bias vector.

Step 2.2: semantic feature extraction

Because the semantic information of the invention plays an important role in rumor detection, and the context-dependent word embedding generated by the BERT model can capture information of various dimensions and generate more accurate and effective feature representation, the invention uses the BERT cantonese pre-training model as a Chinese word embedding extractor.

The invention combines the data provided by hong Kong science and technology university to generate a mapping table, converts the variant characters in Guangdong language into the corresponding characters in Mandarin, and splits the rare characters, thereby relieving the problem that the Chinese pre-training model may not learn the semantic information of the variant characters and the rare characters. In addition, in order to enable the model to better process the Chinese-English mixed grammatical structure of Cantonese, the vocabulary of the BERT Chinese pre-training model is expanded by combining the vocabulary provided by the PyCantonese library and the fastText Guangdong pre-training word vector. Specifically, the common English words in Guangdong languages are added into the word list, and the weights of the common English words are initialized randomly, so that the English words are prevented from being split into roots and affixes, and the learning capacity of the model on English word semantics in Guangdong languages is improved.

The invention further pre-trains the BERT-Base-Chinese model by using the Guangdong language corpus widely collected from the Twitter platform, so that the model learns more characteristics of the Guangdong language. On the basis, the Net-CR-Dataset is used for fine tuning of the BERT Chinese pre-training model, so that the method is more suitable for the technical problem of the invention. Specifically, the original text and the conversion/comment data are pushedV={V ₁,V ₂,_…,V _mAfter labeling, get

. Then, will

Inputting the sentence vector into a retrained and fine-tuned BERT model, and extracting sentence features by using a Transformer to obtain a sentence vectorW={w ₁,w ₂,_…,w _mAs shown in formulas (7) and (8):

（7）

（8）。

wherein,Tokenizeis a word segmentation operation in the BERT model.

Step 2.3 feature fusion

SA-GCN model pair structure feature vectorSAnd semantic feature vectorsWSplicing is carried out to obtain the feature vectorFAs shown in formula (9):

F=S⊕W（9）

feature vector of the pushtextFThroughSoftmaxThe function yields the final classification result, as shown in equation (10):

（10）

wherein,

for predicting sentencesdProbability of rumor;

the optimization goal of the model is to minimize the cross-entropy loss function, as shown in equation (11):

（11）

wherein,Da set of sample data is represented which,

representing a sampledThe true value of (d).

The experimental process comprises the following steps:

(1) data set

The current research lacks a published, authoritative benchmark Dataset for Cantonese rumors, and the Cantonese rumors Dataset CR-Dataset constructed in the previous literature [ KE L, CHEN X, LU Z, et al A Novel Approach for Cantonese Rumor Detection based on Deep Neural Network: 33rd IEEE International Conference on Systems, Man, and Cybernets [ C ], Toronto, Canada, 2020 ] does not have rich map structural information and cannot provide the spread structural features of rumors for Detection models, because the invention constructs a completely new Cantonese Rumor Dataset.

The Web crawler is developed based on the Scrapy framework, the Guangdong language tweet and multi-level conversion tweet and comment information thereof in the Twitter platform are collected, and data labeling work is completed according to strict standards, so that a Net-CR-Dataset data set is constructed. Because rumors generally focus on sensitive topics such as health problems, the healthy cantonese rumors in Twitter are taken as main research objects, so that a large amount of related data can be easily acquired, powerful support is provided for research, and the research has important practical significance.

The invention takes the content released by an authoritative official medium as a factual basis, carries out data labeling work on the collected Guangdong language tweed strictly according to the rumor definition (information which is generated and spread in the crowd and has an actual value which cannot be confirmed or intentionally false, generally generated in an emergency situation, is easy to cause public panic, destroy social order, reduce government reputation and even harm national safety) used in the invention, and filters the tweed data which lack a factual basis and cannot judge the authenticity of the Guangdong language tweed. The invention judges the content of the objective facts by comparing the source tweet and the authoritative media under the same event, if the content is consistent, the content is marked as 0; otherwise, it is marked 1.

Finally, a Cantonese rumor data set Net-CR-Dataset containing rich graph structure information is constructed. The data set contained 2,419 original tweets, 112,539 pushers, 92,260 comments, for a total of 207,218 nodes and 202,437 edges. Table 1 shows the details of the data set.

(2) Experimental setup and data set

The experimental environment of the invention is Intel (R) core (TM) i7-7500U CPU @ 2.70GHz and Tesla-V10032G GPU servers. The data set used in all experiments is Net-CR-Dataset constructed by the invention, and the related information of the data set is shown in Table 1. In the experiment, 80% of the rumor data set was used as the training set, 10% was used as the verification set, and 10% was used as the test set. Each experiment was performed 10 times and the average was taken as the final result. The training set, the verification set and the test set used in 10 experiments are all randomly divided.

(3) Experiment one: feature ablation

In order to prove the effectiveness of each part of the SA-GCN model, the invention compares the detection effect of the SA-GCN model and the variant form thereof on Net-CR-Dataset, and the specific information of the related model is as follows:

1) SA-GCN \ Str: structural features are not introduced into the SA-GCN model, and only semantic features are utilized;

2) SA-GCN \ Sem: semantic features are not introduced into the SA-GCN model, and only structural features are utilized;

3) SA-GCN \ Att: an attention mechanism is not introduced into the SA-GCN model;

4) SA-GCN \ BERT: word embedding was not generated in the SA-GCN model using the BERT model. Generating a word Embedding matrix by using a Chinese pre-training word vector and a Guangdong language pre-training word vector provided by fastText, and introducing the word Embedding matrix into an Embedding layer in front of a Bi-LSTM network;

5) SA-GCN: the invention provides a complete model.

The results of the experiment are shown in FIG. 2. It can be seen that the SA-GCN model proposed by the invention performs best and is optimal in all indexes. Comparing the SA-GCN and SA-GCN \ Sem models, it can be found that semantic features of the tweet play an important role in rumor detection tasks. Meanwhile, due to the fact that interaction behaviors exist among users in the social network, nodes are closely related, and the characteristics of the nodes are affected by the neighborhood, the SA-GCN model considering the structural characteristics of the nodes is superior to the SA-GCN \ Str model in detection effect. Moreover, comparing the SA-GCN \ Str model with the SA-GCN \ Sem model, it can be seen that in the task, the contribution of the semantic features to the detection effect is greater than that of the structural features, and the probable reason is that the quantity distribution of the pushings and comments in the Net-CR-Dataset data set adopted in the experiment is not uniform enough, so that the structural features learned by the model for the data with small transmission amount are insufficient. In addition, the SA-GCN \ BERT model obtains 0.8692F 1 score, which is reduced by about 12% compared with the SA-GCN model, and the result shows that compared with the common pre-training models such as fastText and the like, the BERT Chinese pre-training model adopted by the invention and the further pre-training and fine-tuning operations carried out on the Guangdong language corpus can enable the model to better learn the characteristics of the Guangdong language data, thereby carrying out efficient rumor detection. In addition, compared with the performances of the SA-GCN \ Att model and the SA-GCN model, the attention mechanism introduced into the model can automatically discover words and features which are important to a detection task, and contributes to the improvement of a detection effect.

(4) Experiment two: detecting model effects

The invention compares the provided SA-GCN model with other common methods in rumor detection, and the related model specific information is as follows:

1) SVM-RBF: the SVM model based on the RBF kernel utilizes the manually extracted features based on the statistical information of all posts;

2) DTC: a rumor detection method of decision tree classifier based on various manual characteristics, in order to obtain the credibility of the information;

3) RFC: a random forest classifier using user features, language features and structural features;

4) TextCNN: capturing text semantics for the classification task using a convolutional neural network;

5) Bi-GCN: a rumor detection model for graph structure data, capable of capturing bidirectional propagation structural features of rumors;

6) GLAN: a global-local attention network that fuses local semantic features and global structural features;

7) SA-GCN: the invention provides a detection model.

In the experiment, the input of the SVM-RBF, DTC and RFC models is a vector generated by a TF-IDF algorithm, and the input of the TextCNN model is word embedding generated by using a fastText Chinese pre-training word vector and a Guangdong language pre-training word vector. As can be seen from Table 2 and FIG. 3, the SA-GCN model provided by the invention achieves 0.9845 on the F1 score, and the AUC value is 0.9677, so that the best detection effect is obtained. And the detection models (TextCNN, Bi-GCN, GLAN, SA-GCN) based on deep learning generally perform better than the models (SVM-RBF, DTC, RFC) based on traditional machine learning, because the deep learning models can learn the high-order expression form of rumors, thereby capturing effective characteristics. In addition, the performance of the TextCNN is not as good as that of the GLAN and the SA-GCN, and the fact that the detection effect can be effectively improved by adding the propagation structure characteristics in the detection process is laterally proved, so that the important significance of the invention in combination of the structure characteristics and the semantic characteristics is fully embodied. Meanwhile, since the effect of the Transformer in semantic feature extraction is better than that of the CNN network, the SA-GCN model using the Transformer as a feature extractor is more expressed than TextCNN and GLAN using the CNN as a feature extractor. Moreover, in order to construct a cantonese word embedding extractor more suitable for cantonese and rumor detection tasks, the invention retrains and optimizes the BERT Chinese pre-training model based on cantonese linguistic data and data sets, so that the SA-GCN model can learn the characteristics of more cantonese data, and a better detection effect is obtained. Meanwhile, compared with Bi-GCN and SA-GCN models, the SA-GCN is improved by nearly 9% in F1 score compared with Bi-GCN, because the SA-GCN model provided by the invention is fused with the semantic features of rumors on the basis of acquiring structural features, and the semantic features can fundamentally reflect the meaning expressed by the tweet, so that the detection performance of the model is remarkably improved.

(5) Experiment three: early detection capability

The early detection capability refers to the detection effect of the model on the rumors in the initial period of rumor propagation, and is one of the important indexes for judging the performance of the rumor detection model. Different cut-off times are set in the experiment, and data input models before the cut-off times are selected respectively. The experiment takes the accuracy of the model as an index, and compares the performances of different models in the aspect of early detection capability.

The results of the experiment are shown in FIG. 4. It can be seen that, at different cut-off times, the accuracy of the SA-GCN model proposed by the present invention is always higher than that of other models, and the accuracy of over 0.8944 is achieved when the tweet is just started to propagate, and the accuracy of 0.9425 is achieved within 3 hours of the tweet propagation, which proves that the semantic features and structural features introduced in the SA-GCN model are not only very effective in the long-term rumor detection task, but also contribute to the early detection of rumors. In addition, as the deadline is delayed, semantic information and structural information of the tweet become richer, and meanwhile, the noise of data is larger and larger, but the fluctuation amplitude of the accuracy curve of the SA-GCN model is smaller compared with other models, which also proves the stability of the proposed SA-GCN model.

In conclusion, the invention provides a neural network method based on a deep semantic perception map for detecting the Cantonese rumors in the social network. First, the invention constructs a Web crawler and obtains relevant data in Twitter based on multiple sets of rumor keywords. Meanwhile, the forwarding number and the comment number of the text pushing are limited, so that the data are guaranteed to have rich graph structure information. And after the manual data labeling work is finished, constructing a data set Net-CR-Dataset. Secondly, the invention designs a brand-new SA-GCN for detecting the rumors of Guangdong languages, the model learns the structural features of the tuina by means of an improved GCN network, and captures the semantic features of the tuina by applying a BERT pre-training model retrained and fine-tuned on data of the Guangdong languages, and finally the structural features and the semantic features are spliced by the model. The experimental result shows that the SA-GCN model provided by the invention is superior to a classical detection method in the task of detecting the rumors in Guangdong languages, and the model has strong early rumor detection capability.

Claims

1. A Cantonese rumor detection method based on a deep semantic perception map convolutional network is characterized by comprising the following steps of:

step 1: constructing a plurality of groups of healthy key words of the Cantonese rumors, acquiring related tweets, users, forwarding and comment information, and constructing a data set Net-CR-Dataset of the Cantonese rumors with graph structure information, namely modeling a graph G < V, E > according to entities in a social network and the relationship between the entities;

step 2: fusing a BERT model, a GCN network and an attention mechanism, and providing a social network Cantonese rumor detection model SA-GCN: extracting a structural feature vector of the tweet by using an improved GCN network;

optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously performing further pre-training and fine tuning on the BERT Chinese pre-training model by using a large number of collected Guangdong language materials so as to extract semantic feature vectors of the tweet; finally, the two types of characteristics are fused to obtain a final classification result;

the step 2 comprises the following steps:

the step 2.1 of extracting the structural features specifically comprises:

step 2.1.1: mining potential structural correlation between vertexes by using a multi-head attention mechanism, wherein the potential structural correlation comprises nodes which are not directly connected and nodes which are multi-hop; the specific process is as follows:

firstly, the Guangdong language pre-training word vector provided by fastText is used to generate the feature U ═ U of the node₁,u₂,...,u_N-wherein N is the number of all nodes;

then transforming the propagation tree of the original tweet into a graph which is fully connected by weight edges by constructing an attention adjacency matrix A, thereby comprehensively considering the structural relationship among all tweet nodes; the mth attention adjacency matrix for the mth head is calculated as follows:

wherein Q and K are equal to the node characteristics, namely the extracted node characteristics U; d is the dimension of the feature vector;

and

transfer matrices of Q and K, respectively;

each tight connection layer comprises L sublayers; for node i, its output through the ith sublayer is shown as:

where ρ is a ReLU function, a weight matrix

And bias

Dependent on A^(m)；A^(m)An mth attention adjacency matrix associated with the mth head;

the connection condition of the node i and the node j is shown as a matrix A^(m)The elements of (1);

for the input feature of node j in the l sub-layer, the value of h⁽⁰⁾And {1, 2., l-1} sub-layer updates the resulting node characteristics h⁽¹⁾,...,h^(l-1)The calculation mode is shown as the following formula:

step 2.1.3: a linear combination layer is introduced to integrate the representations from the different densely connected layers, the output of the linear combination layer being defined as follows:

S＝W_combh_out+b_comb (6)

wherein h is_out＝[h⁽¹⁾；...；h^(M)]，h^(M)A feature vector representing the output of the Mth tightly-connected layer; w_combAs a weight matrix for each feature vector, b_combIs a bias vector.

2. The method for detecting rumors in Guangdong languages based on deep semantic perception graph convolution network of claim 1, wherein the modeling is performed according to the relationship between entities in social network as graph G ═ V, E > is specifically:

with T ═ T₁,t₂,...,t_mMeans forThe original text set, m is the number of original text; by using

Representing the original letter t_iA set of commentary and pushback, wherein

Is t_iN is the number of the forwarding and commenting;

V＝{V₁,V₂,...,V_min which V is_i＝{t_i,R_iIs the original text t_iThe node set of (2) contains the original text t_iNode of and set of forwarding and comments R_iA node of (2);

E＝{E₁,E₂,...,E_mtherein of

For pushing the text t_iThe edge set of (2) representing forwarding/commenting relationships between nodes;

X＝{x₁,x₂,...,x_mthe feature matrix of the original text set T is represented,

k is a feature x_iDimension of (d); x is the number of_iRepresents a node t_iThe feature vector of (2);

A∈{0,1}^|V|×|V|the adjacency matrix is a matrix of the graph G, represents the adjacency relation between nodes and indicates whether any two nodes in the graph are connected by edges or not;

suppose forwarding and commenting node r_i ^cAnd

between which there is an edge

The adjacency matrix a behaves as follows:

wherein E is_cFor pushing the text t_cThe set of edges of (c);

consider the rumor detection task as a binary problem, original text t_iCorresponding label y_iE {0,1}, 0 for non-rumors, 1 for rumors; then the rumor detection target is the learning classifier f:

f:T→Y (2)

wherein Y is a set of labels.

3. The method for detecting the rumor in Guangdong languages based on the deep semantic perception map convolutional network of claim 1, wherein the step 2 further comprises:

4. The method for detecting a yue-chow rumor based on the deep semantic perception graph convolutional network of claim 3, wherein the step 2.2 of expanding the vocabulary of the BERT chinese pre-training model comprises: the word list and the fastText cantonese pre-training word vector provided by the PyCantonese library are adopted, common English words in cantonese are added into the word list, the weight of the words is initialized randomly, the English words are prevented from being split into roots and affixes, and the learning capacity of the model to English word semantics in cantonese is improved.

5. The method of claim 3, wherein the step 2.2 of fine-tuning the BERT Chinese pre-training model using the Net-CR-Dataset comprises: the original pushtext and the conversion pushtext/comment data V are set as { V ═ V₁,V₂,...,V_mMarking to obtain V ═ V }₁',V₂',...,V_m' }; then, V' is input to the retrained and trimmed BERT model, and sentence features are extracted by using a Transformer, so that a sentence vector W is obtained as { W ═ W₁,w₂,...,w_mThe formula is shown below;

V'＝Tokenize(V) (7)

W＝BERT(V') (8)

wherein, Tokenize is a word segmentation operation in the BERT model.

6. The method for detecting the rumor in Guangdong languages based on the deep semantic perception map convolutional network of claim 3, wherein the step 2.3 specifically comprises:

step 2.3.1: the SA-GCN model splices the structural feature vector S and the semantic feature vector W to obtain a feature vector F, which is shown as the following formula:

step 2.3.2: and (3) subjecting the feature vector F of the tweet to a Softmax function to obtain a final classification result, which is shown as the following formula:

p_d＝Softmax(F) (10)

wherein p is_dThe probability that the tweet sample d to be predicted is a rumor is obtained;

step 2.3.3: the optimization objective of the model is to minimize the cross-entropy loss function, as shown in the following equation:

where D represents a sample data set, y_dRepresenting the true value of the tweet sample d to be predicted.