CN114444516B - Cantonese rumor detection method based on deep semantic perception map convolutional network - Google Patents

Cantonese rumor detection method based on deep semantic perception map convolutional network Download PDF

Info

Publication number
CN114444516B
CN114444516B CN202210371266.1A CN202210371266A CN114444516B CN 114444516 B CN114444516 B CN 114444516B CN 202210371266 A CN202210371266 A CN 202210371266A CN 114444516 B CN114444516 B CN 114444516B
Authority
CN
China
Prior art keywords
model
cantonese
gcn
node
rumor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210371266.1A
Other languages
Chinese (zh)
Other versions
CN114444516A (en
Inventor
王海舟
陈欣雨
柯亮
方怡萱
王森
蔡易成
王文贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202210371266.1A priority Critical patent/CN114444516B/en
Publication of CN114444516A publication Critical patent/CN114444516A/en
Application granted granted Critical
Publication of CN114444516B publication Critical patent/CN114444516B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The invention relates to the technical field of rumor detection, and particularly discloses a Cantonese rumor detection method based on a deep semantic perception graph convolution network, which comprises the steps of firstly constructing a plurality of groups of healthy Cantonese rumor keywords, constructing a Web crawler to acquire relevant tweet, user, forwarding and comment information, and constructing a data set Net-CR-Dataset after data annotation is completed; secondly, designing a deep semantic perception map convolutional neural network model SA-GCN; optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong tongue, and simultaneously utilizing a large number of collected Guangdong tongue linguistic data to perform further pre-training and fine adjustment on the BERT pre-training model so as to extract semantic feature vectors of the Chinese text; extracting the structural features of the tweet by using an improved GCN network to generate a structural feature vector; and finally, the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result. The invention is superior to other common detection methods in the aspects of detection effect and early detection capability.

Description

Cantonese rumor detection method based on deep semantic perception map convolutional network
Technical Field
The invention relates to the technical field of rumor detection, in particular to a Cantonese rumor detection method based on a deep semantic perception map convolutional network.
Background
Social media provide a platform for people to pay attention to hot events, release opinions and make friends, and play an indispensable role in the life of people. The "Digital 2021" report shows that by 1 month 2021, globally active users of social media have reached 42 billion, accounting for about 53.6% of the world's general population. Due to the great influence of the social network on public opinion, rumors are layered in the social network, which not only disturbs the network order and causes social panic, but also brings economic loss in the real world and endangers the national safety. Besides the common english and chinese rumors, yue-language rumors are also a long-standing disease in social networks. Cantonese, which is a branch of chinese, is prevalent not only in areas such as guangdong, hong kong, and australia in china, but also in overseas chinese. The users of the cantonese language, which are more than 1.2 hundred million in the world, are shared all over the world, and the widely spread cantonese language rumors in the social network also have great adverse effects on the world below. Therefore, there is a need to provide an effective method for automatically detecting the rumors in yue languages in social networks.
The traditional rumor detection method mainly adopts a supervised learning strategy, and trains a machine learning classifier by using features manually extracted from text contents, user homepages and propagation modes, for example: SVM (Support Vector Machine), RF (Random Forest), Bayes (Bayes), DT (Decision Tree). Further research has extracted more effective features such as time series features, topic features, etc. The detection method based on the traditional machine learning mainly depends on feature engineering, and a great deal of time is required to be invested in the method. Moreover, it is difficult to manually extract efficient high-order feature representations, which hinders the enhancement of performance of such methods.
In recent years, powerful deep learning models have been widely used in this task to solve the problems in machine learning based detection methods. RNN (Current Neural Network Recurrent Neural Network), LSTM (Long Short-term Memory), GRU (Gate Current Unit), CNN (Convolvulatory Neural Network) and variants and combinations thereof have all made significant efforts in the field of rumor testing. In addition, some studies further improve the detection effect of the model by introducing an attention mechanism, generating counterstudy and other technologies. However, the methods above mostly consider the microblog texts to be detected as independent individuals, and ignore the connection between them. As is well known, a social network uses a graph as an infrastructure, and includes not only entities such as users, posts, and hashtags, but also relationships such as friends, forwarding, and comments. The graph structure contains rich information, thus providing new features for rumor detection. For example, if there are frequent interactions (e.g., forwarding/commenting/paying attention) between a user and multiple rumor users, the probability that the post posted by the user is a rumor is greatly increased.
To address this problem, recent research has focused on rumor detection using the information of the propagation structure in the graph structure. Some studies construct the Propagation behavior of posts as a Propagation Tree, and design models such as PTK (Propagation Tree Kernel), RvNN (recurrent Neural Network) to learn the structural features of the Propagation Tree. Meanwhile, methods such as GCN (Graph relational Network convolution Network), GAT (Graph Attention Network), PGNN (Propagation Graph Neural Network) and the like are also proposed to extract the global structural features of posts from the Propagation Graph, thereby improving the effect of the detection model. However, most of these methods only focus on obtaining the propagation and structure information from the process of transmission and development of posts over time, but ignore features from text content, user homepages, etc., which may result in some important information (e.g., text features) being lost and have an influence on the final detection effect. In addition, for the GCN network, the shallow model cannot learn the characteristics of the nodes at a long distance. Studies have shown that although deeper GCNs can capture richer neighbor information, layer 2 GCN networks perform best.
Relevant research on the Detection of Cantonese rumors was first conducted in the literature [ KE L, CHEN X, LU Z, et al. A Novel Approach for Cantonese Rumor Detection based on Deep Neural networks: 33rd IEEE International Conference on Systems, Man, and Cybernetics [ C ], Toronto, Canada, 2020.] and the Cantonese rumors in Twitter were widely collected, thereby constructing a first more complete data set CR-Dataset of the Cantonese rumors. Meanwhile, 27 statistical characteristics of rumors in Guangdong languages are provided, a detection model based on a deep learning method is designed, and the semantic characteristics and the statistical characteristics of the rumors are fused for detection. Experiments prove that the method achieves excellent detection effect. However, this method does not take into account the propagation structure characteristics, which are important in the discrimination of rumors in practical situations. In addition, the constructed CR-Dataset also lacks structural information of the tweet.
On one hand, the traditional rumor detection method cannot be directly applied to the scene of cantonese. The above-mentioned methods are mainly studied for english rumors and mandarin rumors in social networks, and solutions for the scenario of cantonese rumors are lacking. The new words, unique oral words and hybrid Chinese and English grammar structure contained in Guangdong language can not bring the traditional method into play the best detection effect. On the other hand, the widespread rumors in cantonese in social networks can have serious negative effects on the real world. One of the most popular and influential dialects in the chinese, the cantonese language has more than 1.2 hundred million users in the world, and the cantonese users are widely distributed not only in guangdong province, hong kong and australia in china, but also in 12 other countries such as singapore, thailand, the united states and canada.
Therefore, a new method based on graph structure is needed to develop the detection research for the rumors in Guangdong languages in social networks.
The unique language features of cantonese pose a serious challenge to social network-oriented detection of the rumor in cantonese. Unlike the common Chinese language, the Guangdong language includes many variant words (e.g., - ) and rare words (e.g., ). Although modern standard chinese and cantonese have the same (or ultimately related) meaning on the part-word syllables associated with the source of the word, some users of cantonese may write with variant characters, i.e., variant words. However, it is difficult for the chinese language model to automatically learn the semantics of these unique characters. Meanwhile, the syntax structure of the english-english mixture in cantonese makes the extraction of semantic features difficult. With the development of history, users of cantonese gradually merge english words into the cantonese language system. For example: to do oh work coffee (meaning that it is not feasible to do so). This usage makes word segmentation and acquisition of the context semantics of the words difficult.
In recent years, researchers have conducted extensive research on rumor automated detection problems in social networks, mainly including detection methods based on traditional machine learning and detection methods based on deep learning.
(1) Detection method based on traditional machine learning
Most rumor detection research focuses on training rumor classifiers by using features manually extracted from aspects such as microblog text content, user homepages, propagation paths and the like, so that rumor detection is realized. Castillo et al extracted user, message, topic, and propagation based features from "push" and "turn-push" behaviors in the Twitter platform to evaluate the trustworthiness of a given tweet. Yang et al has expanded the characteristic set that Castillo et al put forward, has put forward the characteristic based on customer end and place on the basis of the characteristic based on content, account, propagation before, has realized the rumor detection to the Xinlang microblog platform. Kwon et al explore the conventional structural and linguistic features and novel temporal features resulting from sudden fluctuations in rumors over time. Ma et al designed a time series model to capture the change in rumor social context over time, not just the tweet capacity feature. This type of detection method based on traditional machine learning mainly relies on feature engineering, and therefore requires a great deal of research time and human labor. Moreover, some implicit features are difficult to find in feature engineering, and features extracted manually cannot have strong robustness, which makes the method particularly difficult in performance improvement.
(2) Detection method based on deep learning
In order to solve the above-mentioned problems faced by the conventional machine learning method, many studies employ a deep learning method to perform feature learning, so as to capture high-order feature representation and achieve a better classification effect. Ma et al identified rumors by learning the timing and text representation of rumor posts using the RNN model for the first time. On this basis, Chen et al propose to incorporate an attention mechanism into the RNN model to capture text features that are more important to the detection task. Jin et al extracted multimodal features from rumor text, images, and social background for rumor detection tasks. These RNN-based methods can efficiently detect rumors, but are not suitable for the task of early detection of rumors. To address this problem, Yu et al devised a CNN network-based model to efficiently identify error information and enable early detection of propagation. Also, some recent studies have employed a mixture of RNN and CNN for detection. Liu et al learned global and local changes in user characteristics using RNN and CNN networks to identify fake news. Ma et al introduced the idea of generating antagonistic learning by generating antagonistic noise to allow the classifier to learn a stronger rumor representation, enabling more robust and efficient detection. Furthermore, Ke et al first conducted research on social network-oriented cantonese rumor detection, extracted statistical features of 27 cantonese rumors, including four categories of content, user, propagation, and comment, and designed a cantonese rumor detection model BLA (Bi-LSTM network with Attention mechanism fused based on BERT), extracted semantic features of the inferences using BERT (Bi-directional Encoder representation based on Transformers) model, Bi-LSTM (Bi-directional Long Short Term Memory) network, and Attention mechanism, and then spliced with the extracted statistical features, thereby achieving effective recognition of cantonese rumors. However, the classical deep learning rumor detection technology mostly focuses on extracting features of text, images and other categories, and structural features of rumor propagation are ignored.
In addition, in the structure diagram of the social network, the delivery process of rumors implies rich information. Some studies have used propagation relationships in graph structures for rumor detection. Sicilia et al combined content-based features with some fine-grained features inspired by graph theory to detect rumors in single subject domain posts related to health news. Ma et al propose a kernel-based approach to obtain high-order rumor representations by comparing similarities in the propagation tree structure of microbian. This method achieves a good detection effect, but cannot automatically learn a high-order feature without a flat feature including noise. To solve this problem, Ma et al propose to learn the propagation tree of microbodish by using the RvNN network, thereby extracting effective semantics and propagation characteristics. Yuan et al also considers the connection between different propagation trees on the basis of the above methods, and proposes a novel model to encode local context information and global structure information. Bian et al innovatively propose a bipartite graph model Bi-GCN to learn high-order feature representations from both propagation and dispersion directions. Yang et al designed a graph-confronted learning approach to enhance the robustness and generalization of rumor detection models. Most of the detection methods based on graph structures do not consider the fusion of multiple features, which may result in the loss of some important information, such as text content, user information, and the like. Meanwhile, the common shallow GCN network cannot capture the characteristics of the remote neighbors. Moreover, a series of studies conducted at present lack of exploration in the field of cantonese rumors, cantonese is a major branch of chinese, the population distribution is wide, and cantonese rumors in social networks are also endful, so that the invention performs detection research on cantonese rumors in Twitter based on a graph structure and feature fusion method.
However, most of the existing detection methods based on graph structures ignore the fusion of multiple features, resulting in the loss of some important information (such as text content). Also, common shallow GCNs may not be able to capture the features of distant neighbors.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a cantonese rumor detection method based on a deep semantic perception graph convolution network, which learns the structural features of the tweet by means of an improved GCN network, captures the semantic features of the tweet by using a BERT cantonese pre-training model retrained and fine-tuned on cantonese data, and finally splices the structural features and the semantic features, so that the detection effect and the early detection capability are superior to those of other common detection methods. The technical scheme is as follows:
a Cantonese rumor detection method based on a deep semantic perception map convolution network comprises the following steps:
step 1: constructing a plurality of groups of healthy key words of the Guangdong-like rumors, acquiring related tweets, users, forwarding and comment information, and constructing a data set Net-CR-Dataset of the Guangdong-like rumors with graph structure information, namely modeling the data set as a graph according to entities in the social network and the relationship between the entities in the social networkG=<V,E>;
Step 2: fusing a BERT model, a GCN Network and an attention mechanism, and providing a social Network Guangdong language rumor detection model SA-GCN (Semantic perception Graph convolution Network for Semantic meaning of Semantic Graph): extracting a structural feature vector of the tweet by using an improved GCN network;
optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously performing further pre-training and fine tuning on the BERT Chinese pre-training model by using a large number of collected Guangdong language materials so as to extract semantic feature vectors of the tweet; and finally, fusing the two types of characteristics to obtain a final classification result.
Further, the modeling is carried out as a graph according to the entities in the social network and the relations between the entitiesG=<V,E>The method specifically comprises the following steps:
by usingT={t 1,t 2,,t m The symbol represents a set of original text,mthe number of the original Chinese characters is the number of the original Chinese characters; by using
Figure 529690DEST_PATH_IMAGE001
Representing original tuinat i The set of pushups and reviews, where,
Figure 915672DEST_PATH_IMAGE002
is composed oft i The turn-push/comment of (a) is,nnumber of commentary and pushups;
V={V 1,V 2,,V m and (c) the step of (c) in which,V i={t i ,R i is the original textt i The node set of (2) contains the original textt i Node and set of forwarding and commentsR i A node of (2);
E={E 1,E 2,,E m and (c) the step of (c) in which,
Figure 288884DEST_PATH_IMAGE003
to deduce the text from the originalt i The edge set of (2) representing forwarding/commenting relationships between nodes;
X={x 1,x 2,,x m denotes the original text setTIs determined by the characteristic matrix of (a),
Figure 152935DEST_PATH_IMAGE004
kis characterized in thatx i Dimension (d);x i representing nodest i The feature vector of (2);
Figure 560783DEST_PATH_IMAGE005
is shown as a drawingGA matrix representing the adjacency relation between nodes, indicating the adjacency relation in the graphWhether any two nodes are connected by edges or not;
hypothesis forwarding and comment node
Figure 177053DEST_PATH_IMAGE006
And
Figure 912928DEST_PATH_IMAGE007
between which there is an edge
Figure 705304DEST_PATH_IMAGE008
Then adjacent to the matrixAThe expression of (a) is as follows:
Figure 843024DEST_PATH_IMAGE009
(1)
wherein the content of the first and second substances,E c to deduce the text from the originalt c The set of edges of (1);
consider the rumor detection task as a binary problem, original textt i Corresponding labely i E {0,1}, 0 for non-rumors, 1 for rumors; then the rumor detection target is the learning classifierf
Figure 429863DEST_PATH_IMAGE010
(2)
Wherein the content of the first and second substances,TandYrespectively corresponding to the original text set and the label set.
Further, the step 2 specifically includes:
step 2.1: extracting structural features: the method comprises the steps that original tweets, forwarding tweets and comments in Net-CR-Dataset are used as nodes, forwarding and comment relations are used as edges for modeling, propagation paths of the tweets in a social network are converted into graph structure data, and information on the propagation paths of the tweets is aggregated by using an improved GCN (generalized genetic network), so that high-level structure feature vectors of the tweets are generated;
step 2.2: extracting semantic features: constructing a mapping table, converting variant characters in Guangdong languages into corresponding characters in Mandarin, and splitting rare characters; and extending a vocabulary of the BERT Chinese pre-training model; using the collected cantonese linguistic data to perform further pre-training on the BERT-Base-Chinese model to enable the model to learn more characteristics of cantonese, and using a Net-CR-Dataset to perform fine adjustment on the BERT Chinese pre-training model to obtain the BERT cantonese pre-training model, thereby extracting semantic feature vectors of the tweet;
step 2.3: and the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result.
Further, the step 2.1 of extracting the structural features specifically includes:
step 2.1.1: a multi-head attention mechanism is used for mining potential structural correlation among vertexes, particularly nodes which are not directly connected and nodes which are multi-hop; the specific process is as follows:
generating the characteristics of the nodes by using the Guangdong pre-training word vector provided by fastTextU={u 1,u 2,,u N And (c) the step of (c) in which,Nthe number of all nodes;
and constructing an attention adjacency matrixAConverting the propagation tree of the original tweet into a graph which is fully connected by weight edges, thereby comprehensively considering the structural relationship among all tweet nodes; first, themRelated to headmThe calculation of the individual attention adjacency matrices is as follows:
Figure 918613DEST_PATH_IMAGE011
(3)
wherein the content of the first and second substances,QandKequivalent to node features, i.e. extracted node featuresUdIs the dimension of the feature vector;
Figure 452363DEST_PATH_IMAGE012
and
Figure 569223DEST_PATH_IMAGE013
are respectively asQAndKthe transfer matrix of (a);
step 2.1.2: the tight connection layer is used for capturing local and remote node characteristics, the problem that deep associated node information cannot be learned by a shallow GCN is solved, and a better node representation is generated;
each tight connection layer comprisesLA seed layer; for nodeiTo say that it passes throughlThe output of the individual sublayers is shown below:
Figure 733488DEST_PATH_IMAGE014
(4)
wherein the content of the first and second substances,
Figure 771852DEST_PATH_IMAGE015
as ReLU function, weight matrix
Figure 640450DEST_PATH_IMAGE016
And bias
Figure 752763DEST_PATH_IMAGE017
Depends onA m()A m()Is as followsmTo the headmAn attention adjacency matrix;
Figure 681405DEST_PATH_IMAGE018
representing nodesiAnd nodejIs a matrixA m()The element (1) in (1);
Figure 879168DEST_PATH_IMAGE019
is a nodejIn the first placelInput features of individual layers, fromh (0)And the sum of the values of 1,2,,l-1} node characteristics resulting from sub-layer updatesh (1),…,h l(-1)The calculation mode is shown as the following formula:
Figure 554387DEST_PATH_IMAGE020
(5)
step 2.1.3: a linear combination layer is introduced to integrate the representations from the different densely connected layers, the output S of the linear combination layer being defined as follows:
S=W comb h out +b comb (6)
wherein the content of the first and second substances,h out= [h (1);…;h (M)],h (M)is shown asMThe feature vectors output by the tight connection layers;W comb is a weight matrix for each of the feature vectors,b comb is a bias vector.
Further, in step 2.2, the expanding the vocabulary of the BERT chinese pre-training model includes: the word list and the fastText cantonese pre-training word vector provided by the PyCantonese library are adopted, common English words in cantonese are added into the word list, the weight of the words is initialized randomly, the English words are prevented from being split into roots and affixes, and the learning capacity of the model to English word semantics in cantonese is improved.
Furthermore, in the step 2.2, the fine tuning of the BERT Chinese pre-training model by using the Net-CR-Dataset comprises: push the original text and turn to push/comment dataV={V 1,V 2,,V m After labeling, get
Figure 521206DEST_PATH_IMAGE021
. Then, will
Figure 886329DEST_PATH_IMAGE022
Inputting the sentence vector into a retrained and fine-tuned BERT model, and extracting sentence features by using a Transformer to obtain a sentence vectorW={w 1,w 2,,w m As shown below:
Figure 571388DEST_PATH_IMAGE023
(7)
Figure 47369DEST_PATH_IMAGE024
(8)。
wherein the content of the first and second substances,Tokenizeis a word segmentation operation in the BERT model.
Further, the step 2.3 specifically includes:
step 2.3.1: SA-GCN model pair structure feature vectorSAnd semantic feature vectorsWSplicing to obtain a feature vectorFAs shown in the following formula:
F=SW(9)
step 2.3.2: feature vector of the pushtextFThroughSoftmaxThe function obtains the final classification result as shown in the following formula:
Figure 868694DEST_PATH_IMAGE025
(10)
wherein the content of the first and second substances,
Figure 607980DEST_PATH_IMAGE026
extrapolating text samples for predictiondProbability of rumor;
step 2.3.3: the optimization goal of the model is to minimize the cross entropy loss function, as shown in the following equation:
Figure 639390DEST_PATH_IMAGE027
(11)
wherein the content of the first and second substances,Da set of sample data is represented which,y d representing a sample of text to be predicteddThe true value of (d).
The invention has the beneficial effects that:
1. in order to develop rumor detection research based on a graph structure, the invention constructs a cantonese rumor data set with the graph structure, wherein the data set comprises 2,419 original tweets, 112,539 conversion tweets and 92,260 comments, and 207,218 nodes and 202,437 edges are formed together.
2. The invention further pre-trains and adjusts the BERT Chinese pre-training model by using a large amount of collected Guangdong language corpora, so that the model learns the semantic information of the unique vocabulary in the Guangdong language. Meanwhile, the preprocessing flow before training is modified, and the word list of the BERT Chinese pre-training model is expanded, so that foreign body characters and rare characters in Guangdong languages can be processed, and meanwhile, the grammar structure of unique Chinese-English mixing of Guangdong languages is more suitable, and richer semantic information is captured.
3. The invention designs a Guangdong rumor detection model SA-GCN based on a deep semantic perception graph convolution network, which extracts the structural features of the tweet by using an improved graph convolution neural network, captures the semantic features of the tweet by using a BERT pre-training model which is further pre-trained and fine-tuned on the data of the Guangdong rumor, and finally fuses the two types of features.
Drawings
FIG. 1 is a diagram of the SA-GCN model structure of the present invention.
Fig. 2 is a comparison graph of characteristic ablation.
FIG. 3 is a ROC curve.
FIG. 4 is a diagram illustrating early detection capabilities of a model.
Detailed Description
The invention is described in further detail below with reference to the figures and specific embodiments.
The invention provides a social network Cantonese rumor detection method based on a deep semantic perception graph convolutional network. Firstly, a plurality of groups of keywords of healthy Guangdong-like rumors are constructed, a Web crawler is constructed to acquire relevant tweet, user, forwarding and comment information, and a data set Net-CR-Dataset is constructed after data labeling is completed. Secondly, the invention designs a deep semantic perception graph convolutional neural network model SA-GCN. The BERT Chinese pre-training model is optimized according to the unique language features of Guangdong languages, and meanwhile, the BERT pre-training model is further pre-trained and fine-tuned by using a large amount of collected Guangdong language materials, so that semantic feature vectors of the Purchase are extracted. And the improved GCN network is applied, so that the structural features of the tweet are extracted, and a structural feature vector is generated. And finally, the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result.
The method comprises the following specific processes:
step 1: constructing a plurality of groups of healthy key words of the Guangdong-like rumors, acquiring related tweets, users, forwarding and comment information, and constructing a data set Net-CR-Dataset of the Guangdong-like rumors with graph structure information, namely modeling the data set as a graph according to entities in the social network and the relationship between the entities in the social networkG=<V,E>. The social network comprises entities such as original text, forwarding text, comment text and the like, and also comprises behaviors such as sending text, forwarding text, comment text and the like. The invention models entities in a social network and relationships between the entities into a graphG=<V,E>。T={t 1,t 2,,t m The symbol represents a set of original text,mis the number of original tweets.
Figure 794428DEST_PATH_IMAGE028
Representing original tuinat i The set of pushups and reviews, where,
Figure 860473DEST_PATH_IMAGE029
is composed oft i The turn-push/comment of (a) is,nthe number of commentary and pushups.V={V 1,V 2,,V m And (c) the step of (c) in which,V i={t i ,R i is the original textt i The node set of (2) contains the original textt i Node and set of forwarding and commentsR i The node of (c).E={E 1,E 2,,E m -means for, among other things,
Figure 442764DEST_PATH_IMAGE030
to deduce the text from the originalt i Represents the forwarding/commenting relationship between nodes. For example,
Figure 961470DEST_PATH_IMAGE031
is composed of
Figure 654619DEST_PATH_IMAGE032
If there is an edge
Figure 309591DEST_PATH_IMAGE033
I.e. by
Figure 62784DEST_PATH_IMAGE034
. The object of the present invention is an undirected graph, and therefore the direction of the edge is not considered.X={x 1,x 2,,x m Is the original text setTIs determined by the characteristic matrix of (a),
Figure 65856DEST_PATH_IMAGE035
kis characterized in thatx i Of (c) is calculated.x i Representing nodest i The feature vector of (2).
Figure 890593DEST_PATH_IMAGE036
Is shown as a drawingGOf the adjacent matrix. The adjacency matrix is a matrix representing the adjacency relationship between nodes and can indicate whether any two nodes in the graph are connected by an edge. Hypothetical node
Figure 337755DEST_PATH_IMAGE037
And
Figure 324165DEST_PATH_IMAGE038
between which there is an edge
Figure 489567DEST_PATH_IMAGE039
Then adjacent to the matrixAThe expression form of (A) is shown as formula (1):
Figure 852416DEST_PATH_IMAGE040
(1)
the invention considers the rumor detection task as a two-class questionQuestion, original wordt i Corresponding labely i E {0,1}, 0 for non-rumors and 1 for rumors. Therefore, the rumor detection target of the present invention is the learning classifierfAs shown in formula (2):
Figure 622925DEST_PATH_IMAGE041
(2)
wherein the content of the first and second substances,TandYrespectively corresponding to the original text set and the label set. The invention predicts the labels of the tweet based on the structural features and semantic features of the tweet.
Step 2: fusing a BERT model, a GCN network and an attention mechanism, and providing a social network Guangdong rumor detection model SA-GCN: extracting a structural feature vector of the tweet by using an improved GCN network; optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously performing further pre-training and fine tuning on the BERT Chinese pre-training model by using a large number of collected Guangdong language materials so as to extract semantic feature vectors of the tweet; and finally, fusing the two types of characteristics to obtain a final classification result.
The invention provides a novel social network Guangdong language rumor detection model SA-GCN, which integrates a BERT model, a GCN network and an attention mechanism and realizes effective detection of the Guangdong language rumor. The structure of the SA-GCN model is shown in FIG. 1.
Step 2.1: structural feature extraction
The GCN is a multi-layer neural network operating directly on the graph that is capable of updating the representation of a node based on its neighborhood attributes. Work by Kipf et al has demonstrated the effectiveness of graph convolution networks in the task of node classification.LGCN network of layers capable of capturingLAnd (4) hopping the information of the neighbor nodes. Therefore, shallow GCN networks cannot aggregate the characteristics of distant nodes. Also, studies have shown that deep GCN networks do not behave as layer 2 networks. To address this problem, Guo et al introduced tight connections into the GCN network and proposed an attention-directed graph convolution network for the relationship extraction task. The invention proposesThe model SA-GCN of (1) is inspired thereby.
Because close association exists between the conversion and comment distribution on the propagation path of the tweet and the rumor judgment result of the original tweet, the method takes the original tweet, conversion and comment in the Net-CR-Dataset as nodes, takes the forwarding and comment relationship as edges for modeling, converts the propagation path of the tweet in the social network into graph structure data, and uses the improved GCN network to aggregate information on the propagation path of the tweet, thereby generating the high-level structural feature representation of the tweet.
(1) Multi-head attention mechanism
Improved GCN network routingMBlocks, each block containing three modules: multi-head attention mechanism, dense connections and linear combinations. In this section, a multi-head attention mechanism is applied to mine potential structural dependencies between vertices, especially those nodes that are not directly connected and that have passed through multiple hops between them. Specifically, the features of the nodes are first generated using the Guangdong pre-training word vectors provided by fastTextU={u 1,u 2,,u N And (c) the step of (c) in which,Nis the number of all nodes. Next, by constructing an attention adjacency matrixAAnd converting the propagation tree of the original tweed into a graph which is fully connected by the weighted edges, thereby comprehensively considering the structural relationship among all tweed nodes. And a firstmRelated to headmThe calculation of the individual attention adjacency matrix is shown in equation (3):
Figure 576975DEST_PATH_IMAGE042
(3)
wherein the content of the first and second substances,QandKequivalent to node features, i.e. extracted node featuresUdIs the dimension of the feature vector.
Figure 432936DEST_PATH_IMAGE043
And
Figure 396212DEST_PATH_IMAGE044
are respectively asQAndKthe transfer matrix of (2).A m()Will be used in the following graph convolution process.
(2) Tight joining layer
In this section, a tightly-connected layer is used to capture local and distant node features, solve the problem that shallow GCNs cannot learn deep associated node information, and generate better node representations. Each tight connection layer comprisesLAnd a plurality of sub-layers. For nodeiIt goes through, say, the firstlThe output of the individual layers is shown in equation (4):
Figure 755650DEST_PATH_IMAGE045
(4)
wherein the content of the first and second substances,
Figure 146180DEST_PATH_IMAGE046
as ReLU function, weight matrix
Figure 223857DEST_PATH_IMAGE016
And bias
Figure 990825DEST_PATH_IMAGE017
Is dependent onA m()
Figure 470348DEST_PATH_IMAGE019
Is a nodejIn the first placelInput features of individual layers, fromh (0)And the sum of the values of 1,2,,l-1} node characteristics resulting from sub-layer updatesh (1),…,h l(-1)Splicing to form the product, wherein the calculation mode is shown as formula (5):
Figure 34709DEST_PATH_IMAGE047
(5)
(3) linear combination
This section introduces a linear combination layer to integrate the representations from different densely connected layers. The output of the linear combination layer is defined as shown in equation (6):
S=W comb h out +b comb (6)
wherein the content of the first and second substances,h out= [h (1);…;h (M)],h (M)is shown asMThe feature vectors output by the individual tightly-connected layers.W comb In order to be a weight matrix, the weight matrix,b comb is a bias vector.
Step 2.2: semantic feature extraction
Because the semantic information of the invention plays an important role in rumor detection, and the context-dependent word embedding generated by the BERT model can capture information of various dimensions and generate more accurate and effective feature representation, the invention uses the BERT cantonese pre-training model as a Chinese word embedding extractor.
The invention combines the data provided by hong Kong science and technology university to generate a mapping table, converts the variant characters in Guangdong language into the corresponding characters in Mandarin, and splits the rare characters, thereby relieving the problem that the Chinese pre-training model may not learn the semantic information of the variant characters and the rare characters. In addition, in order to enable the model to better process the Chinese-English mixed grammatical structure of Cantonese, the vocabulary of the BERT Chinese pre-training model is expanded by combining the vocabulary provided by the PyCantonese library and the fastText Guangdong pre-training word vector. Specifically, the common English words in Guangdong languages are added into the word list, and the weights of the common English words are initialized randomly, so that the English words are prevented from being split into roots and affixes, and the learning capacity of the model on English word semantics in Guangdong languages is improved.
The invention further pre-trains the BERT-Base-Chinese model by using the Guangdong language corpus widely collected from the Twitter platform, so that the model learns more characteristics of the Guangdong language. On the basis, the Net-CR-Dataset is used for fine tuning of the BERT Chinese pre-training model, so that the method is more suitable for the technical problem of the invention. Specifically, the original text and the conversion/comment data are pushedV={V 1,V 2,,V m After labeling, get
Figure 599682DEST_PATH_IMAGE048
. Then, will
Figure 904762DEST_PATH_IMAGE049
Inputting the sentence vector into a retrained and fine-tuned BERT model, and extracting sentence features by using a Transformer to obtain a sentence vectorW={w 1,w 2,,w m As shown in formulas (7) and (8):
Figure 238791DEST_PATH_IMAGE050
(7)
Figure 971124DEST_PATH_IMAGE051
(8)。
wherein the content of the first and second substances,Tokenizeis a word segmentation operation in the BERT model.
Step 2.3 feature fusion
SA-GCN model pair structure feature vectorSAnd semantic feature vectorsWSplicing is carried out to obtain the feature vectorFAs shown in formula (9):
F=SW(9)
feature vector of the pushtextFThroughSoftmaxThe function yields the final classification result, as shown in equation (10):
Figure 23393DEST_PATH_IMAGE025
(10)
wherein the content of the first and second substances,
Figure 866584DEST_PATH_IMAGE026
for predicting sentencesdProbability of rumor;
the optimization goal of the model is to minimize the cross-entropy loss function, as shown in equation (11):
Figure 320699DEST_PATH_IMAGE052
(11)
wherein the content of the first and second substances,Da set of sample data is represented which,
Figure 489513DEST_PATH_IMAGE053
representing a sampledThe true value of (d).
The experimental process comprises the following steps:
(1) data set
The current research lacks a published, authoritative benchmark Dataset for Cantonese rumors, and the Cantonese rumors Dataset CR-Dataset constructed in the previous literature [ KE L, CHEN X, LU Z, et al A Novel Approach for Cantonese Rumor Detection based on Deep Neural Network: 33rd IEEE International Conference on Systems, Man, and Cybernets [ C ], Toronto, Canada, 2020 ] does not have rich map structural information and cannot provide the spread structural features of rumors for Detection models, because the invention constructs a completely new Cantonese Rumor Dataset.
The Web crawler is developed based on the Scrapy framework, the Guangdong language tweet and multi-level conversion tweet and comment information thereof in the Twitter platform are collected, and data labeling work is completed according to strict standards, so that a Net-CR-Dataset data set is constructed. Because rumors generally focus on sensitive topics such as health problems, the healthy cantonese rumors in Twitter are taken as main research objects, so that a large amount of related data can be easily acquired, powerful support is provided for research, and the research has important practical significance.
The invention takes the content released by an authoritative official medium as a factual basis, carries out data labeling work on the collected Guangdong language tweed strictly according to the rumor definition (information which is generated and spread in the crowd and has an actual value which cannot be confirmed or intentionally false, generally generated in an emergency situation, is easy to cause public panic, destroy social order, reduce government reputation and even harm national safety) used in the invention, and filters the tweed data which lack a factual basis and cannot judge the authenticity of the Guangdong language tweed. The invention judges the content of the objective facts by comparing the source tweet and the authoritative media under the same event, if the content is consistent, the content is marked as 0; otherwise, it is marked 1.
Finally, a Cantonese rumor data set Net-CR-Dataset containing rich graph structure information is constructed. The data set contained 2,419 original tweets, 112,539 pushers, 92,260 comments, for a total of 207,218 nodes and 202,437 edges. Table 1 shows the details of the data set.
Figure 29078DEST_PATH_IMAGE055
(2) Experimental setup and data set
The experimental environment of the invention is Intel (R) core (TM) i7-7500U CPU @ 2.70GHz and Tesla-V10032G GPU servers. The data set used in all experiments is Net-CR-Dataset constructed by the invention, and the related information of the data set is shown in Table 1. In the experiment, 80% of the rumor data set was used as the training set, 10% was used as the verification set, and 10% was used as the test set. Each experiment was performed 10 times and the average was taken as the final result. The training set, the verification set and the test set used in 10 experiments are all randomly divided.
(3) Experiment one: feature ablation
In order to prove the effectiveness of each part of the SA-GCN model, the invention compares the detection effect of the SA-GCN model and the variant form thereof on Net-CR-Dataset, and the specific information of the related model is as follows:
1) SA-GCN \ Str: structural features are not introduced into the SA-GCN model, and only semantic features are utilized;
2) SA-GCN \ Sem: semantic features are not introduced into the SA-GCN model, and only structural features are utilized;
3) SA-GCN \ Att: an attention mechanism is not introduced into the SA-GCN model;
4) SA-GCN \ BERT: word embedding was not generated in the SA-GCN model using the BERT model. Generating a word Embedding matrix by using a Chinese pre-training word vector and a Guangdong language pre-training word vector provided by fastText, and introducing the word Embedding matrix into an Embedding layer in front of a Bi-LSTM network;
5) SA-GCN: the invention provides a complete model.
The results of the experiment are shown in FIG. 2. It can be seen that the SA-GCN model proposed by the invention performs best and is optimal in all indexes. Comparing the SA-GCN and SA-GCN \ Sem models, it can be found that semantic features of the tweet play an important role in rumor detection tasks. Meanwhile, due to the fact that interaction behaviors exist among users in the social network, nodes are closely related, and the characteristics of the nodes are affected by the neighborhood, the SA-GCN model considering the structural characteristics of the nodes is superior to the SA-GCN \ Str model in detection effect. Moreover, comparing the SA-GCN \ Str model with the SA-GCN \ Sem model, it can be seen that in the task, the contribution of the semantic features to the detection effect is greater than that of the structural features, and the probable reason is that the quantity distribution of the pushings and comments in the Net-CR-Dataset data set adopted in the experiment is not uniform enough, so that the structural features learned by the model for the data with small transmission amount are insufficient. In addition, the SA-GCN \ BERT model obtains 0.8692F 1 score, which is reduced by about 12% compared with the SA-GCN model, and the result shows that compared with the common pre-training models such as fastText and the like, the BERT Chinese pre-training model adopted by the invention and the further pre-training and fine-tuning operations carried out on the Guangdong language corpus can enable the model to better learn the characteristics of the Guangdong language data, thereby carrying out efficient rumor detection. In addition, compared with the performances of the SA-GCN \ Att model and the SA-GCN model, the attention mechanism introduced into the model can automatically discover words and features which are important to a detection task, and contributes to the improvement of a detection effect.
(4) Experiment two: detecting model effects
The invention compares the provided SA-GCN model with other common methods in rumor detection, and the related model specific information is as follows:
1) SVM-RBF: the SVM model based on the RBF kernel utilizes the manually extracted features based on the statistical information of all posts;
2) DTC: a rumor detection method of decision tree classifier based on various manual characteristics, in order to obtain the credibility of the information;
3) RFC: a random forest classifier using user features, language features and structural features;
4) TextCNN: capturing text semantics for the classification task using a convolutional neural network;
5) Bi-GCN: a rumor detection model for graph structure data, capable of capturing bidirectional propagation structural features of rumors;
6) GLAN: a global-local attention network that fuses local semantic features and global structural features;
7) SA-GCN: the invention provides a detection model.
Figure DEST_PATH_IMAGE056
In the experiment, the input of the SVM-RBF, DTC and RFC models is a vector generated by a TF-IDF algorithm, and the input of the TextCNN model is word embedding generated by using a fastText Chinese pre-training word vector and a Guangdong language pre-training word vector. As can be seen from Table 2 and FIG. 3, the SA-GCN model provided by the invention achieves 0.9845 on the F1 score, and the AUC value is 0.9677, so that the best detection effect is obtained. And the detection models (TextCNN, Bi-GCN, GLAN, SA-GCN) based on deep learning generally perform better than the models (SVM-RBF, DTC, RFC) based on traditional machine learning, because the deep learning models can learn the high-order expression form of rumors, thereby capturing effective characteristics. In addition, the performance of the TextCNN is not as good as that of the GLAN and the SA-GCN, and the fact that the detection effect can be effectively improved by adding the propagation structure characteristics in the detection process is laterally proved, so that the important significance of the invention in combination of the structure characteristics and the semantic characteristics is fully embodied. Meanwhile, since the effect of the Transformer in semantic feature extraction is better than that of the CNN network, the SA-GCN model using the Transformer as a feature extractor is more expressed than TextCNN and GLAN using the CNN as a feature extractor. Moreover, in order to construct a cantonese word embedding extractor more suitable for cantonese and rumor detection tasks, the invention retrains and optimizes the BERT Chinese pre-training model based on cantonese linguistic data and data sets, so that the SA-GCN model can learn the characteristics of more cantonese data, and a better detection effect is obtained. Meanwhile, compared with Bi-GCN and SA-GCN models, the SA-GCN is improved by nearly 9% in F1 score compared with Bi-GCN, because the SA-GCN model provided by the invention is fused with the semantic features of rumors on the basis of acquiring structural features, and the semantic features can fundamentally reflect the meaning expressed by the tweet, so that the detection performance of the model is remarkably improved.
(5) Experiment three: early detection capability
The early detection capability refers to the detection effect of the model on the rumors in the initial period of rumor propagation, and is one of the important indexes for judging the performance of the rumor detection model. Different cut-off times are set in the experiment, and data input models before the cut-off times are selected respectively. The experiment takes the accuracy of the model as an index, and compares the performances of different models in the aspect of early detection capability.
The results of the experiment are shown in FIG. 4. It can be seen that, at different cut-off times, the accuracy of the SA-GCN model proposed by the present invention is always higher than that of other models, and the accuracy of over 0.8944 is achieved when the tweet is just started to propagate, and the accuracy of 0.9425 is achieved within 3 hours of the tweet propagation, which proves that the semantic features and structural features introduced in the SA-GCN model are not only very effective in the long-term rumor detection task, but also contribute to the early detection of rumors. In addition, as the deadline is delayed, semantic information and structural information of the tweet become richer, and meanwhile, the noise of data is larger and larger, but the fluctuation amplitude of the accuracy curve of the SA-GCN model is smaller compared with other models, which also proves the stability of the proposed SA-GCN model.
In conclusion, the invention provides a neural network method based on a deep semantic perception map for detecting the Cantonese rumors in the social network. First, the invention constructs a Web crawler and obtains relevant data in Twitter based on multiple sets of rumor keywords. Meanwhile, the forwarding number and the comment number of the text pushing are limited, so that the data are guaranteed to have rich graph structure information. And after the manual data labeling work is finished, constructing a data set Net-CR-Dataset. Secondly, the invention designs a brand-new SA-GCN for detecting the rumors of Guangdong languages, the model learns the structural features of the tuina by means of an improved GCN network, and captures the semantic features of the tuina by applying a BERT pre-training model retrained and fine-tuned on data of the Guangdong languages, and finally the structural features and the semantic features are spliced by the model. The experimental result shows that the SA-GCN model provided by the invention is superior to a classical detection method in the task of detecting the rumors in Guangdong languages, and the model has strong early rumor detection capability.

Claims (6)

1. A Cantonese rumor detection method based on a deep semantic perception map convolutional network is characterized by comprising the following steps of:
step 1: constructing a plurality of groups of healthy key words of the Cantonese rumors, acquiring related tweets, users, forwarding and comment information, and constructing a data set Net-CR-Dataset of the Cantonese rumors with graph structure information, namely modeling a graph G < V, E > according to entities in a social network and the relationship between the entities;
step 2: fusing a BERT model, a GCN network and an attention mechanism, and providing a social network Cantonese rumor detection model SA-GCN: extracting a structural feature vector of the tweet by using an improved GCN network;
optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously performing further pre-training and fine tuning on the BERT Chinese pre-training model by using a large number of collected Guangdong language materials so as to extract semantic feature vectors of the tweet; finally, the two types of characteristics are fused to obtain a final classification result;
the step 2 comprises the following steps:
step 2.1: extracting structural features: the method comprises the steps that original tweets, forwarding tweets and comments in Net-CR-Dataset are used as nodes, forwarding and comment relations are used as edges for modeling, propagation paths of the tweets in a social network are converted into graph structure data, and information on the propagation paths of the tweets is aggregated by using an improved GCN (generalized genetic network), so that high-level structure feature vectors of the tweets are generated;
the step 2.1 of extracting the structural features specifically comprises:
step 2.1.1: mining potential structural correlation between vertexes by using a multi-head attention mechanism, wherein the potential structural correlation comprises nodes which are not directly connected and nodes which are multi-hop; the specific process is as follows:
firstly, the Guangdong language pre-training word vector provided by fastText is used to generate the feature U ═ U of the node1,u2,...,uN-wherein N is the number of all nodes;
then transforming the propagation tree of the original tweet into a graph which is fully connected by weight edges by constructing an attention adjacency matrix A, thereby comprehensively considering the structural relationship among all tweet nodes; the mth attention adjacency matrix for the mth head is calculated as follows:
Figure FDA0003668667710000011
wherein Q and K are equal to the node characteristics, namely the extracted node characteristics U; d is the dimension of the feature vector;
Figure FDA0003668667710000012
and
Figure FDA0003668667710000013
transfer matrices of Q and K, respectively;
step 2.1.2: the tight connection layer is used for capturing local and remote node characteristics, the problem that deep associated node information cannot be learned by a shallow GCN is solved, and a better node representation is generated;
each tight connection layer comprises L sublayers; for node i, its output through the ith sublayer is shown as:
Figure FDA0003668667710000021
where ρ is a ReLU function, a weight matrix
Figure FDA0003668667710000022
And bias
Figure FDA0003668667710000023
Dependent on A(m);A(m)An mth attention adjacency matrix associated with the mth head;
Figure FDA0003668667710000024
the connection condition of the node i and the node j is shown as a matrix A(m)The elements of (1);
Figure FDA0003668667710000025
for the input feature of node j in the l sub-layer, the value of h(0)And {1, 2., l-1} sub-layer updates the resulting node characteristics h(1),...,h(l-1)The calculation mode is shown as the following formula:
Figure FDA0003668667710000026
step 2.1.3: a linear combination layer is introduced to integrate the representations from the different densely connected layers, the output of the linear combination layer being defined as follows:
S=Wcombhout+bcomb (6)
wherein h isout=[h(1);...;h(M)],h(M)A feature vector representing the output of the Mth tightly-connected layer; wcombAs a weight matrix for each feature vector, bcombIs a bias vector.
2. The method for detecting rumors in Guangdong languages based on deep semantic perception graph convolution network of claim 1, wherein the modeling is performed according to the relationship between entities in social network as graph G ═ V, E > is specifically:
with T ═ T1,t2,...,tmMeans forThe original text set, m is the number of original text; by using
Figure FDA0003668667710000027
Representing the original letter tiA set of commentary and pushback, wherein
Figure FDA0003668667710000028
Is tiN is the number of the forwarding and commenting;
V={V1,V2,...,Vmin which V isi={ti,RiIs the original text tiThe node set of (2) contains the original text tiNode of and set of forwarding and comments RiA node of (2);
E={E1,E2,...,Emtherein of
Figure FDA0003668667710000029
For pushing the text tiThe edge set of (2) representing forwarding/commenting relationships between nodes;
X={x1,x2,...,xmthe feature matrix of the original text set T is represented,
Figure FDA00036686677100000210
k is a feature xiDimension of (d); x is the number ofiRepresents a node tiThe feature vector of (2);
A∈{0,1}|V|×|V|the adjacency matrix is a matrix of the graph G, represents the adjacency relation between nodes and indicates whether any two nodes in the graph are connected by edges or not;
suppose forwarding and commenting node ri cAnd
Figure FDA0003668667710000031
between which there is an edge
Figure FDA0003668667710000032
The adjacency matrix a behaves as follows:
Figure FDA0003668667710000033
wherein E iscFor pushing the text tcThe set of edges of (c);
consider the rumor detection task as a binary problem, original text tiCorresponding label yiE {0,1}, 0 for non-rumors, 1 for rumors; then the rumor detection target is the learning classifier f:
f:T→Y (2)
wherein Y is a set of labels.
3. The method for detecting the rumor in Guangdong languages based on the deep semantic perception map convolutional network of claim 1, wherein the step 2 further comprises:
step 2.2: extracting semantic features: constructing a mapping table, converting variant characters in Guangdong languages into corresponding characters in Mandarin, and splitting rare characters; and extending a vocabulary of the BERT Chinese pre-training model; using the collected cantonese linguistic data to perform further pre-training on the BERT-Base-Chinese model to enable the model to learn more characteristics of cantonese, and using a Net-CR-Dataset to perform fine adjustment on the BERT Chinese pre-training model to obtain the BERT cantonese pre-training model, thereby extracting semantic feature vectors of the tweet;
step 2.3: and the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result.
4. The method for detecting a yue-chow rumor based on the deep semantic perception graph convolutional network of claim 3, wherein the step 2.2 of expanding the vocabulary of the BERT chinese pre-training model comprises: the word list and the fastText cantonese pre-training word vector provided by the PyCantonese library are adopted, common English words in cantonese are added into the word list, the weight of the words is initialized randomly, the English words are prevented from being split into roots and affixes, and the learning capacity of the model to English word semantics in cantonese is improved.
5. The method of claim 3, wherein the step 2.2 of fine-tuning the BERT Chinese pre-training model using the Net-CR-Dataset comprises: the original pushtext and the conversion pushtext/comment data V are set as { V ═ V1,V2,...,VmMarking to obtain V ═ V }1',V2',...,Vm' }; then, V' is input to the retrained and trimmed BERT model, and sentence features are extracted by using a Transformer, so that a sentence vector W is obtained as { W ═ W1,w2,...,wmThe formula is shown below;
V'=Tokenize(V) (7)
W=BERT(V') (8)
wherein, Tokenize is a word segmentation operation in the BERT model.
6. The method for detecting the rumor in Guangdong languages based on the deep semantic perception map convolutional network of claim 3, wherein the step 2.3 specifically comprises:
step 2.3.1: the SA-GCN model splices the structural feature vector S and the semantic feature vector W to obtain a feature vector F, which is shown as the following formula:
Figure FDA0003668667710000041
step 2.3.2: and (3) subjecting the feature vector F of the tweet to a Softmax function to obtain a final classification result, which is shown as the following formula:
pd=Softmax(F) (10)
wherein p isdThe probability that the tweet sample d to be predicted is a rumor is obtained;
step 2.3.3: the optimization objective of the model is to minimize the cross-entropy loss function, as shown in the following equation:
Figure FDA0003668667710000042
where D represents a sample data set, ydRepresenting the true value of the tweet sample d to be predicted.
CN202210371266.1A 2022-04-08 2022-04-08 Cantonese rumor detection method based on deep semantic perception map convolutional network Active CN114444516B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210371266.1A CN114444516B (en) 2022-04-08 2022-04-08 Cantonese rumor detection method based on deep semantic perception map convolutional network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210371266.1A CN114444516B (en) 2022-04-08 2022-04-08 Cantonese rumor detection method based on deep semantic perception map convolutional network

Publications (2)

Publication Number Publication Date
CN114444516A CN114444516A (en) 2022-05-06
CN114444516B true CN114444516B (en) 2022-07-05

Family

ID=81359641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210371266.1A Active CN114444516B (en) 2022-04-08 2022-04-08 Cantonese rumor detection method based on deep semantic perception map convolutional network

Country Status (1)

Country Link
CN (1) CN114444516B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116432644B (en) * 2023-06-12 2023-08-15 南京邮电大学 News text classification method based on feature fusion and double classification
CN117573988A (en) * 2023-10-17 2024-02-20 广东工业大学 Offensive comment identification method based on multi-modal deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019183191A1 (en) * 2018-03-22 2019-09-26 Michael Bronstein Method of news evaluation in social media networks
CN112035669A (en) * 2020-09-09 2020-12-04 中国科学技术大学 Social media multi-modal rumor detection method based on propagation heterogeneous graph modeling
CN112256945A (en) * 2020-11-06 2021-01-22 四川大学 Social network Cantonese rumor detection method based on deep neural network
CN113343126A (en) * 2021-08-06 2021-09-03 四川大学 Rumor detection method based on event and propagation structure
CN113919440A (en) * 2021-10-22 2022-01-11 重庆理工大学 Social network rumor detection system integrating dual attention mechanism and graph convolution

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112231562B (en) * 2020-10-15 2023-07-14 北京工商大学 Network rumor recognition method and system
CN113094596A (en) * 2021-04-26 2021-07-09 东南大学 Multitask rumor detection method based on bidirectional propagation diagram
CN113705099B (en) * 2021-05-09 2023-06-13 电子科技大学 Social platform rumor detection model construction method and detection method based on contrast learning
CN113268675B (en) * 2021-05-19 2022-07-08 湖南大学 Social media rumor detection method and system based on graph attention network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019183191A1 (en) * 2018-03-22 2019-09-26 Michael Bronstein Method of news evaluation in social media networks
CN112035669A (en) * 2020-09-09 2020-12-04 中国科学技术大学 Social media multi-modal rumor detection method based on propagation heterogeneous graph modeling
CN112256945A (en) * 2020-11-06 2021-01-22 四川大学 Social network Cantonese rumor detection method based on deep neural network
CN113343126A (en) * 2021-08-06 2021-09-03 四川大学 Rumor detection method based on event and propagation structure
CN113919440A (en) * 2021-10-22 2022-01-11 重庆理工大学 Social network rumor detection system integrating dual attention mechanism and graph convolution

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Novel Approach for Cantonese Rumor Detection based on Deep Neural Network;Liang Ke 等;《2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC)》;20201014;第1610-1615页 *
Integrating Semantic and Structural Information with Graph Convolutional Network for Controversy Detection;Lei Zhong 等;《arXiv:2005.07886v1 [cs.CL]》;20200516;第1-13页 *
一种加权图卷积神经网络的新浪微博谣言检测方法;王昕岩 等;《小型微型计算机系统》;20210831;第42卷(第8期);第1780-1786页 *

Also Published As

Publication number Publication date
CN114444516A (en) 2022-05-06

Similar Documents

Publication Publication Date Title
Gong et al. Hashtag recommendation using attention-based convolutional neural network.
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
CN111382575A (en) Event extraction method based on joint labeling and entity semantic information
CN110532328B (en) Text concept graph construction method
CN114444516B (en) Cantonese rumor detection method based on deep semantic perception map convolutional network
CN113051916B (en) Interactive microblog text emotion mining method based on emotion offset perception in social network
Sivakumar et al. Review on word2vec word embedding neural net
CN110807324A (en) Video entity identification method based on IDCNN-crf and knowledge graph
CN110909529B (en) User emotion analysis and prejudgment system of company image promotion system
Uppal et al. Fake news detection using discourse segment structure analysis
Kirchknopf et al. Multimodal detection of information disorder from social media
CN112329444A (en) Early rumor detection method fusing text and propagation structure
Zhang et al. Exploring deep recurrent convolution neural networks for subjectivity classification
CN115017887A (en) Chinese rumor detection method based on graph convolution
CN117112786A (en) Rumor detection method based on graph attention network
Cai et al. Deep learning approaches on multimodal sentiment analysis
CN116910238A (en) Knowledge perception false news detection method based on twin network
Pang et al. Domain relation extraction from noisy Chinese texts
CN115631504A (en) Emotion identification method based on bimodal graph network information bottleneck
CN115329073A (en) Attention mechanism-based aspect level text emotion analysis method and system
Xiang et al. Aggregating local and global text features for linguistic steganalysis
Lan et al. Mining semantic variation in time series for rumor detection via recurrent neural networks
Wang et al. Using ALBERT and Multi-modal Circulant Fusion for Fake News Detection
Wang et al. Hierarchical network emotional assistance mechanism for emotion cause extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant