CN114444516A - Cantonese rumor detection method based on deep semantic perception map convolutional network - Google Patents

Cantonese rumor detection method based on deep semantic perception map convolutional network Download PDF

Info

Publication number
CN114444516A
CN114444516A CN202210371266.1A CN202210371266A CN114444516A CN 114444516 A CN114444516 A CN 114444516A CN 202210371266 A CN202210371266 A CN 202210371266A CN 114444516 A CN114444516 A CN 114444516A
Authority
CN
China
Prior art keywords
model
cantonese
gcn
rumor
bert
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210371266.1A
Other languages
Chinese (zh)
Other versions
CN114444516B (en
Inventor
王海舟
陈欣雨
柯亮
方怡萱
王森
蔡易成
王文贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202210371266.1A priority Critical patent/CN114444516B/en
Publication of CN114444516A publication Critical patent/CN114444516A/en
Application granted granted Critical
Publication of CN114444516B publication Critical patent/CN114444516B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of rumor detection, and particularly discloses a Cantonese rumor detection method based on a deep semantic perception graph convolution network, which comprises the steps of firstly constructing a plurality of groups of healthy Cantonese rumor keywords, constructing a Web crawler to acquire relevant tweet, user, forwarding and comment information, and constructing a data set Net-CR-Dataset after data annotation is completed; secondly, designing a deep semantic perception map convolutional neural network model SA-GCN; optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously utilizing a large amount of collected Guangdong language materials to further pre-train and finely tune the BERT pre-training model, thereby extracting semantic feature vectors of the tweet; extracting the structural features of the tweet by using an improved GCN network to generate a structural feature vector; and finally, the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result. The invention is superior to other common detection methods in the aspects of detection effect and early detection capability.

Description

Cantonese rumor detection method based on deep semantic perception map convolutional network
Technical Field
The invention relates to the technical field of rumor detection, in particular to a Cantonese rumor detection method based on a deep semantic perception map convolutional network.
Background
Social media provides a platform for people to pay attention to hot events, release opinions and make friends, and plays an indispensable role in the life of people. The "Digital 2021" report shows that by 1 month 2021, globally active users of social media have reached 42 billion, accounting for about 53.6% of the world's general population. Due to the great influence of the social network on public opinion, rumors are layered in the social network, which not only disturbs the network order and causes social panic, but also brings economic loss in the real world and endangers the national safety. Besides the common english and chinese rumors, the yue-yu rumors are a long-standing disease in social networks. Cantonese, which is a branch of chinese, is prevalent not only in areas such as guangdong, hong kong, and australia in china, but also in overseas chinese. The users of the cantonese language, which are more than 1.2 hundred million in the world, are shared all over the world, and the widely spread cantonese language rumors in the social network also have great adverse effects on the world below. Therefore, there is a need to provide an effective method for automatically detecting the rumors in yue languages in social networks.
The traditional rumor detection method mainly adopts a supervised learning strategy, and trains a machine learning classifier by using features manually extracted from text content, user homepages and propagation modes, for example: SVM (Support Vector Machine), RF (Random Forest), Bayes (Bayes), DT (Decision Tree). Further research has extracted more effective features such as time series features, topic features, etc. The detection method based on the traditional machine learning mainly depends on feature engineering, and a great deal of time is required to be invested in the method. Moreover, it is difficult to manually extract efficient high-order feature representations, which hinders the enhancement of performance of such methods.
In recent years, powerful deep learning models have been widely used in this task to solve the problems in machine learning based detection methods. RNN (Current Neural Network Recurrent Neural Network), LSTM (Long Short-term Memory), GRU (Gate Current Unit), CNN (Convolvulatory Neural Network) and variants and combinations thereof have all made significant efforts in the field of rumor testing. In addition, some studies further improve the detection effect of the model by introducing an attention mechanism, generating counterstudy and other technologies. However, the methods above mostly consider the microblog texts to be detected as independent individuals, and neglect the connection between them. As is well known, a social network uses a graph as an infrastructure, and includes not only entities such as users, posts, and hashtags, but also relationships such as friends, forwarding, and comments. The graph structure contains rich information, thus providing new features for rumor detection. For example, if there are frequent interactions (e.g., forwarding/commenting/paying attention) between a user and multiple rumor users, the probability that the post posted by the user is a rumor is greatly increased.
To address this problem, recent research has focused on rumor detection using the information of the propagation structure in the graph structure. Some studies construct the Propagation behavior of posts as a Propagation Tree, and design models such as PTK (Propagation Tree Kernel), RvNN (recurrent Neural Network) to learn the structural features of the Propagation Tree. Meanwhile, methods such as GCN (Graph relational Network convolution Network), GAT (Graph Attention Network), PGNN (Propagation Graph Neural Network) and the like are also proposed to extract the global structural features of the posts from the Propagation Graph, thereby improving the effect of the detection model. However, most of these methods only focus on obtaining the propagation and structure information from the process of transmission and development of posts over time, but ignore features from text content, user homepages, etc., which may result in some important information (e.g., text features) being lost and affecting the final detection effect. In addition, for the GCN network, the shallow model cannot learn the characteristics of the nodes at a long distance. Studies have shown that although deeper GCNs can capture richer neighbor information, layer 2 GCN networks perform best.
Relevant research on the Detection of Cantonese rumors was first conducted in the literature [ KE L, CHEN X, LU Z, et al. A Novel Approach for Cantonese Rumor Detection based on Deep Neural networks: 33rd IEEE International Conference on Systems, Man, and Cybernetics [ C ], Toronto, Canada, 2020.] and the Cantonese rumors in Twitter were widely collected, thereby constructing a first more complete data set CR-Dataset of the Cantonese rumors. Meanwhile, 27 statistical characteristics of the rumors in Guangdong languages are provided, a detection model based on a deep learning method is designed, and the semantic characteristics and the statistical characteristics of the rumors are fused for detection. Experiments prove that the method achieves excellent detection effect. However, this method does not take into account the propagation structure characteristics, which are important in the discrimination of rumors in practical situations. In addition, the constructed CR-Dataset also lacks structural information of the tweet.
On one hand, the traditional rumor detection method cannot be directly applied to the scene of cantonese. The above-mentioned methods are mainly studied for english rumors and mandarin rumors in social networks, and solutions for the scenario of cantonese rumors are lacking. The new words, unique oral words and Chinese-English mixed grammatical structures contained in Guangdong languages can not bring the best detection effect into play by the traditional method. On the other hand, the widespread rumors in cantonese in social networks can have serious negative effects on the real world. One of the most popular and influential dialects in the chinese language, cantonese has over 1.2 billion users in the world, and cantonese users are widely distributed not only in guangdong province, hong kong, and australia in china, but also in 12 other countries such as singapore, thailand, usa, and canada.
Therefore, a new method based on graph structure is needed to develop the detection research for the rumors in Guangdong languages in social networks.
The unique language features of cantonese provide a serious challenge for detecting the rumor of cantonese facing social networks. Unlike the common Chinese language, the Guangdong language includes many variant words (e.g., - ) and rare words (e.g., ). Although modern standard chinese and cantonese languages have the same (or ultimately related) meaning on the part-word syllables associated with the source of the word, some users of cantonese languages may write with variant characters, i.e., variant words. However, it is difficult for the chinese language model to automatically learn the semantics of these unique characters. Meanwhile, the syntax structure of Chinese-English mixing in Guangdong language makes the extraction of semantic features difficult. With the development of history, users of cantonese gradually fuse English words into the cantonese language system. For example: to do oh work coffee (meaning that it is not feasible to do so). This usage makes word segmentation and acquisition of the context semantics of the words difficult.
In recent years, researchers have conducted extensive research on rumor automated detection problems in social networks, mainly including detection methods based on traditional machine learning and detection methods based on deep learning.
(1) Detection method based on traditional machine learning
Most rumor detection research focuses on training rumor classifiers by using features manually extracted from the aspects of text content of microblog texts, user homepages, propagation paths and the like, so as to realize detection of rumors. Castillo et al extracted user, message, topic, and propagation based features from "push" and "turn-push" behaviors in the Twitter platform to assess the trustworthiness of a given tweet. Yang et al has expanded the characteristic set that Castillo et al put forward, has put forward the characteristic based on customer end and place on the basis of the characteristic based on content, account, propagation before, has realized the rumor detection to the Xinlang microblog platform. Kwon et al explore the conventional structural and linguistic features and novel temporal features resulting from sudden fluctuations in rumors over time. Ma et al designed a time series model to capture the change in rumor social context over time, not just the tweet capacity feature. This type of detection method based on traditional machine learning mainly relies on feature engineering, and therefore requires a great deal of research time and human labor. Moreover, some implicit features are difficult to find in feature engineering, and features extracted manually cannot have strong robustness, which makes the method particularly difficult in performance improvement.
(2) Detection method based on deep learning
In order to solve the above-mentioned problems faced by the conventional machine learning method, many studies employ a deep learning method to perform feature learning, so as to capture high-order feature representation and achieve a better classification effect. Ma et al identified rumors by learning the timing and text representation of rumor posts using the RNN model for the first time. On this basis, Chen et al propose to incorporate an attention mechanism into the RNN model to capture text features that are more important to the detection task. Jin et al extracted multi-modal features from rumor text, images and social background for rumor detection tasks. These RNN-based methods can efficiently detect rumors, but are not suitable for the task of early detection of rumors. To address this problem, Yu et al devised a CNN network-based model to efficiently identify error information and enable early detection of propagation. Also, some recent studies have employed a mixture of RNN and CNN for detection. Liu et al learned global and local changes in user characteristics using RNN and CNN networks to identify fake news. Ma et al introduced the idea of generating antagonistic learning by generating antagonistic noise to allow the classifier to learn a stronger rumor representation, enabling more robust and efficient detection. Furthermore, Ke et al first conducted research on social network-oriented cantonese rumor detection, extracted statistical features of 27 cantonese rumors, including four categories of content, user, propagation, and comment, and designed a cantonese rumor detection model BLA (Bi-LSTM network with Attention mechanism fused based on BERT), extracted semantic features of the inferences using BERT (Bi-directional Encoder representation based on Transformers) model, Bi-LSTM (Bi-directional Long Short Term Memory) network, and Attention mechanism, and then spliced with the extracted statistical features, thereby achieving effective recognition of cantonese rumors. However, the classical deep learning rumor detection technology mostly focuses on extracting features of text, images and other categories, and structural features of rumor propagation are ignored.
In addition, in the structure diagram of the social network, the delivery process of rumors implies rich information. Some studies have used propagation relationships in graph structures for rumor detection. Sicilia et al combined content-based features with some fine-grained features inspired by graph theory to detect rumors in single subject domain posts related to health news. Ma et al propose a kernel-based approach to obtain high-order rumor representations by comparing similarities in the propagation tree structure of microbian. This method achieves a good detection effect, but cannot automatically learn a high-order feature without a flat feature including noise. To solve this problem, Ma et al propose to learn the propagation tree of microbodish by using the RvNN network, thereby extracting effective semantics and propagation characteristics. Yuan et al also considers the connection between different propagation trees on the basis of the above methods, and proposes a novel model to encode local context information and global structure information. Bian et al innovatively propose a bipartite graph model Bi-GCN to learn high-order feature representations from both propagation and dispersion directions. Yang et al designed a graph-confronted learning approach to enhance robustness and generalizability of rumor detection models. Most of the detection methods based on graph structures do not consider the fusion of multiple features, which may result in the loss of some important information, such as text content, user information, and the like. Meanwhile, the common shallow GCN network cannot capture the characteristics of the remote neighbors. Moreover, a series of studies conducted at present lack of exploration in the field of cantonese rumors, cantonese is a major branch of chinese, the population distribution is wide, and cantonese rumors in social networks are also endful, so that the invention performs detection research on cantonese rumors in Twitter based on a graph structure and feature fusion method.
However, most of the existing detection methods based on graph structures ignore fusion of multiple features, resulting in loss of some important information (such as text content). Also, common shallow GCNs may not be able to capture the features of distant neighbors.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a cantonese rumor detection method based on a deep semantic perception graph convolution network, which learns the structural features of the tweet by means of an improved GCN network, captures the semantic features of the tweet by using a BERT cantonese pre-training model retrained and fine-tuned on cantonese data, and finally splices the structural features and the semantic features, so that the detection effect and the early detection capability are superior to those of other common detection methods. The technical scheme is as follows:
a Cantonese rumor detection method based on a deep semantic perception map convolutional network comprises the following steps:
step 1: constructing a plurality of groups of healthy key words of the Guangdong-like rumors, acquiring related tweets, users, forwarding and comment information, and constructing a data set Net-CR-Dataset of the Guangdong-like rumors with graph structure information, namely modeling the data set as a graph according to entities in the social network and the relationship between the entities in the social networkG=<V,E>;
Step 2: fusing a BERT model, a GCN Network and an attention mechanism, and providing a social Network Guangdong language rumor detection model SA-GCN (Semantic perception Graph convolution Network for Semantic meaning of Semantic Graph): extracting a structural feature vector of the tweet by using an improved GCN network;
optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously performing further pre-training and fine tuning on the BERT Chinese pre-training model by using a large number of collected Guangdong language materials so as to extract semantic feature vectors of the tweet; and finally, fusing the two types of characteristics to obtain a final classification result.
Further, the modeling is carried out as a graph according to the entities in the social network and the relations between the entitiesG=<V,E>The method specifically comprises the following steps:
by usingT={t 1,t 2,,t m The symbol represents a set of original text,mthe number of the original Chinese characters is the number of the original Chinese characters; by using
Figure 276287DEST_PATH_IMAGE001
Representing original tuinat i The set of commentary and pushups in which,
Figure 724586DEST_PATH_IMAGE002
is composed oft i The turn-push/comment of (a) is,nthe number of commentary and commentary;
V={V 1,V 2,,V m and (c) the step of (c) in which,V i={t i ,R i is the original textt i A set of nodes comprising nodest i And forward and comment node setR i
E={E 1,E 2,,E m And (c) the step of (c) in which,
Figure 534017DEST_PATH_IMAGE003
to deduce the text from the originalt i The edge set of (2) representing forwarding/commenting relationships between nodes;
X={x 1,x 2,,x m denotes the original text setTIs determined by the characteristic matrix of (a),
Figure 460385DEST_PATH_IMAGE004
kis characterized byx i Dimension (d);x i representing nodest i The feature vector of (2);
Figure 805915DEST_PATH_IMAGE005
is shown as a drawingGThe adjacency matrix of (2) represents a matrix of adjacent relations between nodes, and indicates whether any two nodes in the graph are connected by edges;
hypothesis forwarding and comment node
Figure 893957DEST_PATH_IMAGE006
And
Figure 692149DEST_PATH_IMAGE007
between which there is an edge
Figure 422207DEST_PATH_IMAGE008
Then adjacent to the matrixAThe expression of (a) is as follows:
Figure 622245DEST_PATH_IMAGE009
(1)
wherein the content of the first and second substances,E c to deduce the text from the originalt c The set of edges of (1);
consider the rumor detection task as a dichotomy problem, original Purchaset i Corresponding labely i ∈{F,T}; then the rumor detection target is the learning classifierf
Figure 146767DEST_PATH_IMAGE010
(2)
Wherein the content of the first and second substances,TandYrespectively corresponding to the original text set and the label set.
Further, the step 2 specifically includes:
step 2.1: extracting structural features: the method comprises the steps that original tweets, forwarding tweets and comments in Net-CR-Dataset are used as nodes, forwarding and comment relations are used as edges for modeling, propagation paths of the tweets in a social network are converted into graph structure data, and information on the propagation paths of the tweets is aggregated by using an improved GCN (generalized genetic network), so that high-level structure feature vectors of the tweets are generated;
step 2.2: extracting semantic features: constructing a mapping table, converting variant characters in Guangdong languages into corresponding characters in Mandarin, and splitting rare characters; and extending a vocabulary of the BERT Chinese pre-training model; using the collected cantonese linguistic data to perform further pre-training on the BERT-Base-Chinese model to enable the model to learn more characteristics of cantonese, and using a Net-CR-Dataset to perform fine adjustment on the BERT Chinese pre-training model to obtain the BERT cantonese pre-training model, thereby extracting semantic feature vectors of the tweet;
step 2.3: and the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result.
Further, the step 2.1 of extracting the structural features specifically includes:
step 2.1.1: mining potential structural correlation between vertexes by using a multi-head attention mechanism, wherein the potential structural correlation comprises nodes which are not directly connected and are subjected to multi-hop; the specific process is as follows:
generating the characteristics of the nodes by using the Guangdong pre-training word vector provided by fastTextU={u 1,u 2,,u N And (c) the step of (c) in which,Nthe number of all nodes;
and constructing an attention adjacency matrixAConverting the propagation tree of the original tweet into a graph which is fully connected by weight edges, thereby comprehensively considering the structural relationship among all tweet nodes; first, themTo the headmThe calculation of the individual attention adjacency matrices is as follows:
Figure 166675DEST_PATH_IMAGE011
(3)
wherein the content of the first and second substances,QandKequivalent to node features, i.e. extracted node featuresUdIs the dimension of the feature vector;
Figure 936311DEST_PATH_IMAGE012
and
Figure 990854DEST_PATH_IMAGE013
are respectively asQAndKthe transfer matrix of (2);
step 2.1.2: the tight connection layer is used for capturing local and remote node characteristics, the problem that deep associated node information cannot be learned by a shallow GCN is solved, and a better node representation is generated;
each tight connection layer comprisesLA seed layer; for nodeiIt goes through, say, the firstlThe output of the individual sublayers is shown below:
Figure 951857DEST_PATH_IMAGE014
(4)
wherein the content of the first and second substances,
Figure 724641DEST_PATH_IMAGE015
as ReLU function, weight matrix
Figure 265344DEST_PATH_IMAGE016
And bias
Figure 439973DEST_PATH_IMAGE017
Is dependent onA m()A m()Is as followsmTo the headmAn attention adjacency matrix;
Figure 306298DEST_PATH_IMAGE018
representing nodesiAnd nodejIs a matrixA m()The elements of (1);
Figure 566378DEST_PATH_IMAGE019
is a nodejIn the first placelInput characteristics of the individual layers, fromh (0)And the sum of the values of 1,2,,l-1} node characteristics resulting from sub-layer updatesh (1),…,h l(-1)The calculation mode is shown as the following formula:
Figure 910772DEST_PATH_IMAGE020
(5)
step 2.1.3: a linear combination layer is introduced to integrate the representations from the different densely connected layers, the output S of the linear combination layer being defined as follows:
S=W comb h out +b comb (6)
wherein the content of the first and second substances,h out= [h (1);…;h (M)],h (M)is shown asMThe characteristic vectors output by the tight connection layers;W comb is a weight matrix for each of the feature vectors,b comb is a bias vector.
Further, in step 2.2, the expanding the vocabulary of the BERT chinese pre-training model includes: the method comprises the steps of adding common English words in Guangdong language into a word list by adopting a word list and a fastText Guangdong language pre-training word vector provided by a PyCantonese library, and randomly initializing the weights of the common English words to avoid splitting the English words into roots and affixes so as to improve the learning capacity of a model on English word semantics in the Guangdong language.
Furthermore, in step 2.2, the fine-tuning of the BERT chinese pre-training model using the Net-CR-Dataset includes: push the original text and turn to push/comment dataV={V 1,V 2,,V m After labeling, get
Figure 444302DEST_PATH_IMAGE021
. Then, will
Figure 747107DEST_PATH_IMAGE022
Inputting the sentence vector into a retrained and fine-tuned BERT model, and extracting sentence features by using a Transformer to obtain a sentence vectorW={w 1,w 2,,w m As shown below:
Figure 228904DEST_PATH_IMAGE023
(7)
Figure 642568DEST_PATH_IMAGE024
(8)
wherein the content of the first and second substances,Tokenizeis a word segmentation operation in the BERT model.
Further, the step 2.3 specifically includes:
step 2.3.1: SA-GCN model pair structure feature vectorSAnd semantic feature vectorsWSplicing to obtain a feature vectorFAs shown in the following formula:
F=SW(9)
step 2.3.2: feature vector of the pushtextFThroughSoftmaxThe function obtains the final classification result as shown in the following formula:
P d =Softmax(F)(10)
wherein the content of the first and second substances,P d extrapolating text samples for predictiondProbability of rumor;
step 2.3.3: the optimization objective of the model is to minimize the cross-entropy loss function, as shown in the following equation:
Figure 260631DEST_PATH_IMAGE025
(11)
wherein the content of the first and second substances,Da set of sample data is represented which,y d representing tweet samples to be predicteddThe true value of (a) is,P d extrapolating text samples for predictiondThe probability of predicting as a positive class.
The invention has the beneficial effects that:
1. in order to develop rumor detection research based on a graph structure, the invention constructs a cantonese rumor data set with the graph structure, wherein the data set comprises 2,419 original tweets, 112,539 conversion tweets and 92,260 comments, and 207,218 nodes and 202,437 edges are formed together.
2. The invention further pre-trains and adjusts the best to the BERT Chinese pre-training model by using a large amount of collected cantonese linguistic data, so that the model learns the semantic information of the unique vocabulary in the cantonese. Meanwhile, the preprocessing flow before training is modified, and the word list of the BERT Chinese pre-training model is expanded, so that foreign body characters and rare characters in Guangdong languages can be processed, and meanwhile, the grammar structure of unique Chinese-English mixing of Guangdong languages is more suitable, and richer semantic information is captured.
3. The invention designs a Guangdong rumor detection model SA-GCN based on a deep semantic perception graph convolution network, which extracts the structural features of the tweet by using an improved graph convolution neural network, captures the semantic features of the tweet by using a BERT pre-training model which is further pre-trained and fine-tuned on the data of the Guangdong rumor, and finally fuses the two types of features.
Drawings
FIG. 1 is a diagram of the SA-GCN model structure of the present invention.
Fig. 2 is a comparison graph of characteristic ablation.
FIG. 3 is a ROC curve.
FIG. 4 is a diagram illustrating early detection capability of a model.
Detailed Description
The invention is described in further detail below with reference to the figures and specific embodiments.
The invention provides a social network Cantonese rumor detection method based on a deep semantic perception graph convolutional network. Firstly, a plurality of groups of keywords of healthy Guangdong-like rumors are constructed, a Web crawler is constructed to acquire relevant tweet, user, forwarding and comment information, and a data set Net-CR-Dataset is constructed after data labeling is completed. Secondly, the invention designs a deep semantic perception graph convolutional neural network model SA-GCN. The BERT Chinese pre-training model is optimized according to the unique language features of Guangdong languages, and meanwhile, the BERT pre-training model is further pre-trained and fine-tuned by using a large amount of collected Guangdong language materials, so that semantic feature vectors of the Purchase are extracted. And the improved GCN network is applied, so that the structural features of the tweet are extracted, and a structural feature vector is generated. And finally, the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result.
The method comprises the following specific processes:
step 1: constructing a plurality of groups of healthy key words of the Guangdong-like rumors, acquiring related tweets, users, forwarding and comment information, and constructing a data set Net-CR-Dataset of the Guangdong-like rumors with graph structure information, namely modeling the data set as a graph according to entities in the social network and the relationship between the entities in the social networkG=<V,E>. The social network comprises entities such as original text, forwarding text, comment text and the like, and also comprises behaviors such as sending text, forwarding text, comment text and the like. The invention models entities in a social network and relationships between the entities into a graphG=<V,E>。T={t 1,t 2,,t m The symbol represents a set of original text,mis the number of original tweets.
Figure 734338DEST_PATH_IMAGE026
Representing original tuinat i The set of commentary and pushups in which,
Figure 703431DEST_PATH_IMAGE027
is composed oft i The turn-push/comment of (a) is,nthe number of commentary and pushups.V={V 1,V 2,,V m And (c) the step of (c) in which,V i={t i ,R i is the original textt i A set of nodes comprising nodest i And forwarding and commenting node setsR i E={E 1,E 2,,E m And (c) the step of (c) in which,
Figure 655206DEST_PATH_IMAGE028
to deduce the text from the originalt i Represents the forwarding/commenting relationship between nodes. For example,
Figure 393355DEST_PATH_IMAGE029
is composed of
Figure 539428DEST_PATH_IMAGE030
If there is an edge
Figure 995817DEST_PATH_IMAGE031
I.e. by
Figure 751284DEST_PATH_IMAGE032
. The object of the present invention is an undirected graph, and therefore the direction of the edge is not considered.X={x 1,x 2,,x m Is the original text setTIs determined by the characteristic matrix of (a),
Figure 343939DEST_PATH_IMAGE033
kis characterized in thatx i Of (c) is calculated.x i Representing nodest i The feature vector of (2).
Figure 159448DEST_PATH_IMAGE034
Is shown as a drawingGOf the adjacent matrix. The adjacency matrix is a matrix representing the adjacency relationship between nodes and can indicate whether any two nodes in the graph are connected by an edge. Hypothetical node
Figure 837554DEST_PATH_IMAGE035
And with
Figure 396712DEST_PATH_IMAGE036
Between which there is an edge
Figure 109453DEST_PATH_IMAGE037
Then adjacent to the matrixAThe expression form of (A) is shown as formula (1):
Figure 564705DEST_PATH_IMAGE038
(1)
the present invention considers the rumor detection task as a binary problem, original reasoningt i Corresponding label
Figure 494221DEST_PATH_IMAGE039
. Therefore, the rumor detection target of the invention is the learning classifierfAs shown in formula (2):
Figure 591490DEST_PATH_IMAGE040
(2)
wherein the content of the first and second substances,TandYrespectively corresponding to the original text set and the label set. The invention predicts the labels of the tweet based on the structural features and semantic features of the tweet.
Step 2: fusing a BERT model, a GCN network and an attention mechanism, and providing a social network Cantonese rumor detection model SA-GCN: extracting a structural feature vector of the tweet by using an improved GCN network; optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously performing further pre-training and fine tuning on the BERT Chinese pre-training model by using a large number of collected Guangdong language materials so as to extract semantic feature vectors of the tweet; and finally, fusing the two types of characteristics to obtain a final classification result.
The invention provides a novel social network Guangdong language rumor detection model SA-GCN, which integrates a BERT model, a GCN network and an attention mechanism and realizes effective detection of the Guangdong language rumor. The structure of the SA-GCN model is shown in FIG. 1.
Step 2.1: structural feature extraction
The GCN is a multi-layer neural network that operates directly on the graph and is capable of updating the representation of a node based on its neighborhood attributes. Work by Kipf et al has demonstrated the effectiveness of graph convolution networks in the task of node classification.LGCN network of layers capable of capturingLAnd (4) hopping the information of the neighbor nodes. Therefore, shallow GCN networks cannot aggregate the characteristics of distant nodes. Also, studies have shown that deep GCN networks do not behave as well as layer 2 networks. To address this problem, Guo et al introduced tight connections into the GCN network and proposed an attention-directed graph convolution network for the relationship extraction task. Hair brushThe proposed model SA-GCN is thus inspired.
Because close association exists between the conversion and comment distribution on the propagation path of the tweet and the rumor judgment result of the original tweet, the method takes the original tweet, conversion and comment in the Net-CR-Dataset as nodes, takes the forwarding and comment relationship as edges for modeling, converts the propagation path of the tweet in the social network into graph structure data, and uses the improved GCN network to aggregate information on the propagation path of the tweet, thereby generating the high-level structural feature representation of the tweet.
(1) Multi-head attention mechanism
Improved GCN network routingMBlocks, each block containing three modules: multi-head attention mechanism, dense connections and linear combinations. In this section, a multi-headed attention mechanism is employed to mine potential structural dependencies between vertices, especially those nodes that are not directly connected and that pass through multiple hops between them. Specifically, the features of the nodes are first generated using the Guangdong pre-training word vectors provided by fastTextU={u 1,u 2,,u N And (c) the step of (c) in which,Nis the number of all nodes. Next, by constructing an attention adjacency matrixAAnd converting the propagation tree of the original tweed into a graph which is fully connected by the weighted edges, thereby comprehensively considering the structural relationship among all tweed nodes. And a firstmTo the headmThe calculation of the individual attention adjacency matrix is shown in equation (3):
Figure 424317DEST_PATH_IMAGE041
(3)
wherein, the first and the second end of the pipe are connected with each other,QandKequivalent to node features, i.e. extracted node featuresUdIs the dimension of the feature vector.
Figure 50470DEST_PATH_IMAGE042
And
Figure 968748DEST_PATH_IMAGE043
are respectively asQAndKthe transfer matrix of (2).A m()Will be used in the following graph convolution process.
(2) Tight connection layer
In this section, a tightly-connected layer is used to capture local and distant node features, solve the problem that shallow GCNs cannot learn deep associated node information, and generate better node representations. Each tight connection layer comprisesLAnd a plurality of sub-layers. For nodeiIt goes through, say, the firstlThe output of the individual layers is shown in equation (4):
Figure 869708DEST_PATH_IMAGE044
(4)
wherein the content of the first and second substances,
Figure 291462DEST_PATH_IMAGE045
as ReLU function, weight matrix
Figure 619675DEST_PATH_IMAGE016
And bias
Figure 759669DEST_PATH_IMAGE017
Is dependent onA m()
Figure 700206DEST_PATH_IMAGE019
Is a nodejIn the first placelInput features of individual layers, fromh (0)And the sum of the values of 1,2,,l-1} node characteristics resulting from sub-layer updatesh (1),…,h l(-1)Splicing to form the product, wherein the calculation mode is shown as formula (5):
Figure 242046DEST_PATH_IMAGE046
(5)
(3) linear combination
This section introduces a linear combination layer to integrate the representations from different densely connected layers. The output of the linear combination layer is defined as shown in equation (6):
S=W comb h out +b comb (6)
wherein the content of the first and second substances,h out= [h (1);…;h (M)],h (M)denotes the firstMThe feature vectors output by the individual tightly-connected layers.W comb In order to be a weight matrix, the weight matrix,b comb is a bias vector.
Step 2.2: semantic feature extraction
Because the semantic information of the invention plays an important role in rumor detection, and the context-dependent word embedding generated by the BERT model can capture information of various dimensions and generate more accurate and effective feature representation, the invention uses the BERT cantonese pre-training model as a Chinese word embedding extractor.
The invention combines the data provided by hong Kong science and technology university to generate a mapping table, converts the variant characters in Guangdong language into the corresponding characters in Mandarin, and splits the rare characters, thereby relieving the problem that the Chinese pre-training model may not learn the semantic information of the variant characters and the rare characters. In addition, in order to enable the model to better process the grammar structure of Chinese and English mixed in Guangdong languages, the word list of the BERT Chinese pre-training model is extended by combining the word list provided by the PyCantonese library and the fastText Guangdong language pre-training word vector. Specifically, the common English words in Guangdong languages are added into the word list, and the weights of the common English words are initialized randomly, so that the English words are prevented from being split into roots and affixes, and the learning capacity of the model on English word semantics in Guangdong languages is improved.
The invention further pre-trains the BERT-Base-Chinese model by using the Guangdong language corpus widely collected from the Twitter platform, so that the model learns more characteristics of the Guangdong language. On the basis, the Net-CR-Dataset is used for fine tuning of the BERT Chinese pre-training model, so that the method is more suitable for the technical problem of the invention. Specifically, the original text and the conversion/comment data are pushedV={V 1,V 2,,V m After labeling, get
Figure 741160DEST_PATH_IMAGE047
. Then, will
Figure 368450DEST_PATH_IMAGE048
Inputting the sentence vector into a retrained and fine-tuned BERT model, and extracting sentence features by using a Transformer to obtain a sentence vectorW={w 1,w 2,,w m As shown in formulas (7) and (8):
Figure 345634DEST_PATH_IMAGE049
(7)
Figure 741980DEST_PATH_IMAGE050
(8)
wherein the content of the first and second substances,Tokenizeis a word segmentation operation in the BERT model.
Step 2.3 feature fusion
SA-GCN model pair structure feature vectorSAnd semantic feature vectorsWSplicing is carried out to obtain the feature vectorFAs shown in formula (9):
F=SW(9)
feature vector of the pushtextFThroughSoftmaxThe function yields the final classification result, as shown in equation (10):
P d =Softmax(F)(10)
wherein the content of the first and second substances,P d for predicting sentencesdProbability of rumor;
the optimization goal of the model is to minimize the cross-entropy loss function, as shown in equation (11):
Figure 411996DEST_PATH_IMAGE051
(11)
wherein the content of the first and second substances,Da set of sample data is represented which,
Figure 261003DEST_PATH_IMAGE052
representing a sampledThe true value of (a) is,P d is a sampledThe probability of predicting as a positive class.
The experimental process comprises the following steps:
(1) data set
The current research lacks a published, authoritative benchmark Dataset for Cantonese rumors, and the Cantonese rumors Dataset CR-Dataset constructed in the previous literature [ KE L, CHEN X, LU Z, et al A Novel Approach for Cantonese Rumor Detection based on Deep Neural Network: 33rd IEEE International Conference on Systems, Man, and Cybernets [ C ], Toronto, Canada, 2020 ] does not have rich map structural information and cannot provide the spread structural features of rumors for Detection models, because the invention constructs a completely new Cantonese Rumor Dataset.
The Web crawler is developed based on the Scapy framework, the Guangdong language tweet in the Twitter platform and multi-level conversion and comment information of the Guangdong language tweet are collected, and data labeling work is completed according to strict standards, so that a Net-CR-Dataset data set is constructed. Because rumors generally focus on sensitive topics such as health problems, the healthy cantonese rumors in Twitter are taken as main research objects, so that a large amount of related data can be easily acquired, powerful support is provided for research, and the research has important practical significance.
The invention takes the content published by an authoritative official medium as a factual basis, carries out data labeling work on the collected Guangdong language tweed strictly according to the rumor definition (information which is generated and propagated in the crowd and has an actual value which cannot be confirmed or intentionally false, is generated in an emergency, easily causes public panic, destroys social order, reduces government reputation and even jeopardizes national safety) used in the invention, and filters the tweed data which lack factual basis and cannot judge the authenticity. The method judges the report content of the objective fact by comparing the source tweet and the authoritative media under the same event, and marks 0 if the source tweet and the authoritative media are consistent; otherwise, it is marked 1.
Finally, a Cantonese rumor data set Net-CR-Dataset containing rich graph structure information is constructed. The data set contained 2,419 original tweets, 112,539 pushers, 92,260 comments, for a total of 207,218 nodes and 202,437 edges. Table 1 shows the details of the data set.
Figure 41877DEST_PATH_IMAGE053
(2) Experimental setup and data set
The experimental environment of the invention is Intel (R) core (TM) i7-7500U CPU @ 2.70GHz and Tesla-V10032G GPU servers. The data set used in all experiments is Net-CR-Dataset constructed by the invention, and the related information of the data set is shown in Table 1. In the experiment, 80% of the rumor data set was used as the training set, 10% was used as the verification set, and 10% was used as the test set. Each experiment was performed 10 times and the average was taken as the final result. The training set, the verification set and the test set used in 10 experiments are all randomly divided.
(3) Experiment one: feature ablation
In order to prove the effectiveness of each part of the SA-GCN model, the invention compares the detection effect of the SA-GCN model and the variant form thereof on Net-CR-Dataset, and the specific information of the related model is as follows:
1) SA-GCN \ Str: structural features are not introduced into the SA-GCN model, and only semantic features are utilized;
2) SA-GCN \ Sem: semantic features are not introduced into the SA-GCN model, and only structural features are utilized;
3) SA-GCN \ Att: no attention mechanism is introduced into the SA-GCN model;
4) SA-GCN \ BERT: word embedding was not generated in the SA-GCN model using the BERT model. Generating a word Embedding matrix by using a Chinese pre-training word vector and a Guangdong language pre-training word vector provided by fastText, and introducing the word Embedding matrix into an Embedding layer in front of a Bi-LSTM network;
5) SA-GCN: the invention provides a complete model.
The results of the experiment are shown in FIG. 2. It can be seen that the SA-GCN model proposed by the invention performs best and is optimal in all indexes. Comparing the SA-GCN and SA-GCN \ Sem models, it can be found that semantic features of the tweet play an important role in rumor detection tasks. Meanwhile, due to the fact that interaction behaviors exist among users in the social network, nodes are closely related, and the characteristics of the nodes are affected by the neighborhood, the SA-GCN model considering the structural characteristics of the nodes is superior to the SA-GCN \ Str model in detection effect. Moreover, comparing the SA-GCN \ Str model with the SA-GCN \ Sem model, it can be seen that in the task, the contribution of the semantic features to the detection effect is greater than that of the structural features, and the probable reason is that the quantity distribution of the pushings and comments in the Net-CR-Dataset data set adopted in the experiment is not uniform enough, so that the structural features learned by the model for the data with small transmission amount are insufficient. In addition, the SA-GCN \ BERT model obtains 0.8692F 1 score, which is reduced by about 12% compared with the SA-GCN model, and the result shows that compared with the common pre-training models such as fastText and the like, the BERT Chinese pre-training model adopted by the invention and the further pre-training and fine-tuning operations carried out on the Guangdong language corpus can enable the model to better learn the characteristics of the Guangdong language data, thereby carrying out efficient rumor detection. In addition, compared with the performances of the SA-GCN \ Att model and the SA-GCN model, the attention mechanism introduced into the model can be found easily, words and features which are important to a detection task can be automatically found, and the detection effect is also promoted.
(4) Experiment two: detecting model effects
The invention compares the provided SA-GCN model with other common methods in rumor detection, and the related model specific information is as follows:
1) SVM-RBF: the SVM model based on the RBF kernel utilizes the manually extracted features based on the statistical information of all posts;
2) DTC: a rumor detection method of decision tree classifier based on various manual characteristics, in order to obtain the credibility of the information;
3) RFC: a random forest classifier using user features, language features and structural features;
4) TextCNN: capturing text semantics for the classification task using a convolutional neural network;
5) Bi-GCN: a rumor detection model for graph structure data, capable of capturing bidirectional propagation structural features of rumors;
6) GLAN: a global-local attention network that fuses local semantic features and global structural features;
7) SA-GCN: the invention provides a detection model.
Figure DEST_PATH_IMAGE054
In the experiment, the input of the SVM-RBF, DTC and RFC models is a vector generated by a TF-IDF algorithm, and the input of the TextCNN model is word embedding generated by using a fastText Chinese pre-training word vector and a Guangdong language pre-training word vector. As can be seen from Table 2 and FIG. 3, the SA-GCN model provided by the invention achieves 0.9845 on the F1 score, and the AUC value is 0.9677, so that the best detection effect is achieved. And the detection models (TextCNN, Bi-GCN, GLAN, SA-GCN) based on deep learning generally perform better than the models (SVM-RBF, DTC, RFC) based on traditional machine learning, because the deep learning models can learn the high-order expression form of rumors, thereby capturing effective characteristics. In addition, the performance of the TextCNN is not as good as that of the GLAN and the SA-GCN, and the fact that the detection effect can be effectively improved by adding the propagation structure characteristics in the detection process is laterally proved, so that the important significance of the invention in combination of the structure characteristics and the semantic characteristics is fully embodied. Meanwhile, since the effect of the Transformer in semantic feature extraction is better than that of the CNN network, the SA-GCN model using the Transformer as a feature extractor is more expressed than TextCNN and GLAN using the CNN as a feature extractor. Moreover, in order to construct a cantonese word embedding extractor more suitable for cantonese and rumor detection tasks, the invention retrains and optimizes the BERT Chinese pre-training model based on cantonese linguistic data and data sets, so that the SA-GCN model can learn the characteristics of more cantonese data, and a better detection effect is obtained. Meanwhile, compared with Bi-GCN and SA-GCN models, the SA-GCN is improved by nearly 9% in F1 score compared with Bi-GCN, because the SA-GCN model provided by the invention is fused with the semantic features of rumors on the basis of acquiring structural features, and the semantic features can fundamentally reflect the meaning expressed by the tweet, so that the detection performance of the model is remarkably improved.
(5) Experiment three: early detection capability
The early detection capability refers to the detection effect of the model on the rumors in the initial period of rumor propagation, and is one of the important indexes for judging the performance of the rumor detection model. Different cut-off times are set in the experiment, and data input models before the cut-off times are selected respectively. The experiment takes the accuracy of the model as an index, and compares the performances of different models in the aspect of early detection capability.
The results of the experiment are shown in FIG. 4. It can be seen that, at different cut-off times, the accuracy of the SA-GCN model proposed by the present invention is always higher than that of other models, and the accuracy of over 0.8944 is achieved when the tweet is just started to propagate, and the accuracy of 0.9425 is achieved within 3 hours of the tweet propagation, which proves that the semantic features and structural features introduced in the SA-GCN model are not only very effective in the long-term rumor detection task, but also contribute to the early detection of rumors. In addition, as the deadline is delayed, semantic information and structural information of the tweet become richer, and meanwhile, the noise of data is larger and larger, but the fluctuation amplitude of the accuracy curve of the SA-GCN model is smaller compared with other models, which also proves the stability of the proposed SA-GCN model.
In conclusion, the invention provides a neural network method based on a deep semantic perception map for detecting the Cantonese rumors in the social network. First, the invention constructs a Web crawler and obtains relevant data in Twitter based on multiple sets of rumor keywords. Meanwhile, the forwarding number and the comment number of the text pushing are limited, so that the data are guaranteed to have rich graph structure information. And after the manual data labeling work is finished, constructing a data set Net-CR-Dataset. Secondly, the invention designs a brand-new SA-GCN for detecting the rumors of Guangdong languages, the model learns the structural features of the tuina by means of an improved GCN network, and captures the semantic features of the tuina by applying a BERT pre-training model retrained and fine-tuned on data of the Guangdong languages, and finally the structural features and the semantic features are spliced by the model. The experimental result shows that the SA-GCN model provided by the invention is superior to a classical detection method in the task of detecting the rumors in Guangdong languages, and the model has strong early rumor detection capability.

Claims (7)

1. A Cantonese rumor detection method based on a deep semantic perception map convolutional network is characterized by comprising the following steps of:
step 1: constructing a plurality of groups of healthy key words of the Guangdong-like rumors, acquiring related tweets, users, forwarding and comment information, and constructing a data set Net-CR-Dataset of the Guangdong-like rumors with graph structure information, namely modeling the data set as a graph according to entities in the social network and the relationship between the entities in the social networkG=<V,E>;
Step 2: fusing a BERT model, a GCN network and an attention mechanism, and providing a social network Cantonese rumor detection model SA-GCN: extracting a structural feature vector of the tweet by using an improved GCN network;
optimizing the BERT Chinese pre-training model according to the unique language features of the Guangdong languages, and simultaneously performing further pre-training and fine tuning on the BERT Chinese pre-training model by using a large number of collected Guangdong language materials so as to extract semantic feature vectors of the tweet; and finally, fusing the two types of characteristics to obtain a final classification result.
2. The method of claim 1, wherein modeling the Cantonese rumor as a graph according to the entities in the social network and their relationships is based on the deep semantic perception graph convolution networkG=<V,E>The method specifically comprises the following steps:
by usingT={t 1,t 2,,t m The symbol represents a set of original text,mthe number of the original Chinese characters is the number of the original Chinese characters; by using
Figure DEST_PATH_IMAGE001
Representing original tuinat i The set of commentary and pushups in which,
Figure DEST_PATH_IMAGE002
is composed oft i The turn-push/comment of (a) is,nnumber of commentary and pushups;
V={V 1,V 2,,V m and (c) the step of (c) in which,V i={t i ,R i is the original textt i A set of nodes comprising nodest i And forward and comment node setR i
E={E 1,E 2,,E m And (c) the step of (c) in which,
Figure DEST_PATH_IMAGE003
to deduce the text from the originalt i The edge set of (2) representing forwarding/commenting relationships between nodes;
X={x 1,x 2,,x m denotes the original text setTIs determined by the characteristic matrix of (a),
Figure DEST_PATH_IMAGE004
kis characterized in thatx i Dimension (d);x i representing nodest i The feature vector of (2);
Figure DEST_PATH_IMAGE005
is shown as a drawingGThe adjacency matrix of (2) represents a matrix of adjacent relations between nodes, and indicates whether any two nodes in the graph are connected by edges;
hypothesis forwarding and comment node
Figure DEST_PATH_IMAGE006
And with
Figure DEST_PATH_IMAGE007
Between which there is an edge
Figure DEST_PATH_IMAGE008
Then adjacent to the matrixAThe expression of (a) is as follows:
Figure DEST_PATH_IMAGE009
(1)
wherein the content of the first and second substances,E c to deduce the text from the originalt c The set of edges of (c);
consider the rumor detection task as a binary problem, original textt i Corresponding labely i ∈{F,T}; then the rumor detection target is the learning classifierf
f:TY(2)
Wherein the content of the first and second substances,TandYrespectively corresponding to the original text set and the label set.
3. The method of claim 1, wherein the step 2 specifically comprises:
step 2.1: extracting structural features: the method comprises the steps that original tweets, forwarding tweets and comments in Net-CR-Dataset are used as nodes, forwarding and comment relations are used as edges for modeling, propagation paths of the tweets in a social network are converted into graph structure data, and information on the propagation paths of the tweets is aggregated by using an improved GCN (generalized genetic network), so that high-level structure feature vectors of the tweets are generated;
step 2.2: extracting semantic features: constructing a mapping table, converting variant characters in Guangdong languages into corresponding characters in Mandarin, and splitting rare characters; and extending a vocabulary of the BERT Chinese pre-training model; using the collected cantonese linguistic data to perform further pre-training on the BERT-Base-Chinese model to enable the model to learn more characteristics of cantonese, and performing fine tuning on the BERT Chinese pre-training model by using a Net-CR-Dataset to obtain the BERT cantonese pre-training model, and extracting semantic feature vectors of a text;
step 2.3: and the SA-GCN model fuses the structural feature vector and the semantic feature vector to obtain a final classification result.
4. The method for detecting the rumors in Guangdong languages based on the deep semantic perception map convolutional network of claim 3, wherein the step 2.1 of extracting the structural features specifically comprises:
step 2.1.1: mining potential structural correlation between vertexes by using a multi-head attention mechanism, wherein the potential structural correlation comprises nodes which are not directly connected and are subjected to multi-hop; the specific process is as follows:
generating the characteristics of the nodes by using the Guangdong pre-training word vector provided by fastTextU={u 1,u 2,,u N And (c) the step of (c) in which,Nthe number of all nodes;
and constructing an attention adjacency matrixAConverting the propagation tree of the original tweet into a graph which is fully connected by weight edges, thereby comprehensively considering the structural relationship among all tweet nodes; first, themTo the headmThe calculation of the individual attention adjacency matrices is as follows:
Figure DEST_PATH_IMAGE010
(3)
wherein the content of the first and second substances,QandKequivalent to node features, i.e. extracted node featuresUdIs the dimension of the feature vector;
Figure DEST_PATH_IMAGE011
and
Figure DEST_PATH_IMAGE012
are respectively asQAndKthe transfer matrix of (2);
step 2.1.2: the tight connection layer is used for capturing local and remote node characteristics, the problem that deep associated node information cannot be learned by a shallow GCN is solved, and a better node representation is generated;
each tight connection layer comprisesLA seed layer; for nodeiIt goes through, say, the firstlThe output of the individual sublayers is shown below:
Figure DEST_PATH_IMAGE013
(4)
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE014
as ReLU function, weight matrix
Figure DEST_PATH_IMAGE015
And bias
Figure DEST_PATH_IMAGE016
Is dependent onA m()A m()Is a firstmTo the headmAn attention adjacency matrix;
Figure DEST_PATH_IMAGE017
representing nodesiAnd nodejIs a matrixA m()The elements of (1);
Figure DEST_PATH_IMAGE018
is a nodejIn the first placelInput features of individual layers, fromh (0)And the sum of the values of 1,2,,l-1} node characteristics resulting from sub-layer updatesh (1),…,h l(-1)The calculation mode is shown as the following formula:
Figure DEST_PATH_IMAGE019
(5)
step 2.1.3: a linear combination layer is introduced to integrate the representations from the different densely connected layers, the output of the linear combination layer being defined as follows:
S=W comb h out +b comb (6)
wherein the content of the first and second substances,h out= [h (1);…;h (M)],h (M)is shown asMThe feature vectors output by the tight connection layers;W comb is a weight matrix for each of the feature vectors,b comb is a bias vector.
5. The method of claim 3, wherein in step 2.2, the expanding the vocabulary of the BERT Chinese pre-training model comprises: the word list and the fastText cantonese pre-training word vector provided by the PyCantonese library are adopted, common English words in cantonese are added into the word list, the weight of the words is initialized randomly, the English words are prevented from being split into roots and affixes, and the learning capacity of the model to English word semantics in cantonese is improved.
6. The method for detecting the rumor in Guangdong languages based on the deep semantic perception map convolutional network of claim 3, wherein in the step 2.2, the fine tuning of the BERT Chinese pre-training model by using the Net-CR-Dataset comprises: push the original text and turn to push/comment dataV={V 1,V 2,,V m After labeling, get
Figure DEST_PATH_IMAGE020
(ii) a Then, will
Figure DEST_PATH_IMAGE021
Inputting the sentence vector into a retrained and fine-tuned BERT model, and extracting sentence features by using a Transformer to obtain a sentence vectorW={w 1,w 2,,w m The formula is shown below;
Figure DEST_PATH_IMAGE022
(7)
Figure DEST_PATH_IMAGE023
(8)
wherein the content of the first and second substances,Tokenizeis a word segmentation operation in the BERT model.
7. The method for detecting the rumor in Guangdong languages based on the deep semantic perception map convolutional network of claim 3, wherein the step 2.3 specifically comprises:
step 2.3.1: SA-GCN model pair structure feature vectorSAnd semantic feature vectorsWSplicing to obtain a feature vectorFAs shown in the following formula:
F=SW(9)
step 2.3.2: feature vector of the pushtextFThroughSoftmaxThe function obtains the final classification result as shown in the following formula:
P d =Softmax(F)(10)
wherein the content of the first and second substances,P d extrapolating text samples for predictiondProbability of rumor;
step 2.3.3: the optimization objective of the model is to minimize the cross-entropy loss function, as shown in the following equation:
Figure DEST_PATH_IMAGE024
(11)
wherein the content of the first and second substances,Da set of sample data is represented which,y d representing tweet samples to be predicteddThe true value of (c) is,P d extrapolating text samples for predictiondThe probability of predicting as a positive class.
CN202210371266.1A 2022-04-08 2022-04-08 Cantonese rumor detection method based on deep semantic perception map convolutional network Active CN114444516B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210371266.1A CN114444516B (en) 2022-04-08 2022-04-08 Cantonese rumor detection method based on deep semantic perception map convolutional network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210371266.1A CN114444516B (en) 2022-04-08 2022-04-08 Cantonese rumor detection method based on deep semantic perception map convolutional network

Publications (2)

Publication Number Publication Date
CN114444516A true CN114444516A (en) 2022-05-06
CN114444516B CN114444516B (en) 2022-07-05

Family

ID=81359641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210371266.1A Active CN114444516B (en) 2022-04-08 2022-04-08 Cantonese rumor detection method based on deep semantic perception map convolutional network

Country Status (1)

Country Link
CN (1) CN114444516B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880478A (en) * 2022-06-07 2022-08-09 昆明理工大学 Weak supervision aspect category detection method based on theme information enhancement
CN116432644A (en) * 2023-06-12 2023-07-14 南京邮电大学 News text classification method based on feature fusion and double classification
CN117253112A (en) * 2023-08-29 2023-12-19 哈尔滨工业大学 Large-model visual language cross-modal learning method for structural health diagnosis
CN117573988A (en) * 2023-10-17 2024-02-20 广东工业大学 Offensive comment identification method based on multi-modal deep learning
CN117573865A (en) * 2023-10-19 2024-02-20 南昌大学 Rumor fuzzy detection method based on interpretable adaptive learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019183191A1 (en) * 2018-03-22 2019-09-26 Michael Bronstein Method of news evaluation in social media networks
CN112035669A (en) * 2020-09-09 2020-12-04 中国科学技术大学 Social media multi-modal rumor detection method based on propagation heterogeneous graph modeling
CN112231562A (en) * 2020-10-15 2021-01-15 北京工商大学 Network rumor identification method and system
CN112256945A (en) * 2020-11-06 2021-01-22 四川大学 Social network Cantonese rumor detection method based on deep neural network
CN113094596A (en) * 2021-04-26 2021-07-09 东南大学 Multitask rumor detection method based on bidirectional propagation diagram
CN113268675A (en) * 2021-05-19 2021-08-17 湖南大学 Social media rumor detection method and system based on graph attention network
CN113343126A (en) * 2021-08-06 2021-09-03 四川大学 Rumor detection method based on event and propagation structure
CN113705099A (en) * 2021-05-09 2021-11-26 电子科技大学 Social platform rumor detection model construction method and detection method based on contrast learning
CN113919440A (en) * 2021-10-22 2022-01-11 重庆理工大学 Social network rumor detection system integrating dual attention mechanism and graph convolution

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019183191A1 (en) * 2018-03-22 2019-09-26 Michael Bronstein Method of news evaluation in social media networks
CN112035669A (en) * 2020-09-09 2020-12-04 中国科学技术大学 Social media multi-modal rumor detection method based on propagation heterogeneous graph modeling
CN112231562A (en) * 2020-10-15 2021-01-15 北京工商大学 Network rumor identification method and system
CN112256945A (en) * 2020-11-06 2021-01-22 四川大学 Social network Cantonese rumor detection method based on deep neural network
CN113094596A (en) * 2021-04-26 2021-07-09 东南大学 Multitask rumor detection method based on bidirectional propagation diagram
CN113705099A (en) * 2021-05-09 2021-11-26 电子科技大学 Social platform rumor detection model construction method and detection method based on contrast learning
CN113268675A (en) * 2021-05-19 2021-08-17 湖南大学 Social media rumor detection method and system based on graph attention network
CN113343126A (en) * 2021-08-06 2021-09-03 四川大学 Rumor detection method based on event and propagation structure
CN113919440A (en) * 2021-10-22 2022-01-11 重庆理工大学 Social network rumor detection system integrating dual attention mechanism and graph convolution

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LEI ZHONG 等: "Integrating Semantic and Structural Information with Graph Convolutional Network for Controversy Detection", 《ARXIV:2005.07886V1 [CS.CL]》 *
LIANG KE 等: "A Novel Approach for Cantonese Rumor Detection based on Deep Neural Network", 《2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC)》 *
王昕岩 等: "一种加权图卷积神经网络的新浪微博谣言检测方法", 《小型微型计算机系统》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880478A (en) * 2022-06-07 2022-08-09 昆明理工大学 Weak supervision aspect category detection method based on theme information enhancement
CN114880478B (en) * 2022-06-07 2024-04-23 昆明理工大学 Weak supervision aspect category detection method based on theme information enhancement
CN116432644A (en) * 2023-06-12 2023-07-14 南京邮电大学 News text classification method based on feature fusion and double classification
CN116432644B (en) * 2023-06-12 2023-08-15 南京邮电大学 News text classification method based on feature fusion and double classification
CN117253112A (en) * 2023-08-29 2023-12-19 哈尔滨工业大学 Large-model visual language cross-modal learning method for structural health diagnosis
CN117253112B (en) * 2023-08-29 2024-06-04 哈尔滨工业大学 Large-model visual language cross-modal learning method for structural health diagnosis
CN117573988A (en) * 2023-10-17 2024-02-20 广东工业大学 Offensive comment identification method based on multi-modal deep learning
CN117573988B (en) * 2023-10-17 2024-05-14 广东工业大学 Offensive comment identification method based on multi-modal deep learning
CN117573865A (en) * 2023-10-19 2024-02-20 南昌大学 Rumor fuzzy detection method based on interpretable adaptive learning

Also Published As

Publication number Publication date
CN114444516B (en) 2022-07-05

Similar Documents

Publication Publication Date Title
CN114444516B (en) Cantonese rumor detection method based on deep semantic perception map convolutional network
Gong et al. Hashtag recommendation using attention-based convolutional neural network.
Zhang et al. Retweet prediction with attention-based deep neural network
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
CN108595708A (en) A kind of exception information file classification method of knowledge based collection of illustrative plates
Sivakumar et al. Review on word2vec word embedding neural net
CN110532328B (en) Text concept graph construction method
Uppal et al. Fake news detection using discourse segment structure analysis
CN112329444A (en) Early rumor detection method fusing text and propagation structure
CN116910238A (en) Knowledge perception false news detection method based on twin network
CN115510236A (en) Chapter-level event detection method based on information fusion and data enhancement
Nadeem et al. SSM: Stylometric and semantic similarity oriented multimodal fake news detection
CN115329088A (en) Robustness analysis method of graph neural network event detection model
CN117036833B (en) Video classification method, apparatus, device and computer readable storage medium
Pang et al. Domain relation extraction from noisy Chinese texts
CN117112786A (en) Rumor detection method based on graph attention network
CN117390299A (en) Interpretable false news detection method based on graph evidence
Cai et al. Deep learning approaches on multimodal sentiment analysis
CN116775855A (en) Automatic TextRank Chinese abstract generation method based on Bi-LSTM
Wang et al. Using ALBERT and Multi-modal Circulant Fusion for Fake News Detection
CN115329073A (en) Attention mechanism-based aspect level text emotion analysis method and system
Xiang et al. Aggregating local and global text features for linguistic steganalysis
Lan et al. Mining semantic variation in time series for rumor detection via recurrent neural networks
CN114386412A (en) Multi-modal named entity recognition method based on uncertainty perception
Meenakshi et al. Deep learning techniques for spamming and cyberbullying detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant