CN112766359A

CN112766359A - Word double-dimensional microblog rumor recognition method for food safety public sentiment

Info

Publication number: CN112766359A
Application number: CN202110050517.1A
Authority: CN
Inventors: 左敏; 何思宇; 张青川; 颜文婧
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2021-05-07
Anticipated expiration: 2041-01-14
Also published as: CN112766359B

Abstract

The invention relates to a word double-dimensional microblog rumor recognition method for food safety public sentiments, which comprises the following steps: preprocessing the internet crawled data, constructing a food safety field word embedding resource library by combining an open domain word embedding resource library, crawling multi-level encyclopedia corpus to perform incremental training on the word embedding resource library, extracting word dimension text features based on a BERT network, extracting word dimension text features based on a BLSTM network and adding a position attention mechanism, finally obtaining word double-dimension text feature vectors, and performing classification and identification on whether microblog texts are rumors or not. The method solves the problems of serious spoken language conversion, weak structure, strong domain and difficult vectorization of microblog text corpora in the field of food safety public sentiment, extracts the corpus characteristics more fully by constructing the field lexicon and the multi-granularity vectorization method, and improves the accuracy of rumor recognition.

Description

Word double-dimensional microblog rumor recognition method for food safety public sentiment

Technical Field

The invention relates to the field of artificial intelligence, in particular to a word double-dimensional microblog rumor recognition method for food safety public sentiments.

Background

Microblogs are popular due to the characteristics of convenience, openness, timeliness, anonymity and the like, and more people choose to use the microblogs to release opinions and share stories. However, due to the low threshold of microblog user registration and the diversity of the use groups, the quality of the information issued by the microblog users is difficult to monitor and control, so that the microblog users become a hotbed for the propagation of network rumor growth, which not only causes serious interference to the lives of people, but also disturbs social order.

The food field is related to the national civilization, so the influence of microblog rumors related to food safety is particularly serious and severe. Therefore, the establishment of the rumor identification model by using the natural language processing technology has great significance for identifying the food safety microblog rumors.

Text classification recognition is an important and practical direction of research in natural language processing. Before the development of deep learning, the conventional machine learning method is applied to the field of text classification, such as a naive bayes model and a support vector machine model. However, the traditional machine learning model depends on artificial corpus labeling, which not only consumes a large amount of manpower and material resources, but also has a text feature extraction result which is not satisfactory.

With the development of technologies such as deep learning, cloud computing and artificial intelligence in recent years, the deep neural network is applied in various fields and achieves better results. In the natural language processing field, under the condition of large-scale corpus, the multi-level network model realizes automatic text characteristic information mining, the deep neural network becomes one of key technologies in the natural language processing field, and the deep neural network also has a good effect in a text semantic classification task. The development and the use of the long-time memory network and the attention mechanism in the field of natural language processing lay a foundation for the invention.

In addition, in text semantic classification, many researchers have studied as to whether two kinds of embedding granularities, a character level and a word level, have an influence on the classification effect. Kim proposes a model for extracting text semantic information through character-level CNN, and Liulongfei and the like prove the superiority of character-level feature representation in Chinese text processing.

Because the microblog texts are mostly unstructured and lack of standard text corpora, the vectorization difficulty is high, the semantic features of the texts are extracted by singly using word dimensions or word dimensions, the feature extraction is incomplete, the classification precision is lost, and the existing language model is difficult to accurately process the texts in the food safety field. Therefore, the microblog text processing in the food safety field is carried out by combining the word and word two-dimensional neural network model with the constructed word library in the food field.

Disclosure of Invention

The invention solves the problems: the method overcomes the defects of the prior art, provides a word double-dimensional microblog rumor identification method facing to food safety public sentiments, solves the requirement of food safety related rumor identification supervision on the existing microblog, can quickly and accurately identify and judge the rumors, greatly improves the working efficiency of a supervisor, and assists the supervisor to make a judgment.

The invention relates to a word double-dimensional microblog rumor recognition method for food safety public sentiments, which comprises the following steps of:

step 1, preprocessing original text data acquired from a web crawler on the Internet, wherein the preprocessing comprises removing a large number of special symbols, stop words and the like contained in the original text data;

step 2, on the basis of the open domain word embedding resource library, constructing a word embedding resource library in the food safety field and performing incremental training;

and 3, constructing a bidirectional long-time and short-time memory network based on the fusion position perception attention mechanism as a neural network model end for obtaining the vector dimension text features of the text words, firstly, judging the semantic role and the position of the domain key words by combining the domain word library constructed in the step 2, and generating the attention based on position perception. And then, inputting word vectors generated by word embedding of the text corpus into a BLSTM model, enabling the word vectors to participate in the calculation of the intermediate hidden layer, and further calculating the vectors calculated by the hidden layer under the influence of an attention mechanism to obtain word level text semantic features.

Step 4, independently of the BLSTM model constructed in the step 3, constructing a BERT neural network model as a neural network model end for obtaining vector dimension text characteristics of text words, wherein the BERT model converts each word in the text into a vector by inquiring a word vector table to be used as model input; the model output is the vector representation after the full-text semantic information corresponding to each word is input.

And step 5, using SoftMax as a classifier, merging the word dimension text characteristic information obtained in the step 3 and the word dimension text characteristic information obtained in the step 4 at a connecting layer after the linguistic data are processed and output by a BERT and BLSTM two-way neural network, and then inputting the information into the classifier for classification and identification to obtain a final rumor classification and identification result.

Further, in the step 2, on the basis of the open domain word embedding resource library, a word embedding resource library in the food safety field is constructed by combining a skip-gram model and word semantic representation, and corpus expansion is performed on the basis, so that the open hundred-degree encyclopedia corpus is increased, and the word encyclopedia and news corpus in the food field are crawled from the network to perform training of the word vector model. And after a period of time, when certain food safety public opinion linguistic data are accumulated, performing incremental training on the word vector model.

Further, in the step 3, a bidirectional long-time and short-time memory network model based on a fusion position perception attention mechanism is trained to serve as a word dimension text feature extraction model. Converting microblog text corpora into vector representation, taking the vector representation as the input of a network, training a neural network model, building one of two-way network models forming an integral model by utilizing a bidirectional long-time memory network integrating a position attention perception mechanism, and obtaining a local output result through the training of the existing microblog text corpora: word dimension text feature vector representation.

Further, in the step 4, a BERT network model is trained to be used as a word dimension text feature extraction model. The model input contains two parts in addition to the word vector (Token Embedding), one of which is Segment Embedding: the value of the vector is automatically learned in the model training process, is used for depicting the global semantic information of the text and is fused with the semantic information of the single character; the second is Position Embedding (Position Embedding): because semantic information carried by words appearing at different positions of a text is different, the BERT model adds different vectors to the words at different positions respectively for distinguishing. Finally, the BERT model takes the sum of Token Embedding, Segment Embedding and Position Embedding as a sentence vector to obtain one of the two-way network outputs of the whole model: word dimension text feature vector representation.

Further, the BERT network is used as a pre-training model, and in a text classification task, a Token Embedding layer in the BERT network requires that the head of a sentence is marked as [ CLS ] and the labels among multiple sentences are marked as [ SEP ] for input. The Segment Embedding and Position Embedding layers utilize pre-trained model parameters to participate in the calculation.

Further, in the step 5, two neural network models are trained, including a bidirectional long-time and short-time memory network model for extracting a fusion position perception attention mechanism of word dimension text feature vectors and a BERT model for extracting the word dimension text feature vectors; when training is started, randomly initializing weights, connecting the two-way network calculation results through a connection layer after the two-way network calculation results are obtained through neural network calculation, and converting numerical output of the neural network into classified probability output by using a SoftMax function as a loss function; in order to avoid overfitting in the training process, Dropout with certain probability is set, namely partial weight or output of the hidden layer is randomly zeroed in the model training process, so that the interdependence among all nodes is reduced, and the model generalization is improved.

Compared with the prior art, the invention has the advantages that: whether food safety related microblogs are rumors or not can be quickly judged through a word two-way text semantic classification model of an LSTM network and a BERT network based on a fusion position perception attribute mechanism, a more comprehensive and more targeted food safety field public opinion Embedding resource library is constructed aiming at the rumors in the food safety public opinion field, two Embedding granularities of character level and word level are used as model input, and finally the texts are classified by combining a feature extraction result of the two-way network. The model provided by the invention fully utilizes the characteristics of the BLSTM, excavates the semantic features of the text from the word vector level, combines with the position attention mechanism, acquires detailed feature information in the microblog text through training of the BLSTM, and uses the position attention mechanism to calculate, so that the words related to the food safety field play a decisive role in the whole text. Meanwhile, the BERT network can further mine text semantics from the word vector level, avoid the loss of classification precision due to incomplete feature extraction caused by unstructured and lack of standard text corpora, and effectively improve the text semantic classification effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a schematic flow chart of a word two-dimensional microblog rumor identification method for food safety public sentiment according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a bidirectional long-short term memory network for a word vector end fused position attention mechanism;

FIG. 3 is a schematic diagram of a word vector end BERT network;

fig. 4 is a connection layer network diagram.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person skilled in the art based on the embodiments of the present invention belong to the protection scope of the present invention without creative efforts.

As shown in fig. 1, the invention provides a word two-dimensional microblog rumor identification method for food safety public sentiment, which comprises the following steps:

and 3, constructing a bidirectional long-time and short-time memory network based on the fusion position perception attention mechanism as a neural network model end for obtaining the vector dimension text features of the text words, firstly, judging the semantic role and the position of the domain key words by combining the domain word library constructed in the step 2, and generating the attention based on position perception. Then inputting word vectors generated by word embedding of the text corpus into a BLSTM model, enabling the word vectors to participate in calculation of an intermediate hidden layer, and further calculating the vectors calculated by the hidden layer under the influence of an attention mechanism to obtain word level text semantic features;

step 4, independently of the BLSTM model constructed in the step 3, constructing a BERT neural network model as a neural network model end for obtaining vector dimension text characteristics of text words, wherein the BERT model converts each word in the text into a vector by inquiring a word vector table to be used as model input; the model output is vector representation after the full-text semantic information corresponding to each character is input;

Referring to fig. 1, an overall schematic diagram of the method provided by the invention is shown, crawled food security public sentiment microblog data are preprocessed, a word embedding resource library in the food security field is constructed by combining an open domain word embedding resource library, then, multi-level encyclopedia linguistic data are crawled to perform incremental training on the word embedding resource library, word dimension text features based on a BERT network and word dimension text features based on a BLSTM network and added with a position attention mechanism are obtained, and finally, word two-dimension text feature vectors are obtained, and classification and identification whether microblog texts are rumors or not are performed.

In the embodiment shown in fig. 2, the model first determines the semantic role and location of the domain keyword by combining the domain thesaurus, and generates the attention based on location awareness. The microblog text is embedded into words to generate word dimension vectors, the word dimension vectors are input into a bidirectional long-time memory network, the word vectors are calculated through a middle hidden layer, hidden layer vectors are output, and semantic features of the word dimension text are calculated with position attention.

Aiming at the problem of microblog rumors in the field of food safety public sentiments in the research, keywords in the field of food safety are very important, and adjacent words of the keywords also have a non-negligible effect. The reason is that in the task of text recognition and classification, the influence of each word in the text on the final classification result is different, and the effect of the keywords can be more fully exerted by increasing the attention on the keywords. Therefore, the positions of the keywords are positioned according to the domain lexicon, the model learns more position information, and a position-based attention mechanism is introduced into the model. The influence of the keywords on a certain distance of the hidden layer dimension is assumed to follow a gaussian distribution. A basis matrix K of influence is defined, each column of which represents a basis vector of influence corresponding to a particular distance. K is as defined in formula (1):

K(i,u)～N(Kernel(u),σ) (1)

wherein K (i, u) represents the corresponding influence of the distance u of the food safety domain keyword in the ith dimension, and N represents a normal distribution conforming to the expectation and standard deviation sigma of Kernel (u) value. Kernel (u) is a Gaussian kernel function used to model location-aware based impact propagation, which is defined as formula (2):

when u is 0, the current word is a keyword in the food safety field, the obtained propagation influence is the largest, and the propagation influence is weakened along with the increase of the distance.

Obtaining the influence vector of the key word at each specific position by utilizing the influence foundation matrix and according to the position relation of the key words in the food safety field through cumulative calculation:

p_j＝Kc_j (3)

in the formula, p_jAs a cumulative influence vector of the words at the j position, c_jIs a distance count vector representing the count of all keywords at a distance u for a word at position j, c_j(u) is calculated as follows:

C_j(u)＝∑_q∈Q[(j-u)∈pos(w)]+[(j+u)∈pos(w)] (4)

in the formula, Q is all keywords contained in a microblog text related to food safety public sentiment, Q is one of the keywords, pos (Q) is a position set of the keywords Q appearing in the son, [. cndot. ] is an index function, and if the condition is satisfied, the value is 1, and if the condition is not satisfied, the value is 0.

The attention calculation method of words at the j position in the microblog text related to the food safety public sentiment is shown as a formula (5):

in the formula, h_jIs a hidden layer vector of j-position words, p_jThe location perception influence vectors are accumulated, len is the number of word vectors in a sentence of microblog text related to food security public sentiment, and a (-) is the importance of words for measuring the hidden layer vectors and the location perception influence vectors. The specific form of a (-) is as follows (6):

in the formula, W_H，W_pIs h_j，p_jWeight matrix of b_iIs a bias vector belonging to the first layer parameters,

for the ReLU function, v is a global vector, b₂Is a bias vector belonging to the second layer parameters. After the weights of the words at each position are calculated, all hidden layer vectors in the sentence are weighted to obtainFinal Attention Value:

in another embodiment shown in fig. 3, a BERT network is adopted at the word dimension text feature extraction end, and for the text classification task, the Token Embedding layer in BERT requires the head of a sentence to be marked as [ CLS ] for input]Between multiple sentences marked SEP]. The word vectors respectively pass through a Token Embed-dings layer, a Segment Embedding layer and a Position Embedding layer, and the Segment Embedding layer and the Position Embedding layer utilize pre-trained model parameters to participate in calculation. And finally, character-level food safety public sentiment related microblog text characteristic representation is obtained. In FIG. 3, Tok denotes different Token, E denotes an embedding vector, T_iRepresenting the feature vector obtained by the ith Token after the BERT process. For text classification in general, BERT directly takes the first [ CLS ]]C, adding a layer of weight W to the final hidden layer vector C, and then using a SoftMax function as an activation function, b_cIs a bias vector, as in equation (8):

P＝SoftMax(CW^T+b_c) (8)

and the model fine adjustment is realized by adjusting the model parameters in a specific task.

In the embodiment shown in fig. 4, after the obtained word vector level and word vector level text vectors are obtained, connection is performed in a connection layer, and finally, the probability of whether the microblog text related to the food security public opinion is a rumor is obtained through a SoftMax function, wherein the formula of the SoftMax function is as follows:

the function maps the output of the neuron into the interval (0, 1), where n represents the number of classes, i represents a class in j, and g_iA value, P(s), representing the classification_i) Representing the probability of the ith class.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. A word double-dimensional microblog rumor recognition method for food safety public sentiments is characterized by comprising the following steps:

step 1, preprocessing original text data acquired from a web crawler on the Internet, wherein the preprocessing comprises the step of removing special symbols and stop words contained in the original text data;

step 2, constructing a word embedding resource library in the food safety field on the basis of the open domain word embedding resource library, and performing incremental training;

step 3, constructing a bidirectional long-time and short-time memory network based on a fusion position perception attention mechanism as a neural network model end for obtaining the vector dimension text features of the text words, and specifically realizing the following steps: firstly, judging semantic roles and positions of domain keywords by combining the domain word library constructed in the step 2 to generate attention based on position perception, then inputting word vectors generated by embedding words into a text corpus into a BLSTM (binary likelihood model), enabling the word vectors to participate in the calculation of an intermediate hidden layer, and further calculating the vectors calculated by the hidden layer under the influence of an attention mechanism to obtain semantic features of a word-level text;

2. The word two-dimensional microblog rumor identification method for food safety public opinion of claim 1, wherein the method comprises the following steps: in the step 2, on the basis of an open domain word embedding resource library, a word embedding resource library in the food safety domain is constructed by combining a skip-gram model and word semantic expression, corpus expansion is performed on the basis, the published encyclopedia corpus is added, vocabulary encyclopedia and news corpus in the food field are crawled from a network, word vector model training is performed, and after a period of time, when certain food safety public opinion corpus is accumulated, incremental training is performed on the word vector model.

3. The word two-dimensional microblog rumor identification method for food safety public opinion of claim 1, wherein the method comprises the following steps: in the step 3, a bidirectional long-time and short-time memory network model based on a fusion position perception attention mechanism is trained to serve as a word dimension text feature extraction model, microblog text corpora are converted into vector representations to serve as input of a network, a neural network model is trained, one of two-way network models forming an integral model is built by using the bidirectional long-time and short-time memory network of the fusion position perception mechanism, and a word dimension text feature vector representation is obtained through training of the existing microblog text corpora.

4. The word two-dimensional microblog rumor identification method for food safety public opinion of claim 1, wherein the method comprises the following steps: in step 4, the BERT network model is trained as a word dimension text feature extraction model, and the model input includes two parts except a word vector (Token Embedding), one of which is segmentation Embedding (Segment Embedding): the value of the vector is automatically learned in the model training process, is used for depicting the global semantic information of the text and is fused with the semantic information of the single character; the second is Position Embedding (Position Embedding): because semantic information carried by words appearing at different positions of a text is different, the BERT model adds different vectors to the words at different positions respectively for distinguishing; and finally, the BERT model takes the sum of Token Embedding, Segment Embedding and Position Embedding as a sentence vector to obtain one of two-way network outputs of the overall model, namely character dimension text feature vector representation.

5. The word two-dimensional microblog rumor identification method for food safety public opinion of claim 1, wherein the method comprises the following steps: the BERT network is used as a pre-training model, in a text classification task, a Token Embedding layer in the BERT network marks the head of an input request sentence as [ CLS ], marks among multiple sentences as [ SEP ], and Segment Embedding and Position Embedding layers utilize pre-trained model parameters to participate in calculation.

6. The word two-dimensional microblog rumor identification method for food safety public opinion of claim 1, wherein the method comprises the following steps: in the step 5, two neural network models are trained, including a bidirectional long-time and short-time memory network model for extracting a fusion position perception attention mechanism of word dimension text feature vectors and a BERT model for extracting the word dimension text feature vectors; when training is started, randomly initializing weights, connecting the two-way network calculation results through a connection layer after the two-way network calculation results are obtained through neural network calculation, and converting numerical output of the neural network into classified probability output by using a SoftMax function as a loss function; in order to avoid overfitting in the training process, Dropout with certain probability is set, namely partial weight or output of the hidden layer is randomly zeroed in the model training process, so that the interdependence among all nodes is reduced, and the model generalization is improved.