CN112766359B

CN112766359B - Word double-dimension microblog rumor identification method for food safety public opinion

Info

Publication number: CN112766359B
Application number: CN202110050517.1A
Authority: CN
Inventors: 左敏; 何思宇; 张青川; 颜文婧
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2023-07-25
Anticipated expiration: 2041-01-14
Also published as: CN112766359A

Abstract

The invention relates to a word double-dimension micro-rumor identification method for food security public opinion, which comprises the following steps: preprocessing Internet crawling data, constructing a word-enabling library in the food safety field by combining an open domain word-enabling library, crawling multi-level hundred-degree encyclopedia corpus to perform incremental training on the word-enabling library, extracting word dimension text features based on BERT network, extracting word dimension text features based on BLSTM network and adding a position attention mechanism, finally obtaining word double-dimension text feature vectors, and performing classification recognition on whether microblog text is rumor not. The method solves the problems of serious spoken language, weak structure, strong territory and difficult vectorization of microblog text in the field of food safety public opinion, and improves the accuracy of rumor recognition by constructing a field word stock and a multi-granularity vectorization method to more fully extract corpus features.

Description

Word double-dimension microblog rumor identification method for food safety public opinion

Technical Field

The invention relates to the field of artificial intelligence, in particular to a word double-dimension microblog rumors identification method for food security public opinion.

Background

Microblogs are popular among the public because of their convenience, openness, timeliness, anonymity, etc., and more people choose to use microblogs to publish views and share stories. However, due to the low threshold of microblog user registration and the diversity of using groups, the quality of information released by the microblog users is difficult to monitor and control, so that the microblog users become a hotbed for propagation of network rumors, and the microblog users can not only cause serious interference to the life of people, but also disturb social order.

The food field is related to national life, so the influence of the food safety related microblog rumors is particularly serious and severe. Therefore, the method for establishing the rumor identification model by using the natural language processing technology has great significance for the identification of the food safety microblog rumors.

Text classification recognition is an important and practical research direction for natural language processing. Prior to the advent of deep learning, traditional machine learning methods were applied in the field of text classification, such as naive bayes models and support vector machine models. However, the traditional machine learning model depends on manual corpus labeling, so that a large amount of manpower and material resources are consumed, and the text feature extraction result is not satisfactory.

With the development of deep learning, cloud computing, artificial intelligence and other technologies in recent years, the deep neural network is applied to various fields and achieves good results. In the field of natural language processing, under the condition of large-scale corpus, a multi-level network model realizes automatic mining of text characteristic information, and a deep neural network becomes one of key technologies in the field of natural language processing, and achieves good effects in text semantic classification tasks. The development and the use of long-short-term memory networks and attention mechanisms in the field of natural language processing lay a foundation for the invention.

In addition, in text semantic classification, many researchers have studied whether or not both the character-level and word-level embedding granularities have an influence on the classification effect. Kim proposes a model for extracting text semantic information through character-level CNN, liu Longfei, etc. demonstrates the superiority of character-level feature representation in chinese text processing.

Because the microblog texts are mostly unstructured text corpora lacking in specifications, the vectorization difficulty is high, text semantic features are extracted by singly using word dimensions or word dimensions, feature extraction is incomplete, classification precision is lost, and the text in the food safety field is difficult to accurately process by the existing language model. Therefore, the invention adopts word two-dimensional neural network model to combine with the constructed word stock in the food field to process the microblog text in the food safety field.

Disclosure of Invention

The invention solves the technical problems: the method for identifying the micro-rumors in the word two dimensions for food safety public opinion aims at solving the requirements of food safety related rumors on the current micro-rumors on identification and supervision, can quickly and accurately identify and judge the rumors, greatly improves the working efficiency of a supervisor, and assists the supervisor in making judgment.

The invention discloses a word double-dimension micro-rumor identification method for food security public opinion, which comprises the following steps:

step 1, preprocessing original text data acquired from web crawlers on the Internet, wherein the preprocessing comprises removing a large number of special symbols, stop words and the like contained in the original text data;

step 2, constructing a word-casting resource library in the food safety field and performing incremental training on the basis of the open-field word-casting resource library;

and 3, constructing a bidirectional long-short-time memory network based on a fused position-aware attention mechanism as a neural network model end for obtaining text word vector dimension text characteristics, and firstly judging the sense roles and positions of the field key words by combining the field word stock constructed in the step 2 to generate the attention based on position awareness. And then, word vectors generated by word embedding of the text corpus are input into a BLSTM model, the word vectors participate in calculation of an intermediate hidden layer, and then the vectors calculated by the hidden layer are further calculated under the influence of an attention mechanism to obtain word-level text semantic features.

Step 4, constructing a BERT neural network model as a neural network model end for obtaining text character vector dimension text characteristics independently of the BLSTM model constructed in the step 3, and converting each character in the text into a vector by inquiring a character vector table by the BERT model to be used as a model input; the model output is a vector representation after the fusion of the full text semantic information corresponding to each word is input.

And 5, using softMax as a classifier, processing and outputting the corpus through a BERT and BLSTM two-way neural network, combining word dimension text characteristic information obtained in the step 3 with word dimension text characteristic information obtained in the step 4 in a connecting layer, and inputting the word dimension text characteristic information into the classifier for classification and identification to obtain a final rumor classification and identification result.

Further, in the step 2, a skip-gram model and word semantic representation are combined on the basis of an open domain word embedding resource library, a word embedding resource library in the food safety domain is constructed, corpus expansion is performed on the basis, the published hundred-degree encyclopedia is increased, and word vector model training is performed by crawling word encyclopedia and news corpus in the food domain from a network. And after that, at intervals, when a certain food safety public opinion corpus is accumulated, performing incremental training on the word vector model.

Further, in the step 3, training a bidirectional long-short-time memory network model based on a fused position awareness attention mechanism as a word dimension text feature extraction model. Converting microblog text corpus into vector representation, using the vector representation as input of a network, training a neural network model, constructing one of two-way network models forming an integral model by using a two-way long-short-time memory network integrating a position attention sensing mechanism, and obtaining a current output result through the existing microblog text corpus training: word dimension text feature vector representation.

Further, in the step 4, the BERT network model is trained as a character dimension text feature extraction model. The model input contains two parts in addition to the word vector (Token Embedding), one is segmentation Embedding (Segment Embedding): the value of the vector is automatically learned in the model training process and is used for describing global semantic information of texts and fusing the global semantic information with semantic information of single words; and secondly, position embedding (Position Embedding): since there is a difference in semantic information carried by words appearing in different positions of the text, the BERT model appends a different vector to the words in different positions to distinguish them. Finally, the BERT model takes the summation of Token components, segment Embedding and Position Embedding as sentence vectors to obtain one of two-way network outputs of the whole model: word dimension text feature vector representation.

Further, the BERT network is used as a pre-training model, and in the text classification task, the Token Embedding layer in the BERT network marks [ CLS ] the head of the sentence required for input and marks [ SEP ] among multiple sentences. Layers Segment Embedding and Position Embedding utilize pre-trained model parameters to participate in the computation.

Further, in the step 5, training two neural network models, including a two-way long-short-time memory network model for extracting a fusion position perception attention mechanism of word dimension text feature vectors, and a BERT model for extracting word dimension text feature vectors; when training is started, randomly initializing weights, after a two-way network calculation result is obtained through neural network calculation, connecting the two-way network calculation result through a connecting layer, and converting the numerical output of the neural network into classified probability output by using a softMax function as a loss function; in order to avoid over fitting in the training process, dropout with certain probability is set, namely partial weight or output of a random zeroing hidden layer in the model training process is set, so that interdependence among all nodes is reduced, and model generalization is improved.

Compared with the prior art, the invention has the advantages that: according to the method, whether the food safety related microblogs are rumors or not can be rapidly judged through a word double-way text semantic classification model based on an LSTM network and a BERT network of a fusion position perception technology mechanism, a more comprehensive and more specific food safety field public opinion Embedding resource library is built aiming at the identification of the rumors in the food safety public opinion field, two embedded granularities of a character level and a word level are used as model input, and finally the text is classified by combining a feature extraction result of the double-way network. The model provided by the invention fully utilizes the characteristics of the BLSTM, mines the semantic features of the text from the word vector level, is combined with a position attention mechanism, acquires detailed feature information in the microblog text through the training of the BLSTM, and uses the position attention mechanism for calculation, so that the words related to the food safety field play a decisive role on the whole text. Meanwhile, the BERT network can further excavate text semantics from the word vector level, so that the problem that classification accuracy is lost due to incomplete feature extraction caused by unstructured text corpus lacking specifications is avoided, and the text semantic classification effect is effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a schematic flow chart of a word bi-dimensional microblog rumors recognition method for food security public opinion provided by the embodiment of the invention;

FIG. 2 is a schematic diagram of a two-way long short term memory network of a word vector end fusion position attention mechanism;

FIG. 3 is a diagram of a word vector end BERT network;

fig. 4 is a schematic diagram of a connection layer network.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without the inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.

As shown in fig. 1, the invention provides a word two-dimensional microblog rumors identification method for food security public opinion, which comprises the following steps:

and 3, constructing a bidirectional long-short-time memory network based on a fused position-aware attention mechanism as a neural network model end for obtaining text word vector dimension text characteristics, and firstly judging the sense roles and positions of the field key words by combining the field word stock constructed in the step 2 to generate the attention based on position awareness. Then, word vectors generated by word embedding of the text corpus are input into a BLSTM model, the word vectors participate in calculation of an intermediate hidden layer, and then the vectors calculated by the hidden layer are further calculated under the influence of an attention mechanism to obtain word-level text semantic features;

step 4, constructing a BERT neural network model as a neural network model end for obtaining text character vector dimension text characteristics independently of the BLSTM model constructed in the step 3, and converting each character in the text into a vector by inquiring a character vector table by the BERT model to be used as a model input; the model output is vector representation after the fusion of the full text semantic information corresponding to each word is input;

Referring to fig. 1, an overall schematic diagram of the method provided by the invention is shown, the crawled food safety public opinion microblog data is preprocessed, an open domain word-filling-in resource library is combined to construct a word-filling-in resource library in the food safety field, then a multi-level hundred-degree encyclopedic corpus is crawled to perform incremental training on the word-filling-in resource library, word dimension text features based on a BERT network and word dimension text features based on a BLSTM network and added with a position attention mechanism are obtained, finally word two-dimension text feature vectors are obtained, and classification recognition on whether microblog texts are rumors is performed.

In the embodiment shown in fig. 2, the model first generates a location-aware based attention by determining domain keyword sense roles and locations in conjunction with a domain word stock. And (3) word embedding is carried out on the microblog text to generate word dimension vectors, the word vectors are input into a bidirectional long-short-time memory network, the word vectors are calculated through the middle hidden layer, hidden layer vectors are output, and the word dimension text semantic features are obtained through calculation with position attention.

Aiming at the problem of identifying microblog rumors in the food safety public opinion field in the research, keywords in the food safety field are very important, and the adjacent words of the keywords have non-negligible effect. This is because in the text recognition classification task, the influence of each word in the text on the final classification result is different, and the effect of the keywords can be fully exerted by increasing the attention to the keywords. Therefore, the invention positions the keyword according to the domain word stock, so that the model learns more position information, and a position-based attention mechanism is introduced into the model. Assume that the influence of keywords on a particular distance in the hidden layer dimension follows a gaussian distribution. A basis matrix K of influences is defined, each column of which represents an influence basis vector corresponding to a specific distance. K is defined as formula (1):

K(i,u)～N(Kernel(u),σ) (1)

where K (i, u) represents the corresponding impact of the distance u of the food safety domain keyword in dimension i, and N represents the normal distribution of the expected and standard deviation σ according to the Kernel (u) value. Kernel (u) is a gaussian Kernel function used to model location-aware based impact propagation, defined as equation (2):

when u=0, the current word is a keyword in the food safety domain, and the obtained propagation influence is maximum, and the propagation influence is weakened along with the increase of the distance.

And obtaining the influence vector of the keywords at each specific position by using the influence basic matrix and according to the position relation of the keywords in the food safety field through cumulative calculation:

p _j ＝Kc _j (3)

wherein p is _j C is the cumulative influence vector of the words at the j position _j For distance count vector, represent the count of all keywords at distance u, c for the word at the j position _j The calculation of (u) is as follows:

C _j (u)＝∑ _q∈Q [(j-u)∈pos(w)]+[(j+u)∈pos(w)] (4)

wherein Q is all keywords contained in a food safety public opinion related microblog text, Q is one of the keywords, pos (Q) is a position set of the keywords Q appearing in the sub-category, [ (DEG ] is an index function, and the condition is 1 when satisfied and 0 when not satisfied.

The attention calculating method of the words at the j position in the food safety public opinion related microblog text is shown as the following (5):

in the formula, h _j Is the hidden layer vector of the j-position word, p _j The method is an accumulated position perception influence vector, len is the number of word vectors in a sentence of food safety public opinion related microblog text, and a (-) is the importance of words used for measuring based on hidden layer vectors and position perception influence vectors. The specific form of a (.) is as shown in formula (6):

in which W is _H ，W _p Is h _j ，p _j Weight matrix of b) _i Is the bias vector belonging to the first layer parameter,v is a global vector, b ₂ Is the bias vector belonging to the second layer parameters. After the weight of each position word is calculated, weighting all hidden layer vectors in the sentence to obtain a final attribute Value:

in another embodiment shown in FIG. 3, a BERT network is employed at the word dimension text feature extraction end, and for text classification tasks, the Token Embedding layer in BERT is labeled [ CLS ] for the input requirement sentence header]Inter-sentence label [ SEP ]]. The word vectors are calculated through Token embedded-bands, segment Embeddings, position Embedding, segment Embedding, and Position Embedding, respectively, using pre-trained model parameters. Finally obtaining the character-level food safety public opinion related microblog textAnd (5) sign representation. In FIG. 3, tok represents a different Token, E represents an embedded vector, T _i Representing the feature vector of the i-th Token after the BERT processing. For text classification in general, BERT takes the first [ CLS directly]Adding a layer of weight W to the final hidden layer vector C in the model, and then using a softMax function as an activation function, b _c Is a bias vector, as in equation (8):

P＝SoftMax(CW ^T +b _c ) (8)

model fine tuning is achieved by adjusting model parameters in specific tasks.

In the embodiment shown in fig. 4, after the obtained word vector level and word vector level text vector are connected at the connection layer, finally, the probability of whether the food safety public opinion related microblog text is rumor or is obtained through a SoftMax function, and the formula of the SoftMax function is as follows:

the function maps the output of neurons into (0, 1) intervals, where n represents the number of classes, i represents some class, g _i A value representing the classification, P (s _i ) Representing the probability of the ith class.

While the foregoing has been described in relation to illustrative embodiments thereof, so as to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but is to be construed as limited to the spirit and scope of the invention as defined and defined by the appended claims, as long as various changes are apparent to those skilled in the art, all within the scope of which the invention is defined by the appended claims.

Claims

1. The word double-dimension microblog rumors identification method for food safety public opinion is characterized by comprising the following steps of:

step 1, preprocessing original text data acquired from web crawlers on the Internet, wherein the preprocessing comprises the steps of removing special symbols and stop words contained in the original text data;

step 2, constructing a word-casting resource library in the food safety field on the basis of the open-field word-casting resource library, and performing incremental training;

step 3, constructing a bidirectional long-short-time memory network based on a fused position awareness attention mechanism as a neural network model end for obtaining text word vector dimension text characteristics, and specifically realizing: firstly, judging the meaning roles and positions of key words in the field by combining the field word stock constructed in the step 2, generating attention based on position perception, inputting word vectors generated by word embedding of text corpus into a BLSTM model, participating the word vectors into the calculation of an intermediate hidden layer, and further calculating the vectors calculated by the hidden layer under the influence of an attention mechanism to obtain word-level text semantic features;

step 5, using softMax as a classifier, processing and outputting corpus through BERT and BLSTM two-way neural network, combining word dimension text characteristic information obtained in the step 3 with word dimension text characteristic information obtained in the step 4 in a connecting layer, and inputting the word dimension text characteristic information into the classifier for classification and identification to obtain a final rumor classification and identification result;

in the step 3, training a two-way long-short-time memory network model based on a fusion position awareness mechanism as a word dimension text feature extraction model, converting microblog text corpus into vector representation, using the vector representation as network input, training a neural network model, constructing one of two-way network models forming an integral model by using a two-way long-short-time memory network of the fusion position awareness mechanism, and obtaining a local output result, namely word dimension text feature vector representation, through the training of the existing microblog text corpus;

in the step 4, the BERT network model is trained as a character dimension text feature extraction model, and the model input comprises two parts except a character vector (Token Embedding), namely a segmentation Embedding (Segment Embedding): the value of the vector is automatically learned in the model training process and is used for describing global semantic information of texts and fusing the global semantic information with semantic information of single words; and secondly, position embedding (Position Embedding): because of the difference of semantic information carried by words appearing in different positions of the text, the BERT model respectively adds a different vector to the words in different positions to distinguish the words; finally, the BERT model takes the sum of Token references Segment Embedding and Position Embedding as sentence vectors to obtain one of two-way network outputs of the whole model, namely, character dimension text feature vector representation.

2. The method for identifying the word bi-dimensional microblog rumors for food safety public opinion according to claim 1, which is characterized in that: in the step 2, a skip-gram model and word semantic representation are combined on the basis of an open domain word filling library, a word filling library in the food safety domain is constructed, corpus expansion is carried out on the basis, the published hundred-degree encyclopedia is increased, the word vector model is trained by crawling the vocabulary encyclopedia and news corpus in the food safety domain from a network, and then incremental training is carried out on the word vector model at intervals when certain food safety public opinion corpus is accumulated.

3. The method for identifying the word bi-dimensional microblog rumors for food safety public opinion according to claim 1, which is characterized in that: the BERT network is used as a pre-training model, and in the text classification task, token Embedding layers in the BERT network are marked with [ CLS ] for the head of an input required sentence, marks among multiple sentences [ SEP ], and Segment Embedding and Position Embedding layers utilize pre-trained model parameters to participate in calculation.

4. The method for identifying the word bi-dimensional microblog rumors for food safety public opinion according to claim 1, which is characterized in that: in the step 5, training two neural network models, including a two-way long-short-time memory network model of a fusion position perception attention mechanism for extracting word dimension text feature vectors and a BERT model for extracting word dimension text feature vectors; when training is started, randomly initializing weights, after a two-way network calculation result is obtained through neural network calculation, connecting the two-way network calculation result through a connecting layer, and converting the numerical output of the neural network into classified probability output by using a softMax function as a loss function; in order to avoid over fitting in the training process, dropout with certain probability is set, namely partial weight or output of a random zeroing hidden layer in the model training process is set, so that interdependence among all nodes is reduced, and model generalization is improved.