CN113312532B

CN113312532B - Public opinion grade prediction method based on deep learning and oriented to public inspection field

Info

Publication number: CN113312532B
Application number: CN202110608376.0A
Authority: CN
Inventors: 赵铁军; 杨沐昀; 徐冰; 郭常江; 曹海龙; 朱聪慧
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2022-10-21
Anticipated expiration: 2041-06-01
Also published as: CN113312532A

Abstract

The invention discloses a public opinion grade prediction method facing the public inspection field based on deep learning. Step 1: crawling public opinion information related to the public inspection method field from a network, extracting text information in the public opinion information and storing the text information in a database; step 2: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result; and step 3: storing the public opinion grade prediction result in the step 2 into a system database; and 4, step 4: making corresponding identification on the public sentiment grade in a database; and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database; and 6: and displaying the public opinion grade prediction result in the system through a data interface. The method is used for solving the problem that a public sentiment system has no pertinence and jumping out of the limitation of an algorithm.

Description

Public opinion grade prediction method based on deep learning and oriented to public inspection field

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a public opinion grade prediction method facing the public inspection method field based on deep learning.

Background

Public opinion monitoring or public opinion early warning is a complex technology which crosses social science and data science, and needs to have preliminary grade prediction on events at the initial stage of public opinion occurrence to make full preparation for coping. To automate the machine for such a purpose requires extensive data and a reliable model for support.

The description of the public opinion events mainly comes from news texts on network media and social platforms similar to Sina microblogs, and people directly inform others or indirectly know the related information of the public opinion events from others through reading, forwarding, commenting and the like. The public opinion system can extract characteristics (text characteristics, description of specific situations of events, data characteristics, description of propagation situations of events) from the event information to further analyze the current fermentation situation and propagation situation of the public opinion. The important point is to analyze the early warning level of the public sentiment, which can be an indispensable ring in a public sentiment system.

However, the public opinion grade prediction function in the existing public opinion system has the following problems:

1. at present, the public sentiment is mainly supervised and the public sentiment grade is analyzed in a manual mode, although the single public sentiment is effective, the collection of the public sentiment is limited, the public sentiment is not found in time, and an automatic public sentiment analysis system is needed to assist manual prediction;

2. the public sentiment system has no pertinence, the public sentiment covers all fields due to the universality, however, the attention points of people are always different aiming at the public sentiment of all fields, so that the public sentiment can not be analyzed in a universal mode, and the public sentiment event grade prediction aiming at the public inspection field is almost not realized at present;

3. the existing technology is mainly analyzed according to a data mining algorithm, but the effect is general due to the non-learnability of the algorithm; the deep learning method needs to uniformly construct a corpus for learning, so that the existing method mainly uses a mining algorithm for analysis.

Disclosure of Invention

The invention provides a public opinion grade prediction method facing the public inspection method field based on deep learning, which is used for solving the problems and assisting in manual public opinion monitoring.

The invention is realized by the following technical scheme:

a public opinion grade prediction method facing the public inspection field based on deep learning comprises the following steps:

step 1: crawling public opinion information related to the public inspection method field from the network, extracting text information in the public inspection method information and storing the text information in a database;

step 2: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result;

and step 3: storing the public opinion grade prediction result in the step 2 into a system database;

and 4, step 4: making corresponding identification on the public opinion grade in a database;

and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database;

step 6: and displaying the public opinion grade prediction result in the system through a data interface.

Further, the step 1 specifically includes the following steps:

step J1.1: crawling to obtain an original JSON file, judging whether the JSON file is a UTF-8 code or not, if so, converting the JSON file into the UTF-8 code, and if not, converting the encoding format of the JSON file into the UTF-8 code;

step J1.2: searching in a JSON file in a UTF-8 coding format, and extracting information of forwarding amount, comment text and text;

step J1.3: cleaning the extracted text by using a regular expression, reserving Chinese, english and numeric characters, removing webpage links, labels and emoticons in the text, and filtering out non-Chinese characters;

step J1.4: and establishing the relevant record of the public sentiment in a database by using the extracted text, storing the cleaned public sentiment text in an independent database, and establishing a relation with the corresponding public sentiment record in the database.

Further, the step 1 specifically includes the following steps:

step H1.1: crawling to obtain an original HTML file, judging whether the original HTML file is UTF-8 coding, if so, converting the original HTML file into UTF-8 coding without converting the original HTML file, and if not, converting the encoding format of the original HTML file into UTF-8 coding;

step H1.2: retrieving body tags in the HTML text, and extracting a body text in the HTML text by using a regular expression;

step H1.3: cleaning the text in the step 1.2 by using the regular expression again, reserving Chinese, english and numeric characters, removing useless information in the text, and filtering out non-Chinese characters;

step H1.4: and storing the cleaned public opinion texts in a separate database, and establishing a connection with corresponding public opinion records in the database.

Further, the step 2 of constructing a corpus by using the database includes the following steps:

step 2.1: extracting public opinion data records in a plurality of public inspection fields from a database, wherein the public opinion data records comprise but are not limited to public opinion event names, public opinion related news, public opinion related microblogs and public opinion related comments;

step 2.2: the public sentiment data records in the step 2.1 take the public sentiment events as units, each public sentiment event is arranged into a JSON file, and level marking is carried out according to the news content of the public sentiment events, media participating in the public sentiment event propagation, forwarding quantity and comment quantity;

step 2.3: checking whether the grade marking in the step 2.2 has obvious errors, if so, performing the step 2.4, and if not, performing the step 2.5;

step 2.4: re-labeling the error file;

step 2.5: and stopping labeling, and finishing corpus construction.

Further, the technical premise of collecting text information in the step 2 to process natural language is that a Word2Vec model is used to map words into a vector form, and the Word2Vec model is specifically trained through the following steps:

step 2.1: extracting a news text crawled by a crawler module from a system database, and cleaning the news text by using a regular expression;

step 2.2: splitting each news text into sentences by using a sentence splitting technology of natural language, and splitting each sentence into words by using Jieba;

step 2.3: and putting the words after being split into sentences as units into a Word2Vec model for training to obtain vector form representation of the words.

Further, the keyword extraction algorithm after the text information is collected in step 2 is an improved algorithm of TextRank, and the side weight is calculated as follows:

wherein w _ij Representing the weight of the edge between the node i and the node j, namely the sum of the co-occurrence frequency and the similarity of the word vector; c. C _ij Representing the co-occurrence times of the word pairs represented by the node i and the node j; vec (V) _i ) Vector representation, vec (V), representing the word corresponding to node i _j ) Respectively representing word vectors corresponding to the nodes j; the initial weight of the node is calculated as follows:

where n represents the number of words in the sentence, vec (V) _i )、vec(V _j ) As previously indicated, α represents the offset coefficient, more preferably [0,1 ]]And (3) indicating which of the sentence length and semantic relevance has more influence on the initial value of the node.

The algorithm iteration formula is as follows:

wherein Rank (V) _i ) A keyword score representing node i; w is a _ij Representing the weight of the edge between the node i and the node j; in (V) _i ) Indicating a pointing node V _i A set of nodes of (c); out (V) _j ) Represents a node V _j A set of pointed-to nodes; score (V) _j ) Represents the node weight, beta represents the harmonic coefficient, and the value is [0,1]To (c) to (d); after iteration, when the Rank (V) is _i ) When converging, calculateThe method is ended.

Further, the deep learning model in the step 2 converts the news headlines into a vector representation form through Word segmentation, and Word vectors are obtained through preprocessing by using a Word2vec model; extracting keywords from the news text through an improved TextRank algorithm, and expressing the keywords into word vectors; extracting the text features by using a deep neural network model, namely acquiring further semantic representation of the text features; splicing data information after the semantic representation to finish classification;

after the model is built, training by using the built data set; and performing grade prediction on other public opinion records in the database by using the model, and storing the prediction result as a field of the public opinion record.

Further, the deep learning model is an RNN recognition model, a Bi-LSTM network structure is used as a model core, and the construction steps are as follows:

step RB2.1: using the word vector to map the title and the keywords input into the network into a vector, namely an Embedding Layer;

step RB2.2: obtaining the context information of the title by using a first-layer bidirectional LSTM structure, and outputting the forward direction of the LSTM to fwOutput ₁ And backward output bwOutput ₁ Spliced together to obtain a vector fwOutput ₁ ,bwOutput ₁ ]Using the attention mechanism, a semantic representation vector of the title, denoted title, is calculated, and the attention calculation formula is as follows:

e _ij ＝tanh(W _w h _ij +b _w )

wherein W _w 、b _w 、u _w Is the parameter to be learned, α _ij Represents the final attention distribution, i.e. the attention value of the jth word in the ith title;

step RB2.3: obtaining the semantic information of the key words by utilizing the second layer bidirectional LSTM structure, and outputting fwOutput of the LSTM in the forward direction ₂ And the inverseOutput to bwOutput ₂ Spliced together to get the vector fwOutput ₂ ,bwOutput ₂ ]Is recorded as keyword;

step RB2.4: normalizing the characteristic numbers to ensure that each numerical value is between-1 and 1, and splicing the characteristic values into a vector which is marked as the number;

step RB2.5: and splicing the title vector, the keyword vector and the number vector together, namely [ title, keyword, number ], mapping through a linear layer, finally inputting to an output layer, and obtaining output by using a softmax function.

Further, in the training mode of the deep learning model in step 2, the data set in step 1 is divided into two parts according to the proportion of 8:2, and the two parts are respectively used as a training set and a test set of the training model, wherein an optimizer used in the training of the model is Adam, and a loss function of the neural network is a weighted cross entropy loss function, which is specifically defined as follows:

wherein, y _i Weight, a probability value representing the input y predicted as class i _i Representing a weight value of class i. The advantage of using the weighted value is that the error caused by unbalanced class distribution of the data set can be effectively inhibited, so that the model can better learn the data features with fewer classes, wherein the weight _i The calculation formula is as follows:

wherein num _i Indicates the number of i-class data.

Further, the database used in step 3 is a MongoDB database, and data records are stored in the form of a dictionary. In a database, according to the prediction result of the model, identifying each public sentiment event record; and designing a data reading interface to read one public sentiment event record from the database every time, so that the public sentiment event record can be called by other systems.

The invention has the beneficial effects that:

the invention extracts the text characteristic and the data characteristic of the public sentiment event by capturing the public sentiment event information related to the public inspection field on the network and combining the deep learning and natural language processing technology to complete the prediction of the grade of the public sentiment event.

The method of the invention enables the staff not to pay attention to the public sentiment event by manpower, realizes automatic extraction of text characteristics and data characteristics to finish public sentiment grade prediction, and reduces the situations of incomplete information collection of the public sentiment event, untimely public sentiment discovery and insufficient public sentiment response.

The invention provides a reading interface by storing the data in the system database, thereby facilitating the data retrieval of managers and developers.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a flow chart of training data set construction in the present invention.

FIG. 3 is a CBOW model diagram of the Word2vec model in the present invention.

FIG. 4 is a diagram of a Bi-LSTM-based classification model in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

and 2, step: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result;

and 3, step 3: storing the public opinion grade prediction result in the step 2 into a system database;

Further, the step 1 specifically includes the following steps:

step J1.1: crawling to obtain an original JSON file, judging whether the JSON file is a UTF-8 code or not, if so, converting the JSON file without conversion, and if not, converting the encoding format of the JSON file into the UTF-8 code;

Further, the step 1 specifically includes the following steps:

step H1.1: crawling to obtain an original HTML file, judging whether the HTML file is a UTF-8 code or not, if so, converting the HTML file into the UTF-8 code, and if not, converting the encoding format of the HTML file into the UTF-8 code;

step H1.3: cleaning the text in the step 1.2 by using the regular expression again, reserving Chinese, english and numerical characters, removing useless information in the text, and filtering out non-Chinese characters;

step 2.4: re-labeling the error file;

step 2.5: and stopping labeling, and finishing corpus construction.

The corpus is used in deep learning.

Further, the technical premise of collecting text information in the step 2 and processing natural language is that words are mapped into a vector form by using a Word2Vec model, and the Word2Vec model is specifically trained by the following steps:

step 2.1: extracting a news text crawled by a crawler module from a system database, and cleaning by using a regular expression;

step 2.3: and putting the split words into a Word2Vec model for training by taking sentences as units to obtain vector form representation of the words.

Further, the keyword extraction algorithm after the text information is collected in the step 2 is an improved algorithm of TextRank. TextRank is a random walk graph model, a text (usually a sentence) is divided into words, the words are regarded as nodes, edges between the nodes exist, and if and only if the words represented by the two nodes appear simultaneously in the original text (called co-occurrence). The edges and the nodes are endowed with initial weights, and after the iterative computation graph model is converged, the node weights represent the key of the nodes at the moment, and the key word results are obtained after the nodes are sequenced according to the key values.

The edge weight is calculated as follows:

wherein w _ij Representing the weight of the edge between the node i and the node j, namely the sum of the co-occurrence frequency and the similarity of the word vector; c. C _ij Representing the co-occurrence times of the word pairs represented by the node i and the node j; vec (V) _i ) Vector representation, vec (V), representing the word corresponding to node i _j ) Respectively representing the word vectors corresponding to the nodes j.

The initial weight of the node is calculated as follows:

The algorithm iteration formula is as follows:

wherein Rank (V) _i ) A keyword score representing node i; w is a _ij Representing the weight of the edge between node i and node j; in (V) _i ) Indicating a pointing node V _i A set of nodes of (a); out (V) _j ) Represents a node V _j A set of pointed-to nodes; score (V) _j ) Represents the node weight, beta represents the harmonic coefficient, and the value is [0,1]In the meantime. After iteration, when the above isRank(V _i ) Upon convergence, the algorithm ends.

Further, the deep learning model in step 2 refers to learning semantic information by modeling a text title and a text keyword, and performing prediction by combining data information (forwarding amount, comment amount, and the like) according to the semantic information, and belongs to a classification problem. Converting news titles into a vector representation form through Word segmentation, and preprocessing the news titles by using a Word2vec model to obtain Word vectors; extracting keywords from the news text through an improved TextRank algorithm, and expressing the keywords into word vectors; extracting text features by using a deep neural network model, namely acquiring further semantic representation of the text features; splicing data information after the semantic representation to finish classification;

after the model is built, training by using the built data set; the model is used for carrying out grade prediction on other public opinion records in the database (not including the records used for training), and the prediction result is stored as a field of the public opinion records.

Further, the deep learning model is an identification model of an RNN (Recurrent Neural Network), a Bi-LSTM (bidirectional Long Short-Term Memory) Network structure is used as a model core, and the construction steps are as follows:

e _ij ＝tanh(W _w h _ij +b _w )

step RB2.3: obtaining the semantic information of the key words by utilizing a second-layer bidirectional LSTM structure, and outputting fwOutput of the LSTM in the forward direction ₂ And backward output bwOutput ₂ Spliced together to obtain a vector fwOutput ₂ ,bwOutput ₂ ]Is recorded as keyword;

step RB2.4: normalizing the characteristic numbers to ensure that each numerical value is between-1 and 1, and splicing the characteristic values into a vector which is recorded as numbers;

step RB2.5: and splicing the three vectors of title, keyword and number together, namely [ title, keyword and number ], mapping through a linear layer, finally inputting to an output layer, and obtaining output by using a softmax function.

wherein num _i Indicates the number of i-class data.

Further, the database used in step 3 is a MongoDB database, and data records are stored in the form of a dictionary. Marking each public opinion event record in a database according to the prediction result of the model; and designing a data reading interface to read one public opinion event record from the database every time, wherein the public opinion event record can be called by other systems.

This example was carried out according to the scheme shown in FIG. 1. The system built by using the invention is divided into two parts: an algorithm portion and a data storage portion. The algorithm part mainly comprises four parts, namely public opinion information acquisition, public opinion text extraction and cleaning, feature extraction, keyword extraction and model prediction; the data storage part is mainly used for storing the public opinion related information after crawling and updating the identification in the database after model prediction.

After the system is started, the trained model is loaded into the memory, and meanwhile, a crawler module crawls news texts, comments, forwarding data and the like related to public sentiment events from the network and stores the data in a system database.

Extracting public opinion related data in a database, and sorting the data as numerical characteristics of model input; the method comprises the steps of cleaning news texts by using a regular expression and the like, extracting keywords from the news texts by using an improved TextRank algorithm to serve as keyword features of a model, and finally inputting the keywords of public sentiment serving as the title features of the model into the model by combining the first two features to finish prediction.

And storing the prediction result in a system database, skipping the public sentiment event when the abnormity occurs, predicting the next public sentiment event, and when the prediction of all the public sentiment events is finished, re-predicting the abnormal public sentiment event again and circularly repeating.

Claims

1. A public opinion grade prediction method facing the public inspection method field based on deep learning is characterized by comprising the following steps:

step 6: displaying the public opinion grade prediction result in the system through a data interface;

the algorithm for extracting the keywords after the text information is collected in the step 2 is an improved algorithm of TextRank, and the side weight is calculated as follows:

wherein w _ij Representing the weight of the edge between the node i and the node j, namely the sum of the co-occurrence frequency and the similarity of the word vector; c. C _ij Representing the co-occurrence times of the word pairs represented by the node i and the node j; vec (V) _i ) Vector representation, vec (V), representing the word corresponding to node i _j ) Representing a word vector corresponding to the node j; the initial weight of the node is calculated as follows:

where n represents the number of words in the sentence, vec (V) _i )、vec(V _j ) The meaning is the same as that in the above, alpha represents the offset coefficient and takes the value of [0,1]The method represents that which of sentence length and semantic relevance has more influence on the initial value of the node;

the algorithm iteration formula is as follows:

wherein Rank (V) _i ) A keyword score representing node i; w is a _ij Representing the weight of the edge between node i and node j; in (V) _i ) Indicating a pointing node V _i A set of nodes of (a); out (V) _j ) Represents a node V _j A set of pointed-to nodes; score (V) _j ) Represents the node weight, beta represents the harmonic coefficient, and the value is [0,1]To (c) to (d); after iteration, when the Rank (V) is _i ) When the convergence is reached, the algorithm is ended;

the deep learning model in the step 2 converts the news headlines into a vector representation form through Word segmentation, and Word vectors are obtained through preprocessing by using a Word2vec model; extracting keywords from the news text through an improved TextRank algorithm, and expressing the keywords into word vectors; extracting text features by using a deep neural network model, namely acquiring further semantic representation of the text features; splicing data information after the semantic representation to finish classification;

2. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the step 1 specifically comprises the following steps:

step J1.2: retrieving in a JSON file in a UTF-8 coding format, and extracting information of forwarding amount, comment text and text;

3. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the step 1 specifically comprises the following steps:

step H1.3: cleaning the text in the step H1.2 by using the regular expression again, reserving Chinese, english and numeric characters, removing useless information in the Chinese, english and numeric characters, and filtering out non-Chinese characters;

4. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the step 2 of constructing a corpus by using a database comprises the following steps:

step 2.4: re-labeling the error file;

step 2.5: and stopping labeling, and finishing corpus construction.

5. The public opinion grade prediction method facing the public inspection field based on deep learning according to claim 1, wherein the technical premise of collecting text information in the step 2 and performing natural language processing on the text information is that words are mapped into a vector form by using a Word2Vec model, and the Word2Vec model is specifically trained by the following steps:

6. The public opinion grade prediction method facing to the public inspection field based on deep learning according to claim 1, wherein the deep learning model is an RNN recognition model, a Bi-LSTM network structure is used as a model core, and the construction steps are as follows:

e _ij ＝tanh(W _w h _ij +b _w )

step RB2.3: obtaining the semantic information of the key words by utilizing the second layer bidirectional LSTM structure, and outputting fwOutput of the LSTM in the forward direction ₂ And backward output bwOutput ₂ Spliced together to obtain a vector fwOutput ₂ ,bwOutput ₂ ]And is recorded as keyword;

7. The public opinion grade prediction method based on deep learning and facing to the public inspection field according to claim 1, wherein the training mode of the deep learning model in the step 2 is divided into two parts according to the proportion of 8:2 through the data set in the step 1, and the two parts are respectively used as a training set and a test set of the training model, wherein an optimizer used in the training of the model is Adam, and a loss function of a neural network is a weighted cross entropy loss function, which is specifically defined as follows:

wherein, y _i Weight, a probability value representing the input y predicted as class i _i A weight value representing class i; the advantage of using the weighted value is that the error caused by unbalanced class distribution of the data set can be effectively inhibited, so that the model can better learn the data features with fewer classes, wherein the weight _i The calculation formula is as follows:

wherein num _i Indicates the number of i-class data.

8. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the database used in step 3 is a MongoDB database, storing data records in the form of dictionary; in a database, according to the prediction result of the model, identifying each public sentiment event record; and designing a data reading interface to read one public sentiment event record from the database every time, so that the public sentiment event record can be called by other systems.