CN113312532B - Public opinion grade prediction method based on deep learning and oriented to public inspection field - Google Patents
Public opinion grade prediction method based on deep learning and oriented to public inspection field Download PDFInfo
- Publication number
- CN113312532B CN113312532B CN202110608376.0A CN202110608376A CN113312532B CN 113312532 B CN113312532 B CN 113312532B CN 202110608376 A CN202110608376 A CN 202110608376A CN 113312532 B CN113312532 B CN 113312532B
- Authority
- CN
- China
- Prior art keywords
- public
- public opinion
- text
- database
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a public opinion grade prediction method facing the public inspection field based on deep learning. Step 1: crawling public opinion information related to the public inspection method field from a network, extracting text information in the public opinion information and storing the text information in a database; step 2: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result; and step 3: storing the public opinion grade prediction result in the step 2 into a system database; and 4, step 4: making corresponding identification on the public sentiment grade in a database; and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database; and 6: and displaying the public opinion grade prediction result in the system through a data interface. The method is used for solving the problem that a public sentiment system has no pertinence and jumping out of the limitation of an algorithm.
Description
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a public opinion grade prediction method facing the public inspection method field based on deep learning.
Background
Public opinion monitoring or public opinion early warning is a complex technology which crosses social science and data science, and needs to have preliminary grade prediction on events at the initial stage of public opinion occurrence to make full preparation for coping. To automate the machine for such a purpose requires extensive data and a reliable model for support.
The description of the public opinion events mainly comes from news texts on network media and social platforms similar to Sina microblogs, and people directly inform others or indirectly know the related information of the public opinion events from others through reading, forwarding, commenting and the like. The public opinion system can extract characteristics (text characteristics, description of specific situations of events, data characteristics, description of propagation situations of events) from the event information to further analyze the current fermentation situation and propagation situation of the public opinion. The important point is to analyze the early warning level of the public sentiment, which can be an indispensable ring in a public sentiment system.
However, the public opinion grade prediction function in the existing public opinion system has the following problems:
1. at present, the public sentiment is mainly supervised and the public sentiment grade is analyzed in a manual mode, although the single public sentiment is effective, the collection of the public sentiment is limited, the public sentiment is not found in time, and an automatic public sentiment analysis system is needed to assist manual prediction;
2. the public sentiment system has no pertinence, the public sentiment covers all fields due to the universality, however, the attention points of people are always different aiming at the public sentiment of all fields, so that the public sentiment can not be analyzed in a universal mode, and the public sentiment event grade prediction aiming at the public inspection field is almost not realized at present;
3. the existing technology is mainly analyzed according to a data mining algorithm, but the effect is general due to the non-learnability of the algorithm; the deep learning method needs to uniformly construct a corpus for learning, so that the existing method mainly uses a mining algorithm for analysis.
Disclosure of Invention
The invention provides a public opinion grade prediction method facing the public inspection method field based on deep learning, which is used for solving the problems and assisting in manual public opinion monitoring.
The invention is realized by the following technical scheme:
a public opinion grade prediction method facing the public inspection field based on deep learning comprises the following steps:
step 1: crawling public opinion information related to the public inspection method field from the network, extracting text information in the public inspection method information and storing the text information in a database;
step 2: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result;
and step 3: storing the public opinion grade prediction result in the step 2 into a system database;
and 4, step 4: making corresponding identification on the public opinion grade in a database;
and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database;
step 6: and displaying the public opinion grade prediction result in the system through a data interface.
Further, the step 1 specifically includes the following steps:
step J1.1: crawling to obtain an original JSON file, judging whether the JSON file is a UTF-8 code or not, if so, converting the JSON file into the UTF-8 code, and if not, converting the encoding format of the JSON file into the UTF-8 code;
step J1.2: searching in a JSON file in a UTF-8 coding format, and extracting information of forwarding amount, comment text and text;
step J1.3: cleaning the extracted text by using a regular expression, reserving Chinese, english and numeric characters, removing webpage links, labels and emoticons in the text, and filtering out non-Chinese characters;
step J1.4: and establishing the relevant record of the public sentiment in a database by using the extracted text, storing the cleaned public sentiment text in an independent database, and establishing a relation with the corresponding public sentiment record in the database.
Further, the step 1 specifically includes the following steps:
step H1.1: crawling to obtain an original HTML file, judging whether the original HTML file is UTF-8 coding, if so, converting the original HTML file into UTF-8 coding without converting the original HTML file, and if not, converting the encoding format of the original HTML file into UTF-8 coding;
step H1.2: retrieving body tags in the HTML text, and extracting a body text in the HTML text by using a regular expression;
step H1.3: cleaning the text in the step 1.2 by using the regular expression again, reserving Chinese, english and numeric characters, removing useless information in the text, and filtering out non-Chinese characters;
step H1.4: and storing the cleaned public opinion texts in a separate database, and establishing a connection with corresponding public opinion records in the database.
Further, the step 2 of constructing a corpus by using the database includes the following steps:
step 2.1: extracting public opinion data records in a plurality of public inspection fields from a database, wherein the public opinion data records comprise but are not limited to public opinion event names, public opinion related news, public opinion related microblogs and public opinion related comments;
step 2.2: the public sentiment data records in the step 2.1 take the public sentiment events as units, each public sentiment event is arranged into a JSON file, and level marking is carried out according to the news content of the public sentiment events, media participating in the public sentiment event propagation, forwarding quantity and comment quantity;
step 2.3: checking whether the grade marking in the step 2.2 has obvious errors, if so, performing the step 2.4, and if not, performing the step 2.5;
step 2.4: re-labeling the error file;
step 2.5: and stopping labeling, and finishing corpus construction.
Further, the technical premise of collecting text information in the step 2 to process natural language is that a Word2Vec model is used to map words into a vector form, and the Word2Vec model is specifically trained through the following steps:
step 2.1: extracting a news text crawled by a crawler module from a system database, and cleaning the news text by using a regular expression;
step 2.2: splitting each news text into sentences by using a sentence splitting technology of natural language, and splitting each sentence into words by using Jieba;
step 2.3: and putting the words after being split into sentences as units into a Word2Vec model for training to obtain vector form representation of the words.
Further, the keyword extraction algorithm after the text information is collected in step 2 is an improved algorithm of TextRank, and the side weight is calculated as follows:
wherein w ij Representing the weight of the edge between the node i and the node j, namely the sum of the co-occurrence frequency and the similarity of the word vector; c. C ij Representing the co-occurrence times of the word pairs represented by the node i and the node j; vec (V) i ) Vector representation, vec (V), representing the word corresponding to node i j ) Respectively representing word vectors corresponding to the nodes j; the initial weight of the node is calculated as follows:
where n represents the number of words in the sentence, vec (V) i )、vec(V j ) As previously indicated, α represents the offset coefficient, more preferably [0,1 ]]And (3) indicating which of the sentence length and semantic relevance has more influence on the initial value of the node.
The algorithm iteration formula is as follows:
wherein Rank (V) i ) A keyword score representing node i; w is a ij Representing the weight of the edge between the node i and the node j; in (V) i ) Indicating a pointing node V i A set of nodes of (c); out (V) j ) Represents a node V j A set of pointed-to nodes; score (V) j ) Represents the node weight, beta represents the harmonic coefficient, and the value is [0,1]To (c) to (d); after iteration, when the Rank (V) is i ) When converging, calculateThe method is ended.
Further, the deep learning model in the step 2 converts the news headlines into a vector representation form through Word segmentation, and Word vectors are obtained through preprocessing by using a Word2vec model; extracting keywords from the news text through an improved TextRank algorithm, and expressing the keywords into word vectors; extracting the text features by using a deep neural network model, namely acquiring further semantic representation of the text features; splicing data information after the semantic representation to finish classification;
after the model is built, training by using the built data set; and performing grade prediction on other public opinion records in the database by using the model, and storing the prediction result as a field of the public opinion record.
Further, the deep learning model is an RNN recognition model, a Bi-LSTM network structure is used as a model core, and the construction steps are as follows:
step RB2.1: using the word vector to map the title and the keywords input into the network into a vector, namely an Embedding Layer;
step RB2.2: obtaining the context information of the title by using a first-layer bidirectional LSTM structure, and outputting the forward direction of the LSTM to fwOutput 1 And backward output bwOutput 1 Spliced together to obtain a vector fwOutput 1 ,bwOutput 1 ]Using the attention mechanism, a semantic representation vector of the title, denoted title, is calculated, and the attention calculation formula is as follows:
e ij =tanh(W w h ij +b w )
wherein W w 、b w 、u w Is the parameter to be learned, α ij Represents the final attention distribution, i.e. the attention value of the jth word in the ith title;
step RB2.3: obtaining the semantic information of the key words by utilizing the second layer bidirectional LSTM structure, and outputting fwOutput of the LSTM in the forward direction 2 And the inverseOutput to bwOutput 2 Spliced together to get the vector fwOutput 2 ,bwOutput 2 ]Is recorded as keyword;
step RB2.4: normalizing the characteristic numbers to ensure that each numerical value is between-1 and 1, and splicing the characteristic values into a vector which is marked as the number;
step RB2.5: and splicing the title vector, the keyword vector and the number vector together, namely [ title, keyword, number ], mapping through a linear layer, finally inputting to an output layer, and obtaining output by using a softmax function.
Further, in the training mode of the deep learning model in step 2, the data set in step 1 is divided into two parts according to the proportion of 8:2, and the two parts are respectively used as a training set and a test set of the training model, wherein an optimizer used in the training of the model is Adam, and a loss function of the neural network is a weighted cross entropy loss function, which is specifically defined as follows:
wherein, y i Weight, a probability value representing the input y predicted as class i i Representing a weight value of class i. The advantage of using the weighted value is that the error caused by unbalanced class distribution of the data set can be effectively inhibited, so that the model can better learn the data features with fewer classes, wherein the weight i The calculation formula is as follows:
wherein num i Indicates the number of i-class data.
Further, the database used in step 3 is a MongoDB database, and data records are stored in the form of a dictionary. In a database, according to the prediction result of the model, identifying each public sentiment event record; and designing a data reading interface to read one public sentiment event record from the database every time, so that the public sentiment event record can be called by other systems.
The invention has the beneficial effects that:
the invention extracts the text characteristic and the data characteristic of the public sentiment event by capturing the public sentiment event information related to the public inspection field on the network and combining the deep learning and natural language processing technology to complete the prediction of the grade of the public sentiment event.
The method of the invention enables the staff not to pay attention to the public sentiment event by manpower, realizes automatic extraction of text characteristics and data characteristics to finish public sentiment grade prediction, and reduces the situations of incomplete information collection of the public sentiment event, untimely public sentiment discovery and insufficient public sentiment response.
The invention provides a reading interface by storing the data in the system database, thereby facilitating the data retrieval of managers and developers.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a flow chart of training data set construction in the present invention.
FIG. 3 is a CBOW model diagram of the Word2vec model in the present invention.
FIG. 4 is a diagram of a Bi-LSTM-based classification model in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A public opinion grade prediction method facing the public inspection field based on deep learning comprises the following steps:
step 1: crawling public opinion information related to the public inspection method field from the network, extracting text information in the public inspection method information and storing the text information in a database;
and 2, step: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result;
and 3, step 3: storing the public opinion grade prediction result in the step 2 into a system database;
and 4, step 4: making corresponding identification on the public opinion grade in a database;
and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database;
step 6: and displaying the public opinion grade prediction result in the system through a data interface.
Further, the step 1 specifically includes the following steps:
step J1.1: crawling to obtain an original JSON file, judging whether the JSON file is a UTF-8 code or not, if so, converting the JSON file without conversion, and if not, converting the encoding format of the JSON file into the UTF-8 code;
step J1.2: searching in a JSON file in a UTF-8 coding format, and extracting information of forwarding amount, comment text and text;
step J1.3: cleaning the extracted text by using a regular expression, reserving Chinese, english and numeric characters, removing webpage links, labels and emoticons in the text, and filtering out non-Chinese characters;
step J1.4: and establishing the relevant record of the public sentiment in a database by using the extracted text, storing the cleaned public sentiment text in an independent database, and establishing a relation with the corresponding public sentiment record in the database.
Further, the step 1 specifically includes the following steps:
step H1.1: crawling to obtain an original HTML file, judging whether the HTML file is a UTF-8 code or not, if so, converting the HTML file into the UTF-8 code, and if not, converting the encoding format of the HTML file into the UTF-8 code;
step H1.2: retrieving body tags in the HTML text, and extracting a body text in the HTML text by using a regular expression;
step H1.3: cleaning the text in the step 1.2 by using the regular expression again, reserving Chinese, english and numerical characters, removing useless information in the text, and filtering out non-Chinese characters;
step H1.4: and storing the cleaned public opinion texts in a separate database, and establishing a connection with corresponding public opinion records in the database.
Further, the step 2 of constructing a corpus by using the database includes the following steps:
step 2.1: extracting public opinion data records in a plurality of public inspection fields from a database, wherein the public opinion data records comprise but are not limited to public opinion event names, public opinion related news, public opinion related microblogs and public opinion related comments;
step 2.2: the public sentiment data records in the step 2.1 take the public sentiment events as units, each public sentiment event is arranged into a JSON file, and level marking is carried out according to the news content of the public sentiment events, media participating in the public sentiment event propagation, forwarding quantity and comment quantity;
step 2.3: checking whether the grade marking in the step 2.2 has obvious errors, if so, performing the step 2.4, and if not, performing the step 2.5;
step 2.4: re-labeling the error file;
step 2.5: and stopping labeling, and finishing corpus construction.
The corpus is used in deep learning.
Further, the technical premise of collecting text information in the step 2 and processing natural language is that words are mapped into a vector form by using a Word2Vec model, and the Word2Vec model is specifically trained by the following steps:
step 2.1: extracting a news text crawled by a crawler module from a system database, and cleaning by using a regular expression;
step 2.2: splitting each news text into sentences by using a sentence splitting technology of natural language, and splitting each sentence into words by using Jieba;
step 2.3: and putting the split words into a Word2Vec model for training by taking sentences as units to obtain vector form representation of the words.
Further, the keyword extraction algorithm after the text information is collected in the step 2 is an improved algorithm of TextRank. TextRank is a random walk graph model, a text (usually a sentence) is divided into words, the words are regarded as nodes, edges between the nodes exist, and if and only if the words represented by the two nodes appear simultaneously in the original text (called co-occurrence). The edges and the nodes are endowed with initial weights, and after the iterative computation graph model is converged, the node weights represent the key of the nodes at the moment, and the key word results are obtained after the nodes are sequenced according to the key values.
The edge weight is calculated as follows:
wherein w ij Representing the weight of the edge between the node i and the node j, namely the sum of the co-occurrence frequency and the similarity of the word vector; c. C ij Representing the co-occurrence times of the word pairs represented by the node i and the node j; vec (V) i ) Vector representation, vec (V), representing the word corresponding to node i j ) Respectively representing the word vectors corresponding to the nodes j.
The initial weight of the node is calculated as follows:
where n represents the number of words in the sentence, vec (V) i )、vec(V j ) As previously indicated, α represents the offset coefficient, more preferably [0,1 ]]And (3) indicating which of the sentence length and semantic relevance has more influence on the initial value of the node.
The algorithm iteration formula is as follows:
wherein Rank (V) i ) A keyword score representing node i; w is a ij Representing the weight of the edge between node i and node j; in (V) i ) Indicating a pointing node V i A set of nodes of (a); out (V) j ) Represents a node V j A set of pointed-to nodes; score (V) j ) Represents the node weight, beta represents the harmonic coefficient, and the value is [0,1]In the meantime. After iteration, when the above isRank(V i ) Upon convergence, the algorithm ends.
Further, the deep learning model in step 2 refers to learning semantic information by modeling a text title and a text keyword, and performing prediction by combining data information (forwarding amount, comment amount, and the like) according to the semantic information, and belongs to a classification problem. Converting news titles into a vector representation form through Word segmentation, and preprocessing the news titles by using a Word2vec model to obtain Word vectors; extracting keywords from the news text through an improved TextRank algorithm, and expressing the keywords into word vectors; extracting text features by using a deep neural network model, namely acquiring further semantic representation of the text features; splicing data information after the semantic representation to finish classification;
after the model is built, training by using the built data set; the model is used for carrying out grade prediction on other public opinion records in the database (not including the records used for training), and the prediction result is stored as a field of the public opinion records.
Further, the deep learning model is an identification model of an RNN (Recurrent Neural Network), a Bi-LSTM (bidirectional Long Short-Term Memory) Network structure is used as a model core, and the construction steps are as follows:
step RB2.1: using the word vector to map the title and the keywords input into the network into a vector, namely an Embedding Layer;
step RB2.2: obtaining the context information of the title by using a first-layer bidirectional LSTM structure, and outputting the forward direction of the LSTM to fwOutput 1 And backward output bwOutput 1 Spliced together to obtain a vector fwOutput 1 ,bwOutput 1 ]Using the attention mechanism, a semantic representation vector of the title, denoted title, is calculated, and the attention calculation formula is as follows:
e ij =tanh(W w h ij +b w )
wherein W w 、b w 、u w Is the parameter to be learned, α ij Represents the final attention distribution, i.e. the attention value of the jth word in the ith title;
step RB2.3: obtaining the semantic information of the key words by utilizing a second-layer bidirectional LSTM structure, and outputting fwOutput of the LSTM in the forward direction 2 And backward output bwOutput 2 Spliced together to obtain a vector fwOutput 2 ,bwOutput 2 ]Is recorded as keyword;
step RB2.4: normalizing the characteristic numbers to ensure that each numerical value is between-1 and 1, and splicing the characteristic values into a vector which is recorded as numbers;
step RB2.5: and splicing the three vectors of title, keyword and number together, namely [ title, keyword and number ], mapping through a linear layer, finally inputting to an output layer, and obtaining output by using a softmax function.
Further, in the training mode of the deep learning model in step 2, the data set in step 1 is divided into two parts according to the proportion of 8:2, and the two parts are respectively used as a training set and a test set of the training model, wherein an optimizer used in the training of the model is Adam, and a loss function of the neural network is a weighted cross entropy loss function, which is specifically defined as follows:
wherein, y i Weight, a probability value representing the input y predicted as class i i Representing a weight value of class i. The advantage of using the weighted value is that the error caused by unbalanced class distribution of the data set can be effectively inhibited, so that the model can better learn the data features with fewer classes, wherein the weight i The calculation formula is as follows:
wherein num i Indicates the number of i-class data.
Further, the database used in step 3 is a MongoDB database, and data records are stored in the form of a dictionary. Marking each public opinion event record in a database according to the prediction result of the model; and designing a data reading interface to read one public opinion event record from the database every time, wherein the public opinion event record can be called by other systems.
This example was carried out according to the scheme shown in FIG. 1. The system built by using the invention is divided into two parts: an algorithm portion and a data storage portion. The algorithm part mainly comprises four parts, namely public opinion information acquisition, public opinion text extraction and cleaning, feature extraction, keyword extraction and model prediction; the data storage part is mainly used for storing the public opinion related information after crawling and updating the identification in the database after model prediction.
After the system is started, the trained model is loaded into the memory, and meanwhile, a crawler module crawls news texts, comments, forwarding data and the like related to public sentiment events from the network and stores the data in a system database.
Extracting public opinion related data in a database, and sorting the data as numerical characteristics of model input; the method comprises the steps of cleaning news texts by using a regular expression and the like, extracting keywords from the news texts by using an improved TextRank algorithm to serve as keyword features of a model, and finally inputting the keywords of public sentiment serving as the title features of the model into the model by combining the first two features to finish prediction.
And storing the prediction result in a system database, skipping the public sentiment event when the abnormity occurs, predicting the next public sentiment event, and when the prediction of all the public sentiment events is finished, re-predicting the abnormal public sentiment event again and circularly repeating.
Claims (8)
1. A public opinion grade prediction method facing the public inspection method field based on deep learning is characterized by comprising the following steps:
step 1: crawling public opinion information related to the public inspection method field from the network, extracting text information in the public inspection method information and storing the text information in a database;
step 2: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result;
and step 3: storing the public opinion grade prediction result in the step 2 into a system database;
and 4, step 4: making corresponding identification on the public opinion grade in a database;
and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database;
step 6: displaying the public opinion grade prediction result in the system through a data interface;
the algorithm for extracting the keywords after the text information is collected in the step 2 is an improved algorithm of TextRank, and the side weight is calculated as follows:
wherein w ij Representing the weight of the edge between the node i and the node j, namely the sum of the co-occurrence frequency and the similarity of the word vector; c. C ij Representing the co-occurrence times of the word pairs represented by the node i and the node j; vec (V) i ) Vector representation, vec (V), representing the word corresponding to node i j ) Representing a word vector corresponding to the node j; the initial weight of the node is calculated as follows:
where n represents the number of words in the sentence, vec (V) i )、vec(V j ) The meaning is the same as that in the above, alpha represents the offset coefficient and takes the value of [0,1]The method represents that which of sentence length and semantic relevance has more influence on the initial value of the node;
the algorithm iteration formula is as follows:
wherein Rank (V) i ) A keyword score representing node i; w is a ij Representing the weight of the edge between node i and node j; in (V) i ) Indicating a pointing node V i A set of nodes of (a); out (V) j ) Represents a node V j A set of pointed-to nodes; score (V) j ) Represents the node weight, beta represents the harmonic coefficient, and the value is [0,1]To (c) to (d); after iteration, when the Rank (V) is i ) When the convergence is reached, the algorithm is ended;
the deep learning model in the step 2 converts the news headlines into a vector representation form through Word segmentation, and Word vectors are obtained through preprocessing by using a Word2vec model; extracting keywords from the news text through an improved TextRank algorithm, and expressing the keywords into word vectors; extracting text features by using a deep neural network model, namely acquiring further semantic representation of the text features; splicing data information after the semantic representation to finish classification;
after the model is built, training by using the built data set; and performing grade prediction on other public opinion records in the database by using the model, and storing the prediction result as a field of the public opinion record.
2. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the step 1 specifically comprises the following steps:
step J1.1: crawling to obtain an original JSON file, judging whether the JSON file is a UTF-8 code or not, if so, converting the JSON file without conversion, and if not, converting the encoding format of the JSON file into the UTF-8 code;
step J1.2: retrieving in a JSON file in a UTF-8 coding format, and extracting information of forwarding amount, comment text and text;
step J1.3: cleaning the extracted text by using a regular expression, reserving Chinese, english and numeric characters, removing webpage links, labels and emoticons in the text, and filtering out non-Chinese characters;
step J1.4: and establishing the relevant record of the public sentiment in a database by using the extracted text, storing the cleaned public sentiment text in an independent database, and establishing a relation with the corresponding public sentiment record in the database.
3. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the step 1 specifically comprises the following steps:
step H1.1: crawling to obtain an original HTML file, judging whether the original HTML file is UTF-8 coding, if so, converting the original HTML file into UTF-8 coding without converting the original HTML file, and if not, converting the encoding format of the original HTML file into UTF-8 coding;
step H1.2: retrieving body tags in the HTML text, and extracting a body text in the HTML text by using a regular expression;
step H1.3: cleaning the text in the step H1.2 by using the regular expression again, reserving Chinese, english and numeric characters, removing useless information in the Chinese, english and numeric characters, and filtering out non-Chinese characters;
step H1.4: and storing the cleaned public opinion texts in a separate database, and establishing a connection with corresponding public opinion records in the database.
4. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the step 2 of constructing a corpus by using a database comprises the following steps:
step 2.1: extracting public opinion data records in a plurality of public inspection fields from a database, wherein the public opinion data records comprise but are not limited to public opinion event names, public opinion related news, public opinion related microblogs and public opinion related comments;
step 2.2: the public sentiment data records in the step 2.1 take the public sentiment events as units, each public sentiment event is arranged into a JSON file, and level marking is carried out according to the news content of the public sentiment events, media participating in the public sentiment event propagation, forwarding quantity and comment quantity;
step 2.3: checking whether the grade marking in the step 2.2 has obvious errors, if so, performing the step 2.4, and if not, performing the step 2.5;
step 2.4: re-labeling the error file;
step 2.5: and stopping labeling, and finishing corpus construction.
5. The public opinion grade prediction method facing the public inspection field based on deep learning according to claim 1, wherein the technical premise of collecting text information in the step 2 and performing natural language processing on the text information is that words are mapped into a vector form by using a Word2Vec model, and the Word2Vec model is specifically trained by the following steps:
step 2.1: extracting a news text crawled by a crawler module from a system database, and cleaning the news text by using a regular expression;
step 2.2: splitting each news text into sentences by using a sentence splitting technology of natural language, and splitting each sentence into words by using Jieba;
step 2.3: and putting the split words into a Word2Vec model for training by taking sentences as units to obtain vector form representation of the words.
6. The public opinion grade prediction method facing to the public inspection field based on deep learning according to claim 1, wherein the deep learning model is an RNN recognition model, a Bi-LSTM network structure is used as a model core, and the construction steps are as follows:
step RB2.1: using the word vector to map the title and the keywords input into the network into a vector, namely an Embedding Layer;
step RB2.2: obtaining the context information of the title by using a first-layer bidirectional LSTM structure, and outputting the forward direction of the LSTM to fwOutput 1 And backward output bwOutput 1 Spliced together to obtain a vector fwOutput 1 ,bwOutput 1 ]Using the attention mechanism, a semantic representation vector of the title, denoted title, is calculated, and the attention calculation formula is as follows:
e ij =tanh(W w h ij +b w )
wherein W w 、b w 、u w Is the parameter to be learned, α ij Represents the final attention distribution, i.e. the attention value of the jth word in the ith title;
step RB2.3: obtaining the semantic information of the key words by utilizing the second layer bidirectional LSTM structure, and outputting fwOutput of the LSTM in the forward direction 2 And backward output bwOutput 2 Spliced together to obtain a vector fwOutput 2 ,bwOutput 2 ]And is recorded as keyword;
step RB2.4: normalizing the characteristic numbers to ensure that each numerical value is between-1 and 1, and splicing the characteristic values into a vector which is recorded as numbers;
step RB2.5: and splicing the three vectors of title, keyword and number together, namely [ title, keyword and number ], mapping through a linear layer, finally inputting to an output layer, and obtaining output by using a softmax function.
7. The public opinion grade prediction method based on deep learning and facing to the public inspection field according to claim 1, wherein the training mode of the deep learning model in the step 2 is divided into two parts according to the proportion of 8:2 through the data set in the step 1, and the two parts are respectively used as a training set and a test set of the training model, wherein an optimizer used in the training of the model is Adam, and a loss function of a neural network is a weighted cross entropy loss function, which is specifically defined as follows:
wherein, y i Weight, a probability value representing the input y predicted as class i i A weight value representing class i; the advantage of using the weighted value is that the error caused by unbalanced class distribution of the data set can be effectively inhibited, so that the model can better learn the data features with fewer classes, wherein the weight i The calculation formula is as follows:
wherein num i Indicates the number of i-class data.
8. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the database used in step 3 is a MongoDB database, storing data records in the form of dictionary; in a database, according to the prediction result of the model, identifying each public sentiment event record; and designing a data reading interface to read one public sentiment event record from the database every time, so that the public sentiment event record can be called by other systems.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110608376.0A CN113312532B (en) | 2021-06-01 | 2021-06-01 | Public opinion grade prediction method based on deep learning and oriented to public inspection field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110608376.0A CN113312532B (en) | 2021-06-01 | 2021-06-01 | Public opinion grade prediction method based on deep learning and oriented to public inspection field |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113312532A CN113312532A (en) | 2021-08-27 |
CN113312532B true CN113312532B (en) | 2022-10-21 |
Family
ID=77376787
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110608376.0A Active CN113312532B (en) | 2021-06-01 | 2021-06-01 | Public opinion grade prediction method based on deep learning and oriented to public inspection field |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113312532B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110807084A (en) * | 2019-05-15 | 2020-02-18 | 北京信息科技大学 | Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy |
CN111191442A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Similar problem generation method, device, equipment and medium |
KR20200137924A (en) * | 2019-05-29 | 2020-12-09 | 경희대학교 산학협력단 | Real-time keyword extraction method and device in text streaming environment |
CN112733538A (en) * | 2021-01-19 | 2021-04-30 | 广东工业大学 | Ontology construction method and device based on text |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107220352B (en) * | 2017-05-31 | 2020-12-08 | 北京百度网讯科技有限公司 | Method and device for constructing comment map based on artificial intelligence |
CN110263343B (en) * | 2019-06-24 | 2021-06-15 | 北京理工大学 | Phrase vector-based keyword extraction method and system |
CN111008274B (en) * | 2019-12-10 | 2021-04-06 | 昆明理工大学 | Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network |
CN112131863B (en) * | 2020-08-04 | 2022-07-19 | 中科天玑数据科技股份有限公司 | Comment opinion theme extraction method, electronic equipment and storage medium |
CN112800211A (en) * | 2020-12-31 | 2021-05-14 | 江苏网进科技股份有限公司 | Method for extracting critical information of criminal process in legal document based on TextRank algorithm |
CN112860906B (en) * | 2021-04-23 | 2021-07-16 | 南京汇宁桀信息科技有限公司 | Market leader hot line and public opinion decision support method and system based on natural language processing |
-
2021
- 2021-06-01 CN CN202110608376.0A patent/CN113312532B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110807084A (en) * | 2019-05-15 | 2020-02-18 | 北京信息科技大学 | Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy |
KR20200137924A (en) * | 2019-05-29 | 2020-12-09 | 경희대학교 산학협력단 | Real-time keyword extraction method and device in text streaming environment |
CN111191442A (en) * | 2019-12-30 | 2020-05-22 | 杭州远传新业科技有限公司 | Similar problem generation method, device, equipment and medium |
CN112733538A (en) * | 2021-01-19 | 2021-04-30 | 广东工业大学 | Ontology construction method and device based on text |
Also Published As
Publication number | Publication date |
---|---|
CN113312532A (en) | 2021-08-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111581396B (en) | Event graph construction system and method based on multi-dimensional feature fusion and dependency syntax | |
CN110135457A (en) | Event trigger word abstracting method and system based on self-encoding encoder fusion document information | |
CN111639171A (en) | Knowledge graph question-answering method and device | |
CN110532398B (en) | Automatic family map construction method based on multi-task joint neural network model | |
CN110765277B (en) | Knowledge-graph-based mobile terminal online equipment fault diagnosis method | |
CN113806563A (en) | Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material | |
CN112749562A (en) | Named entity identification method, device, storage medium and electronic equipment | |
CN113742733B (en) | Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type | |
CN114970508A (en) | Power text knowledge discovery method and device based on data multi-source fusion | |
CN111191051B (en) | Method and system for constructing emergency knowledge map based on Chinese word segmentation technology | |
CN115564393A (en) | Recruitment requirement similarity-based job recommendation method | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN115599899A (en) | Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph | |
CN113378024B (en) | Deep learning-oriented public inspection method field-based related event identification method | |
CN115455202A (en) | Emergency event affair map construction method | |
CN113343701B (en) | Extraction method and device for text named entities of power equipment fault defects | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure | |
CN112329442A (en) | Multi-task reading system and method for heterogeneous legal data | |
CN113312532B (en) | Public opinion grade prediction method based on deep learning and oriented to public inspection field | |
CN111898034A (en) | News content pushing method and device, storage medium and computer equipment | |
CN116975161A (en) | Entity relation joint extraction method, equipment and medium of power equipment partial discharge text | |
CN115952794A (en) | Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph | |
CN111309933B (en) | Automatic labeling system for cultural resource data | |
KR101126186B1 (en) | Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium thereof | |
CN112579666A (en) | Intelligent question-answering system and method and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |