CN113312532B - Public opinion grade prediction method based on deep learning and oriented to public inspection field - Google Patents

Public opinion grade prediction method based on deep learning and oriented to public inspection field Download PDF

Info

Publication number
CN113312532B
CN113312532B CN202110608376.0A CN202110608376A CN113312532B CN 113312532 B CN113312532 B CN 113312532B CN 202110608376 A CN202110608376 A CN 202110608376A CN 113312532 B CN113312532 B CN 113312532B
Authority
CN
China
Prior art keywords
public
public opinion
text
database
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110608376.0A
Other languages
Chinese (zh)
Other versions
CN113312532A (en
Inventor
赵铁军
杨沐昀
徐冰
郭常江
曹海龙
朱聪慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110608376.0A priority Critical patent/CN113312532B/en
Publication of CN113312532A publication Critical patent/CN113312532A/en
Application granted granted Critical
Publication of CN113312532B publication Critical patent/CN113312532B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a public opinion grade prediction method facing the public inspection field based on deep learning. Step 1: crawling public opinion information related to the public inspection method field from a network, extracting text information in the public opinion information and storing the text information in a database; step 2: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result; and step 3: storing the public opinion grade prediction result in the step 2 into a system database; and 4, step 4: making corresponding identification on the public sentiment grade in a database; and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database; and 6: and displaying the public opinion grade prediction result in the system through a data interface. The method is used for solving the problem that a public sentiment system has no pertinence and jumping out of the limitation of an algorithm.

Description

Public opinion grade prediction method based on deep learning and oriented to public inspection field
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a public opinion grade prediction method facing the public inspection method field based on deep learning.
Background
Public opinion monitoring or public opinion early warning is a complex technology which crosses social science and data science, and needs to have preliminary grade prediction on events at the initial stage of public opinion occurrence to make full preparation for coping. To automate the machine for such a purpose requires extensive data and a reliable model for support.
The description of the public opinion events mainly comes from news texts on network media and social platforms similar to Sina microblogs, and people directly inform others or indirectly know the related information of the public opinion events from others through reading, forwarding, commenting and the like. The public opinion system can extract characteristics (text characteristics, description of specific situations of events, data characteristics, description of propagation situations of events) from the event information to further analyze the current fermentation situation and propagation situation of the public opinion. The important point is to analyze the early warning level of the public sentiment, which can be an indispensable ring in a public sentiment system.
However, the public opinion grade prediction function in the existing public opinion system has the following problems:
1. at present, the public sentiment is mainly supervised and the public sentiment grade is analyzed in a manual mode, although the single public sentiment is effective, the collection of the public sentiment is limited, the public sentiment is not found in time, and an automatic public sentiment analysis system is needed to assist manual prediction;
2. the public sentiment system has no pertinence, the public sentiment covers all fields due to the universality, however, the attention points of people are always different aiming at the public sentiment of all fields, so that the public sentiment can not be analyzed in a universal mode, and the public sentiment event grade prediction aiming at the public inspection field is almost not realized at present;
3. the existing technology is mainly analyzed according to a data mining algorithm, but the effect is general due to the non-learnability of the algorithm; the deep learning method needs to uniformly construct a corpus for learning, so that the existing method mainly uses a mining algorithm for analysis.
Disclosure of Invention
The invention provides a public opinion grade prediction method facing the public inspection method field based on deep learning, which is used for solving the problems and assisting in manual public opinion monitoring.
The invention is realized by the following technical scheme:
a public opinion grade prediction method facing the public inspection field based on deep learning comprises the following steps:
step 1: crawling public opinion information related to the public inspection method field from the network, extracting text information in the public inspection method information and storing the text information in a database;
step 2: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result;
and step 3: storing the public opinion grade prediction result in the step 2 into a system database;
and 4, step 4: making corresponding identification on the public opinion grade in a database;
and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database;
step 6: and displaying the public opinion grade prediction result in the system through a data interface.
Further, the step 1 specifically includes the following steps:
step J1.1: crawling to obtain an original JSON file, judging whether the JSON file is a UTF-8 code or not, if so, converting the JSON file into the UTF-8 code, and if not, converting the encoding format of the JSON file into the UTF-8 code;
step J1.2: searching in a JSON file in a UTF-8 coding format, and extracting information of forwarding amount, comment text and text;
step J1.3: cleaning the extracted text by using a regular expression, reserving Chinese, english and numeric characters, removing webpage links, labels and emoticons in the text, and filtering out non-Chinese characters;
step J1.4: and establishing the relevant record of the public sentiment in a database by using the extracted text, storing the cleaned public sentiment text in an independent database, and establishing a relation with the corresponding public sentiment record in the database.
Further, the step 1 specifically includes the following steps:
step H1.1: crawling to obtain an original HTML file, judging whether the original HTML file is UTF-8 coding, if so, converting the original HTML file into UTF-8 coding without converting the original HTML file, and if not, converting the encoding format of the original HTML file into UTF-8 coding;
step H1.2: retrieving body tags in the HTML text, and extracting a body text in the HTML text by using a regular expression;
step H1.3: cleaning the text in the step 1.2 by using the regular expression again, reserving Chinese, english and numeric characters, removing useless information in the text, and filtering out non-Chinese characters;
step H1.4: and storing the cleaned public opinion texts in a separate database, and establishing a connection with corresponding public opinion records in the database.
Further, the step 2 of constructing a corpus by using the database includes the following steps:
step 2.1: extracting public opinion data records in a plurality of public inspection fields from a database, wherein the public opinion data records comprise but are not limited to public opinion event names, public opinion related news, public opinion related microblogs and public opinion related comments;
step 2.2: the public sentiment data records in the step 2.1 take the public sentiment events as units, each public sentiment event is arranged into a JSON file, and level marking is carried out according to the news content of the public sentiment events, media participating in the public sentiment event propagation, forwarding quantity and comment quantity;
step 2.3: checking whether the grade marking in the step 2.2 has obvious errors, if so, performing the step 2.4, and if not, performing the step 2.5;
step 2.4: re-labeling the error file;
step 2.5: and stopping labeling, and finishing corpus construction.
Further, the technical premise of collecting text information in the step 2 to process natural language is that a Word2Vec model is used to map words into a vector form, and the Word2Vec model is specifically trained through the following steps:
step 2.1: extracting a news text crawled by a crawler module from a system database, and cleaning the news text by using a regular expression;
step 2.2: splitting each news text into sentences by using a sentence splitting technology of natural language, and splitting each sentence into words by using Jieba;
step 2.3: and putting the words after being split into sentences as units into a Word2Vec model for training to obtain vector form representation of the words.
Further, the keyword extraction algorithm after the text information is collected in step 2 is an improved algorithm of TextRank, and the side weight is calculated as follows:
Figure GDA0003752678500000031
wherein w ij Representing the weight of the edge between the node i and the node j, namely the sum of the co-occurrence frequency and the similarity of the word vector; c. C ij Representing the co-occurrence times of the word pairs represented by the node i and the node j; vec (V) i ) Vector representation, vec (V), representing the word corresponding to node i j ) Respectively representing word vectors corresponding to the nodes j; the initial weight of the node is calculated as follows:
Figure GDA0003752678500000032
where n represents the number of words in the sentence, vec (V) i )、vec(V j ) As previously indicated, α represents the offset coefficient, more preferably [0,1 ]]And (3) indicating which of the sentence length and semantic relevance has more influence on the initial value of the node.
The algorithm iteration formula is as follows:
Figure GDA0003752678500000041
wherein Rank (V) i ) A keyword score representing node i; w is a ij Representing the weight of the edge between the node i and the node j; in (V) i ) Indicating a pointing node V i A set of nodes of (c); out (V) j ) Represents a node V j A set of pointed-to nodes; score (V) j ) Represents the node weight, beta represents the harmonic coefficient, and the value is [0,1]To (c) to (d); after iteration, when the Rank (V) is i ) When converging, calculateThe method is ended.
Further, the deep learning model in the step 2 converts the news headlines into a vector representation form through Word segmentation, and Word vectors are obtained through preprocessing by using a Word2vec model; extracting keywords from the news text through an improved TextRank algorithm, and expressing the keywords into word vectors; extracting the text features by using a deep neural network model, namely acquiring further semantic representation of the text features; splicing data information after the semantic representation to finish classification;
after the model is built, training by using the built data set; and performing grade prediction on other public opinion records in the database by using the model, and storing the prediction result as a field of the public opinion record.
Further, the deep learning model is an RNN recognition model, a Bi-LSTM network structure is used as a model core, and the construction steps are as follows:
step RB2.1: using the word vector to map the title and the keywords input into the network into a vector, namely an Embedding Layer;
step RB2.2: obtaining the context information of the title by using a first-layer bidirectional LSTM structure, and outputting the forward direction of the LSTM to fwOutput 1 And backward output bwOutput 1 Spliced together to obtain a vector fwOutput 1 ,bwOutput 1 ]Using the attention mechanism, a semantic representation vector of the title, denoted title, is calculated, and the attention calculation formula is as follows:
e ij =tanh(W w h ij +b w )
Figure GDA0003752678500000051
wherein W w 、b w 、u w Is the parameter to be learned, α ij Represents the final attention distribution, i.e. the attention value of the jth word in the ith title;
step RB2.3: obtaining the semantic information of the key words by utilizing the second layer bidirectional LSTM structure, and outputting fwOutput of the LSTM in the forward direction 2 And the inverseOutput to bwOutput 2 Spliced together to get the vector fwOutput 2 ,bwOutput 2 ]Is recorded as keyword;
step RB2.4: normalizing the characteristic numbers to ensure that each numerical value is between-1 and 1, and splicing the characteristic values into a vector which is marked as the number;
step RB2.5: and splicing the title vector, the keyword vector and the number vector together, namely [ title, keyword, number ], mapping through a linear layer, finally inputting to an output layer, and obtaining output by using a softmax function.
Further, in the training mode of the deep learning model in step 2, the data set in step 1 is divided into two parts according to the proportion of 8:2, and the two parts are respectively used as a training set and a test set of the training model, wherein an optimizer used in the training of the model is Adam, and a loss function of the neural network is a weighted cross entropy loss function, which is specifically defined as follows:
Figure GDA0003752678500000052
wherein, y i Weight, a probability value representing the input y predicted as class i i Representing a weight value of class i. The advantage of using the weighted value is that the error caused by unbalanced class distribution of the data set can be effectively inhibited, so that the model can better learn the data features with fewer classes, wherein the weight i The calculation formula is as follows:
Figure GDA0003752678500000053
wherein num i Indicates the number of i-class data.
Further, the database used in step 3 is a MongoDB database, and data records are stored in the form of a dictionary. In a database, according to the prediction result of the model, identifying each public sentiment event record; and designing a data reading interface to read one public sentiment event record from the database every time, so that the public sentiment event record can be called by other systems.
The invention has the beneficial effects that:
the invention extracts the text characteristic and the data characteristic of the public sentiment event by capturing the public sentiment event information related to the public inspection field on the network and combining the deep learning and natural language processing technology to complete the prediction of the grade of the public sentiment event.
The method of the invention enables the staff not to pay attention to the public sentiment event by manpower, realizes automatic extraction of text characteristics and data characteristics to finish public sentiment grade prediction, and reduces the situations of incomplete information collection of the public sentiment event, untimely public sentiment discovery and insufficient public sentiment response.
The invention provides a reading interface by storing the data in the system database, thereby facilitating the data retrieval of managers and developers.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a flow chart of training data set construction in the present invention.
FIG. 3 is a CBOW model diagram of the Word2vec model in the present invention.
FIG. 4 is a diagram of a Bi-LSTM-based classification model in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
A public opinion grade prediction method facing the public inspection field based on deep learning comprises the following steps:
step 1: crawling public opinion information related to the public inspection method field from the network, extracting text information in the public inspection method information and storing the text information in a database;
and 2, step: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result;
and 3, step 3: storing the public opinion grade prediction result in the step 2 into a system database;
and 4, step 4: making corresponding identification on the public opinion grade in a database;
and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database;
step 6: and displaying the public opinion grade prediction result in the system through a data interface.
Further, the step 1 specifically includes the following steps:
step J1.1: crawling to obtain an original JSON file, judging whether the JSON file is a UTF-8 code or not, if so, converting the JSON file without conversion, and if not, converting the encoding format of the JSON file into the UTF-8 code;
step J1.2: searching in a JSON file in a UTF-8 coding format, and extracting information of forwarding amount, comment text and text;
step J1.3: cleaning the extracted text by using a regular expression, reserving Chinese, english and numeric characters, removing webpage links, labels and emoticons in the text, and filtering out non-Chinese characters;
step J1.4: and establishing the relevant record of the public sentiment in a database by using the extracted text, storing the cleaned public sentiment text in an independent database, and establishing a relation with the corresponding public sentiment record in the database.
Further, the step 1 specifically includes the following steps:
step H1.1: crawling to obtain an original HTML file, judging whether the HTML file is a UTF-8 code or not, if so, converting the HTML file into the UTF-8 code, and if not, converting the encoding format of the HTML file into the UTF-8 code;
step H1.2: retrieving body tags in the HTML text, and extracting a body text in the HTML text by using a regular expression;
step H1.3: cleaning the text in the step 1.2 by using the regular expression again, reserving Chinese, english and numerical characters, removing useless information in the text, and filtering out non-Chinese characters;
step H1.4: and storing the cleaned public opinion texts in a separate database, and establishing a connection with corresponding public opinion records in the database.
Further, the step 2 of constructing a corpus by using the database includes the following steps:
step 2.1: extracting public opinion data records in a plurality of public inspection fields from a database, wherein the public opinion data records comprise but are not limited to public opinion event names, public opinion related news, public opinion related microblogs and public opinion related comments;
step 2.2: the public sentiment data records in the step 2.1 take the public sentiment events as units, each public sentiment event is arranged into a JSON file, and level marking is carried out according to the news content of the public sentiment events, media participating in the public sentiment event propagation, forwarding quantity and comment quantity;
step 2.3: checking whether the grade marking in the step 2.2 has obvious errors, if so, performing the step 2.4, and if not, performing the step 2.5;
step 2.4: re-labeling the error file;
step 2.5: and stopping labeling, and finishing corpus construction.
The corpus is used in deep learning.
Further, the technical premise of collecting text information in the step 2 and processing natural language is that words are mapped into a vector form by using a Word2Vec model, and the Word2Vec model is specifically trained by the following steps:
step 2.1: extracting a news text crawled by a crawler module from a system database, and cleaning by using a regular expression;
step 2.2: splitting each news text into sentences by using a sentence splitting technology of natural language, and splitting each sentence into words by using Jieba;
step 2.3: and putting the split words into a Word2Vec model for training by taking sentences as units to obtain vector form representation of the words.
Further, the keyword extraction algorithm after the text information is collected in the step 2 is an improved algorithm of TextRank. TextRank is a random walk graph model, a text (usually a sentence) is divided into words, the words are regarded as nodes, edges between the nodes exist, and if and only if the words represented by the two nodes appear simultaneously in the original text (called co-occurrence). The edges and the nodes are endowed with initial weights, and after the iterative computation graph model is converged, the node weights represent the key of the nodes at the moment, and the key word results are obtained after the nodes are sequenced according to the key values.
The edge weight is calculated as follows:
Figure GDA0003752678500000081
wherein w ij Representing the weight of the edge between the node i and the node j, namely the sum of the co-occurrence frequency and the similarity of the word vector; c. C ij Representing the co-occurrence times of the word pairs represented by the node i and the node j; vec (V) i ) Vector representation, vec (V), representing the word corresponding to node i j ) Respectively representing the word vectors corresponding to the nodes j.
The initial weight of the node is calculated as follows:
Figure GDA0003752678500000082
where n represents the number of words in the sentence, vec (V) i )、vec(V j ) As previously indicated, α represents the offset coefficient, more preferably [0,1 ]]And (3) indicating which of the sentence length and semantic relevance has more influence on the initial value of the node.
The algorithm iteration formula is as follows:
Figure GDA0003752678500000091
wherein Rank (V) i ) A keyword score representing node i; w is a ij Representing the weight of the edge between node i and node j; in (V) i ) Indicating a pointing node V i A set of nodes of (a); out (V) j ) Represents a node V j A set of pointed-to nodes; score (V) j ) Represents the node weight, beta represents the harmonic coefficient, and the value is [0,1]In the meantime. After iteration, when the above isRank(V i ) Upon convergence, the algorithm ends.
Further, the deep learning model in step 2 refers to learning semantic information by modeling a text title and a text keyword, and performing prediction by combining data information (forwarding amount, comment amount, and the like) according to the semantic information, and belongs to a classification problem. Converting news titles into a vector representation form through Word segmentation, and preprocessing the news titles by using a Word2vec model to obtain Word vectors; extracting keywords from the news text through an improved TextRank algorithm, and expressing the keywords into word vectors; extracting text features by using a deep neural network model, namely acquiring further semantic representation of the text features; splicing data information after the semantic representation to finish classification;
after the model is built, training by using the built data set; the model is used for carrying out grade prediction on other public opinion records in the database (not including the records used for training), and the prediction result is stored as a field of the public opinion records.
Further, the deep learning model is an identification model of an RNN (Recurrent Neural Network), a Bi-LSTM (bidirectional Long Short-Term Memory) Network structure is used as a model core, and the construction steps are as follows:
step RB2.1: using the word vector to map the title and the keywords input into the network into a vector, namely an Embedding Layer;
step RB2.2: obtaining the context information of the title by using a first-layer bidirectional LSTM structure, and outputting the forward direction of the LSTM to fwOutput 1 And backward output bwOutput 1 Spliced together to obtain a vector fwOutput 1 ,bwOutput 1 ]Using the attention mechanism, a semantic representation vector of the title, denoted title, is calculated, and the attention calculation formula is as follows:
e ij =tanh(W w h ij +b w )
Figure GDA0003752678500000101
wherein W w 、b w 、u w Is the parameter to be learned, α ij Represents the final attention distribution, i.e. the attention value of the jth word in the ith title;
step RB2.3: obtaining the semantic information of the key words by utilizing a second-layer bidirectional LSTM structure, and outputting fwOutput of the LSTM in the forward direction 2 And backward output bwOutput 2 Spliced together to obtain a vector fwOutput 2 ,bwOutput 2 ]Is recorded as keyword;
step RB2.4: normalizing the characteristic numbers to ensure that each numerical value is between-1 and 1, and splicing the characteristic values into a vector which is recorded as numbers;
step RB2.5: and splicing the three vectors of title, keyword and number together, namely [ title, keyword and number ], mapping through a linear layer, finally inputting to an output layer, and obtaining output by using a softmax function.
Further, in the training mode of the deep learning model in step 2, the data set in step 1 is divided into two parts according to the proportion of 8:2, and the two parts are respectively used as a training set and a test set of the training model, wherein an optimizer used in the training of the model is Adam, and a loss function of the neural network is a weighted cross entropy loss function, which is specifically defined as follows:
Figure GDA0003752678500000102
wherein, y i Weight, a probability value representing the input y predicted as class i i Representing a weight value of class i. The advantage of using the weighted value is that the error caused by unbalanced class distribution of the data set can be effectively inhibited, so that the model can better learn the data features with fewer classes, wherein the weight i The calculation formula is as follows:
Figure GDA0003752678500000103
wherein num i Indicates the number of i-class data.
Further, the database used in step 3 is a MongoDB database, and data records are stored in the form of a dictionary. Marking each public opinion event record in a database according to the prediction result of the model; and designing a data reading interface to read one public opinion event record from the database every time, wherein the public opinion event record can be called by other systems.
This example was carried out according to the scheme shown in FIG. 1. The system built by using the invention is divided into two parts: an algorithm portion and a data storage portion. The algorithm part mainly comprises four parts, namely public opinion information acquisition, public opinion text extraction and cleaning, feature extraction, keyword extraction and model prediction; the data storage part is mainly used for storing the public opinion related information after crawling and updating the identification in the database after model prediction.
After the system is started, the trained model is loaded into the memory, and meanwhile, a crawler module crawls news texts, comments, forwarding data and the like related to public sentiment events from the network and stores the data in a system database.
Extracting public opinion related data in a database, and sorting the data as numerical characteristics of model input; the method comprises the steps of cleaning news texts by using a regular expression and the like, extracting keywords from the news texts by using an improved TextRank algorithm to serve as keyword features of a model, and finally inputting the keywords of public sentiment serving as the title features of the model into the model by combining the first two features to finish prediction.
And storing the prediction result in a system database, skipping the public sentiment event when the abnormity occurs, predicting the next public sentiment event, and when the prediction of all the public sentiment events is finished, re-predicting the abnormal public sentiment event again and circularly repeating.

Claims (8)

1. A public opinion grade prediction method facing the public inspection method field based on deep learning is characterized by comprising the following steps:
step 1: crawling public opinion information related to the public inspection method field from the network, extracting text information in the public inspection method information and storing the text information in a database;
step 2: predicting the collected text information by using a deep learning model to obtain a public opinion grade prediction result;
and step 3: storing the public opinion grade prediction result in the step 2 into a system database;
and 4, step 4: making corresponding identification on the public opinion grade in a database;
and 5: providing a data interface capable of accessing public opinion information for the public opinion grade in the identified database;
step 6: displaying the public opinion grade prediction result in the system through a data interface;
the algorithm for extracting the keywords after the text information is collected in the step 2 is an improved algorithm of TextRank, and the side weight is calculated as follows:
Figure FDA0003736791080000011
wherein w ij Representing the weight of the edge between the node i and the node j, namely the sum of the co-occurrence frequency and the similarity of the word vector; c. C ij Representing the co-occurrence times of the word pairs represented by the node i and the node j; vec (V) i ) Vector representation, vec (V), representing the word corresponding to node i j ) Representing a word vector corresponding to the node j; the initial weight of the node is calculated as follows:
Figure FDA0003736791080000012
where n represents the number of words in the sentence, vec (V) i )、vec(V j ) The meaning is the same as that in the above, alpha represents the offset coefficient and takes the value of [0,1]The method represents that which of sentence length and semantic relevance has more influence on the initial value of the node;
the algorithm iteration formula is as follows:
Figure FDA0003736791080000013
wherein Rank (V) i ) A keyword score representing node i; w is a ij Representing the weight of the edge between node i and node j; in (V) i ) Indicating a pointing node V i A set of nodes of (a); out (V) j ) Represents a node V j A set of pointed-to nodes; score (V) j ) Represents the node weight, beta represents the harmonic coefficient, and the value is [0,1]To (c) to (d); after iteration, when the Rank (V) is i ) When the convergence is reached, the algorithm is ended;
the deep learning model in the step 2 converts the news headlines into a vector representation form through Word segmentation, and Word vectors are obtained through preprocessing by using a Word2vec model; extracting keywords from the news text through an improved TextRank algorithm, and expressing the keywords into word vectors; extracting text features by using a deep neural network model, namely acquiring further semantic representation of the text features; splicing data information after the semantic representation to finish classification;
after the model is built, training by using the built data set; and performing grade prediction on other public opinion records in the database by using the model, and storing the prediction result as a field of the public opinion record.
2. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the step 1 specifically comprises the following steps:
step J1.1: crawling to obtain an original JSON file, judging whether the JSON file is a UTF-8 code or not, if so, converting the JSON file without conversion, and if not, converting the encoding format of the JSON file into the UTF-8 code;
step J1.2: retrieving in a JSON file in a UTF-8 coding format, and extracting information of forwarding amount, comment text and text;
step J1.3: cleaning the extracted text by using a regular expression, reserving Chinese, english and numeric characters, removing webpage links, labels and emoticons in the text, and filtering out non-Chinese characters;
step J1.4: and establishing the relevant record of the public sentiment in a database by using the extracted text, storing the cleaned public sentiment text in an independent database, and establishing a relation with the corresponding public sentiment record in the database.
3. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the step 1 specifically comprises the following steps:
step H1.1: crawling to obtain an original HTML file, judging whether the original HTML file is UTF-8 coding, if so, converting the original HTML file into UTF-8 coding without converting the original HTML file, and if not, converting the encoding format of the original HTML file into UTF-8 coding;
step H1.2: retrieving body tags in the HTML text, and extracting a body text in the HTML text by using a regular expression;
step H1.3: cleaning the text in the step H1.2 by using the regular expression again, reserving Chinese, english and numeric characters, removing useless information in the Chinese, english and numeric characters, and filtering out non-Chinese characters;
step H1.4: and storing the cleaned public opinion texts in a separate database, and establishing a connection with corresponding public opinion records in the database.
4. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the step 2 of constructing a corpus by using a database comprises the following steps:
step 2.1: extracting public opinion data records in a plurality of public inspection fields from a database, wherein the public opinion data records comprise but are not limited to public opinion event names, public opinion related news, public opinion related microblogs and public opinion related comments;
step 2.2: the public sentiment data records in the step 2.1 take the public sentiment events as units, each public sentiment event is arranged into a JSON file, and level marking is carried out according to the news content of the public sentiment events, media participating in the public sentiment event propagation, forwarding quantity and comment quantity;
step 2.3: checking whether the grade marking in the step 2.2 has obvious errors, if so, performing the step 2.4, and if not, performing the step 2.5;
step 2.4: re-labeling the error file;
step 2.5: and stopping labeling, and finishing corpus construction.
5. The public opinion grade prediction method facing the public inspection field based on deep learning according to claim 1, wherein the technical premise of collecting text information in the step 2 and performing natural language processing on the text information is that words are mapped into a vector form by using a Word2Vec model, and the Word2Vec model is specifically trained by the following steps:
step 2.1: extracting a news text crawled by a crawler module from a system database, and cleaning the news text by using a regular expression;
step 2.2: splitting each news text into sentences by using a sentence splitting technology of natural language, and splitting each sentence into words by using Jieba;
step 2.3: and putting the split words into a Word2Vec model for training by taking sentences as units to obtain vector form representation of the words.
6. The public opinion grade prediction method facing to the public inspection field based on deep learning according to claim 1, wherein the deep learning model is an RNN recognition model, a Bi-LSTM network structure is used as a model core, and the construction steps are as follows:
step RB2.1: using the word vector to map the title and the keywords input into the network into a vector, namely an Embedding Layer;
step RB2.2: obtaining the context information of the title by using a first-layer bidirectional LSTM structure, and outputting the forward direction of the LSTM to fwOutput 1 And backward output bwOutput 1 Spliced together to obtain a vector fwOutput 1 ,bwOutput 1 ]Using the attention mechanism, a semantic representation vector of the title, denoted title, is calculated, and the attention calculation formula is as follows:
e ij =tanh(W w h ij +b w )
Figure FDA0003736791080000041
wherein W w 、b w 、u w Is the parameter to be learned, α ij Represents the final attention distribution, i.e. the attention value of the jth word in the ith title;
step RB2.3: obtaining the semantic information of the key words by utilizing the second layer bidirectional LSTM structure, and outputting fwOutput of the LSTM in the forward direction 2 And backward output bwOutput 2 Spliced together to obtain a vector fwOutput 2 ,bwOutput 2 ]And is recorded as keyword;
step RB2.4: normalizing the characteristic numbers to ensure that each numerical value is between-1 and 1, and splicing the characteristic values into a vector which is recorded as numbers;
step RB2.5: and splicing the three vectors of title, keyword and number together, namely [ title, keyword and number ], mapping through a linear layer, finally inputting to an output layer, and obtaining output by using a softmax function.
7. The public opinion grade prediction method based on deep learning and facing to the public inspection field according to claim 1, wherein the training mode of the deep learning model in the step 2 is divided into two parts according to the proportion of 8:2 through the data set in the step 1, and the two parts are respectively used as a training set and a test set of the training model, wherein an optimizer used in the training of the model is Adam, and a loss function of a neural network is a weighted cross entropy loss function, which is specifically defined as follows:
Figure FDA0003736791080000042
wherein, y i Weight, a probability value representing the input y predicted as class i i A weight value representing class i; the advantage of using the weighted value is that the error caused by unbalanced class distribution of the data set can be effectively inhibited, so that the model can better learn the data features with fewer classes, wherein the weight i The calculation formula is as follows:
Figure FDA0003736791080000043
wherein num i Indicates the number of i-class data.
8. The public opinion grade prediction method facing public inspection field based on deep learning according to claim 1, wherein the database used in step 3 is a MongoDB database, storing data records in the form of dictionary; in a database, according to the prediction result of the model, identifying each public sentiment event record; and designing a data reading interface to read one public sentiment event record from the database every time, so that the public sentiment event record can be called by other systems.
CN202110608376.0A 2021-06-01 2021-06-01 Public opinion grade prediction method based on deep learning and oriented to public inspection field Active CN113312532B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110608376.0A CN113312532B (en) 2021-06-01 2021-06-01 Public opinion grade prediction method based on deep learning and oriented to public inspection field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110608376.0A CN113312532B (en) 2021-06-01 2021-06-01 Public opinion grade prediction method based on deep learning and oriented to public inspection field

Publications (2)

Publication Number Publication Date
CN113312532A CN113312532A (en) 2021-08-27
CN113312532B true CN113312532B (en) 2022-10-21

Family

ID=77376787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110608376.0A Active CN113312532B (en) 2021-06-01 2021-06-01 Public opinion grade prediction method based on deep learning and oriented to public inspection field

Country Status (1)

Country Link
CN (1) CN113312532B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807084A (en) * 2019-05-15 2020-02-18 北京信息科技大学 Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy
CN111191442A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Similar problem generation method, device, equipment and medium
KR20200137924A (en) * 2019-05-29 2020-12-09 경희대학교 산학협력단 Real-time keyword extraction method and device in text streaming environment
CN112733538A (en) * 2021-01-19 2021-04-30 广东工业大学 Ontology construction method and device based on text

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107220352B (en) * 2017-05-31 2020-12-08 北京百度网讯科技有限公司 Method and device for constructing comment map based on artificial intelligence
CN110263343B (en) * 2019-06-24 2021-06-15 北京理工大学 Phrase vector-based keyword extraction method and system
CN111008274B (en) * 2019-12-10 2021-04-06 昆明理工大学 Case microblog viewpoint sentence identification and construction method of feature extended convolutional neural network
CN112131863B (en) * 2020-08-04 2022-07-19 中科天玑数据科技股份有限公司 Comment opinion theme extraction method, electronic equipment and storage medium
CN112800211A (en) * 2020-12-31 2021-05-14 江苏网进科技股份有限公司 Method for extracting critical information of criminal process in legal document based on TextRank algorithm
CN112860906B (en) * 2021-04-23 2021-07-16 南京汇宁桀信息科技有限公司 Market leader hot line and public opinion decision support method and system based on natural language processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807084A (en) * 2019-05-15 2020-02-18 北京信息科技大学 Attention mechanism-based patent term relationship extraction method for Bi-LSTM and keyword strategy
KR20200137924A (en) * 2019-05-29 2020-12-09 경희대학교 산학협력단 Real-time keyword extraction method and device in text streaming environment
CN111191442A (en) * 2019-12-30 2020-05-22 杭州远传新业科技有限公司 Similar problem generation method, device, equipment and medium
CN112733538A (en) * 2021-01-19 2021-04-30 广东工业大学 Ontology construction method and device based on text

Also Published As

Publication number Publication date
CN113312532A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN111581396B (en) Event graph construction system and method based on multi-dimensional feature fusion and dependency syntax
CN110135457A (en) Event trigger word abstracting method and system based on self-encoding encoder fusion document information
CN111639171A (en) Knowledge graph question-answering method and device
CN110532398B (en) Automatic family map construction method based on multi-task joint neural network model
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN113806563A (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN112749562A (en) Named entity identification method, device, storage medium and electronic equipment
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN114970508A (en) Power text knowledge discovery method and device based on data multi-source fusion
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN115564393A (en) Recruitment requirement similarity-based job recommendation method
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN115599899A (en) Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph
CN113378024B (en) Deep learning-oriented public inspection method field-based related event identification method
CN115455202A (en) Emergency event affair map construction method
CN113343701B (en) Extraction method and device for text named entities of power equipment fault defects
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN112329442A (en) Multi-task reading system and method for heterogeneous legal data
CN113312532B (en) Public opinion grade prediction method based on deep learning and oriented to public inspection field
CN111898034A (en) News content pushing method and device, storage medium and computer equipment
CN116975161A (en) Entity relation joint extraction method, equipment and medium of power equipment partial discharge text
CN115952794A (en) Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph
CN111309933B (en) Automatic labeling system for cultural resource data
KR101126186B1 (en) Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium thereof
CN112579666A (en) Intelligent question-answering system and method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant