CN111708886A

CN111708886A - Public opinion analysis terminal and public opinion text analysis method based on data driving

Info

Publication number: CN111708886A
Application number: CN202010527263.3A
Authority: CN
Inventors: 贾晓亮; 刘伟; 张志杰; 陈雪; 孟吉凯; 代志称; 郑爱华; 张自达
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2020-09-25

Abstract

The invention belongs to the technical field of databases, relates to the technical field of public opinion analysis, and particularly relates to a public opinion analysis terminal and a public opinion text analysis method based on data driving, wherein the public opinion analysis terminal comprises a terminal body, a memory and a processor are arranged in the terminal body, and the public opinion analysis terminal is characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts. Based on the analysis terminal, a public opinion analysis terminal and a public opinion text analysis method are designed in a matching way, wherein the public opinion analysis terminal and the public opinion text analysis method can be used for processing network text data through algorithms such as Chinese word segmentation, stop word removal, unbalanced corpus processing, feature selection and the like in a matching way and finally realizing public opinion identification.

Description

Public opinion analysis terminal and public opinion text analysis method based on data driving

Technical Field

The invention belongs to the technical field of databases, relates to the technical field of public opinion analysis, and particularly relates to a public opinion analysis terminal and a public opinion text analysis method based on data driving.

Background

With the development of network technology and the popularization of network application, public sentiment propagation speed is far higher than that of any past period, and when certain group time occurs, the rapid propagation of negative public sentiment promotes the expansion outbreak of group events in a very short time.

Therefore, the early discovery, early judgment and early prevention of public opinion information become important prerequisites for the public service department to correctly guide the public opinion. The computer is utilized to help the power grid enterprises to rapidly and completely acquire and arrange public opinion text information, so that the power grid enterprises can seize public opinion management and control opportunities, maintain the enterprise image and improve the basic requirements of service level.

In the public sentiment spreading process, the positive public sentiment can promote the real information of the event to be spread, and the negative public sentiment can cause the adverse response to the event, destroy the stability of the public sentiment environment and trigger the public sentiment crisis. Therefore, how to effectively analyze the emotion of the public sentiment in the public sentiment information, especially in the text information, is a very important content. Therefore, emotion analysis is required for text information of public sentiment.

Emotion analysis, also known as opinion mining, is the process of analyzing, processing, generalizing, and reasoning subjective text with emotional colors. Currently, there are two methods for emotion analysis of a text, one is based on semantic understanding, and the other is based on machine learning. The first method has great limitations in text processing with complex expression modes and irregular text information, while the second method is limited by feature selection and corpus scale and is not suitable for real-time processing of a large amount of texts.

Therefore, a public opinion analysis terminal and a public opinion text analysis method should be designed, which can process web text data by matching with algorithms such as Chinese word segmentation, stop word, unbalanced corpus processing, feature selection and the like, and finally realize public opinion identification.

Disclosure of Invention

The invention aims to make up the defects of the prior art, and provides a public opinion analysis terminal and a public opinion text analysis method which are matched by algorithms such as Chinese word segmentation, stop word removal, unbalanced corpus processing, feature selection and the like and finally realize public opinion identification

The technical scheme adopted by the invention is as follows:

the utility model provides a public opinion analysis terminal based on data drive, includes the terminal body, install memory and treater in the terminal body, its characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts.

Further, the method comprises the following steps:

step 1: designing a theme, analyzing a page theme by a theme crawler;

step 2: carrying out data cleaning on the collected public opinion data;

and step 3: performing Chinese word segmentation, including preprocessing by adopting a dictionary matching method, and then realizing accurate word segmentation by utilizing a statistical word segmentation method;

and 4, step 4: removing stop words and removing part of network habit usages with deepened representation degree;

and 5: generating a text feature vector for the processed text information;

step 6: applying a classifier to collect the text feature vectors;

and 7: and generating a classification result.

Further, the step 1 includes initializing a seed URL, adding the URL into the list to be crawled according to the score, obtaining a first seed of the URL list, and analyzing the related topic of the page.

Further, the step 2 includes removing the nonsense character and ignoring the reply, topic reference, title, URL reference, time and the like.

Further, in step 5, using the CBOW model, a text, sample (text (w), w) in the known corpus T is formed by c words before and after w, and the input layer includes 2c word vectors V (text (w) in the text (w)₁)、V(text(w)₂)...V(text(w)_2c)∈R^mWhere m denotes the word vector length, default value 100, the projection layer sums up the 2c word vectors from the input layer, i.e. the sum is

The output layer is a binary tree, a Huffman tree is constructed by taking words appearing in the corpus as leaf nodes and taking the frequency of each word appearing in the corpus as a weight value, and corresponding word vectors are obtained by continuously performing secondary classification on the tree.

Further, in the step 5, an information gain method is adopted

Wherein n represents the total number of classifications,

indicating that the characteristic value t is not present, P (c)_i) Indicates belonging to class c_iP (t) represents the proportion of the text containing the feature item t in the total textThe proportion of P (tc)_i) Indicating the total text belonging to category c_iAnd the text containing the feature item t has a specific gravity,

as belonging to category c in the total text_iBut the text without the feature item t accounts for the weight.

Further, in step 6, the classifier adopts a logistic regression model

Wherein, the characteristic vector X ═ { X ═ X₁,x₂,…x_n,1}∈Rⁿ⁺¹Corresponding weight vector W ═ W₁,w₂…w_n,b}∈Rⁿ⁺¹。

Furthermore, aiming at a few types of samples with constant sample layout, the SMOTE algorithm is adopted,

wherein

For adjacent samples, the adjacent samples are added into a few types of sample sets to achieve the oversampling effect.

The invention has the advantages and positive effects that:

in the invention, a public opinion analysis terminal is formed by matching with a preset computer program on the basis of the existing device, and the public opinion analysis terminal can be specially customized and can also be supplemented by adopting the existing computer or other mobile terminals.

The invention processes public sentiment data based on a preset computer program, wherein a crawler module is used for collecting the public sentiment data, a text preprocessing module is used for preprocessing a character string, and an emotion judging module is used for performing emotion analysis on the text to form a set of completed processing system.

In the invention, web text data is acquired by means of a crawler technology, and a corresponding page is analyzed; in the process of data cleaning, nonsense characters can be removed, and information such as postbacks, topic references, titles, URL references and time can be ignored; the dictionary matching method can be preprocessed through Chinese word segmentation, and then accurate word segmentation is realized through a statistical word segmentation method; then, further processing is carried out to remove the network habit usage with part of the deepened representation degree; then extracting text features by using a CBOW model, and selecting the features by using an information gain method; and finally, obtaining a classification result by adopting a logistic regression model and an SMOTE algorithm so as to realize public opinion identification.

Drawings

Fig. 1 is a block diagram of a public opinion analysis terminal according to the present invention;

fig. 2 is a flowchart of a public opinion text analysis method according to the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.

In this embodiment, the method includes the following steps:

step 1: designing a theme, analyzing a page theme by a theme crawler;

step 2: carrying out data cleaning on the collected public opinion data;

and 5: generating a text feature vector for the processed text information;

step 6: applying a classifier to collect the text feature vectors;

and 7: and generating a classification result.

In this embodiment, the step 1 includes initializing a seed URL, adding the URL to a list to be crawled according to scores, obtaining a first seed of the URL list, and analyzing a topic related to the page.

In this embodiment, an upper limit 50 of the URL length is set.

In this embodiment, in the step 2, the content of data cleansing is cleansing for the corpus. Including removing meaningless characters such as "#", etc., and ignoring the posting, topic references, title, URL references, time, and the like. The step adopts an artificial table Chinese and cross validation labeling structure

In this embodiment, in step 4, the network habit usage with a deepened part of the representation degree is removed, for example, the word "to" often follows the positive emotion word, the whole context presents the positive property, and the word is removed out of the stop word bank.

In this embodiment, in step 5, a CBOW model is used to know a text, a sample (text (w), w) in the corpus T, where the text (w) is formed by c words before and after w, and the input layer includes 2c word vectors V (text (w) in the text (w)₁)、V(text(w)₂)...V(text(w)_2c)∈R^mWhere m denotes the word vector length, default value 100, the projection layer sums up the 2c word vectors from the input layer, i.e. the sum is

In this embodiment, in the step 5, an information gain method is adopted

Wherein n represents the total number of classifications,

indicating that the characteristic value t is not present, P (c)_i) Indicates belonging to class c_iP (t) represents the proportion of the text containing the feature item t in the total text, and P (t | c)_i) Indicating the total text belonging to category c_iAnd the text containing the feature item t has a specific gravity,

In this embodiment, in step 6, the classifier uses a logistic regression model

In the embodiment, for a few types of samples with constant sample layout, the SMOTE algorithm is adopted,

wherein

Claims

1. The utility model provides a public opinion analysis terminal based on data drive, includes the terminal body, install memory and treater in the terminal body, its characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts.

2. The public opinion text analysis method based on the data-driven public opinion analysis terminal according to claim 1, characterized in that: the method comprises the following steps:

step 1: designing a theme, analyzing a page theme by a theme crawler;

step 2: carrying out data cleaning on the collected public opinion data;

and 5: generating a text feature vector for the processed text information;

step 6: applying a classifier to collect the text feature vectors;

and 7: and generating a classification result.

3. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: the step 1 includes initializing a seed URL, adding the URL into a list to be crawled according to the score, acquiring a first seed of the URL list, and analyzing a page related theme.

4. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: the step 2 includes removing the nonsense character and ignoring the postback, topic reference, title, URL reference, time and the like.

5. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: in the step 5, a CBOW model is used, a text, a sample (text (w), w) in the known corpus T is formed by c words before and after w, and the input layer comprises 2c word vectors V (text (w) in the text (w)₁)、V(text(w)₂)...V(text(w)_2c)∈R^mWhere m denotes the word vector length, default value 100, the projection layer sums up the 2c word vectors from the input layer, i.e. the sum is

6. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 5, characterized in that: in the step 5, an information gain method is adopted

Wherein n represents the total number of classifications,

7. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: in the step 6, the classifier adopts a logistic regression model

8. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 7, characterized in that: aiming at a few types of samples with constant sample layout, the SMOTE algorithm is adopted,

wherein