CN111708886A - Public opinion analysis terminal and public opinion text analysis method based on data driving - Google Patents

Public opinion analysis terminal and public opinion text analysis method based on data driving Download PDF

Info

Publication number
CN111708886A
CN111708886A CN202010527263.3A CN202010527263A CN111708886A CN 111708886 A CN111708886 A CN 111708886A CN 202010527263 A CN202010527263 A CN 202010527263A CN 111708886 A CN111708886 A CN 111708886A
Authority
CN
China
Prior art keywords
text
public opinion
analysis
data
terminal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010527263.3A
Other languages
Chinese (zh)
Inventor
贾晓亮
刘伟
张志杰
陈雪
孟吉凯
代志称
郑爱华
张自达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Tianjin Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202010527263.3A priority Critical patent/CN111708886A/en
Publication of CN111708886A publication Critical patent/CN111708886A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention belongs to the technical field of databases, relates to the technical field of public opinion analysis, and particularly relates to a public opinion analysis terminal and a public opinion text analysis method based on data driving, wherein the public opinion analysis terminal comprises a terminal body, a memory and a processor are arranged in the terminal body, and the public opinion analysis terminal is characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts. Based on the analysis terminal, a public opinion analysis terminal and a public opinion text analysis method are designed in a matching way, wherein the public opinion analysis terminal and the public opinion text analysis method can be used for processing network text data through algorithms such as Chinese word segmentation, stop word removal, unbalanced corpus processing, feature selection and the like in a matching way and finally realizing public opinion identification.

Description

Public opinion analysis terminal and public opinion text analysis method based on data driving
Technical Field
The invention belongs to the technical field of databases, relates to the technical field of public opinion analysis, and particularly relates to a public opinion analysis terminal and a public opinion text analysis method based on data driving.
Background
With the development of network technology and the popularization of network application, public sentiment propagation speed is far higher than that of any past period, and when certain group time occurs, the rapid propagation of negative public sentiment promotes the expansion outbreak of group events in a very short time.
Therefore, the early discovery, early judgment and early prevention of public opinion information become important prerequisites for the public service department to correctly guide the public opinion. The computer is utilized to help the power grid enterprises to rapidly and completely acquire and arrange public opinion text information, so that the power grid enterprises can seize public opinion management and control opportunities, maintain the enterprise image and improve the basic requirements of service level.
In the public sentiment spreading process, the positive public sentiment can promote the real information of the event to be spread, and the negative public sentiment can cause the adverse response to the event, destroy the stability of the public sentiment environment and trigger the public sentiment crisis. Therefore, how to effectively analyze the emotion of the public sentiment in the public sentiment information, especially in the text information, is a very important content. Therefore, emotion analysis is required for text information of public sentiment.
Emotion analysis, also known as opinion mining, is the process of analyzing, processing, generalizing, and reasoning subjective text with emotional colors. Currently, there are two methods for emotion analysis of a text, one is based on semantic understanding, and the other is based on machine learning. The first method has great limitations in text processing with complex expression modes and irregular text information, while the second method is limited by feature selection and corpus scale and is not suitable for real-time processing of a large amount of texts.
Therefore, a public opinion analysis terminal and a public opinion text analysis method should be designed, which can process web text data by matching with algorithms such as Chinese word segmentation, stop word, unbalanced corpus processing, feature selection and the like, and finally realize public opinion identification.
Disclosure of Invention
The invention aims to make up the defects of the prior art, and provides a public opinion analysis terminal and a public opinion text analysis method which are matched by algorithms such as Chinese word segmentation, stop word removal, unbalanced corpus processing, feature selection and the like and finally realize public opinion identification
The technical scheme adopted by the invention is as follows:
the utility model provides a public opinion analysis terminal based on data drive, includes the terminal body, install memory and treater in the terminal body, its characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts.
Further, the method comprises the following steps:
step 1: designing a theme, analyzing a page theme by a theme crawler;
step 2: carrying out data cleaning on the collected public opinion data;
and step 3: performing Chinese word segmentation, including preprocessing by adopting a dictionary matching method, and then realizing accurate word segmentation by utilizing a statistical word segmentation method;
and 4, step 4: removing stop words and removing part of network habit usages with deepened representation degree;
and 5: generating a text feature vector for the processed text information;
step 6: applying a classifier to collect the text feature vectors;
and 7: and generating a classification result.
Further, the step 1 includes initializing a seed URL, adding the URL into the list to be crawled according to the score, obtaining a first seed of the URL list, and analyzing the related topic of the page.
Further, the step 2 includes removing the nonsense character and ignoring the reply, topic reference, title, URL reference, time and the like.
Further, in step 5, using the CBOW model, a text, sample (text (w), w) in the known corpus T is formed by c words before and after w, and the input layer includes 2c word vectors V (text (w) in the text (w)1)、V(text(w)2)...V(text(w)2c)∈RmWhere m denotes the word vector length, default value 100, the projection layer sums up the 2c word vectors from the input layer, i.e. the sum is
Figure BDA0002534041520000021
The output layer is a binary tree, a Huffman tree is constructed by taking words appearing in the corpus as leaf nodes and taking the frequency of each word appearing in the corpus as a weight value, and corresponding word vectors are obtained by continuously performing secondary classification on the tree.
Further, in the step 5, an information gain method is adopted
Figure BDA0002534041520000031
Wherein n represents the total number of classifications,
Figure BDA0002534041520000032
indicating that the characteristic value t is not present, P (c)i) Indicates belonging to class ciP (t) represents the proportion of the text containing the feature item t in the total textThe proportion of P (tc)i) Indicating the total text belonging to category ciAnd the text containing the feature item t has a specific gravity,
Figure BDA0002534041520000033
as belonging to category c in the total textiBut the text without the feature item t accounts for the weight.
Further, in step 6, the classifier adopts a logistic regression model
Figure BDA0002534041520000034
Wherein, the characteristic vector X ═ { X ═ X1,x2,…xn,1}∈Rn+1Corresponding weight vector W ═ W1,w2…wn,b}∈Rn+1
Furthermore, aiming at a few types of samples with constant sample layout, the SMOTE algorithm is adopted,
Figure BDA0002534041520000035
wherein
Figure BDA0002534041520000036
For adjacent samples, the adjacent samples are added into a few types of sample sets to achieve the oversampling effect.
The invention has the advantages and positive effects that:
in the invention, a public opinion analysis terminal is formed by matching with a preset computer program on the basis of the existing device, and the public opinion analysis terminal can be specially customized and can also be supplemented by adopting the existing computer or other mobile terminals.
The invention processes public sentiment data based on a preset computer program, wherein a crawler module is used for collecting the public sentiment data, a text preprocessing module is used for preprocessing a character string, and an emotion judging module is used for performing emotion analysis on the text to form a set of completed processing system.
In the invention, web text data is acquired by means of a crawler technology, and a corresponding page is analyzed; in the process of data cleaning, nonsense characters can be removed, and information such as postbacks, topic references, titles, URL references and time can be ignored; the dictionary matching method can be preprocessed through Chinese word segmentation, and then accurate word segmentation is realized through a statistical word segmentation method; then, further processing is carried out to remove the network habit usage with part of the deepened representation degree; then extracting text features by using a CBOW model, and selecting the features by using an information gain method; and finally, obtaining a classification result by adopting a logistic regression model and an SMOTE algorithm so as to realize public opinion identification.
Drawings
Fig. 1 is a block diagram of a public opinion analysis terminal according to the present invention;
fig. 2 is a flowchart of a public opinion text analysis method according to the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.
The utility model provides a public opinion analysis terminal based on data drive, includes the terminal body, install memory and treater in the terminal body, its characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts.
In this embodiment, the method includes the following steps:
step 1: designing a theme, analyzing a page theme by a theme crawler;
step 2: carrying out data cleaning on the collected public opinion data;
and step 3: performing Chinese word segmentation, including preprocessing by adopting a dictionary matching method, and then realizing accurate word segmentation by utilizing a statistical word segmentation method;
and 4, step 4: removing stop words and removing part of network habit usages with deepened representation degree;
and 5: generating a text feature vector for the processed text information;
step 6: applying a classifier to collect the text feature vectors;
and 7: and generating a classification result.
In this embodiment, the step 1 includes initializing a seed URL, adding the URL to a list to be crawled according to scores, obtaining a first seed of the URL list, and analyzing a topic related to the page.
In this embodiment, an upper limit 50 of the URL length is set.
In this embodiment, in the step 2, the content of data cleansing is cleansing for the corpus. Including removing meaningless characters such as "#", etc., and ignoring the posting, topic references, title, URL references, time, and the like. The step adopts an artificial table Chinese and cross validation labeling structure
In this embodiment, in step 4, the network habit usage with a deepened part of the representation degree is removed, for example, the word "to" often follows the positive emotion word, the whole context presents the positive property, and the word is removed out of the stop word bank.
In this embodiment, in step 5, a CBOW model is used to know a text, a sample (text (w), w) in the corpus T, where the text (w) is formed by c words before and after w, and the input layer includes 2c word vectors V (text (w) in the text (w)1)、V(text(w)2)...V(text(w)2c)∈RmWhere m denotes the word vector length, default value 100, the projection layer sums up the 2c word vectors from the input layer, i.e. the sum is
Figure BDA0002534041520000051
The output layer is a binary tree, a Huffman tree is constructed by taking words appearing in the corpus as leaf nodes and taking the frequency of each word appearing in the corpus as a weight value, and corresponding word vectors are obtained by continuously performing secondary classification on the tree.
In this embodiment, in the step 5, an information gain method is adopted
Figure BDA0002534041520000052
Wherein n represents the total number of classifications,
Figure BDA0002534041520000053
indicating that the characteristic value t is not present, P (c)i) Indicates belonging to class ciP (t) represents the proportion of the text containing the feature item t in the total text, and P (t | c)i) Indicating the total text belonging to category ciAnd the text containing the feature item t has a specific gravity,
Figure BDA0002534041520000054
as belonging to category c in the total textiBut the text without the feature item t accounts for the weight.
In this embodiment, in step 6, the classifier uses a logistic regression model
Figure BDA0002534041520000055
Wherein, the characteristic vector X ═ { X ═ X1,x2,…xn,1}∈Rn+1Corresponding weight vector W ═ W1,w2…wn,b}∈Rn+1
In the embodiment, for a few types of samples with constant sample layout, the SMOTE algorithm is adopted,
Figure BDA0002534041520000056
wherein
Figure BDA0002534041520000057
For adjacent samples, the adjacent samples are added into a few types of sample sets to achieve the oversampling effect.
In the invention, a public opinion analysis terminal is formed by matching with a preset computer program on the basis of the existing device, and the public opinion analysis terminal can be specially customized and can also be supplemented by adopting the existing computer or other mobile terminals.
The invention processes public sentiment data based on a preset computer program, wherein a crawler module is used for collecting the public sentiment data, a text preprocessing module is used for preprocessing a character string, and an emotion judging module is used for performing emotion analysis on the text to form a set of completed processing system.
In the invention, web text data is acquired by means of a crawler technology, and a corresponding page is analyzed; in the process of data cleaning, nonsense characters can be removed, and information such as postbacks, topic references, titles, URL references and time can be ignored; the dictionary matching method can be preprocessed through Chinese word segmentation, and then accurate word segmentation is realized through a statistical word segmentation method; then, further processing is carried out to remove the network habit usage with part of the deepened representation degree; then extracting text features by using a CBOW model, and selecting the features by using an information gain method; and finally, obtaining a classification result by adopting a logistic regression model and an SMOTE algorithm so as to realize public opinion identification.

Claims (8)

1. The utility model provides a public opinion analysis terminal based on data drive, includes the terminal body, install memory and treater in the terminal body, its characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts.
2. The public opinion text analysis method based on the data-driven public opinion analysis terminal according to claim 1, characterized in that: the method comprises the following steps:
step 1: designing a theme, analyzing a page theme by a theme crawler;
step 2: carrying out data cleaning on the collected public opinion data;
and step 3: performing Chinese word segmentation, including preprocessing by adopting a dictionary matching method, and then realizing accurate word segmentation by utilizing a statistical word segmentation method;
and 4, step 4: removing stop words and removing part of network habit usages with deepened representation degree;
and 5: generating a text feature vector for the processed text information;
step 6: applying a classifier to collect the text feature vectors;
and 7: and generating a classification result.
3. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: the step 1 includes initializing a seed URL, adding the URL into a list to be crawled according to the score, acquiring a first seed of the URL list, and analyzing a page related theme.
4. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: the step 2 includes removing the nonsense character and ignoring the postback, topic reference, title, URL reference, time and the like.
5. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: in the step 5, a CBOW model is used, a text, a sample (text (w), w) in the known corpus T is formed by c words before and after w, and the input layer comprises 2c word vectors V (text (w) in the text (w)1)、V(text(w)2)...V(text(w)2c)∈RmWhere m denotes the word vector length, default value 100, the projection layer sums up the 2c word vectors from the input layer, i.e. the sum is
Figure FDA0002534041510000021
The output layer is a binary tree, a Huffman tree is constructed by taking words appearing in the corpus as leaf nodes and taking the frequency of each word appearing in the corpus as a weight value, and corresponding word vectors are obtained by continuously performing secondary classification on the tree.
6. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 5, characterized in that: in the step 5, an information gain method is adopted
Figure FDA0002534041510000022
Wherein n represents the total number of classifications,
Figure FDA0002534041510000023
indicating that the characteristic value t is not present, P (c)i) Indicates belonging to class ciP (t) represents the proportion of the text containing the feature item t in the total text, and P (t | c)i) Indicating the total text belonging to category ciAnd the text containing the feature item t has a specific gravity,
Figure FDA0002534041510000027
as belonging to category c in the total textiBut the text without the feature item t accounts for the weight.
7. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: in the step 6, the classifier adopts a logistic regression model
Figure FDA0002534041510000024
Wherein, the characteristic vector X ═ { X ═ X1,x2,…xn,1}∈Rn+1Corresponding weight vector W ═ W1,w2…wn,b}∈Rn+1
8. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 7, characterized in that: aiming at a few types of samples with constant sample layout, the SMOTE algorithm is adopted,
Figure FDA0002534041510000025
wherein
Figure FDA0002534041510000026
For adjacent samples, the adjacent samples are added into a few types of sample sets to achieve the oversampling effect.
CN202010527263.3A 2020-06-11 2020-06-11 Public opinion analysis terminal and public opinion text analysis method based on data driving Pending CN111708886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010527263.3A CN111708886A (en) 2020-06-11 2020-06-11 Public opinion analysis terminal and public opinion text analysis method based on data driving

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010527263.3A CN111708886A (en) 2020-06-11 2020-06-11 Public opinion analysis terminal and public opinion text analysis method based on data driving

Publications (1)

Publication Number Publication Date
CN111708886A true CN111708886A (en) 2020-09-25

Family

ID=72540334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010527263.3A Pending CN111708886A (en) 2020-06-11 2020-06-11 Public opinion analysis terminal and public opinion text analysis method based on data driving

Country Status (1)

Country Link
CN (1) CN111708886A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4930077A (en) * 1987-04-06 1990-05-29 Fan David P Information processing expert system for text analysis and predicting public opinion based information available to the public
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
WO2019080863A1 (en) * 2017-10-26 2019-05-02 福建亿榕信息技术有限公司 Text sentiment classification method, storage medium and computer
KR20190093757A (en) * 2018-01-11 2019-08-12 주식회사 와이즈인컴퍼니 Analyzing and Reporting System for Survey and Poll Data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4930077A (en) * 1987-04-06 1990-05-29 Fan David P Information processing expert system for text analysis and predicting public opinion based information available to the public
CN103544255A (en) * 2013-10-15 2014-01-29 常州大学 Text semantic relativity based network public opinion information analysis method
CN107291693A (en) * 2017-06-15 2017-10-24 广州赫炎大数据科技有限公司 A kind of semantic computation method for improving term vector model
WO2019080863A1 (en) * 2017-10-26 2019-05-02 福建亿榕信息技术有限公司 Text sentiment classification method, storage medium and computer
KR20190093757A (en) * 2018-01-11 2019-08-12 주식회사 와이즈인컴퍼니 Analyzing and Reporting System for Survey and Poll Data

Similar Documents

Publication Publication Date Title
CN107609132B (en) Semantic ontology base based Chinese text sentiment analysis method
CN108090070B (en) Chinese entity attribute extraction method
CN111767403B (en) Text classification method and device
CN106776574B (en) User comment text mining method and device
Soliman et al. Sentiment analysis of Arabic slang comments on facebook
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN108363748B (en) Topic portrait system and topic portrait method based on knowledge
CN102270212A (en) User interest feature extraction method based on hidden semi-Markov model
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN111767725A (en) Data processing method and device based on emotion polarity analysis model
Rashid et al. Feature level opinion mining of educational student feedback data using sequential pattern mining and association rule mining
CN110134788B (en) Microblog release optimization method and system based on text mining
CN109446299B (en) Method and system for searching e-mail content based on event recognition
CN111626050A (en) Microblog emotion analysis method based on expression dictionary and emotion common sense
CN110910175A (en) Tourist ticket product portrait generation method
CN107451116B (en) Statistical analysis method for mobile application endogenous big data
Filho et al. Gender classification of twitter data based on textual meta-attributes extraction
CN113157860A (en) Electric power equipment maintenance knowledge graph construction method based on small-scale data
TW202111569A (en) Text classification method with high scalability and multi-tag and apparatus thereof also providing a method and a device for constructing topic classification templates
CN115329085A (en) Social robot classification method and system
CN110728144A (en) Extraction type document automatic summarization method based on context semantic perception
CN112632272B (en) Microblog emotion classification method and system based on syntactic analysis
CN112115712B (en) Topic-based group emotion analysis method
CN108829806A (en) Across the evental news text emotion analysis methods of one kind
CN111191413B (en) Method, device and system for automatically marking event core content based on graph sequencing model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination