CN111708886A - Public opinion analysis terminal and public opinion text analysis method based on data driving - Google Patents
Public opinion analysis terminal and public opinion text analysis method based on data driving Download PDFInfo
- Publication number
- CN111708886A CN111708886A CN202010527263.3A CN202010527263A CN111708886A CN 111708886 A CN111708886 A CN 111708886A CN 202010527263 A CN202010527263 A CN 202010527263A CN 111708886 A CN111708886 A CN 111708886A
- Authority
- CN
- China
- Prior art keywords
- text
- public opinion
- analysis
- data
- terminal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 53
- 230000008451 emotion Effects 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 19
- 230000011218 segmentation Effects 0.000 claims abstract description 18
- 238000004590 computer program Methods 0.000 claims abstract description 12
- 238000000034 method Methods 0.000 claims description 28
- 239000013598 vector Substances 0.000 claims description 24
- 238000004140 cleaning Methods 0.000 claims description 5
- 238000007477 logistic regression Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 3
- 230000005484 gravity Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000008092 positive effect Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Abstract
The invention belongs to the technical field of databases, relates to the technical field of public opinion analysis, and particularly relates to a public opinion analysis terminal and a public opinion text analysis method based on data driving, wherein the public opinion analysis terminal comprises a terminal body, a memory and a processor are arranged in the terminal body, and the public opinion analysis terminal is characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts. Based on the analysis terminal, a public opinion analysis terminal and a public opinion text analysis method are designed in a matching way, wherein the public opinion analysis terminal and the public opinion text analysis method can be used for processing network text data through algorithms such as Chinese word segmentation, stop word removal, unbalanced corpus processing, feature selection and the like in a matching way and finally realizing public opinion identification.
Description
Technical Field
The invention belongs to the technical field of databases, relates to the technical field of public opinion analysis, and particularly relates to a public opinion analysis terminal and a public opinion text analysis method based on data driving.
Background
With the development of network technology and the popularization of network application, public sentiment propagation speed is far higher than that of any past period, and when certain group time occurs, the rapid propagation of negative public sentiment promotes the expansion outbreak of group events in a very short time.
Therefore, the early discovery, early judgment and early prevention of public opinion information become important prerequisites for the public service department to correctly guide the public opinion. The computer is utilized to help the power grid enterprises to rapidly and completely acquire and arrange public opinion text information, so that the power grid enterprises can seize public opinion management and control opportunities, maintain the enterprise image and improve the basic requirements of service level.
In the public sentiment spreading process, the positive public sentiment can promote the real information of the event to be spread, and the negative public sentiment can cause the adverse response to the event, destroy the stability of the public sentiment environment and trigger the public sentiment crisis. Therefore, how to effectively analyze the emotion of the public sentiment in the public sentiment information, especially in the text information, is a very important content. Therefore, emotion analysis is required for text information of public sentiment.
Emotion analysis, also known as opinion mining, is the process of analyzing, processing, generalizing, and reasoning subjective text with emotional colors. Currently, there are two methods for emotion analysis of a text, one is based on semantic understanding, and the other is based on machine learning. The first method has great limitations in text processing with complex expression modes and irregular text information, while the second method is limited by feature selection and corpus scale and is not suitable for real-time processing of a large amount of texts.
Therefore, a public opinion analysis terminal and a public opinion text analysis method should be designed, which can process web text data by matching with algorithms such as Chinese word segmentation, stop word, unbalanced corpus processing, feature selection and the like, and finally realize public opinion identification.
Disclosure of Invention
The invention aims to make up the defects of the prior art, and provides a public opinion analysis terminal and a public opinion text analysis method which are matched by algorithms such as Chinese word segmentation, stop word removal, unbalanced corpus processing, feature selection and the like and finally realize public opinion identification
The technical scheme adopted by the invention is as follows:
the utility model provides a public opinion analysis terminal based on data drive, includes the terminal body, install memory and treater in the terminal body, its characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts.
Further, the method comprises the following steps:
step 1: designing a theme, analyzing a page theme by a theme crawler;
step 2: carrying out data cleaning on the collected public opinion data;
and step 3: performing Chinese word segmentation, including preprocessing by adopting a dictionary matching method, and then realizing accurate word segmentation by utilizing a statistical word segmentation method;
and 4, step 4: removing stop words and removing part of network habit usages with deepened representation degree;
and 5: generating a text feature vector for the processed text information;
step 6: applying a classifier to collect the text feature vectors;
and 7: and generating a classification result.
Further, the step 1 includes initializing a seed URL, adding the URL into the list to be crawled according to the score, obtaining a first seed of the URL list, and analyzing the related topic of the page.
Further, the step 2 includes removing the nonsense character and ignoring the reply, topic reference, title, URL reference, time and the like.
Further, in step 5, using the CBOW model, a text, sample (text (w), w) in the known corpus T is formed by c words before and after w, and the input layer includes 2c word vectors V (text (w) in the text (w)1)、V(text(w)2)...V(text(w)2c)∈RmWhere m denotes the word vector length, default value 100, the projection layer sums up the 2c word vectors from the input layer, i.e. the sum isThe output layer is a binary tree, a Huffman tree is constructed by taking words appearing in the corpus as leaf nodes and taking the frequency of each word appearing in the corpus as a weight value, and corresponding word vectors are obtained by continuously performing secondary classification on the tree.
Further, in the step 5, an information gain method is adopted
Wherein n represents the total number of classifications,indicating that the characteristic value t is not present, P (c)i) Indicates belonging to class ciP (t) represents the proportion of the text containing the feature item t in the total textThe proportion of P (tc)i) Indicating the total text belonging to category ciAnd the text containing the feature item t has a specific gravity,as belonging to category c in the total textiBut the text without the feature item t accounts for the weight.
Further, in step 6, the classifier adopts a logistic regression model
Wherein, the characteristic vector X ═ { X ═ X1,x2,…xn,1}∈Rn+1Corresponding weight vector W ═ W1,w2…wn,b}∈Rn+1。
Furthermore, aiming at a few types of samples with constant sample layout, the SMOTE algorithm is adopted,whereinFor adjacent samples, the adjacent samples are added into a few types of sample sets to achieve the oversampling effect.
The invention has the advantages and positive effects that:
in the invention, a public opinion analysis terminal is formed by matching with a preset computer program on the basis of the existing device, and the public opinion analysis terminal can be specially customized and can also be supplemented by adopting the existing computer or other mobile terminals.
The invention processes public sentiment data based on a preset computer program, wherein a crawler module is used for collecting the public sentiment data, a text preprocessing module is used for preprocessing a character string, and an emotion judging module is used for performing emotion analysis on the text to form a set of completed processing system.
In the invention, web text data is acquired by means of a crawler technology, and a corresponding page is analyzed; in the process of data cleaning, nonsense characters can be removed, and information such as postbacks, topic references, titles, URL references and time can be ignored; the dictionary matching method can be preprocessed through Chinese word segmentation, and then accurate word segmentation is realized through a statistical word segmentation method; then, further processing is carried out to remove the network habit usage with part of the deepened representation degree; then extracting text features by using a CBOW model, and selecting the features by using an information gain method; and finally, obtaining a classification result by adopting a logistic regression model and an SMOTE algorithm so as to realize public opinion identification.
Drawings
Fig. 1 is a block diagram of a public opinion analysis terminal according to the present invention;
fig. 2 is a flowchart of a public opinion text analysis method according to the present invention.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.
The utility model provides a public opinion analysis terminal based on data drive, includes the terminal body, install memory and treater in the terminal body, its characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts.
In this embodiment, the method includes the following steps:
step 1: designing a theme, analyzing a page theme by a theme crawler;
step 2: carrying out data cleaning on the collected public opinion data;
and step 3: performing Chinese word segmentation, including preprocessing by adopting a dictionary matching method, and then realizing accurate word segmentation by utilizing a statistical word segmentation method;
and 4, step 4: removing stop words and removing part of network habit usages with deepened representation degree;
and 5: generating a text feature vector for the processed text information;
step 6: applying a classifier to collect the text feature vectors;
and 7: and generating a classification result.
In this embodiment, the step 1 includes initializing a seed URL, adding the URL to a list to be crawled according to scores, obtaining a first seed of the URL list, and analyzing a topic related to the page.
In this embodiment, an upper limit 50 of the URL length is set.
In this embodiment, in the step 2, the content of data cleansing is cleansing for the corpus. Including removing meaningless characters such as "#", etc., and ignoring the posting, topic references, title, URL references, time, and the like. The step adopts an artificial table Chinese and cross validation labeling structure
In this embodiment, in step 4, the network habit usage with a deepened part of the representation degree is removed, for example, the word "to" often follows the positive emotion word, the whole context presents the positive property, and the word is removed out of the stop word bank.
In this embodiment, in step 5, a CBOW model is used to know a text, a sample (text (w), w) in the corpus T, where the text (w) is formed by c words before and after w, and the input layer includes 2c word vectors V (text (w) in the text (w)1)、V(text(w)2)...V(text(w)2c)∈RmWhere m denotes the word vector length, default value 100, the projection layer sums up the 2c word vectors from the input layer, i.e. the sum isThe output layer is a binary tree, a Huffman tree is constructed by taking words appearing in the corpus as leaf nodes and taking the frequency of each word appearing in the corpus as a weight value, and corresponding word vectors are obtained by continuously performing secondary classification on the tree.
In this embodiment, in the step 5, an information gain method is adopted
Wherein n represents the total number of classifications,indicating that the characteristic value t is not present, P (c)i) Indicates belonging to class ciP (t) represents the proportion of the text containing the feature item t in the total text, and P (t | c)i) Indicating the total text belonging to category ciAnd the text containing the feature item t has a specific gravity,as belonging to category c in the total textiBut the text without the feature item t accounts for the weight.
In this embodiment, in step 6, the classifier uses a logistic regression model
Wherein, the characteristic vector X ═ { X ═ X1,x2,…xn,1}∈Rn+1Corresponding weight vector W ═ W1,w2…wn,b}∈Rn+1。
In the embodiment, for a few types of samples with constant sample layout, the SMOTE algorithm is adopted,whereinFor adjacent samples, the adjacent samples are added into a few types of sample sets to achieve the oversampling effect.
In the invention, a public opinion analysis terminal is formed by matching with a preset computer program on the basis of the existing device, and the public opinion analysis terminal can be specially customized and can also be supplemented by adopting the existing computer or other mobile terminals.
The invention processes public sentiment data based on a preset computer program, wherein a crawler module is used for collecting the public sentiment data, a text preprocessing module is used for preprocessing a character string, and an emotion judging module is used for performing emotion analysis on the text to form a set of completed processing system.
In the invention, web text data is acquired by means of a crawler technology, and a corresponding page is analyzed; in the process of data cleaning, nonsense characters can be removed, and information such as postbacks, topic references, titles, URL references and time can be ignored; the dictionary matching method can be preprocessed through Chinese word segmentation, and then accurate word segmentation is realized through a statistical word segmentation method; then, further processing is carried out to remove the network habit usage with part of the deepened representation degree; then extracting text features by using a CBOW model, and selecting the features by using an information gain method; and finally, obtaining a classification result by adopting a logistic regression model and an SMOTE algorithm so as to realize public opinion identification.
Claims (8)
1. The utility model provides a public opinion analysis terminal based on data drive, includes the terminal body, install memory and treater in the terminal body, its characterized in that: the terminal is internally provided with a computer program, the computer program comprises a crawler module, a text preprocessing module and an emotion judging module, the crawler module is used for collecting public opinion data, the text preprocessing module is used for preprocessing character strings, and the emotion judging module is used for performing emotion analysis on texts.
2. The public opinion text analysis method based on the data-driven public opinion analysis terminal according to claim 1, characterized in that: the method comprises the following steps:
step 1: designing a theme, analyzing a page theme by a theme crawler;
step 2: carrying out data cleaning on the collected public opinion data;
and step 3: performing Chinese word segmentation, including preprocessing by adopting a dictionary matching method, and then realizing accurate word segmentation by utilizing a statistical word segmentation method;
and 4, step 4: removing stop words and removing part of network habit usages with deepened representation degree;
and 5: generating a text feature vector for the processed text information;
step 6: applying a classifier to collect the text feature vectors;
and 7: and generating a classification result.
3. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: the step 1 includes initializing a seed URL, adding the URL into a list to be crawled according to the score, acquiring a first seed of the URL list, and analyzing a page related theme.
4. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: the step 2 includes removing the nonsense character and ignoring the postback, topic reference, title, URL reference, time and the like.
5. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: in the step 5, a CBOW model is used, a text, a sample (text (w), w) in the known corpus T is formed by c words before and after w, and the input layer comprises 2c word vectors V (text (w) in the text (w)1)、V(text(w)2)...V(text(w)2c)∈RmWhere m denotes the word vector length, default value 100, the projection layer sums up the 2c word vectors from the input layer, i.e. the sum isThe output layer is a binary tree, a Huffman tree is constructed by taking words appearing in the corpus as leaf nodes and taking the frequency of each word appearing in the corpus as a weight value, and corresponding word vectors are obtained by continuously performing secondary classification on the tree.
6. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 5, characterized in that: in the step 5, an information gain method is adopted
Wherein n represents the total number of classifications,indicating that the characteristic value t is not present, P (c)i) Indicates belonging to class ciP (t) represents the proportion of the text containing the feature item t in the total text, and P (t | c)i) Indicating the total text belonging to category ciAnd the text containing the feature item t has a specific gravity,as belonging to category c in the total textiBut the text without the feature item t accounts for the weight.
7. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 2, characterized in that: in the step 6, the classifier adopts a logistic regression model
Wherein, the characteristic vector X ═ { X ═ X1,x2,…xn,1}∈Rn+1Corresponding weight vector W ═ W1,w2…wn,b}∈Rn+1。
8. The public opinion text analysis method based on a data-driven public opinion analysis terminal according to claim 7, characterized in that: aiming at a few types of samples with constant sample layout, the SMOTE algorithm is adopted,whereinFor adjacent samples, the adjacent samples are added into a few types of sample sets to achieve the oversampling effect.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010527263.3A CN111708886A (en) | 2020-06-11 | 2020-06-11 | Public opinion analysis terminal and public opinion text analysis method based on data driving |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010527263.3A CN111708886A (en) | 2020-06-11 | 2020-06-11 | Public opinion analysis terminal and public opinion text analysis method based on data driving |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111708886A true CN111708886A (en) | 2020-09-25 |
Family
ID=72540334
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010527263.3A Pending CN111708886A (en) | 2020-06-11 | 2020-06-11 | Public opinion analysis terminal and public opinion text analysis method based on data driving |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111708886A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4930077A (en) * | 1987-04-06 | 1990-05-29 | Fan David P | Information processing expert system for text analysis and predicting public opinion based information available to the public |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN107291693A (en) * | 2017-06-15 | 2017-10-24 | 广州赫炎大数据科技有限公司 | A kind of semantic computation method for improving term vector model |
WO2019080863A1 (en) * | 2017-10-26 | 2019-05-02 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
KR20190093757A (en) * | 2018-01-11 | 2019-08-12 | 주식회사 와이즈인컴퍼니 | Analyzing and Reporting System for Survey and Poll Data |
-
2020
- 2020-06-11 CN CN202010527263.3A patent/CN111708886A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4930077A (en) * | 1987-04-06 | 1990-05-29 | Fan David P | Information processing expert system for text analysis and predicting public opinion based information available to the public |
CN103544255A (en) * | 2013-10-15 | 2014-01-29 | 常州大学 | Text semantic relativity based network public opinion information analysis method |
CN107291693A (en) * | 2017-06-15 | 2017-10-24 | 广州赫炎大数据科技有限公司 | A kind of semantic computation method for improving term vector model |
WO2019080863A1 (en) * | 2017-10-26 | 2019-05-02 | 福建亿榕信息技术有限公司 | Text sentiment classification method, storage medium and computer |
KR20190093757A (en) * | 2018-01-11 | 2019-08-12 | 주식회사 와이즈인컴퍼니 | Analyzing and Reporting System for Survey and Poll Data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107609132B (en) | Semantic ontology base based Chinese text sentiment analysis method | |
CN108090070B (en) | Chinese entity attribute extraction method | |
CN111767403B (en) | Text classification method and device | |
CN106776574B (en) | User comment text mining method and device | |
Soliman et al. | Sentiment analysis of Arabic slang comments on facebook | |
CN109885675B (en) | Text subtopic discovery method based on improved LDA | |
CN108363748B (en) | Topic portrait system and topic portrait method based on knowledge | |
CN102270212A (en) | User interest feature extraction method based on hidden semi-Markov model | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN111767725A (en) | Data processing method and device based on emotion polarity analysis model | |
Rashid et al. | Feature level opinion mining of educational student feedback data using sequential pattern mining and association rule mining | |
CN110134788B (en) | Microblog release optimization method and system based on text mining | |
CN109446299B (en) | Method and system for searching e-mail content based on event recognition | |
CN111626050A (en) | Microblog emotion analysis method based on expression dictionary and emotion common sense | |
CN110910175A (en) | Tourist ticket product portrait generation method | |
CN107451116B (en) | Statistical analysis method for mobile application endogenous big data | |
Filho et al. | Gender classification of twitter data based on textual meta-attributes extraction | |
CN113157860A (en) | Electric power equipment maintenance knowledge graph construction method based on small-scale data | |
TW202111569A (en) | Text classification method with high scalability and multi-tag and apparatus thereof also providing a method and a device for constructing topic classification templates | |
CN115329085A (en) | Social robot classification method and system | |
CN110728144A (en) | Extraction type document automatic summarization method based on context semantic perception | |
CN112632272B (en) | Microblog emotion classification method and system based on syntactic analysis | |
CN112115712B (en) | Topic-based group emotion analysis method | |
CN108829806A (en) | Across the evental news text emotion analysis methods of one kind | |
CN111191413B (en) | Method, device and system for automatically marking event core content based on graph sequencing model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |