CN114118069A - Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm - Google Patents
Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm Download PDFInfo
- Publication number
- CN114118069A CN114118069A CN202111027932.1A CN202111027932A CN114118069A CN 114118069 A CN114118069 A CN 114118069A CN 202111027932 A CN202111027932 A CN 202111027932A CN 114118069 A CN114118069 A CN 114118069A
- Authority
- CN
- China
- Prior art keywords
- word
- emotion
- sopmi
- words
- pmi
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 66
- 238000000034 method Methods 0.000 title claims abstract description 18
- 238000004458 analytical method Methods 0.000 title claims abstract description 15
- 230000011218 segmentation Effects 0.000 claims abstract description 16
- 230000002996 emotional effect Effects 0.000 claims description 9
- 238000005516 engineering process Methods 0.000 description 4
- 230000000873 masking effect Effects 0.000 description 4
- 244000188595 Brassica sinapistrum Species 0.000 description 3
- 235000004977 Brassica sinapistrum Nutrition 0.000 description 3
- 230000000875 corresponding effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Business, Economics & Management (AREA)
- Computing Systems (AREA)
- Human Resources & Organizations (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Tourism & Hospitality (AREA)
- Strategic Management (AREA)
- Primary Health Care (AREA)
- Marketing (AREA)
- Probability & Statistics with Applications (AREA)
- General Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Economics (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm, which comprise the following steps: collecting public opinion text data; collecting positive and negative words as seed words; calculating the emotion scores of the words by using an SOPMI algorithm and generating an emotion dictionary; utilizing jieba to carry out word segmentation on the text data and extracting nouns and verbs in the text data; the textual information is scored using the fractional weighting of nouns and verbs. The problem that most emotion polarity analysis intelligence is based on the existing emotion dictionary and cannot generate the emotion dictionary specific to the project is solved.
Description
Technical Field
The invention relates to the technical field of machine learning and text analysis, in particular to an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm.
Background
With the development and progress of society, media diversification, rapid development of networks, popularization of mobile terminals, and wide use of microblogs and forums, a large amount of information is streamed in the networks in various ways every day, and after filtering junk information, the information cannot be effectively evaluated. Whether positive or negative information, high quality and low quality information can be distinguished by text scoring. On one hand, the public opinion analysis of enterprises or governments can be facilitated, and on the other hand, the public praise of products in users can be more widely known.
A commonly used method is to use an emotion dictionary with a wider coverage, match the words of the text with the data in the emotion dictionary, and weight the score of each word to obtain a final score. This score depends entirely on the degree of matching of the segmentation result with the emotion dictionary. Often, no corresponding data in the emotion dictionary returns a null value, so that the returned emotion score data has errors and even the emotion polarity can not be judged through the emotion score.
At present, the emotional dictionary which is relatively widely used and popular is based on the Borsen database, although the final score which cannot be customized for each industry can be achieved, the desired effect cannot be achieved. However, manual labeling by itself costs a lot of labor and cannot label accurate emotion values.
The main reasons are as follows:
1. the emotion dictionary of the Bosen database is a general database, is relatively suitable for mass comments such as most news and microblogs, and has poor effect on a specific industry.
2. Simple emotional tendency can be solved based on the emotional scoring, but complex emotional conditions in the Chinese cannot be solved, such as double negation, question reversing and the like.
Disclosure of Invention
The invention provides an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm, aiming at solving the problems in the background technology.
In order to achieve the purpose, the invention adopts the following technical scheme:
an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm comprise the following steps:
collecting public opinion text data;
collecting positive and negative words as seed words;
calculating the emotion scores of the words by using an SOPMI algorithm, generating an emotion dictionary, performing word segmentation on the text data by using the word segmentation, and extracting nouns and verbs in the text data;
the textual information is scored using the fractional weighting of nouns and verbs.
In some embodiments, the public opinion text data is derived from forums, posts, network data.
In some embodiments, the calculating the emotion score of the word and generating the emotion dictionary by using the SOPMI algorithm includes: after the data is segmented by jieba, the relevance PMI of two words is calculated:
wherein, P (word1& word2) is the probability of two words appearing at the same time, P (word1) is the probability of word1, and P (word2) is the probability of word 2;
wherein the larger the value of PMI is, the higher the association degree between two words is; then, weighting the PMI of each word aiming at each seed word, subtracting the PMI weighting of the commensurable word from the PMI weighting of the dersense word to obtain an absolute value, and obtaining a final SOPMI value, namely the difference between the weighting of the positive PMI and the weighting of the negative PMI:
wherein Pword is a positive seed word in the corpus, and Nword is a negative seed word in the corpus; calculating the PMI value of each word and each positive seed word and each negative seed word according to the formula 1; the difference of the two is taken as SOPMI value, namely the emotion mark of the current text of the current word 1;
if the SOPMI value is higher, the higher the relevance of the word to the emotional tendency is stated.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
On the contrary, this application is intended to cover any alternatives, modifications, equivalents, and alternatives that may be included within the spirit and scope of the application as defined by the appended claims. Furthermore, in the following detailed description of the present application, certain specific details are set forth in order to provide a better understanding of the present application. It will be apparent to one skilled in the art that the present application may be practiced without these specific details.
The emotion dictionary expansion method and emotion polarity analysis method based on the sopir algorithm according to the embodiments of the present application will be described in detail. It is to be noted that the following examples are only for explaining the present application and do not constitute a limitation to the present application.
In an embodiment of the application, an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm comprise the following steps:
step 1: collecting public opinion text data;
the data to be collected in the method is firstly from data sources such as forums, posts, networks and the like and mobile phones; some positive and negative words are used as seed words of SOPMI algorithm. For example, table 1:
TABLE 1 public sentiment data sampling (partial data, data source from Skyline forum of Changhong)
Step 2: collecting positive and negative words as seed words; the expression of seed words is as follows:
table 2 partial seed words (total 10040 seed words, from manual labeling)
Commend word | Deprecation words |
Xi Huan | Poor score |
Good comment | Damage of |
Satisfaction | Cheater |
Give power | Disappointing of vision |
Support for | Worry about |
Convenience of use | False goods |
First class | Deficiency of |
And step 3: carrying out word segmentation on each text in the table 1 by using jieba word segmentation, and extracting nouns and verbs in the text to be used as a training emotion dictionary;
and 4, step 4: calculating the PMI value among each word, the commendation word and the derogation word in the text by using the word segmentation result obtained in the step 3 and the formula 1
And 5: calculating sopmi value of each word using equation 2
The calculating the emotion scores of the words and generating the emotion dictionary by utilizing the SOPMI algorithm comprises the following steps: after the data is segmented by jieba, the relevance PMI of two words is calculated:
wherein, P (word1& word2) is the probability of two words appearing at the same time, P (word1) is the probability of word1, and P (word2) is the probability of word 2;
where, it is PMI >0, the two words are correlated, and the higher the PMI value, the greater the correlation. If PMI is 0, these two words are independent, irrelevant and not mutually exclusive, PMI <0, two words are mutually exclusive; the larger the value of the PMI is, the higher the association degree between two words is; (for example, the higher the association between the words is favored and the television, the higher the PMI value, and the higher the probability that the word is likely to be favorable for the television).
Then, weighting the PMI of each word aiming at each seed word, subtracting the PMI weighting of the commensurable word from the PMI weighting of the dersense word to obtain an absolute value, and obtaining a final SOPMI value, namely the difference between the weighting of the positive PMI and the weighting of the negative PMI:
wherein Pword is a positive seed word in the corpus, and Nword is a negative seed word in the corpus; calculating the PMI value of each word and each positive seed word and each negative seed word according to the formula 1; the difference of the two is taken as SOPMI value, namely the emotion mark of the current text of the current word 1;
if the SOPMI value is higher, the higher the relevance of the word to the emotional tendency is stated.
And 6, taking the sopmi value as the emotion score of the emotion dictionary, wherein the following partial emotion dictionary partial results of the public opinion data sampling in the table 3 are shown:
a forward dictionary:
long rainbow | 195.276138438 |
Television receiver | 181.5876993905827 |
Intelligence | 143.11796751989684 |
Technique of | 135.83785489256738 |
Experience (experience) | 132.00637932593963 |
Product(s) | 129.4934796412743 |
A negative direction dictionary:
result in | 12.0651507 |
Rape merchant | 10.7752902 |
Spoofing | 9.959563051 |
Masking | 9.0626603 |
Cheating | 8.44954610281 |
Questions asked | 8.12146552 |
And 7: and performing word segmentation on all data of the embodiment by using jieba word segmentation, and inquiring each text word segmentation result by using the obtained emotion dictionary to obtain the emotion score of each corresponding word.
And 8: and weighting the sopmi of each word segmentation in the word segmentation result of each text to obtain the final text score. (if there is a piece of data: the fact that there is no rape seed masking, the score of this text is the weighting of the sopmi values of rape seed masking and rape seed masking, i.e. 10.77+9.06 ═ 19.83 (two-digit decimal is taken for convenient calculation)), all data obtained by applying this step are shown in table 3
And step 9: and (3) judging the positive and negative of the text by using the trained emotion analysis model (namely judging whether the text is positive or negative, wherein the model trained by using the tfidf algorithm and the artificial neural network is used in the patent, and the technology is completely complete and is not taken as the key point of the patent and is not specifically described), and multiplying the positive and negative of the text by the sopmi text fraction of the text obtained in the step (8) to obtain a final fraction, wherein the regular parameter is 1 and the negative parameter is-1. (score of-19.83 for the example in step 8)
Step 10: and if the absolute value of the score exceeds 100, setting the emotion score of the commentary as 100, otherwise, setting the emotion score as the original score. (the results of the final examples are shown in Table 4)
TABLE 3
TABLE 4
The sentiment scoring results are ranked to obtain the following content in table 5:
TABLE 5 Emotion scoring results
It follows that the higher the quality, the clearer the better the description and the higher the score, and the higher the description, the clearer the worse the description and the lower the score. By means of the score, it can be judged which information is more valuable and meaningful for public opinion monitoring.
The beneficial effects that the emotion dictionary expansion method and the emotion polarity analysis method based on the SOPMI algorithm may bring include but are not limited to:
the method utilizes a machine learning method to determine the emotional tendency of the comments, and fundamentally solves the problem that the emotional tendency cannot be determined under the Chinese complex context. The problem that most emotion polarity analysis intelligence is based on the existing emotion dictionary and cannot generate the emotion dictionary specific to the project is solved.
The unique technology is that an SOPMI algorithm is used, the probability that a word and a seed word appear at the same time is calculated and used as the basis of emotion scoring, after the word is subjected to part-of-speech tagging, a noun and a verb in a word segmentation result are extracted and used as corresponding emotion words, and then a manually tagged degree adverb is used as a weight to be multiplied by the result and used as the final emotion score of each text.
The SOPMI algorithm and the technology of calculating the text score by using the verb and noun scores solve the defect that most existing text scoring is intelligent and is based on an emotion dictionary labeled manually.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (3)
1. An emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm are characterized by comprising the following steps:
collecting public opinion text data;
collecting positive and negative words as seed words;
calculating the emotion scores of the words by using an SOPMI algorithm and generating an emotion dictionary; utilizing jieba to carry out word segmentation on the text data and extracting nouns and verbs in the text data;
the textual information is scored using the fractional weighting of nouns and verbs.
2. The method as claimed in claim 1, wherein the public opinion text data is from forums, posts and network data.
3. The method for expanding emotion dictionary and analyzing emotion polarity based on SOPMI algorithm according to claim 1, wherein the method for calculating emotion scores of words and generating emotion dictionary by using SOPMI algorithm; performing word segmentation on the text data by using word segmentation, and extracting nouns and verbs in the text data; the method comprises the following steps: after the data is segmented by jieba, the relevance PMI of two words is calculated:
wherein, P (word1& word2) is the probability of two words appearing at the same time, P (word1) is the probability of word1, and P (word2) is the probability of word 2;
wherein the larger the value of PMI is, the higher the association degree between two words is; then, weighting the PMI of each word aiming at each seed word, subtracting the PMI weighting of the commensurable word from the PMI weighting of the dersense word to obtain an absolute value, and obtaining a final SOPMI value, namely the difference between the weighting of the positive PMI and the weighting of the negative PMI:
wherein Pword is a positive seed word in the corpus, and Nword is a negative seed word in the corpus; calculating the PMI value of each word and each positive seed word and each negative seed word according to the formula 1; the difference of the two is taken as SOPMI value, namely the emotion mark of the current text of the current word 1;
if the SOPMI value is higher, the higher the relevance of the word to the emotional tendency is stated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111027932.1A CN114118069A (en) | 2021-09-02 | 2021-09-02 | Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111027932.1A CN114118069A (en) | 2021-09-02 | 2021-09-02 | Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114118069A true CN114118069A (en) | 2022-03-01 |
Family
ID=80441173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111027932.1A Pending CN114118069A (en) | 2021-09-02 | 2021-09-02 | Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114118069A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115796158A (en) * | 2023-02-07 | 2023-03-14 | 中国传媒大学 | Emotion dictionary construction method and device, electronic equipment and computer readable medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460158A (en) * | 2020-04-01 | 2020-07-28 | 安徽理工大学 | Microblog topic public emotion prediction method based on emotion analysis |
-
2021
- 2021-09-02 CN CN202111027932.1A patent/CN114118069A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460158A (en) * | 2020-04-01 | 2020-07-28 | 安徽理工大学 | Microblog topic public emotion prediction method based on emotion analysis |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115796158A (en) * | 2023-02-07 | 2023-03-14 | 中国传媒大学 | Emotion dictionary construction method and device, electronic equipment and computer readable medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107609132B (en) | Semantic ontology base based Chinese text sentiment analysis method | |
EP2947581B1 (en) | Interactive searching method and apparatus | |
CN104050160B (en) | Interpreter's method and apparatus that a kind of machine is blended with human translation | |
CN107092596A (en) | Text emotion analysis method based on attention CNNs and CCR | |
CN109255027B (en) | E-commerce comment sentiment analysis noise reduction method and device | |
CN105975478A (en) | Word vector analysis-based online article belonging event detection method and device | |
CN104331451A (en) | Recommendation level scoring method for theme-based network user comments | |
CN106202584A (en) | A kind of microblog emotional based on standard dictionary and semantic rule analyzes method | |
CN102929861A (en) | Method and system for calculating text emotion index | |
CN107688576B (en) | Construction and tendency classification method of CNN-SVM model | |
CN107577665B (en) | Text emotional tendency judging method | |
CN111626050B (en) | Microblog emotion analysis method based on expression dictionary and emotion general knowledge | |
CN105809186A (en) | Emotion classification method and system | |
CN107818173B (en) | Vector space model-based Chinese false comment filtering method | |
CN113076423A (en) | Data processing method and device and data query method and device | |
CN114970523B (en) | Topic prompting type keyword extraction method based on text semantic enhancement | |
CN114547293A (en) | Cross-platform false news detection method and system | |
CN114118069A (en) | Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm | |
CN107451116A (en) | Raw big data statistical analysis technique in a kind of Mobile solution | |
CN107783958A (en) | A kind of object statement recognition methods and device | |
CN110377706B (en) | Search sentence mining method and device based on deep learning | |
CN112749257A (en) | Intelligent marking system based on machine learning algorithm | |
CN112699831A (en) | Video hotspot segment detection method and device based on barrage emotion and storage medium | |
CN116089578A (en) | Automatic labeling method, system and storage medium for intelligent question-answering data | |
CN113468176B (en) | Information input method and device, electronic equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |