CN114118069A - Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm - Google Patents

Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm Download PDF

Info

Publication number
CN114118069A
CN114118069A CN202111027932.1A CN202111027932A CN114118069A CN 114118069 A CN114118069 A CN 114118069A CN 202111027932 A CN202111027932 A CN 202111027932A CN 114118069 A CN114118069 A CN 114118069A
Authority
CN
China
Prior art keywords
word
emotion
sopmi
words
pmi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111027932.1A
Other languages
Chinese (zh)
Inventor
彭乙庭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Qiruike Technology Co Ltd
Original Assignee
Sichuan Qiruike Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Qiruike Technology Co Ltd filed Critical Sichuan Qiruike Technology Co Ltd
Priority to CN202111027932.1A priority Critical patent/CN114118069A/en
Publication of CN114118069A publication Critical patent/CN114118069A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Human Resources & Organizations (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • Primary Health Care (AREA)
  • Marketing (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Economics (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm, which comprise the following steps: collecting public opinion text data; collecting positive and negative words as seed words; calculating the emotion scores of the words by using an SOPMI algorithm and generating an emotion dictionary; utilizing jieba to carry out word segmentation on the text data and extracting nouns and verbs in the text data; the textual information is scored using the fractional weighting of nouns and verbs. The problem that most emotion polarity analysis intelligence is based on the existing emotion dictionary and cannot generate the emotion dictionary specific to the project is solved.

Description

Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm
Technical Field
The invention relates to the technical field of machine learning and text analysis, in particular to an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm.
Background
With the development and progress of society, media diversification, rapid development of networks, popularization of mobile terminals, and wide use of microblogs and forums, a large amount of information is streamed in the networks in various ways every day, and after filtering junk information, the information cannot be effectively evaluated. Whether positive or negative information, high quality and low quality information can be distinguished by text scoring. On one hand, the public opinion analysis of enterprises or governments can be facilitated, and on the other hand, the public praise of products in users can be more widely known.
A commonly used method is to use an emotion dictionary with a wider coverage, match the words of the text with the data in the emotion dictionary, and weight the score of each word to obtain a final score. This score depends entirely on the degree of matching of the segmentation result with the emotion dictionary. Often, no corresponding data in the emotion dictionary returns a null value, so that the returned emotion score data has errors and even the emotion polarity can not be judged through the emotion score.
At present, the emotional dictionary which is relatively widely used and popular is based on the Borsen database, although the final score which cannot be customized for each industry can be achieved, the desired effect cannot be achieved. However, manual labeling by itself costs a lot of labor and cannot label accurate emotion values.
The main reasons are as follows:
1. the emotion dictionary of the Bosen database is a general database, is relatively suitable for mass comments such as most news and microblogs, and has poor effect on a specific industry.
2. Simple emotional tendency can be solved based on the emotional scoring, but complex emotional conditions in the Chinese cannot be solved, such as double negation, question reversing and the like.
Disclosure of Invention
The invention provides an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm, aiming at solving the problems in the background technology.
In order to achieve the purpose, the invention adopts the following technical scheme:
an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm comprise the following steps:
collecting public opinion text data;
collecting positive and negative words as seed words;
calculating the emotion scores of the words by using an SOPMI algorithm, generating an emotion dictionary, performing word segmentation on the text data by using the word segmentation, and extracting nouns and verbs in the text data;
the textual information is scored using the fractional weighting of nouns and verbs.
In some embodiments, the public opinion text data is derived from forums, posts, network data.
In some embodiments, the calculating the emotion score of the word and generating the emotion dictionary by using the SOPMI algorithm includes: after the data is segmented by jieba, the relevance PMI of two words is calculated:
Figure BDA0003244028310000021
wherein, P (word1& word2) is the probability of two words appearing at the same time, P (word1) is the probability of word1, and P (word2) is the probability of word 2;
wherein the larger the value of PMI is, the higher the association degree between two words is; then, weighting the PMI of each word aiming at each seed word, subtracting the PMI weighting of the commensurable word from the PMI weighting of the dersense word to obtain an absolute value, and obtaining a final SOPMI value, namely the difference between the weighting of the positive PMI and the weighting of the negative PMI:
Figure BDA0003244028310000022
wherein Pword is a positive seed word in the corpus, and Nword is a negative seed word in the corpus; calculating the PMI value of each word and each positive seed word and each negative seed word according to the formula 1; the difference of the two is taken as SOPMI value, namely the emotion mark of the current text of the current word 1;
if the SOPMI value is higher, the higher the relevance of the word to the emotional tendency is stated.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
On the contrary, this application is intended to cover any alternatives, modifications, equivalents, and alternatives that may be included within the spirit and scope of the application as defined by the appended claims. Furthermore, in the following detailed description of the present application, certain specific details are set forth in order to provide a better understanding of the present application. It will be apparent to one skilled in the art that the present application may be practiced without these specific details.
The emotion dictionary expansion method and emotion polarity analysis method based on the sopir algorithm according to the embodiments of the present application will be described in detail. It is to be noted that the following examples are only for explaining the present application and do not constitute a limitation to the present application.
In an embodiment of the application, an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm comprise the following steps:
step 1: collecting public opinion text data;
the data to be collected in the method is firstly from data sources such as forums, posts, networks and the like and mobile phones; some positive and negative words are used as seed words of SOPMI algorithm. For example, table 1:
TABLE 1 public sentiment data sampling (partial data, data source from Skyline forum of Changhong)
Figure BDA0003244028310000031
Figure BDA0003244028310000041
Step 2: collecting positive and negative words as seed words; the expression of seed words is as follows:
table 2 partial seed words (total 10040 seed words, from manual labeling)
Commend word Deprecation words
Xi Huan Poor score
Good comment Damage of
Satisfaction Cheater
Give power Disappointing of vision
Support for Worry about
Convenience of use False goods
First class Deficiency of
And step 3: carrying out word segmentation on each text in the table 1 by using jieba word segmentation, and extracting nouns and verbs in the text to be used as a training emotion dictionary;
and 4, step 4: calculating the PMI value among each word, the commendation word and the derogation word in the text by using the word segmentation result obtained in the step 3 and the formula 1
And 5: calculating sopmi value of each word using equation 2
The calculating the emotion scores of the words and generating the emotion dictionary by utilizing the SOPMI algorithm comprises the following steps: after the data is segmented by jieba, the relevance PMI of two words is calculated:
Figure BDA0003244028310000051
wherein, P (word1& word2) is the probability of two words appearing at the same time, P (word1) is the probability of word1, and P (word2) is the probability of word 2;
where, it is PMI >0, the two words are correlated, and the higher the PMI value, the greater the correlation. If PMI is 0, these two words are independent, irrelevant and not mutually exclusive, PMI <0, two words are mutually exclusive; the larger the value of the PMI is, the higher the association degree between two words is; (for example, the higher the association between the words is favored and the television, the higher the PMI value, and the higher the probability that the word is likely to be favorable for the television).
Then, weighting the PMI of each word aiming at each seed word, subtracting the PMI weighting of the commensurable word from the PMI weighting of the dersense word to obtain an absolute value, and obtaining a final SOPMI value, namely the difference between the weighting of the positive PMI and the weighting of the negative PMI:
Figure BDA0003244028310000052
wherein Pword is a positive seed word in the corpus, and Nword is a negative seed word in the corpus; calculating the PMI value of each word and each positive seed word and each negative seed word according to the formula 1; the difference of the two is taken as SOPMI value, namely the emotion mark of the current text of the current word 1;
if the SOPMI value is higher, the higher the relevance of the word to the emotional tendency is stated.
And 6, taking the sopmi value as the emotion score of the emotion dictionary, wherein the following partial emotion dictionary partial results of the public opinion data sampling in the table 3 are shown:
a forward dictionary:
long rainbow 195.276138438
Television receiver 181.5876993905827
Intelligence 143.11796751989684
Technique of 135.83785489256738
Experience (experience) 132.00637932593963
Product(s) 129.4934796412743
A negative direction dictionary:
result in 12.0651507
Rape merchant 10.7752902
Spoofing 9.959563051
Masking 9.0626603
Cheating 8.44954610281
Questions asked 8.12146552
And 7: and performing word segmentation on all data of the embodiment by using jieba word segmentation, and inquiring each text word segmentation result by using the obtained emotion dictionary to obtain the emotion score of each corresponding word.
And 8: and weighting the sopmi of each word segmentation in the word segmentation result of each text to obtain the final text score. (if there is a piece of data: the fact that there is no rape seed masking, the score of this text is the weighting of the sopmi values of rape seed masking and rape seed masking, i.e. 10.77+9.06 ═ 19.83 (two-digit decimal is taken for convenient calculation)), all data obtained by applying this step are shown in table 3
And step 9: and (3) judging the positive and negative of the text by using the trained emotion analysis model (namely judging whether the text is positive or negative, wherein the model trained by using the tfidf algorithm and the artificial neural network is used in the patent, and the technology is completely complete and is not taken as the key point of the patent and is not specifically described), and multiplying the positive and negative of the text by the sopmi text fraction of the text obtained in the step (8) to obtain a final fraction, wherein the regular parameter is 1 and the negative parameter is-1. (score of-19.83 for the example in step 8)
Step 10: and if the absolute value of the score exceeds 100, setting the emotion score of the commentary as 100, otherwise, setting the emotion score as the original score. (the results of the final examples are shown in Table 4)
TABLE 3
Figure BDA0003244028310000081
TABLE 4
Figure BDA0003244028310000091
The sentiment scoring results are ranked to obtain the following content in table 5:
TABLE 5 Emotion scoring results
Figure BDA0003244028310000101
It follows that the higher the quality, the clearer the better the description and the higher the score, and the higher the description, the clearer the worse the description and the lower the score. By means of the score, it can be judged which information is more valuable and meaningful for public opinion monitoring.
The beneficial effects that the emotion dictionary expansion method and the emotion polarity analysis method based on the SOPMI algorithm may bring include but are not limited to:
the method utilizes a machine learning method to determine the emotional tendency of the comments, and fundamentally solves the problem that the emotional tendency cannot be determined under the Chinese complex context. The problem that most emotion polarity analysis intelligence is based on the existing emotion dictionary and cannot generate the emotion dictionary specific to the project is solved.
The unique technology is that an SOPMI algorithm is used, the probability that a word and a seed word appear at the same time is calculated and used as the basis of emotion scoring, after the word is subjected to part-of-speech tagging, a noun and a verb in a word segmentation result are extracted and used as corresponding emotion words, and then a manually tagged degree adverb is used as a weight to be multiplied by the result and used as the final emotion score of each text.
The SOPMI algorithm and the technology of calculating the text score by using the verb and noun scores solve the defect that most existing text scoring is intelligent and is based on an emotion dictionary labeled manually.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (3)

1. An emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm are characterized by comprising the following steps:
collecting public opinion text data;
collecting positive and negative words as seed words;
calculating the emotion scores of the words by using an SOPMI algorithm and generating an emotion dictionary; utilizing jieba to carry out word segmentation on the text data and extracting nouns and verbs in the text data;
the textual information is scored using the fractional weighting of nouns and verbs.
2. The method as claimed in claim 1, wherein the public opinion text data is from forums, posts and network data.
3. The method for expanding emotion dictionary and analyzing emotion polarity based on SOPMI algorithm according to claim 1, wherein the method for calculating emotion scores of words and generating emotion dictionary by using SOPMI algorithm; performing word segmentation on the text data by using word segmentation, and extracting nouns and verbs in the text data; the method comprises the following steps: after the data is segmented by jieba, the relevance PMI of two words is calculated:
Figure FDA0003244028300000011
wherein, P (word1& word2) is the probability of two words appearing at the same time, P (word1) is the probability of word1, and P (word2) is the probability of word 2;
wherein the larger the value of PMI is, the higher the association degree between two words is; then, weighting the PMI of each word aiming at each seed word, subtracting the PMI weighting of the commensurable word from the PMI weighting of the dersense word to obtain an absolute value, and obtaining a final SOPMI value, namely the difference between the weighting of the positive PMI and the weighting of the negative PMI:
Figure FDA0003244028300000012
wherein Pword is a positive seed word in the corpus, and Nword is a negative seed word in the corpus; calculating the PMI value of each word and each positive seed word and each negative seed word according to the formula 1; the difference of the two is taken as SOPMI value, namely the emotion mark of the current text of the current word 1;
if the SOPMI value is higher, the higher the relevance of the word to the emotional tendency is stated.
CN202111027932.1A 2021-09-02 2021-09-02 Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm Pending CN114118069A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111027932.1A CN114118069A (en) 2021-09-02 2021-09-02 Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111027932.1A CN114118069A (en) 2021-09-02 2021-09-02 Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm

Publications (1)

Publication Number Publication Date
CN114118069A true CN114118069A (en) 2022-03-01

Family

ID=80441173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111027932.1A Pending CN114118069A (en) 2021-09-02 2021-09-02 Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm

Country Status (1)

Country Link
CN (1) CN114118069A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796158A (en) * 2023-02-07 2023-03-14 中国传媒大学 Emotion dictionary construction method and device, electronic equipment and computer readable medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460158A (en) * 2020-04-01 2020-07-28 安徽理工大学 Microblog topic public emotion prediction method based on emotion analysis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460158A (en) * 2020-04-01 2020-07-28 安徽理工大学 Microblog topic public emotion prediction method based on emotion analysis

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796158A (en) * 2023-02-07 2023-03-14 中国传媒大学 Emotion dictionary construction method and device, electronic equipment and computer readable medium

Similar Documents

Publication Publication Date Title
CN107609132B (en) Semantic ontology base based Chinese text sentiment analysis method
EP2947581B1 (en) Interactive searching method and apparatus
CN104050160B (en) Interpreter&#39;s method and apparatus that a kind of machine is blended with human translation
CN109255027B (en) E-commerce comment sentiment analysis noise reduction method and device
CN104331451A (en) Recommendation level scoring method for theme-based network user comments
CN106202584A (en) A kind of microblog emotional based on standard dictionary and semantic rule analyzes method
CN102033950A (en) Construction method and identification method of automatic electronic product named entity identification system
CN102929861A (en) Method and system for calculating text emotion index
CN107688576B (en) Construction and tendency classification method of CNN-SVM model
CN107577665B (en) Text emotional tendency judging method
CN111626050B (en) Microblog emotion analysis method based on expression dictionary and emotion general knowledge
CN107133282B (en) Improved evaluation object identification method based on bidirectional propagation
CN113076423A (en) Data processing method and device and data query method and device
CN107818173B (en) Vector space model-based Chinese false comment filtering method
CN114970523B (en) Topic prompting type keyword extraction method based on text semantic enhancement
CN114547293A (en) Cross-platform false news detection method and system
CN114118069A (en) Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm
CN107451116A (en) Raw big data statistical analysis technique in a kind of Mobile solution
CN110377706B (en) Search sentence mining method and device based on deep learning
CN112749257A (en) Intelligent marking system based on machine learning algorithm
CN107783958A (en) A kind of object statement recognition methods and device
CN116089578A (en) Automatic labeling method, system and storage medium for intelligent question-answering data
CN116070620A (en) Information processing method and system based on big data
CN113468176B (en) Information input method and device, electronic equipment and computer readable storage medium
CN115618092A (en) Information recommendation method and information recommendation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination