CN114118069A

CN114118069A - Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm

Info

Publication number: CN114118069A
Application number: CN202111027932.1A
Authority: CN
Inventors: 彭乙庭
Original assignee: Sichuan Cric Technology Co ltd
Current assignee: Sichuan Cric Technology Co ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2022-03-01

Abstract

The invention discloses an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm, which comprise the following steps: collecting public opinion text data; collecting positive and negative words as seed words; calculating the emotion scores of the words by using an SOPMI algorithm and generating an emotion dictionary; utilizing jieba to carry out word segmentation on the text data and extracting nouns and verbs in the text data; the textual information is scored using the fractional weighting of nouns and verbs. The problem that most emotion polarity analysis intelligence is based on the existing emotion dictionary and cannot generate the emotion dictionary specific to the project is solved.

Description

Emotion dictionary expansion method and emotion polarity analysis method based on SOPMI algorithm

Technical Field

The invention relates to the technical field of machine learning and text analysis, in particular to an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm.

Background

With the development and progress of society, media diversification, rapid development of networks, popularization of mobile terminals, and wide use of microblogs and forums, a large amount of information is streamed in the networks in various ways every day, and after filtering junk information, the information cannot be effectively evaluated. Whether positive or negative information, high quality and low quality information can be distinguished by text scoring. On one hand, the public opinion analysis of enterprises or governments can be facilitated, and on the other hand, the public praise of products in users can be more widely known.

A commonly used method is to use an emotion dictionary with a wider coverage, match the words of the text with the data in the emotion dictionary, and weight the score of each word to obtain a final score. This score depends entirely on the degree of matching of the segmentation result with the emotion dictionary. Often, no corresponding data in the emotion dictionary returns a null value, so that the returned emotion score data has errors and even the emotion polarity can not be judged through the emotion score.

At present, the emotional dictionary which is relatively widely used and popular is based on the Borsen database, although the final score which cannot be customized for each industry can be achieved, the desired effect cannot be achieved. However, manual labeling by itself costs a lot of labor and cannot label accurate emotion values.

The main reasons are as follows:

1. the emotion dictionary of the Bosen database is a general database, is relatively suitable for mass comments such as most news and microblogs, and has poor effect on a specific industry.

2. Simple emotional tendency can be solved based on the emotional scoring, but complex emotional conditions in the Chinese cannot be solved, such as double negation, question reversing and the like.

Disclosure of Invention

The invention provides an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm, aiming at solving the problems in the background technology.

In order to achieve the purpose, the invention adopts the following technical scheme:

an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm comprise the following steps:

collecting public opinion text data;

collecting positive and negative words as seed words;

calculating the emotion scores of the words by using an SOPMI algorithm, generating an emotion dictionary, performing word segmentation on the text data by using the word segmentation, and extracting nouns and verbs in the text data;

the textual information is scored using the fractional weighting of nouns and verbs.

In some embodiments, the public opinion text data is derived from forums, posts, network data.

In some embodiments, the calculating the emotion score of the word and generating the emotion dictionary by using the SOPMI algorithm includes: after the data is segmented by jieba, the relevance PMI of two words is calculated:

wherein, P (word1& word2) is the probability of two words appearing at the same time, P (word1) is the probability of word1, and P (word2) is the probability of word 2;

wherein the larger the value of PMI is, the higher the association degree between two words is; then, weighting the PMI of each word aiming at each seed word, subtracting the PMI weighting of the commensurable word from the PMI weighting of the dersense word to obtain an absolute value, and obtaining a final SOPMI value, namely the difference between the weighting of the positive PMI and the weighting of the negative PMI:

wherein Pword is a positive seed word in the corpus, and Nword is a negative seed word in the corpus; calculating the PMI value of each word and each positive seed word and each negative seed word according to the formula 1; the difference of the two is taken as SOPMI value, namely the emotion mark of the current text of the current word 1;

if the SOPMI value is higher, the higher the relevance of the word to the emotional tendency is stated.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

On the contrary, this application is intended to cover any alternatives, modifications, equivalents, and alternatives that may be included within the spirit and scope of the application as defined by the appended claims. Furthermore, in the following detailed description of the present application, certain specific details are set forth in order to provide a better understanding of the present application. It will be apparent to one skilled in the art that the present application may be practiced without these specific details.

The emotion dictionary expansion method and emotion polarity analysis method based on the sopir algorithm according to the embodiments of the present application will be described in detail. It is to be noted that the following examples are only for explaining the present application and do not constitute a limitation to the present application.

In an embodiment of the application, an emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm comprise the following steps:

step 1: collecting public opinion text data;

the data to be collected in the method is firstly from data sources such as forums, posts, networks and the like and mobile phones; some positive and negative words are used as seed words of SOPMI algorithm. For example, table 1:

TABLE 1 public sentiment data sampling (partial data, data source from Skyline forum of Changhong)

Step 2: collecting positive and negative words as seed words; the expression of seed words is as follows:

table 2 partial seed words (total 10040 seed words, from manual labeling)

Commend word	Deprecation words
		Xi Huan	Poor score
Good comment	Damage of
		Satisfaction	Cheater
Give power	Disappointing of vision
		Support for	Worry about
Convenience of use	False goods
		First class	Deficiency of

And step 3: carrying out word segmentation on each text in the table 1 by using jieba word segmentation, and extracting nouns and verbs in the text to be used as a training emotion dictionary;

and 4, step 4: calculating the PMI value among each word, the commendation word and the derogation word in the text by using the word segmentation result obtained in the step 3 and the formula 1

And 5: calculating sopmi value of each word using equation 2

The calculating the emotion scores of the words and generating the emotion dictionary by utilizing the SOPMI algorithm comprises the following steps: after the data is segmented by jieba, the relevance PMI of two words is calculated:

where, it is PMI >0, the two words are correlated, and the higher the PMI value, the greater the correlation. If PMI is 0, these two words are independent, irrelevant and not mutually exclusive, PMI <0, two words are mutually exclusive; the larger the value of the PMI is, the higher the association degree between two words is; (for example, the higher the association between the words is favored and the television, the higher the PMI value, and the higher the probability that the word is likely to be favorable for the television).

Then, weighting the PMI of each word aiming at each seed word, subtracting the PMI weighting of the commensurable word from the PMI weighting of the dersense word to obtain an absolute value, and obtaining a final SOPMI value, namely the difference between the weighting of the positive PMI and the weighting of the negative PMI:

And 6, taking the sopmi value as the emotion score of the emotion dictionary, wherein the following partial emotion dictionary partial results of the public opinion data sampling in the table 3 are shown:

a forward dictionary:

long rainbow	195.276138438
		Television receiver	181.5876993905827
Intelligence	143.11796751989684
		Technique of	135.83785489256738
Experience (experience)	132.00637932593963
		Product(s)	129.4934796412743

A negative direction dictionary:

result in	12.0651507
		Rape merchant	10.7752902
Spoofing	9.959563051
		Masking	9.0626603
Cheating	8.44954610281
		Questions asked	8.12146552

And 7: and performing word segmentation on all data of the embodiment by using jieba word segmentation, and inquiring each text word segmentation result by using the obtained emotion dictionary to obtain the emotion score of each corresponding word.

And 8: and weighting the sopmi of each word segmentation in the word segmentation result of each text to obtain the final text score. (if there is a piece of data: the fact that there is no rape seed masking, the score of this text is the weighting of the sopmi values of rape seed masking and rape seed masking, i.e. 10.77+9.06 ═ 19.83 (two-digit decimal is taken for convenient calculation)), all data obtained by applying this step are shown in table 3

And step 9: and (3) judging the positive and negative of the text by using the trained emotion analysis model (namely judging whether the text is positive or negative, wherein the model trained by using the tfidf algorithm and the artificial neural network is used in the patent, and the technology is completely complete and is not taken as the key point of the patent and is not specifically described), and multiplying the positive and negative of the text by the sopmi text fraction of the text obtained in the step (8) to obtain a final fraction, wherein the regular parameter is 1 and the negative parameter is-1. (score of-19.83 for the example in step 8)

Step 10: and if the absolute value of the score exceeds 100, setting the emotion score of the commentary as 100, otherwise, setting the emotion score as the original score. (the results of the final examples are shown in Table 4)

TABLE 3

TABLE 4

The sentiment scoring results are ranked to obtain the following content in table 5:

TABLE 5 Emotion scoring results

It follows that the higher the quality, the clearer the better the description and the higher the score, and the higher the description, the clearer the worse the description and the lower the score. By means of the score, it can be judged which information is more valuable and meaningful for public opinion monitoring.

The beneficial effects that the emotion dictionary expansion method and the emotion polarity analysis method based on the SOPMI algorithm may bring include but are not limited to:

the method utilizes a machine learning method to determine the emotional tendency of the comments, and fundamentally solves the problem that the emotional tendency cannot be determined under the Chinese complex context. The problem that most emotion polarity analysis intelligence is based on the existing emotion dictionary and cannot generate the emotion dictionary specific to the project is solved.

The unique technology is that an SOPMI algorithm is used, the probability that a word and a seed word appear at the same time is calculated and used as the basis of emotion scoring, after the word is subjected to part-of-speech tagging, a noun and a verb in a word segmentation result are extracted and used as corresponding emotion words, and then a manually tagged degree adverb is used as a weight to be multiplied by the result and used as the final emotion score of each text.

The SOPMI algorithm and the technology of calculating the text score by using the verb and noun scores solve the defect that most existing text scoring is intelligent and is based on an emotion dictionary labeled manually.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An emotion dictionary expansion method and an emotion polarity analysis method based on an SOPMI algorithm are characterized by comprising the following steps:

collecting public opinion text data;

collecting positive and negative words as seed words;

calculating the emotion scores of the words by using an SOPMI algorithm and generating an emotion dictionary; utilizing jieba to carry out word segmentation on the text data and extracting nouns and verbs in the text data;

2. The method as claimed in claim 1, wherein the public opinion text data is from forums, posts and network data.

3. The method for expanding emotion dictionary and analyzing emotion polarity based on SOPMI algorithm according to claim 1, wherein the method for calculating emotion scores of words and generating emotion dictionary by using SOPMI algorithm; performing word segmentation on the text data by using word segmentation, and extracting nouns and verbs in the text data; the method comprises the following steps: after the data is segmented by jieba, the relevance PMI of two words is calculated: