CN108763214B - Automatic construction method of emotion dictionary for commodity comments - Google Patents

Automatic construction method of emotion dictionary for commodity comments Download PDF

Info

Publication number
CN108763214B
CN108763214B CN201810539447.4A CN201810539447A CN108763214B CN 108763214 B CN108763214 B CN 108763214B CN 201810539447 A CN201810539447 A CN 201810539447A CN 108763214 B CN108763214 B CN 108763214B
Authority
CN
China
Prior art keywords
emotion
words
matrix
evaluation
evaluation objects
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810539447.4A
Other languages
Chinese (zh)
Other versions
CN108763214A (en
Inventor
冯钧
贡诚
李晓东
邹希
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201810539447.4A priority Critical patent/CN108763214B/en
Publication of CN108763214A publication Critical patent/CN108763214A/en
Application granted granted Critical
Publication of CN108763214B publication Critical patent/CN108763214B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products

Abstract

The invention discloses an automatic construction method of an emotion dictionary for commodity comments, which comprises text preprocessing, semantic relation mining and emotion word clustering. The text preprocessing is used for preprocessing the commodity comments and extracting emotional words and evaluation objects contained in a certain type of commodity comments. And mining semantic relations, namely mining the semantic relations between the emotion words and the evaluation objects, and representing the semantic relations between the emotion words and the evaluation objects in a matrix form. And (4) clustering the emotional words, wherein the emotional words are reasonably classified into k types by carrying out unsupervised clustering on the emotional words according to the mutual distance of the emotional words in an emotional matrix space. The invention constructs a domain emotion dictionary aiming at the characteristics of texts in the commodity comment field, the dictionary can divide emotion words into multiple classes instead of two classes which are commendably and commendably, and the domain emotion dictionary has great advantages in the emotion classification task and the like compared with other existing general emotion dictionaries in the commodity comment field.

Description

Automatic construction method of emotion dictionary for commodity comments
Technical Field
The invention relates to a self-defined construction method of an emotion dictionary aiming at the commodity comment field, and belongs to the technical field of computer information technology processing.
Background
With the development of various shopping websites, a large number of comments about various commodities appear on the network, and people can look up the comments anytime and anywhere. It is important to identify the emotional tendencies that these reviews imply, both for the merchant and the consumer. And a good emotion dictionary is the basis for analyzing the emotional tendency of the text. It is well known that sentiment analysis of text requires consideration of the industry to which the text belongs. The existing emotion dictionaries are all universal and do not have emotion dictionaries in a specified field aiming at commodity comments. Obviously, it is not appropriate to utilize the conventional emotion dictionary to perform emotion analysis on the product comment text. Therefore, the automatic construction method of the emotion dictionary, especially for the emotion dictionary in a specific field, draws more and more attention and researches of experts.
The existing construction method of the emotion dictionary can be divided into two categories, namely corpus-based and knowledge base-based, for Chinese and English. And constructing an emotion dictionary based on a corpus, wherein the most common method is to select seed words and determine the emotion polarity of the emotion words by calculating the relation between the emotion words with unknown emotion polarity and the seed words, namely a PMI value. Then, the available common knowledge base for Chinese is very limited, so that the research for constructing Chinese emotion dictionaries by using the knowledge base is very rare. However, when constructing an emotion dictionary for a product review field, it is necessary to consider an evaluation target in particular. The evaluation object is a certain characteristic of the commodity which is evaluated by the user, for example, for a mobile phone, the evaluation object can be a characteristic of a screen, a battery and the like of the mobile phone.
On the other hand, the existing emotion dictionary usually only contains some emotion words, and the emotion words are divided into two categories of positive words and negative words. There are also some scholars who classify emotions into joyous, sad, fear, surprise, anger, jealousy, the six major classes. In summary, the existing emotion classification is based on the experience knowledge of people to determine the emotion classes into which emotion words can be classified.
Considering that many emotional words often show different emotional tendencies in different fields, it is important to be able to accurately identify these emotional words and evaluate the subjects or subjects in the fields, especially in the field of merchandise review. Fast finds it difficult to construct a domain emotion dictionary by means of information experts or by means of crowdsourcing services. Shi et al extract key information from the domain text using association rule algorithms in conjunction with supervised machine learning approaches. Zhang et al extracts an evaluation object of a product using a Point Mutual Information (PMI) and an association rule algorithm. Considering the sequence problem of the evaluation object, Qiu and the like provide a two-way propagation algorithm on the basis of calculating the positive relationship between the emotional words and the product. Mishne selects the evaluation object by using the part of speech and the word frequency of the word.
PMI is a common indicator used to consider the degree of association between two words. Turney and Littman use PMI and LSA to calculate the degree of association between two words, this method of calculating the relationship between a word and a seed word by using PMI is generally called so-PMI. The PMI is improved by Islam and Inkpen, the SOC-PMI is provided, the emotion classification task is a basic task of emotion analysis, and the performance of the used emotion dictionary can be directly reflected by the quality of the classification result. Pang takes the emotion classification task as a text classification task and tests three classifiers, namely naive Bayes, a support vector machine and maximum entropy. Li and Hao expand the evaluation object by using a spectral clustering method. Yang et al then uses word2vec to calculate the cosine similarity between the word and the seed word.
Most of the existing emotion dictionary construction methods are general dictionaries, and the general dictionaries are not suitable for analyzing texts in a specific field, such as commodity comment texts, so that it is very important to construct an emotion dictionary which can be suitable for the specific field.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the defects in the prior art and provides an emotion dictionary construction method aiming at the field of commodity comments.
The technical scheme is as follows: an automatic construction method of an emotion dictionary for commodity comments sequentially comprises the following steps:
(1) and preprocessing the original commodity comment text. The method is characterized in that the method usually adopts Chinese word segmentation, stop word filtering and other measures to determine the emotional words and the evaluation objects contained in the text in the designated field. The determination of the emotional words and the evaluation objects is to select nouns contained in the comment texts as the evaluation objects according to the parts of speech of the words, and adjectives, adverbs and verbs in the comment texts are used as the emotional words.
(2) And (2) mining the relation between the emotion words obtained in the step (1) and the evaluation object, and generating an emotion matrix representing the relation.
(3) Screening the evaluation objects obtained in the step (1) to leave a part of key evaluation objects.
(4) Similar to the step (2), considering the relationship between the emotional words and the key evaluation objects, and generating an emotional matrix representing the relationship between the emotional words and the key evaluation objects.
(5) And (4) mining the correlation between the key evaluation object screened in the step (3) and the original evaluation object obtained in the step (1), and generating a correlation matrix representing the correlation between the key evaluation object and the original evaluation object.
(6) And (5) obtaining a correlation matrix by using the two emotion matrixes obtained in the steps (2) and (4) and the correlation matrix obtained in the step (5), and generating a new emotion matrix for representing the relationship between the emotion words and the key evaluation object.
(7) And (4) clustering the emotion words according to the distance between the emotion words in the emotion matrix in the step (6), and dividing the emotion words into several types to obtain a domain emotion dictionary.
(8) Applying the emotion dictionary to the emotion classification task, determining an optimal k value by adopting methods such as cross check and the like according to different fields, and dividing the emotion words into k types.
Further, in the step (2), the relationship between the emotion word and the evaluation object directly reflects a modification degree of the emotion word on the evaluation object:
(2.1) quantifying the relation between the emotion words and the evaluation objects by using the co-occurrence of the emotion words and the evaluation objects, wherein the PMI is adopted to calculate the relation between the emotion words and the evaluation objects.
(2.2) we use a matrix (emotion matrix) to represent the relationship, the rows of the matrix represent all emotion words, each column of the matrix is the evaluation object, and each unit of the matrix represents the corresponding emotion word and PMI value of the evaluation object.
Further, in the step (3), the tf-idf concept is used for screening the evaluation object, and the details specifically include the following:
(3.1) merging comments of the same type of product into a document, and calculating tf-idf values of words according to the times of occurrence of the words in different documents, namely different product comment sets, and the frequency of reverse documents.
And (3.2) calculating tf-idf values of all words, sequencing the tf-idf values of the evaluation objects, setting a threshold value, and screening the evaluation objects which reach the threshold value by us to be considered as final evaluation objects.
Further, in the step (4), similarly to the constructed emotion matrix in the step (2), the only difference is that the emotion matrix in the step (4) includes the relationship between the emotion word and the evaluation object left after the screening, and not the relationship between the emotion word and the entire evaluation object.
Further, in the step (5), the relationship between all the evaluation objects and the screened evaluation objects is mined, and a correlation matrix between the key evaluation object and all the evaluation objects is generated, wherein the specific details are as follows:
(5.1) the key evaluation objects are the evaluation objects obtained by screening in the step (3), and all the evaluation objects are all the nouns contained in the initial product review.
(5.2) the association degree between the key evaluation object and all the evaluation objects can be understood as a synonymy relationship in the corpus, and the relationship is represented by [0,1], wherein 0 represents no relationship, and 1 represents the highest association degree. Other degrees of correlation are expressed by numerical values between [0,1] intervals, and closer to 1 indicates higher degrees of correlation.
Further, in the step (6), we use the two emotion matrixes and the correlation matrix constructed before to generate the final emotion matrix, which is based on an improved PMI algorithm EPMI algorithm that we propose, as follows:
(6.1) EPMI Algorithm:
Figure BDA0001678950690000041
that is, we are computing the emotion words eiAnd the evaluation object mjIn the case of the relationship between the two, it is necessary to consider and evaluate the object m in addition to the relationship between the twojThose evaluation objects are related, and the degree of the relation is u in the formulajkTo indicate.
(6.2) the new EPMI algorithm is adopted to mine the relation between the emotion words and the evaluation objects, and a new emotion matrix is constructed. The new emotion matrix can be directly obtained by the EPMI algorithm through the two emotion matrixes and the association matrix obtained previously.
Further, in the step (7), the emotion words are clustered, and in the emotion matrix, the emotion words can be represented as a vector, so that the emotion words can be clustered unsupervised according to the distance between the vector and the vector.
Further, in the step (8), because unsupervised clustering is adopted in the clustering process, the number k of final clusters is uncertain, the optimal k is different for different product reviews and different text analysis tasks, and a k value with stable performance can be selected through cross checking.
Has the advantages that: compared with the prior art, the method for automatically constructing the emotion dictionary aiming at the commodity comment field provided by the invention has the advantage that the expression of the field emotion dictionary is superior to that of a general emotion dictionary in the specific field. The invention provides a method for constructing an emotion dictionary on comment corpus, and the constructed emotion dictionary is different from the traditional emotion dictionary, and can divide emotion words into unfixed k classes instead of some classes such as fixed commendability and derogation, and the performance of the domain emotion dictionary with higher dimensionality is better in tasks such as emotion classification.
Drawings
FIG. 1 is a flow diagram of a text pre-processing module;
FIG. 2 is a schematic diagram of semantic relationship mining;
FIG. 3 is a flow chart of k value selection.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
First, in order to facilitate understanding of the present invention, the following description is made:
1. text preprocessing:
the text preprocessing module is mainly used for preprocessing an original comment text, as shown in fig. 1, for chinese, chinese word segmentation is required, and many existing open-source software can be used for chinese word segmentation, such as jieba word segmentation software, ik word segmenter, and the like. If the user wants to achieve the best word segmentation effect, the user often needs to add a self-defined dictionary to identify some rarely-used field words. In the case of english, since words are separated by commas, the step of word segmentation is not required, but the english vocabulary involves more complicated tenses and other problems, and therefore, the shape and stem of the english vocabulary needs to be restored. Similarly, many open source software also support stemming english vocabulary, such as the most commonly used natural language processing toolkit NLTK.
Besides the basic preprocessing of the comment text, the module can obtain a preliminary emotion word and an evaluation object most importantly. And the method of part-of-speech analysis is adopted to extract emotional words and evaluate objects. The emotional words are used for expressing that a user has strong subjective colors on a certain event, and the adjectives, the adverbs and the verbs are selected as the emotional words. The evaluation object, as the name implies, refers to the object modified by the emotional words, and may be a product or a feature of the product. We choose nouns as the evaluation objects.
Of course, whether the text is English or Chinese, a large number of words and phrases do not contain any meaning, such as "yes", "do", and the like in Chinese, and "is", "the", and the like in English, and stop word processing is required to be performed, so that the words are filtered out, and the negative influence of the meaningless high-frequency words on the text analysis is reduced.
2. Semantic relationship mining
The semantic relation mining module is a core module of the invention, and can lay a good foundation for the next emotion word clustering only by fully mining the relation between the emotion words and the evaluation object. The process of obtaining the final emotion matrix is mainly divided into three stages, as shown in fig. 2.
The relationship between the emotional words and the evaluation objects is expressed as a modified and modified relationship, the relationship is expressed as a co-occurrence in the comment text, and the more frequent the co-occurrence, the closer the relationship between the emotional words and the evaluation objects is considered. Since the PMI calculates the relationship between two vocabularies by the co-occurrence of the two vocabularies, we firstly use the PMI to calculate the initial emotion matrix.
At this time, a problem of too high dimension is faced, because the number of evaluation objects in a comment text is often far greater than the number of emotion words, that is, in an emotion matrix, emotion words are represented as a vector, the dimension of the vector is very high, and when clustering is performed on the emotion words according to the distance of the vector, the accuracy and performance of clustering are greatly affected. By taking the tf-idf as a reference, the evaluation object is screened, and the evaluation object refers to a product or a certain characteristic of the product commented by the user. These product features are domain-related, for example, in a mobile phone product review, a user may be interested in features such as "mobile phone", "screen", "battery", etc., and in a hotel product review, a user may be interested in features such as "hotel", "toilet", "air conditioner", etc. Features such as 'cell phone', 'screen', 'battery' and the like will appear in a large number of mobile phone product reviews, but in few hotel reviews. Features such as hotels, toilets, air conditioners and the like will appear in numerous hotel reviews and not in mobile phone product reviews. Therefore, we can use the thought of tf-idf to screen the real evaluation object.
tf-idf is often said for text, an object processed by us is a piece of comment text, we need to merge comments of the same type of product into a document, a set of the comments is regarded as a document of corresponding product comments, and the number of documents is the number of types of different product comments collected by us. Using these documents, the value of tf-idf for each word in the comment text can be calculated. And the nouns with higher tf-idf values are often the evaluation objects of interest.
A threshold value alpha is set, and only the noun with the tf-idf value reaching the threshold value is determined as the evaluation object, which is also called as a key evaluation object. Similar to the original emotion matrix, the relation between emotion words and key evaluation objects is considered, namely PMI values of the emotion words and the key evaluation objects are calculated, so that a new emotion matrix can be constructed, the emotion words can be represented as a vector in the matrix, and because the evaluation objects are screened by tf-idf, the dimension of the vector is far smaller than that of the vector in the original emotion matrix.
The PMI value only considers the co-occurrence between the emotion words and the evaluation objects, which causes some semantic loss, and when the relationship between the emotion words and the evaluation objects is calculated, not only the direct relationship between the emotion words and the evaluation objects but also the relationship between other evaluation objects related to the evaluation objects need to be considered. Therefore, we calculate a correlation between the key evaluation object and all the evaluation objects, and obtain a correlation matrix, and the algorithm of the correlation between the key evaluation object and all the evaluation objects is as follows:
where M is all product features, M' is the filtered key evaluation object, D is the set of reviews, and normal () is a simple normalization function to normalize the values of each dimension in the vector to 0,1]Numerical values in the intervals. Furthermore, (m'i,mj) in d represents a key evaluation object, m'iAnd the evaluation object mjWhen the two evaluation objects appear in the same comment, the more times the two evaluation objects appear in the same comment, the higher the association degree of the two evaluation objects.
Figure BDA0001678950690000071
Through the above calculation, a correlation matrix can be obtained, and the emotion matrix which is finally needed by us can be obtained according to the EPMI algorithm, namely, the formula (5), by using the two emotion matrices and the correlation matrix which are obtained before.
3. Emotional word clustering
After a final emotion matrix is obtained, the emotion words can be clustered, the emotion words in the matrix can be represented as vectors, clustering is carried out according to the distance between the vectors, and due to the fact that unsupervised clustering is adopted, a cross-check mode is adopted, and emotion classification is taken as an example, and proper k is selected. The specific selection process is shown in fig. 3.
Dividing a comment data set into m parts, selecting m-1 parts as a test set, taking the rest 1 parts as test data, calculating the classification accuracy of different k values on the test set, testing different k values by using different test sets and training sets, and finally performing m-round tests, so that m times of accuracy can be obtained for each k value, and selecting the k value with the highest average accuracy as a final k value.
The automatic construction method of the sentiment dictionary for the commodity comment sequentially comprises the following steps:
(1) and preprocessing the original commodity comment text. The method is characterized in that the method usually adopts Chinese word segmentation, stop word filtering and other measures to determine the emotional words and the evaluation objects contained in the text in the designated field.
(2) And (2) mining the relation between the emotion words obtained in the step (1) and the evaluation object, and generating an emotion matrix representing the relation.
The relationship between the emotion words and the evaluation objects directly reflects the modification degree of the emotion words on the evaluation objects:
(2.1) quantifying the relation between the emotion words and the evaluation objects by using the co-occurrence of the emotion words and the evaluation objects, wherein the PMI is adopted to calculate the relation between the emotion words and the evaluation objects.
The relationship between an emotion word and an evaluation object is calculated using Point Mutual Information (PMI), and the PMI calculation formula is defined as follows:
Figure BDA0001678950690000081
Figure BDA0001678950690000082
Figure BDA0001678950690000083
wherein, p (word)1,word2) Is word1And word2Probability that two words co-occur in the same window in the text of the article review. N is the number of different words contained by the product review under consideration. count (word)1,word2) Finger word1And word2The number of times two words co-occur in the same window in a review of the good. count (word) refers to the number of times the word appears in the item review text.
(2.2) we use a matrix (emotion matrix) to represent the relationship, the rows of the matrix represent all emotion words, each column of the matrix is the evaluation object, and each unit of the matrix represents the corresponding emotion word and PMI value of the evaluation object.
The emotion matrix between the emotion words and the evaluation objects is defined as a matrix A as follows:
Figure BDA0001678950690000084
the constructed emotion matrix A is composed of n rows and p columns. Wherein n rows represent n emotional words, i.e. e1~enAnd the p column represents p evaluation objects, i.e., m1~mp. Where p is much larger than n. And wijRepresenting emotional words eiAnd the evaluation object mjPMI value in between, wij=PMI(ei,mj)。
(3) Screening the evaluation objects obtained in the step (1) to leave a part of key evaluation objects.
The tf-idf idea is used for screening evaluation objects, and the method specifically comprises the following details:
(3.1) merging comments of the same type of product into a document, and calculating tf-idf values of words according to the times of occurrence of the words in different documents, namely different product comment sets, and the frequency of reverse documents.
(3.2) calculating tf-idf values of all words, sorting tf-idf values of the evaluation objects, setting a threshold value, and screening out t evaluation objects which reach the threshold value as final evaluation objects.
(4) Similar to the step (2), considering the relationship between the emotional words and the key evaluation objects, and generating an emotional matrix representing the relationship between the emotional words and the key evaluation objects. The only difference from step (2) is that in step (4), the emotion matrix contains the relationship between the emotion word and the evaluation object left after the screening, and not the relationship between the emotion word and the entire evaluation object.
The constructed emotion matrix B is n rows and t columns. The n rows also represent n emotional words, and the t columns represent t key evaluation objects.
Figure BDA0001678950690000091
(5) And (4) mining the correlation between the key evaluation object screened in the step (3) and the original evaluation object obtained in the step (1), and generating a correlation matrix representing the correlation between the key evaluation object and the original evaluation object.
In step (5), relationships between all the evaluation objects and the screened evaluation objects are mined, and a correlation matrix between the key evaluation object and all the evaluation objects is generated, wherein the specific details are as follows:
(5.1) the key evaluation objects are the evaluation objects obtained by screening in the step (3), and all the evaluation objects are all the nouns contained in the initial product review.
(5.2) the association degree between the key evaluation object and all the evaluation objects can be understood as a synonymy relationship in the corpus, and the relationship is represented by [0,1], wherein 0 represents no relationship, and 1 represents the highest association degree.
The constructed correlation matrix C is shown below:
Figure BDA0001678950690000092
the correlation matrix C is composed of t rows and p columns, wherein the t rows represent t screened key evaluation objects, the p columns represent t total evaluation objects, and uijRepresents the evaluation object miAnd the evaluation object mjThe correlation can be formed by miAnd mjThe number of times a comment simultaneously appears in a product comment text is counted.
(6) And (4) obtaining two emotion matrixes obtained in the steps (2) and (4) and a correlation matrix obtained in the step (5), and generating a new emotion matrix for representing the relationship between the emotion words and the key evaluation object.
Generating a final emotion matrix through an EPMI algorithm by using the two emotion matrixes and the incidence matrix which are constructed previously; EPMI algorithm:
Figure BDA0001678950690000101
that is, we are computing the emotion words eiAnd the evaluation object mjIn the case of the relationship between the two, it is necessary to consider and evaluate the object m in addition to the relationship between the twojThose evaluation objects are related, and the degree of the relation is u in the formulajkTo indicate.
The emotion matrix D is calculated as follows:
D[n][t]=B[n][t]+A[n][p]*CT[t][p] (5)
the matrix D is the same as the matrices A and B and is used for representing the relationship between the emotional words and the evaluation objects, and the difference is that the matrices A and B are calculated by using the traditional PMI algorithm, and the matrix D is calculated by using the improved PMI algorithm EPMI algorithm.
(7) And (4) clustering the emotion words according to the distance between the emotion words in the emotion matrix in the step (6), and dividing the emotion words into several types to obtain a domain emotion dictionary.
In the emotion matrix D, the emotion words can be expressed into vectors by each row in the matrix, and the emotion words can be grouped into several types by adopting a clustering method such as k-means and the like according to the distance between the vectors in the matrix space. Finally, a domain emotion dictionary which divides emotion words into some classes can be obtained.
(8) Applying the emotion dictionary to the emotion classification task, determining an optimal k value by adopting methods such as cross check and the like according to different fields, and dividing the emotion words into k types.

Claims (7)

1. An automatic construction method of an emotion dictionary for commodity comments is characterized by sequentially comprising the following steps:
(1) preprocessing an original commodity comment text, and determining emotion words and evaluation objects contained in a specified field text;
(2) mining the relation between the emotion words obtained in the step (1) and the evaluation object, and generating an emotion matrix representing the relation;
(3) screening the evaluation objects obtained in the step (1) and leaving key evaluation objects;
(4) considering the relation between the emotional words and the key evaluation objects, and generating an emotional matrix representing the relation between the emotional words and the key evaluation objects;
(5) mining the correlation between the key evaluation object screened in the step (3) and the original evaluation object obtained in the step (1), and generating a correlation matrix representing the correlation between the key evaluation object and the original evaluation object;
(6) obtaining a correlation matrix by using the two emotion matrixes obtained in the steps (2) and (4) and the correlation matrix obtained in the step (5), and generating a new emotion matrix for representing the relationship between the emotion words and the key evaluation object;
(7) clustering the emotion words according to the distance between the emotion words in the emotion matrix in the step (6), and dividing the emotion words into several types to obtain a domain emotion dictionary;
(8) applying an emotion dictionary to an emotion classification task, determining an optimal k value by adopting methods such as cross check and the like according to different fields, and dividing emotion words into k types;
in the step (2), the relationship between the emotion words and the evaluation objects directly reflects a modification degree of the emotion words to the evaluation objects:
(2.1) quantifying the relation between the emotion words and the evaluation objects by using the co-occurrence of the emotion words and the evaluation objects, wherein the PMI is adopted to calculate the relation between the emotion words and the evaluation objects;
the PMI calculation formula is defined as follows:
Figure FDA0003210381060000011
Figure FDA0003210381060000012
Figure FDA0003210381060000013
wherein, p (word)1,word2) Is word1And word2Probability that two words co-occur in the same window in the commodity comment text; n is the number of different words contained in the commodity review under consideration; count (word)1,word2) Finger word1And word2The number of times that two words co-occur in the same window in the commodity comment is counted (word), which is the number of times that the word appears in the commodity comment text;
(2.2) a matrix (emotion matrix) is used for representing the relation, rows of the matrix represent all emotion words, each column of the matrix is an evaluation object, and each unit of the matrix represents the corresponding emotion word and the PMI value of the evaluation object;
the emotion matrix between the emotion words and the evaluation objects is defined as a matrix A as follows:
Figure FDA0003210381060000021
the formed emotion matrix A is formed by n rows and p columns; wherein n rows represent n emotional words, i.e. e1~enAnd the p column represents p evaluation objects, i.e., m1~mp(ii) a And wijRepresenting emotional words eiAnd the evaluation object mjPMI value in between, wij=PMI(ei,mj)。
2. The method as claimed in claim 1, wherein in the step (1), the determination of the emotion words and the evaluation objects is to select nouns contained in the comment text as the evaluation objects according to the parts of speech of the words, and adjectives, adverbs and verbs in the comment text are used as the emotion words.
3. The method for automatically constructing an emotion dictionary for commodity comments as claimed in claim 1, wherein in the step (3), tf-idf ideas are used for screening of evaluation objects, and the method specifically includes the following details:
(3.1) combining the comments of the same type of product into a document, and calculating tf-idf values of words according to the times of the words appearing in different documents and the frequency of reverse documents;
and (3.2) calculating tf-idf values of all words, sequencing the tf-idf values of the evaluation objects, setting a threshold value, and screening out t evaluation objects reaching the threshold value to be considered as final evaluation objects.
4. The automatic construction method of an emotion dictionary for merchandise comments according to claim 1, wherein in step (4), similarly to the constructed emotion matrix in step (2), the only difference is that in step (4), the emotion matrix contains the relationship between the emotion word and the evaluation object left after the screening, but not the relationship between the emotion word and the entire evaluation object;
the constructed emotion matrix B is n rows and t columns; n rows represent n emotional words, and t columns represent t key evaluation objects;
Figure FDA0003210381060000031
5. the method for automatically constructing an emotion dictionary for commodity comments as set forth in claim 1, wherein in the step (5), the relationship between all the evaluation objects and the screened evaluation objects is mined, and a correlation matrix between the key evaluation object and all the evaluation objects is generated, wherein the correlation matrix C is as follows:
Figure FDA0003210381060000032
the correlation matrix C is composed of t rows and p columns, wherein the t rows represent t screened key evaluation objects, the p columns represent p original evaluation objects, and uijIndicates the evaluation object mi and the evaluation object mjCorrelation between m, correlation is represented byiAnd mjThe number of times a comment simultaneously appears in a product comment text is counted.
6. The automatic construction method of emotion dictionary for merchandise review as set forth in claim 1, wherein in said step (6), a final emotion matrix is generated by EPMI algorithm using two emotion matrices and correlation matrix constructed previously;
EPMI algorithm:
Figure FDA0003210381060000033
in calculating emotional words eiAnd the evaluation object mjWhen the relationship between them, not only need to take into accountConsidering the relationship between the two, the object m needs to be considered and evaluatedjThose evaluation objects are related, and the degree of the relation is u in the formulajkTo represent;
the constructed new emotion matrix D calculation formula is as follows:
D[n][t]=B[n][t]+A[n][p]*CT[t][p] (5);
wherein, D [ n ]][t]And B [ n ]][t]Emotion matrixes all representing correlation between n emotion words and t key evaluation objects, A [ n ]][p]An emotion matrix representing the correlation between n emotion words and p total evaluation objects, CT[t][p]The correlation between the key evaluation object and all the evaluation objects is expressed, and formula (5) is a matrixed representation of formula (4).
7. The method as claimed in claim 1, wherein in the step (7), the emotion words are clustered, in the emotion matrix D, the emotion words are represented as vectors by each row in the matrix, and the emotion words are clustered into several classes by using a k-means clustering method according to the distance between the vectors in the matrix space, thereby finally obtaining a domain emotion dictionary which divides the emotion words into several classes.
CN201810539447.4A 2018-05-30 2018-05-30 Automatic construction method of emotion dictionary for commodity comments Active CN108763214B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810539447.4A CN108763214B (en) 2018-05-30 2018-05-30 Automatic construction method of emotion dictionary for commodity comments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810539447.4A CN108763214B (en) 2018-05-30 2018-05-30 Automatic construction method of emotion dictionary for commodity comments

Publications (2)

Publication Number Publication Date
CN108763214A CN108763214A (en) 2018-11-06
CN108763214B true CN108763214B (en) 2021-09-24

Family

ID=64004195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810539447.4A Active CN108763214B (en) 2018-05-30 2018-05-30 Automatic construction method of emotion dictionary for commodity comments

Country Status (1)

Country Link
CN (1) CN108763214B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800418B (en) * 2018-12-17 2023-05-05 北京百度网讯科技有限公司 Text processing method, device and storage medium
CN109933793B (en) * 2019-03-15 2023-01-06 腾讯科技(深圳)有限公司 Text polarity identification method, device and equipment and readable storage medium
CN110147552B (en) * 2019-05-22 2022-12-06 南京邮电大学 Education resource quality evaluation mining method and system based on natural language processing
CN110413780B (en) * 2019-07-16 2022-02-22 合肥工业大学 Text emotion analysis method and electronic equipment
CN111080055A (en) * 2019-11-06 2020-04-28 邱素容 Hotel scoring method, hotel recommendation method, electronic device and storage medium
CN112818682B (en) * 2021-01-22 2023-01-03 深圳大学 E-commerce data analysis method, equipment, device and computer-readable storage medium
CN114254083A (en) * 2021-08-12 2022-03-29 北京好欣晴移动医疗科技有限公司 Medical special term unsupervised clustering method, device and system
CN116320626B (en) * 2023-05-11 2023-11-14 深圳市兴意腾科技电子有限公司 Method and system for calculating live broadcast heat of electronic commerce
CN117217218B (en) * 2023-11-08 2024-01-23 中国科学技术信息研究所 Emotion dictionary construction method and device for science and technology risk event related public opinion

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130092342A (en) * 2012-02-09 2013-08-20 고민수 System and method for creating emotional word dictionary and computing emotional degrees of documents
CN103646097A (en) * 2013-12-18 2014-03-19 北京理工大学 Constraint relationship based opinion objective and emotion word united clustering method
CN105117428A (en) * 2015-08-04 2015-12-02 电子科技大学 Web comment sentiment analysis method based on word alignment model
CN105718446A (en) * 2016-03-08 2016-06-29 徐勇 UGC fuzzy comprehensive evaluation method based on sentiment analysis
CN106407177A (en) * 2016-08-26 2017-02-15 西南大学 Emergency online group behavior detection method based on clustering analysis
CN107369066A (en) * 2017-06-28 2017-11-21 东软集团股份有限公司 A kind of feature between comment object compares method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130092342A (en) * 2012-02-09 2013-08-20 고민수 System and method for creating emotional word dictionary and computing emotional degrees of documents
CN103646097A (en) * 2013-12-18 2014-03-19 北京理工大学 Constraint relationship based opinion objective and emotion word united clustering method
CN105117428A (en) * 2015-08-04 2015-12-02 电子科技大学 Web comment sentiment analysis method based on word alignment model
CN105718446A (en) * 2016-03-08 2016-06-29 徐勇 UGC fuzzy comprehensive evaluation method based on sentiment analysis
CN106407177A (en) * 2016-08-26 2017-02-15 西南大学 Emergency online group behavior detection method based on clustering analysis
CN107369066A (en) * 2017-06-28 2017-11-21 东软集团股份有限公司 A kind of feature between comment object compares method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于三层过滤的评价对象抽取;牛振东;《北京理工大学学报》;20161115;第36卷(第11期);第1154-1159页 *
电商网站的产品评价对象抽取研究;刘沙;《中国优秀硕士学位论文全文数据库信息科技辑》;20160315(第3期);第I138-7661页,第31页 *

Also Published As

Publication number Publication date
CN108763214A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
CN108763214B (en) Automatic construction method of emotion dictionary for commodity comments
CN109740148B (en) Text emotion analysis method combining BiLSTM with Attention mechanism
Devika et al. Sentiment analysis: a comparative study on different approaches
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN107491531A (en) Chinese network comment sensibility classification method based on integrated study framework
CN108269125B (en) Comment information quality evaluation method and system and comment information processing method and system
CN103116637A (en) Text sentiment classification method facing Chinese Web comments
Atia et al. Increasing the accuracy of opinion mining in Arabic
Probierz et al. Rapid detection of fake news based on machine learning methods
Kwaik et al. An Arabic tweets sentiment analysis dataset (ATSAD) using distant supervision and self training
CN108804595B (en) Short text representation method based on word2vec
Suleiman et al. Comparative study of word embeddings models and their usage in Arabic language applications
Li et al. Accurate recommendation based on opinion mining
Reddy et al. Profile specific document weighted approach using a new term weighting measure for author profiling
KR101593371B1 (en) Propensity classification device for text data and Decision support systems using the same
CN115795030A (en) Text classification method and device, computer equipment and storage medium
Mozafari et al. Emotion detection by using similarity techniques
CN106681986A (en) Multi-dimensional sentiment analysis system
Najibullah Indonesian text summarization based on naïve bayes method
Ruposh et al. A computational approach of recognizing emotion from Bengali texts
López-Santillán et al. Custom document embeddings via the centroids method: Gender classification in an author profiling task
Gwad et al. Twitter sentiment analysis classification in the Arabic language using long short-term memory neural networks
Reddy et al. Author profile prediction using pivoted unique term normalization
Popova et al. Ranking in keyphrase extraction problem: is it suitable to use statistics of words occurrences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant