CN108763214B - Automatic construction method of emotion dictionary for commodity comments - Google Patents
Automatic construction method of emotion dictionary for commodity comments Download PDFInfo
- Publication number
- CN108763214B CN108763214B CN201810539447.4A CN201810539447A CN108763214B CN 108763214 B CN108763214 B CN 108763214B CN 201810539447 A CN201810539447 A CN 201810539447A CN 108763214 B CN108763214 B CN 108763214B
- Authority
- CN
- China
- Prior art keywords
- emotion
- words
- matrix
- evaluation
- evaluation objects
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0282—Rating or review of business operators or products
Abstract
The invention discloses an automatic construction method of an emotion dictionary for commodity comments, which comprises text preprocessing, semantic relation mining and emotion word clustering. The text preprocessing is used for preprocessing the commodity comments and extracting emotional words and evaluation objects contained in a certain type of commodity comments. And mining semantic relations, namely mining the semantic relations between the emotion words and the evaluation objects, and representing the semantic relations between the emotion words and the evaluation objects in a matrix form. And (4) clustering the emotional words, wherein the emotional words are reasonably classified into k types by carrying out unsupervised clustering on the emotional words according to the mutual distance of the emotional words in an emotional matrix space. The invention constructs a domain emotion dictionary aiming at the characteristics of texts in the commodity comment field, the dictionary can divide emotion words into multiple classes instead of two classes which are commendably and commendably, and the domain emotion dictionary has great advantages in the emotion classification task and the like compared with other existing general emotion dictionaries in the commodity comment field.
Description
Technical Field
The invention relates to a self-defined construction method of an emotion dictionary aiming at the commodity comment field, and belongs to the technical field of computer information technology processing.
Background
With the development of various shopping websites, a large number of comments about various commodities appear on the network, and people can look up the comments anytime and anywhere. It is important to identify the emotional tendencies that these reviews imply, both for the merchant and the consumer. And a good emotion dictionary is the basis for analyzing the emotional tendency of the text. It is well known that sentiment analysis of text requires consideration of the industry to which the text belongs. The existing emotion dictionaries are all universal and do not have emotion dictionaries in a specified field aiming at commodity comments. Obviously, it is not appropriate to utilize the conventional emotion dictionary to perform emotion analysis on the product comment text. Therefore, the automatic construction method of the emotion dictionary, especially for the emotion dictionary in a specific field, draws more and more attention and researches of experts.
The existing construction method of the emotion dictionary can be divided into two categories, namely corpus-based and knowledge base-based, for Chinese and English. And constructing an emotion dictionary based on a corpus, wherein the most common method is to select seed words and determine the emotion polarity of the emotion words by calculating the relation between the emotion words with unknown emotion polarity and the seed words, namely a PMI value. Then, the available common knowledge base for Chinese is very limited, so that the research for constructing Chinese emotion dictionaries by using the knowledge base is very rare. However, when constructing an emotion dictionary for a product review field, it is necessary to consider an evaluation target in particular. The evaluation object is a certain characteristic of the commodity which is evaluated by the user, for example, for a mobile phone, the evaluation object can be a characteristic of a screen, a battery and the like of the mobile phone.
On the other hand, the existing emotion dictionary usually only contains some emotion words, and the emotion words are divided into two categories of positive words and negative words. There are also some scholars who classify emotions into joyous, sad, fear, surprise, anger, jealousy, the six major classes. In summary, the existing emotion classification is based on the experience knowledge of people to determine the emotion classes into which emotion words can be classified.
Considering that many emotional words often show different emotional tendencies in different fields, it is important to be able to accurately identify these emotional words and evaluate the subjects or subjects in the fields, especially in the field of merchandise review. Fast finds it difficult to construct a domain emotion dictionary by means of information experts or by means of crowdsourcing services. Shi et al extract key information from the domain text using association rule algorithms in conjunction with supervised machine learning approaches. Zhang et al extracts an evaluation object of a product using a Point Mutual Information (PMI) and an association rule algorithm. Considering the sequence problem of the evaluation object, Qiu and the like provide a two-way propagation algorithm on the basis of calculating the positive relationship between the emotional words and the product. Mishne selects the evaluation object by using the part of speech and the word frequency of the word.
PMI is a common indicator used to consider the degree of association between two words. Turney and Littman use PMI and LSA to calculate the degree of association between two words, this method of calculating the relationship between a word and a seed word by using PMI is generally called so-PMI. The PMI is improved by Islam and Inkpen, the SOC-PMI is provided, the emotion classification task is a basic task of emotion analysis, and the performance of the used emotion dictionary can be directly reflected by the quality of the classification result. Pang takes the emotion classification task as a text classification task and tests three classifiers, namely naive Bayes, a support vector machine and maximum entropy. Li and Hao expand the evaluation object by using a spectral clustering method. Yang et al then uses word2vec to calculate the cosine similarity between the word and the seed word.
Most of the existing emotion dictionary construction methods are general dictionaries, and the general dictionaries are not suitable for analyzing texts in a specific field, such as commodity comment texts, so that it is very important to construct an emotion dictionary which can be suitable for the specific field.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the defects in the prior art and provides an emotion dictionary construction method aiming at the field of commodity comments.
The technical scheme is as follows: an automatic construction method of an emotion dictionary for commodity comments sequentially comprises the following steps:
(1) and preprocessing the original commodity comment text. The method is characterized in that the method usually adopts Chinese word segmentation, stop word filtering and other measures to determine the emotional words and the evaluation objects contained in the text in the designated field. The determination of the emotional words and the evaluation objects is to select nouns contained in the comment texts as the evaluation objects according to the parts of speech of the words, and adjectives, adverbs and verbs in the comment texts are used as the emotional words.
(2) And (2) mining the relation between the emotion words obtained in the step (1) and the evaluation object, and generating an emotion matrix representing the relation.
(3) Screening the evaluation objects obtained in the step (1) to leave a part of key evaluation objects.
(4) Similar to the step (2), considering the relationship between the emotional words and the key evaluation objects, and generating an emotional matrix representing the relationship between the emotional words and the key evaluation objects.
(5) And (4) mining the correlation between the key evaluation object screened in the step (3) and the original evaluation object obtained in the step (1), and generating a correlation matrix representing the correlation between the key evaluation object and the original evaluation object.
(6) And (5) obtaining a correlation matrix by using the two emotion matrixes obtained in the steps (2) and (4) and the correlation matrix obtained in the step (5), and generating a new emotion matrix for representing the relationship between the emotion words and the key evaluation object.
(7) And (4) clustering the emotion words according to the distance between the emotion words in the emotion matrix in the step (6), and dividing the emotion words into several types to obtain a domain emotion dictionary.
(8) Applying the emotion dictionary to the emotion classification task, determining an optimal k value by adopting methods such as cross check and the like according to different fields, and dividing the emotion words into k types.
Further, in the step (2), the relationship between the emotion word and the evaluation object directly reflects a modification degree of the emotion word on the evaluation object:
(2.1) quantifying the relation between the emotion words and the evaluation objects by using the co-occurrence of the emotion words and the evaluation objects, wherein the PMI is adopted to calculate the relation between the emotion words and the evaluation objects.
(2.2) we use a matrix (emotion matrix) to represent the relationship, the rows of the matrix represent all emotion words, each column of the matrix is the evaluation object, and each unit of the matrix represents the corresponding emotion word and PMI value of the evaluation object.
Further, in the step (3), the tf-idf concept is used for screening the evaluation object, and the details specifically include the following:
(3.1) merging comments of the same type of product into a document, and calculating tf-idf values of words according to the times of occurrence of the words in different documents, namely different product comment sets, and the frequency of reverse documents.
And (3.2) calculating tf-idf values of all words, sequencing the tf-idf values of the evaluation objects, setting a threshold value, and screening the evaluation objects which reach the threshold value by us to be considered as final evaluation objects.
Further, in the step (4), similarly to the constructed emotion matrix in the step (2), the only difference is that the emotion matrix in the step (4) includes the relationship between the emotion word and the evaluation object left after the screening, and not the relationship between the emotion word and the entire evaluation object.
Further, in the step (5), the relationship between all the evaluation objects and the screened evaluation objects is mined, and a correlation matrix between the key evaluation object and all the evaluation objects is generated, wherein the specific details are as follows:
(5.1) the key evaluation objects are the evaluation objects obtained by screening in the step (3), and all the evaluation objects are all the nouns contained in the initial product review.
(5.2) the association degree between the key evaluation object and all the evaluation objects can be understood as a synonymy relationship in the corpus, and the relationship is represented by [0,1], wherein 0 represents no relationship, and 1 represents the highest association degree. Other degrees of correlation are expressed by numerical values between [0,1] intervals, and closer to 1 indicates higher degrees of correlation.
Further, in the step (6), we use the two emotion matrixes and the correlation matrix constructed before to generate the final emotion matrix, which is based on an improved PMI algorithm EPMI algorithm that we propose, as follows:
(6.1) EPMI Algorithm:
that is, we are computing the emotion words eiAnd the evaluation object mjIn the case of the relationship between the two, it is necessary to consider and evaluate the object m in addition to the relationship between the twojThose evaluation objects are related, and the degree of the relation is u in the formulajkTo indicate.
(6.2) the new EPMI algorithm is adopted to mine the relation between the emotion words and the evaluation objects, and a new emotion matrix is constructed. The new emotion matrix can be directly obtained by the EPMI algorithm through the two emotion matrixes and the association matrix obtained previously.
Further, in the step (7), the emotion words are clustered, and in the emotion matrix, the emotion words can be represented as a vector, so that the emotion words can be clustered unsupervised according to the distance between the vector and the vector.
Further, in the step (8), because unsupervised clustering is adopted in the clustering process, the number k of final clusters is uncertain, the optimal k is different for different product reviews and different text analysis tasks, and a k value with stable performance can be selected through cross checking.
Has the advantages that: compared with the prior art, the method for automatically constructing the emotion dictionary aiming at the commodity comment field provided by the invention has the advantage that the expression of the field emotion dictionary is superior to that of a general emotion dictionary in the specific field. The invention provides a method for constructing an emotion dictionary on comment corpus, and the constructed emotion dictionary is different from the traditional emotion dictionary, and can divide emotion words into unfixed k classes instead of some classes such as fixed commendability and derogation, and the performance of the domain emotion dictionary with higher dimensionality is better in tasks such as emotion classification.
Drawings
FIG. 1 is a flow diagram of a text pre-processing module;
FIG. 2 is a schematic diagram of semantic relationship mining;
FIG. 3 is a flow chart of k value selection.
Detailed Description
The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.
First, in order to facilitate understanding of the present invention, the following description is made:
1. text preprocessing:
the text preprocessing module is mainly used for preprocessing an original comment text, as shown in fig. 1, for chinese, chinese word segmentation is required, and many existing open-source software can be used for chinese word segmentation, such as jieba word segmentation software, ik word segmenter, and the like. If the user wants to achieve the best word segmentation effect, the user often needs to add a self-defined dictionary to identify some rarely-used field words. In the case of english, since words are separated by commas, the step of word segmentation is not required, but the english vocabulary involves more complicated tenses and other problems, and therefore, the shape and stem of the english vocabulary needs to be restored. Similarly, many open source software also support stemming english vocabulary, such as the most commonly used natural language processing toolkit NLTK.
Besides the basic preprocessing of the comment text, the module can obtain a preliminary emotion word and an evaluation object most importantly. And the method of part-of-speech analysis is adopted to extract emotional words and evaluate objects. The emotional words are used for expressing that a user has strong subjective colors on a certain event, and the adjectives, the adverbs and the verbs are selected as the emotional words. The evaluation object, as the name implies, refers to the object modified by the emotional words, and may be a product or a feature of the product. We choose nouns as the evaluation objects.
Of course, whether the text is English or Chinese, a large number of words and phrases do not contain any meaning, such as "yes", "do", and the like in Chinese, and "is", "the", and the like in English, and stop word processing is required to be performed, so that the words are filtered out, and the negative influence of the meaningless high-frequency words on the text analysis is reduced.
2. Semantic relationship mining
The semantic relation mining module is a core module of the invention, and can lay a good foundation for the next emotion word clustering only by fully mining the relation between the emotion words and the evaluation object. The process of obtaining the final emotion matrix is mainly divided into three stages, as shown in fig. 2.
The relationship between the emotional words and the evaluation objects is expressed as a modified and modified relationship, the relationship is expressed as a co-occurrence in the comment text, and the more frequent the co-occurrence, the closer the relationship between the emotional words and the evaluation objects is considered. Since the PMI calculates the relationship between two vocabularies by the co-occurrence of the two vocabularies, we firstly use the PMI to calculate the initial emotion matrix.
At this time, a problem of too high dimension is faced, because the number of evaluation objects in a comment text is often far greater than the number of emotion words, that is, in an emotion matrix, emotion words are represented as a vector, the dimension of the vector is very high, and when clustering is performed on the emotion words according to the distance of the vector, the accuracy and performance of clustering are greatly affected. By taking the tf-idf as a reference, the evaluation object is screened, and the evaluation object refers to a product or a certain characteristic of the product commented by the user. These product features are domain-related, for example, in a mobile phone product review, a user may be interested in features such as "mobile phone", "screen", "battery", etc., and in a hotel product review, a user may be interested in features such as "hotel", "toilet", "air conditioner", etc. Features such as 'cell phone', 'screen', 'battery' and the like will appear in a large number of mobile phone product reviews, but in few hotel reviews. Features such as hotels, toilets, air conditioners and the like will appear in numerous hotel reviews and not in mobile phone product reviews. Therefore, we can use the thought of tf-idf to screen the real evaluation object.
tf-idf is often said for text, an object processed by us is a piece of comment text, we need to merge comments of the same type of product into a document, a set of the comments is regarded as a document of corresponding product comments, and the number of documents is the number of types of different product comments collected by us. Using these documents, the value of tf-idf for each word in the comment text can be calculated. And the nouns with higher tf-idf values are often the evaluation objects of interest.
A threshold value alpha is set, and only the noun with the tf-idf value reaching the threshold value is determined as the evaluation object, which is also called as a key evaluation object. Similar to the original emotion matrix, the relation between emotion words and key evaluation objects is considered, namely PMI values of the emotion words and the key evaluation objects are calculated, so that a new emotion matrix can be constructed, the emotion words can be represented as a vector in the matrix, and because the evaluation objects are screened by tf-idf, the dimension of the vector is far smaller than that of the vector in the original emotion matrix.
The PMI value only considers the co-occurrence between the emotion words and the evaluation objects, which causes some semantic loss, and when the relationship between the emotion words and the evaluation objects is calculated, not only the direct relationship between the emotion words and the evaluation objects but also the relationship between other evaluation objects related to the evaluation objects need to be considered. Therefore, we calculate a correlation between the key evaluation object and all the evaluation objects, and obtain a correlation matrix, and the algorithm of the correlation between the key evaluation object and all the evaluation objects is as follows:
where M is all product features, M' is the filtered key evaluation object, D is the set of reviews, and normal () is a simple normalization function to normalize the values of each dimension in the vector to 0,1]Numerical values in the intervals. Furthermore, (m'i,mj) in d represents a key evaluation object, m'iAnd the evaluation object mjWhen the two evaluation objects appear in the same comment, the more times the two evaluation objects appear in the same comment, the higher the association degree of the two evaluation objects.
Through the above calculation, a correlation matrix can be obtained, and the emotion matrix which is finally needed by us can be obtained according to the EPMI algorithm, namely, the formula (5), by using the two emotion matrices and the correlation matrix which are obtained before.
3. Emotional word clustering
After a final emotion matrix is obtained, the emotion words can be clustered, the emotion words in the matrix can be represented as vectors, clustering is carried out according to the distance between the vectors, and due to the fact that unsupervised clustering is adopted, a cross-check mode is adopted, and emotion classification is taken as an example, and proper k is selected. The specific selection process is shown in fig. 3.
Dividing a comment data set into m parts, selecting m-1 parts as a test set, taking the rest 1 parts as test data, calculating the classification accuracy of different k values on the test set, testing different k values by using different test sets and training sets, and finally performing m-round tests, so that m times of accuracy can be obtained for each k value, and selecting the k value with the highest average accuracy as a final k value.
The automatic construction method of the sentiment dictionary for the commodity comment sequentially comprises the following steps:
(1) and preprocessing the original commodity comment text. The method is characterized in that the method usually adopts Chinese word segmentation, stop word filtering and other measures to determine the emotional words and the evaluation objects contained in the text in the designated field.
(2) And (2) mining the relation between the emotion words obtained in the step (1) and the evaluation object, and generating an emotion matrix representing the relation.
The relationship between the emotion words and the evaluation objects directly reflects the modification degree of the emotion words on the evaluation objects:
(2.1) quantifying the relation between the emotion words and the evaluation objects by using the co-occurrence of the emotion words and the evaluation objects, wherein the PMI is adopted to calculate the relation between the emotion words and the evaluation objects.
The relationship between an emotion word and an evaluation object is calculated using Point Mutual Information (PMI), and the PMI calculation formula is defined as follows:
wherein, p (word)1,word2) Is word1And word2Probability that two words co-occur in the same window in the text of the article review. N is the number of different words contained by the product review under consideration. count (word)1,word2) Finger word1And word2The number of times two words co-occur in the same window in a review of the good. count (word) refers to the number of times the word appears in the item review text.
(2.2) we use a matrix (emotion matrix) to represent the relationship, the rows of the matrix represent all emotion words, each column of the matrix is the evaluation object, and each unit of the matrix represents the corresponding emotion word and PMI value of the evaluation object.
The emotion matrix between the emotion words and the evaluation objects is defined as a matrix A as follows:
the constructed emotion matrix A is composed of n rows and p columns. Wherein n rows represent n emotional words, i.e. e1~enAnd the p column represents p evaluation objects, i.e., m1~mp. Where p is much larger than n. And wijRepresenting emotional words eiAnd the evaluation object mjPMI value in between, wij=PMI(ei,mj)。
(3) Screening the evaluation objects obtained in the step (1) to leave a part of key evaluation objects.
The tf-idf idea is used for screening evaluation objects, and the method specifically comprises the following details:
(3.1) merging comments of the same type of product into a document, and calculating tf-idf values of words according to the times of occurrence of the words in different documents, namely different product comment sets, and the frequency of reverse documents.
(3.2) calculating tf-idf values of all words, sorting tf-idf values of the evaluation objects, setting a threshold value, and screening out t evaluation objects which reach the threshold value as final evaluation objects.
(4) Similar to the step (2), considering the relationship between the emotional words and the key evaluation objects, and generating an emotional matrix representing the relationship between the emotional words and the key evaluation objects. The only difference from step (2) is that in step (4), the emotion matrix contains the relationship between the emotion word and the evaluation object left after the screening, and not the relationship between the emotion word and the entire evaluation object.
The constructed emotion matrix B is n rows and t columns. The n rows also represent n emotional words, and the t columns represent t key evaluation objects.
(5) And (4) mining the correlation between the key evaluation object screened in the step (3) and the original evaluation object obtained in the step (1), and generating a correlation matrix representing the correlation between the key evaluation object and the original evaluation object.
In step (5), relationships between all the evaluation objects and the screened evaluation objects are mined, and a correlation matrix between the key evaluation object and all the evaluation objects is generated, wherein the specific details are as follows:
(5.1) the key evaluation objects are the evaluation objects obtained by screening in the step (3), and all the evaluation objects are all the nouns contained in the initial product review.
(5.2) the association degree between the key evaluation object and all the evaluation objects can be understood as a synonymy relationship in the corpus, and the relationship is represented by [0,1], wherein 0 represents no relationship, and 1 represents the highest association degree.
The constructed correlation matrix C is shown below:
the correlation matrix C is composed of t rows and p columns, wherein the t rows represent t screened key evaluation objects, the p columns represent t total evaluation objects, and uijRepresents the evaluation object miAnd the evaluation object mjThe correlation can be formed by miAnd mjThe number of times a comment simultaneously appears in a product comment text is counted.
(6) And (4) obtaining two emotion matrixes obtained in the steps (2) and (4) and a correlation matrix obtained in the step (5), and generating a new emotion matrix for representing the relationship between the emotion words and the key evaluation object.
Generating a final emotion matrix through an EPMI algorithm by using the two emotion matrixes and the incidence matrix which are constructed previously; EPMI algorithm:
that is, we are computing the emotion words eiAnd the evaluation object mjIn the case of the relationship between the two, it is necessary to consider and evaluate the object m in addition to the relationship between the twojThose evaluation objects are related, and the degree of the relation is u in the formulajkTo indicate.
The emotion matrix D is calculated as follows:
D[n][t]=B[n][t]+A[n][p]*CT[t][p] (5)
the matrix D is the same as the matrices A and B and is used for representing the relationship between the emotional words and the evaluation objects, and the difference is that the matrices A and B are calculated by using the traditional PMI algorithm, and the matrix D is calculated by using the improved PMI algorithm EPMI algorithm.
(7) And (4) clustering the emotion words according to the distance between the emotion words in the emotion matrix in the step (6), and dividing the emotion words into several types to obtain a domain emotion dictionary.
In the emotion matrix D, the emotion words can be expressed into vectors by each row in the matrix, and the emotion words can be grouped into several types by adopting a clustering method such as k-means and the like according to the distance between the vectors in the matrix space. Finally, a domain emotion dictionary which divides emotion words into some classes can be obtained.
(8) Applying the emotion dictionary to the emotion classification task, determining an optimal k value by adopting methods such as cross check and the like according to different fields, and dividing the emotion words into k types.
Claims (7)
1. An automatic construction method of an emotion dictionary for commodity comments is characterized by sequentially comprising the following steps:
(1) preprocessing an original commodity comment text, and determining emotion words and evaluation objects contained in a specified field text;
(2) mining the relation between the emotion words obtained in the step (1) and the evaluation object, and generating an emotion matrix representing the relation;
(3) screening the evaluation objects obtained in the step (1) and leaving key evaluation objects;
(4) considering the relation between the emotional words and the key evaluation objects, and generating an emotional matrix representing the relation between the emotional words and the key evaluation objects;
(5) mining the correlation between the key evaluation object screened in the step (3) and the original evaluation object obtained in the step (1), and generating a correlation matrix representing the correlation between the key evaluation object and the original evaluation object;
(6) obtaining a correlation matrix by using the two emotion matrixes obtained in the steps (2) and (4) and the correlation matrix obtained in the step (5), and generating a new emotion matrix for representing the relationship between the emotion words and the key evaluation object;
(7) clustering the emotion words according to the distance between the emotion words in the emotion matrix in the step (6), and dividing the emotion words into several types to obtain a domain emotion dictionary;
(8) applying an emotion dictionary to an emotion classification task, determining an optimal k value by adopting methods such as cross check and the like according to different fields, and dividing emotion words into k types;
in the step (2), the relationship between the emotion words and the evaluation objects directly reflects a modification degree of the emotion words to the evaluation objects:
(2.1) quantifying the relation between the emotion words and the evaluation objects by using the co-occurrence of the emotion words and the evaluation objects, wherein the PMI is adopted to calculate the relation between the emotion words and the evaluation objects;
the PMI calculation formula is defined as follows:
wherein, p (word)1,word2) Is word1And word2Probability that two words co-occur in the same window in the commodity comment text; n is the number of different words contained in the commodity review under consideration; count (word)1,word2) Finger word1And word2The number of times that two words co-occur in the same window in the commodity comment is counted (word), which is the number of times that the word appears in the commodity comment text;
(2.2) a matrix (emotion matrix) is used for representing the relation, rows of the matrix represent all emotion words, each column of the matrix is an evaluation object, and each unit of the matrix represents the corresponding emotion word and the PMI value of the evaluation object;
the emotion matrix between the emotion words and the evaluation objects is defined as a matrix A as follows:
the formed emotion matrix A is formed by n rows and p columns; wherein n rows represent n emotional words, i.e. e1~enAnd the p column represents p evaluation objects, i.e., m1~mp(ii) a And wijRepresenting emotional words eiAnd the evaluation object mjPMI value in between, wij=PMI(ei,mj)。
2. The method as claimed in claim 1, wherein in the step (1), the determination of the emotion words and the evaluation objects is to select nouns contained in the comment text as the evaluation objects according to the parts of speech of the words, and adjectives, adverbs and verbs in the comment text are used as the emotion words.
3. The method for automatically constructing an emotion dictionary for commodity comments as claimed in claim 1, wherein in the step (3), tf-idf ideas are used for screening of evaluation objects, and the method specifically includes the following details:
(3.1) combining the comments of the same type of product into a document, and calculating tf-idf values of words according to the times of the words appearing in different documents and the frequency of reverse documents;
and (3.2) calculating tf-idf values of all words, sequencing the tf-idf values of the evaluation objects, setting a threshold value, and screening out t evaluation objects reaching the threshold value to be considered as final evaluation objects.
4. The automatic construction method of an emotion dictionary for merchandise comments according to claim 1, wherein in step (4), similarly to the constructed emotion matrix in step (2), the only difference is that in step (4), the emotion matrix contains the relationship between the emotion word and the evaluation object left after the screening, but not the relationship between the emotion word and the entire evaluation object;
the constructed emotion matrix B is n rows and t columns; n rows represent n emotional words, and t columns represent t key evaluation objects;
5. the method for automatically constructing an emotion dictionary for commodity comments as set forth in claim 1, wherein in the step (5), the relationship between all the evaluation objects and the screened evaluation objects is mined, and a correlation matrix between the key evaluation object and all the evaluation objects is generated, wherein the correlation matrix C is as follows:
the correlation matrix C is composed of t rows and p columns, wherein the t rows represent t screened key evaluation objects, the p columns represent p original evaluation objects, and uijIndicates the evaluation object mi and the evaluation object mjCorrelation between m, correlation is represented byiAnd mjThe number of times a comment simultaneously appears in a product comment text is counted.
6. The automatic construction method of emotion dictionary for merchandise review as set forth in claim 1, wherein in said step (6), a final emotion matrix is generated by EPMI algorithm using two emotion matrices and correlation matrix constructed previously;
EPMI algorithm:
in calculating emotional words eiAnd the evaluation object mjWhen the relationship between them, not only need to take into accountConsidering the relationship between the two, the object m needs to be considered and evaluatedjThose evaluation objects are related, and the degree of the relation is u in the formulajkTo represent;
the constructed new emotion matrix D calculation formula is as follows:
D[n][t]=B[n][t]+A[n][p]*CT[t][p] (5);
wherein, D [ n ]][t]And B [ n ]][t]Emotion matrixes all representing correlation between n emotion words and t key evaluation objects, A [ n ]][p]An emotion matrix representing the correlation between n emotion words and p total evaluation objects, CT[t][p]The correlation between the key evaluation object and all the evaluation objects is expressed, and formula (5) is a matrixed representation of formula (4).
7. The method as claimed in claim 1, wherein in the step (7), the emotion words are clustered, in the emotion matrix D, the emotion words are represented as vectors by each row in the matrix, and the emotion words are clustered into several classes by using a k-means clustering method according to the distance between the vectors in the matrix space, thereby finally obtaining a domain emotion dictionary which divides the emotion words into several classes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810539447.4A CN108763214B (en) | 2018-05-30 | 2018-05-30 | Automatic construction method of emotion dictionary for commodity comments |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810539447.4A CN108763214B (en) | 2018-05-30 | 2018-05-30 | Automatic construction method of emotion dictionary for commodity comments |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108763214A CN108763214A (en) | 2018-11-06 |
CN108763214B true CN108763214B (en) | 2021-09-24 |
Family
ID=64004195
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810539447.4A Active CN108763214B (en) | 2018-05-30 | 2018-05-30 | Automatic construction method of emotion dictionary for commodity comments |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108763214B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800418B (en) * | 2018-12-17 | 2023-05-05 | 北京百度网讯科技有限公司 | Text processing method, device and storage medium |
CN109933793B (en) * | 2019-03-15 | 2023-01-06 | 腾讯科技(深圳)有限公司 | Text polarity identification method, device and equipment and readable storage medium |
CN110147552B (en) * | 2019-05-22 | 2022-12-06 | 南京邮电大学 | Education resource quality evaluation mining method and system based on natural language processing |
CN110413780B (en) * | 2019-07-16 | 2022-02-22 | 合肥工业大学 | Text emotion analysis method and electronic equipment |
CN111080055A (en) * | 2019-11-06 | 2020-04-28 | 邱素容 | Hotel scoring method, hotel recommendation method, electronic device and storage medium |
CN112818682B (en) * | 2021-01-22 | 2023-01-03 | 深圳大学 | E-commerce data analysis method, equipment, device and computer-readable storage medium |
CN114254083A (en) * | 2021-08-12 | 2022-03-29 | 北京好欣晴移动医疗科技有限公司 | Medical special term unsupervised clustering method, device and system |
CN116320626B (en) * | 2023-05-11 | 2023-11-14 | 深圳市兴意腾科技电子有限公司 | Method and system for calculating live broadcast heat of electronic commerce |
CN117217218B (en) * | 2023-11-08 | 2024-01-23 | 中国科学技术信息研究所 | Emotion dictionary construction method and device for science and technology risk event related public opinion |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20130092342A (en) * | 2012-02-09 | 2013-08-20 | 고민수 | System and method for creating emotional word dictionary and computing emotional degrees of documents |
CN103646097A (en) * | 2013-12-18 | 2014-03-19 | 北京理工大学 | Constraint relationship based opinion objective and emotion word united clustering method |
CN105117428A (en) * | 2015-08-04 | 2015-12-02 | 电子科技大学 | Web comment sentiment analysis method based on word alignment model |
CN105718446A (en) * | 2016-03-08 | 2016-06-29 | 徐勇 | UGC fuzzy comprehensive evaluation method based on sentiment analysis |
CN106407177A (en) * | 2016-08-26 | 2017-02-15 | 西南大学 | Emergency online group behavior detection method based on clustering analysis |
CN107369066A (en) * | 2017-06-28 | 2017-11-21 | 东软集团股份有限公司 | A kind of feature between comment object compares method and device |
-
2018
- 2018-05-30 CN CN201810539447.4A patent/CN108763214B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20130092342A (en) * | 2012-02-09 | 2013-08-20 | 고민수 | System and method for creating emotional word dictionary and computing emotional degrees of documents |
CN103646097A (en) * | 2013-12-18 | 2014-03-19 | 北京理工大学 | Constraint relationship based opinion objective and emotion word united clustering method |
CN105117428A (en) * | 2015-08-04 | 2015-12-02 | 电子科技大学 | Web comment sentiment analysis method based on word alignment model |
CN105718446A (en) * | 2016-03-08 | 2016-06-29 | 徐勇 | UGC fuzzy comprehensive evaluation method based on sentiment analysis |
CN106407177A (en) * | 2016-08-26 | 2017-02-15 | 西南大学 | Emergency online group behavior detection method based on clustering analysis |
CN107369066A (en) * | 2017-06-28 | 2017-11-21 | 东软集团股份有限公司 | A kind of feature between comment object compares method and device |
Non-Patent Citations (2)
Title |
---|
基于三层过滤的评价对象抽取;牛振东;《北京理工大学学报》;20161115;第36卷(第11期);第1154-1159页 * |
电商网站的产品评价对象抽取研究;刘沙;《中国优秀硕士学位论文全文数据库信息科技辑》;20160315(第3期);第I138-7661页,第31页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108763214A (en) | 2018-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108763214B (en) | Automatic construction method of emotion dictionary for commodity comments | |
CN109740148B (en) | Text emotion analysis method combining BiLSTM with Attention mechanism | |
Devika et al. | Sentiment analysis: a comparative study on different approaches | |
WO2019080863A1 (en) | Text sentiment classification method, storage medium and computer | |
CN111125349A (en) | Graph model text abstract generation method based on word frequency and semantics | |
CN107491531A (en) | Chinese network comment sensibility classification method based on integrated study framework | |
CN108269125B (en) | Comment information quality evaluation method and system and comment information processing method and system | |
CN103116637A (en) | Text sentiment classification method facing Chinese Web comments | |
Atia et al. | Increasing the accuracy of opinion mining in Arabic | |
Probierz et al. | Rapid detection of fake news based on machine learning methods | |
Kwaik et al. | An Arabic tweets sentiment analysis dataset (ATSAD) using distant supervision and self training | |
CN108804595B (en) | Short text representation method based on word2vec | |
Suleiman et al. | Comparative study of word embeddings models and their usage in Arabic language applications | |
Li et al. | Accurate recommendation based on opinion mining | |
Reddy et al. | Profile specific document weighted approach using a new term weighting measure for author profiling | |
KR101593371B1 (en) | Propensity classification device for text data and Decision support systems using the same | |
CN115795030A (en) | Text classification method and device, computer equipment and storage medium | |
Mozafari et al. | Emotion detection by using similarity techniques | |
CN106681986A (en) | Multi-dimensional sentiment analysis system | |
Najibullah | Indonesian text summarization based on naïve bayes method | |
Ruposh et al. | A computational approach of recognizing emotion from Bengali texts | |
López-Santillán et al. | Custom document embeddings via the centroids method: Gender classification in an author profiling task | |
Gwad et al. | Twitter sentiment analysis classification in the Arabic language using long short-term memory neural networks | |
Reddy et al. | Author profile prediction using pivoted unique term normalization | |
Popova et al. | Ranking in keyphrase extraction problem: is it suitable to use statistics of words occurrences |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |