CN103995853A - Multi-language emotional data processing and classifying method and system based on key sentences - Google Patents
Multi-language emotional data processing and classifying method and system based on key sentences Download PDFInfo
- Publication number
- CN103995853A CN103995853A CN201410198519.5A CN201410198519A CN103995853A CN 103995853 A CN103995853 A CN 103995853A CN 201410198519 A CN201410198519 A CN 201410198519A CN 103995853 A CN103995853 A CN 103995853A
- Authority
- CN
- China
- Prior art keywords
- emotion
- polarity
- key
- sentences
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000002996 emotional effect Effects 0.000 title claims abstract description 37
- 238000012545 processing Methods 0.000 title claims abstract description 21
- 230000008451 emotion Effects 0.000 claims abstract description 227
- 238000000605 extraction Methods 0.000 claims description 14
- 238000012549 training Methods 0.000 claims description 9
- 238000005259 measurement Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 2
- 238000013519 translation Methods 0.000 abstract description 13
- 238000007405 data analysis Methods 0.000 abstract description 7
- 230000008569 process Effects 0.000 abstract description 7
- 238000013508 migration Methods 0.000 abstract description 6
- 230000005012 migration Effects 0.000 abstract description 6
- 238000004458 analytical method Methods 0.000 description 11
- 230000006870 function Effects 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 3
- 239000000725 suspension Substances 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000007430 reference method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a multi-language emotional data processing and classifying method and system based on key sentences. The method includes the steps that first, an emotional dictionary data packet is automatically extracted from an unlabelled emotional data set, and the polarity of emotional words is finally judged through a K nearest neighbor algorithm and a voting rule; second, the extracted emotional dictionary data packet is used for calculating the score of the emotion attribute, then, the position attribute and the key word attribute are comprehensively considered, and a plurality of emotional key sentences are extracted for each text; third, the extracted emotional key sentences are directly applied to supervised emotional data classification and unsupervised emotional data classification. Therefore, the double-difficulty problem caused by language migration and emotional data analysis in the multi-language translation process can be solved, and emotional data analysis accuracy can be improved.
Description
Technical Field
The invention relates to text emotion data analysis, in particular to a multilingual emotion data processing and classifying method and system based on key sentences.
Background
With the continuous emergence of network communication platforms such as forums, blogs, comments, microblogs and the like, people are more and more accustomed to publishing subjective comments on the internet, wherein the comments are used for expressing the opinions and opinions of people on daily events, products, policies and the like. Meanwhile, with the acceleration of the globalization process, the information resources provided by the network present the characteristic of multi-linguishing. Emotion classification is a classification task that divides text into commendably and inversely according to the expressed emotion polarity; multilingual emotion classification refers to emotion classification of other languages using the source language. The multi-language emotion classification aims to research viewpoints, opinions and attitudes contained in multi-language emotion texts by means of minimum resources, and not only can make reasonable purchasing decisions by referring to evaluation of global users on commodities, but also can more timely understand network ideas of countries all over the world.
At present, multilingual emotion data analysis mainly faces two difficult problems, namely, the two difficult problems of language migration and emotion data analysis in the cross-language translation process.
For language migration, the following two methods are mainly adopted:
and performing cross-language emotion data classifier migration by means of a statistical machine translation system. On one hand, a marked source language data set can be translated into a target language, and then a classifier is trained on the translated training corpus to judge the test set; alternatively, the target language test set may be translated into the source language and then directly applied to the classifier trained on the source language. However, the accuracy of cross-language emotion analysis is lost with machine translation based methods. On the one hand, machine translation systems generate unique solutions, so the translation is not necessarily correct; on the other hand, machine translation systems rely on training sets and perform poorly when the domain of the target language differs significantly from the training set.
And performing migration of the cross-language emotion data classifier by means of a bilingual dictionary. In supervised learning, an emotion data classifier can be learned in a source language, and then a bilingual dictionary is used for translating a feature space into a target language; in unsupervised learning, the emotion dictionary in the source language can be translated into the target language through a bilingual dictionary. However, most bilingual dictionary-based works do not consider the contextual dependency of emotional words when selecting translated words. In addition, the polarity (support or opposition) of emotion words has domain dependency, and different polarities are presented in the face of different entities, so that using a general emotion dictionary for a specific domain tends to be poor in performance.
For emotion data analysis, the following three methods are mainly used:
in the supervised learning method, the emotion tendency analysis of the text can be regarded as a text classification process, and the text tendency is judged by means of machine learning methods such as naive Bayes, maximum entropy, support vector machine and the like. Based on the machine learning method, feature fusion or feature reduction can be carried out to further improve the performance of emotion data classification.
In the unsupervised learning method, emotion data analysis is performed without any labeled data. The classical way is that: firstly, part-of-speech tagging is carried out on a text, some collocations of adjectives and adverbs are selected according to a predefined rule, then the difference between mutual information of each collocation and a pair of opposite-polarity emotional words, such as excelent and poror, is calculated, and finally the mutual information differences of all collocations of one text are summed to judge the emotional category.
In the semi-supervised learning method, a large amount of unlabelled data is combined with a small amount of labeled data. The semi-supervised learning can reduce the dependence of the supervised learning on the labeled samples, can obtain better performance than the unsupervised learning, and is a compromise method.
However, the conventional emotion analysis method does not solve the problem of interference of emotion ambiguity in the comment text on emotion data classification. The emotion data classification is somewhat similar to the plain text classification, but more complex than the plain text classification. In topic-based text classification, because word usage is different between texts with different topics, the domain relevance of words enables the texts with different topics to be well distinguished. However, the emotion data classification is much less accurate than the topic-based text classification, which is mainly caused by the complex emotion expression and the large amount of emotional ambiguity in the emotion text. In addition, in an article, objective sentences and subjective sentences may be interlaced, or a subjective sentence has more than two emotions at the same time, so text emotion data classification is a very complicated task. Here, taking a book review on a network as an example:
"many people say this is a sad, overflowing story, perhaps it is this comment that I have not been courage to read seriously. Though people who fall into a popular set are refused to shake and are extremely easy to deepen, the people are willing to see a beautiful large-volume ending in emotion, and the communication is so fragile and impatient in display.
… … this book, i am a good one to see, and is very much like. "
The author uses a large number of passive words to describe the feelings before reading, such as "sadness" and "fragility", but at the end of the article, the author expresses that he likes the book with a very positive attitude. In this example, the polarity of the entire text is positive, but it is easily discriminated as negative due to the presence of a large number of negative words. When the polarity of the whole article is judged, the emotion contribution degrees of all sentences in the article are different, and if the emotion expression key sentences and the sentences for describing details are distinguished, the text emotion data classification performance is improved.
In summary, the following two problems mainly exist in the multi-language emotional orientation analysis:
(1) multi-language emotion analysis over-depends on external resources
Most multilingual emotion analysis techniques rely on machine translation or bilingual dictionaries. Without a machine translation system or a compiled bilingual dictionary, the multilingual emotion analysis work is difficult.
(2) Multilingual emotion analysis is susceptible to interference from emotional ambiguities
In an article, objective sentences and subjective sentences may be interlaced, or a subjective sentence has more than two emotions, so text emotion data classification is a very complicated task.
(3) Multi-language emotion analysis performance difference humanity
Emotional expressions in different languages vary widely, and there is a loss of information when a model derived from the original space is converted to the target language space. For example, machine translation systems only generate unique solutions, and methods based on machine translation lose the accuracy of cross-language emotion analysis.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method and a system for classifying language-independent multi-language emotion data, so as to solve the problem of dual difficulties of language migration and emotion analysis in the cross-language translation process. The method has less resource dependence, can be easily transplanted to a multi-language scene, and can grasp the most main view of an author through a key sentence extraction module so as to improve the accuracy of multi-language emotion data classification.
In order to achieve the purpose, the invention provides a multilingual emotion data processing and classifying method based on key sentences, which is characterized by comprising the following steps of:
step 1, automatically extracting a part of emotion dictionary data packet from an unlabeled emotion data set, and finally judging the polarity of emotion words through a K neighbor algorithm and a voting rule;
step 2, calculating scores of emotion attributes by using the extracted emotion dictionary data packets, then comprehensively considering position attributes and keyword attributes, and automatically extracting a plurality of emotion key sentences for each text;
and 3, directly applying the extracted emotion key sentence to supervised emotion data classification and unsupervised emotion data classification.
The invention discloses a multilingual emotion data processing and classifying method based on key sentences, which is characterized in that the step 1 comprises the following steps:
step 21, taking Chinese as an example, extracting emotional words XX from the whole data set according to pattern matching of 'very XX' and 'very XX';
step 22, taking mutual information as similarity measurement, and assigning an emotion polarity for each emotion word according to a K neighbor algorithm;
and step 23, optimizing the designated emotion polarity through a voting principle.
The invention discloses a multilingual emotion data processing and classifying method based on key sentences, which is characterized in that the step 2 comprises the following steps:
step 31, calculating the emotion score of each sentence according to the extracted emotion dictionary data packet;
step 32, calculating the position score of each sentence according to Gaussian distribution;
step 33, calculating the keyword score of each sentence according to the keyword list;
and step 34, carrying out weighted summation on the emotion scores, the position scores and the keyword scores, and determining the last N sentences as emotion key sentences.
The invention discloses a multilingual emotion data processing and classifying method based on key sentences, which is characterized in that the step 3 comprises the following steps:
unsupervised sentiment data classification: each text is replaced by a plurality of emotion key sentences, and then the polarity of each text is judged on the key sentences by using extracted emotion dictionary data packets;
supervised emotion data classification: and selecting the most confident sample from the unlabeled samples as an labeled set according to the scores of the positive-type emotion words and the negative-type emotion words respectively, then training an emotion data classifier, and finally judging the polarity of each article on the key sentence.
The invention also relates to a multilingual emotion data processing and classifying system based on the key sentences, which is characterized by comprising the following steps:
the polarity judgment module is used for automatically extracting a part of emotion dictionary data packet from the unlabeled emotion data set and finally judging the polarity of the emotion words through a K neighbor algorithm and a voting rule;
the key sentence extraction module is used for calculating the score of the emotion attribute through the extracted emotion dictionary data packet, then comprehensively considering the position attribute and the keyword attribute and automatically extracting a plurality of emotion key sentences for each text;
and the emotion data classification module is used for directly applying the extracted emotion key sentences to supervised emotion data classification and unsupervised emotion data classification.
The multilingual emotion data processing and classifying system based on key sentences is characterized in that the polarity judgment module comprises:
the emotion word extraction module is used for extracting the emotion words XX from the whole data set according to pattern matching of 'very XX' and 'very XX' by taking Chinese as an example;
the polarity endowing module is used for taking mutual information as similarity measurement and appointing an emotion polarity for each emotion word according to a K neighbor algorithm;
and the polarity optimization module is used for optimizing the designated emotion polarity through a voting principle.
The multilingual emotion data processing and classifying system based on key sentences is characterized in that the key sentence extraction module comprises:
the emotion score calculation module is used for calculating the emotion score of each sentence according to the extracted emotion dictionary data packet;
the position score calculating module is used for calculating the position score of each sentence according to Gaussian distribution;
the keyword score calculation module is used for calculating the keyword score of each sentence according to the keyword list;
and the key sentence determining module is used for carrying out weighted summation on the emotion scores, the position scores and the keyword scores and determining the last N sentences as the emotion key sentences.
The invention discloses a multilingual emotion data processing and classifying system based on key sentences, which is characterized in that an emotion data classifying module comprises:
the unsupervised emotion data classification module is used for replacing each text with a plurality of emotion key sentences and judging the polarity of each text on the key sentences by using extracted emotion dictionary data packets;
and the supervised emotion data classification module is used for selecting the most confident sample from the unlabeled samples according to the scores of the positive and negative emotion words respectively as an labeled set, then training an emotion data classifier, and finally judging the polarity of each article on the key sentence.
The invention has the beneficial effects that: the method for analyzing the orientation of the multiple languages, provided by the invention, is language-independent, does not need a machine translation system and a large-scale bilingual dictionary data packet, directly learns the emotion data classifier on the target language, and has less resource dependence. Moreover, the invention also solves the problem that the emotion data classification is easily interfered by emotional ambiguity, and the key sentence extraction module is used for grasping the most main viewpoints of the author and neglecting unimportant viewpoints, thereby improving the performance of the emotion data classification. The present invention is superior to other unsupervised methods. The extracted sentiment dictionary data packet is better than the extracted sentiment dictionary data packet for classifying the full text, so that the sentiment data classification based on the key sentence is higher than the sentiment data classification based on the full text, and the effectiveness of the key sentence extraction algorithm provided by the invention is proved.
Drawings
FIG. 1 is a schematic diagram of the process of the present invention;
FIG. 2 is a graph of a standard Gaussian distribution.
Detailed Description
The invention relates to a multilingual emotion data processing and classifying method based on key sentences, which comprises the following steps:
step 1, automatically extracting an emotion dictionary data packet (binary group data such as 'good positive class' and 'poor negative class') from an unlabeled emotion corpus database. The polarity (positive or negative) of the emotional words is determined by the K-nearest neighbor algorithm and voting rules. In the voting rule, the invention also introduces a suspension mechanism to prevent the polarity from being overused;
step 2, calculating scores of emotion attributes by using the extracted emotion dictionary data packets, then comprehensively considering position attributes and keyword attributes, and automatically extracting a plurality of emotion key sentences for each text as a representative of each text;
and 3, directly applying the extracted emotion key sentence to supervised emotion data classification and unsupervised emotion data classification to obtain the emotion polarity of each text.
Taking book comments as an example, the emotion key sentence extraction module can obtain a key sentence which is a book and is completely read at one stroke, and is very popular to replace the whole viewpoint of the whole comment. Then, through querying the previously acquired emotion dictionary data packet, it is known that the key sentence contains the emotion word "like" and the polarity of the "like" is positive, so that the emotion polarity of the book comment is determined to be positive.
The step 1 comprises the following steps:
first, taking Chinese as an example, the emotion words XX are extracted from the entire data set according to pattern matching "very XX" and "very XX".
And secondly, taking mutual information as similarity measurement, and assigning an emotion polarity for each emotion word according to a K-nearest neighbor algorithm.
And finally, optimizing the designated emotion polarity through a voting principle.
The step 2 comprises the following steps:
first, an emotion score is calculated for each sentence from the extracted emotion dictionary data packet.
Next, a position score of each sentence is calculated from the gaussian distribution.
Again, a keyword score is calculated for each sentence from the keyword list.
And finally, carrying out weighted summation on the emotion scores, the position scores and the keyword scores, and determining the N sentences with the highest scores as emotion key sentences.
The step 3 comprises the following steps:
unsupervised sentiment data classification: each text is replaced by a plurality of emotion key sentences, and then the polarity of each text is judged on the key sentences by using the extracted emotion dictionary data packet.
Supervised emotion data classification: and selecting the most confident sample from the unlabeled data set as the labeled set according to the scores of the positive-type emotion words and the negative-type emotion words respectively, then training an emotion data classifier, and finally judging the polarity of each article on the key sentence.
To prove the effectiveness of the proposed method, the invention was experimented on multi-domain (books, movies, music) review datasets in multiple languages (english, french, german).
In order to verify the validity of the voting rule, the emotion dictionary correctness before and after applying the voting rule was manually checked, and the result is shown in table 1.
TABLE 1 polarity determination accuracy of English emotional words
English | Before voting | After voting |
Book with detachable cover | 0.6931 | 0.8053 |
Film | 0.7263 | 0.7835 |
Music | 0.7512 | 0.7708 |
Average | 0.7235 | 0.7865 |
As can be seen from Table 1, after the voting rule is applied, the accuracy of the English emotion dictionary data packet is improved by 6.3 percentage points on average. For general emotional words, the voting rule enables the polarity judgment accuracy to be higher through minority obeying majority, and for domain-dependent emotional words, the suspension mechanism can prevent the emotional polarity from being excessively corrected.
In order to verify the effectiveness of the key sentence extraction algorithm, the emotion data classification method based on the key sentences is compared with other reference methods respectively, and experiments are carried out on data sets of different languages, and the results are shown in tables 2-4.
TABLE 2 English sentiment data classification accuracy
TABLE 3 French sentiment data Classification accuracy
TABLE 4 German Emotion data Classification accuracy
From tables 2-4, it can be seen that the method of the present invention is superior to other unsupervised methods in both multiple languages and multiple domains. The extracted sentiment dictionary data packet is better than the extracted sentiment dictionary data packet for classifying the full text, so that the sentiment data classification based on the key sentence is higher than the sentiment data classification based on the full text, and the key sentence extraction algorithm provided by the invention is proved to be effective. The core idea of the invention is to analyze the tendentiousness of a completely unknown language with the least resources (priori knowledge), automatically learn the emotion data classifier on the target language data set, and grasp the most important viewpoint of the author through the extraction module of the key sentence, and ignore the interference of unimportant viewpoints.
FIG. 1 is a flowchart of a sentiment data classification method. As shown in fig. 1, the method includes:
step 1, automatically extracting a part of emotion dictionary data packets from an unlabeled emotion data set, and finally judging the polarity (positive type or negative type) of each emotion word through a K neighbor algorithm and a voting rule.
In the polarity judgment based on the K-nearest neighbor algorithm, the emotion polarity of a word is determined by the polarities of K words which are combined with the word most closely, namely, the similarity between the two words is maximum (o)i,oj) Measured by mutual information:
wherein o isiAnd ojRespectively represent two different emotional words, and p is probability.
In order to further optimize the polarity judgment result based on the K-nearest neighbor algorithm, the polarity is secondarily judged by adopting a voting rule. In the voting rule, a hanging mechanism is introduced to utilize the results generated by the three fields respectively. One field is selected as a main field, the other two fields are selected as auxiliary fields, and the voting rules are as follows:
(1) if the polarity results of the emotion words generated by the three fields are consistent, the polarity is determined.
(2) If there is one auxiliary domain that generates emotion words with the same polarity as the main domain, the polarity is determined.
(3) If the emotion words generated by the two auxiliary domains have the same polarity and the result generated by the main domain is different, the polarity is suspended.
The suspension mechanism is introduced to prevent the emotional word polarity from being over-corrected, because the emotional word polarity is domain dependent, such as "big" may be fair in hotel domain and may be devastating in electronic domain, so the main domain decision result is still credible although the main domain decision result is different from the other domain decision results. For the suspended emotion words, the polarity thereof is finally designated by comparing the emotion word score of the main domain and the emotion score sum of the two subsidiary domains.
And 2, comprehensively considering the emotion attributes, the position attributes and the keyword attributes, and automatically extracting the emotion key sentences.
Given an article, the scores of 3 attributes are calculated for each sentence respectively, then weighted summation is carried out, and the sentence with the highest score is selected as the emotion key sentence.
It is known that arbitrary text d consists of a series of sentences: d ═ s1,s2,…,smWhere m represents the number of sentences, and each sentence siIs composed of a series of words si={wi1,wi2,…,winWhere n represents the number of words. The final score for each sentence can be expressed as a weighted sum of 3 attributes:
f(si)=λ1*f_sentiment(si)+λ2*f_position(si)+λ3*f_keyword(si);
where λ 1, λ 2, λ 3 are the weights of each attribute, obtained by maximizing the precision of the classifier, f _ sense(s)i) As a sentence siThe emotion score of (1), f _ position(s)i) As a sentence siF _ keyword(s)i) As a sentence siThe keyword score of (1).
Emotional characteristics: the emotion key sentence mainly expresses the overall view or preference of an author, and the view and the preference are usually embodied by emotion words. The emotion attribute is used for examining whether a sentence has emotion colors or not and measuring the emotion importance degree of the sentence, and an emotion score function f _ present(s) is as follows:
wherein opinion _ lexicon (t) not only identifies whether the word t in the sentence s is an emotion word, but also marks the polarity of the emotion word. If t is a recognition, then opinion _ lexicon (t) is 1; if t is a derogative word, then opinion _ lexicon (t) is-1. As can be seen from the formula, the score is higher only when a sentence contains emotional words with the same polarity, and the score is lower if emotional words with different polarities are contained at the same time.
Position characteristics: in order to effectively extract the main viewpoints from the internet user comments, the ending part of the article needs to be particularly emphasized. The present invention considers the beginning and ending sentences of the article as important. The position attribute ensures that sentences at the beginning and the end of the article become sentences with key sentences with scores larger than those in the middle of the article, and the position score function f _ position(s) is defined as follows:
the position scoring function is actually a negative gaussian distribution probability density function, where μ is the mean, σ is the variance, and len represents the text length (i.e., the number of sentences of a text). In fact, the function f _ position(s) is a parabola with an upward opening, the abscissa represents the position of a sentence, the range of values is 1 to len, and the ordinate represents the score of the sentence at the position as an emotion key sentence. It is not difficult to see the negative form of gaussian distribution (only one curve in mathematical sense is seen, and no other part of the relation with the present invention is seen), the sentence in the middle of the article is at the lowest point of the curve, the score of the middle sentence as the emotion key sentence is smaller, and the scores of the sentences at the beginning and the end are higher. The standard gaussian distribution is shown in fig. 2.
Keyword characteristics: the emotion key sentences often contain some summarized words or phrases, such as "in summary" or "in summary," and the summarized keywords provide good heuristic information for the extraction of the emotion key sentences. The invention carries out word frequency statistics on the last sentence of all texts in the corpus, can sort to obtain a keyword list, if the keywords appear in a certain sentence, the probability that the sentence becomes a keyword is higher, therefore, the keyword score function f _ keyword(s) is defined as follows:
wherein, <math>
<mrow>
<mi>keyword</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfenced open='{' close=''>
<mtable>
<mtr>
<mtd>
<mn>1</mn>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mi>j</mi>
</msub>
<mo>∉</mo>
<mi>keywords</mi>
</mtd>
</mtr>
<mtr>
<mtd>
<mn>0</mn>
</mtd>
<mtd>
<msub>
<mi>w</mi>
<mi>j</mi>
</msub>
<mo>∉</mo>
<mi>keywords</mi>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>,</mo>
</mrow>
</math> wiare phrases that make up sentences.
And 3, directly applying the extracted emotion key sentence to supervised emotion data classification and unsupervised emotion data classification.
The process of choosing training samples from unlabeled samples refers to the following equation:
wherein, TPShow a pieceNumber of positive emotion words in text, TNThe number of negative-class emotional words in a text is represented, POS represents a positive class, NEG represents a negative class, and the invention considers that the greater the difference of the number of the positive-class emotional words and the number of the negative-class emotional words in the text is, the more definite the emotional tendency of the text is. To overcome the text length to TPAnd TNAnd normalizing the difference value through a denominator under the influence of the difference value.
Claims (8)
1. A multilingual emotion data processing and classifying method based on key sentences is characterized by comprising the following steps:
step 1, automatically extracting a part of emotion dictionary data packet from an unlabeled emotion data set, and finally judging the polarity of emotion words through a K neighbor algorithm and a voting rule;
step 2, calculating scores of emotion attributes by using the extracted emotion dictionary data packets, then comprehensively considering position attributes and keyword attributes, and automatically extracting a plurality of emotion key sentences for each text;
and 3, directly applying the extracted emotion key sentence to supervised emotion data classification and unsupervised emotion data classification.
2. The multi-lingual emotion data processing and classification method based on key sentences of claim 1, wherein step 1 comprises:
step 21, taking Chinese as an example, extracting emotional words XX from the whole data set according to pattern matching of 'very XX' and 'very XX';
step 22, taking mutual information as similarity measurement, and assigning an emotion polarity for each emotion word according to a K neighbor algorithm;
and step 23, optimizing the designated emotion polarity through a voting principle.
3. The multi-lingual emotion data processing classification method based on key sentences of claim 1, wherein step 2 includes:
step 31, calculating the emotion score of each sentence according to the extracted emotion dictionary data packet;
step 32, calculating the position score of each sentence according to Gaussian distribution;
step 33, calculating the keyword score of each sentence according to the keyword list;
and step 34, carrying out weighted summation on the emotion scores, the position scores and the keyword scores, and determining the N sentences with the highest scores as emotion key sentences.
4. The multi-lingual emotion data processing classification method based on key sentences as recited in claim 1, wherein step 3 comprises:
unsupervised sentiment data classification: each text is replaced by a plurality of emotion key sentences, and then the polarity of each text is judged on the key sentences by using extracted emotion dictionary data packets;
supervised emotion data classification: and selecting the most confident sample from the unlabeled samples as an labeled set according to the scores of the positive-type emotion words and the negative-type emotion words respectively, then training an emotion data classifier, and finally judging the polarity of each article on the key sentence.
5. A multilingual emotion data processing and classifying system based on key sentences is characterized by comprising the following components:
the polarity judgment module is used for automatically extracting a part of emotion dictionary data packet from the unlabeled emotion data set and finally judging the polarity of the emotion words through a K neighbor algorithm and a voting rule;
the key sentence extraction module is used for calculating the score of the emotion attribute through the extracted emotion dictionary data packet, then comprehensively considering the position attribute and the keyword attribute and automatically extracting a plurality of emotion key sentences for each text;
and the emotion data classification module is used for directly applying the extracted emotion key sentences to supervised emotion data classification and unsupervised emotion data classification.
6. The keyword sentence-based multilingual emotion data processing and classification system of claim 5, wherein the polarity determination module comprises:
the emotion word extraction module is used for extracting the emotion words XX from the whole data set according to pattern matching of 'very XX' and 'very XX' by taking Chinese as an example;
the polarity endowing module is used for taking mutual information as similarity measurement and appointing an emotion polarity for each emotion word according to a K neighbor algorithm;
and the polarity optimization module is used for optimizing the designated emotion polarity through a voting principle.
7. The multi-lingual emotion data processing classification system based on key sentences of claim 5, wherein the key sentence extraction module includes:
the emotion score calculation module is used for calculating the emotion score of each sentence according to the extracted emotion dictionary data packet;
the position score calculating module is used for calculating the position score of each sentence according to Gaussian distribution;
the keyword score calculation module is used for calculating the keyword score of each sentence according to the keyword list;
and the key sentence determining module is used for carrying out weighted summation on the emotion scores, the position scores and the keyword scores and determining the N sentences with the highest scores as the emotion key sentences.
8. The multi-lingual emotion data processing classification system based on key sentences of claim 5, wherein the emotion data classification module includes:
the unsupervised emotion data classification module is used for replacing each text with a plurality of emotion key sentences and judging the polarity of each text on the key sentences by using extracted emotion dictionary data packets;
and the supervised emotion data classification module is used for selecting the most confident sample from the unlabeled samples according to the scores of the positive and negative emotion words respectively as an labeled set, then training an emotion data classifier, and finally judging the polarity of each article on the key sentence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410198519.5A CN103995853A (en) | 2014-05-12 | 2014-05-12 | Multi-language emotional data processing and classifying method and system based on key sentences |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410198519.5A CN103995853A (en) | 2014-05-12 | 2014-05-12 | Multi-language emotional data processing and classifying method and system based on key sentences |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103995853A true CN103995853A (en) | 2014-08-20 |
Family
ID=51310018
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410198519.5A Pending CN103995853A (en) | 2014-05-12 | 2014-05-12 | Multi-language emotional data processing and classifying method and system based on key sentences |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103995853A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104281645A (en) * | 2014-08-27 | 2015-01-14 | 北京理工大学 | Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency |
CN105320960A (en) * | 2015-10-14 | 2016-02-10 | 北京航空航天大学 | Voting based classification method for cross-language subjective and objective sentiments |
CN106294507A (en) * | 2015-06-10 | 2017-01-04 | 华中师范大学 | Viewpoint data classification method and device across language |
CN106557460A (en) * | 2015-09-29 | 2017-04-05 | 株式会社东芝 | The device and method of key word is extracted from single document |
CN106649732A (en) * | 2016-12-23 | 2017-05-10 | 金蝶软件(中国)有限公司 | Information pushing method and device |
CN107704556A (en) * | 2017-09-28 | 2018-02-16 | 北京车慧互动广告有限公司 | A kind of sentiment analysis method and system in automobile industry subdivision field |
CN108038166A (en) * | 2017-12-06 | 2018-05-15 | 武汉大学 | A kind of Chinese microblog emotional analysis method based on the subjective and objective skewed popularity of lexical item |
CN108549636A (en) * | 2018-04-09 | 2018-09-18 | 北京信息科技大学 | A kind of race written broadcasting live critical sentence abstracting method |
CN109902284A (en) * | 2018-12-30 | 2019-06-18 | 中国科学院软件研究所 | A kind of unsupervised argument extracting method excavated based on debate |
CN110399484A (en) * | 2019-06-25 | 2019-11-01 | 平安科技(深圳)有限公司 | Sentiment analysis method, apparatus, computer equipment and the storage medium of long text |
CN111143564A (en) * | 2019-12-27 | 2020-05-12 | 北京百度网讯科技有限公司 | Unsupervised multi-target chapter-level emotion classification model training method and unsupervised multi-target chapter-level emotion classification model training device |
CN113239685A (en) * | 2021-01-13 | 2021-08-10 | 中国科学院计算技术研究所 | Public sentiment detection method and system based on dual sentiments |
CN113806527A (en) * | 2020-06-16 | 2021-12-17 | 百度(美国)有限责任公司 | Cross-language unsupervised classification with multi-view migration learning |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101408883A (en) * | 2008-11-24 | 2009-04-15 | 电子科技大学 | Method for collecting network public feelings viewpoint |
CN103150432A (en) * | 2013-03-07 | 2013-06-12 | 宁波成电泰克电子信息技术发展有限公司 | Method for internet public opinion analysis |
US20140114978A1 (en) * | 2012-10-24 | 2014-04-24 | Metavana, Inc. | Method and system for social media burst classifications |
-
2014
- 2014-05-12 CN CN201410198519.5A patent/CN103995853A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101408883A (en) * | 2008-11-24 | 2009-04-15 | 电子科技大学 | Method for collecting network public feelings viewpoint |
US20140114978A1 (en) * | 2012-10-24 | 2014-04-24 | Metavana, Inc. | Method and system for social media burst classifications |
CN103150432A (en) * | 2013-03-07 | 2013-06-12 | 宁波成电泰克电子信息技术发展有限公司 | Method for internet public opinion analysis |
Non-Patent Citations (3)
Title |
---|
ZHENG LIN ET AL.: "cross-language opinion lexicon extraction using mutual-reinforcement label propagation", 《PLOS ONE》 * |
ZHENG LIN ET AL.: "effective and efficient?bilingual sentiment lexicon extraction using collocation alignment", 《CIKM》 * |
林政 等: "基于情感关键句抽取的情感分类研究", 《计算机研究与发展》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104281645B (en) * | 2014-08-27 | 2017-06-16 | 北京理工大学 | A kind of emotion critical sentence recognition methods interdependent based on lexical semantic and syntax |
CN104281645A (en) * | 2014-08-27 | 2015-01-14 | 北京理工大学 | Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency |
CN106294507A (en) * | 2015-06-10 | 2017-01-04 | 华中师范大学 | Viewpoint data classification method and device across language |
CN106557460A (en) * | 2015-09-29 | 2017-04-05 | 株式会社东芝 | The device and method of key word is extracted from single document |
CN105320960A (en) * | 2015-10-14 | 2016-02-10 | 北京航空航天大学 | Voting based classification method for cross-language subjective and objective sentiments |
CN105320960B (en) * | 2015-10-14 | 2022-04-05 | 北京航空航天大学 | Voting-based cross-language subjective and objective emotion classification method |
CN106649732A (en) * | 2016-12-23 | 2017-05-10 | 金蝶软件(中国)有限公司 | Information pushing method and device |
CN106649732B (en) * | 2016-12-23 | 2020-05-15 | 金蝶软件(中国)有限公司 | Information pushing method and device |
CN107704556B (en) * | 2017-09-28 | 2020-04-24 | 北京车慧科技有限公司 | Emotion analysis method and system for automobile industry subdivision field |
CN107704556A (en) * | 2017-09-28 | 2018-02-16 | 北京车慧互动广告有限公司 | A kind of sentiment analysis method and system in automobile industry subdivision field |
CN108038166A (en) * | 2017-12-06 | 2018-05-15 | 武汉大学 | A kind of Chinese microblog emotional analysis method based on the subjective and objective skewed popularity of lexical item |
CN108549636A (en) * | 2018-04-09 | 2018-09-18 | 北京信息科技大学 | A kind of race written broadcasting live critical sentence abstracting method |
CN109902284A (en) * | 2018-12-30 | 2019-06-18 | 中国科学院软件研究所 | A kind of unsupervised argument extracting method excavated based on debate |
CN110399484A (en) * | 2019-06-25 | 2019-11-01 | 平安科技(深圳)有限公司 | Sentiment analysis method, apparatus, computer equipment and the storage medium of long text |
CN111143564A (en) * | 2019-12-27 | 2020-05-12 | 北京百度网讯科技有限公司 | Unsupervised multi-target chapter-level emotion classification model training method and unsupervised multi-target chapter-level emotion classification model training device |
CN111143564B (en) * | 2019-12-27 | 2023-05-23 | 北京百度网讯科技有限公司 | Unsupervised multi-target chapter-level emotion classification model training method and device |
CN113806527A (en) * | 2020-06-16 | 2021-12-17 | 百度(美国)有限责任公司 | Cross-language unsupervised classification with multi-view migration learning |
CN113239685A (en) * | 2021-01-13 | 2021-08-10 | 中国科学院计算技术研究所 | Public sentiment detection method and system based on dual sentiments |
CN113239685B (en) * | 2021-01-13 | 2023-10-31 | 中国科学院计算技术研究所 | Public opinion detection method and system based on double emotions |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dashtipour et al. | Multilingual sentiment analysis: state of the art and independent comparison of techniques | |
CN103995853A (en) | Multi-language emotional data processing and classifying method and system based on key sentences | |
Saeed et al. | An ensemble approach for spam detection in Arabic opinion texts | |
Rangel et al. | A low dimensionality representation for language variety identification | |
Mudinas et al. | Combining lexicon and learning based approaches for concept-level sentiment analysis | |
Rao | Contextual sentiment topic model for adaptive social emotion classification | |
Read et al. | Weakly supervised techniques for domain-independent sentiment classification | |
Chang et al. | Research on detection methods based on Doc2vec abnormal comments | |
CN107145560B (en) | Text classification method and device | |
Suleiman et al. | Comparative study of word embeddings models and their usage in Arabic language applications | |
CN107818173B (en) | Vector space model-based Chinese false comment filtering method | |
Suchdev et al. | Twitter sentiment analysis using machine learning and knowledge-based approach | |
Phu et al. | Sentiment classification using enhanced contextual valence shifters | |
Mozafari et al. | Emotion detection by using similarity techniques | |
CN109101490A (en) | The fact that one kind is based on the fusion feature expression implicit emotion identification method of type and system | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
Arif et al. | Sentiment analysis of roman urdu/hindi using supervised methods | |
Ashna et al. | Lexicon based sentiment analysis system for malayalam language | |
Alsolamy et al. | A corpus based approach to build arabic sentiment lexicon | |
Kotelnikova et al. | Lexicon-based methods and BERT model for sentiment analysis of Russian text corpora | |
Vīksna et al. | Sentiment analysis in Latvian and Russian: A survey | |
Kasmuri et al. | Subjectivity analysis in opinion mining—a systematic literature review | |
Garouani et al. | Towards a new lexicon-based features vector for sentiment analysis: application to Moroccan Arabic tweets | |
Zadgaonkar et al. | An Approach for analyzing unstructured text data using topic modeling techniques for efficient information extraction | |
Jha et al. | Hsas: Hindi subjectivity analysis system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140820 |