CN104199980A

CN104199980A - Sentiment information compression method and system for comment corpus

Info

Publication number: CN104199980A
Application number: CN201410494394.0A
Authority: CN
Inventors: 李寿山; 高伟; 周国栋; 王红玲
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2014-09-24
Filing date: 2014-09-24
Publication date: 2014-12-10

Abstract

The invention provides a sentiment information compression method for comment corpus and a sentiment information compression system for comment corpus. The method comprises the following steps: S1, dividing to-be-used data into K parts and taking one of the K parts as a test sample and the rest of the K minus 1 parts as training samples; S2, classifying the test sample by a training classifier adopting the machine learning method, and taking the maximal posterior probability of the classified result as the representative sentiment score of each sample; S3, sorting all samples from large to small according to the representative sentiment scores and taking N samples, which rank ahead as a compression sample set according to the compression scale N. Through the sentiment information compression method and the sentiment information compression system, the comment corpus can be effectively compressed, the classified sentiment information of the original corpus can be kept to the largest extent and the purpose of sentiment classification on mobile equipment with small storage capacity is achieved.

Description

A kind of emotion information compression method and system that is applied to comment language material

Technical field

The present invention relates to natural language processing technique field and area of pattern recognition, be specifically related to a kind of emotion information compression method and system that is applied to comment language material.

Background technology

Along with the fast development of internet, people more and more get used on network expressing the viewpoint of oneself, thereby make to emerge on network the text with emotion in a large number.These tendentiousness texts often exist with the form of comment on commodity, forum's comment and blog.These texts are crucial text often, or the interested text of user.How from mass text, to extract this class text, and the analysis that it is carried out to emotion tendency, very strong using value there is.For example: user can understand according to the comment of commodity the information of commodity, selects suitable brand; Businessman improves the quality of commodity according to user's comment, strive for larger market; Follow the trail of public opinion trend, find the hot spot of society etc.Sentiment analysis is exactly the emerging research topic proposing for these application problems.

So-called text tendency analysis, the attitude to speaker (or claiming viewpoint, emotion) is analyzed exactly, namely the subjectivity information in text is analyzed.Emotional semantic classification (Sentiment Classification) is a basic task in sentiment analysis.This task is intended to text to pass judgement on classification according to emotion tendency.Compared with the text classification of tradition based on theme, emotional semantic classification is considered to have more challenge.This task specifically refers to the task of text being divided into front text or negative text.For example: " I am delithted with this film ", by emotional semantic classification, the words will be divided into front text; And " the very poor strength of this film " is classified as negative text.

At present, the sensibility classification method of main flow roughly can be divided into two kinds.

The first sorting technique is the unsupervised learning method based on emotion vocabulary, and this method is mainly the method based on word counting.Utilize emotion vocabulary to remove the number of positive emotion word and negative emotion word in statistical sample, if the number of front word more than the number of negation words, judgement sample is front sample, otherwise is negative sample.The realization of the method is very simple, and execution efficiency is high, be applicable to any field, but classifying quality and actual demand still exists larger gap.

The second is the supervised classification method based on machine learning, and the method is divided into two processes: training process and assorting process.Wherein, in training process, need the positive negative sample of artificial mark certain scale.The classification accuracy of this method is higher, but along with number of training object increases, number of features also improves thereupon significantly, needs to take a large amount of memory headrooms in assorting process, often be subject to the restriction of memory size for mobile terminal device, be difficult to carry out the task of text classification.

In addition, for some special tasks, as uneven emotional semantic classification task, wherein the number of samples of a certain classification is far away more than another kind of other number of samples, and the imbalance of number of samples often causes very poor classifying quality.

In view of the foregoing, the invention provides a kind of emotion information compression method and system that is applied to comment language material, comment language material is compressed, make it to retain to the full extent emotional semantic classification information, thereby can be adapted to the emotional semantic classification task on mobile device, and provide service for some special task (as uneven emotional semantic classification task), to realize, multi-class language material is compressed.

Summary of the invention

In order to understand better the present invention, first conventional term and the mark that the present invention relates to are described below.

Machine learning classification method (Classification Methods Based on Machine Learning): for building the statistical learning method of sorter, input is the vector that represents sample, and output is the class label of sample.Common machine learning classification method has naive Bayesian, maximum entropy, support vector machine etc.Comment language material: the text that product is commented on.Emotional semantic classification: by analyzing the subjectivity information of text, text is divided into the task of commendation text or derogatory sense text.

The invention provides a kind of emotion information compression method that is applied to comment language material, comprise the following steps.

S1, inactive data is divided into K part, and gets wherein 1 part as test sample book, all the other K-1 parts are as training sample.

S2, use machine learning method training classifier to classify to described test sample book, and emotion representative fraction using the maximum a posteriori probability of classification results as each sample.

S3, representative all sample evidence emotions score values are sorted from big to small, and according to the N that downsizes, extract N the sample coming above as compression samples collection.

Preferably, in step S1, to described inactive data employing order cutting or the mode randomly drawed, the sample set of composition K part equalization.

Preferably, in step S1, get wherein 1 part as test sample book from K part, remaining K-1 part is as training sample at every turn, loop iteration K time altogether.

Preferably, in step S2, the machine learning method that the machine learning method of use is maximum entropy.

Preferably, in step S2, described posterior probability is to obtain while using the sorter of machine learning method training to classify to sample.

Preferably, in step S2, use the sorting technique of machine learning to train on training sample, and test sample book is classified, obtain the posterior probability that it belongs to each classification.

Preferably, in step S3, described in come N sample above as compression samples collection, and as final compression result.

The present invention also provides a kind of emotion information compressibility that is applied to comment language material, comprises emotion representative marking module and compression module, and the representative marking of described emotion module connects compression module.The representative marking of described emotion module, comprises pretreatment unit and sorter, described pretreatment unit link sort device.Described pretreatment unit, for inactive data being divided into K part, and gets wherein 1 part as test sample book, and all the other K-1 parts are as training sample.Described sorter, for using machine learning method training classifier to classify to described test sample book, and emotion representative fraction using the maximum a posteriori probability of classification results as each sample.Described compression module, comprises collator and output unit, and described collator connects output unit.Described collator, for sorting representative all sample evidence emotions score value from big to small.Described output unit, for according to downsizing N, extracts N the sample coming above as compression samples collection.

By emotion information compression method and the system that is applied to comment language material provided by the invention, adopt machine learning method training classifier to classify to test sample book, and emotion representative fraction using the maximum a posteriori probability of classification results as each sample.Meanwhile, representative all sample evidence emotions score value is sorted from big to small, and extract N the sample coming above as compression samples collection.So, can effectively compress comment language material, and preserve to the full extent the emotional semantic classification information of former language material, reach the object that realizes emotional semantic classification task on the mobile device of little memory capacity.

Brief description of the drawings

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the emotion information compression method process flow diagram that is applied to comment language material that preferred embodiment of the present invention provides;

Fig. 2 is the algorithm flow chart that sample is carried out to the representative marking of emotion that preferred embodiment of the present invention provides;

Fig. 3 is the algorithm flow chart of the compression process that provides of preferred embodiment of the present invention;

Fig. 4 is the emotion information compressibility schematic diagram that is applied to comment language material that preferred embodiment of the present invention provides.

Embodiment

Hereinafter also describe the present invention in detail with reference to accompanying drawing in conjunction with the embodiments.It should be noted that, in the situation that not conflicting, the feature in embodiment and embodiment in the application can combine mutually.

Fig. 1 is the emotion information compression method process flow diagram that is applied to comment language material that preferred embodiment of the present invention provides.The emotion information compression method of what as shown in Figure 1, preferred embodiment of the present invention provided be applied to comment language material comprises step S1～S3.

Step S1: inactive data is divided into K part, and gets wherein 1 part as test sample book, all the other K-1 parts are as training sample.

Particularly, in the present embodiment, to described inactive data employing order cutting or the mode randomly drawed, the sample set of composition K part equalization.Wherein, get wherein 1 part as test sample book from K part, remaining K-1 part is as training sample at every turn, loop iteration K time altogether.

Step S2: use machine learning method training classifier to classify to described test sample book, and emotion representative fraction using the maximum a posteriori probability of classification results as each sample.

Particularly, described posterior probability is to obtain while using the sorter of machine learning method training to classify to sample.Use the sorting technique of machine learning to train on training sample, and test sample book is classified, obtain the posterior probability that it belongs to each classification.

Fig. 2 is the algorithm flow chart that sample is carried out to the representative marking of emotion that preferred embodiment of the present invention provides.In the present embodiment, document adopts TF vector representation, and the component of document vector is the frequency that corresponding word occurs in the document.The input of the sorter that the vector of text is realized as machine learning classification method.

The machine learning method using in this step comprises k nearest neighbor, Bayes, maximum entropy, SVM etc., the machine learning method that the machine learning method that the present embodiment uses is maximum entropy.In this, maximum entropy sorting technique is based on maximum entropy information theory, and its basic thought is to set up model for all known factors, and the factor of all the unknowns is foreclosed.That is to say, find a kind of probability distribution, meet all known facts, but allow the randomization of unknown factor.With respect to naive Bayesian method, the feature of the method maximum is exactly the condition independence that does not need to meet between feature and feature.Therefore, the method is applicable to merging various different features, and without the impact of considering between them.

Under maximum entropy model, the formula of predicted condition probability P (c|D) is as follows:

P (c_{i} | D) = \frac{1}{Z (D)} \exp (\underset{k}{Σ} λ_{k, c} F_{k, c} (D, c_{i})) .

Wherein Z (D) is normalized factor, F _k,cbe fundamental function, be defined as:

F_{k, c} (D, c^{'}) = \{\begin{matrix} 1, & n_{k} (d) > 0 and c^{'} = c \\ 0, & otherwise \end{matrix} .

The emotion tendency kind judging of sample is by posterior probability P _l(c ₊| D) and P _l(c _-| D) decide, concrete decision rule is: if P is (c ₊| D) >P (c _-| D), sample belongs to commendation; Otherwise sample belongs to derogatory sense.

The maximum a posteriori probability of sample is to posterior probability P _l(c ₊| D) and P _l(c _-| D) compare acquisition, maximum a posteriori probability is judged to be: if P is (c ₊| D) >P (c _-| D), maximum a posteriori probability is P _l(c ₊| D); Otherwise maximum a posteriori probability is P _l(c _-| D).

Step S3: representative all sample evidence emotions score value is sorted from big to small, and according to the N that downsizes, extract N the sample coming above as compression samples collection.

Particularly, in the present embodiment, the algorithm of all samples sequence can be selected voluntarily, comprise bubble sort method, selection sort, quick sort, merge sort methods etc., to this present invention and be not construed as limiting.Wherein, come N sample above as compression samples collection, and as final compression result.

Fig. 3 is the algorithm flow chart of the compression process that provides of preferred embodiment of the present invention.The language material adopting in the present embodiment is multi-field product review language material, wherein has the data in 4 fields, is respectively books (Book), DVD, electronics (Electronic) and kitchen (Kitchen).Each field respectively has commendation and derogatory sense to comment on 1000 sections, respectively selects 500 sections of positive comments and 500 sections of negative reviews as testing material.All the other 7000 sections as corpus.The evaluation criterion of experimental selection is compressibility (CR) and loss percentage (LR):

CR＝Size_C/Size_O

LR＝(Acc_O-Acc_C)/Acc_O

Wherein, Size_C is compression language material scale, and Size_O is primitive gauge mould, and ACC_C is the classification accuracy that utilizes the sorter of compression language material training, and ACC_O is the classification accuracy with the sorter of former language material training.

Shown in table 1, be to above-mentioned 7000 experimental results that corpus compresses according to emotion information compression method provided by the invention.Specific experiment process is: first training sample is divided into K part, then gets successively a copy of it as test sample book, K-1 part is as training sample in addition, test sample book is classified with sorter; Then get the maximum a posteriori probability of classification results as the representative marking of emotion of each sample; Finally, according to the N that downsizes, select N sample that emotion representativeness the is the highest sample set after as compression.K value in this experiment is set as 10.

As can be seen from Table 1, method of the present invention can effectively be compressed corpus, is 0.185 o'clock in compressibility, and loss percentage only only has 0.026.Use 1300 corpus can reach and 7000 original classification performances that corpus is similar.

Downsize	100	500	900	1300	1700
						Compressibility	0.014	0.071	0.128	0.185	0.242
Loss percentage	0.145	0.080	0.056	0.026	0.028

Table 1

In the process of emotional semantic classification, along with the increase of training sample, number of features constantly increases, in the mobile device with little memory space, traditional sensibility classification method is difficult to operation, method of the present invention can effectively be compressed training sample, avoids the high demand of emotional semantic classification to memory capacity, realizes the emotional semantic classification task of high-accuracy on mobile device.In addition, the present invention also can assist other need to compressing corpus of task, is applicable to any environment that need to compress language material.

Fig. 4 is the emotion information compressibility schematic diagram that is applied to comment language material that preferred embodiment of the present invention provides.As shown in Figure 4, the emotion information compressibility of what preferred embodiment of the present invention provided be applied to comment language material comprises emotion representative marking module 1 and compression module 2, and the representative marking of described emotion module 1 connects compression module 2.The representative marking of described emotion module 1, comprises pretreatment unit 11 and sorter 12, described pretreatment unit 11 link sort devices 12.Described pretreatment unit 11, for inactive data being divided into K part, and gets wherein 1 part as test sample book, and all the other K-1 parts are as training sample.Described sorter 12, for using machine learning method training classifier to classify to described test sample book, and emotion representative fraction using the maximum a posteriori probability of classification results as each sample.Described compression module 2, comprises collator 21 and output unit 22, and described collator 21 connects output unit 22.Described collator 21, for representative all sample evidence emotions score value is sorted from big to small, described output unit 22, for according to downsizing N, extracts N the sample coming above as compression samples collection.Operating process and the inventive method about said system are similar, therefore repeat no more in this.

The emotion information compression method and the system that are applied to comment language material that provide by preferred embodiment of the present invention, whole corpus are divided into K part, select wherein K-1 part to classify to another part, get the maximum a posteriori probability of its classification results as the representative marking of emotion, can make full use of existing sample, not need to find in addition sample training sorter.In addition, choose the sample that emotion is representative high, contain abundanter emotional semantic classification information, can help to obtain better classification performance.Meanwhile, after comment language material is compressed, can avoid commenting on language material and take too many internal memory, and can be transplanted on mobile device.

To the above-mentioned explanation of the disclosed embodiments, make professional and technical personnel in the field can realize or use the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiment, General Principle as defined herein can, in the situation that not departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.To the above-mentioned explanation of the disclosed embodiments, make professional and technical personnel in the field can realize or use the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiment, General Principle as defined herein can, in the situation that not departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to embodiment illustrated herein, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims

1. an emotion information compression method that is applied to comment language material, is characterized in that, comprises the following steps:

S1, inactive data is divided into K part, and gets wherein 1 part as test sample book, all the other K-1 parts are as training sample;

S2, use machine learning method training classifier to classify to described test sample book, and emotion representative fraction using the maximum a posteriori probability of classification results as each sample;

2. method according to claim 1, is characterized in that, in step S1, to described inactive data employing order cutting or the mode randomly drawed, the sample set of composition K part equalization.

3. method according to claim 1, is characterized in that, in step S1, gets wherein 1 part as test sample book from K part at every turn, and remaining K-1 part is as training sample, loop iteration K time altogether.

4. method according to claim 1, is characterized in that, in step S2, and the machine learning method that the machine learning method of use is maximum entropy.

5. method according to claim 1, is characterized in that, in step S2, described posterior probability is to obtain while using the sorter of machine learning method training to classify to sample.

6. method according to claim 1, is characterized in that, in step S2, uses the sorting technique of machine learning to train on training sample, and test sample book is classified, and obtains the posterior probability that it belongs to each classification.

7. method according to claim 1, is characterized in that, in step S3, described in come N sample above as compression samples collection, and as final compression result.

8. an emotion information compressibility that is applied to comment language material, is characterized in that, comprises emotion representative marking module and compression module, and the representative marking of described emotion module connects compression module,

The representative marking of described emotion module, comprises pretreatment unit and sorter, described pretreatment unit link sort device,

Described pretreatment unit, for inactive data being divided into K part, and gets wherein 1 part as test sample book, and all the other K-1 parts are as training sample;

Described sorter, for using machine learning method training classifier to classify to described test sample book, and emotion representative fraction using the maximum a posteriori probability of classification results as each sample;

Described compression module, comprises collator and output unit, and described collator connects output unit,

Described collator, for sorting representative all sample evidence emotions score value from big to small;

Described output unit, for according to downsizing N, extracts N the sample coming above as compression samples collection.