CN103617245A

CN103617245A - Bilingual sentiment classification method and device

Info

Publication number: CN103617245A
Application number: CN201310616753.0A
Authority: CN
Inventors: 李寿山; 苏艳; 周国栋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2013-11-27
Filing date: 2013-11-27
Publication date: 2014-03-05

Abstract

The invention provides a bilingual sentiment classification method and device. The method comprises the steps of translating original language documents to be classified and the original language documents of a training sample set to obtain the translated documents to be classified and the translated documents of the training sample set; combining the original language documents to be classified and the translated documents to be classified to obtain a bilingual documents to be classified, and combining the original language documents of the training sample set and the translated documents of the training sample set to obtain bilingual documents of the training sample set; establishing a bilingual feature vector space to be classified and a bilingual feature vector space of the training sample set; training classifiers on the bilingual feature vector space of the training sample set by using a maximum entropy model; carrying out sentiment polarity classification on the bilingual feature vector space to be classified through the trained classifiers. The bilingual sentiment classification method and device combine the characteristics of two languages, provide extra classification information for the sentiment classification, improve classification accuracy, extract important characteristic items from the bilingual feature vector spaces, and improve classification efficiency.

Description

A kind of bilingual sensibility classification method and device

Technical field

The present invention relates to technical field of information processing, relate in particular to a kind of bilingual sensibility classification method and device.

Background technology

In recent years, emotional semantic classification technology shows huge application demand and application prospect in fields such as ecommerce, the analysis of public opinion, information securities.Emotional semantic classification technology can help to understand user's consumption habit and the relative merits of product, automatically product review is carried out to analysis and decision; Understand the common people's satisfaction and demand, find in time social problematic feature; Analyze the focus public feelings information of current social, important decision references foundation is provided to user, enterprise, government etc.Sensibility classification method of the prior art is mainly for a kind of language, and as long as for English.

Inventor finds in realizing the process of the invention: sensibility classification method of the prior art can make classification results produce error, affect classification accuracy rate, for example, " It looks like a book ", in English, " like " may be considered to a commendatory term (with " enjoy " synonym), if be considered to commendatory term, will make classification results produce error.

Summary of the invention

In view of this, the invention provides a kind of bilingual sensibility classification method and device, in order to solve sensibility classification method of the prior art, can make classification results produce error, affect the problem of classification accuracy rate, its technical scheme is as follows:

A bilingual sensibility classification method, comprising:

Translate source document to be sorted and the source document of training sample set, obtain translation document to be sorted and the translation document of training sample set;

Combine described source document to be sorted and described translation document to be sorted, obtain bilingual document to be sorted, combine the source document of described training sample set and the translation document of described training sample set, obtain the bilingual document of training sample set;

Build bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set;

Utilize maximum entropy model training classifier on the bilingual characteristic vector space of described sample set;

By the sorter after training, described bilingual characteristic vector space to be sorted is carried out to feeling polarities classification, obtain the emotional semantic classification result of described source document to be sorted.

Wherein, utilize maximum entropy model training classifier on the bilingual characteristic vector space of described sample set to comprise:

Determine the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set;

Utilize maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.

Wherein, the process of determining the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set comprises:

Utilize CHI feature extracting method to calculate the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set.

Wherein, the process that builds the bilingual characteristic vector space of bilingual characteristic vector space to be sorted and training sample set comprises:

Bilingual document to described bilingual document to be sorted and training sample set carries out word segmentation processing;

The monobasic feature of choosing word forms bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.

Wherein, the process of translating the source document of source document to be sorted and training sample set comprises:

Utilize machine translation system Google Translate to translate source document to be sorted and the source document of training sample set.

A bilingual emotional semantic classification device, comprising:

Translation unit, for translating source document to be sorted and the source document of training sample set, obtains translation document to be sorted and the translation document of training sample set;

Assembled unit, for combining described source document to be sorted and described translation document to be sorted, obtain bilingual document to be sorted, combine the source document of described training sample set and the translation document of described training sample set, obtain the bilingual document of training sample set;

Construction unit, for building bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set;

Training unit, for utilizing maximum entropy model training classifier on the bilingual characteristic vector space of described sample set;

Taxon, carries out feeling polarities classification for the sorter by after training to described bilingual characteristic vector space to be sorted, obtains the emotional semantic classification result of described source document to be sorted.

Wherein, described training unit comprises:

Determine subelement, for determining the weighted value of bilingual each characteristic item of characteristic vector space of described sample set;

Training subelement, for utilizing maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.

Wherein, described definite subelement comprises:

Computation subunit, for utilizing CHI feature extracting method to calculate the weighted value of bilingual each characteristic item of characteristic vector space of described sample set.

Wherein, described construction unit comprises:

Participle subelement, for carrying out word segmentation processing to the bilingual document of described bilingual document to be sorted and training sample set;

Build subelement, for choosing the monobasic feature of word, form bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.

Wherein, described translation unit comprises:

Translation subelement, for utilizing machine translation system Google Translate to translate source document to be sorted and the source document of training sample set.

Technique scheme has following beneficial effect:

Bilingual sensibility classification method provided by the invention and device, source document and translation document are combined into bilingual document, by feature expansion, form bilingual characteristic vector space, adopted maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carried out emotional semantic classification.The application has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.In addition, from bilingual characteristic vector space, extract the characteristic item of outbalance, the dimension of bilingual characteristic vector space is reduced, shortened the emotional semantic classification time, improved classification effectiveness.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skills, do not paying under the prerequisite of creative work, other accompanying drawing can also be provided according to the accompanying drawing providing.

The schematic flow sheet of the bilingual sensibility classification method that Fig. 1 provides for the embodiment of the present invention one;

The schematic flow sheet of the bilingual sensibility classification method that Fig. 2 provides for the embodiment of the present invention two;

Fig. 3 is for adopting the bilingual sensibility classification method that the embodiment of the present invention provides the comment in four fields to be carried out to the experimental result picture of emotional semantic classification;

Fig. 4 is for adopting the bilingual sensibility classification method that the embodiment of the present invention provides the document in four fields to be carried out to the experimental result picture of emotional semantic classification;

The structural representation of the bilingual emotional semantic classification device that Fig. 5 provides for the embodiment of the present invention three;

The structural representation of the bilingual emotional semantic classification device that Fig. 6 provides for the embodiment of the present invention four.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

Embodiment mono-

Refer to Fig. 1, the schematic flow sheet of a kind of bilingual sensibility classification method providing for the embodiment of the present invention one, the method comprises:

Step S101: translate source document to be sorted and the source document of training sample set, obtain translation document to be sorted and the translation document of training sample set.

In the present embodiment, can adopt machine translation system to translate source document to be sorted and the source document of training sample set as Google Translate.For example, source document is Chinese document, can utilize Google Translate that Chinese document is translated into English document.

Step S102: combine source document to be sorted and translation document to be sorted combination, obtain bilingual document to be sorted, the source document of combined training sample set and the translation document of training sample set, obtain the bilingual document of training sample set.

Step S103: build bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.

In the present embodiment, the process that builds the bilingual characteristic vector space of bilingual characteristic vector space to be sorted and training sample set can comprise: the bilingual document to bilingual document to be sorted and training sample set carries out word segmentation processing; The monobasic feature (unigram) of choosing word forms bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.

Bilingual characteristic vector space can be expressed as: F=(e ₁, e ₂... e _n, c ₁, c ₂..., c _n), wherein, e ₁, e ₂... e _nfor the characteristic item of source document, c ₁, c ₂..., c _ncharacteristic item for corresponding translation document.

Step S104: utilize maximum entropy model training classifier on the bilingual characteristic vector space of sample set.

Step S105: by the sorter after training, bilingual characteristic vector space to be sorted is carried out to feeling polarities classification, obtain the emotional semantic classification result of source document to be sorted.

After bilingual proper vector input sorter, according to the posterior probability of returning, judge feeling polarities, get classification that posterior probability is large as final classification results.

The bilingual sensibility classification method that the embodiment of the present invention one provides, source document and translation document are combined into bilingual document, by feature expansion, formed bilingual characteristic vector space, adopt maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carry out emotional semantic classification.The present embodiment has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.

Embodiment bis-

Refer to Fig. 2, the schematic flow sheet of a kind of bilingual sensibility classification method providing for the embodiment of the present invention one, the method comprises:

Step S201: translate source document to be sorted and the source document of training sample set, obtain translation document to be sorted and the translation document of training sample set.

In the present embodiment, can adopt machine translation system to translate source document to be sorted and the source document of training sample set as Google Translate.

Step S202: combine source document to be sorted and translation document to be sorted combination, obtain bilingual document to be sorted, the source document of combined training sample set and the translation document of training sample set, obtain the bilingual document of training sample set.

Step S203: build bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.

For example, source document is Chinese document, and translation document is English document.Bilingual characteristic vector space can be expressed as: F=(e ₁, e ₂... e _n, c ₁, c ₂..., c _n), wherein, e ₁, e ₂... e _nfor the feature of Chinese document, c ₁, c ₂..., c _nfeature for corresponding English document.

Step S204: determine the weighted value of each characteristic item in the bilingual characteristic vector space of sample set, utilize maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.

The dimension of considering bilingual characteristic vector space is larger, can cause the classification time longer, classification effectiveness is lower, the present embodiment passes through characteristic extraction procedure, the characteristic item that definite weighted value is greater than preset value from the bilingual characteristic vector space of sample set forms bilingual proper vector, by this bilingual proper vector training classifier.Feature extracting method reduces the dimension of proper vector under the prerequisite that guarantees classifying quality, has shortened the classification time, has improved classification effectiveness.

In the present embodiment, can utilize CHI feature extracting method to calculate the weighted value of each characteristic item in the bilingual characteristic vector space of sample set, weighted value is larger, represents that characteristic of correspondence is more important.After determining weighted value, can be based on weighted value but order from big to small to sorting, the top n characteristic item that weight selection value is greater than preset value successively forms the bilingual proper vector for training classifier.

In addition, the training sample concentrated part sample in the present embodiment is commendation, and part sample is derogatory sense, and maximum entropy model goes out binary classification model at training set learning.

It should be noted that, the independence of CHI normalized set characteristic item and classification, it is based on following hypothesis: the entry that the frequency of occurrences is high in specifying classification text and the entry that the frequency of occurrences is higher in other classification texts are all helpful to judging whether document belongs to this classification.CHI method is defined as follows:

χ^{2} (t) = {avg}_{i = 1}^{m} (χ^{2} (t, c_{i}))

Therefore,

χ^{2} (t, c_{i}) = \frac{N {(p (t, c_{i}) \cdot p (\overset{&OverBar;}{t}, \overset{&OverBar;}{c_{i}}) - p (t, \overset{&OverBar;}{c_{i}}) \cdot p (\overset{&OverBar;}{t}, c_{i}))}^{2}}{p (t) \cdot p (\overset{&OverBar;}{t}) \cdot p (c_{i}) \cdot p (\overset{&OverBar;}{c_{i}})}

Wherein, the probability that p (t) comprises feature t for document x;

for document x does not belong to classification c _iprobability; P (t, c _i) comprise feature t and belong to classification c for document x _ijoint probability; p(c _i| while t) comprising feature t for document x, belong to the probability of classification ci; for document x belongs to classification c _itime, do not comprise the probability of feature t.Same,

p(c _i), p (t|c _i),

with

definition similar.The reliability that CHI statistic is estimated is better, more stable.

Maximum entropy model is the theoretical foundation of maximum entropy classifiers, and its basic thought is to set up model for all known factors, and the factor of all the unknowns is foreclosed.Namely to find a probability distribution, not only meet all known facts, and can not be subject to the impact of any X factor.

Suppose that x is proper vector, y is the output valve of sample class.P (y|x) is the probability that sample is predicted to be a certain classification.Maximum entropy model requires p (y|x) meeting under the condition of certain constraint, must make the entropy defining obtain maximal value below, i.e. the most equally distributed model of output under constraint set:

H (p) = - \underset{x, y}{Σ} \tilde{p} (x) p (y | x) \log p (y | x)

Here use H (p) to replace H (Y|X), conditional entropy H (Y|X) is a kind of inhomogeneity mathematical measure method of conditional probability p (y|x), emphasizes the dependence to probability distribution p.For any given constraint set C, need to try to achieve H (p) in all models that meet C and get peaked p ^*:

p ^*=argmaxH(p)

Wherein, p is the statistical model meeting under constraint set C condition.

Feature f _ithe corresponding parameter lambda of weight _irepresent, the final probability output of maximum entropy is:

p_{λ} (y | x) = \frac{1}{Z_{λ} (x)} \exp (\underset{i}{Σ} λ_{i} f_{i} (x, y))

Wherein, Ζ _λ(x) be normalized factor:

Z_{λ} (x) = \underset{y}{Σ} \exp (Σ λ_{i} f_{i} (x, y))

Step S205: by the sorter after training, bilingual characteristic vector space to be sorted is carried out to feeling polarities classification, obtain the emotional semantic classification result of source document to be sorted.

In order to prove method that the present embodiment the provides validity to emotional semantic classification, in four Chinese product review fields such as case and bag, electronic product, cosmetics, hotels, test respectively.Each field comprises positive and negative each 1000 pieces of comments.During specific experiment, select 80% sample as initial training sample, 20% remaining sample, as test sample book, is used accuracy rate (Accuracy) as evaluation index.

Refer to Fig. 3, for the sorting technique that adopts the embodiment of the present invention to provide is carried out the experimental result of emotional semantic classification to the comment in four fields, wherein, " CN " represents in prior art and only adopts source language text to train the emotional semantic classification result of sorter, the classification results that the bilingual sensibility classification method that " CN+Trans " representative adopts the embodiment of the present invention to provide is classified and obtained, adopt and comprise that the character vector space that the characteristic item of source language and the characteristic item of interpretive language form trains sorter, then utilizes this sorter to carry out emotional semantic classification.

From the correlation data shown in Fig. 3, can find out, the sensibility classification method that the embodiment of the present invention provides is compared with sensibility classification method of the prior art, and classification accuracy rate obviously improves, 4 fields on average improve 3.4%, wherein, case and bag field improves 5%, and field, hotel improves 4%.

The reason that classification accuracy rate obviously improves has 2 points: first, multilingually itself can provide extra help information for emotional semantic classification; Secondly, in the text classification of source language, emotion word has the phenomenon of ambiguity ambiguity, sensibility classification method of the prior art can not be avoided the situation of this classification error, this causes classification accuracy rate to reduce, and added that after the feature of translation document of source document, these emotion words probably do not exist ambiguity in the context of interpretive language.

Refer to Fig. 4, for adopting the sensibility classification method that the embodiment of the present invention provides the document in four fields to be carried out to the experimental result picture of emotional semantic classification, wherein, four curves represent respectively the emotional semantic classification result on four fields after selected part feature; The number of features of extracting in horizontal ordinate representation feature leaching process; Ordinate represents the accuracy of emotional semantic classification.

As can be seen from Figure 4, the classification accuracy rate of CHI feature extracting method improves along with the increase of characteristic number substantially.When number of features is less, emotional semantic classification accuracy is relatively low, feature quantity is more than or equal to after 1500, the emotional semantic classification effect in four fields all approaches peak value, performance is substantially constant, selects the emotional semantic classification effect of 1500 features just can reach the emotional semantic classification effect of utilizing whole features to classify.Feature extracting method can significantly reduce the dimension of proper vector under the prerequisite of not losing emotional semantic classification accuracy, improves classification effectiveness.

The bilingual sensibility classification method that the embodiment of the present invention two provides, source document and translation document are combined into bilingual document, by feature expansion, formed bilingual characteristic vector space, adopt maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carry out emotional semantic classification.The present embodiment has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.In addition, from bilingual characteristic vector space, extract the characteristic item of outbalance, the dimension of bilingual characteristic vector space is reduced, shortened the emotional semantic classification time, improved classification effectiveness.

Embodiment tri-

Refer to Fig. 5, the structural representation of a kind of bilingual emotional semantic classification device providing for the embodiment of the present invention three, this device can comprise: translation unit 101, assembled unit 102, construction unit 103, training unit 104 and taxon 105.Wherein:

Translation unit 101, for translating source document to be sorted and the source document of training sample set, obtains translation document to be sorted and the translation document of training sample set.

Assembled unit 102, for combining source document to be sorted and the combination of described translation document to be sorted, obtain bilingual document to be sorted, the source document of combined training sample set and the translation document of described training sample set, obtain the bilingual document of training sample set.

Construction unit 103, for building bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.

Training unit 104, for utilizing maximum entropy model training classifier on the bilingual characteristic vector space of described sample set.

Taxon 105, carries out feeling polarities classification for the sorter by after training to described bilingual characteristic vector space to be sorted, obtains the emotional semantic classification result of described source document to be sorted.

The bilingual emotional semantic classification device that the embodiment of the present invention three provides, source document and translation document are combined into bilingual document, by feature expansion, formed bilingual characteristic vector space, adopt maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carry out emotional semantic classification.The present embodiment has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.

Embodiment tetra-

Refer to Fig. 6, the structural representation of a kind of bilingual emotional semantic classification device providing for the embodiment of the present invention three, this device can comprise: translation unit 101, assembled unit 102, construction unit 103, training unit 104 and taxon 105.Wherein:

Further, translation unit 101 comprises translation subelement 1011.Translation subelement 1011, for utilizing machine translation system Google Translate to translate source document to be sorted and the source document of training sample set.

Further, construction unit 103 comprises: participle subelement 1031 and structure subelement 1032.Participle subelement 1031, carries out word segmentation processing for the bilingual document of the bilingual document to be sorted and training sample set; Build subelement 1032, for choosing the monobasic feature of word, form bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.

Further, training unit 104 comprises: determine subelement 1041 and training subelement 1042.Determine subelement 1041, for determining the weighted value of bilingual each characteristic item of characteristic vector space of sample set; Training subelement 1042, for utilizing maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.

Taxon 105, carries out feeling polarities classification for the sorter by after training to bilingual characteristic vector space to be sorted, obtains the emotional semantic classification result of described source document to be sorted.

The bilingual emotional semantic classification device that the embodiment of the present invention four provides, source document and translation document are combined into bilingual document, by feature expansion, formed bilingual characteristic vector space, adopt maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carry out emotional semantic classification.The present embodiment has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.In addition, from bilingual characteristic vector space, extract the characteristic item of outbalance, the dimension of bilingual characteristic vector space is reduced, shortened the emotional semantic classification time, improved classification effectiveness.

While for convenience of description, describing above device, with function, being divided into various unit describes respectively.Certainly, when enforcement is of the present invention, the function of each unit can be realized in same or a plurality of software and/or hardware.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually referring to, each embodiment stresses is the difference with other embodiment.Especially, for device embodiment, because it is substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part is referring to the part explanation of embodiment of the method.System embodiment described above is only schematic, the wherein said unit as separating component explanation can or can not be also physically to separate, the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed in a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the object of the present embodiment scheme.Those of ordinary skills, in the situation that not paying creative work, are appreciated that and implement.

The present invention can be used in numerous general or special purpose computingasystem environment or configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment etc.

It should be noted that, in this article, relational terms such as the first and second grades is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply and between these entities or operation, have the relation of any this reality or sequentially.

Above-mentioned explanation to the disclosed embodiments, makes professional and technical personnel in the field can realize or use the present invention.To the multiple modification of these embodiment, will be apparent for those skilled in the art, General Principle as defined herein can, in the situation that not departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to these embodiment shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims

1. a bilingual sensibility classification method, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, utilizes maximum entropy model training classifier on the bilingual characteristic vector space of described sample set to comprise:

3. method according to claim 2, is characterized in that, determines that the process of the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set comprises:

4. method according to claim 1, is characterized in that, the process that builds the bilingual characteristic vector space of bilingual characteristic vector space to be sorted and training sample set comprises:

5. method according to claim 1, is characterized in that, the process of translating the source document of source document to be sorted and training sample set comprises:

6. a bilingual emotional semantic classification device, is characterized in that, comprising:

7. device according to claim 6, is characterized in that, described training unit comprises:

8. device according to claim 7, is characterized in that, described definite subelement comprises:

9. device according to claim 6, is characterized in that, described construction unit comprises:

10. device according to claim 6, is characterized in that, described translation unit comprises: