CN103617245A - Bilingual sentiment classification method and device - Google Patents

Bilingual sentiment classification method and device Download PDF

Info

Publication number
CN103617245A
CN103617245A CN201310616753.0A CN201310616753A CN103617245A CN 103617245 A CN103617245 A CN 103617245A CN 201310616753 A CN201310616753 A CN 201310616753A CN 103617245 A CN103617245 A CN 103617245A
Authority
CN
China
Prior art keywords
bilingual
sample set
sorted
document
vector space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310616753.0A
Other languages
Chinese (zh)
Inventor
李寿山
苏艳
周国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201310616753.0A priority Critical patent/CN103617245A/en
Publication of CN103617245A publication Critical patent/CN103617245A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a bilingual sentiment classification method and device. The method comprises the steps of translating original language documents to be classified and the original language documents of a training sample set to obtain the translated documents to be classified and the translated documents of the training sample set; combining the original language documents to be classified and the translated documents to be classified to obtain a bilingual documents to be classified, and combining the original language documents of the training sample set and the translated documents of the training sample set to obtain bilingual documents of the training sample set; establishing a bilingual feature vector space to be classified and a bilingual feature vector space of the training sample set; training classifiers on the bilingual feature vector space of the training sample set by using a maximum entropy model; carrying out sentiment polarity classification on the bilingual feature vector space to be classified through the trained classifiers. The bilingual sentiment classification method and device combine the characteristics of two languages, provide extra classification information for the sentiment classification, improve classification accuracy, extract important characteristic items from the bilingual feature vector spaces, and improve classification efficiency.

Description

A kind of bilingual sensibility classification method and device
Technical field
The present invention relates to technical field of information processing, relate in particular to a kind of bilingual sensibility classification method and device.
Background technology
In recent years, emotional semantic classification technology shows huge application demand and application prospect in fields such as ecommerce, the analysis of public opinion, information securities.Emotional semantic classification technology can help to understand user's consumption habit and the relative merits of product, automatically product review is carried out to analysis and decision; Understand the common people's satisfaction and demand, find in time social problematic feature; Analyze the focus public feelings information of current social, important decision references foundation is provided to user, enterprise, government etc.Sensibility classification method of the prior art is mainly for a kind of language, and as long as for English.
Inventor finds in realizing the process of the invention: sensibility classification method of the prior art can make classification results produce error, affect classification accuracy rate, for example, " It looks like a book ", in English, " like " may be considered to a commendatory term (with " enjoy " synonym), if be considered to commendatory term, will make classification results produce error.
Summary of the invention
In view of this, the invention provides a kind of bilingual sensibility classification method and device, in order to solve sensibility classification method of the prior art, can make classification results produce error, affect the problem of classification accuracy rate, its technical scheme is as follows:
A bilingual sensibility classification method, comprising:
Translate source document to be sorted and the source document of training sample set, obtain translation document to be sorted and the translation document of training sample set;
Combine described source document to be sorted and described translation document to be sorted, obtain bilingual document to be sorted, combine the source document of described training sample set and the translation document of described training sample set, obtain the bilingual document of training sample set;
Build bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set;
Utilize maximum entropy model training classifier on the bilingual characteristic vector space of described sample set;
By the sorter after training, described bilingual characteristic vector space to be sorted is carried out to feeling polarities classification, obtain the emotional semantic classification result of described source document to be sorted.
Wherein, utilize maximum entropy model training classifier on the bilingual characteristic vector space of described sample set to comprise:
Determine the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set;
Utilize maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.
Wherein, the process of determining the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set comprises:
Utilize CHI feature extracting method to calculate the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set.
Wherein, the process that builds the bilingual characteristic vector space of bilingual characteristic vector space to be sorted and training sample set comprises:
Bilingual document to described bilingual document to be sorted and training sample set carries out word segmentation processing;
The monobasic feature of choosing word forms bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
Wherein, the process of translating the source document of source document to be sorted and training sample set comprises:
Utilize machine translation system Google Translate to translate source document to be sorted and the source document of training sample set.
A bilingual emotional semantic classification device, comprising:
Translation unit, for translating source document to be sorted and the source document of training sample set, obtains translation document to be sorted and the translation document of training sample set;
Assembled unit, for combining described source document to be sorted and described translation document to be sorted, obtain bilingual document to be sorted, combine the source document of described training sample set and the translation document of described training sample set, obtain the bilingual document of training sample set;
Construction unit, for building bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set;
Training unit, for utilizing maximum entropy model training classifier on the bilingual characteristic vector space of described sample set;
Taxon, carries out feeling polarities classification for the sorter by after training to described bilingual characteristic vector space to be sorted, obtains the emotional semantic classification result of described source document to be sorted.
Wherein, described training unit comprises:
Determine subelement, for determining the weighted value of bilingual each characteristic item of characteristic vector space of described sample set;
Training subelement, for utilizing maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.
Wherein, described definite subelement comprises:
Computation subunit, for utilizing CHI feature extracting method to calculate the weighted value of bilingual each characteristic item of characteristic vector space of described sample set.
Wherein, described construction unit comprises:
Participle subelement, for carrying out word segmentation processing to the bilingual document of described bilingual document to be sorted and training sample set;
Build subelement, for choosing the monobasic feature of word, form bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
Wherein, described translation unit comprises:
Translation subelement, for utilizing machine translation system Google Translate to translate source document to be sorted and the source document of training sample set.
Technique scheme has following beneficial effect:
Bilingual sensibility classification method provided by the invention and device, source document and translation document are combined into bilingual document, by feature expansion, form bilingual characteristic vector space, adopted maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carried out emotional semantic classification.The application has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.In addition, from bilingual characteristic vector space, extract the characteristic item of outbalance, the dimension of bilingual characteristic vector space is reduced, shortened the emotional semantic classification time, improved classification effectiveness.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skills, do not paying under the prerequisite of creative work, other accompanying drawing can also be provided according to the accompanying drawing providing.
The schematic flow sheet of the bilingual sensibility classification method that Fig. 1 provides for the embodiment of the present invention one;
The schematic flow sheet of the bilingual sensibility classification method that Fig. 2 provides for the embodiment of the present invention two;
Fig. 3 is for adopting the bilingual sensibility classification method that the embodiment of the present invention provides the comment in four fields to be carried out to the experimental result picture of emotional semantic classification;
Fig. 4 is for adopting the bilingual sensibility classification method that the embodiment of the present invention provides the document in four fields to be carried out to the experimental result picture of emotional semantic classification;
The structural representation of the bilingual emotional semantic classification device that Fig. 5 provides for the embodiment of the present invention three;
The structural representation of the bilingual emotional semantic classification device that Fig. 6 provides for the embodiment of the present invention four.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
Embodiment mono-
Refer to Fig. 1, the schematic flow sheet of a kind of bilingual sensibility classification method providing for the embodiment of the present invention one, the method comprises:
Step S101: translate source document to be sorted and the source document of training sample set, obtain translation document to be sorted and the translation document of training sample set.
In the present embodiment, can adopt machine translation system to translate source document to be sorted and the source document of training sample set as Google Translate.For example, source document is Chinese document, can utilize Google Translate that Chinese document is translated into English document.
Step S102: combine source document to be sorted and translation document to be sorted combination, obtain bilingual document to be sorted, the source document of combined training sample set and the translation document of training sample set, obtain the bilingual document of training sample set.
Step S103: build bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
In the present embodiment, the process that builds the bilingual characteristic vector space of bilingual characteristic vector space to be sorted and training sample set can comprise: the bilingual document to bilingual document to be sorted and training sample set carries out word segmentation processing; The monobasic feature (unigram) of choosing word forms bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
Bilingual characteristic vector space can be expressed as: F=(e 1, e 2... e n, c 1, c 2..., c n), wherein, e 1, e 2... e nfor the characteristic item of source document, c 1, c 2..., c ncharacteristic item for corresponding translation document.
Step S104: utilize maximum entropy model training classifier on the bilingual characteristic vector space of sample set.
Step S105: by the sorter after training, bilingual characteristic vector space to be sorted is carried out to feeling polarities classification, obtain the emotional semantic classification result of source document to be sorted.
After bilingual proper vector input sorter, according to the posterior probability of returning, judge feeling polarities, get classification that posterior probability is large as final classification results.
The bilingual sensibility classification method that the embodiment of the present invention one provides, source document and translation document are combined into bilingual document, by feature expansion, formed bilingual characteristic vector space, adopt maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carry out emotional semantic classification.The present embodiment has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.
Embodiment bis-
Refer to Fig. 2, the schematic flow sheet of a kind of bilingual sensibility classification method providing for the embodiment of the present invention one, the method comprises:
Step S201: translate source document to be sorted and the source document of training sample set, obtain translation document to be sorted and the translation document of training sample set.
In the present embodiment, can adopt machine translation system to translate source document to be sorted and the source document of training sample set as Google Translate.
Step S202: combine source document to be sorted and translation document to be sorted combination, obtain bilingual document to be sorted, the source document of combined training sample set and the translation document of training sample set, obtain the bilingual document of training sample set.
Step S203: build bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
In the present embodiment, the process that builds the bilingual characteristic vector space of bilingual characteristic vector space to be sorted and training sample set can comprise: the bilingual document to bilingual document to be sorted and training sample set carries out word segmentation processing; The monobasic feature (unigram) of choosing word forms bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
For example, source document is Chinese document, and translation document is English document.Bilingual characteristic vector space can be expressed as: F=(e 1, e 2... e n, c 1, c 2..., c n), wherein, e 1, e 2... e nfor the feature of Chinese document, c 1, c 2..., c nfeature for corresponding English document.
Step S204: determine the weighted value of each characteristic item in the bilingual characteristic vector space of sample set, utilize maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.
The dimension of considering bilingual characteristic vector space is larger, can cause the classification time longer, classification effectiveness is lower, the present embodiment passes through characteristic extraction procedure, the characteristic item that definite weighted value is greater than preset value from the bilingual characteristic vector space of sample set forms bilingual proper vector, by this bilingual proper vector training classifier.Feature extracting method reduces the dimension of proper vector under the prerequisite that guarantees classifying quality, has shortened the classification time, has improved classification effectiveness.
In the present embodiment, can utilize CHI feature extracting method to calculate the weighted value of each characteristic item in the bilingual characteristic vector space of sample set, weighted value is larger, represents that characteristic of correspondence is more important.After determining weighted value, can be based on weighted value but order from big to small to sorting, the top n characteristic item that weight selection value is greater than preset value successively forms the bilingual proper vector for training classifier.
In addition, the training sample concentrated part sample in the present embodiment is commendation, and part sample is derogatory sense, and maximum entropy model goes out binary classification model at training set learning.
It should be noted that, the independence of CHI normalized set characteristic item and classification, it is based on following hypothesis: the entry that the frequency of occurrences is high in specifying classification text and the entry that the frequency of occurrences is higher in other classification texts are all helpful to judging whether document belongs to this classification.CHI method is defined as follows:
χ 2 ( t ) = avg i = 1 m ( χ 2 ( t , c i ) )
Therefore,
χ 2 ( t , c i ) = N ( p ( t , c i ) · p ( t ‾ , c i ‾ ) - p ( t , c i ‾ ) · p ( t ‾ , c i ) ) 2 p ( t ) · p ( t ‾ ) · p ( c i ) · p ( c i ‾ )
Wherein, the probability that p (t) comprises feature t for document x;
Figure BDA0000423660180000073
for document x does not belong to classification c iprobability; P (t, c i) comprise feature t and belong to classification c for document x ijoint probability; p(c i| while t) comprising feature t for document x, belong to the probability of classification ci; for document x belongs to classification c itime, do not comprise the probability of feature t.Same,
Figure BDA0000423660180000075
p(c i), p (t|c i),
Figure BDA0000423660180000076
Figure BDA0000423660180000077
with
Figure BDA0000423660180000078
definition similar.The reliability that CHI statistic is estimated is better, more stable.
Maximum entropy model is the theoretical foundation of maximum entropy classifiers, and its basic thought is to set up model for all known factors, and the factor of all the unknowns is foreclosed.Namely to find a probability distribution, not only meet all known facts, and can not be subject to the impact of any X factor.
Suppose that x is proper vector, y is the output valve of sample class.P (y|x) is the probability that sample is predicted to be a certain classification.Maximum entropy model requires p (y|x) meeting under the condition of certain constraint, must make the entropy defining obtain maximal value below, i.e. the most equally distributed model of output under constraint set:
H ( p ) = - Σ x , y p ~ ( x ) p ( y | x ) log p ( y | x )
Here use H (p) to replace H (Y|X), conditional entropy H (Y|X) is a kind of inhomogeneity mathematical measure method of conditional probability p (y|x), emphasizes the dependence to probability distribution p.For any given constraint set C, need to try to achieve H (p) in all models that meet C and get peaked p *:
p *=argmaxH(p)
Wherein, p is the statistical model meeting under constraint set C condition.
Feature f ithe corresponding parameter lambda of weight irepresent, the final probability output of maximum entropy is:
p λ ( y | x ) = 1 Z λ ( x ) exp ( Σ i λ i f i ( x , y ) )
Wherein, Ζ λ(x) be normalized factor:
Z λ ( x ) = Σ y exp ( Σ λ i f i ( x , y ) )
Step S205: by the sorter after training, bilingual characteristic vector space to be sorted is carried out to feeling polarities classification, obtain the emotional semantic classification result of source document to be sorted.
After bilingual proper vector input sorter, according to the posterior probability of returning, judge feeling polarities, get classification that posterior probability is large as final classification results.
In order to prove method that the present embodiment the provides validity to emotional semantic classification, in four Chinese product review fields such as case and bag, electronic product, cosmetics, hotels, test respectively.Each field comprises positive and negative each 1000 pieces of comments.During specific experiment, select 80% sample as initial training sample, 20% remaining sample, as test sample book, is used accuracy rate (Accuracy) as evaluation index.
Refer to Fig. 3, for the sorting technique that adopts the embodiment of the present invention to provide is carried out the experimental result of emotional semantic classification to the comment in four fields, wherein, " CN " represents in prior art and only adopts source language text to train the emotional semantic classification result of sorter, the classification results that the bilingual sensibility classification method that " CN+Trans " representative adopts the embodiment of the present invention to provide is classified and obtained, adopt and comprise that the character vector space that the characteristic item of source language and the characteristic item of interpretive language form trains sorter, then utilizes this sorter to carry out emotional semantic classification.
From the correlation data shown in Fig. 3, can find out, the sensibility classification method that the embodiment of the present invention provides is compared with sensibility classification method of the prior art, and classification accuracy rate obviously improves, 4 fields on average improve 3.4%, wherein, case and bag field improves 5%, and field, hotel improves 4%.
The reason that classification accuracy rate obviously improves has 2 points: first, multilingually itself can provide extra help information for emotional semantic classification; Secondly, in the text classification of source language, emotion word has the phenomenon of ambiguity ambiguity, sensibility classification method of the prior art can not be avoided the situation of this classification error, this causes classification accuracy rate to reduce, and added that after the feature of translation document of source document, these emotion words probably do not exist ambiguity in the context of interpretive language.
Refer to Fig. 4, for adopting the sensibility classification method that the embodiment of the present invention provides the document in four fields to be carried out to the experimental result picture of emotional semantic classification, wherein, four curves represent respectively the emotional semantic classification result on four fields after selected part feature; The number of features of extracting in horizontal ordinate representation feature leaching process; Ordinate represents the accuracy of emotional semantic classification.
As can be seen from Figure 4, the classification accuracy rate of CHI feature extracting method improves along with the increase of characteristic number substantially.When number of features is less, emotional semantic classification accuracy is relatively low, feature quantity is more than or equal to after 1500, the emotional semantic classification effect in four fields all approaches peak value, performance is substantially constant, selects the emotional semantic classification effect of 1500 features just can reach the emotional semantic classification effect of utilizing whole features to classify.Feature extracting method can significantly reduce the dimension of proper vector under the prerequisite of not losing emotional semantic classification accuracy, improves classification effectiveness.
The bilingual sensibility classification method that the embodiment of the present invention two provides, source document and translation document are combined into bilingual document, by feature expansion, formed bilingual characteristic vector space, adopt maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carry out emotional semantic classification.The present embodiment has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.In addition, from bilingual characteristic vector space, extract the characteristic item of outbalance, the dimension of bilingual characteristic vector space is reduced, shortened the emotional semantic classification time, improved classification effectiveness.
Embodiment tri-
Refer to Fig. 5, the structural representation of a kind of bilingual emotional semantic classification device providing for the embodiment of the present invention three, this device can comprise: translation unit 101, assembled unit 102, construction unit 103, training unit 104 and taxon 105.Wherein:
Translation unit 101, for translating source document to be sorted and the source document of training sample set, obtains translation document to be sorted and the translation document of training sample set.
Assembled unit 102, for combining source document to be sorted and the combination of described translation document to be sorted, obtain bilingual document to be sorted, the source document of combined training sample set and the translation document of described training sample set, obtain the bilingual document of training sample set.
Construction unit 103, for building bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
Training unit 104, for utilizing maximum entropy model training classifier on the bilingual characteristic vector space of described sample set.
Taxon 105, carries out feeling polarities classification for the sorter by after training to described bilingual characteristic vector space to be sorted, obtains the emotional semantic classification result of described source document to be sorted.
The bilingual emotional semantic classification device that the embodiment of the present invention three provides, source document and translation document are combined into bilingual document, by feature expansion, formed bilingual characteristic vector space, adopt maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carry out emotional semantic classification.The present embodiment has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.
Embodiment tetra-
Refer to Fig. 6, the structural representation of a kind of bilingual emotional semantic classification device providing for the embodiment of the present invention three, this device can comprise: translation unit 101, assembled unit 102, construction unit 103, training unit 104 and taxon 105.Wherein:
Translation unit 101, for translating source document to be sorted and the source document of training sample set, obtains translation document to be sorted and the translation document of training sample set.
Further, translation unit 101 comprises translation subelement 1011.Translation subelement 1011, for utilizing machine translation system Google Translate to translate source document to be sorted and the source document of training sample set.
Assembled unit 102, for combining source document to be sorted and the combination of described translation document to be sorted, obtain bilingual document to be sorted, the source document of combined training sample set and the translation document of described training sample set, obtain the bilingual document of training sample set.
Construction unit 103, for building bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
Further, construction unit 103 comprises: participle subelement 1031 and structure subelement 1032.Participle subelement 1031, carries out word segmentation processing for the bilingual document of the bilingual document to be sorted and training sample set; Build subelement 1032, for choosing the monobasic feature of word, form bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
Training unit 104, for utilizing maximum entropy model training classifier on the bilingual characteristic vector space of described sample set.
Further, training unit 104 comprises: determine subelement 1041 and training subelement 1042.Determine subelement 1041, for determining the weighted value of bilingual each characteristic item of characteristic vector space of sample set; Training subelement 1042, for utilizing maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.
Taxon 105, carries out feeling polarities classification for the sorter by after training to bilingual characteristic vector space to be sorted, obtains the emotional semantic classification result of described source document to be sorted.
The bilingual emotional semantic classification device that the embodiment of the present invention four provides, source document and translation document are combined into bilingual document, by feature expansion, formed bilingual characteristic vector space, adopt maximum entropy method training classifier on bilingual characteristic vector space, according to posterior probability, carry out emotional semantic classification.The present embodiment has added bilingual feature in emotional semantic classification, has made up the problem of single language classification information deficiency, and bilingual, in conjunction with can disambiguation, improves the accuracy of emotional semantic classification.In addition, from bilingual characteristic vector space, extract the characteristic item of outbalance, the dimension of bilingual characteristic vector space is reduced, shortened the emotional semantic classification time, improved classification effectiveness.
While for convenience of description, describing above device, with function, being divided into various unit describes respectively.Certainly, when enforcement is of the present invention, the function of each unit can be realized in same or a plurality of software and/or hardware.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually referring to, each embodiment stresses is the difference with other embodiment.Especially, for device embodiment, because it is substantially similar in appearance to embodiment of the method, so describe fairly simplely, relevant part is referring to the part explanation of embodiment of the method.System embodiment described above is only schematic, the wherein said unit as separating component explanation can or can not be also physically to separate, the parts that show as unit can be or can not be also physical locations, can be positioned at a place, or also can be distributed in a plurality of network element.Can select according to the actual needs some or all of module wherein to realize the object of the present embodiment scheme.Those of ordinary skills, in the situation that not paying creative work, are appreciated that and implement.
The present invention can be used in numerous general or special purpose computingasystem environment or configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, the system based on microprocessor, network PC, small-size computer, mainframe computer, comprise distributed computing environment of above any system or equipment etc.
It should be noted that, in this article, relational terms such as the first and second grades is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply and between these entities or operation, have the relation of any this reality or sequentially.
Above-mentioned explanation to the disclosed embodiments, makes professional and technical personnel in the field can realize or use the present invention.To the multiple modification of these embodiment, will be apparent for those skilled in the art, General Principle as defined herein can, in the situation that not departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention will can not be restricted to these embodiment shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims (10)

1. a bilingual sensibility classification method, is characterized in that, comprising:
Translate source document to be sorted and the source document of training sample set, obtain translation document to be sorted and the translation document of training sample set;
Combine described source document to be sorted and described translation document to be sorted, obtain bilingual document to be sorted, combine the source document of described training sample set and the translation document of described training sample set, obtain the bilingual document of training sample set;
Build bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set;
Utilize maximum entropy model training classifier on the bilingual characteristic vector space of described sample set;
By the sorter after training, described bilingual characteristic vector space to be sorted is carried out to feeling polarities classification, obtain the emotional semantic classification result of described source document to be sorted.
2. method according to claim 1, is characterized in that, utilizes maximum entropy model training classifier on the bilingual characteristic vector space of described sample set to comprise:
Determine the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set;
Utilize maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.
3. method according to claim 2, is characterized in that, determines that the process of the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set comprises:
Utilize CHI feature extracting method to calculate the weighted value of each characteristic item in the bilingual characteristic vector space of described sample set.
4. method according to claim 1, is characterized in that, the process that builds the bilingual characteristic vector space of bilingual characteristic vector space to be sorted and training sample set comprises:
Bilingual document to described bilingual document to be sorted and training sample set carries out word segmentation processing;
The monobasic feature of choosing word forms bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
5. method according to claim 1, is characterized in that, the process of translating the source document of source document to be sorted and training sample set comprises:
Utilize machine translation system Google Translate to translate source document to be sorted and the source document of training sample set.
6. a bilingual emotional semantic classification device, is characterized in that, comprising:
Translation unit, for translating source document to be sorted and the source document of training sample set, obtains translation document to be sorted and the translation document of training sample set;
Assembled unit, for combining described source document to be sorted and described translation document to be sorted, obtain bilingual document to be sorted, combine the source document of described training sample set and the translation document of described training sample set, obtain the bilingual document of training sample set;
Construction unit, for building bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set;
Training unit, for utilizing maximum entropy model training classifier on the bilingual characteristic vector space of described sample set;
Taxon, carries out feeling polarities classification for the sorter by after training to described bilingual characteristic vector space to be sorted, obtains the emotional semantic classification result of described source document to be sorted.
7. device according to claim 6, is characterized in that, described training unit comprises:
Determine subelement, for determining the weighted value of bilingual each characteristic item of characteristic vector space of described sample set;
Training subelement, for utilizing maximum entropy model to be greater than training classifier on the bilingual characteristic vector space that the characteristic item of preset value forms at weighted value.
8. device according to claim 7, is characterized in that, described definite subelement comprises:
Computation subunit, for utilizing CHI feature extracting method to calculate the weighted value of bilingual each characteristic item of characteristic vector space of described sample set.
9. device according to claim 6, is characterized in that, described construction unit comprises:
Participle subelement, for carrying out word segmentation processing to the bilingual document of described bilingual document to be sorted and training sample set;
Build subelement, for choosing the monobasic feature of word, form bilingual characteristic vector space to be sorted and the bilingual characteristic vector space of training sample set.
10. device according to claim 6, is characterized in that, described translation unit comprises:
Translation subelement, for utilizing machine translation system Google Translate to translate source document to be sorted and the source document of training sample set.
CN201310616753.0A 2013-11-27 2013-11-27 Bilingual sentiment classification method and device Pending CN103617245A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310616753.0A CN103617245A (en) 2013-11-27 2013-11-27 Bilingual sentiment classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310616753.0A CN103617245A (en) 2013-11-27 2013-11-27 Bilingual sentiment classification method and device

Publications (1)

Publication Number Publication Date
CN103617245A true CN103617245A (en) 2014-03-05

Family

ID=50167948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310616753.0A Pending CN103617245A (en) 2013-11-27 2013-11-27 Bilingual sentiment classification method and device

Country Status (1)

Country Link
CN (1) CN103617245A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462229A (en) * 2014-11-13 2015-03-25 苏州大学 Event classification method and device
CN104462409A (en) * 2014-12-12 2015-03-25 重庆理工大学 Cross-language emotional resource data identification method based on AdaBoost
CN104536953A (en) * 2015-01-22 2015-04-22 苏州大学 Method and device for recognizing textual emotion polarity
CN105117428A (en) * 2015-08-04 2015-12-02 电子科技大学 Web comment sentiment analysis method based on word alignment model
CN107220293A (en) * 2017-04-26 2017-09-29 天津大学 File classification method based on mood
CN109522554A (en) * 2018-11-06 2019-03-26 中国人民解放军战略支援部队信息工程大学 A kind of low-resource Document Classification Method and categorizing system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090216740A1 (en) * 2008-02-25 2009-08-27 Bhiksha Ramakrishnan Method for Indexing for Retrieving Documents Using Particles
CN101876985A (en) * 2009-11-26 2010-11-03 西北工业大学 WEB text sentiment theme recognizing method based on mixed model
CN102662923A (en) * 2012-04-23 2012-09-12 天津大学 Entity instance leading method based on machine learning
CN103020249A (en) * 2012-12-19 2013-04-03 苏州大学 Classifier construction method and device as well as Chinese text sentiment classification method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090216740A1 (en) * 2008-02-25 2009-08-27 Bhiksha Ramakrishnan Method for Indexing for Retrieving Documents Using Particles
CN101876985A (en) * 2009-11-26 2010-11-03 西北工业大学 WEB text sentiment theme recognizing method based on mixed model
CN102662923A (en) * 2012-04-23 2012-09-12 天津大学 Entity instance leading method based on machine learning
CN103020249A (en) * 2012-12-19 2013-04-03 苏州大学 Classifier construction method and device as well as Chinese text sentiment classification method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苏艳: "双语情感分类方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462229A (en) * 2014-11-13 2015-03-25 苏州大学 Event classification method and device
CN104462409A (en) * 2014-12-12 2015-03-25 重庆理工大学 Cross-language emotional resource data identification method based on AdaBoost
CN104462409B (en) * 2014-12-12 2017-08-25 重庆理工大学 Across language affection resources data identification method based on AdaBoost
CN104536953A (en) * 2015-01-22 2015-04-22 苏州大学 Method and device for recognizing textual emotion polarity
CN104536953B (en) * 2015-01-22 2017-12-26 苏州大学 A kind of recognition methods of text emotional valence and device
CN105117428A (en) * 2015-08-04 2015-12-02 电子科技大学 Web comment sentiment analysis method based on word alignment model
CN105117428B (en) * 2015-08-04 2018-12-04 电子科技大学 A kind of web comment sentiment analysis method based on word alignment model
CN107220293A (en) * 2017-04-26 2017-09-29 天津大学 File classification method based on mood
CN107220293B (en) * 2017-04-26 2020-08-18 天津大学 Emotion-based text classification method
CN109522554A (en) * 2018-11-06 2019-03-26 中国人民解放军战略支援部队信息工程大学 A kind of low-resource Document Classification Method and categorizing system

Similar Documents

Publication Publication Date Title
CN104102626B (en) A kind of method for short text Semantic Similarity Measurement
CN103678564B (en) Internet product research system based on data mining
Chawla et al. Product opinion mining using sentiment analysis on smartphone reviews
CN107193801A (en) A kind of short text characteristic optimization and sentiment analysis method based on depth belief network
CN103049501A (en) Chinese domain term recognition method based on mutual information and conditional random field model
CN103617245A (en) Bilingual sentiment classification method and device
CN104298665A (en) Identification method and device of evaluation objects of Chinese texts
CN105045857A (en) Social network rumor recognition method and system
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN103646088A (en) Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN104778209A (en) Opinion mining method for ten-million-scale news comments
CN108376133A (en) The short text sensibility classification method expanded based on emotion word
CN103455562A (en) Text orientation analysis method and product review orientation discriminator on basis of same
CN101127042A (en) Sensibility classification method based on language model
CN104866572A (en) Method for clustering network-based short texts
CN104915443B (en) A kind of abstracting method of Chinese microblogging evaluation object
CN104598535A (en) Event extraction method based on maximum entropy
CN104361037B (en) Microblogging sorting technique and device
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN105975453A (en) Method and device for comment label extraction
CN102033880A (en) Marking method and device based on structured data acquisition
CN102929861A (en) Method and system for calculating text emotion index
CN107133282B (en) Improved evaluation object identification method based on bidirectional propagation
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN103593431A (en) Internet public opinion analyzing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140305

RJ01 Rejection of invention patent application after publication