CN105069021A

CN105069021A - Chinese short text sentiment classification method based on fields

Info

Publication number: CN105069021A
Application number: CN201510415825.4A
Authority: CN
Inventors: 舒磊; 牛建伟; 毛凯莉; 傅树霞; 赵晓轲
Original assignee: Guangdong University of Petrochemical Technology
Current assignee: Guangdong University of Petrochemical Technology
Priority date: 2015-07-15
Filing date: 2015-07-15
Publication date: 2015-11-18
Anticipated expiration: 2035-07-15
Also published as: CN105069021B

Abstract

The present invention discloses a Chinese short text sentiment classification method based on fields, which includes: data preprocessing of a short text including sentence segmentation, word segmentation, stop word filtration, and field division; construction of a field-oriented sentiment dictionary; extraction and matching of sentiment paths, extraction and polarity discrimination of candidates, and TF-IDF weight calculation of sentiment words by the field-oriented sentiment dictionary and using a corpus as a data set; sentimental characteristic extraction of the short text; and the corpus training or unknown sentiment types discrimination by a rand forest algorithm. Experiments show that the scheme provided by the present invention has high accuracy rate.

Description

Based on the Chinese short text sensibility classification method in field

Technical field

The present invention relates to machine learning techniques field, particularly relate to a kind of Chinese short text sensibility classification method based on field.

Background technology

Internet develop the favor making social networks and electric business's shopping platform are subject to user more and more widely rapidly, as face book, push away the network platform both at home and abroad such as spy, Sina's microblogging, bean cotyledon, Jingdone district and Taobao.In these network platforms, data increase with presenting explosion type, comprise the evaluation to commodity, to the view of around event and the record etc. to life interesting episode or anxious state of mind.Wherein, short text is the important form that these data are commonly used, and often with emotional color or subjective consciousness.Emotion in this short text data expressed by user is excavated, contribute to allowing different user object carry out better certainly selecting or serving, as provided more pertinent recommendation to user when selecting, thering is provided more effective service to electric business when promoting product, providing to government or department of news media and predict reliably or push potential focus incident etc.

Text emotion analysis is research direction popular in natural language processing (NaturalLanguageProcessing, NLP) field, obtains extensively researching and analysing of scholar.The technology proposed has a lot, but mainly can be divided into 2 kinds: a kind of is method based on sentiment dictionary, and another kind is the method based on machine learning.Method based on sentiment dictionary is the Main Basis differentiated using emotion word (be divided into actively and passive) as emotion, namely carrys out according to emotion word the emotion that decision-making text contains.Method based on machine learning utilizes to classify according to the emotion of sorter to text of training.Two kinds of technical schemes all have pros and cons: the former algorithm is often comparatively simple, and algorithm complex is lower, and without the need to a large amount of label corpus; But have that sentiment dictionary is easily omitted, ambiguity or extreme, and the emotion difference produced the emotion word of different scene usually cannot perception.The latter's accuracy rate is often high compared with the former, but training affective characteristics sorter needs a large amount of tape label corpus, and corpus will be chosen suitably.

Summary of the invention

Technical matters to be solved by this invention how to carry out automatic classification in conjunction with sentiment dictionary and machine learning to the emotion of Chinese short text efficiently, to improve text automatic marking training effectiveness and to make final sorter have high-accuracy.

In order to solve the problems of the technologies described above, the invention provides a kind of Chinese short text sensibility classification method based on field, comprising:

Data prediction is carried out to short text, comprises sentence segmentation, participle, stop words filters and field divides;

Build the field sentiment dictionary of different field;

After utilizing described field sentiment dictionary and pre-service, data calculate the emotion value of short text;

Extract the affective characteristics of short text;

Random forest is adopted to be that classification tool is trained corpus or differentiates the short text of unknown affective style according to extracted affective characteristics.

Further, described data prediction is carried out to short text, comprises sentence segmentation, participle, stop words filters and field divides, specifically comprise:

Utilize punctuation mark that short text is divided into multiple sentence;

ICTCLAS participle instrument is adopted to be independently word by described multiple sentence cutting;

The word of vocabulary to cutting of stopping using is adopted to filter;

According to short text and context environmental, in conjunction with domain lexicon, mark off field belonging to short text.

Further, the field sentiment dictionary of described structure different field, specifically comprises:

The emotion word irrelevant with field is picked out from existing sentiment dictionary, and the word therefrom deleting ambiguity and be of little use, form basic sentiment dictionary;

Extract all nouns in corpus and sort by word frequency, and utilizing threshold method to choose the higher noun of word frequency as evaluation object;

The all emotion paths between the modification emotion word in described evaluation object and described basic sentiment dictionary are extracted in the analysis of employing dependency grammar;

According to described all emotion paths, mate the word corresponding with the emotion path that described evaluation object conforms to, after getting rid of the word in basic sentiment dictionary, will the vocabulary alternatively emotion word that part of speech is adjective, adverbial word and verb be obtained;

After adopting word similarity distinguished number to carry out feeling polarities classification to described candidate's emotion word, superpose with basic dictionary, form field sentiment dictionary.

Further, after utilizing described field sentiment dictionary and pre-service, data calculate the emotion value of short text, specifically comprise:

Calculate the TF-IDF value of each word in the sentiment dictionary of described field, wherein, TF-IDF=TF*IDF, in formula, TF represents word frequency, and IDF represents reverse document-frequency;

For the multiple words obtained after short text word segmentation processing, calculate the emotion value of each word, namely give different weights according to the TF-IDF value of word to word;

Calculate the weighted sum of the emotion value of all words, obtain the emotion value of short text.

Further, described multiple words for obtaining after short text word segmentation processing, calculate the emotion value of each word, namely give different weights according to the TF-IDF value of word to word, specifically comprise:

For the multiple words obtained after short text word segmentation processing, record position and the propensity value p of the appearance of each word, wherein, if word is positive, then p initialization value is f (TF-IDF), if word is passive, then p initialization value is-f (TF-IDF), wherein, f (TF-IDF) the default initial emotion value that is word;

According to the position that word occurs, judge whether occur negative word between word, if occur, then calculate the number of negative word, when the number of negative word is odd number, just reversed by the propensity value p of the word be in after negative word, otherwise propensity value p is constant, final propensity value p is the emotion value of word;

TF-IDF value according to word gives different weights to different words.

Further, described is that classification tool is trained corpus or differentiates the short text of unknown affective style according to extracted affective characteristics employing random forest, specifically comprises:

Utilize arrf feature templates by affective characteristics document formatting;

Call random forests algorithm in weka to carry out training or carrying out emotion prediction classification to the short text of unknown affective style according to the affective characteristics of extracted corpus as classification tool.

Implement the present invention, there is following beneficial effect:

1) the short text emotion method of discrimination based on field that the present invention proposes improves the accuracy rate of text data emotional semantic classification;

2) the accuracy rate that obtains based on the sentiment dictionary in field is proposed apparently higher than the accuracy rate using basic sentiment dictionary to reach.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the schematic flow sheet of an embodiment of the Chinese short text sensibility classification method based on field provided by the invention;

Fig. 2 is the schematic flow sheet of the concrete steps of step S101 in Fig. 1;

Fig. 3 is the contrast and experiment figure of sentiment dictionary and traditional sentiment dictionary in method proposed by the invention.

Fig. 4 is the test result exemplary plot in four fields.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, to the technical scheme in the embodiment of the present invention carry out clear, intactly describe, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiment.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Fig. 1 is the schematic flow sheet of an embodiment of the Chinese short text sensibility classification method based on field provided by the invention, comprises the steps:

S101, data prediction is carried out to short text, comprise sentence segmentation, participle, stop words filters and field divides.

Concrete, as shown in Figure 2, step S101 comprises step:

S1011, utilize punctuation mark that short text is divided into multiple sentence;

S1012, ICTCLAS participle instrument is adopted to be independently word by described multiple sentence cutting;

S1013, the inactive word of vocabulary to cutting of employing filter;

S1014, according to short text and context environmental, in conjunction with domain lexicon, mark off field belonging to short text.

The field sentiment dictionary of S102, structure different field.

Concrete, step S102 comprises step:

S1021, from existing sentiment dictionary, pick out the emotion word irrelevant with field, and the word therefrom deleting ambiguity and be of little use, form basic sentiment dictionary;

All nouns in S1022, extraction corpus also sort by word frequency, and utilize threshold method to choose the higher noun of word frequency as evaluation object.

S1023, adopt dependency grammar analysis to extract in described evaluation object and described basic sentiment dictionary modification emotion word between all emotion paths;

S1024, according to described all emotion paths, mate the word corresponding with the emotion path that described evaluation object conforms to, after getting rid of the word in basic sentiment dictionary, will the vocabulary alternatively emotion word that part of speech is adjective, adverbial word and verb be obtained;

After S1025, employing word similarity distinguished number carry out feeling polarities classification to described candidate's emotion word, superpose with basic dictionary, form field sentiment dictionary.

S103, data after described field sentiment dictionary and pre-service are utilized to calculate the emotion value of short text.

Concrete, step S103 comprises step:

S1031, calculate the TF-IDF value of each word in the sentiment dictionary of described field, wherein, TF-IDF=TF*IDF, in formula, TF represents word frequency, and IDF represents reverse document-frequency;

S1032, for the multiple words obtained after short text word segmentation processing, calculate the emotion value of each word, namely give different weights according to the TF-IDF value of word to word.

Concrete, step S1032 comprises:

TF-IDF value according to word gives different weights to different words.

S1033, calculate the weighted sum of the emotion value of all words, obtain the emotion value of short text.

The affective characteristics of S104, extraction short text.

Wherein, affective characteristics specifically comprises 9 features, as shown in table 1.

Table 1

S105, according to extracted affective characteristics adopt random forest be that classification tool is trained corpus or differentiates the short text of unknown affective style.

Concrete, step S105 comprises step:

S1051, arrf feature templates is utilized to be formatd by affective characteristics;

S1052, to call random forest in weka be that classification tool is trained corpus or differentiates the short text of unknown affective style.

The embodiment of the present invention is emulated, obtain accuracy rate as shown in table 2 compared with the algorithm of the people such as Tan, in field, hotel and books field, the present invention carry the algorithm of algorithm than people such as Tan and improve a lot in accuracy rate, but in electronic applications, the accuracy rate of algorithm is put forward a little almost by this research institute.

Table 2

Fig. 3 is the contrast and experiment of sentiment dictionary and basic sentiment dictionary in method proposed by the invention.Result shows, field sentiment dictionary is obviously good than the classifying quality of basic sentiment dictionary, and four field Average Accuracies improve 5.3%, wherein on books, hotel, electronic product and cinematic data collection, improve 4%, 5.2%, 2.9% and 8.8% respectively.

Fig. 4 is the test result exemplary plot in four fields, and wherein the transverse axis of figure represents training set proportion, and the longitudinal axis is classification accuracy and F-Measure.Can be shown by result, accuracy rate and the F-Measure of the classification when training data is 80% and test data is 20% are best.

It should be noted that, in this article, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or device and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or device.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the device comprising this key element and also there is other identical element.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

In several embodiments that the application provides, the system and method for setting forth can realize by another way.Such as, system embodiment described above is schematic; The division of described unit, is only a kind of logic function and divides, and actual can have other dividing mode when realizing; Multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.

The software module that the method described in conjunction with embodiment disclosed herein or the step of algorithm can directly use hardware, processor to perform, or the combination of the two is implemented.Software module can be placed in the storage medium of other form any known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.

To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the scope of the invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the most wide region consistent with principle disclosed herein and features of novelty.

Claims

1., based on the Chinese short text sensibility classification method in field, it is characterized in that, comprising:

Build the field sentiment dictionary of different field;

Extract the affective characteristics of short text;

2. as claimed in claim 1 based on the Chinese short text sensibility classification method in field, it is characterized in that, described data prediction carried out to short text, comprise sentence segmentation, participle, stop words filter and field division, specifically comprise:

Utilize punctuation mark that short text is divided into multiple sentence;

The word of vocabulary to cutting of stopping using is adopted to filter;

3., as claimed in claim 1 based on the Chinese short text sensibility classification method in field, it is characterized in that, the field sentiment dictionary of described structure different field, specifically comprises:

4. as claimed in claim 1 based on the Chinese short text sensibility classification method in field, it is characterized in that, after utilizing described field sentiment dictionary and pre-service, data calculate the emotion value of short text, specifically comprise:

5. as claimed in claim 4 based on the Chinese short text sensibility classification method in field, it is characterized in that, described multiple words for obtaining after short text word segmentation processing, calculate the emotion value of each word, namely give different weights according to the TF-IDF value of word to word, specifically comprise:

TF-IDF value according to word gives different weights to different words.

6. as claimed in claim 1 based on the Chinese short text sensibility classification method in field, it is characterized in that, described is that classification tool is trained corpus or differentiates the short text of unknown affective style according to extracted affective characteristics employing random forest, specifically comprises: