CN106126502A

CN106126502A - A kind of emotional semantic classification system and method based on support vector machine

Info

Publication number: CN106126502A
Application number: CN201610529672.0A
Authority: CN
Inventors: 王欣; 钟吉英; 赵亮; 谭斌; 于成业; 郝妙; 赵海臣
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2016-07-07
Filing date: 2016-07-07
Publication date: 2016-11-16
Anticipated expiration: 2036-07-07
Also published as: CN106126502B

Abstract

The present invention relates to the analysis of public opinion technology, it discloses a kind of emotional semantic classification system and method based on support vector machine, from user comment information, find public sentiment for quick, accurate.The present invention utilizes reptile module to obtain user and is published in the review information of forum, by data being carried out the pretreatment such as participle, obtain the feature phrase of comment text and there is the training data of typicality, subsequently training data is carried out Emotion tagging, and utilize support vector machine that training data is calculated, obtain disaggregated model, according to disaggregated model, evaluation text to be sorted is analyzed, obtain the affective state estimated, finally utilize visualization model, show classification results, user is helped quickly to understand user feeling based on different entities object (keyword), and and then understand internet public feelings, it is applicable to website, the analysis of public opinion of forum.

Description

A kind of emotional semantic classification system and method based on support vector machine

Technical field

The present invention relates to the analysis of public opinion technology, be specifically related to a kind of emotional semantic classification system based on support vector machine and side Method.

Background technology

Along with the fast development of the Internet, the data on the Internet present explosive growth.According to incompletely statistics, 1 minute In, the upper newly-increased microblogging of Twitter reaches 100,000.And at home, Sina's microblog users number 6.5 hundred million, day any active ues reach 4600 Ten thousand, Tengxun's microblog users number 6.2 hundred million, day any active ues about 100,000,000；Moreover, valuable information in traditional forum website About 1 year about 100,000,000.The hugest any active ues and abundant in content, the comment that emotion is distinct issued thereof behind, The most numerous valuable information.Analysis to these information, can help to find commentator's emotion to special body, example As: microblogging/forum user, for enterprise " front " or the evaluation of " negatively ", for the viewpoint etc. of social group's event, thus is helped Help others grasp spin, problem analysis cause etc..

But, comment text is classified, and finds that the emotion preference of user is a challenging job, example As: certain user A has delivered the model of " conwoman that telecommunications worker is pretended to be in attention ", and user B replys and says that " money of old man is good Deceive." discounting for the scene of text, only sentence itself is carried out emotion differentiation, often obtain inconsistent judged result. To this end, we have developed a kind of sensibility classification method based on support vector machine, for user is published in microblogging, forum Text message is classified, and then analyzes the public sentiment situation for special body.

Summary of the invention

The technical problem to be solved is: propose a kind of emotional semantic classification system based on support vector machine and side Method, finds public sentiment for quick, accurate from user comment information.

The technical solution adopted for the present invention to solve the technical problems is:

A kind of emotional semantic classification system based on support vector machine, comprising:

Data acquisition and pretreatment module, be responsible for utilizing web crawlers to carry out data and crawl, and what acquisition user was delivered comments Opinion information, and review information is carried out pretreatment；

Feature Words and training sample generation module, be responsible for using the comment text through pretreatment as input, choose with The high frequency words of specific part of speech is as Feature Words, and adds feature dictionary；Choose the evaluation text comprising Feature Words as training sample This, and the emotion of training sample is manually marked；

Svm classifier module, is responsible for based on feature dictionary, and training sample extracts characteristic vector, and vector is supported in input Machine generates disaggregated model；Utilize disaggregated model that the to be sorted emotion value evaluating text is calculated, analyze the emotion of text Orientation；

Visualization model, is responsible for representing analysis result in web terminal.

Additionally, present invention also offers a kind of sensibility classification method based on support vector machine, it comprises the following steps:

A, utilize web crawlers to carry out data to crawl, obtain the review information that user is delivered, and review information is carried out Pretreatment；

B, using the comment text through pretreatment as input, choose the high frequency words with specific part of speech as Feature Words, And add feature dictionary；Choose the evaluation text comprising Feature Words as training sample, and the emotion of training sample is carried out people Work marks；

C, based on feature dictionary, to training sample extract characteristic vector, input support vector machine generate disaggregated model； Utilize disaggregated model that the to be sorted emotion value evaluating text is calculated, analyze the orientation of emotion of text；

D, analysis result is represented in web terminal.

As optimizing further, in step A, described utilize web crawlers to carry out data to crawl, obtain what user was delivered Review information, specifically includes:

From the beginning of the website specified, crawling webpage with the pattern of breadth-first, the webpage got for each, to it Page source code resolves, and obtains user comment information in webpage, the review information write into Databasce that will obtain.

As optimizing further, in step A, described review information is carried out pretreatment, specifically includes:

Use Chinese word segmentation tool kit that the evaluation information of user is carried out participle, and mark part of speech.

As optimizing further, in step B, described in choose the high frequency words with specific part of speech as Feature Words, specifically wrap Include: be that noun, verb and adjectival frequent words are as Feature Words based on FindCover algorithm picks part of speech.

As optimizing further, described is noun, verb and adjectival high frequency based on FindCover algorithm picks part of speech Word as Feature Words, method particularly includes:

Determine the input of FindCover algorithm: participle also marks the evaluation text collection U of part of speech, Feature Words number n, spy Levy word length L, part of speech set P；

Determine the output of FindCover algorithm: feature phrase S；

The process of choosing includes:

Step 1, initialization set S, A；

Step 2, calculating mapping relations Map M, be mapped to one group of text id:M comprising this word by each word word (word)；

Step 3, when gathering S and not comprising n word, then find word word so that it is satisfied three conditions:

I () part of speech meets the requirement of P；

(ii) length meets the requirement of L；

(iii) current coverage rate coverage=| M (word)-A | is maximum；

If coverage rate coverage=0 of the word that step 4 searches out, then terminate circulation, otherwise, word is added S, adds A by M (word), returns step 3 and continues cycling through, until set S comprises n word or the coverage rate of word searched out Coverage=0；

Step 5, return set S are as feature phrase.

As optimizing further, the value of described n, P, L can be adjusted according to practical situation.

As optimizing further, in step B, described in choose the evaluation text comprising Feature Words as training sample, specifically Including:

The feature phrase S returned according to FindCover algorithm, uses following strategy to choose training sample: first, exports institute There is the evaluation text collection U comprising Feature Words_fIf, | U_f| > 1% | U |, then from U_fIn randomly choose 1% | U | individual evaluation text make For training sample；Otherwise export U_fAs training sample.

As optimizing further, in step C, described characteristic vector that training sample is extracted, input support vector machine generation Disaggregated model, specifically includes:

First according to Feature Words, the text in sample data is converted to shape such as "<labelling>feature 1: number feature 2: individual Number ... feature n: number " form, according to three way classification, then<labelling>value be positive, negative or neutral；According to two way classification, then<labelling>value is positive and negative；The training data will changed subsequently It is input in LIBSVM storehouse carry out classification based training.

As optimizing further, in step D, analysis result is represented in web terminal, described in the content that represents include: " front ", " negatively ", the ratio of " neutral " of text based on particular keywords, the urtext that emotion is relevant, temporally tie up Degree represents the emotion change of text.

The invention has the beneficial effects as follows: utilize reptile module to obtain user and be published in the review information of forum, pass through logarithm According to carrying out the pretreatment such as participle, obtain the feature phrase of comment text and there is the training data of typicality, subsequently to training Data carry out Emotion tagging, and utilize support vector machine to calculate training data, obtain disaggregated model, according to classification mould Type, is analyzed evaluation text to be sorted, obtains the affective state estimated, finally utilizes visualization model, shows classification As a result, help user quickly to understand user feeling based on different entities object (keyword), and and then understand internet public feelings.

Accompanying drawing explanation

Fig. 1 is present invention emotional semantic classification based on support vector machine system architecture diagram.

Detailed description of the invention

As it is shown in figure 1, as one embodiment of the present of invention, emotional semantic classification system based on support vector machine includes:

Below each functional module is implemented and illustrates:

(1) data acquisition and pretreatment module (Data Collection and Preprocessing Module, letter Claim CPM)

The main flow of data acquisition is as follows:

(1) from the beginning of the website (initial website) specified, webpage is crawled with the pattern of breadth-first；

(2) webpage got for each, resolves its page source code, letter relevant in obtaining webpage Breath, such as: user comment information etc.；

(3) data base is write data into.

The main flow of data prediction is the Chinese word segmentation tool kit utilizing the Chinese Academy of Sciences the to research and develop evaluation text to user Carry out participle, and mark part of speech.

(2) Feature Words and training sample generation module (Training Data Generation Module is called for short TGM)

In view of the present invention will use support vector machine (Support Vector Machine, hereinafter referred to as SVM) to comment Text is classified, and therefore extracts one group of representative Feature Words, and chooses high-quality training sample on this basis It is to ensure that the key of classification quality.To this end, we adopt carries out selecting of Feature Words and training sample with the following method.Main step Rapid as follows:

(A) the choosing of Feature Words

TGM uses algorithm FindCover to choose typical Feature Words.Additionally, according to actual observation, TGM chooses part of speech For the word of noun (n), verb (v) and adjective (a) as Feature Words, i.e. the input P of FindCover algorithm be array n, v,a}；In this external Practical Calculation, TGM chooses the word of length L ＞ 1 as Feature Words.It is noted that for n, P and L Value, can be adjusted according to actual needs.

Algorithm FindCover

Input: participle mark the evaluation text collection U of part of speech, Feature Words number n, Feature Words length L, part of speech set P

Output: feature phrase

1. initialize set S, A；Here set S is the set for storage feature phrase；Here set A is for evaluating The subset of text collection U, is specifically designed to the text id corresponding to Feature Words word deposited in S.

2. calculate mapping relations Map M, each word word is mapped to one group of text id:M comprising this word (word)；

3. when S does not comprises n word, then find word word so that it is satisfied three conditions:

I () part of speech meets the requirement of P；

(ii) length meets the requirement of L；

(iii) current coverage rate coverage=| M (word)-A | is maximum；

4. if coverage rate coverage=0 of the word searched out, then terminate circulation, otherwise, word is added S, by M (word) add A, return step 3 and continue cycling through, until set S comprises n word or the coverage rate of word searched out Coverage=0；

5. return set S as feature phrase.

(B) the choosing of training sample

The feature phrase S, TGM returned according to FindCover uses following strategy to choose training sample: first, exports institute There is the evaluation text collection U comprising Feature Words_f.If | U_f| > 1% | U |, then from U_fIn randomly choose 1% | U | individual evaluation text make For training sample；Otherwise export U_fAs training sample.Selected training sample will carry out artificial emotion mark.Actually used mistake Cheng Zhong, can be divided into 2 classes by text according to emotion, it may be assumed that front, negatively；Also three classes, i.e. front it are divided into, neutral, negatively.

(3) svm classifier module (SVM Training Module is called for short STM)

First text in sample data is converted to shape such as according to Feature Words by STM: "<labelling>feature 1: number feature 2: Number ... feature n: number " form, wherein according to three way classification, then<labelling>can with value as positive, Negative or neutral；According to two way classification, then<labelling>can be with value as positive with negative.STM subsequently will The training data changed is input in LIBSVM storehouse carry out classification based training.After obtaining training result, STM applies these to classify Text to be sorted is calculated by rule, analyzes the orientation of emotion of text.

(4) visualization model (Visualization Module is called for short VM)

Analysis result is represented by VM at Web end, and main content viewable includes: (1) text based on particular keywords " front ", " negatively ", the ratio of " neutral "；(2) urtext that emotion is relevant；(3) temporally dimension represents the feelings of text Sense change.

Claims

1. an emotional semantic classification system based on support vector machine, it is characterised in that including:

Data acquisition and pretreatment module, be responsible for utilizing web crawlers to carry out data and crawl, and obtains the comment letter that user is delivered Breath, and review information is carried out pretreatment；

Feature Words and training sample generation module, be responsible for using the comment text through pretreatment as input, choose with specific The high frequency words of part of speech is as Feature Words, and adds feature dictionary；Choose the evaluation text comprising Feature Words as training sample, and The emotion of training sample is manually marked；

Svm classifier module, is responsible for based on feature dictionary, and training sample extracts characteristic vector, and input support vector machine is raw Constituent class model；Utilize disaggregated model that the to be sorted emotion value evaluating text is calculated, analyze the orientation of emotion of text；

2. a sensibility classification method based on support vector machine, it is characterised in that comprise the following steps:

A, utilize web crawlers to carry out data to crawl, obtain the review information that user is delivered, and review information is carried out pre-place Reason；

B, using the comment text through pretreatment as input, choose the high frequency words with specific part of speech as Feature Words, and add Enter feature dictionary；Choose the evaluation text comprising Feature Words as training sample, and the emotion of training sample is manually marked Note；

C, based on feature dictionary, to training sample extract characteristic vector, input support vector machine generate disaggregated model；Utilize The to be sorted emotion value evaluating text is calculated by disaggregated model, analyzes the orientation of emotion of text；

D, analysis result is represented in web terminal.

A kind of sensibility classification method based on support vector machine, it is characterised in that in step A, institute State and utilize web crawlers to carry out data to crawl, obtain the review information that user is delivered, specifically include:

From the beginning of the website specified, crawling webpage with the pattern of breadth-first, the webpage got for each, to its page Source code resolves, and obtains user comment information in webpage, the review information write into Databasce that will obtain.

A kind of sensibility classification method based on support vector machine, it is characterised in that in step A, institute State and review information carried out pretreatment, specifically include:

A kind of sensibility classification method based on support vector machine, it is characterised in that in step B, institute State and choose the high frequency words with specific part of speech as Feature Words, specifically include:

It is that noun, verb and adjectival frequent words are as Feature Words based on FindCover algorithm picks part of speech.

A kind of sensibility classification method based on support vector machine, it is characterised in that described based on FindCover algorithm picks part of speech be noun, verb and adjectival frequent words as Feature Words, method particularly includes:

Determine the input of FindCover algorithm: participle also marks the evaluation text collection U of part of speech, Feature Words number n, Feature Words Length L, part of speech set P；

Determine the output of FindCover algorithm: feature phrase S；

The process of choosing includes:

Step 1, initialization set S, A；

I () part of speech meets the requirement of P；

(ii) length meets the requirement of L；

(iii) current coverage rate coverage=| M (word)-A | is maximum；

If coverage rate coverage=0 of the word that step 4 searches out, then terminate circulation, otherwise, word is added S, by M (word) add A, return step 3 and continue cycling through, until set S comprises n word or the coverage rate of word searched out Coverage=0；

Step 5, return set S are as feature phrase.

A kind of sensibility classification method based on support vector machine, it is characterised in that described n, P, L Value can be adjusted according to practical situation.

A kind of sensibility classification method based on support vector machine, it is characterised in that in step B, institute State and choose the evaluation text comprising Feature Words as training sample, specifically include:

The feature phrase S returned according to FindCover algorithm, uses following strategy to choose training sample: first, exports all bags Evaluation text collection U containing Feature Words_fIf, | U_f| > 1% | U |, then from U_fIn randomly choose 1% | U | individual evaluation text as instruction Practice sample；Otherwise export U_fAs training sample.

A kind of sensibility classification method based on support vector machine, it is characterised in that in step C, institute Stating and training sample extracts characteristic vector, input support vector machine generates disaggregated model, specifically includes:

First according to Feature Words the text in sample data is converted to shape as "<labelling>feature 1: number feature 2: number ... Feature n: number " form, according to three way classification, then<labelling>value is positive, negative or neutral；If adopting With two way classification, then<labelling>value is positive and negative；Subsequently the training data changed is input to LIBSVM Storehouse carries out classification based training.

A kind of sensibility classification method based on support vector machine, it is characterised in that in step D, Analysis result is represented in web terminal, described in the content that represents include: " front " of text based on particular keywords, " negative Face ", the ratio of " neutral ", urtext that emotion is relevant, temporally dimension represent the emotion change of text.