CN101876985A

CN101876985A - WEB text sentiment theme recognizing method based on mixed model

Info

Publication number: CN101876985A
Application number: CN2009102191619A
Authority: CN
Inventors: 蔡皖东; 樊娜
Original assignee: Northwestern Polytechnical University
Current assignee: Nanchang Changrong Mechanical and Electrical Co., Ltd.; Northwestern Polytechnical University
Priority date: 2009-11-26
Filing date: 2009-11-26
Publication date: 2010-11-03
Anticipated expiration: 2029-11-26
Also published as: CN101876985B

Abstract

The invention discloses a WEB text sentiment theme recognizing method based on a mixed model, which belongs to the field of network information security. In the method, model training is carried out in a text set; text language expression patterns of different sentiment inclinations and different themes are truly simulated; and language modes of sentiment expression and theme expression are modeled to respectively generate a sentiment language model and a theme language model. Aiming at a text to be processed, which needs to be analyzed, the similarity between the text and the two kinds of models is estimated by comparing a model of the text with the two kinds of models, and a theme and the sentiment inclination of the text can be simultaneously recognized and confirmed. The invention introduces language information knowledge in statistical modeling, captures and explores the characteristics and a rule of sentiment and theme expression and fully utilizes the characteristics and habit of language expression to establish a mixed model which can simultaneously analyze and recognize the theme and sentiment; and the average accuracy of sentiment recognition is improved from 67.81 percent in the prior art to 81.36 percent.

Description

WEB text emotion subject identifying method based on mixture model

Technical field

The present invention relates to a kind of emotion theme recognition methods, particularly based on the WEB text emotion subject identifying method of mixture model.Belong to filed of network information security.

Background technology

The WEB text subject extracts and the emotion trend analysis is the important research contents of filed of network information security.

Document " sorting technique of Chinese emotion tendency under the network environment, spoken and written languages application, 2008, Vol.2 (5), p139-144 " discloses a kind of text emotion sorting technique based on semantic tendency.This method combines by semantics and data mining correlation theory, utilizes the emotion of phrase in the Chinese text to study the emotion tendency of whole text.But the emotion of the phase-split network text that the method is single, the theme of recognition network text and emotion are inclined to simultaneously, can't satisfy the demand of user in the network information processing, and the emotion recognition accuracy rate of this method is not high simultaneously, and average accuracy rate is 67.81%.

Summary of the invention

In order to overcome the low defective of art methods emotion recognition accuracy rate, the invention provides a kind of WEB text emotion subject identifying method based on mixture model.This method is by carrying out model training in text collection, the text language expression pattern of real simulation different emotions tendency and different themes with the language form modelling of emotional expression with the theme expression, produces emotion and theme two class language model respectively.The pending text of analyzing for needs by himself model and this two class model are compared, is assessed the similarity degree between it and two class models, finally can discern theme and the emotion tendency of determining text simultaneously.In statistical modeling, introduce language message knowledge, catch and seek characteristic and rule that emotion and theme are expressed, make full use of the characteristics and the custom of language performance, foundation can be analyzed the mixture model of identification theme and emotion simultaneously, can improve the accuracy rate of emotion recognition.

For solving the technical scheme that the technology of the present invention problem is adopted: a kind of WEB text emotion subject identifying method based on mixture model is characterized in may further comprise the steps:

(a) text in the training set is carried out manual mark, respectively the emotion of each text of mark tendency and affiliated subject categories.According to the difference of different emotions language performance mode, estimate two class emotion models: " commendation " model and " derogatory sense " model; According to the language performance mode of different themes text, estimate all kinds of topic language models respectively simultaneously;

(b) emotion model and the topic model of setting up for step (a) carries out parameter estimation respectively.At first adopt maximal possibility estimation (MLE) method that the parameter of each model is estimated.Utilize maximum Likelihood will inevitably cause the zero probability problem, therefore also need to adopt the Jelinek-Mercer smoothing method to carry out data smoothing, adjust the value of probability distribution;

(c) for pending text, calculate the distance of its language model and two class emotion models, the emotion tendency of the emotion model that selected distance is nearest is given the text; The distance of calculating and each topic model, the subject attribute of the topic model that selected distance is nearest is as the theme of the text.

The invention has the beneficial effects as follows: owing in text collection, carry out model training, the text language expression pattern of real simulation different emotions tendency and different themes, language form modelling with emotional expression and theme expression produces emotion and theme two class language model respectively.The pending text of analyzing for needs by himself model and this two class model are compared, is assessed the similarity degree between it and two class models, finally can discern theme and the emotion tendency of determining text simultaneously.In statistical modeling, introduce language message knowledge, catch and seek characteristic and rule that emotion and theme are expressed, make full use of the characteristics and the custom of language performance, foundation can be analyzed the mixture model of identification theme and emotion simultaneously, and the average accuracy rate of emotion recognition brings up to 81.36% by 67.81% of prior art.

Below in conjunction with drawings and Examples the present invention is elaborated.

Description of drawings

Accompanying drawing is the process flow diagram that the present invention is based on the WEB text emotion subject identifying method of mixture model.

Embodiment

For a text to be measured, analyze text emotion tendency and text subject according to this method, concrete steps are as follows:

The first step, theme and emotion model set up in the emotion and the theme of manual mark training set text.If X is the set X={x1 of document, x2 ..., xn}, other set of C representation class is division a: C={c to X ₁, c ₂..., c _k, c _i∪ c _j=φ, The density function of x is:

p (x) = Σ_{i = 1}^{K} p (x | c_{i}) p (c_{i}) - - - (1)

For the distance between computation model and pending text, adopted Kullback-Liebler to estimate as the criterion of weighing difference between the classification.KL distance between two probability distribution q (x) and the p (x) is normally defined:

KL (q (x) | | p (x)) = &Integral; q (x) \ln [\frac{q (x)}{p (x)}] dx - - - (2)

When q (x)=p (x), the KL distance equals 0.When just two class difference were big more, the KL distance was big more, and when two class probability distribution were identical, KL was 0 apart from minimum.

The probability density function of data x on the i class is q (x)=p (x|ci), and the KL distance definition between density function p (x) and the q (x) is:

ψ＝-KL(p(x|c _i)||p(x)) (3)

For emotion model, i=2, expression has two kinds of models: " commendation " model and " derogatory sense " model; And for topic model, i=s, s are the numbers of the topic model of estimation from the training set.

When setting up language model, model order is the key factor that influences model performance.Under the identical situation of modeling unit, the performance of high-order model is better than lower-order model, but the structure difficulty of high-order model is greater than lower-order model.In theory, though the n-gram that adopts high-order more is the descriptive language model more accurately, makes model can more approach real language phenomenon, in fact uses more high level language unit in existing corpus, can cause the sparse problem of serious data, influence the effect of model.Therefore, the linguistic unit in the formula (1) adopts the parameter of the bigram of word commonly used as model.

Second step, model parameter estimation.Adopt maximal possibility estimation (MLE) method commonly used that model parameter is estimated.Application MLE method is as follows according to a preliminary estimate to model parameter:

P_{M} (w_{i} | T) = \frac{count (w_{i})}{count (r)} - - - (4)

In formula (4), T both can represent pending text, also can represent the commendation text collection, derogatory sense text collection or subject text set.Count (w _i) expression w _iThe number of times that occurs in T, corresponding count (r) represents the number of times that any one speech occurs in T.Because the sparse property of data utilizes the maximum likelihood method of estimation will inevitably cause the zero probability problem: do not appear at lexical item w among the document t for certain, use MLE will cause P (w|t)=0.The zero probability problem can weaken model description ability and reprocessing rate greatly.The data smoothing technology is raised low probability (comprising zero probability) by adjusting the value of probability distribution, and high probability is turned down, thereby avoided the appearance of zero probability, can effectively solve the sparse problem of data, can also make the model parameter probability distribution more even simultaneously, the calculating of probability is more accurate.Adopt the Jelinek-Mercer smoothing method based on linear interpolation among the present invention, this method is usually used in solving the biasing problem of the parameter estimation that causes owing to training sample set is less.According to the thought of Jelinek-Mercer smoothing method, the level and smooth calculating of model parameter can be defined as follows:

P _s(w _i|T)＝λP _M(w _i|T)+(1-λ)P(w _i|C) (5)

In the formula (5), λ is a smoothing parameter, 0＜λ＜1.λ need be determined by experiment, and directly influences the performance of model.Finish the parameter estimation in emotion model and the topic model and level and smooth through type (4) and (5).

The 3rd step, the definition of modal distance function.In order accurately to assess the similarity degree between pending text and the model, introduced distance function.By calculating the distance between pending text model and each model, the similarity between the judgment models.

The distance function of emotion model is defined as follows:

θ(t，δ _P，δ _N)＝d ₁-d ₂ (6)

Wherein t represents pending text, δ _PAnd δ _NExpression " praising " model and " demoting " model respectively, d1 represents the KL distance between text t and " praising " model, and on behalf of text t, d2 " demote " KL distance between the model.When θ greater than 0, show pending text more near " demoting " model, the emotion of judging text representation is for demoting class; Otherwise, when θ less than 0, judge that it is for praising class.When θ equals 0, represent the emotion neutrality of text representation.

For the foundation of topic model, the text subject in the set of at first manual mark training data is estimated the language model of each theme, assesses the language model of pending text self and the similarity degree between this two kinds of models then respectively.If the language model of pending text self is more similar to certain emotion model, so just think that the theme of the text is consistent with the theme of this model.The distance function of topic model is defined as follows:

θ(t，γ ₁，...，γ _s)＝d _min(t，r _i) (7)

Wherein, r _iRepresent i topic model, d _Min(t, r _i) the KL distance of minimum between expression pending text self model and each topic model.If the KL between text and i the topic model, then thinks i the theme that theme as of the text apart from minimum.

After testing, the inventive method is 81.36% to the average accuracy rate of emotion recognition.

Claims

1. WEB text emotion subject identifying method based on mixture model is characterized in that may further comprise the steps:

(a) text in the training set is carried out manual mark, the emotion of each text of mark tendency and affiliated subject categories according to the difference of different emotions language performance mode, estimate two class emotion models: " commendation " model and " derogatory sense " model respectively; According to the language performance mode of different themes text, estimate all kinds of topic language models respectively simultaneously;

(b) emotion model and the topic model of setting up for step (a) carries out parameter estimation respectively, at first adopt maximal possibility estimation (MLE) method that the parameter of each model is estimated, utilize maximum Likelihood will inevitably cause the zero probability problem, therefore also need to adopt the Jelinek-Mercer smoothing method to carry out data smoothing, adjust the value of probability distribution;