CN101876985B

CN101876985B - WEB text sentiment theme recognizing method based on mixed model

Info

Publication number: CN101876985B
Application number: CN200910219161A
Authority: CN
Inventors: 蔡皖东; 樊娜
Original assignee: Northwestern Polytechnical University
Current assignee: Nanchang Changrong Mechanical and Electrical Co., Ltd.; Northwestern Polytechnical University
Priority date: 2009-11-26
Filing date: 2009-11-26
Publication date: 2012-08-29
Anticipated expiration: 2029-11-26
Also published as: CN101876985A

Abstract

The invention discloses a WEB text sentiment theme recognizing method based on a mixed model, which belongs to the field of network information security. In the method, model training is carried out in a text set; text language expression patterns of different sentiment inclinations and different themes are truly simulated; and language modes of sentiment expression and theme expression are modeled to respectively generate a sentiment language model and a theme language model. Aiming at a text to be processed, which needs to be analyzed, the similarity between the text and the two kinds of models is estimated by comparing a model of the text with the two kinds of models, and a theme and the sentiment inclination of the text can be simultaneously recognized and confirmed. The invention introduces language information knowledge in statistical modeling, captures and explores the characteristics and a rule of sentiment and theme expression and fully utilizes the characteristics and habit of language expression to establish a mixed model which can simultaneously analyze and recognize the theme and sentiment; and the average accuracy of sentiment recognition is improved from 67.81 percent in the prior art to 81.36 percent.

Description

WEB text emotion subject identifying method based on mixture model

Technical field

The present invention relates to a kind of emotion theme recognition methods, particularly based on the WEB text emotion subject identifying method of mixture model.Belong to filed of network information security.

Background technology

The WEB text subject extracts and the emotion trend analysis is the important research contents of filed of network information security.

Document " sorting technique of Chinese emotion tendency under the network environment, spoken and written languages application, 2008, Vol.2 (5), p139-144 " discloses a kind of text emotion sorting technique based on semantic tendency.This method combines through semantics and data mining correlation theory, and the emotion of utilizing the emotion of phrase in the Chinese text to study whole text is inclined to.But the emotion of the phase-split network text that the method is single, the theme of recognition network text and emotion are inclined to simultaneously, can't satisfy user's in the network information processing demand, and the emotion recognition accuracy rate of this method is not high simultaneously, and average accuracy rate is 67.81%.

Summary of the invention

In order to overcome the low defective of art methods emotion recognition accuracy rate, the present invention provides a kind of WEB text emotion subject identifying method based on mixture model.This method is through carrying out model training in text collection, the text language expression pattern of real simulation different emotions tendency and different themes with the language form modelling of emotional expression with the theme expression, produces emotion and theme two class language model respectively.The pending text of analyzing for needs through himself model and this two class model are compared, is assessed the similarity degree between it and two class models, finally can discern the theme of confirming text simultaneously and be inclined to emotion.In statistical modeling, introduce language message knowledge; Catch and seek characteristic and rule that emotion and theme are expressed; Make full use of the characteristics and the custom of language performance, foundation can be analyzed the mixture model of identification theme and emotion simultaneously, can improve the accuracy rate of emotion recognition.

For solving the technical scheme that technical matters of the present invention adopted: a kind of WEB text emotion subject identifying method based on mixture model is characterized in may further comprise the steps:

(a) text in the training set is carried out manual mark, respectively the emotion of each text of mark tendency and affiliated subject categories.According to the difference of different emotions language performance mode, estimate two types of emotion models: " commendation " model and " derogatory sense " model; According to the language performance mode of different themes text, estimate all kinds of topic language models respectively simultaneously;

(b) emotion model and the topic model set up for step (a) carry out parameter estimation respectively.At first adopt maximal possibility estimation (MLE) method that the parameter of each model is estimated.Utilize maximum Likelihood will inevitably cause the zero probability problem, therefore also need adopt the Jelinek-Mercer smoothing method to carry out data smoothing, the value of adjustment probability distribution;

(c) for pending text, calculate the distance of its language model and two types of emotion models, the emotion tendency of the emotion model that selected distance is nearest is given the text; The distance of calculating and each topic model, the subject attribute of the topic model that selected distance is nearest is as the theme of the text.

The invention has the beneficial effects as follows: owing in text collection, carry out model training; The text language expression pattern of real simulation different emotions tendency and different themes; Language form modelling with emotional expression and theme expression produces emotion and theme two class language model respectively.The pending text of analyzing for needs through himself model and this two class model are compared, is assessed the similarity degree between it and two class models, finally can discern the theme of confirming text simultaneously and be inclined to emotion.In statistical modeling, introduce language message knowledge; Catch and seek characteristic and rule that emotion and theme are expressed; Make full use of the characteristics and the custom of language performance; Foundation can be analyzed the mixture model of identification theme and emotion simultaneously, and the average accuracy rate of emotion recognition brings up to 81.36% by 67.81% of prior art.

Below in conjunction with accompanying drawing and embodiment the present invention is elaborated.

Description of drawings

Accompanying drawing is the process flow diagram that the present invention is based on the WEB text emotion subject identifying method of mixture model.

Embodiment

For a text to be measured, analyze text emotion tendency and text subject according to this method, concrete steps are following:

The first step, theme and emotion model set up in the emotion and the theme of manual mark training set text.If X is the set X={x1 of document, x2 ..., xn}, other set of C representation class is division a: C={c to X ₁, c ₂..., c _k, c _i∪ c _j=φ,

&ForAll; i &NotEqual; j,

The density function of x is:

p (x) = Σ_{i = 1}^{K} p (x | c_{i}) p (c_{i}) - - - (1)

For the distance between computation model and pending text, adopted Kullback-Liebler to estimate as the criterion of weighing difference between the classification.KL distance between two probability distribution q (x) and the p (x) is normally defined:

KL (q (x) | | p (c)) = &Integral; q (x) \ln [\frac{q (x)}{p (x)}] dx - - - (2)

When q (x)=p (x), the KL distance equals 0.When just two types of difference were big more, the KL distance was big more, and when two types of probability distribution were identical, KL was 0 apart from minimum.

The probability density function of data x on the i class is q (x)=p (x|c _i), the KL distance definition between density function p (x) and the q (x) is:

ψ＝-KL(p(x|c _i)||p(x)) (3)

For emotion model, i=2, expression has two kinds of models: " commendation " model and " derogatory sense " model; And for topic model, i=s, s are the numbers of the topic model of estimation from the training set.

When setting up language model, model order is the key factor that influences model performance.Under the identical situation of modeling unit, the performance of high-order model is better than lower-order model, but the structure difficulty of high-order model is greater than lower-order model.In theory; Though the n-gram that adopts high-order more is the descriptive language model more accurately, makes model can more approach real language phenomenon, in fact in existing corpus, uses more high level language unit; Can cause the sparse problem of serious data, influence the effect of model.Therefore, the linguistic unit in the formula (1) adopts the parameter of the bigram of word commonly used as model.

Second step, model parameter estimation.Adopt maximal possibility estimation (MLE) method commonly used that model parameter is estimated.Application MLE method is following according to a preliminary estimate to model parameter:

P_{M} (w_{i} | T) = \frac{count (w_{i})}{count (r)} - - - (4)

In formula (4), T both can represent pending text, also can represent the commendation text collection, derogatory sense text collection or subject text set.Count (w _i) expression w _iThe number of times that in T, occurs, any number of times that speech occurs in T of corresponding count (r) expression.Because the sparse property of data utilizes the maximum likelihood method of estimation will inevitably cause the zero probability problem: do not appear at the lexical item w among the document t for certain, use MLE will cause P (w|t)=0.The zero probability problem can weaken model description ability and reprocessing rate greatly.The data smoothing technology is raised low probability (comprising zero probability) through the value of adjustment probability distribution, and high probability is turned down; Thereby avoided the appearance of zero probability; Can effectively solve the sparse problem of data, can also make the model parameter probability distribution more even simultaneously, the calculating of probability is more accurate.Adopt the Jelinek-Mercer smoothing method based on linear interpolation among the present invention, this method is usually used in solving the biasing problem of the parameter estimation that causes owing to training sample set is less.According to the thought of Jelinek-Mercer smoothing method, the level and smooth calculating of model parameter can be defined as follows:

P _s(w _i|T)＝λP _M(w _i|T)+(1-λ)P(w _i|C) (5)

In the formula (5), λ is a smoothing parameter, 0＜λ＜1.λ need confirm through experiment, directly influence the performance of model.Accomplish the parameter estimation in emotion model and the topic model with level and smooth through type (4) and (5).

The 3rd step, the definition of modal distance function.In order accurately to assess the similarity degree between pending text and the model, introduced distance function.Through calculating the distance between pending text model and each model, the similarity between the judgment models.

The distance function of emotion model defines as follows:

θ(t，δ _P，δ _N)＝d ₁-d ₂ (6)

Wherein t representes pending text, δ _PAnd δ _NExpression " praising " model and " demoting " model respectively, d1 represents the KL distance between text t and " praising " model, and on behalf of text t, d2 " demote " the KL distance between the model.When θ greater than 0, show pending text more near " demoting " model, judge that the emotion of text representation is type of demoting; Otherwise, when θ less than 0, judge that it is type of praising.When θ equals 0, represent that the emotion of text representation is neutral.

For the foundation of topic model, the text subject in the set of at first manual mark training data is estimated the language model of each theme, assesses the language model of pending text self and the similarity degree between this two kinds of models then respectively.If the language model of pending text self is more similar with certain emotion model, so just think that the theme of the text is consistent with the theme of this model.The distance function of topic model defines as follows:

θ(t，γ ₁，...，γ _s)＝d _min(t，r _i) (7)

Wherein, r _iRepresent i topic model, d _Min(t, r _i) the KL distance of minimum between expression pending text self model and each topic model.If the KL between text and i the topic model, then thinks i the theme that theme as of the text apart from minimum.

Through detecting, the inventive method is 81.36% to the average accuracy rate of emotion recognition.

Claims

1. WEB text emotion subject identifying method based on mixture model is characterized in that may further comprise the steps:

The first step, establishing X is the set X={x1 of document, x2 ..., xn}, other set of C representation class, C={c ₁, c ₂..., c _k, c _i∪ c _j=φ,

The density function of x is:

p (x) = Σ_{i = 1}^{K} p (x | c_{i}) p (c_{i}) - - - (1)

Adopt Kullback-Liebler to estimate as the criterion of weighing difference between the classification; KL distance definition between two probability distribution q (x) and the p (x) is:

KL (q (x) | | p (x)) = &Integral; q (x) \ln [\frac{q (x)}{p (x)}] dx - - - (2)

When q (x)=p (x), the KL distance equals 0;

ψ＝-KL(p(x|c _i)||p(x)) (3)

For emotion model, i=2, expression has " commendation " model and two kinds of models of " derogatory sense " model; For topic model, i=s, s are the numbers of the topic model of estimation from the training set;

In second step, application MLE method is following according to a preliminary estimate to model parameter:

P_{M} (w_{i} | T) = \frac{count (w_{i})}{count (r)} - - - (4)

In the formula (4), T represents the commendation text collection, derogatory sense text collection or subject text set; Count (w _i) expression w _iThe number of times that in T, occurs, any number of times that speech occurs in T of corresponding count (r) expression; The level and smooth calculating of model parameter defines as follows:

P _s(w _i|T)＝λP _M(w _i|T)+(1-λ)P(w _i|C) (5)

In the formula (5), λ is a smoothing parameter, 0＜λ＜1;

The 3rd step, according to the distance function of formula (6) emotion model definition distance:

θ(t，δ _P，δ _N)＝d ₁-d ₂ (6)

Wherein t representes pending text, δ _PAnd δ _NExpression " praising " model and " demoting " model respectively, d1 represents the KL distance between text t and " praising " model, and on behalf of text t, d2 " demote " the KL distance between the model; When θ greater than 0, show pending text more near " demoting " model, judge that the emotion of text representation is type of demoting; Otherwise, when θ less than 0, judge that it is type of praising; When θ equals 0, represent that the emotion of text representation is neutral;

The distance function of topic model defines as follows:

θ(t，γ ₁，...，γ _s)＝d _min(t，r _i) (7)

Wherein, r _iRepresent i topic model, d _Min(t, r _i) the KL distance of minimum between expression pending text self model and each topic model; If the KL between text and i the topic model, then thinks i the theme that theme as of the text apart from minimum.