CN101876985B - WEB text sentiment theme recognizing method based on mixed model - Google Patents

WEB text sentiment theme recognizing method based on mixed model Download PDF

Info

Publication number
CN101876985B
CN101876985B CN200910219161A CN200910219161A CN101876985B CN 101876985 B CN101876985 B CN 101876985B CN 200910219161 A CN200910219161 A CN 200910219161A CN 200910219161 A CN200910219161 A CN 200910219161A CN 101876985 B CN101876985 B CN 101876985B
Authority
CN
China
Prior art keywords
model
text
theme
expression
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200910219161A
Other languages
Chinese (zh)
Other versions
CN101876985A (en
Inventor
蔡皖东
樊娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Changrong Mechanical and Electrical Co., Ltd.
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN200910219161A priority Critical patent/CN101876985B/en
Publication of CN101876985A publication Critical patent/CN101876985A/en
Application granted granted Critical
Publication of CN101876985B publication Critical patent/CN101876985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a WEB text sentiment theme recognizing method based on a mixed model, which belongs to the field of network information security. In the method, model training is carried out in a text set; text language expression patterns of different sentiment inclinations and different themes are truly simulated; and language modes of sentiment expression and theme expression are modeled to respectively generate a sentiment language model and a theme language model. Aiming at a text to be processed, which needs to be analyzed, the similarity between the text and the two kinds of models is estimated by comparing a model of the text with the two kinds of models, and a theme and the sentiment inclination of the text can be simultaneously recognized and confirmed. The invention introduces language information knowledge in statistical modeling, captures and explores the characteristics and a rule of sentiment and theme expression and fully utilizes the characteristics and habit of language expression to establish a mixed model which can simultaneously analyze and recognize the theme and sentiment; and the average accuracy of sentiment recognition is improved from 67.81 percent in the prior art to 81.36 percent.

Description

WEB text emotion subject identifying method based on mixture model
Technical field
The present invention relates to a kind of emotion theme recognition methods, particularly based on the WEB text emotion subject identifying method of mixture model.Belong to filed of network information security.
Background technology
The WEB text subject extracts and the emotion trend analysis is the important research contents of filed of network information security.
Document " sorting technique of Chinese emotion tendency under the network environment, spoken and written languages application, 2008, Vol.2 (5), p139-144 " discloses a kind of text emotion sorting technique based on semantic tendency.This method combines through semantics and data mining correlation theory, and the emotion of utilizing the emotion of phrase in the Chinese text to study whole text is inclined to.But the emotion of the phase-split network text that the method is single, the theme of recognition network text and emotion are inclined to simultaneously, can't satisfy user's in the network information processing demand, and the emotion recognition accuracy rate of this method is not high simultaneously, and average accuracy rate is 67.81%.
Summary of the invention
In order to overcome the low defective of art methods emotion recognition accuracy rate, the present invention provides a kind of WEB text emotion subject identifying method based on mixture model.This method is through carrying out model training in text collection, the text language expression pattern of real simulation different emotions tendency and different themes with the language form modelling of emotional expression with the theme expression, produces emotion and theme two class language model respectively.The pending text of analyzing for needs through himself model and this two class model are compared, is assessed the similarity degree between it and two class models, finally can discern the theme of confirming text simultaneously and be inclined to emotion.In statistical modeling, introduce language message knowledge; Catch and seek characteristic and rule that emotion and theme are expressed; Make full use of the characteristics and the custom of language performance, foundation can be analyzed the mixture model of identification theme and emotion simultaneously, can improve the accuracy rate of emotion recognition.
For solving the technical scheme that technical matters of the present invention adopted: a kind of WEB text emotion subject identifying method based on mixture model is characterized in may further comprise the steps:
(a) text in the training set is carried out manual mark, respectively the emotion of each text of mark tendency and affiliated subject categories.According to the difference of different emotions language performance mode, estimate two types of emotion models: " commendation " model and " derogatory sense " model; According to the language performance mode of different themes text, estimate all kinds of topic language models respectively simultaneously;
(b) emotion model and the topic model set up for step (a) carry out parameter estimation respectively.At first adopt maximal possibility estimation (MLE) method that the parameter of each model is estimated.Utilize maximum Likelihood will inevitably cause the zero probability problem, therefore also need adopt the Jelinek-Mercer smoothing method to carry out data smoothing, the value of adjustment probability distribution;
(c) for pending text, calculate the distance of its language model and two types of emotion models, the emotion tendency of the emotion model that selected distance is nearest is given the text; The distance of calculating and each topic model, the subject attribute of the topic model that selected distance is nearest is as the theme of the text.
The invention has the beneficial effects as follows: owing in text collection, carry out model training; The text language expression pattern of real simulation different emotions tendency and different themes; Language form modelling with emotional expression and theme expression produces emotion and theme two class language model respectively.The pending text of analyzing for needs through himself model and this two class model are compared, is assessed the similarity degree between it and two class models, finally can discern the theme of confirming text simultaneously and be inclined to emotion.In statistical modeling, introduce language message knowledge; Catch and seek characteristic and rule that emotion and theme are expressed; Make full use of the characteristics and the custom of language performance; Foundation can be analyzed the mixture model of identification theme and emotion simultaneously, and the average accuracy rate of emotion recognition brings up to 81.36% by 67.81% of prior art.
Below in conjunction with accompanying drawing and embodiment the present invention is elaborated.
Description of drawings
Accompanying drawing is the process flow diagram that the present invention is based on the WEB text emotion subject identifying method of mixture model.
Embodiment
For a text to be measured, analyze text emotion tendency and text subject according to this method, concrete steps are following:
The first step, theme and emotion model set up in the emotion and the theme of manual mark training set text.If X is the set X={x1 of document, x2 ..., xn}, other set of C representation class is division a: C={c to X 1, c 2..., c k, c i∪ c j=φ, ∀ i ≠ j , The density function of x is:
p ( x ) = Σ i = 1 K p ( x | c i ) p ( c i ) - - - ( 1 )
For the distance between computation model and pending text, adopted Kullback-Liebler to estimate as the criterion of weighing difference between the classification.KL distance between two probability distribution q (x) and the p (x) is normally defined:
KL ( q ( x ) | | p ( c ) ) = ∫ q ( x ) ln [ q ( x ) p ( x ) ] dx - - - ( 2 )
When q (x)=p (x), the KL distance equals 0.When just two types of difference were big more, the KL distance was big more, and when two types of probability distribution were identical, KL was 0 apart from minimum.
The probability density function of data x on the i class is q (x)=p (x|c i), the KL distance definition between density function p (x) and the q (x) is:
ψ=-KL(p(x|c i)||p(x)) (3)
For emotion model, i=2, expression has two kinds of models: " commendation " model and " derogatory sense " model; And for topic model, i=s, s are the numbers of the topic model of estimation from the training set.
When setting up language model, model order is the key factor that influences model performance.Under the identical situation of modeling unit, the performance of high-order model is better than lower-order model, but the structure difficulty of high-order model is greater than lower-order model.In theory; Though the n-gram that adopts high-order more is the descriptive language model more accurately, makes model can more approach real language phenomenon, in fact in existing corpus, uses more high level language unit; Can cause the sparse problem of serious data, influence the effect of model.Therefore, the linguistic unit in the formula (1) adopts the parameter of the bigram of word commonly used as model.
Second step, model parameter estimation.Adopt maximal possibility estimation (MLE) method commonly used that model parameter is estimated.Application MLE method is following according to a preliminary estimate to model parameter:
P M ( w i | T ) = count ( w i ) count ( r ) - - - ( 4 )
In formula (4), T both can represent pending text, also can represent the commendation text collection, derogatory sense text collection or subject text set.Count (w i) expression w iThe number of times that in T, occurs, any number of times that speech occurs in T of corresponding count (r) expression.Because the sparse property of data utilizes the maximum likelihood method of estimation will inevitably cause the zero probability problem: do not appear at the lexical item w among the document t for certain, use MLE will cause P (w|t)=0.The zero probability problem can weaken model description ability and reprocessing rate greatly.The data smoothing technology is raised low probability (comprising zero probability) through the value of adjustment probability distribution, and high probability is turned down; Thereby avoided the appearance of zero probability; Can effectively solve the sparse problem of data, can also make the model parameter probability distribution more even simultaneously, the calculating of probability is more accurate.Adopt the Jelinek-Mercer smoothing method based on linear interpolation among the present invention, this method is usually used in solving the biasing problem of the parameter estimation that causes owing to training sample set is less.According to the thought of Jelinek-Mercer smoothing method, the level and smooth calculating of model parameter can be defined as follows:
P s(w i|T)=λP M(w i|T)+(1-λ)P(w i|C) (5)
In the formula (5), λ is a smoothing parameter, 0<λ<1.λ need confirm through experiment, directly influence the performance of model.Accomplish the parameter estimation in emotion model and the topic model with level and smooth through type (4) and (5).
The 3rd step, the definition of modal distance function.In order accurately to assess the similarity degree between pending text and the model, introduced distance function.Through calculating the distance between pending text model and each model, the similarity between the judgment models.
The distance function of emotion model defines as follows:
θ(t,δ P,δ N)=d 1-d 2 (6)
Wherein t representes pending text, δ PAnd δ NExpression " praising " model and " demoting " model respectively, d1 represents the KL distance between text t and " praising " model, and on behalf of text t, d2 " demote " the KL distance between the model.When θ greater than 0, show pending text more near " demoting " model, judge that the emotion of text representation is type of demoting; Otherwise, when θ less than 0, judge that it is type of praising.When θ equals 0, represent that the emotion of text representation is neutral.
For the foundation of topic model, the text subject in the set of at first manual mark training data is estimated the language model of each theme, assesses the language model of pending text self and the similarity degree between this two kinds of models then respectively.If the language model of pending text self is more similar with certain emotion model, so just think that the theme of the text is consistent with the theme of this model.The distance function of topic model defines as follows:
θ(t,γ 1,...,γ s)=d min(t,r i) (7)
Wherein, r iRepresent i topic model, d Min(t, r i) the KL distance of minimum between expression pending text self model and each topic model.If the KL between text and i the topic model, then thinks i the theme that theme as of the text apart from minimum.
Through detecting, the inventive method is 81.36% to the average accuracy rate of emotion recognition.

Claims (1)

1. WEB text emotion subject identifying method based on mixture model is characterized in that may further comprise the steps:
The first step, establishing X is the set X={x1 of document, x2 ..., xn}, other set of C representation class, C={c 1, c 2..., c k, c i∪ c j=φ,
Figure FSB00000729700800011
The density function of x is:
p ( x ) = Σ i = 1 K p ( x | c i ) p ( c i ) - - - ( 1 )
Adopt Kullback-Liebler to estimate as the criterion of weighing difference between the classification; KL distance definition between two probability distribution q (x) and the p (x) is:
KL ( q ( x ) | | p ( x ) ) = ∫ q ( x ) ln [ q ( x ) p ( x ) ] dx - - - ( 2 )
When q (x)=p (x), the KL distance equals 0;
The probability density function of data x on the i class is q (x)=p (x|c i), the KL distance definition between density function p (x) and the q (x) is:
ψ=-KL(p(x|c i)||p(x)) (3)
For emotion model, i=2, expression has " commendation " model and two kinds of models of " derogatory sense " model; For topic model, i=s, s are the numbers of the topic model of estimation from the training set;
In second step, application MLE method is following according to a preliminary estimate to model parameter:
P M ( w i | T ) = count ( w i ) count ( r ) - - - ( 4 )
In the formula (4), T represents the commendation text collection, derogatory sense text collection or subject text set; Count (w i) expression w iThe number of times that in T, occurs, any number of times that speech occurs in T of corresponding count (r) expression; The level and smooth calculating of model parameter defines as follows:
P s(w i|T)=λP M(w i|T)+(1-λ)P(w i|C) (5)
In the formula (5), λ is a smoothing parameter, 0<λ<1;
The 3rd step, according to the distance function of formula (6) emotion model definition distance:
θ(t,δ P,δ N)=d 1-d 2 (6)
Wherein t representes pending text, δ PAnd δ NExpression " praising " model and " demoting " model respectively, d1 represents the KL distance between text t and " praising " model, and on behalf of text t, d2 " demote " the KL distance between the model; When θ greater than 0, show pending text more near " demoting " model, judge that the emotion of text representation is type of demoting; Otherwise, when θ less than 0, judge that it is type of praising; When θ equals 0, represent that the emotion of text representation is neutral;
The distance function of topic model defines as follows:
θ(t,γ 1,...,γ s)=d min(t,r i) (7)
Wherein, r iRepresent i topic model, d Min(t, r i) the KL distance of minimum between expression pending text self model and each topic model; If the KL between text and i the topic model, then thinks i the theme that theme as of the text apart from minimum.
CN200910219161A 2009-11-26 2009-11-26 WEB text sentiment theme recognizing method based on mixed model Active CN101876985B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910219161A CN101876985B (en) 2009-11-26 2009-11-26 WEB text sentiment theme recognizing method based on mixed model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910219161A CN101876985B (en) 2009-11-26 2009-11-26 WEB text sentiment theme recognizing method based on mixed model

Publications (2)

Publication Number Publication Date
CN101876985A CN101876985A (en) 2010-11-03
CN101876985B true CN101876985B (en) 2012-08-29

Family

ID=43019543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910219161A Active CN101876985B (en) 2009-11-26 2009-11-26 WEB text sentiment theme recognizing method based on mixed model

Country Status (1)

Country Link
CN (1) CN101876985B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207855B (en) * 2013-04-12 2019-04-26 广东工业大学 For the fine granularity sentiment analysis system and method for product review information
CN103617245A (en) * 2013-11-27 2014-03-05 苏州大学 Bilingual sentiment classification method and device
CN105005552B (en) * 2014-04-22 2019-01-08 北京四维图新科技股份有限公司 A kind of information processing method and device
CN105335347A (en) * 2014-05-30 2016-02-17 富士通株式会社 Method and device for determining emotion and reason thereof for specific topic
CN111859979A (en) * 2020-06-16 2020-10-30 中国科学院自动化研究所 Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium
CN116738298B (en) * 2023-08-16 2023-11-24 杭州同花顺数据开发有限公司 Text classification method, system and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101141456A (en) * 2007-10-09 2008-03-12 南京财经大学 Vertical search based network data excavation method
CN101201980A (en) * 2007-12-19 2008-06-18 北京交通大学 Remote Chinese language teaching system based on voice affection identification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101141456A (en) * 2007-10-09 2008-03-12 南京财经大学 Vertical search based network data excavation method
CN101201980A (en) * 2007-12-19 2008-06-18 北京交通大学 Remote Chinese language teaching system based on voice affection identification

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周立柱等.情感分析研究综述.《计算机应用》.2008,第28卷(第11期),2725-2728. *
樊娜等.中文文本情感主题句分析与提取研究.《计算机应用》.2009,第29卷(第4期),1171-1176. *

Also Published As

Publication number Publication date
CN101876985A (en) 2010-11-03

Similar Documents

Publication Publication Date Title
US20220147836A1 (en) Method and device for text-enhanced knowledge graph joint representation learning
CN103400577B (en) The acoustic model method for building up of multilingual speech recognition and device
CN101876985B (en) WEB text sentiment theme recognizing method based on mixed model
CN106326212B (en) A kind of implicit chapter relationship analysis method based on level deep semantic
CN103154936A (en) Methods and systems for automated text correction
CN106372061A (en) Short text similarity calculation method based on semantics
CN103177733B (en) Standard Chinese suffixation of a nonsyllabic "r" sound voice quality evaluating method and system
CN108959250A (en) A kind of error correction method and its system based on language model and word feature
CN103077720B (en) Speaker identification method and system
CN101127042A (en) Sensibility classification method based on language model
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
CN103207860A (en) Method and device for extracting entity relationships of public sentiment events
CN102033964A (en) Text classification method based on block partition and position weight
CN106570180A (en) Artificial intelligence based voice searching method and device
CN106529525A (en) Chinese and Japanese handwritten character recognition method
CN103474061A (en) Automatic distinguishing method based on integration of classifier for Chinese dialects
CN102063424A (en) Method for Chinese word segmentation
CN105975475A (en) Chinese phrase string-based fine-grained thematic information extraction method
CN111222330B (en) Chinese event detection method and system
CN101452701B (en) Confidence degree estimation method and device based on inverse model
CN105609116A (en) Speech emotional dimensions region automatic recognition method
CN104572631A (en) Training method and system for language model
CN110321434A (en) A kind of file classification method based on word sense disambiguation convolutional neural networks
CN110134950A (en) A kind of text auto-collation that words combines
CN105183716B (en) A kind of intelligent interactive method based on abstract semantics

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: NANTONG CHANGRONG MECHANICAL +ELECTRICAL CO., LTD.

Free format text: FORMER OWNER: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140813

Owner name: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140813

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 710072 XI AN, SHAANXI PROVINCE TO: 226600 NANTONG, JIANGSU PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20140813

Address after: 226600 Haian County Development Zone, Jiangsu City, Nantong Province

Patentee after: Nanchang Changrong Mechanical and Electrical Co., Ltd.

Patentee after: Northwestern Polytechnical University

Address before: 710072 Xi'an friendship West Road, Shaanxi, No. 127

Patentee before: Northwestern Polytechnical University