CN101876985A - WEB text sentiment theme recognizing method based on mixed model - Google Patents

WEB text sentiment theme recognizing method based on mixed model Download PDF

Info

Publication number
CN101876985A
CN101876985A CN2009102191619A CN200910219161A CN101876985A CN 101876985 A CN101876985 A CN 101876985A CN 2009102191619 A CN2009102191619 A CN 2009102191619A CN 200910219161 A CN200910219161 A CN 200910219161A CN 101876985 A CN101876985 A CN 101876985A
Authority
CN
China
Prior art keywords
model
text
emotion
language
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2009102191619A
Other languages
Chinese (zh)
Other versions
CN101876985B (en
Inventor
蔡皖东
樊娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanchang Changrong Mechanical and Electrical Co., Ltd.
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN200910219161A priority Critical patent/CN101876985B/en
Publication of CN101876985A publication Critical patent/CN101876985A/en
Application granted granted Critical
Publication of CN101876985B publication Critical patent/CN101876985B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a WEB text sentiment theme recognizing method based on a mixed model, which belongs to the field of network information security. In the method, model training is carried out in a text set; text language expression patterns of different sentiment inclinations and different themes are truly simulated; and language modes of sentiment expression and theme expression are modeled to respectively generate a sentiment language model and a theme language model. Aiming at a text to be processed, which needs to be analyzed, the similarity between the text and the two kinds of models is estimated by comparing a model of the text with the two kinds of models, and a theme and the sentiment inclination of the text can be simultaneously recognized and confirmed. The invention introduces language information knowledge in statistical modeling, captures and explores the characteristics and a rule of sentiment and theme expression and fully utilizes the characteristics and habit of language expression to establish a mixed model which can simultaneously analyze and recognize the theme and sentiment; and the average accuracy of sentiment recognition is improved from 67.81 percent in the prior art to 81.36 percent.

Description

WEB text emotion subject identifying method based on mixture model
Technical field
The present invention relates to a kind of emotion theme recognition methods, particularly based on the WEB text emotion subject identifying method of mixture model.Belong to filed of network information security.
Background technology
The WEB text subject extracts and the emotion trend analysis is the important research contents of filed of network information security.
Document " sorting technique of Chinese emotion tendency under the network environment, spoken and written languages application, 2008, Vol.2 (5), p139-144 " discloses a kind of text emotion sorting technique based on semantic tendency.This method combines by semantics and data mining correlation theory, utilizes the emotion of phrase in the Chinese text to study the emotion tendency of whole text.But the emotion of the phase-split network text that the method is single, the theme of recognition network text and emotion are inclined to simultaneously, can't satisfy the demand of user in the network information processing, and the emotion recognition accuracy rate of this method is not high simultaneously, and average accuracy rate is 67.81%.
Summary of the invention
In order to overcome the low defective of art methods emotion recognition accuracy rate, the invention provides a kind of WEB text emotion subject identifying method based on mixture model.This method is by carrying out model training in text collection, the text language expression pattern of real simulation different emotions tendency and different themes with the language form modelling of emotional expression with the theme expression, produces emotion and theme two class language model respectively.The pending text of analyzing for needs by himself model and this two class model are compared, is assessed the similarity degree between it and two class models, finally can discern theme and the emotion tendency of determining text simultaneously.In statistical modeling, introduce language message knowledge, catch and seek characteristic and rule that emotion and theme are expressed, make full use of the characteristics and the custom of language performance, foundation can be analyzed the mixture model of identification theme and emotion simultaneously, can improve the accuracy rate of emotion recognition.
For solving the technical scheme that the technology of the present invention problem is adopted: a kind of WEB text emotion subject identifying method based on mixture model is characterized in may further comprise the steps:
(a) text in the training set is carried out manual mark, respectively the emotion of each text of mark tendency and affiliated subject categories.According to the difference of different emotions language performance mode, estimate two class emotion models: " commendation " model and " derogatory sense " model; According to the language performance mode of different themes text, estimate all kinds of topic language models respectively simultaneously;
(b) emotion model and the topic model of setting up for step (a) carries out parameter estimation respectively.At first adopt maximal possibility estimation (MLE) method that the parameter of each model is estimated.Utilize maximum Likelihood will inevitably cause the zero probability problem, therefore also need to adopt the Jelinek-Mercer smoothing method to carry out data smoothing, adjust the value of probability distribution;
(c) for pending text, calculate the distance of its language model and two class emotion models, the emotion tendency of the emotion model that selected distance is nearest is given the text; The distance of calculating and each topic model, the subject attribute of the topic model that selected distance is nearest is as the theme of the text.
The invention has the beneficial effects as follows: owing in text collection, carry out model training, the text language expression pattern of real simulation different emotions tendency and different themes, language form modelling with emotional expression and theme expression produces emotion and theme two class language model respectively.The pending text of analyzing for needs by himself model and this two class model are compared, is assessed the similarity degree between it and two class models, finally can discern theme and the emotion tendency of determining text simultaneously.In statistical modeling, introduce language message knowledge, catch and seek characteristic and rule that emotion and theme are expressed, make full use of the characteristics and the custom of language performance, foundation can be analyzed the mixture model of identification theme and emotion simultaneously, and the average accuracy rate of emotion recognition brings up to 81.36% by 67.81% of prior art.
Below in conjunction with drawings and Examples the present invention is elaborated.
Description of drawings
Accompanying drawing is the process flow diagram that the present invention is based on the WEB text emotion subject identifying method of mixture model.
Embodiment
For a text to be measured, analyze text emotion tendency and text subject according to this method, concrete steps are as follows:
The first step, theme and emotion model set up in the emotion and the theme of manual mark training set text.If X is the set X={x1 of document, x2 ..., xn}, other set of C representation class is division a: C={c to X 1, c 2..., c k, c i∪ c j=φ, The density function of x is:
p ( x ) = Σ i = 1 K p ( x | c i ) p ( c i ) - - - ( 1 )
For the distance between computation model and pending text, adopted Kullback-Liebler to estimate as the criterion of weighing difference between the classification.KL distance between two probability distribution q (x) and the p (x) is normally defined:
KL ( q ( x ) | | p ( x ) ) = ∫ q ( x ) ln [ q ( x ) p ( x ) ] dx - - - ( 2 )
When q (x)=p (x), the KL distance equals 0.When just two class difference were big more, the KL distance was big more, and when two class probability distribution were identical, KL was 0 apart from minimum.
The probability density function of data x on the i class is q (x)=p (x|ci), and the KL distance definition between density function p (x) and the q (x) is:
ψ=-KL(p(x|c i)||p(x)) (3)
For emotion model, i=2, expression has two kinds of models: " commendation " model and " derogatory sense " model; And for topic model, i=s, s are the numbers of the topic model of estimation from the training set.
When setting up language model, model order is the key factor that influences model performance.Under the identical situation of modeling unit, the performance of high-order model is better than lower-order model, but the structure difficulty of high-order model is greater than lower-order model.In theory, though the n-gram that adopts high-order more is the descriptive language model more accurately, makes model can more approach real language phenomenon, in fact uses more high level language unit in existing corpus, can cause the sparse problem of serious data, influence the effect of model.Therefore, the linguistic unit in the formula (1) adopts the parameter of the bigram of word commonly used as model.
Second step, model parameter estimation.Adopt maximal possibility estimation (MLE) method commonly used that model parameter is estimated.Application MLE method is as follows according to a preliminary estimate to model parameter:
P M ( w i | T ) = count ( w i ) count ( r ) - - - ( 4 )
In formula (4), T both can represent pending text, also can represent the commendation text collection, derogatory sense text collection or subject text set.Count (w i) expression w iThe number of times that occurs in T, corresponding count (r) represents the number of times that any one speech occurs in T.Because the sparse property of data utilizes the maximum likelihood method of estimation will inevitably cause the zero probability problem: do not appear at lexical item w among the document t for certain, use MLE will cause P (w|t)=0.The zero probability problem can weaken model description ability and reprocessing rate greatly.The data smoothing technology is raised low probability (comprising zero probability) by adjusting the value of probability distribution, and high probability is turned down, thereby avoided the appearance of zero probability, can effectively solve the sparse problem of data, can also make the model parameter probability distribution more even simultaneously, the calculating of probability is more accurate.Adopt the Jelinek-Mercer smoothing method based on linear interpolation among the present invention, this method is usually used in solving the biasing problem of the parameter estimation that causes owing to training sample set is less.According to the thought of Jelinek-Mercer smoothing method, the level and smooth calculating of model parameter can be defined as follows:
P s(w i|T)=λP M(w i|T)+(1-λ)P(w i|C) (5)
In the formula (5), λ is a smoothing parameter, 0<λ<1.λ need be determined by experiment, and directly influences the performance of model.Finish the parameter estimation in emotion model and the topic model and level and smooth through type (4) and (5).
The 3rd step, the definition of modal distance function.In order accurately to assess the similarity degree between pending text and the model, introduced distance function.By calculating the distance between pending text model and each model, the similarity between the judgment models.
The distance function of emotion model is defined as follows:
θ(t,δ P,δ N)=d 1-d 2 (6)
Wherein t represents pending text, δ PAnd δ NExpression " praising " model and " demoting " model respectively, d1 represents the KL distance between text t and " praising " model, and on behalf of text t, d2 " demote " KL distance between the model.When θ greater than 0, show pending text more near " demoting " model, the emotion of judging text representation is for demoting class; Otherwise, when θ less than 0, judge that it is for praising class.When θ equals 0, represent the emotion neutrality of text representation.
For the foundation of topic model, the text subject in the set of at first manual mark training data is estimated the language model of each theme, assesses the language model of pending text self and the similarity degree between this two kinds of models then respectively.If the language model of pending text self is more similar to certain emotion model, so just think that the theme of the text is consistent with the theme of this model.The distance function of topic model is defined as follows:
θ(t,γ 1,...,γ s)=d min(t,r i) (7)
Wherein, r iRepresent i topic model, d Min(t, r i) the KL distance of minimum between expression pending text self model and each topic model.If the KL between text and i the topic model, then thinks i the theme that theme as of the text apart from minimum.
After testing, the inventive method is 81.36% to the average accuracy rate of emotion recognition.

Claims (1)

1. WEB text emotion subject identifying method based on mixture model is characterized in that may further comprise the steps:
(a) text in the training set is carried out manual mark, the emotion of each text of mark tendency and affiliated subject categories according to the difference of different emotions language performance mode, estimate two class emotion models: " commendation " model and " derogatory sense " model respectively; According to the language performance mode of different themes text, estimate all kinds of topic language models respectively simultaneously;
(b) emotion model and the topic model of setting up for step (a) carries out parameter estimation respectively, at first adopt maximal possibility estimation (MLE) method that the parameter of each model is estimated, utilize maximum Likelihood will inevitably cause the zero probability problem, therefore also need to adopt the Jelinek-Mercer smoothing method to carry out data smoothing, adjust the value of probability distribution;
(c) for pending text, calculate the distance of its language model and two class emotion models, the emotion tendency of the emotion model that selected distance is nearest is given the text; The distance of calculating and each topic model, the subject attribute of the topic model that selected distance is nearest is as the theme of the text.
CN200910219161A 2009-11-26 2009-11-26 WEB text sentiment theme recognizing method based on mixed model Active CN101876985B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910219161A CN101876985B (en) 2009-11-26 2009-11-26 WEB text sentiment theme recognizing method based on mixed model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910219161A CN101876985B (en) 2009-11-26 2009-11-26 WEB text sentiment theme recognizing method based on mixed model

Publications (2)

Publication Number Publication Date
CN101876985A true CN101876985A (en) 2010-11-03
CN101876985B CN101876985B (en) 2012-08-29

Family

ID=43019543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910219161A Active CN101876985B (en) 2009-11-26 2009-11-26 WEB text sentiment theme recognizing method based on mixed model

Country Status (1)

Country Link
CN (1) CN101876985B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN103617245A (en) * 2013-11-27 2014-03-05 苏州大学 Bilingual sentiment classification method and device
CN105005552A (en) * 2014-04-22 2015-10-28 北京四维图新科技股份有限公司 Information processing method and apparatus
CN105335347A (en) * 2014-05-30 2016-02-17 富士通株式会社 Method and device for determining emotion and reason thereof for specific topic
CN111859979A (en) * 2020-06-16 2020-10-30 中国科学院自动化研究所 Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium
CN116738298A (en) * 2023-08-16 2023-09-12 杭州同花顺数据开发有限公司 Text classification method, system and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101141456A (en) * 2007-10-09 2008-03-12 南京财经大学 Vertical search based network data excavation method
CN101201980B (en) * 2007-12-19 2010-06-02 北京交通大学 Remote Chinese language teaching system based on voice affection identification

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207855A (en) * 2013-04-12 2013-07-17 广东工业大学 Fine-grained sentiment analysis system and method specific to product comment information
CN103207855B (en) * 2013-04-12 2019-04-26 广东工业大学 For the fine granularity sentiment analysis system and method for product review information
CN103617245A (en) * 2013-11-27 2014-03-05 苏州大学 Bilingual sentiment classification method and device
CN105005552A (en) * 2014-04-22 2015-10-28 北京四维图新科技股份有限公司 Information processing method and apparatus
CN105005552B (en) * 2014-04-22 2019-01-08 北京四维图新科技股份有限公司 A kind of information processing method and device
CN105335347A (en) * 2014-05-30 2016-02-17 富士通株式会社 Method and device for determining emotion and reason thereof for specific topic
CN111859979A (en) * 2020-06-16 2020-10-30 中国科学院自动化研究所 Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium
CN116738298A (en) * 2023-08-16 2023-09-12 杭州同花顺数据开发有限公司 Text classification method, system and storage medium
CN116738298B (en) * 2023-08-16 2023-11-24 杭州同花顺数据开发有限公司 Text classification method, system and storage medium

Also Published As

Publication number Publication date
CN101876985B (en) 2012-08-29

Similar Documents

Publication Publication Date Title
US11631007B2 (en) Method and device for text-enhanced knowledge graph joint representation learning
CN103400577B (en) The acoustic model method for building up of multilingual speech recognition and device
CN106326212B (en) A kind of implicit chapter relationship analysis method based on level deep semantic
CN101876985B (en) WEB text sentiment theme recognizing method based on mixed model
CN106372061A (en) Short text similarity calculation method based on semantics
CN103177733B (en) Standard Chinese suffixation of a nonsyllabic "r" sound voice quality evaluating method and system
CN107330011A (en) The recognition methods of the name entity of many strategy fusions and device
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN103154936A (en) Methods and systems for automated text correction
CN103077720B (en) Speaker identification method and system
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN110427609B (en) Automatic evaluation method for reasonability of discourse structure of writer composition
CN101127042A (en) Sensibility classification method based on language model
CN105912625A (en) Linked data oriented entity classification method and system
CN106570180A (en) Artificial intelligence based voice searching method and device
CN111597328B (en) New event theme extraction method
CN107832290B (en) Method and device for identifying Chinese semantic relation
CN106529525A (en) Chinese and Japanese handwritten character recognition method
CN101452701B (en) Confidence degree estimation method and device based on inverse model
CN104572614A (en) Training method and system for language model
CN104572631A (en) Training method and system for language model
CN110321434A (en) A kind of file classification method based on word sense disambiguation convolutional neural networks
CN105609116A (en) Speech emotional dimensions region automatic recognition method
CN110992959A (en) Voice recognition method and system
CN111177402A (en) Evaluation method and device based on word segmentation processing, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: NANTONG CHANGRONG MECHANICAL +ELECTRICAL CO., LTD.

Free format text: FORMER OWNER: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140813

Owner name: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140813

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 710072 XI AN, SHAANXI PROVINCE TO: 226600 NANTONG, JIANGSU PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20140813

Address after: 226600 Haian County Development Zone, Jiangsu City, Nantong Province

Patentee after: Nanchang Changrong Mechanical and Electrical Co., Ltd.

Patentee after: Northwestern Polytechnical University

Address before: 710072 Xi'an friendship West Road, Shaanxi, No. 127

Patentee before: Northwestern Polytechnical University