CN110020436A

CN110020436A - A kind of microblog emotional analytic approach of ontology and the interdependent combination of syntax

Info

Publication number: CN110020436A
Application number: CN201910276686.XA
Authority: CN
Inventors: 朱群雄; 罗敏; 徐圆; 贺彦林
Original assignee: Beijing University of Chemical Technology
Current assignee: Beijing University of Chemical Technology
Priority date: 2019-04-08
Filing date: 2019-04-08
Publication date: 2019-07-16

Abstract

The invention discloses the microblog emotional analytic approach of a kind of ontology and the interdependent combination of syntax, comprising the following steps: the relevant ontology of semi-automatic building theme, and ontology is persisted to database；Ontology is expanded and updated in terms of ontology dimension and emotion vocabulary two using syntax dependence；Emotion weight computing is carried out to micro-blog information using ontology, determines Sentiment orientation.It is compared with conventional machines learning classification algorithm, the present invention has feasibility and superiority on Chinese microblog data collection.

Description

A kind of microblog emotional analytic approach of ontology and the interdependent combination of syntax

Technical field

The invention belongs to text emotion analysis technical field, in particular to the microblogging feelings of a kind of ontology and the interdependent combination of syntax Feel analytic approach.

Technical background

With popularizing for mobile Internet, microblogging is as social platform, on the basis of possessing a large number of users, has become The most fast informed source of hot news event.Since the viscosity of user is high, microblogging contains the daily information of netizen of magnanimity, wherein Including the in-service evaluation for each product.And because of some reasons, the evaluation data of product itself on-line shop are not objective enough, otherwise because For the routine of microblogging, user's evaluation is more objective, has more tap value.Therefore it for enterprise, obtains and uses from microblogging Family is to the evaluation of product and is subject to sentiment analysis, is the Information base of business decision indispensability.

For microblog data based on text data, the Sentiment orientation analysis for text data is the hot spot studied in recent years, It is broadly divided into machine learning and ontological analysis two ways.Classifier is based on artificial constructed more in method based on machine learning, When being directed to large data collection, modeling process is excessively complicated and tediously long, and manual operation is more difficult.To solve the above-mentioned problems, ontology Construction method be suggested.Ontology is a kind of formalization, for sharing the clear of concept system and being described in detail, its energy It is enough that concept is described from semantic level.It is above-mentioned based on the sentiment analysis of ontology after ontology initial construction, not will be updated this Body, it is excessively high to the accuracy requirement of initial construction during realization, facts proved that the dimension of ontology can with the expansion of data and Increase.

Summary of the invention

The invention proposes the interdependent microblog emotional analysis methods combined of a kind of ontology and syntax, it is therefore an objective to more accurately Related emotion information is obtained from microblogging.Its original body of building semi-automatic for micro-blog information, then according to relevant text Data automate update and optimization ontology using syntax dependency parsing principle in terms of product dimension, emotion vocabulary two, thus Obtain mature ontology.Mature ontology is borrowed again, using new emotion weight calculation method proposed by the present invention, measures text data Emotion weight and tendentiousness, to be accurately realized sentiment analysis.

Technical scheme is as follows:

A kind of microblog emotional analytic approach of ontology and the interdependent combination of syntax, comprising the following steps:

Step (1): the relevant ontology of semi-automatic building theme, and ontology is persisted to database；

Step (2): ontology is carried out in terms of ontology dimension and emotion vocabulary two using syntax dependence expand and It updates；

Step (3): emotion weight computing is carried out to micro-blog information using ontology, determines Sentiment orientation.

Further, the step (1) specifically:

Step (1.1): ontology is constructed using seven footwork conventional construction methods by Prot é g é software: clearly building ontology Fields scope；A possibility that considering multiplexing ontology；Display field important terms；Define class and its hierarchical system；Define class Attribute；The facet of defined attribute；Create example；

Step (1.2): using Jena packet by ontology translation at database, data are extracted from semantic level, and converted Acquisition source for model data is database or file.

Further, the process of the conversion in step (1.2) is as follows:

1. installing necessary software and configuring exploitation environment Eclipse+MySQL Server5.5-win32+ Jena2.6.4+protege5.1.0+mysql-connector-java-5.1.35 (JDBC of MySQL)；

2. building product ontology with protege5.1.0, and actively generate OWL ontology file；

3. creating a database using MySQL；

4. opening Eclipse, a Java engineering is created；

5. while new construction, being directed respectively into Jena packet and the JDBC of MySQL；

6. creating a java class, name military_ontology.java under engineering catalogue；

7. starting to write code in military_ontology.java and run；

8. being successfully database by ontology translation；

7 tables can be generated after converting original body successfully using Jena, jena_g1t1_stmt is storage body contents Table.

Further, the step (2) specifically:

Step (2.1): it is extended by ontology dimension of the syntax dependency parsing technology to product ontology: being existed in sentence Predicate verb is as the center for dominating other ingredients, and predicate verb itself is not dominated by other ingredients, and subject ingredient is with certain Kind of dependence is subordinated to dominator, deposit syntactic structure be using dependence as essential element, i.e. word to binary crelation group, In binary crelation, dominator is known as core word, and subordinate is known as interdependent word, uses Stanford Parser syntax dependency parsing Device carries out syntactic analysis:

Stanford Parser selects syntactic relation with typing dependence to be extended, in extension dimension When concern include keyword two relational expressions, i.e. what nn and assmod, nn were indicated is noun combining form, and assmod indicates to close Connection modification, the dependence based on two noun phrases；For the subordinate's dimension newly obtained, it may be found that good relationship is stored in ontology Database steps are as follows: it is class that new dimension type is arranged first, is then classified as the subclass of corresponding father's dimension；

Step (2.2): as follows for the extended mode of emotion vocabulary: to utilize Stanford Parser, rely on and divide in syntax On the basis of analysis, expanding emotion vocabulary and pay close attention to other two relational expression, i.e. amod and nsubj, amod indicates adjective modifier, Adjective before i.e. common noun, nsubj indicates nominal subject, for indicating the connection between subject and object；Emotion word The step of remittance belongs to example, is inserted into ontology database is as follows: it is NamedIndividual that its type is arranged first, then according to Description classification is classified as the emotion vocabulary of the category, its emotion weight is finally inserted into database, and emotion weight is from emotion word It is obtained in allusion quotation.

Further, the step (3) specifically:

Talking about the calculation formula that emotion weight uses to every is:

Wherein n is the emotion word number in short including, Pri_iNegative word weight is referred to, if when calculating emotion weight Word i is the word of negative word modification, then the weight for being multiplied by negative word, generally negative is needed, if in the weight dictionary of negative word Do not include, is then defaulted as -1；Value_iIt refers to the emotion weight of word itself, derives from emotion weight dictionary；

Dimen_iThe weight of dimension, calculation formula are as follows where indicating i-th of word:

Dimen_i=Per_{class_i}*Per_{words_i}

Wherein Per_{class_i}The quantitative proportion that subordinate's class of dimension where referring to i-th of word accounts in whole class, Per_{words_i}The quantitative proportion that the emotion assessment word number of dimension where referring to i-th of word accounts in total evaluation word；

In the SPARQL query language inquiring dimension class and being carried using ontology when emotion word, using Jena packet interface from It has changed into the ontology of database and has extracted data relevant with class, example using SPARQL sentence；

TI_iRefer to the TF*IDF weight of word, calculation formula is as follows:

Tf_ijRefer to the tf value of word, for indicating ratio that some word occurs in current document, wherein molecule indicates single Word t_iThe number occurred in document j, denominator indicate the sum of all word numbers of document j；Idf_ijIt is word idf value, referred to as inversely Document-frequency, refer to total number of files mesh divided by the number of files comprising keyword, then take that logarithm obtains as a result, wherein molecule Indicate total number of files, denominator indicates to include word t_iThe sum of number of files, in order to guarantee denominator forever just, denominator part adds 1；

Using ontology carry SPARQL query statement, word matching is directly carried out for each sentence, with find its Dimension and emotional category and weight in ontology, then emotion weight can be calculated with above-mentioned formula.

Detailed description of the invention

Fig. 1 is ontology and syntax is interdependent combines microblog emotional analysis flow chart diagram.

Fig. 2 is great thatch liquid medicine body part display diagram.

Fig. 3 is the calculated affection index of the present invention and SVM and Naive Bayes Classifier classifying quality comparison diagram.

Specific embodiment

To make those skilled in the art more fully understand technical solution of the present invention, below to one kind provided by the invention Ontology and the microblog emotional analytic approach of the interdependent combination of syntax are described in detail.Following embodiment be merely to illustrate the present invention rather than For limiting the scope of the invention.

Embodiment

A kind of ontology and the interdependent microblog emotional analysis method combined of syntax, comprising the following steps:

1, the pre-processing of microblog data

The microblog data crawled is needed to carry out pre-processing, is specifically included that

(1) unified Chinese and English punctuation mark, unified full-shape and DBC case；

(2) emoticon is converted directly into corresponding Chinese；

(3) redundancy about reply such as removal " reply: " " it is good to reply weather: " " it is good to reply@weather: "；

(4) remove additional character " | (|) | $ | Shu | " | " | △ | ▲ | ▼ | ▍ | ■ etc.；

(5) remove in addition to,.！Other punctuation marks of equal tables segmentation sentence；

(6) it is segmented using stammerer and carries out vocabulary segmentation；

(7) stop words is removed.

2, the creation and persistence of original body

(1) the semi-automatic creation of ontology

The present invention uses seven footwork conventional construction methods in building.Seven footworks are clear with respect to other methods step, logic Succinctly, easily operated.Seven footworks are developed by Stanford University Medical institute, are a kind of more common body constructing methods.It seven A step is respectively: clearly building ontology fields scope；A possibility that considering multiplexing ontology；Display field important terms； Define class and its hierarchical system；Define the attribute of class；The facet of defined attribute；Create example.

The present invention uses the semi-automatic building ontology of tool Prot é g é (Stanford University, 1999).Protégé Software is the ontology construction tool of Stanford University Medical institute biological information research center exploitation.It is write based on Java language, Belong to open-source software.Prot é g é is that user shields specific ontology description language, and user is not required to specifically learn ontology Write language, need to only be described using the shortcut of software offer.

(2) persistence of ontology

For the ontology of semi-automatic building, need constantly to expand and modify ontology in data processing, it is therefore necessary to By ontology persistence, change is easily processed, and the present invention is using Jena packet (HP Labs, 2009) by ontology translation at database. Jena is a Java Open Framework, is mainly used for extracting data from semantic level, and be translated into model.And data obtain The source of fetching can be database or file etc..If it is desired to inquiring data in semantic model, Jena also provides query language, i.e., SPARQL。

The present invention is using jena packet by ontology translation at database.The process of conversion is as follows:

3. creating a database using MySQL；

4. opening Eclipse, a Java engineering is created；

7. starting to write code in military_ontology.java and run；

8. being successfully database by ontology translation.

7 tables can be generated after converting original body successfully using Jena, ontology information is stored in jena_g1t0_reif In table, other tables are without concern.

3, the ontology expansion based on syntax dependence

The present invention realizes that product ontology extends automatically using the dependence based on Chinese syntax.Principal concern is user Syntax in comment removes the correspondence descriptor or junior's attribute of discovery product itself or product dimension using syntactic relation. Because application is product ontology, the evaluation index of product and product dimension need to be only paid close attention in automatic extension, without considering Uncorrelated vocabulary.The automatic extension of ontology mainly includes two aspects.

(1) each dimension of ontology is extended

The ontology dimension of product ontology is extended.Due to that cannot be completely secured when initially setting up ontology comprehensively, So needing gradually to extend ontology with the processing of data.The technology that extension ontology dimension is mainly used is syntax dependency parsing.

Syntax is interdependent to be proposed by French linguist Tesiniere in nineteen fifty-nine.The core concept of method is: depositing in sentence In predicate verb as the center for dominating other ingredients, and predicate verb itself is not dominated by other ingredients, subject ingredient with Certain dependence is subordinated to dominator.Dependency grammar structure is using dependence as essential element, i.e., word is to binary crelation Group.In binary crelation, dominator is known as core word, and subordinate is known as interdependent word.Dependence just reflect core word and according to Deposit the semantic dependency relationship between word.

The present invention carries out syntactic analysis using Stanford Parser (Stanford Univ-ersity, 2002). Stanford Parser is by the parser of Stanford University's natural language processing group development, is a height optimization Probability context-free grammar and Lexical dependency analysis device, principle is from probability statistics.There is the JAVA of open source real at present Existing software package can be used, multi-lingual including support English, Chinese, German.

Stanford Parser is defeated with the various ways such as parsing tree and typing dependence for syntactic relation Out, the present invention selects typing dependence to be extended selection.There are many dependences that software provides, and the present invention is expanding Two relational expressions comprising keyword, i.e. nn and assmod are primarily upon when opening up dimension.What nn was indicated is noun combining form, than As being " advertising cost " when getting one group of nn relational expression, then it would know that " cost " is subordinate's dimension of " advertisement "；Assmod table Show association modification, be mainly based upon the dependence of two noun phrases, for example when getting one group of assmod relational expression is " medicine Wine advertisement " then would know that " advertisement " is subordinate's dimension of " liquid medicine ".

For the subordinate's dimension newly obtained, it may be found that steps are as follows for good relationship deposit ontology database: setting is new first Dimension type is class, is then classified as the subclass of corresponding father's dimension.Table 1 illustrates new dimension extension and is inserted into data Library needs increased entry.

The extension of the new dimension of table 1 is inserted into database

(2) expand emotion vocabulary

Universal emotion vocabulary can be added when ontology initial construction, however when analyzing specific product, for difference Dimension needs different dimension emotion vocabulary, these words will be obtained from real data mostly, more difficult in early-stage preparations It collects more comprehensive.

Expand emotion vocabulary mode be it is identical as a upper section, utilize Stanford Parser.In syntax dependency analysis On the basis of, expand emotion vocabulary and focuses more on other two relational expression, i.e. amod and nsubj.Amod indicates adjective modifier, Adjective before i.e. common noun, such as amod relational expression are " sham publicity ", then would know that " falseness " is the emotion of " advertisement " Vocabulary；Nsubj indicates nominal subject, is mainly used for indicating that the connection between subject and object, such as nsubj relational expression are " statement is shameless ", then would know that " shamelessness " is the emotion vocabulary of " statement ".

The step of emotion vocabulary belongs to example, is inserted into ontology database is as follows: its type is arranged first is Then NamedIndividual is classified as the emotion vocabulary of the category according to description classification, is finally inserted into its emotion weight Database, emotion weight are obtained from sentiment dictionary.Table 2, which illustrates new dimension emotion vocabulary and is inserted into database, to be needed to increase Entry.

2 dimension emotion vocabulary of table is inserted into database

4, emotion weight computing

After ontology expansion updates completion, need to carry out sentiment analysis to data.Traditional sentiment analysis is mostly emotion power Value is directly added, and the method error is too big.In order to avoid such case, the present invention utilizes the emotion weight based on ontology dimension point Analysis method.This method takes into account the dimension index of emotion word in the body when calculating emotion weight, can be more fully React the effect of emotion word.

Before affection computation, early-stage preparations, the introducing of mainly a variety of dictionaries have been carried out.Include: emotion weight dictionary, Negative word dictionary and synonymicon.Emotion weight dictionary select Chinese Language Department, Tsinghua University sentiment dictionary, it includes be word The emotion weight of itself.Negative word dictionary selects the negative dictionary that uses of Jiangsu University of Science and Technology, negate dictionary effect be in order to Weight is negated, if there are negative word before word, that subsequent word weight should be turned.Synonymicon, which is selected, to be breathed out Work great society calculates and Research into information retrieval center Chinese thesaurus, is to use to expand ontology, the synonym of each word can To expand as similar dimension into ontology, in this way when searching or judging dimension with regard to more acurrate；Secondly, when in emotion dictionary It when weight not comprising some word, can use all synonym weights of the word, its average value taken to weigh as the emotion of the word Value.

Talking about the calculation formula that emotion weight uses to every is:

Wherein n is the emotion word number in short including, Dimen_iThe weight of dimension where indicating i-th of word, calculates Formula is as follows:

Dimen_i=Per_{class_i}*Per_{words_i}

Wherein Per_{class_i}The quantitative proportion that subordinate's class of dimension where referring to i-th of word accounts in whole class, Per_{words_i}The quantitative proportion that the emotion assessment word number of dimension where referring to i-th of word accounts in total evaluation word.

The SPARQL query language that need to only ontology is used to carry when inquiring dimension class and emotion word, Jena packet provide Interface, can extract data relevant with class, example using SPARQL sentence from the ontology for changed into database.

TI_iRefer to the TF*IDF weight of word, calculation formula is as follows:

Tf_ijRefer to the tf value of word, for indicating ratio that some word occurs in current document, wherein molecule indicates single Word t_iThe number occurred in document j, denominator indicate the sum of all word numbers of document j；Idf_ijIt is word idf value, referred to as inversely Document-frequency, refer to total number of files mesh divided by the number of files comprising keyword, then take that logarithm obtains as a result, wherein molecule Indicate total number of files, denominator indicates to include word t_iThe sum of number of files.In order to guarantee denominator forever just, denominator part adds 1.

Pri_iIt refers to negative word weight, if word i is the word of negative word modification when calculating emotion weight, needs to multiply The weight of upper negative word, generally negative are defaulted as -1 if do not included in the weight dictionary of negative word.Value_iRefer to It is the emotion weight of word itself, derives from emotion weight dictionary.

The SPARQL query statement carried using ontology, can directly carry out word matching for each sentence, to find Its dimension and emotional category and weight in the body, then emotion weight can be calculated with above-mentioned formula.

Using the true Chinese comment taken about great thatch liquid medicine is climbed from microblogging, it is shown that specific step is as follows:

1. crawling great thatch liquid medicine relevant microblog using crawler, carries out pre-processing, segmented using stammerer, then Remove stop words.

2. constructing great thatch liquid medicine sheet according to contents such as consumer evaluation's index, great thatch liquid medicine official document and microblogging comments Body.Fig. 2 is the great thatch liquid medicine body part diagram of building.

3. after building original body using prot é g é, using Jena packet by ontology translation to database.

4. after the completion of ontology translation to database, carrying out ontology expansion.Ontology expansion is according to method from ontology dimension, emotion It is carried out in terms of vocabulary another two.

5. selecting representative microblogging from microblog data carries out artificial emotion standard, finally obtain 1000 front evaluations and 1000 unfavorable ratings.

6. then formula of the invention carries out the evaluation of emotion weight to the microblogging picked out, its feeling polarities is determined.

7. being equally labeled to microblog emotional using traditional SVM classifier and Naive Bayes Classifier, standard is utilized True rate, recall rate and F value carry out evaluation comparison.Fig. 3 illustrates application method of the present invention and conventional machines learning method SVM and Piao The comparative situation of plain Bayes, wherein horizontal axis represents the distinct methods of experiment, and the longitudinal axis represents numerical value.It is found by comparing, this hair It is bright more accurately more meticulously to analyze product microblog emotional tendency.

Present embodiments provide it is a kind of towards Chinese microblogging, based on the interdependent emotion combined of emotional noumenon and syntax point Analysis method.Specifically includes the following steps: carrying out micro-blog information acquisition using crawler for the theme to be analyzed, carried out after acquisition Then data cleansing and dimensionality reduction carry out semi-automatic building original body using the relevant micro-blog information of theme and official document, so Microblog data is utilized afterwards, and automation updates ontology in terms of product dimension and emotion vocabulary two, to obtain mature ontology.Again It borrows the information that ontology carries and calculates the emotion weight of micro-blog information, to reach the mesh of the emotion tendency of analysis microblog data 's.It finally uses rate of precision, recall rate and F value as evaluation criterion, is compared with conventional machines learning classification algorithm, this hair It is bright that there is feasibility and superiority on Chinese microblog data collection.

Example of the invention is explained in detail above in conjunction with embodiment, but the present invention is not limited to examples detailed above, Within the knowledge of a person skilled in the art, it can also make without departing from the purpose of the present invention Various change also should be regarded as protection scope of the present invention.

Claims

1. the microblog emotional analytic approach of a kind of ontology and the interdependent combination of syntax, which comprises the following steps:

Step (2): ontology is expanded and is updated in terms of ontology dimension and emotion vocabulary two using syntax dependence；

2. the microblog emotional analytic approach of ontology according to claim 1 and the interdependent combination of syntax, which is characterized in that described Step (1) specifically:

Step (1.1): ontology is constructed using seven footwork conventional construction methods by Prot é g é software: clearly belonging to building ontology Field scope；A possibility that considering multiplexing ontology；Display field important terms；Define class and its hierarchical system；Define the category of class Property；The facet of defined attribute；Create example；

Step (1.2): using Jena packet by ontology translation at database, data are extracted from semantic level, and be translated into mould The acquisition source of type data is database or file.

3. the microblog emotional analytic approach of ontology according to claim 2 and the interdependent combination of syntax, which is characterized in that step (1.2) process of the conversion in is as follows:

1. installing necessary software and configuring exploitation environment Eclipse+MySQL Server5.5-win32+jena2.6.4 + protege5.1.0+mysql-connector-java-5.1.35 (JDBC of MySQL)；

3. creating a database using MySQL；

4. opening Eclipse, a Java engineering is created；

7. starting to write code in military_ontology.java and run；

8. being successfully database by ontology translation；

7 tables can be generated after converting original body successfully using Jena, jena_g1t1_stmt is the table for storing body contents.

4. the microblog emotional analytic approach of ontology according to claim 3 and the interdependent combination of syntax, which is characterized in that described Step (2) specifically:

Step (2.1): be extended by ontology dimension of the syntax dependency parsing technology to product ontology: there are predicates in sentence Verb is as the center for dominating other ingredients, and predicate verb itself is not dominated by other ingredients, subject ingredient with certain according to The relationship of depositing is subordinated to dominator, and depositing syntactic structure is using dependence as essential element, i.e., word is to binary crelation group, in binary In relationship, dominator is known as core word, and subordinate is known as interdependent word, using Stanford Parser syntax dependency parsing device into Row syntactic analysis:

Stanford Parser selects syntactic relation with typing dependence to be extended, and closes when extending dimension Note includes two relational expressions of keyword, i.e. what nn and assmod, nn were indicated is noun combining form, and assmod indicates that association is repaired Decorations, the dependence based on two noun phrases；For the subordinate's dimension newly obtained, it may be found that good relationship is stored in ontology data Steps are as follows in library: it is class that new dimension type is arranged first, is then classified as the subclass of corresponding father's dimension；

Step (2.2): as follows for the extended mode of emotion vocabulary: Stanford Parser to be utilized, in syntax dependency analysis On the basis of, expand emotion vocabulary and pay close attention to other two relational expression, i.e. amod and nsubj, amod indicates adjective modifier, i.e., often Adjective before the noun seen, nsubj indicates nominal subject, for indicating the connection between subject and object；Emotion vocabulary category In example, the step of being inserted into ontology database is as follows: it is NamedIndividual that its type is arranged first, then according to description Classification is classified as the emotion vocabulary of the category, its emotion weight is finally inserted into database, emotion weight is from sentiment dictionary It obtains.

5. the microblog emotional analytic approach of ontology according to claim 4 and the interdependent combination of syntax, which is characterized in that described Step (3) specifically:

Talking about the calculation formula that emotion weight uses to every is:

Wherein n is the emotion word number in short including, Pri_iNegative word weight is referred to, if the word i when calculating emotion weight It is the word of negative word modification, then the weight for being multiplied by negative word, generally negative is needed, if do not wrapped in the weight dictionary of negative word Contain, is then defaulted as -1；Value_iIt refers to the emotion weight of word itself, derives from emotion weight dictionary；

Dimen_i=Per_{class_i}*Per_{words_i}

Wherein Per_{class_i}The quantitative proportion that subordinate's class of dimension where referring to i-th of word accounts in whole class, Per_{words_i} The quantitative proportion that the emotion assessment word number of dimension where referring to i-th of word accounts in total evaluation word；

The SPARQL query language carried when inquiring dimension class and emotion word using ontology, using the interface of Jena packet from It changes into the ontology of database and extracts data relevant with class, example using SPARQL sentence；

TI_iRefer to the TF*IDF weight of word, calculation formula is as follows:

Tf_ijRefer to the tf value of word, for indicating ratio that some word occurs in current document, wherein molecule indicates word t_i The number occurred in document j, denominator indicate the sum of all word numbers of document j；Idf_ijIt is word idf value, referred to as reverse file Frequency, refer to total number of files mesh divided by the number of files comprising keyword, then take that logarithm obtains as a result, wherein molecule indicates Total number of files, denominator indicate to include word t_iThe sum of number of files, in order to guarantee denominator forever just, denominator part adds 1；

The SPARQL query statement carried using ontology, directly carries out word matching for each sentence, to find it in ontology In dimension and emotional category and weight, then emotion weight can be calculated with above-mentioned formula.