CN103049435B

CN103049435B - Text fine granularity sentiment analysis method and device

Info

Publication number: CN103049435B
Application number: CN201310000734.5A
Authority: CN
Inventors: 施寒潇; 厉小军
Original assignee: Zhejiang Gongshang University
Current assignee: Hangzhou Brain Top Technology Co ltd
Priority date: 2013-01-04
Filing date: 2013-01-04
Publication date: 2015-10-14
Anticipated expiration: 2033-01-04
Also published as: CN103049435A

Abstract

The invention discloses a kind of text fine granularity sentiment analysis method, comprise the steps: emotion word polar intensity quantum chemical method; Evaluation object attribute and emotional expression element thereof combine identification; Fine granularity attributive classification and affection computation thereof.The invention also discloses a kind of text fine granularity sentiment analysis device, comprise comment data collection and pretreatment module, data processing module, data analysis module, information display module.The three large steps that fine granularity sentiment analysis method of the present invention adopts have the following advantages: (1) emotion word polar intensity quantum chemical method step, and accuracy improves nearly 30%; (2) evaluation object attribute and emotional expression element thereof combine identification, for particular emotion analytical applications field, its associating recognition correct rate reaches more than 80%; (3) fine granularity attributive classification and affection computation thereof utilize fine granularity attributive classification, can promote affection computation overall performance more than 2.5%.

Description

Text fine-grained emotion analysis method and device

Technical Field

The invention belongs to the technical field of computer application, and particularly relates to a fine-grained sentiment analysis method and device for subjective texts, which can be applied to commodity comments of commercial websites and network public opinion analysis of enterprises or government departments.

Background

With the rapid development of the internet, especially the gradual popularization of the web2.0 technology, the vast network users have changed from the simple information acquirers in the past to the main manufacturers of the network contents. Data of a 30 th statistical report of development conditions of the internet of China (CNNIC, 2012) issued by the information center of the internet of China shows that as long as 6 months in 2012, the total number of network users in China reaches 5.38 hundred million, the internet citizen scale increases 2450 million compared with 2011, and the popularity rate of the internet is 39.9%. The huge and rapidly-growing network user group and the internet application of the Web2.0 mode increase the quantity of network contents and the access quantity of network information at an unprecedented speed, and the internet becomes an important way for people to express viewpoints and obtain information. The information on the internet currently has a wide variety of forms, such as news, blog articles, product reviews, forum posts, and so on.

Emotional orientation analysis in merchandise reviews is becoming an ongoing focus of research. The research aims to utilize abundant customer comment resources on the network to perform market feedback analysis on the commodity, and provide visual network evaluation reports aiming at various characteristics of the commodity for manufacturers and consumers. At present, on one hand, emotional information explosively grows on the Internet, and on the other hand, the emotional information plays an important role in common consumers, company organizations, national governments and other users at all levels, and how to help users to conveniently and quickly find required emotional information becomes one of the problems which need to be solved urgently at present. The emotion analysis task just meets the requirement, and a bridge from the user to the emotion information is expected to be erected, so that the user can effectively acquire the emotion information. Through analysis of various information on the network, particularly the tendency of subjective texts, the consumption habits of users can be better understood, the public sentiments of hot events can be analyzed, and important decision bases are provided for enterprises, governments and other organizations. As is well known, when commenting on commodities, users want to know the emotional tendency of various aspects of products more, which is more beneficial to comprehensive judgment and choice of the users, and the traditional emotional analysis is often a coarse-grained analysis method oriented to chapters and sentences, which cannot effectively solve such demands, so that the users need to change the network users from the pure information acquirers to the main manufacturers of network contents in the past along with the rapid development of the internet, especially the gradual popularization of the web2.0 technology. Data of a 30 th statistical report of development conditions of the internet of China (CNNIC, 2012) issued by the information center of the internet of China shows that as long as 6 months in 2012, the total number of network users in China reaches 5.38 hundred million, the internet citizen scale increases 2450 million compared with 2011, and the popularity rate of the internet is 39.9%. The huge and rapidly-growing network user group and the internet application of the Web2.0 mode increase the quantity of network contents and the access quantity of network information at an unprecedented speed, and the internet becomes an important way for people to express viewpoints and obtain information. The information on the internet currently has a wide variety of forms, such as news, blog articles, product reviews, forum posts, and so on.

Emotional orientation analysis in merchandise reviews is becoming an ongoing focus of research. The research aims to utilize abundant customer comment resources on the network to perform market feedback analysis on the commodity, and provide visual network evaluation reports aiming at various characteristics of the commodity for manufacturers and consumers. At present, on one hand, emotional information explosively grows on the Internet, and on the other hand, the emotional information plays an important role in common consumers, company organizations, national governments and other users at all levels, and how to help users to conveniently and quickly find required emotional information becomes one of the problems which need to be solved urgently at present. The emotion analysis task just meets the requirement, and a bridge from the user to the emotion information is expected to be erected, so that the user can effectively acquire the emotion information. Through analysis of various information on the network, particularly the tendency of subjective texts, the consumption habits of users can be better understood, the public sentiments of hot events can be analyzed, and important decision bases are provided for enterprises, governments and other organizations. As is known, when commenting on commodities, users want to know the emotional tendency of each aspect of the product more, which is more beneficial to comprehensive judgment and choice of the users, and the traditional emotional analysis is often a coarse-grained analysis method facing chapters and sentences, which can not effectively solve the requirements, so that the demand needs to be realized by applying a fine-grained emotional analysis method.

There are two general categories of emotion analysis methods. The first is a rule-based approach. The emotion words appearing in the text are found out according to the emotion dictionary, then simple emotion polarity statistics is carried out, and an emotion polarity conclusion is obtained according to comparison between the final score and a preset threshold value and is generally used for emotion analysis at chapter level. The second is a machine learning based approach. And generating an emotion classifier by training a large number of labeled corpora for classifying the test text.

(1) A rule-based approach. The current method mainly comprises the steps of extracting emotion words and judging polarity according to design rules, and then carrying out simple emotion polarity statistics on all emotion words to obtain the overall emotion polarity of a text. In addition, the semantic tendency of the words is calculated, the distribution, the density and the semantic intensity of the polar elements are comprehensively considered to expand the emotional words, and the overall emotional polarity of the text is further corrected.

(2) A machine learning based method. The method mainly uses emotional words, word co-occurrence pairs, syntactic templates, theme related features and the like as classification features, and uses a machine learning-based classification method to perform emotion/orientation analysis. The classification methods commonly used are: a center vector classification method, a KNN classification method, a perceptron classification method, a Bayesian classification method, a maximum entropy classification method, a support vector machine classification method and the like. The general process is to first produce a training model by manually labeling the training documents, and then to make predictions of the test documents. The method is widely applied to emotion analysis at sentence level

When the two methods are used for text emotion analysis, both sentences and chapters strongly depend on the emotion dictionaries, so that the correctness of the emotion dictionaries is directly influenced, more emotion dictionaries are manually constructed at present, the workload is huge, and new emotion words are layered with the development of the internet, so that the construction of the one-time emotion dictionaries is far insufficient, and the current emotion dictionaries are often lack of quantification of polarity strength and hardly meet the requirements of emotion calculation. On the other hand, in a specific emotion analysis process, the existing method generally only considers the characteristics of words, and after words are divided and parts of speech are labeled, the characteristics of the parts of speech are used for identifying attributes and emotion words, but the whole semantic understanding of the sentence is lacked, so that the identification efficiency is not high.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a text fine-grained emotion analysis method, which comprises the following steps: calculating the polarity intensity of the emotional words; the joint identification of the attribute of the evaluation object and the emotion expression element thereof; and (4) classifying fine-grained attributes and calculating emotion of the fine-grained attributes.

Further, the emotion word polarity strength quantitative calculation comprises polarity strength quantitative calculation of the basic emotion words and polarity strength quantitative calculation of the compound emotion words.

Further, the polarity strength quantization calculation of the basic emotional words comprises calculation of emotion values of the words, and the following formula is adopted:

（1）

（2）

wherein,P _ciis a characterciAs a weight of the recognition word,N _ciis a characterciAs a weight of the derogatory words.fp _ciIs a characterciThe frequency of occurrence in the recognition word table,fn _ciis a characterciThe frequencies appearing in the derogatory word list, each word may be weighted as a positive word and a derogatory word using formula (1) and formula (2),nfor the number of all words appearing in the recognition word table,mto balance the difference in terms between positive and negative words in the emotion dictionary for the number of all words appearing in the negative word table, equations (1) and (2) normalize the frequency of occurrence of each word in the positive and negative word table,

finally, the word can be calculated by formula (3)ciValue of emotional tendency ofS _ci。

（3）

If it is notS _ciThe value of (a) is a positive number,ciif the positive word is positive and the negative word is negative, the word is negative, and if the word is close to 0, the explanation is thatciAnd tends to be neutral.

Further, the polarity intensity quantitative calculation of the basic emotion words also comprises the emotion value calculation of the basic words, and the following formula is adopted:

（4）

whereinThe absolute value of the largest sentiment value among all words,the sign value of the word is +1 if the sentiment value of the word is greater than 0 and-1 if less than 0.

Further, the polarity intensity quantitative calculation of the composite emotional words is divided into:

1) overlapping words of basic emotion words;

2) basic emotion word+Basic emotion words;

3) negative word+Basic emotion words;

4) degree modifiers + basic emotion words;

5) the negation word + the degree modifier + the basic emotion word or the degree modifier + the negation word + the basic emotion word is calculated by adopting the following formula:

（5）

whereinFor the value of the emotion of the base word,is the coefficient of action of the degree word (range is 0.5, 0.7, 0.9, 1.1, 1.3, 1.5),is the degree word reaction coefficient, i.e. coefficient of actionOf the range extreme, soThe value of the number is 2,is a wordThe sign value of the emotion value is +1 if the emotion value of the word is greater than 0, and is-1 if less than 0.Is a wordAbsolute value of the sentiment value.

Further, the joint identification of the evaluation object attribute and the emotion expression element thereof comprises the following steps: extracting semantic features and constructing based on a serialization joint identification model.

Further, the extraction of the semantic features comprises extracting word segmentation information, part-of-speech tagging information and semantic role information.

Further, the fine-grained attribute classification and the emotion calculation thereof comprise attribute classification and fine-grained emotion summarizing calculation based on bootstrap learning.

Further, the fine-grained sentiment summary calculation adopts the following formula (6):

（6）

whereinc(i)In order to be the attribute class i,n(c(i))for property classes in commentsc(i)The total number of occurrences is,for property classes that appear j-th time in a commentc(i)The value of the corresponding emotional tendency is set,for attribute classes in all commentsc(i)The corresponding average emotional tendency value.

The invention also provides a text fine-grained emotion analysis device which comprises a comment data acquisition and preprocessing module, a data processing module, a data analysis module and an information display module, wherein the comment data acquisition and preprocessing module is used for acquiring and storing comment data; the data processing module carries out corresponding processing on the collected comment data and predicts new comment information; the data analysis module is used for carrying out sentiment analysis on the information processed by the data processing module and carrying out fine-grained sentiment intensity quantitative statistics and calculation by utilizing the correlation information between the object attributes and the sentiment words and the relation between the sentiment words and the modifiers; and the information display module is used for carrying out friendly visual display on the processed and analyzed comment information.

The fine-grained emotion analysis method has the following advantages in three steps:

(1) compared with other methods, such as a Ku (2006) method, the method for quantitatively calculating the polarity intensity of the emotion words is designed, and the accuracy is improved by nearly 30%; (2) the combined recognition of the attribute of the evaluation object and the emotion expression elements thereof is carried out, and the combined recognition accuracy rate is up to more than 80% in the specific emotion analysis application field, which greatly exceeds the utilization rule and the statistical method; (3) the fine-grained attribute classification and the emotion calculation thereof utilize the fine-grained attribute classification, and the overall performance of the emotion calculation can be improved by more than 2.5%.

Drawings

FIG. 1 is a graph of the serialization structure relationships of words in a sentence;

FIG. 2 is a flow chart of a bootstrap learning algorithm;

FIG. 3 is a schematic diagram of a text fine-grained emotion analysis device.

Detailed Description

The invention will be further explained with reference to the drawings.

The invention provides a text fine-grained emotion analysis method and device aiming at the problems of the existing emotion analysis method. The method and the device establish an extensible emotion dictionary with quantitative polarity intensity by designing a corresponding algorithm, thereby solving the difficulty of quantifying the polarity intensity of the emotion words; the method reasonably adopts the natural language technology and the machine learning method to carry out fine-grained emotion analysis on the text, and improves the accuracy of an analysis result.

The invention adopts the following technical means: calculating the polarity intensity of the emotional words; the joint identification of the attribute of the evaluation object and the emotion expression element thereof; and (4) classifying fine-grained attributes and calculating emotion of the fine-grained attributes.

1. Quantitative calculation of emotional word polarity intensity

And providing a polarity intensity quantization method based on emotion word classification calculation. The emotion words are divided into two categories, wherein the first category is basic emotion words, and the second category is compound emotion words. In the quantitative calculation work of the polarity strength of basic emotional words, firstly, the emotion value of a word is calculated, and then, relevant rules are designed to calculate the emotion value of the word; in the calculation work of the compound emotional words, corresponding linguistic knowledge is learned, a corresponding rule method is designed, and compound calculation is carried out by using the combination relation of the words.

(1) Polarity intensity quantitative calculation of basic emotion words

The basic emotional words are defined as emotional words of which the first word does not contain a negative word and a degree modifier and the number of words does not exceed 2.

1) Computation of sentiment values for characters

Firstly, calculating an emotional tendency value of each word by using an existing emotional dictionary through a word frequency statistical method; and then designing a corresponding formula by using the emotional tendency value of the word to calculate the emotional tendency value of the word. The detailed procedure is as follows.

Firstly, the weight of each character in the emotion dictionary as positive and negative words is counted, as shown in formulas (1) and (2).

（1）

（2）

Wherein,P _ciis a characterciAs a weight of the recognition word,N _ciis a characterciAs a weight of the derogatory words.fp _ciIs a characterciThe frequency of occurrence in the recognition word table,fn _ciis a characterciFrequency of occurrence in the derogatory word list. The weight of each word as the positive word and the negative word can be calculated by using the formula (1) and the formula (2).nFor the number of all words appearing in the recognition word table,mthe number of all words appearing in the derogatory vocabulary. In order to balance the difference in word number between positive and negative words in the emotion dictionary, equations (1) and (2) normalize the frequency of occurrence of each word in the positive and negative word tables.

（3）

2) Computation of sentiment values for base words

The structural characteristics of the basic emotional words are analyzed, and the emotional tendency value of the basic emotional words is easily found to be approximately equal to the maximum value of the emotional tendency values of all the words. For example, the emotion value of "beautiful" in "beautiful" is 0.5, the emotion value of "beautiful" is 0.8, and the emotional tendency value of "beautiful" is also considered to be equal to 0.8, but the emotion value of "beautiful" should not be calculated simply by the mean value method. Therefore, when calculating the emotional tendency value of the basic word, formula (4) is mainly adopted:

（4）

(2) Polarity intensity quantitative calculation of composite emotional words

The compound emotional words are defined as emotional words with the first word containing a negation word or a degree modifier or emotional words containing more than 2 characters. The polarity intensity quantitative calculation of the composite emotional words is relatively complex, and the composite emotional words are formed by combining various vocabularies such as basic emotional words, negative words, degree modifiers and the like. The invention adopts a method based on a word group classification model to solve the problem of polarity intensity quantitative calculation of composite emotional words.

Aiming at the combination characteristics of different composite emotion words, the method mainly comprises the following steps of:

6) the stacked words of the basic emotional words are beautiful and happy. The words can find the emotional tendency value of the basic emotional words by a method of searching the root word, and because the overlapped words generally have little influence on the emotional value of the original words, the emotional value of the basic words is directly taken for simplifying the problem.

7) Basic emotion word+Basic emotion words, such as caution. The calculation of the compound words is realized by an averaging method.

8) Negative word+And basic emotion words, if not beautiful. The combination words can be calculated by inverting the emotional tendency values of the basic emotional words.

9) And (4) degree modifiers + basic emotion words, such as beautiful words. The combined words can be calculated to obtain the emotion value of the basic word, and then obtain the corresponding action coefficient (the value ranges are 0.5, 0.7, 0.9, 1.1, 1.3 and 1.5) according to the action intensities of the modifiers with different degrees defined in advance, if the degree coefficient of 'very' is defined as 1.3, the degree coefficient of 'comparison' is 0.7. The final emotional tendency value of the combined word is the product of the two, and if the value exceeds the emotional tendency value range of the word [ -1, +1], we take the maximum extreme value.

10) Negative words + degree modifiers + basic emotion words/degree modifiers + negative words + basic emotion words, e.g., less beautiful/less beautiful. The calculation of the combined words is relatively complex, and the position relationship between the combined words and the combined words directly influences the calculation of the tendency value of the emotional words. The method utilizes linguistic knowledge and adopts a formula (5) to solve the emotional tendency value of the compound words.

（5）

WhereinFor the value of the emotion of the base word,is the coefficient of action of the degree word (range is 0.5, 0.7, 0.9, 1.1, 1.3, 1.5),is the degree word reaction coefficient, i.e. coefficient of actionOf the range extreme, soThe value of the number is 2,is a wordThe sign value of the emotion value is +1 if the emotion value of the word is greater than 0, and is-1 if less than 0.Is a wordAbsolute value of the sentiment value. Taking "less beautiful" as an example, the emotional tendency value of "beautiful" is 0.8, the degree action coefficient of "too" is 1.3, and the emotional tendency value of "less beautiful" is 0.56 by the formula (5). If a "too unsightly" emotional propensity value is calculated, we can get-1, again using the formula. It can be easily seen that such results are substantially consistent with our subjective judgment.

2. Joint identification of evaluation object attribute and emotion expression element thereof

The most important work of the fine-grained sentiment analysis is the identification of the attributes of the evaluation objects and sentiment expression elements thereof.

(1) Extraction of semantic features

1) Word segmentation information

In the text emotion analysis based on supervised learning, the vocabulary information characteristics play an important role. The word is the smallest meaningful unit in natural language, but there is no obvious boundary between words in Chinese, so word segmentation is the first task of Chinese information processing.

Conventional word segmentation methods, whether rule-based or statistical-based, typically rely on a pre-compiled vocabulary (dictionary). The automatic word segmentation process is to make word segmentation decision through a word list and related information. In contrast, the word segmentation method based on word labeling is actually a word construction method, i.e., the word segmentation process is regarded as a problem of labeling of words in a word string. Because each word occupies a certain word-forming position (namely word position) when constructing a specific word, a training set with a certain scale is constructed by extracting relevant characteristic information and context information, and word segmentation of a target sentence is realized by utilizing a machine learning tool, and a plurality of existing word segmentation systems mainly utilize the method at present.

2) Part-of-speech tagging information

Part-of-Speech tagging (POS tagging) refers to a process of assigning an appropriate Part-of-Speech to each word in a sentence, i.e., determining whether each word is a noun, a verb, an adjective, or other Part-of-Speech, also referred to as a Part-of-Speech tag or simply a tag. Part-of-speech tagging is a fundamental task in natural language processing and plays an important role in many fields of speech recognition, information retrieval and natural language processing.

Part-of-speech tagging is very easy if each word corresponds to only one part-of-speech tag. However, the complexity of the language itself causes that each word does not have only one part-of-speech tag, but there are words with multiple part-of-speech tags that can be selected, such as "encouragement", which is either a verb ("teacher encourages us to learn well) or a noun (" this is an encouragement to us "), so the key problem of part-of-speech tagging is to resolve ambiguities, that is, to select just the right part-of-speech tag in a certain context for each word in a sentence. Most labeling algorithms can be categorized into three categories: one is rule-based labeling algorithm (rule-based tagger), one is random labeling algorithm (stochastic tagger), and the last is mixed type labeling algorithm. The rule-based labeling algorithm generally comprises a manually made ambiguity resolution rule base; stochastic labeling algorithms typically use a training corpus to calculate the probability that a given word has a given label in a given context, such as HMM-based labeling algorithms; the hybrid labeling algorithm has the characteristics of the two algorithms, such as a TBL (transform-based labeling) labeling algorithm. The existing word tagging tools can realize the word tagging of Chinese.

3) Semantic role information

The semantic role labeling is to analyze the corresponding semantic components in a sentence for each predicate (verb, noun, etc.) in the sentence and make corresponding semantic labels, such as affairs, tools or additional words, for a given sentence. Specifically, some components in the sentence are labeled as semantic roles of a given verb predicate, and the components are assigned certain semantic meanings as part of the predicate framework.

The method realizes the integral semantic understanding of the sentence by utilizing a semantic role marking technology. Taking "Canon A530P's shot is better and cheaper than it", after being processed by Semantic Role Labeling (SRL),

[ Canon A530P lens_Arg0][ it_ARGM-ADV][ good at_V]And [ price ]_Arg0][ further ] to_ARGM-ADV][ it ] to_ARGM-ADV][ inexpensive_V].

At present, different corpora and specific NLP tasks are not uniformly defined for semantic role sets, and only two semantic roles of an actor (Arg 0) and an acceptor (Arg 1) are stable. Considering the universality of the emotion analysis system and the dependency on the SRL result, and in addition, the emotion analysis system simply analyzes each sentence through word segmentation and part-of-speech tagging subtasks, the invention mainly considers two semantic roles, namely Arg0 and Arg1, and predicate information. As the emotion modifier is also identified in the process of extracting the emotion expression elements, ArgM-ADV is further considered in the process of extracting the SRL semantic information.

(2) Serialization-based joint recognition model construction

The extraction of the attribute of the evaluation object, the emotion word and the emotion modifier can be regarded as a simple classification work. Each word is regarded as an example, and then the class label of each word is independently judged by using a classifier such as a support vector machine or hidden Markov. But this method of classification assumes that the class labels are independent from word to word, while in reality there is a strong correlation between the class labels of words. The category label of the context in which the word is positioned plays an important role in judging the category label of the target word. For example, the serialization structural relationship of the sentence is helpful for judging the category label of the attribute of the evaluation object and the emotion expression element thereof. There are two consecutive words in a sentence, and if the preceding adverb has an emotion modifying effect, its following adjective has a high probability of belonging to the emotional words, such as "very good" and "good" in "the geographical location of the hotel is very good". As another example, if the preceding word in a sentence is a noun followed by an adverb with emotion modification and an adjective with emotion in succession, the preceding noun is highly likely to be the attribute of the object of evaluation, such as the relation of the three words of noun "geographical location", adverb "very" and adjective "good" in the previous sentence. It can be seen that the serialization structural relationship of the word occurrence of the sentence has a great effect on the recognition of the emotional word and the subject word. Besides words and parts of speech, the semantic character labeling information also has a great effect on judging the class label of the target word, and semantic characters of the evaluation object attribute and the emotion expression element thereof in a sentence are often relatively fixed, for example, the semantic characters "Arg 0", "ArgM-ADV" and "V" are often labeled with the evaluation object attribute, the emotion modifier and the emotion word in the sentence. Therefore, under the serialization structure, the semantic role information is fully utilized in the feature set of the words.

The present invention uses linear conditional random fields to describe the serialization structural relationships of the occurrence of words in sentences, as shown in FIG. 1. The conditional random field model comprises two groups of nodes, wherein a solid circle represents an observable variable set and is represented by X, which is a characteristic set corresponding to a word; the open circles represent the hidden variable sets, denoted by Y, and refer to the set of class labels to be predicted. The class table label description in fig. 1 refers to table 1. In the linear conditional random field model, the class labels corresponding to each word are linearly connected according to the position relationship in the sentence, namely, the class label relationship of adjacent words is considered in the unified prediction.

TABLE 1 set of labels and associated description

Mark set	Description of the related Art
		<TP>	Evaluating object characteristics
<SO>	Emotional words
		<ADV>	Emotion modifier
<BG>	Other background words

3. Fine-grained attribute classification and emotion calculation thereof

(1) Attribute classification based on bootstrap learning

Similar to emotional words, the description of the attributes of the evaluation objects is also diversified, the attributes of the same type of objects can be expressed in various languages, such as appearance, and the similar description can be appearance, surface and the like. Although these terms are not the same, the meanings and concepts described are substantially the same. Before fine-grained emotion calculation work, the evaluation object must determine a good attribute type so as to facilitate emotion summary statistics. Therefore, attribute classification (attribute classification) work is very important for fine-grained emotion analysis, and although resources such as WordNet (English) and Hadamard synonym forest can help attribute classification to a certain extent, it is difficult to realize effective attribute classification in practical application due to the conditions of domain correlation, resource limitation and the like. Therefore, how to effectively and correctly classify the attributes is the primary work of fine-grained emotion calculation and emotion summarization.

Compared with the traditional coarse-grained emotion analysis, the emotion corpus labeling work in the fine-grained emotion analysis is more labor-consuming and time-consuming. The method mainly aims at the problem of attribute classification, and explores and utilizes a bootstrap learning (bootstrapping) method to realize automatic expansion of fine-grained emotion corpora so as to reduce the dependence on labeled corpora.

Fig. 2 is a flow chart of the bootstrap learning algorithm.

The bootstrap learning algorithm firstly selects a part of representative relation examples from the corpus to label, the part of data set is called a labeled seed set L, the rest of a large number of examples form an unlabeled data set U, and the most common selection method of the seed set is random selection, so that the examples in the seed set have certain representativeness in the corpus. Then, the labeled seed set is used as a training corpus to train a supervised classifier (such as a CRF classifier and an SVM classifier) and obtain a classification model. And predicting the un-labeled data set U by using the classification model, finding out the most reliable S instances to be added into the labeled data set, and continuing the process until all un-labeled data are added or the termination condition is met.

(2) Fine-grained sentiment summary computation

The emotion collection always carries out summary statistics on the emotion values of the same evaluation object attribute class in all the comments. Because the description of the emotional information is diversified, various modifiers may exist, higher requirements are provided for the calculation of the emotional value, and the specific algorithm is similar to the polarity intensity quantitative calculation of the compound emotional words, so that the solution of the emotional tendency value can be realized by using the emotional polarity quantitative method of the compound words for the identified emotional words and the modifier information in the process of solving the emotional tendency value taking a sentence as a unit.

The weight condition of the comment publishers is not considered, the weights of all comments are considered to be equal, and the emotion summarizing and calculating task is changed into calculating the emotion average value corresponding to a certain attribute class, as shown in formula (6).

（6）

As shown in fig. 3, the text fine-grained sentiment analysis device of the invention mainly comprises a comment data acquisition and preprocessing module, a data processing module, a data analysis module and an information display module, wherein the comment data acquisition and preprocessing module is mainly used for acquiring and storing comment data by designing crawler software for a target website, for example, information crawlers are mainly used for a hotel comment website, namely a donkey comment network, before data storage, filtering and formatting information extraction are performed on a webpage, and only the posting time, the publisher, the comment title and the comment content of each comment are stored; the data processing module mainly carries out corresponding processing on the comment data, wherein the semantic features of the comment data, such as word segmentation, part of speech tagging, semantic role tagging and the like, are extracted by applying a natural language processing technology, and in addition, a machine learning method is utilized to establish a corresponding learning model for various extracted features, and then new comment information is predicted; the data analysis module mainly carries out sentiment analysis on the information processed by the data processing module, and carries out fine-grained sentiment intensity quantitative statistics and calculation by utilizing the correlation information between the object attributes and the sentiment words and the relation between the sentiment words and the modifiers; the information display module is mainly used for carrying out friendly visual display on the processed and analyzed comment information, providing a corresponding query interface and helping a user to carry out hotel recommendation according to the emotion value of each attribute in the comment information.

1. Comment data acquisition and preprocessing module

The module mainly realizes the web crawler of the target comment website and the formatted extraction of the comment content in the web.

With the rapid growth of networks, the world wide web has become a carrier of a large amount of information, and how to efficiently extract and utilize such information has become a great challenge. A web crawler is a program or script that automatically crawls information about the world Wide Web according to certain rules. The main objective of the system is to acquire comment information on a comment website, so that only pages in comment related areas on the website are locked for crawling. In the prior art, the page structure is manually analyzed, the identifier is written to position the target information, and the system uses XPath and an extended library lxml of Python, so that the efficiency of writing and grabbing programs, the program running speed and the readability are greatly improved. XPath is a language for finding information in XML documents that can be used to traverse elements and attributes in XML documents and provides the ability to find nodes in a tree of data structures based on the tree structure of XML. Whereas the lxml library can parse XML documents quickly and correctly. In particular to the implementation, we can view HTML page data as a special form of XML data, so XPath can be used to represent the specific location of a comment in this HTML document. In addition, XPath is very convenient to use and can be automatically generated by using a tool. In addition, by combining the method provided by the lxml, the extraction of the target information, such as comment content, user name, comment posting date, user score and the like, can be efficiently realized, so that the formatted extraction of the comment information is realized, and the comment information is stored in a pre-designed related database.

Since the comments are data generated by the user, the situation that the writing format is irregular often exists, in order to reduce the influence on the subsequent text analysis and natural language understanding, preprocessing is firstly performed, such as removing empty lines, removing redundant spaces, removing repeated punctuation marks and the like, and then the preprocessed comment information is stored in corresponding fields in the corresponding original comment records.

2. Data processing and analyzing module

This module is a core part of the apparatus because it is directly related to the processing performance of the system. The text fine-grained sentiment analysis device is mainly applied to realize semantic analysis of comment data. The method comprises three parts:

(1) semantic feature extraction of comment sentences by using natural language processing technology

The method mainly utilizes natural language processing technologies such as word segmentation technology, part of speech Tagging (POS Tagging) and semantic role Tagging (SRL) to realize semantic analysis and processing of comment sentences, extracts and converts the comment sentences into corresponding feature representations, and lays a foundation for subsequent machine learning.

(2) Machine learning method for joint identification of evaluation object attribute and emotion element

The method mainly utilizes the acquired semantic feature information to realize the joint identification of the attributes of the evaluation objects and the emotional elements thereof by constructing a corresponding learning model. In the generation process of the model, the characteristic template is adjusted and the context information is fully utilized through repeated experiments, so that the performance of the model is improved. Through the analysis of the previous section, all comments are identified by using the model with the best performance, and all attribute descriptions of the evaluation objects are subjected to attribute classification. This lays the foundation for the emotion quantification calculation later.

(3) Sentiment quantification calculation based on classification attributes

The method mainly utilizes the extracted 'attribute-emotion-modifier' word pair, attribute classification information and context semantic information to find out various relations between emotion words and related modifiers, and designs different emotion calculation methods to further improve calculation accuracy. The specific content comprises the following steps: based on the experimental data and results, corresponding linguistic rules are researched, and different calculation methods are summarized to realize final sentiment quantitative summarization based on classification attributes.

3. Information display module

The information display module is mainly oriented to users, and commodity inquiry and recommendation are carried out in a convenient and friendly display mode.

The following uses specific examples to illustrate the application of the text fine-grained emotion analysis method and apparatus of the present invention,

and opening a website homepage, and checking all information of hotel comments and the emotional scores of the classification attributes when clicking to enter a specific hotel.

Clicking on a hyperlink of any hotel in the homepage can enter a specific information page of the hotel comment.

The total number of the hotel comments is 82, the 82 comments are finally subjected to emotion summarization by using the method introduced above, and finally fine-grained scores of various attribute features of the hotel are displayed to the user: "environment: 2.8 "," facility: 2.4 "," catering: 2.6 "," price: 2.9 "," traffic: 2.8 "," service: 2.9 "," total score: 2.7". There is a score above each comment, which is the score display after sentiment analysis calculation for the comment, and is obtained by solving the average value of all sentiments appearing in the comment.

Claims

1. A text fine-grained emotion analysis method comprises the following steps: calculating the polarity intensity of the emotional words; the joint identification of the attribute of the evaluation object and the emotion expression element thereof; classifying fine-grained attributes and calculating emotion of the fine-grained attributes; the emotion word polarity intensity quantitative calculation comprises the polarity intensity quantitative calculation of a basic emotion word and the polarity intensity quantitative calculation of a composite emotion word; the polarity intensity quantitative calculation of the composite emotional words comprises the following steps:

overlapping words of basic emotion words;

basic emotion word+Basic emotion words;

negative word+Basic emotion words;

degree modifiers + basic emotion words;

the negation word + the degree modifier + the basic emotion word or the degree modifier + the negation word + the basic emotion word is calculated by adopting the following formula:

（5）

whereinFor the value of the emotion of the base word,the coefficient of action of the term is in the range of 0.5, 0.7, 0.9, 1.1, 1.3, 1.5,is the degree word reaction coefficient, i.e. coefficient of actionOf the range extreme, soThe value of the number is 2,is a wordThe symbol value of the emotion value is +1 if the emotion value of the word is greater than 0, and is-1 if the emotion value of the word is less than 0;is a wordAbsolute value of the sentiment value.

2. The method for fine-grained emotion analysis of text according to claim 1, wherein: the polarity strength quantitative calculation of the basic emotional words comprises the calculation of emotional values of the words, and the following formula is adopted:

（1）

（2）

wherein,P _ciis a characterciAs a weight of the recognition word,N _ciis a characterciAs a weight of a derogative word;fp _ciis a characterciThe frequency of occurrence in the recognition word table,fn _ciis a characterciThe frequencies appearing in the derogatory word list, each word may be weighted as a positive word and a derogatory word using formula (1) and formula (2),nfor the number of all words appearing in the recognition word table,mto balance the difference in terms between positive and negative words in the emotion dictionary for the number of all words appearing in the negative word table, equations (1) and (2) normalize the frequency of occurrence of each word in the positive and negative word table,

finally, the word can be calculated by formula (3)ciValue of emotional tendency ofS _ci：

（3）

3. The method for fine-grained emotion analysis of text according to claim 2, wherein: the polarity intensity quantitative calculation of the basic emotion words further comprises the emotion value calculation of the basic words, and the following formula is adopted:

（4）

4. The method for fine-grained emotion analysis of text according to claim 1, wherein: the joint identification of the evaluation object attribute and the emotion expression element thereof comprises the following steps: extracting semantic features and constructing based on a serialization joint identification model.

5. The method for fine-grained emotion analysis of text according to claim 4, wherein: the extraction of the semantic features comprises extracting word segmentation information, part of speech tagging information and semantic role information.

6. The method for fine-grained emotion analysis of text according to claim 1, wherein: the fine-grained attribute classification and the emotion calculation thereof comprise attribute classification and fine-grained emotion summarizing calculation based on bootstrap learning.

7. The method for fine-grained emotion analysis of text according to claim 6, wherein: the fine-grained emotion summary calculation adopts the following formula (6):

（6）