WO2015043075A1

WO2015043075A1 - Microblog-oriented emotional entity search system

Info

Publication number: WO2015043075A1
Application number: PCT/CN2013/088772
Authority: WO
Inventors: 郝志峰; 温雯; 蔡瑞初; 杜慎芝; 陆印章; 程杰
Original assignee: 广东工业大学
Priority date: 2013-09-29
Filing date: 2013-12-06
Publication date: 2015-04-02
Also published as: CN103544242B; CN103544242A; DE112013004082T5

Abstract

The present invention relates to a microblog-oriented emotional entity search system. The system comprises the following five modules: 1) a user interface used for the interaction between a system and a user, so that the user can submit a query request via the module and obtain a feedback result; 2) a query expansion module used for conducting word relationship mining on microblog corpus data and establishing a weighted word relationship diagram in combination with a WordNet ontology base; 3) a query processing module used for converting a query request of a user into a query key word and a query sentence which can be accepted by an index database and conducting query expansion based on the word relationship diagram constructed by module 2); 4) an emotional information mining module used for conducting emotional mining on a microblog corpus and generating a determination rule for an emotional entity and an emotional polarity; 5) an emotional information decision and index establishment module used for determining the emotional entity and emotional polarity of microblog data, establishing an emotional information index and storing same; and 6) an inverted index establishment module used for establishing an inverted index for microblog text information and storing same. The present invention solves the difficult problems of the extraction of a microblog emotional entity, the emotional polarity analysis and the search for an emotional entity, etc., thereby providing an intelligent search product for analyzing and monitoring public opinions on a social network.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the field of text emotion mining and information retrieval, and particularly to an emotional entity search system for microblogs, which belongs to an innovative technology of an emotional entity search system for microblogs.

Background Art In recent years, with the development of the Internet and social networks, social network data including Weibo is rapidly increasing in the form of an index. The growing popularity of Weibo has made people's searchable information more and more abundant, but the massive amount of Weibo data also makes it difficult to find the information needed quickly and accurately. At the same time, due to the freedom of microblogging, the extraction of emotional information is more difficult than the traditional text. In the field of microblog emotional information retrieval, which is of great significance to the public opinion monitoring and product research industry, there is no mature technology and system. The emotional entity search method and system for Weibo mainly involves three related key background technologies. One is query expansion technology; the second is emotional entity extraction technology; the third is emotional polarity discrimination technology. The following three types of background technologies are separately elaborated and analyzed.

1 Query expansion technology The traditional retrieval system or search engine that directly queries through keywords can obtain some related retrieval results, but the result of using simple matching method is more mechanical, can not truly understand the user's query intention, and the returned result It is not satisfactory. So looking for a kind The method can well understand the user's query intention, and improve the precision and recall rate of the search to become a hot spot to solve the above problems. Query extension technology is one such method. Query extensions allow for a more accurate understanding of user query requirements, helping users get the information they need faster and more accurately. The classic query expansion methods mainly include four types based on global analysis, local analysis, user-based query logs, and association-based rules. In recent years, some scholars have proposed query expansion methods based on ontology (or domain ontology) and semantic web.

The query expansion method based on global analysis is extended by mining the relevance of words in the documents of all data sets or the entire database. The advantage is that the entire data set can be fully analyzed to understand all aspects of the document; the disadvantage is that because the usual data sets are too large, the analysis time and equipment requirements are very high, and it is impossible to complete online. . The existing retrieval systems all perform the analysis of global words offline, and it is more difficult to use this method for real-time search engines.

The method based on local analysis includes two kinds of correlation feedback and pseudo correlation feedback. The relevant feedback is to first obtain the search result through the user's initial query, and then the user manually judges the correlation and irrelevance of the result document, and distributes it in two different document sets. In this way, the relevant documents of the markup are obtained, and only the word analysis of these documents is required before the query expansion. The advantage of this is that only the relevant part of the document is processed, the number of documents is reduced, and the relevance is also improved; the disadvantage is that a large amount of manual feedback is required, which requires a lot of manpower, and still requires a lot of experiments for debugging. deal with. Such an existing retrieval system or search engine rarely uses this method.

The pseudo-correlation feedback method is to analyze the first n results obtained by the user's initial query. The theoretical hypothesis is that the documents related to the query words in the results will appear at the top of the search, that is, these documents are considered to be the most relevant documents. , by analyzing these documents to get extensions and Make query extensions. The patent application number is CN20091032193.5, and the invention titled "Query Expansion Method and Query Expansion System" is an example of a patent using pseudo correlation feedback. The main idea is to collect clusters by clustering and generating clusters by the first part of the results obtained by the user's initial query. After sorting the clusters, extract the extended words from a certain number of clusters before the ranking, and add the resulting extended words. In the original query, a combination of extended words is formed and then a second search is performed. The disadvantage of this method is that it cannot guarantee that the first document of the initial query is relevant. If it is irrelevant, the resulting extended word may make the result of the secondary search less relevant and the retrieval performance will be reduced.

The method based on the user query log is an extension method of the current search engine. The method is to perform word analysis on the user's query log, and the co-occurring words are used as extension words. The patent application number is CN200710097501.6, the invention name is "query expansion method and device and related search term library" and the patent application number is CN200810115470.7, and the invention name is "an extended query method, device and search engine system" The query words input by the user are analyzed to obtain related words, and then these words are used as extension words. This extension method also requires a large amount of query logs first, which requires an accumulation process.

The association rule-based method is a classic method of data mining. It is often used to mine the correlation between transactions. It can be used for mining various types of resources in query expansion, such as mining data document sets, query logs, etc. The relevance of the words of the resource. The patent application number is CN201010605956.6, and the invention titled "Method and Server for Extending User Search Results" is an example of query expansion using association rule technology. The patent uses an association rule database to store the established rules. The required rules can be manually established or the association rules of the support-confidence framework can be used to mine specific documents, and the production rules can be saved to the association rules database. . When the user enters a query term, the word associated with the word is first obtained in the rule database. Then, the original query words, the obtained related words and the combined words of the two are formed into new query words, and the database is searched twice. The disadvantage of this method is that it fails to understand a word through the meaning level of the word, but it floats on the frequency level of the word. Such an extension cannot well understand the user's query intention.

An ontology or semantic web-based query extension method is a technique for extending words by using or constructing a word network. This semantic network can be an established network, such as WordNet and HowNet; it can also be built by itself, such as domain knowledge or domain ontology. The Semantic Web or Ontology Library organizes the multi-layered relationships of words, such as collocations, contextual words, concept words, whole-partial words, etc., to form a network of words. Patent application number is

CN200810116729.X, the invention name is "a semantic query extension method based on domain knowledge". It is to first construct a domain knowledge base by analyzing the domain knowledge and user sentence features, and then use the domain knowledge base content to semantically the original query words. Analysis, obtain a list of semantic items, and then obtain extensible items through semantic calculation; finally return the extended items to the query set for secondary retrieval of the database. The patent application number is CN20101084725.2, and the invention name is "A text-based query expansion and sorting method in image retrieval" is a word that uses WordNet and HowNet to semantically analyze words and obtain semantic expansion. In the image retrieval system, an algorithm for optimally sorting the returned results is invented. Through semantic extension, the user's query intent can be well recognized, but the extension of this method does not analyze the database of the query, and the retrieval performance is usually limited; and the establishment of the domain ontology library is laborious and time consuming.

2 emotional entity extraction technology

An emotional object is an object of emotional expression, usually a noun or a noun phrase. Usually, if you don’t know the emotional object, but only the emotional orientation analysis and research is not Value. The extraction of emotional objects is a very important and challenging task in sentiment analysis and opinion mining, which has attracted the attention of relevant researchers. Although there are many studies on emotional expression and emotional objects, most of them are based on product review information or news information.

Different from the traditional text information, due to the limitation of the system word count and the freedom of the network text, the microblog data contains a lot of abbreviated expressions, typos, special symbols (such as expressions) due to word limit and freedom of writing. Symbols, links, etc., and other types of textual expressions that differ from traditional norms, which undoubtedly increase the difficulty of data analysis. Due to the late domestic sentiment analysis and viewpoint mining, and the differences between Chinese and English, and the immature limitations of related technologies, there are few studies on the identification of emotional objects for Weibo.

The existing emotional object recognition technology has the patent number applied by Beijing University of Aeronautics and Astronautics.

CN201210317183.0, a patent entitled "Method of Extracting Views Based on Word Dependence". The method uses the matching algorithm based on the word dependency chain to extract the evaluation object, and does not use other more available auxiliary information to improve the accuracy of the method. Secondly, the method is not suitable for the special text information of Weibo.

The common emotional object extraction in the existing references is mainly for product reviews. Because of the specified product information and domain definition, the problem is more specific and clear, so the extraction of the topic-related text can often achieve better results. However, it does not work well in other topics that are not related to the topic. The main reason is that the comments in these texts are very mixed, and the emotional words are also diverse. At present, there are few emotional object recognition techniques for the topic-independent microblog. Most of the existing methods directly obtain the paired <emotional and emotional objects> relationships through the syntactic dependency analysis of the microblogs. Object. The recognition effect of this method is not ideal, exist The following deficiencies are as follows: (1) The extraction process relies too much on the sentiment dictionary and certain syntactic dependencies. On the one hand, because the dictionary-based judgment method is limited and is greatly influenced by domain knowledge, There are many misjudgments; on the other hand, the particularity of microblogging expressions, emotional words and emotional objects are not necessarily limited to a specific number of dependencies; (2) in Weibo, often some emotional words and Emotional objects do not appear directly in the text in pairs. Only emotional words express emotional and emotional tendencies, while emotional objects do not appear prominently in sentences. Then the extraction process cannot extract emotions that do not appear directly in the sentence text. Object.

3 emotional polarity discrimination technology

At present, sentiment analysis systems and techniques mainly focus on chapter-level and sentence-level sentiment analysis from the granularity of analysis, while very few entity-level sentiment analysis techniques divide entity recognition and sentiment analysis into two separate tasks. get on. From the point of view of the analysis, the current system and technology should focus on the analysis of social public opinion for news, microblogging and other commentary information.

At present, there are mainly chapter-level and sentence-level sentiment analysis techniques: Northwestern Polytechnical University's application number is CN200910219161.9, and the invention name is "mixed model-based WEB text emotional theme recognition method" patent; Institute of Computing Technology, Chinese Academy of Sciences The application number is CN200910083522.1, and the invention name is "text sentiment analysis method" patent application; the application number of the Institute of Automation, Chinese Academy of Sciences is CN201210088366.X, the invention name is "A sentiment analysis for microblog short text Patent application of the method; Patent Application No. CN201010157784.0 by Fujitsu Co., Ltd., entitled "Emotional Tendency Analysis Method and Apparatus".

The above-mentioned sentiment analysis techniques mainly include two steps of training and emotional judgment. The following is an example of the "hybrid model-based WEB text sentiment theme recognition method" of Northwestern Polytechnical University. In the main steps of training and emotional judgment, the remaining related technologies are basically similar. The method mainly includes the following steps: 1. Manually labeling the text in the training set to estimate two types of emotional models:

"Derogatory" model and "derogatory" model; at the same time, according to the language expression of different subject texts, respectively estimate the various topic language models; 2. Using the maximum likelihood estimation (MLE) method for the emotional model and the topic model established in step 1, respectively Parameter estimation; 3. For the text to be processed, calculate the distance between the language model and the two types of sentiment models, so as to judge the sentiment orientation and the theme of the text. Current sentiment orientation techniques are mainly focused on the chapter level and sentence level. Machine learning based methods are popular, and emotional analysis techniques based on emotional placement are rare. The existing sentiment analysis techniques based on sentiment words mainly have the following three shortcomings: A) The extraction of sentiment phrases does not consider the modification of adverbs, but in general, adverbs will limit the degree of affective words such as adjectives. If not considered, it is easy to cause emotional intensity deviation; B) Negative word recognition and processing problems, the general method is to adopt a search strategy to find negative words, it is difficult to determine negative objects; C) Some based on automatic generation The dictionary of emotional word strength is not reliable, because the intensity of emotional words is the basic attribute of emotional words, which is mainly determined by its intention. SUMMARY OF THE INVENTION The object of the present invention is to overcome the above-mentioned deficiencies of the existing emotional entity search technology, and to propose a microblog-based emotional entity search system that improves the accuracy of emotional polarity determination. The present invention is implemented by the following technical solutions: The invention relates to an emotional entity search system for Weibo, including the following five modules:

1) a user interface, used for interaction between the system and the user, through which the user can submit a query request and obtain a feedback result; 2) Query expansion module, used for word relationship mining of microblog corpus data, combined

The WordNet ontology library establishes a weighted word relationship diagram;

3) a query processing module, configured to convert the user query request into a query keyword and a query statement acceptable by the index library, and perform query expansion based on the word relationship diagram constructed by the module 2);

4) An emotional information mining module, which is used for emotional mining of the microblog corpus, and generates a determination rule of the emotional entity and the emotional polarity;

5) an emotional information determination and indexing module for determining the emotional entity and emotional polarity of the microblog data, establishing an emotional information index, and storing;

6) An inverted index building module is configured to create an inverted index for the microblog text information and store it.

The following steps are used in module 1) above to implement query expansion:

11) mining relevant rules in the data in the microblog corpus, and outputting relevant word sets obtained by mining relevant rules;

12) Combine the frequent items obtained by 11) with the WordNet ontology library to construct a weighted word relationship diagram.

In the above step 11), the Eclat algorithm is used to mine the frequent itemsets of the microblog corpus and generate related word sets, and the related words set and the WordNet ontology map are formed into a weighted word relationship diagram by mapping or inserting;

When constructing the weighted word relationship diagram above, the calculation method of the node weight is:

f(d) = deg(d) = deg ⁺ (d) + deg~(d),

Where deg(4d _e g ⁺ (4deg-W represents the degree, degree, and indegree of the node, respectively; the calculation method of the edge weight is:

1, if d _t , dj is the original, W. rdNet graph node

f di → dj) if d is a node formed only by rule words

Ft(di → d ^+1, if both the WordNe drama node and the node formed by the rule word. The following steps are used in the above module 3) to implement the query processing:

31) receiving a query word or statement entered by the user;

32) performing word segmentation, de-stopping words, and determining central word processing on the user's input to obtain one or more central words;

33) Select the appropriate word in the weighted word relationship library constructed by ontology and rule words The extension word, and the weight calculation of the extension word;

34) Then add the first p words with significant weights to the query word set, and input the extended word set to the query interface.

The above steps 33) use the following method to calculate the weight of the extended word:

Suppose the original query word is ^ = ( _qi , q ₂ , -, q _m ) , where the item has a nearest neighbor A = ^, , ···, ^), then the correlation between the original query term and the nearest neighbor Calculated by

W(q _i , ά _{] ) =―

, )xfog ₂ [/( ) + l]} where is the relevance of the word to the word, the weight of the two words, f(d _v ) is the degree of the word. The weight of all nearest words is calculated as W ( d _k )= ^ Wiq^/m The following steps are used in module 4) above to implement the identification and determination of emotional entities:

41) collecting representative microblog data;

42) pre-processing the collected microblog data, including cleaning, transforming, segmenting, word segmentation, part-of-speech tagging, and syntactic parsing;

43) performing feature extraction on the microblog data and expressing it as a feature vector;

44) training the emotional entity recognition model to obtain model parameters;

45) Output the emotional entity decision model and store it.

In the above step 43), the following methods are used to implement feature extraction: combining the word context, designing a custom dictionary including global features, extracting features of the microblog data according to the custom dictionary, and converting the microblog data into an emotional entity recognition model. The input data format processed.

In the above step 44), the following method is used to realize the emotional entity recognition model: the global feature node is introduced in the conditional random field (CRF) model, the GLCRF model combined with the global feature is established, and the model parameters are obtained by training using the L-BFGS algorithm.

The following steps are used in the above module 5) to determine the emotional polarity of the microblog:

51) microblog data noise removal and semantic form transformation; 52) participle, part of speech and Chinese grammar;

53) extracting emotional phrases in combination with an emotional dictionary;

54) emotional phrase filtering;

55) Emotional polarity determination and result output.

In the above step 53), the sentiPY method is used to extract the emotional phrase, and the form of the emotional phrase is uniformly expressed as phmse modifier * sentiment, that is, a phrase includes a central emotional word, and at the same time, a plurality of modified adverbs may be attached;

In the above step 55), the emotion determination polarity of the microblog is determined by using a mixed decision algorithm based on emotional drop points, and the determination process includes the following steps.

551) Determine whether there is a generalized word in the sentence, if no, go to step 552); if yes, use the statement after the generalized word as the emotional drop point, and the emotional falling point polarity as the microblog emotional polarity output;

552) The first sentence and the last sentence of the microblog are used as emotional points, and the emotional polarity of the first sentence and the ending of the sentence are compared. If the emotional polarities of the two sentences cancel each other, then 553); otherwise, the person with stronger emotional polarity is regarded as micro Bo emotional polarity for output;

553) Calculate the intensity of the emotional words of the entire Weibo, sum and average, and output the average intensity as the emotional polarity of the microblog. The invention is directed to a query expansion scheme for microblog emotional entity search, characterized in that word relationship mining is performed on microblog corpus data, and a weighted word relationship diagram is established by using a WordNet ontology library, and query expansion is performed according to the constructed word relationship diagram, so as to better The invention understands the query intent of the user; the invention solves the problem of effectively combining the semantic ontology and the corpus word relationship in the query expansion, can better understand the user's query purpose, and further converts the query statement into a more suitable query expansion word; In terms of emotional entity extraction and emotional color analysis, it solves the problem of extracting text emotion objects with greater freedom of writing such as Weibo and judging the emotional polarity. It solves the problem of entity extraction under the hidden situation of emotional objects and optimizes emotions. The extraction effect of the entity improves the accuracy of the emotional polarity judgment. It provides an excellent solution for network public opinion monitoring and product public opinion analysis. The invention solves difficult problems such as microblog emotional entity extraction, emotional polarity analysis and emotional entity search. Provide a smart search product for social network public opinion analysis and monitoring.

DRAWINGS

Figure 1 is an overall structural view of the present invention;

Figure 2 is a flow chart showing the implementation of the present invention;

Figure 3 is a structural diagram of the system construction of the present invention;

4 is a flow chart of an emotional polarity analysis method of the present invention;

Figure 5 is an example of a graph structure based on an adjacency relationship in emotional intensity optimization;

Figure 6 is a flow chart of the emotional drop algorithm;

Figure 7 is a flow chart of the microblog emotional object extraction work;

Figure 8 is a flow chart of data preprocessing;

Figure 9 is a schematic diagram of the implementation of the emotional object model training;

Figure 10 is a diagram structure of the GLCRF model;

Figure 11 shows the model diagram structure after the GLCRF model extends multiple global nodes.

BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be further described with reference to the drawings, but the implementation of the present invention is not limited thereto.

Figure 1 is a diagram showing the overall structure of the present invention. An emotional entity search system for Weibo, comprising: a user interface module, the user can submit a query request and obtain a feedback result through the module; query an extension module, implement word relationship mining on the microblog corpus data, and combine the WordNet ontology library Establishing a weighted word relationship diagram; a query processing module for converting a user query request into The query keywords and query statements acceptable to the index library, and the query expansion based on the word relationship diagram constructed by the query extension module; the emotion information mining module, used for emotional mining of the microblog corpus, and generating emotional entities and emotional polarities The determination rule and the index establishment module are used for determining the emotional entity and the emotional polarity of the microblog data, establishing an emotional information index, and storing the data; and an inverted index establishing module for using the microblog text information Create an inverted index and store it.

Figure 2 is a flow chart showing the operation of the query processing module of the present invention.

Referring to FIG. 2, the process includes the following steps: 1. The query interface receives a query word or a sentence input by a user; 2. performs a word segmentation, a stop word, and a process of determining a center word through a query process to obtain one or more The central word, the central word can be a keyword or a modifier, and the like; 3. The central word is selected from the source of the weighted word relationship constructed by the ontology and the rule word, and the selected word distance is 1, That is, the nearest word of the central word; 4, because the expansion word obtained in the third step may be many, so in order to measure the importance of each word, calculate the weight of each word, and then add the first p words with significant weight to join In the query word set; 5, in the fourth step has obtained the required extension words, but to introduce a mechanism to let the user understand these extension words, and operate on the words, that is, modify the extended query word set, so that The extended words are consistent with the user's query intent; 6. The extended word set is returned to the query entry, for the rich media database. Perform an extended search; 7. Return the results of the search to the user.

Figure 3 shows the integration details of the query processing and query extension modules of the present invention.

Referring to FIG. 3, the query processing and query extension of the present invention includes two parts: a background information processing process and a retrieval process, which can also be divided into a microblog information extraction module, an indexing module, a word relationship diagram module, a user retrieval module, and management. Member operation and user operation module The process of the microblog information extraction module includes organizing the initial microblog data, performing appropriate cleaning, clause, word segmentation and grammar analysis. The indexing module is mainly to establish a water index for the microblog dataset for quick retrieval. We use Lucene to build an inverted index. Lucene is an open source full-text search engine architecture that provides a complete query engine and indexing engine that supports Boolean operations, fuzzy queries, group queries, and more. Use it to build an inverted index and save it.

Building a word relationship library module is the core part of this article and a part of innovation. This part V is divided into a word segmentation process, an Eclat related rule mining process, a related rule word generation process, and a process of generating a weighted word relationship diagram in combination with WordNet. The word segmentation process divides the text resource of a text into words. We use the ICTCLAS software with high accuracy of Chinese word segmentation for word segmentation. This is a system specially developed for Chinese word segmentation developed by the Chinese Academy of Sciences. We first segment the documents of the dataset one by one, and then combine the documents of various classes to form a document set for mining related rules. In the process of mining related rules, we use the Eclat mining algorithm with high mining efficiency. This is a depth-first algorithm. The mining related words of large documents can be merged and finally merged. The present invention uses a correlation rule framework of support-interest, which uses two evaluation formulas:

(1), support formula:

\ X Y \

Supp(X→ Y)

\ D \

(2), interest degree formula: lift(X→ Y)

Supp(X) x supp(Y) Where Ι Χ υ Ί is the number of transactions containing both ^ and ^, I is the total number of transactions in the database; su _Pp X ^ Y) is the percentage of transactions in the database containing both X and y, ^x ), respectively, indicating that the transaction contains only X and Contains only the percentage of y.

In the mining process, different support thresholds are set according to different document sets, and the frequent itemsets mined only produce related rule items when the interest degree is greater than 1. Because the present invention considers that they are positively correlated whenever the degree of interest of the two words is greater than one. In the mining process, the concept of compound words is also added: When the interest value of two words is greater than 4, the two words before and after the rule item are combined to form a combined word, which is formed with the front and back of the rule word respectively. A new rule, the interest value of the new rule is the same as the original rule, so that the synthesized word can also be selected as the extended word. After the relevant words are mined, the relevant rule words will be produced and saved. The format of the preservation is "X." At this point, the mining and analysis of related rule words is completed.

The remaining step is to combine these rule words with the WordNet ontology library into a weighted word relationship diagram. WordNet is a vocabulary-based semantic network. WordNet not only organizes vocabulary into concepts, but also defines various semantic relationships between concepts and vocabulary (like words, upper/lower words, antonyms, whole-partial words, implications, etc.). The relationship between words and words forms a To the figure (as in the example of Figure 3). In this process, we consider mapping or adding rule terms to the WordNet ontology library in a certain order. The construction principle of setting the weighted word relationship diagram is: Add a piece from the node of the two rule words to the back part The directed side. The addition of rule words is completely automated, and is divided into two cases: First, if the word exists in the original WordNet ontology diagram, then simply map the word to the graph and then update the node data; second, if the original WordNet ontology graph If the word does not exist, add words first, then add edges and update the data. All node data are counted one by one after the graph is completed. The resulting graph can be represented by a four recombination: G = < V, E, f, g >, where V is the node The set, E is a set of edges, is a function from V to a set of non-negative real numbers, is set to the degree of the node; S is a function from _£ to a set of non-negative real numbers, and is set to the value of the two node edges. Let ^d ' ^d „ denote the degree of the node (ie the sum of the out and the degree of the node), ^ ^→ ^ denote the value of the interest of the node, then: f (d) = deg (d)

(1),

1, if d _t , is the node of the original WordNe diagram

(2), liftid, → if d is a node formed only by rule words

→ dj )+ 1, if, is both the WordNe bad node and the node formed by the rule word. In the weighted word relationship diagram (as in the example of Figure 4), the importance of the word in the whole graph is measured by the degree of the node where the word is located. That is, the sum of the degree of the node and the degree of ingress (the integer value next to the node in Figure 4); the value of the edge is the weight, where the weight between the ontology words of the original WordNet diagram is set to 1 (blue in Figure 4) Edge), the weight between the words inserted by the rule is set to the interest value of the two words (the blue side in Figure 4). If the two words are both WordNet relational and regular words, the weight is the interest value. plus 1. The words indicated by the black side in Figure 4 are compound words (such as "intellectual property"), which are the same as the weights of the two rule words. At this point, the construction of the weighted word relationship graph is completed.

The user retrieval module includes a query input, a query analysis process, a process of matching extended words, a process of generating an extended query word set, a process of retrieving an index, and a process of processing the result and displaying it to the user. The query input is to receive the query words or sentences input by the user in the query interface; the query analysis is the user input to perform word segmentation, stop the stop words and determine the central word processing, and obtain one or more central words; the process of matching the extended words is The center word of the previous step is input into the weighted word relationship library to select the appropriate source of the extended word, that is, the word closest to the original query word (ie, the word with distance 1) is selected from the figure as the candidate extended word. The process of generating an extended query word set is based on the correlation between each word and the original query word, and the weight of the word is calculated, and the first p words are selected as the final extended word. The invention creates a calculation The formula of the weight of each word, according to the structure of the weighted word relationship diagram, if the weight of the two nodes is larger, the correlation between the two nodes is greater; and if the degree of the node is larger, it indicates that The importance of the node is also greater.

Suppose the original query word is q, where the item has a nearest neighbor word d _i =(d _il ,d _i2 ,---,q _ini ),

Then the correlation between the original query term q _t and the nearest term d is calculated by ^{( 2}曜^{+ 1]}

¹ ( )X^ ₂ [/( ) + l]} where wfe,^) is the relevance of the word to the word, gfe,^) is the weight of the two words, the degree of the word ^, the weight of all nearest words The calculation method is

W(d _k )= ^ Wiq^/m where is the weight of word 4, which represents the number of original query words. After calculating the weights of each candidate extended word, the weights are arranged in descending order, and the first p words are added to the original query to form an expanded word set, wherein the weight of the original query item is 1.

From the previous step and get the set of extended words, as in the following form:

Q = (q _l , q ₂ ,...,q _m ,d _x ,d ₂ ,...,d _p ) ₍₄ ) The retrieval process refers to returning the extended word set back to the query entry and returning to the query entry, for rich media The database performs an extended search. The result processing and display process refers to returning and displaying the results of the sorted search to the user.

4 is a flow chart of an emotional polarity analysis method proposed by the present invention. (1) Noise removal and semantic form conversion of commentary corpus:

The noise removal of the commentary corpus is mainly to remove the interference clause such as the virtual tone. Non-real and objective evaluations of these interfering sentences interfere with the analysis of the later stages. Replace the emoji with the corresponding text, thus transforming the semantic form into a form of friendly processing.

(2) Natural language processing: Mainly using Stanford NLP software to segment the commentary corpus, part of speech tag and Chinese grammar analysis.

(3) Combine emotional dictionary to extract emotional phrases:

Because the POS tagger label of the emotional word in the commentary corpus is mainly concentrated on a few labels, we combine these part of speech tags and sentimental lexicon to extract emotional phrases. Using the sentiPY method we developed to extract emotional phrases, the form of emotional phrases in this system is unified:

Phrase: mod ifier * sentiment

, that is, a phrase includes a central emotional word, possibly with multiple modified adverbs.

(4) Emotional phrase filtering: Filter the coarse-grained emotional phrases extracted in step 3 to make the form of the emotional phrase more pure, which can improve the accuracy of the final polarity classification.

(5) sentiment analysis and output of results

We have designed a hybrid decision algorithm based on emotional drop point, which can effectively analyze the corpus of different fields.

Figure 5 is an example of a graph structure based on the adjacency relationship in emotional intensity optimization. Referring to Figure 5, the sentiment words in the commentary corpus are regarded as nodes in the graph, and the propagation-based algorithm can calculate the emotional strength of the context. Based on the sentiment dictionary, the relationship between the sentiment words is extracted and the weights of the sentimental nodes are calculated by NGD, thus forming a directed graph. Figure 3 shows the structure of a comment.

Figure 6 is a flow chart of the emotional drop algorithm. Referring to Figure 4, in this step, our goal is to find a emotional drop of a comment. The so-called emotional placement is the emotional part that the author mainly wants to express in a commentary. We mainly rely on generalized vocabulary (such as "overall"), compare the emotional intensity at the end of the beginning, and the strongest emotional phrase in the sentence to find a emotional drop of a comment.

FIG. 7 shows a workflow diagram of the present invention for microblog emotional entity extraction.

Referring to FIG. 1, the emotional entity extraction of the present invention includes steps of microblog data collection, data preprocessing, feature extraction, dictionary loading, markup and correction, model training, and emotion object extraction. Micro Bo data collection The microblog data crawled from the Internet will be saved in the form of files. The emotional object extraction model obtained by the model training will also be saved for object extraction. The results obtained by the emotion object extraction will be saved in the form of files. , so that users can view and correct the forecast results.

Microblog data collection, used to crawl microblog data from the microblogging system on the Internet (such as Sina Weibo, twtter and Tencent Weibo, etc.), and collect the collected microblog raw data according to a certain organization The form is saved and provides data support for the post processing of the system.

Data preprocessing is used to perform some pre-processing on the original microblog data to facilitate feature extraction later. The module includes data cleansing, data conversion, clauses, word segmentation, part-of-speech tagging, and syntax parsing. The details are shown in Figure 2.

Dictionary loading, used to load the relevant dictionary required for data preprocessing and feature extraction steps. This dictionary includes dictionary data such as sentiment dictionary, stop word dictionary, common network term dictionary.

Feature extraction, with the dictionary data loaded by the dictionary loading module, the pre-defined feature is extracted from the processed data, and the text is vectorized and converted into a format that the object extraction module can process.

The emotional object model training is used to train the emotional object extraction model at the core of the system. The training data converted into the required format is obtained from the marking and correction module, and the CRF model constructed based on the training data is trained using the L-BFGS algorithm. The CRF model used in the present invention evolved from the Linear CRF (Linear Conditional Random Field) model, and the CRF (Conditional Random Field) model was first applied in the field of emotional object recognition. By adding global variables to the traditional CRF model, it is possible to recognize that the emotional object does not appear in the marker sequence.

The emotion object extraction is used to extract the emotional emotion object from the microblog data. This step mainly uses the model trained by the model training module to perform prediction to achieve the purpose of extracting the object.

Marking and Correction, the CRF model used in the present invention is a supervised statistical learning method, so the data needs to be labeled. At the same time, a feedback mechanism is introduced to learn the error analysis information. Existing methods generally do not deal with misclassification results, but these feedbacks contain a lot of useful information. How to make full use of this information becomes the key to the system to achieve self-learning. The introduction of the feedback mechanism enables the model to re-learn the results of the error analysis, making the system more accurate.

FIG. 8 is a schematic diagram showing the implementation of the data preprocessing step of the present invention. The data preprocessing step includes the following steps: (1) Data cleaning processing steps, reading data from the original microblog data collected by the data acquisition module, performing data cleaning process in data preprocessing, filtering out some empty and invalid dirty microblog data.

(2) Data conversion processing step, which processes the data transmitted from the processing of (1), and transforms some contents in the microblog data, which is convenient for (3) (4) (5) (6) Handling, there are several common situations: (a) Weibo often contains some information that is invalid for work, so it needs to be removed; (b) some links that are useless for our work (such as image links and web links) ) and special strings need to be culled; (c) topics with the " #" symbol and contacts with the " @ " symbol are often processed in Weibo, we also put the topic of the microblog head and tail And the contact is directly deleted, and only the "#" and "@" symbols are deleted in the microblog sentence;

(d) Weibo often contains some emojis, which are strongly emotionally inclined, and are also helpful to our work, but these symbols affect the participle and part of speech.

(POS annotation) and the accuracy of syntactic parsing, so it needs to be extracted in this process; (e) It is necessary to convert some network terms in Weibo, for example, to convert the "V5" of network expression into a normative expression. "Wait, this also helps to improve the accuracy of word segmentation, part-of-speech tagging (POS tagging) and syntactic parsing.

(3) The microblog text clause processing step, the conditional random field model of the emotion object recognition method of the present invention is a sequence marker constructed at the sentence level for information extraction, but a microblog can contain more than one sentence, It needs to be processed by clauses. In the process of clause processing, the clause is mainly based on punctuation. However, due to the particularity of Weibo, it is not enough to simply use clauses based on punctuation. Many people in Weibo are accustomed to using spaces or special symbols for convenience.

(such as "~", etc.) to make clauses, so in this process, the corresponding clauses are also processed for these cases.

(4) Sentence word segmentation processing step, the conditional random field model of the emotion object recognition method of the present invention marks each word in the sentence-level sequence, and therefore needs to perform word segmentation processing. The word segmentation process uses some common network term vocabulary dictionaries (such as "crazy", "crowd", etc.) to improve the accuracy of word segmentation.

(5) The part-of-speech tagging step of the word in the sentence. This step performs part-of-speech tagging on each word after the word segmentation, and provides the part-of-speech feature of the word when extracting features for the feature extraction model of the present invention.

(6) Syntactic parsing steps, which use syntactic parsing tools to parse out words between sentences Syntactic dependency, the purpose is to provide the dependency-dependent feature of the word when the feature extraction model of the invention performs feature extraction.

FIG. 9 is a schematic diagram showing the implementation of the training step of the emotion object recognition model of the present invention. Referring to FIG. 9, in this step, the labeled training data set is derived from the microblog data crawled by the data acquisition module from the Internet, and processed by the data preprocessing module. Since the conditional random field (CRF) model used in the present invention performs emotional object extraction, and the CRF model is a supervised learning method, the training data set in the training process also needs to manually label the data set. In the process of training the model, the user dictionary is first loaded by using the dictionary loading module, including the emotional word dictionary and the stop word dictionary; the next step is to extract and normalize the data of the training data set by using the feature extraction module combined with the previous loaded dictionary; The final step is to use the model training module to train the model parameters of the normalized data in the previous step, and use the L-BFGS algorithm to train and learn the parameters of the model.

The conditional random field model used in the present invention is in the form shown in Fig. 10, and the emotional object recognition process is regarded as a sequence mark problem. The first layer of the model X represents the input microblog sentence, _Xl represents the word in the i-th position of the sentence, the second layer of _yi and the third layer of the _gl , g ₂ output result state, the labels of these states are Can be valued as:

Five tags, which indicate the sequence tag space for each position in the sequence tag process, where the NB tag represents the start position tag of the negative emotion object, and the NI tag represents the successor tag of the negative emotion object (ie, the previous tag must be For either N-/), the Ρ-tag indicates the start position label of the positive emotion object, and the Ρ-/ label indicates the successor label of the positive emotion object (the same label must be _Ρ _β or _P _/), O label Represents all other tags, ie y L. For example, the sequence is {"mobile", "screen", "very", "clear" }, "phone screen" is a positive emotional object, and the result of marking it is {"PB,,,"PI"," 0 ", " 0 "}.

The model uses two global nodes & and two independent single emotion objects, so the value can only be the three labels {'N-fiVP-BVO'}, or the positive emotion object is the -^ label. Either the negative emotion object is the label, or the emotion object is the 0 label, and it is impossible to be the successor label N-/ and ^-/ of the emotion object.

In order to improve the flexibility and expandability of the emotion object recognition, the conditional random field model adopted by the present invention is not limited to the graph result shown in FIG. 9, and the non-dominantness is not limited to two hidden nodes & and can be expanded. To &... (η>=1) as shown in Fig. 11.

The specific embodiments described above advance the objects, technical solutions, and beneficial effects of the present invention. It is to be understood that the foregoing description is only illustrative of specific embodiments of the present invention, and is not intended to limit the invention, any modifications, equivalents, Improvements and the like should be included in the scope of the present invention.

Claims

claims

1. An emotional entity search system for Weibo, which is characterized by including the following 5 modules:

1) User interface, used for interaction between the system and users. Users can submit query requests and obtain feedback results through this module;

2) The query expansion module is used to mine word relationships in Weibo corpus data, and combines with the WordNet ontology library to establish a weighted word relationship graph;

3) Query processing module, used to convert user query requests into query keywords and query statements acceptable to the index database, and perform query expansion based on the word relationship graph constructed in module 2);

4) Emotional information mining module, used to conduct emotion mining on Weibo corpus and generate determination rules for emotional entities and emotional polarity;

5) Emotional information determination and index establishment module, used to determine the emotional entities and emotional polarity of Weibo data, establish an emotional information index, and store it;

6) The inverted index creation module is used to create an inverted index for Weibo text information and store it.

2. The emotional entity search system for Weibo according to claim 1, characterized in that the following steps are adopted in the above module 1) to implement query expansion:

11) Mining relevant rules for the data in the Weibo corpus, and outputting the relevant word sets obtained by mining the relevant rules;

12) Combine the frequent items and WordNet ontology library obtained in 11) to construct a weighted word relationship graph.

3. The emotional entity search system for Weibo according to claim 1, characterized in that in the above step 11), the Eclat algorithm is used to mine frequent item sets of the Weibo corpus and generate related word sets, and combine the related word sets with WordNet The ontology graph forms a weighted word relationship graph through mapping or insertion;

When constructing the weighted word relationship graph above, the calculation method of node weight is:

f (d) = deg(d) = deg ⁺ (d) + deg~ (d) Among them, deg(6), deg ⁺ (i), and deg represent the degree, out-degree, and in-degree of the node respectively; the calculation method of edge weight is:

Form where → iJ is obtained according to the Eclat algorithm, and is the correlation degree of dj.

4. The emotional entity search system for Weibo according to claim 1, characterized in that the following steps are used in the above module 3) to implement query processing:

31) Receive query words or sentences entered by users;

32) Process the user's input into word segmentation, remove stop words, and determine center words to obtain one or more center words;

33) Select appropriate expansion words for the central word in the weighted word relationship library constructed from ontology and rule words, and calculate the weight of the expansion words;

34) Then select the top p words with heavy weights and add them to the query word set, and input the expanded word set into the query interface.

5. The emotional entity search system for Weibo according to claim 4, characterized in that the above step 33) uses the following method to calculate the weight of the expanded words:

Assume that the original query term is = 2, and the term ^ has a nearest neighbor =^, ₂ ,···,^^. Then the correlation between the original query term and the nearest neighbor term is calculated as

where is the correlation between word q _t and the word, ( is the weight of the two words, is the degree of word ^, the weight calculation method of all nearest neighboring words is W(d _k )= ^ Wiq^/m

.

6. The emotional entity search system for Weibo according to claim 1, characterized in that the following steps are adopted in the above module 4) to realize the identification and determination of emotional entities:

41) Collect representative Weibo data;

42) Preprocess the collected Weibo data, including data cleaning, transformation, sentence segmentation, word segmentation, part-of-speech tagging and syntax analysis, etc.;

43) Extract features from Weibo data and express them into feature vectors;

44) Train the emotional entity recognition model and obtain the model parameters;

45) Output the emotional entity determination model and store it.

7. The emotional entity search system for Weibo according to claim 6, characterized in that the following method is used to implement feature extraction in the above-mentioned step 43): combined with the word context, a custom dictionary including global features is designed, and according to the custom The definition dictionary extracts features from Weibo data and converts the Weibo data into an input data format that can be processed by the emotional entity recognition model.

8. The emotional entity search system for Weibo according to claim 6, characterized in that in the above step 44), the following method is used to implement the emotional entity recognition model: introducing global feature nodes into the conditional random field (CRF) model, and establishing The GLCRF model (global conditional random field model) combined with global features is trained using the L-BFGS algorithm to obtain model parameters.

9. The emotional entity search system for Weibo according to claim 1, characterized in that the following steps are used in the above-mentioned module 5) to realize the determination of the emotional polarity of Weibo:

51) Weibo data noise removal and semantic form conversion;

52) Word segmentation, part-of-speech tagging and Chinese grammar analysis;

53) Combined with the emotional dictionary to extract emotional phrases;

54) Emotional phrase filtering;

55) Emotional polarity determination and result output.

10. The emotional entity search system for Weibo according to claim 9, characterized in that the sentiPY method is used to extract emotional phrases in the above step 53), and the open-form unified expression of the emotional phrase is phrase: modifier * sentiment, that is, a phrase Including a central emotional word (sentiment), which may be accompanied by multiple modifying adverbs (modifier); in the above step 55), a hybrid decision-making algorithm based on emotional placement is used to determine the emotional extreme of Weibo. The judgment process includes the following steps:

551) Determine whether there is a general word in the sentence, if not, go to step 552); if there is, use the sentence after the general word as the emotional landing point, and output the polarity of the emotional landing point as the Weibo emotional polarity;

552) Use the beginning and end of the Weibo sentence as the emotional point, and compare the emotional polarity of the beginning and the end of the sentence. If the emotional polarities of the two cancel each other out, go to 553); otherwise, use the one with the stronger emotional polarity as the microblog. Bo emotional polarity for output;

553) Calculate the emotional word intensity of the entire Weibo, sum and average it, and output the average intensity as the emotional polarity of the Weibo.