DE112013004082T5

DE112013004082T5 - Search system of the emotion entity for the microblog

Info

Publication number: DE112013004082T5
Application number: DE112013004082.4T
Authority: DE
Inventors: Zhifeng Hao; Ruichu Cai; Shenzhi Du; Jie Cheng; Wen Wen; Yinzhang Lu
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2013-09-29
Filing date: 2013-12-06
Publication date: 2015-07-23
Also published as: CN103544242B; WO2015043075A1; CN103544242A

Abstract

Die vorliegende Erfindung betrifft ein Suchsystem der Emotionsentität für das Microblog. Das Suchsystem weist folgende 5 Module auf: (1) eine Benutzerschnittstelle, wobei der Benutzer durch das Modul eine Abfrage vorlegen und eine Rückkopplung erhalten kann; (2) ein Abfrageerweiterungsmodul, das ein Mining der Wörterbeziehung der Microblog-Sprachendaten realisiert, wobei im Zusammenhang mit der WordNet-Essenzbank ein Beziehungsdiagramm der gewichteten Wörter errichtet wird; (3) ein Abfrageverarbeitungsmodul zur Umwandelung der Abfrage des Benutzers in die Abfrageschlüsselwörter und Abfrageworte, die akzeptierbar für die Indexbank sind, wobei eine Abfrageerweiterung auf dem durch das Modul (2) errichteten Beziehungsdiagramm der Wörter basierend ausgeführt wird; (4) ein Mining-Modul der Emotionsinformation zum Mining der Emotionen in der Microblog-Sprachenbank, wobei die Beurteilungsregel für die Emotionsentität und die emotionale Polarität generiert werden; (5) ein Beurteilungs- und Indexerrichtungsmodul der Emotionsinformation zur Beurteilung der Emotionsentität und der emotionalen Polarität der Microblog-Daten, um den Index der Emotionsinformation zu errichten und zu speichern;(6) ein Errichtungsmodul des invertierten Indexes zur Errichtung des invertierten Indexes für die Microblog-Textinformation und zur Speicherung. Die vorliegende Erfindung löst das schwierige Problem mit der Extraktion der Emotionsentität des Microblogs, der Analyse der emotionalen Polarität und der Suche der Emotionsentität, dadurch wird ein intelligentes Suchprodukt der Netzwerkmeinungsanalyse und Überwachung zur Verfügung gestellt.The present invention relates to a search system of the emotion entity for the microblog. The search system comprises the following 5 modules: (1) a user interface, whereby the user can submit a query and receive feedback through the module; (2) a query extension module that realizes a mining of the vocabulary of the microblog language data, building a relationship graph of the weighted words in the context of the WordNet Essence Bank; (3) a query processing module for converting the user's query into the query keywords and query words that are acceptable to the index bank, executing a query extension based on the relationship diagram of words established by the module (2); (4) a Mining module of the emotion information for mining emotions in the microblog language bank, generating the judgment rule for the emotion entity and the emotional polarity; (5) a judgment and index establishment module of the emotion information for judging the emotion entity and the emotional polarity of the microblog data to set up and store the index of emotion information; (6) a built-in module of the inverted index for establishing the inverted index for the microblog Text information and for storage. The present invention solves the difficult problem of extracting the emotion entity of the microblog, analyzing the emotional polarity and searching the emotion entity, thereby providing an intelligent search product of the network opinion analysis and monitoring.

Description

Technisches Gebiet Technical area

Die vorliegende Erfindung betrifft das Gebiet vom Text-Emotions-Mining und Informationsabruf, insbesondere ein Suchsystem der Emotionsentität für das Microblog, das eine innovative Technik des Suchsystems der Emotionsentität für das Microblog. The present invention relates to the field of text-emotion mining and information retrieval, in particular a search engine of the emotion entity for the microblog, which is an innovative technique of the search engine of the emotion entity for the microblog.

Stand der Technik State of the art

Im Laufe der Entwicklung vom Internet und sozialen Netzwerk vermehren sich die Daten des sozialen Netzwerks einschließlich des Microblogs exponentiell schnell. Mit den wachsenden Mikroblogs bestehen mehr und mehr Informationen zum Abruf. Jedoch ist es schwierig, die benötigten Informationen schnell und genau aus den enormen Microblog-Daten zu finden. Aufgrund der Freiheit beim Schreiben der Microblog-Text ist die Retrieval der Emotionsinformationen schwieriger im Vergleich zu herkömmlichen Texten. Auf dem Gebiet des Microblog-Emotionsinformationsabrufs, das eine große Rolle in Meinungsüberwachungs- und Produktsforschungsindustrien spielt, bestehen zurzeit keine bewährten Technologien oder Systeme. As the Internet and social network evolve, social network data, including the microblog, grow exponentially fast. With the growing microblogs, more and more information is available for retrieval. However, finding the information you need quickly and accurately from the huge microblog data is difficult. Due to the freedom of writing the microblog text, the retrieval of emotion information is more difficult compared to traditional texts. There are currently no proven technologies or systems in the field of microblogging emotion information retrieval, which plays a major role in opinion monitoring and product research industries.

Das Suchverfahren und Suchsystem der Emotionsentität für das Microblog betreffen vor allem drei Sorten vom entsprechenden wichtigen Stand der Technik. Erstens ist die Abfrageerweiterungstechnik, zweitens ist die Emotionsentität-Extraktionstechnik, drittens ist die Emotionspolaritäts-Unterscheidungstechnik. Im Folgenden werden die vorstehenden drei Sorten vom entsprechenden wichtigen Stand der Technik erläutert und analysiert. Above all, the search method and search system of the emotion entity for the microblog relate to three varieties of the corresponding important state of the art. First is the query extension technique, second is the emotion entity extraction technique, third is the emotion polarity discrimination technique. In the following, the above three varieties are explained and analyzed by the relevant prior art.

1. Abfrageerweiterungstechnik 1. Query extension technique

Die traditionellen Abrufsysteme oder Suchmaschinen, die durch die Stichwörter eine Direktabfrage ausführen, können einige relevante Suchergebnisse erhalten. Jedoch sind solche Ergebnisse, die mit Hilfe eines einfachen Zusammenpassens aufgefunden werden, maschinell. Die Abfrageintension des Benutzers kann nicht gut verstanden werden. Deshalb sind die aufgefundenen Ergebnisse auch nicht zufriedenstellend. Um das vorstehend Problem zu lösen, soll ein Verfahren zur Verfügung gestellt werden, das die Abfrageintension des Benutzers gut verstehen und die Genauigkeit und die Vollständigkeit des Abrufs verbessern kann. Die Abfrageerweiterungstechnik ist genau so ein Verfahren. Durch die Abfrageerweiterung kann der Abfragebedarf des Benutzers genauer verstanden werden, um dem Benutzer dabei zu helfen, die erforderlichen Informationen genauer zu erhalten. Das klassische Abfrageerweiterungsverfahren umfasst vor allem vier Sorten, die jeweils auf der globalen Analyse, auf der lokalen Analyse, auf dem Abfrageprotokoll des Benutzers und auf der Assoziationsregel basieren. In den letzten Jahren stellten einige Gelehrte ein Abfrageerweiterungsverfahren zur Verfügung, die auf der Essenz (oder Essenz des Gebiets) und dem semantischen Netzwerk basiert. The traditional retrieval systems or search engines that perform a direct query through the keywords may receive some relevant search results. However, such results, which are found by means of a simple matching, are machine-made. The query intension of the user can not be understood well. Therefore, the results found are also unsatisfactory. In order to solve the above problem, a method is to be provided which can well understand the query intention of the user and improve the accuracy and completeness of the retrieval. The query extension technique is just such a technique. The query extension can better understand the user's query needs to help the user get the information they need more accurately. In particular, the classic query extension process consists of four types, each based on global analysis, on local analysis, on the user's query log, and on the association rule. In recent years, some scholars have provided a query extension method based on the essence (or essence of the domain) and the semantic network.

Das auf der globalen Analyse basierend Abfrageerweiterungsverfahren führt die Erweiterung dadurch aus, dass der Relevanzsgrad der Wörter im gesamten Datensatz oder in Texten in ganzer Datenbank aufgefunden wird. Dabei liegt der Vorteil darin, dass der ganze Datensatz vollständig analysiert werden kann und alle Aspekte des Dokuments erkannt werden können. Dabei ist es der Nachteil: Da der gewöhnliche Datensatz zu groß ist, bestehen hohe Anforderungen an die Analysezeit und die Geräte, weiter kann es nicht online ausgeführt werden. Beim Bestehenden Abrufsystem wird die Analyse der kompletten Wörter offline ausgeführt. Deshalb kann das Verfahren schwer für die Suchmaschinen, die die Echtzeit erfordern, verwendet werden. The query extension method based on the global analysis performs the extension by finding the degree of relevance of the words in the entire data set or in whole database texts. The advantage here is that the entire data set can be fully analyzed and all aspects of the document can be recognized. The disadvantage here is that, because the usual data set is too large, there are high demands on the analysis time and the devices; furthermore, it can not be executed online. Existing retrieval system performs the analysis of complete words offline. Therefore, the method can be difficult to use for the search engines requiring the real time.

Das auf der lokalen Analyse basierende Verfahren umfasst das Relevanz-Feedback-Verfahren und das Pseudorelevanz-Feedback-Verfahren. Beim Relevanz-Feedback werden die Suchergebnisse zuerst über die primäre Abfrage durch den Benutzer erhalten, dann werden die Relevanz oder die Irrelevanz der Ergebnisdokumente durch den Benutzer manuell beurteilt, die dann zwei verschiedenen Dokumentssätzen zugeordnet werden. Dadurch werden gekennzeichnete relevante Dokumente erhalten. Vor der Abfrageerweiterung ist es nur nötig, die Wörteranalyse für solche Dokumente auszuführen. Dabei liegt der Vorteil darin, dass nur die relevanten Dokumente verarbeitet werden, so dass die Anzahl des Dokuments sich verringert, und der Relevanzsgrad wird auch verbessert. Dabei liegt der Nachteil darin, dass eine große Menge am manuellen Feedback erforderlich ist, was eine große Menge an Arbeitskräften erfordert, weiter ist immer noch eine große Anzahl von Experimenten für die Inbetriebnahme benötigt. Deshalb wird das Verfahren selten für bestehende Abrufsysteme oder Suchmaschine n verwendet. The method based on the local analysis includes the relevance feedback method and the pseudorelevance feedback method. In relevance feedback, the search results are first obtained through the primary query by the user, then the relevance or irrelevancy of the result documents is manually judged by the user, which is then mapped to two different sets of documents. As a result, flagged relevant documents are obtained. Before the query extension, it is only necessary to perform the word analysis for such documents. The advantage of this is that only the relevant documents are processed so that the number of documents is reduced, and the degree of relevance is also improved. The drawback is that a large amount of manual feedback is required, which requires a large amount of manpower, and a large number of experiments are still needed for commissioning. Therefore, the method is rarely used for existing retrieval systems or search engine n.

Es ist das Pseudorelevanz-Feedback-Verfahren, eine Analyse mit Hilfe der über die primäre Abfrage durch den Benutzer erhaltenen ersten n Ergebnisse auszuführen. Dabei wird es angenommen, dass die Dokumente in den Ergebnissen, die mit dem Suchwort relevant sind, an der Front des Abrufs auftreten werden, nämlich werden solche Dokumente als Dokumente mit höchstem Relevanzsgrad angesehen werden. Durch die Analyse von solchen Dokumenten wird ein Erweiterungswort aufgefunden, dann wird die Abfrageerweiterung ausgeführt. Die Erfindung "Abfrageerweiterungsverfahren und Abfrageerweiterungssystem" mit Patent Nr. von CN 20091032193.5 ist ein Beispiel des Patents unter der Verwendung des Pseudorelevanz-Feedbacks. Dabei ist es das Hauptkonzept, eine Cluster-Analyse für einen Teil von den Dokumenten an der Front unter den primären Suchergebnissen durch den Benutzers auszuführen, so dass die Cluster generiert werden. Nachdem die Cluster in Reihenfolge gebracht wurde, wird das Erweiterungswort aus einer bestimmten Anzahl von TOP-Clustern extrahiert. Das erhaltene Erweiterungswort wird in ursprünglicher Abfrage hinzugefügt, um die Kombination der Erweiterungswörter zu bilden und den sekundären Abruf auszuführen. Das Verfahre hat einen Nachteil, dass es nicht gewährleistet werden kann, dass die durch die primäre Abfrage erhaltenen Dokumente relevant sind. Wenn sie irrelevant sind, kann das erhaltene Erweiterungswort die Ergebnisse des sekundären Abrufs irrelevanter machen. Dadurch kann die Abrufsperformance verschlechtert werden. It is the pseudo-relevance feedback method to perform an analysis using the first n results obtained from the primary query by the user. It is assumed that the documents will appear in the results relevant to the search word at the front of the retrieval, namely, such documents will be considered documents of the highest relevance. By analyzing such documents, an extension word is found, then the query extension is executed. The invention "query extension method and query extension system" with patent no CN 20091032193.5 is an example of the patent using pseudorelance feedback. The main idea is to perform a cluster analysis on a part of the documents at the front among the primary search results by the user so that the clusters are generated. After the clusters have been ordered, the extension word is extracted from a certain number of TOP clusters. The obtained extension word is added in the original query to form the combination of the extension words and to execute the secondary call. The method has a disadvantage that it can not be guaranteed that the documents obtained by the primary query are relevant. If they are irrelevant, the extender word obtained can make the results of the secondary fetch more irrelevant. This can make the retrieval performance worse.

Das auf dem Abfrageprotokoll des Benutzers basierende Verfahren ist gegenwärtig ein allgemeines Verfahren für die Suchmaschinen. Bei diesem Verfahren wird eine Wörteranalyse für das Abfrageprotokoll des Benutzers ausgeführt, dann werden die Wörter, die gleichzeitig auftreten, als Erweiterungswort benutzt. Bei der Erfindung "Abfrageerweiterungsverfahren und Gerät sowie relevanter Abruf-Thesaurus" mit Patent Nr. von CN 200710097501.6 und der Erfindung "Abfrageerweiterungsverfahren, Gerät und Suchmaschinensystem" mit Patent Nr. von CN 200810115470.7 werden die durch den Benutzer eingegebenen Suchwörter analysiert, um relevante Wörter zu erhalten. Dann werden die Wörter als Erweiterungswort benutzt. Das Erweiterungsverfahren erfordert zuerst eine große Anzahl von Abfrageprotokollen. Dazu ist ein Sammlungsprozess notwendig. The method based on the user's query log is currently a common method for the search engines. In this method, a word analysis is performed on the user's query log, then the words that occur simultaneously are used as the extension word. In the invention "query extension method and device as well as relevant retrieval thesaurus" with patent no CN 200710097501.6 and the invention "query extension method, device and search engine system" with patent no CN 200810115470.7 The search words entered by the user are analyzed to obtain relevant words. Then the words are used as the extension word. The expansion method first requires a large number of query logs. This requires a collection process.

Das auf der Assoziationsregel basierende Verfahren ist ein klassisches Verfahren zum Daten-Mining und dient häufig zum Auffinden der Assoziativität zwischen den Angelegenheiten. Bei der Abfrageerweiterung kann das Verfahren zum Auffinden von verschiedenen Ressourcen dienen, z.B. Auffinden der Assoziativität zwischen den Wörtern aus den Ressourcen wie Daten-Dokumentation, Abfrageprotokolle etc. Die Erfindung "Verfahren und Server zur Erweiterung der Suchergebnisse des Benutzers" mit Patent Nr. von CN 201010605956.6 ist ein Beispiel der Abfrageerweiterung mit Hilfe der Assoziationsregel-Technik. Beim vorliegenden Patent werden die errichteten Regeln in einer Assoziationsregeldatenbank gespeichert. Die Regeln können manuell errichtet werden, es ist auch möglich, mit Hilfe der Assoziationsregel des Unterstützungsgrad-Vertrauengrad-Frameworks ein Mining für bestimmte Dokumente durchzuführen, dann werden die erzeugten Regeln in der Assoziationsregeldatenbank zu speichern. Wenn der Benutzer das Suchwort eingibt, werden zuerst Relevanzwörter aus der Regeldatenbank erhalten. Dann bilden das eigentliche Suchwort, das erhaltene Relevanzwort und das Kombinationswort von den beiden ein neues Suchwort aus, dann wird ein sekundärer Abruf für die Datenbank durchgeführt. Die Methode hat einen Nachteil, dass ein Wort nicht durch die Bedeutung des Worts verstanden werden kann. Nur die Frequenz des Worts wird berücksichtigt. Deshalb kann die Erweiterung die Abfrageintension des Benutzers sehr gut verstehen. The association rule-based method is a classic method of data mining and is often used to find associativity between issues. In the query extension, the method may be to find various resources, eg finding associativity between the words from the resources such as data documentation, query logs, etc. The invention "Method and server for extending the search results of the user" with patent no CN 201010605956.6 is an example of query expansion using association rule technique. In the present patent, the established rules are stored in an association rule database. The rules can be set up manually, it is also possible to mince certain documents using the association rule of the support degree confidence level framework, then the generated rules will be stored in the association rule database. When the user enters the search term, relevancy words are first obtained from the rule database. Then the actual search word, the received relevance word and the combination word of the two form a new search word, then a secondary call for the database is performed. The method has a disadvantage that a word can not be understood by the meaning of the word. Only the frequency of the word is taken into account. Therefore, the extension can understand the user's query intent very well.

Das auf der Essenz oder dem semantischen Netzwerk errichtete Abfrageerweiterungsverfahren ist eine Technik, bei der durch die Verwendung oder die Errichtung des semantischen Netzwerks die Erweiterung durchzuführen. Das semantische Netzwerk kann ein fertig errichtetes Netzwerk sein, wie Word-Net und HowNet, es kann auch selbst errichtet werden, wie Gebietskenntnisse oder Gebietsessenz. Das semantische Netzwerk oder die Essenzbank organisieren die mehrschichtige Beziehung der Wörter, wie Paritätswort, Kontextwort, Begriffwort, Ganzes-Teil-Wort usw., so dass ein Netzwerk über die Wörter ausgebildet wird. Das Patent "ein auf den Gebietskenntnissen basierendes semantisches Abfrageerweiterungsverfahren" mit Patent Nr. von CN 200810116729.X errichtet zuerst ein Gebietskenntnisbank mit Hilfe der Gebietskenntnis und der Analyse der Satzmerkmale des Benutzers, dann mit Hilfe des Inhalts der Gebietskenntnisbank wird eine semantische Analyse für das eigentliche Suchwort durchgeführt, um eine Liste der semantischen Artikel zu erhalten, dann wird ein erweiterbarer Artikel durch die semantische Berechnung erhalten; am Ende wird der Erweiterungsartikel in der Suchgruppe zurückgesetzt, um ein sekundärer Abruf der Datenbank durchzuführen. Das Patent "ein im Bildabruf auf dem Text basierendes Abfrageerweiterungs- und Ordnungsverfahren" mit Patent Nr. von CN 20101084725.2 führt eine semantische Analyse für die Wörter mit Hilfe des WordNet-Netzwerks und HowNet-Netzwerks durch und erhält ein Wort mit semantischer Erweiterung, das im Bildabrufsystem der Textanalyse verwendet wird. Weiter wird ein Algorithmus erfunden, der die zurückgegebenen Ergebnisse optimal ordnet. Durch die semantische Erweiterung kann die Abfrageintension des Benutzers sehr gut erkannt werden. Jedoch analysiert das Erweiterungswort gemäß dem vorliegenden Verfahren die zu suchende Datenbank nicht, dabei ist der Suchperformance oft sehr beschränkt. Darüber hinaus ist die Errichtung der Gebietsessenzbank arbeitsaufwendig und zeitaufwendig. The query extension method established on the essence or the semantic network is a technique of performing the extension through the use or establishment of the semantic network. The semantic network can be a fully established network, such as Word-Net and HowNet, it can also be built by itself, such as domain knowledge or area essence. The semantic network or the Essence Bank organize the multi-layered relationship of the words, such as parity word, context word, term word, whole-part-word, etc., so that a network is formed through the words. The patent "a semantic query extension method based on domain knowledge" with patent no CN 200810116729.X first builds an area knowledge bank with the help of the area knowledge and the analysis of the sentence characteristics of the user, then with the help of the contents of the area knowledge bank a semantic analysis is carried out for the actual search word, in order to obtain a list of the semantic articles, then an extensible article by the semantic Calculation received; at the end, the extension article in the search set is reset to perform a secondary retrieval of the database. The patent "An Image Refreshment Based on the Text Query Extension and Ordering Method" with Patent No. of CN 20101084725.2 performs a semantic analysis on the words using the WordNet network and the HowNet network, and obtains a word with semantic extension that is used in the image retrieval system of text analysis. Furthermore, an algorithm is invented which optimally arranges the returned results. Due to the semantic extension, the query intension of the user can be recognized very well. However, the extension word according to the present method analyzes the searcher Database not, the search performance is often very limited. In addition, the establishment of the Gebietsessenzbank is labor intensive and time consuming.

2. Emotionsentität-Extraktionstechnik 2. Emotional entity extraction technique

Das Emotionsobjekt ist das Objekt des Emotionsausdrucks und ist in der Regel ein Substantiv oder eine Nominalphrase. In der Regel ist es sinnlos, nur die Emotionstendenziösität zu analysieren und zu untersuchen, ohne das Emotionsobjekt zu kennen. Die Erforscher legen einen großen Wert auf die Extraktion des Emotionsobjekts, die eine sehr wichtige und gleichzeitig sehr herausfordernde Aufgabe in der Emotionsanalyse und dem Meinungsmining ist. Obwohl zurzeit eine sehr große Anzahl von Forschungen am Emotionsausdruck und Emotionsobjekt besteht, analysieren sie meist die Kommentarinformationen des Produkts oder die Nachrichten. The emotion object is the object of the emotion expression and is usually a noun or a noun phrase. As a rule, it is meaningless to analyze and investigate only the emotions tendency, without knowing the emotion object. Researchers attach great importance to the extraction of the emotion object, which is a very important and at the same time very challenging task in emotion analysis and opinion-mining. Although there is currently a great deal of research into the emotion expression and emotion object, they mostly analyze the commentary information of the product or the news.

Im Vergleich zur traditionellen Information hat die systembedingte Wörteranzahlbeschränkung und die Freiheit des Internettexts, so dass die Microblog-Daten aufgrund der systembedingten Wörteranzahlbeschränkung und der Freiheit des Internettexts eine große Menge an abgekürzten Ausdrücken, Tippfehlern, Sonderzeichen (wie Gesichtsausdrücken und Links etc.) und anderen verschiedenen Textausdrücken enthalten, die anders als herkömmliche Regel sind. Das erhöht zweifellos die Schwierigkeit der Datenanalyse. Da die Emotionsanalyse und das Meinungsmining in China einen späten Anfang haben, Unterschiede zwischen dem Chinesisch und Englisch bestehen und entsprechende Techniken sich nicht reif entwickeln, bestehen zurzeit noch relativ wenige Forschungen in Hinsicht auf die Identifizierung des Emotionsobjekts für das Microblog. Compared to the traditional information, the system-bound number of words and the freedom of the Internet text, so that the microblog data due to the system-related number of words and the freedom of the Internet text a large amount of abbreviated expressions, typos, special characters (such as facial expressions and links, etc.) and others contain different textual expressions that are different than traditional rule. This undoubtedly increases the difficulty of data analysis. Since emotional analysis and opinion-building in China are late in life, there are differences between Chinese and English, and such techniques are not mature, there is currently relatively little research into identifying the emotion object for the microblog.

Zurzeit besteht eine Emotionsobjekt-Identifizierungstechnik: ein Patent von der Universität für Luft- und Raumfahrt Beijing "eine auf der Abhängigkeitsbeziehung der Wörter basierende Meinungsextraktionsmethode" mit Patent Nr. von CN 201210317183.0 . Bei der Methode wird der auf der Abhängigkeitsbeziehungskette der Wörter basierende Anpassungsalgorithmus das Kommentarobjekt extrahiert. Erstens werden keine anderen verfügbaren Hilfsinformationen zur Verbesserung der Genauigkeit der Methode verwendet. Zweitens ist die Methode nicht unbedingt geeignet für die besonderen Textinformationen des Microblogs. At present, there is an emotion object identification technique: a patent from the Beijing Aerospace University "an expression extraction method based on the dependency relationship of the words" with Patent No. of CN 201210317183.0 , In the method, the customization algorithm based on the dependency relationship chain of the words extracts the comment object. First, no other auxiliary information available is used to improve the method's accuracy. Second, the method is not necessarily suitable for the specific text information of the microblog.

Die häufig vorkommende Emotionsobjektsextraktion in der Literatur richtet vor allem nach dem Kommentar des Produkts. Da dabei bestimmte Produktsinformationen und Gebiete definiert werden, ist die Frage deutlicher und klarer. Deshalb hat die Extraktion der Relevanztexte eines Themas üblicherweise eine bessere Auswirkung. Jedoch hat die Extraktion keine gute Auswirkung bei den irrelevanten Texten des Themas. Der Grund dafür liegt hauptsächlich darin, dass die Kommentarobjekte in solchen Texten sehr verschieden sind, darüber hinaus sind die Emotionswörter auch vielfältig. Zurzeit besteht selten Emotionsobjekt-Identifizierungstechnik für das Microblog mit einem irrelevanten Thema. Bei meisten bestehenden Methoden wird meist eine syntaktische Abhängigkeitsanalyse für das Microblog durchgeführt, im Zusammenhang mit dem Emotionswörterbuch wird ein Paar von <Emotionswort, Emotionsobjekt> erhalten, dadurch wird das Emotionsobjekt extrahiert. Das Methode hat keine ideale Identifizierungsauswirkung und hat folgende Nachteile: (1) Der Extraktionsprozess hängt zu viel von dem Emotionswörterbuch und bestimmten syntaktischen Abhängigkeitsbeziehungen ab, erstens werden viele Fehlbeurteilungen bestehen, da die auf dem Wörterbuch basierende Beurteilungsmethode beschränkt und sehr stark von den Gebietskenntnissen beeinflusst wird; zweitens sind die Emotionswörter und die Emotionsobjekte aufgrund der Besonderheit des Ausdrucks des Microblogs nicht unbedingt auf einige bestimmte Abhängigkeitsbeziehungen beschränkt; (2) Im Microblog treten einige Emotionswörter und ihre Emotionsobjekte oft nicht paarweise im Text auf, dabei drückt nur das Emotionswort die Emotionstendenziösität aus, jedoch erscheint das Emotionsobjekt nicht dominant im Satz, dabei können einige Emotionsobjekte, die nicht direkt im Satz auftreten, durch den Extraktionsprozess nicht extrahiert werden. The frequently occurring emotion object extraction in the literature mainly depends on the comment of the product. As this defines specific product information and areas, the question is clearer and clearer. Therefore, extracting the relevance texts of a topic usually has a better impact. However, the extraction does not have a good effect on the topic's irrelevant texts. The reason for this is mainly that the comment objects in such texts are very different, in addition, the emotion words are also diverse. At the moment, emotion object identification technology rarely exists for the microblog with an irrelevant theme. For most existing methods, a syntactic dependency analysis is usually performed for the microblog; in the context of the emotion dictionary, a pair of <emotion word, emotion object> is obtained, which extracts the emotion object. The method has no ideal identification effect and has the following disadvantages: (1) The extraction process depends too much on the emotion dictionary and certain syntactic dependency relationships; first, many misjudgments will exist because the dictionary-based assessment method is limited and very much influenced by the domain knowledge ; second, because of the particularity of the microblog's expression, the emotion words and the emotion objects are not necessarily limited to some particular dependency relationships; (2) In the microblog, some emotion words and their emotion objects often do not appear in pairs in the text, only the emotion word expresses the emotion tendency, however, the emotion object does not appear dominant in the sentence, while some emotion objects that do not occur directly in the sentence, through the Extraction process can not be extracted.

3. Emotionspolaritäts-Unterscheidungstechnik 3. Emotional polarity discrimination technique

In Hinsicht auf die Körnigkeit der Analyse konzentrieren sich das bestehenden Emotionsanalysesystem und die Technik hauptsächlich auf die Emotionsanalyse der Artikelklasse und der Satzklasse. Die Emotionsanalysetechnik der Entitätsklasser hat eine sehr kleine Anzahl, bei der die Entitätsidentifizierung und die Emotionsanalyse als zwei separate Aufgaben durchgeführt werden. In Hinsicht auf die Analyseobjekte richten die bestehenden Systeme und Techniken nach Nachrichten, Microblogs und andere Kommentarinformationen, dabei ist die Analyse der sozialen Meinungen fokussiert. In terms of the granularity of the analysis, the existing emotion analysis system and technique focus mainly on the emotion analysis of the article class and the sentence class. The entity analysis emotion analysis technique has a very small number where entity identification and emotion analysis are performed as two separate tasks. With regard to the objects of analysis, the existing systems and techniques are aimed at news, microblogs and other commentary, focusing on the analysis of social opinions.

Folgend sind gegenwärtig bestehende Emotionsanalysetechniken der Artikelklasse und der Satzklasse: ein Patent der technischen Universität von Nordwesten "Hybridmodell-basierte Identifizierungsmethode für WEB-Text-Emotionsthemen" mit Patent Nr. von CN 200910219161.9 ; ein Patent vom Computertechnik-Forschungsinstitut der chinesischen Akademie der Wissenschaften "Analysemethode der Tendenziösität der Textemotionen" mit Patent Nr. von CN 200910083522.1 ; ein Patent vom Automatisierungs-Forschungsinstitut der chinesischen Akademie der Wissenschaften "Emotionsanalysemethode der kurzen Texte für das Microblog" mit Patent Nr. von CN 201210088366.X ; ein Patent von der Firma Fujitsu "Analysemethode und Gerät für die Tendenziösität der Emotionen" mit Patent Nr. von CN 201010157784.0 . The following are current article analysis and sentence-class emotion analysis techniques: a patent from the Northwest Technical University "Hybrid Model-Based Identification Method for WEB Text Emotion Issues" with patent no CN 200910219161.9 ; a patent from Computertechnik Research Institute of the Chinese Academy of Sciences "Analysis Method of Tendency of Textual Motions" with Patent No. of CN 200910083522.1 ; a patent from the Automation Research Institute of the Chinese Academy of Sciences "Short text microblogging emotion analysis method" with patent no CN 201210088366.X ; a patent from the company Fujitsu "Analysis Method and Device for the Tendency of Emotions" with Patent No. of CN 201010157784.0 ,

Die vorstehende Emotionsanalysetechnik enthält hauptsächlich zwei Schritte - Training und Emotionsbeurteilung. Im Folgenden wird das Patent "Hybridmodell-basierte Identifizierungsmethode für WEB-Text-Emotionsthemen" der technischen Universität von Nordwesten als Beispiel genommen, um die Hauptschritte für das Training und die Emotionsbeurteilung vorzustellen. Die anderen einschlägigen Techniken sind im Wesentlichen ähnlich. Die Methode hat hauptsächlich folgende Schritte: 1. Eine manuelle Markierung wird für Texte mit konzentriertem Training durchgeführt, um zwei Sorten von Emotionsmodellen zu vermuten: "positives" Modell und "negatives" Model; gleichzeitig werden Modell für verschiedenen Sorten von Themensprachen in Übereinstimmung mit Ausdrucksweisen von verschiedenen Texten zu vermuten; 2. Mit der Maximum-Likelihood-Schätzung(MLE)-Methode werden die Parametervermutungen jeweils für die im Schritt 1 errichteten Emotionsmodelle und Themenmodelle durchgeführt; 3. Für die verarbeitenden Texte wird der Abstand zwischen dem Sprachenmodell und den zwei Sorten von Sprachenmodellen errechnet, so dass die Emotionstendenziösität und das Thema der Texte beurteilt werden. The above emotion analysis technique mainly contains two steps - training and emotion evaluation. In the following, the patent "Hybrid Model-Based Identification Method for WEB Text Emotion Topics" of the Northwest Technical University is taken as an example to introduce the main steps for training and emotion evaluation. The other relevant techniques are essentially similar. The method has mainly the following steps: 1. A manual marking is performed for texts with concentrated training to assume two types of emotion models: "positive" model and "negative" model; at the same time model for different varieties of subject languages in accordance with expressions of different texts to be assumed; 2. Using the maximum likelihood estimation (MLE) method, the parameter presumptions are made for each of the emotion models and topic models built in step 1; 3. For the processing texts, the distance between the language model and the two types of language models is calculated so that the sentimentality of emotions and the subject of the texts are assessed.

Zurzeit ist die Emotionstendenziösitätstechnik hauptsächlich auf die Artikelklasse und die Satzklasse konzentriert. Die auf dem Maschinenlernen basierende Methode ist weit verbreitet, jedoch ist die auf dem emotionalen Landepunkt basierende Emotionsanalysetechnik selten. At present, emotion tendancy technique is mainly focused on the article class and the sentence class. The machine learning-based method is widely used, but the emotion-based emotional analysis technique is rare.

Die bestehende auf den Emotionswörtern basierende Emotionsanalysetechnik hat hauptsächlich folgende drei Nachteile: (A) Die Extraktion der Emotionswortgruppe berücksichtigt die Modifikation der Adverbien nicht, jedoch definieren die Adverbien in der Regel die Emotionswörter wie Adjektive zum einen bestimmten Grad. Wenn es nicht berücksichtigt wird, können die Abweichungen der emotionalen Intensität bewirkt werden; (B) Bei der Identifizierung und Verarbeitung des negativen Worts ist es eine allgemeine Methode, mit einer bestimmten Strategie die negativen Wörter zu suchen, dabei ist das negative Objekt sehr schwer zu bestimmen; (C) Einige automatische errichtete Emotionswörterintensitäts-Wörterbuch ist nicht zuverlässig, weil die Emotionswörterintensität die wesentliche Eigenschaft der Emotionswörter ist und vor allem von ihrer Eigenabsicht abhängt. The existing emotion analysis technique based on the emotion words has mainly three disadvantages: (A) The extraction of the emotion word group does not take into account the modification of the adverbs, but the adverbs usually define the emotion words as adjectives to a certain degree. If it is not taken into account, the deviations of the emotional intensity can be effected; (B) In the identification and processing of the negative word, it is a general method to search for the negative words with a certain strategy, the negative object is very difficult to determine; (C) Some automatic built-up emotion word intensity dictionary is not reliable because the emotion word intensity is the essential property of the emotion words and, above all, depends on their self-intention.

Inhalt der Erfindung Content of the invention

Es ist das Ziel der vorliegenden Erfindung, die vorstehenden technischen Mängel der bestehenden Suchtechnik der Emotionsentität zu überwinden und ein Suchsystem der Emotionsentität für das Microblog zur Verfügung zu stellen, das die Genauigkeit der Beurteilung der emotionalen Polarität verbessert. It is the object of the present invention to overcome the above technical deficiencies of the existing search technique of the emotion entity and to provide an emotion entity search system for the microblog that improves the accuracy of the emotional polarity judgment.

Die vorliegende Erfindung wird durch die folgende technische Lösung realisiert: Ein Suchsystem der Emotionsentität für das Microblog der vorliegenden Erfindung weist folgende 5 Module auf:

(1) eine Benutzerschnittstelle für die Interaktivität zwischen dem System und dem Benutzer, wobei der Benutzer durch das Modul eine Abfrage vorlegen und eine Rückkopplung erhalten kann;
(2) ein Abfrageerweiterungsmodul zum Mining der Wörterbeziehung der Microblog-Sprachendaten, wobei im Zusammenhang mit der Word-Net-Essenzbank ein Beziehungsdiagramm der gewichteten Wörter errichtet wird;
(3) ein Abfrageverarbeitungsmodul zur Umwandelung der Abfrage des Benutzers in die Abfrageschlüsselwörter und Abfrageworte, die akzeptierbar für die Indexbank sind, wobei eine Abfrageerweiterung auf dem durch das Modul (2) errichteten Beziehungsdiagramm der Wörter basierend ausgeführt wird;
(4) ein Mining-Modul der Emotionsinformation zum Mining der Emotionen in der Microblog-Sprachenbank, wobei die Beurteilungsregel für die Emotionsentität und die emotionale Polarität generiert werden;
(5) ein Beurteilungs- und Indexerrichtungsmodul der Emotionsinformation zur Beurteilung der Emotionsentität und der emotionalen Polarität der Microblog-Daten, um den Index der Emotionsinformation zu errichten und zu speichern;
(6) ein Errichtungsmodul des invertierten Indexes zur Errichtung des invertierten Indexes für die Microblog-Textinformation und zur Speicherung;
Im vorstehenden Modul (1) wird die Abfrageerweiterung durch folgende Schritte realisiert:
(11) Mining der Relevanzregel für die Daten in der Microblog-Sprachenbank, Ausgabe der relevanten Wörtergruppe, die durch das Mining der Relevanzregel erhalten wird;
(12) Errichtung des Beziehungsdiagramms der gewichteten Wörter im Zusammenhang mit dem im Schritt (11) erhaltenen Frequenzartikel und der WordNet-Essenzbank.

The present invention is realized by the following technical solution: A search engine of the emotion entity for the microblog of the present invention has the following 5 modules:

(1) a user interface for interactivity between the system and the user, whereby the user can submit a query and receive feedback through the module;
(2) a query extension module for mining the word relationship of the microblog speech data, wherein a relationship graph of the weighted words is established in connection with the Word Net Essence Bank;
(3) a query processing module for converting the user's query into the query keywords and query words that are acceptable to the index bank, executing a query extension based on the relationship diagram of words established by the module (2);
(4) a Mining module of the emotion information for mining emotions in the microblog language bank, generating the judgment rule for the emotion entity and the emotional polarity;
(5) a judgment and index establishment module of the emotion information for judging the emotion entity and the emotional polarity of the microblog data to establish and store the index of the emotion information;
(6) an inverted index establishment module for establishing the inverted index for the microblog text information and for storage;
In the above module (1), the query extension is realized by the following steps:
(11) mining the relevance rule for the data in the microblogging bank, outputting the relevant word group obtained by mining the relevancy rule;
(12) Establishment of the relationship graph of the weighted words associated with the frequency article obtained in step (11) and the WordNet Essence Bank.

Im Schritt (11) werden die Frequenzartikelgruppen der Microblog-Sprachenbank mit Hilfe vom Eclat-Algorithmus aufgefunden, wobei die Relevanzwörtergruppe generiert wird, und wobei die Relevanzwörtergruppe und die WordNet-Essenzbank durch die Kartografierung oder die Einsetzung ein Beziehungsdiagramm der gewichteten Wörter ausbilden; und wobei bei der Errichtung des Beziehungsdiagramms der gewichteten Wörter die Gewichtsberechnungsmethode des Knotens wie folgt ist: f(d) = deg(d) = deg⁺(d) + deg^–(d) und wobei deg(d)deg⁺(d)deg^–(d) jeweils Grad, Außengrad und Innengrad des Knotens sind; und wobei die Berechnungsmethode des Kantengewichts wie folgt ist:

In step (11), the microblogging bank's frequency article groups are retrieved using the Eclat algorithm, generating the relevancy word group, and the relevancy word group and the WordNet Essence Bank, by mapping or inserting, form a relationship diagram of the weighted words; and wherein, in constructing the weighted word relationship diagram, the weighting method of the node is as follows:

f (d) = deg (d) = deg ⁺ (d) + deg ^- (d)

and wherein deg (d) deg ⁺ (d) deg ^- (d) are each degree, outer degree and inner degree of the knot; and wherein the calculation method of the edge weight is as follows:

Im Modul (3) wird die Abfrageverbreitung durch folgende Schritte realisiert:

(31) Empfang der durch den Benutzer eingegebenen Abfragewörter oder Worte;
(32) Durchführung der Wortsegmentierung, der Entfernung des Stoppworts und der Bestimmung des Stichworts für die Eingabe des Benutzers, um ein Stichwort oder mehrere Stichwörter zu erhalten;
(33) Auswahl eines passenden Erweiterungsworts aus dem durch die Essenz und die Regelwörter ausgebildeten Beziehungsdiagramm der gewichteten Wörter für das Stichwort, wobei eine Gewichtsberechnung für das Erweiterungswort durchgeführt wird;
(34) Auswahl der p Wörter mit größtem Gewicht und Hinzufügung in der Suchwörtergruppe, wobei die Erweiterungswörtergruppe in die Abfrageschnittstelle eingegeben wird.

In module (3) the query distribution is realized by the following steps:

(31) receiving the user entered query words or words;
(32) performing word segmentation, removing the stop word, and determining the keyword for the user's input to obtain a keyword or keywords;
(33) selecting an appropriate extension word from the relationship graph of the weighted words for the keyword formed by the essence and the rule words, wherein a weight calculation is performed for the extension word;
(34) Selection of the p words of greatest weight and addition in the search word group, with the extension word group entered in the query interface.

Im Schritt (33) wird eine Gewichtsberechnung wie folgt für das Erweiterungswort durchgeführt: In step (33), a weight calculation is performed as follows for the extension word:

Wobei das eigentliche Suchwort als q = (q₁, q₂, ..., qm) angenommen wird, und wobei der Artikel q_in_i dnächste Wörte d_i =(d_i1, d_i2, ..., q_ini) hat, und wobei die Berechnungsmethode des Relevanzsgrades zwischen dem Whereby the actual search term is assumed to be q = (q ₁ , q ₂ , ..., qm), and where the article q _i n _{i the} next word d _i = (d _i1 , d _i2 , ..., q _ini ), and wherein the method of calculating the degree of relevance between the

eigentlichen Suchwort q_i und dem nächsten Wort d_ij ist:

und wo W(q_i, d_ij) der Relevanzsgrad zwischen dem Wort q_i und dem Wort dij ist, und wobei g(q_i, d_ij) das Gewicht von den beiden Wörtern ist, und wobei f(d_ij) der Grad des Worts d_{ij ist, und wobei die Gewichtsbe} rechnungsmethode aller nächsten Wörter

ist. actual search word q _i and the next word d _ij is:

and where W (q _i , d _ij ) is the degree of relevance between the word q _i and the word dij, and where g (q _i , d _ij ) is the weight of the two words, and f (d _ij ) is the degree of the word d _{ij, and the weighting} method of all the next words

is.

Im Schritt (4) werden die Identifizierung und die Beurteilung der Emotionsentität durch folgende Schritte realisiert:

(41) Sammlung von repräsentativen Microblog-Daten;
(42) Vorverarbeitung der gesammelten Microblog-Daten, einschließlich Bereinigung, Transformation, Wandlung, Satzsegmentierung, Wortsegmentierung, Wortart-Markierung und Syntaxanalyse etc;
(43) Durchführung der Merkmalsextraktion für die Microblog-Daten, die als Merkmalsvektoren ausgedrückt werden;
(44) Training des Erkennungsmodell der Emotionsentität, um die Modellparameter zu erhalten;
(45) Ausgabe und Speicherung des Beurteilungsmodells der Emotionsentität.

In step (4), the identification and the judgment of the emotion entity are realized by the following steps:

(41) collection of representative microblog data;
(42) pre-processing the collected microblog data, including cleanup, transformation, transformation, sentence segmentation, word segmentation, part-of-speech marking, and syntax analysis, etc .;
(43) performing feature extraction on the microblog data expressed as feature vectors;
(44) training the recognition model of the emotion entity to obtain the model parameters;
(45) Output and storage of the rating model of the emotion entity.

Im Schritt (43) wird die Merkmalsextraktion wie folgt realisiert: Im Zusammenhang mit dem Kontext der Wörter wird ein benutzerdefiniertes Wörterbuch mit gesamten Merkmalen gestaltet, wobei in Übereinstimmung mit dem benutzerdefinierten Wörterbuch die Merkmalsextraktion der Microblog-Daten durchgeführt wird, und wobei die Microblog-Daten ins Eingabedatenformat umgewandelt werden, die das Erkennungsmodell der Emotionsentität verarbeiten kann. In step (43), the feature extraction is realized as follows: In connection with the context of the words, a custom dictionary is designed with entire features, in accordance with the custom dictionary, the feature extraction of the microblog data is performed, and the microblog data into the input data format that can process the recognition model of the emotion entity.

Im Schritt (44) wird das Erkennungsmodell der Emotionsentität wie folgt realisiert: Im Modell des konditionalen Randomfeldes (CRF) werden die Knoten der gesamten Merkmale eingeführt, um ein GLCRF-Modell zu errichten, in dem die gesamten Merkmale hinzugefügt werden, wobei die Trainings unter der Verwendung vom L-BFGS-Algorithmus durchgeführt werden, um die Modellparameter zu erhalten. In step (44), the recognition entity of the emotion entity is realized as follows: In the Conditional Random Field (CRF) model, the nodes of the entire features are introduced to build a GLCRF model in which the entire features are added, with the training under using the L-BFGS algorithm to obtain the model parameters.

Im Schritt (5) wird die Beurteilung der emotionalen Polarität vom Microblog durch folgende Schritte realisiert: In step (5), the evaluation of the emotional polarity by the microblog is realized by the following steps:

(51) Entfernung des Microblog-Rausches und Umwandlung der semantischen Form; (51) removal of microblogging intoxication and transformation of the semantic form;

(52) Wortsegmentierung, Wortart-Markierung und Analyse der chinesischen Grammatik; (52) word segmentation, part of speech marking and analysis of Chinese grammar;

(53) Extraktion der Emotionswortgruppe im Zusammenhang mit dem Emotionswörterbuch; (53) extraction of the emotion word group in connection with the emotion dictionary;

(54) Filterung der Emotionswortgruppe; (54) filtering the emotion word group;

(55) Beurteilung der emotionalen Polarität und Ausgabe der Ergebnisse. Im Schritt (53) wird die Emotionswortgruppe mit der sentiPY-Methode extrahiert, wobei die Form der Emotionswortgruppe einheitlich als phrase:modifier·sentiment ausgedrückt wird, nämlich beinhaltet eine Wortgruppe ein zentrales emotionales Wort, gleichzeitig kann die Wortgruppe mehrere Adverbien zur Modifikation zusätzlich beinhalten; (55) Assessment of emotional polarity and output of results. In step (53), the emotion word group is extracted using the sentiPY method, wherein the shape of the emotion word group is uniformly expressed as a phrase: modifier sentiment, viz., A phrase includes a central emotional word, at the same time the phrase may additionally include a plurality of adverbs for modification;

Im Schritt (55) wird die emotionale Polarität vom Microblog mit Hilfe von dem auf dem emotionalen Landepunkt basierenden Mischentscheidungsalgorithmus beurteilt, wobei der Beurteilungsprozess folgende Schritte beinhalten:

(551) Es wird beurteilt, ob ein Satz ein Zusammenfassungswort beinhalt, wenn nicht, geht es zum Schritt (552); wenn ja, werden die Wörter nach dem Zusammenfassungswort als emotionaler Landepunkt benutzt, wobei die Polarität des emotionalen Landepunkts als die emotionale Polarität vom Microblog ausgegeben wird;
(552) Der Satzanfang und das Satzende des Microblogs werden als emotionaler Landepunkt benutzt. Die emotionalen Polaritäten des Satzanfangs und Satzendes werden verglichen. Wenn die beiden emotionalen Polaritäten einander neutralisieren, geht es zum Schritt (553); sonst wird die stärkere emotionale Polarität als emotionale Polarität vom Microblog ausgegeben;
(553) Berechnung der Stärken der Emotionswörter des ganzen Microblogs, wobei die Stärken aufsummiert und gemittelt werden, und wobei die mittlere Stärke als emotionale Polarität vom Microblog ausgegeben wird.

In step (55), the emotional polarity is evaluated by the microblog using the emotional landing point based hybrid decision algorithm, the assessment process comprising the steps of:

(551) It is judged whether a sentence includes a summary word, if not, it goes to step (552); if so, the words after the summary word are used as an emotional landing point, with the polarity of the emotional landing point being given as the emotional polarity of the microblog;
(552) The beginning of the sentence and the end of the sentence of the microblog are used as an emotional landing point. The emotional polarities of the beginning and the end of the sentence are compared. When the two emotional polarities neutralize each other, go to step (553); otherwise the stronger emotional polarity than emotional polarity is spent by the microblog;
(553) Calculating the strengths of the emotion words of the whole microblog, summing up the strengths and averaging them, and giving the median strength as an emotional polarity to the microblog.

Die vorliegende Erfindung betrifft eine Lösung der Abfrageerweiterung für die Suche der Emotionsentität des Microblogs, dadurch gekennzeichnet, dass ein Mining der Wörterbeziehung der Microblog-Sprachendaten durchgeführt wird, ein Beziehungsdiagramm der gewichteten Wörter im Zusammenhang mit der WordNet-Essenzbank errichtet wird und die Abfrageerweiterung in Übereinstimmung mit dem errichteten Beziehungsdiagramm der Wörter durchgeführt wird, um die Abfrageintension des Benutzers besser zu verstehen. Hinsichtlich der Abfrageerweiterung löst die vorliegende Erfindung das Problem mit der wirksamen Kombination zwischen der semantischen Essenz und der Wörterbeziehung, so dass der Abfragezweck des Benutzers besser verstanden werden kann, weiter wird der Abfragesatz ins bessere Abfrageerweiterungswort umgewandelt. Hinsichtlich der Extraktion der Emotionsentität und der Analyse der emotionalen Farbe wird das Problem der Microblog-Texten mit größerer Freiheit beim Schreiben mit der Extraktion des Emotionsobjekts und der Beurteilung der emotionalen Polarität, dadurch wird das Problem mit der Entitätsextraktion beim verdeckten Emotionsobjekt gelöst, so dass die Extraktionswirkung der Emotionsentität optimiert wird, gleichzeitig wird die Genauigkeit der Beurteilung der emotionalen Polarität erhöht. Dadurch wird eine gute technische Lösung der Netzwerkmeinungsüberwachung und der Produktsmeinungsanalyse zur Verfügung gestellt. Die vorliegende Erfindung löst das schwierige Problem mit der Extraktion der Emotionsentität des Microblogs, der Analyse der emotionalen Polarität und der Suche der Emotionsentität, dadurch wird ein intelligentes Suchprodukt der Netzwerkmeinungsanalyse und Überwachung zur Verfügung gestellt. The present invention relates to a solution of query extension for the search of the emotion entity of the microblog, characterized in that a mining of the vocabulary of the microblogging language data is performed, a relationship diagram of the weighted words associated with the WordNet Essence Bank is established and the query extension in accordance is performed with the constructed relationship diagram of the words to better understand the query intension of the user. With regard to the query extension, the present invention solves the problem with the effective combination between the semantic essence and the word relationship so that the query purpose of the user can be better understood, further the query set is converted into the better query extension word. With regard to the extraction of the emotion entity and the analysis of the emotional color, the problem of the microblog texts with greater freedom in writing with the extraction of the emotion object and the Emotional polarity judgment, which solves the problem of entity extraction in the covert emotion object so as to optimize the extraction effect of the emotion entity, at the same time, increases the accuracy of the emotional polarity judgment. This provides a good technical solution to network opinion monitoring and product opinion analysis. The present invention solves the difficult problem of extracting the emotion entity of the microblog, analyzing the emotional polarity and searching the emotion entity, thereby providing an intelligent search product of the network opinion analysis and monitoring.

Kurze Beschreibung der Zeichnung Short description of the drawing

1 zeigt eine Gesamtstrukturansicht der vorliegenden Erfindung. 1 shows an overall view of the present invention.

2 zeigt ein Ablaufdiagramm der Ausführung und der Verwendung der vorliegenden Erfindung. 2 FIG. 10 is a flow chart of the embodiment and use of the present invention. FIG.

3 zeigt ein Systemarchitekturdiagramm der vorliegenden Erfin dung. 3 FIG. 12 shows a system architecture diagram of the present invention. FIG.

4 zeigt ein Ablaufdiagramm der Analysemethode der emotionalen Polarität gemäß der vorliegenden Erfindung. 4 FIG. 12 is a flow chart of the emotional polarity analysis method according to the present invention. FIG.

5 zeigt ein Beispiel der Abbildungsstruktur während der Optimierung der emotionalen Stärke auf der Grundlage der Nachbarbeziehung. 5 shows an example of the mapping structure during the optimization of the emotional strength based on the neighbor relationship.

6 zeigt ein Ablaufdiagramm vom Algorithmus des emotionalen Landepunkts. 6 shows a flow chart of the algorithm of the emotional landing point.

7 zeigt ein Ablaufdiagramm der Extraktion des Emotionsobjekts des Microblogs. 7 shows a flow chart of the extraction of the emotion object of the microblog.

8 zeigt ein Ablaufdiagramm der Vorverarbeitung der Daten. 8th shows a flow chart of the preprocessing of the data.

9 zeigt ein Prinzipbild der Realisierung des Modelltrainings des Emotionsobjekts. 9 shows a schematic diagram of the realization of the model training of the emotion object.

10 zeigt eine Abbildungsstruktur des GLCRF-Modells. 10 shows an imaging structure of the GLCRF model.

11 zeigt eine Modellabbildungsstruktur des GLCRF-Modells nach der Erweiterung von mehreren gesamten Knoten. 11 shows a model mapping structure of the GLCRF model after the extension of multiple entire nodes.

Ausführliche Ausführungsformen Detailed embodiments

Im Zusammenhang mit Figuren wird die Ausführungsform der vorliegenden Erfindung näher erläutert. Jedoch wird die Ausführungsform der vorliegenden Erfindung nicht darauf beschränkt. In connection with figures, the embodiment of the present invention will be explained in more detail. However, the embodiment of the present invention is not limited thereto.

1 zeigt eine Gesamtstrukturansicht der vorliegenden Erfindung. Ein Suchsystem der Emotionsentität für das Microblog, aufweisend: ein Benutzerschnittstellemodul, wobei der Benutzer durch das Modul eine Abfrage vorlegen und eine Rückkopplung erhalten kann; ein Abfrageerweiterungsmodul, das ein Mining der Wörterbeziehung der Microblog-Sprachendaten realisiert, wobei im Zusammenhang mit der WordNet-Essenzbank ein Beziehungsdiagramm der gewichteten Wörter errichtet wird; ein Abfrageverarbeitungsmodul zur Umwandelung der Abfrage des Benutzers in die Abfrageschlüsselwörter und Abfrageworte, die akzeptierbar für die Indexbank sind, wobei eine Abfrageerweiterung auf dem durch das Abfrageerweiterungsmodul errichteten Beziehungsdiagramm der Wörter basierend ausgeführt wird; ein Mining-Modul der Emotionsinformation zum Mining der Emotionen in der Microblog-Sprachenbank, wobei die Beurteilungsregel für die Emotionsentität und die emotionale Polarität generiert werden; ein Beurteilungs- und Indexerrichtungsmodul der Emotionsinformation zur Beurteilung der Emotionsentität und der emotionalen Polarität der Microblog-Daten, um den Index der Emotionsinformation zu errichten und zu speichern; ein Errichtungsmodul des invertierten Indexes zur Errichtung des invertierten Indexes für die Microblog-Textinformation und zur Speicherung. 1 shows an overall view of the present invention. A search engine of the emotion entity for the microblog, comprising: a user interface module, wherein the user can submit a query and receive feedback through the module; a query extension module that implements a mining of the vocabulary of the microblogging language data, building a relationship diagram of the weighted words in the context of the WordNet Essence Bank; a query processing module for converting the user's query into the query keywords and query words that are acceptable to the index bank, executing a query extension based on the relationship diagram of words established by the query extension module; a Mining module of emotion information to mine the emotions in the microblog language bank, generating the judgment rule for the emotion entity and the emotional polarity; a judgment and index setting module of the emotion information for judging the emotion entity and the emotional polarity of the microblog data to establish and store the index of the emotion information; an inverted index establishment module for establishing the inverted index for the microblog text information and for storage.

2 zeigt ein Ablaufdiagramm des Betriebs des Abfrageverarbeitungsmoduls der vorliegenden Erfindung. 2 Figure 14 shows a flow chart of the operation of the query processing module of the present invention.

Siehe 2, enthält der Ablauf folgende Schritte: 1. Das Abfrageinterface empfängt die durch den Benutzer eingegebenen Abfragewörter oder Sätze; 2. Durch den Abfrageprozess werden die Wortsegmentierung, die Entfernung des Stoppworts und die Bestimmung des Stichworts für die Eingabe des Benutzers durchgeführt, um ein Stichwort oder mehrere Stichwörter zu erhalten, das Stichwort kann ein Schlüsselwort oder ein dekoratives Wort etc. sein; 3. Für das Stichwort wird ein passendes Erweiterungswort aus dem durch die Essenz und die Regelwörter ausgebildeten Beziehungsdiagramm der gewichteten Wörter ausgewählt, der Abstand der ausgewählten Worts ist 1, nämlich ist es das nächste Wort des Stichworts; 4. Da durch den Schritt 3 eine große Anzahl von den Erweiterungswörtern erhalten werden kann, wird die Wichtigkeit des jeweiligen Worts gemessen, dazu wird eine Gewichtsberechnung für jedes Wort durchgeführt wird, dann werden die p Wörter mit größtem Gewicht ausgewählt und in der Suchwörtergruppe hinzugefügt; 5. im Schritt 4 werden die notwendigen Erweiterungswörter schon erhalten, jedoch soll ein System eingeführt werden, damit der Benutzer die Erweiterungswörter kennen kann, und der Benutzer betätigt die Wörter, nämlich die geänderte und erweiterte Suchwörtergruppe, so dass die Erweiterungswörter der Abfrageintension des Benutzers entsprechen; 6. die Erweiterungswörtergruppe wird zum Abfragezugang zurückgegeben, und ein erweiterter Abruf wird für die Rich-Media-Datenbank durchgeführt; 7. die Abrufsergebnisse werden zurückgegeben und dem Benutzer angezeigt. Please refer 2 , the process includes the following steps: 1. The query interface receives the user entered query words or phrases; 2. The query process sets the word segmentation, the removal of the stop word, and the determination of the keyword for the user's input done to get a keyword or multiple keywords, the keyword can be a keyword or a decorative word etc; 3. For the keyword, a matching extension word is selected from the relationship diagram of the weighted words formed by the essence and the rule words; the distance of the selected words is 1, namely, it is the next word of the keyword; 4. Since a large number of the extension words can be obtained by the step 3, the importance of each word is measured by performing a weight calculation for each word, then selecting the p words with the largest weight and adding them in the search word group; 5. In step 4, the necessary extension words are already obtained, but a system is to be introduced so that the user can know the extension words, and the user operates the words, namely the modified and extended search word group, so that the extension words match the user's query intention ; 6. The extension dictionary is returned for query access, and an extended fetch is performed on the rich media database; 7. the retrieval results are returned and displayed to the user.

3 zeigt die Einfügungsdetails des Abfrageverarbeitungs- und Abfrageerweiterungsmodul der vorliegenden Erfindung. 3 shows the insertion details of the query processing and query extension module of the present invention.

Siehe 3, umfassen die Abfrageverarbeitung und die Abfrageerweiterung der vorliegenden Erfindung zwei Teile-Hintergrundinformationensverarbeitungs- und Abrufsprozess, dabei kann es in 5 Untermodule unterteilt werden: Microbloginformation-Extraktionsmodul, Indexerrichtungsmodul, Errichtungsmodul des Beziehungsdiagramms der Wörter, Benutzerabrufmodul und Administratorbedien- und Benutzerbedienmodul. Please refer 3 , the query processing and the query extension of the present invention include two parts background information processing and retrieval process, while it can be subdivided into 5 submodules: microblog information extraction module, index setup module, words relationship diagram builder module, user polling module, and administrator and user control module.

Der Prozess des Microbloginformation-Extraktionsmoduls enthält: Organisation der primären Microblog-Daten, Durchführung der passenden Bereinigung, Satzsegmentierung, Wortsegmentierung und grammatischen Analyse. Beim Indexerrichtungsmodul wird vor allem ein Index für die Microblog-Datengruppe errichtet, um einen Schnellabruf durchzuführen. Dabei wird Lucene zur Errichtung des invertierten Indexes verwendet. Lucene ist ein Open-Source-Framework für die Volltext-Suchmaschine, die ein vollständiges Abfrage-Engine und Index-Engine bietet und die boolesche Operation, Fuzzy-Abfrage, Grupenabfragen und andere Operationen unterstützt. Mit Lucene wird der invertierte Index errichtet und gespeichert. The microblog information extraction module process includes: organizing the primary microblog data, performing proper cleanup, sentence segmentation, word segmentation, and grammatical analysis. Above all, the Indexerrichtungsmodul will set up an index for the microblog data group to perform a quick recall. Lucene is used to construct the inverted index. Lucene is an open-source framework for the full-text search engine that provides a complete query engine and index engine that supports Boolean operation, fuzzy query, group queries, and other operations. With Lucene the inverted script is built and saved.

Das Errichtungsmodul des Beziehungsdiagramms der Wörter ist der Kernteil der vorliegenden Erfindung, das ist auch der innovative Abschnitt. Der Abschnitt wird in einen Wortsegmentierungsprozess, einen Mining-Prozess der Eclat-Relevanzregel, einen Generierungsprozess der Relevanzregelwörter und einen Generierungsprozess des Beziehungsdiagramms der gewichteten Wörter im Zusammenhang mit WordNet unterteilt. Beim Wortsegmentierungsprozess wird die Wörterressource eines Texts ins einzelne Wort segmentiert. Dabei wird die ICTCLAS-Software, die eine höhere Genauigkeit bei chinesischer Wortsegmentierung hat, zur Wortsegmentierung verwendet. Die chinesische Akademie der Wissenschaften hat das System spezifisch für die chinesische Wortsegmentierung entwickelt. Zuerst wird die Wortsegmentierung für die Dokumente im Datensatz hintereinander durchgeführt, dann werden die Dokumente verschiedener Sorten zusammengesetzt, um einen Dokumentssatz zu bilden, der dem Mining der Relevanzregel zur Verfügung gestellt wird. Während des Mining-Prozesses der Relevanzregel wird der Eclat-Mining-Algorithmus mit einer höheren Mining-Effizienz verwendet. Das ist ein Algorithmus, bei dem die Tiefe vorrangig ist. Beim größeren Dokument kann das Mining des Relevanzworts in verschiedenen Abschnitten durchgeführt werden, dann erfolgt die Kombination. In der vorliegenden Erfindung wird der Unterstützungsgrad-Interessengrad-Relevanzregelrahmen verwendet. Der Rahmen setzt zwei Beurteilungsformeln ein: The building module of the relationship diagram of the words is the core part of the present invention, which is also the innovative section. The section is divided into a word segmentation process, an Eclat relevance rule mining process, a relevance rule generation process, and a weighted word relationship graph generation process related to WordNet. In the word segmentation process, the word resource of a text is segmented into a single word. In doing so, the ICTCLAS software, which has a higher accuracy in Chinese word segmentation, is used for word segmentation. The Chinese Academy of Sciences has developed the system specifically for Chinese word segmentation. First, the word segmentation for the documents in the record is performed in succession, then the documents of different sorts are assembled to form a set of documents that is made available to the mining of the relevance rule. During the mining process of the relevance rule, the Eclat Mining algorithm is used with a higher mining efficiency. This is an algorithm where depth is paramount. For the larger document, the mining of the relevancy word can be done in different sections, then the combination occurs. In the present invention, the support degree of interest relevance rule frame is used. The frame uses two assessment formulas:

(1) Unterstützungsgrad-Formel: supp(X → Y) = |X∪Y| / |D| (1) Supporting degree formula: supp (X → Y) = | X∪Y | / | D |

(2) Interessengrad-Formel: lift(X → Y) = supp(X ∪ Y) / supp(X) × supp(Y) (2) Degree of interest formula: lift (X →Y) = supp (X∪Y) / supp (X) × supp (Y)

Dabei ist |X ∪ Y| die Anzahl der gleichzeitig X und Y enthaltenden Angelegenheiten, |D| ist die Gesamtzahl der Angelegenheiten der Datenbank; supp(X ∪ Y) ist das Prozent der gleichzeitig X und Y enthaltenden Angelegenheiten in der Datenbank, supp(X), supp(Y) stehen jeweils für das Prozent der nur X enthaltenden Angelegenheiten und der nur Y enthaltenden Angelegenheiten. Where | X ∪ Y | the number of simultaneous X and Y-containing matters, | D | is the total number of affairs of the database; Supp (X ∪ Y) is the percent of matters in the database simultaneously containing X and Y, supp (X), supp (Y) each represent the percent of X-only matters and Y-only matters.

Während des Mining-Prozesses werden verschiedene Schwellenwerte des Unterstützungsgrads in Übereinstimmung mit verschiedenen Dokumentssätzen eingestellt. Nur wenn der Interessengrad höher als 1 ist, generiert der durch das Mining erhaltene Frequenzartikelsatz den Relevanzregelartikel. Denn die vorliegende Erfindung behauptet, dass zwei Wörter in positiver Korrelation sind, nur wenn der Interessengrad von den beiden Wörtern höher als 1 ist. Während des Mining-Prozesses wird weiter der Begriff eines zusammengesetzten Worts hinzugefügt: wenn der Interessengrad von zwei Wörtern höher als 4 ist, werden das vordere Wort und das hintere Wort des Regelartikels zusammengesetzt, so dass ein zusammengesetztes Wort generiert wird. Das Wort bildet jeweils mit dem vorderen Teil und dem hinteren Teil des Regelworts eine neue Regel aus. Der Interessengradswert der neuen Regel ist identisch mit dem der eigentlichen Regel, so dass das zusammengesetzte Wort auch als Erweiterungswort ausgewählt werden kann. Nach dem Mining des Relevanzworts wird die Relevanzregel generiert und gespeichert. Das Speicherformat ist "X Y". Dadurch werden das Mining und die Analyse des Relevanzregelworts abgeschlossen. During the mining process, various levels of support levels are set in accordance with different sets of documents. Only when the degree of interest is greater than 1 does the frequency article set obtained by the mining generate the relevancy rule article. For the present invention asserts that two words are in positive correlation only when the degree of interest of the two words is greater than one. During the mining process, the notion of a compound word is further added: if the degree of interest of two words is higher than 4, the leading word and the back word of the rule article are composed so that a compound word is generated. The word forms a new rule in each case with the front part and the back part of the control word. The degree of interest of the new rule is identical to that of the actual rule, so that the compound word can also be selected as the extension word. After mining the relevancy word, the relevance rule is generated and saved. The storage format is "XY". This completes the mining and analysis of the relevance rule.

Im letzten Schritt werden die Regelwörter und die WordNet-Essenzbank zu einem Beziehungsdiagramm der gewichteten Wörter kombiniert. WordNet ist ein auf dem Wortschatz basierendes semantisches Netzwerk. WordNet organisiert nicht nur den Wortschatz zu Begriffen, sondern definiert auch die Begriffe und verschiedene semantische Zusammenhänge zwischen den Wortschätzen (wie appositionelles Wort, obergeordnetes/untergeordnetes Wort, Antonyme, Ganzes-Teil-Wort, Implikation, etc.). Die Beziehung zwischen den Wörtern bildet einen gerichteten Graph aus (wie in 3 dargestellt). Während des Prozesses wird es berücksichtigt, die Regelwörter in bestimmter Reihenfolge in der WordNet-Essenzbank abzubilden oder hinzuzufügen. Das Konstruktionsprinzip des Beziehungsdiagramms der gewichteten Wörter wird so eingestellt, dass zwischen den Knoten von den zwei Regelwörtern eine vom vorderen Teil nach hinterem Teil gerichtete Kante hinzugefügt wird. Dabei ist die Hinzufügung des Regelworts vollautomatisch, dafür bestehen zwei Situationen: 1. Wenn das Wort im eigentlichen WordNet-Essenzdiagramm besteht, ist es nur nötig, das Wort im Diagramm abzubilden, dann werden die Knotendaten aktualisiert; 2. Wenn das Wort im eigentlichen WordNet-Essenzdiagramm nicht besteht, wird dann das Wort zuerst hinzugefügt, dann wird die Kante hinzugefügt und die Daten werden aktualisiert. Nach Abschließen des Diagramms werden alle Knotendaten hintereinander gezählt. Das schließlich geformte Beziehungsdiagramm kann in Form eines Quartettes dargestellt werden: G = <V, E, f, g>. Dabei ist V die Knotensammlung, E ist die Kantensammlung, f ist die Funktion von V zur nichtnegativen reellen Zahl und wird als Grad des Knotens eingestellt; g ist die Funktion von E zur nicht-negativen reellen Zahl und wird als Wert von beiden Knotenkanten eingestellt. d, d_i, d_j ∊ V wird eingestellt, deg(d) steht für den Grad des Knotens (nämlich Gesamtsumme vom Außengrad und Innengrad des Knotens), lift(d_i → d_j) steht für den Interessengradswert des Knotenworts d_i und d_j, davon resultieren: f(d) = deg(d) (1)

The final step is to combine the rules words and the WordNet Essence Bank into a relationship diagram of the weighted words. WordNet is a vocabulary-based semantic network. WordNet not only organizes the vocabulary of terms, but also defines the terms and different semantic relationships between the vocabularies (such as appositional word, parent / child, antonyms, whole-part-word, implication, etc.). The relationship between the words forms a directed graph (as in 3 shown). During the process, it is considered to map or add the rule words in a specific order in the WordNet Essence Bank. The construction principle of the weighted word relationship diagram is set so that between the nodes of the two rule words, an edge directed from the front to the rear is added. The addition of the rule word is fully automatic, there are two situations: 1. If the word is in the actual WordNet Essence diagram, it is only necessary to map the word in the diagram, then the node data will be updated; 2. If the word does not exist in the actual WordNet Essence diagram, then the word is added first, then the edge is added and the data is updated. After completing the diagram, all node data are counted consecutively. The finally formed relationship diagram can be represented in the form of a quartet: G = <V, E, f, g>. Where V is the knot collection, E is the edge collection, f is the function of V to the nonnegative real number and is set as the degree of the node; g is the function of E as the non-negative real number and is set as the value of both node edges. d, d _i , _dj ε V is set, deg (d) stands for the degree of the node (namely, total sum of the outer degree and inner degree of the node), lift (d _i → d _j ) represents the degree of interest of the node words d _i and d _j , of which result:

f (d) = deg (d) (1)

Im Beziehungsdiagramm der gewichteten Wörter (wie in 4 dargestellt) hängt die Wichtigkeit des Worts im ganzen Diagramm von der Maße des Knotens ab, an dem das Wort sich befindet, nämlich von der Gesamtsumme vom Außengrad und Innnengrad des Knotens (ganze Zahl neben dem Knoten in 4). Der Wert der Kante ist gewichteter Wert, dabei wird der gewichtete Wert zwischen den Essenzwörtern im eigentlichen WordNet-Diagramm auf 1 eingestellt (blaue Kante in 4), der gewichtete Wert zwischen den durch die Regel eingesetzten Wörtern ist der Interessengradswert der beide Wörter (Blaue Kante in 4). Wenn die beiden Wörter sowohl WordNet-Beziehungswörter als auch Regelwörter sind, ist der gewichtete Wert Interessengradswert plus 1. In 4 richtet die schwarze Kante nach einem zusammengesetzten Wort (z.B. "geistiges Eigentum"), das einen identischen gewichteten Wert wie die beiden Regelwörter hat. Dadurch wird die Errichtung des Beziehungsdiagramms der gewichteten Wörter abgeschlossen. In the relationships graph of the weighted words (as in 4 shown), the importance of the word in the whole diagram depends on the size of the node where the word is located, namely the total of the outer degree and inner degree of the node (integer next to the node in 4 ). The value of the edge is weighted, with the weighted value between the Essence words in the actual WordNet diagram 1 set (blue edge in 4 ), the weighted value between the words used by the rule is the degree of interest of the two words (blue edge in 4 ). If the two words are both WordNet relational words and rule words, the weighted value is Opacity value plus 1. In 4 aligns the black edge with a compound word (eg, "intellectual property") that has an identical weighted value as the two rule words. This completes the establishment of the weighted words relationship diagram.

Das Benutzerabrufmodul enthält einen Abfrageeingabe- und Abfrageanalyseprozess, einen Zusammenpassprozess des Erweiterungsworts, einen Generierungsprozess der Erweiterungssuchwörtergruppe, einen Abrufindexprozess und einen Ergebnisverarbeitungs- und Anzeigeprozess. Bei der Abfrageeingabe empfängt das Abfrageinterface die durch den Benutzer eingegebenen Abfragewörter oder Sätze; bei der Abfrageanalyse werden die Wortsegmentierung, die Entfernung des Stoppworts und die Bestimmung des Stichworts für die Eingabe des Benutzers durchgeführt, um ein Stichwort oder mehrere Stichwörter zu erhalten; beim Zusammenpassprozess des Erweiterungsworts wird das Stichwort des letzten Schritts in die Beziehungsdiagrammbank der gewichteten Wörter eingegeben, um eine passende Erweiterungswort-Quelle auszuwählen, nämlich wird das Wort mit einem kürzesten Abstand zum eigentlichen Suchwort (nämlich Wort mit einem Abstand von 1) aus dem Diagramm als optionales Erweiterungswort ausgewählt. Beim Generierungsprozess der Erweiterungssuchwörtergruppe wird das Gewicht des Worts in Übereinstimmung mit dem Relevanzsgrad des jeweiligen Worts zum eigentlichen Suchwort berechnet, dann werden die ersten p Wörter als endgültige Erweiterungswörter ausgewählt. Die vorliegende Erfindung gründet die Formeln zur Berechnung des Gewichts des jeweiligen Worts. In Übereinstimmung mit der Struktur des Beziehungsdiagramms der gewichteten Wörter ist es bekannt: Je größer der gewichtete Wert von zwei Knoten ist, desto höher ist der Relevanzsgrad von den beiden Knoten; je höher der Grad des Knotens ist, desto wichtiger ist der Knoten. The user retrieval module includes a query input and retrieval analysis process, a matching process of the extension word, a generation process of the extension search group, a retrieval index process, and a result processing and display process. At query input, the query interface receives the user entered query words or sentences; Query analysis uses word segmentation, the removal of the stop word, and the keyword definition performed for the input of the user to obtain a keyword or multiple keywords; In the matching word matching process, the keyword of the last step is entered into the weighted word relationship graph bank to select a matching extension word source, namely, the word having the shortest distance to the actual search word (namely, word with a spacing of 1) is removed from the diagram optional extension word selected. In the generation process of the extension search word group, the weight of the word is calculated in accordance with the relevance degree of each word to the actual search word, then the first p words are selected as final extension words. The present invention establishes the formulas for calculating the weight of each word. In accordance with the structure of the relationship diagram of the weighted words, it is known: the larger the weighted value of two nodes, the higher the degree of relevance of the two nodes; the higher the degree of the node, the more important the node is.

Das eigentliche Suchwort wird als q = (q₁, q₂, q_m) angenommen qwird, wobei der Artikel q_i n_i nächste Wörter

hat, und wobei die Berechnungsmethode des Relevanzsgrades zwischen dem eigent lichen Suchwort q_i und dem nächsten Wort d_{ij ist:}

und wobe W(q_i, d_ij) der Relevanzsgrad zwischen dem Wort qi und dem Wort d_ij ist, und wobei g(q_i, d_ij) das Gewicht von den beiden Wörtern ist, und wobei f(d_ij) der Grad des Worts d_ij ist, und wobei die Gewichtsbe rechnungsmethode aller nächsten Wörter

ist. The actual search word is assumed to be q = (q ₁ , q ₂ , q _m ), where q _i n _{i is the} next word

and where the method of calculating the degree of relevance between the actual search term q _i and the next word d _{ij is:}

and where W (q _i , d _ij ) is the degree of relevance between the word qi and the word d _ij , and where g (q _i , d _ij ) is the weight of the two words, and f (d _ij ) is the degree of the word d _ij , and the weighting method of all the next words

is.

Dabei ist W(d_k) das Gewicht vom Wort d_k,m steht für die Anzahl der eigentlichen Suchwörter. Nachdem die Gewichte der jeweiligen optionalen Erweiterungswörter berechnet wurden, werden die Gewichte in absteigende Reihenfolge gebracht. Weiter werden erste p Wörter ausgewählt und in der eigentlichen Abfrage hinzugefügt, so dass die Erweiterungswörtergruppe gebildet wird. Dabei haben die eigentlichen Abfrageartikel alles ein Gewicht von 1. Durch den letzten Schritt wird die Erweiterungswörtergruppe erhalten, die z.B. in folgender Form ist: Q = (q₁, q₂, ..., q_m, d₁, d₂, ..., d_p) (4) Where W (d _k ) is the weight of the word d _k, where _m is the number of actual search words. After the weights of the respective optional extension words have been calculated, the weights are placed in descending order. Further, first p words are selected and added in the actual query, so that the extension word group is formed. The actual query articles have a weight of 1. By the last step, the extension word group is obtained, which is for example in the following form: Q = (q ₁ , q ₂ , ..., q _m , d ₁ , d ₂ , ..., d _p ) (4)

Beim Abrufprozess wird die Erweiterungswörtergruppe zum Abfragezugang zurückgegeben, und ein erweiterter Abruf wird für die Rich-Media-Datenbank durchgeführt. Beim Ergebnisverarbeitungs- und Anzeigeprozess werden die in Reihenfolge gebrachten Abrufergebnisse zurückgegeben und dem Benutzer angezeigt. In the retrieval process, the extension word group is returned to the query access, and an extended polling is performed for the rich media database. In the result processing and display process, the ordered retrieval results are returned and displayed to the user.

Siehe 4, enthält die Methode folgende Schritte:

(1) Entfernung des Rausches in Kommentarwortdaten und Umwandlung der semantischen Form: Bei der Entfernung des Rausches in Kommentarwortdaten werden vor allem störende Sätze entfernt, wie z.B. Konjunktiv. Solche störende Sätze sind keine wahren objektiven Kommentare und werden die Analyse in den kommenden Phasen stören. Die Emoticons werden durch entsprechende Texte ersetzt, so dass die semantische Form in die verarbeitungsfreundliche Form umgewandelt wird.
(2) Verarbeitung der natürlichen Sprache: Vor allem werden die Wortsegmentierung, die Markierung der Wortart und die Analyse der chinesischen Grammatik für die Kommentarwortdaten mit Hilfe der Stanford NLP-Software durchgeführt.
(3) Extraktion der Emotionswortgruppe im Zusammenhang mit dem Emotionswörterbuch. Da die Emotionswörter in den Kommentarwortdaten vom POS-Tagger-Label hauptsächlich an wenig Labels konzentriert sind, wird die Emotionswortgruppe im Zusammenhang mit dem Wohnart-Etikett und dem Emotionswörterbuch extrahiert. Unter der Verwendung der durch uns entwickelten sentiPY-Methode wird die Emotionswortgruppe extrahiert. Im vorliegenden System haben die Emotionswortgruppen eine einheitliche Form: phrase:modifier·sentiment nämlich enthält eine Wortgruppe ein zentrales emotionales Wort, gleichzeitig kann die Wortgruppe mehrere Adverbien zur Modifikation zusätzlich enthalten.
(4) Filterung der Emotionswortgruppen: Die im Schritt 3 extrahierten grobkörnigen Emotionswortgruppen werden gefiltert, so dass die Form der Emotionswortgruppen reiner wird, dadurch kann die Genauigkeit der endgültigen Polaritätsklassifizierung verbessert werden.
(5) Emotionsanalyse und Ausgabe der Ergebnisse Ein auf dem emotionalen Landepunkt basierender Mischentscheidungsalgorithmus wird gestaltet. Der Algorithmus kann die Kommentarwortdaten auf verschiedenen Gebieten wirksam analysieren.

Please refer 4 , the method includes the following steps:

(1) Removal of intoxication in comment word data and conversion of the semantic form: When removing the noise in comment word data, especially disturbing sentences are removed, such as subjunctive. Such disturbing sentences are not true objective comments and will disturb analysis in the coming phases. The emoticons are replaced by corresponding texts, so that the semantic form is converted into the processing-friendly form.
(2) Natural language processing: Above all, the word segmentation, the part-word mark, and the Chinese grammar analysis for the comment word data are performed using the Stanford NLP software.
(3) Extraction of the emotion word group in connection with the emotion dictionary. Since the emotion words in the comment word data from the POS tagger label are mainly concentrated on few labels, the emotion word group is extracted in the context of the home style label and the emotion dictionary. Using the sentiPY method developed by us, the emotion word group is extracted. In the present system, the emotion word groups have a uniform form: phrase: modifier · sentiment namely, a phrase contains a central emotional word; at the same time, the word group may additionally contain several adverbs for modification.
(4) Filtering of Emotional Word Groups: The coarse-grained emotion word groups extracted in step 3 are filtered so that the shape of the emotion word groups becomes cleaner, thereby improving the accuracy of the final polarity classification.
(5) Emotional Analysis and Output of Results Emotional point based mixed decision algorithm is designed. The algorithm can effectively analyze the comment word data in various fields.

5 zeigt ein Beispiel der Abbildungsstruktur während der Optimierung der emotionalen Stärke auf der Grundlage der Nachbarbeziehung. Siehe 5, werden die Emotionswörter in den Kommentarwortdaten als Knoten im Diagramm angesehen. Der auf der Verbreitung basierende Algorithmus kann die Emotionsstärke errechnet werden. Auf dem Emotionswörterbuch wird die benachbarte Beziehung der Emotionswörter errechnet, und mit NGD wird das Gewicht des Knotens von zwei Emotionswörtern errechnet, so dass ein gerichtetes Diagramm ausgebildet wird. 3 zeigt eine Abbildungsstruktur eines Kommentars. 5 shows an example of the mapping structure during the optimization of the emotional strength based on the neighbor relationship. Please refer 5 , the emotion words in the comment word data are considered nodes in the diagram. The distribution-based algorithm can be used to calculate the emotional strength. In the emotion dictionary, the neighboring relationship of the emotion words is calculated, and with NGD, the weight of the node is calculated from two emotion words, so that a directional diagram is formed. 3 shows a picture structure of a comment.

6 zeigt ein Ablaufdiagramm vom Algorithmus des emotionalen Landepunkts. Siehe 4, ist es das Ziel in diesem Schritt, den emotionalen Landepunkt eines Kommentars zu finden. Der emotionale Landepunkt ist der Emotionsteil in einem Kommentar, den der Autor ausdrücken will. Dabei basiert es hauptsächlich auf den zusammenfassenden Wörtern (wie "Allgemein"), dabei werden die Emotionsstärken am Anfang und Ende und die stärksten Emotionswortgruppen im Satz verglichen, dadurch wird der emotionale Landepunkt eines Kommentars gefunden. 6 shows a flow chart of the algorithm of the emotional landing point. Please refer 4 It is the goal in this step to find the emotional landing point of a commentary. The emotional landing point is the emotion part in a comment that the author wants to express. It is mainly based on the summary words (such as "General"), comparing the emotional strengths at the beginning and end and the strongest emotion word groups in the sentence, thereby finding the emotional landing point of a comment.

7 zeigt ein Ablaufdiagramm der Extraktion der Microblog-Emotionsentität der vorliegenden Erfindung. 7 Figure 13 shows a flow chart of the extraction of the microblogging emotion entity of the present invention.

Siehe 1, enthält die Extraktion der Emotionsentität in der vorliegenden Erfindung die Erfassung der Microblog-Daten, die Datenvorverarbeitung, die Merkmalsextraktion, das Laden des Wörterbuchs, die Markierung und Korrektur, das Modelltraining und die Emotionsobjektsextraktion und andere Schritte. Bei der Erfassung der Microblog-Daten werden die Microblog-Daten im Internet in Form einer Datei gespeichert. Das durch das Modelltraining erhaltene Extraktionsmodell des Emotionsobjekts wird auch zur Extraktion des Objekts gespeichert. Die durch die Extraktion des Emotionsobjekts erhaltenen Ergebnisse werden in Form einer Datei gespeichert, so dass der Benutzer die vermutlichen Ergebnisse einsehen und berichtigen kann. Please refer 1 In the present invention, the extraction of the emotion entity includes the collection of the microblog data, the data preprocessing, the feature extraction, the dictionary loading, the marking and correction, the model training and the emotion object extraction, and other steps. When microblogging data is collected, microblogging data is stored on the internet in the form of a file. The extraction model of the emotion object obtained by the model training is also stored for extracting the object. The results obtained by extraction of the emotion object are stored in the form of a file so that the user can view and correct the probable results.

Die Erfassung der Microblog-Daten dient zur Sammlung der Microblog-Daten aus den Microblog-Systemen (wie Sina-Microblog, Twitter und Tencent-Microblog etc.) aus Internet und zur Speicherung der erfassten Microblog-Rohdaten in Übereinstimmung mit bestimmter Organisationsweise in Form der Datei, um die Unterstützung der späteren Verarbeitung des Systems zu Verfügung zu stellen. The collection of microblog data is used to collect the microblog data from the microblogging systems (such as Sina microblog, Twitter and Tencent microblog etc.) from Internet and to store the recorded microblog raw data in accordance with certain organization in the form of File to provide support for later processing of the system.

Bei der Datenvorverarbeitung werden einige Vorverarbeitungen für die ursprünglichen Microblog-Daten durchgeführt, um die spätere Merkmalsextraktion zu erleichtern. Das Modul enthält Datenbereinigung, Datentransformation, Satzsegmentierung, Wortsegmentierung, Wortart-Markierung und Syntaxanalyse. Details sind wie in 2 dargestellt. Data preprocessing performs some preprocessing on the original microblog data to facilitate later feature extraction. The module includes data cleansing, data transformation, sentence segmentation, word segmentation, part-of-speech marking, and parsing. Details are like in 2 shown.

Beim Laden des Wörterbuchs wird ein relevantes Wörterbuch geladen, die notwendig für die Datenvorverarbeitung und die Merkmalsextraktion sind. Das Wörterbuch enthält ein Emotionswörterbuch, ein Stoppwörterbuch, ein Wörterbuch für häufig benutzte Netzwerkwörter und andere Wörterbuchdaten. Loading the dictionary will load a relevant dictionary necessary for data preprocessing and feature extraction. The dictionary contains an emotion dictionary, a stop dictionary, a dictionary for frequently used network words, and other dictionary data.

Bei der Merkmalsextraktion werden die ins Modul geladenen Wörterbuchdaten geladen, und eine Extraktion der vorbestimmten Merkmale wird für die verarbeiteten Daten durchgeführt, um den Text zu vektorisieren und in ein Format zu wandeln, das das Objektextraktionsmodul verarbeiten kann. In feature extraction, the dictionary data loaded into the module is loaded, and extraction of the predetermined features is performed on the processed data to vectorize the text and convert it to a format that can process the object extraction module.

Das Emotionsobjekt-Modelltraining dient zum Training des Modells der Emotionsobjektsextraktion, das der Kern des Systems ist. Aus dem Markierungs- und Korrekturmodul werden die Trainingsdaten, die ins erforderte Format gewandelt sind, erhalten. Mit dem L-BFGS-Algorithmus wird das Training für das in Übereinstimmung mit den Trainingsdaten errichteteCRF-Modell durchgeführt. Das in der vorliegenden Erfindung verwendete CRF-Modell wird auf der Grundlage des Linear-CRF(lineares konditionales Randomfeld)-Modell ausgebildet und ist die erste Anwendung des CRF(konditionales Randomfeld)-Modells auf dem Gebiet der Identifizierung des Emotionsobjekts. Globale Variablen werden im konventionellen CRF-Modell hinzugefügt, so dass die Situation identifiziert wird, in der das Emotionsobjekt nicht in der Markierungssequenz dominant erscheint. Emotion object model training is for training the model of emotion object extraction that is the core of the system. From the marking and correction module, the training data converted into the required format is obtained. With the L-BFGS algorithm, training is performed for the CRF model established in accordance with the training data. The CRF model used in the present invention is formed on the basis of the linear CRF (Linear Conditional Random Field) model and is the first application of the CRF (Conditional Random Field) model in the field of identification of the emotion object. Global variables are added in the conventional CRF model to identify the situation where the emotion object does not appear dominant in the marker sequence.

Bei der Emotionsobjektsextraktion wird das Emotionsobjekt aus den Microblog-Daten extrahiert. In diesem Schritt wird hauptsächlich eine Vermutung mit Hilfe des durch das Modelltraining fertig trainierten Modells durchgeführt, um das Ziel der Objektsextraktion zu erreichen. Emotional object extraction extracts the emotion object from the microblog data. In this step, a guess is mainly made using the model trained ready model training to achieve the objective of object extraction.

Die Markierung und die Korrektur. Das in der vorliegenden Erfindung verwendete CRF-Modell ist eine überwachte statistische Lernmethode, deshalb sollen die Daten markiert werden. Gleichzeitig wird ein Rückkopplungsmechanismus eingeführt, um die Fehleranalyseinformationen zu lernen. Bei der bestehenden Methode werden die Fehlklassifikationsergebnisse in der Regel nicht verarbeitet, jedoch enthalten die Rückkopplungsinformationen eine große Menge an nützlichen Informationen. Es wird der Schlüssel des Selbstlernens des Systems, wie die Informationen vollständig verwendet werden. Mit der Einführung vom Rückkopplungsmechanismus kann das Modell die Ergebnisse der Fehleranalyse wieder lernen, so dass das System im Laufe mit der Verwendung eine immer gute Genauigkeit hat. The marking and the correction. The CRF model used in the present invention is a supervised statistical learning method, therefore the data should be marked. At the same time, a feedback mechanism is introduced to learn the fault analysis information. The existing method typically does not process the misclassification results, but the feedback information contains a large amount of useful information. It will be the key to self-learning of the system as the information is completely used. With the introduction of the feedback mechanism, the model can re-learn the results of the failure analysis so that the system always has good accuracy over use.

8 zeigt ein Prinzipbild der Realisierung des Datenvorverarbeitungsschritts der vorliegenden Erfindung. Dabei enthält der Datenvorverarbeitungsschritt folgende Schritte:

(1) Datenbeinigungs-Verarbeitungsschritt, aus den durch das Datenerfassungsmodul gesammelten Microblog-Rohdaten werden die Daten gelesen, der Datenbereinigungsprozess in der Datenvorverarbeitung wird durchgeführt, um einige leere ungültige Microblog-Daten zu filtern.
(2) Datenkonvertierungs-Verarbeitungsschritt, in diesem Schritt werden die nach der Verarbeitung im Schritt (1) übertragenen Daten verarbeitet, die Konvertierungsverarbeitung wird für einige Inhalte in den Microblog-Daten durchgeführt, um die entsprechenden Verarbeitungen im Schritt (3), (4), (5) und (6) zu erleichtern, dabei sind folgende Situationen häufig vorkommend: (a) Das Microblog enthält oft einige ungültige Informationen für die Arbeit, die gelöscht werden sollen; (b) einige Links (wie Bildlinks und Websitelinks etc.), die nutzlos für die Arbeit sind, und einige spezielle Strings sollen gelöscht werden; (c) das Microblog enthält oft ein Thema mit Symbol “#” und eine Kontaktperson mit Symbol “@”, das Thema und die Kontaktperson, die am Anfang und Ende des Microblogs auftreten, werden direkt gelöscht, im Microblogsatz werden nur die Symbole “#” und “@” gelöscht; (d) das Microblog enthält oft einige Emoticons, die starke emotionale Neigungen enthalten und helfende Informationen für die Arbeit sind, jedoch die Emoticons können die Genauigkeit der Wortsegmentierung, der Wortart-Markierung (POS-Markierung) und der Syntaxanalyse, deshalb sollen die Emoticons während des Prozesses extrahiert werden; (e) einige Internetsprachen im Microblog sollen konvertiert werden, z.B. wird der Internetausdruck "VS" in einen genormten Ausdruck "mächtig" konvertiert, das ist förderlich für die Verbesserung der Genauigkeit der Wortsegmentierung, der Wortart-Markierung (POS-Markierung) und der Syntaxanalyse.
(3) Microblogtextsatzsegmentierungs-Verarbeitungsschritt, das Modell des konditionalen Randomfeldes in der Emotionsobjekt-Identifizierungsmethode der vorliegenden Erfindung wird auf der Sequenzmarkierung der Satzklasse errichtet, um die Informationsextraktion durchzuführen. Jedoch kann ein Microblog bestimmt mehr als 1 Satz enthalten, deshalb soll die Satzsegmentierungs-Verarbeitung dafür durchgeführt werden. Während der Satzsegmentierungs-Verarbeitung wird die Satzsegmentierung hauptsächlich in Übereinstimmung mit den Satzzeichen durchgeführt. Aufgrund der Besonderheit des Microblogs ist jedoch die nur in Übereinstimmung mit den Satzzeichen durchgeführte Satzsegmentierung nicht genügend. Viele Menschen sind daran gewöhnt, Leerzeichen oder Sonderzeichen (wie "~" und so weiter) im Microblog zur Satzsegmentierung zu benutzen, deshalb wird die entsprechende Satzsegmentierungsverarbeitung für solche Situationen während des Prozesses durchgeführt.
(4) Satzwortsegmentierungs-Verarbeitungsschritt, beim Modell des konditionalen Randomfeldes in der Emotionsobjekt-Identifizierungsmethode der vorliegenden Erfindung wird jedes Wort in der Sequenz der Satzklasse markiert, deshalb soll die Wortsegmentierungsverarbeitung durchgeführt werden. Während der Satzwortsegmentierung werden einige häufig vorkommende Internetwörter (wie "verrückt", "Menschenmenge guckt" etc.) benutzt, um die Genauigkeit der Wortsegmentierung zu verbessern.
(5) Wortartmarkierungsschritt für Wörter im Satz, in diesem Schritt wird die Wortart-Markierung für jedes Wort nach der Wortsegmentierung durchgeführt, um die entsprechenden Merkmale der Wortart dem Merkmalsextraktionsmodell der vorliegenden Erfindung bei der Durchführung der Merkmalsextraktion zu Verfügung zu stellen.
(6) Syntaxanalyseschritt, in diesem Schritt werden die syntaktischen Abhängigkeiten zwischen den Wörtern im Satz mit Hilfe der Syntaxanalyseinstrumente analysiert, dabei ist es das Ziel, die entsprechenden Abhängigkeitsmerkmale der Wörter dem Merkmalsextraktionsmodell der vorliegenden Erfindung bei der Merkmalsextraktion zur Verfügung zu stellen.

8th shows a schematic diagram of the implementation of the data preprocessing step of the present invention. The data pre-processing step contains the following steps:

(1) Data manipulation processing step, from the microblog raw data collected by the data acquisition module, the data is read, the data cleansing process in the data preprocessing is performed to filter some empty invalid microblog data.
(2) Data conversion processing step, in this step, the data transferred after the processing in the step (1) is processed, the conversion processing is performed for some contents in the microblog data to perform the respective processings in the step (3), (4) (5) and (6), the following situations are common: (a) The microblog often contains some invalid information for the work to be deleted; (b) some links (like image links and site links etc.) that are useless to the work, and some special strings should be deleted; (c) the microblog often contains a theme with the symbol "#" and a contact person with the symbol "@", the topic and the contact person appearing at the beginning and end of the microblog are deleted directly, in the microblogging set only the symbols "#""And" @ "deleted; (d) the microblog often contains some emoticons that contain strong emotional inclinations and are helping information for the work, however, the emoticons can control the accuracy of the word segmentation, the word mark (POS mark), and the parsing, so the emoticons should extracted from the process; (e) some internet languages in the microblog are to be converted, for example, the Internet term "VS" is converted to a standardized term "mighty", which is conducive to improving word segmentation accuracy, word mark (POS mark) and parsing ,
(3) Microblogging sentence segmentation processing step, the conditional random field model in the emotion object identification method of the present invention is set on the sentence class sequence mark to perform the information extraction. However, a microblog may contain more than 1 sentence, so the sentence segmentation processing should be done for it. During sentence segmentation processing, the sentence segmentation is performed mainly in accordance with the punctuation marks. However, due to the peculiarity of the microblog, the sentence segmentation performed only in accordance with the punctuation marks is insufficient. Many people are accustomed to using spaces or special characters (such as "~" and so on) in the sentence segmentation microblog, so the appropriate sentence segmentation processing for such situations is performed during the process.
(4) sentence word segmentation processing step, in the conditional random field model in the emotion object identification method of the present invention, each word in the sequence of the sentence class is marked, therefore, word segmentation processing should be performed. During sentence word segmentation, some common Internet words (such as "crazy", "crowd peeps", etc.) are used to improve word segmentation accuracy.
(5) Word style mark step for words in sentence, in this step, the word style mark is performed for each word after word segmentation to provide the corresponding word style features to the feature extraction model of the present invention in performing the feature extraction.
(6) Syntax analysis step, in this step, the syntactic dependencies between the words in the sentence are analyzed using the syntactic analysis tools, the aim being to provide the corresponding dependency features of the words to the feature extraction model of the present invention in the feature extraction.

9 zeigt ein Prinzipbild der Realisierung der Trainingsschritte des Emotionsobjekt-Identifizierungsmodells der vorliegenden Erfindung. Siehe 9, in diesem Schritt stammen die markierten Trainingsdatengruppen aus den Microblog-Daten, die durch das Datenerfassungsmodul aus Internet gesammelt sind und dafür eine Verarbeitung durch das Vorverarbeitungsmodul durchgeführt wird. Da in der vorliegenden Erfindung das Modell des konditionalen Randomfeldes (CRF) zur Emotionsobjektsextraktion verwendet wird und das CRF-Modell eine überwachte Lernmethode ist, soll eine manuelle Markierung der Datengruppen für die Trainingsdatengruppe während des Trainingsprozesses durchgeführt werden. Während des Modelltrainings wird zuerst das Benutzerwörterbuch mit Hilfe des Wörterbuchlademoduls geladen, einschließlich Emotionswörterbuch und Stoppwörterbuch; im nächsten Schritt werden die Merkmalsextraktion und die Normierung der Daten für die Trainingsdatengruppen mit Hilfe des Merkmalsextraktionsmoduls im Zusammenhang mit dem letzten geladenen Wörterbuch durchgeführt; im letzten Schritt wird das Modellparametertraining für die normierten Daten im zweiten Schritt mit Hilfe des Modelltrainingsmoduls durchgeführt, und mit Hilfe vom L-BFGS-Algorithmus werden die Modellparameter durch das Training und Lernen erhalten. 9 shows a schematic diagram of the realization of the training steps of the emotion object identification model of the present invention. Please refer 9 , in this step, the marked training data groups come from the microblog data collected by the internet data acquisition module and processed by the preprocessing module. In the present invention, since the conditional random field (CRF) model is used for emotion object extraction and the CRF model is a supervised learning method, manual marking of the data groups for the training data set should be performed during the training process. During model training, the user dictionary is first loaded using the dictionary load module, including emotion dictionary and stop dictionary; in the next step, the feature extraction and the normalization of the data for the training data groups are performed by means of the feature extraction module in connection with the last loaded dictionary; in the final step, the model parameter training for the normalized data is performed in the second step using the model training module, and using the L-BFGS algorithm, the model parameters are obtained through training and learning.

Die Form des in der vorliegenden Erfindung benutzten Modells des konditionalen Randomfeldes ist wie in 10 dargestellt. Der Emotionsobjekt-Identifizierungsprozess wird als ein Sequenzmarkierungsproblem angesehen. X in der ersten Schicht des Modells steht für die eingegebenen Microblog-Sätze, xi steht für das Wort mit der i-Position im Satz, yi in der zweiten Schicht und g1, g2 in der dritten Schicht geben die Ergebniszustande aus, der Wert der Markierungen von solchen Zustanden kann die 5 Kennzeichen sein: L = {"N-B", "N-I", "P-B", "P-I", "O"}, das steht für der Wertraum des markierten Kennzeichens jeder Position der Sequenz während der Sequenzmarkierungsprozesses. Dabei steht das Kennzeichen N-B für das Kennzeichen der Anfangsposition des negativen Emotionsobjekts. N-I steht für das nachfolgende Kennzeichen des negativen Emotionsobjekts (nämlich soll das letzte Kennzeichen N-B oder N-I sein). Das Kennzeichen P-B für das Kennzeichen der Anfangsposition des positiven Emotionsobjekts. P-I steht für das nachfolgende Kennzeichen des positiven Emotionsobjekts (analog dazu soll das letzte Kennzeichen P-B oder P-I sein). Das Kennzeichen O steht für alle anderen Kennzeichen, nämlich y_i ∊ L. Z.B. ist die Sequenz {"Handy", "Bildschirm", "Sehr", "Klar"}, "Handybildschirm" ist ein positives emotionales Objekt, und das entsprechende Markierungsergebnis ist {"P-B", "P-I", "O", "O"}. The form of the conditional random field model used in the present invention is as in FIG 10 shown. The emotion object identification process is considered a sequence tagging problem. X in the first layer of the model represents the entered microblog sets, xi stands for the word with the i-position in the sentence, yi in the second layer, and g1, g2 in the third layer output the result states, the value of the markers of such states, the 5 flags may be: L = {"NB", "NI", "PB", "PI", "O"}, which represents the value space of the tagged tag of each position of the sequence during the sequence tagging process. The indicator NB stands for the indicator of the starting position of the negative emotion object. NI stands for the following tag of the negative emotion object (namely, the last tag should be NB or NI). The indicator PB for the indicator of the initial position of the positive emotion object. PI stands for the following identifier of the positive emotion object (analogous to this, the last identifier should be PB or PI). The O flag stands for all other tags, namely, y _i ε LZB is the sequence {"cell phone", "screen", "very", "clear"}, "cell phone screen" is a positive emotional object, and the corresponding tag result is { "PB", "PI", "O", "O"}.

Im Modell stehen zwei Gesamtknoten g1 und g2 für zwei separate einzelne Emotionsobjekte, deshalb kann der Wert nur die drei Kennzeichen sein: {"N-B", "P-B", "O"}. Es kann ein positives Emotionsobjekt sein, nämlich ist P-B das Kennzeichn, oder es ist ein negatives Emotionsobjekt, nämlich ist N-B das Kennzeichen, oder es ist kein Emotionsobjekt, nämlich ist "O" das Kennzeichen. Es kann nicht das nachfolgende Kennzeichen des Emotionsobjekts N-I und P-I sein. In the model, two total nodes g1 and g2 represent two separate individual emotion objects, so the value can only be the three flags: {"N-B", "P-B", "O"}. It can be a positive emotion object, namely, P-B is the flag, or it is a negative emotion object, namely, N-B is the flag, or it is not an emotion object, namely, "O" is the flag. It can not be the subsequent identifier of the emotion object N-I and P-I.

Um die Flexibilität und die Erweiterbarkeit der Emotionsobjektsidentifizierung zu verbessern, wird das Modell des konditionalen Randomfeldes in der vorliegenden Erfindung nicht auf die in 9 dargestellten Bildergebnisse beschränkt, die Darstellung der nicht-dominanten Eigenschaft wird auch nicht auf die zwei verdeckten Knoten g1 und g2 beschränkt, dabei ist die Erweiterung auf die in 11 dargestellte g1...gn (n > = 1) möglich. In order to improve the flexibility and extensibility of the emotion object identification, the conditional random field model in the present invention does not rely on the ones described in 9 The representation of the non-dominant property is not limited to the two hidden nodes g1 and g2, where the extension to the in 11 represented g1 ... gn (n> = 1) possible.

Die vorstehenden ausführlichen Ausführungsformen sind eine weitere nähere Erläuterung für das Ziel, die technische Lösung und die Vorteile der vorliegenden Erfindung. Es versteht sich, dass der vorstehende Inhalt lediglich die ausführliche Ausführungsform der vorliegenden Erfindung ist. Darauf wird die vorliegende Erfindung nicht beschränkt. Alle auf der Grundlage des Konzepts und Prinzips der vorliegenden Erfindung durchgeführten Änderungen, äquivalenten Ersatze und Verbesserungen sollen als vom Schutzumfang der vorliegenden Erfindung angesehen werden. The foregoing detailed embodiments are further detailed explanation of the object, the technical solution, and the advantages of the present invention. It should be understood that the foregoing is merely the detailed embodiment of the present invention. The present invention is not limited thereto. All changes, equivalent substitutions and improvements made on the basis of the concept and principle of the present invention should be considered as within the scope of the present invention.

Claims

Search engine of the emotion entity for the microblog, characterized in that it comprises the following 5 modules: a user interface for interactivity between the system and the user, whereby the user can submit a query and receive feedback through the module; a query extension module for mining the word relationship of the microblog speech data, wherein in the context of the Word Net Essence Bank, a relationship diagram of the weighted words is established; a query processing module for converting the user's query into the query keywords and query words that are acceptable to the index bank, executing a query extension based on the relationship diagram of words established by the module (2); a Mining module of emotion information to mine the emotions in the microblog language bank, generating the judgment rule for the emotion entity and the emotional polarity; a judgment and index setting module of the emotion information for judging the emotion entity and the emotional polarity of the microblog data to establish and store the index of the emotion information; an inverted index establishment module for establishing the inverted index for the microblog text information and for storage.

Search system of the emotion entity for the microblog according to claim 1, characterized in that the module (1) the query extension is realized by the following steps: Mining the relevance rule for the data in the microblogging language bank, output of the relevant word group by mining the relevance rule is obtained; Establishment of the relationship diagram of the weighted words in connection with the frequency article obtained in step (11) and the WordNet Essence Bank.

Search engine of the emotion entity for the microblog according to claim 1, characterized in that in step (11) the frequency article groups of the microblogging language bank are found using the Eclat algorithm, wherein the relevance word group is generated, and wherein the relevance word group and the WordNet Essenzbank by the mapping or insertion form a relationship diagram of the weighted words; and wherein, in constructing the weighted word relationship diagram, the weighting method of the node is as follows:

f (d) = deg (d) = deg ⁺ (d) + deg ^- (d)

and where lift (d _i → d _j ) is the degree of relevance of d _i , dj obtained with the aid of the Eclat algorithm.

Search engine of the emotion entity for the microblog according to claim 1, characterized in that the module (3) the query processing is realized by the following steps: receiving the input by the user query words or words; Performing the word segmentation, removing the stopword, and determining the keyword for the user's input to obtain a keyword or keywords; Selecting an appropriate extension word from the relationship graph of the weighted words for the keyword formed by the essence and the rule words, wherein a weight calculation is performed for the extension word; Selection of the p words of greatest weight and addition in the search word group, where the extension word group is entered in the query interface.

Search system of the emotion entity for the microblog according to claim 4, characterized in that in step (33) a weight calculation is performed as follows for the extension word: where the actual search word is assumed as q = (q ₁ , q ₂ , q _m ), and where the article q _i n _{i next words}

and where the method of calculating the degree of relevance between the actual search term qi and the next word d _{ij is:}

and where W (q _i , d _ij ) is the degree of relevance between the word q _i and the word ^d ij, and where g (q _i , d _ij ) is the weight of the two words, and f (d _ij ) is the Degree of the word d _ij is, and where the weight calculation method of all the next words

is.

Search system of the emotion entity for the microblog according to claim 1, characterized in that in the module (4) the identification and the judgment of the emotion entity are realized by the following steps: collection of representative microblog data; Pre-processing the collected microblog data, including cleanup, transformation, transformation, sentence segmentation, word segmentation, part-of-speech marking, and syntax analysis, etc .; Performing feature extraction on the microblog data expressed as feature vectors; Training the recognition model of the emotion entity to obtain the model parameters; Output and storage of the rating model of the emotion entity.

Emotional entity search system for the microblog according to claim 6, characterized in that in step (43) the feature extraction is realized as follows: in connection with the context of the words, a user-defined dictionary with entire features is designed, and in accordance with the user-defined The feature extraction of the microblog data is performed and the microblog data is converted to the input data format that the emotion entity recognition model can process.

Emotional entity search system for the microblog according to claim 6, characterized in that in step (44) the recognition model of the emotion entity is realized as follows: wherein in the conditional random field (CRF) model the nodes of the whole features are introduced to form a GLCRF model in which the entire features are added and where the training is performed using the L-BFGS algorithm to obtain the model parameters.

Search system of the emotion entity for the microblog according to claim 1, characterized in that in the module (5) the assessment of the emotional polarity of the microblog is realized by the following steps: removal of the microblogging noise and conversion of the semantic form; Word segmentation, part of speech marking and analysis of Chinese grammar; Extraction of the emotion word group in connection with the emotion dictionary; Filtering the emotion word group; Assessment of emotional polarity and output of results.

An emotion entity search system for the microblog according to claim 9, characterized in that in step (53) the emotion word group is extracted with the sentiPY method, the shape of the Emotion word group is uniformly expressed as a phrase: modifier · sentiment, and in which a group of words includes a central emotional word, and wherein the group of words can simultaneously include a plurality of adverbs for modification; and wherein in step (55) the emotional polarity of the microblog is judged by means of the emotional point based hybrid decision algorithm, and wherein the judging process comprises the steps of: judging whether a sentence includes a summary word, if not, proceeding to step (55) 552); if so, the words after the summary word are used as an emotional landing point, with the polarity of the emotional landing point being given as the emotional polarity of the microblog; Using the sentence beginning and end of the microblog as an emotional landing point, comparing the emotional polarities of the sentence beginning and end, and moving to step (553) when the two emotional polarities neutralize each other, and otherwise the stronger emotional polarity than emotional Polarity is spent by the microblog; Calculation of the strengths of the emotion words of the whole microblog, where the strengths are summed and averaged, and where the average strength as emotional polarity of the microblog is spent.