JP2013525868A

JP2013525868A - System and method for determining sentiment expressed in a document

Info

Publication number: JP2013525868A
Application number: JP2012546247A
Authority: JP
Inventors: ズオン−バンミン
Original assignee: ズオン−バンミン
Priority date: 2009-12-24
Filing date: 2010-12-23
Publication date: 2013-06-20

Abstract

文書中に表現されているセンチメントを求めるためのシステム、そのための命令を記憶するコンピュータ可読記憶媒体、及びそのためのコンピュータ実装型の方法が開示されている。複数の文書から或る文書が受信される。文書の中で、少なくとも１つのセンチメントシグネチャをキーワードのリストからの少なくとも１つのキーワードの所定距離内に含んでいる文が識別され、ここに、キーワードのリストは、複数の文書から抽出され相転移式を使用してフィルタに掛けられており、少なくとも１つのセンチメントシグネチャは、文中の少なくとも１つのセンチメントの表現に対応している。文の少なくとも１つのキーワードに対応する少なくとも１つのカテゴリであって、キーワードのリストを使用して生成されているカテゴリのリストに含まれている、少なくとも１つのカテゴリが求められる。少なくとも１つのセンチメントシグネチャに基づいて、少なくとも１つのカテゴリに対応する少なくとも１つのセンチメントが求められる。
【選択図】図１A system for determining sentiment expressed in a document, a computer readable storage medium storing instructions therefor, and a computer-implemented method therefor are disclosed. A document is received from a plurality of documents. A sentence is identified in the document that includes at least one sentiment signature within a predetermined distance of at least one keyword from the keyword list, wherein the keyword list is extracted from multiple documents and phase transitions Filtered using an expression, at least one sentiment signature corresponds to a representation of at least one sentiment in the sentence. At least one category corresponding to at least one keyword of the sentence is determined that is included in the list of categories generated using the list of keywords. Based on the at least one sentiment signature, at least one sentiment corresponding to the at least one category is determined.
[Selection] Figure 1

Description

（関連出願の相互参照）
本出願は、２０１０年１２月２３日に「文書中に表現されているセンチメントを求めるためのシステム及び方法」という名称で出願されている米国特許出願第１２／９７７，５１３号に対する優先権を主張する。本出願は、更に、２００９年１２月２４日に「センチメントプラットフォーム」の名称で出願されている米国仮特許出願第６１／２８４，８２０号に対する優先権を主張する。本出願は、更に、２００９年１２月２４日に「移動式センチメントプラットフォーム」の名称で出願されている米国仮特許出願第６１／２８４，８１９号に対する優先権を主張する。本出願は、更に、２０１０年１０月１５日に「センチメントエンジン」という名称で出願されている米国仮特許出願第６１／３９３，８１３号に対する優先権を主張するものであり、上記４件の出願のそれぞれを参考文献としてここにそっくりそのまま援用する。
開示されている実施形態は、概括的には、文書中に表現されているセンチメント（感情）を求めることに関する。 (Cross-reference of related applications)
This application claims priority to US patent application Ser. No. 12 / 977,513, filed Dec. 23, 2010, entitled “Systems and Methods for Determining Sentiments Expressed in Documents”. Insist. This application further claims priority to US Provisional Patent Application No. 61 / 284,820, filed December 24, 2009 under the name “Sentiment Platform”. This application further claims priority to US Provisional Patent Application No. 61 / 284,819, filed December 24, 2009 under the name “Mobile Sentiment Platform”. This application further claims priority to US Provisional Patent Application No. 61 / 393,813 filed on October 15, 2010 under the name “Sentiment Engine”. Each of the applications is incorporated herein by reference in its entirety.
The disclosed embodiments generally relate to determining sentiment expressed in a document.

インターネットは、様々な主題に関して情報を含んでいる。この情報は、特定の分野の専門家又は一時的ユーザー（例えば、ブロガーやレビュアーなど）によって書かれている。検索エンジンは、ユーザーが自分たちにとって関心のある主題に関して情報を含んでいる文書を識別できるようにしている。しかしながら、特定の主題に関してこれらのユーザーによって表現されているセンチメントを識別することは現在のところ難しい。 The Internet contains information on various subjects. This information is written by a specialist in a particular field or a temporary user (eg, blogger, reviewer, etc.). Search engines allow users to identify documents that contain information about subjects of interest to them. However, it is currently difficult to identify sentiments represented by these users on a particular subject.

米国特許出願第１２／９７７，５１３号US patent application Ser. No. 12 / 977,513 米国仮特許出願第６１／２８４，８２０号US provisional patent application 61 / 284,820 米国仮特許出願第６１／２８４，８１９号US Provisional Patent Application No. 61 / 284,819 米国仮特許出願第６１／３９３，８１３号US Provisional Patent Application No. 61 / 393,813

文書中に表現されているセンチメントを求めるためのシステム、そのための命令を記憶するコンピュータ可読記憶媒体、及びそのためのコンピュータ実装型の方法が開示されている。複数の文書から或る文書が受信される。文書の中で、少なくとも１つのセンチメントシグネチャをキーワードのリストからの少なくとも１つのキーワードの所定距離内に含んでいる文が識別され、ここに、キーワードのリストは、複数の文書から抽出され相転移式を使用してフィルタに掛けられており、少なくとも１つのセンチメントシグネチャは、文中の少なくとも１つのセンチメントの表現に対応している。文の少なくとも１つのキーワードに対応する少なくとも１つのカテゴリであって、キーワードのリストを使用して生成されているカテゴリのリストに含まれている、少なくとも１つのカテゴリが求められる。少なくとも１つのセンチメントシグネチャに基づいて、少なくとも１つのカテゴリに対応する少なくとも１つのセンチメントが求められる。 A system for determining sentiment expressed in a document, a computer readable storage medium storing instructions therefor, and a computer-implemented method therefor are disclosed. A document is received from a plurality of documents. A sentence is identified in the document that includes at least one sentiment signature within a predetermined distance of at least one keyword from the keyword list, wherein the keyword list is extracted from multiple documents and phase transitions Filtered using an expression, at least one sentiment signature corresponds to a representation of at least one sentiment in the sentence. At least one category corresponding to at least one keyword of the sentence is determined that is included in the list of categories generated using the list of keywords. Based on the at least one sentiment signature, at least one sentiment corresponding to the at least one category is determined.

図面全体を通して同様の参照番号は対応する部分を指す。 Like reference numerals refer to corresponding parts throughout the drawings.

幾つかの実施形態によるネットワークを示しているブロック線図である。1 is a block diagram illustrating a network according to some embodiments. FIG. 幾つかの実施形態によるセンチメントサーバを示しているブロック線図である。FIG. 2 is a block diagram illustrating a sentiment server according to some embodiments. 幾つかの実施形態による、文書中に表現されているセンチメントを求めるための方法のフローチャートである。2 is a flowchart of a method for determining sentiment expressed in a document, according to some embodiments. 幾つかの実施形態による、キーワードのリストを抽出するための方法のフローチャートである。2 is a flowchart of a method for extracting a list of keywords, according to some embodiments. 幾つかの実施形態による、カテゴリのリストを生成するための方法のフローチャートである。2 is a flowchart of a method for generating a list of categories according to some embodiments. 幾つかの実施形態による、カテゴリのリストを生成するための別の方法のフローチャートである。6 is a flowchart of another method for generating a list of categories, according to some embodiments. 幾つかの実施形態による、文の少なくとも１つのキーワードに対応する少なくとも１つのカテゴリを求めるための別の方法のフローチャートである。6 is a flowchart of another method for determining at least one category corresponding to at least one keyword of a sentence, according to some embodiments. 幾つかの実施形態による、カテゴリについてのカテゴリスペクトルを求めるための方法のフローチャートである。2 is a flowchart of a method for determining a category spectrum for a category, according to some embodiments. 幾つかの実施形態による、文書の集積体から複数の文書を選択するための方法のフローチャートである。2 is a flowchart of a method for selecting a plurality of documents from a collection of documents, according to some embodiments. 幾つかの実施形態による機械のブロック線図である。1 is a block diagram of a machine according to some embodiments. FIG.

次に続く記述では、説明を目的とした実施形態を具現化している例示的なシステム、方法、技法、命令シーケンス、及び演算処理機械プログラム製品を含んでいる。以下の記述では、説明を目的として、発明主題の様々な実施形態を理解できるように数多くの特定の詳細事項が示されている。とはいえ、当業者には、発明主題の実施形態はこれらの特定の詳細事項なしに実践され得ることが明白であろう。概して、よく知られている命令インスタンス、プロトコル、構造、技法は詳細に示されていない。 The description that follows includes exemplary systems, methods, techniques, instruction sequences, and processing machine program products that embody embodiments for purposes of illustration. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of various embodiments of the inventive subject matter. Nevertheless, it will be apparent to one skilled in the art that embodiments of the inventive subject matter may be practiced without these specific details. In general, well-known instruction instances, protocols, structures, and techniques have not been shown in detail.

ここに記載の実施形態は、文書中に表現されているセンチメントを求めるための技法を提供している。 The embodiments described herein provide a technique for determining sentiment expressed in a document.

図１は、幾つかの実施形態による、ネットワーク１２０を示しているブロック線図である。ネットワーク１２０は、概して、演算処理ノードを一体に連結することのできる何らかの型式の有線又は無線の通信チャネルを含むことができる。これには、限定するわけではないが、ローカルエリアネットワーク、ワイドエリアネットワーク、又はネットワークの組合せが含まれる。幾つかの実施形態では、ネットワーク１２０はインターネットを含んでいる。 FIG. 1 is a block diagram illustrating a network 120 according to some embodiments. The network 120 can generally include some type of wired or wireless communication channel that can couple processing nodes together. This includes, but is not limited to, a local area network, a wide area network, or a combination of networks. In some embodiments, the network 120 includes the Internet.

幾つかの実施形態では、ネットワーク１２０にはサーバ１００が連結されている。サーバ１００は、文書１０２を含んでいる。文書１０２は、限定するわけではないが、ウェブ文書（例えば、ハイパーテキストマークアップ言語（ＨＴＭＬ）文書や拡張可能マークアップ言語（ＸＭＬ）文書など）、テキスト文書、スプレッドシート、プレゼンテーション、走査された文書（例えば、走査されたテキストなど）、画像、及び同種のものを含めた何れの型式の文書であってもよい。 In some embodiments, a server 100 is coupled to the network 120. Server 100 includes a document 102. Document 102 includes, but is not limited to, web documents (eg, hypertext markup language (HTML) documents and extensible markup language (XML) documents), text documents, spreadsheets, presentations, scanned documents. It can be any type of document, including (for example, scanned text), images, and the like.

幾つかの実施形態では、ネットワーク１２０にはアグリゲータ１０４が連結されている。アグリゲータ１０４は、文書１０６を含んでいる。幾つかの実施形態では、アグリゲータ１０４は、サーバ１００から少なくとも文書１０２のサブセットを入手する。例えば、アグリゲータ１０４は、サーバ１００をクロールし、サーバ１００から少なくとも文書１０２のサブセットを取り出す。 In some embodiments, an aggregator 104 is coupled to the network 120. Aggregator 104 includes a document 106. In some embodiments, aggregator 104 obtains at least a subset of documents 102 from server 100. For example, the aggregator 104 crawls the server 100 and retrieves at least a subset of the document 102 from the server 100.

幾つかの実施形態では、ネットワーク１２０にはセンチメントサーバ１０８が連結されている。センチメントサーバ１０８は、ここに説明されている様に、文書中に表現されているセンチメントを求めるように構成されている。センチメントサーバ１０８によって使用される文書には、サーバ１００から（例えば、サーバ１００をクロールすることによって）入手された文書、アグリゲータ１０４から（例えば、アグリゲータ１０４から文書を購入することによって）入手された文書、又はそれらの組合せが含まれる。 In some embodiments, sentiment server 108 is coupled to network 120. The sentiment server 108 is configured to determine sentiment expressed in the document, as described herein. Documents used by the sentiment server 108 include documents obtained from the server 100 (eg, by crawling the server 100), obtained from the aggregator 104 (eg, by purchasing documents from the aggregator 104). Documents, or combinations thereof, are included.

図２は、幾つかの実施形態によるセンチメントサーバ１０８を示しているブロック線図である。センチメントサーバ１０８は、文書中に表現されているセンチメントを求めるように構成されているセンチメントモジュール２０２と、サーバ１００をクロールして少なくとも文書１０２のサブセットを入手するように構成されているクローリングモジュール２０４（随意）と、文書からキーワードを抽出するように構成されているキーワードモジュール２０６と、キーワードと文書をフィルタに掛けるように構成されているフィルタリングモジュール２０８と、文書、文、及び／又はキーワードを分類するように構成されている分類モジュール２１０と、を含んでいる。これらのモジュールの機能性は組み合わされていてもよいことに留意されたい。例えば、センチメントモジュール２０２は、キーワードモジュール２０６とフィルタリングモジュール２０８の機能性を含んでいてもよい。これらのモジュールについては、以下に図３−図９に関連付けて更に詳細に説明されている。 FIG. 2 is a block diagram illustrating a sentiment server 108 according to some embodiments. The sentiment server 108 is configured to crawl the server 100 to obtain at least a subset of the document 102, and a crawling configured to determine the sentiment represented in the document. A module 204 (optional), a keyword module 206 configured to extract keywords from the document, a filtering module 208 configured to filter keywords and documents, documents, sentences, and / or keywords And a classification module 210 configured to classify. Note that the functionality of these modules may be combined. For example, the sentiment module 202 may include the functionality of a keyword module 206 and a filtering module 208. These modules are described in more detail below in connection with FIGS.

文書中に表現されているセンチメントを求める
図３は、幾つかの実施形態による、文書中に表現されているセンチメントを求めるための方法３００のフローチャートである。センチメントモジュール２０２は、複数の文書からの或る文書を受信する（３０２）。例えば、複数の文書には、少なくとも文書１０２のサブセット、少なくとも文書１０６のサブセット、又はそれらの組合せが含まれる。複数の文書を選択するプロセスについては、以下に図９に関連付けて更に詳細に説明されている。 Determining Sentiments Expressed in a Document FIG. 3 is a flowchart of a method 300 for determining sentiments expressed in a document, according to some embodiments. Sentiment module 202 receives a document from a plurality of documents (302). For example, the plurality of documents includes at least a subset of the document 102, at least a subset of the document 106, or a combination thereof. The process of selecting multiple documents is described in further detail below in connection with FIG.

センチメントモジュール２０２は、次に、文書の中で、少なくとも１つのセンチメントシグネチャをキーワードリストからの少なくとも１つのキーワードの所定距離内に含んでいる文を識別する（３０４）。少なくとも１つのセンチメントシグネチャは、文中の少なくとも１つのセンチメントの表現に対応している。幾つかの実施形態では、少なくとも１つのセンチメントシグネチャには、当該少なくとも１つのセンチメントの表現が文中に存在していることを指し示す少なくとも１つの単語（例えば形容詞）が含まれる。幾つかの実施形態では、センチメントシグネチャには極性が関連付けられている。例えば、極性は、センチメントシグネチャが肯定的センチメントを反映しているか、否定的センチメントを反映しているか、又は中立的センチメントを反映しているかを指し示す。 Sentiment module 202 then identifies (304) sentences in the document that contain at least one sentiment signature within a predetermined distance of at least one keyword from the keyword list. At least one sentiment signature corresponds to a representation of at least one sentiment in the sentence. In some embodiments, the at least one sentiment signature includes at least one word (eg, an adjective) that indicates that an expression of the at least one sentiment is present in the sentence. In some embodiments, the sentiment signature has an associated polarity. For example, the polarity indicates whether the sentiment signature reflects a positive sentiment, a negative sentiment, or a neutral sentiment.

センチメントモジュール２０２は、文よりも大きい又は小さい分法的な単位を識別することもできることに着目されたい。例えば、センチメントモジュール２０２は、少なくとも１つのセンチメントシグネチャを少なくとも１つのキーワードの所定距離内に含んでいる段落又は句を識別していてもよい。 Note that the sentiment module 202 can also identify probabilistic units that are larger or smaller than sentences. For example, the sentiment module 202 may identify a paragraph or phrase that includes at least one sentiment signature within a predetermined distance of at least one keyword.

幾つかの実施形態では、キーワードのリストは、複数の文書から抽出され、相転移式を使用してフィルタにかけられる。これらの実施形態については、以下に図４に関連付けて更に詳細に説明されている。 In some embodiments, the list of keywords is extracted from multiple documents and filtered using a phase transition equation. These embodiments are described in further detail below in connection with FIG.

幾つかの実施形態では、前記少なくとも１つのセンチメントシグネチャは、センチメントシグネチャのリストに含まれている。センチメントシグネチャのリストは、手作業で生成されていてもよい。 In some embodiments, the at least one sentiment signature is included in a list of sentiment signatures. The list of sentiment signatures may be manually generated.

次いで、分類モジュール２１０が、文の少なくとも１つのキーワードに対応する少なくとも１つのカテゴリを求める（３０６）。幾つかの実施形態では、少なくとも１つのカテゴリは、製品、サービス、又はそれらの組合せと関連付けられている。文の少なくとも１つのキーワードに対応する少なくとも１つのカテゴリを求めるプロセスについては、以下に図７及び図８に関連付けて更に詳細に説明されている。幾つかの実施形態では、少なくとも１つのカテゴリは、キーワードのリストを使用して生成されているカテゴリのリストに含まれている。これらの実施形態については、以下に図５及び図６に関連付けて更に詳細に説明されている。 The classification module 210 then determines 306 at least one category corresponding to at least one keyword of the sentence. In some embodiments, at least one category is associated with a product, service, or combination thereof. The process for determining at least one category corresponding to at least one keyword of a sentence is described in more detail below in connection with FIGS. In some embodiments, at least one category is included in a list of categories that has been generated using a list of keywords. These embodiments are described in more detail below in connection with FIGS.

センチメントモジュール２０２は、その後、少なくとも１つのセンチメントシグネチャに基づいて、少なくとも１つのカテゴリに対応する少なくとも１つのセンチメントを求める（３０８）。幾つかの実施形態では、少なくとも１つのセンチメントは、少なくとも１つのカテゴリに関係のある意見の表現である。 The sentiment module 202 then determines 308 at least one sentiment corresponding to the at least one category based on the at least one sentiment signature. In some embodiments, the at least one sentiment is a representation of an opinion related to at least one category.

図３に関連付けて記載されているプロセスを明快にするにあたり、「The the room was stinky and the carpets were dirty.（部屋は臭くカーペットは汚かった）」という例示としての文を含んでいる文書の例を考察してみよう。「stinky（臭い）」及び「dirty（汚い）」という単語は否定的センチメント（例えば負の極性）を表現するセンチメントシグネチャであり、「room（部屋）」及び「carpets（カーペット）」という単語はキーワードであり、所定距離は３であると仮定する。センチメントモジュール２０２は、この例示としての文を、センチメントシグネチャ「stinky（臭い）」がキーワード「room（部屋）」から２単語離れており、センチメントシグネチャ「dirty（汚い）」がキーワード「carpets（カーペット）」から２単語離れていることを理由に識別する（３０４）。分類モジュール２１０は、次に、文のキーワードに対応するカテゴリを求める（３０６）。この例では、分類モジュール２１０は、「ホテル客室」が文のキーワードについてのカテゴリであると確定する。センチメントモジュール２０２は、すると、ホテル客室（例えばカテゴリ）に関して表現されているセンチメントは否定的センチメントであると判定する（３０８）。 In clarifying the process described in connection with Figure 3, an example document containing the example sentence "The room was stinky and the carpets were dirty." Let's consider. The words “stinky” and “dirty” are sentiment signatures representing negative sentiment (eg negative polarity), the words “room” and “carpets” Is a keyword, and the predetermined distance is 3. The sentiment module 202 clarifies this exemplary sentence as if the sentiment signature “stinky” is two words away from the keyword “room” and the sentiment signature “dirty” is the keyword “carpets”. (304) because it is two words away from (carpet). The classification module 210 then determines a category corresponding to the sentence keyword (306). In this example, the classification module 210 determines that “hotel room” is the category for the sentence keyword. The sentiment module 202 then determines that the sentiment expressed for the hotel room (eg, category) is a negative sentiment (308).

キーワードのリストを抽出する
幾つかの実施形態では、文書の中で、少なくとも１つのセンチメントシグネチャをキーワードのリストからの少なくとも１つのキーワードの所定距離内に含んでいる文を識別する（３０４）段階に先立って、キーワードモジュール２０６が複数の文書からキーワードのリストを抽出する。図４は、幾つかの実施形態による、キーワードのリストを抽出するための方法４００のフローチャートである。 In some embodiments for extracting a list of keywords, identifying a sentence in a document that includes at least one sentiment signature within a predetermined distance of at least one keyword from the list of keywords (304). Prior to this, the keyword module 206 extracts a list of keywords from a plurality of documents. FIG. 4 is a flowchart of a method 400 for extracting a list of keywords, according to some embodiments.

キーワードモジュール２０６は、複数の文書のそれぞれの文書からキーワードを抽出する（４０２）。 The keyword module 206 extracts keywords from each of the plurality of documents (402).

それぞれのキーワードについて、キーワードモジュール２０６は以下の動作を行う。キーワードモジュール２０６は、複数の文書中のキーワードの頻出度ｆと当該キーワードを含んでいる文書の数Ｎを計算する（４０４）。次に、キーワードモジュール２０６は、相転移式を使用して、複数の文書中のキーワードの頻出度と当該キーワードを含んでいる文書の数とに基づいて、キーワードの関連度を計算する（４０６）。幾つかの実施形態では、相転移式は、

であり、ここにｘ≧１である。幾つかの実施形態では、ｘは３である。キーワードモジュール２０６は、次に、キーワードの関連度が所定の閾値を超えていたら、当該キーワードをキーワードのリストに加える（４０８）。 For each keyword, the keyword module 206 performs the following operations. The keyword module 206 calculates the frequency f of keywords in a plurality of documents and the number N of documents containing the keywords (404). Next, the keyword module 206 calculates the degree of relevance of the keyword based on the frequency of the keywords in the plurality of documents and the number of documents including the keyword using the phase transition formula (406). . In some embodiments, the phase transition equation is

Where x ≧ 1. In some embodiments, x is 3. Next, the keyword module 206 adds the keyword to the keyword list if the degree of relevance of the keyword exceeds a predetermined threshold (408).

カテゴリのリストを生成する
幾つかの実施形態では、文の少なくとも１つのキーワードに対応する少なくとも１つのカテゴリを求める（３０６）段階に先立って、分類モジュール２１０がカテゴリのリストを生成する。 In some embodiments for generating a list of categories, the classification module 210 generates a list of categories prior to determining 306 at least one category corresponding to at least one keyword of the sentence.

図５は、幾つかの実施形態による、カテゴリのリストを生成するための方法５００のフローチャートである。分類モジュール２１０は、複数の文書の中で、キーワードのリストからの少なくとも１つのキーワードを含んでいる文書の第１のセットを識別する（５０２）。次に、分類モジュール２１０は、文書の第１のセットの中の少なくとも所定の数の文書中に含まれているキーワードのセットを識別する（５０４）。分類モジュール２１０は、次いで、キーワードのセットを、カテゴリそれぞれが各々のキーワードセットを含んでいるカテゴリのリストに加える（５０６）。こうして、これらの実施形態では、分類モジュール２１０は、少なくとも所定の数の文書に出現しているキーワードを識別することによってカテゴリを求める。 FIG. 5 is a flowchart of a method 500 for generating a list of categories according to some embodiments. The classification module 210 identifies a first set of documents that include at least one keyword from a list of keywords among the plurality of documents (502). Next, the classification module 210 identifies a set of keywords contained in at least a predetermined number of documents in the first set of documents (504). The classification module 210 then adds the set of keywords to a list of categories, each category including each keyword set (506). Thus, in these embodiments, the classification module 210 determines a category by identifying keywords that appear in at least a predetermined number of documents.

図６は、幾つかの実施形態による、カテゴリのリストを生成するための方法６００のフローチャートである。分類モジュール２１０は、キーワードのリストの中で互いに関係のあるキーワードのペアであって、固有のキーワードペアである、キーワードのペアを求める（６０２）。次に、分類モジュール２１０は、キーワードのペアのセットであって、それぞれのセットはセットの中のキーワードのペア全てに共通する少なくとも１つのキーワードを含んでいる、キーワードのペアのセットを識別する（６０４）。所定の終結条件が達成されるまで、分類モジュール２１０は、キーワードのペアのセット同士を反復的に組み合わせてゆき、それぞれの組み合わされたセットには、当該組み合わされたセットの中のキーワードのペア全てに共通する少なくとも１つのキーワードが含まれている、その様に組み合わせてゆく（６０６）。こうして、これらの実施形態では、分類モジュール２１０は、互いに関係のあるキーワードのセットを求め、ペア同士（又はより大きなキーワードグループ同士）を反復的に組み合わせていってカテゴリを形成させる。例えば、分類エンジン２１０は、キーワードのリストから以下のキーワードのペア、即ち、｛パリ、ロマンス｝、｛パリ、恋の街｝、｛パリ、フランス｝、｛犬、ビーグル｝、｛猫、シャム｝を識別する。分類エンジン２１０は、次いで、「パリ」という単語が｛パリ、ロマンス｝、｛パリ、恋の街｝、｛パリ、フランス｝のペアに共通していることから、｛パリ、ロマンス、恋の街、フランス｝が関係のあるキーワードのセット（例えばカテゴリ）であると確定する。分類エンジン２１０は、｛パリ、ロマンス、恋の街｝が関係のあるキーワードのセットであると確定する場合もあることに留意されたい。特定のカテゴリと関連付けられているキーワードの数は、幾つかの要因によって異なり、その様な要因には、限定するわけではないが、所望の特定性の量（例えば、４つのキーワードを含んでいるカテゴリは、３つのキーワードを含んでいるカテゴリよりも特定性が高い）、特定のカテゴリと関連付けられている文書の数、及び特定のカテゴリと関連付けられている文の数などが含まれる。幾つかの実施形態では、所定の終結条件は、カテゴリについて所望されている特定性の高さによって決まる。カテゴリを記述するのに使用されているキーワードが多いほど、カテゴリは特定性がより高いということになる（例えば、｛パリ、ロマンス、恋の街、フランス｝は｛パリ、ロマンス、恋の街｝より特定性が高い）。 FIG. 6 is a flowchart of a method 600 for generating a list of categories, according to some embodiments. The classification module 210 obtains a keyword pair that is a keyword pair that is related to each other in the keyword list and that is a unique keyword pair (602). Next, the classification module 210 identifies a set of keyword pairs, each set including at least one keyword that is common to all of the keyword pairs in the set ( 604). Until a predetermined termination condition is achieved, the classification module 210 iteratively combines sets of keyword pairs, and each combined set includes all keyword pairs in the combined set. At least one keyword that is common to each other is included, and such combinations are made (606). Thus, in these embodiments, the classification module 210 determines a set of keywords that are related to each other and repetitively combines pairs (or larger keyword groups) to form a category. For example, the classification engine 210 may generate the following keyword pairs from a list of keywords: {Paris, Romance}, {Paris, City of Love}, {Paris, France}, {Dog, Beagle}, {Cat, Siam} Identify The classification engine 210 then selects {Paris, Romance, City of Love} because the word "Paris" is common to the {Paris, Romance}, {Paris, City of Love}, {Paris, France} pairs. , France} is a relevant set of keywords (eg, category). Note that the classification engine 210 may determine that {Paris, Romance, City of Love} is a set of relevant keywords. The number of keywords associated with a particular category depends on several factors, including but not limited to the desired amount of specificity (eg, including 4 keywords). The category is more specific than the category including three keywords), the number of documents associated with the specific category, the number of sentences associated with the specific category, and the like. In some embodiments, the predetermined termination condition depends on the degree of specificity desired for the category. The more keywords used to describe a category, the more specific the category is (eg {Paris, Romance, City of Love, France} is {Paris, Romance, City of Love} More specific).

キーワードに対応するカテゴリを求める
キーワードに対応するカテゴリを求めるのに幾つかの技法が使用できる。 Several techniques can be used to determine the category corresponding to the keyword that determines the category corresponding to the keyword.

幾つかの実施形態では、分類モジュール２１０は、文の少なくとも１つのキーワードに対応する少なくとも１つのカテゴリを、サポートベクターマシンを使用して求める（３０６）。 In some embodiments, the classification module 210 determines (306) at least one category corresponding to at least one keyword of the sentence using a support vector machine.

幾つかの実施形態では、分類モジュール２１０は、文の少なくとも１つのキーワードに対応する少なくとも１つのカテゴリを、ニューラルネットワークを使用して求める（３０６）。 In some embodiments, the classification module 210 determines (306) at least one category corresponding to at least one keyword of the sentence using a neural network.

図７は、幾つかの実施形態による、或る文の少なくとも１つのキーワードに対応する少なくとも１つのカテゴリを求めるための方法７００のフローチャートである。分類モジュール２１０は、複数のカテゴリスペクトルであって、カテゴリスペクトルそれぞれが各々のカテゴリに対応しているキーワードリスト中のキーワードの出現頻度を含んでいる、カテゴリスペクトルを入手する（７０２）。 FIG. 7 is a flowchart of a method 700 for determining at least one category corresponding to at least one keyword of a sentence, according to some embodiments. The classification module 210 obtains a category spectrum that includes a plurality of category spectra, each of which includes the frequency of occurrence of a keyword in a keyword list corresponding to each category (702).

次に、分類モジュール２１０は、少なくとも１つのキーワードに基づいて、文についてカテゴリスペクトルを求める（７０４）。幾つかの実施形態では、分類モジュール２１０は、文のカテゴリスペクトルを正規化する。分類モジュール２１０は、次いで、文のカテゴリスペクトルと複数のカテゴリスペクトルの中のそれぞれのカテゴリスペクトルのドット積を計算する（７０６）。分類モジュール２１０は、次いで、少なくとも１つのカテゴリを、所定の閾値を超えている少なくとも１つのドット積に対応するカテゴリとして確定する（７０８）。カテゴリスペクトルは、｛単語ＩＤ，頻出度｝のペアによって表すことができ、単語ＩＤの値は固有キーワードに対応しており、頻出度は単語ＩＤに対応するキーワードの出現頻度に対応していることに着目されたい。例えば、キーワード「パリ」は８という単語ＩＤを有し、出現頻度は１００２であるとする。すると、カテゴリスペクトルは｛８，１００２｝というペアを含むことになる。更に、カテゴリスペクトルは、視覚的に表すこともできることに着目されたい。例えば、２Ｄプロット上で、ｘ軸を単語ＩＤ、そしてｙ軸を頻出度とすることもできる。更に、２つのカテゴリスペクトルのドット積は、それぞれの単語ＩＤについてのキーワードの出現頻度の積の総和であることに着目されたい（例えば、

であり、ここにｉは単語ＩＤに対応しており、頻出度_1iは第１のカテゴリについての単語ＩＤ「ｉ」に対応するキーワードの出現頻度であり、頻出度_2iは第２のカテゴリについての単語ＩＤ「ｉ」に対応するキーワードの出現頻度である）。 Next, the classification module 210 determines a category spectrum for the sentence based on the at least one keyword (704). In some embodiments, the classification module 210 normalizes the sentence category spectrum. The classification module 210 then calculates (706) the dot product of the category spectrum of the sentence and each category spectrum among the plurality of category spectra. The classification module 210 then determines 708 at least one category as a category corresponding to at least one dot product that exceeds a predetermined threshold. The category spectrum can be represented by a pair of {word ID, frequency}, the value of the word ID corresponds to a unique keyword, and the frequency of occurrence corresponds to the appearance frequency of the keyword corresponding to the word ID. Please pay attention to. For example, it is assumed that the keyword “Paris” has a word ID of 8 and the appearance frequency is 1002. Then, the category spectrum includes {8,1002} pairs. Note further that the category spectrum can also be represented visually. For example, on the 2D plot, the x-axis can be the word ID and the y-axis can be the frequency. Furthermore, it should be noted that the dot product of the two category spectra is the sum of the product of the appearance frequencies of the keywords for each word ID (for example,

Here, i corresponds to the word ID, the frequency _1i is the appearance frequency of the keyword corresponding to the word ID “i” for the first category , and the frequency _2i is the frequency for the second category. The frequency of occurrence of the keyword corresponding to the word ID “i”).

幾つかの実施形態では、複数のカテゴリスペクトルを入手する段階に先立って、分類モジュール２１０は、それぞれのカテゴリについてカテゴリスペクトルを求める。図８は、幾つかの実施形態による、カテゴリについてカテゴリスペクトルを求めるための方法８００のフローチャートである。分類モジュール２１０は、カテゴリに対応する文書の集積体を入手する（８０２）。次に、分類モジュール２１０は、文書の集積体の中のそれぞれの文書からキーワードを抽出する（８０４）。分類モジュール２１０は、次いで、相転移式を使用してキーワードをフィルタに掛け、フィルタリング済みキーワードを導き出す（８０６）。次に、分類モジュール２１０は、文書の集積体中のフィルタリング済みキーワードの出現頻度を求める（８０８）。幾つかの実施形態では、それぞれのキーワードの出現頻度が、全てのカテゴリに亘る集積体全体の中で当該キーワードが見られた総カウント数で割られる。幾つかの実施形態では、次いで、得られたカテゴリスペクトルに閾値が適用される。これらの実施形態では、或るカテゴリ内のキーワードの総カウント数を全てのカテゴリに亘る総キーワードカウント数で割った振幅が所定の閾値よりも大きければ、当該振幅は式又はアルゴリズムによって新値へリセットされる。幾つかの実施形態では、次いで、得られたスペクトルに閾値が適用される。これらの実施形態では、或るカテゴリ内のキーワードカウント数を、キーワードが全てのカテゴリに亘って出現しているページの総ページカウント数で割った振幅が、所定の閾値よりも大きければ、当該振幅は式又はアルゴリズムによって新たな値へリセットされる。新しい値は、全てのカテゴリに亘って極めて共通に使用されている共通単語を排除するべくゼロに設定される。 In some embodiments, prior to obtaining a plurality of category spectra, the classification module 210 determines a category spectrum for each category. FIG. 8 is a flowchart of a method 800 for determining a category spectrum for a category, according to some embodiments. The classification module 210 obtains a collection of documents corresponding to the category (802). Next, the classification module 210 extracts keywords from each document in the document collection (804). The classification module 210 then filters the keywords using the phase transition equation to derive filtered keywords (806). Next, the classification module 210 determines the frequency of appearance of the filtered keywords in the document collection (808). In some embodiments, the frequency of occurrence of each keyword is divided by the total number of counts that the keyword was found in the entire aggregate across all categories. In some embodiments, a threshold is then applied to the resulting category spectrum. In these embodiments, if the amplitude of the total keyword count in a category divided by the total keyword count across all categories is greater than a predetermined threshold, the amplitude is reset to a new value by an equation or algorithm. Is done. In some embodiments, a threshold is then applied to the resulting spectrum. In these embodiments, if the amplitude obtained by dividing the keyword count in a certain category by the total page count of pages in which the keyword appears in all categories is larger than a predetermined threshold, the amplitude Is reset to a new value by an expression or algorithm. The new value is set to zero to eliminate common words that are very commonly used across all categories.

次いで、分類モジュール２１０は、フィルタリング済みキーワードの出現頻度を正規化して、カテゴリについてのカテゴリスペクトルを導き出す（８１０）。幾つかの実施形態では、カテゴリスペクトルは、カテゴリスペクトルそれぞれのスペクトル下面積が同じになるように正規化される。そうすることで、カテゴリ間の比較バイアスが小さくなる。 The classification module 210 then normalizes the frequency of appearance of the filtered keywords to derive a category spectrum for the category (810). In some embodiments, the category spectrum is normalized such that the area under the spectrum of each category spectrum is the same. By doing so, the comparison bias between categories is reduced.

文書を選択する
文書１０２及び／又は文書１０６は、ほとんど或いは全く価値を持たない文書を含んでいるかもしれない。例えば、文書１０２及び／又は文書１０６は、検索エンジンの順位付けを増強するために機械によって生成された文書を含んでいるかもれしない。機械によって生成される文書は、典型的に、センチメントの表現を含んでいない。これらの形式の文書は、大抵、フィルタに掛けて除外するのが望ましい。よって、幾つかの実施形態では、複数の文書からの文書を受信する段階に先立って、フィルタリングモジュール２０８が、文書の集積体から複数の文書を選択する。図９は、幾つかの実施形態による、文書の集積体から複数の文書を選択するための方法９００のフローチャートである。文書の集積体の中のそれぞれの文書について、フィルタリングモジュール２０８は、以下の動作を行う。フィルタリングモジュール２０８は、文書からｎ−グラムを抽出する（９０２）。次に、フィルタリングモジュール２０８は、当該文書について、抽出されたｎ−グラムに基づいて、ｎ−グラムの出現頻度をｎ−グラムのサイズの関数として示すｎ−グラムスペクトルを求める（９０４）。フィルタリングモジュール２０８は、次いで、文書のｎ−グラムスペクトルが、所定の方程式によって定義されている基準ｎ−グラムスペクトルに、所定の閾値内で整合しているがどうかを判定する（９０６）。幾つかの実施形態では、所定の方程式は、

であり、ここに、ｘはｎ−グラムのサイズであり、ａ、ｂ、及びｃは、所定の方程式のピークがサイズ２のｎ−グラムとサイズ３のｎ−グラムの間となる所定の値である。幾つかの実施形態では、ｂの値は１と２の間であり、ｃの値は１と２の間である。フィルタリングモジュール２０８は、文書のｎ−グラムスペクトルが基準ｎ−グラムスペクトルに所定の閾値内で整合している場合は、当該文書を前記複数の文書に加える（９０８）。フィルタリングモジュール２０８は、文書のｎ−グラムスペクトルが基準ｎ−グラムスペクトルに所定の閾値内で整合していない場合は、当該文書を捨てる（９１０）。 The document 102 and / or document 106 from which the document is selected may include documents that have little or no value. For example, document 102 and / or document 106 may include a machine-generated document to enhance search engine ranking. Machine-generated documents typically do not contain a representation of sentiment. It is often desirable to filter out these types of documents. Thus, in some embodiments, prior to receiving documents from a plurality of documents, the filtering module 208 selects a plurality of documents from a collection of documents. FIG. 9 is a flowchart of a method 900 for selecting a plurality of documents from a collection of documents, according to some embodiments. For each document in the document collection, the filtering module 208 performs the following operations. Filtering module 208 extracts n-grams from the document (902). Next, the filtering module 208 determines an n-gram spectrum for the document based on the extracted n-grams, which indicates the frequency of occurrence of the n-grams as a function of the size of the n-grams (904). Filtering module 208 then determines whether the n-gram spectrum of the document is matched within a predetermined threshold to a reference n-gram spectrum defined by a predetermined equation (906). In some embodiments, the predetermined equation is:

Where x is the size of n-grams, a, b, and c are predetermined values where the peak of a given equation is between an n-gram of size 2 and an n-gram of size 3 It is. In some embodiments, the value of b is between 1 and 2 and the value of c is between 1 and 2. Filtering module 208 adds the document to the plurality of documents if the n-gram spectrum of the document matches the reference n-gram spectrum within a predetermined threshold (908). If the n-gram spectrum of the document does not match the reference n-gram spectrum within a predetermined threshold, the filtering module 208 discards the document (910).

例示としてのコンピュータシステム
図１０は、一例としてコンピュータシステム１０００の形態をしている機械であって、その内部では、ここに論じられている方法論の何れか１つ又はそれ以上を機械に遂行させるための命令のセットが実行される機械のブロック線図を描いている。代わりの実施形態では、機械は、独立デバイスとして作動していたり、或いは他の機械に接続（ネットワーク化）されていることもある。或るネットワーク化された配備では、機械は，サーバ−クライアントネットワーク環境でサーバ機又はクライアント機の資格で作動していることもあろうし、ピア・ツー・ピア（又は分散型）ネットワーク環境でピア機として作動していることもあろう。 Exemplary Computer System FIG. 10 is a machine , by way of example in the form of a computer system 1000, within which a machine may perform any one or more of the methodologies discussed herein. Figure 2 depicts a block diagram of a machine on which a set of instructions is executed. In alternative embodiments, the machine may operate as an independent device or may be connected (networked) to other machines. In some networked deployments, the machine may be operating in a server-client network environment with server or client machine qualification, or in a peer-to-peer (or distributed) network environment. It may be working as.

機械は、当該機械に取らせるアクションを特定している命令のセット（順次的又はそれ以外）を実行することができる。更に、たった１つの機械しか示されていないが、「機械」という用語は、個別に或いは合同で、ここに論じられている方法論の何れか１つ又はそれ以上を遂行するべく命令の（単数又は複数の）セットを実行する、機械の何らかの集合体む含むものと解釈されたい。 A machine can execute a set of instructions (sequentially or otherwise) that specify actions to be taken by the machine. Further, although only one machine is shown, the term “machine”, individually or conjointly, refers to an instruction (single or combined) to perform any one or more of the methodologies discussed herein. It should be construed to include some collection of machines that perform the set (s).

本例のコンピュータシステム１０００は、バス１００８を介して互いに通信しているプロセッサ１００２（例えば、中央演算処理ユニット（ＣＰＵ）、グラフィック処理ユニット（ＧＰＵ）、又は両方）とメモリ１００４を含んでいる。メモリ１００４には、揮発性メモリデバイス（例えば、ＤＲＡＭ、ＳＲＡＭ、ＤＤＲＲＡＭ、又は他の揮発性ソリッドステートメモリデバイス）、不揮発性メモリデバイス（例えば、磁気ディスクメモリデバイス、光ディスクメモリデバイス、フラッシュメモリデバイス、テープドライブ、又は他の不揮発性ソリッドステートメモリデバイス）、又はそれらの組合せが含まれる。メモリ１００４は、随意的に、コンピュータシステム１０００から遠隔に置かれている１つ又はそれ以上の記憶デバイスを含んでいてもよい。コンピュータシステム１０００は、更に、映像表示ユニット１００６（例えば、プラズマディスプレイ、液晶ディスプレイ（ＬＣＤ）、又はブラウン管（ＣＲＴ）を含むことができる。コンピュータシステム１０００は、更に、入力デバイス１０１０（例えば、キーボード、マウス、トラックボール、タッチスクリーンディスプレイなど）、出力デバイス１０１２（例えばスピーカ）、及びネットワークインターフェースデバイス１０１６を含むことができる。コンピュータシステム１０００の上述の構成要素は、単体のハウジング又はケース（図１０に点線で描かれている）内に置かれていてもよい。代わりに、構成要素のサブセットが、ハウジングの外部に置かれていてもよい。例えば、映像表示ユニット１００６と入力デバイス１０１０と出力デバイス１０１２はハウジングの外部に存在していて、ハウジングの外側のアクセスできる外部ポート又はコネクタを介してバス１００８に連結されていてもよい。 The computer system 1000 of this example includes a processor 1002 (eg, a central processing unit (CPU), a graphics processing unit (GPU), or both) and a memory 1004 that are in communication with each other via a bus 1008. Memory 1004 includes volatile memory devices (eg, DRAM, SRAM, DDR RAM, or other volatile solid state memory devices), non-volatile memory devices (eg, magnetic disk memory devices, optical disk memory devices, flash memory devices, Tape drives, or other non-volatile solid state memory devices), or combinations thereof. Memory 1004 may optionally include one or more storage devices located remotely from computer system 1000. The computer system 1000 can further include a video display unit 1006 (eg, a plasma display, a liquid crystal display (LCD), or a cathode ray tube (CRT). The computer system 1000 can further include an input device 1010 (eg, a keyboard, a mouse, etc.). , Trackball, touch screen display, etc.), output device 1012 (eg, speaker), and network interface device 1016. The above-described components of computer system 1000 include a single housing or case (dotted line in FIG. 10). Alternatively, a subset of the components may be placed outside the housing, eg, the video display unit 1006 and the input device 101. An output device 1012 is present in the outside of the housing, it may be connected to bus 1008 via an external port or connector can outside access housing.

メモリ１００４は、ここに記載されている方法論又は機能の何れか１つ又はそれ以上を具現化するか又は当該方法論又は機能の何れか１つ又はそれ以上によって利用されるデータ構造及び命令１０２２（例えばソフトウェア）の１つ又はそれ以上のセットを記憶させておく機械可読媒体１０２０を含んでいる。データ構造の１つ又はそれ以上のセットはデータを記憶している。機械可読媒体とは、機械によって読み出しできる記憶媒体（例えばコンピュータ可読記憶媒体）を指すことに留意されたい。データ構造及び命令１０２２は、更に、全部が又は少なくとも一部が、メモリ１００４内及び／又はコンピュータシステム１０００によって実行中のプロセッサ１００２内に在ってもよく、メモリ１００４とプロセッサ１００２もまた機械可読有形媒体の構成要素である。 Memory 1004 embodies any one or more of the methodologies or functions described herein, or data structures and instructions 1022 (eg, utilized by any one or more of the methodologies or functions). A machine-readable medium 1020 on which one or more sets of software) are stored. One or more sets of data structures store data. Note that machine-readable media refers to storage media that can be read by a machine (eg, computer-readable storage media). The data structures and instructions 1022 may also be wholly or at least partially resident in the memory 1004 and / or the processor 1002 executing by the computer system 1000, and the memory 1004 and the processor 1002 are also machine-readable tangible. It is a component of the medium.

データ構造及び命令１０２２は、更に、ネットワーク１２０経由で、ネットワークインターフェース１０１６を介し、多数のよく知られている転送プロトコル（例えばハイパーテキスト転送プロトコル（ＨＴＴＰ））を利用して送受信されていてもよい。ネットワーク１２０は、概して、演算処理ノード（例えばコンピュータシステム１０００）を一体に連結することのできる何らかの型式の有線又は無線の通信チャネルを含むことができる。これには、限定するわけではないが、ローカルエリアネットワーク、ワイドエリアネットワーク、又はネットワークの組合せが含まれる。幾つかの実施形態では、ネットワーク１２０はインターネットを含んでいる。 Data structures and instructions 1022 may also be sent and received via network 120 and via network interface 1016 using a number of well-known transfer protocols (eg, Hypertext Transfer Protocol (HTTP)). Network 120 may generally include some type of wired or wireless communication channel that can couple together processing nodes (eg, computer system 1000). This includes, but is not limited to, a local area network, a wide area network, or a combination of networks. In some embodiments, the network 120 includes the Internet.

特定の実施形態は、ここでは、論理又は多数の構成要素、モジュール、又は機構を含んでいるものとして説明されている。モジュールは、ソフトウェアモジュール（例えば、機械可読媒体上に又は送信信号中に具現化されているコード及び／又は命令）又はハードウェアモジュールのどちらを構成している場合もある。ハードウェアジュールは、特定の動作を遂行することのできる有形ユニットであり、特定の方式に構成又は配列させることができる。例示としての実施形態では、１つ又はそれ以上のコンピュータシステム（例えばコンピュータシステム１０００）又はコンピュータシステムの１つ又はそれ以上のハードウェアモジュール（例えばプロセッサ１００２又はプロセッサのグループ）は、ソフトウェア（例えばアプリケーション又はアプリケーション部分）によって、ここに記載されている特定の動作を遂行するように作動するハードウェアモジュールとして構成されている。 Particular embodiments are described herein as including logic or a number of components, modules, or mechanisms. A module may comprise either a software module (eg, code and / or instructions embodied on a machine-readable medium or in a transmitted signal) or a hardware module. A hardware module is a tangible unit capable of performing a specific operation, and can be configured or arranged in a specific manner. In an exemplary embodiment, one or more computer systems (eg, computer system 1000) or one or more hardware modules (eg, processor 1002 or group of processors) of a computer system are software (eg, The application portion) is configured as a hardware module that operates to perform the specific operations described herein.

様々な実施形態では、ハードウェアモジュールは、機械的又は電子的に実装することができる。例えば、ハードウェアモジュールは、特定の動作を遂行するように永続的に（例えば、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）や特定用途向け集積回路（ＡＳＩＣ）の様な特殊目的プロセッサとして）構成されている専用の回路構成又は論理を備えていてもよい。ハードウェアモジュールは、更に、特定の動作を遂行するようにソフトウェアによって暫定的に構成されているプログラム可能な論理又は回路構成（例えば、汎用プロセッサ１００２又は他のプログラム可能なプロセッサ内に包含されているもの）を備えていてもよい。ハードウェアモジュールを、機械的に、専用且つ永続的な構成の回路構成に実装するか又は暫定的な構成の回路構成（例えばソフトウェアによって構成）に実装するかの決断は、費用及び時間的な考慮事項によって主導されることになるのが理解されるであろう。 In various embodiments, the hardware module can be implemented mechanically or electronically. For example, a hardware module is dedicated that is configured permanently (eg, as a special purpose processor such as a field programmable gate array (FPGA) or application specific integrated circuit (ASIC)) to perform a specific operation. The circuit configuration or logic may be provided. A hardware module is further included within programmable logic or circuitry (eg, a general purpose processor 1002 or other programmable processor) that is provisionally configured by software to perform a specific operation. Thing) may be provided. The decision to implement a hardware module mechanically in a dedicated and permanent configuration or in a temporary configuration (eg, configured by software) is a cost and time consideration. It will be understood that it will be driven by matters.

従って、「ハードウェア」と言う用語は、有形のエンティティであって、物理的に構築されていて、或る特定の方式に作動するように及び／又はここに記載の特定の動作を遂行するように永続的に構成されている（例えば配線接続されている）又は暫定的に構成されている（例えばプログラムされている）エンティティを網羅するものであると理解されたい。ハードウェアモジュールが暫定的に構成されている（例えばプログラムされている）実施形態を考察してみるが、そこでは、ハードウェアのそれぞれは、どの時々にも構成されなくてはならない或いはインスタンス生成されなくてはならないというわけではない。例えば、ハードウェアモジュールが、ソフトウェアを使用して構成されている汎用プロセッサ１００２を備えている場合、汎用プロセッサ１００２は、異なった時期に各々異なったハードウェアモジュールとして構成されていてもよい。従ってソフトウェアは、プロセッサ１００２を、例えば、或る時には特定のハードウェアモジュールを成すように、そして異なった時には異なったハードウェアモジュールを成すように、構成することができる。 Thus, the term “hardware” is a tangible entity that is physically constructed to operate in a particular manner and / or perform the specific operations described herein. It should be understood that it covers entities that are permanently configured (eg, wired) or provisionally configured (eg, programmed). Consider an embodiment in which hardware modules are provisionally configured (eg programmed), where each piece of hardware must be configured or instantiated at any time. It doesn't have to be. For example, when the hardware module includes a general-purpose processor 1002 configured using software, the general-purpose processor 1002 may be configured as different hardware modules at different times. Thus, the software can configure the processor 1002 to, for example, form a specific hardware module at some times and a different hardware module at different times.

モジュールは、他のモジュールに情報を提供したり他のモジュールから情報を受信したりすることができる。例えば、記載されているモジュールは、通信可能に連結されているものと見なすことができるであろう。その様なハードウェアモジュールが複数同時に存在している場合、通信は、モジュール同士をつなぐ信号送信（例えば、適切な回路又はバスを経由）を通して実現させることができるであろう。複数のモジュールが異なった時期に構成又はインスタンス生成されている実施形態では、その様なモジュール間の通信は、それら複数のモジュールがアクセスできるメモリ構造に情報を記憶させたりメモリ構造から情報を取り出したりすることを通して実現させることができるであろう。例えば、１つのモジュールが或る動作を遂行し、当該動作の出力を、同モジュールが通信可能に連結されているメモリデバイスに記憶させる。すると、別のモジュールは、後刻、メモリデバイスにアクセスし、記憶されている出力を取り出し処理する。モジュールは、更に、入力デバイス又は出力デバイスとの通信を始動することもあるであろうし、リソース（例えば情報の集合体）に関して作動することもできる。 Modules can provide information to other modules and receive information from other modules. For example, the modules described may be considered to be communicatively coupled. If multiple such hardware modules are present at the same time, communication could be achieved through signal transmission (eg, via an appropriate circuit or bus) that connects the modules. In embodiments where multiple modules are configured or instantiated at different times, communication between such modules may store information in or retrieve information from memory structures accessible to the multiple modules. It can be realized through doing. For example, one module performs an operation, and the output of the operation is stored in a memory device to which the module is communicatively coupled. Then, another module accesses the memory device at a later time, retrieves the stored output, and processes it. Modules may also initiate communication with input or output devices and may operate on resources (eg, collections of information).

ここに説明されている一例としての方法の各種動作は、少なくとも部分的には、関連の動作を遂行するように暫定的に（例えば、機械可読媒体に記憶されているソフトウェア、コード、及び／又は命令によって）構成されているか又は永続的に構成されている１つ又はそれ以上のプロセッサ１００２によって遂行させることができる。その様なプロセッサ１００２は、暫定的に構成されているか永続的に構成されているかを問わず、１つ又はそれ以上の動作又は機能を遂行するように作動するプロセッサ実装型（コンピュータ実装型）モジュールを成していてもよい。ここで言及されているモジュールは、幾つかの一例としての実施形態では、プロセッサ実装型（又はコンピュータ実装型）モジュールを備えている。 The various operations of the example methods described herein may be tentatively performed (eg, software, code, and / or stored on a machine-readable medium to at least partially perform related operations). Can be performed by one or more processors 1002 configured (permanently) or configured permanently. Such a processor 1002 may be a processor-implemented (computer-implemented) module that operates to perform one or more operations or functions, whether provisionally or permanently configured. May be included. The modules referred to herein comprise processor-implemented (or computer-implemented) modules in some example embodiments.

また、ここに説明されている方法は、少なくとも部分的には、プロセッサ実装型（又はコンピュータ実装型）及び／又はプロセッサ実行可能（又はコンピュータ実行可能）とすることができる。例えば、方法の動作のうち少なくとも幾つかは、１つ又はそれ以上のプロセッサ１００２又はプロセッサ実装型（又はコンピュータ実装型）モジュールによって遂行されていてもよい。同様に、方法の動作のうち少なくとも幾つかは、コンピュータ可読記憶媒体に記憶されていて、１つ又はそれ以上のプロセッサ１００２又はプロセッサ実装型（コンピュータ実装型）モジュールに実行させる、命令によって統制されていてもよい。動作のうちの或る種の動作の遂行は、１つ又はそれ以上のプロセッサ１００２に亘って分散されており、単一の機械内に存在させるだけではなく、多数の機械に跨って配備させることもできる。幾つかの一例としての実施形態では、プロセッサ１００２は、（例えば、自宅環境内、会社環境内、又はサーバファームとしての）単一の場所に置かれており、他の実施形態では、プロセッサ１００２は、多数の場所に跨って分散されている。 Also, the methods described herein may be at least partially processor-implemented (or computer-implemented) and / or processor-executable (or computer-executable). For example, at least some of the method operations may be performed by one or more processors 1002 or processor-implemented (or computer-implemented) modules. Similarly, at least some of the method operations are stored in a computer-readable storage medium and governed by instructions that cause one or more processors 1002 or processor-implemented (computer-implemented) modules to execute. May be. Performing certain types of operations is distributed across one or more processors 1002 and can be deployed across multiple machines, not just within a single machine. You can also. In some example embodiments, the processor 1002 is located at a single location (eg, in a home environment, a corporate environment, or as a server farm), and in other embodiments, the processor 1002 is , Distributed across multiple locations.

（単数又は複数の）実施形態は様々な実装及び活用に関連付けて説明されているが、これらの実施形態は説明を目的とするものであり、当該（単数又は複数の）実施形態の範囲は前記実装及び活用に限定されない。概して、ここに記載の実施形態は、ここに定義されている何れかの（単数又は複数の）ハードウェアシステムと矛盾しない設備で実装することができるであろう。多くの変型、修正、追加、及び改良が実施可能であろう。 Although embodiment (s) have been described in connection with various implementations and uses, these embodiments are for illustrative purposes and the scope of the embodiment (s) is described above. It is not limited to implementation and utilization. In general, the embodiments described herein could be implemented with equipment consistent with any hardware system (s) defined herein. Many variations, modifications, additions and improvements may be implemented.

ここでは単一のインスタンスとして説明されている構成要素、動作、及び／又は構造には、複数のインスタンスが提供されていてもよい。最後に、各種構成要素、動作、及びデータ記憶の間の境界線はどちらかというと自由裁量的であり、特定の動作は特定の例示的構成に照らして示されている。他の機能性割り付けも想定され、それらも（単数又は複数の）実施形態の範囲に入ることであろう。概して、例示としての構成では別々の構成要素として提示されている構造及び機能性は、合体型の構造又は構成要素として実装することができる。同様に、単体構成要素として提示されている構造及び機能性は、別々の構成要素として実装することができる。これら及び他の変型、修正、追加、及び改良は、（単数又は複数の）実施形態の範囲内に入る。 Multiple instances may be provided for a component, operation, and / or structure described herein as a single instance. Finally, the boundaries between the various components, operations, and data storage are rather discretionary, and specific operations are shown in the context of specific exemplary configurations. Other functionality assignments are envisioned and will fall within the scope of the embodiment (s). In general, structures and functionality presented as separate components in the illustrative configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as unitary components can be implemented as separate components. These and other variations, modifications, additions and improvements fall within the scope of the embodiment (s).

例示としての実施形態
以下の論考は、文書中に表現されているセンチメントを求めるための非限定的な例示としての実施形態を含んでいる。 Exemplary Embodiments The following discussion includes non-limiting exemplary embodiments for determining sentiment expressed in a document.

Ａ．シード単語の初期リスト（シード単語リスト）が、関連集積体テキストを識別するのを支援するべく形成される。特定の主題にとってのシードキーワードのリストは、限定するわけではないが主題用語集を含めた原典及び主題事項の専門家から入手される。 A. An initial list of seed words (seed word list) is formed to assist in identifying relevant aggregate text. A list of seed keywords for a particular subject is obtained from source and subject matter experts, including but not limited to subject glossaries.

Ｂ．集積体の中の文書からの原文テキストが入手される。例えば、集積体は、ウェブサイト（例えば、ブログやレビューウェブサイトなど）からの文書を含んでいるかもしれない。それぞれの文書中の単語がシード単語リストと比較される。文書中の単語で、シード単語リストの成員と一致している単語の数及び／又は比率が所定の閾値に合致するか又は超えていれば、当該文書は合格文書リストに加えられる。そうでなければ、文書は捨てられる。 B. Source text is obtained from documents in the aggregate. For example, the aggregate may include documents from a website (eg, a blog, a review website, etc.). The words in each document are compared to the seed word list. If the number and / or ratio of words in the document that match a member of the seed word list meets or exceeds a predetermined threshold, the document is added to the accepted document list. Otherwise, the document is discarded.

Ｃ．初期シード単語リストは、シード単語リストからの単語を、合格文書リスト中の文書からのキーワードとＮ−グラムを（例えば、標準的なキーワード抽出及びＮ−グラム抽出ツールを使用して）抽出することによって得られる単語のリストと組み合わせることで拡張される。キーワードとＮ−グラムは、シード単語リストの結果を組み合わせる前に、以下に説明されている選択された文書のノード構造を計算するプロセスを用いて更に精錬される。 C. The initial seed word list extracts words from the seed word list and keywords and N-grams from documents in the accepted document list (eg, using standard keyword extraction and N-gram extraction tools). Expanded by combining with the list of words obtained by. Keywords and N-grams are further refined using the process described below to calculate the node structure of the selected document before combining the seed word list results.

Ｄ．文書のノード構造は、キーワードとＮ−グラムの初期セットを向上させるために作成される。このプロセスは、マルチフラクタルデータ構造を作成するために、単語(例えばテキストフォーマット)から情報を抽出することを伴う。マルチフラクタルデータ構造は、スケール内スケールを呈するデータ構造の１つである。１つのスケールでのパターンは、別のスケールで見られるパターンとは僅かに異なっている。一旦、データが抽出され、そのマルチフラクタル構造が求められたら、データ構造をメモリに記憶させ、コンピュータがアクセスしてデータ構造を用いて演算を遂行することができるようにしておく。このデータ構造は、情報の自然な構造を表しており、関連の情報を効率良く識別するのに有用である。データ構造は、以下の様にアセンブルされる。 D. The document node structure is created to improve the initial set of keywords and N-grams. This process involves extracting information from words (eg, text format) to create a multifractal data structure. The multifractal data structure is one of data structures that exhibit an in-scale scale. The pattern on one scale is slightly different from the pattern found on another scale. Once the data is extracted and its multifractal structure is determined, the data structure is stored in memory so that the computer can access it and perform operations using the data structure. This data structure represents the natural structure of information and is useful for efficiently identifying related information. The data structure is assembled as follows.

１．ｋ個のシードカテゴリ名の単語のリスト（シード単語リスト）を作る。
２．ｋ個のカテゴリ名単語のそれぞれをクローラに入れて、ｋ個のカテゴリ名単語のそれぞれからＮ（ｋ）個のウェブページを取り出す。
３．Ｎ（ｋ）個のページのそれぞれを採り、固有単語のリストを作り、それら単語がそれぞれのページに出現した回数を数える。ストップワード（stop words：不要語）（例えば、「the（その）」、「his（彼の）」、「and（及び）」など）は、この段階で取り除かれる。更に、単語の数を、名詞及び隣接の形容詞までに減らす。
４．Ｎ（ｋ）個のページのそれぞれからＭ（例えば１０）個の最頻出単語を採り、第１ページの最頻出単語に索引（例えば、１，２，３，４，５，６，７，８，９，及び１０）を割り当て、第２ページの最頻出単語に索引（例えば、１１，１３，．．．，２１）を割り当て、以下同様に、第１のカテゴリからのＮ（１）個のページが尽きるまで索引を割り当ててゆく。一連なりの語は、単一索引（主題グループと呼称）で識別させることができる。第２のカテゴリ名のＮ（２）個のページについて繰り返し、全ｋ個のカテゴリが尽きるまで継続する。
５．全ｋ個についてのＮ（ｋ）個のページのそれぞれのＭの総和を計算し、当該値をＮ単語と命名する。
６．横軸ｘと縦軸ｙのグラフを作る。
７．ゼロで埋めたＮ単語×Ｎ単語の行列を作成し、これをＷと呼ぶ。
８．第１ウェブページからの単語データを採り、単語リンク係数を、以下の方法を使用して計算する。
ａ．リストに挙げられた最頻出単語のそれぞれについて、単語リンクと呼ばれる固有ペア（ｘ，ｙ）を作成し、ここに、ｘはｙとは同単語ではない。
ｂ．単語リンクをページリンク係数と組み合わせ、その結果、高い単語リンクとページリンクの係数のものは低い単語リンクとページリンクの係数のものよりも選択スコアが高くなる。ページリンク係数は、ウェブページへ入って来るページリンクのリンク構造及びコンテンツの分析に根ざしている。この選択スコアは、データからタグを拾うのに使用される値である。ページリンク係数が利用不可である場合、１．０の様なデフォルト値に設定されるものとする。
９．演算処理目的で、以下を行う。
ａ．単語リンク係数の単語係数を、行列Ｗの第ｘ列と第ｙ行の（ｘ，ｙ）＝（１，３），（１，４），．．．，（１，Ｍ＋１）に記憶する。ｘとｙは、以上に定義されているリストからの単語索引である。
ｂ．（ｘ，ｙ）＝（２，３），．．．，（２，Ｍ＋１）についてプロセスを繰り返し、最終的に、（Ｍ）番目及び（Ｍ＋１）番目の単語即ち（ｘ，ｙ）＝（Ｍ，Ｍ＋１）についてプロセスを行う。ページの第２のカテゴリについては、第１の係数エントリは、（ｘ，ｙ）＝（Ｍ＋２，Ｍ＋３）である。係数エントリのそれぞれは、文書からの２つの単語で互いと一体であるところを見つけられるか又は互いと関連付けられている２つの単語の結合性を表している。
ｃ．主題全てのページ全てについて９ｂの段階を繰り返すが、但し、以下の状況は例外とする。
ｉ．現在の単語を先のｘ単語索引中に見つけることができる場合はいつでも、係数は全て先の索引中に見つけられる係数と、大係数の組合せが小係数の組合せより大きくなる傾向を示すようなやり方で組み合わされる。例えば、単語「犬」は、単語索引４に対応し、それに附随してＷ１の単語係数（４，８）を有しており、そして単語「犬」は、別のページにも同様に出現していて、索引２５を持ち、Ｗ２の単語係数（２５，８）を持っている場合、索引２５の結合性係数は全て、対応するｙの索引それぞれについて索引４のｘ値へ組み加えられる。この事例では、Ｗ（４，８）はＷ（２５，８）と組み合わされることになり、結果は新たなＷ（４，８）に割り当てられる。係数の移入と組み合わせが完了した後、Ｗ（２５，８）は０．０の様な小さい値に設定される。この規則は、「犬」のインスタンスで１番目のインスタンス以後に起こった他の全てのインスタンスに適用される。
ii．ｙ単語索引上に見つけられる重複単語は、ｘ単語索引について行われたのと同じか又は同様のやり方で、元のｙ単語索引へ組み入れてゆけばよい。それらが組み合わされないと、主題の関数としてそれぞれの単語の影響が見られることとなり、というのも、それぞれの主題はｙ索引の一定範囲内に在るからである。
１０．各々の単語リンク係数から単語順位付けスコアを計算することによって、最多最高単語リンク係数を有する単語が最少で低い単語リンク係数の単語よりも順位が高くなるようなやり方で、単語を順位付ける。
ａ．Ｒ（ｉ）を所与の単語ノードの順位付けであるとしよう。
ｂ．Ｗ（ｉ，ｊ）を単語同士の結合性係数であるとしよう。
ｃ．次の様に計算し、即ち、それぞれのｉについて、Ｒ（ｉ）＝Ｒ（Ｗ（ｉ，ｊ））であり、順位付けは、それぞれのｊに対する単語リンク重み（既にページ順位付け重みが含まれている）の関数である。
ｄ．タグ及びグループについて、より大きい重みを有する単語を選択する。
１１．データをグラフィック様式で視覚的に分析する場合は、以下を行う。
ａ．単語及びページリンク係数に基づくスコアに比例させたドットを、（ｘ，ｙ）＝（１，３），（１，４），．．．，（１，Ｍ＋１）座標上に置く。同じことを、（ｘ，ｙ）＝（２，３），．．．，（２，Ｍ＋１）について、そして最終的に、Ｍ番目及び（Ｍ＋１）番目の単語、（ｘ，ｙ）＝（Ｍ，Ｍ＋１）について行う。ページの第２のカテゴリについて、第１のドットは、（ｘ，ｙ）＝（Ｍ＋２，Ｍ＋３）ということになる。ドットそれぞれは、文書からの２つの単語で互いと一体であるところを見つけられるか又は互いと関連付けられている２つの単語の結合性を表している。ｘ軸とｙ軸は、以上に定義されているリストからの単語索引である。
ｂ．全ての主題の文書全てについて１１ａの項を繰り返すが、但し、以下の状況は例外とする。
ｉ．現在の単語を先のｘ単語索引中に見つけることができる場合はいつでも、ドットは全て先の索引に記録される。例えば、単語「犬」は、たまたま単語索引４となっていて、同様に索引２５でも出現していた場合、索引２５の結合性ドットは全て、索引４のｘ値へ移される。ドットは、移された後、グラフ上の元の位置から削除される。この規則は、「犬」のインスタンスで１番目のインスタンス以後に起こった他の全てのインスタンスに適用される。
ii．ｙ単語索引上に見つけられる重複単語は、ｘ単語索引について行われたのと同じか又は同様のやり方で、元のｙ単語索引へ組み入れてゆけばよい。それらが組み合わされないと、主題の関数としてそれぞれの単語の影響が見られることとなり、というのも、それぞれの主題はｙ索引の一定範囲内に在るからである。
ｃ．グラフ上の縦の縞を調べ、最も優勢な縞が、タグ単語又は主題グループとして選択される単語のＷ１索引に相当する。最も多くのリンク又はドットを有しているＷ１単語が、タグ単語又は主題グループとして選択されることになる。リストを更に絞るために、これらの単語に追加のフィルタリングを適用することができる。 1. A word list (seed word list) of k seed category names is created.
2. Each of the k category name words is put into a crawler, and N (k) web pages are taken out from each of the k category name words.
3. Each of the N (k) pages is taken, a list of unique words is created, and the number of times these words appear on each page is counted. Stop words (such as “the”, “his”, “and”, etc.) are removed at this stage. In addition, the number of words is reduced to nouns and adjacent adjectives.
4). M (for example, 10) most frequently occurring words are taken from each of N (k) pages, and the most frequently occurring words on the first page are indexed (for example, 1, 2, 3, 4, 5, 6, 7, 8 , 9, and 10), assigning an index (eg, 11, 13,..., 21) to the most frequent word on the second page, and so on, N (1) items from the first category The index is allocated until the page runs out. A series of words can be identified by a single index (called a subject group). Repeat for N (2) pages of the second category name and continue until all k categories are exhausted.
5. Compute the sum of each of the N (k) pages for all k and name the value N words.
6). Create a graph with a horizontal axis x and a vertical axis y.
7). Create a matrix of N words × N words padded with zeros, and call this W.
8). Take word data from the first web page and calculate the word link coefficient using the following method.
a. For each of the most frequent words listed, a unique pair (x, y) called a word link is created, where x is not the same word as y.
b. Combining word links with page link coefficients, so that high word link and page link coefficients have a higher selection score than low word link and page link coefficients. The page link coefficient is rooted in the link structure and content analysis of the page link entering the web page. This selection score is the value used to pick up the tag from the data. If the page link coefficient is not available, it is set to a default value such as 1.0.
9. For arithmetic processing purposes:
a. The word coefficient of the word link coefficient is expressed as (x, y) = (1, 3), (1, 4),. . . , (1, M + 1). x and y are word indexes from the list defined above.
b. (X, y) = (2, 3),. . . , (2, M + 1), and finally the process is performed for the (M) th and (M + 1) th words, that is, (x, y) = (M, M + 1). For the second category of pages, the first coefficient entry is (x, y) = (M + 2, M + 3). Each of the coefficient entries represents the connectivity of two words that are found or associated with each other in two words from the document.
c. Repeat step 9b for all pages of the subject, except in the following situations:
i. Whenever the current word can be found in the previous x-word index, the coefficients are all found in the previous index and in such a way that large coefficient combinations tend to be larger than small coefficient combinations Combined. For example, the word “dog” corresponds to the word index 4 and is accompanied by a word coefficient (4, 8) of W1, and the word “dog” appears on another page as well. Thus, with index 25 and W2 word coefficient (25,8), all the connectivity coefficients of index 25 are combined into the x value of index 4 for each corresponding index of y. In this case, W (4,8) will be combined with W (25,8) and the result will be assigned to a new W (4,8). After the transfer and combination of coefficients is complete, W (25,8) is set to a small value such as 0.0. This rule applies to all other instances of the “dog” instance that occurred after the first instance.
ii. Duplicate words found on the y word index may be incorporated into the original y word index in the same or similar manner as was done for the x word index. If they are not combined, the effect of each word will be seen as a function of the subject because each subject is within a certain range of the y-index.
10. By calculating a word ranking score from each word link coefficient, the words are ranked in such a way that the word with the highest and highest word link coefficient is ranked higher than the word with the lowest and lowest word link coefficient.
a. Let R (i) be the ranking of a given word node.
b. Let W (i, j) be the connectivity coefficient between words.
c. Calculate as follows: for each i, R (i) = R (W (i, j)) and the ranking is the word link weight for each j (already includes page ranking weight) Function).
d. For tags and groups, select words with greater weight.
11. If you want to analyze the data visually in a graphic format, do the following:
a. Dots proportional to the score based on the word and page link coefficients are denoted by (x, y) = (1, 3), (1, 4),. . . , (1, M + 1) coordinates. The same is true for (x, y) = (2,3),. . . , (2, M + 1) and finally for the Mth and (M + 1) th words, (x, y) = (M, M + 1). For the second category of pages, the first dot will be (x, y) = (M + 2, M + 3). Each dot represents the connectivity of two words that are found or associated with each other in two words from the document. The x-axis and y-axis are word indexes from the list defined above.
b. Repeat section 11a for all subject documents, with the following exceptions:
i. Whenever the current word can be found in the previous x-word index, all dots are recorded in the previous index. For example, if the word “dog” happens to be the word index 4 and also appears in the index 25 as well, all the connectivity dots in the index 25 are moved to the x value in the index 4. After the dots are moved, they are deleted from their original positions on the graph. This rule applies to all other instances of the “dog” instance that occurred after the first instance.
ii. Duplicate words found on the y word index may be incorporated into the original y word index in the same or similar manner as was done for the x word index. If they are not combined, the effect of each word will be seen as a function of the subject because each subject is within a certain range of the y-index.
c. Examine the vertical stripes on the graph and the most prevalent stripe corresponds to the W1 index of the word selected as the tag word or subject group. The W1 word with the most links or dots will be selected as the tag word or subject group. Additional filtering can be applied to these words to further refine the list.

Ｅ．「タグ単語」又は「主題グループ」は、シード単語リストと組み合わされて、実働単語リストが作成される。 E. The “tag word” or “theme group” is combined with the seed word list to create a working word list.

Ｆ．実働単語リストは、他の文書を入手し（例えば、クローリングを介す、アリゲータから文書を入手する、など）、以上のＣ及びＤの段階により、追加文書の所定の数（追加リスト拡張数）に、より多くの文書を入手することによって補強される。 F. The actual word list is obtained from another document (for example, through crawling, obtaining a document from an alligator, etc.), and a predetermined number of additional documents (additional list extension number) by the above C and D stages. It is reinforced by obtaining more documents.

Ｇ．収穫されたＮ−グラムの成長速度が満足のいく速度に落ちてくるまで、Ｆ段階が繰り返される。 G. The F stage is repeated until the growth rate of the harvested N-gram falls to a satisfactory rate.

Ｈ．センチメント分析エンジンのために、主題に関するトピック、形容詞、及び固有名詞のリストが作成される。例えば、主題が「ホテル」である場合、トピック／キーワードには、コンシェルジュ、浴室、ベッド、寝室、フロントデスク、テレビ、ルームサービス、メイドサービス、清潔さ、エレベーター、レストラン、予約、及びチェックアウトが含まれることであろう。形容詞には、きびきびした、感じのよい、だらだらした、無礼な、清潔な、汚い、親切な、印象的な、ひどい、すてきな、困惑させる、不行き届きな、不潔な、不親切な、が含まれることであろう。固有名詞には、マリオット、ハイアット、フォーシーズンズ、モーテル６、インターコンチネンタル、クオリティイン、ハワードジョンソンなど、が含まれることであろう。 H. For the sentiment analysis engine, a list of topics, adjectives, and proper nouns about the subject is created. For example, if the subject is “Hotel”, the topics / keywords include concierge, bathroom, bed, bedroom, front desk, TV, room service, maid service, cleanliness, elevator, restaurant, reservation, and checkout It will be. Adjectives include crisp, pleasant, sloppy, rude, clean, dirty, kind, impressive, terrible, nice, confused, poor, filthy, unfriendly That would be true. Proper nouns would include Marriott, Hyatt, Four Seasons, Motel 6, Intercontinental, Quality Inn, Howard Johnson, etc.

Ｉ．キーワードと形容詞の両方を含んでいる文が、今後の分析のために識別される。 I. Sentences containing both keywords and adjectives are identified for further analysis.

Ｊ．キーワードと形容詞のグループと関心事である主題（例えば、ホテルチェーンの名称）とが、当該キーワード及び形容詞のグループの前後に固有名を探すことによって、相関付けられる。固有名詞とキーワード／形容詞のグループは、今後の分析及び提示のために保存される。 J. et al. A keyword, a group of adjectives, and a subject matter of interest (eg, the name of a hotel chain) are correlated by looking for proper names before and after the keyword and adjective group. Proper nouns and keyword / adjective groups are saved for future analysis and presentation.

Ｋ．レビューサイトの場合、日付が入手され、サイトは周期的に再訪される。期間は、ブログ及び／又はレビューのエントリの日付を識別し、ブログのエントリから次のエントリまでの時間周期を計算し、ブログエントリの時間周期に基づいてレビューサイトをどれぐらいの頻度で再訪するかを決めることによって、確定することができる。不変時間周期（例えば、最小時間周期や平均時間周期の３分の１など）が選定されてもよい。代わりに、所定の要因（例えば季節など）に依存する可変時間周期が選定されてもよい。例えば、ホテルの事例では、当方において７月４日の様な休日周辺では、より多くのエントリが投稿され、従って、クローリングから次のクローリングまでの期間はこの時期周辺では短くなるであろう。可変期間は、地元の催事（例えばインディアナポリス５００マイルレース）に基づいて選定されることもあろう。例えば、ホテルの事例では、インディアナポリス５００マイルレースに供出されるホテルについてのクローリングからクローリングまでの期間は、当該催事時期周辺では下がるであろう。サンプリング時間周期が確定されたら、サイトは次のテキストについてクロールされる。 K. For review sites, the date is obtained and the site is revisited periodically. The period identifies the date of the blog and / or review entry, calculates the time period from one blog entry to the next, and how often the review site is revisited based on the blog entry time period Can be determined by determining. An invariant time period (eg, a minimum time period or a third of the average time period) may be selected. Instead, a variable time period that depends on a predetermined factor (such as the season) may be selected. For example, in the case of a hotel, we will post more entries around holidays like July 4th, so the time from crawling to the next crawling will be shorter around this time. The variable period may be selected based on local events (eg, Indianapolis 500 Mile Race). For example, in the case of a hotel, the crawling to crawling period for a hotel offered in the Indianapolis 500 mile race will be reduced around the time of the event. Once the sampling time period is established, the site is crawled for the next text.

Ｌ．それぞれのウェブサイトからのカテゴリ別に、時間依存性が作られてもよい。 L. A time dependency may be created for each category from each website.

Ｍ．或る主題について、カテゴリのリストが入手される。シードカテゴリのリストは、主題ウェブサイト上のヘッドラインを見ることによって入手される。他の主題ウェブサイトでの単語の頻出度が識別される。 M.M. For a subject, a list of categories is obtained. A list of seed categories is obtained by looking at the headlines on the subject website. Word frequency on other subject websites is identified.

Ｎ．レビューサイトの重要度は以下の様に求められる。
ａ．レビューサイトの入って来るリンクと出てゆくリンクが分析される。
ｉ．リンク先のサイトの正当性がチェックされ、それらがＳＰＡＭリンクでないことが確かめられる。
ii．リンク先サイトでの単語の語彙分布が分析される。
iii．リンク先サイトについての語彙スペクトルが基準語彙スペクトルの所定閾値内になければ、当該リンクは捨てられる。
ｂ．所与の主題について競合するサイトでの、ブログエントリ又はサイトに対するコメント／評価の更新頻度対タイムスタンプデルタが識別される。
ｉ．タイムスタンプデルタが短ければ、コニュニティーの大勢の人々が定期で当該特定のウェブサイトを活発に読んだり更新したりしていることを意味する。正当なコメントが最も素早く寄せられるサイトが、コミュニティー内で最も重要度の高いサイトであると見なされる。 N. The importance of the review site is calculated as follows.
a. The incoming and outgoing links of the review site are analyzed.
i. The validity of the linked sites is checked to make sure they are not SPAM links.
ii. The vocabulary distribution of words on the linked site is analyzed.
iii. If the vocabulary spectrum for the linked site is not within the predetermined threshold of the reference vocabulary spectrum, the link is discarded.
b. The update frequency of a blog entry or comment / evaluation for a site vs. a timestamp delta at a competing site for a given subject is identified.
i. A short timestamp delta means that many people in the community are actively reading and updating that particular website on a regular basis. The site with the fastest legitimate comments is considered the most important site in the community.

Ｏ．コミュニティー成員からの提案は以下の様に識別される。
ａ．推奨を識別するパターンのリストが作成される。例えば、パターンには、「お勧めです」、「改善の余地あり」、「友達みんなに教える」など、が含まれるであろう。
ｂ．これらの文中の主題キーワード又はＮ−グラムが識別される。
ｃ．キーワードパターンが見つけられたら、これらの文は、それぞれの主題事項についての提案として提示用に保存される。 O. Proposals from community members are identified as follows:
a. A list of patterns that identify recommendations is created. For example, a pattern might include “Recommended”, “There is room for improvement”, “Teach all friends”, and so on.
b. Thematic keywords or N-grams in these sentences are identified.
c. Once keyword patterns are found, these sentences are saved for presentation as suggestions for their subject matter.

Ｐ．移動式デバイス用のユーザーインターフェース。
ａ．各種製品及びサービス、例えばホテルサービス、の消費者センチメント評価を見るためのユーザーインターフェースの設計は、閲覧、データエントリ、及び採点の検索が、使用し易くなるようにフォーマットされなくてはならない。Ｘ−Ｙ図の形式をしている各種パラメータ対時間のグラフ、又はパレート図の形式をしている順位付けされた問題は、スマートフォンの画面内に納まる必要がある。様々なデータを閲覧するための横及び縦方向のスクローリングは、画面上で指をスワイプすることによるか、ミニトラックボールや親指ジョイスティックにより、又はユーザーキーをトグル切替することにより、操作できるようにすればよい。データは、センチメントを露わにしているブログからのソース記述を含んでいるリスト断片、時間依存プロット、及び／又はパレート図とすることができる。 P. User interface for mobile devices.
a. User interface designs for viewing consumer sentiment ratings for various products and services, such as hotel services, must be formatted to facilitate browsing, data entry, and scoring searches. Various parameters versus time graphs in the form of XY diagrams, or ranked issues in the form of Pareto diagrams, need to fit within the screen of the smartphone. Horizontal and vertical scrolling to view various data can be operated by swiping your finger on the screen, by mini trackball or thumb joystick, or by toggle user key do it. The data may be list fragments, time-dependent plots, and / or Pareto charts that contain source descriptions from blogs that expose sentiment.

以上の記述は、説明上、特定の実施形態に関連付けて記載されている。しかしながら、説明を目的とした以上の論考は、網羅的であろうとするものでも、また実施形態を開示されている厳密な形態に限定しようとするものでもない。多くの修正又は変型が上記の教示に照らし実施可能である。実施形態は、原理及びその実践的適用を最も分かり易く説明し、それにより、当業者がそれら実施形態及び企図される特定の利用に相応に各種修正を施した様々な実施形態を最適に活用することができるように、選定され記載されている。 The above description is described in association with a specific embodiment for explanation. However, the foregoing discussion for purposes of explanation is not intended to be exhaustive or to limit the embodiments to the precise forms disclosed. Many modifications or variations are possible in light of the above teaching. The embodiments describe the principles and their practical application in the most understandable way so that those skilled in the art can best utilize the various embodiments with various modifications commensurate with those embodiments and the specific applications contemplated. So that it can be selected.

１００サーバ
１０２サーバに含まれている文書
１０４アグリゲータ
１０６アグリゲータに含まれている文書
１０８センチメントサーバ
１２０ネットワーク
２０２センチメントモジュール
２０４クローリングモジュール
２０６キーワードモジュール
２０８フィルタリングモジュール
２１０分類モジュール、分類エンジン
１０００コンピュータシステム
１００２プロセッサ
１００４メモリ
１００６映像表示ユニット
１００８バス
１０１０入力デバイス
１０１２出力デバイス
１０１６ネットワークインターフェースデバイス
１０２０機械可読媒体
１０２２データ構造及び命令 100 server 102 document contained in server 104 aggregator 106 document contained in aggregator 108 sentiment server 120 network 202 sentiment module 204 crawling module 206 keyword module 208 filtering module 210 classification module, classification engine 1000 computer system 1002 processor 1004 Memory 1006 Video display unit 1008 Bus 1010 Input device 1012 Output device 1016 Network interface device 1020 Machine-readable medium 1022 Data structure and instruction

Claims

In a computer-implemented method for seeking sentiment (sentiment) expressed in a document,
Receiving a document from a plurality of documents;
Using at least one processor to identify sentences in the document that contain at least one sentiment signature within a predetermined distance of at least one keyword from the keyword list, the list of keywords Are extracted from the plurality of documents and filtered using a phase transition equation, and the at least one sentiment signature corresponds to a representation of at least one sentiment in the sentence, Identifying, and
Determining at least one category corresponding to the at least one keyword of the sentence, wherein the at least one category is included in a list of categories generated using the list of keywords; Seeking a category,
Determining at least one sentiment corresponding to the at least one category based on the at least one sentiment signature.

Prior to identifying the sentence that includes the at least one sentiment signature within a predetermined distance of the at least one keyword from the list of keywords,
Extracting a keyword from each of the plurality of documents;
For each keyword,
Calculating the frequency f of the keywords in the plurality of documents and the number N of documents including the keywords;
Using the phase transition formula to calculate the relevance of the keyword based on the frequency of the keyword in the plurality of documents and the number of documents containing the keyword;
The computer-implemented computer of claim 1, further comprising: extracting the keyword list by adding the keyword to the keyword list if the relevance of the keyword exceeds a predetermined threshold. the method of.

The phase transition equation is

The computer-implemented method of claim 2, wherein x ≧ 1.

The computer-implemented method of claim 3, wherein x is three.

Prior to determining at least one category corresponding to the at least one keyword of the sentence,
Identifying a first set of documents among the plurality of documents that includes at least one keyword from the list of keywords;
Identifying a set of keywords contained in at least a predetermined number of documents in the first set of documents;
The computer-implemented method of claim 1, further comprising generating the list of categories by adding the set of keywords to the list of categories, each category including each keyword set. Method.

Prior to determining at least one category corresponding to the at least one keyword of the sentence,
In the keyword list, a keyword pair that is a pair of keywords that are related to each other and that is a unique keyword pair is obtained.
A set of keyword pairs, each set identifying a set of keyword pairs that includes at least one keyword that is common to all of the keyword pairs in the set;
The set of keyword pairs are iteratively combined until a predetermined termination condition is achieved, and each combined set is common to all of the keyword pairs in the combined set. The computer-implemented method of claim 1, further comprising generating the list of categories by combining in such a manner that at least one keyword is included.

Determining at least one category corresponding to the at least one keyword of the sentence includes using a support vector machine to determine the at least one category corresponding to the at least one keyword of the sentence. The computer-implemented method of claim 1.

Determining at least one category corresponding to the at least one keyword of the sentence includes determining the at least one category corresponding to the at least one keyword of the sentence using a neural network. The computer-implemented method of claim 1.

Determining at least one category corresponding to the at least one keyword of the sentence;
Obtaining a category spectrum comprising a plurality of category spectra, each category spectrum including the frequency of occurrence of a keyword in the list of keywords corresponding to each category;
Determining a category spectrum for the sentence based on the at least one keyword;
Calculating a dot product of a category spectrum of the sentence and each category spectrum in the plurality of category spectra;
The computer-implemented method of claim 1, comprising: determining the at least one category as a category corresponding to at least one dot product that exceeds a predetermined threshold.

Prior to obtaining the plurality of category spectra, for each category,
Obtain a collection of documents corresponding to the category,
Extracting keywords from each document in the document collection;
Filter the keyword using the phase transition equation to derive a filtered keyword,
Determining the frequency of occurrence of the filtered keyword in the collection of documents;
The computer-implemented method of claim 9, further comprising: determining the category spectrum for the category by normalizing the frequency of occurrence of the filtered keyword to derive a category spectrum for the category. Method.

Prior to receiving a document from a plurality of documents,
For each document in the document collection,
Extract n-grams from the document;
For the document, based on the extracted n-gram, determine an n-gram spectrum indicating the frequency of occurrence of the n-gram as a function of the size of the n-gram;
Determining whether the n-gram spectrum of the document is matched within a predetermined threshold to a reference n-gram spectrum defined by a predetermined equation;
If the n-gram spectrum of the document matches the reference n-gram spectrum within the predetermined threshold, add the document to the plurality of documents;
If the n-gram spectrum of the document does not match the reference n-gram spectrum within the predetermined threshold, selecting the plurality of documents from the collection of documents by discarding the document The computer-implemented method of claim 1, further comprising:

The predetermined equation is

Where x is the size of the n-gram, and a, b, and c are predetermined values where the peak of the predetermined equation is between an n-gram of size 2 and an n-gram of size 3. The computer-implemented method of claim 11, wherein

The computer-implemented method of claim 1, wherein the at least one sentiment signature includes at least one word that indicates that an expression of the at least one sentiment is present in the sentence.

The computer-implemented method of claim 1, wherein the at least one sentiment is a representation of an opinion related to at least one category.

The computer-implemented method of claim 1, wherein the at least one category is associated with a product.

The computer-implemented method of claim 1, wherein the at least one category is associated with a service.

In a system for seeking sentiment expressed in a document,
At least one processor;
Memory,
At least one program stored in the memory,
Instructions for receiving a document from a plurality of documents;
Instructions for identifying a sentence in the document that includes at least one sentiment signature within a predetermined distance of at least one keyword from the keyword list, the keyword list comprising the plurality of documents Instructions for identifying a sentence, wherein the at least one sentiment signature corresponds to a representation of at least one sentiment in the sentence, and is filtered using a phase transition equation ,
Instructions for determining at least one category corresponding to the at least one keyword of the sentence, wherein the at least one category is included in a list of categories generated using the list of keywords An instruction to ask for a category,
And at least one program comprising: instructions for determining at least one sentiment corresponding to the at least one category based on the at least one sentiment signature.

An instruction for extracting the keyword list; and an instruction for extracting the keyword list from the document,
Extracting a keyword from each of the plurality of documents;
For each keyword,
Calculating the frequency f of the keywords in the plurality of documents and the number N of documents including the keywords;
Using the phase transition formula to calculate the relevance of the keyword based on the frequency of the keyword in the plurality of documents and the number of documents containing the keyword;
The system according to claim 17, further comprising: an instruction for adding the keyword to the keyword list if the relevance of the keyword exceeds a predetermined threshold.

The phase transition equation is

The system of claim 18, wherein x ≧ 1.

In a computer readable storage medium storing at least one program configured to be executed by a computer, the at least one program comprises:
Instructions for receiving a document from a plurality of documents;
Instructions for identifying a sentence in the document that includes at least one sentiment signature within a predetermined distance of at least one keyword from the keyword list, the keyword list comprising the plurality of documents Instructions for identifying a sentence, wherein the at least one sentiment signature corresponds to a representation of at least one sentiment in the sentence, and is filtered using a phase transition equation ,
Instructions for determining at least one category corresponding to the at least one keyword of the sentence, wherein the at least one category is included in a list of categories generated using the list of keywords An instruction to ask for a category,
Instructions for determining at least one sentiment corresponding to the at least one category based on the at least one sentiment signature.