JP2018180866A

JP2018180866A - Determination method, determination program and determination device

Info

Publication number: JP2018180866A
Application number: JP2017078509A
Authority: JP
Inventors: 和吉川; Kazu Yoshikawa; 友哉岩倉; Tomoya Iwakura
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-04-11
Filing date: 2017-04-11
Publication date: 2018-11-15
Anticipated expiration: 2037-04-11
Also published as: JP6816621B2

Abstract

PROBLEM TO BE SOLVED: To provide a determination method and the like to determine a similarity index appropriate to the category of a document.SOLUTION: A computer performs processing of: obtaining documents in which mentions in the documents are associated with entities having category information in a knowledge base; generating document groups in which the obtained documents are classified by each of the associated categories; generating, for each of the generated document groups, document pairs including the documents in which the same mentions are associated; giving to the generated document pairs with labels indicating whether or not the entities match; creating similarity indexes on the basis of the document pairs to which the labels are given, and outputting the created similarity indexes by associating thereof with the categories corresponding to the document groups.SELECTED DRAWING: Figure 6

Description

本発明は、クラスタリングのために文書内容を判別する判別方法等に関する。 The present invention relates to a determination method and the like for determining document contents for clustering.

２つの文書の類似度を求める技術が提案されている（特許文献１等）。類似度は文書に含まれる単語の類似度合いを数値にした類似度指標により判定する。 A technique for obtaining the degree of similarity between two documents has been proposed (Patent Document 1 etc.). The degree of similarity is determined by using a similarity index in which the degree of similarity of the words included in the document is numerically determined.

特開２０００−１５５７６２号公報JP 2000-155762 A

しかし、特許文献１等の類似度指標は、文書のカテゴリによって、類似度指標を変えることはできない。そのため、文書のカテゴリによっては、類似度指標の精度が低下するという問題がある。 However, the similarity index of Patent Document 1 etc. can not change the similarity index depending on the category of the document. Therefore, depending on the category of the document, there is a problem that the accuracy of the similarity index decreases.

１つの側面では、文書のカテゴリに適した類似度指標を判別する判別方法等を提供することである。 One aspect is to provide a determination method or the like for determining a similarity index suitable for a category of a document.

本願に開示する判定方法は、コンピュータが、文書中のメンションが知識ベース中のカテゴリ情報を持つエンティティと対応付けられた文書を取得し、取得した文書を対応付けられた前記カテゴリ毎に分類した文書グループを生成し、生成した文書グループ毎に、同一の前記メンションが対応付けられた前記文書を含む文書対を生成し、生成した文書対に対して、前記エンティティが一致するか否かのラベルを付与し、ラベルを付与した前記文書対に基づいて、類似度指標を作成し、作成した類似度指標を前記文書グループに対応した前記カテゴリと対応付けて出力する処理を行う。 The determination method disclosed in the present application is a document in which a computer acquires a document in which a mention in the document is associated with an entity having category information in the knowledge base, and the acquired document is classified for each category associated with the document. A group is generated, and for each generated document group, a document pair including the document to which the same mention is associated is generated, and a label indicating whether the entity matches the generated document pair. A similarity index is created based on the document pair that has been assigned and labeled, and the created similarity index is output in association with the category corresponding to the document group.

本願の一観点によれば、文書のカテゴリに適した類似度指標を判別することが可能となる。 According to one aspect of the present application, it is possible to determine the similarity index suitable for the category of the document.

類似度スコア算出装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of a similarity score calculation apparatus. 実体情報ＤＢのレコードレイアウト例を示す説明図である。It is explanatory drawing which shows the example of a record layout of entity information DB. データセットＤＢのレコードレイアウト例を示す説明図である。It is explanatory drawing which shows the example of a record layout of data set DB. タイプ−カテゴリ対応ＤＢのレコードレイアウト例を示す説明図である。It is explanatory drawing which shows the example of a record layout of type-category corresponding | compatible DB. 類似度指標ＤＢのレコード例を示す説明図である。It is explanatory drawing which shows the example of a record of similarity index DB. 類似度指標作成処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of similarity index production | generation processing. 類似度スコア算出処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of similarity score calculation processing. クラスタリング処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a clustering process. 類似度指標を用いた文書まとめあげの例を示す説明図である。It is explanatory drawing which shows the example of the document grouping using a similarity index. クラスタリング処理の他の手順を示すフローチャートである。It is a flowchart which shows the other procedure of a clustering process. タイプーカテゴリ対応ＤＢの作成方法を示す説明図である。It is explanatory drawing which shows the production method of type / category corresponding | compatible DB. タイプーカテゴリ対応ＤＢの他の作成方法を示す説明図である。It is explanatory drawing which shows the other creation method of type-category corresponding DB. 類似度スコア算出装置の機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a function structure of a similarity score calculation apparatus.

以下実施の形態を、図面を参照して説明する。 Hereinafter, embodiments will be described with reference to the drawings.

実施の形態１
図１は類似度スコア算出装置１の構成例を示すブロック図である。類似度スコア算出装置（判別装置）１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３、大容量記憶部１４、通信部１５、入出力部１６及び読み取り部１７を含む。各構成はバスＢで接続されている。 Embodiment 1
FIG. 1 is a block diagram showing a configuration example of the similarity score calculation device 1. The similarity score calculation device (determination device) 1 includes a central processing unit (CPU) 11, a read only memory (ROM) 12, a random access memory (RAM) 13, a large capacity storage unit 14, a communication unit 15, and an input / output unit 16. And a reading unit 17. Each configuration is connected by a bus B.

ＣＰＵ１１はＲＯＭ１２に記憶された制御プログラム（判別プログラム）１Ｐに従い、ハードウェア各部を制御する。ＲＡＭ１３は例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）、ＤＲＡＭ（ＤｙｎａｍｉｃＲＡＭ）又はフラッシュメモリである。ＲＡＭ１３はＣＰＵ１１によるプログラムの実行時に発生するデータを一時的に記憶する。 The CPU 11 controls each part of the hardware in accordance with a control program (determination program) 1 P stored in the ROM 12. The RAM 13 is, for example, an SRAM (Static RAM), a DRAM (Dynamic RAM), or a flash memory. The RAM 13 temporarily stores data generated when the CPU 11 executes a program.

大容量記憶部１４は、例えばハードディスク又はＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）などである。大容量記憶部１４は類似度指標の判別処理や類似度スコア算出処理に必要な各種データを記憶する。大容量記憶部１４は文書ＤＢ（ＤａｔａＢａｓｅ）１４１、実体情報ＤＢ１４２、データセットＤＢ１４３、タイプ−カテゴリ対応ＤＢ１４４、類似度指標ＤＢ１４５を記憶する。また、制御プログラム１Ｐを大容量記憶部１４に記憶してもよい。 The large-capacity storage unit 14 is, for example, a hard disk or a solid state drive (SSD). The large-capacity storage unit 14 stores various data necessary for the determination process of the similarity index and the similarity score calculation process. The large-capacity storage unit 14 stores a document DB (Data Base) 141, an entity information DB 142, a data set DB 143, a type-category correspondence DB 144, and a similarity index DB 145. Alternatively, the control program 1P may be stored in the large capacity storage unit 14.

通信部１５はネットワークを介して、他のコンピュータと通信を行う。入出力部１６はキーボードやマウスからの操作信号が入力される。また、入出力部１６は液晶表示装置などの表示装置へ表示画像を出力する。 The communication unit 15 communicates with other computers via a network. The input / output unit 16 receives an operation signal from a keyboard or a mouse. The input / output unit 16 also outputs a display image to a display device such as a liquid crystal display device.

読み取り部１７はＣＤ（ＣｏｍｐａｃｔＤｉｓｃ）−ＲＯＭ及びＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）−ＲＯＭを含む可搬型記憶媒体１ａを読み取る。ＣＰＵ１１が読み取り部１７を介して、制御プログラム１Ｐを可搬型記憶媒体１ａより読み取り、大容量記憶部１４に記憶してもよい。また、ネットワーク等を介して他のコンピュータからＣＰＵ１１が制御プログラム１Ｐをダウンロードし、大容量記憶部１４に記憶してもよい。さらにまた、半導体メモリ１ｂから、ＣＰＵ１１が制御プログラム１Ｐを読み込んでもよい。 The reading unit 17 reads a portable storage medium 1 a including a CD (Compact Disc) -ROM and a DVD (Digital Versatile Disc) -ROM. The CPU 11 may read the control program 1 P from the portable storage medium 1 a via the reading unit 17 and store the control program 1 P in the large-capacity storage unit 14. Further, the CPU 11 may download the control program 1 P from another computer via a network or the like and store the control program 1 P in the large-capacity storage unit 14. Furthermore, the CPU 11 may read the control program 1P from the semiconductor memory 1b.

次に類似度スコア算出装置１の動作について説明する。類似度スコア算出装置１は２つの動作モードを持つ。２つの動作モードは、類似度指標作成モード、類似度スコア算出モードである。類似度指標作成モードでは、類似度スコア算出装置１はカテゴリ別の類似度指標を取得する。類似度スコア算出装置１は類似度指標を学習により取得する。類似度スコア算出装置１は大規模知識ベースを利用して、学習に用いる学習データを獲得する。大規模知識ベースの一例は、Wikipedia（ウィキペディア）である。 Next, the operation of the similarity score calculation device 1 will be described. The similarity score calculation device 1 has two operation modes. The two operation modes are a similarity index creation mode and a similarity score calculation mode. In the similarity index creation mode, the similarity score calculation device 1 acquires category-based similarity indexes. The similarity score calculation device 1 acquires a similarity index by learning. The similarity score calculation device 1 acquires learning data used for learning using a large scale knowledge base. An example of a large-scale knowledge base is Wikipedia (Wikipedia).

類似度スコア算出モードでは、類似度スコア算出装置１は、類似度指標作成モードで取得したカテゴリ別の類似度指標を用いて、文書対の類似度スコアを算出する。 In the similarity score calculation mode, the similarity score calculation device 1 calculates the similarity score of the document pair using the category-based similarity index acquired in the similarity index creation mode.

ここで、以下の説明において用いる用語の定義を示す。ｍｅｎｔｉｏｎは、文書中に現れる特定の実体を指す文字列である。ｅｎｔｉｔｙは実体そのものである。例えば、「今日は作家の鈴木一郎のサイン会だ。」との文章では、鈴木一郎がｍｅｎｔｉｏｎである。同じ文章において、作家の鈴木一郎という人物がｅｎｔｉｔｙである。また、カテゴリとは、主題の分類である。カテゴリは例えば、人物、企業、市町村である。 Here, definitions of terms used in the following description will be shown. An mention is a string that points to a specific entity that appears in the document. entity is the entity itself. For example, in the sentence, "Today is the signing meeting of the artist Suzuki Ichiro.", Suzuki Ichiro is a mention. In the same sentence, the person named Suzuki Ichiro is the entity. Also, a category is a classification of the subject. The categories are, for example, persons, companies, and municipalities.

また、類似度指標の取得においては、上述したように大規模知識ベースを利用する。大規模知識ベースの利用は、次の仮説が成り立つことを前提としている。（１）文書中に現れる実体の一致度の判定基準は、ｅｎｔｉｔｙの属性（カテゴリ）によって異なる。（２）属性（カテゴリ）を、大規模知識ベースのカテゴリ情報に対応付けることができる。（１）及び（２）が成り立つことにより、大規模知識ベースを利用し、カテゴリ別類似度指標を得るための学習データを獲得することが可能となる。 In addition, in acquiring the similarity index, as described above, a large-scale knowledge base is used. The use of a large-scale knowledge base assumes that the following hypothesis holds. (1) The criteria for determining the degree of identity of an entity appearing in a document differ depending on the attribute (category) of the entity. (2) An attribute (category) can be associated with category information of a large scale knowledge base. By establishing (1) and (2), it becomes possible to obtain learning data for obtaining a category-based similarity index using a large-scale knowledge base.

続いて、大容量記憶部１４に記憶するデータベースについて説明する。文書ＤＢ１４１は種々の文書データを記憶する。文書データは例えば、類似度指標作成モードで用いる大規模知識ベースから取得した文書や、類似度スコア算出モードで類似度スコア算出の対象となる文書群である。なお、類似度指標作成モードで用いる文書は、知識ベースのｅｎｔｉｔｙに紐付いていれば（例えば、リンクがあれば）よく、知識ベースから取得したものでなくてもよい。 Subsequently, a database stored in the large capacity storage unit 14 will be described. The document DB 141 stores various document data. The document data is, for example, a document acquired from a large scale knowledge base used in the similarity index creation mode or a document group to be subjected to similarity score calculation in the similarity score calculation mode. Note that the document used in the similarity index creation mode is not limited to being acquired from the knowledge base as long as it is linked to the knowledge base entity (for example, if there is a link).

図２は実体情報ＤＢ１４２のレコードレイアウト例を示す説明図である。実体情報ＤＢ１４２は文書に含まれる実体（人物、企業など）についての情報を記憶する。文書は、例えば文書ＤＢ１４１に記憶しているものである。実体情報ＤＢ１４２が記憶する情報は、類似度指標作成のための学習コーパスの１つである。実体情報ＤＢ１４２は文書列、ｍｅｎｔｉｏｎ列、ｅｎｔｉｔｙ列、及びカテゴリ列を含む。文書列は文書の内容を記憶する。ｍｅｎｔｉｏｎ列は文書に含まれるｍｅｎｔｉｏｎを記憶する。ｅｎｔｉｔｙ列は文書に含まれるｅｎｔｉｔｙを記憶する。カテゴリ列はｅｎｔｉｔｙが大規模知識ベース上で属するカテゴリを記憶する。 FIG. 2 is an explanatory view showing an example of the record layout of the entity information DB 142. As shown in FIG. The entity information DB 142 stores information about an entity (person, company, etc.) included in the document. The document is, for example, one stored in the document DB 141. The information stored in the entity information DB 142 is one of learning corpuses for creating a similarity index. The entity information DB 142 includes a document sequence, a mention sequence, an entity sequence, and a category sequence. The document sequence stores the content of the document. The mention column stores the mention contained in the document. The entity column stores the entity contained in the document. The category column stores the category to which the entity belongs on the large knowledge base.

図３はデータセットＤＢ１４３のレコードレイアウト例を示す説明図である。データセットＤＢ１４３はカテゴリ毎に文書対が一致するか否かを記憶する。データセットＤＢ１４３が記憶する情報は、類似度指標作成のための学習コーパスの１つである。データセットＤＢ１４３はカテゴリ列、第１文書列、第２文書列及びラベル列を含む。カテゴリ列は文書のカテゴリを記憶する。第１文書列及び第２文書列それぞれは文書の内容を記憶する。ラベル列は第１文書のｅｎｔｉｔｙと第２文書のｅｎｔｉｔｙとが一致しているか否かを記憶する。データセットＤＢ１４３は実体情報ＤＢ１４２の内容を基に作成される。 FIG. 3 is an explanatory view showing an example of the record layout of the data set DB 143. As shown in FIG. The data set DB 143 stores for each category whether or not the document pair matches. The information stored in the data set DB 143 is one of learning corpuses for creating a similarity index. The data set DB 143 includes a category column, a first document column, a second document column, and a label column. The category column stores the document category. Each of the first document sequence and the second document sequence stores the content of the document. The label column stores whether the entity of the first document matches the entity of the second document. The data set DB 143 is created based on the contents of the entity information DB 142.

図４はタイプ−カテゴリ対応ＤＢ１４４のレコードレイアウト例を示す説明図である。タイプ−カテゴリ対応ＤＢ１４４は固有表現タイプ列及び知識ベースカテゴリ列を含む。固有表現タイプ列は固有表現抽出により得られる固有表現タイプを記憶する。知識ベースカテゴリ列は固有表現タイプに対応する大規模知識ベースのカテゴリを記憶する。タイプ−カテゴリ対応ＤＢ１４４は予め人手により作成しておく。それに限らず、機械学習より生成してもよい。 FIG. 4 is an explanatory view showing an example of the record layout of the type-category correspondence DB 144. As shown in FIG. The type-category correspondence DB 144 includes specific expression type columns and knowledge base category columns. The specific expression type sequence stores specific expression types obtained by specific expression extraction. The knowledge base category column stores large knowledge base categories corresponding to specific expression types. The type-category correspondence DB 144 is manually created in advance. Not limited to that, it may be generated from machine learning.

図５は類似度指標ＤＢ１４５のレコード例を示す説明図である。類似度指標ＤＢ１４５はカテゴリ別に類似度指標を算出する際の係数を記憶している。類似度指標の算出式は例えば以下の式（１）である。 FIG. 5 is an explanatory view showing an example of a record of the similarity index DB 145. As shown in FIG. The similarity index DB 145 stores coefficients for calculating the similarity index by category. The calculation formula of the similarity index is, for example, the following formula (1).

類似度指標＝ａ×単語類似度＋ｂ×固有名詞類似度＋ｃ×文書ＵＲＬの類似度
＋ … ＋ｂ＿ｗ×Ｉ（単語ｗが一致）＋ … （１） Similarity index = a × word similarity + b × proper noun similarity + c × document URL similarity
+ ... + b_w × I (word w matches) + ... (1)

図５では人物・類似度指標１４５１、企業・類似度指標１４５２、及びスポーツ・類似度指標１４５３を示している。 In FIG. 5, a person / similarity index 1451, a company / similarity index 1452, and a sport / similarity index 1453 are shown.

次に、類似度スコア算出装置１が行う処理について説明する。図６は類似度指標作成処理の手順を示すフローチャートである。類似度指標作成処理は、類似度指標作成モードにおける動作である。類似度スコア算出装置１のＣＰＵ１１は類似度指標を作成するための文書を取得し、文書ＤＢ１４１に記憶する（ステップＳ１）。取得対象となる文書は大規模知識ベースへのリンクが埋め込まれた文書である。文書は通信部１５を介して他のコンピュータから取得する。文書を記憶した可搬型記憶媒体１ａより読み取り部１７を介して取得してもよい。文書を記憶した半導体メモリ１ｂから取得してもよい。 Next, processing performed by the similarity score calculation device 1 will be described. FIG. 6 is a flowchart showing the procedure of the similarity index creation process. The similarity index creation process is an operation in the similarity index creation mode. The CPU 11 of the similarity score calculation device 1 acquires a document for creating a similarity index, and stores the document in the document DB 141 (step S1). The document to be acquired is a document in which a link to a large-scale knowledge base is embedded. The document is acquired from another computer via the communication unit 15. The document may be acquired from the portable storage medium 1a storing the document via the reading unit 17. The document may be acquired from the stored semiconductor memory 1b.

ＣＰＵ１１は文書ＤＢ１４１に記憶した各文書から実体情報を取得し、実体情報ＤＢ１４２に記憶する（ステップＳ２）。文書ＤＢ１４１に記憶している各文書は上述したように、大規模知識ベースへのリンクが埋め込まれている。ここで、リンクが埋め込まれている部分をｍｅｎｔｉｏｎ、リンク先をｅｎｔｉｔｙとする。また、カテゴリはｅｎｔｉｔｙが大規模知識ベース上で属するカテゴリとする。ＣＰＵ１１は文書、ｍｅｎｔｉｏｎ、ｅｎｔｉｔｙ、及びカテゴリを対応付けて、実体情報ＤＢ１４２に記憶する。 The CPU 11 acquires entity information from each document stored in the document DB 141, and stores the acquired entity information in the entity information DB 142 (step S2). As described above, each document stored in the document DB 141 is embedded with a link to a large scale knowledge base. Here, the portion in which the link is embedded is referred to as a mention, and the link destination is referred to as an entity. Also, a category is a category to which an entity belongs on a large scale knowledge base. The CPU 11 associates the document, the mention, the entity, and the category, and stores them in the entity information DB 142.

ＣＰＵ１１は処理対象とするカテゴリを１つ選択する（ステップＳ３）。ＣＰＵ１１は選択したカテゴリについての実体情報を実体情報ＤＢ１４２から取得する（ステップＳ４）。ＣＰＵ１１は取得した実体情報よりデータセットを作成し、データセットＤＢ１４３に記憶する（ステップＳ５）。ＣＰＵ１１は取得した実体情報において、ｍｅｎｔｉｏｎが同一である２つの文書からなる文書対を作成する。ＣＰＵ１１取得した文書対それぞれについて、文書対に含まれる文書それぞれのｅｎｔｉｔｙを比較する。ＣＰＵ１１は比較結果に基づいて、文書対に付与するラベルを決定する。ＣＰＵ１１は２つのｅｎｔｉｔｙが一致すると判定した場合には、文書対に対して、一致というラベルを付与する。ＣＰＵ１１は２つのｅｎｔｉｔｙが相違すると判定した場合には、文書対に対して、不一致というラベルを付与する。ＣＰＵ１１はラベルを付与した文書対、すなわち、データセットをデータセットＤＢ１４３に記憶する。 The CPU 11 selects one category to be processed (step S3). The CPU 11 acquires entity information on the selected category from the entity information DB 142 (step S4). The CPU 11 creates a data set from the acquired entity information, and stores the data set in the data set DB 143 (step S5). In the acquired entity information, the CPU 11 creates a document pair consisting of two documents having the same mention. The CPU 11 compares the entities of each of the documents included in the document pair for each of the acquired document pairs. The CPU 11 determines a label to be assigned to the document pair based on the comparison result. If the CPU 11 determines that the two entities match, it labels the document pair as a match. If the CPU 11 determines that the two entities are different, it labels the document pair as non-coincidence. The CPU 11 stores the labeled document pair, that is, the data set in the data set DB 143.

ＣＰＵ１１は作成したデータセットを基づき、類似度指標を作成する（ステップＳ６）。類似度指標は例えば、ＳＶＭ（ＳｕｐｐｏｒｔＶｅｔｏｒＭａｃｈｉｎｅ）やロジスティック回帰を用いた機械学習による類似度スコア学習を行い求める。機械学習による類似度指標の作成は公知の技術であるので、詳細は省略する。 The CPU 11 creates a similarity index based on the created data set (step S6). The similarity index is determined by, for example, similarity score learning by machine learning using SVM (Support Vetor Machine) or logistic regression. Since the creation of the similarity index by machine learning is a known technique, the details will be omitted.

ＣＰＵ１１は作成した類似度指標をカテゴリと対応付けて、類似度指標ＤＢ１４５に記憶する（ステップＳ７）。ＣＰＵ１１は未処理のカテゴリがあるか否かを判定する（ステップＳ８）。ここで未処理とは類似度指標の作成を行っていないということである。ＣＰＵ１１は未処理のカテゴリがあると判定した場合（ステップＳ８でＹＥＳ）、処理をステップＳ３に戻し、未処理のカテゴリについての処理を行う。ＣＰＵ１１は未処理のカテゴリがないと判定した場合（ステップＳ８でＮＯ）、処理を終了する。 The CPU 11 associates the created similarity index with the category, and stores the created similarity index in the similarity index DB 145 (step S7). The CPU 11 determines whether there is an unprocessed category (step S8). Here, unprocessed means that the similarity index is not created. If it is determined that there is an unprocessed category (YES in step S8), the CPU 11 returns the process to step S3 and performs the process on the unprocessed category. If the CPU 11 determines that there is no unprocessed category (NO in step S8), the process ends.

続いて、類似度スコア算出モードでの類似度スコア算出装置１の動作について説明する。類似度スコア算出モードの動作では、類似度スコア算出装置１はカテゴリ別の類似度指標を使用する。したがって、類似度スコア算出モードで動作の前には、類似度指標作成モードの動作により、カテゴリ別の類似度指標が作成されているのが前提となる。 Subsequently, the operation of the similarity score calculation device 1 in the similarity score calculation mode will be described. In the operation of the similarity score calculation mode, the similarity score calculation device 1 uses a category-based similarity index. Therefore, before the operation in the similarity score calculation mode, it is premised that the similarity index according to category is created by the operation of the similarity index creation mode.

類似度スコア算出について説明する前に、固有表現抽出について説明する。固有表現抽出は公知の技術であるので、簡単な説明に留める。固有表現抽出は、文書から、人物・企業となどの固有名詞や数値表現などを抽出する技術である。固有表現抽出に得られる固有表現には複数の種類（ここでは、タイプという）がある。固有表現抽出により、文書に含まれる固有表現の表出箇所とそのタイプを抽出することが可能となる。例えば、「田中太郎は汐留にあるＯＸ製薬の研究員だ。」との文章に対して、固有表現抽出を行う。得られるは結果は「＜人物＞田中太郎＜／人物＞は＜場所＞汐留＜／場所＞にある＜企業＞ＯＸ製薬＜／企業＞の研究員だ。」となる。ここで、下線が引かれた部分、すなわち、タグ＜…＞＜／…＞で囲まれた部分が固有表現であることを示す。タグ中の…がタイプを示す。上記の例では、「田中太郎」がタイプ：人物の固有表現であることを示す。「汐留」がタイプ：場所の固有表現であることを示す。「ＯＸ製薬」がタイプ：企業の固有表現であることを示す。 Before describing similarity score calculation, specific expression extraction will be described. Named entity extraction is a well-known technique, so it will be described briefly. Named entity extraction is a technique for extracting, from documents, proper nouns such as persons and companies, and numerical expressions. There are multiple types (herein, referred to as types) of specific expressions obtained for specific expression extraction. The specific expression extraction makes it possible to extract the appearance location of the specific expression included in the document and its type. For example, specific expression extraction is performed on the sentence "Taro Tanaka is a researcher at OX pharmaceutical in Shiodome." The result is “<person> Tanaka Taro </ person> is a researcher at <company> OX pharmaceutical </ company> at <place> Shiodome </ place >>. Here, the underlined part, that is, the part enclosed by tags <...></...> indicates that it is a unique expression. ... in the tag indicates the type. In the above example, it shows that "Taro Tanaka" is a type: specific expression of a person. Indicates that "Shiodome" is a type: a specific expression of a place. Indicates that "OX Pharmaceutical" is a type: a unique expression of a company.

図７は類似度スコア算出処理の手順を示すフローチャートである。ＣＰＵ１１はキーワード及び文書対を取得する（ステップＳ１１）。キーワードは類似度を判定する基準となる語である。また、キーワードはｍｅｎｔｉｏｎとなる前提である。例えば、田中太郎について書かれた文書対の類似度スコアを算出したい場合は、キーワードは田中太郎となる。文書対は類似度スコアの算出対象となる文書の対である。ＣＰＵ１１は文書対に含まれる文書それぞれについて、固有表現抽出を行う（ステップＳ１２）。文書それぞれについてキーワードに対応するｍｅｎｔｉｏｎの固有表現タイプが得られる。ＣＰＵ１１は文書それぞれのｍｅｎｔｉｏｎの固有表現タイプが一致している否かを判定する（ステップＳ１３）。ＣＰＵ１１は固有表現タイプが一致しないと判定した場合（ステップＳ１３でＮＯ）、文書対は不一致と判定し、予め定めた最低スコアを出力する（ステップＳ１４）。ＣＰＵ１１は処理を終了する。ＣＰＵ１１は文書それぞれから取得した固有表現タイプが一致していると判定した場合（ステップＳ１３でＹＥＳ）、一致した固有表現タイプに対応したカテゴリをタイプ−カテゴリ対応ＤＢ１４４から取得する（ステップＳ１５）。ＣＰＵ１１は取得したカテゴリに対応した類似度指標を類似度指標ＤＢ１４５から取得し、類似度スコアを算出する（ステップＳ１６）。ＣＰＵ１１は算出したスコアを出力する（ステップＳ１７）。 FIG. 7 is a flowchart showing the procedure of the similarity score calculation process. The CPU 11 acquires a keyword and a document pair (step S11). A keyword is a word that is a criterion for determining the degree of similarity. Also, keywords are premised to be mentions. For example, when it is desired to calculate the similarity score of a document pair written for Tanaka Taro, the keyword is Taro Tanaka. The document pair is a pair of documents for which the similarity score is to be calculated. The CPU 11 extracts a unique expression for each of the documents included in the document pair (step S12). For each of the documents, the unique expression type of the mention corresponding to the keyword is obtained. The CPU 11 determines whether or not the unique expression types of each of the documents in the document match (step S13). If the CPU 11 determines that the specific expression types do not match (NO in step S13), it determines that the document pair does not match, and outputs a predetermined minimum score (step S14). The CPU 11 ends the process. If the CPU 11 determines that the specific expression types acquired from the respective documents match (YES in step S13), the CPU 11 acquires a category corresponding to the matched specific expression type from the type-category correspondence DB 144 (step S15). The CPU 11 acquires a similarity index corresponding to the acquired category from the similarity index DB 145, and calculates a similarity score (step S16). The CPU 11 outputs the calculated score (step S17).

なお、１つの文書中に複数のｍｅｎｔｉｏｎがある場合には、そのうち最初の１つを代表として使用し、スコア算出を行う。又は、各ｍｅｎｔｉｏｎについてスコア算出を行い、算出したすべてのスコアの平均値を最終的なスコアとする。 When there are a plurality of mentions in one document, the first one of them is used as a representative to calculate the score. Alternatively, score calculation is performed for each of the elements, and the average value of all the calculated scores is used as a final score.

本実施形態は、次の効果を奏する。類似度を判定する対象（人物、企業、市町村）毎に異なる文書類似度指標を選択して、類似度スコアを算出するので、精度の高い類似度スコアを取得することが可能となる。 The present embodiment has the following effects. Different document similarity indexes are selected for each target (person, company, municipality) whose similarity is to be determined, and the similarity score is calculated. Therefore, it is possible to acquire the similarity score with high accuracy.

続いて、類似度スコアを用いた文書群のクラスタリングについて説明する。以下の説明においては、例として、文書群を人物の実体毎にクラスタリングする場合について説明する。例えば、人物評伝、伝記、回顧録などの人物について書かれた多数の文書を、取り上げられている人物毎に分類する場合である。図８はクラスタリング処理の手順を示すフローチャートである。ＣＰＵ１１はカテゴリ及び文書群を取得する（ステップＳ２１）。カテゴリはここでは人物である。文書群に含まれる各文書に対して固有表現抽出を行う（ステップＳ２２）。ＣＰＵ１１固有表現抽出で抽出した人名毎に文書群を分割し、文書ＤＢ１４１等に記憶する（ステップＳ２３）。ＣＰＵ１１はカテゴリ：人物に対応する類似度指標を類似度指標ＤＢ１４５から取得する（ステップＳ２４）。ＣＰＵ１１は人名毎に分割した文書群から類似度スコアを算出する。（ステップＳ２５）。ＣＰＵ１１は類似度スコアを用いて、文書群のまとめあげを行う（ステップＳ２６）。ＣＰＵ１１は、ステップＳ２５及びＳ２６を人名毎に分割した文書群それぞれに対して行う。ＣＰＵ１１は結果を出力し（ステップＳ２７）、処理を終了する。 Subsequently, clustering of document groups using the similarity score will be described. In the following description, as an example, clustering of documents by entity of a person will be described. For example, it is a case where a large number of documents written about a person such as a person biography, a biography, a memoir, etc. are classified according to the person being taken up. FIG. 8 is a flowchart showing the procedure of the clustering process. The CPU 11 acquires the category and the document group (step S21). The category is here a person. The unique expression extraction is performed on each of the documents included in the document group (step S22). The document group is divided for each personal name extracted by the CPU 11 specific expression extraction, and stored in the document DB 141 or the like (step S23). The CPU 11 acquires a similarity index corresponding to the category: person from the similarity index DB 145 (step S24). The CPU 11 calculates the similarity score from the document group divided for each personal name. (Step S25). The CPU 11 uses the similarity score to organize the document group (step S26). The CPU 11 performs steps S25 and S26 for each document group divided for each personal name. The CPU 11 outputs the result (step S27) and ends the process.

図９は類似度指標を用いた文書まとめあげの例を示す説明図である。文書まとめあげを行う際には、「文書−文書」対に対する類似度指標だけでなく、「文書−文書グループ」対、「文書グループ−文書グループ」対に付いても定義を行う。図９Ａは「文書−文書」対に対する処理を示している。文書１及び文書２が含まれている文書について、類似度スコアを算出する。スコアの値が予め定めた閾値よりも大きければ、２つの文書を同じ文書グループとする。スコアの値が閾値以下であれば、２つの文書は違う文書グループとする。 FIG. 9 is an explanatory view showing an example of document grouping using the similarity index. When document grouping is performed, not only the similarity index for the "document-document" pair but also the "document-document group" pair and the "document group-document group" pair are defined. FIG. 9A shows the process for the "document-document" pair. The similarity score is calculated for the document including the document 1 and the document 2. If the score value is larger than a predetermined threshold value, the two documents are made the same document group. If the score value is less than or equal to the threshold, the two documents are different document groups.

図９Ｂ及び図９Ｃは「文書−文書グループ」対に対する処理を示している。図９Ｂ及び図９Ｃに示す例では、すでに文書グループとして、グループ１とグループ２の２つのグループが作られている場合に、グループ分けがされていない文書（新規文書）が属するグループを決定する処理を示している。図９Ｂは新規文書がグループ１に分けられる例を示している。新規文書とグループ１との類似度スコア、及び新規文書とグループ２との類似度スコアを算出する。算出した結果、前者として３．５を、後者として０．５を得たとする。このとき、新規文書をスコアの高い方のグループ、すなわちグループ１にグループ分けする。図９Ｃは新規文書がいずれグループにも分けられず、新たなグループを作成する場合を示している。算出した類似度スコアのいずれもが所定の閾値以下の時、新規文書は既存のグループに分けず、新たなグループ作成する。図９Ｃに示す例では閾値を０と定義している。そして、新規文書とグループ１との類似度スコアが−１．０、新規文書とグループ２との類似度スコアが−０．５であった。いずれの類似度スコアも０以下であるので、新規文書は新規に作成したグループ３にグループ分けされる。図９の処理により、文書群をまとめあげすることができる。 9B and 9C illustrate the process for the "document-document group" pair. In the example shown in FIG. 9B and FIG. 9C, when two groups of group 1 and group 2 have already been created as document groups, processing for determining the group to which a document (new document) to which grouping has not been made belongs Is shown. FIG. 9B shows an example in which a new document is divided into group 1. The similarity score between the new document and group 1 and the similarity score between the new document and group 2 are calculated. As a result of calculation, it is assumed that 3.5 is obtained as the former and 0.5 as the latter. At this time, the new document is grouped into a group having a higher score, that is, Group 1. FIG. 9C shows a case where a new document is not divided into any group, and a new group is created. When any of the calculated similarity scores is less than or equal to a predetermined threshold, a new document is not divided into existing groups, but a new group is created. In the example shown in FIG. 9C, the threshold is defined as zero. Then, the similarity score between the new document and the group 1 is -1.0, and the similarity score between the new document and the group 2 is -0.5. Since any similarity score is less than or equal to 0, new documents are grouped into newly created group 3. The document group can be organized by the process of FIG.

続いて、クラスタリング処理の他の例について説明する。上述と同じく人物の実体毎にクラスタリングする例である。ここでの例は、目的の人名で文書群に対して全文検索をかけた結果をクラスタリングする処理である。目的とする人物以外に同姓同名の他の人物が存在し、当該他の人物に関する文書も検索結果に含まれている場合に有効な処理である。 Subsequently, another example of the clustering process will be described. This is an example of clustering for each entity of a person as described above. An example here is a process of clustering the result of applying a full text search to a document group with a target personal name. This processing is effective when there is another person with the same surname and the same name other than the intended person, and a document related to the other person is also included in the search result.

図１０はクラスタリング処理の他の手順を示すフローチャートである。ＣＰＵ１１はカテゴリ、キーワード、及び文書群を取得する（ステップＳ３１）。ここではカテゴリは人物である。キーワードは例えば鈴木一郎などの人名である。ＣＰＵ１１は文書群に対して、キーワード検索を行う（ステップＳ３２）。ＣＰＵ１１はキーワード検索にヒットした文書を文書ＤＢ１４１などに記憶する（ステップＳ３３）。ＣＰＵ１１はカテゴリ：人物に対応した類似度指標を類似度指標ＤＢ１４５から取得する（ステップＳ３４）。ＣＰＵ１１はステップＳ３３で記憶した文書群について類似度スコアを算出する（ステップＳ３５）。ＣＰＵ１１は算出したスコアを用いて、文書群のまとめあげを行う（ステップＳ３６）。ＣＰＵ１１はまとめあげた結果を出力し（ステップＳ３７）、処理を終了する。類似度スコアの算出、文書群のまとめあげは上述したものと同様であるので、説明を省略する。 FIG. 10 is a flowchart showing another procedure of the clustering process. The CPU 11 acquires a category, a keyword, and a document group (step S31). Here the category is a person. The keyword is, for example, a personal name such as Ichiro Suzuki. The CPU 11 performs a keyword search on the document group (step S32). The CPU 11 stores the document hit in the keyword search in the document DB 141 or the like (step S33). The CPU 11 acquires the similarity index corresponding to the category: person from the similarity index DB 145 (step S34). The CPU 11 calculates the similarity score for the document group stored in step S33 (step S35). The CPU 11 organizes the document group using the calculated score (step S36). The CPU 11 outputs the summarized result (step S37) and ends the process. The calculation of the similarity score and the grouping of the document group are the same as those described above, so the description will be omitted.

次に、タイプ−カテゴリ対応ＤＢ１４４について、詳細に説明する。類似度スコア算出装置１は文書のカテゴリ毎に類似度指標を使い分けることで、類似度スコアの精度向上を実現する。文書のカテゴリは、固有表現抽出より得た固有表現タイプをタイプ−カテゴリ対応ＤＢ１４４を用いて、カテゴリに変換することにより得ている。そして、カテゴリに対応した類似度指標を選択している。そのため、タイプ−カテゴリ対応ＤＢ１４４の正確性が、類似度スコアの精度に影響を与える。 Next, the type-category correspondence DB 144 will be described in detail. The similarity score calculation device 1 realizes improvement in the accuracy of the similarity score by selectively using the similarity index for each category of the document. The category of the document is obtained by converting the specific expression type obtained from the specific expression extraction into a category using the type-category correspondence DB 144. Then, the similarity index corresponding to the category is selected. Therefore, the accuracy of the type-category correspondence DB 144 affects the accuracy of the similarity score.

タイプ−カテゴリ対応ＤＢ１４４の作成方法として、２つの方法について述べる。第１の方法は上述のように固有表現タイプとカテゴリとを１対１対応とする場合である。第２の方法は固有表現タイプとカテゴリとを１対多対応とする場合である。以下の説明においては、地理的な位置情報を例として説明する。固有表現タイプにおいて、地理的な位置情報はＬＯＣＡＴＩＯＮとする。大規模知識ベースのカテゴリにおいて、地理的な位置情報は、日本の市町村、日本の区、日本の地理の３種類があるとする。 As a method of creating the type-category correspondence DB 144, two methods will be described. The first method is a case where the unique expression type and the category are in one-to-one correspondence as described above. The second method is one-to-many correspondence between specific expression types and categories. In the following description, geographical position information is described as an example. In the specific representation type, geographical location information is LOCATION. In the large-scale knowledge-based category, geographical location information has three types: Japanese municipalities, Japanese wards, and Japanese geography.

図１１はタイプ−カテゴリ対応ＤＢ１４４の作成方法を示す説明図である。図１１は固有表現タイプとカテゴリとを１対１対応とする場合である。類似度指標はカテゴリ毎に指標が作成される。そのため、類似度指標作成処理により、カテゴリ：日本の市町村に対応した類似度指標１４５ａ、カテゴリ：日本の区に対応した類似度指標１４５ｂ、及びカテゴリ：日本の地理に対応した類似度指標１４５ｃが作成される。 FIG. 11 is an explanatory view showing a method of creating the type-category correspondence DB 144. As shown in FIG. FIG. 11 shows the case where the unique expression types and the categories are in one-to-one correspondence. Similarity indicators are created for each category. Therefore, the category: similarity index 145a corresponding to Japanese municipalities, category: similarity index 145b corresponding to Japanese ward, and category: similarity index 145c corresponding to geography of Japan are created by the similarity index creation processing. Be done.

３つの類似度指標を作成後、類似度指標の評価を行い、もっとも精度が高いと評価される指標に対応するカテゴリを固有表現ＬＯＣＡＴＩＯＮに対応するものとする。類似度指標作成後、類似度指標作成時に用いた文書とは異なる文書群からデータセットを生成する。データセットは上述と同様である。データセットは、文書対及び文書対のｅｎｔｉｔｙが一致するか否かのラベルを含む。生成したデータセットそれぞれに含まれる文書対の類似度スコアを類似度指標毎に算出する。類似度スコアとラベルの値とを比較することにより、類似度指標の精度を算出することが可能である。精度の算出方法は公知の技術であるので、説明を省略する。図１１の例では、カテゴリ：日本の市町村に対応した類似度指標の精度がもっとも高かったため、固有表現タイプ：ＬＯＣＡＴＩＯＮに対応するカテゴリは日本の市町村である旨のレコードをタイプ−カテゴリ対応ＤＢ１４４に記憶する。 After creating three similarity indices, the similarity indices are evaluated, and the category corresponding to the index that is evaluated to be the most accurate corresponds to the unique expression LOCATION. After the creation of the similarity index, a data set is generated from a group of documents different from the document used when creating the similarity index. The data set is as described above. The data set includes a document pair and a label indicating whether the entities of the document pair match. The similarity score of the document pair included in each of the generated data sets is calculated for each similarity index. By comparing the similarity score with the value of the label, it is possible to calculate the accuracy of the similarity index. Since the method of calculating the accuracy is a known technique, the description will be omitted. In the example of FIG. 11, the category: the accuracy index corresponding to the municipalities in Japan is the highest, so the specific expression type: the category corresponding to LOCATION stores a record that the municipalities in Japan are in the type-category correspondence DB 144 Do.

図１２はタイプ−カテゴリ対応ＤＢ１４４の他の作成方法を示す説明図である。図１２は固有表現タイプとカテゴリとを１対多対応とする場合である。３つの類似度指標１４５ａ、１４５ｂ、１４５ｃを作成する点、３つの類似度指標１４５ａ、１４５ｂ、１４５ｃそれぞれの評価を行う点は１対１対応の場合と同様である。 FIG. 12 is an explanatory view showing another method of creating the type-category correspondence DB 144. As shown in FIG. FIG. 12 shows the case where one-to-many correspondence is made between specific expression types and categories. The point of creating three similarity indexes 145a, 145b, 145c and the point of evaluating each of the three similarity indexes 145a, 145b, 145c are the same as in the case of one-to-one correspondence.

１対多対応の場合は、固有表現タイプ１つにつき、複数のカテゴリの類似度指標によるスコアを、重み付けをして組み合わせる。重み付けは各指標の評価結果により決定する。図１２に示す例では、重み付けは各指標の精度としてある。タイプ−カテゴリ対応ＤＢ１４４は、固有表現タイプがＬＯＣＡＴＩＯＮの場合、日本の市町村に対応した類似度指標１４５ａ、日本の区に対応した類似度指標１４５ｂ、日本の地理に対応した類似度指標１４５ｃを用いることを示している。まず、類似度指標１４５ａ、１４５ｂ、１４５ｃそれぞれを用いてスコアを計算する。それぞれのスコアがＳＣ１、ＳＣ２、ＳＣ３であったとき、最終的なスコアＳは、以下の式（２）で算出される。 In the case of one-to-many correspondence, the score by the similarity index of a plurality of categories is combined by weighting for one specific expression type. Weighting is determined by the evaluation result of each index. In the example shown in FIG. 12, the weighting is the accuracy of each index. When the specific expression type is LOCATION, the type-category correspondence DB 144 uses the similarity index 145a corresponding to the municipalities of Japan, the similarity index 145b corresponding to the ward of Japan, and the similarity index 145c corresponding to the geography of Japan Is shown. First, a score is calculated using each of the similarity indexes 145a, 145b, and 145c. When the respective scores are SC1, SC2 and SC3, the final score S is calculated by the following equation (2).

Ｓ＝０．８×ＳＣ１＋０．６×ＳＣ２＋０．３×ＳＣ３ … （２） S = 0.8 × SC1 + 0.6 × SC2 + 0.3 × SC3 (2)

以上のように、タイプ−カテゴリ対応ＤＢ１４４の作成を、人手ではなく機械学習を用いて行うことにより、類似度スコアの精度の向上が可能となる。 As described above, the accuracy of the similarity score can be improved by creating the type-category correspondence DB 144 using machine learning instead of human hands.

図１３は類似度スコア算出装置１の機能構成の一例を示すブロック図である。類似度スコア算出装置１は取得部１１ａ、グループ生成部１１ｂ、文書対生成部１１ｃ、付与部１１ｄ、作成部１１ｅ、及び出力部１１ｆを含む。これらの各機能部は、ＣＰＵ１１が制御プログラム１Ｐに基づいて動作することにより、実現される。 FIG. 13 is a block diagram showing an example of a functional configuration of the similarity score calculation device 1. The similarity score calculation device 1 includes an acquisition unit 11a, a group generation unit 11b, a document pair generation unit 11c, an assignment unit 11d, a creation unit 11e, and an output unit 11f. These function units are realized by the CPU 11 operating based on the control program 1P.

取得部１１ａは、文書中のメンションが知識ベース中のカテゴリ情報を持つエンティティと対応付けられた文書を取得する。グループ生成部１１ｂは、取得した文書を対応付けられたカテゴリ毎に分類した文書グループを生成する。文書対生成部１１ｃは、生成した文書グループ毎に、同一のメンションが対応付けられた文書を含む文書対を生成する。付与部１１ｄは生成した文書対に対して、エンティティが一致するか否かのラベルを付与する。作成部１１ｅは、ラベルを付与した文書対に基づいて、類似度指標を作成する。出力部１１ｆは、作成した類似度指標を文書グループに対応したカテゴリと対応付けて出力する。 The acquisition unit 11a acquires a document in which a mention in the document is associated with an entity having category information in the knowledge base. The group generation unit 11 b generates a document group in which the acquired document is classified into each associated category. The document pair generation unit 11 c generates, for each of the generated document groups, a document pair including a document associated with the same mention. The assignment unit 11d assigns a label indicating whether the entities match, to the generated document pair. The creation unit 11 e creates the similarity index based on the document pair to which the label is attached. The output unit 11 f outputs the created similarity index in association with the category corresponding to the document group.

各実施の形態で記載されている技術的特徴（構成要件）はお互いに組み合わせ可能であり、組み合わせすることにより、新しい技術的特徴を形成することができる。
今回開示された実施の形態はすべての点で例示であって、制限的なものではないと考えられるべきである。本発明の範囲は、上記した意味ではなく、特許請求の範囲によって示され、特許請求の範囲と均等の意味及び範囲内でのすべての変更が含まれることが意図される。 The technical features (component requirements) described in the respective embodiments can be combined with each other, and by combining, new technical features can be formed.
It should be understood that the embodiments disclosed herein are illustrative in all respects and not restrictive. The scope of the present invention is indicated not by the meaning described above but by the claims, and is intended to include all modifications within the meaning and scope equivalent to the claims.

以上の実施の形態に関し、さらに以下の付記を開示する。 Further, the following appendices will be disclosed regarding the above embodiment.

（付記１）
コンピュータが、
文書中のメンションが知識ベース中のカテゴリ情報を持つエンティティと対応付けられた文書を取得し、
取得した文書を対応付けられた前記カテゴリ毎に分類した文書グループを生成し、
生成した文書グループ毎に、同一の前記メンションが対応付けられた前記文書を含む文書対を生成し、
生成した文書対に対して、前記エンティティが一致するか否かのラベルを付与し、
ラベルを付与した前記文書対に基づいて、類似度指標を作成し、
作成した類似度指標を前記文書グループに対応した前記カテゴリと対応付けて出力する
処理を行う
判別方法。 (Supplementary Note 1)
The computer is
Get a document in which a mention in the document is associated with an entity with category information in the knowledge base,
Generating a document group in which the acquired documents are classified according to the categories associated with each other;
Generating a document pair including the document to which the same mention is associated for each of the generated document groups;
Label the generated document pair whether the entity matches or not,
Create a similarity index based on the labeled document pair,
The discrimination | determination method which performs the process which matches the created similarity index with the said category corresponding to the said document group, and outputs it.

（付記２）
前記カテゴリは前記知識ベースにおいて定義され、前記エンティティ毎に付与されるものである
付記１に記載の判別方法。 (Supplementary Note 2)
The determination method according to appendix 1, wherein the category is defined in the knowledge base and given to each of the entities.

（付記３）
キーワードと、該キーワードを用いて検索にヒットした複数の文書を受け付け、
受け付けた複数の文書より固有表現抽出を行い、
前記キーワードに対応するメンションの固有表現タイプを文書毎に比較し、
複数の文書間で一致する場合は、前記固有表現タイプに対応した前記カテゴリを取得し、
取得したカテゴリに対応付けられた類似度指標を取得し、
取得した類似度指標を用いて複数文書間の類似度スコアを求め、
求めた類似度スコアを出力する
付記１又は付記２に記載の判別方法。 (Supplementary Note 3)
Accept keywords and multiple documents that hit the search using the keywords,
Extract specific names from multiple accepted documents,
Compare specific expression types of mentions corresponding to the keywords for each document,
When a plurality of documents match, the category corresponding to the specific expression type is acquired,
Acquire the similarity index associated with the acquired category,
Find similarity score among multiple documents using the obtained similarity index,
The determination method according to Appendix 1 or 2, wherein the calculated similarity score is output.

（付記４）
前記カテゴリと複数の文書を受け付け、
受け付けた複数の文書より固有表現抽出を行い、
抽出した固有表現に対応する固有表現タイプから１つを選択し、
選択した固有表現タイプの固有表現毎に、前記複数の文書を分割し、
前記カテゴリに対応付けられた類似度指標を取得し、
取得した類似度指標を用いて分割して得た文書群毎に類似度スコアを求め、
求めた類似度スコアを出力する
付記１又は付記２に記載の判別方法。 (Supplementary Note 4)
Accept the category and multiple documents,
Extract specific names from multiple accepted documents,
Select one of the specific expression types corresponding to the extracted specific expression,
The plurality of documents are divided according to the specific representation of the selected specific representation type,
Acquire a similarity index associated with the category,
Determine the similarity score for each document group obtained by division using the acquired similarity index,
The determination method according to Appendix 1 or 2, wherein the calculated similarity score is output.

（付記５）
前記類似度スコアにより前記複数の文書又は文書群をクラスタリングし、
クラスタリングした結果を出力する
付記３又は４に記載の判別方法。 (Supplementary Note 5)
Clustering the plurality of documents or document groups according to the similarity score;
The determination method according to appendix 3 or 4, which outputs the result of clustering.

（付記６）
文書中のメンションが知識ベース中のカテゴリ情報を持つエンティティと対応付けられた文書を取得し、
取得した文書を対応付けられた前記カテゴリ毎に分類した文書グループを生成し、
生成した文書グループ毎に、同一の前記メンションが対応付けられた前記文書を含む文書対を生成し、
生成した文書対に対して、前記エンティティが一致するか否かのラベルを付与し、
ラベルを付与した前記文書対に基づいて、類似度指標を作成し、
作成した類似度指標を前記文書グループに対応した前記カテゴリと対応付けて出力する
処理をコンピュータに実行させる判別プログラム。 (Supplementary Note 6)
Get a document in which a mention in the document is associated with an entity with category information in the knowledge base,
Generating a document group in which the acquired documents are classified according to the categories associated with each other;
Generating a document pair including the document to which the same mention is associated for each of the generated document groups;
Label the generated document pair whether the entity matches or not,
Create a similarity index based on the labeled document pair,
A determination program that causes a computer to execute a process of outputting the created similarity index in association with the category corresponding to the document group.

（付記７）
文書中のメンションが知識ベース中のカテゴリ情報を持つエンティティと対応付けられた文書を取得する取得部と、
取得した文書を対応付けられた前記カテゴリ毎に分類した文書グループを生成するグループ生成部と、
生成した文書グループ毎に、同一の前記メンションが対応付けられた前記文書を含む文書対を生成する文書対生成部と、
生成した文書対に対して、前記エンティティが一致するか否かのラベルを付与する付与部と、
ラベルを付与した前記文書対に基づいて、類似度指標を作成する作成部（１１ｅ）と、
作成した類似度指標を前記文書グループに対応した前記カテゴリと対応付けて出力する出力部と
を備える判別装置。 (Appendix 7)
An acquisition unit for acquiring a document in which a mention in the document is associated with an entity having category information in the knowledge base;
A group generation unit that generates a document group in which the acquired document is classified according to the categories associated with each other;
A document pair generation unit that generates a document pair including the document to which the same mention is associated for each of the generated document groups;
An assigning unit for assigning a label indicating whether the entity matches the generated document pair;
A creation unit (11e) that creates a similarity index based on the labeled document pair;
An output unit that outputs the created similarity index in association with the category corresponding to the document group.

１類似度スコア算出装置
１１ＣＰＵ
１１ａ取得部
１１ｂグループ生成部
１１ｃ文書対生成部
１１ｄ付与部
１１ｅ作成部
１１ｆ出力部
１２ＲＯＭ
１３ＲＡＭ
１４大容量記憶部
１４１文書ＤＢ
１４２実体情報ＤＢ
１４３データセットＤＢ
１４４カテゴリ対応ＤＢ
１４５類似度指標ＤＢ
１５通信部
１６入出力部
１７読み取り部
１Ｐ制御プログラム
１ａ可搬型記憶媒体
１ｂ半導体メモリ
Ｂバス 1 Similarity score calculator 11 CPU
11a acquisition unit 11b group generation unit 11c document pair generation unit 11d addition unit 11e generation unit 11f output unit 12 ROM
13 RAM
14 Mass Storage 141 Document DB
142 Entity information DB
143 Data set DB
144 category corresponding DB
145 Similarity Index DB
15 communication unit 16 input / output unit 17 reading unit 1P control program 1a portable storage medium 1b semiconductor memory B bus

Claims

The computer is
Get a document in which a mention in the document is associated with an entity with category information in the knowledge base,
Generating a document group in which the acquired documents are classified according to the categories associated with each other;
Generating a document pair including the document to which the same mention is associated for each of the generated document groups;
Label the generated document pair whether the entity matches or not,
Create a similarity index based on the labeled document pair,
The discrimination | determination method which performs the process which matches the created similarity index with the said category corresponding to the said document group, and outputs it.

The method according to claim 1, wherein the category is defined in the knowledge base and is given to each of the entities.

Accept keywords and multiple documents that hit the search using the keywords,
Extract specific names from multiple accepted documents,
Compare specific expression types of mentions corresponding to the keywords for each document,
When a plurality of documents match, the category corresponding to the specific expression type is acquired,
Acquire the similarity index associated with the acquired category,
Find similarity score among multiple documents using the obtained similarity index,
The discrimination | determination method of Claim 1 or Claim 2 which outputs the calculated | required similarity score.

Get a document in which a mention in the document is associated with an entity with category information in the knowledge base,
Generating a document group in which the acquired documents are classified according to the categories associated with each other;
Generating a document pair including the document to which the same mention is associated for each of the generated document groups;
Label the generated document pair whether the entity matches or not,
Create a similarity index based on the labeled document pair,
A determination program that causes a computer to execute a process of outputting the created similarity index in association with the category corresponding to the document group.

An acquisition unit for acquiring a document in which a mention in the document is associated with an entity having category information in the knowledge base;
A group generation unit that generates a document group in which the acquired document is classified according to the categories associated with each other;
A document pair generation unit that generates a document pair including the document to which the same mention is associated for each of the generated document groups;
An assigning unit for assigning a label indicating whether the entity matches the generated document pair;
A creation unit that creates a similarity index based on the labeled document pair;
An output unit that outputs the created similarity index in association with the category corresponding to the document group.