JP2009140263A

JP2009140263A - Term co-occurrence degree extractor

Info

Publication number: JP2009140263A
Application number: JP2007316422A
Authority: JP
Inventors: Hidenori Kawai; 英紀河合
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-12-06
Filing date: 2007-12-06
Publication date: 2009-06-25
Anticipated expiration: 2027-12-06
Also published as: JP5251099B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a term co-occurrence degree extractor capable of extracting a co-occurrence degree graph of a large scale and high similarity by approximately obtaining a co-occurrence degree between terms by a small number of times of retrieval. <P>SOLUTION: The term co-occurrence degree extractor 100 for extracting the co-occurrence degree graph for which the term of a retrieval object is a node and the co-occurrence degree of two optional terms of the retrieval object is an edge between the nodes corresponding to the two terms comprises: a co-occurrence degree detection accuracy determination part 20 for determining the probability of obtaining the co-occurrence degree between the terms for a non-retrieved term; a retrieval strategy decision part 21 for deciding the retrieval order of the term on the basis of the possibility determined in the co-occurrence degree detection accuracy determination part 20; a data retrieval part 22 for retrieving document data with each term of the retrieval object as a keyword according to the order decided in the retrieval strategy decision part 21; and a co-occurrence degree calculation part 24 for approximately obtaining the co-occurrence degree between the terms for the terms of the retrieval object included in a retrieval result document from the retrieval result document including the terms retrieved in the data retrieval part 22. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、検索対象の用語をノードとし、前記検索対象の任意の２つの用語について、該２つの用語が同一文書で出現する度合いを示す共起度を該２つの用語に対応するノードの間のエッジとする、共起度グラフを抽出する用語共起度抽出装置、用語共起度抽出方法及び用語共起度抽出プログラムに関する。 The present invention uses a search target term as a node, and, for any two terms of the search target, a co-occurrence degree indicating the degree of appearance of the two terms in the same document is between the nodes corresponding to the two terms. The term co-occurrence degree extraction apparatus, the term co-occurrence degree extraction method, and the term co-occurrence degree extraction program for extracting the co-occurrence degree graph, which are the edges of the term, are related.

近年、インターネットおよびＷＷＷ（World Wide Web。以下、Ｗｅｂという）の普及とともに爆発的な情報が流通するようになり、Ｗｅｂをマイニングの対象とする情報抽出の研究が盛んに行われている。特に、人名、組織名、施設名、地名などの用語を検索クエリとしてＷｅｂ検索エンジンに入力し、得られた検索結果をコーパス（corpus：言語資料）として用いることによって、用語間の共起度を求める手法に注目が集まっている。共起度とは、特定の２つの用語が同一文書中に出現する度合い（頻度、割合）の指標である。 In recent years, with the spread of the Internet and WWW (World Wide Web; hereinafter referred to as the Web), explosive information has been distributed, and research on information extraction using the Web as a target for mining has been actively conducted. In particular, terms such as person names, organization names, facility names, and place names are input as search queries to a Web search engine, and the obtained search results are used as a corpus (corpus: language material), thereby increasing the degree of co-occurrence between terms. Attention has been focused on the desired method. The co-occurrence degree is an index of the degree (frequency, ratio) at which two specific terms appear in the same document.

例えば、Ｗｅｂ検索エンジンで人名を検索することによって、人間関係を推定する技術として、特許文献１の技術が挙げられる。特許文献１の技術によれば、人名のリストが入力されると、２つの人名同士の組み合わせを検索クエリとしてＷｅｂ検索エンジンで検索することによって、２つの人名同士の人間関係を文書内の共起度として求めることができる。 For example, as a technique for estimating a human relationship by searching for a person's name with a Web search engine, the technique of Patent Document 1 can be cited. According to the technique of Patent Document 1, when a list of person names is input, a web search engine searches for a combination of two person names as a search query, thereby co-occurring human relations between the two person names in the document. It can be calculated as a degree.

用語間の共起度について、特許文献２には、自然言語文で入力された膨大な量の時系列データから任意の区間で分割したスナップショット・データを生成し、スナップショット・データに含まれるデータに自然言語解析を施し、得られたノード対から共起関係を求めネットワーク図を描画する技術が記載されている。また、ノード対に対し、相互情報量を用いて共起関係を算出することが記載されている。相互情報量Ｉ（ｘ，ｙ）は、単語「ｘ」と単語「ｙ」とが共起する確率Ｐ（ｘ，ｙ）と、それぞれがテキスト内で生起する確率Ｐ（ｘ）Ｐ（ｙ）との比である。 Regarding the co-occurrence between terms, Patent Document 2 generates snapshot data divided in an arbitrary section from a huge amount of time-series data input in a natural language sentence, and is included in the snapshot data. A technique for performing natural language analysis on data and obtaining a co-occurrence relationship from the obtained node pairs and drawing a network diagram is described. In addition, it is described that a co-occurrence relation is calculated using a mutual information amount for a node pair. The mutual information I (x, y) includes the probability P (x, y) that the word “x” and the word “y” co-occur and the probability P (x) P (y) that each occurs in the text. And the ratio.

特許文献３には、統語処理を一括して行う同音語グループ間の共起単語の組み合わせの集合を、同音語グループの組み合わせと対応付けて格納する技術が記載されている。特許文献３の技術は、候補バッファから後側の先頭単語を取り出し、この後側単語で共起辞書インデックスを検索することにより、共起辞書本体の検索範囲を限定する。代表単語として前側の先頭単語を取り出し、この前側単語で共起辞書本体を検索する。それによって、優先すべき単語の組み合わせがあるかどうか判る。
特開２００４−３４８１７９号公報特開２００５−３５２８１７号公報特開平０８−１１５３１８号公報 Patent Document 3 describes a technique for storing a set of co-occurrence word combinations between homophone groups that collectively perform syntactic processing in association with the combination of homophone groups. The technique of Patent Document 3 limits the search range of the co-occurrence dictionary main body by taking out the first word on the rear side from the candidate buffer and searching the co-occurrence dictionary index with this rear side word. The front word at the front side is taken out as a representative word, and the co-occurrence dictionary body is searched with this front word. As a result, it is determined whether there is a combination of words that should be given priority.
JP 2004-348179 A JP 2005-352817 A Japanese Patent Laid-Open No. 08-115318

共起度の計算方法には共起頻度、相互情報量、Dice係数、Jaccard係数、Simpson係数、Cosine係数など、様々な方法がある。Ｗｅｂページ全体の数をＮ、用語K1、K2のＷｅｂ検索エンジンにおけるヒット件数をそれぞれ｜K1｜、｜K2｜とし、用語K1、K2を論理積条件（AND条件）で検索したときのヒット件数を｜K1 AND K2｜とし、用語K1、K2を論理和条件（OR条件）で検索したときのヒット件数を｜K1 OR K2｜と標記することにすると、共起頻度、相互情報量、Dice係数、Jaccard係数、Simpson係数はそれぞれ以下のように定義される。共起度は、２つの用語の文書内での共起の度合いを数値化する指標であるため、いずれの定義でも｜K1 AND K2｜の項が必須である。
共起頻度＝｜K1 AND K2｜
相互情報量＝ −log｛Ｎ×｜K1 AND K2｜／（｜K1｜×｜K2｜）｝
Dice係数＝｜K1 AND K2｜／（｜K1｜＋｜K2｜）
Jaccard係数＝｜K1 AND K2｜／｜K1 OR K2｜
Simpson係数＝｜K1 AND K2｜／min（｜K1｜，｜K2｜）
Cosine係数＝｜K1 AND K2｜／√（｜K1｜×｜K2｜） There are various methods for calculating the degree of co-occurrence, such as co-occurrence frequency, mutual information, Dice coefficient, Jaccard coefficient, Simpson coefficient, and Cosine coefficient. The total number of web pages is N, the number of hits in the web search engine for the terms K1 and K2 is | K1 | and | K2 |, respectively, and the number of hits when the terms K1 and K2 are searched with the AND condition ｜ K1 AND K2 | and if the terms K1 and K2 are searched with the logical sum condition (OR condition) and the number of hits is marked as | K1 OR K2 |, the co-occurrence frequency, mutual information, Dice coefficient, Jaccard coefficient and Simpson coefficient are defined as follows. Since the co-occurrence degree is an index for quantifying the degree of co-occurrence of two terms in a document, the term | K1 AND K2 | is essential in any definition.
Co-occurrence frequency = | K1 AND K2 |
Mutual information = -log {N × | K1 AND K2 | / (| K1 | × | K2 |)}
Dice coefficient = | K1 AND K2 | / (| K1 | + | K2 |)
Jaccard coefficient = | K1 AND K2 | / | K1 OR K2 |
Simpson coefficient = | K1 AND K2 | / min (| K1 |, | K2 |)
Cosine coefficient = | K1 AND K2 | / √ (| K1 | × | K2 |)

特許文献１では、ヒット数の少ない人名の共起度が不当に高く評価されるのを防ぐため、閾値付Simpson係数も例として用いている。これは、｜K1｜と｜K2｜の最小値min（｜K1｜，｜K2｜）が閾値ｋよりも大きい場合には、共起度として通常のSimpson係数を用いるが、min（｜K1｜，｜K2｜）が閾値ｋ以下の場合は共起度を０として計算する方法である。 In Patent Document 1, a Simpson coefficient with a threshold is also used as an example in order to prevent the co-occurrence degree of a person with a small number of hits from being unduly evaluated. When the minimum value min (| K1 |, | K2 |) of | K1 | and | K2 | is larger than the threshold value k, a normal Simpson coefficient is used as the co-occurrence, but min (| K1 | , | K2 |) is equal to or less than the threshold value k, the co-occurrence is calculated as 0.

また、特許文献１は人名のみを対象とした技術であるが、入力データを組織名や地名などの用語リストに置き換えることによって、人名以外の用語間の関係を得ることは可能である。 Further, Patent Document 1 is a technique that targets only personal names, but it is possible to obtain relationships between terms other than personal names by replacing input data with term lists such as organization names and place names.

共起度を求める関連する技術における第１の問題点は、入力データとなる用語リストが大規模になると、共起度を求めるために必要な検索の回数が飛躍的に増大してしまうことである。例えば、入力データとなる用語リストが１００語である場合、任意の２語の組み合わせは100×99／2！＝4,950通り存在する。Simpson係数を使って用語間の共起度を求めることにすると、｜K1 AND K2｜を全ての組み合わせに対して求めるために4,950回、min（｜K1｜，｜K2｜）を求めるために100回の検索が必要で、Ｗｅｂ検索エンジンに対する検索回数は合計5,050回になる。 The first problem in the related technology for obtaining the co-occurrence degree is that the number of searches necessary for obtaining the co-occurrence degree increases dramatically when the term list as input data becomes large. is there. For example, if the term list as input data is 100 words, the combination of any two words is 100 × 99/2! = There are 4,950 ways. If we use the Simpson coefficient to determine the co-occurrence between terms, we need 4,950 times to find | K1 AND K2 | for all combinations, and 100 to find min (| K1 |, | K2 |). Search is required, and the total number of searches for the Web search engine is 5,050.

同様に、用語リストが１万語になると、10,000×9,999／2！＋10,000＝50,005,000回もの検索が必要になってしまう。Ｗｅｂ検索エンジンに対して短時間に大量のクエリを発行して検索するわけにはいかないが、仮に１秒間に１回のペースで検索を行ったとしても、１万語の用語の関係を全て求めるためには、50,005,000回／（3,600秒×24時間）＝579日もかかってしまうことになる。一般に、用語リストの語数がｎ倍になると、検索回数はｎの２乗に比例して増大する。これは、共起度の計算のために、２つの用語の論理積条件で検索を行うことが原因である。 Similarly, if the term list reaches 10,000 words, 10,000 × 9,999 / 2! + 10,000 = 50,005,000 searches are required. Although it is not possible to issue a large number of queries to a Web search engine in a short time, even if a search is performed at a rate of once per second, all 10,000 word terms are obtained. Therefore, 50,005,000 times / (3,600 seconds × 24 hours) = 579 days will be required. In general, when the number of words in the term list is increased by n times, the number of searches increases in proportion to the square of n. This is because a search is performed with a logical product condition of two terms for calculating the co-occurrence degree.

第２の問題点は、用語間の共起度を近似的に計算することが不可能なことである。例えば、用語K1で検索を行った場合、検索結果の文書の中に用語K2が100回出現しているのに対し、用語K3が10回しか出現していなかったとすると、用語K2と用語K3を検索しなくても、Kl−K2の共起度の方がK1−K3の共起度よりも強い可能性があることは推定できる。しかし、特許文献１の発明では、Ｗｅｂ検索エンジンを使って、２つの用語の共起度を求めない限り、その共起度を計算することはできない。 The second problem is that it is impossible to approximately calculate the co-occurrence degree between terms. For example, if a search is performed using the term K1, the term K2 appears 100 times in the search result document, whereas the term K3 appears only 10 times. Even without searching, it can be estimated that the co-occurrence degree of Kl-K2 may be stronger than the co-occurrence degree of K1-K3. However, in the invention of Patent Document 1, unless the co-occurrence degree of two terms is obtained using a Web search engine, the co-occurrence degree cannot be calculated.

第３の問題点は、入力データである用語リストに含まれていない新語を抽出しながら再帰的に共起度を計算することが不可能なことである。その理由は、特許文献１には、新語を抽出する手段がないからである。また、仮に新語を抽出する手段があったとしても、新語の抽出によって用語リストが増大すると、第１の問題点で指摘した問題が発生し、検索回数の幾何級数的な増大を招いてしまう。 The third problem is that it is impossible to recursively calculate the co-occurrence degree while extracting new words that are not included in the term list as input data. The reason is that Patent Document 1 does not have means for extracting a new word. Even if there is a means for extracting a new word, if the term list increases due to the extraction of a new word, the problem pointed out in the first problem occurs, and the number of searches is increased geometrically.

本発明の目的は、入力データとして与えられた用語リストに対して、用語間の共起度を少ない検索回数で近似的に求めることによって、大規模で近似度の高い共起度グラフを抽出できる用語共起度抽出装置を提供することにある。 It is an object of the present invention to extract a large-scale co-occurrence degree graph having a high degree of approximation by approximately obtaining the co-occurrence degree between terms with a small number of searches for a term list given as input data. The object is to provide a term co-occurrence degree extraction device.

本発明の第１の観点に係る用語共起度抽出装置は、
検索対象の用語をノードとし、前記検索対象の任意の２つの用語について、該２つの用語が同一文書で出現する度合いを示す共起度を該２つの用語に対応するノードの間のエッジとする、共起度グラフを抽出する用語共起度抽出装置であって、
未検索の用語について、用語間の共起度が求まる可能性を判断する共起度検出確度判定手段と、
前記共起度検出確度判定手段で判定した可能性に基づいて、用語の検索順を決定する検索戦略決定手段と、
前記検索戦略決定手段で決定した順序に従って、検索対象の用語１語ずつをキーワードとして、文書データを検索する検索手段と、
前記検索手段で検索した用語を含む検索結果文書から、該検索結果文書に含まれる検索対象の用語について、用語間の共起度を近似的に求める共起度計算手段と、
を備えることを特徴とする。 The term co-occurrence degree extraction device according to the first aspect of the present invention is:
A search target term is a node, and for any two terms of the search target, a co-occurrence degree indicating the degree of appearance of the two terms in the same document is an edge between the nodes corresponding to the two terms. , A term co-occurrence degree extraction device that extracts a co-occurrence degree graph,
For unsearched terms, co-occurrence detection accuracy determination means for determining the possibility of finding the co-occurrence between terms,
Search strategy determination means for determining the search order of terms based on the possibility determined by the co-occurrence detection accuracy determination means;
Search means for searching for document data using each word as a keyword as a keyword according to the order determined by the search strategy determination means;
A co-occurrence degree calculating means for approximately obtaining a co-occurrence degree between terms for a search target term included in the search result document from a search result document including the term searched by the search means;
It is characterized by providing.

好ましくは、所定の規則に基づいて、前記検索結果文書から、前記検索対象の用語に含まれていない用語を抽出する用語抽出手段を備える。 Preferably, the apparatus includes a term extracting unit that extracts a term not included in the search target term from the search result document based on a predetermined rule.

さらに好ましくは、前記検索結果文書における用語の出現傾向から、動的に用語を抽出する規則を生成する抽出規則学習手段を備える。 More preferably, an extraction rule learning means for dynamically generating a rule for extracting a term from the appearance tendency of the term in the search result document is provided.

なお、前記抽出規則学習手段は、
前記検索結果文書における用語の周辺に出現する文字列を列挙し、
前記検索対象の用語に登録されている用語の単語属性、および該単語属性を一般化した正規表現によって、前記周辺文字列から規則候補の集合を生成し、
前記規則候補の出現頻度および／または用語抽出率の値をそれぞれの所定の閾値と比較して、前記規則候補を絞り込む、
ことによって前記用語を抽出する規則を生成してもよい。 The extraction rule learning means is
List character strings that appear around terms in the search result document,
A set of rule candidates is generated from the surrounding character string by using a word attribute of the term registered in the search target term and a regular expression that generalizes the word attribute,
The rule candidates are narrowed down by comparing the frequency of appearance of the rule candidates and / or the value of the term extraction rate with respective predetermined threshold values.
By doing so, a rule for extracting the term may be generated.

好ましくは、前記用語抽出手段は、単語属性および単語属性の正規表現によって記述された所定の規則に基づいて用語を抽出する。 Preferably, the term extraction means extracts a term based on a predetermined rule described by a word attribute and a regular expression of the word attribute.

好ましくは、前記共起度検出確度判定手段は、未検索の用語について、用語間の共起度が求まる可能性を、新たに抽出される用語の数の期待値、前記共起度グラフにおいて両側未検索から片側検索済みになるエッジの数、片側検索済みから両側検索済みになるエッジの数もしくは片側検索済みのままだが情報量の増加が期待できるエッジの数のいずれか、またはそれらの組み合わせからなる近似グラフスコアとして算出し、
前記検索戦略決定手段は、前記近似グラフスコアの上位の１または複数の用語を検索候補語とする。 Preferably, the co-occurrence degree detection accuracy determination means determines the possibility of obtaining the co-occurrence degree between terms for unsearched terms, the expected value of the number of newly extracted terms, and both sides in the co-occurrence degree graph. From the number of edges that have been searched from one side to the one side, the number of edges that have been searched from one side to the two sides, the number of edges that have been searched from one side but can be expected to increase the amount of information, or a combination of these As an approximate graph score
The search strategy determination means uses one or more terms higher in the approximate graph score as search candidate words.

好ましくは、前記共起度計算手段は、前記共起度グラフにおいて、
検索済みの２つの用語に対応するノードの間のエッジである共起度を計算するとともに、
両側またはいずれか一方のノードの用語が未検索のエッジである共起度を、前記検索結果文書に基づく近似的な共起度として計算する。 Preferably, the co-occurrence degree calculating means in the co-occurrence degree graph,
Calculate the co-occurrence that is the edge between the nodes corresponding to the two searched terms,
The co-occurrence degree in which the term of both nodes or any one of the nodes is an unsearched edge is calculated as an approximate co-occurrence degree based on the search result document.

本発明の第２の観点に係る用語共起度抽出方法は、
検索対象の用語をノードとし、前記検索対象の任意の２つの用語について、該２つの用語が同一文書で出現する度合いを示す共起度を該２つの用語に対応するノードの間のエッジとする、共起度グラフを抽出する用語共起度抽出方法であって、
未検索の用語について、用語間の共起度が求まる可能性を判断する共起度検出確度判定ステップと、
前記共起度検出確度判定ステップで判定した可能性に基づいて、用語の検索順を決定する検索戦略決定ステップと、
前記検索戦略決定ステップで決定した順序に従って、検索対象の用語１語ずつをキーワードとして、文書データを検索する検索ステップと、
前記検索ステップで検索した用語を含む検索結果文書から、該検索結果文書に含まれる検索対象の用語について、用語間の共起度を近似的に求める共起度計算ステップと、
を備えることを特徴とする。 The term co-occurrence degree extraction method according to the second aspect of the present invention is:
A search target term is a node, and for any two terms of the search target, a co-occurrence degree indicating the degree of appearance of the two terms in the same document is an edge between the nodes corresponding to the two terms. , A term co-occurrence degree extraction method for extracting a co-occurrence degree graph,
For unsearched terms, a co-occurrence detection accuracy determination step for determining the possibility of finding the co-occurrence between terms;
A search strategy determination step for determining a search order of terms based on the possibility determined in the co-occurrence detection accuracy determination step;
In accordance with the order determined in the search strategy determination step, a search step for searching the document data using each word to be searched as a keyword,
A co-occurrence degree calculation step of approximately obtaining a co-occurrence degree between terms for a search target term included in the search result document from a search result document including the term searched in the search step;
It is characterized by providing.

好ましくは、所定の規則に基づいて、前記検索結果文書から、前記検索対象の用語に含まれていない用語を抽出する用語抽出ステップを備えることを特徴とする。 Preferably, the method includes a term extracting step of extracting a term that is not included in the search target term from the search result document based on a predetermined rule.

さらに好ましくは、前記検索結果文書における用語の出現傾向から、動的に用語を抽出する規則を生成する抽出規則学習ステップを備えることを特徴とする。 More preferably, the method further comprises an extraction rule learning step of generating a rule for dynamically extracting a term from the appearance tendency of the term in the search result document.

なお、前記抽出規則学習ステップは、
前記検索結果文書における用語の周辺に出現する文字列を列挙し、
前記検索対象の用語に登録されている用語の単語属性、および該単語属性を一般化した正規表現によって、前記周辺文字列から規則候補の集合を生成し、
前記規則候補の出現頻度および／または用語抽出率の値をそれぞれの所定の閾値と比較して、前記規則候補を絞り込むことによって前記用語を抽出する規則を生成してもよい。 The extraction rule learning step includes
List character strings that appear around terms in the search result document,
A set of rule candidates is generated from the surrounding character string by using a word attribute of the term registered in the search target term and a regular expression that generalizes the word attribute,
A rule for extracting the term may be generated by comparing the frequency of appearance of the rule candidate and / or the value of the term extraction rate with respective predetermined threshold values and narrowing down the rule candidate.

好ましくは、前記用語抽出ステップは、単語属性および単語属性の正規表現によって記述された所定の規則に基づいて用語を抽出する。 Preferably, the term extracting step extracts a term based on a predetermined rule described by a word attribute and a regular expression of the word attribute.

好ましくは、前記共起度検出確度判定ステップは、未検索の用語について、用語間の共起度が求まる可能性を、新たに抽出される用語の数の期待値、前記共起度グラフにおいて両側未検索から片側検索済みになるエッジの数、片側検索済みから両側検索済みになるエッジの数もしくは片側検索済みのままだが情報量の増加が期待できるエッジの数のいずれか、またはそれらの組み合わせからなる近似グラフスコアとして算出し、
前記検索戦略決定ステップは、前記近似グラフスコアの上位の１または複数の用語を検索候補語とする。 Preferably, in the co-occurrence degree detection accuracy determination step, with respect to an unsearched term, the possibility that the co-occurrence degree between terms is obtained is calculated based on an expected value of the number of newly extracted terms, both sides in the co-occurrence degree graph. From the number of edges that have been searched from one side to the one side, the number of edges that have been searched from one side to the two sides, the number of edges that have been searched from one side but can be expected to increase the amount of information, or a combination of these As an approximate graph score
In the search strategy determination step, one or more terms higher in the approximate graph score are set as search candidate words.

好ましくは、前記共起度計算ステップは、前記共起度グラフにおいて、
検索済みの２つの用語に対応するノードの間のエッジである共起度を計算するとともに、
両側またはいずれか一方のノードの用語が未検索のエッジである共起度を、前記検索結果文書に基づく近似的な共起度として計算する。 Preferably, in the co-occurrence degree graph, the co-occurrence degree calculation step includes:
Calculate the co-occurrence that is the edge between the nodes corresponding to the two searched terms,
The co-occurrence degree in which the term of both nodes or any one of the nodes is an unsearched edge is calculated as an approximate co-occurrence degree based on the search result document.

本発明の第３の観点に係る用語共起度抽出プログラムは、
検索対象の用語をノードとし、前記検索対象の任意の２つの用語について、該２つの用語が同一文書で出現する度合いを示す共起度を該２つの用語に対応するノードの間のエッジとする、共起度グラフを抽出する用語共起度抽出プログラムであって、
コンピュータを、
未検索の用語について、用語間の共起度が求まる可能性を判断する共起度検出確度判定手段と、
前記共起度検出確度判定手段で判定した可能性に基づいて、用語の検索順を決定する検索戦略決定手段と、
前記検索戦略決定手段で決定した順序に従って、検索対象の用語１語ずつをキーワードとして、文書データを検索する検索手段と、
前記検索手段で検索した用語を含む検索結果文書から、該検索結果文書に含まれる検索対象の用語について、用語間の共起度を近似的に求める共起度計算手段と、
として機能させることを特徴とする。 The term co-occurrence degree extraction program according to the third aspect of the present invention is:
A search target term is a node, and for any two terms of the search target, a co-occurrence degree indicating the degree of appearance of the two terms in the same document is an edge between the nodes corresponding to the two terms. , A term co-occurrence degree extraction program that extracts a co-occurrence degree graph,
Computer
For unsearched terms, co-occurrence detection accuracy determination means for determining the possibility of finding the co-occurrence between terms,
Search strategy determination means for determining the search order of terms based on the possibility determined by the co-occurrence detection accuracy determination means;
Search means for searching for document data using each word as a keyword as a keyword according to the order determined by the search strategy determination means;
A co-occurrence degree calculating means for approximately obtaining a co-occurrence degree between terms for a search target term included in the search result document from a search result document including the term searched by the search means;
It is made to function as.

好ましくは、所定の規則に基づいて、前記検索結果文書から、前記検索対象の用語に含まれていない用語を抽出する用語抽出手段としての機能を備えることを特徴とする。 Preferably, it has a function as a term extracting means for extracting a term not included in the term to be searched from the search result document based on a predetermined rule.

さらに好ましくは、前記検索結果文書における用語の出現傾向から、動的に用語を抽出する規則を生成する抽出規則学習手段としての機能を備える。 More preferably, it has a function as an extraction rule learning means for generating a rule for dynamically extracting a term from the appearance tendency of the term in the search result document.

本発明によれば、検索対象の用語の数に対して、検索回数が幾何級数的に増加するのを防止できる。また、少ない検索回数でもより多くの用語の関係を近似的に求めることができる。さらに、少ない検索回数でもより真の値に近い共起度グラフを求めることができる。 According to the present invention, the number of searches can be prevented from increasing geometrically with respect to the number of search target terms. In addition, the relationship between more terms can be obtained approximately even with a small number of searches. Furthermore, a co-occurrence degree graph closer to the true value can be obtained even with a small number of searches.

本発明では、グラフ理論の用語を用いて、検索対象の用語をノードとし、用語間の共起度をエッジとして、検索対象の用語の関係をグラフ（共起度グラフ）で表す。共起度グラフは、エッジに値（共起度）が付いた重み付きグラフであり、通常、ループや多重エッジを含まない単純グラフで表される。２つの用語の間の共起度が０または所定のしきい値以下である場合には、エッジがないものとする。 In the present invention, using terms of graph theory, a search target term is a node, a co-occurrence degree between terms is an edge, and a relationship between search target terms is represented by a graph (co-occurrence degree graph). The co-occurrence degree graph is a weighted graph with a value (co-occurrence degree) at an edge, and is usually represented by a simple graph that does not include a loop or multiple edges. If the co-occurrence between two terms is 0 or less than a predetermined threshold, it is assumed that there is no edge.

（実施の形態１）
図１は、本発明の実施の形態１に係る用語共起度抽出装置１００の構成例を示すブロック図である。本発明の実施の形態１に係る用語共起度抽出装置１００は、記憶装置１と、処理装置２と、キーボード等の入力部３と、ディスプレイやプリンタ等の出力部４を含む。また、処理装置２は、インターネットやイントラネット等のネットワーク５を介してＷｅｂ検索エンジン等の公開データ６にアクセス可能な構成となっている。 (Embodiment 1)
FIG. 1 is a block diagram showing a configuration example of a term co-occurrence degree extraction device 100 according to Embodiment 1 of the present invention. The term co-occurrence degree extraction device 100 according to Embodiment 1 of the present invention includes a storage device 1, a processing device 2, an input unit 3 such as a keyboard, and an output unit 4 such as a display or a printer. The processing device 2 is configured to be accessible to public data 6 such as a Web search engine via a network 5 such as the Internet or an intranet.

記憶装置１は、用語記憶部１１と、共起度データ記憶部１３とを含む。また、処理装置２は、検索戦略決定部２１と、共起度検出確度判定部２０と、データ検索部２２と、共起度計算部２４を含む。 The storage device 1 includes a term storage unit 11 and a co-occurrence degree data storage unit 13. In addition, the processing device 2 includes a search strategy determination unit 21, a co-occurrence degree detection accuracy determination unit 20, a data search unit 22, and a co-occurrence degree calculation unit 24.

用語記憶部１１には、共起度抽出の対象となる用語リストが格納されている。図２は、用語記憶部１１に格納されるデータの例を示す。図２では、人名のリスト、用語ＩＤ、用語、検索フラグ、出現文書ＩＤがテーブルとして格納されている。図２を見ると、用語ＩＤがK01の「田中一郎」の検索フラグは「未」、出現文書ＩＤは「なし」となっている。これは、「田中一郎」というキーワードで検索を行ったことがなく、また、出現する文書も見つかっていないことを意味している。 The term storage unit 11 stores a term list that is a target of co-occurrence degree extraction. FIG. 2 shows an example of data stored in the term storage unit 11. In FIG. 2, a list of person names, term IDs, terms, search flags, and appearance document IDs are stored as a table. As shown in FIG. 2, the search flag of “Ichiro Tanaka” with the term ID K01 is “not yet” and the appearance document ID is “none”. This means that no search has been made with the keyword “Ichiro Tanaka”, and no appearing document has been found.

また、用語ＩＤがK02の「高橋二郎」の検索フラグは「済」、出現文書ＩＤは「D01，D02，D04，D05，D10，D13，D15，D18」となっている。これは、「高橋二郎」というキーワードで検索を行ったことがあり、また、検索結果として、文書ＩＤが「D01，D02，D04，D05，D10，D13，D15，D18」の８件の文書がヒットしていることを意味している。 The search flag of “Jiro Takahashi” with the term ID K02 is “Done”, and the appearance document ID is “D01, D02, D04, D05, D10, D13, D15, D18”. This has been searched with the keyword "Jiro Takahashi", and as a search result, eight documents with document IDs "D01, D02, D04, D05, D10, D13, D15, D18" It means that you are hit.

図２において、用語ＩＤがK03の「佐藤花子」の検索フラグは「未」、出現文書ＩＤは「D02，D05，D10，D18」となっている。これは、「佐藤花子」というキーワードで検索を行ったことはないが、「佐藤花子」が出現する文書ＩＤとして「D02，D05，D10，D18」の４件の文書が得られていることを意味している。「佐藤花子」が未検索にも拘わらず、出現文書ＩＤが得られているのは、他の用語を検索した結果の文書中に、「佐藤花子」が出現したことを検出したからである。例えば図２では、文書D02は「高橋二郎」を検索した結果得られたものであり、その中に「佐藤花子」も出現していたと解釈することができる。 In FIG. 2, the search flag of “Hanako Sato” with the term ID K03 is “not yet” and the appearance document ID is “D02, D05, D10, D18”. This is because we have never searched with the keyword “Hanako Sato”, but four documents “D02, D05, D10, D18” are obtained as document IDs where “Hanako Sato” appears. I mean. The reason why the appearance document ID is obtained even though “Sato Hanako” has not been searched is that it has been detected that “Hanako Sato” has appeared in the document as a result of searching for other terms. For example, in FIG. 2, the document D02 is obtained as a result of searching for “Jiro Takahashi”, and it can be interpreted that “Sato Hanako” also appeared therein.

文書D05，D10，D18についても同様の解釈が可能である。以下同様に、用語ＩＤがK04の「鈴木三郎」は検索済みで、出現文書として「D01，D03，D05，D07，D10，D15，D17，D20」の８件の文書が得られていることを意味している。また、用語ＩＤがK05の「田中太郎」は未検索だが、出現文書として「D03，D05，D07，D11，D18」の５件の文書が得られていることを意味している。 The same interpretation is possible for the documents D05, D10, and D18. Similarly, “Saburo Suzuki” with the term ID “K04” has been searched, and eight documents “D01, D03, D05, D07, D10, D15, D17, D20” have been obtained as appearance documents. I mean. Further, it means that “Taro Tanaka” with the term ID “K05” is not searched, but five documents “D03, D05, D07, D11, D18” are obtained as appearing documents.

なお、ここでは説明を簡潔にするため、用語記憶部１１に格納される用語リストを用語ＩＤ、用語、検索フラグ、出現文書ＩＤからなるテーブルとして説明したが、用語ＩＤを使わず用語そのものを主キーとして用いたり、出現文書ＩＤの変わりにＵＲＬ（Uniform Resource Locator）やファイルのアドレスを用いたり、出現文書の最終更新日を一緒に格納したりするなどの方法も考えられ、本実施の形態に述べる方法に限定されない。 Here, for the sake of brevity, the term list stored in the term storage unit 11 has been described as a table including a term ID, a term, a search flag, and an appearance document ID. However, the term itself is mainly used without using the term ID. A method of using the URL as a key, using a URL (Uniform Resource Locator) or file address instead of the appearance document ID, or storing the last update date of the appearance document together is also conceivable. It is not limited to the method described.

共起度データ記憶部１３には、用語と用語の関係が重み付のグラフ構造として格納される。図３は、共起度データ記憶部１３に格納される共起度グラフの例を示す。図３を参照すると、用語K01と用語K02の共起度は0.1、用語K01と用語K05の共起度は0.5であることが分かる。また、検索済みの用語はハッチングを付したノードで、未検索の用語は白色のノードとして表現されているため、用語K01と用語K02の共起度0.1は、両方の用語が検索された結果、算出されたものであることが分かる。また、用語K01と用語K11の共起度0.1は、用語K01の片方だけの検索結果に基づいて算出されたものであることが分かる。さらに、用語K15と用語16はどちらも未検索だが、他の用語の検索結果文書に出現した頻度を使って共起度0.5が算出されていることが分かる。 The co-occurrence degree data storage unit 13 stores a term-to-term relationship as a weighted graph structure. FIG. 3 shows an example of the co-occurrence degree graph stored in the co-occurrence degree data storage unit 13. Referring to FIG. 3, it can be seen that the co-occurrence degree of terms K01 and K02 is 0.1, and the co-occurrence degree of terms K01 and K05 is 0.5. In addition, since the searched term is a hatched node and the unsearched term is expressed as a white node, the co-occurrence degree 0.1 of the term K01 and the term K02 is the result of searching both terms, It turns out that it is what was calculated. It can also be seen that the co-occurrence degree 0.1 between the terms K01 and K11 is calculated based on the search result of only one of the terms K01. Furthermore, although terms K15 and 16 are both unsearched, it can be seen that the co-occurrence degree 0.5 is calculated using the frequency of appearance in the search result document of other terms.

共起度グラフの算出について、両側のノードが検索済みか未検索かの組み合わせは、（ａ）両側検索済み、（ｂ）片側検索済み、（ｃ）両側未検索の３通り存在する。図４は、３つの組み合わせについて、近似的な共起度計算を説明する概念図である。 Regarding the calculation of the co-occurrence degree graph, there are three combinations of whether the nodes on both sides have been searched or not searched: (a) both sides searched, (b) one side searched, and (c) both sides not searched. FIG. 4 is a conceptual diagram illustrating approximate co-occurrence calculation for the three combinations.

図４（ａ）は、両側検索済みの用語の共起度の概念図である。左側の円Ｋ１が用語K1が出現する文書集合、右側の円Ｋ２が用語K2が出現する文書集合を表す。この場合、用語K1と用語K2は両方とも検索済みであるため、共起頻度、相互情報量、Dice係数、Jaccard係数、Simpson係数、Cosine係数のいずれの定義であっても誤差なく共起度を計算することができる。例えば、用語記憶部１１に格納されているデータが図２の通りであった場合、用語K02と用語K04は両側検索済みの共起度を算出することができる。図２より、用語K02が出現する文書は「D01，D02，D04，D05，D10，D13，D15，D18」の８件、用語K04が出現する文書は「D01，D03，D05，D07，D10，D15，D17，D20」の８件、用語K02と用語K04が両方出現する文書は、「D01，D05，D10，D15」の４件であるので、Simpson係数を使って共起度を算出したとすると、｜K02 AND K04｜／min（｜K02｜，｜K04｜）＝4／8＝0.5となる。 FIG. 4A is a conceptual diagram of the degree of co-occurrence of terms that have been searched on both sides. The left circle K1 represents a document set in which the term K1 appears, and the right circle K2 represents a document set in which the term K2 appears. In this case, since both terms K1 and K2 have already been searched, the co-occurrence degree can be calculated without error regardless of the definition of co-occurrence frequency, mutual information, Dice coefficient, Jaccard coefficient, Simpson coefficient, and Cosine coefficient. Can be calculated. For example, when the data stored in the term storage unit 11 is as shown in FIG. 2, the co-occurrence degree having been searched for both sides of the terms K02 and K04 can be calculated. As shown in FIG. 2, eight documents “D01, D02, D04, D05, D10, D13, D15, D18” appear in the term K02, and “D01, D03, D05, D07, D10,” appear in the document in which the term K04 appears. There are 8 documents “D15, D17, D20” and 4 documents “D01, D05, D10, D15” in which both terms K02 and K04 appear, so the co-occurrence degree was calculated using the Simpson coefficient. Then, | K02 AND K04 | / min (| K02 |, | K04 |) = 4/8 = 0.5.

図４（ｂ）は、片側検索済みの用語の共起度の概念図である。左側の円Ｋ１が用語K1が出現する文書集合、右側の点線の円Ｋ２が用語K2が出現する真の文書集合、その内側の長円Ｋ２’が他の用語を検索した結果、用語K2が抽出された文書集合を表す。この場合、用語K1は検索済みであるため、用語K1が出現する文書集合は既に明らかになっている。一方、用語K2については、他の用語の検索結果から抽出された文書集合は、用語K2が出現する真の文書集合の部分集合のみである。このような場合でも、用語K1が出現する文書集合と用語K2が抽出された文書集合との積集合に含まれる文書の数は｜K1 AND K2｜に一致する。なぜなら、用語K1と用語K2が共起している文書集合は、用語K1が出現する文書集合のうち、用語K2が抽出された文書集合として求めることができるからである。この場合、用語間の近似的な共起度を以下のようにして算出することができる。 FIG. 4B is a conceptual diagram of the degree of co-occurrence of terms that have been searched on one side. The left circle K1 is the document set in which the term K1 appears, the right dotted circle K2 is the true document set in which the term K2 appears, and the inner ellipse K2 'searches for other terms, and the term K2 is extracted. Represents a document set. In this case, since the term K1 has already been searched, the document set in which the term K1 appears has already been clarified. On the other hand, for the term K2, the document set extracted from the search results of other terms is only a subset of the true document set in which the term K2 appears. Even in such a case, the number of documents included in the product set of the document set in which the term K1 appears and the document set from which the term K2 is extracted matches | K1 AND K2 |. This is because the document set in which the term K1 and the term K2 co-occur can be obtained as the document set from which the term K2 is extracted from the document set in which the term K1 appears. In this case, the approximate co-occurrence degree between terms can be calculated as follows.

共起度の指標として、共起頻度を用いる場合は、｜K1 AND K2｜が得られているため、用語K1と用語K2の共起度を誤差なく算出することが可能である。例えば、用語記憶部１１に格納されているデータが図２の通りであった場合、用語K02と用語K03が両方出現する文書は「D02，D05，D10，D18」の４件であるため、共起頻度は４となる。 When the co-occurrence frequency is used as the co-occurrence degree index, | K1 AND K2 | is obtained, and therefore the co-occurrence degree of the terms K1 and K2 can be calculated without error. For example, if the data stored in the term storage unit 11 is as shown in FIG. 2, there are four documents “D02, D05, D10, D18” in which both the term K02 and the term K03 appear. The frequency of occurrence is 4.

共起度の指標として、相互情報量を用いる場合は、用語K2が出現する文書の数｜K2｜の代わりに、用語K2が抽出された文書の数｜K2｜’を用いることによって、近似的に
−log｛N×｜K1 AND K2｜／(｜K1｜×｜K2｜’)}
として共起度を計算することが可能である。図４（ｂ）から明らかなように、｜K2｜＞｜K2｜’であるため、片側検索済みにおける近似的な相互情報量の値は、両側検索済みにおける真の相互情報量の値の下限が分かっていることになる。 When mutual information is used as an index of co-occurrence, it is approximated by using the number of documents from which the term K2 is extracted | K2 | 'instead of the number of documents in which the term K2 appears | K2 | -Log {N × | K1 AND K2 | / (| K1 | × | K2 | ')}
It is possible to calculate the degree of co-occurrence as As apparent from FIG. 4B, since | K2 |> | K2 | ', the approximate mutual information value in the one-sided search is the lower limit of the true mutual information value in the two-sided searched Will be known.

例えば、Ｗｅｂ検索エンジンに登録されている文書数が1,000,000ページで、用語記憶部１１に格納されているデータが図２の通りであった場合、用語K02と用語K03の共起度を近似的な総合情報量で求める方法は次のようになる。用語K02と用語K03が両方出現する文書は「D02，D05，D10，D18」の４件、用語K02が出現する文書は「D01，D02，D04，D05，D10，D13，D15，D18」の８件、用語K03が抽出された文書は「D02，D05，D10，D18」の４件であるため、近似的な相互情報量は−log（1,000,000×4／（8×4）｝＝−5.4となる。この値は、その後、用語K2が抽出される文書が増えることによって、大きくなる可能性はあるが、これより小さくなる可能性はない。 For example, if the number of documents registered in the Web search engine is 1,000,000 pages and the data stored in the term storage unit 11 is as shown in FIG. 2, the co-occurrence of terms K02 and K03 is approximated. The method for obtaining the total amount of information is as follows. There are four documents “D02, D05, D10, D18” in which both the term K02 and the term K03 appear, and eight documents “D01, D02, D04, D05, D10, D13, D15, D18” in which the term K02 appears. Since the number of documents from which the term K03 is extracted is “D02, D05, D10, D18”, the approximate mutual information amount is −log (1,000,000 × 4 / (8 × 4)} = − 5.4. This value can then be increased by increasing the number of documents from which the term K2 is extracted, but it cannot be smaller.

共起度の指標として、Dice係数を用いる場合は、用語K2が出現する文書の数｜K2｜の代わりに、用語K2が抽出された文書の数｜K2｜’を用いることによって、近似的に、｜K1 AND K2｜／（｜K1｜＋｜K2｜’）として共起度を計算することが可能である。この場合、片側検索済みにおける近似的なDice係数は、両側検索済みにおける真のDice係数の値の上限となる。 When the Dice coefficient is used as an index of co-occurrence, the number of documents from which the term K2 is extracted | K2 | 'is used instead of the number of documents in which the term K2 appears | K2 | , | K1 AND K2 | / (| K1 | + | K2 | ′) can be calculated. In this case, the approximate Dice coefficient after one-side search is the upper limit of the true Dice coefficient value after both-side search.

例えば、用語記憶部１１に格納されているデータが図２の通りであった場合、用語K02と用語K03の共起度を近似的なDice係数で求める方法は次のようになる。用語K02と用語K03が両方出現する文書は「D02，D05，D10，D18」の４件、用語K02が出現する文書は「D01，D02，D04，D05，D10，D13，D15，D18」の８件、用語K03が抽出された文書は「D02，D05，D10，D18」の４件であるため、近似的なDice係数は4／（4＋8）＝0.3となる。この値は、その後、用語K2が抽出される文書が増えることによって、小さくなる可能性はあるが、これ以上大きくなる可能性はない。 For example, when the data stored in the term storage unit 11 is as shown in FIG. 2, the method for obtaining the co-occurrence degree of the terms K02 and K03 with an approximate Dice coefficient is as follows. There are four documents “D02, D05, D10, D18” in which both the term K02 and the term K03 appear, and eight documents “D01, D02, D04, D05, D10, D13, D15, D18” in which the term K02 appears. Since the number of documents from which the term K03 is extracted is “D02, D05, D10, D18”, the approximate Dice coefficient is 4 / (4 + 8) = 0.3. This value may then decrease as the number of documents from which the term K2 is extracted increases, but it cannot increase further.

共起度の指標として、Jaccard係数を用いる場合は、｜K1
OR K2｜の代わりに用語K1が出現する文書集合と用語K2が抽出された文書集合の和集合に含まれる文書の数｜K1 OR K2｜’を用いることによって、近似的に｜K1 AND K2｜／｜K1 OR K2｜’として共起度を計算することが可能である。この場合、片側検索済みにおける近似的なJaccard係数は、両側検索済みにおける真のJaccard係数の値の上限となる。 When using the Jaccard coefficient as an index of co-occurrence, | K1
The number of documents included in the union of the document set in which the term K1 appears instead of OR K2 | and the document set from which the term K2 is extracted | K1 OR K2 | It is possible to calculate the co-occurrence degree as / | K1 OR K2 | '. In this case, the approximate Jaccard coefficient after one-side search is the upper limit of the true Jaccard coefficient value after both-side search.

例えば、用語記憶部１１に格納されているデータが図２の通りであった場合、用語K02と用語K03の共起度を近似的なJaccard係数で求める方法は次のようになる。用語K02と用語K03が両方出現する文書は「D02，D05，D10，D18」の４件、用語K02が出現する文書集合と用語K03が抽出された文書の和集合は「D01，D02，D04，D05，D10，D13，D15，D18」の８件であるため、近似的なJaccard係数は4／8＝0.5となる。この値は、その後、用語K2が抽出される文書が増えることによって、小さくなる可能性はあるが、これより大きくなる可能性はない。 For example, when the data stored in the term storage unit 11 is as shown in FIG. 2, the method for obtaining the co-occurrence degree of the terms K02 and K03 with an approximate Jaccard coefficient is as follows. The documents in which both the term K02 and the term K03 appear are “D02, D05, D10, D18”, and the union of the document set in which the term K02 appears and the document from which the term K03 is extracted is “D01, D02, D04, D05, D10, D13, D15, and D18 ”, the approximate Jaccard coefficient is 4/8 = 0.5. This value may then become smaller as the number of documents from which the term K2 is extracted increases, but it cannot be larger.

共起度の指標として、Simpson係数を用いる場合は、用語K2が出現する文書の数｜K2｜の代わりに、用語K2が抽出された文書の数｜K2｜’を用いることによって、近似的に、
｜K1 AND K2｜／min(｜K1｜，｜K2｜’）
として共起度を計算することが可能である。この場合、片側検索済みにおける近似的なSimpson係数は、両側検索済みにおける真のSimpson係数の値の上限となる。 When using the Simpson coefficient as an index of co-occurrence, the number of documents from which the term K2 is extracted | K2 | 'is used instead of the number of documents in which the term K2 appears | K2 | ,
| K1 AND K2 | / min (| K1 |, | K2 | ')
It is possible to calculate the degree of co-occurrence as In this case, the approximate Simpson coefficient after one-sided search is the upper limit of the true Simpson coefficient value after two-sided search.

例えば、用語記憶部１１に格納されているデータが図２の通りであった場合、用語K02と用語K03の共起度を近似的なSimpson係数で求める方法は次のようになる。用語K02と用語K03が両方出現する文書は「D02，D05，D10，D18」の４件、用語K02が出現する文書は「D01，D02，D04，D05，D10，D13，D15，D18」の８件、用語K03が抽出された文書は「D02，D05，D10，D18」の４件であるため、近似的なSimpson係数は4／min（8，4）＝1となる。この値は、その後、用語K2が抽出される文書が増えることによって、小さくなる可能性はあるが、これより大きくなる可能性はない。 For example, when the data stored in the term storage unit 11 is as shown in FIG. 2, the method for obtaining the co-occurrence degree of the terms K02 and K03 with an approximate Simpson coefficient is as follows. There are four documents “D02, D05, D10, D18” in which both the term K02 and the term K03 appear, and eight documents “D01, D02, D04, D05, D10, D13, D15, D18” in which the term K02 appears. Since the document from which the term K03 is extracted is “D02, D05, D10, D18”, the approximate Simpson coefficient is 4 / min (8, 4) = 1. This value may then become smaller as the number of documents from which the term K2 is extracted increases, but it cannot be larger.

さらに細かく場合分けを考えると、検索済みの用語K1が出現する文書数｜K1｜と未検索の用語K2が抽出された文書数｜K2｜’を比較した場合、｜K1｜＜｜K2｜’であれば、片側検索済みの場合であっても、両側検索済みにおける真のSimpson係数の値と等しくなる。 Considering more detailed classification, if the number of documents in which the searched term K1 appears | K1 | is compared with the number of documents in which the unsearched term K2 is extracted | K2 | ', | K1 | <| K2 |' Then, even if one-sided search has been completed, the value is equal to the true Simpson coefficient value in both-sided search.

共起度の指標として、Cosine係数を用いる場合は、用語K2が出現する文書の数｜K2｜の代わりに、用語K2が抽出された文書の数｜K2｜’を用いることによって、近似的に、
｜K1 AND K2｜／√(｜K1｜ × ｜K2｜’）
として共起度を計算することが可能である。この場合、片側検索済みにおける近似的なCosine係数は、両側検索済みにおける真のCosine係数の値の上限となる。 When the Cosine coefficient is used as an index of co-occurrence, the number of documents from which the term K2 is extracted | K2 | 'is used instead of the number of documents in which the term K2 appears | K2 | ,
｜ K1 AND K2 | / √ (| K1 | × ｜ K2 | ')
It is possible to calculate the degree of co-occurrence as In this case, the approximate Cosine coefficient after one-sided search is the upper limit of the true Cosine coefficient value after both-sided search.

例えば、用語記憶部１１に格納されているデータが図２の通りであった場合、用語K02と用語K03の共起度を近似的なCosine係数で求める方法は次のようになる。用語K02と用語K03が両方出現する文書は「D02，D05，D10，D18」の４件、用語K02が出現する文書は「D01，D02，D04， D05，D10，D13，D15，D18」の８件、用語K03が抽出された文書は「D02，D05，D10，D18」の４件であるため、近似的なCosine係数は4／√（8×4）＝22.6となる。この値は、その後、用語K2が抽出される文書が増えることによって、小さくなる可能性はあるが、これより大きくなる可能性はない。 For example, when the data stored in the term storage unit 11 is as shown in FIG. 2, the method for obtaining the co-occurrence degree of the terms K02 and K03 with an approximate Cosine coefficient is as follows. There are 4 documents “D02, D05, D10, D18” in which both terms K02 and K03 appear, and 8 documents “D01, D02, D04, D05, D10, D13, D15, D18” in which the term K02 appears. Since the number of documents from which the term K03 is extracted is “D02, D05, D10, D18”, the approximate Cosine coefficient is 4 / √ (8 × 4) = 22.6. This value may then become smaller as the number of documents from which the term K2 is extracted increases, but it cannot be larger.

図４（ｃ）は、両側未検索の用語の共起度の概念図である。左側の点線の円Ｋ１が用語K1が出現する真の文書集合、その内側の円Ｋ１’が他の用語を検索した結果、用語K1が抽出された文書集合、右側の点線の円Ｋ２が用語K2が出現する真の文書集合、その内側の円Ｋ２’が他の用語を検索した結果、用語K2が抽出された文書集合を表す。この場合、用語K1と用語K2のどちらも、出現する真の文書集合の部分集合しか得られていないことになる。このような場合でも、用語K1が抽出された文書の数｜K1｜’、用語K2が抽出された文書の数｜K2｜’、用語K1と用語K2が抽出された文書の数｜K1
AND K2｜’を用いることによって、用語間の近似的な共起度を算出することができる。 FIG. 4C is a conceptual diagram of the degree of co-occurrence of terms that have not been searched on both sides. The left dotted circle K1 is a true document set in which the term K1 appears, the inner circle K1 'is a document set from which the term K1 has been extracted as a result of searching for other terms, and the right dotted circle K2 is the term K2 Is a true document set, and a circle K2 ′ inside thereof represents a document set from which the term K2 has been extracted as a result of searching for other terms. In this case, both the term K1 and the term K2 are obtained only a subset of the true document set that appears. Even in such a case, the number of documents from which the term K1 is extracted | K1 | ', the number of documents from which the term K2 is extracted | K2 |', the number of documents from which the terms K1 and K2 are extracted | K1
By using AND K2 | ′, an approximate co-occurrence degree between terms can be calculated.

ただし、片側検索済みの場合に、｜K1 AND K2｜が正確に求まっており、共起度の近似値が上限または下限であることが明らかであったのに対して、両側未検索の場合は｜K1 AND K2｜’も近似値であるため、後の処理で別の用語が検索されて用語K1および用語K2が抽出される文書集合が追加されることにより、共起度の近似値は大きくなる可能性も小さくなる可能性も残っていることになる。 However, when one-sided search has been completed, | K1 AND K2 | has been accurately obtained, and it was clear that the approximate value of the co-occurrence is the upper limit or lower limit. Since | K1 AND K2 | 'is also an approximate value, the approximate value of the co-occurrence degree is increased by adding a document set in which another term is searched in the subsequent processing and the terms K1 and K2 are extracted. The possibility of becoming smaller will also remain.

図１の検索戦略決定部２１は、用語記憶部１１に格納されている用語リストと、共起度データ記憶部１３に格納されている共起度グラフを参照し、各未検索の用語について共起度グラフの近似度を高める可能性を近似グラフスコアAGS（Approximate Graph Score）として算出し、近似グラフスコアAGS上位k個の用語を検索候補語としてデータ検索部２２に渡す。 The search strategy determination unit 21 in FIG. 1 refers to the term list stored in the term storage unit 11 and the co-occurrence degree graph stored in the co-occurrence degree data storage unit 13 to share each unsearched term. The possibility of increasing the degree of approximation of the occurrence graph is calculated as an approximate graph score AGS (Approximate Graph Score), and the top k terms of the approximate graph score AGS are passed to the data search unit 22 as search candidate words.

用語Kiに対する近似グラフスコアAGS（Ki）は例えば、以下のように定義できる。
AGS(Ki) ＝ ΔN ×（α｜E01｜＋ β｜E12｜＋ γ｜E11｜）
ここで、△Nは用語Kiを検索することによって、新たに抽出される用語の数の期待値である。一般に、より多くの抽出済み用語と共起している用語ほど、多くの未抽出の用語とも共起していると推測できるため、△Nには、図３の共起度グラフにおける用語Kiのまわりのエッジ数が目安として利用できる。例えば、図３において、用語K16の周りのエッジはK16−K07、K16−K12、K16−K13、K16−K14、K16−K15、K16−K17の６本であるので、K16に関する△Nの値は６になる。 The approximate graph score AGS (Ki) for the term Ki can be defined as follows, for example.
AGS (Ki) = ΔN × (α | E01 | + β | E12 | + γ | E11 |)
Here, ΔN is an expected value of the number of terms newly extracted by searching for the term Ki. In general, it can be assumed that the terms that co-occur with more extracted terms co-occur with more unextracted terms. Therefore, ΔN includes the term Ki in the co-occurrence degree graph of FIG. The number of surrounding edges can be used as a guide. For example, in FIG. 3, since there are six edges around the term K16, K16-K07, K16-K12, K16-K13, K16-K14, K16-K15, K16-K17, the value of ΔN for K16 is 6

｜E01｜は、用語Kiを検索することによって、両側未検索から片側検索済みになるエッジの数である。図３において、新たに用語K16を検索することにすると、K16−K12、K16−K13、K16−K14、K16−K15、K16−K17の５本のエッジは、両側未検索から片側検索済みになるため、用語K16に関する｜E01｜は５になる。 | E01 | is the number of edges that have been searched from one side to the other by searching the term Ki. In FIG. 3, when the term K16 is newly searched, the five edges K16-K12, K16-K13, K16-K14, K16-K15, and K16-K17 are searched from one side to the other side. Therefore, | E01 | for the term K16 is 5.

｜E12｜は、用語Kiを検索することによって、片側検索済みから両側検索済みになるエッジの数である。例えば、図３において、新たに用語K16を検索することにすると、K16−K07の１本のエッジは、片側検索済みから両側検索済みになるため、用語K16に関する｜E12｜は１になる。 | E12 | is the number of edges that are searched from one side to the two-sided search by searching for the term Ki. For example, in FIG. 3, if a new term K16 is searched, one edge of K16-K07 is changed from one-side searched to two-sided searched, so | E12 | for the term K16 becomes 1.

｜E11｜は用語Kiを検索することによって、片側未検索のままだが情報量が多くなることによってより近似された共起度が計算できることが期待できるエッジの数である。例えば、図３において、新たに用語K16を検索することにすると、K12−K10、K13−K08、K14−K08、K15−K08、K15−K07、K17−K07、K17−K09の７本のエッジは、片側検索済みのままだが、検索結果に含まれる文書から新たにK12、K13、K14、K15、K17が抽出される可能性があるため、より近似された共起度が計算できることが期待できる。従って、用語K16に関する｜E11｜は７になる。なお、α、β、γは｜E01｜、｜E12｜、｜E11｜のエッジの本数に対する重みである。 | E11 | is the number of edges that can be expected by searching for the term Ki and calculating a more approximate co-occurrence degree by increasing the amount of information while being unsearched on one side. For example, in FIG. 3, if the term K16 is newly searched, the seven edges K12-K10, K13-K08, K14-K08, K15-K08, K15-K07, K17-K07, K17-K09 are Although one-side search has been completed, K12, K13, K14, K15, and K17 may be newly extracted from the document included in the search result, so that it is expected that a more approximate co-occurrence degree can be calculated. Therefore, | E11 | for the term K16 is 7. Α, β, and γ are weights for the number of edges of | E01 |, | E12 |, and | E11 |.

図４の説明で議論した通り、片側検索済みの場合は、両側検索済みと同等の共起度または上限もしくは下限が定まるのに対して、両側未検索の場合は、あくまで共起度の目安が求まっているに過ぎない。従って、真の共起度で構成される共起度グラフに対する近似の度合いとしては、両側未検索が片側検索済みになるエッジの方が、片側検索済みが両側検索済みになるエッジよりも重要である。また片側検索済みが両側検索済みになるエッジの方が、片側検索済みのままのエッジよりも重要である。以上の議論から、重みα、β、γは、α＞β＞γとなるように設定することが好ましい。 As discussed in the explanation of FIG. 4, when one-sided search is completed, the co-occurrence degree or upper limit or lower limit equivalent to the two-sided search is determined, whereas when both sides are not searched, the co-occurrence degree is only a guideline. It's just wanted. Therefore, the degree of approximation to the co-occurrence degree graph consisting of true co-occurrence degrees is more important for edges where one-sided unsearched has been searched for one-sided than for edges whose one-sided searched has been searched for both-sided. is there. Also, the edge that has been searched on one side is more important than the edge that has been searched on one side. From the above discussion, the weights α, β, and γ are preferably set so that α> β> γ.

図１のデータ検索部２２は、検索戦略決定部２１から渡されたk個の検索候補語について、１語ずつネットワーク５を介して公開データ６を検索し、検索結果として用語が出現する文書ＩＤのリストを得る。次に、得られた文書ＩＤのリストを用語記憶部１１に格納されている用語リストに追加する。また、文書ＩＤで示される文書の本体をネットワーク５を介して取得し、共起度計算部２４に渡す。 The data search unit 22 of FIG. 1 searches the public data 6 through the network 5 word by word for the k search candidate words passed from the search strategy determination unit 21, and the document ID in which the term appears as a search result Get a list of. Next, the obtained list of document IDs is added to the term list stored in the term storage unit 11. In addition, the main body of the document indicated by the document ID is acquired via the network 5 and passed to the co-occurrence degree calculation unit 24.

共起度計算部２４は、用語記憶部１１に格納されている用語リストから、各用語間の共起度を計算し、重み付グラフとして共起度データ記憶部１３に格納する。 The co-occurrence degree calculation unit 24 calculates the co-occurrence degree between the terms from the term list stored in the term storage unit 11 and stores it in the co-occurrence degree data storage unit 13 as a weighted graph.

次に、図１及び図２〜図８を参照して本実施の形態の動作について詳細に説明する。図５は、本実施の形態における用語共起度抽出装置１００の動作の一例を示す流れ図である。 Next, the operation of the present embodiment will be described in detail with reference to FIGS. 1 and 2 to 8. FIG. 5 is a flowchart showing an example of the operation of the term co-occurrence degree extraction device 100 in the present embodiment.

検索戦略決定部２１は、用語記憶部１１に格納されている用語リストと、共起度データ記憶部１３に格納されている共起度グラフとを参照し、各未検索の用語について、共起度グラフの近似度を高める可能性を近似グラフスコアAGSとして算出する。そして、近似グラフスコアAGS上位k個の用語を検索候補語と決定する（図５のステップＳ２０１）。 The search strategy determination unit 21 refers to the term list stored in the term storage unit 11 and the co-occurrence degree graph stored in the co-occurrence degree data storage unit 13, and performs co-occurrence for each unsearched term. The possibility of increasing the degree of approximation of the degree graph is calculated as an approximation graph score AGS. Then, the top k terms of the approximate graph score AGS are determined as search candidate words (step S201 in FIG. 5).

データ検索部２２は、検索戦略決定部２１から渡されたk個の検索候補語について、１語ずつネットワーク５を介して公開データ６を検索し、検索結果として得られた文書ＩＤのリストを用語記億部１１に格納されている用語リストに追加する。文書ＩＤで示される文書群をネットワーク５を介して取得し、共起度計算部２４に渡す（図５のステップＳ２０２）。 The data search unit 22 searches the public data 6 through the network 5 one word at a time for the k search candidate words passed from the search strategy determination unit 21, and uses the document ID list obtained as a search result as a term It is added to the term list stored in the storage unit 11. A document group indicated by the document ID is acquired via the network 5 and transferred to the co-occurrence degree calculation unit 24 (step S202 in FIG. 5).

ここで、共起度データ記憶部１３の更新度合いが閾値以上の場合は（図５のステップＳ２０５；Ｙｅｓ）、さらに処理を繰り返すことでグラフの近似度が高まることを意味しているため、図５のステップＳ２０１に戻って再帰的に処理を繰り返す。共起度データ記憶部１３の更新度合いの高さは、（１）共起度グラフに新しく追加された用語の数△K、（２）エッジの重みの変化の合計△E、によって△K×△Eとして定義できる。一方、共起度データ記憶部１３の更新度合いが閾値未満の場合は（図５のステップＳ２０５；Ｎｏ）、十分高い近似度の共起度グラフが得られていることになるため、処理を終了する。 Here, when the update degree of the co-occurrence degree data storage unit 13 is equal to or greater than the threshold (step S205 in FIG. 5; Yes), it means that the degree of approximation of the graph is increased by repeating the process. Returning to step S201 in step 5, the process is recursively repeated. The degree of update of the co-occurrence degree data storage unit 13 depends on (1) the number of terms ΔK newly added to the co-occurrence degree graph, and (2) the total change in edge weights ΔE, ΔK × Can be defined as △ E. On the other hand, when the update degree of the co-occurrence degree data storage unit 13 is less than the threshold value (step S205 in FIG. 5; No), a co-occurrence degree graph with a sufficiently high degree of approximation is obtained, and thus the process is terminated. To do.

なお、ここでは説明を簡潔にするため、図５のステップＳ２０１で検索戦略決定部２１がデータ検索部２２に渡す検索候補語を、近似グラフスコアAGSの上位k個の用語としたが、他にも、近似グラフスコアAGSの上位x％を検索候補語とする方法や閾値ρ以上の用語を検索候補語とする方法も考えられ、本実施の形態に述べた方法に限定されない。また、図５のステップＳ２０５の終了条件として、共起度データ記憶部１３の更新度合いを測定する方法について述べたが、他にも、処理時間の合計が閾値以上に達したかどうか、あるいは、検索回数が閾値以上に達したかどうかを基準に再起処理を打ち切る方法も考えられ、本実施の形態に述べた方法に限定されない。 For the sake of brevity, the search candidate words that the search strategy determination unit 21 passes to the data search unit 22 in step S201 of FIG. 5 are the top k terms of the approximate graph score AGS. In addition, a method using the upper x% of the approximate graph score AGS as a search candidate word and a method using a term equal to or higher than the threshold ρ as a search candidate word are also conceivable, and the present invention is not limited to the method described in this embodiment. Further, as the end condition of step S205 in FIG. 5, the method for measuring the update degree of the co-occurrence degree data storage unit 13 has been described, but in addition, whether or not the total processing time has reached or exceeded the threshold, A method of aborting the restart process based on whether or not the number of searches has reached a threshold value or more can be considered, and is not limited to the method described in the present embodiment.

図６は、検索戦略決定部２１の動作の一例を示す流れ図である。検索戦略決定部２１は、検索候補集合Ｔを空集合として初期化する（図６のステップＳ２１１)。次に、検索戦略決定部２１は、共起度データ記憶部１３に格納されている共起度グラフ内を走査し、未検索の用語Kiを見つけ出す（図６のステップＳ２１２)。もし、未検索の用語Kiが見つかった場合（ステップＳ２１２；Ｙｅｓ）、検索戦略決定部２１は、用語Kiの近似グラフスコアAGS（Ki）を計算する（図６のステップＳ２１３）。そして、用語Kiとその近時グラフスコアAGS（Ki）を検索候補集合Ｔに追加する（図６のステップＳ２１４）。 FIG. 6 is a flowchart showing an example of the operation of the search strategy determination unit 21. The search strategy determination unit 21 initializes the search candidate set T as an empty set (step S211 in FIG. 6). Next, the search strategy determination unit 21 scans the co-occurrence degree graph stored in the co-occurrence degree data storage unit 13 and finds an unsearched term Ki (step S212 in FIG. 6). If an unsearched term Ki is found (step S212; Yes), the search strategy determination unit 21 calculates an approximate graph score AGS (Ki) for the term Ki (step S213 in FIG. 6). Then, the term Ki and its recent graph score AGS (Ki) are added to the search candidate set T (step S214 in FIG. 6).

用語Kiと近似グラフスコアAGS（Ki）の組、例えば、共起度データ記憶部１３に格納されている共起度グラフが図３の通りであった場合、未検索の用語はK11、K12、K13、K14、K15、K16、K17の７語存在することになる。それぞれの未検索の用語について、α＝100、β＝10、γ＝1として近似グラフスコアを求めると、次のようになる。 When the combination of the term Ki and the approximate graph score AGS (Ki), for example, the co-occurrence degree graph stored in the co-occurrence degree data storage unit 13 is as shown in FIG. 3, the unsearched terms are K11, K12, There are seven words K13, K14, K15, K16, and K17. For each unsearched term, the approximate graph score is obtained with α = 100, β = 10, and γ = 1 as follows.

用語K11のノードの周りのエッジの本数はK11−K01，K11−K02, K11−K03，K11−K04, K11−K05，K11−K07の６本、用語K11を検索することによって、両側未検索から片側検索済みになるエッジの数｜E01｜は０本、片側検索済みから両側検索済みになるエッジの数｜E12｜はK11−K01、K11−K02、K11−K03、K11−K04、K11−K05、K11−K07の６本、片側検索済みのままだが情報量が増えるエッジの数｜E11｜は０本である。従って、
AGS（K11）＝ △N×（α｜E01｜＋β｜E12｜＋γ｜E11｜）
＝ 6×（100×0＋10×6＋1×0）
＝ 360 The number of edges around the node of term K11 is K11−K01, K11−K02, K11−K03, K11−K04, K11−K05, and K11−K07. Number of edges that have been searched for one side | E01 | is 0, and the number of edges that have been searched from one side to both sides | E12 | is K11-K01, K11-K02, K11-K03, K11-K04, K11-K05 , K11-K07, the number of edges | E11 | in which the amount of information increases while one side is already searched is zero. Therefore,
AGS (K11) = △ N × (α | E01 | + β | E12 | + γ | E11 |)
= 6 x (100 x 0 + 10 x 6 + 1 x 0)
= 360

用語K12のノードの周りのエッジの本数はK12−K10，K12−K16の２本、用語K12を検索することによって、両側未検索から片側検索済みになるエッジの数｜E01｜はK12−K16の１本、片側検索済みから両側検索済みになるエッジの数｜E12｜はK12−K10の１本、片側検索済みのままだが情報量が増えるエッジの数｜E11｜はK16−K07の１本である。従って、
AGS(K12) ＝ ΔN×（α｜E01｜＋β｜E12｜＋γ｜E11｜)
＝ 2×（100×1＋10×1＋1×1）
＝ 222 The number of edges around the node of the term K12 is two of K12−K10 and K12−K16. By searching for the term K12, the number of edges that have been searched from one side to the other side | E01 | is K12−K16 One, the number of edges that have been searched from one side to the two-sided search | E12 | is one of K12-K10, and the number of edges that have been searched on one side but increases the amount of information | E11 | is one of K16-K07 is there. Therefore,
AGS (K12) = ΔN × (α | E01 | + β | E12 | + γ | E11 |)
= 2 x (100 x 1 + 10 x 1 + 1 x 1)
= 222

用語K13のノードの周りのエッジの本数はK13−K16，K13−K08の２本、用語K13を検索することによって、両側未検索から片側検索済みになるエッジの数｜E01｜はK13−K16の１本、片側検索済みから両側検索済みになるエッジの数｜E12｜はK13−K08の１本、片側検索済みのままだが情報量が増えるエッジの数｜E11｜はK16−K07の1本である。従って、
AGS(K13) ＝ ΔN×（α｜E01｜＋β｜E12｜＋γ｜E11｜）
＝ 2×（100×1＋10×1＋1×1）
＝ 222 The number of edges around the node of the term K13 is two of K13-K16 and K13-K08. By searching the term K13, the number of edges that have been searched from one side to the other side | E01 | is K13-K16 One, the number of edges that have been searched from one side to the two-sided search | E12 | is one of K13-K08, the number of edges that have been searched on one side but the amount of information increases | E11 | is one of K16-K07 is there. Therefore,
AGS (K13) = ΔN × (α | E01 | + β | E12 | + γ | E11 |)
= 2 x (100 x 1 + 10 x 1 + 1 x 1)
= 222

用語K14のノードの周りのエッジの本数はK14−K16, K14−K08の２本、用語K14を検索することによって、両側未検索から片側検索済みになるエッジの数｜E01｜はK14−K16の１本、片側検索済みから両側検索済みになるエッジの数｜E12｜はK14−K08の１本、片側検索済みのままだが情報量が増えるエッジの数｜E11｜はK08−K15の１本である。従って、
AGS(K14) ＝ ΔN×（α｜E01｜＋β｜E12｜＋γ｜E11｜）
＝ 2×（100×1＋10×1＋1×1）
＝ 222 The number of edges around the node of the term K14 is K14−K16, K14−K08, and by searching for the term K14, the number of edges that have been searched from one side to the other side | E01 | is K14−K16 One, the number of edges that have been searched from one side to the two-sided search | E12 | is one of K14-K08, and the number of edges that have been searched on one side but increases the amount of information | E11 | is one of K08-K15 is there. Therefore,
AGS (K14) = ΔN × (α | E01 | + β | E12 | + γ | E11 |)
= 2 x (100 x 1 + 10 x 1 + 1 x 1)
= 222

用語K15のノードの周りのエッジの本数はK15−K16，K15−K07，K15−K08の３本、用語K15を検索することによって、両側未検索から片側検索済みになるエッジの数｜E01｜はK15−K16の１本、片側検索済みから両側検索済みになるエッジの数｜E12｜はK15−K07， K15−K08の２本、片側検索済みのままだが情報量が増えるエッジの数｜E11｜はK16−K07の１本である。従って、
AGS（K15）＝ ΔN×（α｜E01｜＋β｜E12｜＋γ｜E11｜）
＝ 3×（100×1＋10×2＋1×1）
＝ 363 The number of edges around the node of the term K15 is K15−K16, K15−K07, K15−K08, and by searching the term K15, the number of edges | E01 | The number of edges that have been searched from one side of K15-K16 to one-sided search | E12 | is the number of edges that have been searched on one side of K15-K07 and K15-K08 but the amount of information increases | E11 | Is one of K16-K07. Therefore,
AGS (K15) = ΔN × (α | E01 | + β | E12 | + γ | E11 |)
= 3 x (100 x 1 + 10 x 2 + 1 x 1)
= 363

用語K16のノードの周りのエッジの本数はK16−K07，K16−K12，K16−K13，K16−K14，K16−K15，K16−K17の６本、用語K16を検索することによって、両側未検索から片側検索済みになるエッジの数｜E01｜はK16−K12, K16−K13, K16−K14, K16−K15,
K16−K17の５本、片側検索済みから両側検索済みになるエッジの数｜E12｜はK16−K07の１本、片側検索済みのままだが情報量が増えるエッジの数｜E11｜はK12−K10，K13−K08，K14−K08，K15−K07，K15−K08，K17−K07，K17−K09の７本である。従って、
AGS(K16) ＝ △N×（α｜E01｜＋β｜E12｜＋γ｜E11｜）
＝ 6×（100×5＋10×1＋1×7）
＝ 3,102 The number of edges around the node of the term K16 is 6 from K16-K07, K16-K12, K16-K13, K16-K14, K16-K15, K16-K17. The number of edges that have been searched on one side | E01 | is K16−K12, K16−K13, K16−K14, K16−K15,
The number of edges that have been searched from one side to the two sides searched from 5 for K16-K17 | E12 | is the number of edges that have been searched for one side of K16-K07, but the amount of information increases | E11 | is K12-K10 , K13-K08, K14-K08, K15-K07, K15-K08, K17-K07, K17-K09. Therefore,
AGS (K16) = △ N × (α | E01 | + β | E12 | + γ | E11 |)
= 6 x (100 x 5 + 10 x 1 + 1 x 7)
= 3,102

用語K17のノードの周りのエッジの本数はK17−K07，K17−K09，K17−K16の３本、用語K17を検索することによって、両側未検索から片側検索済みになるエッジの数｜E01｜はK17−K16の１本、片側検索済みから両側検索済みになるエッジの数｜E12｜はK17−K07，K17−K09の２本、片側検索済みのままだが情報量が増えるエッジの数｜E11｜はK16−K07の１本である。従って、
AGS(K17) ＝ △N×（α｜E01｜＋β｜E12｜＋γ｜E11｜）
＝ 3×（100×1＋10×2＋1×l）
＝ 363 The number of edges around the node of the term K17 is three of K17-K07, K17-K09, K17-K16, and the number of edges that have been searched from one side to the other by searching the term K17 is | E01 | The number of edges that have been searched for one side of K17-K16 and one-sided search | E12 | is the number of edges that have been searched for one side of K17-K07 and K17-K09 but the amount of information increases | E11 | Is one of K16-K07. Therefore,
AGS (K17) = △ N × (α | E01 | + β | E12 | + γ | E11 |)
= 3 x (100 x 1 + 10 x 2 + 1 x l)
= 363

次に、検索戦略決定部２１は、近似グラフスコアAGS（Ki）を計算すべき未検索の用語Kiがなくなると（図６のステップＳ２１２；Ｎｏ）、検索候補集合Ｔを近似グラフスコアAGSの順にソートし（図６のステップＳ２１５）、上位ｎ件の未検索用語を出力としてデータ検索部２２に渡す（図６のステップＳ２１６）。例えば、上述した用語K12〜K17の近似グラフスコア計算の例で、上位３語の未検索用語を出力するとした場合、用語K16、K15、K17の３語が、次に検索されるべき用語としてデータ検索部２２に渡されることになる。 Next, when there is no unsearched term Ki for which the approximate graph score AGS (Ki) is to be calculated (step S212 in FIG. 6; No), the search strategy determination unit 21 sets the search candidate set T in the order of the approximate graph score AGS. Sorting is performed (step S215 in FIG. 6), and the top n unsearched terms are output to the data search unit 22 (step S216 in FIG. 6). For example, in the example of the approximate graph score calculation for the terms K12 to K17 described above, if the top three unsearched terms are output, the three terms K16, K15, and K17 are data as terms to be searched next. It will be passed to the search unit 22.

なお、ここでは説明を簡潔にするため、共起度データ記憶部１３に格納されている共起度グラフがある程度構築された後の途中状態での処理について説明を行ったが、初期状態では、共起度データ記憶部１３には共起度グラフは構築されておらず、用語リストが用語記憶部１１に全て未検索の状態で格納されているだけである。従って、初期状態では、用語記憶部１１に格納されている用語リストの先頭からk個、もしくはランダムにk個を選択して検索候補語とするなどの方法が考えられ、本実施の形態に述べた方法に限定されない。 Here, for the sake of brevity, the processing in the intermediate state after the co-occurrence degree graph stored in the co-occurrence degree data storage unit 13 has been constructed to some extent has been described, but in the initial state, A co-occurrence degree graph is not constructed in the co-occurrence degree data storage unit 13, and the term list is only stored in the term storage unit 11 in an unsearched state. Therefore, in the initial state, a method such as selecting k items from the beginning of the term list stored in the term storage unit 11 or randomly selecting k items as search candidate words can be considered, which will be described in the present embodiment. The method is not limited.

図７は、データ検索部２２の動作の一例を示す流れ図である。データ検索部２２は、検索戦略決定部２１から渡された検索候補語の集合から、１語ずつ取り出しながら（図７のステップＳ２２１；Ｙｅｓ）、検索侯補語をクエリとしてネットワーク５を介して公開データ６を検索する（図７のステップＳ２２２）。次に、用語記憶部１１に格納されている用語リストに対して、クエリとして使われた用語の出現文書ＩＤの欄に、検索結果として得られた文書ＩＤのリストを追加する（図７のステップＳ２２３）。また、検索語果として得られた文書ＩＤのリストで示される文書本体を取得しておく（図７のステップＳ２２４）。 FIG. 7 is a flowchart showing an example of the operation of the data search unit 22. The data search unit 22 extracts one word at a time from the set of search candidate words passed from the search strategy determination unit 21 (step S221 in FIG. 7; Yes), and makes public data via the network 5 using the search complement as a query. 6 is searched (step S222 in FIG. 7). Next, a list of document IDs obtained as a search result is added to the column of appearance document IDs of terms used as queries with respect to the term list stored in the term storage unit 11 (step in FIG. 7). S223). Further, the document body indicated by the list of document IDs obtained as the search result is acquired (step S224 in FIG. 7).

検索候補語を全て検索し終わったら（図７のステップＳ２２１；Ｎｏ）、取得した文書本体の集合を共起度計算部２４に渡す。このように、データ検索部２２は、検索候補語の集合から１語ずつ検索を行うため、検索回数はたかだか用語リストに含まれる用語の数であり、検索回数が幾何級数的に増加するのを防ぐことができる。 When all the search candidate words have been searched (step S221 in FIG. 7; No), the acquired set of document main bodies is transferred to the co-occurrence degree calculation unit 24. Thus, since the data search unit 22 searches one word at a time from the set of search candidate words, the number of searches is at most the number of terms included in the term list, and the number of searches increases in terms of geometric series. Can be prevented.

なお、ここでは説明を簡潔にするため、データ検索部２２は、検索結果として得られた文書ＩＤのリストで示される文書本体を全て取得するとして説明を行ったが、一度取得済みの文書をキャッシュとして残しておき、同じ文書は改めて取得しないようにして効率化を図る方法なども考えられ、本実施の形態に述べた方法に限定されない。 Here, for the sake of brevity, the data search unit 22 has been described as acquiring all the document bodies indicated by the list of document IDs obtained as a search result. In other words, a method of improving efficiency by preventing the same document from being obtained again is considered, and the method is not limited to the method described in the present embodiment.

図８は、共起度計算部２４の動作の一例を示す流れ図である。共起度計算部２４は、用語記憶部１１に格納されている用語リストから1組ずつペアの組み合わせを生成し（図８のステップＳ２４１；Ｙｅｓ）、用語リストに記述されている出現文書ＩＤのリストから、Simpson係数を用いて共起度を計算する（図８のステップＳ２４２）。次に、計算された共起度があらかじめ指定された閾値βよりも高ければ（図８のステップＳ２４３；Ｙｅｓ）、該当する用語のペアを共起度データ記憶部１３に格納されている共起度グラフに追加し、エッジの重みとして共起度の値を設定する（図８のステップＳ２４４）。この時、用語のペアが既に共起度グラフに登録されている場合は、そのエッジの重みの値を更新する。これを、全ての用語のペアについて共起度を計算するまで繰り返す（図８のステップＳ２４１；Ｎｏ）。 FIG. 8 is a flowchart illustrating an example of the operation of the co-occurrence degree calculation unit 24. The co-occurrence degree calculation unit 24 generates a pair combination one by one from the term list stored in the term storage unit 11 (step S241 in FIG. 8; Yes), and the appearance document ID described in the term list. The co-occurrence degree is calculated from the list using the Simpson coefficient (step S242 in FIG. 8). Next, if the calculated co-occurrence degree is higher than the threshold value β designated in advance (step S243 in FIG. 8; Yes), the co-occurrence stored in the co-occurrence degree data storage unit 13 is a corresponding term pair. In addition to the degree graph, the co-occurrence value is set as the edge weight (step S244 in FIG. 8). At this time, if the term pair has already been registered in the co-occurrence degree graph, the value of the edge weight is updated. This is repeated until the co-occurrence degree is calculated for all term pairs (step S241 in FIG. 8; No).

なお、ここでは説明を簡潔にするため、共起度の計算方法としてSimpson係数を用いる例について述べたが、他にも、共起頻度、相互情報量、Dice係数、Jaccard係数、閾値付Simpson係数、Cosine係数など、様々な共起度の計算方法が考えられ、本実施の形態に述べた方法に限定されない。また、共起度計算部２４は、用語記憶部１１に格納されている用語の全ての組み合わせについて共起度を計算するものとして説明を行ったが、データ検索部２２によって更新が起こった用語とその他の用語のペアの組み合わせだけに限って共起度の計算を行うことにより処理の効率化を図る方法も考えられ、本実施の形態に述べる方法に限定されない。 In addition, for simplicity of explanation, an example using the Simpson coefficient as a method of calculating the co-occurrence degree has been described. However, the co-occurrence frequency, the mutual information amount, the Dice coefficient, the Jaccard coefficient, and the Simpson coefficient with a threshold are also described. Various co-occurrence calculation methods such as Cosine coefficient are conceivable, and the present invention is not limited to the method described in this embodiment. The co-occurrence degree calculation unit 24 has been described as calculating the co-occurrence degree for all combinations of terms stored in the term storage unit 11. A method for improving the efficiency of processing by calculating the co-occurrence degree only for combinations of other term pairs is also conceivable, and the present invention is not limited to the method described in this embodiment.

次に、本実施の形態の効果について説明する。
本実施の形態では、公開データ６に対する検索は、用語のペアではなく、用語１語ずつで行う。そのため、検索回数はたかだか用語リストに含まれる用語の数であり、検索回数が幾何級数的に増加するのを防ぐことができる。 Next, the effect of this embodiment will be described.
In the present embodiment, the search for the public data 6 is performed not for a pair of terms but for each term. Therefore, the number of searches is at most the number of terms included in the term list, and the number of searches can be prevented from increasing geometrically.

また、本実施の形態では、未検索の用語であっても、検索済み用語の検索結果に含まれる文書中に出現していれば、近似的な共起度を求めることができる。そのため、少ない検索回数でもより多くの用語の関係を近似的に求めることができる。 In the present embodiment, even if an unsearched term appears in the document included in the search result of the searched term, an approximate co-occurrence degree can be obtained. Therefore, the relationship of more terms can be obtained approximately even with a small number of searches.

また、本実施の形態では、未検索のどの用語を検索すれば、より近似度の高い共起度グラフが求まるかという指標を近似グラフスコアとして計算し、近似グラフスコアの高い用語の順に検索を行う。そのため、少ない検索回数でもより真の値に近い共起度グラフを求めることができる。 Also, in this embodiment, an index indicating whether an unsearched term is searched to obtain a co-occurrence degree graph with a higher degree of approximation is calculated as an approximate graph score, and the search is performed in the order of the terms with the highest approximate graph score. Do. Therefore, a co-occurrence degree graph closer to the true value can be obtained even with a small number of searches.

（実施の形態２）
図１０は、本発明の実施の形態２に係る用語共起度抽出装置１００の構成例を示すブロック図である。実施の形態２は、実施の形態１の構成に加えて、処理装置２に用語抽出部２３が追加されている点で異なる。また、記憶装置１に抽出ルール記憶部１２が追加されている。 (Embodiment 2)
FIG. 10 is a block diagram illustrating a configuration example of the term co-occurrence degree extraction device 100 according to Embodiment 2 of the present invention. The second embodiment is different from the first embodiment in that a term extracting unit 23 is added to the processing device 2 in addition to the configuration of the first embodiment. Further, an extraction rule storage unit 12 is added to the storage device 1.

抽出ルール記憶部１２には、用語として抽出すべき文字列を記述した抽出ルールとそのスコアの組が格納されている。抽出ルールは単語属性の組み合わせとして表現される。単語属性とは、用語記憶部１１に記憶されている用語、表層文字列である表記、動詞や形容詞の活用の原形、品詞、読み（ふりがな、仮名表記）、同義表現や送り仮名、ひらがなカタカナ表記の違いを吸収した代表表記、「地名」や「色名」などの意味分類などを含む。 The extraction rule storage unit 12 stores a combination of an extraction rule describing a character string to be extracted as a term and its score. The extraction rule is expressed as a combination of word attributes. Word attributes are terms stored in the term storage unit 11, notations that are surface layer character strings, original forms of verbs and adjectives, parts of speech, readings (furigana, kana notation), synonymous expressions, sending kana, hiragana katakana notation Including representative notation that absorbs the difference, semantic classification such as “place name” and “color name”.

図１１は、抽出ルール記憶部１２に格納されている抽出ルールの例を示す。ダブルクォーテーション“”で囲まれた抽出ルールに一致する文字列を用語として抽出する。図１１における「｜」「＋」「（）」などの演算子の意味は、一般的な正規表現演算子の意味と同じである。図１１は、例として人名を抽出するためのルールである。 FIG. 11 shows an example of extraction rules stored in the extraction rule storage unit 12. A character string that matches the extraction rule enclosed in double quotations “” is extracted as a term. The meanings of operators such as “|”, “+”, and “()” in FIG. 11 are the same as the meanings of general regular expression operators. FIG. 11 shows a rule for extracting a person name as an example.

抽出ルールR01は、用語記憶部１１に記憶されている用語と完全一致する文字列を人名として抽出するルールである。例えば、用語記憶部１１の内容が図２のようであった場合、「田中一郎」や「高橋二郎」などの文字列が文書に出現すると、それは人名と判断され、スコア1.0が加算される。 The extraction rule R01 is a rule for extracting a character string that completely matches a term stored in the term storage unit 11 as a person name. For example, if the contents of the term storage unit 11 are as shown in FIG. 2, if a character string such as “Ichiro Tanaka” or “Jiro Takahashi” appears in the document, it is determined as a person name, and a score of 1.0 is added.

抽出ルールR02は、文書を形態素解析した際に、品詞が「名詞−固有名詞−人名−姓」「名詞−固有名詞−人名−名」の順で出現している文字列を人名として抽出するルールである。例えば、用語記憶部１１に「田中五郎」という人名が登録されていなくても、「田中五郎」を形態素解析した結果が、「田中／名詞−固有名詞−人名−姓五郎／名詞−固有名詞−人名−名」であれば、「田中五郎」を新しい人名として抽出し、スコア1.0を加算する。 The extraction rule R02 is a rule for extracting a character string in which the part of speech appears in the order of "noun-proprietary noun-person name-surname" "noun-proprietary noun-person name-first name" as a person name when a document is subjected to morphological analysis It is. For example, even if the personal name “Tanaka Goro” is not registered in the term storage unit 11, the result of morphological analysis of “Tanaka Goro” is “Tanaka / Noun—Proper Noun—Person Name—Last Name Goro / Noun—Proper Noun— If it is “person-name”, “Tanaka Goro” is extracted as a new person name, and a score of 1.0 is added.

抽出ルールR03は、文書を形態素解析した際に、品詞が「名詞」の単語が繰り返し出現し、次に、「名詞−固有名詞−人名−名」が出現し、さらに、表記が「氏」、「様」、「さん」、「先生」のような、人名によく付属する接尾語が出現した場合に、接尾語の前までの文字列を人名として抽出するルールである。例えば、用語記憶部１１に「笹間太郎」という人名が登録されていなくても、「笹間太郎さん」を形態素解析した結果が、「笹／名詞−一般間／名詞−一般−一郎／名詞−固有名詞−人名−名さん／名詞−接尾−人名」であれば、「笹間一郎」を新しい人名として抽出し、スコア0.5を加算する。このようなルールを使うことによって、「笹間」という姓が形態素解析器に登録されていなくても、人名らしい文字列を抽出することができる。 In the extraction rule R03, when a morphological analysis is performed on a document, a word whose part of speech is “noun” repeatedly appears, then “noun—proper noun—person name—name” appears, and the notation is “Mr.”, This is a rule for extracting a character string before the suffix as a person name when a suffix often attached to the person name such as “sama”, “san”, and “teacher” appears. For example, even if the personal name “Taro Sakuma” is not registered in the term storage unit 11, the result of the morphological analysis of “Mr. Taro Sakuma” is the result of “笹 / noun-general / noun-general-ichiro / noun-specific” If “noun-person-name / noun-suffix-person”, “Ichiro Sakuma” is extracted as a new person, and score 0.5 is added. By using such a rule, even if the surname “Sakuma” is not registered in the morphological analyzer, a character string that is likely to be a personal name can be extracted.

抽出ルールR04は、文書を形態素解析した際に、品詞が「名詞−固有名詞−人名−姓」の単語が出現し、次に、「名詞」が繰り返し出現し、さらに、表記が「氏」、「様」、「さん」、「先生」のような、人名によく付属する接尾語が出現した場合に、接尾語の前までの文字列を人名として抽出するルールである。例えば、用語記憶部１１に「田中仙太郎」という人名が登録されていなくても、「田中仙太郎先生」を形態素解析した結果が、「田中／名詞−固有名詞−人名−姓仙／名詞−固有名詞−人名−名太郎／名詞−固有名詞−人名−名先生／名詞−一般」であれば、「田中仙太郎」を新しい人名として抽出し、スコア0.4を加算する。このようなルールを使うことによって、「仙太郎」という名が形態素解析器に登録されていなくても、人名らしい文字列を抽出することができる。 In the extraction rule R04, when a morphological analysis is performed on a document, a word with a part of speech of “noun-proper noun-person name-surname” appears, then “noun” repeatedly appears, and the notation is “Mr.”, This is a rule for extracting a character string before the suffix as a person name when a suffix often attached to the person name such as “sama”, “san”, and “teacher” appears. For example, even if the personal name “Tanaka Sentaro” is not registered in the term storage unit 11, the result of morphological analysis of “Tanaka Sentaro-sensei” is “Tanaka / noun-proper noun-person name-surname / sen / noun-proper noun”. If it is -person name-name Taro / noun-proprietary noun-person name-name teacher / noun-general "," Tanaka Sentaro "is extracted as a new person name and score 0.4 is added. By using such a rule, even if the name “Sentaro” is not registered in the morphological analyzer, a character string that is likely to be a human name can be extracted.

抽出ルールR05は、用語記憶部１１に記憶されている用語の先頭２文字と末尾２文字の文字列で構成されている文字列を人名として抽出するルールである。例えば、用語記憶部１１の内容が図２のようであった場合、「高橋一郎」や「佐藤太郎」のような文字列が文書に出現すると、それは人名と判断され、スコア0.7が加算される。上述の抽出ルールは必ずしも排他的でなく、一つの文字列に複数の抽出ルールが該当する場合もある。例えば、用語記憶部１１に「田中一郎」という人名が登録されており、形態素解析の結果が「田中／名詞−固有名詞−人名−姓一郎／名詞−固有名詞−人名−名」であれば、この文字列は抽出ルールR01、R02、R05に該当することになる。この場合、全ての抽出ルールを加算して、2.7とする。これにより、スコアの高い文字列ほど人名らしいと判断できるようになる。 The extraction rule R05 is a rule for extracting, as a person name, a character string made up of a character string of the first two characters and the last two characters of a term stored in the term storage unit 11. For example, if the contents of the term storage unit 11 are as shown in FIG. 2, if a character string such as “Ichiro Takahashi” or “Taro Sato” appears in the document, it is determined to be a person's name, and a score of 0.7 is added. . The above extraction rules are not necessarily exclusive, and a plurality of extraction rules may correspond to one character string. For example, if the personal name “Ichiro Tanaka” is registered in the term storage unit 11 and the result of the morphological analysis is “Tanaka / Noun-proprietary noun-person name-last name Ichiro / noun-proper noun-person name-name”, This character string corresponds to the extraction rules R01, R02, and R05. In this case, all the extraction rules are added to obtain 2.7. As a result, a character string having a higher score can be determined to be more likely to be a personal name.

用語抽出部２３は、データ検索部２２から渡された文書本体に対して、抽出ルール記憶部１２に記述されている抽出ルールに該当する文字列を用語として抽出し、用語記憶部１１に格納されている用語リストの該当する用語の出現文書ＩＤを追加する。抽出した用語が用語記憶部１１に未登録の場合、新しい行を作成し、検索フラグを「未」に設定して、出現文書ＩＤを記録する。 The term extraction unit 23 extracts a character string corresponding to the extraction rule described in the extraction rule storage unit 12 as a term from the document main body passed from the data search unit 22 and stores it in the term storage unit 11. Appearing document ID of the corresponding term in the term list is added. If the extracted term is not registered in the term storage unit 11, a new line is created, the search flag is set to “not yet”, and the appearance document ID is recorded.

図１２は、実施の形態２に係る用語共起度抽出装置１００の動作の一例を示す流れ図である。実施の形態２の用語共起度抽出処理は、図５に示す実施の形態１の処理の動作に、用語抽出処理が追加されている。すなわち、ステップＳ２０１、ステップＳ２０２は実施の形態１と同様である。用語抽出部２３は、データ検索部２２と共起度計算部２４の間に置かれている。データ検索部２２は、公開データ６から検索した文書データを用語抽出部２３に渡す（図１２のステップＳ２０２）。 FIG. 12 is a flowchart showing an example of the operation of the term co-occurrence degree extraction device 100 according to the second embodiment. In the term co-occurrence degree extraction processing of the second embodiment, the term extraction processing is added to the operation of the processing of the first embodiment shown in FIG. That is, step S201 and step S202 are the same as in the first embodiment. The term extraction unit 23 is placed between the data search unit 22 and the co-occurrence degree calculation unit 24. The data search unit 22 passes the document data searched from the public data 6 to the term extraction unit 23 (step S202 in FIG. 12).

用語抽出部２３は、データ検索部２２から渡された文書群に対して、抽出ルール記憶部１２に記述されている抽出ルールに該当する文字列を用語として抽出する。そして、用語記憶部１１に格納されている用語リストの該当する用語の出現文書ＩＤを追加する（図１２のステップＳ２０３）。以降の処理は、実施の形態１と同様である。 The term extraction unit 23 extracts a character string corresponding to the extraction rule described in the extraction rule storage unit 12 as a term from the document group passed from the data search unit 22. Then, the appearance document ID of the corresponding term in the term list stored in the term storage unit 11 is added (step S203 in FIG. 12). The subsequent processing is the same as in the first embodiment.

図１３は、用語抽出部２３の動作の一例を示す流れ図である。用語抽出部２３は、最初に、初期化処理として、抽出候補集合Ｅを空集合として設定する（図１３のステップＳ２３１）。次に、データ検索部２２から渡された文書集合から1文書ずつ取り出しながら（図１３のステップＳ２３２；Ｙｅｓ）、文書の形態素解析を行い、文書内に抽出ルール記憶部１２に格納されている抽出ルールにマッチする文字列がないか調べる（図１３のステップＳ２３３）。 FIG. 13 is a flowchart illustrating an example of the operation of the term extraction unit 23. The term extraction unit 23 first sets the extraction candidate set E as an empty set as an initialization process (step S231 in FIG. 13). Next, while taking out one document at a time from the document set passed from the data search unit 22 (step S232 in FIG. 13; Yes), the morphological analysis of the document is performed, and the extraction rule stored in the extraction rule storage unit 12 is stored in the document. It is checked whether there is a character string that matches the rule (step S233 in FIG. 13).

文書中に抽出ルールにマッチする文字列があれば（図１３のステップＳ２３３；Ｙｅｓ）、その文字列ESと出現文書ＩＤ、およびその抽出スコアRSの組を抽出候補集合Ｅに追加する（図１３のステップＳ２３４）。このとき、既に文字列ESが抽出候補集合Ｅに登録済みであれば、出現文書ＩＤをリストとして追加し、抽出スコアRSの合計を計算する。文書中に抽出ルールにマッチする文字列が出てこなくなれば（図１３のステップＳ２３３；Ｎｏ）、次の文書に対して繰り返し処理を行う（図１３のステップＳ２３２）。 If there is a character string that matches the extraction rule in the document (step S233 in FIG. 13; Yes), the combination of the character string ES, the appearance document ID, and the extraction score RS is added to the extraction candidate set E (FIG. 13). Step S234). At this time, if the character string ES has already been registered in the extraction candidate set E, the appearance document ID is added as a list, and the total extraction score RS is calculated. If a character string matching the extraction rule does not appear in the document (step S233 in FIG. 13; No), the next document is repeatedly processed (step S232 in FIG. 13).

全ての文書に対して処理が終わったら（図１３のステップＳ２３２；Ｎｏ）、抽出候補集合Ｅの中から、抽出スコアの合計が閾値以上になっている用語について、出現文書ＩＤのリストを用語記憶部１１に格納されている用語リストに追加する。このように、用語抽出部２３は、抽出ルールに従って文書中に含まれる用語を抽出できるため、初期の入力データの用語リストに含まれていない新語であっても、再帰的に共起度を計算することができるようになる。 When the processing is completed for all the documents (step S232 in FIG. 13; No), a list of appearance document IDs is stored for the terms whose extraction scores are equal to or greater than the threshold from the extraction candidate set E. This is added to the term list stored in the section 11. In this way, the term extraction unit 23 can extract the terms included in the document according to the extraction rule, so even if it is a new word that is not included in the term list of the initial input data, the co-occurrence degree is calculated recursively. Will be able to.

なお、共起度計算部２４は、データ検索部２２と用語抽出部２３によって更新が起こった用語とその他の用語のペアの組み合わせだけに限って共起度の計算を行うことにより処理の効率化を図る方法も考えられる。 The co-occurrence degree calculation unit 24 increases the efficiency of processing by calculating the co-occurrence degree only for the combination of a term and another term pair updated by the data search unit 22 and the term extraction unit 23. A method for achieving this is also conceivable.

また、ここでは説明を簡潔にするため、収集対象の用語を人名に限定した例について述べたが、他にも、例えば図９に示すような組織名リストを用語記憶部１１に格納し、図１４に示すような抽出ルールを抽出ルール記憶部１２に与えることによって、組織名の共起度も抽出することができるようになり、本実施の形態に述べた方法に限定されない。 Further, here, for the sake of brevity, an example in which terms to be collected are limited to personal names has been described. However, for example, an organization name list as shown in FIG. By providing the extraction rule as shown in FIG. 14 to the extraction rule storage unit 12, the co-occurrence degree of the organization name can be extracted, and the present invention is not limited to the method described in the present embodiment.

さらに、用語記憶部１１に格納される用語リストと、抽出ルール記憶部１２に格納される抽出ルールに、ドメインのラベルのデータを付与することによって、人と組織、組織と地名など、異なる複数のドメインに属する用語を新たに抽出することができる。 Furthermore, by adding domain label data to the term list stored in the term storage unit 11 and the extraction rules stored in the extraction rule storage unit 12, a plurality of different names such as people and organizations, organizations and place names, etc. Terms newly belonging to the domain can be extracted.

本実施の形態２では、検索の結果得られた文書に対して、抽出ルールを用いて用語リストに未登録の新語を抽出して追加する。そのため、入力データである用語リストに含まれていない新語を抽出しながら再帰的に共起度を計算することができる。 In the second embodiment, unregistered new words are extracted and added to the term list using the extraction rule for the document obtained as a result of the search. Therefore, the co-occurrence degree can be recursively calculated while extracting new words that are not included in the term list that is input data.

（実施の形態３）
図１５は、本発明の実施の形態３に係る用語共起度抽出装置１００の構成例を示すブロック図である。図１５を参照すると、本発明の実施の形態３は、図１０に示された実施の形態２の構成に加えて、処理装置２に抽出ルール学習部２５が追加されている点で異なる。 (Embodiment 3)
FIG. 15 is a block diagram illustrating a configuration example of the term co-occurrence degree extraction device 100 according to Embodiment 3 of the present invention. Referring to FIG. 15, the third embodiment of the present invention is different in that an extraction rule learning unit 25 is added to the processing device 2 in addition to the configuration of the second embodiment shown in FIG.

抽出ルール学習部２５が用語記憶部１１に格納されている用語リストの文書中での出現傾向の統計量を計算することにより、抽出ルール記憶部１２に格納されている抽出ルールを増やす。 The extraction rule learning unit 25 increases the number of extraction rules stored in the extraction rule storage unit 12 by calculating the statistics of the appearance tendency in the document of the term list stored in the term storage unit 11.

本実施の形態の動作を、図１５〜１７を参照して詳細に説明する。
図１６は、本発明の実施の形態３の動作の一例を示す流れ図である。図１６におけるステップＳ２０１〜Ｓ２０５における、検索戦略決定部２１、データ検索部２２、共起度計算部２４の動作は、図５に示す実施の形態１における検索戦略決定部２１〜共起度計算部２４の動作と同一のため、説明は省略する。用語抽出部２３は、図１６のステップＳ２０３の後、データ検索部２２から渡された検索結果の文書群をそのまま抽出ルール学習部２５に渡すものとする。 The operation of the present embodiment will be described in detail with reference to FIGS.
FIG. 16 is a flowchart showing an example of the operation of the third embodiment of the present invention. The operations of the search strategy determination unit 21, the data search unit 22, and the co-occurrence degree calculation unit 24 in steps S201 to S205 in FIG. 16 are the same as the search strategy determination unit 21 to the co-occurrence degree calculation unit in the first embodiment shown in FIG. Since the operation is the same as that in FIG. It is assumed that the term extraction unit 23 passes the document group of the search result passed from the data search unit 22 to the extraction rule learning unit 25 as it is after step S203 of FIG.

抽出ルール学習部２５は、用語記憶部１１に格納されている用語リストについて、用語抽出部２３から渡された文書群中での出現パタンを計測し、出現頻度が高く、かつ、用語を抽出する可能性の高いパタンを抽出ルールとして抽出ルール記憶部１２に追加する。 The extraction rule learning unit 25 measures the appearance pattern in the document group passed from the term extraction unit 23 with respect to the term list stored in the term storage unit 11, and extracts the terms with high appearance frequency. A highly likely pattern is added to the extraction rule storage unit 12 as an extraction rule.

図１７は、抽出ルール学習部２５の動作の一例を示す流れ図である。抽出ルール学習部２５は、初期化処理として、周辺文字列集合Ｃとルール候補集合Ｒを空集合に設定する（図１７のステップＳ２５０）。次に、用語記憶部１１に格納されている用語リスト中の用語を１語ずつ取り出して（図１７のステップＳ２５１；Ｙｅｓ）、取り出された用語が、用語抽出部２３から渡された文書群中に出現している前後w語以内の周辺文字列を全て列挙し、周辺文字列集合Ｃに追加する（図１７のステップＳ２５２）。 FIG. 17 is a flowchart showing an example of the operation of the extraction rule learning unit 25. The extraction rule learning unit 25 sets the peripheral character string set C and the rule candidate set R as empty sets as initialization processing (step S250 in FIG. 17). Next, the terms in the term list stored in the term storage unit 11 are extracted one by one (step S251 in FIG. 17; Yes), and the extracted terms are included in the document group passed from the term extraction unit 23. All the surrounding character strings within w words before and after appearing are listed and added to the surrounding character string set C (step S252 in FIG. 17).

例えば、ｗ＝４で、取り出された用語が「田中一郎」であり、文書群中に「凸凹株式会社の田中一郎社長が語る」という記述があったとする。この場合、「凸凹株式会社の田中一郎社長が語る」という記述を形態素解析すると、「凸凹／名詞−一般株式会社／名詞−一般の／助詞−連帯化田中／名詞−固有名詞−人名−姓一郎／名詞−固有名詞−人名−名社長／名詞−一般が／助詞−格助詞−一般語る／動詞−自立」となるため、「田中一郎」を含む４語以内の周辺文字列は、「株式会社／の／田中／一郎」「の／田中／一郎／社長」「田中／一郎／社長／が」「の／田中／一郎」「田中／一郎／社長」「田中／一郎」の６通り存在する。 For example, suppose that w = 4, the extracted term is “Ichiro Tanaka”, and there is a description in the document group “Ichiro Tanaka, President of Convex Inc. speaks”. In this case, a morphological analysis of the statement “Ichirou Tanaka, President of Convex Co., Ltd. speaks” gives the following: / Noun-proprietary noun-person name-name president / noun-general / particle-case particle-general Talk / verb-independent " / / / Tanaka / Ichiro] "/ / Tanaka / Ichiro / President" "Tanaka / Ichiro / President / G" "No / Tanaka / Ichiro" "Tanaka / Ichiro / President" "Tanaka / Ichiro" exists.

なお、ここでは、説明を簡潔にするため、文書群中に出現している前後w語以内の周辺文字列を全て列挙するものとして説明を行ったが、自立語で始まる周辺文字列に限定する、自立語で終わる周辺文字列に限定する、自立語で始まりかつ自立語で終わる周辺文字列に限定するなどの方法も考えられ、本実施の形態に述べた方法に限定されない。例えば、自立語で始まりかつ自立語で終わる周辺文字列に限定する場合、「凸凹株式会社の田中一郎社長が語る」という記述における「田中一郎」のｗ＝４の周辺文字列は、「株式会社／の／田中／一郎」「田中／一郎／社長」「田中／一郎」の３通りになる。 Here, for the sake of brevity, the description has been made assuming that all the surrounding character strings within the preceding and following w words appearing in the document group are listed, but the surrounding character strings starting with independent words are limited. There are also conceivable methods such as limiting to a peripheral character string ending with an independent word, limiting to a peripheral character string starting with an independent word and ending with an independent word, and is not limited to the method described in the present embodiment. For example, when limiting to a peripheral character string that starts with an independent word and ends with an independent word, the surrounding character string of w = 4 of “Ichiro Tanaka” in the description “Ichiro Tanaka, President of Convex Inc. speaks” / No / Tanaka / Ichiro ”,“ Tanaka / Ichiro / President ”and“ Tanaka / Ichiro ”.

次に、抽出ルール学習部２５は、列挙された周辺文字列について、用語を品詞などの単語属性として一般化したルールを生成し、ルール候補集合Ｒに追加する（図１７のステップＳ２５３）。例えば、周辺文字列が「株式会社の田中一郎」であった場合、
「株式会社の“［品詞：名詞−固有名詞−姓］［品詞：名詞−固有名詞−名］”」、
「株式会社の“［品詞：名詞−固有名国−姓］［品詞：名詞−固有名詞］”」、
「株式会社の“［品詞：名詞−固有名詞−姓］［品詞：名詞］”」、
「株式会社の“［品詞：名詞−固有名詞」［品詞：名詞−固有名詞−名］”」、
「株式会社の“［品詞：名詞−固有名詞］［品詞：名詞−固有名詞］”」、
「株式会社の“［品詞：名詞−固有名詞］［品詞：名詞］”」、
「株式会社の“［品詞：名詞］［品詞：名詞−固有名詞−名］”」、
「株式会社の“［品詞：名詞］［品詞：名詞−固有名詞］”」、
「株式会社の“［品詞：名詞］［品詞：名詞］”」、
の９つのルールがルール候補Rに追加される。 Next, the extraction rule learning unit 25 generates a rule in which terms are generalized as word attributes such as parts of speech for the enumerated character strings and adds them to the rule candidate set R (step S253 in FIG. 17). For example, if the surrounding string is "Ichiro Tanaka of Co., Ltd."
“[Part of speech: noun-proprietary noun-surname] [part of speech: noun-proprietary noun-name]”,
““ Parts of speech: noun-proprietary country-surname ”[part of speech: noun-proprietary noun]”
““ Parts of speech: noun-proprietary noun-surname ”[part of speech: noun]”
““ Parts of speech: nouns—proper nouns ”[parts of speech: nouns—proprietary nouns—names]”
““ Parts of speech: nouns—proper nouns ”[parts of speech: nouns—proprietary nouns]”
““ Parts of speech: noun-proper noun ”[part of speech: noun]”
““ [Part of speech: noun] [part of speech: noun-proper noun-name] ”
““ Parts of speech: nouns ”[Parts of speech: nouns-proper nouns]”
“[Part of speech: noun] [part of speech: noun]”
These nine rules are added to the rule candidate R.

次に、抽出ルール学習部２５は、ルール候補集合Ｒに含まれる各ルール候補について、用語抽出部２３から渡された文書群中でマッチする頻度を数え、その頻度が閾値fを超えていないルール候補はルール候補集合Ｒから削除する（図１７のステップＳ２５４）。例えば、閾値f＝10で、ルール「株式会社の［品詞：名詞−固有名詞−姓］［品詞：名詞−固有名詞−名］」とがマッチする文字列の頻度が5だった場合、ルール候補集合Ｒから削除される。 Next, the extraction rule learning unit 25 counts the frequency of matching in the document group passed from the term extraction unit 23 for each rule candidate included in the rule candidate set R, and the frequency does not exceed the threshold f. The candidate is deleted from the rule candidate set R (step S254 in FIG. 17). For example, if the threshold f = 10 and the frequency of a character string that matches the rule “[part of speech: noun-proprietary noun-surname] [part of speech: noun-proprietary noun-first name]” is 5, the candidate rule Deleted from set R.

次に、抽出ルール学習部２５は、ルール候補集合Ｒに含まれる各ルール候補について、用語抽出部２３から渡された文書群中でマッチする文字列を抽出し、その文字列が用語記憶部１１に格納されている用語リストに登録されている割合を、用語抽出率として計算する。用語抽出率が低いルール候補は、多くの語を抽出できる可能性があるが、一方で、ノイズとなる語を抽出しやすいことを意味している。そのため、用語抽出率が閾値rを超えていないルール候補はルール候補集合Ｒから削除する（図１７のステップＳ２５５）。 Next, the extraction rule learning unit 25 extracts a matching character string in the document group passed from the term extraction unit 23 for each rule candidate included in the rule candidate set R, and the character string is the term storage unit 11. The ratio registered in the term list stored in is calculated as the term extraction rate. A rule candidate with a low term extraction rate may extract many words, but on the other hand, it means that it is easy to extract words that cause noise. Therefore, rule candidates whose term extraction rate does not exceed the threshold value r are deleted from the rule candidate set R (step S255 in FIG. 17).

例えば、用語抽出率の閾値r＝50％とする。この時、ルール候補r［品詞名詞−固有名詞］［品詞：名詞−固有名詞］”社長」により抽出される文字列が10語あり、そのうち7語が用語記憶部１１に格納されている用語リストに登録されている場合、このルール候補の用語抽出率は7／10＝70％となり、閾値r＝50％を超えているので、ルール候補集合Ｒから削除されない。一方、ルール候補「株式会社の”［品詞名詞］［品詞’名詞］”」により抽出される文字列が100語あり、そのうち20語が用語記憶部１１に格納きれている用語リストに登録されている場合、このルール候補の用語抽出率は20／100＝20％となり、閾値r＝50％未満であるので、ルール候補集合Ｒから削除される。 For example, the term extraction rate threshold r is set to 50%. At this time, there are ten character strings extracted by the rule candidate r [part of speech noun-proprietary noun] [part of speech: noun-proprietary noun] “President”, and seven terms are stored in the term storage unit 11 In this case, the term extraction rate of this rule candidate is 7/10 = 70%, which exceeds the threshold value r = 50%, so that it is not deleted from the rule candidate set R. On the other hand, there are 100 words extracted by the rule candidate “corporate“ [part of speech noun] [part of speech noun] ””, of which 20 words are registered in the term list stored in the term storage unit 11. In this case, the term extraction rate of this rule candidate is 20/100 = 20%, and the threshold value r = less than 50% is deleted from the rule candidate set R.

次に、抽出ルール学習部２５は、ルール候補集合Ｒに残っているルール候補を、抽出ルールとして、抽出ルール記憶部１２に追加する（図１７のステップＳ２５６） Next, the extraction rule learning unit 25 adds the rule candidates remaining in the rule candidate set R to the extraction rule storage unit 12 as extraction rules (step S256 in FIG. 17).

なお、ここでは説明を簡潔にするため、抽出ルール学習部２５は、用語抽出部２３から渡された文書群のみを用いて周辺文字列の抽出とルール候補の生成を行う方法について説明を行ったが、他にも、データ検索部２２が取得した文書群全てを記憶装置１に格納しておき、それら文書群全体を使って周辺文字列の抽出とルール候補の生成を行う方法もあり、本実施の形態に述べた方法に限定されない。 Here, for the sake of brevity, the extraction rule learning unit 25 has described a method of extracting neighboring character strings and generating rule candidates using only the document group passed from the term extraction unit 23. However, there is also a method in which all the document groups acquired by the data search unit 22 are stored in the storage device 1 and the surrounding character strings are extracted and rule candidates are generated using the entire document groups. The method is not limited to the method described in the embodiment.

本実施の形態では、検索結果の文書群に含まれる用語周辺の文字列の出現傾向を求めることにより、動的に新しい抽出ルールを生成する。そのため、初期の抽出ルールが少なくても、より多くの用語を再帰的に抽出することができる。 In the present embodiment, a new extraction rule is dynamically generated by obtaining the appearance tendency of a character string around a term included in a document group as a search result. Therefore, even if there are few initial extraction rules, more terms can be extracted recursively.

図１８は、図１、図１０または図１５に示す用語共起度抽出装置１００のハードウェア構成の一例を示すブロック図である。用語共起度抽出装置１００は、図１８に示すように、制御部３１、主記憶部３２、外部記憶部３３、操作部３４、表示部３５及び送受信部３６を備える。主記憶部３２、外部記憶部３３、操作部３４、表示部３５及び送受信部３６はいずれも内部バス３０を介して制御部３１に接続されている。 18 is a block diagram illustrating an example of a hardware configuration of the term co-occurrence degree extraction device 100 illustrated in FIG. 1, FIG. 10, or FIG. As illustrated in FIG. 18, the term co-occurrence degree extraction device 100 includes a control unit 31, a main storage unit 32, an external storage unit 33, an operation unit 34, a display unit 35, and a transmission / reception unit 36. The main storage unit 32, the external storage unit 33, the operation unit 34, the display unit 35, and the transmission / reception unit 36 are all connected to the control unit 31 via the internal bus 30.

制御部３１はＣＰＵ（Central Processing Unit）等から構成され、外部記憶部３３に記憶されている用語共起度抽出用プログラム５００に従って、前述の用語共起度抽出装置１００の処理を実行する。 The control unit 31 includes a CPU (Central Processing Unit) and the like, and executes the processing of the term co-occurrence degree extraction device 100 described above according to the term co-occurrence degree extraction program 500 stored in the external storage unit 33.

主記憶部３２はＲＡＭ（Random-Access Memory）等から構成され、外部記憶部３３に記憶されている用語共起度抽出用プログラム５００をロードし、制御部３１の作業領域として用いられる。 The main storage unit 32 is composed of a RAM (Random-Access Memory) or the like, loads the term co-occurrence degree extraction program 500 stored in the external storage unit 33, and is used as a work area of the control unit 31.

外部記憶部３３は、フラッシュメモリ、ハードディスク、ＤＶＤ−ＲＡＭ（Digital Versatile
Disc Random-Access Memory）、ＤＶＤ−ＲＷ（Digital Versatile
Disc ReWritable）等の不揮発性メモリから構成され、前記の処理を制御部３１に行わせるための用語共起度抽出用プログラム５００を予め記憶し、また、制御部３１の指示に従って、このプログラムが記憶するデータを制御部３１に供給し、制御部３１から供給されたデータを記憶する。図１、図１０または図１５の用語記憶部１１、抽出ルール記憶部１２および共起度データ記憶部１３は、外部記憶部３３に構成される。用語共起度抽出処理を行っているときは、それらのデータの一部は主記憶部３２に記憶されて制御部３１の作業に用いる。 The external storage unit 33 includes a flash memory, a hard disk, a DVD-RAM (Digital Versatile
Disc Random-Access Memory), DVD-RW (Digital Versatile)
The term co-occurrence degree extraction program 500 is configured in advance and stored in accordance with an instruction from the control unit 31. The data to be supplied is supplied to the control unit 31, and the data supplied from the control unit 31 is stored. The term storage unit 11, the extraction rule storage unit 12, and the co-occurrence degree data storage unit 13 of FIG. 1, FIG. 10, or FIG. 15 are configured in the external storage unit 33. When the term co-occurrence degree extraction process is performed, part of the data is stored in the main storage unit 32 and used for the operation of the control unit 31.

操作部３４はキーボード及びマウスなどのポインティングデバイス等と、キーボード及びポインティングデバイス等を内部バス３０に接続するインターフェース装置から構成されている。操作部３４を介して、参加者を絞り込む条件設定などが入力され、制御部３１に供給される。操作部３４は、図１、図１０または図１５の入力部３に相当する。 The operation unit 34 includes a pointing device such as a keyboard and mouse, and an interface device that connects the keyboard and pointing device to the internal bus 30. The condition setting for narrowing down participants is input via the operation unit 34 and supplied to the control unit 31. The operation unit 34 corresponds to the input unit 3 in FIG. 1, FIG. 10, or FIG.

表示部３５は、ＣＲＴ（Cathode Ray Tube）又はＬＣＤ（Liquid Crystal Display）などから構成され、検索対象の用語、検索結果、検索結果文書、用語抽出ルール、共起度グラフなどを表示する。表示部３５は、図１、図１０または図１５の出力部４の例である。その他、出力部４として、プリンタなどを備えてもよい。 The display unit 35 includes a CRT (Cathode Ray Tube) or an LCD (Liquid Crystal Display), and displays a search target term, a search result, a search result document, a term extraction rule, a co-occurrence degree graph, and the like. The display unit 35 is an example of the output unit 4 of FIG. 1, FIG. 10, or FIG. In addition, a printer or the like may be provided as the output unit 4.

送受信部３６は、ネットワーク５に接続する網終端装置または無線通信装置、及びそれらと接続するシリアルインタフェース又はＬＡＮ（Local Area Network）インタフェースから構成されている。送受信部３６は、ネットワーク５を介して、検索エンジンを提供するサーバ（図示せず）に接続し、公開データ６の情報にアクセスする。 The transmission / reception unit 36 includes a network termination device or a wireless communication device connected to the network 5, and a serial interface or a LAN (Local Area Network) interface connected thereto. The transmission / reception unit 36 connects to a server (not shown) that provides a search engine via the network 5 and accesses information of the public data 6.

図１、図１０または図１５の検索戦略決定部２１、データ検索部２２、用語抽出部２３、共起度計算部２４および抽出ルール学習部２５の処理は、用語共起度抽出用プログラム５００が、制御部３１、主記憶部３２、外部記憶部３３、操作部３４、表示部３５および送受信部３６などを資源として用いて処理することによって実行する。 The processing of the search strategy determination unit 21, the data search unit 22, the term extraction unit 23, the co-occurrence degree calculation unit 24, and the extraction rule learning unit 25 in FIG. 1, FIG. 10, or FIG. The control unit 31, the main storage unit 32, the external storage unit 33, the operation unit 34, the display unit 35, and the transmission / reception unit 36 are used for processing as resources.

以上、説明したように、本発明の第１の効果は、検索回数が幾何級数的に増加するのを防ぐことができることである。その理由は、公開データ６に対する検索は、用語のペアではなく、用語１語ずつで行うからである。 As described above, the first effect of the present invention is that the number of searches can be prevented from increasing geometrically. The reason is that the search for the public data 6 is performed not for a term pair but for each term.

第２の効果は、少ない検索回数でもより多くの用語の関係を近似的に求めることができることである。その理由は、未検索の用語であっても、検索済み用語の検索結果に含まれる文書中に出現していれば、近似的な共起度を求めることができるからである。 The second effect is that the relationship of more terms can be obtained approximately even with a small number of searches. The reason is that an approximate co-occurrence degree can be obtained even if an unsearched term appears in a document included in the search result of the searched term.

第３の効果は、少ない検索回数でもより真の値に近い共起度グラフを求めることができることである。その理由は、未検索のどの用語を検索すれば、より近似度の高い共起度グラフが求まるかという指標を近似グラフスコアとして計算し、近似グラフスコアの高い用語の順に検索を行うからである。 A third effect is that a co-occurrence degree graph closer to a true value can be obtained even with a small number of searches. The reason is that an index indicating whether an unsearched term is searched for a co-occurrence degree graph with a higher degree of approximation is calculated as an approximate graph score, and the search is performed in the order of the terms with the highest approximate graph score. .

第４の効果は、入力データである用語リストに含まれていない新語を抽出しながら再帰的に共起度を計算することができることである。その理由は、検索の結果得られた文書に対して、抽出ルールを用いて用語リストに未登録の新語を抽出して追加するからである。 The fourth effect is that the co-occurrence degree can be recursively calculated while extracting new words that are not included in the term list as input data. The reason is that a new word that is not registered in the term list is extracted and added to the document obtained as a result of the search using the extraction rule.

その他、前記のハードウエア構成やフローチャートは一例であり、任意に変更及び修正が可能である。 In addition, the hardware configuration and the flowchart described above are merely examples, and can be arbitrarily changed and modified.

制御部３１、主記憶部３２、外部記憶部３３、送受信部３６及び内部バス３０などから構成される用語共起度抽出装置１００の処理を行う中心となる部分は、専用のシステムによらず、通常のコンピュータシステムを用いて実現可能である。たとえば、前記の動作を実行するための用語共起度抽出用プログラム５００を、コンピュータが読み取り可能な記録媒体（フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ等）に格納して配布し、当該コンピュータプログラムをコンピュータにインストールすることにより、前記の処理を実行する用語共起度抽出装置１００を構成してもよい。また、インターネット等の通信ネットワーク上のサーバ装置が有する記憶装置に当該コンピュータプログラムを格納しておき、通常のコンピュータシステムがダウンロード等することで用語共起度抽出装置１００を構成してもよい。 The central part that performs processing of the term co-occurrence degree extraction device 100 configured by the control unit 31, the main storage unit 32, the external storage unit 33, the transmission / reception unit 36, the internal bus 30, and the like is not based on a dedicated system. It can be realized using a normal computer system. For example, the term co-occurrence degree extraction program 500 for executing the above operation is stored in a computer-readable recording medium (flexible disk, CD-ROM, DVD-ROM, etc.) and distributed. May be configured in the term co-occurrence degree extraction device 100 that performs the above-described processing. Further, the term co-occurrence degree extraction device 100 may be configured by storing the computer program in a storage device included in a server device on a communication network such as the Internet and downloading the computer program by a normal computer system.

また、用語共起度抽出装置１００の機能を、ＯＳ（オペレーティングシステム）とアプリケーションプログラムの分担、またはＯＳとアプリケーションプログラムとの協働により実現する場合などには、アプリケーションプログラム部分のみを記録媒体や記憶装置に格納してもよい。 When the function of the term co-occurrence degree extracting apparatus 100 is realized by sharing of an OS (operating system) and an application program, or by cooperation between the OS and the application program, only the application program portion is recorded on a recording medium or storage. You may store in an apparatus.

また、搬送波にコンピュータプログラムを重畳し、通信ネットワークを介して配信することも可能である。たとえば、通信ネットワーク上の掲示板(BBS, Bulletin Board System)に用語共起度抽出用プログラム５００を掲示し、ネットワークを介して用語共起度抽出用プログラム５００を配信してもよい。そして、用語共起度抽出用プログラム５００を起動し、ＯＳの制御下で、他のアプリケーションプログラムと同様に実行することにより、前記の処理を実行できるように構成してもよい。 It is also possible to superimpose a computer program on a carrier wave and distribute it via a communication network. For example, the term co-occurrence degree extraction program 500 may be posted on a bulletin board (BBS, Bulletin Board System) on a communication network, and the term co-occurrence degree extraction program 500 may be distributed via the network. The term co-occurrence degree extraction program 500 may be started and executed in the same manner as other application programs under the control of the OS, so that the above-described processing may be executed.

本発明によれば、新開記事、スポーツニュース、論文、日記、掲示板、blog、メーリングリスト、メールマガジンなどの様々な情報源から、人間関係を表す情報や、組織間の関係を表す情報、組織と人の関係を表す情報、製品と企業の関係を表す情報などの抽出に適用することができる。 According to the present invention, from various information sources such as newly opened articles, sports news, papers, diaries, bulletin boards, blogs, mailing lists, e-mail magazines, information representing human relationships, information representing relationships between organizations, organizations and people It can be applied to extraction of information representing the relationship between products and information representing the relationship between products and companies.

本発明の実施の形態１に係る用語共起度抽出装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the term co-occurrence degree extraction apparatus which concerns on Embodiment 1 of this invention. 実施の形態１における用語記憶部に格納されるデータの例を示す図である。6 is a diagram illustrating an example of data stored in a term storage unit in Embodiment 1. FIG. 実施の形態１における共起度データ記憶部に格納されるデータの例を示す図である。6 is a diagram illustrating an example of data stored in a co-occurrence degree data storage unit in Embodiment 1. FIG. 実施の形態１における近似的な共起度計算を説明する図である。6 is a diagram for explaining approximate co-occurrence calculation in Embodiment 1. FIG. 実施の形態１に係る用語共起度抽出装置の動作の一例を示す流れ図である。5 is a flowchart showing an example of the operation of the term co-occurrence degree extracting apparatus according to the first embodiment. 実施の形態１における検索戦略決定部の動作の一例を示す流れ図である。3 is a flowchart illustrating an example of an operation of a search strategy determination unit according to the first embodiment. 実施の形態１におけるデータ検索部の動作の一例を示す流れ図である。3 is a flowchart illustrating an example of an operation of a data search unit according to the first embodiment. 実施の形態１における共起度計算部の動作の一例を示す流れ図である。3 is a flowchart illustrating an example of an operation of a co-occurrence degree calculation unit according to the first embodiment. 実施の形態１における用語記憶部に格納されるデータの例を示す図である。6 is a diagram illustrating an example of data stored in a term storage unit in Embodiment 1. FIG. 本発明の実施の形態２に係る用語共起度抽出装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the term co-occurrence degree extraction apparatus which concerns on Embodiment 2 of this invention. 実施の形態２における抽出ルール記憶部に格納されるデータの例を示す図である。6 is a diagram illustrating an example of data stored in an extraction rule storage unit according to Embodiment 2. FIG. 実施の形態２に係る用語共起度抽出装置の動作の一例を示す流れ図である。12 is a flowchart showing an example of the operation of the term co-occurrence degree extracting apparatus according to the second embodiment. 実施の形態２における用語抽出部の動作の一例を示す流れ図である。10 is a flowchart illustrating an example of an operation of a term extraction unit in the second embodiment. 実施の形態２における抽出ルール記憶部に格納されるデータの例を示す図である。6 is a diagram illustrating an example of data stored in an extraction rule storage unit according to Embodiment 2. FIG. 本発明の実施の形態３に係る用語共起度抽出装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the term co-occurrence degree extraction apparatus which concerns on Embodiment 3 of this invention. 実施の形態３に係る用語共起度抽出装置の動作の一例を示す流れ図である。10 is a flowchart showing an example of the operation of the term co-occurrence degree extracting apparatus according to the third embodiment. 実施の形態３における抽出ルール学習部の動作の一例を示す流れ図である。12 is a flowchart illustrating an example of an operation of an extraction rule learning unit in the third embodiment. 用語共起度抽出装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of a term co-occurrence degree extraction apparatus.

Explanation of symbols

１記憶装置
２処理装置
３入力部
４出力部
５ネットワーク
６公開データ
１１用語記憶部
１２抽出ルール記憶部
１３共起度データ記憶部
２１検索戦略決定部
２２データ検索部
２３用語抽出部
２４共起度計算部
２５抽出ルール学習部
３１制御部
３２主記憶部
３３外部記憶部
３４操作部
３５表示部
３６送受信部
１００用語共起度抽出装置
５００用語共起度抽出用プログラム DESCRIPTION OF SYMBOLS 1 Storage device 2 Processing device 3 Input part 4 Output part 5 Network 6 Public data 11 Term storage part 12 Extraction rule storage part 13 Co-occurrence degree data storage part 21 Search strategy determination part 22 Data search part 23 Term extraction part 24 Co-occurrence degree Calculation unit 25 Extraction rule learning unit 31 Control unit 32 Main storage unit 33 External storage unit 34 Operation unit 35 Display unit 36 Transmission / reception unit 100 Term co-occurrence degree extraction device 500 Term co-occurrence degree extraction program

Claims

A search target term is a node, and for any two terms of the search target, a co-occurrence degree indicating the degree of appearance of the two terms in the same document is an edge between the nodes corresponding to the two terms. , A term co-occurrence degree extraction device that extracts a co-occurrence degree graph,
For unsearched terms, co-occurrence detection accuracy determination means for determining the possibility of finding the co-occurrence between terms,
Search strategy determination means for determining the search order of terms based on the possibility determined by the co-occurrence detection accuracy determination means;
Search means for searching for document data using each word as a keyword as a keyword according to the order determined by the search strategy determination means;
A co-occurrence degree calculating means for approximately obtaining a co-occurrence degree between terms for a search target term included in the search result document from a search result document including the term searched by the search means;
A term co-occurrence degree extraction device comprising:

2. The term co-occurrence degree extracting apparatus according to claim 1, further comprising: a term extracting unit that extracts a term that is not included in the search target term from the search result document based on a predetermined rule.

3. The term co-occurrence degree extraction device according to claim 2, further comprising extraction rule learning means for dynamically generating a rule for extracting a term from the appearance tendency of the term in the search result document.

The extraction rule learning means includes
List character strings that appear around terms in the search result document,
A set of rule candidates is generated from the surrounding character string by using a word attribute of the term registered in the search target term and a regular expression that generalizes the word attribute,
The rule candidates are narrowed down by comparing the frequency of appearance of the rule candidates and / or the value of the term extraction rate with respective predetermined threshold values.
4. The term co-occurrence degree extracting apparatus according to claim 3, wherein a rule for extracting the term is generated by the above-mentioned.

5. The term co-occurrence degree according to claim 2, wherein the term extraction unit extracts a term based on a predetermined rule described by a word attribute and a regular expression of the word attribute. Extraction device.

The co-occurrence degree detection accuracy determination means determines the possibility that the co-occurrence degree between terms for an unsearched term is obtained from the expected value of the number of newly extracted terms, from both-side unsearched in the co-occurrence degree graph. Approximate graph consisting of either the number of edges that have been searched on one side, the number of edges that have been searched from one side to both sides, the number of edges that have been searched on one side, but can be expected to increase the amount of information, or a combination of these As a score,
The search strategy determination means uses one or more terms in the top of the approximate graph score as search candidate words.
The term co-occurrence degree extracting device according to any one of claims 1 to 5, characterized in that:

The co-occurrence degree calculating means in the co-occurrence degree graph,
Calculate the co-occurrence that is the edge between the nodes corresponding to the two searched terms,
Calculating a co-occurrence degree in which a term of both nodes or one of the nodes is an unsearched edge as an approximate co-occurrence degree based on the search result document;
The term co-occurrence degree extraction device according to any one of claims 1 to 6, characterized in that:

A search target term is a node, and for any two terms of the search target, a co-occurrence degree indicating the degree of appearance of the two terms in the same document is an edge between the nodes corresponding to the two terms. , A term co-occurrence degree extraction method for extracting a co-occurrence degree graph,
For unsearched terms, a co-occurrence detection accuracy determination step for determining the possibility of finding the co-occurrence between terms;
A search strategy determination step for determining a search order of terms based on the possibility determined in the co-occurrence detection accuracy determination step;
In accordance with the order determined in the search strategy determination step, a search step for searching the document data using each word to be searched as a keyword,
A co-occurrence degree calculation step of approximately obtaining a co-occurrence degree between terms for a search target term included in the search result document from a search result document including the term searched in the search step;
A term co-occurrence degree extraction method comprising:

The term co-occurrence degree extracting method according to claim 8, further comprising a term extracting step of extracting a term not included in the search target term from the search result document based on a predetermined rule.

The term co-occurrence degree extraction method according to claim 9, further comprising an extraction rule learning step of dynamically generating a rule for extracting a term from the appearance tendency of the term in the search result document.

The extraction rule learning step includes:
List character strings that appear around terms in the search result document,
A set of rule candidates is generated from the surrounding character string by using a word attribute of the term registered in the search target term and a regular expression that generalizes the word attribute,
11. The rule for extracting the term is generated by comparing the frequency of appearance of the rule candidate and / or the value of the term extraction rate with respective predetermined threshold values and narrowing down the rule candidate. The term co-occurrence extraction method described.

The term co-occurrence degree according to any one of claims 9 to 11, wherein the term extraction step extracts a term based on a predetermined rule described by a word attribute and a regular expression of the word attribute. Extraction method.

In the co-occurrence degree detection accuracy determination step, the possibility of obtaining the co-occurrence degree between terms for an unsearched term is obtained from the expected value of the number of newly extracted terms, from both-side unsearched in the co-occurrence degree graph. Approximate graph consisting of either the number of edges that have been searched on one side, the number of edges that have been searched from one side to both sides, the number of edges that have been searched on one side, but can be expected to increase the amount of information, or a combination of these As a score,
In the search strategy determination step, one or more terms higher in the approximate graph score are set as search candidate words.
The term co-occurrence degree extraction method according to any one of claims 8 to 12, characterized in that:

In the co-occurrence degree graph, in the co-occurrence degree graph,
Calculate the co-occurrence that is the edge between the nodes corresponding to the two searched terms,
Calculating a co-occurrence degree in which a term of both nodes or one of the nodes is an unsearched edge as an approximate co-occurrence degree based on the search result document;
The term co-occurrence degree extracting method according to any one of claims 8 to 13, wherein the term co-occurrence degree is extracted.

A search target term is a node, and for any two terms of the search target, a co-occurrence degree indicating the degree of appearance of the two terms in the same document is an edge between the nodes corresponding to the two terms. , A term co-occurrence degree extraction program that extracts a co-occurrence degree graph,
Computer
For unsearched terms, co-occurrence detection accuracy determination means for determining the possibility of finding the co-occurrence between terms,
Search strategy determination means for determining the search order of terms based on the possibility determined by the co-occurrence detection accuracy determination means;
Search means for searching for document data using each word as a keyword as a keyword according to the order determined by the search strategy determination means;
A co-occurrence degree calculating means for approximately obtaining a co-occurrence degree between terms for a search target term included in the search result document from a search result document including the term searched by the search means;
Term co-occurrence degree extraction program characterized by functioning as

16. The term co-occurrence degree according to claim 15, further comprising a function as a term extracting unit that extracts terms not included in the search target terms from the search result document based on a predetermined rule. Extraction program.

The term co-occurrence degree extraction program according to claim 16, further comprising a function as an extraction rule learning unit that dynamically generates a rule for extracting a term from the appearance tendency of the term in the search result document.

The extraction rule learning means includes
List character strings that appear around terms in the search result document,
A set of rule candidates is generated from the surrounding character string by using a word attribute of the term registered in the search target term and a regular expression that generalizes the word attribute,
The rule candidates are narrowed down by comparing the frequency of appearance of the rule candidates and / or the value of the term extraction rate with respective predetermined threshold values.
18. The term co-occurrence degree extracting program according to claim 17, wherein a rule for extracting the term is generated.

The term co-occurrence degree according to any one of claims 16 to 18, wherein the term extraction means extracts a term based on a predetermined rule described by a word attribute and a regular expression of the word attribute. Extraction program.

The co-occurrence degree detection accuracy determination means determines the possibility that the co-occurrence degree between terms for an unsearched term is obtained from the expected value of the number of newly extracted terms, from both-side unsearched in the co-occurrence degree graph. Approximate graph consisting of either the number of edges that have been searched on one side, the number of edges that have been searched from one side to both sides, the number of edges that have been searched on one side, but can be expected to increase the amount of information, or a combination of these As a score,
The search strategy determination means uses one or more terms in the top of the approximate graph score as search candidate words.
The term co-occurrence degree extraction program according to any one of claims 15 to 19, characterized in that:

The co-occurrence degree calculating means in the co-occurrence degree graph,
Calculate the co-occurrence that is the edge between the nodes corresponding to the two searched terms,
Calculating a co-occurrence degree in which a term of both nodes or one of the nodes is an unsearched edge as an approximate co-occurrence degree based on the search result document;
The term co-occurrence degree extraction program according to any one of claims 15 to 20, characterized in that: