JP2006215850A

JP2006215850A - Apparatus and method for creating concept information database, program, and recording medium

Info

Publication number: JP2006215850A
Application number: JP2005028555A
Authority: JP
Inventors: Naoki Asanoma; 直樹麻野間; Masahiro Oku; 雅博奥
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-02-04
Filing date: 2005-02-04
Publication date: 2006-08-17

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus and a method for creating a concept information database, which allow users to easily create the concept information database that has eventually enough accuracy, a program and a recording medium therefor. <P>SOLUTION: A provided document set is analyzed, and words that exist in the provided document set are extracted. In addition, word chains that exist in the provided document set are extracted, and the co-occurrence frequency between each of the words and each of the word chains is detected. A level for co-occurrence is detected corresponding to the co-occurrence frequency, and then concept information on the words is quantified based on the detected co-occurrence level to store the acquired concept information on the words into a database. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、語または語連鎖の持つ概念情報を定量化してデータベース化する装置および方法に係り、特に、与えられた文書中に現れる語または語連鎖の概念情報を、上記与えられた文書中に現れる語または語連鎖と、上記語または語連鎖との共起度に基づいて、定量化する概念情報データベース作成装置および方法に関する。
The present invention relates to an apparatus and a method for quantifying conceptual information of words or word chains and creating a database, and in particular, concept information of words or word chains appearing in a given document is stored in the given document. The present invention relates to an apparatus and method for creating a conceptual information database for quantification based on the co-occurrence of a word or word chain that appears and the word or word chain.

従来、単語間の類似性判別や、文書検索の高精度化を目的として、単語の概念情報をデータベース化する装置・方法が提案されている。つまり、単語間の類似性判別を目的として、単語を要素とした多次元空間を用意し、この多次元空間中に、各単語をベクトルとして配置するデータベースの作成手法が提案されている（たとえば、非特許文献１、非特許文献２参照）。 2. Description of the Related Art Conventionally, devices and methods for creating word concept information in a database have been proposed for the purpose of determining similarity between words and improving document search accuracy. That is, for the purpose of determining similarity between words, a database creation method has been proposed in which a multidimensional space having words as elements is prepared and each word is arranged as a vector in the multidimensional space (for example, (See Non-Patent Document 1 and Non-Patent Document 2).

また、文中の「語」の概念を定量化する場合、構文解析を用いて、「語」と文法上の組を形成する関係にある「関係語」を取り出し、それらの間の「結合度」を用いる手法が提案されている（たとえば、特許文献１参照）。 In addition, when quantifying the concept of “word” in a sentence, using “syntactic analysis”, “relationship words” that have a relationship that forms a grammatical pair with “word” are extracted, and “degree of association” between them is extracted. A method of using is proposed (for example, see Patent Document 1).

非特許文献１、非被特許文献２記載の従来例は、ある単語の概念を、共起する複数の単語で構成される多次元空間で表現する。これに対して、特許文献１記載の従来例は、共起だけではなく、ある単語と文法的に関係（たとえば主語と述語との関係）がある複数の単語で構成される多次元空間で表現するという点が大きく異なる。
Schuetze, H., “Dimensions of Meaning”, in Proceedings of Supercomputing '92, pp.787-796, 1992 笠原，松澤，石川、「国語辞書を利用した日常語の類似性判別」、情報処理学会論文誌、Vol.38、No.7、pp.1272-1284、１９９７年特開平９−１３４３６０号公報 In the conventional examples described in Non-Patent Document 1 and Non-Patent Document 2, the concept of a certain word is expressed in a multidimensional space including a plurality of co-occurring words. On the other hand, the conventional example described in Patent Document 1 is expressed not only in co-occurrence but also in a multidimensional space composed of a plurality of words that have a grammatical relationship with a certain word (for example, a relationship between a subject and a predicate). The point of doing is very different.
Schuetze, H., “Dimensions of Meaning”, in Proceedings of Supercomputing '92, pp.787-796, 1992 Kasahara, Matsuzawa, Ishikawa, “Difference of everyday words using Japanese dictionary”, Transactions of Information Processing Society of Japan, Vol.38, No.7, pp.1272-1284, 1997 JP-A-9-134360

しかし、非特許文献１、非特許文献２記載の従来例では、文法的・意味的な関係を考慮せずに共起頻度を算出するので、単語間の類似性を判別する場合や文書検索を高精度化する場合に利用するには、充分ではないという問題がある。 However, in the conventional examples described in Non-Patent Document 1 and Non-Patent Document 2, the co-occurrence frequency is calculated without considering the grammatical / semantic relationship. There is a problem that it is not sufficient for use in increasing accuracy.

また、特許文献１記載の従来例では、文法的・意味的な関係を捉えるために、構文解析を行うが、現在の技術では、文法的・意味的な関係を完全に捉えることは難しいという問題がある。 Further, in the conventional example described in Patent Document 1, syntax analysis is performed in order to capture a grammatical / semantic relationship, but it is difficult to completely capture a grammatical / semantic relationship with the current technology. There is.

すなわち、上記従来例では、結果として、充分な精度を持つ概念情報データベースを得ることが困難であるという問題がある。 That is, the conventional example has a problem that it is difficult to obtain a conceptual information database with sufficient accuracy.

本発明は、結果として、充分な精度を持つ概念情報データベースを得ることが容易である概念情報データベース作成装置、概念情報データベース作成方法、プログラムおよび記録媒体を提供することを目的とするものである。
As a result, it is an object of the present invention to provide a conceptual information database creation apparatus, a conceptual information database creation method, a program, and a recording medium that can easily obtain a conceptual information database with sufficient accuracy.

本発明は、与えられた文書集合を解析する文書解析手段と、上記与えられた文書集合中に存在している語を抽出し、記憶装置に記憶する語抽出手段と、上記与えられた文書集合中に存在している語連鎖を抽出し、記憶装置に記憶する語連鎖抽出手段と、上記語のそれぞれと上記語連鎖のそれぞれとの共起回数を検出し、記憶装置に記憶する共起回数検出手段と、上記共起回数に応じて共起度を検出し、この検出された共起度に基づいて、上記語の概念情報を、定量化し、記憶装置に記憶する概念情報定量化手段と、上記概念情報定量化手段で得られた上記語の概念情報を、データベースとする概念情報データベース作成手段とを有する概念情報データベース作成装置である。 The present invention provides a document analysis means for analyzing a given document set, a word extraction means for extracting a word existing in the given document set and storing it in a storage device, and the given document set. The word chain extraction means for extracting the word chain existing in the storage unit and storing it in the storage device, and detecting the number of times of co-occurrence between each of the words and each of the word chains, and storing the number of co-occurrence in the storage device Detecting means, and detecting the co-occurrence degree according to the number of co-occurrence, quantifying the conceptual information of the word based on the detected co-occurrence degree, and storing the concept information in the storage device A concept information database creation device having concept information database creation means that uses the concept information of the word obtained by the concept information quantification means as a database.

また、本発明は、与えられた文書集合を解析する文書解析手段と、上記与えられた文書集合中に存在している語連鎖を抽出するか、または、語連鎖と語とを抽出し、記憶装置に記憶する抽出手段と、上記語連鎖のそれぞれと上記語連鎖または語のそれぞれとの共起回数を検出し、記憶装置に記憶する共起回数検出手段と、上記共起回数に応じて共起度を検出し、この検出された共起度に基づいて、上記語連鎖の概念情報を、定量化し、記憶装置に記憶する概念情報定量化手段と、上記概念情報定量化手段で得られた上記語連鎖の概念情報を、データベースとする概念情報データベース作成手段とを有する概念情報データベース作成装置である。
The present invention also provides a document analysis means for analyzing a given document set and a word chain existing in the given document set, or a word chain and a word are extracted and stored. An extraction means stored in the device; a co-occurrence count between each of the word chains and the word chain or each word; and a co-occurrence count detection means stored in the storage device; and a co-occurrence count according to the co-occurrence count. The degree of occurrence is detected, and based on the detected degree of co-occurrence, the concept information of the word chain is quantified and obtained by the concept information quantification means for storing in the storage device and the concept information quantification means. A conceptual information database creation device having concept information database creation means that uses the word chain concept information as a database.

本発明によれば、語連鎖との共起を用いるので、構文解析を行うことなく間接的に文法的・意味的な関係を捉えることができ、単語間の類似性を判別する場合や文書検索を高精度化する場合に、充分な精度で概念情報データベースを作成することができるという効果を奏する。
According to the present invention, since co-occurrence with a word chain is used, it is possible to indirectly grasp a grammatical / semantic relationship without performing parsing, and to determine similarity between words or a document search In the case of increasing the accuracy of the data, the concept information database can be created with sufficient accuracy.

発明を実施するための最良の形態は、以下の実施例である。 The best mode for carrying out the invention is the following examples.

図１は、本発明の実施例１である概念情報データベース作成装置１０の基本構成を示すブロック図である。 FIG. 1 is a block diagram showing a basic configuration of a conceptual information database creation apparatus 10 that is Embodiment 1 of the present invention.

概念情報データベース作成装置１０は、概念情報データベースを作成する元となる大量の文書集合２０を入力し、概念情報データベース３０を出力し、文書解析部１１と、語抽出部１２、語連鎖抽出部１３、共起回数検出部１４と、概念情報定量化部１５と、概念情報データベース作成部１６とを有する。 The concept information database creation apparatus 10 inputs a large amount of document set 20 that is a source for creating a concept information database, outputs a concept information database 30, a document analysis unit 11, a word extraction unit 12, and a word chain extraction unit 13. The co-occurrence number detection unit 14, the concept information quantification unit 15, and the concept information database creation unit 16 are included.

文書解析部１１は、文書集合２０に含まれている全ての文に対して形態素解析を行い、単語に分割するとともに、各単語に品詞を付与する。 The document analysis unit 11 performs morphological analysis on all sentences included in the document set 20, divides it into words, and gives parts of speech to each word.

語抽出部１２は、文書集合２０に含まれている語を抽出し、記憶装置に記憶する。 The word extraction unit 12 extracts words included in the document set 20 and stores them in the storage device.

語連鎖抽出部１３は、文書集合２０に含まれている語連鎖を抽出し、記憶装置に記憶する。 The word chain extraction unit 13 extracts word chains included in the document set 20 and stores them in the storage device.

共起回数検出部１４は、文書解析部１１が行った解析結果に基づいて、ある語または語連鎖に対して共起する語または語連鎖を、抽出し、共起回数をカウントし、記憶装置に記憶する。 The co-occurrence count detection unit 14 extracts words or word chains co-occurring with respect to a certain word or word chain based on the analysis result performed by the document analysis unit 11, counts the number of co-occurrence, and stores the storage device To remember.

上記「語連鎖」は、文書中で連続するｎ単語の連鎖（ｎは２以上の整数）である。 The “word chain” is a chain of n words continuous in the document (n is an integer of 2 or more).

概念情報定量化部１５は、共起回数検出部１４でカウントされた共起回数に基づいて、着目している語または語連鎖と、語または語連鎖との共起度を計算し、着目している語または語連鎖に対する概念情報を定量化する。なお、上記定量化については、後述する。 Based on the number of co-occurrence counted by the co-occurrence number detection unit 14, the conceptual information quantification unit 15 calculates the co-occurrence degree of the word or word chain of interest and the word or word chain, Quantify conceptual information for a word or word chain. The quantification will be described later.

概念情報データベース作成部１６は、語または語連鎖をキーとして、概念情報定量化部１５で定量化された概念情報を検索できるようにデータベース化する。 The concept information database creation unit 16 creates a database so that the concept information quantified by the concept information quantification unit 15 can be searched using words or word chains as keys.

図２は、概念情報データベース作成装置１０の概略動作を示すフローチャートである。 FIG. 2 is a flowchart showing a schematic operation of the conceptual information database creation device 10.

Ｓ１では、文書解析部１１が、文書集合２０から１つの文書を抽出する。 In S <b> 1, the document analysis unit 11 extracts one document from the document set 20.

Ｓ２では、文書解析部１１が、Ｓ１で取り出された文書中から１つの文を抽出する。 In S2, the document analysis unit 11 extracts one sentence from the document extracted in S1.

Ｓ３では、文書解析部１１が、Ｓ２で取り出された１文に対して形態素解析を行い、単語単位に分割し、単語に分割するとともに、各単語に品詞を付与する。 In S3, the document analysis unit 11 performs morphological analysis on the one sentence extracted in S2, divides it into words, divides them into words, and gives parts of speech to each word.

Ｓ４では、取り出された文書中の全文を処理したかどうかを判断する。未処理の文が存在する場合、Ｓ５に進み、未処理の文が存在しない場合（全ての文を処理済の場合）、Ｓ６に進む。 In S4, it is determined whether or not the entire sentence in the extracted document has been processed. If there is an unprocessed sentence, the process proceeds to S5. If there is no unprocessed sentence (when all sentences have been processed), the process proceeds to S6.

Ｓ５では、次の文を処理対象として、Ｓ２〜Ｓ５の処理を繰り返す。 In S5, the process of S2-S5 is repeated for the next sentence as a processing target.

Ｓ６では、文書集合２０に含まれている全文書を処理したかどうかを判断する。未処理の文書が存在する場合、Ｓ７に進み、未処理の文書が存在しない場合（全ての文書を処理済の場合）、文書集合２０中の全文書の形態素解析結果を語抽出部１２、語連鎖抽出部１３に送り、Ｓ８に進む。 In S6, it is determined whether or not all documents included in the document set 20 have been processed. When there is an unprocessed document, the process proceeds to S7. When there is no unprocessed document (when all the documents have been processed), the morphological analysis results of all the documents in the document set 20 are converted into the word extraction unit 12, word The process is sent to the chain extraction unit 13, and the process proceeds to S8.

Ｓ７では、文書集合２０中の次の文書を処理対象として、Ｓ１〜Ｓ６の処理を繰り返す。 In S7, the processing of S1 to S6 is repeated for the next document in the document set 20 as a processing target.

Ｓ８では、語抽出部１２、語連鎖抽出部１３が、形態素解析結果から、全ての語または語連鎖（２単語連鎖以上の連鎖）を抽出し、記憶装置に記憶する。 In S8, the word extraction unit 12 and the word chain extraction unit 13 extract all words or word chains (chains of two or more word chains) from the morphological analysis result, and store them in the storage device.

Ｓ９では、共起回数検出部１４が、抽出された自立語（名詞、代名詞、動詞、形容詞、副詞）または語連鎖のそれぞれに対して、共起する自立語または語連鎖を抽出し、出現回数をカウントし、このカウント結果を、概念情報定量化部１５に送る。 In S9, the co-occurrence number detection unit 14 extracts a co-occurring independent word or word chain for each of the extracted independent words (nouns, pronouns, verbs, adjectives, adverbs) or word chains, and the number of appearances And the count result is sent to the conceptual information quantification unit 15.

なお、実施例において、出現回数をカウントする場合、次の３つの類型がある。すなわち、
（１）文書集合２０から語（自立語）と、語連鎖とを抽出し、所定の文書範囲に存在している上記抽出された語のそれぞれと、上記所定の文書範囲に存在している語連鎖のそれぞれとの共起回数をカウントする類型。
（２）文書集合２０から、語連鎖を抽出し、所定の文書範囲に存在している上記抽出された第１の語連鎖のそれぞれと、上記所定の文書範囲に存在している第２の語連鎖のそれぞれとの共起回数をカウントする類型。
（３）文書集合２０から、語連鎖と語（独立語）とを抽出し、所定の文書範囲に存在している上記抽出された語連鎖のそれぞれと、上記所定の文書範囲に存在している語のそれぞれとの共起回数をカウントする類型。
がある。 In the embodiment, when the number of appearances is counted, there are the following three types. That is,
(1) A word (independent word) and a word chain are extracted from the document set 20, and each of the extracted words existing in a predetermined document range and a word existing in the predetermined document range A type that counts the number of co-occurrence with each chain.
(2) A word chain is extracted from the document set 20, and each of the extracted first word chains existing in the predetermined document range and the second word existing in the predetermined document range A type that counts the number of co-occurrence with each chain.
(3) A word chain and a word (independent word) are extracted from the document set 20, and each of the extracted word chains existing in a predetermined document range is present in the predetermined document range. A type that counts the number of co-occurrence with each word.
There is.

なお、上記「所定の文書範囲」は、共起回数をカウントする文書範囲であり、たとえば、上記与えられた文書集合の部分集合、上記文書に含まれている少なくとも１つの段落、上記１つの段落に含まれている少なくとも１つの文のうちの１つである。 The “predetermined document range” is a document range in which the number of times of co-occurrence is counted, for example, a subset of the given document set, at least one paragraph included in the document, and the one paragraph Is one of at least one sentence included in.

Ｓ１０では、共起回数検出部１４がカウントした結果に基づいて、概念情報定量化部１５が、抽出された語または語連鎖のそれぞれについて、語または語連鎖のそれぞれとの共起度を計算する。 In S10, based on the result counted by the co-occurrence number detection unit 14, the concept information quantification unit 15 calculates the degree of co-occurrence with each word or word chain for each extracted word or word chain. .

Ｓ１１では、上記計算された共起度に基づいて、概念情報定量化部１５が、それぞれの語または語連鎖の概念情報を定量化し、この定量化した結果を、概念情報データベース作成部１６に送る。 In S11, based on the calculated co-occurrence degree, the concept information quantification unit 15 quantifies the concept information of each word or word chain, and sends the quantified result to the concept information database creation unit 16. .

ここで、上記「概念情報」は、語または語連鎖を行とし、共起する対象として調べる語または語連鎖を列とし、共起度を値とする行列として定量化される。すなわち、語または語連鎖の概念情報は、語または語連鎖の共起度を要素とする行ベクトルとして表現される。 Here, the above-mentioned “concept information” is quantified as a matrix having a word or word chain as a row, a word or word chain examined as a co-occurrence target as a column, and a co-occurrence degree as a value. That is, the concept information of a word or word chain is expressed as a row vector whose element is the co-occurrence degree of the word or word chain.

この場合、行列の行、列のそれぞれの個数を、語または語連鎖の頻度に応じて設定するようにしてもよい。 In this case, the number of rows and columns of the matrix may be set according to the frequency of words or word chains.

さらに、共起度が０であるものが多く存在する場合、図８に示すように、文書集合２０から抽出した語連鎖の数を、選択的に少なくするようにしてもよい。また、共起度が０のものが多く存在する場合における語連鎖を構成する語の数を少なくした語連鎖（たとえば、３単語連鎖に対する２単語連鎖（単語ｂｉｇｒａｍ）や、１単語連鎖（単語ｕｎｉｇｒａｍ））の共起頻度を求め、この求めた共起度を使用して、概念情報を補完するようにしてもよい。また、特異値分解によって、列数を縮退するようにしてもよい。なお、ここでは、上記共起度を補完する方法については、特に限定しない。 Further, when there are many cases where the co-occurrence degree is 0, as shown in FIG. 8, the number of word chains extracted from the document set 20 may be selectively reduced. In addition, when there are many cases with co-occurrence degrees of 0, a word chain in which the number of words constituting the word chain is reduced (for example, a two-word chain (word bigram) with respect to a three-word chain or a one-word chain (word unigram). )), And the concept information may be supplemented using the calculated degree of co-occurrence. Further, the number of columns may be reduced by singular value decomposition. Here, the method of complementing the co-occurrence degree is not particularly limited.

Ｓ１２では、概念情報データベース作成部１６が、語または語連鎖をキーとして、上記語または語連鎖の概念情報を検索できるように概念情報データベース３０を作成する。そして、概念情報データベース作成処理を終了する。 In S12, the concept information database creation unit 16 creates the concept information database 30 so that the concept information of the word or word chain can be searched using the word or word chain as a key. Then, the conceptual information database creation process ends.

次に、具体例を用いて、概念情報データベース作成装置１０の動作を説明する。 Next, the operation of the conceptual information database creation device 10 will be described using a specific example.

図３は、実施例の具体例で使用する文書集合２０の内容例を示す図である。 FIG. 3 is a diagram illustrating an example of the contents of the document set 20 used in the specific example of the embodiment.

文書集合２０は、ｎ個（ｎは整数）の文書によって構成されている。 The document set 20 includes n (n is an integer) documents.

第１文書２１、第２文書２２、……、第ｎ文書２ｎは、文書集合２０に含まれている文書であり、第１文書２１、第２文書２２、……、第ｎ文書２ｎの順で、文書集合を構成し、第ｎ文書２ｎは、文書集合２０に含まれている最終文書である。 The first document 21, the second document 22,..., The nth document 2n are documents included in the document set 20, and are in the order of the first document 21, the second document 22,. The nth document 2n is the final document included in the document set 20.

［具体例１］（単語対単語ｔｒｉｇｒａｍの例）
具体例１では、図３に示す文書集合２０を対象として、概念情報データベース３０を作成する。第１文書２１から抽出した語は、自立語であり、この抽出された自立語と共起する回数を調べる対象としての語連鎖は、３単語連鎖（単語ｔｒｉｇｒａｍ）である。 [Specific Example 1] (Example of word-to-word trigram)
In specific example 1, a concept information database 30 is created for the document set 20 shown in FIG. The word extracted from the first document 21 is an independent word, and the word chain as a target for examining the number of times that the extracted independent word co-occurs is a three-word chain (word trigram).

また、具体例として、共起回数をカウントする文書範囲が、同一文書内である例を示す。つまり、たとえば、第１文書２１に含まれている自立語については、第１文書２１のみに含まれている３単語連鎖との共起をカウントする。 As a specific example, an example is shown in which the document range for counting the number of times of co-occurrence is in the same document. That is, for example, for independent words included in the first document 21, co-occurrence with a three-word chain included only in the first document 21 is counted.

文書解析部１１が、図３に示す文書集合２０から第１文書２１を抽出する（Ｓ１）。次に、文書解析部１１が、Ｓ１で取り出された第１文書２１中から第１文を抽出する（Ｓ２）。第１文書２１は、図３に示す文書であり、第１文として、「我々は検索システムの研究開発を進めている。」が抽出される。さらに、文書解析部１１が、この第１文に対して形態素解析を行い、記憶装置に記憶し、単語単位で分割し、各単語に品詞を表す識別子を付与し、記憶装置に記憶する（Ｓ３）。 The document analysis unit 11 extracts the first document 21 from the document set 20 shown in FIG. 3 (S1). Next, the document analysis unit 11 extracts the first sentence from the first document 21 extracted in S1 (S2). The first document 21 is the document shown in FIG. 3, and “we are researching and developing a search system” is extracted as the first sentence. Further, the document analysis unit 11 performs morphological analysis on the first sentence, stores it in the storage device, divides it into words, gives each word an identifier representing the part of speech, and stores it in the storage device (S3). ).

単語境界を「／」で示すと、上記形態素解析の結果は、以下のようになる。なお、記号［］は、品詞等を示す。 When the word boundary is indicated by “/”, the result of the morphological analysis is as follows. The symbol [] indicates a part of speech or the like.

第１文書：第１段落：第１文：我々［代名詞］／は［副助詞］／検索システム［複合名詞］／の［格助詞］／研究［サ変名詞］／開発［サ変名詞］／を［格助詞］／進める［動詞］／て［接続助詞］／いる［補助動詞］／。［記号］／
次に、取り出された第１文書２１中の全文を処理したかどうかを判断し（Ｓ４）、まだ処理していない文が残っているので、次の第２文として、「ＰＢ電話機からの入力を簡単なものとするために、新しい日本語入力方式を採用している。」を処理対象とする（Ｓ５）。上記第２文についても、上記第１文における処理と同様に、形態素解析と識別子付与とを行い、記憶装置に記憶する（Ｓ３）。 1st document: 1st paragraph: 1st sentence: We [pronoun] / ha [adverb] / search system [compound noun] / [case particle] / research [sa variable noun] / development [sa variable noun] / Case particle] / advance [verb] / te [connecting particle] / is [auxiliary verb] /. [symbol]/
Next, it is determined whether or not the entire sentence in the retrieved first document 21 has been processed (S4). Since there is a sentence that has not been processed yet, the next second sentence is “input from PB telephone”. Is adopted as a processing target (S5). Also for the second sentence, similarly to the process in the first sentence, morphological analysis and identifier assignment are performed and stored in the storage device (S3).

第１文書２１に含まれている全ての文が処理されると（Ｓ４）、文書集合２０に含まれている全文書を処理したかどうかを判断する（Ｓ６）。文書集合２０は、図３に示す集合であるので、第２文書２２を、次の処理対象とする（Ｓ７）。 When all the sentences included in the first document 21 have been processed (S4), it is determined whether all the documents included in the document set 20 have been processed (S6). Since the document set 20 is the set shown in FIG. 3, the second document 22 is set as the next processing target (S7).

第２文書２２の全文も処理され（Ｓ４）、文書集合２０の全文書が処理され、つまり、第ｎ文書２ｎまで処理が済むと（Ｓ６）、文書解析部２は、文書集合２０中の全文書の形態素解析結果を、語抽出部１２、語連鎖抽出部１３に送る（Ｓ７）。 The entire text of the second document 22 is also processed (S4), and all the documents of the document set 20 are processed. That is, when the processing up to the nth document 2n is completed (S6), the document analysis unit 2 The morphological analysis result of the document is sent to the word extraction unit 12 and the word chain extraction unit 13 (S7).

語抽出部１２、語連鎖抽出部１３が、形態素解析結果から、全ての語または語連鎖を抽出し、記憶装置に記憶する（Ｓ８）。ここで、具体例１では、抽出する語は、自立語であるので、第１文書２１の第１文の形態素解析結果からは、「我々」、「検索システム」、「研究」、「開発」、「進める」の５つの自立語が抽出される。同様に、第２文以降、文書集合２０に含まれている全文の形態素解析結果から、全ての自立語を抽出する。 The word extraction unit 12 and the word chain extraction unit 13 extract all words or word chains from the morphological analysis result and store them in the storage device (S8). Here, in the specific example 1, the word to be extracted is an independent word. Therefore, from the morphological analysis result of the first sentence of the first document 21, “we”, “search system”, “research”, “development” , 5 independent words of “advance” are extracted. Similarly, after the second sentence, all independent words are extracted from the morphological analysis results of the whole sentence included in the document set 20.

さらに、共起回数検出部１４が、抽出された自立語のそれぞれに対して、語連鎖（３単語連鎖）を抽出し、上記抽出された自立語と上記３単語連鎖との共起回数をカウントし、このカウントされた結果を、概念情報定量化部１５に送る（Ｓ９）。 Further, the co-occurrence count detection unit 14 extracts a word chain (3-word chain) for each of the extracted independent words, and counts the number of times that the extracted independent word and the 3-word chain co-occur. Then, the counted result is sent to the conceptual information quantification unit 15 (S9).

図４は、具体例１において、「文書集合２０から抽出した語（自立語）」のそれぞれと、「文書集合２０から抽出した語連鎖（３単語連鎖）」のそれぞれとの共起回数の例を示す図である。 FIG. 4 shows an example of the number of times of co-occurrence between “words extracted from the document set 20 (independent words)” and “word chains extracted from the document set 20 (three-word chains)” in Example 1. FIG.

図４において、まず、たとえば、第１文書２１の範囲で、自立語「我々」と語連鎖「検索システムの研究」との共起回数をカウントし、次に、第２文書２２の範囲で、自立語「我々」と語連鎖「検索システムの研究」との共起回数をカウントし、……、最後に、第ｎ文書２ｎの範囲で、自立語「我々」と語連鎖「検索システムの研究」との共起回数をカウントし、これらカウントした共起回数の合計値が、図４に示すように、５６回である。 In FIG. 4, first, for example, the number of co-occurrence of the independent word “we” and the word chain “research system” is counted in the range of the first document 21, and then in the range of the second document 22, Count the number of co-occurrence between the independent word "we" and the word chain "research system" ... Finally, within the nth document 2n, the independent word "we" and the word chain "research system research" The total number of co-occurrence times is counted as 56, as shown in FIG.

また、自立語「我々」と、語連鎖「検索システムの研究」との共起回数をカウントする場合、同一文書中であれば、自立語「我々」と、語連鎖「検索システムの研究」との間に、どのような語が存在してもカウントし、また、自立語「我々」と、語連鎖「検索システムの研究」との間に存在する語の数がいくつであってもカウントする。さらに、同一文書中で、語連鎖「検索システムの研究」が自立語「我々」よりも先に出現する場合でもカウントする。 In addition, when counting the number of co-occurrence between the independent word “we” and the word chain “research system research”, the independent word “we” and the word chain “research system research” Count any number of words between, and count the number of words that exist between the independent word "we" and the word chain "Research system research" . Furthermore, even if the word chain “Research System Research” appears before the independent word “we” in the same document, it is counted.

これと同様に、たとえば、自立語「研究」に着目した場合、この自立語「研究」と語連鎖「ている。」との共起回数を、文書ごとにカウントし、このカウントした共起回数の合計値が、図４に示すように、７６回である。 Similarly, for example, when focusing on the independent word “research”, the number of co-occurrence of this independent word “research” and the word chain “has been” is counted for each document, and the counted number of co-occurrence The total value is 76 times as shown in FIG.

具体例１では、共起する語連鎖は、３単語連鎖（単語ｔｒｉｇｒａｍ）であるので、第１文書２１の第１文の形態素解析結果からは、「★★我々」、「★我々は」、「我々は検索システム」、「は検索システムの」、「検索システムの研究」、「の研究開発」、「研究開発を」、「開発を進める」、「を進めるて」、「進めるている」、「ている。」、「いる。本」、「。本システム」が、抽出される。なお、最初の２つの語連鎖に含まれている★印は、空単語を表す。また、最後の２つの３単語連鎖には、第２文の文頭の語である「本」、「システム」が含まれている。 In specific example 1, since the co-occurring word chain is a three-word chain (word trigram), from the morphological analysis result of the first sentence of the first document 21, “★★ we”, “★ we” "We are a search system", "is a search system", "search system research", "research and development", "research and development", "proceed development", "progress", "progress" , “Is”, “is. Book”, “. Book system” are extracted. Note that the ★ mark included in the first two word chains represents an empty word. The last two three-word chains include “book” and “system” which are words at the beginning of the second sentence.

上記と同様にして、文書集合２０に含まれている全文から、全ての３単語連鎖（単語ｔｒｉｇｒａｍ）を抽出する。 In the same manner as described above, all three-word chains (word trigrams) are extracted from all sentences included in the document set 20.

図４において、「語連鎖」は、３単語であるが、２単語連鎖であってもよく、４単語以上が連鎖した語連鎖であってもよい。 In FIG. 4, the “word chain” is three words, but may be a two-word chain or a word chain in which four or more words are chained.

次に、共起回数検出部１４がカウントした結果に基づいて、第１文書２１から抽出した自立語のそれぞれについて、概念情報定量化部１５が、３単語連鎖（単語ｔｒｉｇｒａｍ）との共起度を計算する（Ｓ１０）。 Next, based on the result counted by the co-occurrence number detection unit 14, the concept information quantification unit 15 determines the degree of co-occurrence with the three-word chain (word trigram) for each independent word extracted from the first document 21. Is calculated (S10).

ここで、共起度として正規化した値を用い、上記正規化した値として、着目している自立語に関する３単語連鎖の全ての出現回数に対する個々の３単語連鎖の出現回数の割合を使用する。 Here, a normalized value is used as the co-occurrence degree, and the ratio of the number of occurrences of each three-word chain to the number of all occurrences of the three-word chain regarding the independent word of interest is used as the normalized value. .

たとえば、図４において、自立語「我々」に関する３単語連鎖の出現回数の合計が１０００であったとすると、個々の３単語連鎖の出現回数の割合は、図４における自立語「我々」に関するカウント値を１０００で割った値が、正規化した値である。 For example, in FIG. 4, assuming that the total number of occurrences of the three-word chain relating to the independent word “we” is 1000, the ratio of the number of appearances of the individual three-word chain is the count value for the independent word “we” in FIG. Is a normalized value.

図５は、具体例１において、図４に示す場合において、「自立語」と、「語連鎖」との共起度の例を示す図である。 FIG. 5 is a diagram illustrating an example of co-occurrence degrees of “independent words” and “word chains” in the case of FIG.

さらに、概念情報定量化部１５は、共起度に基づいて、それぞれの自立語の概念情報を定量化した後に、この定量化した結果を、概念情報データベース作成部１６に送る（Ｓ１１）。 Further, the concept information quantification unit 15 quantifies the concept information of each independent word based on the co-occurrence degree, and then sends the quantified result to the concept information database creation unit 16 (S11).

また、図５に示す場合において、自立語を行とし、３単語連鎖を列とし、共起度をそのまま値として行列をつくり、この行列が、概念情報を定量化するものであるとすれば、図５に示す共起度計算結果例が、そのまま概念情報定量化結果例となる。 Further, in the case shown in FIG. 5, if a self-supported word is a row, a three-word chain is a column, and a co-occurrence is directly used as a value, a matrix is formed, and this matrix quantifies conceptual information. The co-occurrence degree calculation result example shown in FIG. 5 becomes the conceptual information quantification result example as it is.

概念情報の定量化結果を受け取った概念情報データベース作成部１６は、自立語をキーとして、この自立語の概念情報を検索できるように、概念情報データベース３０を作成し、概念情報データベース作成処理を終了する（Ｓ１２）。 Receiving the quantification result of the concept information, the concept information database creation unit 16 creates the concept information database 30 so that the concept information of the independent word can be searched using the independent word as a key, and the conceptual information database creation process is completed. (S12).

以上の動作によって、文書集合２０から、３単語連鎖（単語ｔｒｉｇｒａｍ）の共起度を要素とする行ベクトルによって、自立語の概念情報が表現された概念情報データベース３０を作成することができる。 With the above operation, the concept information database 30 in which the concept information of independent words is expressed by the row vector having the co-occurrence degree of the three word chain (word trigram) as an element can be created from the document set 20.

［具体例２］（単語ｔｒｉｇｒａｍ対単語ｔｒｉｇｒａｍの例）
具体例２においても、図３に示す文書集合２０を対象に、概念情報データベース３０を作成する。具体例１における自立語の代わりに、３単語連鎖（単語ｔｒｉｇｒａｍ）を使用し、具体例１における語連鎖は、同じく語連鎖を使用し、語連鎖同士で共起回数をカウントし、語連鎖として、３単語連鎖（単語ｔｒｉｇｒａｍ）を使用する。また、共起回数をカウントする文書範囲は、同一文書内（第１文書２１に含まれている３単語連鎖に対して第１文書２１内に含まれる３単語連鎖のみをカウントする等）とする。 [Specific Example 2] (Example of word trigram vs. word trigram)
Also in the specific example 2, the concept information database 30 is created for the document set 20 shown in FIG. Instead of the independent words in Example 1, a three-word chain (word trigram) is used, and the word chain in Example 1 similarly uses the word chain, counts the number of co-occurrence between the word chains, 3 word chain (word trigram) is used. The document range for counting the number of times of co-occurrence is within the same document (for example, counting only the three-word chain included in the first document 21 with respect to the three-word chain included in the first document 21). .

図２に示すＳ１〜Ｓ７に対応する動作は、具体例１と同じであるので、その説明を省略する。 Since the operations corresponding to S1 to S7 shown in FIG. 2 are the same as those in the first specific example, the description thereof is omitted.

次に、語連鎖抽出部１３は、形態素解析結果から、全ての語連鎖を抽出する（Ｓ８）。ここで、具体例２では、抽出する語連鎖は、３単語連鎖（単語ｔｒｉｇｒａｍ）であるので、第１文書２１の第１文の形態素解析結果からは、「★★我々」、「★我々は」、「我々は検索システム」、「は検索システムの」、「検索システムの研究」、「の研究開発」、「研究開発を」、「開発を進める」、「を進めるて」、「進めるている」、「ている。」、「いる。本」、「。本システム」の１３個の３単語連鎖（単語ｔｒｉｇｒａｍ）が抽出される。なお、最初の２つに含まれる★印は空単語を表す。また、最後の２つに３単語連鎖に含まれる「本」と「システム」は、第２文の単語である。 Next, the word chain extraction unit 13 extracts all word chains from the morphological analysis result (S8). Here, in specific example 2, since the word chain to be extracted is a three-word chain (word trigram), from the morphological analysis result of the first sentence of the first document 21, “★★ we”, “★ we are ”,“ We are a search system ”,“ Has a search system ”,“ Research system research ”,“ Research and development ”,“ Research and development ”,“ Proceed development ”,“ Proceed ”,“ Proceed ” Thirteen three-word chains (word trigrams) of “Yes”, “Yes.”, “Yes. Book”, “.Book System” are extracted. Note that the ★ marks included in the first two represent empty words. “Book” and “system” included in the last two words in the three-word chain are words in the second sentence.

上記と同様に、第２文以降、文書集合２０に含まれる全文の形態素解析結果から、全ての３単語連鎖（単語ｔｒｉｇｒａｍ）を抽出する。 Similarly to the above, from the second sentence onward, all three word chains (word trigrams) are extracted from the morphological analysis results of all the sentences included in the document set 20.

さらに、共起回数検出部１４は、抽出された３単語連鎖（単語ｔｒｉｇｒａｍ）のそれぞれに対して、共起する語連鎖を抽出し、共起回数をカウントし、このカウント結果を概念情報定量化部１５に送る（Ｓ９）。 Further, the co-occurrence count detection unit 14 extracts co-occurring word chains for each of the extracted three word chains (word trigram), counts the number of co-occurrence, and quantifies the count result as conceptual information. The data is sent to the unit 15 (S9).

具体例２では、抽出した語連鎖は、３単語連鎖（単語ｔｒｉｇｒａｍ）であるので、第１文書２１の第１文の形態素解析結果から、上記１３個の３単語連鎖（単語ｔｒｉｇｒａｍ）が抽出される。これと同様にして、文書集合２０に含まれている全文から、全ての３単語連鎖（単語ｔｒｉｇｒａｍ）を抽出する。 In specific example 2, since the extracted word chain is a three-word chain (word trigram), the above 13 three-word chains (word trigram) are extracted from the morphological analysis result of the first sentence of the first document 21. The In the same manner, all three word chains (word trigrams) are extracted from all sentences included in the document set 20.

図６は、具体例２において、文書集合２０から抽出した語連鎖同士の共起回数の例を示す図である。 FIG. 6 is a diagram illustrating an example of the number of co-occurrence of word chains extracted from the document set 20 in the specific example 2.

次に、共起回数検出部１４がカウントした結果をもとに、３単語連鎖（単語ｔｒｉｇｒａｍ）のそれぞれについて、概念情報定量化部１５が、同一文書内に共起する３単語連鎖（単語ｔｒｉｇｒａｍ）との共起度を計算する（Ｓ１０）。 Next, based on the result counted by the co-occurrence number detection unit 14, the concept information quantification unit 15 performs co-occurrence of three word chains (word trigrams) in the same document for each of the three word chains (word trigrams). ) Is calculated (S10).

ここで、共起度として、正規化した値を用い、正規化した値として、着目している自立語に関する３単語連鎖の全ての出現回数に対する個々の３単語連鎖の出現回数の割合を使用する。 Here, a normalized value is used as the degree of co-occurrence, and the ratio of the number of occurrences of each three-word chain to the number of all occurrences of the three-word chain regarding the independent word of interest is used as the normalized value. .

たとえぱ、図６において、３単語連鎖（単語ｔｒｉｇｒａｍ）「★★我々」に関する３単語連鎖の全ての出現回数（３単語連鎖「★★我々」について、３単語連鎖との共起回数の合計）が、２００であったとすると、個々の共起する３単語連鎖の出現回数の割合は、図６の３単語連鎖（単語ｔｒｉｇｒａｍ）「★★我々」に関する計数値のそれぞれを、２００で割った値が、正規化した値である。 For example, in FIG. 6, the number of all occurrences of the three-word chain relating to the three-word chain (word trigram) “★★ we” (the total number of co-occurrence of the three-word chain “★★ we” with the three-word chain) Is 200, the ratio of the number of occurrences of each co-occurring three-word chain is the value obtained by dividing each of the count values for the three-word chain (word trigram) “★★ we” in FIG. Is the normalized value.

図７は、具体例２において、図６に示す場合において、語連鎖同士の共起度の例を示す図である。 FIG. 7 is a diagram illustrating an example of the co-occurrence degree between word chains in the specific example 2 shown in FIG.

さらに、概念情報定量化部１５は、共起度に基づいて、それぞれの３単語連鎖（単語ｔｒｉｇｒａｍ）の概念情報を定量化した後に、この結果を概念情報データベース作成部１６に送る（Ｓ１１）。 Further, the concept information quantification unit 15 quantifies the concept information of each three-word chain (word trigram) based on the co-occurrence degree, and then sends the result to the concept information database creation unit 16 (S11).

３単語連鎖（単語ｔｒｉｇｒａｍ）を行とし、共起する３単語連鎖（単語ｔｒｉｇｒａｍ）を列とする。 A three-word chain (word trigram) is a row, and a co-occurring three-word chain (word trigram) is a column.

図８は、具体例２において、共起回数を調べる２つの語連鎖のうちの一方の語連鎖の数を減らした場合における語連鎖同士の共起度の例を示す図である。 FIG. 8 is a diagram illustrating an example of the degree of co-occurrence between word chains when the number of one of the two word chains for checking the number of times of co-occurrence is reduced in the specific example 2.

図７に示す例において、文書集合２０から抽出した語連鎖の全部を使用するのではなく、いくつかを選択し、つまり、一方の３単語連鎖として、「我々は検索システム」、「は検索システムの」、「検索システムの研究」、「の研究開発」、「研究開発を」、「やＷｅｂを」、「を利用するた」の７個に限定した場合における概念情報定量化結果例を、図８に示してある。 In the example shown in FIG. 7, instead of using all the word chains extracted from the document set 20, some are selected, that is, as one of the three word chains, “we are a search system”, “is a search system” ”,“ Search system research ”,“ Research and development ”,“ Research and development ”,“ And Web ”, and“ Use of ”conceptual information quantification result examples, It is shown in FIG.

なお、上記一方の３単語連鎖の数を、７個以外に限定するようにしてもよい。 The number of the one three-word chain may be limited to other than seven.

このように、上記一方の３単語連鎖を、少ない個数に限定すると、計算が容易になる。 In this way, if the one three-word chain is limited to a small number, the calculation becomes easy.

ここで、上記一方の３単語連鎖を限定し、この限定された３単語連鎖との共起回数によって、共起度を再計算するようにしてもよい。 Here, the one three-word chain may be limited, and the co-occurrence degree may be recalculated based on the number of co-occurrence with the limited three-word chain.

概念情報データベース作成部１６は、概念情報の定量化結果を受け取ると、３単語連鎖（単語ｔｒｉｇｒａｍ）をキーとして、該３単語連鎖（単語ｔｒｉｇｒａｍ）の概念情報を検索可能なように概念情報データベース３０を作成し、概念情報データベース作成処理を終了する（Ｓ１２）。 When the conceptual information database creation unit 16 receives the quantification result of the conceptual information, the conceptual information database 30 can search the conceptual information of the three-word chain (word trigram) using the three-word chain (word trigram) as a key. And the conceptual information database creation process is terminated (S12).

なお、キーとして、３単語連鎖（単語ｔｒｉｇｒａｍ）だけではなく、各３単語連鎖に含まれている単語を副次キーとして使用するようにしてもよい。つまり、単語を副次キーとし、この単語からも、当該３単語連鎖を検索できるようにしてもよく、また、自立語を副次キーとしてもよく、ここでは、副次キーについては限定しない。 As a key, not only a three-word chain (word trigram) but also a word included in each three-word chain may be used as a secondary key. That is, a word may be used as a secondary key, and the 3-word chain may be searched from this word, or a self-supporting word may be used as a secondary key. Here, the secondary key is not limited.

つまり、具体例２において、概念情報定量化手段は、第１の語連鎖のそれぞれと第２の語連鎖のそれぞれとの共起回収に基づいて共起度を求め、この求めた共起度に基づいて、上記語連鎖の概念情報を、定量化し、記憶装置に記憶する手段であり、上記第１の語連鎖は、文書中で連続するｎ単語の連鎖（ｎは２以上の整数）であり、上記第２の語連鎖は、文書中で連続するｍ単語の連鎖（ｍは２以上の整数）である。 That is, in the specific example 2, the conceptual information quantification means obtains the co-occurrence degree based on the co-occurrence recovery of each of the first word chain and each of the second word chain, and the calculated co-occurrence degree Based on the above, the word chain concept information is quantified and stored in a storage device, and the first word chain is a chain of n words consecutive in the document (n is an integer of 2 or more). The second word chain is a chain of m words continuous in the document (m is an integer of 2 or more).

以上の動作によって、文書集合２０から、３単語連鎖（単語ｔｒｉｇｒａｍ）の共起度を要素とする行ベクトルによって、３単語連鎖（単語ｔｒｉｇｒａｍ）の概念情報が表現された概念情報データベース３０を作成することができる。 Through the above operation, the concept information database 30 in which the concept information of the three word chain (word trigram) is expressed from the document set 20 by the row vector having the co-occurrence degree of the three word chain (word trigram) as an element is created. be able to.

［具体例３］
具体例３は、第１文書２１から抽出した「語連鎖」と、「語」との共起回数をカウントし、このカウントされた共起回数に基づいて、上記と同様に、上記語連鎖の概念情報を、定量化し、共起度を求める例である。 [Specific Example 3]
Concrete example 3 counts the number of co-occurrence of “word chain” extracted from first document 21 and “word” and, based on the counted number of co-occurrence, in the same manner as above, This is an example in which conceptual information is quantified to determine the degree of co-occurrence.

つまり、具体例３は、図４に示す行と列とを転置させて、共起回数を求め、共起度を求める例である。 That is, the specific example 3 is an example in which the number of co-occurrence is obtained by transposing the rows and columns shown in FIG. 4 to obtain the co-occurrence degree.

上記実施例によれば、与えられた文書を単語ごとに分割し、品詞等の文法情報を付与し（文書解析）、上記文書中に現れる語または語連鎖と一定の文書範囲内（同一の文書中、同一の段落中、または同一の文中等）に現れる語または語連鎖を抽出し、上記語または語連鎖の概念情報を、上記語または語連鎖のそれぞれの共起度に基づいて定量化し、この定量化して得られた全ての語または語連鎖の概念情報をデータベースとして作成するので、与えられた文書中に現れる語または語連鎖の概念情報を、それと共起する語または語連鎖との共起度に基づいて定量化することができる。 According to the above embodiment, a given document is divided into words, grammatical information such as part of speech is added (document analysis), and words or word chains appearing in the document and within a certain document range (the same document In the same paragraph, in the same sentence, etc.), and quantifying the conceptual information of the word or word chain based on the co-occurrence of the word or word chain, Since the concept information of all the words or word chains obtained by this quantification is created as a database, the concept information of the words or word chains that appear in a given document is shared with the words or word chains that co-occur with it. It can be quantified based on the degree of occurrence.

つまり、上記実施例例は、語と語連鎖との組み合わせによって、データベースを作る実施例であり、与えられた文書集合を解析する文書解析手段と、上記与えられた文書集合中に存在している語を抽出し、記憶装置に記憶する語抽出手段と、上記与えられた文書集合中に存在している語連鎖を抽出し、記憶装置に記憶する語連鎖抽出手段と、上記語のそれぞれと上記語連鎖のそれぞれとの共起回数を検出し、記憶装置に記憶する共起回数検出手段と、上記共起回数に応じて共起度を検出し、この検出された共起度に基づいて、上記語の概念情報を、定量化し、記憶装置に記憶する概念情報定量化手段と、上記概念情報定量化手段で得られた上記語の概念情報を、データベースとする概念情報データベース作成手段とを有する概念情報データベース作成装置の例である。 In other words, the above-described embodiment is an embodiment in which a database is created by combining words and word chains, and exists in the document collection means for analyzing a given document set and the given document set. A word extracting means for extracting a word and storing it in a storage device; a word chain extracting means for extracting a word chain existing in the given document set and storing it in the storage device; Detecting the number of co-occurrences with each of the word chains, the co-occurrence number detecting means for storing in the storage device, and detecting the co-occurrence degree according to the number of co-occurrence, based on the detected co-occurrence degree Concept information quantifying means for quantifying the concept information of the word and storing it in a storage device, and concept information database creating means for using as a database the concept information of the word obtained by the concept information quantifying means. Concept information database It is an example of creating apparatus.

この場合、上記共起度は、上記語のうちで、着目している語に関する上記語連鎖のそれぞれの出現回数を正規化した値である。また、上記正規化した値は、上記着目している語に関する上記語連鎖の出現回数の合計に対する個々の上記語連鎖の出現回数の割合である。さらに、上記正規化した値は、上記着目している語に関する上記語連鎖の出現回数の中で、最大の出現回数に対する個々の上記語連鎖の出現回数の割合である。そして、上記語は、自立語であり、上記語連鎖は、文書中で連続するｎ単語の連鎖（ｎは２以上の整数）である。また、上記概念情報定量化手段は、上記文書集合における共起回数をカウントする文書範囲に存在している上記語と、上記語連鎖との共起度に基づいて、上記語の概念情報を定量化し、記憶装置に記憶する手段であり、上記共起回数をカウントする文書範囲は、上記与えられた文書集合の部分集合、上記文書に含まれている少なくとも１つの段落、上記１つの段落に含まれている少なくとも１つの文のうちの１つである。 In this case, the co-occurrence degree is a value obtained by normalizing the number of appearances of the word chain related to the focused word among the words. The normalized value is the ratio of the number of appearances of each word chain to the total number of appearances of the word chain related to the focused word. Further, the normalized value is a ratio of the number of appearances of each word chain to the maximum number of appearances among the number of appearances of the word chain related to the word of interest. The word is an independent word, and the word chain is a chain of n words continuous in the document (n is an integer of 2 or more). The conceptual information quantification means quantifies the conceptual information of the word based on the co-occurrence degree of the word existing in the document range for counting the number of co-occurrence in the document set and the word chain. The document range for counting the number of co-occurrence is included in the subset of the given document set, at least one paragraph included in the document, and the one paragraph One of the at least one sentence.

また、上記実施例は、語連鎖と語連鎖または語との組み合わせによって、データベースを作る実施例であり、与えられた文書集合を解析する文書解析手段と、上記与えられた文書集合中に存在している語連鎖を抽出するか、または、語連鎖と語とを抽出し、記憶装置に記憶する抽出手段と、上記語連鎖のそれぞれと上記語連鎖または語のそれぞれとの共起回数を検出し、記憶装置に記憶する共起回数検出手段と、上記共起回数に応じて共起度を検出し、この検出された共起度に基づいて、上記語連鎖の概念情報を、定量化し、記憶装置に記憶する概念情報定量化手段と、上記概念情報定量化手段で得られた上記語連鎖の概念情報を、データベースとする概念情報データベース作成手段とを有する概念情報データベース作成装置の例である。 Further, the above embodiment is an embodiment in which a database is created by a combination of word chain and word chain or word. Document analysis means for analyzing a given document set, and exists in the given document set. The word chain or the word chain and the word are extracted and stored in the storage device, and the number of co-occurrence of each of the word chain and the word chain or the word is detected. The co-occurrence number detection means for storing in the storage device, the co-occurrence degree is detected according to the co-occurrence number, and the concept information of the word chain is quantified and stored based on the detected co-occurrence degree It is an example of the conceptual information database creation apparatus which has the conceptual information quantification means memorize | stored in an apparatus, and the conceptual information database creation means which uses as a database the conceptual information of the said word chain obtained by the said conceptual information quantification means.

この場合、上記共起度は、上記語連鎖のうちで、着目している語連鎖に関する上記語連鎖または上記語のそれぞれの出現回数を正規化した値である。また、上記正規化した値は、上記着目している語連鎖に関する上記語連鎖または上記語の出現回数の合計に対する個々の上記語連鎖または上記語の出現回数の割合である。さらに、上記正規化した値は、上記着目している語連鎖に関する上記語連鎖または上記語の出現回数の中で、最大の出現回数に対する個々の上記語連鎖または上記語の出現回数の割合である。そして、上記語は、自立語であり、上記語連鎖は、文書中で連続するｎ単語の連鎖（ｎは２以上の整数）であるまた、上記概念情報定量化手段は、上記文書集合における共起回数をカウントする文書範囲に存在している上記語連鎖と、上記語または語連鎖との共起度に基づいて、上記語の概念情報を定量化し、記憶装置に記憶する手段であり、上記共起回数をカウントする文書範囲は、上記与えられた文書集合の部分集合、上記文書に含まれている少なくとも１つの段落、上記１つの段落に含まれている少なくとも１つの文のうちの１つである。さらに、上記概念情報定量化手段は、第１の語連鎖のそれぞれと第２の語連鎖のそれぞれとの共起度に基づいて、上記語連鎖の概念情報を、定量化し、記憶装置に記憶する手段であり、上記第１の語連鎖は、文書中で連続するｎ単語の連鎖（ｎは２以上の整数）であり、上記第２の語連鎖は、文書中で連続するｍ単語の連鎖（ｍは２以上の整数）である。 In this case, the co-occurrence degree is a value obtained by normalizing the number of occurrences of the word chain or the word related to the word chain of interest in the word chain. The normalized value is a ratio of the number of appearances of each word chain or the word to the total number of appearances of the word chain or the word regarding the word chain of interest. Further, the normalized value is a ratio of the number of appearances of the individual word chain or the word to the maximum number of appearances in the word chain or the number of appearances of the word regarding the word chain of interest. . The word is an independent word, and the word chain is a chain of n words continuous in the document (n is an integer of 2 or more). The concept information quantifying means is a common word in the document set. A means for quantifying the conceptual information of the word based on the co-occurrence of the word chain and the word or word chain existing in the document range for counting the number of occurrences, and storing it in a storage device, The document range for counting the number of co-occurrence is one of a subset of the given document set, at least one paragraph included in the document, and at least one sentence included in the one paragraph. It is. Further, the concept information quantification means quantifies the concept information of the word chain based on the co-occurrence degree of each of the first word chain and each of the second word chains, and stores the information in the storage device. The first word chain is a chain of n words continuous in the document (n is an integer of 2 or more), and the second word chain is a chain of m words continuous in the document ( m is an integer of 2 or more.

また、上記実施例は、方法の実施例として把握することができ、与えられた文書集合を解析する文書解析段階と、上記与えられた文書集合中に存在している語を抽出し、記憶装置に記憶する語抽出段階と、上記与えられた文書集合中に存在している語連鎖を抽出し、記憶装置に記憶する語連鎖抽出段階と、上記語のそれぞれと上記語連鎖のそれぞれとの共起回数を検出し、記憶装置に記憶する共起回数検出段階と、上記共起回数に応じて共起度を検出し、この検出された共起度に基づいて、上記語の概念情報を、定量化し、記憶装置に記憶する概念情報定量化段階と、上記概念情報定量化段階で得られた上記語の概念情報を、データベースとする概念情報データベース作成段階とを有する概念情報データベース作成方法の例である。 Further, the above embodiment can be grasped as an embodiment of the method, a document analysis stage for analyzing a given document set, a word existing in the given document set, and a storage device The word extraction stage stored in the above-mentioned document chain, the word chain extraction stage that extracts the word chain existing in the given document set and stores it in the storage device, and each of the above words and each of the above word chains is shared. Detecting the number of occurrences and storing the number of occurrences in the storage device, detecting the degree of co-occurrence according to the number of times of co-occurrence, and based on the detected degree of co-occurrence, the concept information of the word, An example of a conceptual information database creation method having a conceptual information quantification stage to be quantified and stored in a storage device, and a conceptual information database creation stage in which the conceptual information of the word obtained in the conceptual information quantification stage is used as a database It is.

さらに、上記実施例は、方法の別の実施例として把握することができ、与えられた文書集合を解析する文書解析段階と、上記与えられた文書集合中に存在している語連鎖を抽出するか、または、語連鎖と語とを抽出し、記憶装置に記憶する抽出段階と、上記語連鎖のそれぞれと上記語連鎖または語のそれぞれとの共起回数を検出し、記憶装置に記憶する共起回数検出段階と、上記共起回数に応じて共起度を検出し、この検出された共起度に基づいて、上記語連鎖の概念情報を、定量化し、記憶装置に記憶する概念情報定量化段階と、上記概念情報定量化段階で得られた上記語連鎖の概念情報を、データベースとする概念情報データベース作成段階とを有する概念情報データベース作成方法の例である。 Further, the above embodiment can be grasped as another embodiment of the method, and a document analysis stage for analyzing a given document set and a word chain existing in the given document set are extracted. Or a word chain and a word are extracted and stored in a storage device, and the number of co-occurrence of each of the word chains and the word chain or each of the words is detected and stored in the storage device. Detecting the number of occurrences and detecting the degree of co-occurrence according to the number of times of co-occurrence, quantifying the concept information of the word chain based on the detected degree of co-occurrence and storing the concept information in a storage device It is an example of the conceptual information database creation method which has a conversion stage and the conceptual information database preparation stage which uses as a database the conceptual information of the said word chain obtained in the said conceptual information quantification stage.

そして、上記実施例は、上記両概念情報データベース作成方法のそれぞれにおける上記各段階をコンピュータに実行させるプログラムの例である。 And the said Example is an example of the program which makes a computer perform each said step in each of both said conceptual information database preparation methods.

また、上記プログラムを、ＣＤ、ＤＶＤ、半導体メモリ等の記録媒体に記録するようにしてもよい。つまり、上記実施例は、上記両概念情報データベース作成方法のそれぞれにおける上記各段階をコンピュータに実行させるプログラムを記録したコンピュータ読み取り可能な記録媒体の例である。
The program may be recorded on a recording medium such as a CD, a DVD, or a semiconductor memory. That is, the above-described embodiment is an example of a computer-readable recording medium that records a program that causes a computer to execute the above-described steps in each of the both concept information database creation methods.

本発明の実施例１である概念情報データベース作成装置１０の基本構成を示すブロック図である。It is a block diagram which shows the basic composition of the conceptual information database creation apparatus 10 which is Example 1 of this invention. 概念情報データベース作成装置１０の概略動作を示すフローチャートである。4 is a flowchart showing a schematic operation of the concept information database creation device 10. 実施例の具体例で使用する文書集合２０の内容例を示す図である。It is a figure which shows the example of the content of the document set 20 used by the specific example of an Example. 具体例１において、「文書集合２０から抽出した語（自立語）」のそれぞれと、「文書集合２０から抽出した語連鎖（３単語連鎖）」のそれぞれとの共起回数の例を示す図である。FIG. 11 is a diagram illustrating an example of the number of times of co-occurrence between each of “words extracted from document set 20 (independent words)” and “word chain extracted from document set 20 (three-word chain)” in specific example 1; is there. 具体例１において、図４に示す場合において、「自立語」と、「語連鎖」との共起度の例を示す図である。FIG. 5 is a diagram showing an example of co-occurrence degrees of “independent words” and “word chains” in the case shown in FIG. 具体例２において、文書集合２０から抽出した語連鎖同士の共起回数の例を示す図である。FIG. 10 is a diagram illustrating an example of the number of co-occurrence of word chains extracted from a document set 20 in specific example 2. 具体例２において、図６に示す場合において、語連鎖同士の共起度の例を示す図である。In the specific example 2, in the case shown in FIG. 6, it is a figure which shows the example of the co-occurrence degree of word chains. 具体例２において、共起回数を調べる２つの語連鎖のうちの一方の語連鎖の数を減らした場合における語連鎖同士の共起度の例を示す図である。In the specific example 2, it is a figure which shows the example of the co-occurrence degree of word chains when the number of one word chain of two word chains which investigates the number of times of co-occurrence is reduced.

Explanation of symbols

１０…概念情報データベース作成装置、
１１…文書解析部、
１２…語抽出部、
１３…語連鎖抽出部、
１４…共起回数検出部、
１５…概念情報定量化部、
１６…概念情報データベース作成部、
２０…文書集合、
３０…概念情報データベース、
２１…第１文書、
２２…第２文書、
２ｎ…第ｎ文書（最終文書）。 10 ... Concept information database creation device,
11 ... Document analysis section,
12 ... Word extraction unit,
13 ... word chain extraction part,
14 ... Co-occurrence detection unit,
15 ... Concept information quantification part,
16 ... Concept information database creation part,
20 ... Document set,
30 ... Conceptual information database,
21 ... first document,
22 ... the second document,
2n: nth document (final document).

Claims

A document analysis means for analyzing a given document set;
Word extraction means for extracting a word existing in the given document set and storing it in a storage device;
Word chain extraction means for extracting a word chain existing in the given document set and storing it in a storage device;
A co-occurrence number detecting means for detecting the co-occurrence number of each of the words and each of the word chains and storing it in a storage device;
A concept information quantification unit that detects a co-occurrence degree according to the number of times of co-occurrence, quantifies the concept information of the word based on the detected co-occurrence degree, and stores it in a storage device;
A conceptual information database creating means that uses the conceptual information of the word obtained by the conceptual information quantifying means as a database;
A conceptual information database creation device characterized by comprising:

In claim 1,
The said co-occurrence degree is the value which normalized each frequency | count of appearance of the said word chain regarding the focused word among the said words, The conceptual information database creation apparatus characterized by the above-mentioned.

In claim 2,
The said normalized value is a ratio of the frequency | count of appearance of each said word chain with respect to the sum total of the frequency | count of said word chain regarding the said attention word, The conceptual information database creation apparatus characterized by the above-mentioned.

In claim 2,
The normalized value is a ratio of the number of appearances of each word chain to the maximum number of appearances among the number of appearances of the word chain related to the focused word apparatus.

In any one of Claims 1-4,
The above words are independent words,
The word chain is a chain of n words continuous in a document (n is an integer of 2 or more).

In claim 1,
The conceptual information quantifying means quantifies the conceptual information of the word based on the co-occurrence degree of the word existing in the document range for counting the number of co-occurrence in the document set and the word chain, Means for storing in a storage device;
The document range for counting the number of times of co-occurrence is one of a subset of the given document set, at least one paragraph included in the document, and at least one sentence included in the one paragraph. A conceptual information database creation device characterized by

A document analysis means for analyzing a given document set;
Extraction means for extracting a word chain existing in the given document set or extracting a word chain and a word and storing them in a storage device;
A co-occurrence number detection means for detecting the number of co-occurrence of each of the word chains and the word chain or each of the words and storing it in a storage device;
A concept information quantification means for detecting a co-occurrence degree according to the number of times of co-occurrence, quantifying the concept information of the word chain based on the detected co-occurrence degree, and storing it in a storage device;
A conceptual information database creating means that uses the conceptual information of the word chain obtained by the conceptual information quantifying means as a database;
A conceptual information database creation device characterized by comprising:

In claim 7,
The co-occurrence degree is a value obtained by normalizing the number of occurrences of the word chain or the word related to the word chain of interest in the word chain.

In claim 8,
The normalized value is a ratio of the number of occurrences of each word chain or the word to the total number of occurrences of the word chain or the word relating to the word chain of interest. Creation device.

In claim 8,
The normalized value is a ratio of the number of appearances of the individual word chain or the word to the maximum number of appearances in the word chain or the number of appearances of the word regarding the word chain of interest. Feature information database creation device.

In any one of Claims 7-10,
The above words are independent words,
The word chain is a chain of n words continuous in a document (n is an integer of 2 or more).

In claim 7,
The conceptual information quantifying means calculates the conceptual information of the word based on the co-occurrence degree of the word chain and the word chain existing in the document range counting the number of times of co-occurrence in the document set. Means for quantifying and storing in a storage device;
The document range for counting the number of times of co-occurrence is one of a subset of the given document set, at least one paragraph included in the document, and at least one sentence included in the one paragraph. A conceptual information database creation device characterized by

In any one of Claims 7-10,
The concept information quantifying means is a means for quantifying the concept information of the word chain based on the co-occurrence degree of each of the first word chain and each of the second word chain, and storing it in a storage device. Yes,
The first word chain is a chain of n words consecutive in the document (n is an integer of 2 or more),
The second word chain is a chain of m words continuous in a document (m is an integer of 2 or more).

A document analysis stage for analyzing a given set of documents;
A word extraction step of extracting words stored in the given document set and storing them in a storage device;
A word chain extraction step of extracting word chains existing in the given document set and storing them in a storage device;
Detecting the number of co-occurrence between each of the words and each of the word chains and storing it in a storage device;
A co-occurrence degree is detected according to the co-occurrence number, and based on the detected co-occurrence degree, the concept information of the word is quantified and stored in a storage device;
A conceptual information database creation stage in which the conceptual information of the word obtained in the conceptual information quantification stage is used as a database;
A concept information database creation method characterized by comprising:

A document analysis stage for analyzing a given set of documents;
Extracting word chains present in the given document set, or extracting word chains and words and storing them in a storage device;
Detecting the number of co-occurrence between each of the word chains and the word chain or each of the words, and storing the number of co-occurrence in a storage device;
A co-occurrence degree is detected in accordance with the number of co-occurrence times, and based on the detected co-occurrence degree, concept information of the word chain is quantified and stored in a storage device;
A concept information database creation stage using the word chain concept information obtained in the concept information quantification stage as a database;
A concept information database creation method characterized by comprising:

The program which makes a computer perform each said step in the database preparation method of Claim 14 or Claim 15.

The computer-readable recording medium which recorded the program which makes a computer perform the said each step in the database preparation method of Claim 14 or Claim 15.