JP2009193219A

JP2009193219A - Indexing apparatus, method thereof, program, and recording medium

Info

Publication number: JP2009193219A
Application number: JP2008031699A
Authority: JP
Inventors: Takaaki Hasegawa; 隆明長谷川; Kenji Imamura; 賢治今村; Genichiro Kikui; 玄一郎菊井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-02-13
Filing date: 2008-02-13
Publication date: 2009-08-27

Abstract

<P>PROBLEM TO BE SOLVED: To enable document retrieval of high ranking accuracy at processing speed applicable to a search engine even for a document set composed of documents including miscellaneous contents. <P>SOLUTION: By a keyword identification section 4, a position is identified at which a word or a word sequence in respective analyzed retrieval object documents matches a keyword. By a neighborhood word acquisition section 5, a neighborhood word for the keyword is extracted from the respective analyzed retrieval object documents and a co-occurrence frequency of the keyword and the neighborhood word is counted. By a relevance ratio calculation section 7, a co-occurrence frequency of the keyword and the neighborhood word in the entire set of the analyzed retrieval object documents and an appearance frequency of the keyword, and a relevance between the keyword and the neighborhood word from the appearance frequency of the neighborhood word and an entire word count are calculated. By a document score calculation section 9, a relevance between the keyword and the neighborhood word is used to calculate a score of the keyword in the respective analyzed retrieval object documents, and the score is output as an index. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、文書の集合からクエリ（処理要求を文字列で表したもの）に適合する文書を検索し、適合順にランク付けするためのインデックスを作成する技術に関する。 The present invention relates to a technique for searching a document that matches a query (a processing request represented by a character string) from a set of documents and creating an index for ranking in the order of matching.

従来より、文書の集合からクエリに適合する文書を検索し、適合順にランク付け（ランキング）する方法として、様々な方法が提案されていた。 Conventionally, various methods have been proposed as a method of searching documents that match a query from a set of documents and ranking (ranking) them in the order of matching.

例えば、文書の内容を表現する方法としてベクトル空間モデルが代表的であり、文書ベクトルとクエリベクトルとのコサイン尺度に従って類似度を求め、類似度の高い順にランク付けを行っていた。さらに、検索精度を上げる方法として、一度検索した後にユーザが適合するまたは適合しないと判断した文書に含まれる単語を、適合する文書数と適合しない文書数を考慮した重みを付けて新たにクエリに付加することで再度検索を行う適合性フィードバックという方法が提案されている（非特許文献１、２を参照）。また、ユーザからのフィードバックを与えずに、初回の検索結果の上位文書から文書に含まれる単語とクエリとの相互情報量を計算することにより、クエリと関連する単語を自動的に検出してクエリに追加する方法も提案されている（非特許文献３を参照）。
徳永健伸、「言語と計算５情報検索と言語処理」、財団法人東京大学出版会、１９９９年、Ｐ．１５４〜１５９北研二、津田和彦、獅々堀正幹、「情報検索アルゴリズム」、共立出版株式会社、２００２年、Ｐ．６０〜６５金谷敦志、梅村恭司、「相関係数を用いた実証的重みの分析と検索質問拡張」、情報処理学会研究報告ＩＰＳＪＳＩＧＴｅｃｈｎｉｃａｌＲｅｐｏｒｔ、２００３−ＦＩ−１３、２００３年、Ｐ．１７〜２４ For example, a vector space model is representative as a method for expressing the contents of a document, and similarity is obtained according to a cosine measure between a document vector and a query vector, and ranking is performed in descending order of similarity. Furthermore, as a method for improving the search accuracy, a word that is included in a document that the user has determined that matches or does not match after searching once is weighted in consideration of the number of documents that match and the number of documents that do not match, and is newly added to the query. There has been proposed a method called relevance feedback for performing a search again by adding (see Non-Patent Documents 1 and 2). In addition, by calculating the mutual information amount between the word and the query contained in the document from the top document of the first search result without giving feedback from the user, the query and the word related to the query are automatically detected. A method of adding to the above has also been proposed (see Non-Patent Document 3).
Takenobu Tokunaga, “Language and Computation 5 Information Retrieval and Language Processing”, The University of Tokyo Press, 1999, p. 154-159 Kita Kenji, Tsuda Kazuhiko, Sasabori Masatomi, “Information Retrieval Algorithm”, Kyoritsu Publishing Co., Ltd., 2002, p. 60-65 Satoshi Kanaya, Koji Umemura, “Analysis of Empirical Weights Using Correlation Coefficients and Search Query Expansion”, IPSJ SIG Technical Report, 2003-FI-13, 2003, P.A. 17-24

しかしながら、前述した従来の技術をサーチエンジンに適用する場合にはいくつか課題があった。 However, there are some problems in applying the above-described conventional technique to a search engine.

即ち、適合性フィードバックの方法は初回の検索結果の上位の文書を利用するが、サーチエンジンでユーザが入力するキーワードの個数は極めて少ない場合が多く、初回の検索においてユーザの検索要求を十分に反映した検索結果の上位文書を獲得することができないという問題があった。 In other words, the relevance feedback method uses the top document of the first search result, but the number of keywords input by the user in the search engine is often very small, and the search request of the user is sufficiently reflected in the first search. There is a problem that the higher-order document of the retrieved result cannot be obtained.

また、サーチエンジンが対象とする文書には、ユーザが自由に記述する文書も含まれる。このような文書は一つの文書に雑多な内容を含むこともあり、一つの文書においてクエリと関連する単語を抽出することが難しく、従来の適合性フィードバックやクエリの自動拡張の方法をそのまま適用しても精度向上に寄与しないという課題があった。 Further, the documents targeted by the search engine include documents that are freely described by the user. Such a document may contain miscellaneous contents in one document, and it is difficult to extract words related to the query in one document, and the conventional conformity feedback and automatic query expansion methods are applied as they are. However, there is a problem that it does not contribute to accuracy improvement.

さらにまた、サーチエンジンではリアルタイム処理が求められるが、従来の適合性フィードバックやクエリの自動拡張の方法では、初回の検索結果に基づいて関連する単語を検出するための計算をその都度実行し更新したクエリで再検索するので、処理速度の面で課題があった。 Furthermore, the search engine requires real-time processing, but with the conventional relevance feedback and automatic query expansion methods, calculations for detecting related words based on the initial search results are executed and updated each time. There was a problem in terms of processing speed because the query was searched again.

本発明の目的は、ユーザが入力するクエリが少数のキーワードで与えられる場合に、雑多な内容を含む文書で構成され得る文書の集合に対しても、サーチエンジンに適用できる処理速度で精度の高いランキングを可能とすることにある。 It is an object of the present invention to provide a high-accuracy processing speed applicable to a search engine even for a set of documents that can be composed of documents including various contents when a query input by a user is given by a small number of keywords. It is to enable ranking.

前記目的を達成するため、本発明では、
・雑多な内容を含む文書の集合を対象にできるよう、キーワードと関連の強い近傍の単語に着目するため、キーワードと近傍単語との関連度を計算する、
・また、キーワードと相関の強い単語を近傍に持つ文書はキーワードに良く適合するという前提で、キーワードの近傍単語に対してキーワードとの関連度を用いて文書スコアを計算する、ことを特徴とする。 In order to achieve the above object, in the present invention,
・ Calculate the degree of relevance between keywords and neighboring words in order to focus on neighboring words that are strongly related to keywords so that a collection of documents containing miscellaneous contents can be targeted.
-In addition, the document score is calculated by using the degree of relevance with the keyword for the neighboring word of the keyword on the assumption that the document having the word closely correlated with the keyword is well matched with the keyword. .

具体的には、本発明のインデックス作成装置は、
検索対象文書の集合からクエリに適合する文書を検索し、適合順にランク付けするためのインデックスを作成する装置であって、形態素解析済みの検索対象文書の集合を格納する解析済み文書データベースと、クエリとして与えられるキーワードに対して各解析済み検索対象文書中の単語あるいは単語列が合致する位置を同定するキーワード同定部と、各解析済み検索対象文書中から前記キーワードが合致する単語あるいは単語列の位置の近傍に存在する内容語を近傍単語として抽出するとともに当該キーワードが合致する単語あるいは単語列と近傍単語との共起頻度をカウントする近傍単語獲得部と、解析済み検索対象文書の集合全体におけるキーワードに対応する単語あるいは単語列の出現頻度、近傍単語の出現頻度及び全単語数をカウントする単語頻度計測部と、解析済み検索対象文書の集合全体における前記キーワードが合致する単語あるいは単語列と近傍単語との共起頻度とともに、解析済み検索対象文書の集合全体における前記キーワードに対応する単語あるいは単語列の出現頻度、近傍単語の出現頻度及び全単語数から、前記キーワードと近傍単語との関連度を計算する関連度計算部と、前記キーワードと近傍単語との関連度を用いて各解析済み検索対象文書における前記キーワードに対するスコアを計算し、これをインデックスとして出力する文書スコア計算部とを少なくとも有することを特徴とする。 Specifically, the index creation device of the present invention is:
An apparatus for creating an index for searching documents that match a query from a set of search target documents and ranking them in the order of suitability, an analyzed document database that stores a set of search target documents that have been subjected to morphological analysis, and a query A keyword identifying unit for identifying a position where a word or a word string in each analyzed search target document matches a keyword given as: a position of a word or a word string that matches the keyword in each analyzed search target document The word in the whole set of analyzed documents to be searched is extracted as content words existing in the vicinity of the word and the co-occurrence frequency of the word or word string that matches the keyword and the co-occurrence frequency of the neighborhood word is counted. Count the appearance frequency of words or word strings corresponding to, the appearance frequency of neighboring words, and the total number of words. Corresponding to the keyword in the entire set of analyzed search target documents, together with the co-occurrence frequency of the word or word string that matches the keyword in the entire set of analyzed search target documents and the neighboring word. Relevance calculator that calculates the degree of association between the keyword and the neighborhood word from the appearance frequency of the word or word string, the appearance frequency of the neighborhood word, and the total number of words, and the degree of association between the keyword and the neighborhood word It has at least a document score calculation unit that calculates a score for the keyword in the analyzed search target document and outputs it as an index.

さらに、本発明では、
・複数回の検索を避けたり検索時の計算量を減らすため、予めキーワードごとに文書集合に含まれる文書の文書スコアを計算しインデックスに格納する、ことを特徴とする。 Furthermore, in the present invention,
In order to avoid multiple searches and reduce the amount of calculation during the search, the document score of the document included in the document set is calculated for each keyword in advance and stored in the index.

即ち、本発明のインデックス作成装置は、
前記に加え、キーワードの集合を格納するキーワードデータベースを備え、前記キーワード同定部、近傍単語獲得部、単語頻度計測部、関連度計算部、文書スコア計算部における処理を、キーワードデータベース中に含まれる全てのキーワードについて繰り返させ、各キーワードと各解析済み検索対象文書との組み合わせに対応するスコアを要素とするインデックスを生成することを特徴とする。 That is, the index creation device of the present invention
In addition to the above, a keyword database for storing a set of keywords is provided, and all the processes in the keyword identification unit, the neighborhood word acquisition unit, the word frequency measurement unit, the relevance calculation unit, and the document score calculation unit are included in the keyword database. It repeats about this keyword, The index which makes the score the element corresponding to the combination of each keyword and each analyzed search object document is generated.

この結果、本発明では、キーワードと関連の強い単語の情報を検索ランキングに反映させることができる。特に、(1)キーワードに対する周辺の単語の情報を用いるので、雑多な情報を含む文書集合に対しても有効である。(2)ユーザからのフィードバックを用いなくても良い。(3)事前に計算しておいた文書スコアを格納したインデックスを用いることで検索時には高速に検索文書ランキングを出力できる。 As a result, in the present invention, it is possible to reflect information on a word strongly related to the keyword in the search ranking. In particular, (1) the use of peripheral word information for a keyword is effective for a document set including miscellaneous information. (2) It is not necessary to use feedback from the user. (3) By using an index storing the document score calculated in advance, the retrieval document ranking can be output at high speed during retrieval.

本発明によれば、少数のキーワードでしかクエリが与えられず、雑多な内容を含む文書集合を対象とする場合でも、キーワードの周辺に存在するキーワードと相関の高い単語を多く含む文書に高いスコアが与えられるため、結果的にキーワードと関連する単語を近傍に多く持つ文書を検索ランキングの上位に表示することが可能となる。また、事前にクエリとして想定され得るキーワードと文書のインデックスを作成しているため、検索時にユーザからのフィードバックを取り入れたり、検索結果を利用した計算を実行する必要がないことから、従来の手法に比べ検索時に迅速に結果を出力することができる。このように本発明は高速で高精度な文書検索ランキングを実現する効果を有する。 According to the present invention, even when a query is given only with a small number of keywords and a document set including various contents is targeted, a high score is obtained for a document including many words that are highly correlated with keywords existing around the keyword. Therefore, as a result, it is possible to display a document having many words related to the keyword in the vicinity of the search ranking. In addition, since keywords and documents that can be assumed as queries are indexed in advance, there is no need to incorporate feedback from users or perform calculations using search results. Compared to this, it is possible to output the results quickly when searching. As described above, the present invention has an effect of realizing a high-speed and highly accurate document search ranking.

次に、本発明の実施の形態について図面を参照して説明する。 Next, embodiments of the present invention will be described with reference to the drawings.

図１は本発明のインデックス作成装置の実施の形態の一例（但し、ここでは文書検索に関わる部分も含む。）を示すもので、本実施の形態のインデックス作成装置は、解析済み文書データベース１と、キーワードデータベース２と、キーワード入力部３と、キーワード同定部４と、近傍単語獲得部５と、単語頻度計測部６と、関連度計算部７と、関連度テーブル８と、文書スコア計算部９と、キーワード文書インデックス１０と、文書検索部１１とからなる。 FIG. 1 shows an example of an embodiment of an index creation apparatus according to the present invention (however, here also includes a part related to document search). The index creation apparatus of the present embodiment includes an analyzed document database 1 and , Keyword database 2, keyword input unit 3, keyword identification unit 4, neighborhood word acquisition unit 5, word frequency measurement unit 6, relevance level calculation unit 7, relevance level table 8, and document score calculation unit 9. And a keyword document index 10 and a document search unit 11.

解析済み文書データベース１は、検索対象文書中の各単語に対し、その単語表記、読み、品詞等の単語情報と、出現した文書中での当該単語の位置を表す位置情報とをそれぞれ付与してなる解析済み検索対象文書の集合を格納している。なお、処理済み検索対象文書は、予め検索対象文書の集合を格納した検索対象文書データベース中の各検索対象文書（自然言語で記述された検索対象文書）に対し、周知の形態素解析処理を行うことによって得られる。 The analyzed document database 1 assigns each word in the search target document with word information such as word notation, reading, and part of speech, and position information indicating the position of the word in the document that has appeared. A set of analyzed search target documents is stored. The processed search target document is subjected to a well-known morphological analysis process for each search target document (search target document described in natural language) in the search target document database in which a set of search target documents is stored in advance. Obtained by.

検索対象文書が「横須賀百貨店の長谷川隆明店長ってほんとにいい人かもね」である場合の解析済み検索対象文書の例を図２に示す（但し、位置情報については省略）。 FIG. 2 shows an example of the analyzed search target document when the search target document is “Yokosuka Department Store Takaaki Hasegawa may be a really good person” (however, location information is omitted).

キーワードデータベース２は、文書の検索に使用する（クエリとして与えられる）キーワードの集合を格納している。ここでのキーワードは１単語または複合語またはフレーズとする。複合語は名詞が連続したものであり、フレーズは内容語だけでなく機能語も含むものとする。キーワードには検索対象文書の集合における単語の出現頻度が高いものを選んでも良いし、既存の固有表現抽出技術により抽出される固有表現を選んでも良い。あるいは文書集合以外の情報源として、利用者が入力した検索キーワードを集計したクエリログなどを用いても良い。 The keyword database 2 stores a set of keywords (given as a query) used for document search. The keyword here is one word, compound word or phrase. A compound word is a series of nouns, and a phrase includes not only content words but also function words. As a keyword, a keyword having a high appearance frequency in a set of search target documents may be selected, or a unique expression extracted by an existing specific expression extraction technique may be selected. Alternatively, as an information source other than the document set, a query log obtained by aggregating search keywords input by the user may be used.

キーワード入力部３は、キーワードデータベース２からキーワードを取得してキーワード同定部４に入力する。 The keyword input unit 3 acquires a keyword from the keyword database 2 and inputs it to the keyword identification unit 4.

キーワード同定部４は、解析済み文書データベース１に格納されている各解析済み検索対象文書中の単語（形態素）あるいは複数の単語の並びである単語列（形態素列）が、入力されたキーワードと合致するかどうかを比較し、合致した場合はその開始位置及び終了位置の情報を取得して近傍単語獲得部５に出力する。 The keyword identification unit 4 matches a word (morpheme) or a word string (morpheme string) that is a sequence of a plurality of words in each analyzed search target document stored in the analyzed document database 1 with the input keyword. If they match, information on the start position and end position is acquired and output to the neighborhood word acquisition unit 5.

例えば、キーワードが「長谷川隆明」であり、文書が図３に示した例である場合には、先頭位置から各々の単語の表記とキーワードとを前方一致により照合し、さらに連続する単語を接続した単語列の表記がキーワードと完全一致するまで単語の位置をずらしていくことにより、位置４の「長谷川」から位置５の「隆明」までがキーワードに合致し、当該キーワードが合致する単語列の開始位置４及び終了位置５を得る。文書内に合致する位置が複数ある場合は全て検出する。なお、キーワードの検出には、文字列での照合を利用するなどの方法を取っても良く、必ずしも単語または単語列を用いることを規定しない。 For example, if the keyword is “Takaaki Hasegawa” and the document is the example shown in FIG. 3, the word notation and the keyword are collated from the head position by front matching, and further consecutive words are connected. By shifting the position of the word until the word string notation completely matches the keyword, the position from “Hasegawa” at position 4 to “Ryoaki” at position 5 matches the keyword, and the start of the word string that matches the keyword Position 4 and end position 5 are obtained. If there are multiple matching positions in the document, all are detected. It should be noted that the keyword may be detected by using a method such as using collation with a character string, and it is not always necessary to use a word or a word string.

近傍単語獲得部５は、前記キーワードが合致する単語あるいは単語列の開始位置及び終了位置の情報に基づき、解析済み文書データベース１に格納されている各解析済み検索対象文書中から当該キーワードが合致する単語あるいは単語列の位置の近傍に存在する内容語を近傍単語として抽出し、当該キーワードが合致する単語あるいは単語列と対にして関連度テーブル８に格納するとともに、当該キーワードが合致する単語あるいは単語列と近傍単語との共起頻度をカウント、即ち関連度テーブル８に格納・更新する。この際、共起頻度は当該キーワードが合致する単語あるいは単語列及び近傍単語を検出した解析済み検索対象文書に限定したものと、解析済み検索対象文書の集合全体を対象にしたものとをカウントするものとする。 The neighborhood word acquisition unit 5 matches the keyword from the analyzed search target documents stored in the analyzed document database 1 based on the information on the start position and end position of the word or word string that matches the keyword. A content word existing in the vicinity of the position of the word or word string is extracted as a neighborhood word, stored in the relevance table 8 as a pair with the word or word string matching the keyword, and the word or word matching the keyword The co-occurrence frequency of the column and the neighboring word is counted, that is, stored / updated in the association degree table 8. At this time, the co-occurrence frequency is counted only for an analyzed search target document that detects a word or a word string that matches the keyword and a neighboring word, and for a whole set of analyzed search target documents. Shall.

ここで、内容語は、例えば単語に付与されている品詞情報によって予め規定される。内容語の品詞情報としては、例えば図３に示すように「名詞」や「名詞：固有」と指定する。また、内容語として、ＩＤＦ（inverse document frequency）や残差ＩＤＦ等の尺度を用いてこれらの値が大きい単語に限定したり、品詞情報と組み合わせて用いても良い。例えば、残差ＩＤＦは単語が出現する文書数に対して、文書集合全体の総出現頻度が多いものが大きい値を持ち、重要なキーワードとされている。このような尺度であればこれらに限定しない。さらに、内容語として人名や地名、組織名等の固有表現に限定したり、あるいは単語と組み合わせて用いても良い。固有表現は既存の固有表現抽出技術を用いて抽出することができる。 Here, the content word is defined in advance by, for example, part-of-speech information given to the word. As part-of-speech information of a content word, for example, as shown in FIG. 3, “noun” or “noun: unique” is designated. Further, the content words may be limited to words having large values using a scale such as IDF (inverse document frequency) or residual IDF, or may be used in combination with part-of-speech information. For example, the residual IDF has a large value with respect to the number of documents in which a word appears, and the total appearance frequency of the entire document set has a large value and is regarded as an important keyword. If it is such a scale, it will not be limited to these. Furthermore, the content word may be limited to a specific expression such as a person name, a place name, or an organization name, or may be used in combination with a word. A specific expression can be extracted using an existing specific expression extraction technique.

近傍は、例えば予め指定しておくキーワードの前後の単語数の範囲にある単語の集合とする。図３の例ではキーワードの前後の３つの単語を近傍と指定した場合を示しており、この例ではキーワード「長谷川隆明」の開始位置４から３つの単語、即ち位置１から位置３までの単語と、キーワード「長谷川隆明」の終了位置５から３つの単語、即ち位置６から位置８までの単語とが近傍となるが、前述した内容語の品詞情報の指定により、近傍単語として「横須賀」、「百貨店」及び「店長」が得られる。 The neighborhood is, for example, a set of words in the range of the number of words before and after a keyword specified in advance. The example of FIG. 3 shows a case where the three words before and after the keyword are designated as neighbors. In this example, three words from the start position 4 of the keyword “Takaaki Hasegawa”, that is, the words from position 1 to position 3 , The three words from the end position 5 of the keyword “Takaaki Hasegawa”, that is, the words from the position 6 to the position 8 are in the vicinity, but by specifying the part-of-speech information of the content word described above, “Yokosuka”, “ "Department store" and "store manager" are obtained.

なお、近傍については、キーワードの前後の単語数に限定するものではなく、係り受け解析を行った結果として得られるキーワードと係り受けの関係にある単語としても良いし、キーワードと同一の文節内に含まれる単語としても良い。 The neighborhood is not limited to the number of words before and after the keyword, but may be a word having a dependency relationship with the keyword obtained as a result of dependency analysis, or within the same phrase as the keyword. It may be a word included.

また、固有表現を対象とする場合には、キーワードも固有表現も１つの単語で構成されるとは限らないので、キーワードと固有表現との共起は、両者の開始位置の差が予め指定しておいたウインドウサイズの範囲内としても良い。図４の例に示すように、予め規定したウインドウサイズの範囲に固有表現が完全には含まれていない場合や、キーワードが固有表現の内部に含まれている場合も共起関係にあるとみなしても良い。 When a specific expression is targeted, the keyword and the specific expression are not necessarily composed of a single word, so the co-occurrence of the keyword and the specific expression is specified in advance by the difference between the start positions of the two. It may be within the window size range. As shown in the example of FIG. 4, the case where the specific expression is not completely included in the range of the predetermined window size or the case where the keyword is included in the specific expression is regarded as having a co-occurrence relationship. May be.

図５にキーワード同定部及び近傍単語獲得部における処理の流れを示す。 FIG. 5 shows the flow of processing in the keyword identification unit and the neighborhood word acquisition unit.

即ち、まず、解析済み検索対象文書の先頭にポインタをセットし（ｓ１）、単語とその位置情報を取得する（ｓ２）。次に、取得した単語が入力されたキーワードと合致するかどうかを比較し（ｓ３）、合致すれば取得した単語の位置をキーワードが合致する単語の位置として取得した（ｓ４）後に、また、合致しなければそのままポインタを１つ進める（ｓ５）。以上の処理を文書末まで繰り返し行う（ｓ６）。 That is, first, a pointer is set at the head of the analyzed search target document (s1), and a word and its position information are acquired (s2). Next, it is compared whether or not the acquired word matches the input keyword (s3), and if it matches, the position of the acquired word is acquired as the position of the word that matches the keyword (s4). If not, the pointer is advanced by one (s5). The above processing is repeated until the end of the document (s6).

次に、前記解析済み検索対象文書から単語をその位置情報とともに１つ取り出し（ｓ７）、該取り出した単語の位置と前記記憶したキーワードが合致する単語の位置との差が予め指定した値以内かどうかを判定し（ｓ８）、指定値以内であれば取り出した単語を前記キーボードの近傍単語として関連度テーブル８に格納するとともに当該キーワードが合致する単語と取り出した単語との共起頻度を関連度テーブル８に格納・更新（カウント）し（ｓ９）、指定値より大きい場合は何もしない。以上の処理を前記解析済み検索対象文書中の全ての単語に対して繰り返し行う（ｓ１０）。 Next, one word is extracted from the analyzed search target document together with its position information (s7), and the difference between the extracted word position and the word position where the stored keyword matches is within a predetermined value. (S8), if it is within the specified value, the extracted word is stored in the relevance table 8 as a word near the keyboard, and the co-occurrence frequency between the word matching the keyword and the extracted word is determined as the relevance Store / update (count) in the table 8 (s9), and if it is larger than the specified value, do nothing. The above processing is repeated for all the words in the analyzed search target document (s10).

なお、図５に示したのは１つの解析済み検索対象文書に対する処理であり、実際の処理は解析済み文書データベース１に格納された全ての解析済み検索対象文書に対して同様に行われる。 FIG. 5 shows the processing for one analyzed search target document, and the actual processing is similarly performed for all the analyzed search target documents stored in the analyzed document database 1.

単語頻度計測部６は、解析済み文書データベース１に格納されている全ての解析済み検索対象文書に対して前述したキーワード及び関連度テーブル８に格納された近傍単語による検索を行い、解析済み検索対象文書の集合全体におけるキーワードに対応する単語あるいは単語列の出現頻度及び解析済み検索対象文書の集合全体における近傍単語の出現頻度を計測するとともに、各解析済み検索対象文書の単語数、並びに解析済み検索対象文書の集合全体における全ての単語数を計測し、これらを関連度テーブル８に格納する。 The word frequency measurement unit 6 performs a search with respect to all the analyzed search target documents stored in the analyzed document database 1 using the above-described keyword and the neighborhood word stored in the relevance degree table 8, and the analyzed search target Measures the frequency of occurrence of words or word strings corresponding to keywords in the entire set of documents and the frequency of appearance of neighboring words in the entire set of analyzed search target documents, as well as the number of words in each analyzed search target document and the analyzed search The number of all words in the entire set of target documents is measured and stored in the relevance table 8.

関連度計算部７は、関連度テーブル８に格納されている、解析済み検索対象文書の集合全体における前記キーワードが合致する単語あるいは単語列と近傍単語との共起頻度とともに、解析済み検索対象文書の集合全体における前記キーワードに対応する単語あるいは単語列の出現頻度、近傍単語の出現頻度及び全単語数から、予め指定された計算式に基づいて前記キーワードと近傍単語との関連度を計算する。計算式はキーワードと近傍単語との相関の強さが求められれば良く、例えば相互情報量を求める式であっても良い。計算式の一例を以下に示す。 The relevance calculation unit 7 stores the analyzed search target document together with the co-occurrence frequency of the word or word string that matches the keyword and the neighborhood word in the entire set of analyzed search target documents stored in the relevance level table 8. From the appearance frequency of the word or word string corresponding to the keyword, the appearance frequency of neighboring words, and the total number of words in the whole set, the degree of association between the keyword and the neighboring word is calculated based on a pre-specified calculation formula. The calculation formula only needs to obtain the strength of the correlation between the keyword and the neighboring word. For example, the calculation formula may be a formula for obtaining the mutual information amount. An example of the calculation formula is shown below.

但し、
Ｑ：キーワード
Ｗ：キーワードの近傍単語
Ｃ（Ｑ）：文書集合におけるキーワードＱの出現頻度
Ｃ（Ｗ）：文書集合における近傍単語Ｗの出現頻度
Ｃ（Ｑ，Ｗ）：文書集合におけるキーワードＱと近傍単語Ｗの共起頻度
Ｎ：文書集合における全単語数
である。 However,
Q: Keyword W: Keyword near the keyword C (Q): Appearance frequency of the keyword Q in the document set C (W): Appearance frequency of the neighborhood word W in the document set C (Q, W): Keyword Q and the neighborhood in the document set Co-occurrence frequency of word W N: The total number of words in the document set.

関連度を計算するものは、キーワードと共起する単語や固有表現のうち、共起頻度の数に従って予め指定した数の条件を満たす単語や固有表現に限定したり、残差ＩＤＦ等の尺度がある閾値より大きいものに限定したり、あるいはこれらを組み合わせて限定しても良い。 The calculation of the degree of relevance is limited to words or specific expressions that satisfy a predetermined number of words or specific expressions that co-occur with keywords, according to the number of co-occurrence frequencies, or a measure such as residual IDF. It may be limited to a value larger than a certain threshold value, or may be limited in combination.

関連度テーブル８は、キーワードに対応する単語あるいは単語列とその解析済み検索対象文書の集合全体における出現頻度、当該キーワードに対応する単語あるいは単語列の近傍単語とその解析済み検索対象文書の集合全体における出現頻度、キーワードが合致する単語あるいは単語列と近傍単語との共起頻度（但し、ここでは集合全体を対象にしたもののみを示す。）、前記計算したキーワードと近傍単語との関連度を格納する。 The relevance table 8 shows the appearance frequency of the word or word string corresponding to the keyword and the entire set of analyzed search target documents, the word corresponding to the keyword or a word near the word string and the entire set of analyzed search target documents. The frequency of occurrence of the word, the co-occurrence frequency of the word or word string that matches the keyword and the neighboring word (however, only the whole set is shown here), the degree of association between the calculated keyword and the neighboring word Store.

図６は関連度テーブル８の一例を示すもので、ここではキーワードと近傍単語との関連度を式（１）により計算した場合の例を示す。なお、図６の関連度は集合全体における全単語数Ｎを１，０００，０００としたときの数値である。 FIG. 6 shows an example of the degree-of-association table 8, and here shows an example in which the degree of association between a keyword and a neighboring word is calculated by the equation (1). 6 is a numerical value when the total number of words N in the entire set is 1,000,000.

文書スコア計算部９は、関連度テーブル８に格納されている、キーワードと近傍単語の関連度を用いて各解析済み検索対象文書における前記キーワードに対する文書スコアを計算し、これをインデックスとしてキーワード文書インデックス１０に出力する。文書スコアの計算式は、キーワードと関連の強い近傍単語を多く有する文書に高いスコアが与えられる式であれば良い。近傍や内容語に関する規定は関連度の計算と同じでも良いし異なっても良いが、以下では同じ場合について説明する。 The document score calculation unit 9 calculates the document score for the keyword in each analyzed search target document using the degree of association between the keyword and the neighboring word stored in the degree-of-association table 8, and uses this as an index to obtain the keyword document index 10 is output. The calculation formula of the document score may be an expression that gives a high score to a document having many neighboring words strongly related to the keyword. The rules for the neighborhood and the content word may be the same as or different from the calculation of the relevance, but the same case will be described below.

例えば、前記相互情報量を用いる場合では、解析済み検索対象文書内のキーワードに対応する単語あるいは単語列の各出現位置について近傍単語との相互情報量の総和の値を求め、さらにキーワードに対応する単語あるいは単語列の各出現位置について得られた値の総和をキーワードに対する検索対象文書の文書スコアとする。また、文書スコアはこの値に対して文書の全単語数や内容語の異なり数等を用いて正規化した値としても良い。文書スコアの計算に用いる内容語は、キーワードとの関連度が高い順に上位から選択するよう制限を設けても良い。 For example, in the case of using the mutual information amount, a value of the sum of mutual information amounts with neighboring words is obtained for each occurrence position of the word or word string corresponding to the keyword in the analyzed search target document, and further corresponding to the keyword. The sum of the values obtained for each occurrence position of the word or word string is used as the document score of the search target document for the keyword. Further, the document score may be a value normalized with respect to this value using the total number of words in the document, the number of different content words, and the like. The content word used for the calculation of the document score may be restricted so that it is selected from the top in descending order of the degree of association with the keyword.

キーワードＱに対する文書Ｄの文書スコアの計算式の一例を以下に示す。 An example of the calculation formula of the document score of the document D with respect to the keyword Q is shown below.

但し、
Ｄ：文書
Ｑ：キーワード
Ｗ_i：キーワードＱと共起する各単語
Ｃｏ（Ｄ，Ｑ，Ｗ_i）：文書ＤにおいてキーワードＱと共起する単語Ｗｉの共起頻度
Ｎ_D：文書Ｄの単語数
である。 However,
D: document Q: keyword W _i: each word Co to co-occur with keyword _{Q (D, Q, W i} ): co-occurrence frequency of the word Wi to co-occur with keyword Q in the document _D N D: the number of words in the document D It is.

例えば、前記のキーワード「長谷川隆明」に対する文書「横須賀百貨店の長谷川隆明店長ってほんとにいい人かもね」における文書スコアを求める。キーワード「長谷川隆明」の前後３つの形態素の範囲に存在する品詞が「名詞」や「名詞：固有」である単語を近傍単語とすると、図３に示したように３つの近傍単語が得られ、当該文書におけるそれぞれの共起頻度は１である。キーワード「長谷川隆明」と近傍単語「横須賀」、「百貨店」、「店長」との関連度ＰＭＩは、図６に示したようにそれぞれ３．５１，３．２２，６．２１で、文書の単語数１３なので、当該キーワードに対する当該文書の文書スコアは式（２）により
Ｓｃｏｒｅ＝（１＊３，５１＋１＊３．２２＋１＊６．２１）／１３
＝０．９９５
と計算される。 For example, the document score for the document “Takaaki Hasegawa at Yokosuka Department Store may be a really good person” is obtained for the keyword “Takaaki Hasegawa”. If a word whose part of speech existing in the range of the three morphemes before and after the keyword “Takaaki Hasegawa” is “noun” or “noun: proper” is a neighboring word, three neighboring words are obtained as shown in FIG. Each co-occurrence frequency in the document is 1. The degree of association PMI between the keyword “Takaaki Hasegawa” and the neighboring words “Yokosuka”, “Department Store”, and “Store Manager” is 3.51, 3.22, 6.21, respectively, as shown in FIG. Since Equation 13 is satisfied, the document score of the document for the keyword is expressed by the following equation (2): Score = (1 * 3, 51 + 1 * 3.22 + 1 * 6.21) / 13
= 0.995
Is calculated.

また、文書のスコアは、文書の単語数で正規化したキーワードに対応する単語あるいは単語列の出現頻度に基づくスコア（Score1）と、前述したキーワードに対応する単語あるいは単語列と共起する単語の相互情報量に基づくスコア（Score2）と、キーワードに対応する単語あるいは単語列と共起する固有表現の相互情報量に基づくスコア（Score3）とから、各々のスコアを適当な定数α，β，γで正規化した上で、次式に示すような線形和
Ｓｃｏｒｅ＝α＊Score1＋β＊Score2＋γ＊Score3
を求め、この値に従ってキーワードを含む文書にスコアを付けても良い。 In addition, the score of the document includes a score (Score1) based on the appearance frequency of the word or word string corresponding to the keyword normalized by the number of words in the document, and the word or word string corresponding to the keyword described above. From the score based on the mutual information (Score2) and the score based on the mutual information of the specific expression co-occurring with the word or word string corresponding to the keyword (Score3), each score is set to an appropriate constant α, β, γ After normalizing with, linear sum as shown in the following equation: Score = α * Score1 + β * Score2 + γ * Score3
And a score may be given to the document containing the keyword according to this value.

キーワードデータベース２中の全てのキーワードがキーワード入力部３により順次、取得され、前述したキーワード同定部４、近傍単語獲得部５、単語頻度計測部６、関連度計算部７、文書スコア計算部９における一連の処理が行われて、キーワードデータベース２中の全てのキーワードと解析済み文書データベース中の全ての解析済み検索対象文書の組み合わせに対する文書スコアが計算され、該計算された各キーワードと各解析済み検索対象文書との組み合わせに対応する文書スコアを要素とするインデックスがキーワード文書インデックス１０に格納される。 All keywords in the keyword database 2 are sequentially acquired by the keyword input unit 3, and in the keyword identification unit 4, the neighborhood word acquisition unit 5, the word frequency measurement unit 6, the relevance calculation unit 7, and the document score calculation unit 9 described above. A series of processing is performed, and a document score is calculated for a combination of all keywords in the keyword database 2 and all analyzed search target documents in the analyzed document database, and each of the calculated keywords and each analyzed search is calculated. An index having a document score corresponding to a combination with the target document as an element is stored in the keyword document index 10.

図７はキーワード文書インデックスの一例を示すものである。解析済み文書データベース１に格納されている解析済み検索対象文書数がＮで、キーワードデータベース２に格納されているキーワード数がＭとすれば、文書１から文書Ｎまでに対しキーワード１からキーワードＭまでの文書スコアの行列として格納する。 FIG. 7 shows an example of the keyword document index. If the number of analyzed search target documents stored in the analyzed document database 1 is N and the number of keywords stored in the keyword database 2 is M, keywords 1 to M are compared to documents 1 to N. As a matrix of document scores.

文書検索部１１は、キーワード文書インデックス１０に格納されている文書スコアに従って検索対象文書のランク付けを行い、ランキングを出力する。例えば、検索時のクエリが一つのキーワードに対しては、そのキーワードに対して格納されている文書スコアが大きい順序となるように検索対象文書をランク付けすれば良い。クエリが複数のキーワードの場合には、各々のキーワードに対する文書スコアの和が大きい順序にランク付けしても良いし、ベクトル空間モデルにおける単語の文書内頻度と単語の文書頻度の逆数を用いる代わりにキーワードの文書スコアとキーワードの文書頻度の逆数を使ってクエリと文書の類似度を計算し、類似度の大きい順序に文書をランク付けしても良い。また、クエリに含まれるキーワードがキーワード文書インデックスに存在しない場合には、そのキーワードについては既存のベクトル空間モデルを適用しても良い。 The document search unit 11 ranks the search target documents according to the document score stored in the keyword document index 10 and outputs the ranking. For example, for a keyword with a query at the time of search, the search target documents may be ranked so that the document scores stored for the keyword are in descending order. When the query is a plurality of keywords, it may be ranked in descending order of the sum of the document scores for each keyword, or instead of using the reciprocal of the word document frequency and the word document frequency in the vector space model. The similarity between the query and the document may be calculated using the reciprocal of the keyword document score and the keyword document frequency, and the documents may be ranked in the descending order of similarity. When a keyword included in the query does not exist in the keyword document index, an existing vector space model may be applied to the keyword.

なお、本発明は、周知のコンピュータに記録媒体もしくは通信回線を介して、図１の構成図に示された機能を実現するプログラムをインストールすることによっても実現可能である。 The present invention can also be realized by installing a program for realizing the functions shown in the configuration diagram of FIG. 1 via a recording medium or a communication line in a known computer.

本発明のインデックス作成装置の実施の形態の一例を示す構成図The block diagram which shows an example of embodiment of the index production apparatus of this invention 解析済み検索対象文書の一例を示す説明図Explanatory diagram showing an example of an analyzed search target document 解析済み検索対象文書に対するキーワード同定及び近傍単語獲得処理のようすを示す説明図Explanatory drawing which shows the state of keyword identification and neighborhood word acquisition processing for the analyzed search target document キーワードと固有表現との共起関係の一例を示す説明図Explanatory diagram showing an example of co-occurrence relationship between keywords and proper expressions キーワード同定部及び近傍単語獲得部における処理の流れ図Flow chart of processing in keyword identification unit and neighborhood word acquisition unit 関連度テーブルの一例を示す説明図Explanatory drawing which shows an example of an association degree table キーワード文書インデックスの一例を示す説明図Explanatory drawing showing an example of keyword document index

Explanation of symbols

１：解析済み文書データベース、２：キーワードデータベース、３：キーワード入力部、４：キーワード同定部、５：近傍単語獲得部、６：単語頻度計測部、７：関連度計算部、８：関連度テーブル、９：文書スコア計算部、１０：キーワード文書インデックス、１１：文書検索部。 1: analyzed document database, 2: keyword database, 3: keyword input unit, 4: keyword identification unit, 5: neighborhood word acquisition unit, 6: word frequency measurement unit, 7: relevance calculation unit, 8: relevance table 9: Document score calculation unit, 10: Keyword document index, 11: Document search unit.

Claims

A device that searches a document that matches a query from a set of search target documents and creates an index for ranking in order of matching,
An analyzed document database for storing a set of search target documents that have been subjected to morphological analysis;
A keyword identification unit for identifying a position where a word or a word string in each analyzed search target document matches a keyword given as a query;
A content word existing in the vicinity of the position of the word or word string matching the keyword is extracted as a neighborhood word from each analyzed search target document, and the co-occurrence frequency of the word or word string matching the keyword and the neighborhood word is extracted. A neighborhood word acquisition unit that counts
A word frequency measuring unit that counts the appearance frequency of words or word strings corresponding to keywords in the entire set of analyzed search target documents, the appearance frequency of neighboring words, and the total number of words;
The frequency of occurrence of the word or word string corresponding to the keyword in the entire set of analyzed search target documents, together with the co-occurrence frequency of the word or word string that matches the keyword in the entire set of analyzed search target documents and neighboring words, A degree-of-association calculating unit that calculates the degree of association between the keyword and the neighboring word from the appearance frequency of the neighboring word and the total number of words;
An index creation apparatus comprising: at least a document score calculation unit that calculates a score for the keyword in each analyzed search target document using a degree of association between the keyword and a neighboring word, and outputs the score as an index.

In addition to the above
It has a keyword database that stores a set of keywords,
The keyword identification unit, neighborhood word acquisition unit, word frequency measurement unit, relevance calculation unit, and document score calculation unit are repeated for all keywords included in the keyword database, and each keyword and each analyzed search target document The index creation device according to claim 1, wherein an index having a score corresponding to a combination of and as an element is generated.

A method for searching documents that match a query from a set of search target documents and creating an index for ranking in the order of matching,
Using an analyzed document database that stores a set of search target documents that have undergone morphological analysis,
A step in which a keyword identification unit identifies a position where a word or a word string in each analyzed search target document matches a keyword given as a query;
The neighborhood word acquisition unit extracts, as a neighborhood word, a content word that exists in the vicinity of the position of the word or word string that matches the keyword from each analyzed search target document, and the word or word string that matches the keyword and the neighborhood Counting the frequency of co-occurrence with a word;
A word frequency measurement unit that counts the appearance frequency of words or word strings corresponding to keywords in the entire set of analyzed search target documents, the appearance frequency of neighboring words, and the total number of words;
The relevance calculator calculates a word corresponding to the keyword in the entire set of analyzed search target documents, together with the co-occurrence frequency of the word or word string that matches the keyword in the entire set of analyzed search target documents and neighboring words, or Calculating the degree of association between the keyword and the neighborhood word from the appearance frequency of the word string, the appearance frequency of the neighborhood word, and the total number of words;
The document score calculation unit includes at least a step of calculating a score for the keyword in each analyzed search target document using a degree of association between the keyword and a neighboring word, and outputting the score as an index. How to make.

In addition to the above
Using a keyword database that stores a set of keywords,
The keyword position identification step, the neighborhood word acquisition step, the word frequency measurement step, the relevance calculation step, and the document score calculation step are repeated for all keywords included in the keyword database, and each keyword and each analyzed search target document are The index creation method according to claim 3, wherein an index having a score corresponding to the combination as an element is generated.

The program for functioning a computer as each means of the index production apparatus of Claim 1 or 2.

A computer-readable recording medium on which the program according to claim 5 is recorded.