JP5601121B2

JP5601121B2 - Transposed index generation method and generation apparatus for N-gram search, search method and search apparatus using the inverted index, and computer program

Info

Publication number: JP5601121B2
Application number: JP2010215611A
Authority: JP
Inventors: 倫治山口
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2010-09-27
Filing date: 2010-09-27
Publication date: 2014-10-08
Anticipated expiration: 2030-09-27
Also published as: JP2012069071A

Description

本発明は、Ｎグラム検索に関し、とくにＮグラム検索のための転置インデックスの生成方法および生成装置、当該転置インデックスを用いた検索方法および検索装置、ならびにコンピュータプログラムに関する。 The present invention relates to N-gram search, and more particularly to a transposed index generation method and generation apparatus for N-gram search, a search method and search apparatus using the transposed index, and a computer program.

文書の電子化の増大に伴い、これまでに蓄積されてきた大量の文書群から所望の文書を見つけ出す検索技術の重要性が高まっている。 With the increasing digitization of documents, the importance of search technology that finds a desired document from a large number of document groups accumulated so far has increased.

英語などの多くの言語においては、単語を索引単位として索引ファイルを作成して、これを用いて高速な検索処理を実現することが一般的である。しかし、日本語の場合、スペース等によって単語の切れ目が明示的に示されないため、しばしば、Ｎグラムを索引単位とする方法が用いられている。 In many languages such as English, it is common to create an index file by using a word as an index unit and implement high-speed search processing using the index file. However, in Japanese, word breaks are not explicitly indicated by spaces or the like, and therefore, a method using N-grams as index units is often used.

Ｎグラムとは、連続するＮ文字からなる部分文字列のことである。Ｎグラムによる索引ファイル（以下、転置インデックスと呼称する）の作成には、文字列にのみ基づくため、単語を認識する必要がない。しかし、検索処理される検索語が複数のＮグラムに分割されて処理されるので、長い検索語で検索処理を行う場合、検索時間が増大するという問題がある。 An N-gram is a partial character string composed of consecutive N characters. The creation of an N-gram index file (hereinafter referred to as a transposed index) is based only on a character string, so that it is not necessary to recognize a word. However, since the search term to be searched is divided and processed into a plurality of N-grams, there is a problem that the search time increases when the search processing is performed with a long search term.

このような問題に対し、非特許文献１において、検索処理の高速化の技術が開示されている。具体的に、非特許文献１では、Ｎグラムの文書頻度の和を処理の高速化の推定値として計算し、実際に文書の検索処理に用いるＮグラムの選定に利用することで、検索処理の高速化を行う。 In order to solve such a problem, Non-Patent Document 1 discloses a technique for speeding up search processing. Specifically, in Non-Patent Document 1, the sum of the N-gram document frequencies is calculated as an estimated value for speeding up the processing, and is used to select N-grams that are actually used for document search processing. Speed up.

小川泰嗣，松田透，”ｎ−ｇｒａｍ索引を用いた効率的な文書検索法”，電子情報通信学会論文誌(D-I)，Vol.J82-D-I，No.1，pp.121-129，1999年1月Yasuaki Ogawa, Toru Matsuda, “Efficient Document Retrieval Method Using n-gram Index”, IEICE Transactions (DI), Vol.J82-DI, No.1, pp.121-129, 1999 January

このようなＮグラムを用いた検索処理において、より単純な処理によって高速化を実現したい、との要望がある。しかしながら、特許文献１に開示される構成では、検索対象となる単語や文書のデータ量が増えるにつれて、検索時間が長くなるという問題がある。例えば、携帯電話や電子辞書などのような小型の電子機器といった、使用できる資源が限られている環境では、機器の性能が制限されるため、検索時間が長くなる場合がある。そこで、検索を効率的に処理することによって、高速な検索処理を行う新たな方法が求められている。 In such a search process using N-grams, there is a demand for speeding up by a simpler process. However, the configuration disclosed in Patent Document 1 has a problem that the search time becomes longer as the amount of data of words and documents to be searched increases. For example, in an environment where usable resources are limited, such as a small electronic device such as a mobile phone or an electronic dictionary, the performance of the device is limited, so that the search time may be long. Therefore, there is a demand for a new method for performing high-speed search processing by processing search efficiently.

本発明は、以上のような課題を解決するためのものであり、検索対象となる単語等を効率的に絞り込むことを可能にするのに好適な転置インデックスの生成方法および生成装置、当該転置インデックスを用いた検索方法および検索装置、ならびに、コンピュータプログラムを提供することを目的とする。 The present invention is to solve the above-described problems, and a transposed index generation method and a generation apparatus suitable for efficiently narrowing down search target words and the like, and the transposed index It is an object to provide a search method and a search apparatus using a computer, and a computer program.

上記目的を達成するため、本発明の第１の観点に係る転置インデックスの生成方法は、
情報処理装置における転置インデックスの生成方法であって、
前記情報処理装置は、見出し語と対応する説明文とからそれぞれ構成される複数の文書データを記憶している記憶手段と、前記記憶手段に記憶されている前記複数の文書データに対する転置インデックスを生成する制御手段とを有し、
前記制御手段が、
前記記憶手段に記憶されている前記見出し語と対応する説明文とからそれぞれ構成される複数の文書データのそれぞれについて、当該見出し語と当該説明文との文字数を計数し、文字数が少ない順に、当該複数の文書データの順序を入れ換える順序入換ステップと、
「Ｎ文字の文字列であるＮグラム（Ｎは自然数）」のそれぞれについて、前記順序が入れ換えられた複数の文書データ中の出現位置を対応付けて、転置インデックスを生成する生成ステップと、
を実行することを特徴とする。 In order to achieve the above object, a transposed index generation method according to the first aspect of the present invention includes:
A method for generating an inverted index in an information processing device,
The information processing apparatus generates a transposing index for the plurality of document data stored in the storage unit, and a storage unit that stores a plurality of document data each composed of an entry word and a corresponding explanatory sentence Control means to
The control means is
For each of a plurality of document data each composed of the headword and the explanatory text corresponding to the headword stored in the storage means, the number of characters of the headword and the explanatory text is counted, An order change step for changing the order of a plurality of document data;
A generation step of generating a transposed index by associating appearance positions in a plurality of document data in which the order is changed for each of “N-grams (N is a natural number)” that is a character string of N characters;
It is characterized by performing .

前記制御手段は、
前記順序が入れ換えられた複数の文書データのそれぞれの見出し語と説明文との文字列を包含する文書データを抽出し、当該抽出された文書データと、当該抽出された文書データに文字列が包含される文書データと、の包含関係を対応付ける対応付けステップをさらに実行し、
前記制御手段が実行する前記生成ステップでは、当該包含関係をさらに記憶させて、前記転置インデックスを生成する、
ことが望ましい。 The control means includes
Document data including character strings of headwords and explanations of each of the plurality of document data in which the order is changed is extracted, and character strings are included in the extracted document data and the extracted document data. Moreover perform the document data, the correspondence step for associating inclusion relation,
In the generation step executed by the control unit , the inclusion relation is further stored, and the transposed index is generated.
And this is desirable.

前記制御手段が実行する前記生成ステップでは、前記順序が入れ換えられた複数の文書データと、前記順序が入れ換えられる前の複数の文書データと、の対応関係をさらに記憶させて、前記転置インデックスを生成する、
ことが望ましい。 In the generating step executed by the control means , the transposition index is generated by further storing the correspondence relationship between the plurality of document data whose order has been changed and the plurality of document data before the order has been changed. To
And this is desirable.

上記目的を達成するため、本発明の第２の観点に係る検索方法は、
上記の転置インデックスの生成方法の順序入換ステップで順序が入れ換えられた複数の文書データを記憶する入換文書データ記憶手段と、
上記の転置インデックスの生成方法の生成ステップで生成された転置インデックスを記憶する転置インデックス記憶手段と、
前記転置インデックスを用いて前記文書データを検索する制御手段とを有する情報処理装置における検索方法であって、
前記制御手段は、
検索文字列からＮグラムを抽出するＮグラム抽出ステップと、
前記転置インデックスから、前記Ｎグラム抽出ステップにおいて抽出されたＮグラムに対応付けられた出現位置を取得し、当該出現位置に基づいて、前記順序が入れ換えられた複数の文書データのうちから前記検索文字列を含む文書データを特定する文書特定ステップと、
を実行することを特徴とする。 In order to achieve the above object, a search method according to the second aspect of the present invention includes:
Replaced document data storage means for storing a plurality of document data whose order has been changed in the order changing step of the above transposed index generation method,
An inverted index storage means for storing the inverted index generated in the generating step of the above-described inverted index generating method;
A search method in an information processing apparatus having control means for searching for the document data using the transposed index,
The control means includes
N-gram extraction step for extracting N-gram from the search character string;
From the inverted index, the N-gram extraction to get the appearance position associated with the N-gram extracted in step, on the basis of the occurrence position, the search character from among the plurality of document data in which the sequence has been replaced A document identification step for identifying document data including columns;
It is characterized by performing .

上記目的を達成するため、本発明の第３の観点に係る検索方法は、
上記の転置インデックスの生成方法の順序入換ステップで順序が入れ換えられた複数の文書データを記憶する入換文書データ記憶手段と、
上記の転置インデックスの生成方法の生成ステップで生成された転置インデックスを記憶する転置インデックス記憶手段と、
前記転置インデックスを用いて前記文書データを検索する制御手段とを有する情報処理装置における検索方法であって、
前記制御手段は、
検索文字列からＮグラムを抽出するＮグラム抽出ステップと、
前記転置インデックスから、前記Ｎグラム抽出ステップにおいて抽出されたＮグラムに対応付けられた出現位置を取得し、当該出現位置に基づいて、前記順序が入れ換えられた複数の文書データのうちから前記検索文字列を含む文書データを特定し、前記記憶された包含関係に基づいて、当該特定した文書データに対応付けられた文書データをさらに特定する文書特定ステップと、
を実行することを特徴とする。 In order to achieve the above object, a search method according to the third aspect of the present invention includes:
Replaced document data storage means for storing a plurality of document data whose order has been changed in the order changing step of the above transposed index generation method,
An inverted index storage means for storing the inverted index generated in the generating step of the above-described inverted index generating method;
A search method in an information processing apparatus having control means for searching for the document data using the transposed index,
The control means includes
N-gram extraction step for extracting N-gram from the search character string;
From the inverted index, the N-gram extraction to get the appearance position associated with the N-gram extracted in step, on the basis of the occurrence position, the search character from among the plurality of document data in which the sequence has been replaced A document specifying step of specifying document data including a column, and further specifying document data associated with the specified document data based on the stored inclusion relationship;
It is characterized by performing .

上記目的を達成するため、本発明の第４の観点に係る検索方法は、
上記の転置インデックスの生成方法の順序入換ステップで順序が入れ換えられた複数の文書データを記憶する入換文書データ記憶手段と、
上記の転置インデックスの生成方法の生成ステップで生成された転置インデックスを記憶する転置インデックス記憶手段と、
前記転置インデックスを用いて前記文書データを検索する制御手段とを有する情報処理装置における検索方法であって、
前記制御手段は、
検索文字列からＮグラムを抽出するＮグラム抽出ステップと、
前記転置インデックスから、前記Ｎグラム抽出ステップにおいて抽出されたＮグラムに対応付けられた出現位置を取得し、当該出現位置と、前記順序が入れ換えられた複数の文書データと前記順序が入れ換えられる前の複数の文書データとの対応関係と、に基づいて、前記順序が入れ換えられる前の複数の文書データのうちから前記検索文字列を含む文書データを特定し、前記記憶された包含関係に基づいて、当該特定した文書データに対応付けられた文書データをさらに特定する文書特定ステップと、
を実行することを特徴とする。 In order to achieve the above object, a search method according to the fourth aspect of the present invention includes:
Replaced document data storage means for storing a plurality of document data whose order has been changed in the order changing step of the above transposed index generation method,
An inverted index storage means for storing the inverted index generated in the generating step of the above-described inverted index generating method;
A search method in an information processing apparatus having control means for searching for the document data using the transposed index,
The control means includes
N-gram extraction step for extracting N-gram from the search character string;
From the inverted index, the N obtains the occurrence position associated with the N-gram extracted in grams extraction step, and the occurrence position, the sequence plurality of document data and the order before it is replaced, which is replaced Based on the correspondence relationship with a plurality of document data, the document data including the search character string is identified from the plurality of document data before the order is changed, and based on the stored inclusion relationship, A document specifying step for further specifying the document data associated with the specified document data;
It is characterized by performing .

上記目的を達成するため、本発明の第５の観点にかかる転置インデックスの生成装置は、
見出し語と対応する説明文とからそれぞれ構成される複数の文書データのそれぞれについて、当該見出し語と当該説明文との文字数を計数し、文字数が少ない順に、当該複数の文書データの順序を入れ換える順序入換手段と、
「Ｎ文字の文字列であるＮグラム（Ｎは自然数）」のそれぞれについて、前記順序が入れ換えられた複数の文書データ中の出現位置を対応付けて、転置インデックスを生成する生成手段と、
を備えることを特徴とする。 In order to achieve the above object, an inverted index generation device according to a fifth aspect of the present invention provides:
For each of a plurality of document data composed of a headword and a corresponding explanatory text, the number of characters between the headword and the explanatory text is counted, and the order of the plurality of document data is changed in the order of the smaller number of characters. Replacement means,
Generating means for generating an inverted index by associating appearance positions in a plurality of document data in which the order is changed for each of “N-grams (N is a natural number)” that is a character string of N characters;
It is characterized by providing.

上記目的を達成するため、本発明の第６の観点にかかる検索装置は、
検索文字列からＮグラムを抽出するＮグラム抽出手段と、
上記の生成方法によって生成された転置インデックスから、前記Ｎグラム抽出ステップにおいて抽出されたＮグラムに対応付けられた出現位置を取得し、当該出現位置に基づいて、前記順序が入れ換えられた複数の文書データのうちから前記検索文字列を含む文書データを特定する文書特定手段と、
を備えることを特徴とする。 In order to achieve the above object, a search device according to the sixth aspect of the present invention provides:
N-gram extracting means for extracting N-gram from the search character string;
The appearance positions associated with the N-grams extracted in the N-gram extraction step are acquired from the transposed index generated by the above generation method, and the plurality of documents in which the order is changed based on the appearance positions Document specifying means for specifying document data including the search character string from the data;
It is characterized by providing.

上記目的を達成するため、本発明の第７の観点にかかるコンピュータプログラムは、
コンピュータを、
見出し語と対応する説明文とからそれぞれ構成される複数の文書データのそれぞれについて、当該見出し語と当該説明文との文字数を計数し、文字数が少ない順に、当該複数の文書データの順序を入れ換える順序入換手段、
「Ｎ文字の文字列であるＮグラム（Ｎは自然数）」のそれぞれについて、前記順序が入れ換えられた複数の文書データ中の出現位置を対応付けて、転置インデックスを生成する生成手段、
として機能させる。 In order to achieve the above object, a computer program according to the seventh aspect of the present invention provides:
Computer
For each of a plurality of document data composed of a headword and a corresponding explanatory text, the number of characters between the headword and the explanatory text is counted, and the order of the plurality of document data is changed in the order of the smaller number of characters. Replacement means,
Generating means for generating a transposed index by associating appearance positions in a plurality of document data in which the order is changed with respect to each of “N-grams (N is a natural number)” that is a character string of N characters;
To function as.

上記目的を達成するため、本発明の第８の観点にかかるコンピュータプログラムは、
コンピュータを、
検索文字列からＮグラムを抽出するＮグラム抽出手段、
上記の生成方法によって生成された転置インデックスから、前記Ｎグラム抽出ステップにおいて抽出されたＮグラムに対応付けられた出現位置を取得し、当該出現位置に基づいて、前記順序が入れ換えられた複数の文書データのうちから前記検索文字列を含む文書データを特定する文書特定手段、
として機能させる。 In order to achieve the above object, a computer program according to the eighth aspect of the present invention provides:
Computer
N-gram extraction means for extracting N-gram from the search character string;
The appearance positions associated with the N-grams extracted in the N-gram extraction step are acquired from the transposed index generated by the above generation method, and the plurality of documents in which the order is changed based on the appearance positions Document specifying means for specifying document data including the search character string from the data;
To function as.

本発明によれば、検索対象となる単語等を効率的に絞り込むことを可能にするのに好適な転置インデックスの生成方法および生成装置、当該転置インデックスを用いた検索方法および検索装置、ならびに、コンピュータプログラムを提供することができる。 According to the present invention, a transposed index generation method and a generation apparatus suitable for efficiently narrowing down a word or the like to be searched, a search method and a search apparatus using the transposed index, and a computer A program can be provided.

本発明に係る転置インデックスを生成する生成装置の概要構成の１例を示す図である。It is a figure which shows one example of schematic structure of the production | generation apparatus which produces | generates the transposed index which concerns on this invention. 本発明に係る転置インデックスを搭載した検索装置の概要構成の１例を示す図である。It is a figure which shows one example of schematic structure of the search device carrying the transposition index concerning this invention. 転置インデックスの生成処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the production | generation process of an inverted index. 文字数が少ない順に、順序を入れ換えた文書データの例を示す図である。It is a figure which shows the example of the document data which replaced the order in order with few characters. 文字列が一致する入換文書データを対応付けた文書データの例を示す図である。It is a figure which shows the example of the document data which matched the replacement | exchange document data in which a character string corresponds. 転置インデックスの具体的な構成を示す図である。It is a figure which shows the specific structure of an inverted index. 検索装置の検索処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the search process of a search device. 位置ポインタと番号ポインタとを付与した文書データの例を示す図である。It is a figure which shows the example of the document data which provided the position pointer and the number pointer. 文字列が一致する入換文書データを多層的に対応付けた文書データの例を示す図である。It is a figure which shows the example of the document data which matched the replacement | exchange document data in which a character string corresponds in multiple layers. 本発明に係る転置インデックスを生成する生成装置の概要構成の別の例を示す図である。It is a figure which shows another example of schematic structure of the production | generation apparatus which produces | generates the transposed index which concerns on this invention. 本発明に係る転置インデックスを搭載した検索装置の概要構成の別の例を示す図である。It is a figure which shows another example of the outline | summary structure of the search device carrying the transposition index concerning this invention.

以下、本発明の実施形態に係る転置インデックスの生成方法および生成装置、当該転置インデックスを用いた検索方法および検索装置について説明する。以下に説明する実施形態は説明のためのものであり、本発明の範囲を制限するものではない。 Hereinafter, a transposed index generation method and generation apparatus, a search method using the transposed index, and a search apparatus according to an embodiment of the present invention will be described. The embodiments described below are for illustrative purposes and do not limit the scope of the present invention.

本実施形態では、コンピュータ装置を、図１に示されるような転置インデックスの生成装置として構成する。また、図１に示される生成装置１０によって、本実施形態に係る転置インデックスの生成方法が実現される。 In the present embodiment, the computer apparatus is configured as a transposed index generating apparatus as shown in FIG. Moreover, the generating apparatus 10 shown in FIG. 1 realizes the transposed index generating method according to the present embodiment.

生成装置１０は、ＣＰＵ（Central Processing Unit）１１、ＲＯＭ（Read Only Memory）１２、ＲＡＭ（Random Access Memory）１３、ＨＤＤ（Hard Disk Drive）１４、入力装置１５、出力装置１６、通信制御装置１７により構成される。各構成要素は、命令やデータを転送するための伝送経路であるシステムバスにより、相互に接続されている。 The generation device 10 includes a central processing unit (CPU) 11, a read only memory (ROM) 12, a random access memory (RAM) 13, a hard disk drive (HDD) 14, an input device 15, an output device 16, and a communication control device 17. Composed. Each component is connected to each other by a system bus which is a transmission path for transferring commands and data.

ＣＰＵ１１は、生成装置１０全体の動作を制御し、各構成要素と接続され制御信号やデータをやりとりする。
ＲＯＭ１２は、生成装置１０全体の動作制御に必要なコンピュータプログラムや各種データを記憶する。ＣＰＵ１１は、ＲＯＭ１２に記憶されたコンピュータプログラムによって動作し、各種制御を実行する。
ＲＡＭ１３は、データやコンピュータプログラムを一時的に記憶するためのもので、ＲＯＭ１２から読み出したコンピュータプログラムやデータ、その他処理の進行に必要なデータが保持される。
ＨＤＤ１４は、転置インデックスの生成処理の動作のために必要なデータ等を記憶する。このＨＤＤ１４には、複数の文書データ１８が記憶される。生成装置１０は、この複数の文書データ１８をもとにして、転置インデックスを生成する。
入力装置１５は、例えばキーボードやタッチパネル等によって構成され、ユーザからの各種入力を受け付ける。
出力装置１６は、例えばディスプレイ等によって構成され、生成装置１０の種々の処理結果を出力する。
通信制御装置１７は、生成装置１０をインターネット等のコンピュータ通信網に接続するためのものであり、コンピュータ通信網に接続してデータをやり取りする場合に必要となる。 The CPU 11 controls the operation of the entire generation apparatus 10 and is connected to each component to exchange control signals and data.
The ROM 12 stores computer programs and various data necessary for operation control of the entire generation apparatus 10. The CPU 11 operates by a computer program stored in the ROM 12 and executes various controls.
The RAM 13 is for temporarily storing data and computer programs, and holds computer programs and data read from the ROM 12 and other data necessary for the progress of processing.
The HDD 14 stores data necessary for the operation of the inverted index generation process. The HDD 14 stores a plurality of document data 18. The generation device 10 generates an inverted index based on the plurality of document data 18.
The input device 15 is configured by a keyboard, a touch panel, or the like, for example, and accepts various inputs from the user.
The output device 16 is configured by a display, for example, and outputs various processing results of the generation device 10.
The communication control device 17 is used for connecting the generation device 10 to a computer communication network such as the Internet, and is necessary when connecting to the computer communication network to exchange data.

本実施形態では、生成装置１０は、順序入換手段と、対応付け手段と、生成手段と、を備える。これらは、上述したＣＰＵ１１が、ＲＯＭ１２やＲＡＭ１３と協働し、ＨＤＤ１４に記憶されたデータにアクセスしながら、入力装置１５や出力装置１６、通信制御装置１７を用いて外部とやり取りすることで、実現される。 In the present embodiment, the generation device 10 includes order changing means, association means, and generation means. These are realized by the above-described CPU 11 interacting with the outside using the input device 15, the output device 16, and the communication control device 17 while accessing the data stored in the HDD 14 in cooperation with the ROM 12 and the RAM 13. Is done.

具体的に、生成装置１０の順序入換手段は、見出し語と対応する説明文とからそれぞれ構成される複数の文書データ１８のそれぞれについて、当該見出し語と当該説明文との文字数を計数し、文字数が少ない順に、当該複数の文書データ１８の順序を入れ換える。 Specifically, the order changing unit of the generation device 10 counts the number of characters of the headword and the explanatory text for each of a plurality of document data 18 each composed of the headword and the corresponding explanatory text, The order of the plurality of document data 18 is changed in ascending order of the number of characters.

ここで、見出し語とは、検索対象となる単語や文書をいう。また、説明文とは、見出し語の意味が説明された文書をいう。例えば、見出し語が「記憶」という単語の場合、説明文は、例えば、「（１）物事を忘れずに覚えている、または覚えておくこと。（２）生物体に過去の影響が残ること。（３）過去の経験の内容を保持し、それを後で思い出すこと。」となり、見出し語と説明文とが対応付けられて、１つの文書データ１８として、例えばＨＤＤ１４に記憶される。 Here, the headword is a word or document to be searched. An explanatory sentence is a document in which the meaning of a headword is explained. For example, when the headword is the word “memory”, the explanatory text is, for example, “(1) remember or remember things without forgetting. (2) The past influence remains on the organism. (3) Retain the contents of past experiences and remember them later. ”, The headword and the explanatory text are associated with each other and stored as one document data 18 in, for example, the HDD 14.

順序入換手段は、ＣＰＵ１１等の機能により、このような見出し語と説明文とから構成される文書データ１８毎に文字列の文字数を計数する。さらに、順序入換手段は、ＣＰＵ１１等の機能により、ＨＤＤ１４に記憶された複数の文書データ１８のうち、文書データ１８の文字数が少ない順に、文書データ１８の順序を入れ換えて、当該順序を入れ換えた複数の入換文書データ１９を、例えばＲＡＭ１２等に記憶させて保持する。 The order changing means counts the number of characters in the character string for each document data 18 composed of such headwords and explanations by the function of the CPU 11 or the like. Further, the order changing means changes the order of the document data 18 by changing the order of the document data 18 in the order of the number of characters of the document data 18 among the plurality of document data 18 stored in the HDD 14 by the function of the CPU 11 or the like. A plurality of replacement document data 19 is stored in, for example, the RAM 12 and held.

また、生成装置１０の対応付け手段は、順序が入れ換えられた複数の入換文書データ１９のそれぞれの見出し語と説明文との文字列を包含する入換文書データ１９を抽出し、当該抽出された入換文書データ１９と、当該抽出された入換文書データ１９に文字列が包含される入換文書データ１９と、の包含関係を対応付ける。 In addition, the associating unit of the generation apparatus 10 extracts replacement document data 19 including character strings of headwords and explanatory texts of the plurality of replacement document data 19 in which the order is changed, and the extracted document data 19 is extracted. The inclusion relationship between the replacement document data 19 and the replacement document data 19 in which a character string is included in the extracted replacement document data 19 is associated.

対応付け手段は、ＣＰＵ１１等の機能により、入換文書データ１９が表す所定の見出し語や所定の説明文の文字列を包含する、複数の入換文書データ１９を抽出する。ここでは、具体例として、複数の入換文書データ１９として、３つの見出し語と、当該３つの見出し語に対応する３つの説明文がある場合を考える。文書データＡ（見出し語：「ダイヤ」、当該見出し語に対応する説明文：「ダイヤ」）、文書データＢ（見出し語：「ダイヤグラム」、当該見出し語に対応する説明文：「列車運行表、また、バスや航空機などの運行予定、ダイヤ、列車ダイヤ」）、文書データＣ（見出し語「記憶」、当該見出し語に対応する説明文：「物事を忘れずに覚えている、または覚えておくこと」）とする。この場合、入換文書データＡには「ダイヤ」という文字列があり、入換文書データＢにも「ダイヤグラム」、「ダイヤ」、「列車ダイヤ」という文字列があるため、入換文書データＢは、入換文書データＡを包含する関係にある。一方、入換文書データＣには「ダイヤ」という文字列がないため、入換文書データＣは、入換文書データＡや入換文書データＢを包含する関係にはない。 The associating means extracts a plurality of replacement document data 19 including a predetermined headword represented by the replacement document data 19 and a character string of a predetermined explanatory text by a function of the CPU 11 or the like. Here, as a specific example, consider a case where there are three headwords and three explanatory texts corresponding to the three headwords as the plurality of replacement document data 19. Document data A (headword: “diagram”, explanatory text corresponding to the headword: “diamond”), document data B (headword: “diagram”, explanatory text corresponding to the headword: “train operation table, Also, schedules of buses and airplanes, schedules, train schedules ”), document data C (headword“ memory ”, explanation corresponding to the headword:“ remember or remember things ” ")". In this case, the replacement document data A includes a character string “diagram”, and the replacement document data B also includes character strings “diagram”, “diagram”, and “train diagram”. Are in a relationship including the replacement document data A. On the other hand, since the replacement document data C does not have the character string “diamond”, the replacement document data C does not include the replacement document data A and the replacement document data B.

すなわち、対応付け手段は、見出し語や説明文の文字列と一致する、他の見出し語や他の説明文の文字列があるか否かを判別することにより、文字列を包含する入換文書データ１９を特定する。そして、文字列を包含する入換文書データ１９がある場合、対応付け手段は、入換文書データ１９同士を対応付けて、当該対応付けられた包含関係を、例えばＲＡＭ１２等に記憶させて保持する。 That is, the associating means determines whether or not there is a character string of another headword or other explanatory text that matches the character string of the headword or explanatory text, so that the replacement document including the character string is determined. Data 19 is specified. If there is replacement document data 19 including a character string, the association unit associates the replacement document data 19 with each other and stores the associated inclusion relation in, for example, the RAM 12 or the like and holds the relation. .

また、生成装置１０の生成手段は、「Ｎ文字の文字列であるＮグラム（Ｎは自然数）」のそれぞれについて、複数の入換文書データ１９中の出現位置を対応付けて、そして対応付け手段により対応付けられた包含関係をさらに記憶させて、転置インデックスを生成する。生成された転置インデックスは、ＨＤＤ１４に、あるいは出力装置１６や通信制御装置１７を介して、出力される。 Further, the generation unit of the generation apparatus 10 associates the appearance positions in the plurality of replacement document data 19 with respect to each of “N gram (N is a natural number) that is a character string of N characters”, and the association unit Further, the inclusive relationship associated with is stored and a transposed index is generated. The generated inverted index is output to the HDD 14 or via the output device 16 or the communication control device 17.

具体的には、生成手段は、１個の文書データがＮ_ｄｏｃ文字の文字列から構成されていた場合、Ｎ_ｄｏｃ−Ｎ＋１個のＮグラム（Ｎ文字列）を抽出し、さらに、複数の文書データについて同様にＮグラムを抽出し、同一パターンのＮグラムに関して、それぞれの出現位置を記載した転置インデックスを、生成する。また、生成手段は、対応付け手段により対応付けられた文書データについても同様にＮグラムを抽出し、同一パターンのＮグラムに関して、それぞれの出現位置を記載した転置インデックスを、生成する。生成された転置インデックスは、例えばＨＤＤ１４に記憶されて保存されることになる。 Specifically, when one document data is composed of a character string of N _doc characters, the generation unit extracts N _doc −N + 1 N-grams (N character strings), and further, a plurality of documents. N-grams are similarly extracted from the data, and a transposed index describing each appearance position is generated for N-grams of the same pattern. Further, the generation unit similarly extracts N-grams from the document data associated by the association unit, and generates a transposed index describing each appearance position for the N-grams of the same pattern. The generated transposed index is stored and saved in, for example, the HDD 14.

このような生成装置１０によって生成された転置インデックスは、検索装置に搭載され、検索処理に用いられる。本実施形態では、コンピュータ装置を、図２に示されるような検索装置として構成する。また、図２に示される検索装置２０によって、本実施形態に係る転置インデックスの検索方法が実現される。 The inverted index generated by such a generation device 10 is mounted on the search device and used for search processing. In the present embodiment, the computer device is configured as a search device as shown in FIG. Also, the search apparatus 20 shown in FIG. 2 implements the inverted index search method according to the present embodiment.

検索装置２０は、ＣＰＵ２１、ＲＯＭ２２、ＲＡＭ２３、ＨＤＤ２４、入力装置２５、出力装置２６、通信制御装置２７により構成される。各構成要素は、命令やデータを転送するための伝送経路であるシステムバスにより、相互に接続されている。 The search device 20 includes a CPU 21, ROM 22, RAM 23, HDD 24, input device 25, output device 26, and communication control device 27. Each component is connected to each other by a system bus which is a transmission path for transferring commands and data.

これらの構成要素は、基本的には図１に示された生成装置１０の構成要素と同等なものである。すなわち、図１では、文書データ１８から転置インデックスを生成するために機能した各構成要素が、ここでは生成された転置インデックスを用いて検索処理を行うために機能する。 These constituent elements are basically equivalent to the constituent elements of the generation apparatus 10 shown in FIG. That is, in FIG. 1, each component that functions to generate a transposed index from the document data 18 functions to perform a search process using the generated transposed index here.

すなわち、ＣＰＵ２１は、検索装置２０全体の動作を制御し、各構成要素と接続され制御信号やデータをやりとりする。
ＲＯＭ２２は、検索装置２０全体の動作制御に必要なコンピュータプログラムや各種データを記憶する。ＣＰＵ１１は、ＲＯＭ１２に記憶されたコンピュータプログラムによって動作し、各種制御を実行する。
ＲＡＭ２３は、データやコンピュータプログラムを一時的に記憶するためのもので、ＲＯＭ２２から読み出したコンピュータプログラムやデータ、その他処理の進行に必要なデータが保持される。
ＨＤＤ２４は、検索処理の動作のために必要なデータ等を記憶する。このＨＤＤ２４には、生成装置１０によって生成された転置インデックス３０と、その際に複数の文書データ１８から順序が入れ換えられた複数の入換文書データ１９と、が記憶される。検索装置２０は、この転置インデックス３０をもとに、ユーザによって指定された検索文字列が複数の入換文書データ１９の中のどの入換文書データ１９中に出現するかを特定する。
入力装置２５は、例えばキーボードやタッチパネル等によって構成され、ユーザからの各種入力を受け付ける。
出力装置２６は、例えばディスプレイ等によって構成され、検索装置２０の種々の処理結果を出力する。
通信制御装置２７は、検索装置２０をインターネット等のコンピュータ通信網に接続するためのものであり、コンピュータ通信網に接続してデータをやり取りする場合に必要となる。 That is, the CPU 21 controls the overall operation of the search device 20 and is connected to each component to exchange control signals and data.
The ROM 22 stores computer programs and various data necessary for operation control of the entire search device 20. The CPU 11 operates by a computer program stored in the ROM 12 and executes various controls.
The RAM 23 is for temporarily storing data and computer programs, and holds computer programs and data read from the ROM 22 and other data necessary for the progress of processing.
The HDD 24 stores data and the like necessary for the search processing operation. The HDD 24 stores a transposed index 30 generated by the generation apparatus 10 and a plurality of replacement document data 19 in which the order is changed from the plurality of document data 18 at that time. Based on the transposed index 30, the search device 20 specifies in which replacement document data 19 a plurality of replacement document data 19 the search character string designated by the user appears.
The input device 25 is configured by a keyboard, a touch panel, or the like, for example, and accepts various inputs from the user.
The output device 26 is configured by a display or the like, for example, and outputs various processing results of the search device 20.
The communication control device 27 is for connecting the search device 20 to a computer communication network such as the Internet, and is necessary when connecting to the computer communication network to exchange data.

本実施形態では、検索装置２０は、Ｎグラム抽出手段と、文書特定手段と、を備える。これらは、上述したＣＰＵ１１が、ＲＯＭ１２やＲＡＭ１３と協働し、ＨＤＤ１４に記憶されたデータにアクセスしながら、入力装置１５や出力装置１６、通信制御装置１７を用いて外部とやり取りすることで、実現される。 In the present embodiment, the search device 20 includes N-gram extraction means and document identification means. These are realized by the above-described CPU 11 interacting with the outside using the input device 15, the output device 16, and the communication control device 17 while accessing the data stored in the HDD 14 in cooperation with the ROM 12 and the RAM 13. Is done.

具体的に、検索装置２０のＮグラム抽出手段は、検索文字列からＮグラムを抽出する。すなわち、例えば検索装置２０の入力装置２５が、ユーザが入力した検索文字列を受付ける。そして、Ｎグラム抽出手段は、検索装置２０のＣＰＵ２１などによって、検索文字列を構成するＮグラムのうち、抽出可能なものを抽出する。具体的には、ユーザがＭ文字の検索文字列を入力したとき、Ｎグラム抽出部１３は、検索文字列から抽出可能なすべてのＮグラム（Ｎ文字列）を抽出する。すなわち、ＭがＮより大きい場合は、Ｍ−Ｎ＋１個のＮグラムが抽出されることになる。 Specifically, the N-gram extraction means of the search device 20 extracts N-gram from the search character string. That is, for example, the input device 25 of the search device 20 receives the search character string input by the user. Then, the N-gram extracting means extracts an extractable one of N-grams constituting the search character string by the CPU 21 of the search device 20 or the like. Specifically, when the user inputs a search character string of M characters, the N-gram extraction unit 13 extracts all N-grams (N character strings) that can be extracted from the search character string. That is, when M is larger than N, M−N + 1 N-grams are extracted.

そして、検索装置２０の文書特定手段は、転置インデックス３０から、Ｎグラム抽出手段において抽出されたＮグラムに対応付けられた出現位置を取得し、当該出現位置に基づいて、複数の入換文書データ１９のうちから検索文字列を含む入換文書データ１９を特定する。さらに、文書特定手段は、特定した入換文書データ１９に対応付けられた入換文書データ１９を特定する。特定された入換文書データ１９は、出力装置２６や通信制御装置２７を介して出力される。 Then, the document specifying unit of the search device 20 acquires the appearance position associated with the N-gram extracted by the N-gram extracting unit from the transposed index 30, and based on the appearance position, a plurality of replacement document data. The replacement document data 19 including the search character string is specified from 19. Further, the document specifying unit specifies the replacement document data 19 associated with the specified replacement document data 19. The specified replacement document data 19 is output via the output device 26 and the communication control device 27.

すなわち、生成装置１０によって生成された転置インデックス３０は、文字列を包含する文書データが対応付けられているので、文書特定手段は、複数の入換文書データ１９のうち、１つの入換文書データ１９を特定すると、当該入換文書データ１９に対応付けられた入換文書データ１９をさらに特定することとなる。 That is, since the transposed index 30 generated by the generation device 10 is associated with document data including a character string, the document specifying unit can replace one replacement document data among a plurality of replacement document data 19. When 19 is specified, the replacement document data 19 associated with the replacement document data 19 is further specified.

このような構成によって実現される生成装置１０と検索装置２０における、処理の流れの詳細を説明する。ここではまず、図３を参照して、転置インデックス３０の生成処理について、フローチャートを用いて説明する。 Details of the flow of processing in the generation device 10 and the search device 20 realized by such a configuration will be described. Here, first, with reference to FIG. 3, the process of generating the inverted index 30 will be described with reference to a flowchart.

生成装置１０が、例えば入力装置１５を介してユーザからの生成処理の開始の指示を受け付けることで、転置インデックス３０の生成処理を開始する。転置インデックス３０の生成処理が開始されると、生成装置１０の順序入換手段は、ＣＰＵ１１の機能により、例えばＨＤＤ１４に記憶された複数の文書データ１８のそれぞれについて、文字列の文字数を計数する（ステップＳ１０１）。ここで、文書データ１８は、見出し語と当該見出し語の説明文とから構成される。順序入換手段は、文書データ１８の文字数によって、複数の文書データ１８の順序を入れ換えるために、見出し語の文字列及び説明文の文字列の文字数を計数する。例えば、順序入換手段は、見出し語の文字数が５であり、説明文の文字数が１５である場合、文書データ１８の文字数を２０（＝５＋１５）と計数する。順序入換手段により、複数の文書データ１８のそれぞれについて文字数が計数されるため、文書データ１８毎の文字数が導出される。 For example, the generation device 10 receives the instruction for starting the generation process from the user via the input device 15, and starts the generation process of the transposed index 30. When the generation process of the transposed index 30 is started, the order changing unit of the generation apparatus 10 counts the number of characters in the character string for each of the plurality of document data 18 stored in the HDD 14 by the function of the CPU 11 ( Step S101). Here, the document data 18 is composed of a headword and an explanatory text of the headword. The order changing means counts the number of characters in the character string of the headword and the character string of the explanatory text in order to change the order of the plurality of document data 18 according to the number of characters of the document data 18. For example, when the number of characters in the headword is 5 and the number of characters in the explanatory text is 15, the order changing unit counts the number of characters in the document data 18 as 20 (= 5 + 15). Since the number of characters is counted for each of the plurality of document data 18 by the order changing means, the number of characters for each document data 18 is derived.

次に、生成装置１０の順序入換手段は、ＣＰＵ１１の機能により、順序入換手段により計数された文字数の少ない順に、複数の文書データ１８の順序を入れ換える（ステップＳ１０２）。具体的に説明すると、複数の文書データ１８は、図４に示されるように、例えば、当初は、見出し語１、見出し語２、見出し語３、というように、見出し語の順番で文書データ１８が並べられて、文書データ１８の文字数とは無関係に並べられていた。例えば、文字数の少ない見出し語２（１５文字）の文書データ１８よりも、文字数の多い見出し語１（３０文字）の文書データ１８の方が前に位置している。 Next, the order changing means of the generating apparatus 10 changes the order of the plurality of document data 18 in ascending order of the number of characters counted by the order changing means by the function of the CPU 11 (step S102). More specifically, as shown in FIG. 4, the plurality of document data 18 is initially written in the order of headwords, such as headword 1, headword 2, headword 3, and so on. Are arranged regardless of the number of characters of the document data 18. For example, the document data 18 of the headword 1 (30 characters) with the larger number of characters is positioned ahead of the document data 18 of the headword 2 (15 characters) with the smaller number of characters.

本実施形態では、順序入換手段が、この状態から、複数の文書データ１８を文字数の少ない順に並べ替えて、複数の入換文書データ１９とする。具体的に説明すると、複数の文書データ１８のうち、最も文字数の少ない見出し語Ｚの文書データ１８を先頭にし、最も文字数の多い見出し語３の文書データ１８を末尾にする。また、当初は先頭にあった見出し語１の文書データ１８は、複数の文書データ１８の中では５番目に文字数が少ないものであるため、複数の入換文書データ１９では５番目に並べる。 In this embodiment, the order changing means rearranges the plurality of document data 18 from the state in the ascending order of the number of characters to obtain the plurality of replacement document data 19. More specifically, among the plurality of document data 18, the document data 18 of the entry word Z having the smallest number of characters is placed at the beginning, and the document data 18 of entry word 3 having the largest number of characters is placed at the end. In addition, since the document data 18 of the headword 1 that was initially at the head has the fifth smallest number of characters among the plurality of document data 18, it is arranged fifth in the plurality of replacement document data 19.

このように、順序入換手段は、文書データ１８毎に文字数が計数された複数の文書データ１８のうち、文字数が少ない順に、文書データ１８の順序を入れ換えることにより、入換文書データ１９を生成する。そして、順序入換手段は、入換文書データ１９を、ＲＡＭ１３やＨＤＤ１４に記憶する。 As described above, the order changing unit generates the exchange document data 19 by changing the order of the document data 18 in the order of the smallest number of characters among the plurality of document data 18 counted for each document data 18. To do. The order changing unit stores the replacement document data 19 in the RAM 13 or the HDD 14.

次に、生成装置１０の対応付け手段は、ＣＰＵ２１の機能により、最初の入換文書データ１９に着目する（ステップＳ１０３）。そして、対応付け手段は、複数の入換文書データ１９の中に、着目された入れ過分書データ１９の構成する見出し語と説明文との文字列を包含する入換文書データ１９が、他にあるか否かを判定する（ステップＳ１０４）。例えば、着目された入換文書データ１９の見出し語の文字列が「ＡＢ」であり、対応する説明文の文字列が「ＣＤＥ」の場合、対応付け手段は、当該「ＡＢ」という文字列と当該「ＣＤＥ」という文字列とを、いずれも含む入換文書データ１９が他にあるか否かを判定する。このような文字列を包含する入換文書データ１９としては、例えば、見出し語あるいは説明文に「ＡＢＣＤＥ」、「ＡＢＸＹＺＣＤＥ」、「ＣＤＥＡＢ」という文字列を含む入換文書データ１９などが相当する。 Next, the associating means of the generating apparatus 10 pays attention to the first replacement document data 19 by the function of the CPU 21 (step S103). Then, the associating means includes, among the plurality of replacement document data 19, replacement document data 19 that includes character strings of headwords and explanatory texts constituting the focused excess demarcation data 19. It is determined whether or not there is (step S104). For example, when the character string of the headword in the replacement document data 19 of interest is “AB” and the character string of the corresponding explanatory text is “CDE”, the associating means determines that the character string “AB” It is determined whether there is any other replacement document data 19 that includes the character string “CDE”. As the replacement document data 19 including such a character string, for example, replacement document data 19 including a character string “ABCDE”, “ABXYZCDE”, and “CDEAB” in an entry word or an explanation corresponds.

そして、複数の入換文書データ１９の中に、着目された入換文書データ１９の文字列を包含するものがあると判定された場合（ステップＳ１０４；ＹＥＳ）、対応付け手段は、着目した入換文書データ１９に、その文字列を包含する入換文書データ１９を対応付けて、当該対応付けられた包含関係を保持する（ステップＳ１０５）。具体的に説明すると、図５に示されるように、例えば、見出し語２０を有する入換文書データ１９ａに、４つの入換文書データ１９ｃ〜１９ｆを対応付け、見出し語５７を有する入換文書データ１９ｂに、２つの入換文書データ１９ｇ〜１９ｈを対応付ける。対応付け手段は、このように対応付けられた包含関係を、ＲＡＭ１３等に保持する。 When it is determined that there is a plurality of replacement document data 19 that includes the character string of the replacement document data 19 of interest (step S104; YES), the associating means The replacement document data 19 including the character string is associated with the replacement document data 19, and the associated inclusion relation is held (step S105). More specifically, as shown in FIG. 5, for example, replacement document data 19 a having headword 20 is associated with four replacement document data 19 c to 19 f and replacement document data having headword 57. 19b is associated with two replacement document data 19g to 19h. The associating means retains the inclusion relationship thus associated in the RAM 13 or the like.

一方で、複数の入換文書データ１９の中に、着目された入換文書データ１９の文字列を包含するものがあると判定されなかった場合（ステップＳ１０４；ＮＯ）、検索装置２０の処理は上述したステップＳ１０５を通らない。すなわち、上述した包含関係を保持しない。 On the other hand, when it is not determined that there is a plurality of replacement document data 19 that includes the character string of the focused replacement document data 19 (step S104; NO), the processing of the search device 20 is performed. It does not pass through step S105 mentioned above. That is, the above-described inclusion relationship is not maintained.

そして、検索装置２０の対応付け手段は、次の入換文書データ１９があるかを判定する（ステップＳ１０６）。すなわち、対応付け手段は、現在着目している入換文書データ１９が、最後の入換文書データ１９かを判定する。次の入換文書データ１９があれば（ステップＳ１０６；ＹＥＳ）、対応付け手段は、当該次の入換文書データ１９に着目して（ステップＳ１０７）、その後、処理は再びステップＳ１０４へと戻る。 Then, the associating unit of the search device 20 determines whether there is the next replacement document data 19 (step S106). That is, the associating means determines whether the replacement document data 19 currently focused on is the last replacement document data 19. If there is the next replacement document data 19 (step S106; YES), the associating means pays attention to the next replacement document data 19 (step S107), and then the process returns to step S104 again.

このようなステップＳ１０４〜Ｓ１０７の処理を、入換文書データ１９ごとに行う。そして、次の入換文書データ１９がなくなるまで、ステップＳ１０３において着目された入換文書データ１９の文字列を包含する、入換文書データ１９を判定して、含まれていれば包含関係を保持する。 Such processing in steps S104 to S107 is performed for each replacement document data 19. Then, until the next replacement document data 19 disappears, the replacement document data 19 including the character string of the replacement document data 19 focused in step S103 is determined. If included, the inclusion relationship is maintained. To do.

ここで、複数の入換文書データ１９は、文字数の少ない順に入れ換えられているので、ステップＳ１０３において着目された入換文書データ１９の文字数より文字数が多い入換文書データ１９が、ステップＳ１０４〜Ｓ１０７の処理において、次々と着目されることとなる。このため、複数の入換文書データ１９を先頭から順次着目するだけで、ステップＳ１０３において着目された入換文書データ１９の文字列を包含する、他の入換文書データ１９を特定することができる。 Here, since the plurality of replacement document data 19 are replaced in ascending order of the number of characters, the replacement document data 19 having a larger number of characters than the number of characters of the replacement document data 19 focused in step S103 is obtained in steps S104 to S107. In this process, attention will be paid one after another. For this reason, it is possible to specify other replacement document data 19 including the character string of the replacement document data 19 focused in step S103 by simply paying attention to the plurality of replacement document data 19 sequentially from the top. .

次に、生成装置１０の生成手段は、ＣＰＵ１１の機能により、文字数が少ない順に入れ換えられて記憶された入換文書データ１９から抽出されるＮグラムのそれぞれについて、入換文書データ１９中の出現位置と、を構成要素とし、さらにステップＳ１０５において保持された包含関係を記憶させて、転置インデックス３０を生成する（ステップＳ１０８）。生成された転置インデックス３０は、ＨＤＤ１４に記憶される、あるいは出力装置１６、通信制御装置１７を介して出力される。そして、その後、検索処理を終了する。 Next, the generation unit of the generation device 10 uses the function of the CPU 11 to display the appearance position in the replacement document data 19 for each of the N-grams extracted from the replacement document data 19 stored by being replaced in ascending order of the number of characters. And the inclusion relationship held in step S105 is stored, and the transposed index 30 is generated (step S108). The generated transposed index 30 is stored in the HDD 14 or is output via the output device 16 and the communication control device 17. Then, the search process ends.

以下、図６を用いて、本実施形態に係る転置インデックス３０の具体的な構成を説明する。本図に示すように、転置インデックス３０は、Ｎグラム文字列パターンと出現位置情報格納アドレスが記載されたファイル（pattern.idx）、各Ｎグラム文字列パターンについての出現位置が記載されたファイル（position.idx）、文書番号と各文書の先頭文字位置が記載されたファイル（number.idx）、および包含関係に関するファイル（relation.idx）から構成される。 Hereinafter, a specific configuration of the transposed index 30 according to the present embodiment will be described with reference to FIG. As shown in the figure, the transposed index 30 includes a file (pattern.idx) in which an N-gram character string pattern and an appearance position information storage address are described, and a file (in which an appearance position for each N-gram character string pattern is described ( position.idx), a file (number.idx) in which the document number and the first character position of each document are described, and a file related to inclusion (relation.idx).

ここで、出現位置は、検索対象の文書群を文書番号順に並べたテキストの先頭文字位置を基準とした位置である。同様に、本図中の各文書番号の先頭文字位置も、検索対象の文書群を文書番号順に並べたテキストの先頭文字位置を基準とした位置である。 Here, the appearance position is a position based on the first character position of the text in which the document group to be searched is arranged in the document number order. Similarly, the first character position of each document number in the figure is also a position based on the first character position of the text in which the document groups to be searched are arranged in document number order.

包含関係に関するファイル（relation.idx）は、上述した生成処理のステップＳ１０５において保持された包含関係が記載される。具体的には、上記図５において、見出し語２０の入換文書データ１９ａには、見出し語４５、８５、４５６、７７５の入換文書データ１９ｃ〜１９ｆが対応付けられていたため、図６では、文書番号２０（見出し語２０）に対して、４つの包含文書番号４５、８５、４５６、７７５が対応付けられている。同様に、文書番号５７（見出し語５７）に対して、２つの包含文書番号２０３、３６０が対応付けられている。 The file (relation.idx) relating to the inclusion relationship describes the inclusion relationship held in step S105 of the generation process described above. Specifically, in FIG. 5, the replacement document data 19a of the headword 20 is associated with the replacement document data 19c to 19f of the headwords 45, 85, 456, and 775. Four inclusion document numbers 45, 85, 456, and 775 are associated with the document number 20 (headword 20). Similarly, two inclusion document numbers 203 and 360 are associated with the document number 57 (headword 57).

ステップＳ１０８において作成された転置インデックス３０は、後述する検索装置２０にて行われる検索処理に利用される。 The transposed index 30 created in step S108 is used for search processing performed by the search device 20 described later.

以上の処理により、本実施形態における転置インデックス３０の生成装置１０は、文書データ１８毎の文字数が少ない順に、複数の文書データ１８の順序を入れ換えた入換文書データ１９を作成し、入換文書データ１９中のＮグラムについて、入換文書データ１９中における出現位置を対応付けて、転置インデックス３０を生成する。また、文字列を包含する関係にある入換文書データ１９を対応付けて、その包含関係をさらに記憶させて転置インデックス３０を生成する。文字列を包含する関係にある入換文書データ１９が対応付け（紐付け）されているため、検索文字列を含む入換文書データ１９が特定されると、当該入換文書データ１９に紐付けられた入換文書データ１９も特定されることとなる。これは、後述する検索処理を効率的なものにすることにつながる。 Through the above processing, the generating apparatus 10 for the inverted index 30 in the present embodiment creates replacement document data 19 in which the order of the plurality of document data 18 is changed in ascending order of the number of characters for each document data 18, and the replacement document is created. An N-gram in the data 19 is associated with the appearance position in the replacement document data 19 to generate a transposed index 30. Further, the transposed index data 30 is generated by associating the replacement document data 19 having the relationship including the character string and further storing the inclusion relationship. Since the replacement document data 19 having the relationship including the character string is associated (linked), when the replacement document data 19 including the search character string is specified, the replacement document data 19 is linked to the replacement document data 19. The replacement document data 19 thus specified is also specified. This leads to efficient search processing described later.

次に、本実施形態に係る検索装置２０にて行われる、検索処理について説明する。図７は、検索処理の流れを示すフローチャートである。 Next, a search process performed by the search device 20 according to the present embodiment will be described. FIG. 7 is a flowchart showing the flow of search processing.

まず、検索装置２０の処理が開始されると、例えば検索装置２０の入力装置２５が、ユーザから検索文字列を受け付ける（ステップＳ２０１）。 First, when the processing of the search device 20 is started, for example, the input device 25 of the search device 20 receives a search character string from the user (step S201).

次に、Ｎグラム抽出手段は、ＣＰＵ２１の機能により、ステップＳ２０１において受け付けられた検索文字列から、Ｎグラムを抽出する（ステップＳ２０２）。ここでＮの値は、検索装置２０において予め定められている値であり、Ｎ＝２、Ｎ＝３、あるいはそれ以外の自然数の値をとる、以下では説明のために、その都度Ｎ＝２やＮ＝３などの場合を用いて説明をする。 Next, the N-gram extraction means extracts N-grams from the search character string received in step S201 by the function of the CPU 21 (step S202). Here, the value of N is a value determined in advance in the search device 20 and takes a value of N = 2, N = 3, or other natural numbers. In the following, for the sake of explanation, N = 2 each time. A case where N = 3 or the like is used will be described.

具体的に、ユーザが「高速化全文検索処理」という９文字の検索文字列を入力したとする。このとき、Ｎ＝２による検索処理の場合、抽出されるＮグラム（バイグラム）は、前から順に「高速」、「速化」、「化全」、「全文」、「文検」、「検索」、「索処」、「処理」、の８個（９−２＋１個）である。また、例えば、Ｎ＝３による検索処理の場合、抽出されるＮグラム（トリグラム）は、前から順に「高速化」、「速化全」、「化全文」、「全文検」、「文検索」、「検索処」、「索処理」の７個（９−３＋１個）である。 Specifically, it is assumed that the user inputs a 9-character search character string “accelerated full-text search process”. At this time, in the case of search processing with N = 2, the extracted N-grams (bigrams) are “high speed”, “speed-up”, “general text”, “full text”, “text check”, “search” in order from the front. ”,“ Search process ”, and“ process ”(9-2 + 1). Further, for example, in the case of a search process with N = 3, the extracted N-grams (trigrams) are “accelerated”, “accelerated all”, “according full sentence”, “full text check”, “sentence search” in order from the front. ”,“ Search process ”, and“ search process ”(9-3 + 1).

次に、文書特定手段は、ＣＰＵ２１の機能により、最初の入換文書データ１９に着目する（ステップＳ２０３）。そして、文書特定手段は、着目された入換文書データ１９に、検索文字列が含まれるか否かを判定する（ステップＳ２０４）。ここで、入換文書データ１９に検索文字列が含まれるかどうかを判定するために、転置インデックス３０を用いる。具体的には、ステップＳ２０２において、検索文字列から抽出されたＮグラムを用いて、それぞれのＮグラムに対応付けられた出現位置を、転置インデックス３０から取得する。 Next, the document specifying means pays attention to the first replacement document data 19 by the function of the CPU 21 (step S203). Then, the document specifying unit determines whether or not the searched replacement document data 19 includes a search character string (step S204). Here, the transposed index 30 is used to determine whether or not the search character string is included in the replacement document data 19. Specifically, in step S202, the appearance position associated with each N-gram is acquired from the transposed index 30 using the N-gram extracted from the search character string.

ステップＳ２０４では、文書特定手段は、取得されたＮグラムの出現位置のうち、検索文字列を構成するような連続した出現位置があるかを判定して、その検索文字列が着目された入換文書データ１９中に含まれているかを判断する。そして、着目された入換文書データ１９に検索文字列が含まれる場合（ステップＳ２０４；ＹＥＳ）、文書特定手段は、その入換文書データ１９をＲＡＭ１３等に一時的に保持して（ステップＳ２０５）、後の処理においてユーザへ出力する。 In step S204, the document specifying unit determines whether there is a continuous appearance position that constitutes the search character string among the appearance positions of the acquired N-gram, and the replacement in which the search character string is focused. It is determined whether it is included in the document data 19. Then, when the replacement character data 19 of interest includes a search character string (step S204; YES), the document specifying unit temporarily holds the replacement document data 19 in the RAM 13 or the like (step S205). And output to the user in later processing.

一方、着目された入換文書データ１９に検索文字列が含まれない場合（ステップＳ２０４；ＮＯ）、文書特定手段は、着目された入換文書データ１９の次に、入換文書データ１９があるか否かを判定する（ステップＳ２０８）。複数の入換文書データ１９は、文字数が少ない順に入れ換えられているので、文書特定手段は、文字数が徐々に多くなる順に、入換文書データ１９を着目していくこととなる。そして、文書特定手段は、現在着目している入換文書データ１９が、最後の入換文書データ１９かを判定する。 On the other hand, if the focused replacement document data 19 does not contain a search character string (step S204; NO), the document specifying means has the replacement document data 19 next to the focused replacement document data 19. It is determined whether or not (step S208). Since the plurality of replacement document data 19 are replaced in ascending order of the number of characters, the document specifying unit pays attention to the replacement document data 19 in the order in which the number of characters gradually increases. Then, the document specifying unit determines whether the replacement document data 19 currently focused on is the last replacement document data 19.

次に、文書特定手段は、ＣＰＵ２１の機能により、検索文字列が含まれると判定された入換文書データ１９に対応付けられる入換文書データ１９があるか否かを判定する（ステップＳ２０６）。 Next, the document specifying unit determines whether or not there is replacement document data 19 associated with the replacement document data 19 determined to include the search character string by the function of the CPU 21 (step S206).

転置インデックス３０の包含関係に関するファイル（relation.idx）では、文字列が対応する（文字列を包含する）文書番号がそれぞれ対応付けられている。このため、文書特定手段は、検索文字列が含まれると判定された入換文書データ１９の文書番号に対応付けられた包含文書番号があるか否かを判定する。そして、対応付けられた包含文書番号がある場合、文書特定手段は、当該包含文書番号が付された入換文書データ１９が、検索文字列が含まれると判定された入換文書データ１９に対応付けられている文書データであると特定する。 In the file (relation.idx) relating to the inclusion relation of the transposed index 30, the document numbers corresponding to the character strings (including the character strings) are associated with each other. For this reason, the document specifying unit determines whether there is an inclusion document number associated with the document number of the replacement document data 19 determined to include the search character string. When there is an associated document number associated with the document, the document specifying unit corresponds to the replacement document data 19 in which the replacement document data 19 with the included document number is determined to include the search character string. Specify that the document data is attached.

対応付けられる入換文書データ１９がある場合（ステップＳ２０６；ＹＥＳ）、文書特定手段は、ステップＳ２０４において、検索文字列が含まれると判定された入換文書データ１９に対応付けられた、すべての入換文書データ１９をＲＡＭ１３等に一時的に保持する（ステップＳ２０７）。 If there is the replacement document data 19 to be associated (step S206; YES), the document specifying unit determines in step S204 all the replacement document data 19 determined to include the search character string. The replacement document data 19 is temporarily stored in the RAM 13 or the like (step S207).

一方、対応付けられる入換文書データ１９がない場合（ステップＳ２０６；ＮＯ）、文書特定手段は、着目された入換文書データ１９の次に、入換文書データ１９があるか否かを判定する（ステップＳ２０８）。そして、次の入換文書データ１９があれば（ステップＳ２０８；ＹＥＳ）、文書特定手段は、当該次の入換文書データ１９が、ステップＳ２０４において対応付けられていると判定された入換文書データ１９であるか否かを判定する（ステップＳ２０９）。 On the other hand, when there is no associated replacement document data 19 (step S206; NO), the document specifying unit determines whether there is replacement document data 19 next to the focused replacement document data 19. (Step S208). If there is the next replacement document data 19 (step S208; YES), the document specifying means determines that the next replacement document data 19 is determined to be associated in step S204. It is determined whether or not 19 (step S209).

次の入換文書データ１９が、対応付けられていると判定された入換文書データ１９である場合（ステップＳ２０９；ＹＥＳ）、文書特定手段は、当該次の入換文書データ１９を着目せずに（ステップＳ２１０）、当該次の入換文書データ１９のその次の入換文書データ１９があるか否かを判定する（ステップＳ２０８）。これは、ステップＳ２０４において対応付けられていると判定された入換文書データ１９は、検索文字列が含まれると判定された入換文書データの文字列を包含しているため、検索文字列が含まれるか否かを再度判定する必要がないからである。このため、ステップＳ２０８〜Ｓ２１０の処理を行うことにより、入換文書データ１９に検索文字列が含まれるか否かを判定する処理回数を減らすことができる。 When the next replacement document data 19 is the replacement document data 19 determined to be associated (step S209; YES), the document specifying unit does not pay attention to the next replacement document data 19. In step S210, it is determined whether or not there is the next replacement document data 19 of the next replacement document data 19 (step S208). This is because the replacement document data 19 determined to be associated in step S204 includes the character string of the replacement document data determined to include the search character string. This is because it is not necessary to determine again whether or not it is included. For this reason, by performing the processing of steps S208 to S210, the number of times of determining whether or not the search character string is included in the replacement document data 19 can be reduced.

次の入換文書データ１９が、対応付けられていると判定された入換文書データ１９でない場合（ステップＳ２０９；ＮＯ）、文書特定手段は、当該次の入換文書データ１９に着目して（ステップＳ２１１）、その後、処理は再びステップＳ２０４へと戻る。 If the next replacement document data 19 is not the replacement document data 19 determined to be associated (step S209; NO), the document specifying unit pays attention to the next replacement document data 19 ( Step S211), and then the process returns to Step S204 again.

次の入換文書データ１９がない場合（ステップＳ２０８；ＮＯ）、文書特定手段は、すべての入換文書データ１９について、検索文字列が含まれるか否かを判定したとして、上記のステップＳ２０４〜Ｓ２１１の繰り返し処理を抜ける。そして、文書特定手段は、ステップＳ２０５及びＳ２０７によって保持された検索文字列を含む入換文書データ１９を、ユーザへ出力する（ステップＳ２１２）。すなわち、ステップＳ２０４の処理において、検索文字列を含むと特定された入換文書データ１９と、ステップＳ２０６の処理において、当該入換文書データ１９に対応付けられていると判定されたすべての入換文書データ１９と、が出力されることになる。その後、検索処理を終了する。 If there is no next replacement document data 19 (step S208; NO), the document specifying unit determines whether or not the search character string is included in all the replacement document data 19, and the above-described steps S204 to S204 are performed. Exit the repeat process of S211. Then, the document specifying unit outputs the replacement document data 19 including the search character string held in steps S205 and S207 to the user (step S212). That is, the replacement document data 19 specified as including the search character string in the process of step S204 and all replacements determined to be associated with the replacement document data 19 in the process of step S206. Document data 19 is output. Thereafter, the search process is terminated.

このようなステップＳ２０４〜Ｓ２１１の処理を、入換文書データ１９ごとに行う。検索文字列が含まれると判定された入換文書データ１９に対応付けられた入換文書データ１９がある場合、検索文字列が含まれる入換文書データ１９がまず１つ特定され、当該入換文書データ１９に対応付けられた入換文書データ１９についてもさらに特定される。このため、検索文字列が含まれる入換文書データ１９を一度に特定することができる。 Such processing in steps S204 to S211 is performed for each replacement document data 19. When there is replacement document data 19 associated with the replacement document data 19 determined to include the search character string, one replacement document data 19 including the search character string is first identified, and the replacement document data 19 is identified. The replacement document data 19 associated with the document data 19 is further specified. Therefore, the replacement document data 19 including the search character string can be specified at a time.

ステップＳ２１２において、もし検索文字列を含むと特定された入換文書データ１９が１つもなければ、文書特定手段は、いずれの入換文書データ１９も出力せず、典型的には「検索文字列が見つかりませんでした。」等をユーザへ出力して、処理を終了する。 In step S212, if there is no replacement document data 19 specified to include the search character string, the document specifying means does not output any replacement document data 19, and typically "search character string" Is not found "is output to the user and the process is terminated.

このように、本実施形態における検索装置２０は、複数の入換文書データ１９のうち、検索文字列が含まれる入換文書データ１９が１つ特定されると、当該入換文書データ１９に対応付けられた、文字列が一致する（文字列を包含する）入換文書データ１９も同時に特定することができる。これによって、複数の入換文書データ１９中から、検索文字列が含まれるか否かを判定する処理回数を減らすことができ、効率的に検索処理を行うことができる。そのため、例えば携帯電話や電子辞書などのような小型の電子機器では、使用できる資源が限られている環境において、とくに本実施形態は有用である。 As described above, when one replacement document data 19 including a search character string is identified from among a plurality of replacement document data 19, the search device 20 according to the present embodiment corresponds to the replacement document data 19. The attached replacement document data 19 with a matching character string (including the character string) can be specified at the same time. Thereby, it is possible to reduce the number of times of processing for determining whether or not a search character string is included in the plurality of replacement document data 19 and to perform search processing efficiently. For this reason, this embodiment is particularly useful in an environment where the resources that can be used are limited in small electronic devices such as mobile phones and electronic dictionaries.

なお、本発明は上記の実施形態に限定されず、種々の変形及び応用が可能である。 In addition, this invention is not limited to said embodiment, A various deformation | transformation and application are possible.

例えば、本実施形態では、順序入換手段が入換文書データ１９を作成する際に、入れ換えられる前の状態との対応関係が認識できるように、番号を付してもよい。図８では、順序入れ換え前の複数の文書データ１８のそれぞれに昇順に見出し番号を付し、その状態で、文字数が多い順に文書データ１８が入れ換えられて入換文書データ１９が作成される様子を示している。ここで、最初は昇順に並んでいた見出し番号は、入換文書データ１９では、ばらばらになって並べ替えられている。それに対して、図８では、改めて入換後の見出し番号が、昇順に付されている。 For example, in the present embodiment, when the order changing unit creates the replacement document data 19, a number may be attached so that the correspondence with the state before the replacement can be recognized. In FIG. 8, a heading number is assigned to each of the plurality of document data 18 before the rearrangement in ascending order, and in this state, the document data 18 is replaced in the descending order of the number of characters and the replacement document data 19 is created. Show. Here, the heading numbers that were initially arranged in ascending order are rearranged in the replacement document data 19 so as to be separated. On the other hand, in FIG. 8, the heading numbers after replacement are assigned in ascending order.

このように付された入換後の見出し番号と入換前の見出し番号とを対応付けて、生成手段が、例えば転置インデックス３０に記憶することで、それぞれの入換文書データ１９が、入れ換えられる前はどの順番で並んでいたかの対応を付けることができるようになる。 The generating unit stores the post-replacement heading number assigned in this way and the pre-replacement heading number in association with each other, for example, in the transposed index 30, so that each replacement document data 19 is replaced. It becomes possible to attach the order in which they were arranged before.

これにより、検索装置２０の文書特定手段が、複数の入換文書データ１９から生成された転置インデックス３０に基づいて、検索文字列が含まれる入換文書データ１９を特定した場合でも、複数の入換文書データ１９の中においてどの入換文書データ１９が特定されたかだけでなく、入れ換えられる前の順序で並んでいた複数の文書データ１８の中においてもどの文書データ１８が特定されたのかを判別することが可能になる。 Thus, even when the document specifying means of the search device 20 specifies the replacement document data 19 including the search character string based on the transposed index 30 generated from the plurality of replacement document data 19, a plurality of input documents Not only which replacement document data 19 is specified in the replacement document data 19 but also which document data 18 is specified in the plurality of document data 18 arranged in the order before the replacement. It becomes possible to do.

また、対応付け手段は、入換文書データ同士の対応構造が多層構造となるように、入換文書データ１９同士を対応付けることもできる。図９では、入換文書データ１９ａ（見出し語２０）に対して、入換文書データ１９ｃ（見出し語４５）が対応付けられ、さらに、当該入換文書データ１９ｃ（見出し語４５）に対して、３つの入換文書データ１９ｉ〜１９ｋが対応付けられた様子を示している。また、入換文書データ１９ｆ（見出し語７７５）に対しては、２つの入換文書データ１９ｌ〜１９ｍが対応付けられ、さらに入換文書データ１９ｌに対して、入換文書データ１９ｎが対応付けられている。このように、対応付け手段は、文字列が一致する入換文書データ１９を多層的に対応付けることもできる。これにより、１つの入換文書データ１９が特定されると、当該入換文書データ１９に対応付けられた複数の入換文書データ１９を特定することができるため、効率的に検索処理を行うことができる。 The association unit can also associate the replacement document data 19 with each other so that the correspondence structure between the replacement document data has a multilayer structure. In FIG. 9, the replacement document data 19c (headword 45) is associated with the replacement document data 19a (headword 20), and the replacement document data 19c (headword 45) A state in which three replacement document data 19i to 19k are associated with each other is shown. Two replacement document data 19l to 19m are associated with the replacement document data 19f (headword 775), and replacement document data 19n is associated with the replacement document data 19l. ing. As described above, the association unit can also associate the replacement document data 19 having matching character strings in multiple layers. As a result, when one replacement document data 19 is specified, a plurality of replacement document data 19 associated with the replacement document data 19 can be specified. Can do.

また、ステップＳ１０８において、「＆」、「＊」、「＋」等の記号や特殊文字を取り除いた、転置インデックス３０を生成することもできる。当該記号等は、検索文字列としてユーザから入力されることが少ないため、記号等を取り除いた転置インデックス３０を生成し、当該転置インデックスに基づいて、検索処理を行うことにより、検索対象となる単語等を効率的に絞り込むことができる。さらに、記号等を取り除いた転置インデックス３０では、文字列を包含する関係にある入換文書データ１９同士を対応付けた対応関係を示す構成要素の割合が相対的に大きくなるため、効率的な検索を行うことができる。 Further, in step S108, the transposed index 30 from which symbols and special characters such as “&”, “*”, “+” are removed can be generated. Since the symbol or the like is rarely input from the user as a search character string, a transposed index 30 from which the symbol or the like has been removed is generated, and a search process is performed based on the transposed index, so that the word to be searched Etc. can be narrowed down efficiently. Further, in the transposed index 30 from which symbols and the like are removed, since the ratio of the constituent elements indicating the correspondence relationship in which the replacement document data 19 that includes the character string is associated with each other is relatively large, an efficient search is performed. It can be performed.

また、ユーザから受け付けられる検索文字列は、１語であってもよいし、一方で、複数の検索文字列であってもよい。複数の検索文字列を受け付けた場合、複数の検索文字列の論理積、論理和、否定論理積など、演算方法に種類があり、いずれの演算方法による検索を行ってもよい。 Further, the search character string accepted from the user may be one word, or may be a plurality of search character strings. When a plurality of search character strings are accepted, there are types of calculation methods such as logical product, logical sum, and negative logical product of the plurality of search character strings, and the search may be performed by any of the calculation methods.

また、文書データ１８の構成要素は、見出し語と説明文とに限られない。例えば、文書データ１８は、見出し語、説明文、当該見出し語が説明された図面、当該見出し語が意味する反対の意味を有する見出し語など、から構成されてもよい。 Further, the constituent elements of the document data 18 are not limited to headwords and explanatory texts. For example, the document data 18 may be composed of a headword, an explanation, a drawing in which the headword is explained, a headword having the opposite meaning that the headword means, and the like.

そして、転置インデックス３０の構成要素は、上記図５に示したような構成要素に限られない。例えば、抽出されたＮグラムについて、検索対象の複数の文書データ１８中における出現頻度を、さらに構成要素としてもよい。この場合、文書特定手段は、出現頻度の情報を利用することで、検索文字列を含む文書データ１８を、さらに効率的に特定することができる。 And the component of the transposition index 30 is not restricted to a component as shown in the said FIG. For example, the appearance frequency in the plurality of document data 18 to be searched for the extracted N-gram may be further used as a constituent element. In this case, the document specifying unit can specify the document data 18 including the search character string more efficiently by using the appearance frequency information.

さらに、順序入換手段は、文書データ１８の順序を入れ換える際に、文字数を計数することに限られず、見出し語と説明文とから構成される文書データ１８のデータ量を計測することもできる。そして、順序入換手段は、計測されたデータ量が少ない順に、複数の文書データ１８の順序を入れ換えることにより、入換文書データ１９を作成することもできる。 Further, the order changing means is not limited to counting the number of characters when changing the order of the document data 18, and can also measure the data amount of the document data 18 composed of headwords and explanatory texts. The order changing means can also create the replacement document data 19 by changing the order of the plurality of document data 18 in the order of the measured data amount.

なお、本実施形態における生成装置１０では、文書データ１８は、例えば図１のようにＨＤＤ１４内に記憶されるなどして生成装置１０内に存在することに限られない。すなわち、例えば図１１のように、文書データ１８は、生成装置１０内ではなくインターネット上に存在し、通信制御装置１７を介して取得されうるものであってもよい。 In the generation apparatus 10 according to the present embodiment, the document data 18 is not limited to exist in the generation apparatus 10 by being stored in the HDD 14 as shown in FIG. That is, for example, as shown in FIG. 11, the document data 18 may exist on the Internet instead of in the generation device 10 and can be acquired via the communication control device 17.

また、本実施形態における検索装置２０では、上記の生成装置１０と同様に、文書データ１８は、例えば図２のようにＨＤＤ１４内に記憶されるなどして検索装置２０内に存在することに限られない。すなわち、例えば図１１のように、文書データ１８は、検索装置２０内ではなくインターネット上に存在し、通信制御装置１７を介して取得されうるものであってもよい。 Further, in the search device 20 in the present embodiment, the document data 18 is limited to exist in the search device 20 by being stored in the HDD 14 as shown in FIG. I can't. That is, for example, as shown in FIG. 11, the document data 18 may exist on the Internet, not in the search device 20, and may be acquired via the communication control device 17.

このような構成をとることで、図１１の実施形態では図２でのものに比べ、検索装置２０内に文書データ１８を記憶する必要がなく、インターネットに適切に接続可能な環境であれば、小型の電子辞書のような限られた容量の装置においても実現しやすくなる。 By adopting such a configuration, in the embodiment of FIG. 11, it is not necessary to store the document data 18 in the search device 20 as compared with the one in FIG. It becomes easy to realize even in a device having a limited capacity such as a small electronic dictionary.

また、本発明での実施形態は、上述した実施形態に加え、上記生成装置１０としてコンピュータ装置を機能させるためのコンピュータプログラムであってもよい。また、上記検索装置２０としてコンピュータ装置を機能させるためのコンピュータプログラムであってもよい。 Further, the embodiment of the present invention may be a computer program for causing a computer device to function as the generating device 10 in addition to the above-described embodiment. Moreover, the computer program for functioning a computer apparatus as the said search apparatus 20 may be sufficient.

上記コンピュータプログラムは、コンパクトディスク、フレキシブルディスク、ハードディスク、光磁気ディスク、ディジタルビデオディスク、磁気テープ、半導体メモリ等のコンピュータ読取可能な情報記憶媒体に記憶することができる。 The computer program can be stored in a computer-readable information storage medium such as a compact disk, flexible disk, hard disk, magneto-optical disk, digital video disk, magnetic tape, and semiconductor memory.

また、上記コンピュータプログラムは、コンピュータプログラムが実行されるコンピュータ装置とは独立して、コンピュータ通信網を介して配付・販売することができる。また、上記情報記憶媒体は、コンピュータ装置とは独立して配付・販売することができる。 Further, the computer program can be distributed and sold via a computer communication network independently of a computer device on which the computer program is executed. The information storage medium can be distributed and sold independently of the computer device.

１０…生成装置、１１…ＣＰＵ、１２…ＲＯＭ、１３…ＲＡＭ、１４…ＨＤＤ、１５…入力装置、１６…出力装置、１７…通信制御装置、１８…文書データ、１９…入換文書データ、２０…検索装置、２１…ＣＰＵ、２２…ＲＯＭ、２３…ＲＡＭ、２４…ＨＤＤ、２５…入力装置、２６…出力装置、２７…通信制御装置、３０…転置インデックス DESCRIPTION OF SYMBOLS 10 ... Generating device, 11 ... CPU, 12 ... ROM, 13 ... RAM, 14 ... HDD, 15 ... Input device, 16 ... Output device, 17 ... Communication control device, 18 ... Document data, 19 ... Replacement document data, 20 ... Search device, 21 ... CPU, 22 ... ROM, 23 ... RAM, 24 ... HDD, 25 ... Input device, 26 ... Output device, 27 ... Communication control device, 30 ... Transposition index

Claims

A method for generating an inverted index in an information processing device,
The information processing apparatus generates a transposing index for the plurality of document data stored in the storage unit, and a storage unit that stores a plurality of document data each composed of an entry word and a corresponding explanatory sentence Control means to
The control means is
For each of a plurality of document data each composed of the headword and the explanatory text corresponding to the headword stored in the storage means, the number of characters of the headword and the explanatory text is counted, An order change step for changing the order of a plurality of document data;
A generation step of generating a transposed index by associating appearance positions in a plurality of document data in which the order is changed for each of “N-grams (N is a natural number)” that is a character string of N characters;
Method for generating an inverted index, which comprises the run.

The control means includes
Document data including character strings of headwords and explanations of each of the plurality of document data in which the order is changed is extracted, and character strings are included in the extracted document data and the extracted document data. Moreover perform the document data, the correspondence step for associating inclusion relation,
In the generation step executed by the control unit , the inclusion relation is further stored, and the transposed index is generated.
Method for generating an inverted index of claim 1, wherein the this.

In the generating step executed by the control means , the transposition index is generated by further storing the correspondence relationship between the plurality of document data whose order has been changed and the plurality of document data before the order has been changed. To
Method for generating an inverted index according to claim 1 or 2, characterized and this.

Replaced document data storage means for storing a plurality of document data whose order is changed in the order changing step of the transposed index generation method according to claim 1;
Inverted index storage means for storing the inverted index generated in the generating step of the inverted index generating method according to claim 1;
A search method in an information processing apparatus having control means for searching for the document data using the transposed index,
The control means includes
N-gram extraction step for extracting N-gram from the search character string;
From the inverted index, the N-gram extraction to get the appearance position associated with the N-gram extracted in step, on the basis of the occurrence position, the search character from among the plurality of document data in which the sequence has been replaced A document identification step for identifying document data including columns;
The search method characterized by performing .

Replaced document data storage means for storing a plurality of document data whose order has been changed in the order changing step of the transposed index generation method according to claim 2;
Inverted index storage means for storing the inverted index generated in the generating step of the inverted index generating method according to claim 2;
A search method in an information processing apparatus having control means for searching for the document data using the transposed index,
The control means includes
N-gram extraction step for extracting N-gram from the search character string;
From the inverted index, the N-gram extraction to get the appearance position associated with the N-gram extracted in step, on the basis of the occurrence position, the search character from among the plurality of document data in which the sequence has been replaced A document specifying step of specifying document data including a column, and further specifying document data associated with the specified document data based on the stored inclusion relationship;
The search method characterized by performing .

Replaced document data storage means for storing a plurality of document data whose order has been changed in the order changing step of the transposed index generation method according to claim 3;
Transposed index storage means for storing the inverted index generated in the generating step of the inverted index generating method according to claim 3;
A search method in an information processing apparatus having control means for searching for the document data using the transposed index,
The control means includes
N-gram extraction step for extracting N-gram from the search character string;
From the inverted index, the N obtains the occurrence position associated with the N-gram extracted in grams extraction step, and the occurrence position, the sequence plurality of document data and the order before it is replaced, which is replaced Based on the correspondence relationship with a plurality of document data, the document data including the search character string is identified from the plurality of document data before the order is changed, and based on the stored inclusion relationship, A document specifying step for further specifying the document data associated with the specified document data;
The search method characterized by performing .

For each of a plurality of document data composed of a headword and a corresponding explanatory text, the number of characters between the headword and the explanatory text is counted, and the order of the plurality of document data is changed in the order of the smaller number of characters. Exchange means;
Generating means for generating an inverted index by associating appearance positions in a plurality of document data in which the order is changed for each of “N-grams (N is a natural number)” that is a character string of N characters;
Generator inverted index, characterized in that it comprises a.

N-gram extracting means for extracting N-gram from the search character string;
An appearance position associated with the N-gram extracted in the N-gram extraction step is acquired from the transposed index generated by the generation method according to claim 1, and the order is changed based on the appearance position. Document specifying means for specifying document data including the search character string from a plurality of document data;
Search apparatus comprising: a.

Computer
For each of a plurality of document data composed of a headword and a corresponding explanatory text, the number of characters between the headword and the explanatory text is counted, and the order of the plurality of document data is changed in the order of the smaller number of characters. Replacement means,
Generating means for generating a transposed index by associating appearance positions in a plurality of document data in which the order is changed with respect to each of “N-grams (N is a natural number)” that is a character string of N characters;
Computer program for functioning as a.

Computer
N-gram extraction means for extracting N-gram from the search character string;
An appearance position associated with the N-gram extracted in the N-gram extraction step is acquired from the transposed index generated by the generation method according to claim 1, and the order is changed based on the appearance position. Document specifying means for specifying document data including the search character string from a plurality of document data;
Computer program for functioning as a.