JP2006209399A

JP2006209399A - Device and method for retrieving document

Info

Publication number: JP2006209399A
Application number: JP2005019589A
Authority: JP
Inventors: Suefumi Yamada; 季史山田; Shigehisa Kawabe; 惠久川邉
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2005-01-27
Filing date: 2005-01-27
Publication date: 2006-08-10
Anticipated expiration: 2025-01-27
Also published as: JP4682627B2

Abstract

<P>PROBLEM TO BE SOLVED: To retrieve a document of a keyword of KANA (Japanese syllabary) and KATAKANA (square form of kana) with an N-gram index of a small size. <P>SOLUTION: A character string extracting part 11 extracts a KANA character string and a KATAKANA character string in the document. A character string connecting part 12 connects the extracted character string so that a break can be discriminated and constructs it as a pseudo document. A bit vector generating part 13 generates a bit vector showing an appearing position of each N-garm realizing a KANA word and a KATAKANA word in the pseudo document by a flag bit. An index registering part 14 registers the bit vector in an index storage part 15 as an index for retrieving the document. A keyword input part 16 inputs a keyword of the KANA character string and the KATAKANA character string. A retrieval part 17 decomposes the keyword into N-garam, discriminates whether the keyword is included in the pseudo document or not from position information in the pseudo document of N-garam, and a retrieval result output part 18 outputs a retrieval result. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、Ｎ−ｇｒａｍ手法を用いた文書検索技術に関し、特に形態書解析手法の文書検索と組み合わせて最適なものである。 The present invention relates to a document search technique using an N-gram method, and is particularly optimal in combination with a document search of a form book analysis method.

文書検索手法としては形態素解析手法やＮ−ｇｒａｍ手法が知られている。形態素解析手法では、形態素解析により文書を形態素に分解してインデックスに登録する。形態素語のエントリごとに、当該形態素語を含む文書の識別子、または文書の識別子と当該形態素語の出現位置をインデックスに登録しておく。入力されたキーワードを元にインデックスを検索することにより当該キーワードを含む文書を高速に選択することができる。この形態素解析手法では、未知語などによる形態素区切りの誤判定によって、本来選択すべき文書を選べない場合があるという問題がある。 A morphological analysis method and an N-gram method are known as document search methods. In the morphological analysis method, a document is decomposed into morphemes by morphological analysis and registered in an index. For each morpheme word entry, the identifier of the document including the morpheme word, or the document identifier and the appearance position of the morpheme word are registered in the index. By searching the index based on the input keyword, a document including the keyword can be selected at high speed. In this morpheme analysis method, there is a problem that a document to be originally selected may not be selected due to an erroneous determination of a morpheme segmentation due to an unknown word or the like.

他方、Ｎ−ｇｒａｍ手法では、文書中の各Ｎ−ｇｒａｍをインデックスに登録する。Ｎ−ｇｒａｍとは言語モデルでの表現で、文書中から隣接する２文字（ｂｉ−ｇｒａｍ）や３文字（ｔｒｉ−ｇｒａｍ）を機械的に切り出したもののことである。通常、文書中での出現位置と文書の識別子も合わせてインデックスに登録され、出現位置から隣接判定を行うことで、任意の長さの文字列検索が可能である。たとえばＮ＝２のＮ−ｇｒａｍでは、「文字列の検索処理」という単語を２文字ごとに分解して、その出現位置を記憶しておく。この場合は、「（１）文字」、「（２）字列」、「（３）列の」、「（４）の検」、「（５）検索」、「（６）索処」、「（７）処理」、と分解して、インデックスに登録しておく（カッコ内の数字は出現位置）。ここで「検索」という単語を検索する場合は、インデックスから、「検索」というＮ−ｇｒａｍが含まれている文書を探せばよい。また「検索処理」ならば、「検索」と「処理」の両方が含まれている文書を探し、かつ「処理」の出現位置が「検索」の出現位置より２多いもの（この場合は、（５）と（７）で見つかる）を探せばよい。 On the other hand, in the N-gram method, each N-gram in the document is registered in an index. N-gram is an expression in a language model, and is obtained by mechanically cutting two adjacent characters (bi-gram) or three characters (tri-gram) from a document. Usually, an appearance position in a document and an identifier of the document are also registered in an index, and a character string search of an arbitrary length can be performed by determining adjacency from the appearance position. For example, in N-gram of N = 2, the word “character string search processing” is decomposed every two characters and the appearance position is stored. In this case, “(1) Character”, “(2) Character String”, “(3) Column”, “(4) Search”, “(5) Search”, “(6) Search”, “(7) Processing” is decomposed and registered in the index (the number in parentheses is the appearance position). Here, when searching for the word “search”, a document including the N-gram “search” may be searched from the index. In the case of “search processing”, a document including both “search” and “processing” is searched, and the appearance position of “processing” is two more than the appearance position of “search” (in this case, ( Find 5) and (7)).

しかしながら、Ｎ−ｇｒａｍ手法を用いた検索では、単語の境界を無視し、単に文字列の一致で文書を検索するため、本来ヒットすべきでない文書を検索してしまい、本来の文書が、検索に適合しない文書に埋もれてしまう場合がある。また、出現位置などによりインデックスサイズが肥大化しやすい。 However, in the search using the N-gram method, the word boundary is ignored, and the document is searched by simply matching the character strings. Therefore, the document that should not be hit is searched, and the original document is not searched. It may be buried in non-conforming documents. Also, the index size tends to be enlarged depending on the appearance position.

なお、この発明と関連する先行文献としては、つぎのようなものがある。 The following documents are related to the present invention.

特許文献１は、形態素解析手法のインデックスを小さくするために形態素語集合としてもっともコンパクトなものとし（他の形態素語を含む形態素語（延長語）は辞書に含めない）、その上で、コンパクトな形態素語集合に含まれる形態素語をエントリとして当該形態素語を一部として含む形態素語を関連づける辞書（延長語辞書）を用い、入力キーワードがコンパクトな形態素語集合に含まれない場合には、キーワードの構成語と延長語辞書を用いて漏れのない検索が行なわれるようにし、さらに、検索態様を、コンパクトな形態素語集合のみに限定したり、延長語辞書を用いた漏れのない検索を行なったり、切換を行なえるようにしている。 Patent Document 1 assumes that the morpheme word set is the most compact in order to reduce the index of the morpheme analysis method (morpheme words including other morpheme words (extended words) are not included in the dictionary), and then compact. If a morpheme word included in the morpheme word set is used as an entry and a dictionary (extended word dictionary) that associates the morpheme word including the morpheme word as a part is used, and the input keyword is not included in the compact morpheme word set, A search without omission is performed using a constituent word and an extension word dictionary, and further, a search mode is limited to only a compact morpheme word set, or an omission search using an extension word dictionary is performed, Switching can be performed.

特許文献２は、文字種に基づいて文字列の切り出しを行い、ひらがな、カタカナはそのまま特徴語とし、漢字についてはＮ−ｇｒａｍを抽出して特徴語とし、特徴語の頻度情報を元に類似文書検索を行なうことを開示している。 Patent Document 2 cuts out a character string based on a character type, hiragana and katakana are used as feature words as they are, N-gram is extracted as a feature word for kanji, and a similar document search is performed based on the frequency information of the feature word. Is disclosed.

特許文献３は、形態素解析結果から得た単語先頭位置、単語末尾位置の情報をＮ−ｇｒａｍのインデックスに付加して、単語の境界を意図して前方一致、後方一致等の検索を行なえるようにすることを開示している。
特開平１１−７３４２９号公報特開平１１−１４３９０２号公報特開２０００−２３１５６公報 Patent Document 3 adds information on the word start position and word end position obtained from the morphological analysis result to the N-gram index so that a search such as a forward match or a backward match can be performed with the intention of a word boundary. Is disclosed.
Japanese Patent Laid-Open No. 11-73429 JP-A-11-143902 JP 2000-23156 A

この発明は、以上の事情を考慮してなされたものであり、インデックスサイズを抑制しつつＮ−ｇｒａｍの検索を簡易に行なえるようにすることを目的としている。また、具体的な側面では、形態素解析手法の検索と組み合わせて最適なＮ−ｇｒａｍ手法による解析技術を提供することを目的としている。 The present invention has been made in consideration of the above circumstances, and an object thereof is to make it possible to easily perform an N-gram search while suppressing an index size. Further, in a specific aspect, an object is to provide an analysis technique using an optimal N-gram technique in combination with a search for a morphological analysis technique.

この発明の具体的な構成例では、形態素解析方式のインデックスを検索の基本として用いながらも、典型的には、ひらがな、カタカナに限定してＮ−ｇｒａｍ方式でインデックスを構築することによって、形態素解析手法の検索漏れという課題を補う。 In a specific configuration example of the present invention, while using an index of a morphological analysis method as a basic of a search, typically, an index is constructed by an N-gram method limited to hiragana and katakana, whereby a morphological analysis is performed. To compensate for the problem of missing search methods.

また、異なり語（同一でない語）を空白文字で区切った擬似文書中の出現位置をビットベクター化することでインデックスサイズを縮小し、隣接計算を高速化する。 In addition, the index size is reduced by converting the appearance position in the pseudo document in which different words (words that are not the same) are separated by a blank character into a bit vector, thereby speeding up the adjacent calculation.

この構成例では、形態素解析手法を用いながらも、典型的には、ひらがな、カタカナに限定してＮ−ｇｒａｍ方式で部分一致検索ができる。 In this configuration example, while using the morphological analysis method, a partial match search can be performed by the N-gram method, typically limited to hiragana and katakana.

また、異なり語のみからなる擬似文書を作成し、その文書中の出現位置をビットベクターで保持するのでＮ−ｇｒａｍのインデックスを小さくできる。 In addition, since a pseudo document consisting only of different words is created and the appearance position in the document is held by a bit vector, the N-gram index can be reduced.

また、ビットベクターを固定長で折り返して当該固定長に縮退させることが可能である。 In addition, the bit vector can be folded back at a fixed length and degenerated to the fixed length.

さらに、ビットベクターを複数のシーケンスに分けて出現位置を示すフラグビットが立っていないシーケンスを省略してビットベクターサイズを縮小できる。 Further, the bit vector size can be reduced by dividing the bit vector into a plurality of sequences and omitting a sequence having no flag bit indicating the appearance position.

さらにこの発明を説明する。なお、以下では、理解を容易にするために、実施例の各部の符号を付して説明することもあるが、これは、この発明を実施例に限定する意図ではない。 The present invention will be further described. In the following description, for ease of understanding, the reference numerals of the respective parts of the embodiments may be attached and described, but this is not intended to limit the present invention to the embodiments.

この発明の一側面によれば、上述の目的を達成するために、文書検索装置（１００）に：検索対象の文書の各々から、ひらがな文字列、カタカナ文字列、アルファベット文字列、ならびに、ひらがな、カタカナおよびアルファベットの混合文字列のうちの予め選定された少なくとも１種類の文字列を、抽出して連結し、擬似文書を生成する擬似文書生成手段（１１、１２）と；上記検索対象の文書の各々に対して、Ｎ−ｇｒａｍの各エントリについて当該Ｎ−ｇｒａｍの上記擬似文書中における出現位置を表す出現位置情報を記憶する出現位置記憶手段（１３、１４、１５）と；ひらがな文字列、カタカナ文字列、アルファベット文字列、ならびに、ひらがな、カタカナおよびアルファベットの混合文字列のうちの上記予め選定された少なくとも１種類の文字列により構成される検索キーワードを上記出現位置情報に照合して上記検索キーワードを含む文書を特定する文書特定手段（１７）とを設けている。 According to one aspect of the present invention, in order to achieve the above-described object, the document search apparatus (100): from each of the search target documents, a hiragana character string, a katakana character string, an alphabet character string, and a hiragana character, Pseudo document generation means (11, 12) for extracting and concatenating at least one kind of character string selected in advance from a mixed character string of katakana and alphabet, and generating a pseudo document; For each entry of N-gram, appearance position storage means (13, 14, 15) for storing appearance position information representing the appearance position of the N-gram in the pseudo-document; hiragana character string, katakana Of the character strings, alphabetic character strings, and hiragana, katakana and alphabetic mixed character strings, It is provided and the document identification means (17) for specifying the document that contains the search term against the above occurrence position information composed search keyword by also one string.

この構成によれば、Ｎ−ｇｒａｍのインデックスを小さくして少ない計算機資源で高速に検索処理を行なえる。 According to this configuration, the search processing can be performed at high speed with a small number of computer resources by reducing the N-gram index.

この構成において、上記出現位置記憶手段は、上記Ｎ−ｇｒａｍの上記擬似文書中における出現位置を表す出現位置情報を、当該位置に対応するビット位置にフラグビットを立てるビットベクターで表すことが好ましい。 In this configuration, the appearance position storage means preferably represents the appearance position information representing the appearance position of the N-gram in the pseudo document with a bit vector that sets a flag bit at the bit position corresponding to the position.

ビットベクターを用いることによりシフト演算およびＡＮＤ演算により隣接関係の判別を簡易に行い、その結果、簡易に検索処理を行なえる。 By using a bit vector, it is possible to easily determine the adjacency relationship by a shift operation and an AND operation, and as a result, a search process can be easily performed.

また、上記擬似文書は、同一文書内の複数の同一の文字列についてはそのうち１つを残し、他を削除して生成されるようにすることが好ましい。このようにすれば一層インデックスサイズを小さくできる。もちろん、重複したものを残したままにしてもよい。 The pseudo document is preferably generated by leaving one of the plurality of identical character strings in the same document and deleting the other. In this way, the index size can be further reduced. Of course, you may leave duplicates.

上記擬似文書において隣接する文字列の間の区切りを表すために空白文字を用いることができるが、これに限定されない。 A blank character can be used to represent a delimiter between adjacent character strings in the pseudo document, but is not limited thereto.

また、上記予め選定された少なくとも１種類の文字列は、典型的には、ひらがな文字列およびカタカナ文字列である。 The at least one kind of character string selected in advance is typically a hiragana character string and a katakana character string.

また、上記ビットベクターのビット長が所定長を超える場合には、上記ビットベクターを上記所定長位置で折り返して上記フラグビットのＯＲ論理をとり、ビットベクターを所定長に縮退させるようにしても良い。 Further, when the bit length of the bit vector exceeds a predetermined length, the bit vector may be folded back at the predetermined length position to perform OR logic of the flag bit, and the bit vector may be degenerated to a predetermined length. .

また、上記ビットベクターのビット長が所定長を超える場合には、上記ビットベクターを上記所定長のシーケンスに分割し、さらにフラグビットを含まないシーケンスは省略して登録し、検索時に補完するようにしてもよい。このようなフォーマットのビットベクターは例えばハードディスクに記憶し、必要となったときにＮ−ｇｒａｍ単位で取りだして通常のフォーマットにメモリ上に展開するようにしても良い。 If the bit length of the bit vector exceeds a predetermined length, the bit vector is divided into the predetermined length sequences, and sequences that do not include flag bits are omitted and registered, and complemented at the time of search. May be. The bit vector having such a format may be stored in, for example, a hard disk, and may be extracted in N-gram units when necessary and expanded on a memory in a normal format.

また、この発明の他の側面によれば、文書検索装置（１００）に：上記検索対象の文書の各々に対して、Ｎ−ｇｒａｍの各エントリについて当該Ｎ−ｇｒａｍの出現位置を表す出現位置情報を、当該位置に対応するビット位置にフラグビットを立てるビットベクターとして記憶する出現位置記憶手段（１５）と；検索キーワードを上記出現位置情報に照合して上記検索キーワードを含む文書を特定する文書特定手段（１７）とを設け；さらに、上記検索キーワードが２つ以上のＮ−ｇｒａｍから構成される場合に、当該Ｎ−ｇｒａｍの各々のビットベクターのフラグビット位置が対応する隣接関係にあることを判別して上記検索キーワードを含む文書を特定するようにしている。 According to another aspect of the present invention, the document search apparatus (100): Appearance position information representing the appearance position of the N-gram for each N-gram entry for each of the documents to be searched. And an appearance position storage means (15) for storing a flag vector at a bit position corresponding to the position, and specifying a document including the search keyword by comparing the search keyword with the appearance position information Means (17); further, when the search keyword is composed of two or more N-grams, the flag bit position of each bit vector of the N-gram is in a corresponding adjacent relationship. The document including the search keyword is specified by discrimination.

この構成においては、Ｎ−ｇｒａｍの出現位置をビットベクターで表現しているのでＮ−ｇｒａｍの隣接関係をシフト演算およびＡＮＤ演算で処理することができる。 In this configuration, since the appearance position of the N-gram is expressed by a bit vector, the adjacent relationship of the N-gram can be processed by a shift operation and an AND operation.

また、この発明の他の側面によれば、文書検索装置（２００）に：形態素解析結果から生成された検索辞書を用いて文書検索を行なう第１検索手段と；Ｎ−ｇｒａｍ辞書を用いて文書検索を行なう第２検索手段とを設け；さらに上記第２検索手段に：検索対象の文書の各々から、ひらがな文字列、カタカナ文字列、アルファベット文字列、ならびに、ひらがな、カタカナおよびアルファベットの混合文字列のうちの予め選定された少なくとも１種類の文字列を、抽出して連結し、擬似文書を生成する擬似文書生成手段と；上記検索対象の文書の各々に対して、Ｎ−ｇｒａｍの各エントリについて当該Ｎ−ｇｒａｍの上記擬似文書中における出現位置を表す出現位置情報を記憶する出現位置記憶手段と；ひらがな文字列、カタカナ文字列、アルファベット文字列、ならびに、ひらがな、カタカナおよびアルファベットの混合文字列のうちの上記予め選定された少なくとも１種類の文字列により構成される検索キーワードを上記出現位置情報に照合して上記検索キーワードを含む文書を特定する文書特定手段とを設けるようにしている。 According to another aspect of the present invention, the document search device (200) includes: a first search means for searching a document using a search dictionary generated from a morphological analysis result; and a document using an N-gram dictionary. A second search means for performing a search; and further to the second search means: from each of the documents to be searched, a hiragana character string, a katakana character string, an alphabet character string, and a mixed character string of hiragana, katakana, and alphabet A pseudo-document generation unit that extracts and concatenates at least one character string selected in advance to generate a pseudo-document; and for each N-gram entry for each of the search target documents Appearance position storage means for storing appearance position information representing the appearance position of the N-gram in the pseudo document; hiragana character string, katakana character string, al A document including the search keyword by collating a search keyword composed of at least one type of character string selected from among the alphabet character strings and a mixed character string of hiragana, katakana and alphabet with the appearance position information And a document specifying means for specifying.

この構成によれば、形態素手法の検索とＮ−ｇｒａｍ手法の検索とを複合的に利用することにより、Ｎ−ｇｒａｍ手法の検索を、各々から、ひらがな文字列、カタカナ文字列、アルファベット文字列、ならびに、ひらがな、カタカナおよびアルファベットの混合文字列のうちの予め選定された少なくとも１種類の文字列に限定することができ、この結果、Ｎ−ｇｒａｍのインデックスを小さくすることができる。しかも、形態素語にない文字列の検索も確実に行なうことができる。 According to this configuration, the search of the N-gram method is combined with the search of the hiragana character string, the katakana character string, the alphabet character string, by using the search of the morpheme method and the search of the N-gram method, respectively. In addition, the character string can be limited to at least one kind of character string selected in advance from a mixed character string of hiragana, katakana, and alphabet. As a result, the N-gram index can be reduced. Moreover, it is possible to reliably search for character strings that are not in morpheme words.

なお、この発明は装置またはシステムとして実現できるのみでなく、方法としても実現可能である。また、そのような発明の一部をソフトウェアとして構成することができることはもちろんである。またそのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品もこの発明の技術的な範囲に含まれることも当然である。 The present invention can be realized not only as an apparatus or a system but also as a method. Of course, a part of the invention can be configured as software. Of course, software products used for causing a computer to execute such software are also included in the technical scope of the present invention.

この発明の上述の側面および他の側面は特許請求の範囲に記載され以下実施例を用いて詳述される。 These and other aspects of the invention are set forth in the appended claims and will be described in detail below with reference to examples.

この発明によれば、インデックスサイズを抑制しつつＮ−ｇｒａｍの検索を簡易に行なえる。 According to the present invention, it is possible to easily perform an N-gram search while suppressing the index size.

以下、この発明の実施例について説明する。 Examples of the present invention will be described below.

まず、この発明の基本構成を実装した実施例１の文書検索装置１００について説明する。この実施例は、ひらがな語およびカタカナ語をキーワードとして受け取りＮ−ｇｒａｍの手法で文書を検索するものである。文書検索装置１００は計算機例えばパーソナルコンピュータ１０００にソフトウェアを例えば記録媒体１００１を用いてインストールすることにより実現される。パーソナルコンピュータ１０００は周知のとおりＣＰＵ、主メモリ、外部メモリ、バス、種々のＩ／Ｏ装置を具備して構成され、パーソナルコンピュータ１０００のハードウェア資源とソフトウェア資源とを協同させて文書検索装置１００の各部すなわち各機能ブロックが構成される。 First, the document search apparatus 100 according to the first embodiment in which the basic configuration of the present invention is implemented will be described. In this embodiment, hiragana and katakana words are received as keywords and a document is searched by the N-gram technique. The document search apparatus 100 is realized by installing software on a computer, for example, a personal computer 1000 using, for example, a recording medium 1001. As is well known, the personal computer 1000 includes a CPU, a main memory, an external memory, a bus, and various I / O devices, and the hardware resources and software resources of the personal computer 1000 cooperate with each other in the document search device 100. Each part, that is, each functional block is configured.

図１は、実施例１の文書検索装置１００を示しており、この図において、文書検索装置１００は、文書入力部１０、文字列抽出部１１、文字列連結部１２、ビットベクター生成部１３、インデックス登録部１４、インデックス記憶部１５、キーワード入力部１６、検索部１７、検索結果出力部１８等を含んで構成される。文書入力部１０は、検索対象の文書（電子データ）を入力するものである。文書の入力は種々の態様で行なうことができる。ファイル管理システム上の１つ、または１群の文書ファイルを指定して入力していも良いし、ファイル転送やメッセージ転送で入力しても良い。入力文書をその属性や単語ベクトル等により選択しても良い。文字列抽出部１１は、文書中のひらがな文字列およびカタカナ文字列を抽出する。形態素情報を用いて漢字等とともに１の形態素語を構成するものは除くようにしても良い。この例では、ひらがな文字列およびカタカナ文字列のみを対象にするが、そのほかに、適宜に、ひらがな・カタカナ混合文字列、アルファベット文字列、ひらがな・アルファベット混合文字列、カタカナ・アルファベット混合文字列、ひらがな・カタカナ・アルファベット混合文字列を抽出するようにしても良い。文字列連結部１２は、抽出した文字列をその区切りを判別可能に連結して擬似文書として構成するものである。抽出した文字列はその種類ごとに複数の擬似文書に分けて構成されるようにしても良い。元の文書に含まれていても、抽出対象でない語は擬似文書から省かれそのサイズをコンパクトにすることができる。ビットベクター生成部１３は、ひらがな文字列（ひらがな語）およびカタカナ文字列（カタカナ語）を実現可能な各Ｎ−ｇａｒｍについて各擬似文書におけるその出現位置をフラグビットで示すビットベクターを生成するものである。ビットベクターについては後に例を挙げて説明する。インデックス登録部１４は、ビットベクターを文書検索用のインデックスとしてインデックス記憶部１５に登録する。キーワード入力部１６は、ひらがな文字列またはカタカナ文字列からなるキーワードを入力するものである。このキーワードは検索ユーザが直接に入力するものでも良いし、ユーザが入力した検索条件を所定のフロントエンドで処理してひらがな語およびカタカナ語に該当する部分をキーワード入力部１６から入力しても良い。検索部１７は、入力されたキーワードをＮ−ｇａｒａｍに分解して、当該Ｎ−ｇａｒａｍの擬似文書中の位置情報から当該擬似文書中にキーワードが含まれるかどうかを判別する。この点についても後に例を挙げて説明する。検索結果出力部１８は、擬似文書中に、すなわち入力文書中に、キーワードが含まれるかどうかを出力する。なお、該当文書のリストを出力するようにしても良い。 FIG. 1 shows a document search apparatus 100 according to the first embodiment. In this figure, the document search apparatus 100 includes a document input unit 10, a character string extraction unit 11, a character string connection unit 12, a bit vector generation unit 13, The index registration unit 14, the index storage unit 15, the keyword input unit 16, the search unit 17, the search result output unit 18 and the like are configured. The document input unit 10 inputs a search target document (electronic data). Document input can be performed in various ways. One or a group of document files on the file management system may be designated and input, or may be input by file transfer or message transfer. The input document may be selected by its attribute, word vector, or the like. The character string extraction unit 11 extracts hiragana character strings and katakana character strings in the document. You may make it exclude the thing which comprises one morpheme word with a Chinese character etc. using morpheme information. In this example, only the hiragana and katakana character strings are targeted. -You may make it extract a katakana / alphabet mixed character string. The character string concatenation unit 12 concatenates the extracted character strings so that the delimiters can be discriminated and constitutes a pseudo document. The extracted character string may be divided into a plurality of pseudo documents for each type. Even if it is included in the original document, words that are not to be extracted can be omitted from the pseudo document and the size thereof can be made compact. The bit vector generation unit 13 generates a bit vector indicating the appearance position in each pseudo document with a flag bit for each N-garm capable of realizing a hiragana character string (Hiragana word) and a katakana character string (Katakana word). is there. The bit vector will be described later with an example. The index registration unit 14 registers the bit vector in the index storage unit 15 as a document search index. The keyword input unit 16 inputs a keyword composed of a hiragana character string or a katakana character string. The keyword may be input directly by the search user, or the search condition input by the user may be processed by a predetermined front end, and the part corresponding to the hiragana and katakana words may be input from the keyword input unit 16. . The search unit 17 decomposes the input keyword into N-garam, and determines whether the pseudo document includes the keyword from the position information in the N-garam pseudo document. This point will also be described later with an example. The search result output unit 18 outputs whether or not a keyword is included in the pseudo document, that is, in the input document. A list of applicable documents may be output.

図２は、入力文書のインデックス登録処理のフローを示しており、図６は検索処理のフローを示している。 FIG. 2 shows the flow of index registration processing for input documents, and FIG. 6 shows the flow of search processing.

まず、入力文書のインデックス登録処理について図２を参照して説明する。ここでは、図３に示す文書を例に挙げて説明する。図３の文書は、文１〜文５により構成され、文１は「あいうえお」のひらがな文字列を有し、文２は「あいう」のひらがな文字列を有し、文３は「えおかきくけこ」のひらがな文字列を有し、文４は「あいう」の文字列を有し、文５は「えおかき」のひらがな文字列を有する。この例では、ひらがな文字列の例を示したが、カタカナ文字列を含む場合があることはもちろんである。 First, an index registration process for an input document will be described with reference to FIG. Here, the document shown in FIG. 3 will be described as an example. The document in FIG. 3 is composed of sentences 1 to 5, sentence 1 has a hiragana character string “aiueo”, sentence 2 has a hiragana character string “a”, and sentence 3 The sentence 4 has the hiragana character string, the sentence 4 has the character string “Ayu”, and the sentence 5 has the hiragana character string “Eokaki”. In this example, an example of a hiragana character string is shown, but it goes without saying that a katakana character string may be included.

図２のインデックス登録処理の例は以下のとおりである。 An example of the index registration process in FIG. 2 is as follows.

［ステップＳ１０］：検索対象の文書を文書入力部１０により入力する。文書は図３に示すようなものとする。 [Step S10]: The document to be searched is input by the document input unit 10. The document is as shown in FIG.

［ステップＳ１１］：文書からひらがな文字列およびカタカナ文字列を文字列抽出部１１により抽出する。この文字列の抽出は文字種を用いて行なうことができる。 [Step S11]: The character string extraction unit 11 extracts hiragana character strings and katakana character strings from the document. This character string can be extracted using the character type.

［ステップＳ１２］：抽出した文字列を文字列連結部１２により図４に示すように順次に連結して擬似文書を生成する。文字列の間に区切りコードを挿入する。この例では空白文字を用いた。さらに、同じ文字列が出現した場合（例えば、文２の「あいう」と文４の「あいう」）には、連結を省略する。同一の文字列については１つだけ登録しておけばその出現を検索可能であるからである。もちろん、擬似文書のサイズが大きくなるが、同一文字列を繰り返し登録するようにしても良い。逆に、ある文字列が他の文字列の部分文字列になる場合、連結を省略して擬似文書のサイズをさらに小さくしても良い。 [Step S12]: The extracted character string is sequentially connected by the character string connecting unit 12 as shown in FIG. 4 to generate a pseudo document. Insert a delimiter between strings. In this example, a space character is used. Further, when the same character string appears (for example, “Ayan” in sentence 2 and “Ayan” in sentence 4), the connection is omitted. This is because if only one identical character string is registered, its occurrence can be searched. Of course, although the size of the pseudo document increases, the same character string may be repeatedly registered. Conversely, when a character string becomes a partial character string of another character string, concatenation may be omitted to further reduce the size of the pseudo document.

［ステップＳ１３］：擬似文書の各文字位置をビットで表し、各Ｎ−ｇｒａｍについてその先頭位置にフラグビット（例えば「１」）を立てる。ここでは、Ｎを３とした。図４の例では、「あいう」のＮ−ｇｒａｍは、第０ビット、第６ビットにあるので、第０ビット、第６ビットにフラグビット（黒で示す）を立てたビットベクターが生成される。同様に「いうえ」のＮ−ｇｒａｍについては第２ビットにフラグビットを立てたビットベクターが生成される。同様にして実現可能なＮ−ｇｒａｍについてビットベクターが生成される。 [Step S13]: Each character position of the pseudo document is represented by a bit, and a flag bit (for example, “1”) is set at the head position of each N-gram. Here, N is set to 3. In the example of FIG. 4, the “Any” N-gram is in the 0th bit and the 6th bit, so a bit vector in which flag bits (shown in black) are set in the 0th bit and the 6th bit is generated. . Similarly, for the “Iue” N-gram, a bit vector in which a flag bit is set in the second bit is generated. Similarly, a bit vector is generated for a feasible N-gram.

［ステップＳ１４］：インデックス登録部１４が入力文書のビットベクターをインデックスとしてインデックス記憶部１５に登録する。このようにして、各Ｎ−ｇｒａｍのエントリに対して、それを含む文書ＩＤおよびそのビットベクターが生成され、インデックスレコードとして登録される。 [Step S14]: The index registration unit 14 registers the bit vector of the input document in the index storage unit 15 as an index. In this way, for each N-gram entry, a document ID including the entry and its bit vector are generated and registered as an index record.

図６の検索処理の例は以下のとおりである。 An example of the search process in FIG. 6 is as follows.

［ステップＳ２０］：キーワード入力部１６によりひらがな文字列またはカタカナ文字列のキーワードを入力する。 [Step S20]: A keyword of a hiragana character string or a katakana character string is input by the keyword input unit 16.

［ステップＳ２１］：検索部１７により、入力キーワードの文字列をＮ−ｇｒａｍに分解する。この例ではＮは３である。例えば、入力キーワードが「おかきくけこ」であれば、「おかき」と「くけこ」に分解する。 [Step S21]: The search unit 17 decomposes the character string of the input keyword into N-grams. In this example, N is 3. For example, if the input keyword is “Okaki Kukeko”, it is broken down into “Okaki” and “Kukeko”.

［ステップＳ２２］：検索部１７により、インデックス記憶部１５からＮ−ｇｒａｍに対応する文書ＩＤとビットベクターを取り出す。この例では図７で示すように「おかき」のビットベクターと「くけこ」のビットベクターが取り出される。 [Step S22]: The retrieval unit 17 extracts the document ID and bit vector corresponding to the N-gram from the index storage unit 15. In this example, as shown in FIG. 7, the “Okaki” bit vector and the “Kukeko” bit vector are extracted.

［ステップＳ２３］：検索部１７により、ビットベクターの隣接間隔を判別して該当する文字列「おかきくけこ」があるかどうかを判別する。この例では、図７に示すように、ビット位置の差は「３」であり、「おかきくけこ」が存在することが判別される。なお、この判別処理の詳細については後に詳述する。他方、「うえおかきく」をキーワードとして「うえお」のＮ−ｇｒａｍと「かきく」のＮ−ｇｒａｍを用いたときには、図８に示すように隣接関係がないことが判明し、該当するキーワードがないことがわかる。 [Step S23]: The search unit 17 determines the adjacent interval of the bit vectors and determines whether there is a corresponding character string “Okaki Kukeko”. In this example, as shown in FIG. 7, the bit position difference is “3”, and it is determined that “Okaki Kokeko” exists. Details of this determination processing will be described later. On the other hand, when “Ueokaki” is used as the keyword and “Ueokki” N-gram and “Kakikuku” N-gram are used, it is found that there is no adjacent relationship as shown in FIG. You can see that there is no.

この検索結果は検索結果出力部１８により出力される。 This search result is output by the search result output unit 18.

ここで、ビットベクターを用いたＮ−ｇｒａｍの隣接関係判別処理の詳細な例について説明する。 Here, a detailed example of the N-gram adjacency determination process using a bit vector will be described.

図９は隣接関係判別処理（キーワード検索処理）の詳細な処理例のフローを示しており、その処理は以下のとおりである。 FIG. 9 shows a flow of a detailed processing example of the adjacency determination processing (keyword search processing), and the processing is as follows.

［ステップＳ３０］：キーワードを構成するすべてのＮ−ｇｒａｍ（検索語ともいう）についてインデックスを検索して同じ文書ＩＤであれば、それぞれのビットベクターを取りだしてくる。ここでは、「おかきくけこさしす」のキーワードに対して「おかき」、「くけこ」、「さしす」のＮ−ｇｒａｍについてインデックスを検索して所定の文書ＩＤについて図１０に示すようなビットベクターが取り出された場合を例を挙げて説明する。 [Step S30]: Indexes are searched for all N-grams (also referred to as search words) constituting the keyword, and if the document ID is the same, each bit vector is extracted. Here, an index is searched for N-grams “Okaki”, “Kukeko”, and “Sashissu” for the keyword “Okaki Kokesashisu”, and a predetermined document ID as shown in FIG. A case where a bit vector is taken out will be described as an example.

［ステップＳ３１］：ｉ番目のビットベクターを右に３ビットシフトし、ｉ＋１番目のビットベクターとのＡＮＤを取ってその結果をｉ＋１番目のビットベクターとする。（ｉの初期値は０）さらにｉに１足す。 [Step S31]: The i-th bit vector is shifted 3 bits to the right, ANDed with the i + 1-th bit vector, and the result is used as the i + 1-th bit vector. (The initial value of i is 0) Further, add 1 to i.

［ステップＳ３２］：つぎのビットベクターがあるかどうか判別する。あればステップＳ３１に戻り処理を繰り返す。つぎのビットベクターがなければステップＳ３３へ進む。 [Step S32]: It is determined whether there is a next bit vector. If there is, return to step S31 and repeat the process. If there is no next bit vector, the process proceeds to step S33.

［ステップＳ３３］：ビットベクター中に「１」が立ってるかかどうか判別する。「１」が立っていれば当該文書中のＮ−ｇｒａｍの間に対応する隣接関係があり、検索キーワードが存在すること（ヒット）を示し、なければ当該文書中のＮ−ｇｒａｍの間に対応する隣接関係がなく検索キーワードが存在しないことを示す。 [Step S33]: It is determined whether or not “1” stands in the bit vector. If “1” stands, there is a corresponding adjacency relationship between N-grams in the document, indicating that a search keyword exists (hit), and if there is no correspondence, it corresponds between N-grams in the document This indicates that there is no adjacent relationship and no search keyword exists.

この例をさらに図１１に示す。この図から明らかなように、ビットベクターのシフト演算およびＡＮＤ演算により簡易に処理することができる。 This example is further shown in FIG. As is apparent from this figure, it can be easily processed by a bit vector shift operation and an AND operation.

つぎに実施例１の変形例について説明する。 Next, a modification of the first embodiment will be described.

図１２は、ビットベクターのサイズを抑制するものである。図１２の例では、１０００ビットごとに折り返して１０００ビットを超えるビットベクターを１０００ビットのビットベクターに縮退させるようにしている。重なるビットについてはＯＲ演算しておく。この場合、ＯＲ演算することにより、本来、対応する隣接関係にないＮ−ｇｒａｍの間でも、誤って対応する隣接関係があると判別する場合もあるが、検索漏れは生じない。もちろん、ビットベクターを１０００ビット長でなく任意のサイズに設定することが可能である。 FIG. 12 suppresses the size of the bit vector. In the example of FIG. 12, a bit vector exceeding 1000 bits is folded every 1000 bits to be reduced to a 1000-bit bit vector. An OR operation is performed for overlapping bits. In this case, by performing an OR operation, it may be erroneously determined that there is a corresponding adjacency even among N-grams that are not originally in the corresponding adjacency, but no search omission occurs. Of course, it is possible to set the bit vector to an arbitrary size instead of 1000 bits.

図１３は、ビットベクターを複数のシーケンスに分けて、フラグビットの内シーケンスについては省略するようにしたものである。図１３の例では、ビットベクター（図１３（Ａ））を所定長例えば１バイトのシーケンスに分け（図１３（Ｂ））、シーケンス番号を用いてシーケンス単位で管理できるようにし、シーケンス中にフラグビットがないときは当該シーケンス自体も省略する。すなわち、図１３（Ｃ）に示すようにＮ−ｇｒａｍを示す識別子（ＫＥＹ）のほかにシーケンス番号を用い、フラグビットを含まないシーケンスのシーケンス番号は省略するようにする。 In FIG. 13, the bit vector is divided into a plurality of sequences, and the sequence of flag bits is omitted. In the example of FIG. 13, the bit vector (FIG. 13A) is divided into a sequence of a predetermined length, for example, 1 byte (FIG. 13B), and can be managed in sequence units using the sequence number. When there is no bit, the sequence itself is also omitted. That is, as shown in FIG. 13C, a sequence number is used in addition to an identifier (KEY) indicating N-gram, and a sequence number not including a flag bit is omitted.

実際には、図１３（Ｃ）のフォーマットのインデックスデータは図１４に示すように圧縮インデックス記憶部１５ｂ（例えばハードディスク）に記憶されており、キーワードを分解してＮ−ｇｒａｍを決定した段階で、該当するＮ−ｇｒａｍのインデックスデータ（図１３（Ｃ）のフォーマット）を圧縮インデックス記憶部１５ｂから取りだしてメインメモリ１５ａに展開して上述のシフト処理やＡＮＤ演算を行い隣接関係を判別する。 Actually, the index data in the format of FIG. 13C is stored in the compressed index storage unit 15b (for example, a hard disk) as shown in FIG. 14, and at the stage where the N-gram is determined by decomposing the keyword, The corresponding N-gram index data (format in FIG. 13C) is extracted from the compression index storage unit 15b and expanded into the main memory 15a, and the above-described shift processing and AND operation are performed to determine the adjacent relationship.

つぎにこの発明を形態素解析手法のインデックスおよびＮ−ｇｒａｍ手法のインデックスの双方を利用した文書検索装置に適用した実施例２について説明する。 Next, a second embodiment in which the present invention is applied to a document search apparatus using both an index of a morphological analysis technique and an index of an N-gram technique will be described.

図１５は実施例２の文書検索装置２００を示しており、この図において、文書検索装置２００は、検索条件入力部２０、検索フロントエンド２１、Ｎ−ｇｒａｍ文書検索部２２、Ｎ−ｇｒａｍインデックス記憶部２３、形態素語文書検索部２４、形態素語インデックス記憶部２５、検索結果合成部２６、合成検索結果出力部２７等を含んで構成されている。この例も実施例１と同様に計算機にソフトウェアをインストールして実現できる。 FIG. 15 shows a document search apparatus 200 according to the second embodiment. In this figure, the document search apparatus 200 includes a search condition input unit 20, a search front end 21, an N-gram document search unit 22, and an N-gram index storage. A unit 23, a morpheme word document search unit 24, a morpheme word index storage unit 25, a search result synthesis unit 26, a synthesis search result output unit 27, and the like. This example can also be realized by installing software in the computer as in the first embodiment.

検索条件入力部２０は、検索条件を入力するものである。検索条件は、自然文を入力するものでもよい。検索フロントエンド２１は、検索条件に従ってＮ−ｇｒａｍ文書検索部２２にひらがな文字列のキーワードやカタカナ文字列のキーワードを出力し、形態素語文書検索部２４に形態素語をキーワードとして出力する。Ｎ−ｇｒａｍ文書検索部２２およびＮ−ｇｒａｍインデックス記憶部２３は、図１のＮ−ｇｒａｍ文書検索装置１００に対応するものである。Ｎ−ｇｒａｍインデックス記憶部２３はインデックス記憶部１５に対応する。形態素語文書検索部２４は形態素語インデックス記憶部２５に記憶された形態素語単位のインデックスを参照して文書検索を行なうものである。形態素単位のインデックスは形態素語をエントリとしてその形態素語が出現する文書のＩＤを含むインデックスレコードからなるものである。文書中の出現位置情報を保持していても良い。 The search condition input unit 20 is for inputting a search condition. The search condition may be a natural sentence input. The search front end 21 outputs a hiragana character string keyword or a katakana character string keyword to the N-gram document search unit 22 according to the search condition, and outputs a morpheme word as a keyword to the morpheme word search unit 24. The N-gram document search unit 22 and the N-gram index storage unit 23 correspond to the N-gram document search device 100 in FIG. The N-gram index storage unit 23 corresponds to the index storage unit 15. The morpheme word document search unit 24 performs a document search with reference to the morpheme word unit index stored in the morpheme word index storage unit 25. The morpheme unit index consists of an index record including the ID of the document in which the morpheme word appears with the morpheme word as an entry. Appearance position information in the document may be held.

検索結果合成部２６はＮ−ｇｒａｍ文書検索部２２および形態素語文書検索部２４の各検索結果を合成するものである。同一のキーワードをＮ−ｇｒａｍ文書検索部２２および形態素語文書検索部２４に供給して検索結果のＯＲをとって漏れのない検索を行なうようにしてもよいし、検索条件中の、形態素語に含まれないひらがな文字列や同様のカタカナ文字列のキーワードをＮ−ｇｒａｍ文書検索部２２に供給し、検索条件中の形態素語に対応するキーワードを形態素語文書検索部２４に供給して、検索条件のＡＮＤまたはＯＲ条件にしたがって対応する処理を検索結果合成部２６で行なっても良い。合成検索結果出力部２７は合成検索結果を例えば文書リストとして出力する。 The search result combining unit 26 combines the search results of the N-gram document search unit 22 and the morpheme document search unit 24. The same keyword may be supplied to the N-gram document search unit 22 and the morpheme document search unit 24 so as to perform a search without omission by ORing the search results. A keyword of a hiragana character string or similar katakana character string that is not included is supplied to the N-gram document search unit 22, and a keyword corresponding to the morpheme word in the search condition is supplied to the morpheme word document search unit 24. The corresponding processing may be performed by the search result combining unit 26 in accordance with the AND or OR condition. The combined search result output unit 27 outputs the combined search result as, for example, a document list.

なお、この発明は上述の実施例に限定されるものではなくその趣旨を逸脱しない範囲で種々変更が可能である。例えば、上述の例ではスタンドアローンの装置として説明したが複数のコンピュータシステムを用いてこの発明の検索手法を実現しても良い。例えば文書検索サーバ装置と任意のクライアント装置（パーソナルコンピュータ、携帯情報端末等を含む）を用いて実現しても良い。 The present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the invention. For example, although the above example has been described as a stand-alone device, the search method of the present invention may be realized using a plurality of computer systems. For example, it may be realized using a document search server device and an arbitrary client device (including a personal computer, a portable information terminal, etc.).

この発明の実施例１の文書検索装置の構成を説明するブロック図である。It is a block diagram explaining the structure of the document search device of Example 1 of this invention. 上述実施例１のインデックス登録処理の例を説明するフローチャートである。It is a flowchart explaining the example of the index registration process of the said Example 1. FIG. インデックス登録処理の動作を説明するためのものであり、文書例を説明する図である。It is a figure for demonstrating the operation | movement of an index registration process, and explaining the example of a document. インデックス登録処理の動作を説明するためのものであり、擬似文書の例を説明する図である。It is a figure for demonstrating the operation | movement of an index registration process, and explaining the example of a pseudo document. インデックス登録処理の動作を説明するためのものであり、ビットベクターの例を説明する図である。It is a figure for demonstrating the operation | movement of an index registration process, and is a figure explaining the example of a bit vector. 上述実施例１の検索処理の例を説明するフローチャートである。It is a flowchart explaining the example of the search process of the said Example 1. FIG. 検索キーワードに対応するＮ−ｇｒａｍの近隣関係がある例を説明する図である。It is a figure explaining the example with the N-gram neighborhood relationship corresponding to a search keyword. 検索キーワードに対応するＮ−ｇｒａｍの近隣関係がない例を説明する図である。It is a figure explaining the example which does not have the N-gram neighborhood relationship corresponding to a search keyword. 検索キーワードに対応するＮ−ｇｒａｍの近隣関係があるかどうかをビットベクターから判別する処理の例を説明するフローチャートである。It is a flowchart explaining the example of the process which discriminate | determines from the bit vector whether there exists N-gram neighborhood relation corresponding to a search keyword. 図９のフローチャートにおいてＮ−ｇｒａｍの例を説明する図である。It is a figure explaining the example of N-gram in the flowchart of FIG. 図９のフローチャートの動作例を説明する図である。It is a figure explaining the operation example of the flowchart of FIG. 上述実施例の変形例を説明する図である。It is a figure explaining the modification of the above-mentioned Example. 上述実施例の他の変形例を説明する図である。It is a figure explaining the other modification of the above-mentioned Example. 上述他の変形例を説明する図である。It is a figure explaining the above-mentioned other modification. この発明の実施例２の文書検索装置の構成を説明するブロック図である。It is a block diagram explaining the structure of the document search device of Example 2 of this invention.

Explanation of symbols

１０文書入力部
１１文字列抽出部
１２文字列連結部
１３ビットベクター生成部
１４インデックス登録部
１５インデックス記憶部
１５インデックス記憶部
１５ａメインメモリ
１５ｂ圧縮インデックス記憶部
１６キーワード入力部
１７検索部
１８検索結果出力部
２０検索条件入力部
２１検索フロントエンド
２２Ｎ−ｇｒａｍ文書検索部
２３Ｎ−ｇｒａｍインデックス記憶部
２４形態素語文書検索部
２５形態素語インデックス記憶部
２６検索結果合成部
２７合成検索結果出力部
１００文書検索装置
２００文書検索装置
１０００パーソナルコンピュータ
１００１記録媒体 DESCRIPTION OF SYMBOLS 10 Document input part 11 Character string extraction part 12 Character string connection part 13 Bit vector production | generation part 14 Index registration part 15 Index storage part 15 Index storage part 15a Main memory 15b Compression index storage part 16 Keyword input part 17 Search part 18 Search result output Unit 20 search condition input unit 21 search front end 22 N-gram document search unit 23 N-gram index storage unit 24 morpheme word document search unit 25 morpheme word index storage unit 26 search result synthesis unit 27 synthesis search result output unit 100 document search Apparatus 200 document search apparatus 1000 personal computer 1001 recording medium

Claims

First search means for performing a document search using a search dictionary generated from a morphological analysis result;
Second search means for searching for a document using an N-gram dictionary,
The second search means includes
Extract and concatenate at least one character string selected from hiragana character string, katakana character string, alphabet character string, and mixed character string of hiragana, katakana, and alphabet from each search target document. And pseudo document generation means for generating a pseudo document;
Appearance position storage means for storing appearance position information representing an appearance position of the N-gram in the pseudo document for each entry of the N-gram for each of the search target documents;
A search keyword composed of at least one character string selected from the hiragana character string, katakana character string, alphabet character string, and mixed character string of hiragana, katakana and alphabet is collated with the appearance position information. And a document specifying means for specifying a document including the search keyword.

Extract and concatenate at least one character string selected from hiragana character string, katakana character string, alphabet character string, and mixed character string of hiragana, katakana, and alphabet from each search target document. And pseudo document generation means for generating a pseudo document;
Appearance position storage means for storing appearance position information representing an appearance position of the N-gram in the pseudo document for each entry of the N-gram for each of the search target documents;
A search keyword composed of at least one character string selected from the hiragana character string, katakana character string, alphabet character string, and mixed character string of hiragana, katakana and alphabet is collated with the appearance position information. And a document specifying means for specifying a document including the search keyword.

For each of the documents to be searched, the appearance position information indicating the appearance position of the N-gram for each N-gram entry is stored as a bit vector for setting a flag bit at the bit position corresponding to the position. Position storage means;
Document specifying means for checking a search keyword against the appearance position information and specifying a document including the search keyword;
When the search keyword is composed of two or more N-grams, it is determined that the flag bit position of each bit vector of the N-gram has a corresponding adjacency, and a document including the search keyword is obtained. A document search apparatus characterized by specifying.

The pseudo-document generating means selects at least one kind of character string selected from hiragana character string, katakana character string, alphabet character string, and mixed character string of hiragana, katakana, and alphabet from each of the documents to be searched. A pseudo document generation step of extracting and concatenating and generating a pseudo document;
An appearance position storage step of storing, for each of the search target documents, appearance position information representing an appearance position of the N-gram in the pseudo document for each entry of the N-gram;
The document specifying means selects a search keyword composed of at least one kind of character string selected in advance from the hiragana character string, katakana character string, alphabet character string, and a mixed character string of hiragana, katakana and alphabet. A document search method comprising: a document specifying step of specifying a document including the search keyword by collating with appearance position information.

The appearance position storage means sets, for each of the search target documents, appearance position information indicating the appearance position of the N-gram for each entry of the N-gram, and sets a flag bit at the bit position corresponding to the position. An appearance position storing step of storing as a bit vector;
A document specifying means for checking a search keyword against the appearance position information and specifying a document including the search keyword;
When the search keyword is composed of two or more N-grams, it is determined that the flag bit position of each bit vector of the N-gram has a corresponding adjacency, and a document including the search keyword is specified. A document retrieval method characterized by:

Extract and concatenate at least one character string selected from hiragana character string, katakana character string, alphabet character string, and mixed character string of hiragana, katakana, and alphabet from each search target document. And pseudo document generation means for generating a pseudo document;
Appearance position storage means for storing appearance position information representing an appearance position of the N-gram in the pseudo document for each entry of the N-gram for each of the search target documents;
A search keyword composed of at least one character string selected from the hiragana character string, katakana character string, alphabet character string, and mixed character string of hiragana, katakana and alphabet is collated with the appearance position information. A computer program for searching for a document, which is used for realizing a document specifying means for specifying a document including the search keyword.

7. The document retrieval use according to claim 6, wherein the appearance position storage means represents the appearance position information representing the appearance position of the N-gram in the pseudo document by a bit vector that sets a flag bit at a bit position corresponding to the position. Computer program.

8. The computer program for searching a document according to claim 6, wherein the pseudo document is generated by degenerating a plurality of identical character strings in the same document into one.

9. The computer program for searching a document according to claim 6, 7 or 8, wherein a blank character is used to represent a break between adjacent character strings in the pseudo document.

The computer program for document search according to any one of claims 6 to 9, wherein the at least one character string selected in advance is a hiragana character string or a katakana character string.

8. The computer program for searching a document according to claim 7, wherein when the bit length of the bit vector exceeds a predetermined length, the bit vector is folded at the predetermined length position and ORed with the flag bit.

8. The computer for document search according to claim 7, wherein when the bit length of the bit vector exceeds a predetermined length, the bit vector is divided into the sequence of the predetermined length, and the sequence not including the flag bit is omitted and complemented. program.

For each of the documents to be searched, the appearance position information indicating the appearance position of the N-gram for each N-gram entry is stored as a bit vector for setting a flag bit at the bit position corresponding to the position. Position storage means;
And a document specifying means for specifying a document including the search keyword by comparing the search keyword with the appearance position information, and
When the search keyword is composed of two or more N-grams, it is determined that the flag bit position of each bit vector of the N-gram has a corresponding adjacency, and a document including the search keyword is obtained. A computer program for document search characterized by specifying.