JP4183767B2

JP4183767B2 - Character string search device and search method thereof

Info

Publication number: JP4183767B2
Application number: JP00241896A
Authority: JP
Inventors: 橋良浩高; 井恒順寺; 村昌義中; 木奏鈴; 藤理史佐
Original assignee: Nomura Research Institute Ltd
Current assignee: Nomura Research Institute Ltd
Priority date: 1996-01-10
Filing date: 1996-01-10
Publication date: 2008-11-19
Anticipated expiration: 2016-01-10
Also published as: JPH09190448A

Description

【０００１】
【発明の属する技術分野】
本発明は、テキストファイル中から任意の文字列を検索する装置とその方法に係り、特に検索対象のテキストファイルについて機械的な文字列分割によってインデックスファイルを作成し、このインデックスファイルを用いて任意の文字列を効率よく検索することができる文字列検索装置およびその方法に関する。
【０００２】
【従来の技術】
一般に、コンピュータやワードプロセッサの分野では、テキストファイル（文書ファイル）から、任意の文字列を検索する技術は不可欠である。特に、最近ではコンピュータ等が取り扱うテキスト（文書）の量が膨大化しているので、確実かつ効率よく所定の文字列を検索する要求が高まっていた。
【０００３】
また、電子通信分野でも、通信ネットワーク上に多数のユーザーがメッセージを掲載するようになったので、掲載されている大量な情報の中から自分が求める情報を検索するために、大量なテキストから任意の文字列を効率よく検索する技術の開発が求められていた。
【０００４】
ところで、欧米の言語は、単語と単語の間にスペースが挿入されているので、このスペースを標識として一連のテキストから単語を抽出することが容易である。技術的に言えば、欧米の言語は一文字が一バイトになっており、スペースなどをそれを表すバイトによって検出することが簡単である。したがって、検索対象となるテキストについて、スペースを区切りとして単語と、テキストにおけるその単語の位置とを予め抽出してインデックスファイルを作成しておけば、そのインデックスファイルを参照することにより、検索しようとする単語や文字列の有無と、存在する場合の位置とを素早く検索することができる。
【０００５】
しかし、日本語や韓国語や中国語は、句読点があるものの、単語と単語が連続して文章を構成しており、かつ、一文字が複数バイトによって表される言語は（このことからこれらの言語をマルチバイト言語という）、インデックスファイルを作成するのが容易ではない。
【０００６】
すなわち、日本語等の場合は、文字列をいずれの位置で区切って単語として抽出するかは、単純なバイトの照合から判断することが困難である。このため、これらのマルチバイト言語は、辞書を用意しておき、文章の構文解析を行った後なければ、単語を単語として文字列から抽出することができない。
【０００７】
そこで、従来のマルチバイト言語の文字列検索は、主に以下の３つの方法のいずれかの方法によって行っていた。
【０００８】
(1) 単純検索による方法
ワードプロセッサの分野で一般に行われているように、インデックスファイルを作成することなく、文字列を検索するときは、テキスト全体について一致する文字列を検索する方法である。
【０００９】
(2) キーワード検索による方法
ある種のデータベースのように、所定のテキストに対してユーザが予めキーワードを指定することにより、インデックスファイルを作成しておき、そのインデックスファイルを利用して文字列を検索する方法である。
【００１０】
(3) インデックスファイルによる全文検索の方法
辞書を用意し、テキストを形態索解析等の手法を用いてテキスト全部を自動的に単語に分割してインデックスファイルを作成し、そのインデックスファイルを用いてテキストの全文から文字列を検索する方法である。
【００１１】
【発明が解決しようとする課題】
しかしながら、上記従来の文字列検索方法では、最近の、あるいは近い将来さらに顕著になる大テキストからの文字列検索を効率よく行うことはできなかった。
【００１２】
すなわち、上記単純検索の方法では、テキストの最初から逐一的に同一文字列を検索するので、大きなテキストを検索するには時間がかかり過ぎて実用に適していない。
【００１３】
次に、上記キーワード検索の方法では、ユーザーがテキストについてキーワードを入力しなければならないので、入力の時間と手間がかかる上に、入力したキーワード以外の文字列を検索するできないという問題があった。
【００１４】
最後に、上記インデックスファイルによる全文検索の方法では、テキストを構文解析するための時間がかかる上に、その構文解析によっても完全に単語を正確に分割することができなかった。たとえば、現在の構文解析の技術では「新党さきがけ」のような漢字とひらがなとからなる単語は、単語として抽出するのが困難であった。さらに、次々に生み出される新語を辞書に登録しなければ、新語を単語として抽出することができないので、継続的に辞書をメンテナンスしなければならなかった。
【００１５】
このため、検索するための準備を含めて、大量の文字を含むテキストから任意の文字列を簡単に検索する簡便な技術の開発が求められていた。
【００１６】
そこで、本願発明が解決しようとする課題は、辞書のメンテナンスや構文解析を行うことなく文字列検索のためのインデックスファイルを自動作成でき、このインデックスファイルを用いて任意の文字列を検索する文字列検索装置およびその検索方法を提供することにある。
【００１８】
【課題を解決するための手段】
本願発明に係る文字列検索装置は、
検索処理部とインデックス生成部とを有し、
前記インデックス生成部は、検索対象となるテキストを入力し、前記検索対象テキストを、前記検索処理部が入力した検索文字列の長さの文字列に文字配列をそのままに各文字を先頭に分割し、それぞれの分割された文字列にその文字列が前記検索対象テキストにおいて出現する位置に関する情報を付加してインデックスを作成し、これらのインデックスをソートしてインデックスファイルを作成し、
前記検索処理部は、前記インデックスの中から前記検索文字列と同一の文字列を検索する、ことを特徴とするものである。
また、本願発明に係る文字列検索方法は、
検索対象となるテキストと、検索文字列とを入力し、前記検索対象テキストを、前記検索文字列の長さの文字列に文字配列をそのままに各文字を先頭に分割し、それぞれの分割された文字列にその文字列が前記検索対象テキストにおいて出現する位置に関する情報を付加してインデックスを作成し、これらのインデックスをソートし、
前記インデックスの中から前記検索文字列と同一の文字列を検索する、ことを特徴とするものである。
【００１９】
【発明の実施の形態】
次に、本願発明の文字列検索装置およびその検索方法の実施形態について、添付の図面を用いて以下に説明する。
【００２０】
図１は、本発明による文字列検索装置の構成とその処理の流れを示したものである。図１に示すように、本発明による文字列検索装置１は、インデックス生成部２と検索処理部３とからなる。インデックス生成部２は、検索対象となるテキスト４を入力し、これを後述の方法で処理してインデックスファイル５を自動的に作成する。
【００２１】
一方、検索処理部３は、検索対象である検索文字列６を入力し、後述する検索のための処理を行って検索文字列を生成し、インデックスファイル５を参照することにより、検索文字列とその位置７を出力する。
【００２２】
次に上記インデックス生成部２と検索処理部３における処理をさらに説明する。
【００２３】
図２は、インデックス生成部２におけるインデックス生成のための処理の流れを示している。
図２に示すように、インデックス生成部２は、最初に検索対象となるテキストを入力し（ステップ１００）、これを固定長の文字列に分割する（ステップ１１０）。
【００２４】
つまり、インデックス生成部２は、検索対象となるテキストを入力すると、その構文（単語や助詞や接続詞等の別）に拘わらず、一定の長さの文字列（この文字列の長さを固定長という）に分割する。
【００２５】
たとえば、検索対象となるテキストを「辞書や単語分割機能を有する」とし、固定長をｎ＝３とすると、ステップ１１０では上記テキストを、
「辞書や」
「書や単」
「や単語」
「単語分」
「語分割」
「分割機」
「割機能」
「機能を」
「能を有」
「を有す」
「有する」
「する」
「る」
の１３個の固定長文字列に分割する。
【００２６】
次に上記固定長文字列にその出現する位置の情報、すなわち、検索対象テキストの最初の文字からその固定長文字列の先頭文字までの文字数を示す数値を付す（ステップ１２０）。
【００２７】
上記検索対象テキスト「辞書や単語分割機能を有する」の例で言えば、
「辞書や，０」
「書や単，１」
「や単語，２」
「単語分，３」
「語分割，４」
「分割機，５」
「割機能，６」
「機能を，７」
「能を有，８」
「を有す，９」
「有する，１０」
「する，１１」
「る，１２」
というように、各固定長文字列とその位置情報とをペアとして、１３個のインデックスを生成する。
【００２８】
なお、上記位置情報は検索対象テキストの最初の文字から固定長文字列の先頭文字までの文字数に限られず、検索対象テキストの末尾の文字からの文字数でもよく、また、一定の関数として与えてもよい。
【００２９】
次に、これらのインデックスをその先頭文字によって一定の順序に並べ替える（この操作をソートという）（ステップ１３０）。
【００３０】
上記インデックス「辞書や，０」，…，「る，１２」の例で言えば、
「する，１１」
「や単語，２」
「る，１２」
「を有す，９」
「割機能，６」
「機能を，７」
「語分割，４」
「辞書や，０」
「書や単，１」
「単語分，３」
「能を有，８」
「分割機，５」
「有する，１０」
というように、ソートする。ソートしたインデックスはインデックスファイルとして出力する（ステップ１４０）。
【００３１】
上記インデックス生成の処理で注目すべきことは、この処理方法によれば、インデックスを作成するのに、辞書を用意することもなく、また、困難な構文解析も行うことなく、機械的にテキストからインデックスを生成することができる点にある。このインデックスはソートによって後述するように検索が容易となる。
【００３２】
次に、上記インデックスの使用方法、すなわち、検索処理部３による処理を図３を用いて説明する。
図３に示すように、検索処理部３は、検索文字列を入力し（ステップ２００）、その長さを判断して、固定長と比較することによってその後の処理を振り分ける（ステップ２１０）。
【００３３】
最初に、検索文字列の長さｍと固定長ｎが等しい場合について説明する。
検索文字列の長さｍと固定長ｎが等しいときは、検索文字列と同一の文字列をインデックスファイルから検索する（ステップ２２０）。
【００３４】
たとえば、前記インデックスファイルを作成した「辞書や単語分割機能を有する」から、「語分割」という検索文字列を検索する場合がこれに該当する。
【００３５】
すなわち、検索文字列「語分割」の長さは３文字ゆえ、ｍ＝３となり、前述した固定長ｎ＝３と等しい（ｍ＝ｎ）。この場合は、前述したソートしたインデックスから同一の文字列を検索すればよい。インデックスの文字列には位置情報が付加されているので、その文字列の位置も知ることができる。
【００３６】
ここで注目すべきことは、前述したようにインデックスファイルはインデックスをソートしているので、全部を検索する必要がなく、「語」を先頭文字とする「語分割，４」なるインデックスを直ちに検索することができることである。これにより、従来の単純検索の方法に比べてはるかに効率的に検索することができる。
【００３７】
上記例ではインデックス「語分割，４」を得ることにより、テキストの最初の文字から４番目に「語分割」なる文字列が存在することを知ることができる。
【００３８】
次に、検索文字列の長さｍが固定長ｎより小さい場合について説明する。
検索文字列の長さｍが固定長ｎより小さいときは、検索文字列にワイルドカードを補充してワイルドカード文字列を作成し（ステップ２３０）、インデックスファイルから該当する文字列を検索する（ステップ２４０）。
【００３９】
たとえば、前記例の「辞書や単語分割機能を有する」から、「分割」という検索文字列を検索する場合がこれに該当する。この場合、検索文字列「分割」の長さは２文字ゆえ、ｍ＝２となり、前述した固定長ｎ＝３より小さい（ｍ＜ｎ）。
【００４０】
このときは、「分割＊」なるワイルドカード文字列をインデックスファイルから検索する。ここで「＊」がワイドカード文字であり、このワイルドカード文字に該当する部分は任意の文字であってよい。
【００４１】
上記インデックスファイル「辞書や，０」，…，「る，１２」の例で言えば、「分割＊」に該当する文字列として「分割機，５」なるインデックスを得ることができる。これによって、検索文字列「分割」はテキストの最初の文字から５番目に存在することを知ることができる。
【００４２】
ここで、注目すべきことは、ｍ＜ｎの場合、ワイルドカード文字＊は検索文字の後尾に付し、先頭文字によってソートされたインデックスの該当部分に直ちにアクセスことができることである。インデックスは、テキストの各文字を先頭として作成されているので、上述方法でも検索漏れを生じることがない。
【００４３】
最後に、検索文字列の長さｍが固定長ｎより大きい場合について説明する。
検索文字列の長さｍが固定長ｎより大きいときは、検索文字列を固定長文字列に分割し（ステップ２５０）、後述するフレーズ式を作成し（ステップ２６０）、インデックスファイルから該当するフレーズ式を検索する（ステップ２７０）。
【００４４】
たとえば、前記例の「辞書や単語分割機能を有する」から「単語分割機能を」という検索文字列を検索する場合がｍ＞ｎの場合に該当する。最初に「単語分割機能を」からフレーズ式を作成する。ここで、フレーズ式とは、文字列「○○○」と文字列「△△△」を含む検索文字列（これをフレーズという）において、文字列「○○○」と文字列「△△△」の先頭文字どうしがｐ文字離れて出現する場合に、これを「○○○」＜ｐ＞「△△△」と表し、この「○○○」＜ｐ＞「△△△」をフレーズ式という。なお、ｐがｎ（＝３）より小さい場合は、文字列「○○○」と文字列「△△△」の一部または全部が重複して場合であるが、これらも全く同一の方法によって上記フレーズ式に表すことができる。
【００４５】
上記文字列「○○○」と文字列「△△△」を含む検索文字列を検索するには、ｐ文字離れた「○○○」というインデックスと「△△△」というインデックスとを検索すればよい。
【００４６】
上記「辞書や単語分割機能を有する」から「単語分割機能を」という検索文字列を検索する例では、「単語分割機能を」から、
「単語分」＜３＞「割機能」＜１＞「機能を」
あるいは、「単語分」＜２＞「分割機」＜２＞「機能を」のようなフレーズ式を作成する。ここで、上記２つのフレーズ式は互いに等価であり、フレーズ式は検索文字列の全体をカバーしていればよい。
【００４７】
次にインデックス「辞書や，０」，…，「る，１２」から、上記フレーズ式に該当するインデックスを検索する。
【００４８】
これにより、インデックス「単語分，３」〜「機能を，７」が検索され、検索文字列は検索対象のテキストの最初の文字から３文字目に出現することを知ることができる。
【００４９】
上記フレーズ検索機能によれば、インデックス固定長より長い文字列も予め用意したインデックスファイルを用いて検索でき、インデックスファイルがソートされていることから、目的とする文字列を素早く検索することができる。
【００５０】
以上で上記実施形態の説明を終了するが、上記実施形態は、検索対象テキストについて予め固定長を定めてインデックスファイルを作成し、このインデックスファイルを用いて検索文字列を検索するものである。しかし、本発明の方法を用いれば、異なる検索方法も可能となる。以下にその検索方法について説明する。
【００５１】
上記異なる検索方法とは、予めインデックスファイルを作成することなく、検索する際に、検索文字列の長さに合わせて検索対象のテキストを分割する方法である。
【００５２】
この方法は、比較的少量、かつ、保存すべき期間が短いテキストに対しては有効なものである。
【００５３】
この方法によれば、所定の検索対象テキストに対して検索文字列を入力すると、その文字列の長さを固定長として検索対象テキストを分割してインデックスを作成する。
【００５４】
この場合、検索文字列の長さに満たないテキスト末端部のインデックスは作成を省略する。このようなインデックスは検索文字列と一致しないことが明らかだからである。
【００５５】
たとえば、「辞書や単語分割機能を有する」というテキストから「単語分割機能を有する」という文字列を検索しようとする場合、固定長ｎ＝１０（＝ｍ）として、「辞書や単語分割機能を有する」から、下記のインデックスを作成する。
【００５６】
「辞書や単語分割機能を，０」
「書や単語分割機能を有，１」
「や単語分割機能を有す，２」
「単語分割機能を有する，３」
このとき、９文字以下のインデックス、すなわち、「語分割機能を有する，４」〜「る，１２」を作成する必要がない。これらには検索文字列は含まれていないことが明らかだからである。
【００５７】
次にインデックスをソートする。上記例では、下記のように並べ変える。
【００５８】
「や単語分割機能を有す，２」
「辞書や単語分割機能を，０」
「書や単語分割機能を有，１」
「単語分割機能を有する，３」
この状態で検索文字列「単語分割機能を有する」と同一の文字列をインデックスから検索すれば、求める文字列の位置を知ることができる。上記例では文字列「単語分割機能を有する」がテキストの最初の文字から３文字目に存在することを知ることができる。
【００５９】
ここで注目すべき点は、この方法によれば、すべてのテキストについてインデックスを作成しておく必要がなく、記憶装置の利用効率を高くすることができるという点と、長い検索文字列を検索する場合、テキストの後尾のｎより短い文字列についてインデックスを作成する必要がなく、インデックスの作成が簡単であり、かつ、文字列の検索方法が同一文字列の発見であるのできわめて簡単である点にある。
【００６０】
したがって、比較的少量、かつ、保存すべき期間が短いテキストに対しては有効な検索方法であり、特に、長い文字列を検索する場合にきわめて効率よく検索することができる。
【００６１】
【発明の効果】
以上の説明から明らかなように、本発明による文字列検索装置およびその検索方法は、検索対象となるテキストの構文解析することなく、したがって構文解析のための辞書を用意することなく、機械的に一定長の文字列からなるインデックスを作成して、これらのインデックスをソートしておくことができる。
【００６２】
このインデックスを利用し、文字列を検索するときは、同一文字列検索、ワイルドカード文字列検索、フレーズ式検索のいずれかによって、検索文字列の長さに拘わらず、任意の長さの検索文字列を検索することができる。
【００６３】
これにより、大容量のテキストを多数検索する場合も、機械的な処理によってインデックスを作成しておき、任意の文字列を素早く、かつ、確実に検索することが可能となる。
【図面の簡単な説明】
【図１】本願発明による文字列検索装置の構成とその処理の流れを示したブロック図。
【図２】本願発明による文字列検索装置のインデックス生成部における処理を示したフローチャート。
【図３】本願発明による文字列検察装置の検索処理部における処理を示したフローチャート。
【符号の説明】
１文字列検察装置
２インデックス生成部
３検索処理部
４検索対象となるテキスト
５インデックスファイル
６検索文字列
７検索しようとする文字列とその位置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an apparatus and a method for searching for an arbitrary character string from a text file. In particular, an index file is created by mechanical character string division for a text file to be searched, and an arbitrary file is searched using the index file. The present invention relates to a character string search apparatus and method that can efficiently search a character string.
[0002]
[Prior art]
In general, in the field of computers and word processors, a technique for retrieving an arbitrary character string from a text file (document file) is indispensable. In particular, recently, since the amount of text (documents) handled by computers and the like has become enormous, there has been an increasing demand for searching for a predetermined character string reliably and efficiently.
[0003]
Also, in the field of electronic communication, many users have posted messages on the communication network, so in order to search for the information that they want from the large amount of information posted, any number of texts can be selected. There was a need to develop a technology that efficiently searches for strings.
[0004]
By the way, in Western languages, since a space is inserted between words, it is easy to extract a word from a series of texts using this space as a marker. Technically speaking, in Western languages, one character is one byte, and it is easy to detect a space or the like by a byte representing it. Therefore, if the index file is created by extracting in advance the word and the position of the word in the text with a space as a delimiter for the text to be searched, it will try to search by referring to the index file It is possible to quickly search for the presence or absence of a word or character string and the position where it exists.
[0005]
However, although Japanese, Korean, and Chinese have punctuation marks, words that are composed of words and words in a row and one character is represented by multiple bytes (thus these languages It is not easy to create an index file.
[0006]
That is, in the case of Japanese or the like, it is difficult to determine at which position the character string is divided and extracted as a word from simple byte collation. For this reason, in these multibyte languages, a word cannot be extracted from a character string as a word unless a dictionary is prepared and a sentence is analyzed.
[0007]
Therefore, the conventional multi-byte language character string search is mainly performed by one of the following three methods.
[0008]
(1) Method by simple search As is generally done in the field of word processors, when searching for a character string without creating an index file, it is a method of searching for a matching character string for the entire text.
[0009]
(2) Keyword search method Like a certain database, an index file is created by a user specifying keywords in advance for a given text, and a character string is searched using the index file. It is a method to do.
[0010]
(3) Method of full-text search using index file Prepare a dictionary, automatically divide the text into words using a technique such as morphological analysis, create an index file, and use the index file to create text This is a method of searching for a character string from the full text.
[0011]
[Problems to be solved by the invention]
However, the conventional character string search method described above cannot efficiently perform a character string search from a large text that becomes more prominent in the near future.
[0012]
That is, in the above simple search method, the same character string is searched one by one from the beginning of the text. Therefore, it takes too much time to search for a large text and is not suitable for practical use.
[0013]
Next, in the above keyword search method, since the user has to input a keyword for text, it takes time and effort to input, and there is a problem that a character string other than the input keyword cannot be searched.
[0014]
Finally, in the full-text search method using the index file, it takes time to parse the text, and the parse analysis cannot completely divide words. For example, with the current parsing technology, it has been difficult to extract words consisting of kanji and hiragana such as “New Party PRESTO”. Furthermore, if new words that are generated one after another are not registered in the dictionary, new words cannot be extracted as words, so the dictionary must be maintained continuously.
[0015]
For this reason, there has been a demand for the development of a simple technique for easily searching an arbitrary character string from text including a large number of characters, including preparation for searching.
[0016]
Thus, the problem to be solved by the present invention is that an index file for character string search can be automatically created without performing dictionary maintenance or parsing, and a character string for searching for an arbitrary character string using this index file. A search device and a search method thereof are provided.
[0018]
[Means for Solving the Problems]
The character string search device according to the present invention is:
A search processing unit and an index generation unit;
The index generation unit inputs text to be searched, and divides the search target text into a character string of the length of the search character string input by the search processing unit, leaving each character as it is. , To create an index by adding information on the position where the character string appears in the search target text to each divided character string, create an index file by sorting these indexes,
The search processing unit searches the index for the same character string as the search character string.
In addition , the character string search method according to the present invention is as follows:
The search target text and the search character string are input, the search target text is divided into character strings of the length of the search character string, and the characters are divided as they are. An index is created by adding information on a position where the character string appears in the search target text to the character string, and the index is sorted.
A character string identical to the search character string is searched from the index.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of a character string search device and a search method thereof according to the present invention will be described below with reference to the accompanying drawings.
[0020]
FIG. 1 shows the configuration of a character string search apparatus according to the present invention and the flow of processing. As shown in FIG. 1, a character string search device 1 according to the present invention includes an index generation unit 2 and a search processing unit 3. The index generation unit 2 inputs the text 4 to be searched and processes this by a method described later to automatically create an index file 5.
[0021]
On the other hand, the search processing unit 3 inputs a search character string 6 to be searched, performs a search process described later to generate a search character string, and refers to the index file 5 to obtain a search character string. The position 7 is output.
[0022]
Next, processing in the index generation unit 2 and the search processing unit 3 will be further described.
[0023]
FIG. 2 shows a flow of processing for index generation in the index generation unit 2.
As shown in FIG. 2, the index generation unit 2 first inputs text to be searched (step 100) and divides it into fixed-length character strings (step 110).
[0024]
In other words, when the text to be searched is input, the index generation unit 2 sets a fixed-length character string (the length of this character string is fixed length) regardless of the syntax (word, particle, conjunction, etc.). Divided).
[0025]
For example, assuming that the text to be searched is “having a dictionary and word division function” and the fixed length is n = 3, in step 110,
"Dictionary"
"Calligraphy and simple"
"Ya word"
"Words"
"Word division"
"Split machine"
"Split function"
"Function"
“Having Noh”
"I have"
"Have"
"Yes"
"Ru"
Are divided into 13 fixed-length character strings.
[0026]
Next, information on the position at which the fixed-length character string appears, that is, a numerical value indicating the number of characters from the first character of the search target text to the first character of the fixed-length character string is added (step 120).
[0027]
Speaking of the above search target text “having a dictionary and word segmentation function”
"Dictionary, 0"
"Written and simple, 1"
"Ya word, 2"
“Words, 3”
"Word division, 4"
"Split machine, 5"
"Split function, 6"
“Function, 7”
“Has Noh, 8”
"I have 9"
"Has 10"
"Yes, 11"
"Ru, 12"
In this way, 13 indexes are generated by pairing each fixed-length character string and its position information.
[0028]
The position information is not limited to the number of characters from the first character of the search target text to the first character of the fixed-length character string, and may be the number of characters from the last character of the search target text, or may be given as a constant function. Good.
[0029]
Next, these indexes are rearranged in a certain order by the first character (this operation is called sorting) (step 130).
[0030]
In the example of the index “dictionary, 0”, ..., “ru, 12”,
"Yes, 11"
"Ya word, 2"
"Ru, 12"
"I have 9"
"Split function, 6"
“Function, 7”
"Word division, 4"
"Dictionary, 0"
"Written and simple, 1"
“Words, 3”
“Has Noh, 8”
"Split machine, 5"
"Has 10"
So sort. The sorted index is output as an index file (step 140).
[0031]
What should be noted in the above index generation process is that, according to this processing method, it is possible to create an index mechanically from text without preparing a dictionary and performing difficult parsing. An index can be generated. This index can be easily searched by sorting as will be described later.
[0032]
Next, the method of using the index, that is, the processing by the search processing unit 3 will be described with reference to FIG.
As shown in FIG. 3, the search processing unit 3 inputs a search character string (step 200), determines its length, and distributes subsequent processing by comparing it with a fixed length (step 210).
[0033]
First, the case where the length m of the search character string is equal to the fixed length n will be described.
If the length m of the search character string is equal to the fixed length n, the same character string as the search character string is searched from the index file (step 220).
[0034]
For example, a case where a search character string “word division” is searched from “having a dictionary and a word division function” that created the index file corresponds to this.
[0035]
That is, since the length of the search character string “word division” is 3 characters, m = 3, which is equal to the above-described fixed length n = 3 (m = n). In this case, the same character string may be searched from the sorted index described above. Since position information is added to the character string of the index, the position of the character string can also be known.
[0036]
What should be noted here is that the index file sorts the indexes as described above, so it is not necessary to search all of them, and the index “word division, 4” with “word” as the first character is immediately searched. Is what you can do. As a result, the search can be performed much more efficiently than the conventional simple search method.
[0037]
In the above example, by obtaining the index “word division, 4”, it is possible to know that there is a character string “word division” fourth from the first character of the text.
[0038]
Next, a case where the length m of the search character string is smaller than the fixed length n will be described.
When the length m of the search character string is smaller than the fixed length n, a wild card character string is created by supplementing the search character string with a wild card (step 230), and the corresponding character string is searched from the index file (step 230). 240).
[0039]
For example, this corresponds to the case where the search character string “division” is searched from “having a dictionary and word division function” in the above example. In this case, since the length of the search character string “divided” is two characters, m = 2, which is smaller than the above-described fixed length n = 3 (m <n).
[0040]
At this time, the wild card character string “divided *” is searched from the index file. Here, “*” is a wide card character, and the portion corresponding to the wild card character may be an arbitrary character.
[0041]
In the example of the index file “dictionary or 0,...,“ Ru, 12 ”, an index“ divider, 5 ”can be obtained as a character string corresponding to“ divide * ”. This makes it possible to know that the search character string “divided” exists fifth from the first character of the text.
[0042]
Here, it should be noted that when m <n, the wildcard character * is added to the end of the search character, and the corresponding portion of the index sorted by the first character can be immediately accessed. Since the index is created starting from each character of the text, the above method does not cause a search omission.
[0043]
Finally, a case where the length m of the search character string is larger than the fixed length n will be described.
When the length m of the search character string is larger than the fixed length n, the search character string is divided into fixed length character strings (step 250), a phrase expression to be described later is created (step 260), and the corresponding phrase is obtained from the index file. The expression is searched (step 270).
[0044]
For example, the case of searching for a search character string “having a word division function” from “having a dictionary and word division function” in the above example corresponds to a case where m> n. First, create a phrase expression from "word division function". Here, the phrase expression is a search character string (this is called a phrase) including a character string “XXX” and a character string “△△△”, and a character string “XXX” and a character string “Δ △△”. When the first characters of “” appear apart from each other by p characters, this is expressed as “XXX” <p> “ΔΔΔ”, and this “XX” <p> “ΔΔΔ” is expressed as a phrase expression. That's it. In addition, when p is smaller than n (= 3), the character string “XXX” and the character string “ΔΔΔ” are partly or entirely overlapped. It can be expressed in the above phrase formula.
[0045]
To search for a search character string including the character string “XXX” and the character string “△△△”, search the index “XXX” and the index “△△△” separated by p characters. That's fine.
[0046]
In the example of searching the search character string “word split function” from “having a dictionary and word split function”, from “word split function”
"Word"<3>"Distributionfunction"<1>"Function"
Alternatively, a phrase expression such as “words” <2> “divider” <2> “function” is created. Here, the two phrase expressions are equivalent to each other, and the phrase expression only needs to cover the entire search character string.
[0047]
Next, the index corresponding to the phrase expression is searched from the index “dictionary, 0,...,“ Ru, 12 ”.
[0048]
As a result, the indexes “word portion, 3” to “function, 7” are searched, and it can be known that the search character string appears as the third character from the first character of the search target text.
[0049]
According to the phrase search function, character strings longer than the fixed index length can be searched using an index file prepared in advance, and the target character string can be searched quickly because the index file is sorted.
[0050]
The description of the above-described embodiment is completed. In the above-described embodiment, an index file is created with a fixed length for a search target text, and a search character string is searched using the index file. However, different search methods are possible using the method of the present invention. The search method will be described below.
[0051]
The different search method is a method of dividing the text to be searched according to the length of the search character string when searching without creating an index file in advance.
[0052]
This method is effective for texts that are relatively small and have a short shelf life.
[0053]
According to this method, when a search character string is input to a predetermined search target text, the search target text is divided with the length of the character string as a fixed length, and an index is created.
[0054]
In this case, creation of an index at the end of the text that is less than the length of the search character string is omitted. This is because it is clear that such an index does not match the search character string.
[0055]
For example, when searching for a character string “having a word division function” from a text “having a dictionary and a word division function”, a fixed length n = 10 (= m) is set, and “having a dictionary and a word division function” The following index is created.
[0056]
“Dictionary and word segmentation function is 0”
“Has a book and word division function, 1”
“And with word-splitting function, 2”
"Has word division function, 3"
At this time, it is not necessary to create an index of 9 characters or less, that is, “having a word division function, 4” to “ru, 12”. This is because it is clear that these do not include the search string.
[0057]
Then sort the index. In the above example, they are rearranged as follows.
[0058]
“And with word-splitting function, 2”
“Dictionary and word segmentation function is 0”
“Has a book and word division function, 1”
"Has word division function, 3"
If the same character string as the search character string “having the word division function” is searched from the index in this state, the position of the desired character string can be known. In the above example, it can be known that the character string “having the word division function” exists in the third character from the first character of the text.
[0059]
What should be noted here is that according to this method, it is not necessary to create an index for all texts, and the use efficiency of the storage device can be increased, and a long search character string is searched. In this case, it is not necessary to create an index for a character string shorter than n at the tail of the text, the creation of the index is simple, and the search method of the character string is discovery of the same character string. is there.
[0060]
Therefore, it is an effective search method for a relatively small amount of text that has a short storage period, and can be searched very efficiently particularly when searching for a long character string.
[0061]
【The invention's effect】
As is clear from the above description, the character string search device and the search method according to the present invention are mechanically performed without parsing the text to be searched, and thus without preparing a dictionary for parsing. It is possible to create indexes composed of character strings of a certain length and sort these indexes.
[0062]
When searching for a character string using this index, a search character of any length can be obtained regardless of the length of the search character string by any of the same character string search, wildcard character string search, and phrase expression search. You can search columns.
[0063]
As a result, even when a large number of large-capacity texts are searched, an index is created by mechanical processing, and an arbitrary character string can be searched quickly and reliably.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a character string search device according to the present invention and a flow of processing thereof.
FIG. 2 is a flowchart showing processing in an index generation unit of the character string search device according to the present invention;
FIG. 3 is a flowchart showing processing in a search processing unit of the character string probing device according to the present invention.
[Explanation of symbols]
1 Character string probing device 2 Index generation unit 3 Search processing unit 4 Text to be searched 5 Index file 6 Search character string 7 Character string to be searched and its position

Claims

A search processing unit and an index generation unit;
The index generation unit inputs text to be searched, and divides the search target text into a character string of the length of the search character string input by the search processing unit, leaving each character as it is. , To create an index by adding information on the position where the character string appears in the search target text to each divided character string, create an index file by sorting these indexes,
The character string search apparatus, wherein the search processing unit searches the index for the same character string as the search character string.

The search target text and the search character string are input, the search target text is divided into character strings of the length of the search character string, and the characters are divided as they are. An index is created by adding information on a position where the character string appears in the search target text to the character string, and the index is sorted.
A character string search method, wherein the same character string as the search character string is searched from the index.