JPH09212524A

JPH09212524A - Entire sentence retrieval method and electronic dictionary formation device

Info

Publication number: JPH09212524A
Application number: JP8035501A
Authority: JP
Inventors: Emi Ikeda; 恵美池田
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1996-01-30
Filing date: 1996-01-30
Publication date: 1997-08-15

Abstract

PROBLEM TO BE SOLVED: To reduce a retrieving time and to keep the retrieving time constant concerning any character string. SOLUTION: An index is expressed by the set of the low-order bytes (LOW vector) and the set of the high-order bytes (HIGH vector) of the code of each character in a character string. At the time of retrieving an optional retrieval character string, a first character is collated first (steps S1 and S2). When there is no index, the non-existence of a pertinent place is judged at this point of time. Next, the set of the low-order byte and the high-order byte of the retrieval character string is prepared and collation with the LOW vector is executed first (steps S3 and S5). When the LOW vector is not pertinent, the non-existence of a pertinent place is judged. When the LOW vector is pertinent, collation with a HIGH vector is executed (steps S6 and S7). When the HIGH vector coincides and there is information of the pertinent place in the HIGH vector, the pertinent place is fetched and retrieval is finished.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、検索文書から文字
列を検索する全文検索方法と、電子化辞書中の登録語を
検索する電子化辞書装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a full-text search method for searching a character string from a search document and a computerized dictionary device for searching registered words in a computerized dictionary.

【０００２】[0002]

【従来の技術】コンピュータやワードプロセッサの普及
によって、大量の文書が電子化されつつあり、また、Ｃ
Ｄ−ＲＯＭによる電子出版や電子メール、ニュースの普
及により、これらの情報が利用可能になってきた。この
ように、情報の電子化が進み、取り扱う電子化文書の量
が多くなるほど、これらを効率的に、かつ、容易に扱う
ための種々の手段が必要となってくる。その一つに、散
逸している文書から必要なものだけを迅速に探し出すと
いう情報検索技術がある。電子化された文書の資源を一
層生かし、作業能率を上げられることからこの技術への
ニーズは大きい。このようなニーズに対処するため、そ
の文書に対して予め付与されたキーワードを用いて検索
を行うキーワード検索方式に代わって、全文検索システ
ム技術が注目されている。2. Description of the Related Art With the spread of computers and word processors, a large amount of documents are being digitized, and C
With the spread of electronic publication by D-ROM, electronic mail, and news, such information has become available. Thus, as the digitization of information progresses and the amount of digitized documents handled increases, various means for efficiently and easily handling these are required. One of them is information retrieval technology that quickly finds only what is needed from scattered documents. There is a great need for this technology because it is possible to make better use of the resources of electronic documents and improve work efficiency. In order to cope with such needs, a full-text search system technique has been attracting attention instead of a keyword search method in which a keyword assigned to a document is used for a search.

【０００３】また、大量の文書が電子化されるに伴い、
電子化文書を読む機会が増え、コンピュータ上で文書編
集の作業を行うことが多くなってきている。そこで、コ
ンピュータ上で辞書をひくことへのニーズが発生してい
る。In addition, as a large number of documents are digitized,
The opportunity to read electronic documents has increased, and the work of document editing on computers has become more frequent. Therefore, there is a need to look up a dictionary on a computer.

【０００４】[0004]

【発明が解決しようとする課題】上記の全文検索システ
ムの最も単純な発想は、文書を逐次に走査することによ
って検索パターン文字列を見つけることである。これま
でに様々な高速文字列照合アルゴリズムが研究開発され
ているが、逐次的に走査することからデータ量が多くな
るほど速度が遅くなり、大規模な検索システムには向か
ない。The simplest idea of the above full text search system is to find the search pattern string by scanning the document sequentially. Various high-speed character string collation algorithms have been researched and developed so far, but the speed becomes slower as the amount of data increases due to sequential scanning, which is not suitable for a large-scale search system.

【０００５】そこで、検索を高速化するため、検索補助
ファイルを生成し、それを用いて検索することによる全
文検索方式が注目されている。この補助ファイルを用い
た実装には、大きく分けて２種類の方式がある。一つ
は、本の索引のように、語句を取り出してインデックス
を作成する方式で、更に、インデックスの二次処理を行
い、インデックスサイズを圧縮する研究も進んでいる。
この方式は英語などの単語が分かれて表記されている場
合には、非常に有効であるが、日本語の場合は、先ず、
単語を取り出すために形態素解析が必要である。しか
し、形態素解析には曖昧さがあり、最終的に人手による
チェックを要するため、完全なシステム自動生成は難し
い。また、この方式インデックスに登録された語句でし
か探せないため、実在する文字列であっても指定方法に
よっては発見できない恐れがある。例えば、複合名詞な
どのように、二つ以上の名詞からなっているものを、そ
れぞれの名詞でインデックス登録を行った場合、検索時
にその複合名詞を指定しても、見つからない。Therefore, in order to speed up the search, a full-text search method in which a search auxiliary file is generated and a search is performed using the search auxiliary file is drawing attention. The implementation using this auxiliary file is roughly divided into two types. One is a method of creating an index by extracting words and phrases like the index of a book, and further research is underway to compress the index size by performing secondary processing of the index.
This method is very effective when words such as English are written separately, but in the case of Japanese, first,
Morphological analysis is required to extract words. However, since morphological analysis is ambiguous and requires manual checks, it is difficult to generate a complete system automatically. Further, since it is possible to search only by the word / phrase registered in this method index, there is a possibility that even an existing character string cannot be found depending on the specified method. For example, if a compound noun, such as a compound noun, is index-registered with each noun, it will not be found even if the compound noun is specified during the search.

【０００６】一方、二つ目の方式は、文書を単に文字の
羅列と考え、どのような文字列でも検索できるようにす
る方式である。これは、上記の方式に比べ、インデック
スサイズが大きくなりがちであるが、文で検索できる
等、検索に不自然さがなく、検索漏れが少ない。よっ
て、ユーザは思いついた語句で検索でき、上述した複合
名詞の問題も解決することができる。しかしながら、従
来のシステムでは、インデックスに木構造を採用してい
るものがあるが、木の深さが検索時間に影響し、コンス
タントに検索速度を保つのが難しかった。このような点
から、検索時間を一定に保つことのできる全文検索方法
の実現が望まれていた。On the other hand, the second method is a method in which a document is simply regarded as a list of characters and any character string can be searched. In this method, the index size tends to be larger than in the above method, but there is no unnaturalness in the search such as a sentence search, and there are few search omissions. Therefore, the user can search with a conceived word and phrase, and can solve the above-mentioned problem of compound nouns. However, in some conventional systems, a tree structure is used for the index, but the depth of the tree affects the search time, and it is difficult to constantly maintain the search speed. From such a point, it has been desired to realize a full-text search method capable of keeping the search time constant.

【０００７】また、従来の電子化辞書は、単語を登録す
る方式として、国語辞典のように単純にコードなどを大
小で振り分け、あるいは、ハッシュ化することで振り分
ける、といった、いわゆる五十音の辞書引きに似た方法
をとるものが多く、これを文字列マッチングしていくこ
とで、意味を取りだしていた。しかしながら、このよう
な分類方法だと、例えば「電子計算機」、「電子化辞
書」のように、「電子＊」と同じ文字列で始まる単語は
近い所に分類され、その識別に時間がかかってしまう。
このような点から、登録された単語を早く取り出し、ま
た、単語によって検索時間が変わらず一定である電子化
辞書装置の実現が望まれていた。Further, the conventional electronic dictionary is a so-called Japanese syllabary dictionary, which is a method of registering words, in which codes are simply sorted according to size like a Japanese dictionary, or sorted by hashing. Many of them use a method similar to the pulling method, and by matching this with a character string, the meaning was taken out. However, with such a classification method, words that start with the same character string as "electronic *", such as "electronic computer" and "electronic dictionary", are classified close to each other, and it takes time to identify them. I will end up.
From this point of view, it has been desired to realize a digitized dictionary device in which registered words are taken out quickly and the search time is constant regardless of the words.

【０００８】[0008]

【課題を解決するための手段】本発明は、前述の課題を
解決するため次の構成を採用する。〈請求項１の構成〉検索文書に出現する文字列のインデ
ックスを、その文字列における各文字のコードの下位バ
イトおよび上位バイトの集合で表わすと共に、この集合
に対応して文字列の検索文書中の該当場所を示す情報を
備え、任意の検索文字列で検索を行う場合、検索文字列
における各文字のコードの下位バイト同士の集合と上位
バイト同士の集合を作成し、これら集合により、インデ
ックスを参照し、検索文書中の該当場所を取り出すこと
を特徴とするものである。The present invention employs the following structure to solve the above-mentioned problems. <Structure of claim 1> The index of the character string appearing in the search document is represented by a set of lower byte and upper byte of the code of each character in the character string, and in the search document of the character string corresponding to this set. When the search is performed with an arbitrary search character string by providing the information indicating the corresponding place of, the set of the low-order bytes and the set of the high-order bytes of the code of each character in the search string are created, and the index is calculated from these sets. It is characterized by referring to and extracting the corresponding place in the search document.

【０００９】〈請求項１の説明〉２バイトからなる日本
語文字の構成要素をみると、上位バイトは文字種を表す
情報を持ち、下位バイトがその文字そのものを表すコー
ドとして割り当てられている。例えば、全角文字“モ”
“ジ”において、２バイトで考えた場合、それぞれの文
字コード２５６２，２５２８とみるより、それぞれの下
位バイト、上位バイトをつなぎ合わせた６２３８，２５
２５の方が速く該当要素を絞り込むことができる。ま
た、文字の使用頻度を調べると、漢字よりもひらがな・
カタカナが圧倒的に多く、漢字を考えてみても、実際の
文書における使用頻度はＪＩＳ第１水準漢字の方が第２
水準漢字よりも圧倒的に多い。従って、文字列の組み合
せを考えた場合、それぞれの文字の上位バイトは、共通
している場合が多いことになる。<Explanation of Claim 1> Looking at the constituent elements of a Japanese character consisting of 2 bytes, the upper byte has information representing the character type, and the lower byte is assigned as a code representing the character itself. For example, the double-byte character "mo"
When considering 2 bytes in “J”, it is considered that the character codes 2562 and 2528 are compared with each other, and the lower byte and the upper byte are connected to each other.
With 25, the corresponding element can be narrowed down faster. Also, when we check the frequency of use of characters, hiragana
The number of katakana is overwhelmingly large, and even when considering kanji, the frequency of use in actual documents is second for JIS level 1 kanji.
Overwhelmingly more than level kanji. Therefore, when considering a combination of character strings, the upper bytes of each character are often common.

【００１０】そこで、請求項１の発明は、文字の特徴を
多分に含む下位バイトに注目し、文字列を例えば４文字
ずつの下位バイト集合、上位バイト集合として作成し、
検索を進めていく方法をとる。従って、組み合せの分布
を分散させることができ、候補の絞り込みが速く、検索
速度を向上させることができる。Therefore, the invention of claim 1 pays attention to the lower byte that contains the character features, and creates the character string as a lower byte set and an upper byte set of four characters, respectively.
Take a method to proceed with the search. Therefore, the distribution of combinations can be dispersed, the candidates can be narrowed down quickly, and the search speed can be improved.

【００１１】尚、文字列の４文字ずつとしたのは、一般
的な文書を考えた場合、４文字分あればかなり絞り込む
ことができ、また、検索を実行する装置の処理単位とし
ても適当であるからであるが、特にこの文字数に限定さ
れるものではなく、適宜、文字数の設定は可能である。It should be noted that the reason why the four characters in the character string are used is that when considering a general document, it is possible to narrow down considerably if there are four characters, and it is also appropriate as a processing unit of a device that executes a search. However, the number of characters is not particularly limited, and the number of characters can be appropriately set.

【００１２】〈請求項２の構成〉請求項１記載の全文検
索方法において、最初に検索文字列における各文字のコ
ードの下位バイトの集合によって検索を行い、次に上位
バイトの集合で検索を行うことを特徴とするものであ
る。<Structure of Claim 2> In the full-text search method according to Claim 1, first, a search is performed by a set of lower bytes of a code of each character in a search character string, and then a set of higher bytes. It is characterized by that.

【００１３】〈請求項２の説明〉請求項２の発明は、例
えば、“モ”“ジ”の場合、先に下位バイト６２３８を
検査する。この検査で一致した下位バイトの集合がなか
った場合は、その時点で該当場所なしの検索結果を得る
ことができる。即ち、“モ”に続く組み合せは無数にあ
るが、下位バイトの文字コード列６２３８で始まる文字
列は実際には少ないと思われることから、下位バイト６
２３８を先に検査することで、１回の検査でこの２文字
はほぼ確定する。<Explanation of Claim 2> In the invention of Claim 2, for example, in the case of "mo" and "di", the lower byte 6238 is inspected first. If there is no matching set of low-order bytes in this check, it is possible to obtain a search result with no corresponding place at that time. That is, although there are innumerable combinations following "mo", it is considered that there are actually few character strings starting with the character code string 6238 of the lower byte.
By inspecting 238 first, these two characters are almost fixed in one inspection.

【００１４】〈請求項３の構成〉請求項１または２記載
の全文検索方法において、検索は、先ず、検索文字列の
１文字目でインデックスを参照し、その後、残りの文字
列でインデックスを参照することを特徴とするものであ
る。<Structure of claim 3> In the full-text search method according to claim 1 or 2, the search first refers to the index with the first character of the search character string, and then refers to the index with the remaining character string. It is characterized by doing.

【００１５】〈請求項３の説明〉請求項３の発明は、文
字列の１文字目をハッシュ化し、残りの文字列で新たな
文字列を生成し、該当候補を絞り込んでいく。このた
め、最初から複数の文字の組み合せを検査するのに比べ
て、検索処理を速く行うことができる。尚、この場合、
文字列の１文字目は通常、その文字列の先頭文字である
が、特にこれに限定されるものではなく、最後尾文字
等、特定の文字であればよい。<Explanation of Claim 3> In the invention of Claim 3, the first character of the character string is hashed, a new character string is generated from the remaining character string, and the relevant candidates are narrowed down. Therefore, the search process can be performed faster than inspecting a combination of a plurality of characters from the beginning. In this case,
The first character of the character string is usually the first character of the character string, but it is not particularly limited to this and may be a specific character such as the last character.

【００１６】〈請求項４の構成〉請求項１〜３のいずれ
かに記載の全文検索方法において、インデックスは、検
索文書に出現する文字列における１文字目をルートとす
る木構造であることを特徴とするものである。<Structure of Claim 4> In the full-text search method according to any one of Claims 1 to 3, the index has a tree structure whose root is the first character in the character string appearing in the search document. It is a feature.

【００１７】〈請求項４の説明〉請求項４の発明では、
インデックスが木構造となっており、また、２文字目以
降は、下位バイト、上位バイトの集合であることから、
ルートより下のノードは、木の深さの偏りが分散され、
また、確定までの分岐が少なくなっている。その結果、
インデックス木を辿る処理が減り、検索速度を向上させ
ることができる。<Explanation of Claim 4> In the invention of claim 4,
Since the index has a tree structure and the second and subsequent characters are a set of low-order bytes and high-order bytes,
For the nodes below the root, the unevenness of the tree depth is distributed,
In addition, there are fewer branches to finalization. as a result,
The processing of tracing the index tree is reduced, and the search speed can be improved.

【００１８】〈請求項５の構成〉請求項３または４に記
載の全文検索方法において、検索文書に出現する文字列
の下位バイトの集合であるインデックスのＬＯＷベクタ
を参照する場合、先ず、一つの文字で参照を行い、その
後、残りの文字列で参照することを特徴とするものであ
る。<Structure of claim 5> In the full-text search method according to claim 3 or 4, when a LOW vector of an index, which is a set of lower bytes of a character string appearing in a search document, is referred to, first one It is characterized in that reference is made by characters and then the remaining character strings are referenced.

【００１９】〈請求項５の説明〉請求項５の発明は、請
求項３と同様に、ＬＯＷベクタ内でも、その中の１文字
目、即ち、検索文字列における２文字目をハッシュ化
し、残りの文字列で新たな文字列を生成し、該当候補を
絞り込んでいく。木構造では、特に、２〜５文字目につ
いてのＬＯＷベクタの要素の数は多くなると思われる。
このため、検索文字列の２文字目をハッシュ値とするこ
とで、アクセスを速くし、ＬＯＷベクタ内で、複数の文
字の組み合せを検査するのに比べて、検索処理を速く行
うことができる。<Explanation of Claim 5> In the invention of Claim 5, as in Claim 3, even in the LOW vector, the first character in the vector, that is, the second character in the search character string is hashed and the rest is left. A new character string is generated with the character string of and the applicable candidates are narrowed down. In the tree structure, the number of elements of the LOW vector especially for the second to fifth characters is expected to be large.
Therefore, by using the second character of the search character string as a hash value, access can be speeded up, and the search processing can be performed faster than inspecting a combination of a plurality of characters in the LOW vector.

【００２０】〈請求項６の構成〉登録語に対する意味
と、その意味の文字列の長さの情報を有する真の意味登
録データベースと、真の意味登録データベースに登録さ
れた登録語の文字コードを下位バイトおよび上位バイト
の集合で表したインデックスを設け、検索語が与えられ
た場合、検索語における各文字のコードの下位バイト同
士の集合と上位バイト同士の集合を作成し、これら集合
により、インデックスを参照し、検索語に対応した登録
語を取り出す検索部とを備えたことを特徴とするもので
ある。<Structure of Claim 6> A true meaning registration database having a meaning of a registered word and information on the length of a character string of the meaning, and a character code of the registered word registered in the true meaning registration database. When a search word is given, an index that is a set of low-order byte and high-order byte is provided, and a set of low-order bytes and a set of high-order bytes of the code of each character in the search word are created. And a search unit that retrieves a registered word corresponding to the search word.

【００２１】〈請求項６の説明〉請求項６の発明は、真
の意味登録データベースに登録された登録語を取り出す
場合、請求項１〜５の発明と同様の全文検索方法を用い
る。任意の検索語が与えられた場合、検索部は、その検
索語の下位バイト同士の集合と上位バイト同士の集合と
を作成し、これらの集合とインデックスを照合する。<Explanation of Claim 6> In the invention of claim 6, when the registered word registered in the true meaning registration database is taken out, the same full-text search method as that of the invention of claims 1-5 is used. When an arbitrary search word is given, the search unit creates a set of lower-order bytes and a set of higher-order bytes of the search word, and collates the set with the index.

【００２２】従って、候補の絞り込みが速く、検索時間
が短縮され、しかも、どのような単語でも一様な速度で
検索することができる。また、真の意味登録データベー
スの登録語に対応して、その長さ情報を備えているた
め、例えば検索結果の転送や表示を行う場合に、どの程
度のデータ量があるかを把握できるため、アクセスを速
やかに行うことができる。Therefore, candidates can be narrowed down quickly, the search time can be shortened, and any word can be searched at a uniform speed. In addition, since the length information is provided in correspondence with the registered words in the true meaning registration database, it is possible to grasp how much data is present when transferring or displaying the search results, for example. Access can be done quickly.

【００２３】[0023]

【発明の実施の形態】以下、本発明の実施の形態を図面
を用いて詳細に説明する。先ず、本発明の全文検索方法
および電子化辞書装置の原理について説明する。日本語
文書に使用される文字はＪＩＳコードのみでも、約１万
語ほどもあり、日本語文字は、これらの多くの文字との
組み合わせから構成される。そこで、あまり長くない文
字列であっても、確率的に見ると、膨大な組み合わせの
うちの一部の出現であるといえる。故に、この文字の組
み合わせのパターンを手掛りとすると、データの中から
求める部分を特定、あるいは絞り込むことが容易である
考えられる。そして、このように文字列マッチングに文
字列の組み合わせという考えを導入する時、その組み合
わせの確率分布が偏らず、一様であることが望ましい。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described in detail below with reference to the drawings. First, the principle of the full-text search method and the electronic dictionary device of the present invention will be described. The characters used in Japanese documents are about 10,000 words even with only the JIS code, and Japanese characters are composed of a combination of many of these characters. Therefore, even if the character string is not very long, it can be said that it is a part of a huge combination in terms of probability. Therefore, if the pattern of this character combination is used as a clue, it can be considered that it is easy to specify or narrow down the part to be obtained from the data. Then, when introducing the idea of a combination of character strings in such character string matching, it is desirable that the probability distribution of the combination is uniform without being biased.

【００２４】そこで、絞り込みを速くする方法として、
二つの文字間の関係を利用する。例えば、全角文字
“モ”“ジ”との組み合わせを例にとってこれを説明す
る。それぞれの文字コードは以下の通りである。モ：２５６２ジ：２５３８Therefore, as a method for speeding up the narrowing,
Take advantage of the relationship between two characters. For example, this will be described by taking a combination with the double-byte characters "MO" and "J" as an example. Each character code is as follows. Mo: 2562 J: 2538

【００２５】２バイトで考えた時、２５６２，２５２８
と見るより、それぞれ下位バイト、上位バイトをつなぎ
合わせた６２３８，２５２５の方が速く該当要素を絞り
込むことができる。即ち、２バイトからなる日本語文字
の構成要素を見ると、上位バイトは文字種（ひらがな、
カタカナ等）を表す情報を持ち、下位バイトがその文字
そのものを表すコードとして割り当てられている。When considered with 2 bytes, 2562, 2528
Rather than seeing that, 6238 and 2525, in which the lower byte and the upper byte are respectively connected, can narrow down the corresponding element faster. That is, looking at the constituent elements of a Japanese character consisting of 2 bytes, the upper byte is the character type (Hiragana,
Information such as katakana), and the lower byte is assigned as a code representing the character itself.

【００２６】そして、文字の使用頻度を調べると、漢字
よりもひらがな・カタカナが圧倒的に多い。例えば、
“モ”に続く組み合わせは無数にあるが、下位バイト６
２３８を先に検査することで、１回の検査でこの２文字
はほぼ確定する。即ち、下位バイトの文字コード列６２
３８で始まる文字列は実際には少ないと思われる。When the frequency of use of characters is examined, the number of hiragana and katakana is overwhelmingly higher than that of kanji. For example,
There are innumerable combinations following "mo", but the lower byte 6
By inspecting 238 first, these two characters are almost fixed in one inspection. That is, the character code string 62 of the lower byte
It seems that there are actually few strings starting with 38.

【００２７】また、漢字においても、実際の文書におけ
る使用頻度は、ＪＩＳ第１水準漢字の方が、ＪＩＳ第２
水準漢字よりも圧倒的に高い。従って、漢字の場合も上
位バイトよりも下位バイトに注目することで、組み合わ
せの分布を分散させることができ、しかも、候補を最初
の段階で絞り込むことができるのである。Regarding the kanji, the frequency of use in actual documents is higher in JIS 1st level kanji than in JIS 2nd level.
It is overwhelmingly higher than the standard Kanji. Therefore, even in the case of Kanji, by paying attention to the lower byte than the upper byte, the distribution of combinations can be dispersed, and the candidates can be narrowed down at the first stage.

【００２８】本発明は、このように文字の特徴を多分に
含む下位バイトに注目し、文字列を４文字ずつの下位バ
イト集合、上位バイト集合として作成し、検索を進めて
いく方法をとる。ここで４文字ずつとしたのは、次のよ
うな理由からである。As described above, the present invention pays attention to the low-order bytes that contain the character features in this way, and creates a character string as a low-order byte set and a high-order byte set of four characters, and takes a method of proceeding with the search. The reason why four characters are used here is for the following reason.

【００２９】即ち、上述したように、１文字よりも２文
字、２文字よりも３文字といったように、文字列の文字
数が多いほど、文字列を絞り込むための条件が厳しくな
る（絞り込むための情報が多くなる）。そこで、一般的
な文書を考えた場合、４文字分あればかなり絞り込むこ
とができ、また、検索を実行する装置の処理単位として
も適当であるため、このような４文字という単位とした
ものである。That is, as described above, the larger the number of characters in the character string, such as two characters rather than one character and three characters rather than two characters, the stricter the condition for narrowing down the character string (information for narrowing down). Will increase). Therefore, considering a general document, it is possible to narrow down considerably if there are four characters, and it is also appropriate as a processing unit of the device that executes the search. is there.

【００３０】《具体例１》図１は本発明の全文検索方法
の具体例１を示すフローチャートであるが、これに先立
ち、本具体例を実現するためのインデックスについて説
明する。<< Specific Example 1 >> FIG. 1 is a flow chart showing a specific example 1 of the full-text search method of the present invention. Prior to this, an index for realizing this specific example will be described.

【００３１】図２は、全文検索方法を実現するためのイ
ンデックスの説明図である。図示のように、本具体例の
インデックスは、木構造で実現されている。図におい
て、１は先頭文字ハッシュテーブル、Ｌ１，Ｌ２は、Ｌ
ＯＷベクタ、Ｈ１，Ｈ２は、ＨＩＧＨベクタである。FIG. 2 is an explanatory diagram of indexes for realizing the full-text search method. As shown in the figure, the index of this specific example is realized by a tree structure. In the figure, 1 is a leading character hash table, L1 and L2 are L
The OW vectors, H1 and H2, are HIGH vectors.

【００３２】先頭文字ハッシュテーブル１は、６４Ｋバ
イトの１次元配列で、各文字列の先頭の文字コードをイ
ンデックスとして参照するためのものであり、全ての文
字コード分の領域が設けられている。そして、各文字領
域には、一つ以上の該当文字列が検索文書中にある場合
は、その該当文字列の２文字目以降のＬＯＷベクタＬ１
に続くための情報が入っており、また、その文字コード
を先頭に持つ文字列が検索文書中に存在しない場合は、
ＮＵＬＬが入っている。The leading character hash table 1 is a one-dimensional array of 64 Kbytes, which is used to refer to the leading character code of each character string as an index, and has an area for all character codes. When one or more corresponding character strings are present in the search document in each character area, the LOW vector L1 of the second and subsequent characters of the corresponding character string.
If there is information to follow, and the character string that has that character code at the beginning does not exist in the search document,
Contains NULL.

【００３３】ＬＯＷベクタＬ１，Ｌ２は、検索文書中の
該当文字列における４ｋ＋２、４ｋ＋３、４ｋ＋４、４
ｋ＋５番目の文字の下位１バイトの順列からなる集合で
ある。尚、ここでｋ≧０である。また、ＬＯＷベクタＬ
１，Ｌ２には、４文字分の下位バイトの集合と共に、そ
れに続くＨＩＧＨベクタＨ１，Ｈ２への情報（ポイン
タ）が入れられている。The LOW vectors L1 and L2 are 4k + 2, 4k + 3, 4k + 4, 4 in the corresponding character string in the search document.
It is a set consisting of a permutation of the lower 1 byte of the k + 5th character. Note that k ≧ 0 here. Also, the LOW vector L
In 1 and L2, a set of lower bytes for 4 characters and information (pointers) to the following HIGH vectors H1 and H2 are entered.

【００３４】また、木構造では、特に、文字列の２〜５
文字目についての最初のＬＯＷベクタＬ１の要素の数は
多くなると思われる。この中より該当要素を早く見つけ
るため、２文字目のコードをハッシュ値として、ハッシ
ュ化することで、アクセスを速くしている。Further, in the tree structure, especially 2 to 5 of the character string are
It seems that the number of elements of the first LOW vector L1 for the character is large. In order to find the corresponding element earlier than this, hashing is performed by using the code of the second character as a hash value to speed up access.

【００３５】図３はこのＬＯＷベクタの説明図である。
即ち、ＬＯＷベクタの内部構成は、図２に示した先頭文
字ハッシュテーブル１とＬＯＷベクタＬ１との関係と同
様に、ＬＯＷベクタＬ１における先頭文字（＝４ｋ＋２
文字目）をハッシュ値としてテーブルＡを構成してい
る。また、４ｋ＋３〜４ｋ＋５文字の下位バイトがベク
タＢとなっている。FIG. 3 is an explanatory diagram of this LOW vector.
That is, the internal structure of the LOW vector is similar to the relationship between the leading character hash table 1 and the LOW vector L1 shown in FIG. 2, and the leading character (= 4k + 2) in the LOW vector L1.
The table A is configured with the (character) as a hash value. The lower byte of 4k + 3 to 4k + 5 characters is the vector B.

【００３６】ＨＩＧＨベクタＨ１，Ｈ２は、検索文書中
の該当文字列における４ｋ＋２、４ｋ＋３、４ｋ＋４、
４ｋ＋５番目の文字の上位１バイトの順列からなる集合
である。尚、ここでｋ≧０である。また、ＨＩＧＨベク
タＨ１，Ｈ２には、これらの４文字分の上位バイトの集
合と共に、更に該当文字列が続く場合は、識別情報Ｄと
して、その残りの文字列のＬＯＷベクタへの情報（ポイ
ンタ）が、また、４ｋ＋５文字目までに該当文字列の場
所が確定している場合は、その該当場所の情報が入れら
れている。The HIGH vectors H1 and H2 are 4k + 2, 4k + 3, 4k + 4, 4k + 2, 4k + 3, and 4k + 4 in the corresponding character string in the search document.
It is a set consisting of a permutation of the upper 1 byte of the 4k + 5th character. Note that k ≧ 0 here. In addition, in the HIGH vectors H1 and H2, together with a set of upper bytes for these four characters, and when a corresponding character string further follows, as identification information D, information (pointer) to the LOW vector of the remaining character string However, when the location of the corresponding character string is confirmed by the 4k + 5th character, the information of the relevant location is entered.

【００３７】次に、図１のフローチャートに沿って、本
具体例の全文検索方法を説明する。１．入力された長さｎの文字列Ｓ(n) の１文字目を取り
出し、その文字コードで先頭文字ハッシュテーブル１を
参照する（ステップＳ１）。ここで、検索文字列Ｓ(n)
の一例として、その文字列を「アメリカンフットボー
ル」とする。Next, the full-text search method of this example will be described with reference to the flowchart of FIG. 1. The first character of the input character string S (n) of length n is taken out and the leading character hash table 1 is referred to by that character code (step S1). Here, the search character string S (n)
As an example, the character string is “American football”.

【００３８】「アメリカンフットボール」の１文字目
は、“ア”であるため、先頭文字ハッシュテーブル１を
参照し、存在するかを調べる（ステップＳ２）。即ち、
“ア”の位置の値が０より大きい値（ＮＵＬＬではない
値）であれば、それはＬＯＷベクタＬ１の値であるた
め、ＬＯＷベクタＬ１をカレントＬＯＷベクタとし、ｋ
＝０とする（ステップＳ３）。一方、０であれば、その
文字列Ｓ(n) は検索文書中に現れないことを示し、検索
終了となる。Since the first character of "American football" is "A", the first character hash table 1 is referred to check whether it exists (step S2). That is,
If the value at the position of "a" is a value larger than 0 (a value that is not NULL), it is the value of the LOW vector L1, so the LOW vector L1 is set as the current LOW vector, and k
= 0 (step S3). On the other hand, if it is 0, it means that the character string S (n) does not appear in the search document, and the search ends.

【００３９】２．文字列Ｓ(n) より、４ｋ＋２，４ｋ＋
３，４ｋ＋４，４ｋ＋５文字目の文字を取り出し、それ
ぞれの文字の下位バイト同士、上位バイト同士に分け、
それぞれを検索ベクタＬ，Ｈに格納する（ステップＳ
４）。尚、検索ベクタＬ，Ｈは、それぞれ、ＬＯＷベク
タＬ１，Ｌ２を検索するためのベクタ、ＨＩＧＨベクタ
Ｈ１，Ｈ２を検索するためのベクタであり、この場合
は、「メリカン」の下位バイト、上位バイトが相当す
る。ここで、各文字のコードは、メ＝２５６１、リ＝２
５６Ａ、カ＝２５２Ｂ、ン＝２５７３であるため、検索
ベクタＬ＝６１６Ａ２Ｂ７３、検索ベクタＨ＝２５２５
２５２５となる。2. From character string S (n), 4k + 2,4k +
Take out the 3rd, 4k + 4, 4k + 5th character and divide into the lower byte and upper byte of each character,
Store each in search vectors L and H (step S
4). The search vectors L and H are a vector for searching the LOW vectors L1 and L2 and a vector for searching the HIGH vectors H1 and H2, respectively. In this case, the lower byte and the upper byte of "Merikan" are used. Is equivalent to Here, the code of each character is M = 2561, L = 2
Since 56A, f = 252B, and h = 2573, search vector L = 616A2B73 and search vector H = 2525.
2525.

【００４０】３．カレントＬＯＷベクタＬ１ベクタと検
索ベクタＬとのマッチングを行う（ステップＳ５）。即
ち、「メリカン」の下位バイトの集合「６１６Ａ２Ｂ７
３」がＬＯＷベクタＬ１に存在するかを検索する。存在
した場合は、ＨＩＧＨベクタＨ１への情報が得られるの
で、それをカレントＨＩＧＨベクタとする（ステップＳ
６）。一方、等しいものがなければ、その文字列Ｓ(n)
は検索文書中に現れないことを示し、検索終了となる。3. The current LOW vector L1 vector and the search vector L are matched (step S5). That is, a set of lower bytes of "Melican""616A2B7"
3 ”is present in the LOW vector L1. If it exists, the information for the HIGH vector H1 is obtained, and it is set as the current HIGH vector (step S
6). On the other hand, if there is no equivalent, the character string S (n)
Indicates that it does not appear in the search document, and the search ends.

【００４１】４．カレントＨＩＧＨベクタＨ１と検索ベ
クタＨとのマッチングを行う（ステップＳ７）。即ち、
上述したＬＯＷベクタＬ１とのマッチングと同様に、
「メリカン」の上位バイトの集合「２５２５２５２５」
がＨＩＧＨベクタＨ１に存在するかを検索する。検索の
結果、等しいものがなければ、その文字列Ｓ(n) は検索
文書中に現れないことを示し、検索終了となる。また、
等しいものがあった場合は識別情報Ｄが得られるため、
この識別情報Ｄを取り出す（ステップＳ８）。4. Matching between the current HIGH vector H1 and the search vector H (step S7). That is,
Similar to the matching with the LOW vector L1 described above,
A set of high-order bytes of "Merikan""25252525"
Is present in the HIGH vector H1. If there is no match as a result of the search, it indicates that the character string S (n) does not appear in the search document, and the search ends. Also,
If there is an equal value, the identification information D is obtained,
This identification information D is taken out (step S8).

【００４２】５．識別情報Ｄの内容がＬＯＷベクタＬ２
へのポインタであった場合は（ステップＳ９）、ＬＯＷ
ベクタＬ２をカレントＬＯＷベクタとし、また、ｋ＝ｋ
＋１として、上記の２．に戻る（ステップＳ１０）。こ
の場合、検索文書中に「アメリカン…」が複数あったと
すると、５文字目までで該当場所は特定されないため、
更に、ＬＯＷベクタＬ２へのポインタが格納されてい
る。従って、今度は、検索ベクタＬ，Ｈを、文字列の６
〜９文字目である「フットボ」の下位バイト、上位バイ
トの集合として上記の検索を行う。5. The content of the identification information D is the LOW vector L2.
If it is a pointer to (step S9), LOW
Vector L2 is the current LOW vector, and k = k
+1 as above 2. (Step S10). In this case, if there are multiple "American ..." in the search document, the corresponding place is not specified by the fifth character, so
Further, a pointer to the LOW vector L2 is stored. Therefore, this time, the search vectors L and H are set to 6 of the character string.
The above search is performed as a set of the lower byte and the upper byte of "Football" which is the 9th character.

【００４３】６．６〜９文字目の検索により、ＨＩＧＨ
ベクタＨ２より取り出した識別情報Ｄが、該当場所を示
すものであれば、その該当場所の情報を取り出し（ステ
ップＳ１１）、検索を終了する。尚、この場合、検索文
字列Ｓ(n) には、残りの文字列が存在するが、上記の
「フットボ」までで、検索文書中の存在する場所が特定
されたため、検索終了となる。即ち、文字列の前方部分
一致で、その文字列の組み合わせが唯一存在することが
判明すれば、その文字列は特定されたことになる。6.6 to 9th character search results in HIGH
If the identification information D extracted from the vector H2 indicates the corresponding place, the information of the corresponding place is taken out (step S11), and the search ends. In this case, the search character string S (n) has the remaining character strings, but since the location in the search document is specified up to the above-mentioned "football", the search ends. That is, if it is found in the front partial match of a character string that only one combination of the character strings exists, the character string is identified.

【００４４】以上のように、上記具体例１によれば、イ
ンデックス木の深さを小さくできるので、少ない分岐で
検索できる。その結果、分岐にかかる時間を削減でき、
検索速度の向上を図ることができる。As described above, according to the first specific example, since the depth of the index tree can be reduced, it is possible to search with a small number of branches. As a result, the time required for branching can be reduced,
The search speed can be improved.

【００４５】《具体例２》図４は、具体例２である電子
化辞書装置の構成図である。図の装置は、検索部１０と
真の意味登録データベース２０からなる。検索部１０
は、上述した具体例１の全文検索方法を用いた検索を行
うものであり、真の意味登録データベース２０は、辞書
としての登録語に対する意味の部分と、その意味の文字
列の長さの情報を格納するデータベースである。<< Specific Example 2 >> FIG. 4 is a block diagram of an electronic dictionary device which is a specific example 2. In FIG. The device shown in the figure comprises a search unit 10 and a true meaning registration database 20. Search unit 10
Is a search using the full-text search method of the specific example 1 described above, and the true meaning registration database 20 has information on the meaning part of a registered word as a dictionary and the length of a character string of that meaning. Is a database for storing.

【００４６】図５は、電子化辞書装置のインデックス構
造を示す図である。本具体例のインデックスは、図２に
示した具体例１のインデックスと基本的な構造は同様で
ある。即ち、先頭文字ハッシュテーブル１は、６４Ｋバ
イトの１次元配列で、登録語の先頭の文字コードをイン
デックスとして参照するためのものであり、全ての文字
コード分の領域が設けられている。そして、各文字領域
には、一つ以上の登録語が真の意味登録データベース２
０中にある場合は、その登録語の２文字目以降のＬＯＷ
ベクタＬ１に続くための情報が入っており、また、その
文字コードを先頭に持つ登録語が真の意味登録データベ
ース２０中に存在しない場合は、ＮＵＬＬが入ってい
る。FIG. 5 is a diagram showing an index structure of the electronic dictionary device. The index of this specific example has the same basic structure as the index of the specific example 1 shown in FIG. That is, the leading character hash table 1 is a 64-Kbyte one-dimensional array for referencing the leading character code of a registered word as an index, and areas for all character codes are provided. Then, in each character area, one or more registered words have a true meaning registration database 2
If it is within 0, LOW after the second character of the registered word
If the information for continuing the vector L1 is included, and if the registered word having the character code at the head does not exist in the true meaning registration database 20, NULL is included.

【００４７】ＬＯＷベクタＬ１，Ｌ２は、真の意味登録
データベース２０中の登録語における４ｋ＋２、４ｋ＋
３、４ｋ＋４、４ｋ＋５番目の語句の下位１バイトの順
列からなる集合である。尚、ここでｋ≧０である。ま
た、ＬＯＷベクタＬ１，Ｌ２には、４文字分の下位バイ
トの集合と共に、それに続くＨＩＧＨベクタＨ１，Ｈ２
への情報（ポインタ）が入れられている。また、本具体
例においても、ＬＯＷベクタＬ１では、図３にて示した
ように、その先頭文字（登録語における２文字目）のコ
ードをハッシュ値としてハッシュ化している。The LOW vectors L1 and L2 are 4k + 2 and 4k + in the registered words in the true meaning registration database 20.
It is a set of permutations of the lower 1 byte of the 3rd, 4k + 4, 4k + 5th words. Note that k ≧ 0 here. In addition, the LOW vectors L1 and L2 include a set of lower bytes for four characters, followed by HIGH vectors H1 and H2.
Information (pointer) is entered. Also in this specific example, in the LOW vector L1, as shown in FIG. 3, the code of the first character (the second character in the registered word) is hashed as a hash value.

【００４８】ＨＩＧＨベクタＨ１，Ｈ２は、真の意味登
録データベース２０中の登録語における４ｋ＋２、４ｋ
＋３、４ｋ＋４、４ｋ＋５番目の文字の上位１バイトの
順列からなる集合である。尚、ここでｋ≧０である。ま
た、ＨＩＧＨベクタＨ１，Ｈ２には、これらの４文字分
の上位バイトの集合と共に、更に登録語が続く場合は、
識別情報Ｄとして、その残りの語句のＬＯＷベクタへの
情報（ポインタ）が、また、４ｋ＋５文字目までに登録
語の場所が確定している場合は、その該当場所の情報が
入れられている。The HIGH vectors H1 and H2 are 4k + 2 and 4k in the registered word in the true meaning registration database 20.
This is a set of permutations of the upper 1 byte of the +3, 4k + 4, 4k + 5th character. Note that k ≧ 0 here. In addition, in the HIGH vectors H1 and H2, when a registered word continues along with the set of the upper bytes for these four characters,
As the identification information D, the information (pointer) to the LOW vector of the remaining words, and when the location of the registered word is determined by the 4k + 5th character, the information of the corresponding location is entered.

【００４９】次に、本具体例の電子化辞書装置の検索方
法を説明する。図６は、本具体例の検索方法のフローチ
ャートである。１．入力された長さｎの検索語の文字列Ｓ(n) の１文字
目を取り出し（ステップＳ１）、その文字コードで先頭
文字ハッシュテーブル１を参照する（ステップＳ２）。
ここで、０より大きい数ならば、それはＬＯＷベクタＬ
１の情報であるため、この情報から得られるＬＯＷベク
タＬ１をカレントＬＯＷベクタとし、ｋ＝０とする（ス
テップＳ３）。一方、値が０ならば文字列Ｓ(n) は登録
されていないことを示し、検索は終了となる。Next, the search method of the electronic dictionary device of this example will be described. FIG. 6 is a flowchart of the search method of this specific example. 1. The first character of the input character string S (n) of the search word of length n is extracted (step S1), and the leading character hash table 1 is referred to by the character code (step S2).
Here, if the number is greater than 0, it is a LOW vector L
Since it is the information of 1, the LOW vector L1 obtained from this information is set as the current LOW vector, and k = 0 is set (step S3). On the other hand, if the value is 0, it means that the character string S (n) has not been registered, and the search ends.

【００５０】２．文字列Ｓ(n) より、４ｋ＋２，４ｋ＋
３，４ｋ＋４，４ｋ＋５文字目の文字を取り出し、それ
ぞれの文字の下位バイト同士、上位バイト同士に分け、
それぞれを検索ベクタＬ，Ｈに格納する（ステップＳ
４）。2. From character string S (n), 4k + 2,4k +
Take out the 3rd, 4k + 4, 4k + 5th character and divide into the lower byte and upper byte of each character,
Store each in search vectors L and H (step S
4).

【００５１】３．カレントＬＯＷベクタＬ１ベクタと検
索ベクタＬとのマッチングを行う（ステップＳ５）。等
しいものがあれば、ＨＩＧＨベクタＨ１への情報が得ら
れるので、それをカレントＨＩＧＨベクタとする。一
方、等しいものがなければ、その文字列Ｓ(n) は登録さ
れていないことを示し、検索終了となる。3. The current LOW vector L1 vector and the search vector L are matched (step S5). If they are equal to each other, the information for the HIGH vector H1 is obtained, so that it is set as the current HIGH vector. On the other hand, if they are not equal, it means that the character string S (n) is not registered, and the search ends.

【００５２】４．カレントＨＩＧＨベクタＨ１と検索ベ
クタＨとのマッチングを行う（ステップＳ６）。マッチ
ングの結果、等しいものがなければ、その文字列Ｓ(n)
は検索文書中に現れないことを示し、検索終了となる。4. Matching between the current HIGH vector H1 and the search vector H (step S6). If there is no match as a result of matching, the character string S (n)
Indicates that it does not appear in the search document, and the search ends.

【００５３】５．等しい場合に得られた情報がＬＯＷベ
クタＬ２の情報ならば、それをカレントＬＯＷベクタと
し、また、ｋ＝ｋ＋１として、上記２．へ戻る（ステッ
プＳ７、Ｓ８）。5. If the information obtained in the case of equality is the information of the LOW vector L2, it is set as the current LOW vector, and k = k + 1 is set to 2. It returns to (step S7, S8).

【００５４】６．ｎ≦４ｋ＋５であれば（ステップＳ
９）、上記の等しい時に得られた情報は、真の意味登録
データベース２０への情報である。従って、真の意味登
録データベース２０より該当する意味およびその情報長
を取り出し（ステップＳ１０）、検索を終了する。一
方、ｎ＞４ｋ＋５であれば文字列は登録されていないこ
とを示し、検索は終了する。6. If n ≦ 4k + 5 (step S
9) The information obtained at the same time is the information to the true meaning registration database 20. Therefore, the corresponding meaning and its information length are retrieved from the true meaning registration database 20 (step S10), and the search ends. On the other hand, if n> 4k + 5, it means that the character string is not registered, and the search ends.

【００５５】以上のように、具体例２によれば、登録単
語の集合をなすインデックス木の偏りを分散させること
ができ、候補の絞り込みが早くなる。これによって、検
索処理時間を短縮でき、また、どのような単語でも一様
な速度で検索することができるようになる。また、真の
意味登録データベース２０の先頭に、その情報の長さを
保持しているため、検索結果の転送や表示を行う場合、
どの程度のデータ量があるかが把握できることから、ア
クセスを速やかに行うことができる。As described above, according to the second specific example, it is possible to disperse the bias of the index trees forming the set of registered words, and the candidates can be narrowed down quickly. As a result, the search processing time can be shortened, and any word can be searched at a uniform speed. Further, since the length of the information is held at the head of the true meaning registration database 20, when transferring or displaying the search result,
Since it is possible to grasp how much data is available, it is possible to quickly access.

[Brief description of drawings]

【図１】本発明の全文検索方法の具体例１を示すフロー
チャートである。FIG. 1 is a flowchart showing a specific example 1 of the full-text search method of the present invention.

【図２】本発明の全文検索方法の具体例１を実現するた
めのインデックスの説明図である。FIG. 2 is an explanatory diagram of indexes for realizing a specific example 1 of the full-text search method of the present invention.

【図３】本発明の全文検索方法の具体例１におけるＬＯ
Ｗベクタの説明図である。FIG. 3 is an LO in a specific example 1 of the full-text search method of the present invention.
It is explanatory drawing of W vector.

【図４】本発明の電子化辞書装置の構成図である。FIG. 4 is a block diagram of an electronic dictionary device of the present invention.

【図５】本発明の電子化辞書装置のインデックス構造を
示す図である。FIG. 5 is a diagram showing an index structure of the electronic dictionary device of the present invention.

【図６】本発明の電子化辞書装置における検索方法のフ
ローチャートである。FIG. 6 is a flowchart of a search method in the electronic dictionary device of the present invention.

[Explanation of symbols]

１先頭文字ハッシュテーブル１０検索部２０真の意味登録データベースＬ１，Ｌ２ＬＯＷベクタＨ１，Ｈ２ＨＩＧＨベクタ 1 first character hash table 10 search unit 20 true meaning registration database L1, L2 LOW vector H1, H2 HIGH vector

Claims

[Claims]

1. An index of a character string appearing in a search document is represented by a set of a lower byte and an upper byte of a code of each character in the character string, and in the search document of the character string corresponding to the set. When performing a search with an arbitrary search character string with the information indicating the corresponding place of, the set of the low-order bytes and the set of the high-order bytes of the code of each character in the search string are created. A full-text search method that refers to an index and extracts the corresponding location in a search document.

2. The full-text search method according to claim 1, wherein the search is first performed by a set of low-order bytes of a code of each character in the search character string, and then by a set of high-order bytes. Full text search method.

3. The full-text search method according to claim 1, wherein the search first refers to the index with the first character of the search character string, and then refers to the index with the remaining character strings. Full-text search method.

4. The full-text search method according to claim 1, wherein the index is 1 in a character string appearing in a search document.
A full-text search method characterized by a tree structure whose root is the first character.

5. The full-text search method according to claim 3 or 4, when referring to a LOW vector of an index, which is a set of lower-order bytes of a character string appearing in a search document, first perform reference by one character. , Then, the full text search method characterized by referring to the remaining character string.

6. A true meaning registration database having information on the meaning of a registered word and the length of a character string of the meaning, a character code of the registered word registered in the true meaning registration database, lower byte and upper byte. When a search term is given by providing an index represented by a set of bytes, a set of low-order bytes and a set of high-order bytes of each character code in the search term are created, and the set refers to the index. An electronic dictionary device, comprising: a search unit that retrieves a registered word corresponding to the search word.