JP5494066B2

JP5494066B2 - SEARCH DEVICE, SEARCH METHOD, AND SEARCH PROGRAM

Info

Publication number: JP5494066B2
Application number: JP2010061451A
Authority: JP
Inventors: 順也大堀
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2010-03-17
Filing date: 2010-03-17
Publication date: 2014-05-14
Anticipated expiration: 2030-03-17
Also published as: JP2011197809A

Description

本発明は、検索装置等に関する。 The present invention relates to a search device and the like.

複数の文書データから特定の文字列を検索する全文検索が知られている。この全文検索では、転置インデックスが用いられる。転置インデックスは、文字データに含まれる単語の位置情報等を格納する索引に対応する。転置インデックスを作成する方式には、大きく分けて文字区切方式と、単語区切方式とがある。 A full-text search for searching for a specific character string from a plurality of document data is known. In this full-text search, an inverted index is used. The transposed index corresponds to an index for storing position information of words included in character data. Methods for creating an inverted index are roughly divided into a character delimiter method and a word delimiter method.

文字区切方式では、単語の意味を考えずに、文字単位で転置インデックスを作成するものである。文字区切方式で作成した転置インデックスを文字インデックスと表記する。文字インデックスを用いれば、完全な部分一致検索が可能である。しかし、検索キーワードと文字インデックスとを一文字ずつ比較する必要があり、検索時間を多く要してしまうという欠点がある。 In the character delimiter method, an inverted index is created for each character without considering the meaning of the word. An inverted index created by the character delimiter method is referred to as a character index. If a character index is used, a complete partial match search is possible. However, it is necessary to compare the search keyword and the character index one character at a time, and there is a disadvantage that a long search time is required.

単語区切方式では、意味のある単語単位で転置インデックスを作成するものである。単語区切方式で作成した転置インデックスを単語インデックスと表記する。単語インデックスを用いれば、検索キーワードを単語毎に比較するので、文字インデックスを利用する場合と比較して、検索時間を短縮することが可能となる。しかし、単語の区切り方によっては、検索漏れが発生する場合がある。 In the word segmentation method, a transposed index is created for each meaningful word. The transposed index created by the word break method is expressed as a word index. If the word index is used, the search keyword is compared for each word, so that the search time can be shortened compared to the case where the character index is used. However, search omission may occur depending on how words are separated.

このように、文字区切方式および単語区切方式には、それぞれ長所、短所があるため、いかにして文字区切方式と単語区切方式とを使い分けるのかが重要になる。例えば、文字区切方式および単語区切方式を用いた従来技術として、検索キーワードの長さに応じて、文字インデックスと単語インデックスとを自動選択するという技術が開示されている。 As described above, since the character delimiter method and the word delimiter method have advantages and disadvantages, it is important how to properly use the character delimiter method and the word delimiter method. For example, a technique of automatically selecting a character index and a word index according to the length of a search keyword is disclosed as a conventional technique using a character partitioning system and a word partitioning system.

特開平１０−３０７８３５号公報Japanese Patent Laid-Open No. 10-307835 特開２００１−３４６２３号公報JP 2001-34623 A 特開２００８−７７６７３号公報JP 2008-77673 A

しかしながら、全文検索を行う文書データによっては、検索キーワードの長さが同じ場合でも、文字インデックスと単語インデックスとを使い分けた方が効率的な全文検索を行える場合がある。 However, depending on the document data for which full text search is performed, even when the length of the search keyword is the same, there are cases where efficient full text search can be performed by using different character indexes and word indexes.

例えば、バイオデータベースに記憶される文書データには、文書に加えて他のデータベースへリンクするためのＩＤ（Identification）が含まれている。一般的に、ＩＤ等の記号を有さない文書データに対しては、単語インデックスが有効であり、記号を有する文書データに対しては、文字インデックスが有効である。 For example, document data stored in a biodatabase includes an ID (Identification) for linking to another database in addition to the document. In general, a word index is effective for document data having no symbol such as an ID, and a character index is effective for document data having a symbol.

ここで、「1.1.1.1ANDsuppressor」という検索式が与えられた場合を例にして説明する。かかる検索式に対して、文字インデックスを用いて全文検索を試みる場合には、「1.1.1.1」というＩＤを含み、かつ、「suppressor」という単語を含む文書データのみを検索することが好ましい。しかし、上記検索式に対して、文字インデックスを用いて全文検索を試みると、実際には、「1.1.1.11」、「1.1.1.12」等のＩＤを含む文書データもヒットしてしまう。 Here, a case where a search expression “1.1.1.1 ANDsuppressor” is given will be described as an example. When a full-text search is attempted using a character index for such a search expression, it is preferable to search only document data that includes the ID “1.1.1.1” and the word “suppressor”. However, if a full-text search is attempted using the character index for the above search expression, document data including IDs such as “1.1.1.11” and “1.1.1.12” will actually be hit.

これに対して、上記検索式に対して、単語インデックスを用いて全文検索を試みると、「1.1.1.1」のＩＤを含む文書データのみを検索することが可能である。しかし、「suppressors」と「suppressor」とは完全に一致していないので、「suppressors」を含む文書データを検索することが出来なくなってしまう。 On the other hand, if a full-text search is attempted using the word index for the above search expression, only document data including the ID “1.1.1.1” can be searched. However, since “suppressors” and “suppressor” do not completely match, document data including “suppressors” cannot be searched.

開示の技術は、上記に鑑みてなされたものであって、文書データの特性によらず、効率よく全文検索を実行することができる検索装置、検索方法および検索プログラムを提供することを目的とする。 The disclosed technology has been made in view of the above, and an object thereof is to provide a search device, a search method, and a search program capable of efficiently performing a full-text search regardless of the characteristics of document data. .

本願の開示する検索装置は、一つの態様において、第１の区切方式に基づいて区切られ、文書データに関連付けられた第１のインデックスと、第２の区切方式に基づいて区切られ、文書データに関連付けられた第２のインデックスと、所定の文字の特徴を定義したパターンファイルを記憶する記憶部と、検索キーワードを受け付け、前記検索キーワードと前記パターンファイルとを基にして、前記第１のインデックスを用いて文書データの検索を行うのか、前記第２のインデックスを用いて文書データの検索を行うのかを判定する判定部と、前記判定部の判定結果に基づいて、前記第１のインデックスまたは前記第２のインデックスを用いて文書データの検索を実行する検索部とを備えたことを要件とする。 In one aspect, the search device disclosed in the present application is partitioned based on the first partitioning method, partitioned based on the first index associated with the document data, and based on the second partitioning method, and is stored in the document data. A second index that is associated, a storage unit that stores a pattern file that defines features of a predetermined character, a search keyword is received, and the first index is determined based on the search keyword and the pattern file. A determination unit that determines whether to use the second index to search for document data, and the first index or the first index based on the determination result of the determination unit. And a search unit for searching for document data using the index of No. 2.

本願の開示する検索装置の一つの態様によれば、文書データの特性によらず、効率よく全文検索を実行することができるという効果を奏する。 According to one aspect of the search device disclosed in the present application, there is an effect that a full-text search can be executed efficiently regardless of the characteristics of document data.

図１は、本実施例１にかかる検索装置の構成を示す図である。FIG. 1 is a diagram illustrating the configuration of the search device according to the first embodiment. 図２は、本実施例２にかかるシステムを示す図である。FIG. 2 is a diagram illustrating the system according to the second embodiment. 図３は、本実施例２にかかる検索装置の構成を示す図である。FIG. 3 is a diagram illustrating the configuration of the search device according to the second embodiment. 図４は、パターンファイルのデータ構造を示す図である。FIG. 4 is a diagram showing the data structure of the pattern file. 図５は、本実施例２にかかる検索装置の処理手順を示すフローチャートである。FIG. 5 is a flowchart of the process procedure of the search device according to the second embodiment. 図６は、実施例にかかる検索装置を構成するコンピュータのハードウェア構成を示す図である。FIG. 6 is a diagram illustrating a hardware configuration of a computer that configures the search device according to the embodiment.

以下に、本願の開示する検索装置、検索方法および検索プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Embodiments of a search device, a search method, and a search program disclosed in the present application will be described below in detail with reference to the drawings. Note that the present invention is not limited to the embodiments.

図１は、本実施例１にかかる検索装置１００の構成を示す図である。図１に示すように、この検索装置１００は、記憶部１１０、判定部１２０、検索部１３０を有する。 FIG. 1 is a diagram illustrating the configuration of the search device 100 according to the first embodiment. As illustrated in FIG. 1, the search device 100 includes a storage unit 110, a determination unit 120, and a search unit 130.

記憶部１１０は、パターンファイル１１０ａ、第１のインデックス１１０ｂ、第２のインデックス１１０ｃを記憶する。パターンファイル１１０ａは、所定の文字の特徴を定義したデータである。第１のインデックス１１０ｂは、第１の区切方式に基づいて区切られ、文書データに関連付けられたデータである。第２のインデックス１１０ｃは、第２の区切方式に基づいて区切られ、文書データに関連付けられたデータである。 The storage unit 110 stores a pattern file 110a, a first index 110b, and a second index 110c. The pattern file 110a is data defining features of a predetermined character. The first index 110b is data that is partitioned based on the first partitioning method and associated with the document data. The second index 110c is data that is partitioned based on the second partitioning method and associated with the document data.

判定部１２０は、検索キーワードを受け付け、検索キーワードとパターンファイル１１０ａとを基にして、第１のインデックス１１０ｂを用いて検索を行うのか、第２のインデックス１１０ｃを用いて検索を行うのかを判定する。 The determination unit 120 receives a search keyword, and determines whether to perform a search using the first index 110b or the second index 110c based on the search keyword and the pattern file 110a. .

検索部１３０は、判定部１２０の判定結果に基づいて、第１のインデックス１１０ｂまたは第２のインデックス１１０ｃを用いて文書データの検索を実行する。 Based on the determination result of the determination unit 120, the search unit 130 searches for document data using the first index 110b or the second index 110c.

上記の検索装置１００は、パターンファイル１１０ａを用いて、第１のインデックス１１０ｂを用いた検索を行うのか、第２のインデックス１１０ｃを用いた検索を行うのかを判定している。このため、検索キーワードの特徴に合わせて最適なインデックスを選択することができるので、文書データの特性によらず、効率よく全文検索を実行することができる。 The search device 100 uses the pattern file 110a to determine whether to perform a search using the first index 110b or a search using the second index 110c. For this reason, since an optimal index can be selected according to the characteristics of the search keyword, a full-text search can be efficiently executed regardless of the characteristics of the document data.

次に、本実施例２にかかるシステムの一例について説明する。図２は、本実施例２にかかるシステムを示す図である。図２に示すように、このシステムは、利用者端末６０、検索装置２００を有する。利用者端末６０と検索装置２００は、ネットワーク５０を介して接続される。 Next, an example of a system according to the second embodiment will be described. FIG. 2 is a diagram illustrating the system according to the second embodiment. As shown in FIG. 2, this system includes a user terminal 60 and a search device 200. The user terminal 60 and the search device 200 are connected via the network 50.

利用者端末６０は、検索装置２００に検索キーワードを送信し、検索キーワードに対する検索結果を検索装置２００から受信する装置である。 The user terminal 60 is a device that transmits a search keyword to the search device 200 and receives a search result for the search keyword from the search device 200.

検索装置２００は、文書データの全文検索を行う装置である。図３は、本実施例２にかかる検索装置２００の構成を示す図である。図３に示すように、この検索装置２００は、記憶部２１０、インデクシング処理部２２０、入力受付部２３０、検索式解析処理部２４０、スコアリング処理部２５０、検索結果出力部２６０を有する。 The search device 200 is a device that performs full-text search of document data. FIG. 3 is a diagram illustrating the configuration of the search device 200 according to the second embodiment. As illustrated in FIG. 3, the search device 200 includes a storage unit 210, an indexing processing unit 220, an input receiving unit 230, a search expression analysis processing unit 240, a scoring processing unit 250, and a search result output unit 260.

記憶部２１０は、パターンファール２１０ａ、文書データ群２１０ｂ、単語インデックス２１０ｃ、文字インデックス２１０ｄを記憶する。 The storage unit 210 stores a pattern file 210a, a document data group 210b, a word index 210c, and a character index 210d.

パターンファイル２１０ａは、所定の文字の特徴を定義したデータである。図４は、パターンファイル２１０ａのデータ構造を示す図である。図４に示すように、このパターンファイルは、Ｎｏとパターンとを有する。Ｎｏは、各パターンを識別するものである。パターンは、所定の文字の特徴を正規表現で示したものである。ここで、文字には、一般的な文字のほかに、数字や記号等も含まれるものとする。 The pattern file 210a is data defining features of a predetermined character. FIG. 4 shows the data structure of the pattern file 210a. As shown in FIG. 4, this pattern file has No and a pattern. No identifies each pattern. The pattern represents the characteristics of a predetermined character with a regular expression. Here, it is assumed that the characters include numbers and symbols in addition to general characters.

ここで、パターンの記載方法の一例について説明する。パターン中の［］は、［と］の中に書かれたいずれかの一文字に一致する文字、数字、記号を意味する。例えば、［０−９］は、１桁の数字を意味する。パターン中の｛ｎ，ｍ｝は、直前の文字がｎ回からｍ回まで繰り返されることを意味する。例えば、［０−９］｛１，３｝は、１桁、２桁、３桁の数字を意味する。 Here, an example of a pattern description method will be described. [] In the pattern means a letter, number, or symbol that matches any one of the characters written in [and]. For example, [0-9] means a single digit. {N, m} in the pattern means that the immediately preceding character is repeated from n to m times. For example, [0-9] {1, 3} means a 1-digit, 2-digit, 3-digit number.

また、パターン中の＋は、直前の文字が１回以上繰り返されることを意味する。例えば、［０−９］＋は、数字からなる文字列を意味する。パターン中の＊は、直前の文字が０回以上繰り返されることを意味する。例えば、［０−９］^＊は、空文字または数字からなる文字列を意味する。 Moreover, + in the pattern means that the immediately preceding character is repeated one or more times. For example, [0-9] + means a character string consisting of numbers. * In the pattern means that the immediately preceding character is repeated zero or more times. For example, [0-9] ^* means a character string consisting of an empty character or a number.

図３の説明に戻る。文書データ群２１０ｂは、複数の文書データを含む。また、各文書データは、固有のＩＤが割り当てられ、各種の文字列を含む。 Returning to the description of FIG. The document data group 210b includes a plurality of document data. Each document data is assigned a unique ID and includes various character strings.

単語インデックス２１０ｃは、文書データ群２１０ｂに含まれる各文書データの単語と、この単語の存在する文書データのＩＤとを対応付けた転置インデックスである。文字インデックス２１０ｄは、文書データ群２１０ｂに含まれる各文書データの文字と、この文字の存在する文書データのＩＤとを対応付けた転置インデックスである。 The word index 210c is a transposed index that associates the word of each document data included in the document data group 210b with the ID of the document data in which the word exists. The character index 210d is a transposed index in which characters of each document data included in the document data group 210b are associated with IDs of document data in which the characters exist.

インデクシング処理部２２０は、文書データ群２１０ｂから単語インデックス２１０ｃと文字インデックス２１０ｄを生成する処理部である。インデクシング処理部２２０は、単語区切方式により、文書データ群２１０ｂから単語インデックス２１０ｃを生成する。また、インデクシング処理部２２０は、文字区切方式により、文書データ群２１０ｂから文字インデックス２１０ｄを生成する。なお、単語区切方式による単語インデックス２１０ｃの生成は、周知の単語区切方式と同様である。文字区切方式による文字インデックス２１０ｄの生成は、周知の文字区切方式と同様である。 The indexing processing unit 220 is a processing unit that generates a word index 210c and a character index 210d from the document data group 210b. The indexing processing unit 220 generates a word index 210c from the document data group 210b by a word segmentation method. Further, the indexing processing unit 220 generates a character index 210d from the document data group 210b by a character delimiter method. The generation of the word index 210c by the word segmentation method is the same as the well-known word segmentation method. The generation of the character index 210d by the character delimiter method is the same as the well-known character delimiter method.

入力受付部２３０は、利用者端末６０から検索キーワードを受け付け、この検索キーワードを検索式解析処理部２４０に出力する。なお、入力受付部２３０は、検索装置２００に接続された入力装置から、検索キーワードを取得してもよい。入力装置は、例えば、マウスやキーボードに対応する。 The input receiving unit 230 receives a search keyword from the user terminal 60 and outputs the search keyword to the search formula analysis processing unit 240. Note that the input reception unit 230 may acquire a search keyword from an input device connected to the search device 200. The input device corresponds to, for example, a mouse or a keyboard.

検索式解析処理部２４０は、検索キーワードとパターンファイル２１０ａとを比較して、単語インデックス２１０ｃを用いて文書データの検索を行うのか、文字インデックス２１０ｄを用いて文書データの検索を行うのかを判定する処理部である。以下において、単語インデックス２１０ｃを用いて文書データの検索を行うことを、単語区切方式の検索と表記する。文字インデックス２１０ｄを用いて文書データの検索を行うことを、文字区切方式の検索と表記する。 The search expression analysis processing unit 240 compares the search keyword with the pattern file 210a to determine whether to search for document data using the word index 210c or to search for document data using the character index 210d. It is a processing unit. In the following, searching for document data using the word index 210c is referred to as a word segmentation type search. Searching for document data using the character index 210d is referred to as a character delimiter search.

まず、検索式解析処理部２４０は、検索キーワードに対して構文解析を実行する。例えば、検索キーワードを「1.1.1.1ANDsuppressor」とする。検索式解析処理部２４０が、検索キーワード「1.1.1.1ANDsuppressor」に対して構文解析を実行することで、この検索キーワードに含まれる条件文「AND」と、条件文を挟む文字列「1.1.1.1」、「suppressor」が抽出される。 First, the search expression analysis processing unit 240 performs syntax analysis on the search keyword. For example, the search keyword is “1.1.1.1 ANDsuppressor”. The search expression analysis processing unit 240 performs syntax analysis on the search keyword “1.1.1.1 ANDsuppressor”, so that the conditional sentence “AND” included in the search keyword and the character string “1.1.1.1” sandwiching the conditional sentence are included. ”And“ suppressor ”are extracted.

検索式解析処理部２４０は、検索キーワードから抽出した各文字列と、パターンファイル２１０ａのパターンとをそれぞれ比較し、各文字列に対して、単語区切方式の検索を行うのか、文字区切方式の検索を行うのかを判定する。 The search expression analysis processing unit 240 compares each character string extracted from the search keyword with the pattern in the pattern file 210a, and performs a word delimiter search for each character string. Determine whether to perform.

具体的には、検索式解析処理部２４０は、パターンファイル２１０ａのパターンのいずれかに文字列がマッチする場合には、文字区切方式の検索を行うと判定する。例えば、文字列「1.1.1.1」は、図４に示したパターンファイル２１０ａのＮｏ「２」のパターンとマッチする。このため、検索式解析処理部２４０は、文字列「1.1.1.1」に対して単語区切方式の検索を行うと判定する。 Specifically, the search formula analysis processing unit 240 determines that a character delimiter search is performed when a character string matches any of the patterns in the pattern file 210a. For example, the character string “1.1.1.1” matches the pattern “No 2” in the pattern file 210a shown in FIG. For this reason, the search expression analysis processing unit 240 determines to perform a word break search for the character string “1.1.1.1”.

また、文字列「suppressor」は、図４に示したパターンファイル２１０ａのパターンとマッチしない。このため、検索式解析処理部２４０は、文字列「suppressor」に対して文字区切方式の検索を行うと判定する。 Further, the character string “suppressor” does not match the pattern of the pattern file 210a shown in FIG. For this reason, the search formula analysis processing unit 240 determines that the character string “suppressor” is to be searched by the character delimiter method.

検索式解析処理部２４０は、文字列と判定結果とを対応付けたデータをスコアリング処理部２５０に出力する。また、検索式解析処理部２４０は、検索キーワードに含まれる条件文も合わせてスコアリング処理部２５０に出力する。 The search expression analysis processing unit 240 outputs data in which the character string is associated with the determination result to the scoring processing unit 250. In addition, the search expression analysis processing unit 240 also outputs the conditional sentence included in the search keyword to the scoring processing unit 250.

スコアリング処理部２５０は、検索式解析処理部２４０の文字列、文字列の判定結果、条件文を取得し、取得したデータに基づいて、検索キーワードに対応する文書データを検索する処理部である。ここでは一例として、文字列「1.1.1.1」に対応する判定結果が「単語区切方式の検索を行う」であり、文字列「suppressor」に対応する判定結果が「文字区切方式の検索を行う」であり、条件文が「AND」とする。 The scoring processing unit 250 is a processing unit that acquires the character string, the determination result of the character string, and the conditional sentence of the search expression analysis processing unit 240 and searches the document data corresponding to the search keyword based on the acquired data. . Here, as an example, the determination result corresponding to the character string “1.1.1.1” is “search for the word separation method”, and the determination result corresponding to the character string “suppressor” is “search for the character separation method”. And the conditional statement is “AND”.

この場合には、スコアリング処理部２５０は、文字列「1.1.1.1」と、文字インデックス２１０ｄとを比較して、文字列「1.1.1.1」に対応する文書データを特定し、特定した文書データを文書データ群２１０ｂから取得する。また、スコアリング処理部２５０は、文字列「suppressor」と、単語インデックス２１０ｃとを比較して、文字列「suppressor」に対応する文書データを特定し、特定した文書データを文書データ群２１０ｂから取得する。 In this case, the scoring processing unit 250 compares the character string “1.1.1.1” with the character index 210d, specifies the document data corresponding to the character string “1.1.1.1”, and specifies the specified document data Is obtained from the document data group 210b. Further, the scoring processing unit 250 compares the character string “suppressor” with the word index 210c, specifies the document data corresponding to the character string “suppressor”, and acquires the specified document data from the document data group 210b. To do.

そして、スコアリング処理部２５０は、条件文が「AND」であるため、文字列「1.1.1.1」に対応する文書データと文字列「suppressor」に対応する文書データとを比較し、重複する文書データを検索結果出力部２６０に出力する。なお、条件文が「OR」の場合には、スコアリング処理部２５０は、条件文が「OR」であるため、文字列「1.1.1.1」に対応する文書データと文字列「suppressor」に対応する文書データとを検索結果出力部２６０に出力する。 Then, since the conditional sentence is “AND”, the scoring processing unit 250 compares the document data corresponding to the character string “1.1.1.1” with the document data corresponding to the character string “suppressor”, and the duplicate document The data is output to the search result output unit 260. When the conditional statement is “OR”, the scoring processing unit 250 corresponds to the document data corresponding to the character string “1.1.1.1” and the character string “suppressor” because the conditional statement is “OR”. The document data to be output is output to the search result output unit 260.

スコアリング処理部２５０は、文書データを検索した場合に、文書データに含まれる文字列の頻度に応じて、文書データにスコアを付与してもよい。 When the document data is searched, the scoring processing unit 250 may assign a score to the document data according to the frequency of the character string included in the document data.

検索結果出力部２６０は、スコアリング処理部２５０から受け付けた文書データを、利用者端末６０に通知する。検索結果出力部２６０は、文書データのスコアに応じて、利用者端末６０に表示させる文書データの順番を調整してもよい。また、検索結果出力部２６０は、検索装置２００に接続された表示装置に文書データを出力してもよい。表示装置は、例えば、モニタや液晶ディスプレイに対応する。 The search result output unit 260 notifies the user terminal 60 of the document data received from the scoring processing unit 250. The search result output unit 260 may adjust the order of the document data displayed on the user terminal 60 according to the score of the document data. The search result output unit 260 may output document data to a display device connected to the search device 200. The display device corresponds to, for example, a monitor or a liquid crystal display.

次に、本実施例２にかかる検索装置２００の処理手順について説明する。図５は、本実施例２にかかる検索装置２００の処理手順を示すフローチャートである。図５に示すように、検索装置２００は、検索キーワードを取得し（ステップＳ１０１）、構文解析を実行する（ステップＳ１０２）。 Next, a processing procedure of the search device 200 according to the second embodiment will be described. FIG. 5 is a flowchart of the process procedure of the search device 200 according to the second embodiment. As shown in FIG. 5, the search device 200 acquires a search keyword (step S101) and executes syntax analysis (step S102).

検索装置２００は、パターンファイル２１０ａから未選択のパターンを取得し（ステップＳ１０３）、検索キーワードはパターンにマッチするか否かを判定する（ステップＳ１０４）。検索装置２００は、検索キーワードがパターンにマッチする場合には（ステップＳ１０４，Ｙｅｓ）、単語区切方式の検索を行うと判定し（ステップＳ１０５）、ステップＳ１０８に移行する。 The search device 200 acquires an unselected pattern from the pattern file 210a (step S103), and determines whether or not the search keyword matches the pattern (step S104). When the search keyword matches the pattern (Yes at Step S104), the search device 200 determines that the word segmentation type search is performed (Step S105), and proceeds to Step S108.

一方、検索装置２００は、検索キーワードがパターンにマッチしない場合には（ステップＳ１０４，Ｎｏ）、未選択のパターンが存在するか否かを判定する（ステップＳ１０６）。検索装置２００は、未選択のパターンが存在する場合には（ステップＳ１０６，Ｙｅｓ）、ステップＳ１０３に移行する。 On the other hand, when the search keyword does not match the pattern (No at Step S104), the search device 200 determines whether there is an unselected pattern (Step S106). When there is an unselected pattern (step S106, Yes), the search device 200 proceeds to step S103.

一方、検索装置２００は、未選択のパターンが存在しない場合には（ステップＳ１０６，Ｎｏ）、文字区切方式の検索を行うと判定し（ステップＳ１０７）、検索を実行する（ステップＳ１０８）。 On the other hand, when there is no unselected pattern (No in step S106), the search device 200 determines that the character delimiter search is performed (step S107) and executes the search (step S108).

上述してきたように、本実施例２にかかる検索装置２００は、パターンファイル２１０ａを用いて、単語区切方式の検索を行うのか、文字区切方式の検索を行うのかを判定している。このため、検索キーワードの特徴に合わせて最適なインデックスを選択することができるので、文書データの特性によらず、効率よく全文検索を実行することができる。 As described above, the search device 200 according to the second embodiment uses the pattern file 210a to determine whether to perform a word delimiter search or a character delimiter search. For this reason, since an optimal index can be selected according to the characteristics of the search keyword, a full-text search can be efficiently executed regardless of the characteristics of the document data.

また、本実施例２では、検索キーワードが検索式の場合に、この検索式を複数の部分キーワードに分割し、部分キーワード毎に単語区切方式の検索を行うのか、文字区切方式の検索を行うのかを判定している。このため、既存の技術を踏襲した検索式をそそのまま利用して、全文検索を実行することができる。 In the second embodiment, when the search keyword is a search expression, the search expression is divided into a plurality of partial keywords, and a word-delimited search or a character-delimited search is performed for each partial keyword. Is judged. For this reason, a full-text search can be executed using a search formula that follows the existing technology as it is.

また、本実施例２のパターンファイル２１０ａは、利用者単位の好みに合わせて容易にカスタマイズすることができる。 Further, the pattern file 210a of the second embodiment can be easily customized according to the preference of each user.

ところで、図３に示した検索装置２００の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、検索装置２００の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、記憶部２１０を、着脱可能な外部装置または携帯端末等に搭載し、かかる外部装置または携帯端末等を検索装置２００に有線または無線で接続するようにしてもよい。 Incidentally, each component of the search device 200 shown in FIG. 3 is functionally conceptual and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of the search device 200 is not limited to the illustrated one, and all or a part thereof is functionally or physically distributed in arbitrary units according to various loads or usage conditions. -Can be integrated and configured. For example, the storage unit 210 may be mounted on a removable external device or portable terminal, and the external device or portable terminal may be connected to the search device 200 by wire or wirelessly.

なお、検索装置２００は、既知のパーソナルコンピュータ、ワークステーション、携帯電話、ＰＨＳ端末、移動体通信端末またはＰＤＡなどの情報処理装置に、検索装置２００の各機能を搭載することによって実現することもできる。 The search device 200 can also be realized by mounting each function of the search device 200 on an information processing device such as a known personal computer, workstation, mobile phone, PHS terminal, mobile communication terminal, or PDA. .

図６は、実施例にかかる検索装置を構成するコンピュータのハードウェア構成を示す図である。図６に示すように、このコンピュータ３００は、各種演算処理を実行するＣＰＵ（Central Processing Unit）３０１と、ユーザからのデータの入力を受け付ける入力装置３０２と、モニタ３０３を有する。また、コンピュータ３００は、記憶媒体からプログラム等を読取る媒体読み取り装置３０４と、ネットワークを介して他のコンピュータとの間でデータの授受を行うネットワークインターフェース装置３０５を有する。また、コンピュータ３００は、各種情報を一時記憶するＲＡＭ（Random Access Memory）３０６と、ハードディスク装置３０７を有する。各装置３０１〜３０７は、バス３０８に接続される。 FIG. 6 is a diagram illustrating a hardware configuration of a computer that configures the search device according to the embodiment. As shown in FIG. 6, the computer 300 includes a CPU (Central Processing Unit) 301 that executes various arithmetic processes, an input device 302 that receives input of data from a user, and a monitor 303. The computer 300 includes a medium reading device 304 that reads a program and the like from a storage medium, and a network interface device 305 that exchanges data with other computers via a network. The computer 300 also includes a RAM (Random Access Memory) 306 that temporarily stores various information and a hard disk device 307. Each device 301 to 307 is connected to a bus 308.

そして、ハードディスク装置３０７には、図３に示した検索式解析処理部２４０、スコアリング処理部２５０、インデクシング処理部２２０と同様の機能を有する検索プログラム３０７ａを記憶する。また、ハードディスク装置３０７は、図３に示した各種データ２１０ａ〜２１０ｄにそれぞれ対応する各種データ３０７ｂを記憶する。 The hard disk device 307 stores a search program 307a having the same functions as the search expression analysis processing unit 240, the scoring processing unit 250, and the indexing processing unit 220 shown in FIG. Further, the hard disk device 307 stores various data 307b corresponding to the various data 210a to 210d shown in FIG.

ＣＰＵ３０１が検索プログラム３０７ａをハードディスク装置３０７から読み出してＲＡＭ３０６に展開することにより、検索プログラム３０７ａは、検索プロセス３０６ａとして機能するようになる。また、ＣＰＵ３０１は、各種データ３０７ｂをＲＡＭ３０６に読み出す。検索プロセス３０６ａは、各種データ３０６ｂを利用して、全文検索を実行する。 When the CPU 301 reads the search program 307a from the hard disk device 307 and develops it in the RAM 306, the search program 307a functions as the search process 306a. In addition, the CPU 301 reads various data 307 b into the RAM 306. The search process 306a performs a full text search using various data 306b.

なお、上記の検索プログラム３０７ａは、必ずしもハードディスク装置３０７に格納されている必要はなく、ＣＤ−ＲＯＭ等の記憶媒体に記憶されたプログラムを、コンピュータ３００が読み出して実行するようにしてもよい。また、公衆回線、インターネット、ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等にこのプログラムを記憶させておき、コンピュータ３００がこれらからプログラムを読み出して実行するようにしてもよい。 Note that the search program 307a is not necessarily stored in the hard disk device 307, and the computer 300 may read and execute a program stored in a storage medium such as a CD-ROM. Alternatively, the program may be stored in a public line, the Internet, a LAN (Local Area Network), a WAN (Wide Area Network), or the like, and the computer 300 may read and execute the program therefrom.

以上の各実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）第１の区切方式に基づいて区切られ、文書データに関連付けられた第１のインデックスと、第２の区切方式に基づいて区切られ、文書データに関連付けられた第２のインデックスと、所定の文字の特徴を定義したパターンファイルとを記憶する記憶部と、
検索キーワードを受け付け、前記検索キーワードと前記パターンファイルとを基にして、前記第１のインデックスを用いて文書データの検索を行うのか、前記第２のインデックスを用いて文書データの検索を行うのかを判定する判定部と、
前記判定部の判定結果に基づいて、前記第１のインデックスまたは前記第２のインデックスを用いて文書データの検索を実行する検索部と
を備えたことを特徴とする検索装置。 (Supplementary Note 1) A first index that is partitioned based on the first partitioning scheme and associated with the document data, and a second index that is partitioned based on the second partitioning scheme and associated with the document data; A storage unit for storing a pattern file defining characteristics of a predetermined character;
Whether to search for document data using the first index or to search for document data using the second index based on the search keyword and the pattern file. A determination unit for determining;
A search device, comprising: a search unit that searches for document data using the first index or the second index based on a determination result of the determination unit.

（付記２）前記検索キーワードを、複数の部分キーワードに分割するキーワード分割部を更に有し、前記判定部は、部分キーワード毎に前記第１のインデックスを用いて文書データの検索を行うのか、前記第２のインデックスを用いて文書データの検索を行うのかを判定することを特徴とする付記１に記載の検索装置。 (Additional remark 2) It further has the keyword division part which divides | segments the said search keyword into a some partial keyword, The said determination part searches document data using the said 1st index for every partial keyword, The search apparatus according to appendix 1, wherein it is determined whether to search for document data using the second index.

（付記３）前記第１の区切方式は、意味を持つ単語毎に文字列を区切る単語区切方式であり、前記判定部は、前記パターンファイルに定義された特徴が、前記検索キーワードと一致する場合には、前記第１のインデックスを用いて文書データの検索を行うと判定することを特徴とする付記１または２に記載の検索装置。 (Additional remark 3) The said 1st division | segmentation system is a word division | segmentation system which divides | segments a character string for every meaningful word, The said determination part is when the characteristic defined in the said pattern file corresponds with the said search keyword. The search apparatus according to appendix 1 or 2, wherein it is determined that the document data is searched using the first index.

（付記４）第１の区切方式に基づいて区切られ、文書データに関連付けられた第１のインデックスと、第２の区切方式に基づいて区切られ、文書データに関連付けられた第２のインデックスと、所定の文字の特徴を定義したパターンファイルとを記憶する記憶装置を有する検索装置が、
検索キーワードを受け付け、前記検索キーワードと前記パターンファイルとを基にして、前記第１のインデックスを用いて文書データの検索を行うのか、前記第２のインデックスを用いて文書データの検索を行うのかを判定する判定ステップと、
前記判定ステップの判定結果に基づいて、前記第１のインデックスまたは前記第２のインデックスを用いて文書データの検索を実行する検索ステップと
を含むことを特徴とする検索方法。 (Supplementary Note 4) A first index that is partitioned based on the first partitioning scheme and associated with the document data, and a second index that is partitioned based on the second partitioning scheme and associated with the document data; A search device having a storage device for storing a pattern file defining features of a predetermined character,
Whether to search for document data using the first index or to search for document data using the second index based on the search keyword and the pattern file. A determination step for determining;
And a search step of searching for document data using the first index or the second index based on the determination result of the determination step.

（付記５）前記検索キーワードを、複数の部分キーワードに分割するキーワード分割ステップを更に含み、前記判定ステップでは、部分キーワード毎に前記第１のインデックスを用いて文書データの検索を行うのか、前記第２のインデックスを用いて文書データの検索を行うのかを判定することを特徴とする付記４に記載の検索方法。 (Additional remark 5) It further includes the keyword division | segmentation step which divides | segments the said search keyword into a some partial keyword, In the said determination step, search of document data is performed using said 1st index for every partial keyword. 5. The search method according to appendix 4, wherein it is determined whether or not the document data is searched using the index of 2.

（付記６）前記第１の区切方式は、意味を持つ単語毎に文字列を区切る単語区切方式であり、前記判定ステップでは、前記パターンファイルに定義された特徴が、前記検索キーワードと一致する場合には、前記第１のインデックスを用いて文書データの検索を行うと判定することを特徴とする付記４または５に記載の検索方法。 (Additional remark 6) The said 1st division | segmentation system is a word division | segmentation system which divides | segments a character string for every meaningful word, and when the characteristic defined in the said pattern file corresponds with the said search keyword in the said determination step. The search method according to appendix 4 or 5, wherein it is determined that the document data is searched using the first index.

（付記７）第１の区切方式に基づいて区切られ、文書データに関連付けられた第１のインデックスと、第２の区切方式に基づいて区切られ、文書データに関連付けられた第２のインデックスと、所定の文字の特徴を定義したパターンファイルとを記憶する記憶装置を有するコンピュータに、
検索キーワードを受け付け、前記検索キーワードと前記パターンファイルとを基にして、前記第１のインデックスを用いて文書データの検索を行うのか、前記第２のインデックスを用いて文書データの検索を行うのかを判定する判定手順と、
前記判定ステップの判定結果に基づいて、前記第１のインデックスまたは前記第２のインデックスを用いて文書データの検索を実行する検索手順と
を実行させることを特徴とする検索プログラム。 (Supplementary Note 7) A first index that is partitioned based on the first partitioning scheme and associated with the document data, and a second index that is partitioned based on the second partitioning scheme and associated with the document data; A computer having a storage device for storing a pattern file defining the characteristics of a predetermined character,
Whether to search for document data using the first index or to search for document data using the second index based on the search keyword and the pattern file. A determination procedure for determining;
And a search procedure for executing a search for document data using the first index or the second index based on a determination result of the determination step.

（付記８）前記検索キーワードを、複数の部分キーワードに分割するキーワード分割手順を更にコンピュータに実行させ、前記判定手順は、部分キーワード毎に前記第１のインデックスを用いて文書データの検索を行うのか、前記第２のインデックスを用いて文書データの検索を行うのかを判定することを特徴とする付記７に記載の検索プログラム。 (Supplementary Note 8) Whether or not the computer further executes a keyword dividing procedure for dividing the search keyword into a plurality of partial keywords, and the determination procedure searches the document data using the first index for each partial keyword. The search program according to appendix 7, wherein it is determined whether or not to search for document data using the second index.

（付記９）前記第１の区切方式は、意味を持つ単語毎に文字列を区切る単語区切方式であり、前記判定手順は、前記パターンファイルに定義された特徴が、前記検索キーワードと一致する場合には、前記第１のインデックスを用いて文書データの検索を行うと判定することを特徴とする付記７または８に記載の検索プログラム。 (Additional remark 9) The said 1st division | segmentation system is a word division | segmentation system which divides | segments a character string for every word with a meaning, and the said determination procedure is when the characteristics defined in the said pattern file correspond with the said search keyword. The search program according to appendix 7 or 8, wherein it is determined that the document data is searched using the first index.

１００検索装置
１１０ａパターンファイル
１１０ｂ第１のインデックス
１１０ｃ第２のインデックス
１２０判定部
１３０検索部 DESCRIPTION OF SYMBOLS 100 Search apparatus 110a Pattern file 110b 1st index 110c 2nd index 120 Determination part 130 Search part

Claims

It is divided based on the word separation method that separates character strings for each meaningful word, and is divided based on the word index associated with document data and the character separation method that separates character strings for each character , and is associated with document data. A storage unit for storing a character index and a pattern file defining characteristics of a predetermined character string including symbols ;
It accepts a search string, based on the search string and the said pattern file, the pattern file to the defined features, when matching the search string, the document data using the word index A determination unit that determines to perform a search, and determines that the document data is to be searched using the character index if the feature defined in the pattern file does not match the search character string ;
A search device comprising: a search unit that executes a search for document data using the word index or the character index based on a determination result of the determination unit.

The pattern file includes information indicating a format of characters included in the predetermined character string,
Whether the determination unit searches the document data using the word index based on whether or not the character format included in the received search character string matches the character format indicated in the pattern file. Determine whether to search for document data using the character index
The search device according to claim 1.

The search string, further comprising a character string dividing unit for dividing a plurality of partial character string, the determination unit, for each partial character string, is defined wherein the pattern file, matches the substring If it is determined that the document data is to be searched using the word index, and the feature defined in the pattern file does not match the partial character string, the document index is used using the character index. Search device according to claim 1 or 2, wherein the determining to perform the search.

It is divided based on the word separation method that separates character strings for each meaningful word, and is divided based on the word index associated with document data and the character separation method that separates character strings for each character , and is associated with document data. A search device having a storage device for storing a character index and a pattern file defining characteristics of a predetermined character string including symbols ,
It accepts a search string, based on the search string and the said pattern file, the pattern file to the defined features, when matching the search string, the document data using the word index A determination step of determining to perform a search, and determining that a search for document data is to be performed using the character index if the feature defined in the pattern file does not match the search character string ;
And a search step of performing a search of document data using the word index or the character index based on the determination result of the determination step.

It is divided based on the word separation method that separates character strings for each meaningful word, and is divided based on the word index associated with document data and the character separation method that separates character strings for each character , and is associated with document data. A computer having a storage device for storing a character index and a pattern file defining characteristics of a predetermined character string including symbols ;
It accepts a search string, based on the search string and the said pattern file, the pattern file to the defined features, when matching the search string, the document data using the word index A determination procedure for determining to perform a search, and to determine to perform a search for document data using the character index when the characteristics defined in the pattern file do not match the search character string ;
And a search procedure for executing a search of document data using the word index or the character index based on a determination result of the determination step.

Included in the character string is a first index that is partitioned based on the first partitioning scheme and associated with the document data, a second index that is partitioned based on the second partitioning scheme and is associated with the document data, and A storage unit for storing a pattern file indicating a character format
A search character string is received, and the document data is searched using the first index based on whether the character format included in the search character string matches the character format indicated in the pattern file. A determination unit that determines whether to search for document data using the second index;
A search unit that searches for document data using the first index or the second index based on a determination result of the determination unit;
A search device comprising:

Included in the character string is a first index that is partitioned based on the first partitioning scheme and associated with the document data, a second index that is partitioned based on the second partitioning scheme and is associated with the document data, and A search device having a storage device for storing a pattern file indicating a character format
A search character string is received, and the document data is searched using the first index based on whether the character format included in the search character string matches the character format indicated in the pattern file. A determination step for determining whether to search for document data using the second index;
A search step for executing a search for document data using the first index or the second index based on a determination result of the determination step;
The search method characterized by including.

Included in the character string is a first index that is partitioned based on the first partitioning scheme and associated with the document data, a second index that is partitioned based on the second partitioning scheme and is associated with the document data, and A computer having a storage device for storing a pattern file indicating a character format
A search character string is received, and the document data is searched using the first index based on whether the character format included in the search character string matches the character format indicated in the pattern file. Or a determination procedure for determining whether to search for document data using the second index;
A search procedure for executing a search for document data using the first index or the second index based on a determination result of the determination procedure;
A search program characterized in that is executed.