JP2008287394A

JP2008287394A - Document retrieval method and apparatus, and computer program therefor

Info

Publication number: JP2008287394A
Application number: JP2007130321A
Authority: JP
Inventors: Izumi Takahashi; いづみ高橋; Hisako Asano; 久子浅野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-05-16
Filing date: 2007-05-16
Publication date: 2008-11-27
Anticipated expiration: 2027-05-16
Also published as: JP4950755B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document retrieval method, a document retrieval apparatus, and a computer program therefor, which improves accuracy in retrieval and absorbs orthographic variations in a retrieval result. <P>SOLUTION: The document retrieval apparatus 10 receives, an input search character string as a query via an input section 1, uses a formal name/regular expression dictionary 31, a conversion list 41, and an insertion list 51 to create a regular expression for the received query, uniquely identifies the formal name of a word from the regular expression, creates a query group by using the created regular expression to expand the query, uses the expanded query group to retrieve a document via a document retrieval section 6, extracts document data including the query group, uses the identified formal name and regular expression corresponding to the formal name to unify the notation of a query included in the extracted document data, and outputs information on the document data unified in notation via an output section 8. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は検索装置に入力された検索対象となる文字列としてのクエリから、その同義の正規表現を生成して正規表現辞書作成を行い、質問拡張及び表記統一を行い、検索精度の向上及び、検索結果の表記ゆれ吸収を同時に可能にすることを特徴とした文書検索方法及び装置並びにそのコンピュータプログラムに関するものである。 The present invention creates a regular expression dictionary by generating a regular expression of the same meaning from a query as a character string to be searched that is input to the search device, performs question expansion and standardization, improves search accuracy, and The present invention relates to a document search method and apparatus, and a computer program thereof, characterized in that it can simultaneously absorb fluctuations in search results.

従来、文書検索装置では、検索者が検索対象となる文字列としてのクエリを入力すると文書データべースから該当クエリを含む文書を検索する。従来の文書検索装置では、検索者がクエリを入力すると、該当クエリを有する文書だけでなく、該当クエリの表記ゆれを有する文書もあわせて検索して文書の検索漏れを防止する機能が付加される事が多い。表記ゆれを吸収するため、検索装置は同義語辞書を用いる。その辞書は、クエリを拡張するだけでなく、検索された各文書に含まれるクエリの表記ゆれを吸収する際にも必要となる。 Conventionally, in a document search apparatus, when a searcher inputs a query as a character string to be searched, a document including the query is searched from a document database. In a conventional document search apparatus, when a searcher inputs a query, a function for searching not only for a document having the corresponding query but also for a document having a notation of the corresponding query and preventing omission of document search is added. There are many things. In order to absorb the notation fluctuation, the search device uses a synonym dictionary. The dictionary is required not only to expand the query but also to absorb the query notation included in each retrieved document.

例えば、非特許文献１に記載されている質問応答システムでは、ユーザの質問文と知識データベースのテキストの表現のずれを、同義語や同義フレーズをグループ化した同義語辞書を用いて吸収している。この質問応答システムは、質問キーワードに対してその答えの元となるテキストを文書集合から検索するため、キーワードに対して適合するテキストを探すというタスクにおいて検索システムと等しい。また、この質問応答システムが、質問文と回答を含むテキスト中の表現のずれを吸収するために同義語辞書を用いる際には、クエリとその表記ゆれを含む表現からなる集合全てを展開した状態で辞書に登録する必要がある。文献１では919グループの同義表現、3512語、217フレーズが登録されている同義語辞書を、手作業で作成している。
大規模テキスト知識ベースに基づく自動質問応答−ダイアログナビ−、清田陽司,黒橋禎男,木戸冬子、自然言語処理 Vol.10 No.4 2003 For example, in the question answering system described in Non-Patent Document 1, the difference between the text of the user's question sentence and the knowledge database is absorbed using a synonym dictionary in which synonyms and synonyms are grouped. . This question answering system is equivalent to the search system in the task of searching for a text that matches the keyword because the text that is the basis of the answer to the question keyword is searched from the document set. In addition, when this question answering system uses a synonym dictionary to absorb the deviation of the expression in the text including the question sentence and the answer, the entire set of expressions including the query and its notation is expanded. Need to be registered in the dictionary. Reference 1 manually creates a synonym dictionary in which 919 groups of synonymous expressions, 3512 words, and 217 phrases are registered.
Automatic Question Answering Based on Large-scale Text Knowledge Base-Dialog Navi-Yoji Kiyota, Ikuo Kurohashi, Toko Kido, Natural Language Processing Vol.10 No.4 2003

しかしながら、表記ゆれ表現を含む辞書を作るには、以下のような課題がある。 However, there are the following problems in creating a dictionary including a written expression.

・表記ゆれを含む表現パターンは無数にあるため、全てを登録すると辞書サイズが非常に大きくなってしまう。 -Since there are an infinite number of expression patterns including notation fluctuation, the dictionary size becomes very large when all of them are registered.

・新しい単語や、表記ゆれは日々生成され続けているので、ある時点での表現を集めただけでは足りない場合がある。また、スピードある対応が必要になる。・ Since new words and notation fluctuations continue to be generated every day, it may not be sufficient to collect expressions at a certain point in time. Also, speedy response is required.

・手作業で作成するには膨大な時間及びコストがかかってしまう。・ It takes a lot of time and cost to create manually.

上記のような課題は、英数記号を含む文字列によって構成される製品名でより顕著である。 The problem as described above is more conspicuous in product names composed of character strings including alphanumeric symbols.

本発明は上記の問題点に鑑みてなされたもので、検索時の精度向上及び検索結果の表記ゆれ吸収を可能にする文書検索方法及び文書検索装置並びにそのコンピュータプログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and an object of the present invention is to provide a document search method, a document search apparatus, and a computer program thereof that can improve accuracy during search and absorb notation fluctuations in search results. .

本発明は上記の目的を達成するために、単語の正式名称と該正式名称の正規表現とを対にした情報をエントリとして持つ正式名称／正規表現辞書のデータベースと、検索対象となる文字列としてのクエリの正規表現作成時に変換規則として用いる任意に登録可能な変換リストのデータベースと、前記クエリの正規表現作成時に挿入する文字や記号を任意に登録できる挿入リストのデータベースとを備えたコンピュータ装置からなる文書検索装置に対してクエリを入力すると、前記文書検索装置が、文書データベースに格納されている複数の文書データの中から前記クエリを拡張したクエリ群を含む文書データを抽出して、該抽出結果を出力する文書検索方法であって、前記文書検索装置は、入力された前記検索対象となる文字列をクエリとして受け付け、前記正式名称／正規表現辞書と、前記変換リストと、前記挿入リストとを用いて、前記受け付けたクエリの正規表現を作成して、該正規表現から単語の正式名称を一意に特定し、作成された正規表現を用い、クエリの拡張を行ってクエリ群を作成し、拡張された前記クエリ群を用いて文書検索を行い、当該クエリ群を含む文書データを抽出し、前記特定された正式名称と該正式名称に対応する正規表現とを用いて、前記抽出された文書データ中に含まれるクエリの表記を統一し、表記統一された文書データの情報を出力する文書検索方法を提案する。 In order to achieve the above-mentioned object, the present invention provides a database of formal name / regular expression dictionaries having information on pairs of formal names of words and regular expressions of the formal names as entries, and character strings to be searched for From a computer device comprising a database of conversion lists that can be arbitrarily registered used as a conversion rule when creating a regular expression of a query, and a database of insertion lists that can arbitrarily register characters and symbols to be inserted when creating a regular expression of the query When a query is input to the document search device, the document search device extracts document data including a query group obtained by expanding the query from a plurality of document data stored in a document database. A document search method for outputting a result, wherein the document search device uses an input character string to be searched as a query. Accept, create a regular expression of the accepted query using the formal name / regular expression dictionary, the conversion list, and the insertion list, and uniquely identify the formal name of the word from the regular expression, Using the created regular expression, the query is expanded to create a query group, a document search is performed using the expanded query group, document data including the query group is extracted, and the identified formal A document search method is proposed that uses a name and a regular expression corresponding to the formal name to unify the notation of a query included in the extracted document data, and output information of the document data with the unified notation.

本発明の文書検索方法によれば、入力された英数記号列を含むクエリから正規表現で記述した表記ゆれパターンを自動的に生成し、その正規表現と正式名称を対にして辞書に登録することにより、検索時のクエリの拡張及び、検索結果の表記の統一を行う。 According to the document search method of the present invention, a notation fluctuation pattern described in a regular expression is automatically generated from a query including an input alphanumeric symbol string, and the regular expression and a formal name are paired and registered in a dictionary. As a result, the query at the time of search is expanded and the notation of search results is unified.

すなわち、本発明では
１．入力されたクエリの、様々な表記ゆれを含む表現を正規表現として記述し、辞書に登録する。 That is, in the present invention: Expressions including various notations of the input query are described as regular expressions and registered in the dictionary.

２．正規表現を全展開することで、考えられる全ての表記ゆれを含む正規表現を自動生成する。 2. A regular expression including all possible notation fluctuations is automatically generated by fully expanding the regular expression.

３．新たな表記ゆれを発見した場合は、自動的に辞書中の正規表現を更新する。
を、英数記号を含む文字列を対象に行う。 3. If a new notation is found, the regular expression in the dictionary is automatically updated.
Is performed on a character string including alphanumeric symbols.

このように正規表現を用いることにより、辞書ヘの登録単語数を削減することができ、１つの辞書をクエリ拡張と表記ゆれ吸収両方に使用可能であり、辞書サイズを縮小することができる。 By using regular expressions in this way, the number of registered words in the dictionary can be reduced, and one dictionary can be used for both query expansion and notation fluctuation absorption, and the dictionary size can be reduced.

また、考えられる全ての表記ゆれを含む表現を自動生成することにより、新語への対応スピード、網羅性共に向上が見込めると同時に、コストの削減にもなる。 In addition, by automatically generating expressions that include all possible notation fluctuations, it is possible to improve both the speed and completeness of correspondence to new words, and at the same time reduce costs.

さらに、入力されたクエリと正式名称、どちらも正規表現で記述するため、正式名称とは違うクエリが入力されても特定できる可能性が高い。 Furthermore, since both the input query and the formal name are described by regular expressions, there is a high possibility that the query can be identified even if a query different from the formal name is input.

また、本発明は上記の文書検索方法を適用した文書検索装置、すなわち、検索対象となる文字列としてのクエリを入力すると、文書データベースに格納されている複数の文書データの中から前記クエリを拡張したクエリ群を含む文書データを抽出して、該抽出結果を出力する文書検索装置であって、入力された前記検索対象となる文字列をクエリとして受け付ける入力手段と、単語の正式名称と該正式名称の正規表現とを対にした情報をエントリとして持つ正式名称／正規表現辞書のデータベースと、前記クエリの正規表現作成時に変換規則として用いる任意に登録可能な変換リストのデータベースと、前記クエリの正規表現作成時に挿入する文字や記号を任意に登録できる挿入リストのデータベースと、前記正式名称／正規表現辞書と、前記変換リストと、前記挿入リストとを用いて、前記受け付けたクエリの正規表現を作成して、該正規表現から単語の正式名称を一意に特定する正式名称特定手段と、作成された正規表現を用いてクエリの拡張を行ってクエリ群を作成するクエリ拡張手段と、拡張された前記クエリ群を用いて文書検索を行い、当該クエリ群を含む文書データを抽出する文書検索手段と、前記正式名称特定手段により特定された正式名称と該正式名称に対応する正規表現とを用いて、前記抽出された文書データ中に含まれるクエリの表記を統一する表記統一手段と、表記統一された文書データの情報を出力する出力手段とを備えた文書検索装置を提案する。 The present invention also provides a document search apparatus to which the above document search method is applied, that is, when a query as a character string to be searched is input, the query is expanded from a plurality of document data stored in a document database. A document search apparatus that extracts document data including a group of queries and outputs the extraction result, an input means for receiving the input character string to be searched as a query, a formal name of the word, and the formal A database of formal names / regular expression dictionaries having information paired with regular expressions of names as entries, a database of conversion lists that can be arbitrarily registered to be used as conversion rules when creating regular expressions of the query, and a normality of the query A database of an insertion list in which characters and symbols to be inserted at the time of expression creation can be arbitrarily registered, the formal name / regular expression dictionary, Using the conversion list and the insertion list, the regular expression of the accepted query is created, and the formal name identifying means for uniquely identifying the formal name of the word from the regular expression, and using the created regular expression Query expansion means for expanding a query to create a query group, document search means for performing a document search using the expanded query group, and extracting document data including the query group, and specifying the formal name Using the formal name specified by the means and the regular expression corresponding to the formal name, the notation unifying means for unifying the notation of the query included in the extracted document data, and the information of the document data with the unified notation A document search apparatus provided with output means for outputting.

さらに、本発明は、コンピュータ装置を上記文書検索装置として動作させるためのコンピュータプログラムとして、単語の正式名称と該正式名称の正規表現とを対にした情報をエントリとして持つ正式名称／正規表現辞書のデータベースと、検索対象となる文字列としてのクエリの正規表現作成時に変換規則として用いる任意に登録可能な変換リストのデータベースと、前記クエリの正規表現作成時に挿入する文字や記号を任意に登録できる挿入リストのデータベースとを備えたコンピュータ装置からなる文書検索装置に対してクエリを入力すると、前記文書検索装置が、文書データベースに格納されている複数の文書データの中から前記クエリを拡張したクエリ群を含む文書データを抽出して、該抽出結果を出力する前記文書検索装置を動作させるコンピュータプログラムであって、入力された前記検索対象となる文字列をクエリとして受け付けるステップと、前記正式名称／正規表現辞書と、前記変換リストと、前記挿入リストとを用いて、前記受け付けたクエリの正規表現を作成して、該正規表現から単語の正式名称を一意に特定するステップと、作成された正規表現を用い、クエリの拡張を行ってクエリ群を作成するステップと、拡張された前記クエリ群を用いて文書検索を行い、当該クエリ群を含む文書データを抽出するステップと、前記特定された正式名称と該正式名称に対応する正規表現とを用いて、前記抽出された文書データ中に含まれるクエリの表記を統一するステップと、表記統一された文書データの情報を出力するステップとを含むコンピュータプログラムを提案する。 Furthermore, the present invention provides a computer program for causing a computer device to operate as the document search device, as a formal name / regular expression dictionary having as an entry information obtained by pairing a formal name of a word and a regular expression of the formal name. Database, database of conversion list that can be arbitrarily registered used as a conversion rule when creating a regular expression of a query as a character string to be searched, and insertion that can arbitrarily register characters and symbols to be inserted when creating a regular expression of the query When a query is input to a document search device comprising a computer device having a list database, the document search device expands the query from a plurality of document data stored in the document database. The document retrieval apparatus that extracts the document data including the data and outputs the extraction result is operated. The received query using the step of accepting the inputted character string to be searched as a query, the formal name / regular expression dictionary, the conversion list, and the insertion list Creating a regular expression, and uniquely identifying the formal name of the word from the regular expression, expanding the query using the created regular expression, creating a query group, and the expanded Performing a document search using a query group, extracting document data including the query group, and using the specified formal name and a regular expression corresponding to the formal name, in the extracted document data A computer program including the step of unifying the notation of a query included in the document and the step of outputting document data information in which the notation is unified Proposed.

本発明の文書検索方法及び文書検索装置によれば、正規表現を用いることにより、辞書ヘの登録単語数を削減することができ、１つの辞書をクエリ拡張と表記ゆれ吸収両方に使用可能であり、辞書サイズを縮小することができる。また、考えられる全ての表記ゆれを含む正規表現を自動生成することにより、新語への対応スピード、網羅性共に向上が見込めると同時に、コストの削減にもなる。さらに、入力されたクエリと正式名称、どちらも正規表現で記述するため、正式名称とは違うクエリが入力されても特定できる可能性が高まるという非常に優れた効果を奏する。 According to the document search method and document search apparatus of the present invention, the number of registered words in the dictionary can be reduced by using regular expressions, and one dictionary can be used for both query expansion and notation fluctuation absorption. The dictionary size can be reduced. In addition, automatic generation of regular expressions including all possible notation fluctuations can improve the speed and coverage of new words, and at the same time reduce costs. Furthermore, since both the input query and the formal name are described by regular expressions, there is an excellent effect that the possibility that the query can be specified even if a query different from the formal name is input increases.

また、本発明のコンピュータプログラムを用いることにより、任意のコンピュータ装置によって本発明の文書検索方法を実施する文書検索装置を容易に構成することができるという非常に優れた効果を奏するものである。 In addition, by using the computer program of the present invention, it is possible to easily configure a document search apparatus that implements the document search method of the present invention using an arbitrary computer apparatus.

以下、図面を参照して本発明の一実施形態を説明する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

図１は本発明の一実施形態における文書検索装置の機能ブロック図である。図において、１０は所定のコンピュータプログラムによって動作するコンピュータ装置からなる文書検索装置で、入力部１と、質問拡張部２、正式名称／正規表現辞書３１が格納されているデータベース３、変換リスト４１が格納されているデータベース４、挿入リスト５１が格納されているデータベース５、文書検索部６、表記統一部７、出力部８を備えている。これらの各構成部はコンピュータ装置のハードウェアとソフトウェアの双方によって構成されている。 FIG. 1 is a functional block diagram of a document search apparatus according to an embodiment of the present invention. In the figure, reference numeral 10 denotes a document search device comprising a computer device that operates according to a predetermined computer program. An input unit 1, a question expansion unit 2, a database 3 in which a formal name / regular expression dictionary 31 is stored, and a conversion list 41 are provided. A database 4 stored, a database 5 storing an insertion list 51, a document search unit 6, a notation unification unit 7, and an output unit 8 are provided. Each of these components is configured by both hardware and software of the computer device.

詳細は後述するとして、ここでは先に概要を述べる。 As will be described later in detail, the outline will be described first here.

入力部１では任意の形式で検索するクエリを受け付ける。質問拡張部２ではクエリが数字、アルファベット、記号、その他の文字種のうち２種以上を含む場合は複数の同義表現に拡張しその拡張したクエリ群を、そうでない場合はクエリをそのままの状態で文書検索部６へ出力する。文書検索部６では質問拡張部２の出力を含む文書を検索し、その結果である文書集合を表記統一部７へ渡す。表記統一部７では、クエリ拡張されている場合に限り、文書検索部６の返してきた文書データの集合に含まれる入力クエリの表記を統一し、文書データの集合を出力部８ヘ渡す。出力部８では処理結果を任意の形で出力する。 The input unit 1 accepts a query for searching in an arbitrary format. In the query expansion unit 2, if the query includes two or more of numbers, alphabets, symbols, and other character types, the query is expanded to a plurality of synonymous expressions and the expanded query group, otherwise the document is left as is. Output to the search unit 6. The document search unit 6 searches for a document including the output of the question expansion unit 2 and passes a document set as a result to the notation unification unit 7. The notation unifying unit 7 unifies the notation of the input query included in the document data set returned from the document search unit 6 only when the query is expanded, and passes the document data set to the output unit 8. The output unit 8 outputs the processing result in an arbitrary form.

尚、入力部１及び出力部８における入出力方法や文書検索部４での検索方法については周知の既存技術(例えば、非特許文献２:情報の科学と技術 Vol.54 2004 No.2、特集インターネット検索エンジン、検索エンジンのアルゴリズム(兼宗進)(p78〜p83)、検索エンジンのアーキテクチャ(山名早人)(p84〜p89)に開示される技術)を使用することとする。 The input / output method in the input unit 1 and the output unit 8 and the search method in the document search unit 4 are well-known existing technologies (for example, Non-Patent Document 2: Information Science and Technology Vol.54 2004 No.2, Special Feature The Internet search engine, the search engine algorithm (Kanemune Susumu) (p78 to p83), and the search engine architecture (Yaman Hayato) (technology disclosed in p84 to p89) are used.

また、ここでの正規表現はPerlで書くことを例に説明を行っている。よって、他の表現を用いる場合には、その手法に準じる。また、正規表現の中でも“.*”や“.+”などの、“任意の文字を任意の数繰り返す”不確定な要素を含む表現は用いない。また、Perlの表現とは異なるが、空白記号を表現するのに便宜上“”を用いる。 The regular expression here is explained using Perl as an example. Therefore, when using other expressions, the method is followed. Also, regular expressions that include uncertain elements such as “. *” And “. +” That repeat “any number of characters any number” are not used. In addition, although different from the Perl expression, “” is used for convenience in expressing a space symbol.

本装置１０を使用する際には、事前の設定(辞書登録機能)を行う必要がある。 When using this apparatus 10, it is necessary to perform a prior setting (dictionary registration function).

すなわち、ユーザは検索装置１０を利用するときに、辞書の更新（新語の登録および、正規表現拡張）を行うかどうかの「辞書登録機能」を、設定ファイルなどを用いて選択できるようにしておく。 That is, when using the search device 10, the user can select a “dictionary registration function” for whether or not to update the dictionary (new word registration and regular expression expansion) using a setting file or the like. .

本装置１０を初めて使用する際などに、正式名称／正規表現辞書３１を作成する必要がある場合には、「辞書登録機能」をＯＮにしておかなければならない。 When it is necessary to create the formal name / regular expression dictionary 31 when the apparatus 10 is used for the first time, the “dictionary registration function” must be turned on.

本装置１０をクエリの拡張のみに使用する場合や、辞書登録されては困るような表記ゆれ表現をクエリとして用いる場合などには、「辞書登録機能」をＯＦＦにすることもできる。 When this apparatus 10 is used only for query expansion, or when a notation fluctuation expression that is difficult to register in a dictionary is used as a query, the “dictionary registration function” can be turned off.

また、クエリの拡張に使用する変換規則や記号などは、変換リスト４１、挿入リスト５１にあらかじめ設定しておくことができる。 Also, conversion rules and symbols used for query expansion can be set in advance in the conversion list 41 and the insertion list 51.

次に、本装置１０の各構成部の動作を詳細に説明する。 Next, the operation of each component of the apparatus 10 will be described in detail.

入力部１は、ユーザが任意の形式で入力した検索対象となる文字列を取得し、この文字列をクエリとして受け付け、受け付けたクエリを質問拡張部２へ渡す。尚、クエリは文字列で構成される。 The input unit 1 acquires a search target character string input by the user in an arbitrary format, receives the character string as a query, and passes the received query to the question expansion unit 2. The query is composed of a character string.

質問拡張部２は、例えば図２に示すように入力判別部２１と、正式名称特定部２２、クエリ拡張部２３とから構成され、正式名称／正規表現辞書３１と、変換リスト４１、挿入リスト５１を用いて処理を行う。 For example, as shown in FIG. 2, the question expansion unit 2 includes an input determination unit 21, a formal name specifying unit 22, and a query expansion unit 23, and a formal name / regular expression dictionary 31, a conversion list 41, and an insertion list 51. Process using.

正式名称／正規表現辞書３１は、単語の同義表現の集合を正規表現で現して、単語の正式名称と対にしたものを１エントリとする辞書である。図３に正式名称／正規表現辞書３１の一例を示す。 The official name / regular expression dictionary 31 is a dictionary that expresses a set of synonymous expressions of words as a regular expression and has one entry as a pair with the official name of the word. FIG. 3 shows an example of the formal name / regular expression dictionary 31.

変換リスト４１は、同義表現を多数もつブランド名や、削除しても意味の変化が生じないと思われる記号などを登録したリストである。同義表現を多く持つブランド名のようなエントリは、同義表現集合を１つのエントリとして変換項目に登録する。例えば図４のエントリ１〜４では（表現１，表現２，表現３）のように、変換項目の同義表現を「，(カンマ)」で区切った。また、削除のみ可能なエントリはエントリ５の（＆）のように１エントリ１表現で登録する。 The conversion list 41 is a list in which brand names having a large number of synonymous expressions and symbols that do not appear to change in meaning even if deleted are registered. An entry such as a brand name having many synonymous expressions registers the synonymous expression set as one entry in the conversion item. For example, in entries 1 to 4 in FIG. 4, synonymous expressions of conversion items are separated by “, (comma)” as (expression 1, expression 2, expression 3). Also, an entry that can only be deleted is registered in the form of one entry and one as shown by (&) in entry 5.

挿入リスト５のエントリに登録した表現は、クエリを、表記ゆれを含む表現へと拡張する際に用いられる。よって、挿入リスト５には、表記ゆれ表現に頻繁に使用される空白記号やハイフンなど、クエリ拡張の際に挿入したい記号を登録しておく。挿入リスト５の一例を図５に示す。図５の挿入リスト５においては、エントリ１の挿入項目として「“”（空白記号）」が登録され、エントリ２の挿入項目として「-（ハイフン）」が登録されている。 The expression registered in the entry of the insertion list 5 is used when the query is expanded to an expression including notation fluctuation. Therefore, in the insertion list 5, symbols to be inserted at the time of query expansion, such as blank symbols and hyphens that are frequently used for expression fluctuation expressions, are registered. An example of the insertion list 5 is shown in FIG. In the insertion list 5 of FIG. 5, ““ (blank symbol) ”is registered as the insertion item of entry 1, and“-(hyphen) ”is registered as the insertion item of entry 2.

入力判別部２１は、入力部１から入力したクエリが（アルファべット、数字、記号）のうちどれかを含み、２種類以上の文字種で構成されている場合は正式名称特定部２２へクエリを出力する。そうでない場合は文書検索部６へクエリを出力する。 The input discriminating unit 21 queries the formal name specifying unit 22 when the query input from the input unit 1 includes any of (alphabet, number, symbol) and is composed of two or more types of characters. Is output. Otherwise, the query is output to the document search unit 6.

正式名称特定部２２は、正式名称/正規表現辞書３１と、変換リスト４１、挿入リスト５１を用いて、図６に示す処理を行う。ここでの処理には事前の設定で行った「辞書登録機能」のＯＮ／ＯＦＦの情報が必要となるため、あらかじめその情報を取り込んでおきステップ(22s26),(22s27)の処理で使用する。 The formal name identification unit 22 performs the process shown in FIG. 6 using the formal name / regular expression dictionary 31, the conversion list 41, and the insertion list 51. Since the processing here requires ON / OFF information of the “dictionary registration function” performed in advance, the information is captured in advance and used in the processing of steps (22s26) and (22s27).

ここで、正式名称特定部２２が行う処理の詳細を図６及び図７のフローチャートを参照して説明する。 Here, the details of the process performed by the formal name specifying unit 22 will be described with reference to the flowcharts of FIGS. 6 and 7.

正式名称特定部２２は、入力判別部２１から受け取ったクエリと正式名称／正規表現辞書３１のエントリとを比較し、正式名称／正規表現辞書３１のエントリの正規表現に一致するものがあるか否かを判定し(22s1)、一致するものがあった場合には、ステップ(22s3)の処理へ進み、一致したエントリの正規表現をクエリ拡張部２３へ出力する。また、一致するものがなかった場合は、ステップ(22s2)の処理へ進み正規表現作成を行う。 The official name identification unit 22 compares the query received from the input determination unit 21 with the entry of the formal name / regular expression dictionary 31 and whether there is a match with the regular expression of the entry of the formal name / regular expression dictionary 31. (22s1) and if there is a match, the process proceeds to step (22s3), and the regular expression of the matched entry is output to the query expansion unit 23. If there is no match, the process proceeds to step (22s2) to create a regular expression.

ステップ(22s2)の正規表現作成の詳細処理は図７に示すとおりであり、まずクエリを文字種の異なる境界で分割し、配列の要素とする(22s21)。このとき、空白記号は要素とせず削除する。次にルール適用処理を行う(22s22)。このルール適用処理(22s22)の詳細は図８に示すとおりであり、この処理には変換リスト４１を用いる。 The detailed processing of creating a regular expression in step (22s2) is as shown in FIG. 7. First, the query is divided at different boundaries of character types to form array elements (22s21). At this time, the space symbol is not an element and is deleted. Next, rule application processing is performed (22s22). The details of the rule application process (22s22) are as shown in FIG. 8, and the conversion list 41 is used for this process.

すなわち、ルール適用処理(22s22)では，ステップ(22s221)からステップ(22s226)の処理を、ステップ(22s21)で作成した配列の要素毎に繰り返し、全要素を調べたら繰り返しを終了する。 That is, in the rule application process (22s22), the process from step (22s221) to step (22s226) is repeated for each element of the array created in step (22s21), and the repetition is finished when all elements are examined.

ステップ(22s222)では、配列要素が変換リストの表現と一致するか否かを判定し、一致する場合はステップ(22s223)の処理へ、そうでない場合はステップ(22s224)の処理へと進む。 In step (22s222), it is determined whether or not the array element matches the expression of the conversion list. If they match, the process proceeds to step (22s223), and if not, the process proceeds to step (22s224).

ステップ(22s223)では、配列要素を(一致したエントリ，？)に変更して、ステップ(22s224)の処理へ進む。 In step (22s223), the array element is changed to (matched entry,?), And the process proceeds to step (22s224).

ステップ(22s224)では、配列要素がアルファベットを含むか否かを判定し，アルファべットを含む場合はステップ(22s225)の処理へ、そうでない場合はステップ(22s226)ヘ処理を進める。 In step (22s224), it is determined whether or not the array element includes an alphabet. If the array element includes an alphabet, the process proceeds to step (22s225), and if not, the process proceeds to step (22s226).

ステップ(22s225)では、配列要素を(そのまま、全て大文字、全て小文字、先頭だけ大文字)に変換し、要素中に重複が出た場合は重複部分を削除する。この後、ステップ(22s226)の処理へ進む。 In step (22s225), the array element is converted to (as it is, all capital letters, all lower case letters, and only the first letter is capital letters), and if duplicates appear in the elements, the duplicate parts are deleted. Thereafter, the process proceeds to step (22s226).

ステップ(22s226)では、配列の全要素を調べ終わったら処理を終了して、ステップ(22s2)における記号挿入処理のステップ(22s23)の処理へ進む。 In step (22s226), when all the elements of the array have been examined, the process is terminated, and the process proceeds to step (22s23) of the symbol insertion process in step (22s2).

記号挿入処理のステップ(22s23)では、挿入リスト５１を「，(カンマ)」で区切ったものの最後に「？」の文字を付け足して１要素とし、クエリ配列要素間に挿入した後、ステップ(22s24)の処理へ進む。 In the symbol insertion processing step (22s23), the insertion list 51 is delimited by “, (comma)”, and a character “?” Is added to the end to form one element, which is then inserted between query array elements. ).

正規表現作成処理のステップ(22s24)では、配列要素をつなげて、正規表現を作成する。この処理の詳細を図９のフローチャートに示す。 In the regular expression creation process step (22s24), array elements are connected to create a regular expression. Details of this processing are shown in the flowchart of FIG.

すなわち、ステップ(22s24)の処理では、配列の要素ごとにステップ(22s241)〜(22s248)までの処理を繰り返す。 That is, in the processing of step (22s24), the processing from steps (22s241) to (22s248) is repeated for each element of the array.

ステップ(22s242)では１要素中の「，(カンマ)」で区切られた文字が全て１文字であるか否かを判定し，１文字である場合はステップ(22s243)の処理へ、そうでない場合はステップ(22s244)の処理へ進む。 In step (22s242), it is determined whether or not all the characters delimited by “, (comma)” in one element are one character. If it is one character, the process proceeds to step (22s243). Advances to the processing of step (22s244).

ステップ(22s243)では、「，(カンマ)」を削除し、「()」の文字を「[]」の文字へ変換して、ステップ(22s245)の処理へ進む。 In step (22s243), “, (comma)” is deleted, the character “()” is converted to the character “[]”, and the process proceeds to step (22s245).

ステップ(22s244)では、「，(カンマ)」を「｜(パイプ)」へ変換して、ステップ(22s245)の処理へ進む。 In step (22s244), “, (comma)” is converted to “| (pipe)” and the process proceeds to step (22s245).

ステップ(22s245)では、要素の最後に「？」の文字を含むか否かを調べ、要素の最後に「？」の文字を含む場合はステップ(22s246)の処理へ、要素の最後に「？」の文字を含まない場合はステップ(22s247)の処理へ進む。 In step (22s245), it is checked whether or not the character “?” Is included at the end of the element. If the character “?” Is included at the end of the element, the process proceeds to step (22s246). If the character “” is not included, the process proceeds to step (22s247).

ステップ(22s246)では、要素の外側の括弧の右外へ「？」の文字を移し、ステップ(22s247)の処理へ進む。 In step (22s246), the character “?” Is moved to the right outside the parenthesis outside the element, and the process proceeds to step (22s247).

ステップ(22s247)では、全ての要素の処理が終わったら、ステップ(22s248)の処理へ進む。 In step (22s247), when all the elements have been processed, the process proceeds to step (22s248).

ステップ(22s248)では、全要素をつなげて文字列とし、処理を終了する。この後、ステップ(22s25)の処理へと進む。 In step (22s248), all elements are connected to form a character string, and the process ends. Thereafter, the process proceeds to step (22s25).

ステップ(22s25)では、作成した正規表現に一致する正式名称が正式名称／正規表現辞書３１にあるか否かを調べる。この結果、作成した正規表現に一致する正式名称が正式名称／正規表現辞書３１にある場合はステップ(22s26)の処理へ、ない場合はステップ(22s27)の処理へ進む。 In step (22s25), it is checked whether or not a formal name that matches the created regular expression exists in the formal name / regular expression dictionary 31. As a result, if the official name matching the created regular expression is found in the official name / regular expression dictionary 31, the process proceeds to step (22s26), and if not, the process proceeds to step (22s27).

ステップ(22s26)は、上記一致した正式名称に対応する正規表現を新しく作成した正規表現を含む形で拡張し、辞書登録機能がＯＮのときにはこれを登録する。このステップ(22s26)の処理の詳細を図１０のフローチャートを参照して説明する。ステップ(22s26)の処理では、図１０に示すように、ステップ(22s161)〜(22s2610)の処理を行う。 In step (22s26), the regular expression corresponding to the matched formal name is expanded to include the newly created regular expression, and this is registered when the dictionary registration function is ON. Details of the processing of this step (22s26) will be described with reference to the flowchart of FIG. In the process of step (22s26), as shown in FIG. 10, the processes of steps (22s161) to (22s2610) are performed.

ステップ(22s261)では、２つの正規表現を、括弧表現の対を単位にして区切り、配列に格納する。このとき括弧表現がついていない部分には「()」の文字を付けてから格納する。この後、ステップ(22s262)の処理へ進む。 In step (22s261), the two regular expressions are separated in pairs of parenthesis expressions and stored in the array. At this time, the part without parentheses is appended with the character “()” and stored. Thereafter, the process proceeds to step (22s262).

ステップ(22s262)では、２つの正規表現間で完全一致する要素を探し、そこを境界に一致しない要素のまとまり同士を前から比較していく。この後、ステップ(22s263)の処理へ進む。 In step (22s262), an element that completely matches between two regular expressions is searched, and a group of elements that do not match the boundary are compared from the front. Thereafter, the process proceeds to step (22s263).

ステップ(22s263)では、ステップ(22s262)で抽出した2つの正規表現間で一致しない要素のまとまり同士の中でも部分一致する部分を探す。この後、ステップ(22s264)の処理へ進む。 In step (22s263), a partially matching portion is searched for among the group of elements that do not match between the two regular expressions extracted in step (22s262). Thereafter, the process proceeds to step (22s264).

ステップ(22s264)では、要素内の１部分が異なるか否かを判定し，要素内の１部分だけ異なり他の部分は一致する場合にはステップ(22s265)の処理へ、それ以外はステップ(22s267)の処理へ進む。 In step (22s264), it is determined whether or not one part in the element is different. If only one part in the element is different and the other parts match, the process proceeds to step (22s265). Otherwise, the process proceeds to step (22s267). ).

ステップ(22s265)では、要素内の異なる部分が一文字であるか否かを判定し、要素内の異なる部分が一文字の場合はステップ(22s266)の処理へ、それ以外の場合はステップ(22s268)の処理へ進む。 In step (22s265), it is determined whether or not the different part in the element is a single character. If the different part in the element is a single character, the process proceeds to step (22s266). Otherwise, the process in step (22s268) is performed. Proceed to processing.

ステップ(22s266)では、要素を囲む括弧が「[]」の文字のときは不一致文字を中に付け足し、「()」の文字のときは「｜（パイプ）」の文字でつなぐ。この後，ステップ(22s269)の処理へ進む。 In step (22s266), when the parenthesis surrounding the element is a character “[]”, an unmatched character is added in the middle, and when the character “()” is used, a character “| (pipe)” is connected. Thereafter, the process proceeds to step (22s269).

ステップ(22s267)は、ステップ(22s264)の判定の結果、一致する要素がない場合の処理なので、それぞれの要素の括弧の外側へ「？」の文字をつけてから元の位置ヘ挿入する。この後、ステップ(22s269)の処理へ進む。 Step (22s267) is processing when there is no matching element as a result of the determination in step (22s264). Therefore, a character “?” Is put outside the parentheses of each element and then inserted into the original position. Thereafter, the process proceeds to step (22s269).

ステップ(22s268)では、「[]」の文字のときはこれを「()」の文字に変えて「｜（パイプ）」の文字でつなぐ。この後、ステップ(22s269)の処理へ進む。 In step (22s268), when the character is “[]”, it is replaced with the character “()” and connected with the character “| (pipe)”. Thereafter, the process proceeds to step (22s269).

ステップ(22s269)では、ステップ(22s262)の処理で検出した一致しない部分の全て（異なる部分の全て）を比較し終えたら繰り返し処理を終了して、ステップ(22s2610)の処理へ進む。 In step (22s269), when all the non-matching parts detected in the process of step (22s262) (all of the different parts) have been compared, the iterative process is terminated, and the process proceeds to step (22s2610).

ステップ(22s2610)では、前述した「辞書登録機能」がＯＮの場合は、図３に示した正式名称／正規表現辞書３１の対応するエントリの正規表現を新たに作った正規表現に登録しなおしてからステップ(22s26)の処理を終了する。 In step (22s2610), if the above-mentioned “dictionary registration function” is ON, the regular expression of the corresponding entry in the formal name / regular expression dictionary 31 shown in FIG. 3 is re-registered in the newly created regular expression. To step (22s26).

ステップ(22s26)の処理を終了したら、ステップ(22s2)の処理を終了して、ステップ(22s23)の処理へ進む。 When the process of step (22s26) is finished, the process of step (22s2) is finished, and the process proceeds to step (22s23).

ステップ(22s27)では、もし辞書登録機能がＯＮの場合には、入力されたクエリを正式名称として、作成した正規表現とセットで図３に示した正式名称／正規表現辞書３１の新しいエントリとして登録する。そうしてステップ(22s2)の処理を終えたら、ステップ(22s3)の処理ヘ進む。 In step (22s27), if the dictionary registration function is ON, the entered query is registered as a formal name and registered as a new entry in the formal name / regular expression dictionary 31 shown in FIG. To do. When the process of step (22s2) is finished, the process proceeds to step (22s3).

ステップ(22s3)では、ステップ(22s2)の正規表現作成処理が終わったら、辞書に登録されていた、または新しく作成した正式名称と正規表現の対をクエリ拡張部２３へ出力して処理を終了する。 In step (22s3), when the regular expression creation process in step (22s2) is finished, the pair of the formal name and regular expression registered in the dictionary or newly created is output to the query expansion unit 23, and the process is terminated. .

クエリ拡張部２３は、上記作成された正規表現を用いてクエリの拡張を行ってクエリ群を作成する。クエリ拡張部２３の処理の詳細は図１１のフローチャートに示すとおりで、ステップ(23s1)〜(23s9)の処理を行う。 The query expansion unit 23 expands a query using the created regular expression to create a query group. The details of the processing of the query expansion unit 23 are as shown in the flowchart of FIG. 11, and the processing of steps (23s1) to (23s9) is performed.

ステップ(23s1)では、正式名称特定部２３が出力した正規表現のうち、記号部分以外の全展開を行い、ステップ(23s2)の処理へ進む。 In step (23s1), all of the regular expressions output by the official name specifying unit 23 are expanded except for the symbol part, and the process proceeds to step (23s2).

ステップ(23s2)〜(23s8)では、正規表現中の展開前の記号部分ごとにこれらの全ての処理を進める。 In steps (23s2) to (23s8), all these processes are performed for each symbol part before expansion in the regular expression.

ステップ(23s3)では、処理対象となっている記号部分の前または後ろの文字が記号であるか否かを判定し、処理対象となっている記号部分の前または後ろの文字が記号である場合は処理中の情報をコピーして、ステップ(23s4)及びステップ(23s6)の処理を並行して進める。それ以外の場合はステップ(23s7)の処理を行う。 In step (23s3), it is determined whether the character before or after the symbol part being processed is a symbol, and if the character before or after the symbol part being processed is a symbol Copies the information being processed, and advances the processing of step (23s4) and step (23s6) in parallel. Otherwise, the process of step (23s7) is performed.

ステップ(23s4)では、処理中の記号部分の前後の記号を削除して、ステップ(23s3)の処理へ戻る。 In step (23s4), the symbols before and after the symbol part being processed are deleted, and the process returns to step (23s3).

ステップ(23s 5)では、処理対象となっている記号部分の前または後ろの何れかの文字がクエリの先頭または終端であるか否かを判定し、処理対象となっている記号部分の前または後ろの何れかの文字がクエリの先頭または終端だった場合にステップ(23s6)の処理を、それ以外の場合はステップ(23s7)の処理を行う。 In step (23s 5), it is determined whether any character before or after the symbol part to be processed is the head or end of the query, and before or after the symbol part to be processed. The process of step (23s6) is performed when any of the following characters is the head or end of the query, and the process of step (23s7) is performed otherwise.

ステップ(23s6)では、処理中の記号部分を展開せず削除して、ステップ(23s2)の処理へ進み次の記号部分の処理へ移る。 In step (23s6), the symbol part being processed is deleted without being expanded, and the process proceeds to step (23s2), and the process proceeds to the next symbol part.

ステップ(23s7)では、処理中の記号部分を展開して、ステップ(23s8)の処理へと進む。 In step (23s7), the symbol part being processed is expanded, and the process proceeds to step (23s8).

ステップ(23s8)では、正規表現中に含まれる記号部分を全て処理し終えたら繰り返し処理を終了し、ステップ(23s 9)の処理へ進む。 In step (23s8), when all the symbol parts included in the regular expression have been processed, the iterative process is terminated, and the process proceeds to step (23s9).

ステップ(23s 9)では、展開の終わったクエリのうち、先に生成したクエリと重複が生じたものは削除して、処理を終了する。 In step (23s9), among the queries that have been expanded, those that have overlapped with the previously generated query are deleted, and the process is terminated.

クエリ拡張部２３は、展開し終えたクエリの同義表現集合と、その正式名称、正規表現を文書検索部６へ渡す。また、クエリ拡張部２３は、クエリ拡張を行わなかった場合にはクエリのみを文書検索部６に渡す。 The query extension unit 23 passes the synonymous expression set of the expanded query, its formal name, and regular expression to the document search unit 6. The query expansion unit 23 passes only the query to the document search unit 6 when query expansion is not performed.

以上で質問拡張部２の処理は終了となる。 Thus, the process of the question expansion unit 2 ends.

文書検索部６は、質問拡張部２から、[クエリのみ]またはクエリの[正式名称、正規表現、同義表現集合]を受け取る。そして、前者の場合はクエリを用いて、後者の場合はクエリの同義表現集合を用いて、周知の既存技術を用いて文書検索処理を行い、文書データベース100に蓄積されている複数の文書データの中から該当する文書データを抽出する。そして検索結果の文書データのリストと、正式名称、正規表現がある場合はそれも、表記統一部７へと出力する。 The document search unit 6 receives [query only] or [formal name, regular expression, synonym expression set] of the query from the question expansion unit 2. In the former case, a query is used, and in the latter case, a synonymous expression set of queries is used to perform a document search process using a well-known existing technique, and a plurality of document data stored in the document database 100 are stored. The corresponding document data is extracted from the inside. Then, if there is a list of document data as a search result, a formal name, and a regular expression, they are also output to the notation unifying unit 7.

表記統一部７は、図１２に示すように、出力判別部７１と統一部７２とから構成されており、質問拡張部２において質問拡張を行った場合は検索結果とクエリの正式名称とその正規表現を文書検索部６から受け取る。質問拡張部２において質問拡張を行わなかった場合は検索結果のみを文書検索部６から受け取る。 As shown in FIG. 12, the notation unifying unit 7 includes an output discriminating unit 71 and a unifying unit 72. When the question expansion unit 2 performs the question expansion, the search result, the official name of the query, and its normal name An expression is received from the document search unit 6. When the question expansion unit 2 does not perform the question expansion, only the search result is received from the document search unit 6.

出力判別部７１は、文書検索部６から受け取ったのが検索結果のみだった場合は何もせずそのまま出力部８へ検索結果を渡す。また、出力判別部７１は、文書検索部６から検索結果に加えて正式名称と正規表現も受け取った場合は統一部７２へこれらを渡して処理を依頼する。 If only the search result is received from the document search unit 6, the output determination unit 71 does nothing and passes the search result to the output unit 8 as it is. In addition to the search result from the document search unit 6, the output determination unit 71 passes the formal name and regular expression to the unification unit 72 to request processing.

統一部７２では、正規表現を用いて検索結果である文書データの集合中に含まれるクエリの表記ゆれを含む表現を探し、正式名称で書き換えを行う。そして、その結果を出力部８へ渡す。 The unifying unit 72 uses a regular expression to search for an expression including a query notation included in a set of document data as a search result, and rewrites the expression with an official name. Then, the result is passed to the output unit 8.

出力部８は、任意の形式でクエリの表記が統一された文書データの情報を出力する。 The output unit 8 outputs document data information in which the query notation is unified in an arbitrary format.

次に、上記実施形態に関して具体例を用いた一実施例を詳細に説明する。 Next, an example using a specific example regarding the above embodiment will be described in detail.

ここでは正規表現はPerlの表現にしたがっている。また、記号「“”」は空白記号(スペース)をあらわしている。 Here, the regular expression follows the Perl expression. The symbol ““ ”represents a blank symbol (space).

本実施例では、事前の設定(辞書登録機能)として、「辞書登録機能」をＯＮにしておく。また、本実施例では文書検索装置１０の入力部１へ検索対象となる文字列として「FOMA P902i」というクエリが入力されたものとして説明する。この場合、質問拡張部２で用いる正式名称/正規表現辞書３１、変換リスト４１、挿入リスト５１としては、それぞれ前述した図３，４，５に示したものを例として用いる。 In this embodiment, the “dictionary registration function” is set to ON as a prior setting (dictionary registration function). In the present embodiment, a description will be given assuming that a query “FOMA P902i” is input to the input unit 1 of the document search apparatus 10 as a character string to be searched. In this case, as the formal name / regular expression dictionary 31, the conversion list 41, and the insertion list 51 used in the question expansion unit 2, those shown in FIGS.

質問拡張部２の入力判別部２１では、入力部１で受け付けたクエリ「FOMA P902i」を入力として受け取る。クエリ「FOMA P902i」はアルファべット、数字、空白記号からなるため、正式名称特定部２２へクエリを出力する。 The input determination unit 21 of the question expansion unit 2 receives the query “FOMA P902i” received by the input unit 1 as an input. Since the query “FOMA P902i” is composed of alphabets, numbers, and space symbols, the query is output to the official name identification unit 22.

正式名称特定部２２では、前述した図６に示す処理を行う。すなわち、正式名称特定部２２は、ステップ(22s 1)では入力判別部２１から受け取ったクエリ「FOMA P902i」は、図３に示す正式名称／正規表現辞書３１のエントリの正規表現に一致するものがないのでステップ(22s2)ヘ進み、正規表現作成を行う。 The official name identification unit 22 performs the process shown in FIG. That is, the formal name identification unit 22 determines that the query “FOMA P902i” received from the input determination unit 21 in step (22s1) matches the regular expression of the entry of the formal name / regular expression dictionary 31 shown in FIG. Since there is no, go to step (22s2) and create a regular expression.

ステップ(22s2)の正規表現作成処理では、変換リスト４１と、挿入リスト５１、正式名称／正規表現辞書３１を用い、ステップ(22s21)において、まずクエリ「FOMA P902i」を文字種の異なる境界で分割し、[FOMA,P,902,i]とする。FOMAとPの間の“”（空白）は削除する。 In the regular expression creation process in step (22s2), the conversion list 41, the insertion list 51, and the formal name / regular expression dictionary 31 are used. In step (22s21), the query “FOMA P902i” is first divided at the boundaries of different character types. , [FOMA, P, 902, i]. Delete the “” (blank) between FOMA and P.

次に、ステップ(22s22)でルール適用処理を行う。この処理には変換リスト４１を用いる。 Next, rule application processing is performed in step (22s22). The conversion list 41 is used for this processing.

ステップ(22s221)では、ステップ(22s21)の処理で作成した配列[F0MA,P,902,i]の要素ごとにステップ(22s221)〜(22s226)の処理を繰り返す。このとき[F0MA]の処理から行う。 In step (22s221), the processing in steps (22s221) to (22s226) is repeated for each element of the array [F0MA, P, 902, i] created in the processing in step (22s21). At this time, the processing starts from [F0MA].

ステップ(22s222)では、配列要素[FOMA]が変換リスト４１のエントリと一致するか否かを判定し、配列要素[FOMA]はエントリ１と一致するので、ステップ(22s223)の処理へと進む。 In step (22s222), it is determined whether or not the array element [FOMA] matches the entry in the conversion list 41. Since the array element [FOMA] matches entry 1, the process proceeds to step (22s223).

ステップ(22s223)では、配列要素[FOMA]を[(FOMA,フォーマ,?)]に変更して、ステップ(22s224)の処理へ進む。 In step (22s223), the array element [FOMA] is changed to [(FOMA, former,?)], And the process proceeds to step (22s224).

ステップ(22s224)では、変更された配列要素(FOMA,フォーマ,?)はアルファべットを含むので、ステップ(22s225)へ処理を進める。 In step (22s224), since the changed array element (FOMA, former,?) Includes an alphabet, the process proceeds to step (22s225).

ステップ(22s225)の処理では、(FOMA,フォーマ,?)→(FOMA, FOMA,foma,Foma,フォーマ,?)→(FOMA,foma,Foma,フォーマ,?)となり、ステップ(22s226)の処理へ進む。 In the process of step (22s225), (FOMA, Forma,?) → (FOMA, FOMA, foma, Foma, Forma,?) → (FOMA, foma, Foma, Forma,?), And go to the process of Step (22s226) move on.

ステップ(22s226)では、全ての要素を調べていないので、ステップ(22s221)の処理へ戻る。 In step (22s226), not all elements have been examined, so the process returns to step (22s221).

ステップ(22s221)では、次の配列要素[P]の処理を行う。 In step (22s221), the next array element [P] is processed.

ステップ(22s222)では、配列要素[P]が変換リストと一致しないので、ステップ(22s224)の処理へ進む。 In step (22s222), since array element [P] does not match the conversion list, the process proceeds to step (22s224).

ステップ(22s224)では、配列要素[P]はアルファベットを含むので、ステップ(22s225)へ処理を進める。 In step (22s224), since the array element [P] includes alphabets, the process proceeds to step (22s225).

ステップ(22s225)の処理では、P→(P,P,p,P)→(P,p)となり、ステップ(22s226)の処理へ進む。 In the process of step (22s225), P → (P, P, p, P) → (P, p), and the process proceeds to step (22s226).

ステップ(22s226)では、全ての要素を調べていないので、ステップ(22s221)の処理へ戻り、残りの[902],[i]も同様に処理を行い、全ての要素の処理を行うと最終的な配列要素は[(FOMA,foma,Foma,フォーマ,?),(P,p),902,(i,I)]となる。その時点で処理を終了し、ステップ(22s2)の正規表現作成処理のステップ(22s23)の処理ヘ進む。 In step (22s226), not all elements have been examined, so the process returns to step (22s221), and the remaining [902] and [i] are processed in the same manner, and all elements are processed. The array elements are [(FOMA, foma, Foma, former,?), (P, p), 902, (i, I)]. At that point, the process ends, and the process proceeds to the regular expression creation process step (22s23) in step (22s2).

ステップ(22s23)では、図５に示す挿入リスト５１を「，(カンマ)」で区切ったもの[“”，-]の最後に「？」の文字を付け足して１要素[“”,-,?]としてクエリ配列要素間に挿入し、[FOMA,foma,Foma,フォーマ,?),(“”,-,?),(P,p),(“”,-,?),902,(“”,-,?),(i,I)]とした後、ステップ(22s24)の処理ヘ進む。 In step (22s23), the insertion list 51 shown in FIG. 5 is separated by “, (comma)” [“”,-], and a “?” Character is added to the end to add one element [“”,-,? ] Between query array elements as [FOMA, foma, Foma, former,?), (“”,-,?), (P, p), (“”,-,?), 902, (“ ”,-,?), (I, I)], then the process proceeds to step (22s24).

ステップ(22s24)では、配列要素をつなげて正規表現を作成する。すなわち、配列の要素ごとにステップ(22s241)〜(22s248)まで(FOMA,foma,Foma,フォーマ,?),(“”,-,?),(P,p),(“”,-,?),902,(“”,-,?),(i,I)の順に処理を７回繰り返す。まずは(FOMA, foma,Foma,フォーマ,?)の処理から行う。 In step (22s24), array elements are connected to create a regular expression. That is, for each element of the array, steps (22s241) to (22s248) (FOMA, foma, Foma, former,?), (“”,-,?), (P, p), (“”,-,? ), 902, (“”,-,?), (I, I), and the process is repeated seven times. First of all, the processing of (FOMA, foma, Foma, former,?) Is performed.

ステップ(22s242)の処理では、(FOMA,foma,Foma,フォーマ,?)は１要素中の「，(カンマ)」で区切られた文字が全て１文字でないので、ステップ(22s244)の処理へ進む。 In the process of step (22s242), since (FOMA, foma, Foma, former,?) Are not all one character delimited by “, (comma)” in one element, the process proceeds to step (22s244). .

ステップ(22s246)では、配列要素の外側の括弧の右外へ「？」の文字を移し、(FOMA｜foma｜Foma｜フォーマ)？として、ステップ(22s247)の処理へ進む。 In step (22s246), the character “?” Is moved to the right outside the parenthesis outside the array element, and (FOMA | foma | Foma | former)? The process proceeds to step (22s247).

ステップ(22s247)では、まだ全要素を調べていないので、ステップ(22s241)の処理へ戻る。 In step (22s247), all elements have not been examined yet, so the process returns to step (22s241).

次に、ステップ(22s241)では、配列要素(“”,-,？)の処理を行う。 Next, in step (22s241), processing of array elements ("",-,?) Is performed.

ステップ(22s242)の処理では、(“”,-,?)は、１要素中の「，(カンマ)」で区切られた文字が全て１文字であるので、ステップ(22s243)の処理へ進む。 In the process of step (22s242), since all the characters delimited by “, (comma)” in one element are one character in (“”,-,?), The process proceeds to step (22s243).

ステップ(22s243)では、配列要素(“”,-,？)の「，(カンマ)」を削除して、「()」の文字を「[]」の文字ヘ変換し、配列要素[“”-?]とした後に、ステップ(22s245)の処理へ進む。 In step (22s243), ", (comma)" is deleted from the array element ("",-,?), the character "()" is converted to the character "[]", and the array element ["" -?], Then proceed to step (22s245).

ステップ(22s245)では、配列要素[“”-?]は要素の最後に「？」の文字があるので、ステップ(22s246)の処理へ進む。 In step (22s245), since the array element [“”-?] Has the character “?” At the end of the element, the process proceeds to step (22s246).

ステップ(22s246)では、要素の外側の括弧の右外ヘ「？」の文字を移し、[“”-]？とした後、ステップ(22s247)の処理へ進む。 In step (22s246), the character “?” Is moved to the right outside of the parenthesis outside the element and [“”-]? Then, the process proceeds to step (22s247).

ステップ(22s247)では、まだ全要素を調べていないので，ステップ(22s241)の処理へ戻り、残りの(P,p),(“”,-,?),902,(“”,-,?),(i,I)も同様にして処理を行う。全ての要素の処理が終わったら、ステップ(22s248)の処理へ進む。 In step (22s247), all elements have not been examined yet, so the process returns to step (22s241), and the remaining (P, p), (“”,-,?), 902, (“”,-,? ), (i, I) are similarly processed. When all the elements have been processed, the process proceeds to step (22s248).

ステップ(22s247)の処理において、全ての要素の処理が終わったと判断したら、繰り返し処理を終了して、ステップ(22s28)の処理へと進む。 If it is determined in step (22s247) that all elements have been processed, the iterative process is terminated and the process proceeds to step (22s28).

ステップ(22s248)では、全ての要素の処理が終わったら、全要素をつなげて文字列(F0MA｜Foma｜foma)?[“”-]？[Pp][“”-]？902[“”-]？[iI]とし、ステップ(22s25)の処理へ進む。 In step (22s248), when all the elements have been processed, all the elements are connected and a character string (F0MA | Foma | foma)? [“”-]? [Pp] [“”-]? 902 [“”-]? [iI] is set, and the process proceeds to step (22s25).

ステップ(22s25)では、作成した正規表現に一致する正式名称「DoCoMo P902i」が図３の正式名称／正規表現辞書３１のエントリ３にあるので、ステップ(22s26)の処理へ進む。 In step (22s25), since the official name “DoCoMo P902i” that matches the created regular expression is in entry 3 of the formal name / regular expression dictionary 31 in FIG. 3, the process proceeds to step (22s26).

ステップ(22s26)では、一致した正式名称に対応する正規表現を、新しく作成した正規表現を含む形で拡張する。 In step (22s26), the regular expression corresponding to the matched formal name is expanded to include the newly created regular expression.

ステップ(22s261)では、正式名称／正規表現辞書３１中の正規表現(DoCoMo|Docomo|ドコモ)[“”-]？[Pp][“”-]902[“”-]？iと新しく作成した正規表現(FOMA|Foma|foma)？[“”-]？[Pp][“”-]？902[“”-]？[iI]の２つの正規表現を、括弧表現の対を単位にして区切り、配列に格納する。このとき括弧表現がついていない部分には()を付けてから格納するので、[(DoCoMo｜Docomo｜ドコモ),[“”-]？,[Pp],[“”-],(902),[“”-]？,(i)]と、[(FOMA|Foma|foma)？,[“”-]？,[Pp],[“”-]？,(902),[“”-]？,[iI]]となる。この後、ステップ(22s262)の処理へ進む。 In step (22s261), a regular expression (DoCoMo | Docomo | docomo) [“”-] in the formal name / regular expression dictionary 31? [Pp] [“”-] 902 [“”-]? i and the newly created regular expression (FOMA | Foma | foma)? [“”-]? [Pp] [“”-]? 902 [“”-]? The two regular expressions [iI] are delimited in pairs of parenthesis expressions and stored in an array. At this time, the part without parentheses is stored after adding (), so [(DoCoMo | Docomo | docomo), [""-]? , [Pp], [“”-], (902), [“”-]? , (i)] and [(FOMA | Foma | foma)? , [“”-]? , [Pp], [“”-]? , (902), [“”-]? , [iI]]. Thereafter, the process proceeds to step (22s262).

ステップ(22s262)では、上記２つの正規表現間で一致する要素を探す。この結果、[[“”-]？,[Pp],[“”-],(902),[“”-]？]が一致しているので、一致部分の前後の要素のまとまり[(DoCoMo｜Docomo｜ドコモ)]と[(FOMA｜Foma｜foma)],[(i)]と[[iI]]を対象に以下のステップ(22s263)〜(22s268)の処理を繰り返し、前から比較していく。その際、挿入等の変更は新たに作成した配列(F0MA｜Foma｜foma)？[“”-]？[Pp][“”-]？902[“”-]？[iI]の要素に対して行う。まずは[(DoCoMo｜Docomo｜ドコモ)]と[(FOMA｜Foma｜foma)]の処理から行うためにステップ(22s263)ヘ進む。 In step (22s262), an element that matches between the two regular expressions is searched. As a result,[[""-]? , [Pp], [“”-], (902), [“”-]? ] Match, so the group of elements before and after the match [[DoCoMo | Docomo | docomo)] and [(FOMA | Foma | foma)], [(i)] and [[iI]] The following steps (22s263) to (22s268) are repeated and compared from the previous. In that case, is the newly created array (F0MA | Foma | foma) changed such as insertion? [“”-]? [Pp] [“”-]? 902 [“”-]? Do this for the element in [iI]. First, the processing proceeds to step (22s263) in order to perform processing from [(DoCoMo | Docomo | docomo)] and [(FOMA | Foma | foma)].

ステップ(22s263)では、ステップ(22s262)で抽出した２つの正規表現間で一致しない要素のまとまり同士の中でも部分一致する部分を探す。この結果，要素のまとまり[(DoCoMo｜Docomo｜ドコモ)]と[(FOMA｜Foma｜foma)]には部分一致が無いので、ステップ(22s264)の処理へ進む。 In step (22s263), a partially matching portion is searched for among a group of elements that do not match between the two regular expressions extracted in step (22s262). As a result, there is no partial match between the group of elements [(DoCoMo | Docomo | docomo)] and [(FOMA | Foma | foma)], and the process proceeds to step (22s264).

ステップ(22s264)の処理では、要素内に一致する部分が無いので、ステップ(22s267)の処理へ進む。 In the process of step (22s264), since there is no matching part in the element, the process proceeds to step (22s267).

ステップ(22s267)の処理では、それぞれの要素の括弧の外側へ「？」の文字をつけて繋げ、[(DoCoMo｜Docomo｜ドコモ)？(FOMA｜Foma ｜foma)？]とし、ステップ(22s269)の処理へ進む。 In the process of step (22s267), add the character “?” Outside the parentheses of each element and connect them to [(DoCoMo | Docomo | docomo)? (FOMA | Foma | foma)? The process proceeds to step (22s269).

ステップ(22s269)の処理では、まだ全ての要素を比較していないので、ステップ(22s262)の処理へ戻り、次の異なる部分を比較処理を繰り返す。 In the process of step (22s269), since not all elements have been compared yet, the process returns to the process of step (22s262) and the comparison process is repeated for the next different part.

ステップ(22s262)では、２つの正規表現[(i)]と[[iI]]との間で一致する部分を探し、ステップ(22s263)の処理へ進む。 In step (22s262), a matching part is searched for between the two regular expressions [(i)] and [[iI]], and the process proceeds to step (22s263).

ステップ(22s263)の処理では、ステップ(22s262)の処理で抽出した２つの正規表現間で一致しない要素のまとまり同士の中でも部分一致する部分を探す。ここでは、２つの正規表現[(i)]と[[iI]]ではiのみが等しい。この後、ステップ(22s264)の処理へ進む。 In the process of step (22s263), a part that partially matches is found among a group of elements that do not match between the two regular expressions extracted in the process of step (22s262). Here, only i is equal in the two regular expressions [(i)] and [[iI]]. Thereafter, the process proceeds to step (22s264).

ステップ(22s264)の処理では、２つの正規表現[(i)]と[[iI]]は要素内の１部分だけ異なり他の部分は一致するので、ステップ(22s265)の処理へ進む。 In the process of step (22s264), the two regular expressions [(i)] and [[iI]] differ by only one part in the element, and the other parts match, so the process proceeds to step (22s265).

ステップ(22s265)の処理では、２つの正規表現[(i)]と[[iI]]の要素内の異なる部分が一文字なので、ステップ(22s266)の処理ヘ進む。 In the process of step (22s265), since the different part in the elements of the two regular expressions [(i)] and [[iI]] is one character, the process proceeds to the process of step (22s266).

ステップ(22s266)の処理では、要素を囲む括弧が[]の時は不一致文字を中に付け足し、[[iI]]としてステップ(22s269)の処理へ進む。 In the process of step (22s266), when the parenthesis surrounding the element is [], an unmatched character is added in the middle, and the process proceeds to step (22s269) as [[iI]].

ステップ(22s269)の処理では、異なる部分を比較し終えたら、新たに作成中であった配列をつなげて文字列[(DoCoMo｜Docomo｜ドコモ)？(FOMA｜Foma｜foma)？,[“”-]？,[Pp],[“”-]？,(902),[“”-]？,[iI]]ヘ戻し、繰り返しを終了して、ステップ(22s2610)の処理へ進む。 In the process of step (22s269), after comparing the different parts, the newly created sequence is connected to the character string [(DoCoMo | Docomo | docomo)? (FOMA ｜ Foma ｜ foma)? , [“”-]? , [Pp], [“”-]? , (902), [“”-]? , [iI]], the repetition ends, and the process proceeds to step (22s2610).

ステップ(22s2610)では、図３に示した正式名称/正規表現辞書３１のエントリ３をDoCoMo P9O2i／(DocoMo｜Docomo｜ドコモ)[“”-]？[Pp][“”-]902[“”-]？iからDoCoMo P902i／[(DoCoMo｜Docomo｜ドコモ)？(F0MA｜Foma｜foma)？[“”-]？[Pp][“”-]？902[“”-]？[iI]]に変更して、ステップ(22s26)の処理を終了し、ステップ(22s2)へ戻り、ステップ(22s2)の処理も終了し、ステップ(22s3)の処理へ進む。 In step (22s2610), the entry 3 of the formal name / regular expression dictionary 31 shown in FIG. 3 is changed to DoCoMo P9O2i / (DocoMo | Docomo | docomo) [""-]? [Pp] [“”-] 902 [“”-]? i to DoCoMo P902i / [(DoCoMo | Docomo | docomo)? (F0MA | Foma | foma)? [“”-]? [Pp] [“”-]? 902 [“”-]? [iI]], the process of step (22s26) is terminated, the process returns to step (22s2), the process of step (22s2) is also terminated, and the process proceeds to step (22s3).

ステップ(22s3)では、ステップ(22s2)の正規表現作成処理が終わったら、DoCoMo P902i／[(DoCoMo｜Docomo｜ドコモ)？(FOMA｜Foma｜foma)？[“”-]？[Pp][“”-]？902[“”-]？[iI]]をクエリ拡張部２３へ出力して処理を終了する。 In step (22s3), when the regular expression creation process in step (22s2) is finished, DoCoMo P902i / [(DoCoMo | Docomo | docomo)? (FOMA ｜ Foma ｜ foma)? [“”-]? [Pp] [“”-]? 902 [“”-]? [iI]] is output to the query expansion unit 23, and the process ends.

クエリ拡張部２３では、前述した図１１に示す処理を行う。即ち、クエリ拡張部２３は、ステップ(23s1)では正式名称特定部２２が出力した正規表現のうち、記号部分以外の全展開を行い、(例１)FOMA[“”-]？P[“”-]？902[“”-]？i、(例２)[“”-]？p[“”-]？902[“”-]？i等としてステップ(23s2)の処理へ進む。 The query expansion unit 23 performs the process shown in FIG. That is, in step (23s1), the query expansion unit 23 performs all expansion except for the symbol part in the regular expression output by the formal name identification unit 22, and (Example 1) FOMA [“”-]? P [“”-]? 902 [“”-]? i, (Example 2) [""-]? p [“”-]? 902 [“”-]? Proceed to step (23s2) as i etc.

ステップ(23s2)〜(23s8)では、正規表現中の展開前の記号部分ごとに処理を進める。 In steps (23s2) to (23s8), the process proceeds for each symbol part before expansion in the regular expression.

上記(例１)の場合はステップ(23s3)〜(23s6)の処理は行わずにステップ(23s7)の処理でFOMA[“”-]？P[“”-]？902[“”-]？iをそのまま全展開し、
FOMA P 902 i
FOMA P 902-i
FOMA P 902i
FOMA P-902 i
FOMA P-902-i
FOMA P-902i
FOMA P902 i
FOMA P902-i
FOMA P902i
FOMA-P 902 i
FOMA-P 9O2-i
FOMA-P 9O2i
FOMA-P-902 i
FOMA-P-902-i
FOMA-P-902i
FOMA-P902 i
FOMA-P902-i
FOMA-P902i
FOMAP 902 i
FOMAP 902-i
FOMAP 902i
FOMAP-902 i
FOMAP-902-i
FOMAP-902i
FOMAP902 i
FOMAP902-i
FOMAP902i
としてステップ(23s8)の処理ヘと進む。 In the case of the above (Example 1), the processing of steps (23s3) to (23s6) is not performed, and the processing of step (23s7) is performed with FOMA [“”-]? P [“”-]? 902 [“”-]? Expand i as it is,
FOMA P 902 i
FOMA P 902-i
FOMA P 902i
FOMA P-902 i
FOMA P-902-i
FOMA P-902i
FOMA P902 i
FOMA P902-i
FOMA P902i
FOMA-P 902 i
FOMA-P 9O2-i
FOMA-P 9O2i
FOMA-P-902 i
FOMA-P-902-i
FOMA-P-902i
FOMA-P902 i
FOMA-P902-i
FOMA-P902i
FOMAP 902 i
FOMAP 902-i
FOMAP 902i
FOMAP-902 i
FOMAP-902-i
FOMAP-902i
FOMAP902 i
FOMAP902-i
FOMAP902i
Then, the process proceeds to step (23s8).

上記(例２)の場合はステップ(23s5)の判定の結果、記号部分の前がクエリの先頭なので、ステップ(23s6)の処理を行ってp[“”-]？902[“”-]？iとなったものをステップ(23s7)で全展開して
p 9O2i
p 9O2-i
p 902i
P-902 i
P-902-i
p-9O2i
p902 i
p902-i
p902i
としてステップ(23s8)の処理ヘと進む。 In the case of the above (example 2), the result of the determination in step (23s5) is that the part before the symbol part is the head of the query, so the process of step (23s6) is performed and p [“”-]? 902 [“”-]? Expand all i's in step (23s7)
p 9O2i
p 9O2-i
p 902i
P-902 i
P-902-i
p-9O2i
p902 i
p902-i
p902i
Then, the process proceeds to step (23s8).

ステップ(23s8)では、展開した正規表現中に含まれる記号部分を全て処理し終えたら繰り返し処理を終了してステップ(23s9)の処理へ進む。 In step (23s8), when all the symbol parts included in the expanded regular expression have been processed, the iterative process is terminated and the process proceeds to step (23s9).

ステップ(23s9)の処理では、展開の終わったクエリのうち、先に生成したクエリと重複が生じたものは削除する。ここでは重複はないため処理を終了して、展開し終えたクエリの同義表現集合 FOMA P-902i,FOMA-P902I,p902i等１７２８通りと、正式名称DoCoMo P902i、正規表現[(DoCoMo｜Docomo|ドコモ)?(F0MA｜Foma｜foma)?[“”-]?[Pp][“”-]?902[“”-]?[iI]]を文書検索部６へ渡す。以上で質問拡張部２の処理は終了となる。 In the process of step (23s9), among the queries that have been expanded, those that have overlapped with the previously generated query are deleted. Since there is no duplication here, the processing is terminated and the synonym expression set FOMA P-902i, FOMA-P902I, p902i, etc. of the query that has been expanded 1728 ways, the formal name DoCoMo P902i, the regular expression [(DoCoMo | Docomo | docomo )? (F0MA | Foma | foma)? [""-]? [Pp] [""-]? 902 [""-]? [II]] is passed to the document search unit 6. Thus, the process of the question expansion unit 2 ends.

文書検索部６は、質問拡張部２から、クエリの[正式名称，正規表現，同義表現集合]を受け取り、クエリの同義表現集合を用いて、既存の技術で検索を行う。そして検索結果の文書データのリストと、正式名称、正規表現を表記統一部７へと出力する。 The document search unit 6 receives the [formal name, regular expression, synonym expression set] of the query from the question expansion unit 2, and performs a search using existing techniques using the synonym expression set of the query. Then, a list of document data as a search result, a formal name, and a regular expression are output to the notation unifying unit 7.

表記統一部７は、前述した図１２に示す処理を行う。すなわち、表記統一部７では質問拡張部２でクエリの拡張を行っているため
・検索結果
・正式名称 DoCoMo P902i
・正規表現 [(DoCoMo｜Docomo｜ドコモ)?(FOMA｜Foma｜foma)?[“”-]?[Pp][“”-]?902[“”-]?[iI]]
を受け取る。 The notation unit 7 performs the process shown in FIG. That is, because the query unification unit 7 expands the query in the question expansion unit 2 ・ Search results ・ Formal name DoCoMo P902i
Regular expression [(DoCoMo | Docomo | docomo)? (FOMA | Foma | foma)? [""-]? [Pp] [""-]? 902 [""-]? [II]]
Receive.

これらを受け取った表記統一部７の出力判別部７１では、正式名称と正規表現が入力されるので、これらを統一部７２に渡して統一部７２での処理を進める。 In the output discriminating unit 71 of the notation unifying unit 7 that has received these, the formal name and the regular expression are input, so that they are passed to the unifying unit 72 and the processing in the unifying unit 72 proceeds.

統一部７２では、正規表現[(DoCoMo｜Docomo｜ドコモ)?(FOMA｜Foma｜foma)?[“”-]?[Pp][“”-]?9O2[“”-]?[iI]]を用いて検索結果である文書データの集合中に含まれるクエリの表記ゆれを含む表現を探し、正式名称「DoCoMo P902i」で書き換えを行う。そしてその結果を出力部８へ渡す。 In the unifying unit 72, the regular expression [(DoCoMo | Docomo | docomo)? (FOMA | Foma | foma)? [""-]? [Pp] [""-]? 9O2 [""-]? [II]] Is used to find an expression including the query notation included in the set of document data as a search result, and rewritten with the official name “DoCoMo P902i”. The result is passed to the output unit 8.

出力部８は、任意の形式でクエリの表記が統一された文書データの集合をユーザに提示する。 The output unit 8 presents to the user a set of document data in which the query notation is unified in an arbitrary format.

前述したように本発明の文書検索方法を適用した本実施形態の文書検索装置１０によれば、正規表現を用いることにより、辞書ヘの登録単語数を削減することができ、１つの辞書をクエリ拡張と表記ゆれ吸収両方に使用可能であり、辞書サイズを縮小することができる。また、考えられる全ての表記ゆれを含む正規表現を自動生成することにより、新語への対応スピード、網羅性共に向上すると同時に、コストを削減することができる。さらに、入力されたクエリと正式名称、どちらも正規表現で記述するため、正式名称とは違うクエリが入力されても特定できる可能性が高まる。さらにまた、本実施形態の文書検索装置１０に適用している本発明のコンピュータプログラムを用いることにより、任意のコンピュータ装置によって本発明の文書検索方法を実施する文書検索装置を容易に構成することができる。 As described above, according to the document search apparatus 10 of the present embodiment to which the document search method of the present invention is applied, the number of registered words in the dictionary can be reduced by using regular expressions, and one dictionary can be queried. It can be used for both expansion and notation fluctuation absorption, and the dictionary size can be reduced. In addition, by automatically generating a regular expression including all possible notation fluctuations, it is possible to improve the speed and coverage of new words, and at the same time reduce costs. Furthermore, since both the input query and the formal name are described in regular expressions, the possibility that the query can be specified even if a query different from the formal name is input is increased. Furthermore, by using the computer program of the present invention applied to the document search apparatus 10 of the present embodiment, a document search apparatus that implements the document search method of the present invention can be easily configured by an arbitrary computer apparatus. it can.

本発明の一実施形態における文書検索装置の機能ブロック図Functional block diagram of a document search apparatus in an embodiment of the present invention 本発明の一実施形態における質問拡張部の詳細構成を示す機能ブロック図The functional block diagram which shows the detailed structure of the question expansion part in one Embodiment of this invention 本発明の一実施形態における正式名称／正規表現辞書の一例を示す図The figure which shows an example of a formal name / regular expression dictionary in one Embodiment of this invention 本発明の一実施形態における変換リストの一例を示す図The figure which shows an example of the conversion list in one Embodiment of this invention 本発明の一実施形態における挿入リストの一例を示す図The figure which shows an example of the insertion list in one Embodiment of this invention 本発明の一実施形態における正式名称特定部の処理動作を説明するフローチャートThe flowchart explaining the processing operation of the formal name specific | specification part in one Embodiment of this invention. 本発明の一実施形態における正規表現作成処理の動作を説明するフローチャートThe flowchart explaining operation | movement of the regular expression creation process in one Embodiment of this invention. 本発明の一実施形態におけるルール適用処理の動作を説明するフローチャートThe flowchart explaining operation | movement of the rule application process in one Embodiment of this invention. 本発明の一実施形態における正規表現作成処理の動作を説明するフローチャートThe flowchart explaining operation | movement of the regular expression creation process in one Embodiment of this invention. 本発明の一実施形態における正規表現拡張処理の動作を説明するフローチャートThe flowchart explaining the operation | movement of the regular expression expansion process in one Embodiment of this invention. 本発明の一実施形態におけるクエリ拡張部の処理動作を説明するフローチャートThe flowchart explaining the processing operation of the query expansion part in one Embodiment of this invention. 本発明の一実施形態における表記統一部の詳細構成を示す機能ブロック図The functional block diagram which shows the detailed structure of the description unification part in one Embodiment of this invention

Explanation of symbols

１…入力部、２…質問拡張部、３，４，５…データベース、６…文書検索部、７…表記統一部、８…出力部、２１…入力判別部、２２…正式名称特定部、２３…クエリ拡張部、３１…正規名称／正規表現辞書、４１…変換リスト、５１…挿入リスト、７１…出力判別部、７２…統一部、100…文書データベース。 DESCRIPTION OF SYMBOLS 1 ... Input part, 2 ... Question expansion part, 3, 4, 5 ... Database, 6 ... Document search part, 7 ... Notation unification part, 8 ... Output part, 21 ... Input discrimination | determination part, 22 ... Formal name identification part, 23 ... Query expansion part, 31 ... Regular name / regular expression dictionary, 41 ... Conversion list, 51 ... Insertion list, 71 ... Output discrimination part, 72 ... Unification part, 100 ... Document database.

Claims

A database of formal name / regular expression dictionaries having as entries information that pairs the formal name of the word with the regular expression of the formal name, and an arbitrary conversion rule used when creating a regular expression for a query as a character string to be searched When a query is input to a document search device comprising a computer having a conversion list database that can be registered in a database and an insertion list database that can arbitrarily register characters and symbols to be inserted when creating a regular expression of the query, The document search method is a document search method for extracting document data including a query group obtained by expanding the query from a plurality of document data stored in a document database, and outputting the extraction result.
The document search device includes:
Accept the entered search string as a query,
Create a regular expression of the accepted query using the formal name / regular expression dictionary, the conversion list, and the insertion list, and uniquely identify the formal name of the word from the regular expression,
Using the created regular expression, expand the query to create a query group,
A document search is performed using the expanded query group, and document data including the query group is extracted.
Using the identified formal name and a regular expression corresponding to the formal name, unifying the notation of the query included in the extracted document data,
A document search method characterized by outputting information on document data with unified notation.

The document search device updates the formal name / regular expression dictionary using a regular expression created when a new query is input and information on a formal name of a word specified from the regular expression. The document search method according to claim 1.

A document that, when a query as a character string to be searched is input, extracts document data including a query group obtained by expanding the query from a plurality of document data stored in a document database, and outputs the extraction result A search device,
Input means for receiving the input character string to be searched as a query;
A database of formal name / regular expression dictionaries having as entries information that is pairs of formal names of words and regular expressions of the formal names;
A database of conversion lists that can be arbitrarily registered and used as conversion rules when creating regular expressions of the query;
A database of an insertion list that can arbitrarily register characters and symbols to be inserted when creating a regular expression of the query;
Create a regular expression of the accepted query using the formal name / regular expression dictionary, the conversion list, and the insertion list, and identify a formal name of the word uniquely from the regular expression Means,
Query expansion means for expanding a query using the created regular expression to create a query group,
A document search means for performing a document search using the expanded query group and extracting document data including the query group;
Using a formal name identified by the formal name identifying means and a regular expression corresponding to the formal name, a notation unifying means for unifying the notation of a query included in the extracted document data;
A document retrieval apparatus comprising: output means for outputting information of document data in which notation is unified.

The said formal name identification means has a means to create the said formal name / regular expression dictionary using the created regular expression and the information of the formal name of the word identified from this regular expression. Document retrieval device.

A database of formal name / regular expression dictionaries having as entries information that pairs the formal name of the word with the regular expression of the formal name, and an arbitrary conversion rule used when creating a regular expression for a query as a character string to be searched When a query is input to a document search device comprising a computer having a conversion list database that can be registered in a database and an insertion list database that can arbitrarily register characters and symbols to be inserted when creating a regular expression of the query, A computer that operates the document search apparatus that extracts document data including a query group obtained by expanding the query from a plurality of document data stored in a document database and outputs the extraction result. A program,
Receiving the input character string to be searched as a query;
Creating a regular expression for the accepted query using the formal name / regular expression dictionary, the conversion list, and the insertion list, and uniquely identifying the formal name of the word from the regular expression;
Using the created regular expression to expand the query to create a set of queries,
Performing a document search using the expanded query group, and extracting document data including the query group;
Unifying the notation of the query included in the extracted document data using the identified formal name and a regular expression corresponding to the formal name;
And a step of outputting document data information in which notation is unified.

A step of updating the formal name / regular expression dictionary using a regular expression created when a character string to be newly searched is input and information on a formal name of a word identified from the regular expression. The computer program according to claim 5.