JP4813312B2

JP4813312B2 - Electronic document search method, electronic document search apparatus and program

Info

Publication number: JP4813312B2
Application number: JP2006267847A
Authority: JP
Inventors: 正吾新海; 啓北内; 一也小西; 徹高木
Original assignee: NTT Data Corp
Current assignee: NTT Data Corp
Priority date: 2006-09-29
Filing date: 2006-09-29
Publication date: 2011-11-09
Anticipated expiration: 2026-09-29
Also published as: JP2008090396A

Description

本発明は、検索者が指定した検索条件を超えた範囲の電子文書の検索を可能にする電子文書検索技術に関する。 The present invention relates to an electronic document search technique that enables a search for an electronic document in a range that exceeds a search condition specified by a searcher.

電子化された新聞記事やウェブ上のドキュメントのような大規模な電子文書に含まれる重要な情報を効率的に検索することに対する需要が増大している。このような需要に対応すべく、所望の電子文書を検索するための検索ルールを事前に作成し、電子文書群から抽出した電子文書を形態素解析し、この形態素解析結果に対して検索ルールを用いてパターンマッチングを行うことにより、検索ルールに適合する電子文書を検索する検索手法が知られている。 There is an increasing demand for efficient retrieval of important information contained in large-scale electronic documents such as digitized newspaper articles and documents on the web. In order to meet such demand, a search rule for searching a desired electronic document is created in advance, a morphological analysis is performed on the electronic document extracted from the electronic document group, and a search rule is used for the morphological analysis result. A search technique for searching for an electronic document that conforms to a search rule by performing pattern matching is known.

図１５は、形態素解析を用いた従来の検索手法の一例を説明する図である。ここに示す例では、「名詞のうちの、固有名詞であって、地域を表すもの」という条件に適合し、且つ、「銀行」の語を含む電子文書（記事）を抽出するために、「ＢＡＮＫ＝名詞−固有名詞−地域”銀行”」なる検索ルールを用いている。「ＢＡＮＫ」というカテゴリに分類される検索ルールであり、複数の語の論理積によって表される。この検索ルールを用いて形態素解析結果に対してパターンマッチングを行うことにより、「ＢＡＮＫ」のカテゴリに属する語として、「ＡＡ銀行」（ＡＡは地域名を表す固有名詞）及び「ＢＢ銀行」（ＢＢは、ＡＡとは異なる地域名を表す固有名詞）が索出される。 FIG. 15 is a diagram for explaining an example of a conventional search method using morphological analysis. In the example shown here, in order to extract an electronic document (article) that matches the condition of “a proper noun and represents a region” and includes the word “bank”, “ A search rule of BANK = noun−proper noun−region “bank” ”is used. This is a search rule classified into the category “BANK” and is expressed by a logical product of a plurality of words. By performing pattern matching on the result of morphological analysis using this search rule, “AA bank” (AA is a proper noun representing a region name) and “BB bank” (BB Is a proper noun representing an area name different from AA).

この検索手法で用いられる検索ルールは、語及びその品詞名を構成要素として正規表現で表したものであり、その作成には文法知識の習得が不可欠となる。これを軽減するため、検索ルールを効率的に作成する技術が種々提案されてきた。例えば、検索したい電子文書の構成要素やその組み合わせパターンを解析し、検索される頻度の高いパターンを検索ルールとして自動的に出力する技術がある（例えば、特許文献１）。 The search rules used in this search method are expressed by regular expressions using words and their part-of-speech names as constituent elements, and acquisition of grammar knowledge is indispensable for the creation. In order to alleviate this, various techniques for efficiently creating search rules have been proposed. For example, there is a technique for analyzing a component of an electronic document to be searched for and a combination pattern thereof and automatically outputting a frequently searched pattern as a search rule (for example, Patent Document 1).

また、既存の検索ルールによって索出された語を検索語として類似文書検索を行い、これにより索出された類似文書のうち検索語との類似度の高い語を類似語として選定し、選定した類似語を既存の検索ルールの構成要素と置換する技術もあった（例えば、特許文献２）。 In addition, a similar document search is performed using the words searched by the existing search rules as search words, and a word having a high similarity to the search word is selected and selected from the searched similar documents. There is also a technique for replacing similar words with components of existing search rules (for example, Patent Document 2).

特開２００１−３１８７９２号公報JP 2001-318792 A 特開２００４−２９５７９７号公報JP 2004-295797 A

特許文献１に記載されている技術では、所望の語を含む電子文書を検索するために必要な条件の一部を検索者が検索ルールに入れなかった場合に、当該電子文書を漏れなく検索するための検索ルールを作成することが困難となる。この状況を図１６を参照して具体的に説明する。 In the technique described in Patent Document 1, when a searcher does not include a part of the conditions necessary for searching an electronic document including a desired word in the search rule, the electronic document is searched without omission. Therefore, it becomes difficult to create a search rule. This situation will be specifically described with reference to FIG.

いま、電子化された新聞記事の中に、金融機関名として「ＣＣ銀行」、「ＤＤ銀行」、「ＥＥ銀行」、「○○銀行」、「ＦＦ証券」の語があったとする。ＣＣ、ＤＤ、ＥＥ、ＦＦは、それぞれ異なる地域名を表す固有名詞であり、○○は、地域名以外の意味を表す固有名詞である。検索者が、上記金融機関名を含む記事を検索したい場合に、検索条件として「ＣＣ銀行」、「ＤＤ銀行」及び「ＥＥ銀行」なる語を指定した場合、これら３つの語はすべて、地域名を表す固有名詞と「銀行」の文字とで構成されるため、「ＣＣ銀行」、「ＤＤ銀行」及び「ＥＥ銀行」を含む記事のみを索出するための検索ルールが生成される。つまり、「○○」は地域名を表す固有名詞ではないため、この検索ルールによっては「○○銀行」を含む記事は索出されない。同様に、この検索ルールでは「証券」の文字が「銀行」の文字と一致しないため、「ＦＦ証券」を含む記事も索出されない。このように、検索者が適切な検索条件を入力しない限り、検索漏れが生じ、所望の語を含む記事を検索することができない。 Now, assume that electronic newspaper articles include the words “CC bank”, “DD bank”, “EE bank”, “XX bank”, and “FF securities” as financial institution names. CC, DD, EE, and FF are proper nouns that represent different area names, and OO is a proper noun that represents a meaning other than the area name. If the searcher wants to search for an article containing the above financial institution name, and the search terms specify the terms “CC bank”, “DD bank”, and “EE bank”, these three terms are all region names. Therefore, a search rule for searching only articles including “CC bank”, “DD bank”, and “EE bank” is generated. In other words, since “XX” is not a proper noun representing a region name, articles including “XX bank” are not searched by this search rule. Similarly, in this search rule, since the characters “securities” do not match the characters “bank”, articles including “FF securities” are not searched. As described above, unless the searcher inputs an appropriate search condition, a search omission occurs and an article including a desired word cannot be searched.

特許文献２に記載されている技術では、類似語展開された語の中に検索ルールの構成要素として不適切な語も含まれる場合があり、この場合には、適切な検索ルールを作成することが困難となる。例えば、図１７に示すように、金融機関名を含む記事を検索する場合に、特許文献２に記載されている技術では、検索者によって入力された「銀行」の類似語として、例えば、「為替」、「金利」及び「預金」が抽出される。しかし、これらの語が金融機関名を表すとは考えにくい。このため、検索語をこれらの語に置換して生成した検索ルールでは、金融機関名を含む記事を適切に検索することができない。 In the technique described in Patent Document 2, there are cases where an inappropriate word is included as a constituent element of a search rule among words expanded in similar terms. In this case, an appropriate search rule is created. It becomes difficult. For example, as shown in FIG. 17, when searching for an article including the name of a financial institution, in the technique described in Patent Document 2, as an analogy of “bank” input by a searcher, for example, “exchange” ”,“ Interest rate ”and“ deposit ”are extracted. However, it is unlikely that these words represent financial institution names. For this reason, the search rule generated by replacing the search words with these words cannot appropriately search for articles including financial institution names.

本発明の課題は、所望の語を含む電子文書を検索するための検索条件が検索者によって適切に指定されないような場合においても、その電子文書を索出することができる手法を提供することにある。 An object of the present invention is to provide a technique capable of searching for an electronic document even when a searcher does not appropriately specify a search condition for searching for an electronic document including a desired word. is there.

上記の課題を解決するため、本発明は、電子文書検索方法、電子文書検索装置及びコンピュータプログラムを提供する。
本発明が提供する電子文書検索方法は、電子文書群の中から予め指定された検索条件の範囲を超えた検索を可能とする装置が行う電子文書検索方法であって、前記検索条件を一次検索ルールとして受け付け、この一次検索ルールにより前記電子文書群を検索して一次文書を得る一次検索ステップと、前記一次文書に含まれる語または語句を類似検索ルールとして抽出し、抽出した類似検索ルールにより前記電子文書群を検索して１又は複数の類似文書を得、この類似文書毎に、前記一次文書とどの位特徴が類似するかを表す類似度を算出し、この類似度を各類似文書と対応づけて記憶する類似検索ステップと、前記一次検索ルールに基づいて前記検索条件よりも広い範囲の検索を可能とする複数の二次検索ルール候補を生成し、これらの二次検索ルール候補の各々により前記類似文書を検索して、検索結果である語又は語句を対応する前記類似度と共に抽出するとともに、抽出した類似度の合計値を当該二次検索ルール候補による検索結果の評価値として導出し、この評価値が所定条件を満たす二次検索ルール候補を二次検索ルールとして決定する検索ルール決定ステップと、決定した前記二次検索ルールにより前記電子文書群を再検索する二次検索ステップと、を含むことを特徴とする。
この電子文書検索方法により、例えば検索者が指定した検索条件よりも広い範囲で検索するための二次検索ルールが得られるので、所望の語を含む電子文書を検索するための検索条件が検索者によって適切に指定されないような場合においても、所望の電子文書を索出することができるようになる。また、評価値によって、一次文書と類似する文書の検索可能範囲を定量的に知ることもできる。 In order to solve the above problems, the present invention provides an electronic document search method, an electronic document search apparatus, and a computer program.
An electronic document search method provided by the present invention is an electronic document search method performed by an apparatus that enables a search that exceeds a range of search conditions specified in advance from a group of electronic documents, wherein the search condition is a primary search. A primary search step of accepting as a rule, searching the electronic document group by the primary search rule to obtain a primary document, extracting a word or phrase included in the primary document as a similarity search rule, By searching the electronic document group, one or a plurality of similar documents are obtained, and for each similar document, a similarity indicating how much the feature is similar to the primary document is calculated, and this similarity is associated with each similar document. A plurality of secondary search rule candidates that enable a search in a wider range than the search condition based on the similar search step and the primary search rule, The similar document is searched by each of the search rule candidates, and a word or phrase that is a search result is extracted together with the corresponding similarity, and the total value of the extracted similarities is obtained as a result of the search by the secondary search rule candidate. A search rule determining step that is derived as an evaluation value and determines a secondary search rule candidate satisfying a predetermined condition as a secondary search rule, and a second search for re-searching the electronic document group by the determined secondary search rule. And a next search step.
By this electronic document search method, for example, a secondary search rule for searching in a wider range than the search condition specified by the searcher can be obtained, so the search condition for searching for an electronic document containing a desired word is the searcher. Even in the case where it is not specified appropriately by the user, a desired electronic document can be searched. Further, the searchable range of a document similar to the primary document can be quantitatively known from the evaluation value.

前記一次検索ルールは、複数の語または語句の論理条件を含むものとすることができる。この場合、前記検索ルール決定ステップは、例えば、前記論理条件が論理積条件のときには前記複数の語または語句のいくつかを除外し、他方、前記論理条件が論理和のときには新たな語または語句を追加することにより前記二次検索ルール候補を生成する。 The primary search rule may include a logical condition of a plurality of words or phrases. In this case, the search rule determination step excludes, for example, some of the plurality of words or phrases when the logical condition is a logical product condition, and adds a new word or phrase when the logical condition is a logical sum. The secondary search rule candidate is generated by adding.

前記類似検索ステップは、前記一次文書に含まれる語または語句に代えて、あるいは、当該語または語句と共に、当該語または語句に対して同義あるいは類義となる語または語句を前記類似検索ルールとして抽出するようにしても良い。このようにすることで、一次文書と内容が近似する多くの関連文書を索出することができる。 The similar search step extracts, as the similar search rule, a word or phrase that is synonymous or similar to the word or phrase instead of or together with the word or phrase included in the primary document. You may make it do. By doing in this way, it is possible to search for many related documents whose contents approximate to those of the primary document.

類似検索ルールは、また、前記一次文書に含まれる語または語句に代えて、前記一次文書において共起する頻度が所定値よりも高い共起情報とすることもできる。これにより、一次文書とより関連性の高い類似文書を検索することができ、二次検索に際しての検索範囲がより拡がる。なお、「共起」とは、ある語と共に他の関連単語が想起されることをいう。「共起情報」は、共起された語または語句等である。頻度を測るための周期は任意であってよい。例えば１週間とすることができる。
なお、前記検索ルール決定ステップは、例えば、前記評価値が相対的に高い二次検索ルール候補を前記二次検索ルールとして決定する。 The similarity search rule may be co-occurrence information whose frequency of co-occurrence in the primary document is higher than a predetermined value, instead of a word or phrase included in the primary document. This makes it possible to search for similar documents that are more relevant to the primary document, and the search range for the secondary search is further expanded. “Co-occurrence” refers to recalling other related words together with a certain word. “Co-occurrence information” is a co-occurrence word or phrase. The period for measuring the frequency may be arbitrary. For example, it can be one week.
In the search rule determination step, for example, a secondary search rule candidate having a relatively high evaluation value is determined as the secondary search rule.

本発明の電子文書検索装置は、検索者から受け付けた検索条件を一次検索ルールとして記憶する一次検索ルール記憶手段と、前記一次検索ルールにより電子文書群を検索して一次文書を抽出し、この一次文書に含まれる語または語句により前記電子文書群を検索して１又は複数の類似文書を索出し、この類似文書毎に、前記一次文書とどの位特徴が類似するかを表す類似度を算出し、この類似度を各類似文書と対応づけて記憶する類似文書管理手段と、前記一次検索ルールに基づいて前記検索条件よりも広い範囲の検索を可能とする複数の二次検索ルール候補を生成し、生成した複数の二次検索ルール候補の各々により前記類似文書を検索して、検索結果である語又は語句を、対応する前記類似度と共に抽出するとともに、抽出した類似度の合計値を当該二次検索ルール候補による検索結果の評価値として導出し、この評価値が所定条件を満たす二次検索ルール候補を、前記電子文書群の再検索を行うための二次検索ルールとして決定する検索ルール決定手段とを有するものである。 The electronic document search apparatus of the present invention extracts a primary document by searching an electronic document group using a primary search rule storage means for storing a search condition received from a searcher as a primary search rule, and the primary search rule. The electronic document group is searched by using words or phrases included in the document to find one or a plurality of similar documents, and for each similar document, a degree of similarity indicating how much the feature is similar to the primary document is calculated. A similar document management means for storing the similarity in association with each similar document, and generating a plurality of secondary search rule candidates that enable a search in a wider range than the search condition based on the primary search rule. The similar document is searched by each of the plurality of generated secondary search rule candidates, and a word or phrase as a search result is extracted together with the corresponding similarity, and the extracted similarity A total value is derived as an evaluation value of a search result by the secondary search rule candidate, and a secondary search rule candidate satisfying a predetermined condition for the evaluation value is used as a secondary search rule for re-searching the electronic document group. And a search rule determining means for determining.

本発明のコンピュータプログラムは、コンピュータを；検索者から受け付けた検索条件を一次検索ルールとして記憶する一次検索ルール記憶手段；前記一次検索ルールにより電子文書群を検索して一次文書を抽出し、この一次文書に含まれる語または語句により前記電子文書群を検索して１又は複数の類似文書を索出し、この類似文書毎に、前記一次文書とどの位特徴が類似するかを表す類似度を算出し、この類似度を各類似文書と対応づけて記憶する類似文書管理手段；前記一次検索ルールに基づいて前記検索条件よりも広い範囲の検索を可能とする複数の二次検索ルール候補を生成し、生成した複数の二次検索ルール候補の各々により前記類似文書を検索して、検索結果である語又は語句を、対応する前記類似度と共に抽出するとともに、抽出した類似度の合計値を当該二次検索ルール候補による検索結果の評価値として導出し、この評価値が所定条件を満たす二次検索ルール候補を、前記電子文書群の再検索を行うための二次検索ルールとして決定する検索ルール決定手段；として機能させるためのコンピュータプログラムである。 The computer program of the present invention includes a computer; primary search rule storage means for storing a search condition received from a searcher as a primary search rule; a primary document is extracted by searching an electronic document group according to the primary search rule, and the primary The electronic document group is searched by using words or phrases included in the document to find one or a plurality of similar documents, and for each similar document, a degree of similarity indicating how much the feature is similar to the primary document is calculated. A similar document management means for storing the similarity in association with each similar document; generating a plurality of secondary search rule candidates that enable a search in a wider range than the search condition based on the primary search rule; The similar document is searched by each of the plurality of generated secondary search rule candidates, and a word or a phrase as a search result is extracted together with the corresponding similarity. A total value of the extracted similarities is derived as an evaluation value of a search result by the secondary search rule candidate, and a secondary search rule candidate whose evaluation value satisfies a predetermined condition is re-searched for the electronic document group A computer program for functioning as search rule determination means for determining as a secondary search rule.

本発明によれば、一次検索で索出された一次文書に関連し、且つ、一次検索よりも広い範囲の検索を可能にする二次検索ルールが生成されるので、例えば所望の語又は語句を含む電子文書の検索条件が正しく指定されな且つた場合であっても、その電子文書を正しく索出できるようになるという特有の効果が得られる。 According to the present invention, a secondary search rule related to the primary document searched in the primary search and enabling a wider range of search than the primary search is generated. Even if the search condition of the electronic document to be included is not correctly specified, a unique effect is obtained that the electronic document can be correctly searched.

以下、本発明の実施形態例を説明する。
［電子文書検索装置］
本発明を適用した電子文書検索装置の機能構成例を図１に示す。この電子文書検索装置１は、例えば、ハードディスク等の外部記憶装置を有するコンピュータと本発明のコンピュータプログラムとの協働により実現される。
すなわち、コンピュータが、本発明のコンピュータプログラムを読み込んで実行することにより、入出力Ｉ／Ｆ（「Ｉ／Ｆ」はインタフェースの略、以下同じ）２、ルール管理部３、検索部４、評価部５、検索結果出力部６、ＤＢ（「ＤＢ」はデータベースの略、以下同じ）管理部７の機能を形成するとともに、装置がアクセス可能な外部記憶装置内に、ルールＤＢ１１、文書ＤＢ１２、ファイルＤＢ１３の各ＤＢを構築する。文書ＤＢ１２には、電子化され、例えばＩＤ（Identification）などの識別子で識別可能に管理された、金融関連の多数の新聞記事（電子文書群）が蓄積されており、適宜追加、削除、変更できるようになっている。 Hereinafter, exemplary embodiments of the present invention will be described.
[Electronic document search device]
An example of the functional configuration of an electronic document retrieval apparatus to which the present invention is applied is shown in FIG. The electronic document retrieval apparatus 1 is realized by, for example, cooperation between a computer having an external storage device such as a hard disk and the computer program of the present invention.
That is, when the computer reads and executes the computer program of the present invention, the input / output I / F (“I / F” is an abbreviation of an interface, the same applies hereinafter) 2, rule management unit 3, search unit 4, evaluation unit 5. A search result output unit 6, a DB ("DB" is an abbreviation of database, the same applies hereinafter) management unit 7 functions, and a rule DB 11, a document DB 12, a file DB 13 in an external storage device accessible by the device. Each DB is constructed. The document DB 12 stores a large number of financial-related newspaper articles (electronic document group) that are digitized and managed so as to be identifiable by identifiers such as ID (Identification), and can be added, deleted, and changed as appropriate. It is like that.

入出力Ｉ／Ｆ２は、例えば検索者からキーボード等を通じて入力された、検索に用いる語または語句、例えば品詞の種別又は語等を受け付けたり、検索結果を出力したりするためのインタフェースである。なお、本実施形態では、検索に用いる語または語句を特に区別する必要がない場合は、これらを総称して「検索語」と称する。 The input / output I / F 2 is an interface for accepting a word or phrase used for a search, for example, a type of part of speech or a word, input from a searcher through a keyboard or the like, and outputting a search result. In the present embodiment, when it is not necessary to particularly distinguish words or phrases used for the search, these are collectively referred to as “search words”.

ルール管理部３は、外部記憶装置の所定領域に展開されるルールブックに基づいて一次検索ルール、類似検索ルール、並びに、二次検索ルールの候補となる二次検索ルール候補検索ルールないしその候補を生成し、これをルールＤＢ１１に記憶する。ルールブックには、検索ルールを生成するための複数の規則が格納されており、これらの規則の中から目的に応じたものを任意に選ぶことができる。ルール管理部３は、複数の検索ルール候補を生成したときは各検索ルール候補の絞り込みないし検索ルールとして決定するための処理をも行う。 The rule management unit 3 selects a primary search rule, a similar search rule, and a secondary search rule candidate search rule that is a candidate for the secondary search rule based on a rule book developed in a predetermined area of the external storage device or a candidate thereof. This is generated and stored in the rule DB 11. The rule book stores a plurality of rules for generating a search rule, and a rule according to the purpose can be arbitrarily selected from these rules. When a plurality of search rule candidates are generated, the rule management unit 3 also performs a process for narrowing down each search rule candidate or determining a search rule.

検索部４は、ＤＢ管理部７を介して各種ＤＢ１１〜１３の検索を行う。評価部５は、電子文書及び二次検索ルール候補の評価、すなわち、後述する類似度の算出並びに評価値の導出等を行う。検索結果出力部６は、検索結果を、一時的に蓄積し又はＤＢ管理部７を介してファイル管理ＤＢ１３蓄積するとともに、ルール管理部３にフィードバックする。検索結果出力部６はまた、検索結果をＤＢ管理部７を介して、図示しない表示装置、外部記録装置あるいはプリンタ等の出力装置のいずれかに出力する。出力装置宛に出力されるのは、最終的な検索結果が得られたとき、例えば検索者からその旨の指示入力を受け付けたときである。
ＤＢ管理部７は、各種ＤＢ１１〜１３への情報の書込（蓄積）とこれらのＤＢ１１〜１３からの情報読込を行うものである。これらの機能の詳細については、後述する。 The search unit 4 searches the various DBs 11 to 13 via the DB management unit 7. The evaluation unit 5 performs evaluation of the electronic document and the secondary search rule candidate, that is, calculation of similarity described later, derivation of an evaluation value, and the like. The search result output unit 6 temporarily stores the search results or stores the file management DB 13 via the DB management unit 7 and feeds it back to the rule management unit 3. The search result output unit 6 also outputs the search result via the DB management unit 7 to any output device such as a display device, an external recording device, or a printer (not shown). Output to the output device is when a final search result is obtained, for example, when an instruction input to that effect is received from the searcher.
The DB management unit 7 writes (accumulates) information in various DBs 11 to 13 and reads information from these DBs 11 to 13. Details of these functions will be described later.

ここで、本実施形態の電子文書検索装置１で扱う各検索ルールについて、より詳しく説明する。
一次検索ルールは、文書ＤＢ１２に蓄積されている電子文書群に対する一次検索を行うための条件を含むルールであり、例えば入出力Ｉ／Ｆ２で受け付けた語または語句等の論理条件式で表される。入力を受け付ける語または語句としては、例えば品詞の種類（名詞、固有名詞、形容詞等）、品詞の属性（固有名詞のうち、地域名称（地名を含む）、特定の単語（一般的に銀行名に含まれる「銀行」、一般的に証券会社の名称に含まれる「証券」）等がある。一次検索ルールは、検索語の論理条件、例えば論理積（ＡＮＤ）をとることにより、生成される。一次検索ルールによって検索される電子文書が「一次文書」となる。 Here, each search rule handled by the electronic document search apparatus 1 of the present embodiment will be described in more detail.
The primary search rule is a rule including a condition for performing a primary search with respect to a group of electronic documents stored in the document DB 12, and is expressed by a logical conditional expression such as a word or a phrase accepted by the input / output I / F2, for example. . Examples of words or phrases that accept input include part-of-speech types (nouns, proper nouns, adjectives, etc.), part-of-speech attributes (of proper nouns, regional names (including place names), specific words (generally bank names) ("Bank"", generally included in the name of the securities company), etc. The primary search rule is generated by taking the logical condition of the search term, for example, logical product (AND). An electronic document searched by the primary search rule is a “primary document”.

類似検索ルールは、文書ＤＢ１２に蓄積されている電子文書群の中から一次文書に類似する電子文書（これが「類似文書」となる）を検索するためのルールである。本実施形態では、１または複数の一次文書から所用の検索語を抽出し、抽出した検索語を検索条件として含む論理条件式を類似検索ルールとする。 The similarity search rule is a rule for searching for an electronic document similar to the primary document (this becomes a “similar document”) from the electronic document group stored in the document DB 12. In the present embodiment, a desired search word is extracted from one or a plurality of primary documents, and a logical conditional expression including the extracted search word as a search condition is set as a similar search rule.

二次検索ルール候補は、一次検索ルールを起点としつつ当該一次検索ルールよりも広い範囲の検索を可能にするための二次検索ルールの候補となるものである。検索の範囲を拡げる手法としては、例えば、論理和条件のときには検索語を増やす、論理積条件のときには検索語を削除する、あるいは、検索語のいくつかを一次検索ルールを用いて検索された電子文書に含まれる語または語句（予め内部メモリ等に保存されているもの等）に置き換えることが考えられる。 The secondary search rule candidate is a candidate for a secondary search rule for enabling a search in a wider range than the primary search rule while starting from the primary search rule. As a technique for expanding the search range, for example, the search term is increased in the case of the logical sum condition, the search term is deleted in the case of the logical product condition, or some of the search terms are searched using the primary search rule. It is conceivable to replace with words or phrases (such as those stored in advance in the internal memory) included in the document.

［電子文書検索方法］
次に、上記のように構成される電子文書検索装置１による電子文書検索方法を、図２の全体手順図及び図３〜図１４の動作説明図を参照して説明する。ここでは、文書ＤＢ１２に蓄積されている電子文書群（多数の新聞記事）の中から検索者が検索条件の入力により指定した検索範囲を超えて、真に所望するであろう電子文書を自動的に検索する場合の例を挙げる。 [Electronic document search method]
Next, an electronic document search method by the electronic document search apparatus 1 configured as described above will be described with reference to the overall procedure diagram of FIG. 2 and the operation explanatory diagrams of FIGS. Here, an electronic document that is truly desired is automatically selected beyond the search range specified by the searcher by inputting the search condition from among a group of electronic documents (many newspaper articles) stored in the document DB 12. An example for searching is given below.

図２を参照すると、電子文書検索装置１は、まず一次検索処理を行う（ステップＳ１０１）。一次検索処理は、例えば図３に示す手順で行われる。
すなわち、入出力Ｉ／Ｆ２を通じて「地域名を表す固有名詞」及び「銀行」を銀行名「ＢＡＮＫ」を検索するための一次検索ルールとして受け付けると、ルール管理部３は、この一次検索ルールをルールＤＢ１１に記憶する。この例の一次検索ルールは「名詞のうちの（条件１）、固有名詞であって（条件２）、地域を表す（条件３）」語と、キーワードとしての「銀行」という語（条件４）とを論理積判定のための検索条件として含む。すなわち、一次検索ルールは、以下のような論理条件式で表されるものとなる。
「ＢＡＮＫ＝名詞−固有名詞−地域”銀行”」 Referring to FIG. 2, the electronic document search apparatus 1 first performs a primary search process (step S101). The primary search process is performed, for example, according to the procedure shown in FIG.
That is, when the “proprietary noun representing the area name” and “bank” are received as the primary search rules for searching for the bank name “BANK” through the input / output I / F 2, the rule management unit 3 sets the primary search rules as rules. Store in DB11. In this example, the primary search rule is “of nouns (condition 1), proper noun (condition 2), representing region (condition 3)”, and the word “bank” as a keyword (condition 4) Are included as search conditions for logical product determination. That is, the primary search rule is represented by the following logical conditional expression.
"BANK = noun-proper noun-area" bank ""

検索部４は、この一次検索ルールを読み出し、自動または検索者の指示による検索要求をもとに、文書ＤＢ１２に蓄積されている電子文書群に対して検索処理を行う。この検索処理の具体的な内容例を図８に示す。図８の例では、右上に示された電子文書群の中から、同図右下のように、「ＧＧ銀行」、「ＨＨ銀行」及び「ＩＩ銀行」なる語を含む電子文書（新聞記事）が、同図左上の一次検索ルールに適合する一次文書として索出されることを示している。ここで、「ＧＧ」、「ＨＨ」、「ＩＩ」は、それぞれ異なる地域名を表す固有名詞である。 The search unit 4 reads the primary search rule and performs a search process on the electronic document group stored in the document DB 12 based on a search request automatically or according to a searcher's instruction. A specific example of the contents of this search process is shown in FIG. In the example of FIG. 8, the electronic document (newspaper article) including the words “GG bank”, “HH bank” and “II bank” from the electronic document group shown in the upper right as shown in the lower right of the figure. Indicates that it is searched as a primary document that conforms to the primary search rule in the upper left of the figure. Here, “GG”, “HH”, and “II” are proper nouns representing different regional names.

検索結果出力部６は、検索部４による一次検索結果である一次文書と、この一次文書から抽出した検索語、すなわち語または語句とを一次文書ファイルとして、ファイルＤＢ１３に蓄積する。検索語の抽出は、例えば公知の形態素解析手法により一次文書の形態素解析を行うことにより行うことができる。 The search result output unit 6 stores the primary document as a primary search result by the search unit 4 and a search word extracted from the primary document, that is, a word or a phrase, in the file DB 13 as a primary document file. The search term can be extracted by, for example, performing morphological analysis of the primary document by a known morphological analysis method.

図２に戻り、一次検索処理を終了すると、電子文書検索装置１は、類似検索処理を行う（ステップＳ１０２）。類似検索処理は、例えば図４の手順で行われる。
まず、ルール管理部３が、一次文書ファイルから検索語を読み出し、この検索語を検索条件として含む類似検索ルールを生成する。そして、検索部４が、この類似検索ルールにより、文書ＤＢ１２に蓄積されている電子文書群の検索を行う。この検索により、１又は複数の類似文書が索出される。評価部５は、索出された類似文書毎に類似度を算出する。類似度は、類似文書が一次文書とどの位特徴が類似するかを表す一つの尺度である。本実施形態では、尺度として、類似度が高いほど大きくなる数値を用いる。 Returning to FIG. 2, when the primary search process is completed, the electronic document search apparatus 1 performs a similarity search process (step S102). The similarity search process is performed, for example, according to the procedure shown in FIG.
First, the rule management unit 3 reads a search word from the primary document file, and generates a similar search rule including this search word as a search condition. And the search part 4 searches the electronic document group accumulate | stored in document DB12 by this similar search rule. By this search, one or a plurality of similar documents are searched. The evaluation unit 5 calculates the degree of similarity for each retrieved similar document. The degree of similarity is a measure that represents how similar a similar document is to the primary document. In the present embodiment, a numerical value that increases as the degree of similarity increases as a scale.

すなわち、評価部５は、類似文書及び一次文書の各々に含まれる語ないし語句を抽出するとともに、各々の特徴ベクトルを算出し、類似文書における特徴ベクトルと一次文書における特徴ベクトルとの類似度合いを数値で表現する。このような類似度の算出は、例えば、ＴＦ・ＩＤＦ（Term Frequency Inverse Document Frequency）法によって行うことができる。ＴＦ・ＩＤＦ法については、例えば、「言語と計算５情報検索と言語処理」徳永健伸東京大学出版会１９９９／１１等の記載を参考にすることができる。
なお、類似度の算出は、ＴＦ・ＩＤＦ法による算出方法に限られるものではなく、他の方法によって算出してもよいのは勿論である。評価部５は、上記のようにして類似度を算出すると、これを類似文書の各々に対応付けるための処理を行う。例えば、類似文書に付与されるＩＤと類似度とをリンクさせておく。 That is, the evaluation unit 5 extracts words or phrases included in each of the similar document and the primary document, calculates each feature vector, and numerically represents the degree of similarity between the feature vector in the similar document and the feature vector in the primary document. It expresses with. Such calculation of the similarity can be performed, for example, by a TF / IDF (Term Frequency Inverse Document Frequency) method. Regarding the TF / IDF method, for example, the description of “Language and Calculation 5 Information Retrieval and Language Processing”, Takenobu Tokunaga, University of Tokyo Press 1999/11 can be referred to.
Note that the calculation of the similarity is not limited to the calculation method based on the TF / IDF method, and may be calculated by other methods. When the evaluation unit 5 calculates the similarity as described above, the evaluation unit 5 performs processing for associating the similarity with each similar document. For example, an ID assigned to a similar document and a similarity are linked.

ここまでの具体的な処理の内容例を図９に示す。図９に示される例では、一次文書から抽出された検索語として、「ＡＴＭ」、「取扱」、「金融」、「機関」、「商品」、「無担保」、「ローン」及び「残高」が挙げられている。この検索語は例示であって、一次文書に含まれる１又は複数の形態素が検索語として抽出され得る。例えば、名詞句だけを選んで検索語とするようにしてもよい。図９下段は、各類似文書とそれぞれの一次文書に対応付けられた類似度を表している。類似文書（ａ）は類似度”７２．１”，類似文書（ｂ）は類似度”６８．６”，類似文書（ｃ）は類似度”６６．４”，類似文書（ｄ）は類似度”６２．８”，類似文書（ｅ）は類似度”５９．５”となる。図９の例では、類似文書（ａ）が一次文書に最も近いものであることを示している。
検索結果出力部６は、この類似文書と類似度とを類似文書ファイルとしてファイルＤＢ１３に蓄積する。 FIG. 9 shows a specific example of the contents of the processing so far. In the example shown in FIG. 9, “ATM”, “handling”, “finance”, “institution”, “product”, “unsecured”, “loan”, and “balance” are used as search terms extracted from the primary document. Is listed. This search term is an example, and one or a plurality of morphemes included in the primary document can be extracted as the search term. For example, only a noun phrase may be selected as a search word. The lower part of FIG. 9 represents the degree of similarity associated with each similar document and each primary document. Similar document (a) has similarity “72.1”, similar document (b) has similarity “68.6”, similar document (c) has similarity “66.4”, and similar document (d) has similarity “62.8” and the similar document (e) have the similarity “59.5”. The example of FIG. 9 shows that the similar document (a) is the closest to the primary document.
The search result output unit 6 stores the similar document and the similarity in the file DB 13 as a similar document file.

図２に戻り、類似検索処理を終了すると、電子文書検索装置は、二次検索ルール候補生成処理を行う（ステップＳ１０３）。二次検索ルール候補生成処理は、例えば図５に示すように、ルール管理部３が、ルールＤＢ１１に記憶されている一次検索ルールに基づいて、一次検索ルールよりも広い範囲の検索を可能とする複数（Ｘ個）の二次検索ルール候補を生成することから始まる。生成される二次検索ルール候補の例を図１０に示す。各二次検索ルール候補は、以下のようにして生成される。 Returning to FIG. 2, when the similarity search process is completed, the electronic document search apparatus performs a secondary search rule candidate generation process (step S <b> 103). In the secondary search rule candidate generation process, for example, as shown in FIG. 5, the rule management unit 3 enables a search in a wider range than the primary search rule based on the primary search rule stored in the rule DB 11. The process starts by generating a plurality (X) of secondary search rule candidates. An example of the generated secondary search rule candidate is shown in FIG. Each secondary search rule candidate is generated as follows.

（イ）一次検索ルール「ＢＡＮＫ＝名詞−固有名詞−地域”銀行”」のうちの「条件３」である「地域」なる条件を除外する。
（ロ）〜（ホ）一次検索ルール「ＢＡＮＫ＝名詞−固有名詞−地域”銀行”」のうちの「条件３」を除外するとともに、「条件４」（「銀行」なる語を抽出する条件）をステップＳ３で抽出した検索語「為替」、「金利」、「証券」及び「預金」の各々の語を、抽出したい他の語に置き換える。
なお、上記のほかに、「為替」、「金利」、「証券」及び「預金」以外のすべての検索語を抽出するための条件を「条件４」と置き換えた二次検索ルール候補にしてもよい。 (A) The condition “region” which is “condition 3” in the primary search rule “BANK = noun-proper noun-region“ bank ”” is excluded.
(B) to (e) Excluding “condition 3” in the primary search rule “BANK = noun-proper noun-region“ bank ”” and “condition 4” (condition for extracting the word “bank”) Are replaced with other words to be extracted in the search terms “exchange”, “interest rate”, “securities” and “deposit” extracted in step S3.
In addition to the above, a secondary search rule candidate in which the condition for extracting all search terms other than “exchange”, “interest rate”, “securities” and “deposit” is replaced with “condition 4”. Good.

二次検索ルール候補が生成された後は、図６に示す手順で処理が進む。すなわち、ルール管理部３が、最初の二次検索ルール候補を抽出して、これを検索部４に渡す。検索部４は、評価部５と協働で、二次検索ルール候補を用いた二次検索処理を行う（ステップＳ１０４，Ｓ１０５）。 After the secondary search rule candidate is generated, the process proceeds according to the procedure shown in FIG. That is, the rule management unit 3 extracts the first secondary search rule candidate and passes it to the search unit 4. The search unit 4 performs a secondary search process using the secondary search rule candidates in cooperation with the evaluation unit 5 (steps S104 and S105).

二次検索処理は、以下のようにして行う。すなわち、検索部４が、二次検索ルール候補により類似文書ファイルを検索して、検索結果及び対応する類似度を抽出し、これらを二次検索結果としてテンポラリに保持する。評価部５は、テンポラリに保持されている検索結果としての語又は語句に付与されている類似度（元の類似文書に付与されている類似度）の合計をとり、合計値を当該二次検索ルール候補による検索結果の評価値として導出する。そして、この評価値を、二次検索結果及び二次検索ルール候補と共に検索結果出力部６に渡す。 The secondary search process is performed as follows. That is, the search unit 4 searches for similar document files using secondary search rule candidates, extracts search results and corresponding similarities, and temporarily holds these as secondary search results. The evaluation unit 5 takes the sum of the similarities (similarities given to the original similar document) given to the words or phrases as the search results held temporarily, and performs the secondary search on the total value. Derived as an evaluation value of a search result by a rule candidate. Then, the evaluation value is passed to the search result output unit 6 together with the secondary search result and the secondary search rule candidate.

この二次検索処理の内容例を図１１及び図１２に示す。図１１の例では、二次検索ルール候補（イ）（＝「ＢＡＮＫ＝名詞−固有名詞”銀行”」）による検索結果として、「○○銀行」、「ＩＩ銀行」、「ＨＨ銀行」、「ＧＧ銀行」が得られる。「○○銀行」からは抽出元の類似文書（ａ）の類似度”７２．１”、「ＩＩ銀行」からは類似文書（ｂ）の類似度”６８．６”、「ＨＨ銀行」からは類似文書ｃの類似度”６６．４”、「ＧＧ銀行」からは類似文書（ｄ）の類似度”６２．８”が抽出されるので、評価部５は、これらの類似度を合計することにより、二次検索ルール候補（イ）による検索結果の評価値は、その合計値である”２６９．９”となる。 An example of the contents of this secondary search process is shown in FIGS. In the example of FIG. 11, as a search result by the secondary search rule candidate (A) (= “BANK = noun−proper noun“ bank ””), “XX bank”, “II bank”, “HH bank”, “ GG Bank "is obtained. From “XX Bank”, the similarity “72.1” of the extracted similar document (a), from “II Bank”, the similarity “68.6” of the similar document (b), from “HH Bank” Since the similarity “66.4” of the similar document c and the similarity “62.8” of the similar document (d) are extracted from “GG bank”, the evaluation unit 5 adds these similarities. Thus, the evaluation value of the search result by the secondary search rule candidate (A) is “269.9” which is the total value.

同様に、図１２の例では、二次検索ルール候補（ニ）（＝「ＢＡＮＫ＝名詞−固有名詞”証券”」）による検索結果として、「ＪＪ証券」が得られる。一つの語しか抽出されないため、二次検索ルール候補（ニ）による検索結果の評価値は、その「ＪＪ証券」の抽出元の類似文書（ｅ）の類似度”５９．５”と同じ値となる。
検索結果出力部６は、このようにして導出された評価値と、二次検索ルール候補及び二次検索結果とを対応付け、これを二次文書ファイルとして、ファイルＤＢ１３に蓄積する。
以上の二次検索処理をＸ個のすべての二次検索ルール候補について繰り返す（ステップＳ１０６，Ｓ１０７：No）。 Similarly, in the example of FIG. 12, “JJ Securities” is obtained as a search result based on the secondary search rule candidate (d) (= “BANK = noun−proper noun“ securities ””). Since only one word is extracted, the evaluation value of the search result by the secondary search rule candidate (d) is the same value as the similarity “59.5” of the similar document (e) from which “JJ Securities” is extracted. Become.
The search result output unit 6 associates the evaluation value thus derived with the secondary search rule candidate and the secondary search result, and accumulates them in the file DB 13 as a secondary document file.
The above secondary search process is repeated for all X secondary search rule candidates (steps S106 and S107: No).

Ｘ個の二次検索ルール候補に対する二次検索処理を終了すると（ステップＳ１０７：Yes）、電子文書検索装置１は、二次検索ルール決定処理（ステップＳ１０８）及び再検索処理（ステップＳ１０９）を行う。これらの処理は、例えば図７の手順で行われる。
二次検索ルール決定処理は、ルール管理部３により行われる。ルール管理部３は、二次文書ファイルに記録されている評価値を抽出し、この評価値が所定条件を満たす二次検索ルール候補、例えば評価値が相対的に高い二次検索ルール候補を二次検索ルールとして決定し、これをルールＤＢ１１に記憶する。評価値と予め記憶している閾値を超えることを所定条件とすることもできる。 When the secondary search process for the X secondary search rule candidates is completed (step S107: Yes), the electronic document search apparatus 1 performs a secondary search rule determination process (step S108) and a re-search process (step S109). . These processes are performed, for example, according to the procedure shown in FIG.
The secondary search rule determination process is performed by the rule management unit 3. The rule management unit 3 extracts an evaluation value recorded in the secondary document file, and selects a secondary search rule candidate satisfying a predetermined condition, for example, a secondary search rule candidate having a relatively high evaluation value. The next search rule is determined and stored in the rule DB 11. It is also possible to set the predetermined condition to exceed the evaluation value and a threshold value stored in advance.

二次検索ルール決定処理の具体的な内容例を図１３に示す。図１３には、これまで説明した手順で導出された５つの二次検索ルール候補（イ）〜（ホ）の評価値の例が示されている。二次検索ルール候補（イ）が”２６９．９”、二次検索ルール候補（ロ）が”０．０”、二次検索ルール候補（ハ）が”０．０”、二次検索ルール候補（ニ）が”５９．５”、二次検索ルール候補（ホ）が”０．０”である。二次検索ルール候補（ロ）、（ハ）、（ホ）の評価値が”０．０”であるのは、これらの二次検索ルール候補（ロ）、（ハ）、（ホ）に適合する語が類似文書（ａ）〜（ｅ）に存在しないためである。所定条件が、相対的に高い１つというものであれば、二次検索ルールとして、二次検索ルール候補（イ）である「ＢＡＮＫ＝名詞−固有名詞”銀行”」が決定される。他方、所定条件が、閾値が”５０”に設定されているものとすると、評価値がこの閾値を上回るのは、二次検索ルール候補（イ）及び（ニ）の２つである。従って、この場合は、二次検索ルール候補（イ）のほかに、二次検索ルール候補（ニ）である「ＢＡＮＫ＝名詞−固有名詞”証券”」も二次検索ルールとして決定される。 An example of specific contents of the secondary search rule determination process is shown in FIG. FIG. 13 shows examples of evaluation values of five secondary search rule candidates (A) to (E) derived by the procedure described so far. The secondary search rule candidate (A) is “269.9”, the secondary search rule candidate (B) is “0.0”, the secondary search rule candidate (C) is “0.0”, and the secondary search rule candidate (D) is “59.5” and the secondary search rule candidate (e) is “0.0”. The evaluation value of secondary search rule candidates (b), (c), and (e) is “0.0”, which matches these secondary search rule candidates (b), (c), and (e) This is because the word to be found does not exist in the similar documents (a) to (e). If the predetermined condition is a relatively high one, the secondary search rule candidate (A) “BANK = noun-proper noun“ bank ”” is determined as the secondary search rule. On the other hand, if the predetermined condition is that the threshold value is set to “50”, there are two secondary search rule candidates (A) and (D) whose evaluation value exceeds the threshold value. Therefore, in this case, in addition to the secondary search rule candidate (b), the secondary search rule candidate (d) “BANK = noun-proper noun“ securities ”” is also determined as the secondary search rule.

再検索処理は、検索部３及び検索結果出力部６の協働により行われる。検索部３は、決定された二次検索ルールを用いて文書ＤＢ１２に蓄積されている電子文書群を再検索し、これにより、検索者が真に望む電子文書（新聞記事、これを「検索文書」とする）が索出される。検索部４は、これらの検索文書を検索結果出力部６に渡す。検索結果出力部６は、検索文書を再検索結果として、ファイルＤＢ１４に記憶する。 The re-search process is performed in cooperation with the search unit 3 and the search result output unit 6. The search unit 3 re-searches the electronic document group stored in the document DB 12 using the determined secondary search rule, and thereby, the electronic document (newspaper article, this search document that the searcher really wants is searched. ”) Is retrieved. The search unit 4 passes these search documents to the search result output unit 6. The search result output unit 6 stores the search document as a re-search result in the file DB 14.

この再検索処理の内容例を図１４に示す。図１４の例では、上述した二つの二次検索ルール候補（「ＢＡＮＫ＝名詞−固有名詞”銀行”」と「ＢＡＮＫ＝名詞−固有名詞”証券”」）を二次検索ルールとして決定した場合の再検索結果が示されている。このように二つの二次検索ルールによって索出された検索文書には、「ＧＧ銀行」「ＩＩ銀行」及び「ＨＨ銀行」を含む新聞記事に加えて、「○○銀行」及び「ＪＪ証券」をそれぞれ含む新聞記事が含まれる。金融機関名という観点からは、銀行も証券も同じなので、索出される新聞記事の範囲が拡大されることがわかる。 An example of the contents of this re-search process is shown in FIG. In the example of FIG. 14, the above-described two secondary search rule candidates (“BANK = noun-proper noun“ bank ”” and “BANK = noun-proper noun“ securities ””) are determined as secondary search rules. The re-search results are shown. The search documents retrieved by the two secondary search rules in this way include newspaper articles including “GG Bank”, “II Bank” and “HH Bank”, as well as “XX Bank” and “JJ Securities”. Newspaper articles that contain From the viewpoint of financial institution name, it can be seen that the scope of newspaper articles to be searched is expanded because banks and securities are the same.

このように、本実施形態の電子文書検索装置１によれば、検索者が入力した検索条件よりも検索範囲を広げた二次検索ルールが自動的に生成されるので、検索条件が検索者によって適切に指定されないような場合においても、検索者が真に望む新聞記事を容易に索出することができる。 As described above, according to the electronic document search apparatus 1 of the present embodiment, the secondary search rule having the search range wider than the search condition input by the searcher is automatically generated. Even if it is not properly specified, the newspaper article that the searcher really wants can be easily searched.

本発明の実施形態例は以上のとおりであるが、本発明は、上記の実施形態例に限定されるものではない。例えば、本実施形態では、電子文書群が、システム内の文書ＤＢ１２に蓄積してある場合の例を説明したが、電子文書群は、電子文書検索装置１がアクセス可能な領域であれば、どこに存在してもよい。例えば、電子文書検索装置１が接続されるコンピュータ・ネットワークの任意のサーバに存在するものであってもよい。 Embodiments of the present invention are as described above, but the present invention is not limited to the above-described embodiments. For example, in the present embodiment, an example in which the electronic document group is stored in the document DB 12 in the system has been described. However, the electronic document group may be any area as long as the electronic document search apparatus 1 is accessible. May be present. For example, it may exist in an arbitrary server of a computer network to which the electronic document search apparatus 1 is connected.

また、本実施形態では、検索部３が、一次文書を文書ＤＢ１２に蓄積されている電子文書群から索出する場合の例を説明したが、電子文書群の中で予め指定された特定の一次文書をルール管理部３が読み出すようにしてもよい。 In the present embodiment, an example in which the search unit 3 searches for the primary document from the electronic document group stored in the document DB 12 has been described. However, a specific primary specified in advance in the electronic document group is described. The document may be read by the rule management unit 3.

また、本実施形態では、例えば「ＢＡＮＫ」というカテゴリに分類されている一次検索ルール（＝「ＢＡＮＫ＝名詞−固有名詞−地域”銀行”」）のうちの「名詞のうちの（条件１）、固有名詞であって（条件２）、地域を表すもの（条件３）」という複数の条件が地域を表す固有名詞であることを検索条件とする場合について説明したが、固有名詞は地域を表すものには限定されないし、また、品詞は名詞に限定されるものではなく、他の品詞を検索条件としてもよい。さらに、所望の電子文書を検索するための条件であれば、検索条件は品詞に限られるものではない。 Further, in the present embodiment, for example, in the primary search rule (= “BANK = noun-proprietary noun-region“ bank ””) classified into the category “BANK”, “of nouns (condition 1), The case where the search condition is that a plurality of conditions, “proprietary nouns (condition 2) and representing the region (condition 3)” are proper nouns representing the region, is explained. The part of speech is not limited to a noun, and another part of speech may be used as a search condition. Furthermore, the search condition is not limited to the part of speech as long as it is a condition for searching for a desired electronic document.

また、二次検索ルール候補は、検索語の置換による場合について説明したが、検索語に対して同義または類義となる語へ置換することにより生成してもよい。また、検索ルール生成部２が、一次文書内における共起頻度が所定の閾値よりも高い語または語句を抽出してこれを保存しておき、一次検索ルールのうちの「条件４」によって表されるキーワードを、この語または語句に置き換えて二次検索ルール候補を生成するようにしてもよい。 Moreover, although the case where the secondary search rule candidate is based on the replacement of the search word has been described, it may be generated by replacing the search word with a word that is synonymous or similar to the search word. In addition, the search rule generation unit 2 extracts a word or phrase whose co-occurrence frequency in the primary document is higher than a predetermined threshold and stores it, and is expressed by “condition 4” in the primary search rule. The keyword may be replaced with this word or phrase to generate a secondary search rule candidate.

また、評価値を導出した段階で、各二次検索ルール候補と評価値の導出結果とを検索者に提示し、検索者が指定した二次検索ルール候補を二次検索ルールとして決定できるようにしてもよい。この場合は、検索者が評価値の高い任意の二次検索ルール候補を自らの判断で選択することができ、検索者の趣向をより反映した検索ルールを生成することができるという利点が生じる。 In addition, when the evaluation value is derived, each secondary search rule candidate and the evaluation value derivation result are presented to the searcher so that the secondary search rule candidate designated by the searcher can be determined as the secondary search rule. May be. In this case, the searcher can select an arbitrary secondary search rule candidate having a high evaluation value by his / her own judgment, and an advantage that a search rule more reflecting the searcher's taste can be generated.

本発明の一実施の形態例となる電子文書検索装置の機能構成図。1 is a functional configuration diagram of an electronic document search apparatus according to an embodiment of the present invention. 本実施形態による電子文書検索方法の処理手順図。FIG. 6 is a processing procedure diagram of an electronic document search method according to the present embodiment. 一次検索処理の手順説明図。Explanatory drawing of the procedure of a primary search process. 類似検索処理の手順説明図。Explanatory drawing of the procedure of a similar search process. 二次検索ルール候補生成処理の手順説明図。Explanatory drawing of a secondary search rule candidate production | generation process. 二次検索処理の手順説明図。Explanatory drawing of the procedure of a secondary search process. 二次検索ルール決定処理及び再検索処理の手順説明図。Explanatory drawing of the procedure of a secondary search rule determination process and a re-search process. 一次文書の検索結果例を示す説明図。Explanatory drawing which shows the search result example of a primary document. 検索語の抽出及び類似文書の検索結果の概要を示す説明図。Explanatory drawing which shows the outline | summary of search result extraction of a search word, and a similar document. 一次検索ルールから生成される二次検索ルール候補の例を示す説明図。Explanatory drawing which shows the example of the secondary search rule candidate produced | generated from a primary search rule. 類似文書から抽出される語と類似度及び評価値との関係を示す説明図。Explanatory drawing which shows the relationship between the word extracted from a similar document, similarity, and an evaluation value. 類似文書から抽出される語と類似度及び評価値との関係を示す説明図。Explanatory drawing which shows the relationship between the word extracted from a similar document, similarity, and an evaluation value. 二次検索ルール候補と評価値の関係を示す説明図。Explanatory drawing which shows the relationship between a secondary search rule candidate and evaluation value. 二次検索ルールにより電子文書群から索出された検索文書を示す説明図。Explanatory drawing which shows the search document searched out from the electronic document group by the secondary search rule. 形態素解析を用いた従来の検索手法を示す図。The figure which shows the conventional search method using a morphological analysis. 形態素解析を用いた従来の検索手法を示す図。The figure which shows the conventional search method using a morphological analysis. 従来の類似語展開を利用した検索手法を示す図。The figure which shows the search method using the conventional similar word expansion | deployment.

Explanation of symbols

１…電子文書検索装置、２…入出力Ｉ／Ｆ、３…ルール管理部、４…検索部、５…評価部、６…検索結果出力部、７…ＤＢ管理部、１１…ルールＤＢ、１２…文書ＤＢ、１３…ファイルＤＢ。 DESCRIPTION OF SYMBOLS 1 ... Electronic document search apparatus, 2 ... Input / output I / F, 3 ... Rule management part, 4 ... Search part, 5 ... Evaluation part, 6 ... Search result output part, 7 ... DB management part, 11 ... Rule DB, 12 ... document DB, 13 ... file DB.

Claims

An electronic document search method performed by an apparatus that enables a search that exceeds a range of search conditions specified in advance from an electronic document group,
A primary search step of accepting the search condition as a primary search rule and obtaining the primary document by searching the electronic document group according to the primary search rule;
A word or phrase included in the primary document is extracted as a similar search rule, and the electronic document group is searched by the extracted similar search rule to obtain one or a plurality of similar documents. For each similar document, the primary document and A similarity search step for calculating a similarity indicating how similar the features are, and storing the similarity in association with each similar document;
Based on the primary search rule, a plurality of secondary search rule candidates that enable a search in a wider range than the search condition are generated, and the similar document is searched by each of these secondary search rule candidates to perform a search. A word or phrase as a result is extracted together with the corresponding similarity, and a total value of the extracted similarities is derived as an evaluation value of a search result by the secondary search rule candidate, and the evaluation value satisfies a predetermined condition A search rule determination step for determining secondary search rule candidates as secondary search rules;
A secondary search step of re-searching the electronic document group according to the determined secondary search rule.

The primary search rule includes a logical condition of a plurality of words or phrases,
The search rule determination step excludes some of the plurality of words or phrases when the logical condition is a logical product condition, while adding a new word or phrase when the logical condition is a logical sum. Generate secondary search rule candidates,
The electronic document search method according to claim 1.

The similar search step extracts, as the similar search rule, a word or phrase that is synonymous or similar to the word or phrase instead of or together with the word or phrase included in the primary document. To
The electronic document search method according to claim 2.

The search rule determination step determines a secondary search rule candidate having a relatively high evaluation value as the secondary search rule.
The electronic document search method according to claim 1.

Primary search rule storage means for storing search conditions received from a searcher as a primary search rule;
The electronic document group is searched by the primary search rule to extract a primary document, the electronic document group is searched by a word or phrase included in the primary document, and one or a plurality of similar documents are searched, and each similar document is searched. A similarity document management means for calculating a similarity indicating how similar the feature to the primary document is, and storing the similarity in association with each similar document;
Based on the primary search rule, a plurality of secondary search rule candidates that enable a search in a wider range than the search condition are generated, and the similar document is searched by each of the generated plurality of secondary search rule candidates. , A word or phrase that is a search result is extracted together with the corresponding similarity, and a total value of the extracted similarities is derived as an evaluation value of the search result by the secondary search rule candidate, and the evaluation value is a predetermined condition A search rule determining means for determining a secondary search rule candidate that satisfies the above as a secondary search rule for performing a re-search of the electronic document group,
Electronic document search device.

A computer;
Primary search rule storage means for storing a search condition received from a searcher as a primary search rule;
The electronic document group is searched by the primary search rule to extract a primary document, the electronic document group is searched by a word or phrase included in the primary document, and one or a plurality of similar documents are searched, and each similar document is searched. A similar document management means for calculating a similarity indicating how much the feature is similar to the primary document, and storing the similarity in association with each similar document;
Based on the primary search rule, a plurality of secondary search rule candidates that enable a search in a wider range than the search condition are generated, and the similar document is searched by each of the generated plurality of secondary search rule candidates. , A word or phrase that is a search result is extracted together with the corresponding similarity, and a total value of the extracted similarities is derived as an evaluation value of the search result by the secondary search rule candidate, and the evaluation value is a predetermined condition A search rule determining means for determining a secondary search rule candidate satisfying as a secondary search rule for re-searching the electronic document group;
Computer program to function as.