JP2004151959A

JP2004151959A - Method and system of pseudo search

Info

Publication number: JP2004151959A
Application number: JP2002315879A
Authority: JP
Inventors: Keiko Aoki; 圭子青木; Naoki Inoue; 直己井ノ上; Yoshiji Sasano; 義二笹野
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2002-10-30
Filing date: 2002-10-30
Publication date: 2004-05-27

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and a system of pseudo search, by which a high-precision search result can be obtained using a simple operation in a short time, irrespective of operator's skill level. <P>SOLUTION: A search condition document DB1 stores a search condition document as a search key in a text format. A document DB2 to be searched stores a multiplicity of documents to be searched in a text format. A keyword extracting part 3 extracts a plurality of keywords from the search condition document, stored in the search condition document DB1. A full-text search part 4 implements full-text search, using a keyword extracted by the keyword extracting part 3 as a search key, covering all the documents stored in the document DB2. A pseudo search part 5 implements pseudo search, using the search condition document as a search key covering the results of the full-text search by the full-text search part 4. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、文書を対象とした類似検索方法およびシステムに係り、特に、テキスト文書を検索キーとして、これに類似する文書を多数の検索対象文書から検索する類似検索方法およびシステムに関する。
【０００２】
【従来の技術】
電子化文書から所望の情報を含む文書を高精度かつ効率的に検索するための技術として、入力されたキーワードを含む全ての文書を抽出する「全文検索」と、テキスト文書を検索キーとして入力し、このテキスト文書の内容に類似する文書を検索する「類似検索」とが知られている。
【０００３】
図５は、従来の全文検索システムの機能ブロック図であり、検索対象文書データベース（ＤＢ）４１には、検索対象となる多数の文書がテキスト形式で格納されている。全文検索部４３は、前記検索対象文書ＤＢ４１に格納されている全ての文書を対象に、キーボード４２から作業者によって手入力されたキーワードを検索キーとして全文検索を実施する。全文検索結果出力部４４は、前記全文検索結果を表示または印字出力する。
【０００４】
図６は、従来の類似検索システムの機能ブロック図であり、検索対象文書ＤＢ５１には、検索対象となる多数の文書がテキスト形式で格納されている。検索条件文書ＤＢ５２には、検索キーとしての検索条件文書がテキスト形式で格納されている。類似検索部５３は、前記検索対象文書ＤＢ５１に格納されている全ての文書を対象に、別途に指定された今回の検索条件文書を検索キーとして類似検索を実施する。類似検索結果出力部５４は、前記類似検索結果を表示または印字出力する。
【０００５】
前記類似検索の代表的な手法としては、テキストからキーワードを自動的に抽出し、キーワードの出現頻度などを用いて文書に重み付けを行い、検索要求へ適合する順に検索結果を出力するベクトル空間モデル（ＶｅｃｔｏｒＳｐａｃｅＭｏｄｅｌ：ＶＳＭ）や確率モデル（ＰｒｏｂａｂｌｉｓｔｉｃＭｏｄｅｌ：ＰＢＭ）などの検索モデルが提案されており、特開平１０−２６９２３５号公報や特開２００１−３３１５２７号公報に開示されている。
【０００６】
前記ベクトル空間モデルでは、検索条件文書および検索対象文書内の各語句を重み付けなどを要素としたベクトルで表現し、”ベクトルの相似”という視点で最も適合する文書を検索する。前記確率モデルでは、検索条件文書と検索対象文書との適合性に関する確率に従って検索文書がランキングされる。
【０００７】
また、特開２００２−２２２２０８号公報には、「全文検索（キーワード検索）」と「類似検索」とを組み合わせた検索システムが開示されている。ここでは、キーワードを入力して初めに類似検索を実行し、これにより得られた検索結果（文書）を検索キーとして類似検索を再度実行する。さらに、この検索結果として返される特徴単語を検索キーとして、今度は全文検索を実行する。
【０００８】
【発明が解決しようとする課題】
キーワードを入力する全文検索システムは、構成が比較的単純で大規模でも高速動作が可能である反面、作業者がキーワードを選択して手入力する手順が必要となってしまう。また、キーワードが不適切であると、無関係な文書が多数検索されたり、あるいは所望の文書が全く発見されなくなるなど、検索精度が作業者の習熟度に大きく依存し、必ずしも精度の高い検索ができないという問題があった。
【０００９】
これに対して、類似検索システムでは、文書のテキストやその断片をそのままクリップボードに貼り付けて検索できるので、全文検索と比べて使い勝手が格段に向上する。しかしながら、類似検索において高精度の検索が要求される場合には、文書中の全ての単語を対象とした類似度計算が必要となるために、計算量が増えて検索時間が長くなってしまうという問題があった。
【００１０】
さらに、全文検索と類似検索とを組み合わせたシステムでも、最初はキーワードの手入力操作が必要となるのみならず、３回（２回の類似検索と１回の全文検索）もの検索動作が必要となるという問題があった。
【００１１】
本発明の目的は、上記した従来技術の課題を解決し、作業者の習熟度にかかわらず、簡単な操作で高精度の検索結果を短時間で得られる類似検索方法およびシステムを提供することにある。
【００１２】
【課題を解決するための手段】
上記した目的を達成するために、本発明は、指定された検索条件文書に類似した文書を複数の検索対象文書から検索する類似検索システムにおいて、検索条件文書からキーワードを抽出する手段と、前記複数の検索対象文書群を対象に、前記抽出されたキーワードを検索キーとして全文検索を実施する手段と、全文検索結果を対象に、前記検索条件文書を検索キーとして類似検索を実施する手段と、前記類似検索結果を出力する手段とを含むことを特徴とする。
【００１３】
上記した特徴によれば、検索キーとして文書（検索条件文書）を指定するだけで、全文検索用のキーワードが自動的に抽出されるので、検索作業者の習熟度とは無関係に適切なキーワードを抽出できる。さらに、全ての検索対象文書を対象に前記キーワードを検索キーとして全文検索を実行し、この全文検索結果のみを対象に、前記検索条件文書を検索キーとして類似検索が実行されるので、検索精度の向上と検索時間の短縮とが同時に達成される。
【００１４】
【発明の実施の形態】
以下、図面を参照して本発明の好ましい実施の形態について詳細に説明する。図１は、本発明の一実施形態である類似検索システムの主要部の構成を示したブロック図である。
【００１５】
検索条件文書データベース（ＤＢ）１には、検索キーとしての検索条件文書がテキスト形式で格納されている。検索対象文書ＤＢ２には、検索対象となる多数の文書がテキスト形式で格納されている。当該システムを特許公報の検索システムに適用するのであれば、前記検索条件文書ＤＢ１には発明者により記述された論文や明細書草案などが格納され、前記検索対象文書ＤＢ２には、特許公報や特許公開公報などの公知文献が格納されることになる。
【００１６】
キーワード抽出部３は、前記検索条件文書ＤＢ１に格納されている今回の検索条件文書から、当該今回の検索条件文書の内容を代表するキーワードを抽出する。全文検索部４は、前記検索対象文書ＤＢ２に格納されている全ての文書を対象に、前記キーワード抽出部３により抽出されたキーワードを検索キーとして全文検索を実施する。類似検索部５は、前記全文検索部４による全文検索結果を対象に、前記今回の検索条件文書を検索キーとして類似検索を実施する。検索結果出力部６は、前記類似検索結果を表示または印字出力する。
【００１７】
図２は、前記キーワード抽出部３の機能ブロック図であり、形態素解析部３１は、前記今回の検索条件文書を対象に形態素解析を実施して、当該文書を複数の単語に分解する。フィルタリング部３２は、各単語を正規化し、ストップワード除去等のフィルタリングを実施する。この正規化およびストップワード処置により、活用語尾が原型に戻され、表記や略語が統一される。統計処理部３３は、各単語を出現頻度などのパラメータに基づいて重み付けし、重み付けの高い複数の単語をキーワードとして抽出する。
【００１８】
前記統計処理部３３は、例えば今回の検索条件文書Ｃｉ中の各単語ｔに関して相対頻度Ｐ（Ｔ＝ｔ｜Ｃｉ）を求め、さらに、全ての検索対象文書中の前記単語ｔに関しても相対頻度Ｐ（Ｔ＝ｔ）を求める。そして、次式（１）が成立する全ての単語ｔ、あるいは次式（１）が成立する単語ｔの上位Ｎ個をキーワードとして抽出する。なお、Ｓは所定の定数である。
Ｐ（Ｔ＝ｔ｜Ｃｉ）／Ｐ（Ｔ＝ｔ）≧Ｓ …（１）
【００１９】
すなわち、文書内で出現頻度［Ｐ（Ｔ＝ｔ｜Ｃｉ）］の高い単語は当該文書の内容を代表する確率が高くなる。しかしながら、どの文書にもよく現れるような一般語は、たとえその出現頻度が高くても当該文書の内容を代表できない。そこで、本実施形態では検索条件文書Ｃｉ中の各単語ｔに関して、その文書内での出現頻度［Ｐ（Ｔ＝ｔ｜Ｃｉ）］を求めると共に、検索対象文書内での出現頻度［Ｐ（Ｔ＝ｔ）］を求め、［Ｐ（Ｔ＝ｔ｜Ｃｉ）］が高く、かつ［Ｐ（Ｔ＝ｔ）］が低い単語のみがキーワードとして抽出されるようにしている。
【００２０】
次いで、図３のフローチャートおよび図４のシーケンス図を参照して本実施形態の動作を詳細に説明する。
【００２１】
図３のステップＳ１では、検索条件文書ＤＢ１に格納されている複数の検索条件文書の一つＣｉが指定される。ステップＳ２では、前記キーワード抽出部３において、前記指定された検索条件文書Ｃｉを対象に、形態素解析（Ｓ２１）、フィルタリング（Ｓ２２）および統計処理（Ｓ２３）が実行され、前記検索条件文書Ｃｉの内容を代表すると予測される複数のキーワードが自動的に抽出される。
【００２２】
ステップＳ３では、前記全文検索部４において、前記検索対象文書ＤＢ２に格納されている全ての文書を対象に、前記キーワード抽出部３により抽出されたキーワードを検索キーとして全文検索が実施される。この全文検索により得られた複数の文書ＩＤは類似検索部５へ通知される。
【００２３】
このように、本実施形態では検索条件文書を指定するだけで、その内容に基づいてキーワードが自動的に抽出され、このキーワードを検索キーとして全文検索が実行されるので、検索条件文書の内容を代表するキーワードによる自動検索が可能になる。
【００２４】
ステップＳ４では、前記類似検索部５において、検索対象文書ＤＢ２に格納されている多数の文書のうち、前記全文検索部４により提供された複数の文書ＩＤに対応した検索対象文書のみを対象に、前記今回の検索条件文書Ｃｉを検索キーとして類似検索が実施される。ステップＳ５では、前記検索結果出力部６から前記類似検索結果が出力される。
【００２５】
このように、本実施形態によれば、全文検索結果により限定された範囲の文書に対してのみ類似検索が実行されるので、類似検索に要する時間を短縮できる。
【００２６】
さらに、本実施形態によれば、検索条件文書を指定するという、従来の類似検索手順と同一の手順を踏むだけで、検索条件文書に類似した内容の文書を、従来よりも短時間で確実に検索できるようになる。
【００２７】
【発明の効果】
本発明によれば、以下のような効果が達成される。
（１）検索条件文書を指定するという、従来の類似検索手順と同一の手順を踏むだけで、従来よりも短時間で正確な類似検索が可能になる。
（２）検索条件文書を指定するだけで、その内容に基づいてキーワードが自動的に抽出され、このキーワードを検索キーとして全文検索が実行されるので、検索条件文書の内容を代表するキーワードによる自動検索が可能になる。
（３）全文検索結果により限定された範囲の検索対象文書に対してのみ類似検索が実行されるので、類似検索に要する時間の短縮と検索精度の向上とが同時に達成される。
【図面の簡単な説明】
【図１】本発明の一実施形態である類似検索システムの主要部の構成を示したブロック図である。
【図２】図１のキーワード抽出部３の機能ブロック図である。
【図３】本実施形態の動作を示したフローチャートである。
【図４】本実施形態の動作を示したシーケンス図である。
【図５】従来の全文検索システムの機能ブロック図である。
【図６】従来の類似検索システムの機能ブロック図である。
【符号の説明】
１…検索条件文書ＤＢ
２…検索対象文書ＤＢ
３…キーワード抽出部
４…全文検索部
５…類似検索部
６…検索結果出力部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a similarity search method and system for documents, and more particularly to a similarity search method and system for searching a document similar to the text document from a large number of search target documents using a text document as a search key.
[0002]
[Prior art]
As a technique for searching a document containing desired information from an electronic document with high accuracy and efficiency, a "full text search" for extracting all documents including an input keyword and a text document as a search key are input. A "similarity search" for searching for a document similar to the contents of the text document is known.
[0003]
FIG. 5 is a functional block diagram of a conventional full-text search system. A search target document database (DB) 41 stores a large number of documents to be searched in a text format. The full-text search unit 43 performs a full-text search on all documents stored in the search target document DB 41 using a keyword manually input by the operator from the keyboard 42 as a search key. The full-text search result output unit 44 displays or prints out the full-text search result.
[0004]
FIG. 6 is a functional block diagram of a conventional similarity search system. Many documents to be searched are stored in a search target document DB 51 in a text format. The search condition document DB 52 stores a search condition document as a search key in a text format. The similarity search unit 53 performs a similarity search on all documents stored in the search target document DB 51 using a separately specified current search condition document as a search key. The similarity search result output unit 54 displays or prints out the similarity search result.
[0005]
As a typical method of the similarity search, a keyword is automatically extracted from a text, a document is weighted using a keyword appearance frequency or the like, and a search result is output in a vector space model ( A search model such as a Vector Space Model (VSM) or a probabilistic model (Probabilistic Model: PBM) has been proposed, and is disclosed in JP-A-10-269235 and JP-A-2001-331527.
[0006]
In the vector space model, each word in the search condition document and the search target document is represented by a vector having elements such as weights, and the most suitable document is searched from the viewpoint of “similarity of vector”. In the probability model, the search documents are ranked according to the probability regarding the relevance between the search condition document and the search target document.
[0007]
Japanese Patent Application Laid-Open No. 2002-222208 discloses a search system that combines "full-text search (keyword search)" and "similarity search". Here, a similarity search is first executed by inputting a keyword, and the similarity search is executed again using the search result (document) obtained as a search key. Further, a full-text search is executed this time using the characteristic word returned as a search result as a search key.
[0008]
[Problems to be solved by the invention]
The full-text search system for inputting a keyword has a relatively simple configuration and can operate at high speed even on a large scale, but requires a procedure for an operator to select and manually input a keyword. In addition, if the keyword is inappropriate, a large number of irrelevant documents will be searched, or a desired document will not be found at all. Therefore, the search accuracy greatly depends on the proficiency of the operator, and a search with high accuracy cannot always be performed. There was a problem.
[0009]
On the other hand, in the similar search system, the text of a document or a fragment thereof can be pasted as it is on the clipboard and searched, so that the usability is remarkably improved as compared with the full text search. However, when a high-precision search is required in the similarity search, the similarity calculation for all the words in the document is required, so that the calculation amount increases and the search time increases. There was a problem.
[0010]
Furthermore, even in a system combining a full-text search and a similar search, not only a manual input operation of a keyword is required at first but also a search operation three times (two similar searches and one full-text search) is required. There was a problem of becoming.
[0011]
An object of the present invention is to solve the above-mentioned problems of the related art and to provide a similar search method and system that can obtain a high-accuracy search result in a short time with a simple operation regardless of the skill level of an operator. is there.
[0012]
[Means for Solving the Problems]
In order to achieve the above object, the present invention provides a similar search system for searching a document similar to a specified search condition document from a plurality of search target documents, wherein a means for extracting a keyword from a search condition document; Means for performing a full-text search on the group of documents to be searched using the extracted keyword as a search key, means for performing a similarity search on the full-text search results using the search condition document as a search key, Means for outputting a similar search result.
[0013]
According to the above-described feature, a keyword for full-text search is automatically extracted only by designating a document (search condition document) as a search key. Therefore, an appropriate keyword can be selected regardless of the proficiency of a search operator. Can be extracted. Further, a full-text search is performed on all of the search target documents using the keyword as a search key, and a similarity search is performed on only the full-text search results using the search condition document as a search key. Improvement and reduction of search time are achieved at the same time.
[0014]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a main part of a similarity search system according to an embodiment of the present invention.
[0015]
A search condition document database (DB) 1 stores a search condition document as a search key in a text format. A large number of documents to be searched are stored in the search target document DB2 in text format. If the system is applied to a search system for patent publications, the search condition document DB1 stores articles and drafts written by the inventor, and the search target document DB2 stores patent publications and patents. A publicly known document such as a public publication is stored.
[0016]
The keyword extracting unit 3 extracts a keyword representing the content of the current search condition document from the current search condition document stored in the search condition document DB1. The full-text search unit 4 performs a full-text search on all the documents stored in the search target document DB 2 using the keyword extracted by the keyword extraction unit 3 as a search key. The similarity search unit 5 performs a similarity search on the full-text search result by the full-text search unit 4 using the current search condition document as a search key. The search result output unit 6 displays or prints out the similar search results.
[0017]
FIG. 2 is a functional block diagram of the keyword extracting unit 3. The morphological analyzing unit 31 performs a morphological analysis on the current search condition document to decompose the document into a plurality of words. The filtering unit 32 normalizes each word and performs filtering such as stop word removal. By this normalization and stop word processing, the inflected ending is returned to the original form, and the notation and abbreviation are unified. The statistical processing unit 33 weights each word based on a parameter such as an appearance frequency, and extracts a plurality of words with high weights as keywords.
[0018]
The statistical processing unit 33 obtains, for example, the relative frequency P (T = t | Ci) for each word t in the current search condition document Ci, and further calculates the relative frequency P for the word t in all search target documents. (T = t) is obtained. Then, all the words t satisfying the following equation (1) or the upper N words of the word t satisfying the following equation (1) are extracted as keywords. Note that S is a predetermined constant.
P (T = t | Ci) / P (T = t) ≧ S (1)
[0019]
That is, a word having a high appearance frequency [P (T = t | Ci)] in a document has a high probability of representing the contents of the document. However, a general word that often appears in any document cannot represent the contents of the document even if its appearance frequency is high. Thus, in the present embodiment, for each word t in the search condition document Ci, the appearance frequency [P (T = t | Ci)] in the document is obtained, and the appearance frequency [P (T (T = T)], and only words having a high [P (T = t | Ci)] and a low [P (T = t)] are extracted as keywords.
[0020]
Next, the operation of the present embodiment will be described in detail with reference to the flowchart of FIG. 3 and the sequence diagram of FIG.
[0021]
In step S1 of FIG. 3, one of a plurality of search condition documents Ci stored in the search condition document DB1 is designated. In step S2, the keyword extraction unit 3 performs morphological analysis (S21), filtering (S22), and statistical processing (S23) on the specified search condition document Ci, and the content of the search condition document Ci Are automatically extracted.
[0022]
In step S3, the full-text search unit 4 performs a full-text search on all documents stored in the search target document DB 2 using the keyword extracted by the keyword extraction unit 3 as a search key. The plurality of document IDs obtained by the full-text search are notified to the similarity search unit 5.
[0023]
As described above, in the present embodiment, only by specifying the search condition document, the keyword is automatically extracted based on the content, and a full-text search is performed using this keyword as a search key. Automatic search using a representative keyword becomes possible.
[0024]
In step S4, the similarity search unit 5 targets only the search target documents corresponding to the plurality of document IDs provided by the full text search unit 4 among a large number of documents stored in the search target document DB2. A similarity search is performed using the current search condition document Ci as a search key. In step S5, the search result output unit 6 outputs the similar search result.
[0025]
As described above, according to the present embodiment, since the similarity search is performed only on the documents limited in the range limited by the full-text search result, the time required for the similarity search can be reduced.
[0026]
Further, according to the present embodiment, a document having a content similar to the search condition document can be reliably obtained in a shorter time than in the past by simply performing the same procedure as the conventional similar search procedure of specifying the search condition document. Be able to search.
[0027]
【The invention's effect】
According to the present invention, the following effects are achieved.
(1) Just by performing the same procedure as a conventional similarity search procedure of designating a search condition document, an accurate similarity search can be performed in a shorter time than in the related art.
(2) Just by specifying a search condition document, a keyword is automatically extracted based on the content, and a full-text search is performed using this keyword as a search key. Search becomes possible.
(3) Since the similarity search is executed only for the search target documents limited in the range limited by the full-text search result, the time required for the similarity search is reduced and the search accuracy is simultaneously improved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a main part of a similarity search system according to an embodiment of the present invention.
FIG. 2 is a functional block diagram of a keyword extracting unit 3 of FIG.
FIG. 3 is a flowchart showing an operation of the embodiment.
FIG. 4 is a sequence diagram showing an operation of the embodiment.
FIG. 5 is a functional block diagram of a conventional full-text search system.
FIG. 6 is a functional block diagram of a conventional similarity search system.
[Explanation of symbols]
1. Search condition document DB
2 ... Search target document DB
3. Keyword extraction unit 4 Full text search unit 5 Similarity search unit 6 Search result output unit

Claims

In a similar search system for searching a document similar to a specified search condition document from a plurality of search target documents,
Means for extracting a keyword from the search condition document;
Means for performing a full-text search on the plurality of search target documents using the keyword as a search key;
Means for performing a similarity search on the full-text search result using the search condition document as a search key;
Means for outputting the similarity search result.

The means for extracting the keyword includes:
Means for performing morphological analysis on search condition documents,
Means for statistically processing the result of the morphological analysis and weighting each word;
2. A system according to claim 1, further comprising means for extracting a word having a high weight as a keyword.

In a similar search method for searching a document similar to a specified search condition document from a plurality of search target documents,
Extracting a keyword from the search condition document;
Performing a full-text search on the plurality of search target documents using the keyword as a search key;
Performing a similarity search on the full-text search result using the search condition document as a search key;
Outputting the similarity search result.