JP2002537604A

JP2002537604A - Document similarity search

Info

Publication number: JP2002537604A
Application number: JP2000600197A
Authority: JP
Inventors: シャーペ，ウィリアム; バーンズ，ローランド，ジョン
Original assignee: Hewlett Packard Co
Current assignee: HP Inc
Priority date: 1999-02-16
Filing date: 2000-02-15
Publication date: 2002-11-05
Also published as: WO2000049526A1; GB9903451D0; EP1072001A1

Abstract

(57)【要約】クエリト゛キュメント 1を異なるテ゛ータタイフ゜の要素へと分解する(22)ステッフ゜を含む、ワールト゛ワイト゛ウェフ゛上のコンテンツを表すテ゛ータヘ゛ース等のテ゛ータヘ゛ースを探索してクエリト゛キュメントに類似したト゛キュメントを見つける方法。その後、第1のテ゛ータタイフ゜の1つ又は2つ以上の要素について、第1のテ゛ータタイフ゜の類似性探索を行って(23)、該第1のテ゛ータタイフ゜の1つ又は2つ以上の要素についてのテ゛ータヘ゛ースからの突き合わせ結果を返す。第2のテ゛ータタイフ゜の1つ又は2つ以上の要素について、第2のテ゛ータタイフ゜の類似性探索を行って、該第2のテ゛ータタイフ゜の1つ又は2つ以上の要素についてのテ゛ータヘ゛ースからの突き合わせ結果を返す。前記第1のテ゛ータタイフ゜からの突き合わせ結果を適当な重み付けと共に組み合わせてクエリト゛キュメントの突き合わせ結果を提供する。テ゛ータタイフ゜は、テキスト、ヒ゜クチャ、ク゛ラフィクス、及びト゛キュメント全体のレイアウトを含むことができる。 (57) [Summary] Decompose query document 1 into elements of different data types (22) Search for data sources such as data bases that represent the contents on the walt-white web, including the steps, to find documents that are similar to the query document Method. Thereafter, for one or more elements of the first data type, a similarity search of the first data type is performed (23), and for one or more elements of the first data type, Returns the matching result from the data space of For one or more elements of the second data type, a similarity search of the second data type is performed, and one or more elements of the second data type are matched from the data surface. Returns the result. The matching result from the first data type is combined with an appropriate weight to provide a matching result of the query document. The data type can include the layout of the text, the features, the graphics, and the entire document.

Description

DETAILED DESCRIPTION OF THE INVENTION TECHNICAL FIELD OF THE INVENTION

本発明は、クエリに応じて類似したドキュメントを見つけるよう探索を行う方
法及び装置に関し、特に、探索のためのクエリとして１つのドキュメントを使用
して類似したドキュメントを取得することに関する。The present invention relates to a method and apparatus for performing a search to find similar documents in response to a query, and more particularly to obtaining similar documents using a single document as a query for the search.

[Prior art]

電子的に格納されたドキュメントのデータベースにおける類似性探索(similar
ity searching)は、実践的な用途における重要な分野である。かかる探索は、テ
キストに関しては周知のものである。典型的には、かかる探索のための入力はテ
キストストリングであり、サーチエンジンは、該テキストストリングに対するデ
ータベースの突き合わせエントリを探索して、許容可能な類似性しきい値を有す
るエントリを返す。これと同様の探索をイメージに使用することが可能である。
その一例として、IBM CorporationのＱＢＩＣ（Query by Image Content）パッ
ケージがあり、これについてはサイトhttp://wwwqbic.almaden.ibm.com/で説明
されており、該パッケージは同サイトから入手することができる。また、特にGerman Research Center for Artificial Intelligence GmbH（DFK
I）（ドイツ人工知能研究センター）において、Office Maid及びSALTといったシ
ステムで、探索時にドキュメントの構造的な分析を用いることについて研究が行
われてきた。これらシステムについては、サイトhttp://www.dfki.uni-kl.deで
更なる詳細が説明されている。クエリが本質的に１つのデータタイプ（即ちテキストストリングのみ又はイメ
ージのみ）からなる場合には、既存の技術は有効なものとなる。しかし、一般に
１つの電子ドキュメントは複数のデータタイプの組み合わせからなり、典型的な
ドキュメントは、１つ又は２つ以上のテキスト節(text passage)、１つ又は２つ
以上のイメージ、及びラインアート（又は線画）を含む可能性がある。該テキス
ト節はまた、見出し、凡例、及びバルクテキスト（即ち一団をなすテキスト）と
いった異なるタイプへと容易に細分化され得るものである。上述したような既存
の技術を使用して、類似性探索は、特定のデータタイプにおける１つの要素を抽
出した後、該データタイプに適した類似性探索を行う、という各ステップを有す
るものとなる。Similarity search in a database of electronically stored documents
ity searching) is an important area in practical use. Such searches are well known for text. Typically, the input for such a search is a text string, and the search engine searches the database for a matching entry for the text string and returns an entry with an acceptable similarity threshold. A similar search can be used for images.
One example is the IBM Corporation's Query by Image Content (QBIC) package, which is described on the site http://wwwqbic.almaden.ibm.com/, and is available from the site. it can. In particular, the German Research Center for Artificial Intelligence GmbH (DFK
I) (German Artificial Intelligence Research Center) has been studying the use of structural analysis of documents when searching with systems such as Office Maid and SALT. These systems are described in more detail at the site http://www.dfki.uni-kl.de. If the query consists essentially of one data type (i.e., only text strings or only images), existing techniques will be effective. However, in general, an electronic document consists of a combination of data types, and a typical document is one or more text passages, one or more images, and line art ( Or line drawings). The text sections can also be easily broken down into different types such as headings, legends, and bulk text (ie, a group of texts). Using the existing technology as described above, the similarity search has the steps of extracting one element in a specific data type, and then performing a similarity search suitable for the data type. .

[Problems to be solved by the invention]

かかる一連のアプローチの一例が、米国特許第6,002,798号に述べられている
。これは、異なるタイプの領域（即ち、単なるイメージ＋テキストではなく、異
なる機能的な意味を有する複数の領域（例えば、タイトル、見出し、テキストブ
ロック））へのドキュメントの初期の構造的な分析を提供するものである。次い
で、この構造的な情報が使用されて、ドキュメントの選択された機能的な要素に
おける探索及びテキスト索引付け(text indexing)をユーザが行うことが可能と
なる。この機構は、複雑なドキュメントでのテキスト探索に関する問題をより扱
い易いものにするのに特に有用である。しかし、全体としてクエリドキュメント
に類似しているドキュメントの探索を可能にするために有効なものではない。完全なドキュメントを正しく表すドキュメントの特徴を探索で適切に使用する
ことを可能にする類似性探索の方法を提供することが望まれている。One example of such a series of approaches is described in US Pat. No. 6,002,798. This provides an initial structural analysis of the document into different types of regions (ie, not just images + text, but multiple regions with different functional meanings (eg, titles, headings, text blocks)). Is what you do. This structural information is then used to allow the user to search and text indexing on selected functional elements of the document. This mechanism is particularly useful for making the problem of text search in complex documents more manageable. However, it is not effective to enable searching for documents that are similar to the query document as a whole. It would be desirable to provide a method of similarity search that allows the features of a document that correctly represent a complete document to be properly used in the search.

[Means for Solving the Problems]

したがって、本発明の第１の態様では、クエリドキュメントに類似するドキュ
メントを見つけるためにデータベースを探索する方法であって、該クエリドキュ
メントを異なるデータタイプの要素へと分解し、第１のデータタイプにおける要
素のうちの１つ又は２つ以上の要素について第１のデータタイプの類似性探索を
実行して、該第１のデータタイプの１つ又は２つ以上の要素についてのデータベ
ースからの突き合わせ結果を返し、第２のデータタイプにおける要素のうちの１
つ又は２つ以上の要素について第２のデータタイプの類似性探索を実行して、第
２のデータタイプの１つ又は２つ以上の要素についてのデータベースからの突き
合わせ結果を返し、前記第１のデータタイプの類似性探索及び前記第２のデータ
タイプの類似性探索からの突き合わせ結果を組み合わせてクエリドキュメントの
突き合わせ結果を提供する、という各ステップを有する方法が提供される。好適には、クエリドキュメントの各突き合わせ結果を組み合わせることにより
、あらゆるデータタイプを単一で又は更なる組み合わせで使用してクエリを更に
緻密なものへと改善することが可能となる。本発明の第２の態様では、クエリドキュメントに類似するドキュメントを見つ
けるためにデータベースを探索する方法であって、クエリドキュメントを異なる
データタイプの要素へと分解し、ドキュメントにおける要素の空間的な配置から
レイアウトデータタイプにおけるレイアウト要素を決定し、該レイアウト要素に
ついてレイアウトの類似性探索を実行して該レイアウト要素についてのデータベ
ースから突き合わせ結果を返す、という各ステップを有する方法が提供される。Therefore, in a first aspect of the present invention, a method for searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; Performing a similarity search of a first data type for one or more of the elements, and matching results from a database for one or more elements of the first data type; Returns one of the elements in the second data type
Performing a similarity search of a second data type for one or more elements and returning a match result from a database for one or more elements of a second data type; A method is provided having the steps of combining matching results from a data type similarity search and the second data type similarity search to provide a matching result of a query document. Advantageously, combining each matching result of the query document allows the query to be refined using any data type, alone or in further combinations. In a second aspect of the present invention, a method for searching a database to find documents similar to a query document, comprising decomposing the query document into elements of different data types and determining a spatial arrangement of the elements in the document. A method is provided that includes determining a layout element in a layout data type, performing a layout similarity search on the layout element, and returning a matching result from a database for the layout element.

BEST MODE FOR CARRYING OUT THE INVENTION

本発明の例示を目的とした特定の実施形態を図面を参照して以下で説明する。典型的なドキュメントは、複数のデータタイプを含むものである。最も基本的
なデータタイプは、テキスト及びイメージである。図１に示すドキュメント1は
、テキストブロック12を含んでいる。該テキストブロック12は、第１のデータタ
イプのデータである。ドキュメント1はまた、２つの異なる種類のイメージを含
んでいる。一方の種類、即ちイメージブロック13は写真イメージであり、これは
典型的には、各々が１つのカラー値を有する複数のピクセルのアレイから構成さ
れる。他方の種類、即ちラインアートブロック11もまたイメージであるが、幾何
学的又は数式的な要素の組み合わせとして容易に表現することが可能な「描画さ
れた」イメージであり、一般に容易に拡大縮小を行うことが可能なものである。
写真イメージ及びラインアートイメージ（以下「ピクチャ」及び「グラフィクス
」と称す）は、異なるイメージ処理及び分析技術に対して異なる応答を示し、異
なるデータタイプとして最も効果的に処理されるものである。更に、ピクチャ及
びグラフィクスは、一般にドキュメント中で異なる目的に使用されるものである
ため、ピクチャ及びグラフィクスを別々に処理することもまた類似性探索の目的
にとって実際的である。本発明の第１の実施形態による図１のドキュメントについての類似性探索に伴
う各ステップを図２に示す。最初に、ステップ21でドキュメント1が選択される。電子的なドキュメントの
場合には、これは、ドキュメントの１つ又は２つ以上のファイルタイプをサポー
トすることが可能な任意の適当なアプリケーションにより達成することができる
。また、物理的なドキュメントの場合には、これは、スキャナを用いてドキュメ
ントを走査することにより達成することができる。第２に、ステップ22でドキュメントが別個の要素へと分解される。即ち、ドキ
ュメント1の場合、かかる要素は、グラフィックブロック11、テキストブロック1
2、及びピクチャブロック13である。テキストブロック12の場合には、この時点
で光学文字認識を実行して、分解の結果として得られるテキストブロック要素が
ASCIIテキストからなるようにするのが望ましい。ドキュメントの分解は、分析
及び認識処理により達成され、該分析及び認識処理を介してドキュメントの異な
る各部がテキスト、ピクチャ、又はグラフィクスとして認識される。このような
別個のデータタイプへのドキュメントの分解は、既知のものであり、例えば、FM
Wahl、KY Wong、及びRG Caseyによる「Block Segmentation and Text Extracti
on in Mixed Text/Image Documents」（Computer Graphics and Image Processi
ng,Vol.20(1982)）で開示されている技術を用いて行うことが可能である（更に
別の一例が米国特許第6,002,798号で開示されている）。（各データタイプ毎の
走査処理を最適化するために）走査したページの要素を別個のデータタイプへと
分解するために所有権を主張できるスキャナ(proprietary scanner)で使用する
よう構成されたソフトウェアが、ヒューレット・パッカード・カンパニーにより
「HP PrecisionScan」として提供されている。このHP PrecisionScanの出力は、
単一のデータタイプを各々が有する複数の要素からなる１組の要素であり、その
各要素は、更なる処理を行うために選択することが可能なものである。分解の結果は１組の要素であり、その各要素は単一のデータタイプを有してい
る。次いで、テキスト等の特定のデータタイプについて、分解が如何に実行され
たかに依存して、全てのテキストが単一の要素の一部であると決定され、又は、
テキストの物理的に別個の領域が別々の要素とみなされる。本発明の一実施形態
では、ドキュメントの全ての要素が類似性探索で使用され、別の実施形態では、
１つ又は２つ以上の要素が類似性探索で使用するために選択される（又はかかる
更なる処理のために１つの要素の一部を選択する機会がユーザに与えられる）。次いで、別々の要素が、データベース（例えばワールドワイドウェブ上で利用
可能なコンテンツを表すデータベース）に対する類似性探索23,24で使用される
。全ての要素が１つのデータタイプからなる場合には、関連するデータタイプに
ついて単一の探索エンジンで扱うことが可能な従来の類似性探索に関する問題へ
と問題が縮小される。しかし、各要素が異なるデータタイプからなる場合には、
各データタイプ毎に別々の探索エンジンが使用される。異なるデータタイプにつ
いての類似性探索のための適当な探索エンジンは既知のものである。例えば、テ
キストについては、Teragram Corporation (http://www.teragram.com）及びInx
ight Software, Inc.(http://www.inxight.com）から、適当な言語突き合わせ(l
inguistic matching)ツールキットを入手することが可能である。何れの場合に
も、主なデータタイプに関連して以下で簡単に説明するように、突き合わせステ
ップ24の前に適当な前処理ステップ23を行うことが望ましい。例えば、「Inxight Summarizer」は、ドキュメントからキーセンテンスを抽出
することにより該ドキュメントの要約を行うソフトウェアコンポーネント技術で
ある。これが前処理ステップ23となる。次いで、かかる要約の互いに対する突き
合わせを突き合わせステップ24で行うことが可能となる。Inxight Summarizerは
、キーセンテンス要素を含む指示要約(indicative summary)をドキュメントから
生成する。ステミング（stemming）及びテキスト正規化（normalisation）技術
によりテキストのエッセンスが抽出されて、テキストの簡潔で正規化された概要
(concise and canonical synopsis)が取得される。ここで、「ステミング」とは
、ワードをその語根及び品詞に置換する（例えば、「I had wanted（私はしたか
った（又は欲しかった））」を「to want（したい（又は欲しい））／first per
son（第１人称）／pluperfect（過去完了）」とする）ことであり、また「正規
化」とは、幾つかの形態のうちの１つの「概念」への置換を伴うものである（例
えば、「2/3/99」、「Feb 3^nd, 1999」及び「3^nd February」は、全て同一概念
の代替的な形態である）。次いで、全く同様のテキスト内容が、望ましくない文
法を考慮することによる悪影響を伴うこと無く突き合わせされる、という確信を
持って、前処理ステップ23のステミングされ正規化された結果に対して突き合わ
せステップ24を実行することができる。イメージ探索ツールの一例は、上述したように、IBM QBICパッケージである。
QBICは、更に、http://wwwqbic.almaden.ibm.comで述べられている。このパッケ
ージは、イメージ中で発生するカラーの割合、カラーのレイアウト、及びテクス
チャ等の多数の異なる基準を分析することにより、イメージに対して前処理を行
うように構成されている。次いで、それら基準の組み合わせが突き合わせステッ
プ24で使用される。ロボットビジョン（容器中の部品を探索するロボット）から
、交通監視システム（車のナンバプレートの自動検出）まで、「既知の対象につ
いて「新しい」イメージを「探索する」」他の多くの既知の用途が存在する。本
突き合わせ問題は、本質的にこれら既知の問題と正反対のものである。また、逐次的な方法を効果的に使用することができる、ということも認めるこ
とができる。例えば、まず、自然のシーンと人工のシーンとの区別を可能とする
ために「ストレートエッジ（straight edge）」ヒストグラムを使用し、次いで
「エッジ長（edge length）」ヒストグラム（長いエッジが少ない場合には自然
のシーンを示す可能性が高い）を使用し、イメージの上部で青の色調の広い領域
（野外のシーンを示す）についてテストを行い、肌色の色調の有意の要素につい
てテストを行う。なお、この肌色の色調は、人間の表現を含むイメージがあるこ
とを示し、これによって、顔の突き合わせ分析により同じ顔を見つけることがで
きるようになる。明らかに、逐次的なステップ及び並列的なステップの組み合わ
せを採用することができる。類似性探索の結果は、データベース中のドキュメントについての一連の突き合
わせスコア(matching score)からなる組（探索される各要素毎に存在する組等）
である。これらの探索スコアの各々は、組み合わされた探索結果27を得るために
、組み合わせ26について正規化25される必要がある。該正規化ステップ25は、異
なる探索ステップ24の結果に正しい平衡を確実に与えることである。これは、ド
キュメントの各要素を等しく重み付けすること、又は、ドキュメントの各要素を
、ドキュメント中の知覚される重要性に従って、又はドキュメントの異なる要素
の相対的な重要性のユーザによる評価に従って重み付けすることの何れとするこ
とも可能である。好ましい解決法は、自動及びマニュアルの重み付けを両方とも伴うことが可能
である。特に効果的な方法は、テキスト部分に概要生成技術を用いて一組のテキ
スト探索基準を生成すること、及び非テキスト部分に基づき一組の考え得る基準
を与えることである。次いで、これらの基準が検証のためにユーザに提供される
。かかるユーザベースの方法は、使用が容易なものである（該方法が無効である
場合をユーザが伝えるのが容易なものでもある）。例えば、ユーザが「この人物
」、「このようなシーン」、「この対象を含むピクチャ」…又は「このページの
ようなページ」等の何れを望むかに関わらず、該ユーザに対し、テキストの概要
と一致するものを探索したいか、又はイメージ及び描画部分を探索したいかとい
う質問を与えることが可能である。組み合わせられた結果27は、従来の類似性探索については、最良の突き合わせ
から最悪の突き合わせまでデータベース中のドキュメントを列挙した一連の突き
合わせスコア（一般にパーセンテージとして表されるもの）である。一般に、１回目のクエリの結果を、次回のクエリを構成するための基礎として
使用することにより、最も効果的なユーザクエリが達成され、この場合には、ユ
ーザがユーザクエリの連続的な緻密化(refinement)を達成することが可能となり
、このため、実際に、組み合わされた結果27が頻繁に後の選択ステップにフィー
ドバックされることになり、このため、効果的な反復的な探索が可能となる。類似性探索でのページ分解から導出される情報を更に使用することができる。
ページ分解から得られた別々の要素（グラフィクス11、テキストブロック12、及
びピクチャ13）に加えて、ドキュメント中の異なる要素の構成で更なる情報が提
供される。図３に示すように、ページ分解から得られる更なる出力は、垂直方向
に順次配置されたラインアートブロック、テキストブロック、及びイメージブロ
ックとしてドキュメントを表現するデータタイププラン31である。レイアウトへ
の分解は、米国特許第6,002,798号に述べられている。しかし、本発明者は、こ
のデータタイププラン自体をレイアウトデータタイプとして使用することができ
ることを認識した。これにより、更に別の要素即ちレイアウトデータタイプ要素
をデータベースの探索32で使用することが可能となる（レイアウト情報がデータ
ベースエントリで入手可能であり又はデータベースエントリから導出可能である
場合）。かかるレイアウト要素についての類似性探索の結果は、図２に示されて
いるような他の要素についての類似性探索と組み合わせることができる。この場
合、レイアウトデータタイプ31は、分解ステップ22からもたらされ、次いで探索
ステップ23,24と等価で並列的な探索ステップ32で使用される（次いで、正規化
ステップが行われた後、ステップ26において他のデータタイプからの結果と組み
合わされる）。本発明の第２の実施形態では、レイアウトデータタイプのみを用いて類似性探
索が行われる。後続の各ステップは、本質的に従来の類似性探索と同様である。
これは、図４に示されており、本発明の第１の実施形態と共通する要素には図２
と同じ符号を付してある。レイアウト類似性探索は、それ自体について使用され
るか、又は本発明の第１の実施形態で説明したように組み合わせられた探索にお
ける複数要素のうちの１つとして使用されるかに関わらず、テキストについて及
び全ドキュメントタイプについて多数の異なるデータタイプが使用される場合に
より強力なものとなる。ルールベースの方法を用いると、特に形式的なワークフ
ロードキュメントの場合に、異なるテキストブロック及びドキュメント全体に、
比較的高い信頼性で特定の機能を割り当てることができる。例えば、１ページの
上端における他から分離したテキストブロック及び下端における手書きがレター
を示唆するものであることが良く知られており、このため、ドキュメントの異な
る空間的な領域を適当な機能フィールド（住所、レターテキストその他）に割り
当てることができる。同様に、ドキュメント中の表及び通貨合計を別個の用途と
して識別することができ、それらの存在によってドキュメントが別のグループ（
請求書、価格表、又はインボイス）に限定される。このため、レイアウト探索は
、異なるワークフロードキュメントタイプを表すテンプレートとの突き合わせを
含むことができる（これにより、他のレターに対するレターであると判定された
ドキュメントの突き合わせが促進される）。適当な機構は、サイズ、方向、及び
スキューについてレイアウトを正規化し、次いでクエリ要素及びデータベース中
のレイアウトレコードについて「排他的論理和」演算を実行するものである。こ
れは、含まれる全てのレコードが一般的な共通したフォーマットを有する場合に
効果的である。この問題の困難さは、突き合わせのために考慮されるドキュメントの性質及び
タイプによって決まる。ドキュメントの「領域（universe）」が十分に確立され
たものである場合には、該領域内で分類及びラベル付けといった的確なジョブを
行うことができるツールを利用することができる（例えばDFKIによるOfficeMaid
）。この場合に必要なことは、考察のために利用可能なあらゆる種類のドキュメ
ントについて定められた１組の取り決めに従った分類である。ここで、「取り決
め」とは、本質的には、厳密に従う必要のないルールである。従って、この問題
に対する適当なアプローチは、ルールベースのものである（最も好適にはファジ
ールールを使用する）。ニューラルネットワークのトレーニングもまた採用する
には有効な方法である。当業者であれば、従来のファジールール又はニューラル
ネットワークによるアプローチをこの問題の解決策での使用に適合させる態様が
理解されよう。また、当業者であれば、特許請求の範囲に規定する本発明の範囲から逸脱する
ことなく上述した実施形態の変更態様を容易に実施することが可能であることが
理解されよう。Specific embodiments for the purpose of illustration of the invention are described below with reference to the drawings. A typical document contains multiple data types. The most basic data types are text and image. The document 1 shown in FIG. The text block 12 is data of the first data type. Document 1 also contains two different types of images. One type, image block 13, is a photographic image, which is typically composed of an array of pixels, each having one color value. The other type, the lineart block 11, is also an image, but a "drawn" image that can be easily represented as a combination of geometric or mathematical elements, and is generally easily scaled. It is something that can be done.
Photographic and line art images (hereinafter "pictures" and "graphics") exhibit different responses to different image processing and analysis techniques and are the ones most effectively processed as different data types. In addition, since pictures and graphics are generally used for different purposes in a document, processing pictures and graphics separately is also practical for similarity search purposes. Each step involved in the similarity search for the document of FIG. 1 according to the first embodiment of the present invention is shown in FIG. First, at step 21, document 1 is selected. In the case of an electronic document, this can be achieved by any suitable application capable of supporting one or more file types of the document. Also, in the case of a physical document, this can be achieved by scanning the document with a scanner. Second, at step 22, the document is broken down into discrete elements. That is, in the case of document 1, such elements are graphic block 11, text block 1
2 and picture block 13. In the case of text block 12, optical character recognition is performed at this point, and the resulting text block element is
It should be composed of ASCII text. Decomposition of the document is achieved by an analysis and recognition process, through which different parts of the document are recognized as text, pictures or graphics. Decomposition of documents into such distinct data types is known, for example, FM
"Block Segmentation and Text Extracti" by Wahl, KY Wong, and RG Casey
on in Mixed Text / Image Documents "(Computer Graphics and Image Processi
ng, Vol. 20 (1982)) (another example is disclosed in US Pat. No. 6,002,798). Software configured for use with proprietary scanners to decompose scanned page elements into separate data types (to optimize the scanning process for each data type) Hewlett-Packard Company offers the HP PrecisionScan. The output of this HP PrecisionScan is
A set of elements, each element having a single data type, each of which can be selected for further processing. The result of the decomposition is a set of elements, each of which has a single data type. All text is then determined to be part of a single element, depending on how the decomposition was performed for a particular data type, such as text, or
Physically distinct areas of text are considered separate elements. In one embodiment of the invention, all elements of the document are used in a similarity search, and in another embodiment,
One or more elements are selected for use in the similarity search (or the user is given the opportunity to select a portion of one element for such further processing). The separate elements are then used in a similarity search 23,24 against a database (eg, a database representing content available on the World Wide Web). If all elements consist of one data type, the problem is reduced to the problem of conventional similarity search that can be handled by a single search engine for related data types. However, if each element is of a different data type,
A separate search engine is used for each data type. Suitable search engines for similarity search for different data types are known. For example, for text, Teragram Corporation (http://www.teragram.com) and Inx
From ight Software, Inc. (http://www.inxight.com), select the appropriate language match (l
Inguistic matching) toolkits are available. In any case, it is desirable to perform a suitable pre-processing step 23 before the matching step 24, as will be described briefly below in relation to the main data types. For example, “Inxight Summarizer” is a software component technology that extracts a key sentence from a document to summarize the document. This is the pre-processing step 23. Then, such summaries can be matched against each other in a matching step 24. The Inxight Summarizer generates an indicative summary from the document that includes key sentence elements. Stemming and text normalization techniques extract the essence of the text and provide a concise and normalized summary of the text
(concise and canonical synopsis) are obtained. Here, "stemming" refers to replacing a word with its root and part of speech (for example, "I had wanted" or "to want" or "to want"). first per
son (first person) / pluperfect (past completed) ", and" normalization "involves replacement of one of several forms with" concept "(for example, , “2/3/99”, “Feb 3 ^nd , 1999” and “3 ^nd February” are all alternative forms of the same concept). A matching step 24 is then performed on the stemmed and normalized result of pre-processing step 23 with the confidence that exactly the same text content will be matched without the adverse consequences of considering the undesirable grammar. Can be performed. One example of an image search tool is the IBM QBIC package, as described above.
QBIC is further described at http://wwwqbic.almaden.ibm.com. The package is configured to preprocess the image by analyzing a number of different criteria, such as the percentage of colors occurring in the image, the layout of the colors, and the texture. The combination of the criteria is then used in matching step 24. From robot vision (robots searching for parts in containers) to traffic monitoring systems (automatic detection of car license plates), "exploring""new" images of known objects "and many other known uses Exists. This matching problem is essentially the exact opposite of these known problems. It can also be appreciated that the sequential method can be used effectively. For example, first, a "straight edge" histogram is used to allow the distinction between natural and artificial scenes, and then an "edge length" histogram (if there are few long edges) Is likely to indicate a natural scene), tests on a wide blue tonal region (indicating an outdoor scene) at the top of the image, and tests for significant elements of flesh tones. Note that the color tone of the skin color indicates that there is an image including a human expression, and thereby, the same face can be found by face matching analysis. Obviously, a combination of sequential and parallel steps can be employed. The result of the similarity search is a set consisting of a series of matching scores for documents in the database (such as a set that exists for each element searched).
It is. Each of these search scores needs to be normalized 25 with respect to combination 26 to obtain combined search results 27. The normalization step 25 is to ensure that the results of the different search steps 24 are in correct equilibrium. This involves weighting each element of the document equally, or weighting each element of the document according to the perceived importance in the document or according to a user's assessment of the relative importance of the different elements of the document. It is possible to use any one of the following. The preferred solution can involve both automatic and manual weighting. A particularly effective method is to generate a set of text search criteria using outline generation techniques for text parts, and to provide a set of possible criteria based on non-text parts. These criteria are then provided to the user for verification. Such a user-based method is easy to use (it is also easy for the user to tell when the method is invalid). For example, regardless of whether the user desires “this person”, “this scene”, “picture including this object”. A question can be given as to whether one wants to search for a match with the summary or to search for images and drawing parts. The combined result 27 is, for a conventional similarity search, a series of match scores (typically expressed as percentages) that list the documents in the database from the best match to the worst match. In general, by using the results of a first query as a basis for constructing the next query, the most effective user queries will be achieved, in which case the user will have a continuous refinement of the user queries (refinement), and indeed, the combined result 27 is frequently fed back to subsequent selection steps, thus enabling an effective iterative search. Become. Information derived from page decomposition in the similarity search can be further used.
In addition to the separate elements obtained from the page decomposition (graphics 11, text blocks 12, and pictures 13), further information is provided in the composition of the different elements in the document. As shown in FIG. 3, a further output resulting from the page separation is a data type plan 31 that represents the document as linear art blocks, text blocks, and image blocks arranged sequentially in the vertical direction. Layout decomposition is described in U.S. Patent No. 6,002,798. However, the inventor has recognized that the data type plan itself can be used as a layout data type. This allows yet another element, the layout data type element, to be used in the search 32 of the database (if layout information is available in the database entry or can be derived from the database entry). The results of a similarity search for such layout elements can be combined with similarity searches for other elements as shown in FIG. In this case, the layout data type 31 results from the decomposition step 22 and is then used in a search step 32 which is equivalent and parallel to the search steps 23, 24 (then after the normalization step is performed, the step 26) In combination with results from other data types). In the second embodiment of the present invention, the similarity search is performed using only the layout data type. Each subsequent step is essentially similar to a conventional similarity search.
This is shown in FIG. 4 and elements common to the first embodiment of the present invention are shown in FIG.
The same reference numerals are used. Whether the layout similarity search is used on its own or as one of multiple elements in a combined search as described in the first embodiment of the present invention, Will be more powerful if a large number of different data types are used for and for all document types. Using a rule-based approach, especially in the case of formal workflow documents, different text blocks and entire documents
Specific functions can be assigned with relatively high reliability. For example, it is well known that a block of text at the top of a page and a handwritten text at the bottom are indicative of a letter, so that different spatial areas of the document can be represented by appropriate functional fields (address , Letter text, etc.). Similarly, tables and currency totals in a document can be identified as separate uses, and their presence causes the document to be in a different group (
Billing, price list, or invoice). Thus, the layout search can include matching with templates representing different workflow document types (this facilitates matching documents determined to be letters to other letters). A suitable mechanism is to normalize the layout for size, orientation, and skew, and then perform an "exclusive or" operation on the query elements and the layout records in the database. This is effective when all the records included have a common common format. The difficulty of this problem depends on the nature and type of documents considered for matching. If the "universe" of the document is well-established, tools can be used that can do the right job, such as classification and labeling, within that area (eg OfficeMaid by DFKI)
). What is needed in this case is a classification according to a set of conventions established for all types of documents available for consideration. Here, an "arrangement" is essentially a rule that does not need to be strictly followed. Therefore, a suitable approach to this problem is rule-based (most preferably using fuzzy rules). Neural network training is also an effective method to adopt. One skilled in the art will recognize how to adapt conventional fuzzy rules or neural network approaches for use in solving this problem. It will also be appreciated by those skilled in the art that modifications of the embodiments described above may be readily made without departing from the scope of the invention, which is defined in the claims.

[Brief description of the drawings]

【図１】異なるデータタイプを含む典型的なドキュメントページを示す説明図である。FIG. 1 is an illustration showing a typical document page containing different data types.

【図２】図１に示すドキュメントについて類似性探索を行うための本発明の第１の実施
態様による方法を示すフローチャートである。FIG. 2 is a flowchart illustrating a method for performing a similarity search on the document illustrated in FIG. 1 according to a first embodiment of the present invention;

【図３】図１に示すドキュメントを複数のデータタイプをレイアウトしたものとして表
し、及び本発明の方法の更なる実施形態で使用することが可能な探索ステップを
示す説明図である。FIG. 3 is an illustration showing the document shown in FIG. 1 as a layout of a plurality of data types and showing search steps that can be used in a further embodiment of the method of the invention.

【図４】レイアウト情報について類似性探索を行う本発明の第２の実施態様による方法
を示すフローチャートである。FIG. 4 is a flowchart illustrating a method for performing a similarity search on layout information according to a second embodiment of the present invention;

───────────────────────────────────────────────────── フロントページの続き (72)発明者バーンズ，ローランド，ジョンアメリカ合衆国カリフォルニア州94404，フォスター・シティ，ロック・ハーバー・レーン・62 Ｆターム(参考） 5B075 ND03 ND06 QM05 5L096 BA17 BA20 EA45 JA03 JA09──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Burns, Roland, John 94404, California, Foster City, Rock Harbor Lane 62 F-term (reference) 5B075 ND03 ND06 QM05 5L096 BA17 BA20 EA45 JA03 JA09

Claims

[Claims]

1. A method for searching a database for documents similar to a query document, comprising: decomposing the query document into elements of a plurality of different data types; Performing a similarity search of a first data type for one or more elements and returning a match from a database for the one or more elements of the first data type; Performing a second data type similarity search on one or more of the elements of the second data type to determine the one or more elements of the second data type Returns a match result from the database for the similarity search of the first data type and the similarity search of the second data type. How to combine fruit, it provides a butt results of the query document comprises the steps of, searching the database to find a document similar to the query document.

2. The method of claim 1, wherein one of the plurality of data types represents text.
The method described in.

3. The method of claim 2, wherein the plurality of data types represent text, and wherein distinct ones of the plurality of data types represent different functional blocks of text.

4. The method of claim 1, wherein one of the plurality of data types represents a picture image.
A method according to any one of claims 1 to 3.

5. The method according to claim 1, wherein one of the plurality of data types is representative of a radical image.

6. The method of claim 1, wherein one of the plurality of data types is indicative of an arrangement of another data type in the document.

7. The method according to claim 1, wherein the step of performing the similarity search and returning the matching result is performed separately for a plurality of elements each having three or more data types. The method described in the section.

8. The method according to claim 1, wherein all features of a common data type in the document are treated as one element.

9. The method according to claim 1, wherein spatially different features of a common data type in the document are treated as separate elements.

10. The method according to claim 1, wherein an element is selectable or non-selectable by a user for the step of performing a similarity search.

11. The method according to claim 1, wherein the results of the similarity search for different elements are weighted and then combined.

12. The method of claim 11, wherein said weighting is selected by a user.

13. The method of claim 11, wherein the weighting is performed according to a determined meaning of each of the related elements in the document.

14. A method of searching a database to find documents similar to a query document, comprising: decomposing the query document into elements of different data types; Determining a layout element of a layout data type, performing a layout similarity search on the layout element, and returning a match result from the database for the layout element. How to explore the database to find.

15. The method of claim 14, wherein the layout similarity search comprises a search for templates representing different document types.

16. The method of claim 14, wherein the elements include elements of different data types representing different functional blocks of text.

17. The method according to claim 14, wherein the elements include elements of a data type representing an image.