JP3955410B2

JP3955410B2 - Similar information collating device, similar information collating method, and recording medium recording similar information collating program

Info

Publication number: JP3955410B2
Application number: JP07812599A
Authority: JP
Inventors: 雅彦徳永
Original assignee: AdIn Research Inc
Current assignee: AdIn Research Inc
Priority date: 1999-03-23
Filing date: 1999-03-23
Publication date: 2007-08-08
Anticipated expiration: 2019-03-23
Also published as: JP2000276472A

Description

【０００１】
【発明の属する技術分野】
本発明は、パターンとして表現できる情報の類似性を判定する類似情報照合装置及び類似情報照合方法に係わり、特に、コンピュータで情報処理される種々のパターンの照合技術に関する。
【０００２】
【従来の技術】
文字、図形、画像、音声、或いは、一般的な記号のような情報は、その情報を構成する要素が位置と特徴とによって表現することができる。このように、要素の特徴に着目したときに各要素の特徴が時空間的に関連して配置される情報は、所謂パターンと呼ばれる。従来より、このパターンをコンピュータで処理するため各種のパターン情報処理方式が提案されている。パターン情報処理の過程において、類似した情報の合致度を評価すべき状況、例えば、文字情報の場合に、類似文字列を検索することは頻繁に要求される。そのため、情報の類似性を判定する方式として、情報の要素をパターンとして表現し、パターン間の関連性を評価するパターン照合方式が知られている。
【０００３】
従来のパターン照合方式の例として、文字列照合システムについて説明する。例えば、文字列照合システムを利用する従来の高速全文検索技術は、“第２部高速全文検索の要素技術カギを握るインデクス処理”、日経バイト、１９９６年１０月号、ページ１５８−１６７に記載されている。この引用文献に記載されている従来の典型的な照合システムは、照合文字列及び被照合文字列を固定長の微小な部分に分割する。ここで、用語「照合文字列」及び「被照合文字列」の用法を簡単に説明すると、例えば、文字列Ａと類似した文字列Ｂを文書Ｃの中から見つける場合に、文字列Ａが「照合文字列」であり、文書Ｃの中の文字列Ｂが「被照合文字列」である。次に、照合システムは、照合文字列の微小部分が被照合文字列の微小部分文字列群に含まれるかどうかを判定し、当該微小部分文字列を含む被照合文字列を照合文字列に類似した文字列として出力する。
【０００４】
このようなタイプの照合システムは、文字列の中の微小部分が完全一致する文字列の有無を判定する。そのため、文字列の一部が欠落した場合、文字列の一部が他の文字列で置換された場合、或いは、文字列の中に他の文字列が混入した場合のように、照合文字列若しくは被照合文字列に局部的な変形が生じた場合に、変形した箇所の周辺の微小文字列が一致しないため、文字列が照合しないと判定される。このように従来技術の第１のタイプの照合システムでは、文字列の局所的な変形を許容できないという欠点がある。
【０００５】
【発明が解決しようとする課題】
本発明は、上述の従来の照合システムの問題点に鑑み、情報の類似性を判定する類似情報照合装置において、照合される情報に対応した第１のパターン或いは第２のパターン内で、部分的に一致する微小パターンが一部欠落、他のパターンとの置換、或いは、他のパターンによる混入などによって、パターンの全域に分散された場合でも、パターンの照合を行うことにより情報の類似性を判定することができる類似情報照合装置、類似情報照合方法及び類似情報照合プログラムを記録した記録媒体の提供を目的とする。
【０００６】
【課題を解決するための手段】
上記の目的を達成するため、本発明は、パターンの照合位置を追跡し、離間した照合位置を許容する連続性の概念を導入し、この連続性を評価して照合の漏れを防止する。
図１は本発明の原理構成図である。本発明の情報の類似性を判定する類似情報照合装置１は、
照合されるべき第１の情報及び第２の情報から、情報の要素の位置及び特徴により表されるパターンとして、上記第１の情報に対応する第１のパターン及び上記第２の情報に対応する第２のパターンを生成するパターン生成手段１０と、
上記第１のパターン及び上記第２のパターンの中で同じ特徴を有する上記第１のパターンに属する第１の要素及び上記第２のパターンに属する第２の要素の夫々の位置の対を座標とする照合位置により構成される照合マップ３０を作成する照合マップ生成手段２０と、
上記照合マップ３０内で近傍にある上記照合位置が順次に連結された経路毎に上記経路の連続性を評価する連続性評価手段４０と、
上記経路毎に評価された連続性に基づいて上記第１のパターンと上記第２のパターンの合致度を判定するパターン照合手段５０とを含む。
【０００７】
上記照合マップ作成手段２０は、同じ特徴を有する上記第１の要素及び上記第２の要素の複数の組合せに対し、個別に上記照合位置を作成することを特徴とする。
また、上記パターン照合手段５０は、上記照合位置毎に該照合位置を通過する上記経路に対し評価された連続性の中で最も高い連続性を該照合位置の評価値として設定する手段と、上記照合位置毎に設定された評価値に基づいて上記第１のパターンと上記第２のパターンの合致度を計算する手段とを有する。
【０００８】
さらに、上記パターン生成手段１０は、上記パターンとして表される上記情報の少なくとも一部の要素に対し、上記少なくとも一部の元の要素の特徴を置換可能な特徴を有する同義的な要素を生成する手段と、上記同義的な要素が上記元の要素と同時に列挙されるよう上記パターンを生成する手段とを有し、上記照合マップ生成手段２０と、上記連続性評価手段３０と、上記パターン照合手段４０とは、同時に列挙された上記同義的な要素を上記元の要素と並行して処理するよう適合されていることを特徴とする。
【０００９】
また、上記照合マップ生成手段２０は、上記要素が数値を表現する特徴を有する場合に、数値の表す値が一致する場合に同じ特徴であると判定する手段を有するように構成してもよい。
図２は、上記本発明の目的を達成する情報の要素の位置及び特徴により表される第１のパターンと第２のパターンを照合することにより情報の類似性を判定する類似情報照合方法の動作フローチャートである。同図に示す如く、本発明の類似情報照合方法は、
上記第１のパターン及び上記第２のパターンを入力する段階（ステップ１）と、
上記第１のパターン及び上記第２のパターンの中で同じ特徴を有する上記第１のパターンに属する第１の要素及び上記第２のパターンに属する第２の要素を検出する段階（ステップ２）と、
上記検出された第１の要素及び第２の要素の夫々の位置の対を座標とする照合マップを作成する照合マップ生成段階（ステップ３）と、
上記照合マップ内で近傍にある上記照合位置を順次に連結することにより経路を生成する経路生成段階（ステップ４）と、
上記生成された経路毎に上記経路の連続性を評価する連続性評価段階（ステップ５）と、
上記経路毎に評価された連続性に基づいて上記第１のパターンと上記第２のパターンの合致度を判定するパターン照合段階（ステップ６）とを含む。
【００１０】
また、情報の類似性を判定する類似情報照合システムにおいて、情報の類似性を判定する上記の本発明の類似情報照合装置及び方法は、コンピュータが読み取り可能な記録媒体に記録したプログラム（ソフトウェア）として実現してもよい。
したがって、本発明は、情報の類似性を判定する類似情報照合プログラムを記録したコンピュータが読み取り可能な記録媒体を含む。上記類似情報照合プログラムは、
照合されるべき第１の情報及び第２の情報から、情報の要素の位置及び特徴により表されるパターンとして、上記第１の情報に対応する第１のパターン及び上記第２の情報に対応する第２のパターンを生成させるパターン生成コードと、
上記第１のパターン及び上記第２のパターンの中で同じ特徴を有する上記第１のパターンに属する第１の要素及び上記第２のパターンに属する第２の要素の夫々の位置の対を座標とする照合位置により構成される照合マップを作成させる照合マップ生成コードと、
上記照合マップ内で近傍にある上記照合位置が順次に連結された経路毎に上記経路の連続性を評価させる連続性評価コードと、
上記経路毎に評価された連続性に基づいて上記第１のパターンと上記第２のパターンの合致度を判定させるパターン照合コードとを含むことを特徴とする。
【００１１】
【発明の実施の形態】
以下、添付図面を参照して本発明の一実施例による文字列照合システムを説明する。本実施例の文字列照合システムは、被検索文書ファイルに保存された被検索文書の中からオペレータが入力した検索文と類似した文を含む被検索文書をオペレータに提示するシステムである。
【００１２】
図３は、本発明の一実施例による文字列照合システムの概略的な構成図であり、図４は、この文字列照合システムの動作フローチャートである。文字列照合システムは、ステップ１０においてオペレータから入力された検索文を受ける照合データ生成部１１０を有する。また、照合データ生成部１１０は、検索文を照合に適した照合データとしての照合文字列に変換する（ステップ２０）。文字列照合システムは、ステップ２０において被検索文書ファイル１４０から被検索文書を取り出し、照合文字列との照合に適した被照合文字列及び被照合文字列が属する被検索文書の文書識別番号を含む被照合データを生成する被照合データ生成部１３０を更に有する。
【００１３】
照合データとしての照合文字列及び被照合データとしての被照合文字列は、種々の情報を表現するパターンの中で、特に、文字情報を表現するパターンである。文字列内の各文字がパターンの要素に対応する。要素は、その文字の特徴としての文字コードと、その文字の文字列内における位置とによって表される。
また、文字列照合システムは照合マップ生成部１５０を更に有し、照合マップ生成部１５０は、照合データ生成部１１０からの照合文字列と、被照合データ生成部１３０からの被照合文字列とを受け、共通文字を検出し（ステップ３０）、照合マップを作成、出力する（ステップ４０）。
【００１４】
照合マップは、照合文字列及び被照合文字列に共通して含まれる共通文字の照合文字列及び被照合文字列での位置を夫々Ｘ座標及びＹ座標として表される位置（Ｘ，Ｙ）を照合位置として有するマップである。照合マップは、パターンとして文字列が採用される場合には２次元のマップとして構築することができる。
文字列照合システムは連続性評価部１６０及び検索結果出力部１７０を更に有する。連続性評価部１６０は、照合マップ生成部１５０によって作成、出力された照合マップを受け、照合文字列と被照合文字列とを照合し、照合結果を検索結果出力部に渡す。そのため、連続性評価部１６０は、照合マップ内で、照合位置から近傍の照合位置を順次に追跡することにより一連の照合位置を含む経路を形成し（ステップ５０）、経路毎に連続性の値を計算し（ステップ６０）、各照合位置に対する評価値として、その照合位置を通過する経路の連続性の値の中で最高の連続性の値を選択し（ステップ７０）、照合文字列の各文字についての照合位置の評価値を集計する（ステップ８０）。この集計結果は、照合文字列と被照合文字列の類似性を表している。
【００１５】
検索結果出力部１７０は、連続性評価部１６０から照合結果を受け、照合文字列と類似していると判定された被照合文字列を含む被検索文書に関する情報を被検索文書ファイル１４０から取り出し、オペレータに通知する（ステップ９０）。
以下、本発明の一実施例の文字列照合システムについて詳述する。
【００１６】
図５は本例における文字列照合システムの照合データ生成部１１０の構成図である。同図に示す如く、照合データ生成部１１０は、検索文を入力し、検索文拡張辞書１２０を参照して拡張検索文を出力する検索文拡張部１１１と、検索拡張部を入力して正規化拡張検索文に変換する検索文正規化部１１２と、正規化拡張検索文を入力して数値表現部分を同じ形式に変換し、最終的な照合データとしての照合文字列を出力する数値表現置き換え部１１３とを有する。
【００１７】
検索文拡張部１１１は、オペレータから検索文（例えば、「文書検索の高速化」）が入力され、検索文の中の文字列（例えば、「検索」）を置き換え可能な文字列が検索文拡張辞書１２０内に存在するかどうかを判定し、置き換え可能な文字列（例えば、「抽出」）が検索文拡張辞書１２０内に存在する場合に、その置き換え可能な文字列（「抽出」）を、文字列（「検索」）の同義語として検索文に付加し、拡張検索文を出力する。置き換え可能な文字列が存在しない場合には、入力された検索文がそのまま拡張検索文として出力される。この場合の拡張検索文は、「文書｛検索｜抽出｝の高速化」のように表現され、｛文字列ａ｜文字列ｂ｝の部分が拡張された部分であり、文字列ａと文字列ｂが同義語であることを表す。
【００１８】
検索文正規化部１１２は、検索文拡張部１１１から出力された拡張検索文を入力し、拡張検索文内の文字の正規化を行い、正規化拡張検索文を出力する。文字の正規化とは、例えば、英数字カナの半角文字から全角文字への変換、英字の小文字から大文字への変換、或いは、句読点、改行制御文字及び伸張音等の検索の際に無視されるべきノイズ文字の削除等の処理を意味する。
【００１９】
数値表現置き換え部１１３は、検索文正規化部１１２から出力された正規化拡張検索文を受け、正規化拡張検索文中に数値により定量表現された部分文字列が存在するかどうかを判定する。定量表現された部分文字列が存在する場合、その部分文字列を値に変換し、正規化拡張検索文中の定量表現部分が値によって置換された最終的な照合データとしての照合文字列を作成、出力する。
【００２０】
図６は本例における文字列照合システムの被照合データ生成部１３０の構成図である。同図に示す如く、被照合データ生成部１３０は、被検索文書ファイル１４０から、検索に必要な全ての被検索文書を読み出し、出力する被検索文書読み込み部１３１を有する。被照合データ生成部１３０は、被検索文書正規化部１３２及び数値表現置き換え部１３３を更に有する。
【００２１】
被検索文書正規化部１３２は、被検索文書読み込み部１３１から被検索文書を入力し、被検索文書内の文字の正規化を行い、正規化被検索文書を出力する。文字の正規化については、検索文正規化部１１２で説明した通りである。
また、数値表現置き換え部１３３は、被検索文書正規化部１３２から正規化被検索文書を受け、正規化被検索文書中に数値により定量表現された部分文字列が存在する場合に、その部分文字列を値に変換し、正規化被検索文書中の定量表現部分が値によって置換された最終的な被照合データとしての被照合文字列を作成、出力する。
【００２２】
次に、本発明の一実施例による文字列照合システムの照合マップ生成部１５０の機能について詳述する。図３に示される如く、照合マップ生成部１５０は、照合データ生成部１１０及び被照合データ生成部１３０に接続され、照合データとしての照合文字列及び被照合データとしての被照合文字列を夫々から受け、照合マップを生成するよう機能する。
【００２３】
照合マップ生成部１５０は、最初に、照合文字列と被照合文字列の双方に共通して含まれる文字、すなわち、共通文字を検出する。例えば、照合文字列を「文書検索の高速化」とし、被照合文字列を「高速な文書の検索を行う」とすると、共通文字は、「高」、「速」、「文」、「書」、「の」、「検」及び「索」である。次に、照合文字列における共通文字の位置をＹ座標とし、被照合文字列における共通文字の位置をＸ座標とする照合位置により構成される照合マップを生成する。
【００２４】
図７は、照合マップの概念がよりよく理解されるように、一例として、上記の照合文字列及び被照合文字列に対し生成された照合マップを視覚的に表現した説明図である。同図において“○”で示される点が照合位置に対応する。本例では、簡単のため、被照合文字列は同義語を含まない場合を想定している。
一方、既に説明した通り、検索文拡張部１１１において、照合文字列「文書検索の高速化」が拡張検索文「文書｛検索｜抽出｝の高速化」の形として同義語を含むように拡張されている場合、図８に示すような照合マップが得られる。この場合、検索文字列内での共通文字の位置を表すＹ座標は補正される。すなわち、「検」と「抽」のＹ座標、並びに、「索」と「出」のＹ座標は一致するように補正される。
【００２５】
かくして、照合マップ生成部１５０は、照合文字列中の共通文字の位置を表すＹ座標値と、照合文字列に同義語が含まれる場合の共通文字の位置の補正値であるＹ補正値と、被照合文字列中の共通文字の位置を表すＸ座標値と、被照合データに対応する文書識別番号とを照合マップとして出力する。
連続性評価部１６０は、総合マップ生成部１５０から照合マップを入力する。連続性評価部１６０では、文書識別番号毎に、照合文字列と被照合文字列の類似性が評価される。そのため、連続性評価部１６０は、最初に、照合マップ内の照合位置を追跡し、全ての照合位置に連続性評価値を付与し、次に、同じ照合文字列内の文字に対し存在し得る複数の照合位置の連続性評価値の中から最大値を照合文字列内の当該文字の連続性評価値として選択する。最後に、照合文字列内の文字毎に得られた連続性評価値を照合文字列全体に関して集計し、正規化し、得られた値を照合文字列と被照合文字列の合致度とする。合致度は、文書識別番号と共に連続性評価部１６０から検索結果出力部１７０に送られる。
【００２６】
以下、連続性評価について詳述する。図９は、本発明の一実施例による文字列照合システムにおいて行われる連続性評価のための経路追跡の説明図である。経路追跡処理は、図７に示された照合マップの照合位置に関して、一つの照合位置から有効距離内にある他の照合位置を探し、リンクを張る。この経路追跡処理を繰り返すことにより、照合マップ内の照合位置は、分岐を含む幾つかの経路に分類される。図９には、「高」から「速」への経路と、「書」から「の」の分岐及び「書」から「検」を経由して「索」に至る分岐を含む「文」と「書」を含む経路とが示されている。
【００２７】
図１０は、照合位置の典型的な４通りの連続性の形を説明する図である。一般に、連続した文字列が照合している箇所では、照合位置のリンクは右下４５度の方向に並ぶ。同図の（Ａ）は、全ての照合位置が右下４５度方向に並ぶ完全一致の場合を示す図である。同図の（Ｂ）は、データの（１字）欠落がある場合を示し、（Ｃ）はデータの（１字）置換がある場合を示し、（Ｄ）はデータの（２字）混入がある場合を示す図である。これらのリンクを追跡することにより、データの欠落、置換、混入が生じている場合でも、連続性を保ったまま照合を評価することができる。
【００２８】
連続性評価部１６０は、経路の生成に続いて合致度の算出処理を行う。ここで、照合位置間のリンクに重みを付けるため、全ての文字について文字種別（タイプ）を設定し、照合位置の各文字を分類する。本例において、文字種別として、「漢字」と「かな」の２種類に分類する場合を想定すると、
漢字：「高」「速」「文」「書」「検」「索」
かな：「の」
のような分類がなされる。次に、リンクの前後の文字の文字種別ｔ１及びｔ２に応じて、文字種間のリンクの重みを以下の通り設定する。

また、リンクには、リンクの長さ（リンクの前後の照合位置間の距離）に応じた重みを設定する。例えば、リンクの長さによる重みは以下の通り表される。
リンクの長さによる重み（Ｗ_l）＝
ｇ（ｘ₁，ｙ₁，ｘ₂，ｙ₂）＝１／｛（ｘ₂−ｘ₁）²＋（ｙ₂−ｙ₁）²｝
最後に、上記の文字種間の重み（Ｗ_t）とリンクの長さによる重み（Ｗ_l）とを結合することにより、一つのリンクについての以下の評価値が得られる。
１リンクの評価値＝
ｖ＝Ｗ_t・Ｗ_l＝ｆ（ｔ₁，ｔ₂）・ｇ（ｘ₁，ｙ₁，ｘ₂，ｙ₂）
連続性評価部１６０では、次に、照合マップ内の経路追跡によって獲得された経路上の全てのリンクに対し、リンクの評価値を集計し、一つの経路全体の評価値を得る。この一つの経路の評価値Ｖは、例えば、次の式に従って計算することができる。
【００２９】
【数１】

式中、ｋは着目経路上のリンクのインデックス、ｎは着目経路上のリンクの総数＋１、ｖ_kは着目経路上の各リンクの評価値を表す。
かくして得られた一つの経路の評価値Ｖは、着目経路上の各照合位置に照合位置の評価値Ｖ_xyとして設定される。また、経路が分岐を含む場合には、例えば、分岐毎に計算された経路の評価値の中で最も評価値の高い分岐を含む経路が有効であるとして選択することができる。このようにして、照合マップ内で生成された全ての経路に対して上記の一つの経路の評価値Ｖを求めることにより、照合マップの内の全ての照合位置に関して照合位置の評価値Ｖ_xyが得られる。
【００３０】
次に、照合文字列中の各文字列に関する評価値を得る。例えば、図７に示される如く、照合文字列中の文字に対応する照合位置が高々１個しかない場合には、対応する照合位置が存在する照合文字列中の文字の評価値として、その照合位置の評価値を設定し、照合文字列中のそれ以外の文字の評価値は零とする。また、照合文字列中の文字に対応する照合位置が２個以上存在する場合には、対応する照合位置の評価値の中で最大の評価値をその文字の評価値として設定する。かくして、照合文字列中の全ての文字に対し連続性の評価値を得ることができる。
【００３１】
最後に、照合文字列全体として被照合文字列との合致度を求めるため、照合文字列中の全ての文字に関する連続性の評価値を集計して集計値を得る。連続性評価値の集計値Ｖ_totalは、例えば、次式に従って計算される。
【００３２】
【数２】

照合文字列全体としての被照合文字列との合致度は、例えば、この連続性評価値の集計値Ｖ_totalを完全一致の場合の連続性評価値の集計値Ｖ_equalで除算した値によって表される。
合致度＝Ｖ_total／Ｖ_equal
合致度をこのように表現することにより、連続性評価値の集計値は完全一致の場合に最大値１．０をとる。このようにして得られた合致度は、文書識別番号と共に、照合結果として次の検索結果出力部１７０に送られる。
【００３３】
図１１は本発明の一実施例による文字列照合システムの検索結果出力部の構成図である。同図に示されるように、検索結果出力部１７０は、照合結果変換部１７１と、検索結果表示部１７２と、検索結果選択部１７３と、文書表示部１７４とを含む。
照合結果変換部１７１は、連続性評価部１６０から、上記合致度及び文書識別番号を照合結果として入力し、文書識別番号に基づいて照合結果に対応する文書の見出し、要約情報等を被検索文書ファイル１４０から読み込み、合致度の順に照合結果の文書に関する情報を並べ換え、検索結果として出力する。
【００３４】
検索結果表示部１７２は、照合結果変換部１７１から検索結果を入力し、この検索結果をディスプレイなどの表示装置に表示させ、次の段の検索結果選択部１７３に検索結果を渡す。
検索結果選択部１７３は、検索結果表示部１７２から検索結果を入力し、また、検索結果表示に応じたオペレータからの指示を入力し、オペレータから読み込むべき文書が指定された場合、指定された文書を被検索文書ファイル１４０から読み込み、選択文書として出力する。
【００３５】
文書表示部１７４は、検索結果選択部１７３から出力された選択文書を入力し、読み込まれた選択文書をディスプレイなどの表示装置に表示させる。
本発明の一実施例による文字列照合システムは、図３乃至１１を参照して説明した構成及び動作に従って、オペレータから入力された検索文を被検索文書ファイルに格納された文書と照合し、検索文に類似した被検索文を含む文書をオペレータに提示することができる。
【００３６】
次に、本発明の一実施例の文字列照合システムにおいて、特に、検索文拡張部１１１が拡張検索文を出力した場合の処理を説明する。本例では、検索文「文書検索の高速化」において、文字列「検索」の同義語「抽出」が存在する場合を考える。既に説明したように、照合データが同義語を含む場合、複数の照合データ「文書検索の高速化」及び「文書抽出の高速化」が存在すると解釈される。また、
同義データ正規表現：文書｛検索｜抽出｝の高速化
を用いることにより同義語が照合データ内に列挙して表現されるような拡張検索文が作成される。このように照合データが同義語を含む場合、照合データは、同じ位置にある同義語の中の一つの同義語が選択されたとして処理される。図１２は、図８に示された同義語を含む照合マップにおける経路追跡の説明図である。経路追跡の際の有効距離は、実際に生成された照合マップに配置された経路上の照合位置間の距離と、一つの同義語が選択されたとして処理された場合に生成される理論上の照合マップ上での照合位置間の距離との差違を表す距離補正値を考慮して計算される。
【００３７】
最後に、本発明の一実施例による文字列照合システムの数値表現置き換え部１１３又は１３３において、照合文字列又は被照合文字列中の定量表現が数値に置換された場合の処理について説明する。図１３は、類似定量文字照合の処理手順のフローチャートである。
第１に、照合文字列又は被照合文字列から、数値により定量表現された部分文字列を抽出する（ステップ１００）。第２に、抽出された部分文字列を値に変換する（ステップ１０１）。第３に、変換された値に基づいて数値の合致度を計算する（ステップ１０２）。
【００３８】
ここで、数値により定量表現された部分文字列の抽出は、文字列内に数値表現文字が連続して出現した部分を検出し、取り出すことにより行われる。例えば、以下のような文字が数値表現文字として検出される。
１２３４５６７８９０
一二三四五六七八九零
十百千万・・・・
合致度は、照合文字列から得られた値（Ｖｓ）と、被照合文字列から得られた値（Ｖｄ）とに基づいて以下の式に従って計算することができる。
【００３９】
【数３】

図１４は、類似定量文字照合処理の説明図である。同図には、照合文字列及び被照合文字列、定量表現による部分文字列、部分文字列から変換された数値、並びに、変換された数値の合致度が示されている。
したがって、本発明の一実施例によれば、コンピュータを利用した文字列照合システムにおいて、照合文字列或いは被照合文字列内に部分文字列の欠落、他の文字列との置換、他の文字列の混入などによって、部分的に一致する文字列が分散した場合に、文字列の照合を行うことができる。
【００４０】
また、本発明の一実施例による文字列照合システムの構成は、上記の実施例で説明された例に限定されることなく、文字列照合システムの各々の構成要件をソフトウェア（プログラム）で構築し、ディスク装置等に記録しておき、必要に応じて文字列照合システムのコンピュータにインストールして文字列照合を行うことも可能である。さらに、構築されたプログラムをフロッピーディスクやＣＤ−ＲＯＭ等の可搬記録媒体に格納し、このような文字列照合システムを用いる場面で汎用的に使用することも可能である。
【００４１】
本発明は、上記の実施例に限定されることなく、特許請求の範囲内で種々変更・応用が可能である。
【００４２】
【発明の効果】
上述の如く、本発明によれば、パターン照合を行う際に、パターンの照合位置を追跡し、照合位置が離間していてもパターンの連続性を評価することができる。したがって、照合パターン或いは被照合パターンにおいて一部のパターンが欠落し、他のパターンと置換され、若しくは、他のパターンが混入される等の影響によって、照合パターンと被照合パターンとの間で部分的に一致するパターンが分散して存在する場合でも、照合が行える。そのため、本発明によれば、オペレータが被照合パターンの内容を熟知していなくても、漏れの無い照合が実現され、オペレータの負担が軽減される利点が得られる。
【図面の簡単な説明】
【図１】本発明の原理構成図である。
【図２】本発明の類似情報照合方法の動作フローチャートである。
【図３】本発明の一実施例による文字列照合システムの概略的な構成図である。
【図４】本発明の一実施例による文字列照合システムの動作フローチャートである。
【図５】本発明の一実施例による文字列照合システムの照合データ生成部の構成図である。
【図６】本発明の一実施例による文字列照合システムの被照合データ生成部の構成図である。
【図７】本発明の一実施例による照合マップを視覚的に表現した説明図である。
【図８】同義語を含む場合の照合マップの説明図である。
【図９】本発明の一実施例による文字列照合システムにおいて行われる連続性評価のための経路追跡の説明図である。
【図１０】照合位置の連続性を説明する図である。
【図１１】本発明の一実施例による文字列照合システムの検索結果出力部の構成図である。
【図１２】図８に示された同義語を含む照合マップにおける経路追跡の説明図である。
【図１３】類似定量文字照合の処理手順のフローチャートである。
【図１４】類似定量文字照合処理の説明図である。
【符号の説明】
１類似情報照合装置
１０パターン生成手段
２０照合マップ作成手段
３０照合マップ
４０連続性評価手段
５０パターン照合手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a similar information collation apparatus and a similar information collation method for determining the similarity of information that can be expressed as a pattern, and more particularly to a technique for collating various patterns processed by a computer.
[0002]
[Prior art]
Information such as characters, graphics, images, sounds, or general symbols can be expressed by the position and characteristics of the elements that make up the information. Thus, when attention is paid to the feature of an element, information in which the feature of each element is arranged in spatio-temporal relation is called a so-called pattern. Conventionally, various pattern information processing methods have been proposed for processing this pattern by a computer. In the process of pattern information processing, it is frequently required to search for a similar character string in a situation where the matching degree of similar information is to be evaluated, for example, in the case of character information. Therefore, as a method for determining the similarity of information, a pattern matching method for expressing information elements as patterns and evaluating the relevance between patterns is known.
[0003]
A character string matching system will be described as an example of a conventional pattern matching method. For example, a conventional high-speed full-text search technique using a character string matching system is described in “Part 2: Index technology that holds the key to high-speed full-text search element technology”, Nikkei Bytes, October 1996, pages 158-167. ing. The conventional typical collation system described in this cited document divides a collation character string and a collated character string into small portions having a fixed length. Here, the usage of the terms “matching character string” and “matched character string” will be briefly described. For example, when a character string B similar to the character string A is found in the document C, the character string A is “ “Character string to be collated”, and character string B in document C is “character string to be collated”. Next, the collation system determines whether or not a minute part of the collation character string is included in the minute part character string group of the collated character string, and the collated character string including the minute part character string is similar to the collation character string. Is output as a character string.
[0004]
Such a collation system determines whether or not there is a character string in which a minute portion in the character string completely matches. Therefore, the collation character string, such as when a part of the character string is missing, when a part of the character string is replaced with another character string, or when another character string is mixed in the character string Alternatively, when local deformation occurs in the collated character string, it is determined that the character string does not collate because the minute character string around the deformed portion does not match. As described above, the first type collation system of the prior art has a drawback that local deformation of the character string cannot be allowed.
[0005]
[Problems to be solved by the invention]
In view of the above-described problems of the conventional collation system, the present invention provides a similar information collation apparatus for determining similarity of information, in a first pattern or a second pattern corresponding to information to be collated, in a partial manner. Even if a minute pattern that matches is missing, replaced with another pattern, or mixed with another pattern, etc., even if the pattern is distributed over the entire area, the similarity of information is determined by matching the pattern. An object of the present invention is to provide a similar information collation apparatus, a similar information collation method, and a recording medium on which a similar information collation program is recorded.
[0006]
[Means for Solving the Problems]
To achieve the above object, the present invention introduces the concept of continuity that tracks pattern matching positions, allows spaced matching positions, and evaluates the continuity to prevent matching leaks.
FIG. 1 is a principle configuration diagram of the present invention. The similar information collating apparatus 1 for determining the similarity of information according to the present invention includes:
Corresponding to the first pattern and the second information corresponding to the first information as a pattern represented by the position and characteristics of the information elements from the first information and the second information to be collated Pattern generation means 10 for generating a second pattern;
A pair of positions of the first element belonging to the first pattern and the second element belonging to the second pattern having the same characteristics in the first pattern and the second pattern is a coordinate. Collation map generation means 20 for creating a collation map 30 composed of collation positions to be performed;
Continuity evaluation means 40 for evaluating the continuity of the route for each route in which the matching positions in the vicinity in the matching map 30 are sequentially connected;
Pattern matching means 50 for determining the degree of coincidence between the first pattern and the second pattern based on the continuity evaluated for each path.
[0007]
The collation map creating means 20 creates the collation position individually for a plurality of combinations of the first element and the second element having the same characteristics.
In addition, the pattern matching unit 50 sets, as the evaluation value of the matching position, the highest continuity among the continuity evaluated for the path passing through the matching position for each matching position; Means for calculating a degree of coincidence between the first pattern and the second pattern based on an evaluation value set for each collation position;
[0008]
Further, the pattern generation means 10 generates a synonymous element having a feature capable of replacing the characteristics of the at least some original elements with respect to at least some elements of the information represented as the pattern. Means, and means for generating the pattern so that the synonymous elements are enumerated simultaneously with the original element, the matching map generating means 20, the continuity evaluating means 30, and the pattern matching means 40 is characterized in that it is adapted to process the synonymous elements listed at the same time in parallel with the original elements.
[0009]
The collation map generation means 20 may be configured to include means for determining that the elements have the same characteristics when the values represented by the numerical values match when the elements have characteristics expressing the numerical values.
FIG. 2 shows the operation of the similar information collating method for judging the similarity of information by collating the first pattern and the second pattern represented by the positions and features of the information elements that achieve the object of the present invention. It is a flowchart. As shown in the figure, the similar information matching method of the present invention is
Inputting the first pattern and the second pattern (step 1);
Detecting a first element belonging to the first pattern and a second element belonging to the second pattern having the same characteristics in the first pattern and the second pattern (step 2); ,
A collation map generation step (step 3) for creating a collation map having coordinates as a pair of positions of the detected first element and second element;
A route generation step (step 4) for generating a route by sequentially connecting the matching positions in the vicinity in the matching map;
A continuity evaluation stage (step 5) for evaluating the continuity of the path for each of the generated paths;
A pattern matching step (step 6) for determining the degree of matching between the first pattern and the second pattern based on the continuity evaluated for each path.
[0010]
Further, in the similar information collating system for judging the similarity of information, the similar information collating apparatus and method of the present invention for judging the similarity of information is a program (software) recorded on a computer-readable recording medium. It may be realized.
Therefore, the present invention includes a computer-readable recording medium on which a similar information collation program for determining information similarity is recorded. The similar information collation program is
Corresponding to the first pattern and the second information corresponding to the first information as a pattern represented by the position and characteristics of the information elements from the first information and the second information to be collated A pattern generation code for generating a second pattern;
A pair of positions of the first element belonging to the first pattern and the second element belonging to the second pattern having the same characteristics in the first pattern and the second pattern is a coordinate. A collation map generation code for creating a collation map composed of collation positions to be
A continuity evaluation code for evaluating the continuity of the path for each path in which the matching positions in the vicinity in the matching map are sequentially connected;
It includes a pattern matching code for determining the degree of coincidence between the first pattern and the second pattern based on the continuity evaluated for each path.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a character string matching system according to an embodiment of the present invention will be described with reference to the accompanying drawings. The character string matching system according to the present embodiment is a system that presents to the operator a search target document that includes a sentence similar to the search sentence input by the operator from the search target documents stored in the search target document file.
[0012]
FIG. 3 is a schematic configuration diagram of a character string matching system according to an embodiment of the present invention, and FIG. 4 is an operation flowchart of the character string matching system. The character string collation system includes a collation data generation unit 110 that receives the search text input from the operator in Step 10. Further, the collation data generation unit 110 converts the search sentence into a collation character string as collation data suitable for collation (step 20). In step 20, the character string matching system retrieves the searched document from the searched document file 140, and includes a matched character string suitable for matching with the matched character string and a document identification number of the searched document to which the matched character string belongs. It further has a collation data generation unit 130 that generates collation data.
[0013]
The collation character string as the collation data and the collated character string as the collation data are patterns that express character information, among other patterns that express various information. Each character in the string corresponds to a pattern element. The element is represented by a character code as a characteristic of the character and a position in the character string of the character.
The character string collation system further includes a collation map generation unit 150. The collation map generation unit 150 receives the collation character string from the collation data generation unit 110 and the collated character string from the collation data generation unit 130. The common character is received (step 30), and a collation map is created and output (step 40).
[0014]
The collation map indicates positions (X, Y) represented as X coordinates and Y coordinates, respectively, of positions in the collation character string and collated character string of common characters included in common in the collation character string and collated character string. It is a map which has as a collation position. The collation map can be constructed as a two-dimensional map when a character string is adopted as a pattern.
The character string matching system further includes a continuity evaluation unit 160 and a search result output unit 170. The continuity evaluation unit 160 receives the collation map created and output by the collation map generation unit 150, collates the collation character string with the collated character string, and passes the collation result to the search result output unit. Therefore, the continuity evaluation unit 160 forms a path including a series of collation positions by sequentially tracking the collation positions near the collation position in the collation map (step 50), and the continuity value for each path. (Step 60), and as the evaluation value for each collation position, the highest continuity value among the continuity values of the paths passing through the collation position is selected (step 70). The collation position evaluation values for the characters are tabulated (step 80). This aggregation result represents the similarity between the collation character string and the collated character string.
[0015]
The search result output unit 170 receives the collation result from the continuity evaluation unit 160, extracts information on the search target document including the collated character string determined to be similar to the collation character string from the search document file 140, The operator is notified (step 90).
Hereinafter, a character string matching system according to an embodiment of the present invention will be described in detail.
[0016]
FIG. 5 is a configuration diagram of the collation data generation unit 110 of the character string collation system in this example. As shown in the figure, the collation data generation unit 110 inputs a search sentence, refers to the search sentence expansion dictionary 120, and outputs an extended search sentence. A search statement normalization unit 112 that converts to an extended search statement, and a numeric expression replacement unit that inputs the normalized extended search statement, converts the numeric expression portion into the same format, and outputs a collation character string as final collation data 113.
[0017]
The search sentence expansion unit 111 receives a search sentence (for example, “acceleration of document search”) from an operator, and a character string that can replace a character string (for example, “search”) in the search sentence is expanded as a search sentence. It is determined whether or not it exists in the dictionary 120. When a replaceable character string (for example, “extract”) exists in the search sentence expansion dictionary 120, the replaceable character string (“extract”) is Appends to the search sentence as a synonym for the character string ("search") and outputs an extended search sentence. If there is no replaceable character string, the input search text is output as it is as an extended search text. The extended search sentence in this case is expressed as “acceleration of document {search | extraction}”, and is a part in which {character string a | character string b} is extended, and character string a and character string It represents that b is a synonym.
[0018]
The search sentence normalization unit 112 receives the extended search sentence output from the search sentence extension unit 111, normalizes characters in the extended search sentence, and outputs a normalized extended search sentence. Character normalization is ignored, for example, when converting from half-width alphanumeric characters to full-width characters, from lowercase letters to uppercase letters, or when searching for punctuation marks, line feed control characters, extended sounds, etc. It means processing such as deletion of power noise characters.
[0019]
The numeric expression replacement unit 113 receives the normalized extended search sentence output from the search sentence normalization unit 112 and determines whether or not a partial character string quantitatively expressed by a numerical value exists in the normalized extended search sentence. If there is a quantitatively expressed partial character string, convert the partial character string to a value, and create a collation character string as final collation data in which the quantitative expression part in the normalized extended search sentence is replaced by the value. Output.
[0020]
FIG. 6 is a configuration diagram of the collated data generation unit 130 of the character string collating system in this example. As shown in the figure, the to-be-matched data generation unit 130 has a to-be-searched document reading unit 131 that reads and outputs all the to-be-searched documents necessary for the search from the to-be-searched document file 140. The collated data generation unit 130 further includes a searched document normalization unit 132 and a numerical expression replacement unit 133.
[0021]
The searched document normalization unit 132 inputs the searched document from the searched document reading unit 131, normalizes characters in the searched document, and outputs a normalized searched document. The normalization of characters is as described in the search sentence normalization unit 112.
The numerical expression replacement unit 133 receives the normalized search document from the search target document normalization unit 132, and if a partial character string quantitatively expressed by a numerical value exists in the normalized search document, the partial character The column is converted into a value, and a matched character string is created and output as final matched data in which the quantitative expression part in the normalized searched document is replaced by the value.
[0022]
Next, the function of the collation map generation unit 150 of the character string collation system according to an embodiment of the present invention will be described in detail. As shown in FIG. 3, the collation map generation unit 150 is connected to the collation data generation unit 110 and the collation data generation unit 130, and receives a collation character string as collation data and a collation character string as collation data from each. Receive and function to generate a matching map.
[0023]
The collation map generation unit 150 first detects a character that is included in both the collation character string and the collated character string, that is, a common character. For example, if the collation character string is “high-speed document search” and the collated character string is “high-speed document search”, the common characters are “high”, “speed”, “sentence”, “document”. "," "No", "inspection" and "search". Next, a collation map including a collation position in which the position of the common character in the collation character string is the Y coordinate and the position of the common character in the collated character string is the X coordinate is generated.
[0024]
FIG. 7 is an explanatory diagram visually representing the collation map generated for the collation character string and the collated character string as an example so that the concept of the collation map can be better understood. In the figure, the point indicated by “◯” corresponds to the collation position. In this example, for simplicity, it is assumed that the collated character string does not include a synonym.
On the other hand, as already described, in the search sentence expansion unit 111, the collation character string “acceleration of document search” is expanded to include a synonym as a form of the expansion search sentence “acceleration of document {search | extraction}”. If so, a collation map as shown in FIG. 8 is obtained. In this case, the Y coordinate indicating the position of the common character in the search character string is corrected. That is, the Y coordinates of “inspection” and “drawing”, and the Y coordinates of “search” and “out” are corrected so as to match.
[0025]
Thus, the matching map generation unit 150 includes a Y coordinate value that represents the position of the common character in the matching character string, a Y correction value that is a correction value for the position of the common character when a synonym is included in the matching character string, An X coordinate value representing the position of the common character in the character string to be verified and a document identification number corresponding to the data to be verified are output as a verification map.
The continuity evaluation unit 160 inputs a collation map from the comprehensive map generation unit 150. The continuity evaluation unit 160 evaluates the similarity between the collated character string and the collated character string for each document identification number. Therefore, the continuity evaluation unit 160 first tracks the collation positions in the collation map, assigns continuity evaluation values to all the collation positions, and then can exist for characters in the same collation character string. The maximum value is selected as the continuity evaluation value of the character in the collation character string from among the continuity evaluation values at the plurality of collation positions. Finally, the continuity evaluation values obtained for each character in the collation character string are totalized with respect to the entire collation character string, normalized, and the obtained value is used as the matching degree between the collation character string and the collated character string. The degree of match is sent from the continuity evaluation unit 160 to the search result output unit 170 together with the document identification number.
[0026]
Hereinafter, the continuity evaluation will be described in detail. FIG. 9 is an explanatory diagram of path tracking for continuity evaluation performed in the character string matching system according to the embodiment of the present invention. The route tracking process searches for another collation position within the effective distance from one collation position with respect to the collation position of the collation map shown in FIG. By repeating this route tracking processing, the matching position in the matching map is classified into several routes including branches. FIG. 9 includes a “sentence” including a route from “high” to “speed”, a branch from “call” to “no”, and a branch from “call” to “search” via “verification”. A route containing "calligraphy" is shown.
[0027]
FIG. 10 is a diagram for explaining four typical forms of continuity of the collation position. In general, at a location where consecutive character strings are collated, the collation position links are arranged in the direction of 45 degrees on the lower right. (A) of the same figure is a figure which shows the case where all the collation positions are a perfect coincidence arranged in the lower right 45 degree direction. (B) in the figure shows a case where there is a missing (1 character) of data, (C) shows a case where there is a (1 character) replacement of data, and (D) shows that (2 characters) of data is mixed. It is a figure which shows a case. By tracking these links, collation can be evaluated while maintaining continuity even when data is missing, replaced, or mixed.
[0028]
The continuity evaluation unit 160 performs a degree-of-match calculation process following the path generation. Here, in order to weight the links between the collation positions, the character type (type) is set for all the characters, and each character at the collation position is classified. In this example, assuming that the character types are classified into two types, “Kanji” and “Kana”.
Kanji: “High” “Speed” “Sentence” “Calligraphy” “Check” “Search”
Kana: "no"
Classification like this is made. Next, the link weights between character types are set as follows according to the character types t1 and t2 of the characters before and after the link.

Further, a weight corresponding to the length of the link (the distance between the collation positions before and after the link) is set for the link. For example, the weight depending on the link length is expressed as follows.
Weight by link length (W_l) =
g (x₁, Y₁, X₂, Y₂) = 1 / {(x₂-X₁)²+ (Y₂-Y₁)²}
Finally, the weight between character types (W_t) And the link weight (W_l) To obtain the following evaluation value for one link.
Evaluation value for one link =
v = W_t・ W_l= F (t₁, T₂) ・ G (x₁, Y₁, X₂, Y₂)
Next, the continuity evaluation unit 160 aggregates the link evaluation values for all the links on the route acquired by the route tracking in the matching map, and obtains the evaluation value of one entire route. The evaluation value V of this one path can be calculated according to the following equation, for example.
[0029]
[Expression 1]

Where k is the index of the link on the route of interest, n is the total number of links on the route of interest + 1, v_kRepresents the evaluation value of each link on the route of interest.
The evaluation value V of one route obtained in this way is the evaluation value V of the matching position at each matching position on the route of interest._xySet as Further, when the route includes a branch, for example, the route including the branch having the highest evaluation value among the evaluation values of the route calculated for each branch can be selected as effective. In this way, the evaluation value V of the matching position is obtained for all the matching positions in the matching map by obtaining the evaluation value V of the one path for all the routes generated in the matching map._xyIs obtained.
[0030]
Next, an evaluation value relating to each character string in the collation character string is obtained. For example, as shown in FIG. 7, when there is at most one matching position corresponding to a character in the matching character string, the matching value is used as an evaluation value of the character in the matching character string in which the corresponding matching position exists. The position evaluation value is set, and the other character evaluation values in the collation character string are set to zero. When there are two or more collation positions corresponding to the character in the collation character string, the maximum evaluation value is set as the evaluation value of the character among the evaluation values of the corresponding collation positions. Thus, continuity evaluation values can be obtained for all characters in the collation character string.
[0031]
Finally, in order to obtain the degree of matching with the collated character string as the entire collated character string, the continuity evaluation values for all characters in the collated character string are totaled to obtain a total value. Total value V of continuity evaluation value_totalIs calculated according to the following equation, for example.
[0032]
[Expression 2]

The degree of coincidence with the collated character string as the entire collated character string is, for example, the total value V of the continuity evaluation value._totalValue of the continuity evaluation value in case of complete match_equalRepresented by the value divided by.
Match level = V_total/ V_equal
By expressing the degree of coincidence in this way, the total value of the continuity evaluation values takes a maximum value of 1.0 when the coincidence is complete. The degree of match obtained in this way is sent to the next search result output unit 170 as a collation result together with the document identification number.
[0033]
FIG. 11 is a block diagram of the search result output unit of the character string matching system according to an embodiment of the present invention. As shown in the figure, the search result output unit 170 includes a matching result conversion unit 171, a search result display unit 172, a search result selection unit 173, and a document display unit 174.
The collation result conversion unit 171 inputs the degree of match and the document identification number from the continuity evaluation unit 160 as a collation result, and based on the document identification number, finds the document heading, summary information, etc. corresponding to the collation result. It reads from the file 140, rearranges the information about the collation result documents in the order of the matching degree, and outputs it as a search result.
[0034]
The search result display unit 172 inputs the search result from the collation result conversion unit 171, displays the search result on a display device such as a display, and passes the search result to the search result selection unit 173 in the next stage.
The search result selection unit 173 inputs the search result from the search result display unit 172, inputs an instruction from the operator in accordance with the search result display, and if the document to be read is specified by the operator, the specified document Is read from the searched document file 140 and output as a selected document.
[0035]
The document display unit 174 receives the selected document output from the search result selection unit 173 and displays the read selected document on a display device such as a display.
The character string collation system according to an embodiment of the present invention collates a search sentence input from an operator with a document stored in a search document file according to the configuration and operation described with reference to FIGS. A document including a searched sentence similar to a sentence can be presented to the operator.
[0036]
Next, in the character string matching system according to the embodiment of the present invention, a process when the search sentence expansion unit 111 outputs an extended search sentence will be described. In this example, a case is considered where the synonym “extraction” of the character string “search” exists in the search sentence “acceleration of document search”. As already described, when the collation data includes synonyms, it is interpreted that there are a plurality of collation data “acceleration of document search” and “acceleration of document extraction”. Also,
Synonymous data regular expression: speed up document {search | extraction}
By using, an extended search sentence in which synonyms are listed and expressed in the collation data is created. Thus, when collation data contains a synonym, collation data are processed as one synonym in the synonyms in the same position having been selected. FIG. 12 is an explanatory diagram of route tracking in the matching map including the synonyms shown in FIG. The effective distance for path tracking is the distance between the matching positions on the path placed in the actually generated matching map and the theoretical distance generated when one synonym is selected. The distance is calculated in consideration of a distance correction value that represents a difference from the distance between the matching positions on the matching map.
[0037]
Finally, the processing when the numerical

expression replacement unit

113 or 133 of the character string matching system according to one embodiment of the present invention replaces the quantitative expression in the matching character string or the character string to be compared with a numerical value will be described. FIG. 13 is a flowchart of a processing procedure for similar quantitative character collation.
First, a partial character string quantitatively expressed by a numerical value is extracted from a collation character string or a collated character string (step 100). Secondly, the extracted partial character string is converted into a value (step 101). Third, the degree of coincidence of numerical values is calculated based on the converted value (step 102).
[0038]
Here, the extraction of the partial character string quantitatively expressed by the numerical value is performed by detecting and extracting a portion where the numerical expression character appears continuously in the character string. For example, the following characters are detected as numerical expression characters.
1 2 3 4 5 6 7 8 9 0
1 2 3 4 5 6 7 8 9 0
110 million ...
The degree of match can be calculated according to the following equation based on the value (Vs) obtained from the collated character string and the value (Vd) obtained from the collated character string.
[0039]
[Equation 3]

FIG. 14 is an explanatory diagram of the similar quantitative character matching process. In the figure, a collation character string and a collated character string, a partial character string by a quantitative expression, a numerical value converted from the partial character string, and a matching degree of the converted numerical value are shown.
Therefore, according to one embodiment of the present invention, in a character string collation system using a computer, a partial character string is missing in a collation character string or a collated character string, replaced with another character string, or another character string. When character strings that partially match are dispersed due to mixing of the character strings, the character strings can be collated.
[0040]
In addition, the configuration of the character string matching system according to the embodiment of the present invention is not limited to the example described in the above embodiment, and each component of the character string matching system is constructed by software (program). It is also possible to record the data in a disk device or the like and install it in a computer of a character string collation system as necessary to perform character string collation. Furthermore, the constructed program can be stored in a portable recording medium such as a floppy disk or a CD-ROM, and can be used universally in a situation where such a character string collation system is used.
[0041]
The present invention is not limited to the above-described embodiments, and various modifications and applications can be made within the scope of the claims.
[0042]
【The invention's effect】
As described above, according to the present invention, when performing pattern matching, the pattern matching position can be tracked, and the continuity of the pattern can be evaluated even if the matching positions are separated. Therefore, a partial pattern is missing between the verification pattern and the pattern to be verified due to the effect that a part of the verification pattern or the pattern to be verified is lost, replaced with another pattern, or mixed with another pattern. Even if there are patterns that are distributed in a distributed manner, matching can be performed. Therefore, according to the present invention, even if the operator is not familiar with the contents of the pattern to be verified, verification without omission is realized, and the operator's burden is reduced.
[Brief description of the drawings]
FIG. 1 is a principle configuration diagram of the present invention.
FIG. 2 is an operation flowchart of the similar information matching method of the present invention.
FIG. 3 is a schematic configuration diagram of a character string matching system according to an embodiment of the present invention.
FIG. 4 is an operation flowchart of a character string matching system according to an embodiment of the present invention.
FIG. 5 is a configuration diagram of a collation data generation unit of a character string collation system according to an embodiment of the present invention.
FIG. 6 is a configuration diagram of a collated data generation unit of the character string collating system according to the embodiment of the present invention.
FIG. 7 is an explanatory diagram visually representing a matching map according to an embodiment of the present invention.
FIG. 8 is an explanatory diagram of a matching map when synonyms are included.
FIG. 9 is an explanatory diagram of path tracking for continuity evaluation performed in the character string matching system according to the embodiment of the present invention.
FIG. 10 is a diagram illustrating continuity of collation positions.
FIG. 11 is a configuration diagram of a search result output unit of a character string matching system according to an embodiment of the present invention.
12 is an explanatory diagram of route tracking in a matching map including the synonyms shown in FIG. 8. FIG.
FIG. 13 is a flowchart of a processing procedure for similar quantitative character collation.
FIG. 14 is an explanatory diagram of similar quantitative character collation processing;
[Explanation of symbols]
1 Similar information matching device
10 pattern generation means
20 collation map creation means
30 Matching map
40 Continuity evaluation means
50 pattern matching means

Claims

In the similar information collating apparatus for determining the similarity of character strings,
A search data generation unit that generates a search character string from the input search sentence;
A collated data generation unit that generates a searched character string from a storage device in which the document is stored;
A collation map generating unit that generates a collation map that associates each character of the search character string and each character of the search target character string with the two-dimensional array, and obtains coordinates of a collation position where the associated character matches;
A link connecting the verification positions within an effective distance starting from the predetermined verification position, and a path connecting a series of the links in which the verification position at the end of the link matches the verification position at the start end And
An evaluation value Vk for each link weighted according to a combination of character types corresponding to the start and end of the link and weighted according to the link length is determined, and evaluation values Vk of all links on the path A continuity evaluation unit that sets an evaluation value Vxy of each collation character that forms the route as an evaluation value Vxy and an evaluation value Vt as a value obtained by adding the evaluation value Vxy of the search character string;
A pattern matching unit that determines the degree of matching between the search character string and the character string to be searched based on the magnitude of the evaluation value Vt;
A similar information collating apparatus characterized by comprising:

The pattern matching unit calculates the degree of match by dividing the evaluation value Vt by the evaluation value when the search character string and the searched character string completely match,
The similar information collating apparatus according to claim 1, wherein:

The pattern matching unit selects a matching position that has the highest evaluation value V among the evaluation values V obtained for the path passing through the matching position for each matching position, and evaluates the evaluation value Vxy of each matching character. To decide,
The similar information collating apparatus according to claim 1 or 2, characterized in that:

A search sentence extension that generates synonymous characters or words with at least some characters or words of the input search character string;
The collation map generation unit replaces a search character string with a synonymous character or word, generates the collation map, obtains the coordinates of the collation position where the associated character matches,
The continuity evaluation unit processes the search character string before replacement and the search character string after replacement in parallel to determine an evaluation value Vt for each of the search character string before replacement and the search character string after replacement.
The similar information collating apparatus according to any one of claims 1 to 3.

When the search character string includes consecutive numerical values, the search data generation unit takes out consecutive numerical values from the search character string and the searched character string,
The pattern matching unit calculates the matching degree between the search character string and the searched character string based on the relative value of each numerical value.
The similar information collating apparatus according to claim 1, wherein:

In a similar collation method in a similar information collation apparatus having a search data generation unit, a collated data generation unit, a collation map generation unit, a continuity evaluation unit, and a pattern collation unit ,
The search data generation unit generating a search character string from the input search sentence;
The step of generating the searched character string from the storage device in which the document is stored;
The collation map generation unit generates a collation map that associates each character of the search character string and each character of the character string to be searched with the two-dimensional array, and obtains the coordinates of the collation position where the associated characters match; ,
The continuity evaluation unit includes a link in which the collation position within an effective distance is connected starting from the predetermined collation position, and the collation position at the end of the link matches the collation position at the start. Generating a path connecting the links, determining an evaluation value Vk for each link weighted according to a combination of character types corresponding to the start and end of the link and weighted according to the length of the link; A step of setting an evaluation value Vxy obtained by adding the evaluation values Vk of all links on the route as an evaluation value Vxy of each collation character forming the route, and setting a value obtained by adding the evaluation value Vxy of the search character string as an evaluation value Vt; ,
The pattern matching unit determining the degree of matching between the search character string and the character string to be searched based on the magnitude of the evaluation value Vt;
A similar information collating method characterized by comprising:

The pattern matching unit includes a step of calculating the degree of match by dividing the evaluation value Vt by the evaluation value when the search character string and the character string to be searched for completely match each other. Similar information matching method.

The pattern matching unit selects, for each matching position, a matching position that has the highest evaluation value V among the evaluation values V obtained for the path passing through the matching position, and the evaluation value Vxy of each matching character. Determining a step,
8. The similar information collating method according to claim 6 or 7, wherein:

A search sentence extension unit generating at least some characters or words of the input search character string and synonymous characters or words;
The collation map generation unit replaces a search character string with a synonymous character or word, generates the collation map, and obtains the coordinates of the collation position where the associated character matches;
The continuity evaluation unit processes the search character string before replacement and the search character string after replacement in parallel to determine an evaluation value Vt for each of the search character string before replacement and the search character string after replacement. The similar information matching method according to claim 6, wherein:

When the search data generation unit includes continuous numerical values in the search character string, the step of taking out continuous numerical values from the search character string and the searched character string,
The pattern matching unit has a step of calculating the matching degree between the search character string and the search character string based on the relative value of each numerical value;
10. The similar information collating method according to claim 6, wherein the similar information is collated.

A computer-readable storage medium storing a program for causing a computer to execute the similar information collating method according to claim 6.