JP3603395B2

JP3603395B2 - Matching device and matching method

Info

Publication number: JP3603395B2
Application number: JP19034395A
Authority: JP
Inventors: 忠信宮内; 良寛上田
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1995-07-26
Filing date: 1995-07-26
Publication date: 2004-12-22
Anticipated expiration: 2015-07-26
Also published as: JPH0944507A

Description

【０００１】
【発明の属する技術分野】
本発明は、データのマッチングの度合いを算出するマッチング装置およびマッチング方法に関連するものであり、特に、データ間の階層関係に基づくマッチング装置およびマッチング方法に関するものである。
【０００２】
【従来の技術】
従来より、テキスト検索や自動分類の研究が活発に行なわれている。こうした分野において、シソーラスと呼ばれる単語間の上位／下位関係および類義語を定義した辞書の重要性がますます高まっている。
【０００３】
テキスト検索においては、ユーザの入力した表現が検索対象中の表現と一致しないことがしばしば発生する。このため、検索された内容のヒット率を確保するために、類義語や上位／下位語を含めた検索が必要である。そこで、シソーラスを用いることにより検索要求の単語を展開する手法などがよく用いられる。
【０００４】
一般的には、検索時に検索要求に基づきシソーラスの階層をたどり、得られた類義語または上位／下位語のそれぞれを用いて検索がなされる。このとき、検索のたびに毎回シソーラスの階層をたどったのでは検索速度が低下するため、例えば、特開平２−２８０２７４号公報の「データベース検索システムの包括検索方式」で述べられているように、あらかじめシソーラスの上位／下位などを含めた包括インデックスを作成し、検索時の速度低下を防ごうとする技術などが提案されている。
【０００５】
しかしながら、このような包括インデックスを保持したとしても、包括インデックスに含まれる複数のデータベースキーに基づいた検索を行なう必要があるため、依然として単純一致検索に比べて速度の低下は発生する。また、こうした包括インデックス情報を予め作成しておく必要があるため、データ量やデータベース構築時のコストの増加を招くという問題もある。
【０００６】
例えば、検索キーとキーワード間でシソーラスを考慮したマッチングを行ない、マッチング度合いによって検索結果とするか否かを判定することも考えられるが、従来のマッチング手法では、マッチングの計算に時間がかかり、検索で用いることができる技術ではなかった。
【０００７】
一方、テキストの自動分類と呼ばれる分野では、なんらかの基準を用いてテキスト間の意味的な距離を用いることにより、テキストの分類が行なわれる。このとき、テキスト間の距離を求めるためにもシソーラスが用いられる。
【０００８】
例えば、情報処理，Ｖｏｌ．３６，Ｎｏ．２，１９９５．２，飯田，「人工知能におけるスーパーコンピューティング」，ｐｐ．１６４−１６８においては、句同士の間の距離を求める技術の一環として、概念階層を用いた単語間の距離計算方法が示されており、入力と用例間の意味距離が、あらかじめ付与された１０進シソーラスコードの照合により計算されている。
【０００９】
しかしながら、この計算法は共通の上位ノードへの距離のみによる計算であるため、木構造のリーフ（葉）であり、同一階層に並ぶ単語間の距離しか計算できない。単語はすべて同一レベルのリーフとして表現されるわけではないので、階層の途中の概念やリーフまでの階層数が異なる場合には適用できない。したがって、単語間の距離の比較においては、このような単純な方法では依然充分ではない。また、並列計算機を想定しているため、効率的な距離計算には連想メモリなど特殊な処理を必要とし、現在のところ一般的でない。
【００１０】
このように、従来の技術においては、シソーラスにおける単語間のマッチングを効率的に行なうには、依然不十分な技術しか存在しなかった。
【００１１】
【発明が解決しようとする課題】
本発明は、上述した事情に鑑みてなされたものであり、特に、シソーラスにおける単語間のマッチングを、高速かつ少ないデータサイズで効率的に行なうマッチング装置およびマッチング方法を提供することを目的とするものである。
【００１２】
【課題を解決するための手段】
本発明は、データ間の階層関係に基づくマッチングを行なうマッチング装置において、階層関係をもつデータ群の各データを各階層の同じ親を有する各データに対して順次割り振った値とともに保持し入力されたデータまで階層をたどったときのそれぞれの階層のデータに割り振られた値を並べたコードに変換する階層データコード化手段と、与えられた少なくとも二つのデータについての階層関係におけるデータ間の近さを示す概念距離として前記二つのデータから前記階層データコード化手段より得られたコードの排他的論理和と各階層の値のマスクコードとの論理積が非ゼロになる最大のマスクコードから該マスクコードに対応づけられている値を前記概念距離として算出する距離算出手段を有することを特徴とするものである。
また本発明は、データ間の階層関係に基づくマッチングを行なうマッチング方法において、階層関係をもつデータ群の各データを各階層の同じ親を有する各データに対して順次割り振った値とともに階層データコード化手段に保持しており、与えられた少なくとも二つのデータについて前記階層データコード化手段により該データまで階層をたどったときのそれぞれの階層のデータに割り振られた値を並べたコードに変換し、得られたコードの排他的論理和と各階層の値のマスクコードとの論理積が非ゼロになる最大のマスクコードから該マスクコードに対応づけられている値を距離算出手段で算出し、該値を階層関係におけるデータ間の近さを示す概念距離とすることを特徴とするものである。
【００１３】
さらに、請求項２および請求項５に記載の発明のように、与えられた少なくとも二つのデータについて前記階層データコード化手段より得られたコードと各階層の値のマスクコードとの論理積が非ゼロになる最小のマスクコードから各データの階層レベルを求めて前記データ間の階層レベルの差を算出する階層差算出手段を設けることができる。
【００１４】
さらに、請求項３および請求項６に記載の発明のように、前記距離算出手段および前記階層差算出手段の結果に基づき優先度を算出する優先度算出手段を設けることができる。
【００１５】
【作用】
本発明によれば、階層関係をもつデータ群の各データに対して、階層関係に応じたコードを保持させておく。そして、与えられた少なくとも二つのデータについて、各データまで階層をたどったときのそれぞれの階層のデータに割り振られた値を並べたコードに変換し、得られたコードの排他的論理和と各階層の値のマスクコードとの論理積が非ゼロになる最大のマスクコードから該マスクコードに対応づけられている値を算出し、その値を階層関係におけるデータ間の近さを示す概念距離とする。これにより、従来のように算術演算など、時間のかかる演算を行なうことなく距離計算を行なうことができ、データ間の階層関係に基づくマッチングが高速かつ容易に実現可能となる。また、請求項２および請求項３に記載の発明のように、用途に応じ、階層差算出手段や、さらに優先度算出手段を設けるなど、種々の構成をとることで、階層レベル差や優先度といったよりきめ細かな比較が可能となる。
【００１６】
【発明の実施の態様】
図１は、本発明のマッチング装置の１つの実施の態様を示す概略構成図である。図中、１は入力部、２は階層関係コード化部、３はマッチング部、４は出力部である。入力部１は、比較すべきデータを与える。階層関係コード化部２は、入力に応じて、あらかじめ階層間の位置にしたがって付与されたコードを返す。マッチング部３は、入力部１から与えられた複数のデータを、階層関係コード化部２から得られるコードを用いてビット論理演算でマッチングを行なう。出力部４は、マッチング部３により得られた結果を出力する。
【００１７】
図２は、本発明のマッチング装置の１つの実施の態様を類似語検索システムに適用した場合の一例を示すブロック構成図である。図中、１０は検索要求入力部、１１は端末、１２はＯＣＲ、１３は電話および音声認識部、１４は記憶装置、１５は赤外線・無線受信部、２０はシソーラスコード化部、２１はシソーラス、３０はマッチング部、３１は距離算出部、３２は階層差算出部、３３は優先度算出部、３４は出力指示部、４０は出力部、４１は端末、４２はファクシミリやプリンタ、４３は電話やポケベル、４４は赤外線・無線発信部、５０はデータベース部、５１はデータベースである。検索要求入力部１０は、図１に示した入力部１に対応し、同様に、シソーラスコード化部２０は階層関係コード化部２に、マッチング部３０はマッチング部３に、出力部４０は出力部４にそれぞれ対応する。データベース部５０は、検索対象となる各種の文書を電子的に記憶する部分であり、文書はデータベース５１に記憶される。
【００１８】
検索要求入力部１０は、例えば、端末１１、ＯＣＲ１２、音声を入力するマイクなどを具備した電話および音声認識部１３、メモリやディスク、テープ等の記憶装置１４、携帯情報機器などからの赤外線や無線を受信する赤外線・無線受信部１５などから構成されている。もちろん、これらのうちの一部でもよいし、これ以外の入力装置を用いるようにしてもよい。
【００１９】
出力部４０は、端末４１のディスプレイ、ファクシミリ／プリンタ４２、音声合成によりスピーカから出力する電話やポケベル４３、携帯情報機器へ赤外線や無線を用いて情報を伝送する赤外線・無線発信部４４などから構成される。もちろん、これらの一部で構成してもよいし、これ以外の出力装置を用いるようにしてもよく、いったんネットワーク等に接続された記憶装置に蓄えるように構成してもよい。
【００２０】
シソーラスコード化部２０は、入力された単語に応じて、シソーラス２１の階層構造に応じて保持されたコードに変換する部分である。
【００２１】
マッチング部３０は、入力された単語とデータベース５１中の文書のキーワードを比較し、検索要求に一致あるいは類似したキーワードを有する文書を出力指示する部分である。距離算出部３１は、シソーラスコード化部２０で変換された検索要求と文書のキーワードのコード同士を論理演算し、階層関係における距離を算出する部分である。階層差算出部３２は、シソーラスコード化部２０で変換された検索要求と文書のキーワードのコード同士を論理演算し、階層レベルの差を算出する部分である。優先度算出部３３は、距離算出部および階層差算出部の結果に基づき、マッチング結果の優先度を算出する部分である。出力指示部３４は、優先度算出部３３で得られた結果に基づき、データベース部５０から検索要求に一致または類似したキーワードを有する文書を取り出し、出力部４０に出力を指示する。
【００２２】
図３は、類似語検索システムの一例における全体の動作の一例を示すフローチャート、図４は、マッチングアルゴリズムの一例の概略を示すフローチャートである。まず、Ｓ６１において、検索要求入力部１０からユーザの検索要求ＲＱが入力されると、Ｓ６２において、シソーラスコード化部２０でシソーラスコードへの変換が行なわれ、検索要求のシソーラスコードＣ１が得られる。
【００２３】
続いて、Ｓ６３において、検索対象となるデータベース５１より一つの文書ＤＱが取り出される。さらに、Ｓ６４において、この文書のキーワードＫＷを順次取り出す。キーワードは、検索時にその場で文書中から抽出してもよいし、検索の高速性が必要であれば、あらかじめ文書に手動または自動で付与されたものを用いればよい。Ｓ６５において、キーワードＫＷは、検索要求ＲＱと同様に、シソーラスコード化部２０でコード変換され、キーワードのシソーラスコードＣ２が得られる。
【００２４】
検索要求のシソーラスコードＣ１とキーワードのシソーラスコードＣ２がともに得られると、Ｓ６６において、コードのマッチングが行なわれる。まず、階層差算出部３２は、Ｓ８１において、検索要求のシソーラスコードＣ１の階層レベルを論理演算によって求め、検索要求の階層レベルＬ１とする。同様に、Ｓ８２において、キーワードのシソーラスコードＣ２の階層レベルを論理演算によって求め、キーワードの階層レベルＬ２とする。Ｓ８３において、検索要求の階層レベルＬ１とキーワードの階層レベルＬ２の差を求め、階層レベル差ＬＤとする。続いて、距離算出部３１は、Ｓ８４において、検索要求のシソーラスコードＣ１とキーワードのシソーラスコードＣ２のビット排他論理和を計算し、概念距離ＣＤとする。さらに、優先度算出部３３は、Ｓ８５において、抽出された階層レベル差ＬＤと概念距離ＣＤに応じて優先度ＰＬを算出する。
【００２５】
このようにしてコードのマッチングが行なわれると、得られた階層レベル差ＬＤ、概念距離ＣＤ、優先度ＰＬなどが基準を満たしているか否かを、Ｓ６７でチェックする。Ｓ６８で基準を満たしていると判定されれば、出力指示部３４は、Ｓ６９で結果バッファＢＦに文書ＤＱを出力する。もちろん、出力部４０に直接出力指示を行なってもよい。
【００２６】
Ｓ７０において、現在処理中の文書に付与されているキーワードがすべて処理されたか否かを判定し、最後のキーワードの処理が終了するまで、Ｓ６４〜Ｓ７０の処理を繰り返し行なう。１つの文書について、すべてのキーワードの処理が終了すると、Ｓ７１において、すべての文書について処理を行なったか否かを判定し、最後の文書の処理が終了するまで、Ｓ６３〜Ｓ７０の処理を繰り返し行なう。
【００２７】
このような処理を行なうことによって、検索要求に一致あるいは類似したキーワードを有する文書が検索され、出力されることになる。上述のように、マッチング処理では階層差や概念距離を、論理演算を用いて行なっているので、高速にマッチング処理を行なうことができる。そのため、従来のようにシソーラス階層をたどったり、包括インデックスを用いることなどもなく、少ないデータサイズで高速な類似語検索を実現することができる。
【００２８】
以下、具体例を用いて、上述の動作の一例を説明する。図５は、シソーラスの構成の具体例を示す説明図である。例えば、「自家用車」という単語は、「自動車」の下位概念であり、「スポーツカー」、「セダン」、「ＲＶ車」などの上位概念を表わす。それぞれの単語には、図中の角カッコ“［］”内に示すように、各階層の同じ親を有する各単語について、レベル１から始まるコードを順次割り振る。ここでは一つの階層につき２進４ビットを用い、１６進数で示している。
【００２９】
実際に各単語に割り当てられるコードの例の一部も、図中の丸カッコ“（）”内に示している。例えば、「自家用車」という単語においては、
具体物［２］→機械［３］→乗物［Ａ］→車両［２］→自動車［４］→自家用車［４］
という階層をたどり、さらに１レベルの下位データを持つため、これを［０］と表わし、「２３Ａ２４４０Ｈ」という４バイトの１６進コードを割り当てる。ここで、末尾の‘Ｈ’は、１６進数であることを示しており、１６進の「２３Ａ２４４０」という値を示している。同様に、「自家用車」の下位に位置し、末端のデータである「セダン」という単語には、「２３Ａ２４４２Ｈ」というコードを割り当てる。ここでは各ノードが１５個以下の８階層としたため、４ビット７レベルのコードを用いたが、もちろんこの割り当ては自由であり、シソーラスの構造に応じて設計すればよい。なお、この割り当ては、シソーラスの階層構造をトラバースすることにより容易に自動的化できるため、人手によるコード付与などの手間を排除することが可能である。
【００３０】
図３のＳ６１で入力された検索要求ＲＱは、Ｓ６２において、シソーラスコードＣ１へ変換される。図６は、シソーラスコードへの変換の一例を示す概念図である。ここでは、データ構造として特開平５−２８１９４号公報の「データアクセス方式」に述べられているような、ハッシングを用いた例を示す。入力されたデータは、ハッシュ関数によって得られたハッシュ値をもとにハッシュテーブルを参照する。ここでは、ハッシュテーブルにはチェインインデックスへのポインタが格納されており、このポインタをもとにチェインインデックスを参照し、このチェインインデックスで衝突をチェックした後、変換データが参照される。変換データには、単語に対応するコードが格納されており、入力データに対応するコードが得られる。
【００３１】
例えば、「セダン」というデータが入力されると、ハッシュ関数によりハッシュ値が得られ、これをもとにハッシュテーブルが参照され、さらにチェインインデックにより衝突がチェックされた後、シソーラスコードＣ１として「２３Ａ２４４２Ｈ」が得られる。また、図６に示したように、「オートバイ」と「バイク」などの同義語は、入力データとしては異なるが、変換により同一化されたコードとなる。このように、シソーラスにおける階層関係をたどる必要なしに、直接、データからコードを得ることができる。
【００３２】
なお、ここではハッシュ法を用いた例について説明したが、ＢＴｒｅｅなど、他のデータ構造を用いてもよいことはもちろんである。
【００３３】
このようにして、入力された検索要求ＲＱは、ハッシュ関数により変換され、ハッシュテーブルとチェインインデックスをたどることにより、シソーラスコードＣ１に変換される。続いて、Ｓ６３で検索対象となるデータベース５１より文書が１つ取り出され、さらにＳ６４でこの文書のキーワードが取り出される。ここでは、キーワードＫＷとして「パソコン」が取り出されたとする。キーワードＫＷは、Ｓ６５で上述と同様の方法によりシソーラスコード化部２０でコード変換される。たとえばキーワードが「パソコン」の場合には、図５に示すように、「２３ＥＣ２２０Ｈ」というシソーラスコードに変換される。
【００３４】
検索要求のシソーラスコードＣ１と、キーワードのシソーラスコードＣ２がともに得られると、Ｓ６６でコードのマッチングが行なわれる。マッチングは、大きく階層差算出部３２におけるレベル差の計算と、距離算出部３１における概念距離の計算に分けられる。
【００３５】
まず、階層差算出部３２は、Ｓ８１，Ｓ８２において、検索要求のシソーラスコードＣ１とキーワードのシソーラスコードＣ２の階層レベルを計算する。具体的には、次に示すような４ビット単位のマスクコードを用意し、検索要求のシソーラスコードＣ１およびキーワードのシソーラスコードＣ２とのＡＮＤを順次とり、非ゼロになるまで繰り返すことでレベルが決定される。なお、レベル１までのＡＮＤがすべて０であった場合、すなわちＣ１またはＣ２のコードが０の場合、それはシソーラスの根、この場合「概念」であるからレベルは０となる。
レベル７：００００００ＦＨ
レベル６：０００００Ｆ０Ｈ
レベル５：００００Ｆ００Ｈ
レベル４：０００Ｆ０００Ｈ
レベル３：００Ｆ００００Ｈ
レベル２：０Ｆ０００００Ｈ
レベル１：Ｆ００００００Ｈ
【００３６】
「セダン」に関するコードは、上述のように「２３Ａ２４４２Ｈ」であったから、まず００００００ＦＨとのＡＮＤを取ると、
００００００ＦＨＡＮＤ２３Ａ２４４２Ｈ＝２Ｈ
となり、非ゼロであるのでレベル７の単語であることがわかり、終了する。
【００３７】
「パソコン」に関するコードは、上述のように「２３ＥＣ２２０Ｈ」であったから、まず００００００ＦＨとのＡＮＤを取ると、
００００００ＦＨＡＮＤ２３ＥＣ２２０Ｈ＝０Ｈ
となり、ゼロであるので、さらに０００００Ｆ０ＨとのＡＮＤを取る。すると、
０００００Ｆ０ＨＡＮＤ２３ＥＣ２２０Ｈ＝２０Ｈ
となり、非ゼロであるので、レベル６の単語であることがわかり、終了する。
【００３８】
ここでは、ＡＮＤによるマスクを用いた方法を述べたが、レベルに応じたビット数、例えば、レベル７の場合４ビットだけ右シフトし、下位桁あふれフラグをチェックするなどの方法でも、もちろん論理的に同一になる。いずれにせよ、これらのビット単位の論理演算は、一般に非常に高速に実行可能であるため、処理の高速化を実現することが可能である。
【００３９】
こうして得られた階層レベルの差の絶対値をＳ８３で計算し、階層レベル差ＬＤとする。ここでは｜７−６｜＝１となる。
【００４０】
続いて、距離算出部３１において、単語間の概念距離が計算される。概念距離は、コード間のビット排他的論理和（ＸＯＲ）に基づいて求められる。ここでは、
２３Ａ２４４２ＨＸＯＲ２３ＥＣ２２０Ｈ＝４Ｅ６６２Ｈ
となる。こうして得られる値は、単語（概念）間のいわば近さを示す。概念距離はここで得られた値の範囲に応じて付与する。この付与には、上述の階層レベルチェックと同様に、次に示すマスクコードとのＡＮＤで非ゼロになる最大のものとして、容易に求められる。なお、ＸＯＲの結果が全ビット０の場合、同一のコードであるから概念距離は０となる。
概念距離７：Ｆ００００００Ｈ
概念距離６：０Ｆ０００００Ｈ
概念距離５：００Ｆ００００Ｈ
概念距離４：０００Ｆ０００Ｈ
概念距離３：００００Ｆ００Ｈ
概念距離２：０００００Ｆ０Ｈ
概念距離１：００００００ＦＨ
【００４１】
ここでは、ビット排他論理和の値が「４Ｅ６６２Ｈ」であるので、
Ｆ００００００ＨＡＮＤ４Ｅ６６２Ｈ＝０Ｈ
０Ｆ０００００ＨＡＮＤ４Ｅ６６２Ｈ＝０Ｈ
００Ｆ００００ＨＡＮＤ４Ｅ６６２Ｈ＝４００００Ｈ
となり、概念距離＝５であることがわかる。
【００４２】
図７は、マッチング結果の具体例の説明図である。図７には、「セダン」を検索要求として、各キーワードと、そのキーワードのシソーラスコードＣ２、「セダン」のシソーラスコードＣ１「２３Ａ２４４２Ｈ」とのビット排他論理和の値、概念距離、レベル差、優先度について示している。優先度については、後で再びこの図を参照して説明する。
【００４３】
図７に示すように、概念距離はレベル差とは直接関係ない。例えば、「セダン」に関し、１レベル上位（親子関係）である「自家用車」とのビット排他論理和の値は２Ｈ、その１レベル下で同じ上位語を持つ「スポーツカー」とは３Ｈとなり、これらは非常に近い関係であり、概念距離は１になる。
【００４４】
一方、２レベル上位（祖父母関係）である「自動車」では、ビット排他論理和の値は４２Ｈであるが、その１レベル下位（おじ／おば）にあたる「バス」とは５２Ｈ、さらに１レベル下（いとこ）、すなわち同レベルの「ボンネットバス」では５０Ｈとなり、これらとの概念距離は２になる。さらに、前述の「パソコン」に対しては、ビット排他論理和の値は４Ｅ６６２Ｈとなり、概念距離は５である。
【００４５】
図８は、具体例における概念距離の一部を２次元にマッピングした概念図である。図８において、○はノードを示し、実線はシソーラス階層を示しており、高さが階層レベルを示している。ここで示したシソーラス階層は、図５に示したものの一部である。いま、「セダン」を基準とし、概念距離を２次元上の距離に対応させると、等しい概念距離のノードは、図８に示すように等距離の円弧上に配置して示すことができる。
【００４６】
なお、上述の例では、概念距離の計算にあたり、マスクコードによるＡＮＤを用いたが、論理的に同一であればさまざまな方法で実現し得る。図９は、概念距離のＯＲゲートによる算出法の説明図である。例えば、図９に示すように、同一階層に相当するビットのＯＲをとり、ＯＮとなる最上位ビットを概念距離とすることが考えられる。図９に示した例では、「セダン」と「パソコン」のシソーラスコードのビット排他論理和の値「４Ｅ６６２Ｈ」を４ビットごとにＯＲ回路に入力し、出力として「００１１１１１」を得る。この結果の最上位ビットの位置は５番目であるので、概念距離は５となる。この計算方法では、ワイヤードロジックにより簡単に回路を構成できるので、ハードウェア化に向いた方法ということができる。もちろん、他の方法を用いて演算してもよい。
【００４７】
また、この例では、この後に優先度を算出するために、０〜７に限定した概念距離を導入している。実際に上述のようにして、ある概念距離をもつ表現間のＸＯＲの演算を行なうと、その結果は特定の範囲（概念距離２の場合１０Ｈ〜ＦＦＨ）のみをとるので、例えば、単語間のおおざっぱな距離を求める場合には、ＸＯＲを計算した結果を直接用いてももちろんよい。
【００４８】
さらに、この実施の態様では、シソーラスの全階層に４ビットの固定長を割り当てているが、リーフに近付くにつれてノード数が増えるなどシソーラスの構造はさまざまである。このような場合には、各階層ごとに異なるビット数を割り当てればよく、記憶容量が限られている場合でも、これを有効に利用することが可能である。加えて、シソーラスにおいてはすべてのリーフデータが同じ階層レベルになるとは限らないが、本発明は異なる階層レベルにまたがるデータが比較可能であるため、このような場合においても非常に有効である。
【００４９】
続いて、優先度算出部３３ではＳ８５において、距離算出部３１で算出された概念距離と、階層差算出部３２で算出された階層レベル差に応じ、優先度を算出する。上述のように、概念距離と階層レベル差は直接関係はない。また、概念距離が同一でも、それらの間の類似度が異なることは多い。上述の図８に示すように、「セダン」に対して「自動車」と「ボンネットバス」はともに概念距離２であるが、類似度は「自動車」のほうが直接の孫であるため、類似度は高いと考えられる。一方、シソーラスの構成によっては上位ノードに対し下位ノードとして類義語が並ぶような場合もある。このような場合は同じ階層レベルの方が類似度が高くなる。このように、概念距離と階層レベル差を用いることにより、これらを別の軸として評価することができる。
【００５０】
この例では、優先度として、次のような計算式を用いる。
優先度＝総レベル数×２ − （概念距離×２ − レベル差）
ここで、この例では総レベル数は７としている。上述の図７に示した各キーワードでは、「セダン」に対し、次のような値が返される。
１４：セダン（７×２ − （０×２ − ０）＝１４）
１３：自家用車
１２：スポーツカー，自動車
１１：バス，車両，・・・
１０：ボンネットバス，オートバイ，・・・
・・・
５：パソコン，・・・
・・・
もちろん、上述の計算式のほか、種々の計算方法によって優先度を算出してもよい。
【００５１】
続いて、図３のＳ６７で、上述のようにして得られたマッチング結果を用いて、抽出基準のチェックを行なう。ここでは単純に優先度の値が１２以上の場合、適合結果とする。上述の図７に示した各キーワードの例では、文書が「セダン」、「自家用車」、「スポーツカー」、「自動車」のいずれかのキーワードを有していれば、結果バッファＢＦに出力される。
【００５２】
もちろん、この基準とする値は適宜設定すればよい。また、この抽出基準については、他の抽出基準を用いてもよい。例えば、必要に応じて文書全体を走査したのちに統計処理によって抽出基準に対して重みづけをするなどがある。
【００５３】
このようにして、検索要求と１つのキーワードとのマッチング処理および文書の抽出が終了する。ここまでの処理が１つの文書に付されているすべてのキーワードについて繰り返し、さらに、それらをデータベース５１中の各文書に対して順次行なうことで処理が進められる。
【００５４】
以上の処理により、結果バッファＢＦには、ユーザからの検索要求「セダン」に類似したキーワードを含む文書、具体的には「セダン」、「スポーツカー」、「ＲＶ車」、「自家用車」、「自動車」などを含む文書が蓄積される。この結果に対し、出力指示部３４は優先度に基づいた文書の出力指示を出力部４０に対して行なう。出力部４０は、検索結果を出力し、検索処理は終了する。
【００５５】
上述の類似語検索システムでは、マッチング部３０において、概念距離、階層レベル差、およびこれらから優先度を算出したが、例えば、概念距離のみを利用したり、概念距離と階層レベル差を利用するなど、これらのうちの一部の計算結果を利用してもよい。
【００５６】
本発明は、階層関係を持つデータ間の比較処理一般に適用でき、上述の類似語検索システムへの適用のみに限定されるものではないことは言うまでもない。以下、本発明のマッチング装置の１つの実施の態様を文書自動分類システムに適用した場合について述べる。
【００５７】
現在、自動分類においては、標本データを用いた自動分類が一般的な技術のひとつである。しかし、シソーラス展開を含めることは、コストの問題から従来は困難であった。本発明のマッチング装置を、標準データとのマッチングに適用することにより、低コストで自動分類が可能となる。
【００５８】
例えば、次のようなキーワードを持つ標本データ群（一部）に対して、ある文書を自動分類することを考える。
［標本１］
キーワード：携帯，電話，自動車，ＰＨＳ
カテゴリ：移動体電話
［標本２］
キーワード：端末，反射，抵抗，遅延
カテゴリ：ターミネータ
［標本３］
キーワード：携帯，端末，ＰＤＡ，ネットワーク
カテゴリ：携帯端末
【００５９】
入力文書も同様にキーワードを持つとする。これは、あらかじめ付与されたものでも、その場で抽出したものでもかまわない。
［入力文書］
キーワード：モーバイル，端末，通信
【００６０】
ここで、シソーラス展開を含めた入力文書と各標本のキーワード間の類似度を求める。類似度の計算には、上述の類似語検索システムの場合と同様の方法を用いることができる。さらに、入力文書の各キーワードとの類似度がもっとも低いもの同士の和を求める。標本１〜３において計算した例を示す。
（標本１）
モーバイル：携帯＝１，端末：ＰＨＳ＝４，通信：電話＝２合計７
（標本２）
モーバイル：遅延＝６，端末：端末＝０，通信：遅延＝３合計９
（標本３）
モーバイル：携帯＝１，端末：端末＝０，通信：ネットワーク＝２合計３
【００６１】
こうして得られた和のもっとも小さい標本が、入力文書との類似度がもっとも高いと考えられるため、入力された文書を当該カテゴリに分類する。この例では、標本３において計算された和が最も小さいので、入力文書は標本３のカテゴリ「携帯端末」に分類される。以上の処理を入力文書すべてに対して順次行なことで自動分類が行なわれる。
【００６２】
なお、上述の各システムへの適用例においては、シソーラスを用いた例について述べたが、階層構造をなすデータは一般に広く用いられているため、本発明はファイルディレクトリなど、シソーラス以外のデータ間のマッチングにも適用することができる。
【００６３】
【発明の効果】
以上の説明から明らかなように、本発明によれば、シソーラスのような階層構造をなすデータ間での比較が、ＡＮＤ、ＸＯＲといった単純なビット論理演算をベースとする簡単な処理のみで高速に実行することが可能となる。また、ビット論理演算処理をベースとすることにより、ソフトウェアによる実現はもちろん、ハードウェア化も容易である。
【００６４】
本発明により、従来の共通上位ノードまでの階層数といった単純な処理では困難であった、階層の途中にまたがるようなデータ間の場合でも比較可能となった。加えて、階層レベル差を用いることにより、データの種類に応じたきめ細かな比較基準を設け、優先度づけを行なうことができる。
【００６５】
このように、従来困難であったシソーラス展開を含むマッチング処理が簡単に、しかも高速に実行できるので、本発明のマッチング処理を適用した例えば類似テキスト検索やテキスト自動分類などが、高速かつ低コストで実現可能となるという効果がある。
【図面の簡単な説明】
【図１】本発明のマッチング装置の１つの実施の態様を示す概略構成図である。
【図２】本発明のマッチング装置の１つの実施の態様を類似語検索システムに適用した場合の一例を示すブロック構成図である。
【図３】類似語検索システムの一例における全体の動作の一例を示すフローチャートである。
【図４】マッチングアルゴリズムの一例の概略を示すフローチャートである。
【図５】シソーラスの構成の具体例を示す説明図である。
【図６】シソーラスコードへの変換の一例を示す概念図である。
【図７】マッチング結果の具体例の説明図である。
【図８】具体例における概念距離の一部を２次元にマッピングした概念図である。
【図９】概念距離のＯＲゲートによる算出法の説明図である。
【符号の説明】
１…入力部、２…階層関係コード化部、３…マッチング部、４…出力部、１０…検索要求入力部、１１…端末、１２…ＯＣＲ、１３…電話および音声認識部、１４…記憶装置、１５…赤外線・無線受信部、２０…シソーラスコード化部、２１…シソーラス、３０…マッチング部、３１…距離算出部、３２…階層差算出部、３３…優先度算出部、３４…出力指示部、４０…出力部、４１…端末、４２…ファクシミリ／プリンタ、４３…電話・ポケベル、４４…赤外線・無線発信部、５０…データベース部、５１…データベース。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a matching device and a matching method for calculating a degree of data matching, and more particularly to a matching device and a matching method based on a hierarchical relationship between data.
[0002]
[Prior art]
Conventionally, research on text search and automatic classification has been actively conducted. In these fields, dictionaries that define superordinate / subordinate relations and synonyms between words called thesaurus have become increasingly important.
[0003]
In a text search, the expression input by the user often does not match the expression in the search target. For this reason, in order to secure a hit rate of the searched content, it is necessary to perform a search including synonyms and upper / lower terms. Therefore, a method of expanding a word of a search request by using a thesaurus is often used.
[0004]
Generally, at the time of search, the hierarchy of the thesaurus is followed based on the search request, and the search is performed using the obtained synonyms or the upper / lower terms. At this time, if the hierarchy of the thesaurus is traversed every time the search is performed, the search speed decreases. For example, as described in “Comprehensive search method of database search system” in JP-A-2-280274, A technique has been proposed in which a comprehensive index including the upper / lower levels of a thesaurus is created in advance to prevent a reduction in speed at the time of search.
[0005]
However, even if such a comprehensive index is held, it is necessary to perform a search based on a plurality of database keys included in the comprehensive index, so that the speed is still lower than that of the simple match search. Further, since it is necessary to create such comprehensive index information in advance, there is a problem that the data amount and the cost for constructing the database are increased.
[0006]
For example, it is conceivable to perform matching in consideration of a thesaurus between a search key and a keyword, and determine whether or not to obtain a search result based on the degree of matching. It was not a technology that could be used with.
[0007]
On the other hand, in a field called automatic text classification, texts are classified by using a meaningful distance between texts using some criterion. At this time, a thesaurus is also used to determine the distance between the texts.
[0008]
For example, information processing, Vol. 36, no. 2, 1995.2, Iida, “Supercomputing in Artificial Intelligence”, pp. In 164-168, a method of calculating the distance between words using a concept hierarchy is shown as a part of a technique for obtaining the distance between phrases, and a semantic distance between an input and an example is given by 10 It is calculated by comparing hexadecimal thesaurus codes.
[0009]
However, since this calculation method is a calculation based only on the distance to a common upper node, it is a leaf (leaf) of a tree structure, and can only calculate the distance between words arranged in the same hierarchy. Since not all words are expressed as leaves at the same level, this cannot be applied when the concept in the middle of the hierarchy or the number of layers up to the leaf is different. Therefore, such a simple method is still not sufficient for comparing the distance between words. Further, since a parallel computer is assumed, a special process such as an associative memory is required for efficient distance calculation, which is not common at present.
[0010]
As described above, in the related art, there is still only an insufficient technique for efficiently performing matching between words in a thesaurus.
[0011]
[Problems to be solved by the invention]
The present invention has been made in view of the above-described circumstances, and has as its object to provide a matching device and a matching method for efficiently performing word-to-word matching in a thesaurus at high speed and with a small data size. It is.
[0012]
[Means for Solving the Problems]
According to the present invention, in a matching device that performs matching based on a hierarchical relationship between data, each data of a data group having a hierarchical relationship is held and input together with a value sequentially allocated to each data having the same parent of each hierarchy. Hierarchical data encoding means for converting values assigned to data of each layer when tracing the layer to data into a code arranged, and determining the closeness between data in a hierarchical relationship for at least two given data The logical distance between the exclusive OR of the codes obtained by the hierarchical data coding means from the two data and the mask code of the value of each layer as the conceptual distance is represented by the maximum mask code at which the logical product is non-zero. Characterized in that it has a distance calculating means for calculating a value associated with the above as the conceptual distance.
Further, according to the present invention, in a matching method for performing matching based on a hierarchical relationship between data, hierarchical data encoding is performed together with a value obtained by sequentially assigning each data of a data group having a hierarchical relationship to each data having the same parent of each hierarchy. Means for converting the values assigned to the data of each hierarchy when the hierarchy is traversed to the data by the hierarchical data encoding means for at least two given data, and The value associated with the mask code is calculated by the distance calculating means from the largest mask code in which the logical product of the exclusive OR of the obtained code and the mask code of the value of each layer becomes non-zero, Is a conceptual distance indicating the closeness between data in a hierarchical relationship.
[0013]
Further, as in the second and fifth aspects of the present invention, the logical product of the code obtained by the hierarchical data encoding means and the mask code of the value of each hierarchical level is given for at least two given data. It is possible to provide a hierarchy difference calculating means for calculating the hierarchy level of each data from the minimum mask code which becomes zero and calculating the difference in the hierarchy level between the data.
[0014]
Further, as in the third and sixth aspects of the present invention, a priority calculating means for calculating a priority based on the results of the distance calculating means and the hierarchical difference calculating means can be provided.
[0015]
[Action]
According to the present invention, a code corresponding to a hierarchical relationship is held for each data of a data group having a hierarchical relationship. Then, for at least two given data, the value assigned to the data of each layer when the data is traversed to each data is converted to a code in which the exclusive OR of the obtained code and each layer are calculated. The value associated with the mask code is calculated from the largest mask code that makes the logical product of the value of the value and the mask code non-zero, and the value is used as a conceptual distance indicating the closeness between data in a hierarchical relationship. . As a result, distance calculation can be performed without performing a time-consuming operation such as an arithmetic operation as in the related art, and matching based on a hierarchical relationship between data can be realized at high speed and easily. In addition, as in the inventions according to claims 2 and 3, various configurations such as providing a hierarchy difference calculating means and a priority calculating means depending on the application are used, so that a hierarchical level difference and a priority level can be obtained. This enables a more detailed comparison.
[0016]
DESCRIPTION OF THE PREFERRED EMBODIMENTS
FIG. 1 is a schematic configuration diagram showing one embodiment of the matching device of the present invention. In the figure, 1 is an input unit, 2 is a hierarchy relation coding unit, 3 is a matching unit, and 4 is an output unit. The input unit 1 provides data to be compared. The hierarchy relation coding unit 2 returns a code given in advance according to the position between the hierarchies in response to the input. The matching unit 3 performs matching of a plurality of data provided from the input unit 1 by a bit logical operation using a code obtained from the hierarchical relation coding unit 2. The output unit 4 outputs the result obtained by the matching unit 3.
[0017]
FIG. 2 is a block diagram showing an example in which one embodiment of the matching device of the present invention is applied to a similar word search system. In the figure, 10 is a search request input unit, 11 is a terminal, 12 is an OCR, 13 is a telephone and voice recognition unit, 14 is a storage device, 15 is an infrared / wireless receiving unit, 20 is a thesaurus coding unit, 21 is a thesaurus, 30 is a matching unit, 31 is a distance calculation unit, 32 is a hierarchy difference calculation unit, 33 is a priority calculation unit, 34 is an output instruction unit, 40 is an output unit, 41 is a terminal, 42 is a facsimile or printer, 43 is a telephone or Pager, 44 is an infrared / wireless transmitter, 50 is a database, and 51 is a database. The search request input unit 10 corresponds to the input unit 1 shown in FIG. 1. Similarly, the thesaurus coding unit 20 outputs the hierarchical relation coding unit 2, the matching unit 30 outputs the matching unit 3, and the output unit 40 outputs the output. It corresponds to the unit 4 respectively. The database section 50 is a section for electronically storing various documents to be searched, and the documents are stored in the database 51.
[0018]
The search request input unit 10 includes, for example, a terminal 11, an OCR 12, a telephone and a voice recognizing unit 13 equipped with a microphone for inputting voice, a storage device 14 such as a memory, a disk, and a tape, and infrared and wireless signals from a portable information device. And an infrared / wireless receiving unit 15 for receiving the data. Of course, some of these may be used, or other input devices may be used.
[0019]
The output unit 40 includes a display of a terminal 41, a facsimile / printer 42, a telephone or pager 43 output from a speaker by voice synthesis, and an infrared / radio transmission unit 44 for transmitting information to a portable information device using infrared or radio. Is done. Of course, it may be constituted by a part of these, or an output device other than the above may be used, or the output device may be temporarily stored in a storage device connected to a network or the like.
[0020]
The thesaurus coding unit 20 is a unit that converts the input word into a code stored according to the hierarchical structure of the thesaurus 21.
[0021]
The matching unit 30 is a unit that compares an input word with a keyword of a document in the database 51 and outputs a document having a keyword that matches or is similar to a search request. The distance calculation unit 31 is a part that performs a logical operation between the search request converted by the thesaurus coding unit 20 and the keyword codes of the document to calculate a distance in a hierarchical relationship. The hierarchy difference calculation unit 32 is a part that performs a logical operation between the search request converted by the thesaurus coding unit 20 and the keyword codes of the document to calculate a hierarchy level difference. The priority calculation unit 33 is a unit that calculates the priority of the matching result based on the results of the distance calculation unit and the hierarchy difference calculation unit. The output instruction unit 34 extracts a document having a keyword that matches or is similar to the search request from the database unit 50 based on the result obtained by the priority calculation unit 33, and instructs the output unit 40 to output.
[0022]
FIG. 3 is a flowchart showing an example of the overall operation in an example of the similar word search system, and FIG. 4 is a flowchart showing an outline of an example of the matching algorithm. First, in S61, when a user's search request RQ is input from the search request input unit 10, in S62, the thesaurus coding unit 20 performs conversion to a thesaurus code, and obtains a thesaurus code C1 of the search request.
[0023]
Subsequently, in S63, one document DQ is extracted from the search target database 51. Further, in S64, the keywords KW of the document are sequentially extracted. The keyword may be extracted from the document on the spot at the time of the search, or if a high-speed search is required, a keyword manually or automatically added to the document in advance may be used. In step S65, the keyword KW is code-converted by the thesaurus coding unit 20, similarly to the search request RQ, to obtain a keyword thesaurus code C2.
[0024]
When the thesaurus code C1 of the search request and the thesaurus code C2 of the keyword are both obtained, the code matching is performed in S66. First, in S81, the hierarchy difference calculation unit 32 obtains the hierarchy level of the thesaurus code C1 of the search request by a logical operation, and sets it as the hierarchy level L1 of the search request. Similarly, in S82, the hierarchical level of the keyword thesaurus code C2 is obtained by a logical operation, and is set as the keyword hierarchical level L2. In S83, a difference between the hierarchical level L1 of the search request and the hierarchical level L2 of the keyword is obtained, and is set as a hierarchical level difference LD. Subsequently, in S84, the distance calculation unit 31 calculates a bit exclusive OR of the thesaurus code C1 of the search request and the thesaurus code C2 of the keyword, and sets the result to the conceptual distance CD. Further, in S85, the priority calculation unit 33 calculates a priority PL according to the extracted hierarchy level difference LD and the conceptual distance CD.
[0025]
When the code matching is performed in this way, it is checked in S67 whether the obtained hierarchical level difference LD, concept distance CD, priority PL, and the like satisfy the criteria. If it is determined in S68 that the standard is satisfied, the output instruction unit 34 outputs the document DQ to the result buffer BF in S69. Of course, an output instruction may be directly given to the output unit 40.
[0026]
In S70, it is determined whether all the keywords assigned to the document currently being processed have been processed, and the processing of S64 to S70 is repeated until the processing of the last keyword is completed. When the processing of all keywords is completed for one document, it is determined in S71 whether or not processing has been performed for all documents, and the processing of S63 to S70 is repeated until the processing of the last document is completed.
[0027]
By performing such processing, a document having a keyword that matches or is similar to the search request is searched for and output. As described above, in the matching processing, the hierarchical difference and the conceptual distance are performed using the logical operation, so that the matching processing can be performed at high speed. Therefore, it is possible to realize a high-speed similar word search with a small data size without tracing the thesaurus hierarchy and using a comprehensive index as in the related art.
[0028]
Hereinafter, an example of the above-described operation will be described using a specific example. FIG. 5 is an explanatory diagram showing a specific example of the configuration of a thesaurus. For example, the word “private car” is a subordinate concept of “automobile” and represents a superordinate concept such as “sports car”, “sedan”, and “RV car”. As shown in square brackets “[]” in the figure, codes starting from level 1 are sequentially assigned to each word having the same parent in each hierarchy. Here, four bits are used for one layer, and are indicated by a hexadecimal number.
[0029]
Some examples of codes actually assigned to each word are also shown in parentheses "()" in the figure. For example, in the word "private car"
Concrete object [2] → machine [3] → vehicle [A] → vehicle [2] → car [4] → private car [4]
Since the data has a lower level of one level, it is represented as [0], and a 4-byte hexadecimal code "23A2440H" is assigned. Here, the suffix “H” indicates a hexadecimal number, and indicates a value “23A2440” in hexadecimal. Similarly, the code “23A2442H” is assigned to the word “sedan” which is located below “private car” and is the terminal data. Here, since each node has 15 layers or less and 8 layers, a 4-bit 7-level code is used. However, it is needless to say that this assignment is free and may be designed according to the structure of the thesaurus. Since this assignment can be easily automated by traversing the hierarchical structure of the thesaurus, it is possible to eliminate the trouble of manually assigning codes.
[0030]
The search request RQ input in S61 of FIG. 3 is converted to a thesaurus code C1 in S62. FIG. 6 is a conceptual diagram illustrating an example of conversion into a thesaurus code. Here, an example using hashing as a data structure described in “Data Access Method” of Japanese Patent Application Laid-Open No. 5-28194 will be described. The input data refers to a hash table based on a hash value obtained by a hash function. Here, a pointer to the chain index is stored in the hash table. The chain index is referred to based on the pointer, and after checking for collision with the chain index, the converted data is referred to. A code corresponding to the word is stored in the conversion data, and a code corresponding to the input data is obtained.
[0031]
For example, when data "sedan" is input, a hash value is obtained by a hash function, a hash table is referred to based on the hash value, and a collision is checked by a chain index. Is obtained. Also, as shown in FIG. 6, synonyms such as “motorcycle” and “motorcycle” are different as input data, but are codes that are made the same by conversion. In this way, the code can be obtained directly from the data without having to follow hierarchical relationships in the thesaurus.
[0032]
Although an example using the hash method has been described here, it goes without saying that another data structure such as BTTree may be used.
[0033]
In this way, the input search request RQ is converted by the hash function, and is converted into the thesaurus code C1 by following the hash table and the chain index. Subsequently, in S63, one document is extracted from the search target database 51, and further in S64, the keyword of this document is extracted. Here, it is assumed that “PC” has been extracted as the keyword KW. The keyword KW is code-converted by the thesaurus coding unit 20 in the same manner as described above in S65. For example, when the keyword is “PC”, as shown in FIG. 5, it is converted into a thesaurus code of “23EC220H”.
[0034]
When the thesaurus code C1 of the search request and the thesaurus code C2 of the keyword are both obtained, the codes are matched in S66. Matching is roughly divided into calculation of the level difference in the hierarchy difference calculation unit 32 and calculation of the conceptual distance in the distance calculation unit 31.
[0035]
First, in S81 and S82, the hierarchy difference calculation unit 32 calculates the hierarchy levels of the thesaurus code C1 of the search request and the thesaurus code C2 of the keyword. Specifically, a mask code in the following 4-bit unit is prepared, the AND of the thesaurus code C1 of the search request and the thesaurus code C2 of the keyword is sequentially obtained, and the level is determined by repeating until it becomes non-zero. Is done. When the ANDs up to level 1 are all 0, that is, when the code of C1 or C2 is 0, the level is 0 because it is the root of the thesaurus, in this case "concept".
Level 7: 000000FH
Level 6: 00000F0H
Level 5: 0000F00H
Level 4: 000F000H
Level 3: 00F0000H
Level 2: 0F0000H
Level 1: F0000H
[0036]
The code for “Sedan” was “23A2442H” as described above, so if you first AND with 000000FH,
000000FH AND 23A2442H = 2H
Since it is non-zero, the word is found to be a word of level 7, and the processing ends.
[0037]
Since the code related to “PC” was “23EC220H” as described above, first take AND with 000000FH,
000000FH AND 23EC220H = 0H
Since it is zero, AND with 00000F0H is further taken. Then
00000F0H AND 23EC220H = 20H
Since it is non-zero, it is determined that the word is a level 6 word, and the processing ends.
[0038]
Here, a method using a mask by AND has been described. However, a method of shifting the number of bits according to the level, for example, rightward by 4 bits in the case of level 7 and checking the lower-order overflow flag, is of course logically possible. Become the same. In any case, these bit-wise logical operations can generally be executed at a very high speed, so that a high-speed processing can be realized.
[0039]
The absolute value of the difference between the hierarchical levels thus obtained is calculated in S83, and is set as the hierarchical level difference LD. Here, | 7−6 | = 1.
[0040]
Subsequently, the distance calculating unit 31 calculates the conceptual distance between words. The concept distance is obtained based on a bit exclusive OR (XOR) between codes. here,
23A2442H XOR 23EC220H = 4E662H
It becomes. The value thus obtained indicates the closeness between words (concepts). The conceptual distance is given according to the range of the value obtained here. This addition can be easily obtained as the maximum value that becomes non-zero by ANDing with the following mask code, as in the above-described hierarchy level check. When the result of XOR is all 0s, the conceptual distance is 0 because the codes are the same.
Concept distance 7: F0000H
Concept distance 6: 0F00000H
Concept distance 5:00: 00F0000H
Concept distance 4: 000F000H
Concept distance 3: 0000F00H
Concept distance 2: 00000F0H
Concept distance 1: 000000FH
[0041]
Here, since the value of the bit exclusive OR is “4E662H”,
F000000H AND 4E662H = 0H
0F00000H AND 4E662H = 0H
00F0000H AND 4E662H = 40000H
It can be seen that the conceptual distance = 5.
[0042]
FIG. 7 is an explanatory diagram of a specific example of a matching result. FIG. 7 shows the value of the bit exclusive OR of each keyword, the thesaurus code C2 of the keyword, and the thesaurus code C1 of “Sedan” “23A2442H”, the concept distance, the level difference, and the priority. The degree is shown. The priority will be described later again with reference to FIG.
[0043]
As shown in FIG. 7, the conceptual distance is not directly related to the level difference. For example, regarding the "sedan", the value of the bit-exclusive OR with "private car" which is one level higher (parent-child relationship) is 2H, and "sports car" having the same broader word one level below is 3H, These are very close relationships, and the conceptual distance becomes 1.
[0044]
On the other hand, in the "automobile" which is two levels higher (grandparent relationship), the value of the bit exclusive OR is 42H, but the "bus" which is one level lower (uncle / aunt) is 52H, and one level lower ( Cousin), that is, 50H for the same level "bonnet bus", and the conceptual distance to them is 2. Further, for the above-mentioned “PC”, the value of the bit exclusive OR is 4E662H and the conceptual distance is 5.
[0045]
FIG. 8 is a conceptual diagram in which a part of the conceptual distance in the specific example is two-dimensionally mapped. In FIG. 8, ○ indicates a node, a solid line indicates a thesaurus hierarchy, and a height indicates a hierarchy level. The thesaurus hierarchy shown here is a part of the one shown in FIG. Now, if the concept distance is made to correspond to a two-dimensional distance based on “sedan”, nodes having the same concept distance can be arranged and shown on an equidistant arc as shown in FIG.
[0046]
In the above-described example, AND using a mask code is used in calculating the conceptual distance, but it can be realized by various methods as long as they are logically the same. FIG. 9 is an explanatory diagram of a method of calculating a conceptual distance by an OR gate. For example, as shown in FIG. 9, it is conceivable to OR the bits corresponding to the same layer and set the most significant bit that is turned on as the conceptual distance. In the example shown in FIG. 9, the value "4E662H" of the bit exclusive OR of the thesaurus codes of "Sedan" and "PC" is input to the OR circuit every 4 bits, and "0011111" is obtained as an output. Since the position of the most significant bit in the result is the fifth, the conceptual distance is 5. In this calculation method, since a circuit can be easily configured by wired logic, it can be said that this method is suitable for hardware. Of course, the calculation may be performed using another method.
[0047]
Further, in this example, a conceptual distance limited to 0 to 7 is introduced in order to calculate the priority thereafter. When the XOR operation between expressions having a certain conceptual distance is actually performed as described above, the result takes only a specific range (10H to FFH in the case of a conceptual distance of 2). When a rough distance is obtained, the result of XOR calculation may be used directly.
[0048]
Further, in this embodiment, a fixed length of 4 bits is assigned to all the hierarchies of the thesaurus. However, the structure of the thesaurus is various such that the number of nodes increases as approaching the leaf. In such a case, a different number of bits may be assigned to each layer, and even if the storage capacity is limited, this can be effectively used. In addition, in a thesaurus, not all leaf data is at the same hierarchical level, but the present invention is very effective in such a case because data across different hierarchical levels can be compared.
[0049]
Subsequently, in S85, the priority calculation unit 33 calculates the priority according to the conceptual distance calculated by the distance calculation unit 31 and the hierarchy level difference calculated by the hierarchy difference calculation unit 32. As described above, the concept distance and the hierarchy level difference have no direct relationship. Even if the conceptual distances are the same, the similarity between them is often different. As shown in FIG. 8 described above, both “car” and “bonnet bus” have a conceptual distance of 2 with respect to “sedan”, but the similarity is “automobile” because it is a direct grandchild. It is considered high. On the other hand, depending on the configuration of the thesaurus, synonyms may be arranged as lower nodes with respect to upper nodes. In such a case, the similarity level is higher at the same hierarchical level. In this way, by using the concept distance and the hierarchy level difference, these can be evaluated as different axes.
[0050]
In this example, the following formula is used as the priority.
Priority = total number of levels x 2-(concept distance x 2-level difference)
Here, in this example, the total number of levels is seven. For each of the keywords shown in FIG. 7, the following values are returned for “sedan”.
14: sedan (7 × 2 − (0 × 2 − 0) = 14)
13: Private car
12: Sports car, car
11: Bus, vehicle, ...
10: Bonnet bus, motorcycle, ...
...
5: PC, ...
...
Of course, the priority may be calculated by various calculation methods other than the above-described calculation formula.
[0051]
Subsequently, in S67 of FIG. 3, the extraction criterion is checked using the matching result obtained as described above. Here, if the value of the priority is simply 12 or more, the result is determined as the matching result. In the example of each keyword shown in FIG. 7 described above, if the document has any of the keywords “sedan”, “private car”, “sports car”, and “car”, the keyword is output to the result buffer BF. You.
[0052]
Of course, this reference value may be set as appropriate. Further, another extraction criterion may be used as the extraction criterion. For example, after the entire document is scanned as necessary, the extraction criterion is weighted by statistical processing.
[0053]
Thus, the matching process between the search request and one keyword and the extraction of the document are completed. The processing up to this point is repeated for all the keywords attached to one document, and further, the processing is performed by sequentially performing them for each document in the database 51.
[0054]
By the above processing, the result buffer BF includes, in the result buffer BF, a document including a keyword similar to the search request “Sedan” from the user, specifically, “Sedan”, “Sports car”, “RV car”, “Private car”, Documents including "car" and the like are accumulated. In response to this result, the output instructing unit 34 instructs the output unit 40 to output the document based on the priority. The output unit 40 outputs the search result, and the search processing ends.
[0055]
In the above-described similar word search system, the matching unit 30 calculates the concept distance, the hierarchy level difference, and the priority based on the concept distance. , A part of the calculation results may be used.
[0056]
The present invention can be applied to general comparison processing between data having a hierarchical relationship, and it is needless to say that the present invention is not limited to the application to the similar word search system described above. Hereinafter, a case where one embodiment of the matching device of the present invention is applied to an automatic document classification system will be described.
[0057]
At present, automatic classification using sample data is one of the general techniques. However, including a thesaurus deployment has traditionally been difficult due to cost considerations. By applying the matching device of the present invention to matching with standard data, automatic classification can be performed at low cost.
[0058]
For example, consider a case where a certain document is automatically classified into a sample data group (part) having the following keywords.
[Specimen 1]
Keywords: mobile, telephone, car, PHS
Category: Mobile Phone
[Specimen 2]
Keywords: terminal, reflection, resistance, delay
Category: Terminator
[Specimen 3]
Keywords: mobile, terminal, PDA, network
Category: Mobile Terminal
[0059]
It is assumed that the input document also has a keyword. This may be given in advance or extracted on the spot.
[Input Document]
Keywords: mobile, terminal, communication
[0060]
Here, the similarity between the input document including thesaurus expansion and the keywords of each sample is obtained. For the calculation of the similarity, the same method as in the above-described similar word search system can be used. Further, the sum of the keywords having the lowest similarity with each keyword of the input document is obtained. The example which calculated in the samples 1-3 is shown.
(Specimen 1)
Mobile: Mobile = 1, Terminal: PHS = 4, Communication: Telephone = 2 Total 7
(Specimen 2)
Mobile: delay = 6, terminal: terminal = 0, communication: delay = 3, 9 in total
(Specimen 3)
Mobile: Mobile = 1, Terminal: Terminal = 0, Communication: Network = 2 Total 3
[0061]
Since the sample with the smallest sum obtained in this way is considered to have the highest similarity to the input document, the input document is classified into the category. In this example, the input document is classified into the category “mobile terminal” of the sample 3 because the sum calculated in the sample 3 is the smallest. The above processing is sequentially performed on all the input documents to perform automatic classification.
[0062]
Note that, in the application examples to the above-described respective systems, an example using a thesaurus has been described. However, since data having a hierarchical structure is generally widely used, the present invention relates to a method for interchanging data other than a thesaurus, such as a file directory. It can also be applied to matching.
[0063]
【The invention's effect】
As is apparent from the above description, according to the present invention, comparison between data having a hierarchical structure such as a thesaurus can be performed at high speed only by simple processing based on simple bit logical operations such as AND and XOR. It is possible to execute. Further, based on the bit logical operation processing, not only realization by software but also hardware realization is easy.
[0064]
According to the present invention, comparison can be made even in the case of data between data straddling the middle of a hierarchy, which has been difficult with simple processing such as the number of layers up to a common upper node in the related art. In addition, by using the hierarchical level difference, it is possible to provide a detailed comparison criterion according to the type of data and to assign a priority.
[0065]
As described above, since the matching process including thesaurus expansion, which has been difficult in the past, can be executed easily and at high speed, for example, similar text search and automatic text classification using the matching process of the present invention can be performed at high speed and at low cost. This has the effect of being feasible.
[Brief description of the drawings]
FIG. 1 is a schematic configuration diagram showing one embodiment of a matching device of the present invention.
FIG. 2 is a block diagram showing an example of a case where one embodiment of the matching device of the present invention is applied to a similar word search system.
FIG. 3 is a flowchart illustrating an example of an overall operation in an example of a similar word search system.
FIG. 4 is a flowchart illustrating an outline of an example of a matching algorithm.
FIG. 5 is an explanatory diagram showing a specific example of the configuration of a thesaurus.
FIG. 6 is a conceptual diagram showing an example of conversion into a thesaurus code.
FIG. 7 is an explanatory diagram of a specific example of a matching result.
FIG. 8 is a conceptual diagram in which a part of the conceptual distance in a specific example is two-dimensionally mapped.
FIG. 9 is an explanatory diagram of a method of calculating a conceptual distance using an OR gate.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Input part, 2 ... Hierarchical relation coding part, 3 ... Matching part, 4 ... Output part, 10 ... Search request input part, 11 ... Terminal, 12 ... OCR, 13 ... Telephone and voice recognition part, 14 ... Storage device , 15: infrared / wireless receiving unit, 20: thesaurus coding unit, 21: thesaurus, 30: matching unit, 31: distance calculating unit, 32: hierarchical difference calculating unit, 33: priority calculating unit, 34: output instruction unit , 40 ... output unit, 41 ... terminal, 42 ... facsimile / printer, 43 ... telephone / pager, 44 ... infrared / wireless transmission unit, 50 ... database unit, 51 ... database.

Claims

In a matching device that performs matching based on a hierarchical relationship between data, each data of a data group having a hierarchical relationship is stored together with a value sequentially allocated to each data having the same parent of each hierarchical level, and the hierarchical level is maintained until input data. A hierarchical data encoding means for converting values assigned to the data of each hierarchy when tracing into a code arranged, and a conceptual distance indicating a closeness between data in a hierarchical relationship of at least two given data. The largest mask code in which the logical product of the exclusive OR of the codes obtained by the hierarchical data coding means from the two data and the mask code of the value of each layer is non-zero is associated with the mask code. A distance calculating means for calculating a value of the distance as the conceptual distance.

Further, for the given at least two data, the hierarchical level of each data is obtained from the minimum mask code in which the logical product of the code obtained by the hierarchical data encoding means and the mask code of the value of each hierarchical level becomes non-zero. 2. The matching apparatus according to claim 1, further comprising a hierarchical difference calculating unit that calculates a hierarchical level difference between the data.

3. The matching device according to claim 2, further comprising a priority calculation unit that calculates a priority based on the results of the distance calculation unit and the hierarchy difference calculation unit.

In a matching method for performing matching based on a hierarchical relationship between data, each data of a data group having a hierarchical relationship is stored in a hierarchical data encoding unit together with a value sequentially allocated to each data having the same parent of each hierarchy. The hierarchical data encoding means converts the values assigned to the data of the respective layers when the data is traced up to the data with respect to at least two given data, and converts the values into a code in which the obtained codes are mutually exclusive. The value associated with the mask code is calculated by the distance calculating means from the largest mask code in which the logical product of the logical OR and the mask code of the value of each layer is non-zero, and the value is calculated as the data in the layer relation. A matching method characterized by using a conceptual distance indicating closeness between the two.

In a matching method for performing matching based on a hierarchical relationship between data, each data of a data group having a hierarchical relationship is stored in a hierarchical data encoding unit together with a value sequentially allocated to each data having the same parent of each hierarchy. The hierarchical data encoding means converts the values assigned to the data of the respective layers when the data is traced up to the data with respect to at least two given data, and converts the values into a code in which the obtained codes are mutually exclusive. The value associated with the mask code is calculated by the distance calculation means from the largest mask code in which the logical product of the logical OR and the mask code of the value of each layer is non-zero, and the distance between the data in the layer relation is calculated. And a value obtained by converting the code of the data and the value of each layer, Matching wherein the logical product of the Sukukodo calculates the difference between the hierarchical levels between the data seek hierarchical level of the data from the minimum of the mask code to be non-zero in a hierarchical difference calculating means.

In a matching method for performing matching based on a hierarchical relationship between data, each data of a data group having a hierarchical relationship is stored in a hierarchical data encoding unit together with a value sequentially allocated to each data having the same parent of each hierarchy. The hierarchical data encoding means converts the values assigned to the data of the respective layers when the data is traced up to the data with respect to at least two given data, and converts the values into a code in which the obtained codes are mutually exclusive. The value associated with the mask code is calculated by the distance calculation means from the largest mask code in which the logical product of the logical OR and the mask code of the value of each layer is non-zero, and the distance between the data in the layer relation is calculated. And a value obtained by converting the code of the data and the value of each layer, The hierarchical level of each data is determined from the minimum mask code whose logical product with the code is non-zero, and the hierarchical level difference between the data is calculated by the hierarchical difference calculating means, and the conceptual distance and the hierarchical level difference are calculated. A matching method, wherein a priority is calculated by a priority calculation unit based on the priority.