JP2000276472A

JP2000276472A - Method and device for similar information collation and recording medium for recording similar information collation program

Info

Publication number: JP2000276472A
Application number: JP11078125A
Authority: JP
Inventors: Masahiko Tokunaga; 雅彦徳永
Original assignee: AdIn Research Inc
Current assignee: AdIn Research Inc
Priority date: 1999-03-23
Filing date: 1999-03-23
Publication date: 2000-10-06
Anticipated expiration: 2019-03-23
Also published as: JP3955410B2

Abstract

PROBLEM TO BE SOLVED: To obtain a similar information collation device which discriminates the similarity of information by pattern collation in the case that a noise is generated in a minute pattern partially coinciding in a collation pattern to break up the minute pattern. SOLUTION: A similar information collation device 1 is provided with a means 10 which generates first and second patterns represented by the positions and features of the elements from information to be collated, a means 20 which generates a collation map 30 consisting of collation positions whose coordinates are pairs of positions of first and second elements having the same features in the first and second patterns, a means 40 which evaluates the continuity of each of routes where neighboring collation positions in the collation map 30 are successively connected, and a means 40 which discriminates the degree of coincidence between the first and second patterns on the basis of the continuity of each route.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、パターンとして表
現できる情報の類似性を判定する類似情報照合装置及び
類似情報照合方法に係わり、特に、コンピュータで情報
処理される種々のパターンの照合技術に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a similar information collating apparatus and a similar information collating method for determining the similarity of information that can be expressed as a pattern, and more particularly to a technique for collating various patterns processed by a computer.

【０００２】[0002]

【従来の技術】文字、図形、画像、音声、或いは、一般
的な記号のような情報は、その情報を構成する要素が位
置と特徴とによって表現することができる。このよう
に、要素の特徴に着目したときに各要素の特徴が時空間
的に関連して配置される情報は、所謂パターンと呼ばれ
る。従来より、このパターンをコンピュータで処理する
ため各種のパターン情報処理方式が提案されている。パ
ターン情報処理の過程において、類似した情報の合致度
を評価すべき状況、例えば、文字情報の場合に、類似文
字列を検索することは頻繁に要求される。そのため、情
報の類似性を判定する方式として、情報の要素をパター
ンとして表現し、パターン間の関連性を評価するパター
ン照合方式が知られている。2. Description of the Related Art Information such as characters, figures, images, sounds, or general symbols can be represented by the position and characteristics of the elements constituting the information. As described above, information in which the features of each element are arranged in a spatiotemporal manner when focusing on the features of the elements is called a so-called pattern. Conventionally, various pattern information processing methods have been proposed for processing this pattern by a computer. In the process of pattern information processing, it is frequently required to search for a similar character string in a situation where the degree of matching of similar information should be evaluated, for example, in the case of character information. Therefore, as a method of determining the similarity of information, a pattern matching method is known in which elements of information are expressed as patterns and the relevance between the patterns is evaluated.

【０００３】従来のパターン照合方式の例として、文字
列照合システムについて説明する。例えば、文字列照合
システムを利用する従来の高速全文検索技術は、“第２
部高速全文検索の要素技術カギを握るインデクス処
理”、日経バイト、１９９６年１０月号、ページ１５８
−１６７に記載されている。この引用文献に記載されて
いる従来の典型的な照合システムは、照合文字列及び被
照合文字列を固定長の微小な部分に分割する。ここで、
用語「照合文字列」及び「被照合文字列」の用法を簡単
に説明すると、例えば、文字列Ａと類似した文字列Ｂを
文書Ｃの中から見つける場合に、文字列Ａが「照合文字
列」であり、文書Ｃの中の文字列Ｂが「被照合文字列」
である。次に、照合システムは、照合文字列の微小部分
が被照合文字列の微小部分文字列群に含まれるかどうか
を判定し、当該微小部分文字列を含む被照合文字列を照
合文字列に類似した文字列として出力する。As an example of a conventional pattern matching method, a character string matching system will be described. For example, a conventional high-speed full-text search technology using a character string matching system is described in “2nd.
Elemental Technology for High-speed Full-text Search in Nippon Express, The Index Processing That Holds the Key, "Nikkei Byte, October 1996, page 158
-167. The conventional typical collation system described in this reference divides the collation character string and the collated character string into small portions of fixed length. here,
The usage of the terms “collation character string” and “character string to be collated” will be briefly described. For example, when a character string B similar to the character string A is found in the document C, the character string A becomes “collation character string”. , And the character string B in the document C is a “character string to be verified”.
It is. Next, the collation system determines whether the minute part of the collation character string is included in the minute part character string group of the collated character string, and determines that the collated character string including the minute part character string is similar to the collation character string. Is output as a converted character string.

【０００４】このようなタイプの照合システムは、文字
列の中の微小部分が完全一致する文字列の有無を判定す
る。そのため、文字列の一部が欠落した場合、文字列の
一部が他の文字列で置換された場合、或いは、文字列の
中に他の文字列が混入した場合のように、照合文字列若
しくは被照合文字列に局部的な変形が生じた場合に、変
形した箇所の周辺の微小文字列が一致しないため、文字
列が照合しないと判定される。このように従来技術の第
１のタイプの照合システムでは、文字列の局所的な変形
を許容できないという欠点がある。A collation system of this type determines whether there is a character string in which a minute part in the character string completely matches. Therefore, when a part of a character string is missing, a part of a character string is replaced with another character string, or a case where another character string is mixed in a character string, Alternatively, when a local deformation occurs in the collated character string, it is determined that the character strings are not collated because the minute character strings around the deformed portion do not match. As described above, the first type of collation system of the related art has a disadvantage that local deformation of a character string cannot be tolerated.

【０００５】[0005]

【発明が解決しようとする課題】本発明は、上述の従来
の照合システムの問題点に鑑み、情報の類似性を判定す
る類似情報照合装置において、照合される情報に対応し
た第１のパターン或いは第２のパターン内で、部分的に
一致する微小パターンが一部欠落、他のパターンとの置
換、或いは、他のパターンによる混入などによって、パ
ターンの全域に分散された場合でも、パターンの照合を
行うことにより情報の類似性を判定することができる類
似情報照合装置、類似情報照合方法及び類似情報照合プ
ログラムを記録した記録媒体の提供を目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned problems of the conventional collating system, and has been described in the context of a similar information collating apparatus which determines the similarity of information. In the second pattern, even when a minute pattern that partially matches is partially dispersed, replaced with another pattern, or mixed in with another pattern, the pattern matching is performed even if it is dispersed throughout the pattern. It is an object of the present invention to provide a similar information collation device, a similar information collation method, and a recording medium on which a similar information collation program is recorded, which can determine the similarity of information by performing the same.

【０００６】[0006]

【課題を解決するための手段】上記の目的を達成するた
め、本発明は、パターンの照合位置を追跡し、離間した
照合位置を許容する連続性の概念を導入し、この連続性
を評価して照合の漏れを防止する。図１は本発明の原理
構成図である。本発明の情報の類似性を判定する類似情
報照合装置１は、照合されるべき第１の情報及び第２の
情報から、情報の要素の位置及び特徴により表されるパ
ターンとして、上記第１の情報に対応する第１のパター
ン及び上記第２の情報に対応する第２のパターンを生成
するパターン生成手段１０と、上記第１のパターン及び
上記第２のパターンの中で同じ特徴を有する上記第１の
パターンに属する第１の要素及び上記第２のパターンに
属する第２の要素の夫々の位置の対を座標とする照合位
置により構成される照合マップ３０を作成する照合マッ
プ生成手段２０と、上記照合マップ３０内で近傍にある
上記照合位置が順次に連結された経路毎に上記経路の連
続性を評価する連続性評価手段４０と、上記経路毎に評
価された連続性に基づいて上記第１のパターンと上記第
２のパターンの合致度を判定するパターン照合手段５０
とを含む。In order to achieve the above object, the present invention introduces a concept of continuity which tracks a pattern matching position and allows a separated matching position, and evaluates this continuity. To prevent omission of verification. FIG. 1 is a diagram showing the principle of the present invention. The similarity information matching device 1 for determining the similarity of information according to the present invention uses the first information and the second information to be compared as a pattern represented by the position and characteristic of an element of the information. Pattern generating means 10 for generating a first pattern corresponding to information and a second pattern corresponding to the second information; and a second pattern having the same characteristics among the first pattern and the second pattern. A collation map generating means 20 for producing a collation map 30 composed of a collation position using a pair of respective positions of a first element belonging to one pattern and a second element belonging to the second pattern as coordinates, A continuity evaluating means for evaluating the continuity of the route for each of the routes in which the matching positions in the vicinity in the matching map are sequentially connected, and the continuity evaluating means for evaluating the continuity evaluated for each of the routes. Patterns and the determining pattern matching means for matching of the second pattern 50
And

【０００７】上記照合マップ作成手段２０は、同じ特徴
を有する上記第１の要素及び上記第２の要素の複数の組
合せに対し、個別に上記照合位置を作成することを特徴
とする。また、上記パターン照合手段５０は、上記照合
位置毎に該照合位置を通過する上記経路に対し評価され
た連続性の中で最も高い連続性を該照合位置の評価値と
して設定する手段と、上記照合位置毎に設定された評価
値に基づいて上記第１のパターンと上記第２のパターン
の合致度を計算する手段とを有する。[0007] The collation map creating means 20 is characterized in that the collation positions are individually created for a plurality of combinations of the first element and the second element having the same characteristics. Further, the pattern matching means 50 sets, for each of the matching positions, the highest continuity among the continuities evaluated for the path passing through the matching position as an evaluation value of the matching position, Means for calculating a degree of coincidence between the first pattern and the second pattern based on an evaluation value set for each collation position.

【０００８】さらに、上記パターン生成手段１０は、上
記パターンとして表される上記情報の少なくとも一部の
要素に対し、上記少なくとも一部の元の要素の特徴を置
換可能な特徴を有する同義的な要素を生成する手段と、
上記同義的な要素が上記元の要素と同時に列挙されるよ
う上記パターンを生成する手段とを有し、上記照合マッ
プ生成手段２０と、上記連続性評価手段３０と、上記パ
ターン照合手段４０とは、同時に列挙された上記同義的
な要素を上記元の要素と並行して処理するよう適合され
ていることを特徴とする。[0008] Further, the pattern generating means 10 is a synonymous element having a feature capable of replacing a feature of the at least part of the original element with respect to at least a part of the information represented as the pattern. Means for generating
Means for generating the pattern so that the synonymous elements are listed at the same time as the original element. The matching map generating means 20, the continuity evaluating means 30, and the pattern matching means 40 , Characterized in that the synonymous elements listed at the same time are adapted to be processed in parallel with the original elements.

【０００９】また、上記照合マップ生成手段２０は、上
記要素が数値を表現する特徴を有する場合に、数値の表
す値が一致する場合に同じ特徴であると判定する手段を
有するように構成してもよい。図２は、上記本発明の目
的を達成する情報の要素の位置及び特徴により表される
第１のパターンと第２のパターンを照合することにより
情報の類似性を判定する類似情報照合方法の動作フロー
チャートである。同図に示す如く、本発明の類似情報照
合方法は、上記第１のパターン及び上記第２のパターン
を入力する段階（ステップ１）と、上記第１のパターン
及び上記第２のパターンの中で同じ特徴を有する上記第
１のパターンに属する第１の要素及び上記第２のパター
ンに属する第２の要素を検出する段階（ステップ２）
と、上記検出された第１の要素及び第２の要素の夫々の
位置の対を座標とする照合マップを作成する照合マップ
生成段階（ステップ３）と、上記照合マップ内で近傍に
ある上記照合位置を順次に連結することにより経路を生
成する経路生成段階（ステップ４）と、上記生成された
経路毎に上記経路の連続性を評価する連続性評価段階
（ステップ５）と、上記経路毎に評価された連続性に基
づいて上記第１のパターンと上記第２のパターンの合致
度を判定するパターン照合段階（ステップ６）とを含
む。The collation map generation means 20 is configured to include a means for judging that the element has the same characteristic when the element has a characteristic expressing a numerical value and the value represented by the numerical value matches. Is also good. FIG. 2 shows an operation of a similarity information matching method for determining the similarity of information by comparing a first pattern and a second pattern represented by the positions and characteristics of information elements to achieve the object of the present invention. It is a flowchart. As shown in the drawing, the similar information collating method of the present invention comprises the steps of inputting the first pattern and the second pattern (step 1), and includes the steps of inputting the first pattern and the second pattern. Detecting a first element belonging to the first pattern and a second element belonging to the second pattern having the same characteristics (step 2)
A matching map generating step of creating a matching map having coordinates of pairs of the positions of the detected first element and the detected second element (step 3); and a step of generating the matching map in the vicinity of the matching map. A route generation step of generating a route by sequentially connecting positions (step 4), a continuity evaluation step of evaluating the continuity of the route for each of the generated routes (step 5), A pattern matching step (step 6) of determining a matching degree between the first pattern and the second pattern based on the evaluated continuity.

【００１０】また、情報の類似性を判定する類似情報照
合システムにおいて、情報の類似性を判定する上記の本
発明の類似情報照合装置及び方法は、コンピュータが読
み取り可能な記録媒体に記録したプログラム（ソフトウ
ェア）として実現してもよい。したがって、本発明は、
情報の類似性を判定する類似情報照合プログラムを記録
したコンピュータが読み取り可能な記録媒体を含む。上
記類似情報照合プログラムは、照合されるべき第１の情
報及び第２の情報から、情報の要素の位置及び特徴によ
り表されるパターンとして、上記第１の情報に対応する
第１のパターン及び上記第２の情報に対応する第２のパ
ターンを生成させるパターン生成コードと、上記第１の
パターン及び上記第２のパターンの中で同じ特徴を有す
る上記第１のパターンに属する第１の要素及び上記第２
のパターンに属する第２の要素の夫々の位置の対を座標
とする照合位置により構成される照合マップを作成させ
る照合マップ生成コードと、上記照合マップ内で近傍に
ある上記照合位置が順次に連結された経路毎に上記経路
の連続性を評価させる連続性評価コードと、上記経路毎
に評価された連続性に基づいて上記第１のパターンと上
記第２のパターンの合致度を判定させるパターン照合コ
ードとを含むことを特徴とする。Further, in the similar information collating system for judging the similarity of information, the similar information collating apparatus and method of the present invention for judging the similarity of information according to the present invention provide a computer-readable storage medium storing a program ( Software). Therefore, the present invention
It includes a computer-readable recording medium that records a similarity information collation program that determines the similarity of information. The similarity information collation program extracts a first pattern corresponding to the first information from the first information and the second information to be collated as a pattern represented by a position and a characteristic of an element of the information. A pattern generation code for generating a second pattern corresponding to the second information; a first element belonging to the first pattern having the same feature in the first pattern and the second pattern; Second
The collation map generation code for creating a collation map composed of the collation positions each having a pair of the positions of the second elements belonging to the pattern and the collation positions adjacent in the collation map are sequentially connected. Pattern matching for evaluating the continuity of the route for each of the determined routes, and pattern matching for determining the degree of matching between the first pattern and the second pattern based on the continuity evaluated for each of the routes And a code.

【００１１】[0011]

【発明の実施の形態】以下、添付図面を参照して本発明
の一実施例による文字列照合システムを説明する。本実
施例の文字列照合システムは、被検索文書ファイルに保
存された被検索文書の中からオペレータが入力した検索
文と類似した文を含む被検索文書をオペレータに提示す
るシステムである。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A character string collating system according to one embodiment of the present invention will be described below with reference to the accompanying drawings. The character string matching system according to the present embodiment is a system for presenting a searched document including a sentence similar to the search sentence input by the operator from the searched documents stored in the searched document file to the operator.

【００１２】図３は、本発明の一実施例による文字列照
合システムの概略的な構成図であり、図４は、この文字
列照合システムの動作フローチャートである。文字列照
合システムは、ステップ１０においてオペレータから入
力された検索文を受ける照合データ生成部１１０を有す
る。また、照合データ生成部１１０は、検索文を照合に
適した照合データとしての照合文字列に変換する（ステ
ップ２０）。文字列照合システムは、ステップ２０にお
いて被検索文書ファイル１４０から被検索文書を取り出
し、照合文字列との照合に適した被照合文字列及び被照
合文字列が属する被検索文書の文書識別番号を含む被照
合データを生成する被照合データ生成部１３０を更に有
する。FIG. 3 is a schematic configuration diagram of a character string collating system according to an embodiment of the present invention, and FIG. 4 is an operation flowchart of the character string collating system. The character string collation system includes a collation data generator 110 that receives a search sentence input by an operator in step 10. The collation data generation unit 110 converts the search sentence into a collation character string as collation data suitable for collation (step 20). In step 20, the character string matching system extracts the searched document from the searched document file 140, and includes the matched character string suitable for matching with the matched character string and the document identification number of the searched document to which the matched character string belongs. It further has a data-to-be-collated generating unit 130 that generates data to be collated.

【００１３】照合データとしての照合文字列及び被照合
データとしての被照合文字列は、種々の情報を表現する
パターンの中で、特に、文字情報を表現するパターンで
ある。文字列内の各文字がパターンの要素に対応する。
要素は、その文字の特徴としての文字コードと、その文
字の文字列内における位置とによって表される。また、
文字列照合システムは照合マップ生成部１５０を更に有
し、照合マップ生成部１５０は、照合データ生成部１１
０からの照合文字列と、被照合データ生成部１３０から
の被照合文字列とを受け、共通文字を検出し（ステップ
３０）、照合マップを作成、出力する（ステップ４
０）。The collation character string as collation data and the collation character string as collation data are patterns expressing character information, among patterns expressing various information. Each character in the string corresponds to an element of the pattern.
The element is represented by a character code as a characteristic of the character and a position of the character in the character string. Also,
The character string collation system further includes a collation map generation unit 150, and the collation map generation unit 150
Receiving the collation character string from 0 and the collation character string from the collation data generation unit 130, a common character is detected (step 30), and a collation map is created and output (step 4).
0).

【００１４】照合マップは、照合文字列及び被照合文字
列に共通して含まれる共通文字の照合文字列及び被照合
文字列での位置を夫々Ｘ座標及びＹ座標として表される
位置（Ｘ，Ｙ）を照合位置として有するマップである。
照合マップは、パターンとして文字列が採用される場合
には２次元のマップとして構築することができる。文字
列照合システムは連続性評価部１６０及び検索結果出力
部１７０を更に有する。連続性評価部１６０は、照合マ
ップ生成部１５０によって作成、出力された照合マップ
を受け、照合文字列と被照合文字列とを照合し、照合結
果を検索結果出力部に渡す。そのため、連続性評価部１
６０は、照合マップ内で、照合位置から近傍の照合位置
を順次に追跡することにより一連の照合位置を含む経路
を形成し（ステップ５０）、経路毎に連続性の値を計算
し（ステップ６０）、各照合位置に対する評価値とし
て、その照合位置を通過する経路の連続性の値の中で最
高の連続性の値を選択し（ステップ７０）、照合文字列
の各文字についての照合位置の評価値を集計する（ステ
ップ８０）。この集計結果は、照合文字列と被照合文字
列の類似性を表している。In the collation map, the positions of the common characters included in the collation character string and the collated character string in the collation character string and the collated character string are represented as X coordinates and Y coordinates, respectively (X, Y). Y) is a map having Y) as a collation position.
The collation map can be constructed as a two-dimensional map when a character string is adopted as a pattern. The character string collation system further includes a continuity evaluation unit 160 and a search result output unit 170. The continuity evaluation unit 160 receives the collation map created and output by the collation map generation unit 150, collates the collation character string with the collated character string, and passes the collation result to the search result output unit. Therefore, the continuity evaluation unit 1
60 sequentially forms a path including a series of collation positions by sequentially tracking collation positions near the collation position in the collation map (step 50), and calculates a continuity value for each path (step 60). ), As an evaluation value for each collation position, the highest continuity value among the continuity values of the route passing through the collation position is selected (step 70), and the collation position of each character of the collation character string is determined. The evaluation values are totaled (step 80). This total result indicates the similarity between the collation character string and the collated character string.

【００１５】検索結果出力部１７０は、連続性評価部１
６０から照合結果を受け、照合文字列と類似していると
判定された被照合文字列を含む被検索文書に関する情報
を被検索文書ファイル１４０から取り出し、オペレータ
に通知する（ステップ９０）。以下、本発明の一実施例
の文字列照合システムについて詳述する。The search result output unit 170 is a continuity evaluation unit 1
Receiving the collation result from 60, information relating to the retrieved document including the collated character string determined to be similar to the collated character string is extracted from the retrieved document file 140 and notified to the operator (step 90). Hereinafter, a character string collation system according to an embodiment of the present invention will be described in detail.

【００１６】図５は本例における文字列照合システムの
照合データ生成部１１０の構成図である。同図に示す如
く、照合データ生成部１１０は、検索文を入力し、検索
文拡張辞書１２０を参照して拡張検索文を出力する検索
文拡張部１１１と、検索拡張部を入力して正規化拡張検
索文に変換する検索文正規化部１１２と、正規化拡張検
索文を入力して数値表現部分を同じ形式に変換し、最終
的な照合データとしての照合文字列を出力する数値表現
置き換え部１１３とを有する。FIG. 5 is a block diagram of the collation data generation unit 110 of the character string collation system in this embodiment. As shown in the figure, a collation data generation unit 110 inputs a search sentence, and outputs a search sentence expansion unit 111 by referring to a search sentence expansion dictionary 120. A search sentence normalizing unit 112 for converting into an extended search sentence, and a numeric expression replacing unit for inputting the normalized extended search sentence, converting the numerical expression part into the same format, and outputting a collation character string as final collation data 113.

【００１７】検索文拡張部１１１は、オペレータから検
索文（例えば、「文書検索の高速化」）が入力され、検
索文の中の文字列（例えば、「検索」）を置き換え可能
な文字列が検索文拡張辞書１２０内に存在するかどうか
を判定し、置き換え可能な文字列（例えば、「抽出」）
が検索文拡張辞書１２０内に存在する場合に、その置き
換え可能な文字列（「抽出」）を、文字列（「検索」）
の同義語として検索文に付加し、拡張検索文を出力す
る。置き換え可能な文字列が存在しない場合には、入力
された検索文がそのまま拡張検索文として出力される。
この場合の拡張検索文は、「文書｛検索｜抽出｝の高速
化」のように表現され、｛文字列ａ｜文字列ｂ｝の部分
が拡張された部分であり、文字列ａと文字列ｂが同義語
であることを表す。The search sentence expansion unit 111 receives a search sentence (eg, “speed-up of document search”) from the operator and generates a character string that can replace a character string (eg, “search”) in the search sentence. It is determined whether it exists in the search sentence expansion dictionary 120, and a replaceable character string (for example, “extract”)
Is present in the search sentence expansion dictionary 120, the replaceable character string (“extraction”) is converted into a character string (“search”).
Is added to the search sentence as a synonym for, and an extended search sentence is output. If there is no replaceable character string, the input search sentence is output as it is as an extended search sentence.
The extended search sentence in this case is expressed as “speedup of document {search | extraction}”, where {character string a | character string b} is an extended part, and character string a and character string b represents a synonym.

【００１８】検索文正規化部１１２は、検索文拡張部１
１１から出力された拡張検索文を入力し、拡張検索文内
の文字の正規化を行い、正規化拡張検索文を出力する。
文字の正規化とは、例えば、英数字カナの半角文字から
全角文字への変換、英字の小文字から大文字への変換、
或いは、句読点、改行制御文字及び伸張音等の検索の際
に無視されるべきノイズ文字の削除等の処理を意味す
る。The search sentence normalizing unit 112 includes a search sentence expanding unit 1
The extended search sentence output from 11 is input, characters in the extended search sentence are normalized, and a normalized extended search sentence is output.
Character normalization is, for example, conversion of half-width characters of alphanumeric kana to full-width characters, conversion of lowercase letters to uppercase letters,
Alternatively, it means processing such as deletion of noise characters that should be ignored when searching for punctuation marks, line feed control characters, extended sounds, and the like.

【００１９】数値表現置き換え部１１３は、検索文正規
化部１１２から出力された正規化拡張検索文を受け、正
規化拡張検索文中に数値により定量表現された部分文字
列が存在するかどうかを判定する。定量表現された部分
文字列が存在する場合、その部分文字列を値に変換し、
正規化拡張検索文中の定量表現部分が値によって置換さ
れた最終的な照合データとしての照合文字列を作成、出
力する。The numerical expression replacement unit 113 receives the normalized expanded search sentence output from the search sentence normalizing unit 112, and determines whether or not a partial character string quantitatively expressed by a numerical value exists in the normalized expanded search sentence. I do. If there is a substring expressed quantitatively, convert the substring to a value,
Creates and outputs a collation character string as final collation data in which the quantitative expression part in the normalized extended search sentence is replaced by a value.

【００２０】図６は本例における文字列照合システムの
被照合データ生成部１３０の構成図である。同図に示す
如く、被照合データ生成部１３０は、被検索文書ファイ
ル１４０から、検索に必要な全ての被検索文書を読み出
し、出力する被検索文書読み込み部１３１を有する。被
照合データ生成部１３０は、被検索文書正規化部１３２
及び数値表現置き換え部１３３を更に有する。FIG. 6 is a configuration diagram of the collated data generation unit 130 of the character string collation system in this example. As shown in the figure, the matching data generation unit 130 has a searched document reading unit 131 that reads all the searched documents required for the search from the searched document file 140 and outputs the read documents. The data-to-be-verified unit 130 is a unit for normalizing the document to be searched 132
And a numerical expression replacement unit 133.

【００２１】被検索文書正規化部１３２は、被検索文書
読み込み部１３１から被検索文書を入力し、被検索文書
内の文字の正規化を行い、正規化被検索文書を出力す
る。文字の正規化については、検索文正規化部１１２で
説明した通りである。また、数値表現置き換え部１３３
は、被検索文書正規化部１３２から正規化被検索文書を
受け、正規化被検索文書中に数値により定量表現された
部分文字列が存在する場合に、その部分文字列を値に変
換し、正規化被検索文書中の定量表現部分が値によって
置換された最終的な被照合データとしての被照合文字列
を作成、出力する。The search target document normalizing section 132 receives the search target document from the search target document reading section 131, normalizes characters in the search target document, and outputs a normalized search target document. The normalization of characters is as described in the search sentence normalization unit 112. In addition, the numerical expression replacement unit 133
Receives the normalized search target document from the search target document normalization unit 132, and if there is a substring quantitatively represented by a numerical value in the normalized search target document, converts the substring into a value; It creates and outputs a collated character string as final collated data in which the quantitative expression part in the normalized searched document is replaced by a value.

【００２２】次に、本発明の一実施例による文字列照合
システムの照合マップ生成部１５０の機能について詳述
する。図３に示される如く、照合マップ生成部１５０
は、照合データ生成部１１０及び被照合データ生成部１
３０に接続され、照合データとしての照合文字列及び被
照合データとしての被照合文字列を夫々から受け、照合
マップを生成するよう機能する。Next, the function of the matching map generator 150 of the character string matching system according to one embodiment of the present invention will be described in detail. As shown in FIG.
Are the collation data generation unit 110 and the collation target data generation unit 1
And a function of receiving a collation character string as collation data and a collation character string as collation data from each, and generating a collation map.

【００２３】照合マップ生成部１５０は、最初に、照合
文字列と被照合文字列の双方に共通して含まれる文字、
すなわち、共通文字を検出する。例えば、照合文字列を
「文書検索の高速化」とし、被照合文字列を「高速な文
書の検索を行う」とすると、共通文字は、「高」、
「速」、「文」、「書」、「の」、「検」及び「索」で
ある。次に、照合文字列における共通文字の位置をＹ座
標とし、被照合文字列における共通文字の位置をＸ座標
とする照合位置により構成される照合マップを生成す
る。First, the collation map generation unit 150 first generates a character included in both the collation character string and the collated character string,
That is, a common character is detected. For example, if the collation character string is “speed up document search” and the collated character string is “search for high-speed document”, the common characters are “high”,
"Quick", "sentence", "call", "no", "inspection" and "search". Next, a collation map is generated that includes collation positions where the position of the common character in the collation character string is the Y coordinate and the position of the common character in the collation character string is the X coordinate.

【００２４】図７は、照合マップの概念がよりよく理解
されるように、一例として、上記の照合文字列及び被照
合文字列に対し生成された照合マップを視覚的に表現し
た説明図である。同図において“○”で示される点が照
合位置に対応する。本例では、簡単のため、被照合文字
列は同義語を含まない場合を想定している。一方、既に
説明した通り、検索文拡張部１１１において、照合文字
列「文書検索の高速化」が拡張検索文「文書｛検索｜抽
出｝の高速化」の形として同義語を含むように拡張され
ている場合、図８に示すような照合マップが得られる。
この場合、検索文字列内での共通文字の位置を表すＹ座
標は補正される。すなわち、「検」と「抽」のＹ座標、
並びに、「索」と「出」のＹ座標は一致するように補正
される。FIG. 7 is an explanatory diagram showing, by way of example, a collation map generated for the collation character string and the collated character string, so that the concept of the collation map is better understood. . In the figure, a point indicated by “○” corresponds to the collation position. In this example, for the sake of simplicity, it is assumed that the collated character string does not include a synonym. On the other hand, as described above, in the search sentence expansion unit 111, the collation character string “speed up document search” is expanded to include a synonym in the form of the expanded search sentence “speed up document {search | extraction}”. , A matching map as shown in FIG. 8 is obtained.
In this case, the Y coordinate representing the position of the common character in the search character string is corrected. In other words, the Y coordinate of “test” and “extraction”,
In addition, the Y coordinates of the “line” and the “go” are corrected so as to match.

【００２５】かくして、照合マップ生成部１５０は、照
合文字列中の共通文字の位置を表すＹ座標値と、照合文
字列に同義語が含まれる場合の共通文字の位置の補正値
であるＹ補正値と、被照合文字列中の共通文字の位置を
表すＸ座標値と、被照合データに対応する文書識別番号
とを照合マップとして出力する。連続性評価部１６０
は、総合マップ生成部１５０から照合マップを入力す
る。連続性評価部１６０では、文書識別番号毎に、照合
文字列と被照合文字列の類似性が評価される。そのた
め、連続性評価部１６０は、最初に、照合マップ内の照
合位置を追跡し、全ての照合位置に連続性評価値を付与
し、次に、同じ照合文字列内の文字に対し存在し得る複
数の照合位置の連続性評価値の中から最大値を照合文字
列内の当該文字の連続性評価値として選択する。最後
に、照合文字列内の文字毎に得られた連続性評価値を照
合文字列全体に関して集計し、正規化し、得られた値を
照合文字列と被照合文字列の合致度とする。合致度は、
文書識別番号と共に連続性評価部１６０から検索結果出
力部１７０に送られる。Thus, the collation map generator 150 generates a Y coordinate value representing the position of the common character in the collation character string and a Y correction value which is a correction value of the position of the common character when the collation character string contains a synonym. A value, an X coordinate value representing the position of a common character in the collated character string, and a document identification number corresponding to the collated data are output as a collation map. Continuity evaluation unit 160
Inputs the collation map from the comprehensive map generation unit 150. The continuity evaluation unit 160 evaluates the similarity between the collated character string and the collated character string for each document identification number. Therefore, the continuity evaluation unit 160 first tracks the collation position in the collation map, assigns a continuity evaluation value to all collation positions, and then exists for characters in the same collation character string. The maximum value is selected as the continuity evaluation value of the character in the collation character string from the continuity evaluation values of the plurality of collation positions. Lastly, the continuity evaluation value obtained for each character in the collation character string is totaled and normalized for the entire collation character string, and the obtained value is used as the matching degree between the collation character string and the collated character string. The degree of match is
The continuity evaluation unit 160 sends the search result output unit 170 together with the document identification number.

【００２６】以下、連続性評価について詳述する。図９
は、本発明の一実施例による文字列照合システムにおい
て行われる連続性評価のための経路追跡の説明図であ
る。経路追跡処理は、図７に示された照合マップの照合
位置に関して、一つの照合位置から有効距離内にある他
の照合位置を探し、リンクを張る。この経路追跡処理を
繰り返すことにより、照合マップ内の照合位置は、分岐
を含む幾つかの経路に分類される。図９には、「高」か
ら「速」への経路と、「書」から「の」の分岐及び
「書」から「検」を経由して「索」に至る分岐を含む
「文」と「書」を含む経路とが示されている。Hereinafter, the continuity evaluation will be described in detail. FIG.
FIG. 4 is an explanatory diagram of path tracking for continuity evaluation performed in the character string collation system according to one embodiment of the present invention. In the route tracking processing, with respect to the collation positions in the collation map shown in FIG. 7, another collation position within an effective distance from one collation position is searched for and a link is established. By repeating this route tracking process, the matching position in the matching map is classified into several routes including branches. FIG. 9 shows a path from “high” to “fast”, and a “sentence” including a branch from “sho” to “no” and a branch from “sho” through “ken” to “search”. The path including the “book” is shown.

【００２７】図１０は、照合位置の典型的な４通りの連
続性の形を説明する図である。一般に、連続した文字列
が照合している箇所では、照合位置のリンクは右下４５
度の方向に並ぶ。同図の（Ａ）は、全ての照合位置が右
下４５度方向に並ぶ完全一致の場合を示す図である。同
図の（Ｂ）は、データの（１字）欠落がある場合を示
し、（Ｃ）はデータの（１字）置換がある場合を示し、
（Ｄ）はデータの（２字）混入がある場合を示す図であ
る。これらのリンクを追跡することにより、データの欠
落、置換、混入が生じている場合でも、連続性を保った
まま照合を評価することができる。FIG. 10 is a diagram for explaining four typical types of continuity of the collation position. In general, where a continuous character string is collated, the link of the collation position is the lower right 45
Line up in the direction of degrees. FIG. 7A is a diagram showing a case where all the matching positions are completely matched in the lower right direction at 45 degrees. (B) of the figure shows a case where there is (one character) missing in data, (C) shows a case where there is (one character) replacement of data,
(D) is a diagram showing a case where data (two characters) is mixed. By tracking these links, the collation can be evaluated while maintaining continuity even when data is missing, replaced, or mixed.

【００２８】連続性評価部１６０は、経路の生成に続い
て合致度の算出処理を行う。ここで、照合位置間のリン
クに重みを付けるため、全ての文字について文字種別
（タイプ）を設定し、照合位置の各文字を分類する。本
例において、文字種別として、「漢字」と「かな」の２
種類に分類する場合を想定すると、漢字：「高」「速」
「文」「書」「検」「索」かな：「の」のような分類が
なされる。次に、リンクの前後の文字の文字種別ｔ１及
びｔ２に応じて、文字種間のリンクの重みを以下の通り
設定する。文字種間の重み（Ｗ_t）＝ｆ（ｔ₁，ｔ₂）＝Ｗ１１（ｔ₁＝漢字、ｔ₂＝漢字の場合）＝Ｗ１２（ｔ₁＝漢字、ｔ₂＝かなの場合）＝Ｗ２１（ｔ₁＝かな、ｔ₂＝漢字の場合）＝Ｗ２２（ｔ₁＝かな、ｔ₂＝かなの場合）また、リンクには、リンクの長さ（リンクの前後の照合
位置間の距離）に応じた重みを設定する。例えば、リン
クの長さによる重みは以下の通り表される。リンクの長さによる重み（Ｗ_l）＝ｇ（ｘ₁，ｙ₁，ｘ₂，ｙ₂）＝１／｛（ｘ₂−ｘ₁）²＋（ｙ₂−ｙ₁）²｝最後に、上記の文字種間の重み（Ｗ_t）とリンクの長さ
による重み（Ｗ_l）とを結合することにより、一つのリ
ンクについての以下の評価値が得られる。１リンクの評価値＝ｖ＝Ｗ_t・Ｗ_l＝ｆ（ｔ₁，ｔ₂）・ｇ（ｘ₁，ｙ₁，ｘ₂，ｙ₂）連続性評価部１６０では、次に、照合マップ内の経路追
跡によって獲得された経路上の全てのリンクに対し、リ
ンクの評価値を集計し、一つの経路全体の評価値を得
る。この一つの経路の評価値Ｖは、例えば、次の式に従
って計算することができる。The continuity evaluation unit 160 performs a process of calculating the degree of matching following the generation of the route. Here, in order to weight the link between the collation positions, a character type (type) is set for all the characters, and each character at the collation position is classified. In this example, two character types, “Kanji” and “Kana”, are used.
Assuming the case of classification into types, kanji: "high""fast"
"Sentence", "Book", "Check", "Kana": Kana is classified as "No". Next, according to the character types t1 and t2 of the characters before and after the link, the weight of the link between the character types is set as follows. Weight between character types (W _t ) = f (t ₁ , t ₂ ) = W11 (when t ₁ = Kanji, t ₂ = Kanji) = W12 (When t ₁ = Kanji, t ₂ = Kana) = W21 (T ₁ = Kana, t ₂ = Kanji) = W22 (T ₁ = Kana, t ₂ = Kana) Also, a weight is set for the link according to the length of the link (the distance between the collation positions before and after the link). For example, the weight according to the link length is expressed as follows. Weight by link length (W _l ) = g (x ₁ , y ₁ , x ₂ , y ₂ ) = 1 / {(x ₂ −x ₁ ) ² + (y ₂ −y ₁ ) ² } Finally, By combining the weight (W _t ) between the character types and the weight (W _l ) based on the length of the link, the following evaluation value for one link is obtained. 1 link evaluation value _{_{= v = W t · W l}} = f (t 1, t 2) · g in _{_{(x 1, y 1, x}} 2, y 2) continuity evaluation unit 160, then the lookup map For all the links on the route obtained by the route tracing, the evaluation values of the links are totaled, and the evaluation value of one entire route is obtained. The evaluation value V of this one route can be calculated according to the following formula, for example.

【００２９】[0029]

【数１】式中、ｋは着目経路上のリンクのインデックス、ｎは着
目経路上のリンクの総数＋１、ｖ_kは着目経路上の各リ
ンクの評価値を表す。かくして得られた一つの経路の評
価値Ｖは、着目経路上の各照合位置に照合位置の評価値
Ｖ_xyとして設定される。また、経路が分岐を含む場合に
は、例えば、分岐毎に計算された経路の評価値の中で最
も評価値の高い分岐を含む経路が有効であるとして選択
することができる。このようにして、照合マップ内で生
成された全ての経路に対して上記の一つの経路の評価値
Ｖを求めることにより、照合マップの内の全ての照合位
置に関して照合位置の評価値Ｖ_xyが得られる。(Equation 1) In the equation, k is the index of the link on the path of interest, n is the total number of links on the path of interest + 1, and v _k is the evaluation value of each link on the path of interest. The evaluation value V of one path thus obtained is set as an evaluation value _Vxy of the collation position at each collation position on the path of _interest . When the route includes a branch, for example, a route including a branch having the highest evaluation value among the evaluation values of the routes calculated for each branch can be selected as valid. In this way, by calculating the evaluation value V of the above one path for all the paths generated in the collation map, the evaluation value V _xy of the collation position is obtained for all the collation positions in the collation map. can get.

【００３０】次に、照合文字列中の各文字列に関する評
価値を得る。例えば、図７に示される如く、照合文字列
中の文字に対応する照合位置が高々１個しかない場合に
は、対応する照合位置が存在する照合文字列中の文字の
評価値として、その照合位置の評価値を設定し、照合文
字列中のそれ以外の文字の評価値は零とする。また、照
合文字列中の文字に対応する照合位置が２個以上存在す
る場合には、対応する照合位置の評価値の中で最大の評
価値をその文字の評価値として設定する。かくして、照
合文字列中の全ての文字に対し連続性の評価値を得るこ
とができる。Next, an evaluation value for each character string in the collation character string is obtained. For example, as shown in FIG. 7, when there is at most one collation position corresponding to the character in the collation character string, the collation position is determined as the evaluation value of the character in the collation character string where the corresponding collation position exists. The evaluation value of the position is set, and the evaluation values of other characters in the collation character string are set to zero. If there are two or more collation positions corresponding to the character in the collation character string, the largest evaluation value among the evaluation values of the corresponding collation positions is set as the evaluation value of the character. Thus, the continuity evaluation value can be obtained for all the characters in the collation character string.

【００３１】最後に、照合文字列全体として被照合文字
列との合致度を求めるため、照合文字列中の全ての文字
に関する連続性の評価値を集計して集計値を得る。連続
性評価値の集計値Ｖ_totalは、例えば、次式に従って計
算される。Finally, in order to determine the degree of matching of the entire collation character string with the collated character string, the continuity evaluation values of all the characters in the collation character string are totaled to obtain a total value. The _total value V _total of the continuity evaluation values is calculated, for example, according to the following equation.

【００３２】[0032]

【数２】照合文字列全体としての被照合文字列との合致度は、例
えば、この連続性評価値の集計値Ｖ_totalを完全一致の
場合の連続性評価値の集計値Ｖ_equalで除算した値によ
って表される。合致度＝Ｖ_total／Ｖ_equal 合致度をこのように表現することにより、連続性評価値
の集計値は完全一致の場合に最大値１．０をとる。この
ようにして得られた合致度は、文書識別番号と共に、照
合結果として次の検索結果出力部１７０に送られる。(Equation 2)For example, the degree of matching with the matched string as the entire matched string is
For example, the total value V of the continuity evaluation value_totalAn exact match of
Total value V of continuity evaluation value in case_equalBy the value divided by
Is expressed. Matching degree = V_total/ V_equal By expressing the degree of coincidence in this way, the continuity evaluation value
Takes the maximum value of 1.0 in the case of a perfect match. this
The matching degree obtained in this way is referenced together with the document identification number.
The result is sent to the next search result output unit 170 as a combined result.

【００３３】図１１は本発明の一実施例による文字列照
合システムの検索結果出力部の構成図である。同図に示
されるように、検索結果出力部１７０は、照合結果変換
部１７１と、検索結果表示部１７２と、検索結果選択部
１７３と、文書表示部１７４とを含む。照合結果変換部
１７１は、連続性評価部１６０から、上記合致度及び文
書識別番号を照合結果として入力し、文書識別番号に基
づいて照合結果に対応する文書の見出し、要約情報等を
被検索文書ファイル１４０から読み込み、合致度の順に
照合結果の文書に関する情報を並べ換え、検索結果とし
て出力する。FIG. 11 is a configuration diagram of a search result output unit of the character string collating system according to one embodiment of the present invention. As shown in the figure, the search result output section 170 includes a collation result conversion section 171, a search result display section 172, a search result selection section 173, and a document display section 174. The matching result conversion unit 171 inputs the matching degree and the document identification number from the continuity evaluation unit 160 as the matching result, and based on the document identification number, searches the document heading, summary information, and the like corresponding to the matching result for the searched document. The information is read from the file 140, and the information on the document as the collation result is rearranged in the order of the matching degree, and is output as the retrieval result.

【００３４】検索結果表示部１７２は、照合結果変換部
１７１から検索結果を入力し、この検索結果をディスプ
レイなどの表示装置に表示させ、次の段の検索結果選択
部１７３に検索結果を渡す。検索結果選択部１７３は、
検索結果表示部１７２から検索結果を入力し、また、検
索結果表示に応じたオペレータからの指示を入力し、オ
ペレータから読み込むべき文書が指定された場合、指定
された文書を被検索文書ファイル１４０から読み込み、
選択文書として出力する。The search result display unit 172 inputs the search result from the collation result conversion unit 171, displays the search result on a display device such as a display, and passes the search result to the search result selection unit 173 in the next stage. The search result selection unit 173
A search result is input from the search result display unit 172, and an instruction from the operator corresponding to the search result display is input. When a document to be read is specified by the operator, the specified document is read from the search target document file 140. Loading,
Output as selected document.

【００３５】文書表示部１７４は、検索結果選択部１７
３から出力された選択文書を入力し、読み込まれた選択
文書をディスプレイなどの表示装置に表示させる。本発
明の一実施例による文字列照合システムは、図３乃至１
１を参照して説明した構成及び動作に従って、オペレー
タから入力された検索文を被検索文書ファイルに格納さ
れた文書と照合し、検索文に類似した被検索文を含む文
書をオペレータに提示することができる。The document display unit 174 includes the search result selection unit 17
The selected document output from 3 is input, and the read selected document is displayed on a display device such as a display. The character string matching system according to one embodiment of the present invention is shown in FIGS.
In accordance with the configuration and operation described with reference to 1, the search text input by the operator is compared with the document stored in the search target document file, and a document including the search text similar to the search text is presented to the operator. Can be.

【００３６】次に、本発明の一実施例の文字列照合シス
テムにおいて、特に、検索文拡張部１１１が拡張検索文
を出力した場合の処理を説明する。本例では、検索文
「文書検索の高速化」において、文字列「検索」の同義
語「抽出」が存在する場合を考える。既に説明したよう
に、照合データが同義語を含む場合、複数の照合データ
「文書検索の高速化」及び「文書抽出の高速化」が存在
すると解釈される。また、同義データ正規表現：文書｛検索｜抽出｝の高速化を用いることにより同義語が照合データ内に列挙して表
現されるような拡張検索文が作成される。このように照
合データが同義語を含む場合、照合データは、同じ位置
にある同義語の中の一つの同義語が選択されたとして処
理される。図１２は、図８に示された同義語を含む照合
マップにおける経路追跡の説明図である。経路追跡の際
の有効距離は、実際に生成された照合マップに配置され
た経路上の照合位置間の距離と、一つの同義語が選択さ
れたとして処理された場合に生成される理論上の照合マ
ップ上での照合位置間の距離との差違を表す距離補正値
を考慮して計算される。Next, in the character string collating system according to the embodiment of the present invention, a process when the search sentence expanding unit 111 outputs an extended search sentence will be described. In this example, it is assumed that a synonym “extraction” of the character string “search” exists in the search sentence “speed-up of document search”. As described above, when the collation data includes a synonym, it is interpreted that there are a plurality of collation data “speed-up document search” and “speed-up document extraction”. In addition, by using the synonym data regular expression: speeding up document {search | extraction}, an extended search sentence is created in which synonyms are enumerated in the collation data. When the collation data includes a synonym, the collation data is processed as if one of the synonyms at the same position is selected. FIG. 12 is an explanatory diagram of the path tracking in the matching map including the synonyms shown in FIG. The effective distance at the time of route tracking is the distance between the matching positions on the route actually arranged in the matching map generated, and the theoretical distance generated when one synonym is selected and processed. It is calculated in consideration of a distance correction value indicating a difference from a distance between collation positions on the collation map.

【００３７】最後に、本発明の一実施例による文字列照
合システムの数値表現置き換え部１１３又は１３３にお
いて、照合文字列又は被照合文字列中の定量表現が数値
に置換された場合の処理について説明する。図１３は、
類似定量文字照合の処理手順のフローチャートである。
第１に、照合文字列又は被照合文字列から、数値により
定量表現された部分文字列を抽出する（ステップ１０
０）。第２に、抽出された部分文字列を値に変換する
（ステップ１０１）。第３に、変換された値に基づいて
数値の合致度を計算する（ステップ１０２）。Lastly, a description will be given of a process in the case where the numerical expression in the collation character string or the collated character string is replaced with a numerical value in the numerical expression replacement unit 113 or 133 of the character string collation system according to one embodiment of the present invention. I do. FIG.
It is a flowchart of the processing procedure of similar quantitative character collation.
First, a partial character string quantitatively expressed by a numerical value is extracted from the collation character string or the collated character string (step 10).
0). Second, the extracted partial character string is converted into a value (step 101). Third, the degree of matching of numerical values is calculated based on the converted values (step 102).

【００３８】ここで、数値により定量表現された部分文
字列の抽出は、文字列内に数値表現文字が連続して出現
した部分を検出し、取り出すことにより行われる。例え
ば、以下のような文字が数値表現文字として検出され
る。１２３４５６７８９０一二三四五六七八九零十百千万・・・・合致度は、照合文字列から得られた値（Ｖｓ）と、被照
合文字列から得られた値（Ｖｄ）とに基づいて以下の式
に従って計算することができる。Here, the extraction of a partial character string quantitatively expressed by a numerical value is performed by detecting and extracting a portion where a numerically expressed character continuously appears in the character string. For example, the following characters are detected as numerical expression characters. 1 2 3 4 5 6 7 8 9 0 1 2 4 5 6 7 8 7 9 9 0 10 million ... The degree of matching is calculated from the value (Vs) obtained from the collation character string and the collated character string. It can be calculated according to the following equation based on the obtained value (Vd).

【００３９】[0039]

【数３】図１４は、類似定量文字照合処理の説明図である。同図
には、照合文字列及び被照合文字列、定量表現による部
分文字列、部分文字列から変換された数値、並びに、変
換された数値の合致度が示されている。したがって、本
発明の一実施例によれば、コンピュータを利用した文字
列照合システムにおいて、照合文字列或いは被照合文字
列内に部分文字列の欠落、他の文字列との置換、他の文
字列の混入などによって、部分的に一致する文字列が分
散した場合に、文字列の照合を行うことができる。(Equation 3) FIG. 14 is an explanatory diagram of the similar quantitative character collation processing. The figure shows a collation character string and a collation target character string, a partial character string by a quantitative expression, a numerical value converted from the partial character string, and a degree of matching between the converted numerical values. Therefore, according to an embodiment of the present invention, in a character string collating system using a computer, a partial character string is missing in a collated character string or a collated character string, replacement with another character string, another character string. When character strings that partially match are scattered due to mixing of characters, character string collation can be performed.

【００４０】また、本発明の一実施例による文字列照合
システムの構成は、上記の実施例で説明された例に限定
されることなく、文字列照合システムの各々の構成要件
をソフトウェア（プログラム）で構築し、ディスク装置
等に記録しておき、必要に応じて文字列照合システムの
コンピュータにインストールして文字列照合を行うこと
も可能である。さらに、構築されたプログラムをフロッ
ピー（登録商標）ディスクやＣＤ−ＲＯＭ等の可搬記録
媒体に格納し、このような文字列照合システムを用いる
場面で汎用的に使用することも可能である。Further, the configuration of the character string collating system according to one embodiment of the present invention is not limited to the example described in the above embodiment, and each component of the character string collating system is defined by software (program). It is also possible to record the data in a disk device or the like, install it in a computer of a character string collating system as needed, and perform character string collation. Furthermore, the constructed program can be stored in a portable recording medium such as a floppy (registered trademark) disk or CD-ROM, and can be used for general purposes in a case where such a character string collation system is used.

【００４１】本発明は、上記の実施例に限定されること
なく、特許請求の範囲内で種々変更・応用が可能であ
る。The present invention is not limited to the above embodiment, but can be variously modified and applied within the scope of the claims.

【００４２】[0042]

【発明の効果】上述の如く、本発明によれば、パターン
照合を行う際に、パターンの照合位置を追跡し、照合位
置が離間していてもパターンの連続性を評価することが
できる。したがって、照合パターン或いは被照合パター
ンにおいて一部のパターンが欠落し、他のパターンと置
換され、若しくは、他のパターンが混入される等の影響
によって、照合パターンと被照合パターンとの間で部分
的に一致するパターンが分散して存在する場合でも、照
合が行える。そのため、本発明によれば、オペレータが
被照合パターンの内容を熟知していなくても、漏れの無
い照合が実現され、オペレータの負担が軽減される利点
が得られる。As described above, according to the present invention, when performing pattern matching, the pattern matching position can be tracked, and the continuity of the pattern can be evaluated even if the matching positions are separated. Therefore, a part of the matching pattern or the to-be-checked pattern is missing, replaced with another pattern, or mixed with another pattern. Can be collated even if there are patterns that are distributed. Therefore, according to the present invention, even if the operator is not familiar with the content of the pattern to be verified, verification without omission is realized, and an advantage that the burden on the operator is reduced is obtained.

[Brief description of the drawings]

【図１】本発明の原理構成図である。FIG. 1 is a principle configuration diagram of the present invention.

【図２】本発明の類似情報照合方法の動作フローチャー
トである。FIG. 2 is an operation flowchart of the similarity information matching method of the present invention.

【図３】本発明の一実施例による文字列照合システムの
概略的な構成図である。FIG. 3 is a schematic configuration diagram of a character string collation system according to an embodiment of the present invention.

【図４】本発明の一実施例による文字列照合システムの
動作フローチャートである。FIG. 4 is an operation flowchart of the character string collation system according to one embodiment of the present invention.

【図５】本発明の一実施例による文字列照合システムの
照合データ生成部の構成図である。FIG. 5 is a configuration diagram of a collation data generation unit of the character string collation system according to one embodiment of the present invention.

【図６】本発明の一実施例による文字列照合システムの
被照合データ生成部の構成図である。FIG. 6 is a configuration diagram of a collated data generation unit of the character string collation system according to one embodiment of the present invention.

【図７】本発明の一実施例による照合マップを視覚的に
表現した説明図である。FIG. 7 is an explanatory diagram visually representing a matching map according to an embodiment of the present invention.

【図８】同義語を含む場合の照合マップの説明図であ
る。FIG. 8 is an explanatory diagram of a matching map when a synonym is included.

【図９】本発明の一実施例による文字列照合システムに
おいて行われる連続性評価のための経路追跡の説明図で
ある。FIG. 9 is an explanatory diagram of path tracking for continuity evaluation performed in the character string collating system according to one embodiment of the present invention.

【図１０】照合位置の連続性を説明する図である。FIG. 10 is a diagram for explaining the continuity of collation positions.

【図１１】本発明の一実施例による文字列照合システム
の検索結果出力部の構成図である。FIG. 11 is a configuration diagram of a search result output unit of the character string collation system according to one embodiment of the present invention.

【図１２】図８に示された同義語を含む照合マップにお
ける経路追跡の説明図である。FIG. 12 is an explanatory diagram of route tracking in a collation map including the synonyms shown in FIG. 8;

【図１３】類似定量文字照合の処理手順のフローチャー
トである。FIG. 13 is a flowchart of a processing procedure for similar quantitative character matching.

【図１４】類似定量文字照合処理の説明図である。FIG. 14 is an explanatory diagram of a similar quantitative character collation process.

[Explanation of symbols]

１類似情報照合装置１０パターン生成手段２０照合マップ作成手段３０照合マップ４０連続性評価手段５０パターン照合手段 DESCRIPTION OF SYMBOLS 1 Similar information collation apparatus 10 Pattern generation means 20 Collation map creation means 30 Collation map 40 Continuity evaluation means 50 Pattern collation means

Claims

[Claims]

1. A similarity information matching device for determining the similarity of information, wherein the first information and the second information to be matched are used as a pattern represented by a position and a characteristic of an element of the information. Pattern generating means for generating a first pattern corresponding to the first information and a second pattern corresponding to the second information, and a second pattern having the same feature in the first pattern and the second pattern. A collation map generating means for producing a collation map composed of a collation position having coordinates of pairs of positions of a first element belonging to one pattern and a second element belonging to the second pattern; A continuity evaluation means for evaluating the continuity of the route for each of the paths in which the matching positions in the vicinity are sequentially connected, and the first pattern based on the continuity evaluated for each of the routes. The matching of the emission and the second pattern and a determining pattern matching means, similarity information collating apparatus.

2. The collation map creating means creates the collation positions individually for a plurality of combinations of the first element and the second element having the same characteristics. Described similarity collation device.

3. The pattern matching means sets, for each of the matching positions, the highest continuity among continuities evaluated for the path passing through the matching position as an evaluation value of the matching position. Based on the evaluation value set for each of the collation positions,
3. The similarity information matching device according to claim 1, further comprising: means for calculating a degree of coincidence between the second pattern and the second pattern.

4. The pattern generating means according to claim 1, wherein at least a part of the information represented as the pattern is replaced with a synonymous element having a characteristic capable of replacing a characteristic of the at least part of the original element. Generating means; and means for generating the pattern so that the synonymous elements are listed at the same time as the original element. The matching map generating means, the continuity evaluating means, and the pattern matching means 4. The similarity matching apparatus according to claim 1, wherein the synonymous elements listed at the same time are adapted to be processed in parallel with the original elements. .

5. The method according to claim 1, wherein said collation map generating means has means for judging that said element has the same characteristic when said element has a characteristic expressing a numerical value, when said element has a characteristic expressing a numerical value. Item 5. The similarity information matching device according to any one of Items 1 to 4.

6. A similarity information matching method for determining similarity of information by comparing a first pattern and a second pattern represented by positions and characteristics of information elements, wherein the first pattern and the Inputting a second pattern; and a first element belonging to the first pattern and a second element belonging to the second pattern having the same characteristics among the first pattern and the second pattern. Detecting an element; generating a verification map having coordinates of pairs of the detected positions of the first element and the second element; and generating a verification map having coordinates in the vicinity of the verification map. A path generation step of generating a path by sequentially connecting collation positions; a continuity evaluation step of evaluating the continuity of the path for each of the generated paths; and a continuity evaluated for each of the paths. A pattern matching step of determining a degree of matching between the first pattern and the second pattern based on the pattern matching step.

7. A computer-readable recording medium having recorded thereon a similarity information collation program for judging the similarity of information, wherein the similarity information collation program comprises: a first information and a second information to be collated; A pattern generation code for generating a first pattern corresponding to the first information and a second pattern corresponding to the second information as the pattern represented by the position and the characteristic of the element of the information; And a collation position having coordinates of a pair of positions of a first element belonging to the first pattern and a second element belonging to the second pattern having the same feature in the second pattern and the second pattern. A collation map generating code for creating a collation map composed of: Recording medium, comprising: a continuity evaluation code for evaluating the continuity of a pattern; and a pattern matching code for determining a degree of matching between the first pattern and the second pattern based on the continuity evaluated for each path. .

8. The similarity information collation method according to claim 1, wherein the collation map creation code causes the collation position to be created individually for a plurality of combinations of the first element and the second element having the same characteristics. The recording medium according to claim 7, wherein the program is recorded.

9. A code for setting the highest continuity among the continuities evaluated for the route passing through the matching position for each of the matching positions as an evaluation value of the matching position for each of the matching positions. Based on the evaluation value set for each of the collation positions,
9. The recording medium according to claim 7, wherein a similarity information matching program having a pattern for calculating the degree of matching between the second pattern and the second pattern is recorded.

10. The pattern generation code according to claim 1, wherein at least a part of the information represented as the pattern has a synonymous element having a characteristic capable of replacing a characteristic of the at least some original element. And a code for generating the pattern so that the synonymous elements are enumerated at the same time as the original element. The synonymous elements enumerated at the same time are in parallel with the original element. The recording medium according to any one of claims 7 to 9, wherein a similarity information collation program is adapted to be processed.

11. The similarity characteristic that the collation map generation code has a code that, when the element has a characteristic expressing a numerical value, determines that the element has the same characteristic when the value represented by the numerical value matches. The recording medium according to any one of claims 7 to 10, wherein the information collation program is recorded.