JP5780036B2

JP5780036B2 - Extraction program, extraction method and extraction apparatus

Info

Publication number: JP5780036B2
Application number: JP2011163604A
Authority: JP
Inventors: 正樹長尾
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-07-26
Filing date: 2011-07-26
Publication date: 2015-09-16
Anticipated expiration: 2031-07-26
Also published as: JP2013029891A

Description

本発明は、所定の文字列を文字列群から抽出するための抽出プログラム、抽出方法及び抽出装置に関する。 The present invention relates to an extraction program, an extraction method, and an extraction apparatus for extracting a predetermined character string from a character string group.

情報技術の進歩に伴い、情報の収集、加工、保管といった情報処理がコンピュータを用いて行われることが多くなっている。扱う情報量が増加しつつある一方で、効率的な情報処理の技術が求められている。 As information technology advances, information processing such as information collection, processing, and storage is often performed using a computer. While the amount of information handled is increasing, an efficient information processing technique is required.

例えば、ユーザが指定した文書に記述されている内容の特徴を表す文字列を抽出し、当該文書に記述されている内容と類似する内容を含む文書を文書データベースの中から検索する技術が知られている。また、ユーザが指定した検索条件にしたがって所望の文書を文書データベースから検索する際に、各文書が検索条件に合致する度合に応じて検索結果を並べ替えて表示する技術も知られている。 For example, a technique is known in which a character string representing the characteristics of contents described in a document specified by a user is extracted, and a document including contents similar to the contents described in the document is searched from a document database. ing. A technique is also known in which, when a desired document is searched from a document database according to a search condition specified by a user, the search results are rearranged and displayed according to the degree to which each document matches the search condition.

特開平１１−３３８８８３号公報JP 11-338883 A 特開２００１−１０９７６６号公報JP 2001-109766 A

情報処理の１つとして、複数のデータが集まっている状況において、データ同士を照合してその関連性を導き出す「名寄せ」という処理がある。名寄せにおいては、ある文字列と類似する文字列を抽出する処理が行なわれる。類似する文字列を抽出する処理は、同一の文字列を抽出する処理よりも条件判断処理を多く含むため、処理負荷が大きい。名寄せを行う対象のデータ量が膨大である場合には、名寄せの処理量も膨大なものとなる。そこで、本願の開示内容の一側面として、類似文字列の抽出対象のデータの増大による抽出処理負荷の増大を抑制することを目的とする。 As one of information processing, there is a process called “name identification” for collating data and deriving the relationship in a situation where a plurality of data is gathered. In name identification, processing for extracting a character string similar to a certain character string is performed. Since the process of extracting a similar character string includes more condition determination processes than the process of extracting the same character string, the processing load is large. When the amount of data subject to name identification is enormous, the amount of name identification processing also becomes enormous. Accordingly, an object of one aspect of the disclosure content of the present application is to suppress an increase in extraction processing load due to an increase in data to be extracted from similar character strings.

本願に開示する抽出技術においては、所定の文字列に含まれる１または複数の部分文字列であって、前記所定の文字列の文字数を所定の編集距離で除算した商よりも小さい文字数の、１または複数の部分文字列を抽出し、抽出された前記１または複数の部分文字列のうち、前記所定の文字列の文字数及び１の和から、前記所定の編集距離及び１の和と前記商よりも小さい自然数との積を引いた差以下の部分文字列のいずれかを含む文字列を、文字列群から抽出する。 Oite extraction technology disclosed in the present application, a 1 or a plurality of partial character strings included in Jo Tokoro string, than before Symbol quotient obtained by dividing the number of characters in a given string in a predetermined edit distance One or a plurality of partial character strings having a small number of characters are extracted, and among the extracted one or a plurality of partial character strings , the predetermined number of characters and the sum of the predetermined character strings are used to calculate the predetermined edit distance and 1 a string containing one of the following substrings minus the product of the natural number smaller than the quotient sum, extracted from character strings.

本願の開示内容の一側面として、類似文字列の抽出対象のデータの増大による抽出処理負荷の増大を抑制することができる。 As one aspect of the disclosure content of the present application, an increase in extraction processing load due to an increase in data to be extracted from similar character strings can be suppressed.

名寄せ機能の説明図である。It is explanatory drawing of a name collation function. 名寄せ機能の動作についての説明図である。It is explanatory drawing about operation | movement of a name collation function. 名寄せ定義のデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of a name identification definition. 粗絞りによる名寄せの説明図である。It is explanatory drawing of name collation by rough | squeezing. 粗絞りによる名寄せの処理手順を示すフローチャートである。It is a flowchart which shows the processing procedure of name identification by rough aperture. 照合処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of collation processing. 粗絞り定義のデータ構造の一例を示す説明図である。It is explanatory drawing which shows an example of the data structure of a rough aperture definition. 粗絞り定義の作成の流れを示す説明図である。It is explanatory drawing which shows the flow of preparation of a rough aperture definition. 名寄せ処理と粗絞り処理の詳細を示す説明図である。It is explanatory drawing which shows the detail of a name collation process and a rough narrowing process. 名寄せ処理の比較条件と粗絞り処理の検索条件との対応関係を示す説明図である。It is explanatory drawing which shows the correspondence of the comparison conditions of a name collation process, and the search conditions of a rough narrowing process. 名寄せ先のレコードの集合から検索する必要のあるレコードを示す説明図である。It is explanatory drawing which shows the record which needs to be searched from the collection of records of a name identification destination. 名寄せ先のレコードの集合から検索する必要のあるレコードを示す説明図である。It is explanatory drawing which shows the record which needs to be searched from the collection of records of a name identification destination. 名寄せ先のレコードの集合から検索する必要のあるレコードを示す説明図である。It is explanatory drawing which shows the record which needs to be searched from the collection of records of a name identification destination. 名寄せの比較条件と粗絞りの検索条件との関係を示す説明図である。It is explanatory drawing which shows the relationship between the comparison conditions of name collation, and the search conditions of rough aperture. 名寄せの比較条件と粗絞りの検索条件との関係式を示す説明図である。It is explanatory drawing which shows the relational expression of the comparison conditions of name collation, and the search conditions of rough drawing. 粗絞り定義のチューニングの詳細を示す説明図である。It is explanatory drawing which shows the detail of tuning of a rough aperture definition. 粗絞り定義のチューニングの詳細を示す説明図である。It is explanatory drawing which shows the detail of tuning of a rough aperture definition. 情報処理装置の機能構成例を示す説明図である。It is explanatory drawing which shows the function structural example of information processing apparatus. 情報処理装置のハードウェア構成例を示す説明図である。FIG. 25 is an explanatory diagram illustrating a hardware configuration example of an information processing device. 情報処理装置により行われる粗絞り定義の作成の流れを示す説明図である。It is explanatory drawing which shows the flow of preparation of the rough aperture definition performed by information processing apparatus.

［１．名寄せの概要］
値の集合を含んだレコードについて、レコード間を照合し、レコード間の同一性、類似性及び関連性を判定する機能として名寄せ機能がある。名寄せ機能では、例えば、名寄せするレコードの集合を名寄せ元、名寄せ相手となるレコードの集合を名寄せ先と称する。図１は、名寄せ機能を説明する図である。図示しているように、名寄せ機能を実現する名寄せ処理は、名寄せ元と同じレコード、名寄せ元と類似するレコードまたは名寄せ元と関連するレコードを名寄せ先から検出し、検出結果を名寄せ結果として出力する。 [1. Summary of name identification]
There is a name identification function as a function of collating records with respect to records including a set of values and determining the identity, similarity, and relationship between records. In the name identification function, for example, a set of records to be identified is referred to as a name identification source, and a set of records to be a name identification partner is referred to as a name identification destination. FIG. 1 is a diagram for explaining the name identification function. As shown in the figure, the name identification process that realizes the name identification function detects the same record as the name identification source, a record similar to the name identification source, or a record related to the name identification source from the name identification destination, and outputs the detection result as the name identification result. .

近年、データベースの大容量（大規模）化に伴い、名寄せを高速に行う手法が求められている。従来の名寄せ機能の動作について、図２を参照しながら説明する。図２は、名寄せ機能の動作を説明する図である。図示しているように、名寄せ元が取引先表であり、名寄せ先が顧客表である。そして、名寄せ元である取引先表に含まれる複数のレコードのうち、１つのレコードＪ１のみが図示されている。また、名寄せ先である顧客表については、全レコードＭ（Ｍ１〜Ｍｎ）が図示されている。名寄せ元のレコードＪ１は、ＩＤ、会社名、郵便番号、住所、電話番号という５つのカラムを含んでいる。他方、名寄せ先のレコードＭ（Ｍ１〜Ｍｎ）は、ＩＤ、顧客名、郵便番号、顧客住所、電話番号という５つのカラムを含んでいる。名寄せ機能を実現する名寄せ処理は、名寄せ元の１レコードＪ１について、名寄せ先のレコードＭ（Ｍ１〜Ｍｎ）との名寄せを実行する。 In recent years, with the increase in capacity (large scale) of databases, a method for performing name identification at high speed is required. The operation of the conventional name identification function will be described with reference to FIG. FIG. 2 is a diagram for explaining the operation of the name identification function. As shown in the figure, the name identification source is a supplier table, and the name identification source is a customer table. And only one record J1 is illustrated among the some records contained in the supplier table | surface which is a name identification source. Further, all records M (M1 to Mn) are illustrated for the customer table as a name identification destination. The name identification source record J1 includes five columns of ID, company name, postal code, address, and telephone number. On the other hand, the name identification record M (M1 to Mn) includes five columns of ID, customer name, postal code, customer address, and telephone number. In the name identification process for realizing the name identification function, name identification with the record M (M1 to Mn) of the name identification destination is executed for one record J1 of the name identification source.

まず、名寄せ処理は、名寄せ元のレコードＪ１及び名寄せ先のレコードＭ１の各名寄せ対象の項目（「名寄せ対象項目」という。）の値について、予め名寄せ対象項目毎に定義した評価関数を適用して照合を行う。ここでは、名寄せ元の会社名と名寄せ先の顧客名とを第１の名寄せ対象項目とし、名寄せ元の郵便番号と名寄せ先の郵便番号とを第２の名寄せ対象項目とする。さらに、名寄せ元の住所と名寄せ先の顧客住所とを第３の名寄せ対象項目とし、名寄せ元の電話番号と名寄せ先の電話番号とを第４の名寄せ対象項目とする。名寄せ処理は、第１の名寄せ対象項目をｆａ（）、第２の名寄せ対象項目をｆｂ（）、第３の名寄せ対象項目をｆｃ（）、第４の名寄せ対象項目をｆｄ（）とする各評価関数を適
用して照合を行う。そして、名寄せ処理は、照合の結果として導出される各名寄せ対象項目の評価値に重みＡ〜Ｄを用いて名寄せ対象項目毎に重み付けを行い、得られた各値を加算することによって、総合評価値ｈを導出する。さらに、名寄せ処理は、名寄せ元のレコードＪ１に対する残り全ての名寄せ先のレコードＭ２〜Ｍｎについて、総合評価値をそれぞれ導出する。名寄せ処理は、これら名寄せ元のレコードＪ１及び名寄せ先のレコードＭ１〜Ｍｎの組についての総合評価値を含む名寄せ候補集合を作成する。 First, in the name identification process, an evaluation function defined in advance for each name identification item is applied to the value of each name identification item (referred to as “name identification item”) in the name identification source record J1 and the name identification destination record M1. Perform verification. Here, the company name of the name identification source and the customer name of the name identification destination are the first name identification target items, and the postal code of the name identification source and the postal code of the name identification destination are the second name identification target items. Further, the name identification source address and the name identification destination customer address are set as third name identification target items, and the name identification source telephone number and the name identification destination telephone number are set as fourth name identification target items. In the name identification process, the first name identification item is fa (), the second name identification item is fb (), the third name identification item is fc (), and the fourth name identification item is fd (). Match by applying evaluation function. In the name identification process, the evaluation value of each name identification target item derived as a result of collation is weighted for each name identification target item using the weights A to D, and the total value is obtained by adding the obtained values. The value h is derived. Further, in the name identification process, comprehensive evaluation values are derived for all remaining name identification destination records M2 to Mn for the name identification source record J1. In the name identification process, a name identification candidate set including a comprehensive evaluation value for the combination of the name identification source record J1 and the name identification target records M1 to Mn is created.

そして、名寄せ処理は、予め定義した閾値や判定ルールに基づいて、名寄せ候補集合に属するレコードの組について名寄せに関する判定を行う。例えば、名寄せ処理は、総合評価値ｈを閾値と比較して、一致していると判定されたレコードの組を「Ｗｈｉｔｅ」、一致していないと判定されたレコードの組を「Ｂｌａｃｋ」とし、名寄せ結果を出力する。名寄せ処理は、「Ｗｈｉｔｅ」、「Ｂｌａｃｋ」のいずれにも該当しない組を「Ｇｒａｙ」として候補リストに出力する。そして、候補リストに出力された組の関連性の判定が人により行われる。なお、人による設定が必要な名寄せ定義１０３ａとして、名寄せ対象項目の選定、評価関数の選定、重み及び閾値の設定がある。 In the name identification process, determination regarding name identification is performed for a set of records belonging to the candidate group for name identification based on a predetermined threshold value or determination rule. For example, in the name identification process, the total evaluation value h is compared with a threshold value, a set of records determined to match is set to “White”, and a set of records determined to not match is set to “Black”. Output name identification result. In the name identification process, a pair that does not correspond to either “White” or “Black” is output to the candidate list as “Gray”. Then, the relevance of the pair output to the candidate list is determined by a person. The name identification definition 103a that needs to be set by a person includes selection of a name identification item, selection of an evaluation function, setting of a weight and a threshold.

名寄せ定義１０３ａの具体例を図３に示している。図３は、名寄せ定義のデータ構造の一例を示す図であり、図３（Ａ）が、名寄せ定義の内容を示し、図３（Ｂ）が、名寄せ定義の具体例を示す。 A specific example of the name identification definition 103a is shown in FIG. FIG. 3 is a diagram illustrating an example of the data structure of the name identification definition. FIG. 3A illustrates the contents of the name identification definition, and FIG. 3B illustrates a specific example of the name identification definition.

図３（Ａ）に示すように、名寄せ定義は、名寄せ方法ｄ１、名寄せ元指定ｄ２、名寄せ先指定ｄ３、名寄せ対象項目指定ｄ４および閾値ｄ５を対応付けて定義される。名寄せ方法ｄ１には、名寄せの方法が指定される。例えば、名寄せの方法には、１つのレコード集合を対象として集合内のレコード間の総当りで名寄せを行い、一致しているレコードを検出して重複するレコードを除去する「自己名寄せ」がある。自己名寄せは、名寄せ元と名寄せ先が同じレコード集合なので、その構造（レコードの項目）も同じであるという特徴を有する。また、名寄せの方法には、名寄せ及び名寄せ先として異なるレコード集合を対象として名寄せ元レコードと名寄せ先レコードの組み合わせによる名寄せを行い、一致しているレコードを検出して該当レコード間の関連付けを行う「他者名寄せ」がある。他者名寄せは、名寄せ元と名寄せ先が異なる集合なので、一般的にその構造（レコードの項目）が異なるという特徴を有する。名寄せ元指定ｄ２には、名寄せ元のデータベース名等のアクセス情報および名寄せ元のレコードの項目が指定される。名寄せ先指定ｄ３には、名寄せ先のデータベース名等のアクセス情報および名寄せ先のレコードの項目が指定される。名寄せ対象項目指定ｄ４には、名寄せ対象項目が名寄せ元の項目と名寄せ先の項目の組み合わせとして指定され、名寄せ対象項目毎に適用される評価関数および重みが指定される。閾値ｄ５には、Ｗｈｉｔｅ判定用の上位の閾値およびＢｌａｃｋ判定用の下位の閾値が指定される。 As shown in FIG. 3A, the name identification definition is defined by associating a name identification method d1, a name identification source designation d2, a name identification destination designation d3, a name identification target item designation d4, and a threshold value d5. A name identification method is designated as the name identification method d1. For example, as a name identification method, there is “self-name identification” in which a single record set is subjected to name identification among all the records in the set, a matching record is detected, and duplicate records are removed. The self-name identification has a feature that the structure (record items) is the same because the name identification source and the name identification target are the same record set. As a name identification method, name identification is performed by combining a name identification source record and a name identification destination record for different record sets as a name identification and a name identification destination, a matching record is detected, and association between the corresponding records is performed. "Other name identification". Other name identification is a set in which a name identification source and a name identification destination are different, and thus generally has a feature that its structure (record item) is different. In the name identification source designation d2, access information such as the name identification source database and items of the name identification source record are designated. In the name identification destination designation d3, access information such as the name identification destination database name and items of the name identification destination record are designated. In the name identification target item specification d4, the name identification target item is specified as a combination of the name identification source item and the name identification target item, and an evaluation function and a weight applied to each name identification target item are specified. As the threshold value d5, an upper threshold value for White determination and a lower threshold value for Black determination are designated.

図３（Ｂ）に示すように、例えば、名寄せ方法ｄ１には、「他者名寄せ」が指定されている。名寄せ元指定ｄ２のアクセス情報には、「取引先表」が指定され、名寄せ元指定ｄ２のレコード情報には、ＩＤ（identification）、会社名、郵便番号、住所および電話番号の項目が指定されている。名寄せ先指定ｄ３のアクセス情報には、「顧客表」が指定され、名寄せ先指定ｄ３のレコード情報には、ＩＤ（identification）、顧客名、郵便番号、顧客住所および電話番号の項目が指定されている。 As shown in FIG. 3B, for example, “other name identification” is designated in the name identification method d1. In the access information of the name identification source designation d2, “business partner table” is designated, and in the record information of the name identification source designation d2, items of ID (identification), company name, postal code, address and telephone number are designated. Yes. “Customer table” is designated as the access information of the name identification destination designation d3, and items of ID (identification), customer name, zip code, customer address and telephone number are designated as the record information of the name identification destination designation d3. Yes.

名寄せ対象項目指定ｄ４には、名寄せ対象項目を会社名：顧客名、郵便番号：郵便番号、住所：顧客住所および電話番号：電話番号として指定されている。これは、名寄せ元の項目：名寄せ先の項目の組として名寄せ対象項目を指定してするものである。この名寄せ対象項目に対して、適用する評価関数と重みを指定する。例えば名寄せ対象項目が会社名：顧客名の場合には、評価関数に「編集距離１字違い以内」、重みに０．２が指定されて
いる。名寄せ対象項目が郵便番号：郵便番号の場合には、評価関数に「完全一致」、重みに０．２が指定されている。閾値ｄ５には、上位の閾値に０．７２、下位の閾値に０．２６が指定されている。 In the name identification item designation d4, the name identification item is designated as company name: customer name, postal code: postal code, address: customer address, and telephone number: telephone number. In this method, a name identification target item is designated as a set of name identification source item: name identification destination item. The evaluation function and weight to be applied are specified for this name identification item. For example, when the name identification target item is company name: customer name, the evaluation function specifies “within 1 edit distance difference” and the weight is 0.2. When the name identification item is zip code: zip code, “complete match” is specified for the evaluation function and 0.2 is specified for the weight. As the threshold value d5, 0.72 is designated as the upper threshold value and 0.26 is designated as the lower threshold value.

なお、「編集距離」とは、名寄せ元と名寄せ先との名寄せ対象項目の値の照合において名寄せ先の値を名寄せ元の値に変形させる際の最小編集回数を距離として表す評価関数である。例えば、「編集距離１字違い以内」を評価関数としたときに、編集距離が１字違い以内の場合には当該評価関数は１．０を返し、その他の場合には当該評価関数は０を返す。 The “edit distance” is an evaluation function that represents the minimum number of edits as a distance when the name identification target value is transformed into the name identification source value in the collation of the value of the name identification target item between the name identification source and the name identification destination. For example, when the evaluation function is “less than one edit distance difference”, the evaluation function returns 1.0 if the edit distance is less than one letter difference, and 0 in the other cases. return.

また、「完全一致」とは、名寄せ元と名寄せ先との名寄せ対象項目の値の照合において２つの値が完全に一致するか否かを表す評価関数である。２つの値が完全に一致する場合には１．０を返し、それ以外は０を返す。なお、図３（Ｂ）に示した「編集距離」及び「完全一致」はあくまで評価関数の一例である。 The “complete match” is an evaluation function that indicates whether or not two values are completely matched in the collation of the value of the name identification target item between the name identification source and the name identification target. Returns 1.0 if the two values match completely, 0 otherwise. Note that “edit distance” and “perfect match” shown in FIG. 3B are merely examples of evaluation functions.

図１〜図３を参照して説明した名寄せ処理において、名寄せに係る照合処理の対象となるレコードの組み合わせの数は、名寄せ元のレコード数と名寄せ先のレコード数とを乗ずることにより求められる。例えば、図２において、名寄せ元である取引先表のレコード数が１０万件、名寄せ先である顧客表のレコード数が２００万件である場合には、２０００億組の照合処理が必要となる。このような大規模な名寄せは、膨大な時間を要することとなる。 In the name identification process described with reference to FIGS. 1 to 3, the number of record combinations to be subjected to collation processing related to name identification is obtained by multiplying the number of records of the name identification source and the number of records of the name identification destination. For example, in FIG. 2, when the number of records in the customer table as the name identification source is 100,000 and the number of records in the customer table as the name identification destination is 2 million, 200 billion sets of collation processing are required. . Such a large-scale name identification requires a huge amount of time.

［２．粗絞りによる名寄せの高速化］
名寄せ元のレコードおよび名寄せ先のレコードについて、レコード同士を照合する照合処理の前に、照合するレコードの組を減らし、大規模な名寄せを高速化する技術がある。ここでは、照合処理の前に、名寄せ元と一致する可能性のある名寄せ先のレコードを粗く絞り込む「粗絞り」の技術について説明する。以下、粗絞りを「事前検索」とも呼び、続いて行う照合処理を「本検索」とも呼ぶ。 [2. Speeding up name identification by rough drawing]
There is a technique for speeding up large-scale name identification by reducing the number of records to be collated before performing collation processing for collating records with respect to a name identification source record and a name identification destination record. Here, a technique of “rough narrowing” will be described in which the name identification target records that may match the name identification source are roughly narrowed before the matching process. Hereinafter, the rough narrowing is also referred to as “prior search”, and the subsequent matching process is also referred to as “main search”.

図４は、「粗絞り」による名寄せを説明する図である。図示しているように、粗絞り処理１０２では、名寄せ元１００のレコード毎に生成される検索条件を用いて、名寄せ先１０１からレコードを検索し、検索した結果を検索結果１０２ｂとして出力する。この検索条件は、後述する粗絞り定義１０２ａに基づいて生成される。 FIG. 4 is a diagram for explaining name identification by “rough aperture”. As shown in the drawing, in the rough narrowing process 102, a record is searched from the name identification destination 101 using a search condition generated for each record of the name identification source 100, and the search result is output as a search result 102b. This search condition is generated based on a rough aperture definition 102a described later.

ここで、名寄せ先候補となる検索結果１０２ｂの件数が名寄せ元１００の１レコードに対して平均１００件であると仮定すると、名寄せ処理１０３による照合では、名寄せ元１００の１０万件×名寄せ先候補の平均１００件＝１０００万組の照合となり、名寄せ先１０１のレコード全件を対象とする名寄せに係る照合の２０００億組に比べて大幅な削減となる。 Here, if it is assumed that the number of search results 102b serving as name identification destination candidates is an average of 100 with respect to one record of the name identification source 100, in the collation by the name identification processing 103, 100,000 cases of name identification source 100 × name identification destination candidates The average of 100 cases = 10 million sets of collation, which is a significant reduction compared to the 200 billion pairs of collation related to name identification for all records of the name identification destination 101.

次に、粗絞りによる名寄せの処理手順について、図５及び図６を参照しながら説明する。図５及び図６は、粗絞りによる名寄せの処理手順を示すフローチャートである。 Next, a name identification process procedure based on rough drawing will be described with reference to FIGS. FIG. 5 and FIG. 6 are flowcharts showing a name identification process procedure based on rough aperture.

まず、粗絞り処理１０２は、粗絞り定義１０２ａを読み込んで動作環境を設定し（ステップＳ１００）、名寄せ元１００から名寄せする対象となる名寄せ元のレコード（以降、「名寄せ元レコード」という。）を順に取り出す（ステップＳ１０１）。そして、粗絞り処理１０２は、粗絞り定義１０２ａに定義される粗絞り対象項目毎に名寄せ元レコードの該当する項目の値を条件にして、名寄せ先１０１を粗く検索する（ステップＳ１０２）。具体的には、粗絞り処理１０２は、粗絞り対象項目毎に名寄せ元レコードの該当する項目の値を条件とした各条件をＯＲした検索条件で名寄せ先１０１を曖昧検索する。ここで、
曖昧検索とは「ｎ−ｇｒａｍ」等による検索である。そして、粗絞り処理１０２は、検索したレコードを検索結果１０２ｂとして格納する。 First, the rough aperture processing 102 reads the rough aperture definition 102a, sets the operating environment (step S100), and selects a name identification source record (hereinafter referred to as “name identification source record”) to be identified from the name identification source 100. It takes out in order (step S101). Then, the rough aperture processing 102 roughly searches the name identification destination 101 for each rough aperture target item defined in the rough aperture definition 102a, using the value of the corresponding item in the name identification source record as a condition (step S102). Specifically, the rough narrowing process 102 performs an ambiguous search of the name collation destination 101 with a search condition obtained by ORing each condition with the value of the corresponding item of the name collation source record as a condition for each rough narrowing target item. here,
The ambiguous search is a search by “n-gram” or the like. Then, the rough-drawing process 102 stores the searched record as a search result 102b.

次に、名寄せ処理１０３は、検索結果１０２ｂに格納された各レコードを名寄せ先として順に取り出し（ステップＳ１０３）、名寄せ元レコードと名寄せ先との照合処理を行う（ステップＳ１０４）。このステップＳ１０４の詳細は図６を参照して後述する。そして、名寄せ処理１０３は、照合結果を名寄せ候補集合に格納する（ステップＳ１０５）。なお、照合結果には、総合評価値が含まれる。 Next, the name identification process 103 sequentially extracts each record stored in the search result 102b as a name identification destination (step S103), and performs a collation process between the name identification source record and the name identification destination (step S104). Details of step S104 will be described later with reference to FIG. Then, the name identification process 103 stores the collation result in the name identification candidate set (step S105). The collation result includes a comprehensive evaluation value.

続いて、名寄せ処理１０３は、検索結果１０２ｂに残りの検索結果レコードが有るか否かを判定する（ステップＳ１０６）。検索結果１０２ｂに残りの検索結果レコードが有ると判定された場合には（ステップＳ１０６；Ｙｅｓ）、名寄せ処理１０３は、残りの検索結果レコードを取り出すべく、ステップＳ１０３に移行する。 Subsequently, the name identification process 103 determines whether or not there are remaining search result records in the search result 102b (step S106). If it is determined that there are remaining search result records in the search result 102b (step S106; Yes), the name identification process 103 proceeds to step S103 to extract the remaining search result records.

一方、検索結果１０２ｂに残りの検索結果レコードが無いと判定された場合には（ステップＳ１０６；Ｎｏ）、名寄せ処理１０３は、名寄せ候補集合に格納された各総合評価値について閾値による判定を実行して判定結果を出力する（ステップＳ１０７）。例えば、名寄せ処理１０３は、総合評価値が上位閾値以上である場合には、照合した名寄せ元レコードと名寄せ先レコードの組について、一致しているレコードの組であると判断して「Ｗｈｉｔｅ」と判定する。また、名寄せ処理１０３は、総合評価値が上位閾値未満且つ下位閾値以上である場合には、照合した名寄せ元レコードと名寄せ先レコードの組について、一致しているとも一致していないとも判定できないと判断して「Ｇｒａｙ」と判定する。また、名寄せ処理１０３は、総合評価値が下位閾値未満である場合には、照合した名寄せ元レコードと名寄せ先レコードの組について、不一致であるレコードの組であると判断して「Ｂｌａｃｋ」と判定する。そして、「Ｗｈｉｔｅ」及び「Ｂｌａｃｋ」と判定されたレコードの組が名寄せ結果として出力され、「Ｇｒａｙ」と判定されたレコードの組は候補リストとして出力される。この候補リストにあるレコードの組についての一致または不一致の判断は人により行われる。 On the other hand, when it is determined that there is no remaining search result record in the search result 102b (step S106; No), the name identification process 103 executes determination based on a threshold for each comprehensive evaluation value stored in the name identification candidate set. The determination result is output (step S107). For example, if the overall evaluation value is equal to or higher than the upper threshold value, the name identification process 103 determines that the matched name identification source record and name identification target record group is a matched record group and sets “White”. judge. Further, the name identification process 103 cannot determine whether the pair of the collated name identification source record and the name identification target record matches or does not match when the comprehensive evaluation value is less than the upper threshold value and greater than or equal to the lower threshold value. Judgment is made and “Gray” is determined. Further, when the comprehensive evaluation value is less than the lower threshold, the name identification process 103 determines that the collated name identification source record and name identification target record group is a mismatched record group and determines “Black”. To do. Then, a set of records determined as “White” and “Black” is output as a name identification result, and a set of records determined as “Gray” is output as a candidate list. Whether a record set in the candidate list matches or does not match is determined by a person.

そして、粗絞り処理１０２は、名寄せ元１００に残りの名寄せ元レコードが有るか否かを判定する（ステップＳ１０８）。そして、名寄せ元１００に残りの名寄せ元レコードが有ると判定された場合には（ステップＳ１０８；Ｙｅｓ）、粗絞り処理１０２は、残りの名寄せ元レコードを取り出すべく、ステップＳ１０１に移行する。一方、名寄せ元１００に残りの名寄せ元レコードが無いと判定された場合には（ステップＳ１０８；Ｎｏ）、粗絞り処理１０２は、粗絞りによる名寄せ処理を終了する。 Then, the rough narrowing process 102 determines whether or not there are remaining name identification source records in the name identification source 100 (step S108). If it is determined that there are remaining name identification source records in the name identification source 100 (step S108; Yes), the rough narrowing process 102 proceeds to step S101 to extract the remaining name identification source records. On the other hand, when it is determined that there are no remaining name identification source records in the name identification source 100 (step S108; No), the rough narrowing process 102 ends the name identification process by the rough narrowing.

次に、図５に示すステップＳ１０４の処理手順について、図６を参照しながら説明する。図６は、照合処理の手順を示すフローチャートである。照合処理は、名寄せ元レコードと名寄せ先レコードの１組毎に、照合を行い総合評価値を導出する処理である。 Next, the processing procedure of step S104 shown in FIG. 5 will be described with reference to FIG. FIG. 6 is a flowchart showing the procedure of the collation process. The matching process is a process for deriving a comprehensive evaluation value by matching each set of the name identification source record and the name identification target record.

まず、名寄せ処理１０３は、名寄せ定義１０３ａに定義された名寄せ対象項目を順に選択する（ステップＳ１１０）。なお、名寄せ対象項目は、名寄せ元の項目と名寄せ先の項目で構成される比較の対象とする項目の対として予め名寄せ定義１０３ａに定義されている。そして、名寄せ処理１０３は、名寄せ元レコードおよび名寄せ先レコードについて、それぞれ選択した名寄せ対象項目に対応した各値を指定し（ステップＳ１１１）、指定した２つの値に評価関数を適用し（ステップＳ１１２）、評価値を算出する。なお、評価関数は、名寄せ定義１０３ａにおいて名寄せ対象項目につき予め定義されている関数である。 First, the name identification process 103 sequentially selects the name identification target items defined in the name identification definition 103a (step S110). Note that the name identification target item is defined in advance in the name identification definition 103a as a pair of items to be compared, including a name identification source item and a name identification target item. Then, the name identification process 103 specifies each value corresponding to the selected name identification item for the name identification source record and the name identification destination record (step S111), and applies the evaluation function to the two specified values (step S112). The evaluation value is calculated. The evaluation function is a function defined in advance for each name identification item in the name identification definition 103a.

続いて、名寄せ処理１０３は、残りの名寄せ対象項目が有るか否かを判定する（ステッ
プＳ１１３）。残りの名寄せ対象項目が有ると判定された場合には（ステップＳ１１３；Ｙｅｓ）、名寄せ処理１０３は、残りの名寄せ対象項目について評価関数を適用すべく、ステップＳ１１０に移行する。 Subsequently, the name identification process 103 determines whether or not there are remaining name identification items (step S113). If it is determined that there are remaining name identification items (step S113; Yes), the name identification processing 103 proceeds to step S110 to apply the evaluation function to the remaining name identification items.

一方、残りの名寄せ対象項目が無いと判定された場合には（ステップＳ１１３；Ｎｏ）、名寄せ処理１０３は、各名寄せ対象項目の評価値に名寄せ対象項目毎の重み付けを行い、重み付けを行った結果の各評価値を加算する（ステップＳ１１４）。そして、名寄せ処理１０３は、加算結果の値を対象のレコード組に対する総合評価値として出力し（ステップＳ１１５）、１組に対する照合処理を終える。 On the other hand, when it is determined that there are no remaining name identification target items (step S113; No), the name identification process 103 weights the evaluation value of each name identification target item for each name identification target item, and results of weighting. Each evaluation value is added (step S114). Then, the name identification process 103 outputs the value of the addition result as a comprehensive evaluation value for the target record set (step S115), and finishes the matching process for one set.

粗絞り定義のデータ構造の一例を図７に示している。図７（Ａ）が粗絞り定義の内容を示し、図７（Ｂ）が粗絞り定義の具体例を示している。この粗絞り定義の作成は、名寄せ定義の作成と同様に人により行われる。 An example of the data structure of the rough aperture definition is shown in FIG. FIG. 7A shows the contents of the rough aperture definition, and FIG. 7B shows a specific example of the rough aperture definition. The rough aperture definition is created by a person in the same manner as the name identification definition.

図７（Ａ）に示すように、粗絞り定義は、対象項目と検索条件を対応付けて定義し、必要に応じて最大検出数も定義することができる。対象項目は、粗絞り処理において検索条件を適用する名寄せ元の項目と名寄せ先の項目とを対として複数指定することができ、対応する検索条件が指定される。最大検出数は、１つの名寄せ元レコードについて名寄せ先を検索した結果として残す名寄せ先レコードの最大件数である。 As shown in FIG. 7A, in the rough aperture definition, the target item and the search condition are defined in association with each other, and the maximum number of detections can be defined as necessary. A plurality of target items can be specified as a pair of a name identification source item and a name identification target item to which the search condition is applied in the rough narrowing process, and a corresponding search condition is specified. The maximum number of detections is the maximum number of name identification destination records left as a result of searching for a name identification target for one name identification source record.

図７（Ｂ）に示すように、粗絞り定義１０２ａにおいては、粗絞り対象項目ｄ１１毎に対象とする名寄せ元の項目と名寄せ元の項目および適用する検索条件が定義され、前述の最大検出数ｄ１２も必要に応じて定義される。粗絞り対象項目ｄ１１には、「元先」および「検索条件」が対応付けられている。「元先」は、名寄せ元レコード及び名寄せ先レコードそれぞれの粗絞り対象項目となる項目の名称を「名寄せ元項目：名寄せ先項目」として示す。検索条件は、各対象項目について、名寄せ元の該当項目の値により名寄せ先の該当項目を検索する際の検索方法を指定する。例えば、検索条件には、名寄せ元レコードの対象項目について値の連続する何れかのＮ文字を対象項目に含む名寄せ先レコードを検索する「ｎ−ｇｒａｍ」や、名寄せ先レコードの対象項目の値が完全に一致している対象項目を有する名寄せ先レコードを検索する「完全一致」がある。「ｎ−ｇｒａｍ」を用いる場合には、インデックス一致数ｃを指定することもできる。ここで、ｎ−ｇｒａｍにおけるインデックスとは、名寄せ元レコードの対象項目においてｎ文字にわたって連続した文字列のことである。例えば、「Ｂｉ−ｇｒａｍ／インデックス一致数４」という検索条件である場合には、名寄せ元レコードの対象項目において２文字にわたって連続した文字列のうち、４個以上を対象項目に含むレコードを名寄せ先レコードから検索することになる。 As shown in FIG. 7B, in the rough aperture definition 102a, a target name identification item, a name identification source item, and a search condition to be applied are defined for each rough aperture target item d11. d12 is also defined as necessary. The rough aperture target item d11 is associated with “source” and “search condition”. “Source” indicates the name of an item that is a target for rough narrowing of each of the name identification source record and the name identification destination record as “name identification source item: name identification destination item”. For each target item, the search condition specifies a search method for searching for the corresponding item of the name identification destination by the value of the corresponding item of the name identification source. For example, the search condition includes “n-gram” for searching for a name identification destination record including any N characters whose values are continuous for the target item of the name identification source record, or the value of the target item of the name identification target record. There is a “perfect match” that searches for a name identification record having a target item that is completely matched. When “n-gram” is used, the index match number c can be designated. Here, the index in n-gram is a character string continuous over n characters in the target item of the name identification source record. For example, when the search condition is “Bi-gram / index match number 4”, the name identification destination includes a record including four or more characters in the target item of the name identification source record that are continuous over two characters. Search from the record.

図７（Ｂ）の例では、対象項目が「会社名：顧客名」の検索条件は、「Ｂｉ−Ｇｒａｍ／一致数２」であることを示している。また、「住所：顧客住所」の検索条件は、「Ｂｉ−Ｇｒａｍ／一致数４」であることを示している。対象項目が「郵便番号：郵便番号」及び「電話番号：電話番号」の検索条件は「完全一致」であることを示す。これら４つの検索条件により検索される各集合の和集合が、図４の検索結果１０２ｂに相当する。また、各名寄せ元レコードに対する名寄せ先レコードの最大検出数は、１０００件である。 In the example of FIG. 7B, the search condition for the target item “company name: customer name” is “Bi-Gram / number of matches 2”. Further, the search condition of “address: customer address” indicates that “Bi-Gram / number of matches 4”. It indicates that the search condition of the target item “zip code: zip code” and “phone number: phone number” is “complete match”. The union of each set searched by these four search conditions corresponds to the search result 102b in FIG. The maximum number of name identification destination records for each name identification source record is 1000.

［３．粗絞り定義の作成］
これまで、図４〜図７を参照して粗絞りを伴う名寄せについて説明した。名寄せ処理に用いる名寄せ定義、粗絞り処理に用いる粗絞り定義はいずれも人により作成されることは、上述したとおりである。通常、名寄せ定義がまず作成され、続いて粗絞り定義が作成される。このうちの粗絞り定義の作成の流れを図８に示している。 [3. Create rough aperture definition]
So far, the name identification with rough drawing has been described with reference to FIGS. As described above, both the name identification definition used for the name identification process and the rough aperture definition used for the rough extraction process are created by a person. Usually, a name identification definition is created first, followed by a rough drawing definition. Of these, FIG. 8 shows the flow of creating a rough aperture definition.

手順Ｔ１１では、ユーザが名寄せ元データと、名寄せ先データと、既に作成した名寄せ定義とを参照しながら、名寄せ先データのうち照合する必要がないとして除外できるレコードを考慮し、初期の粗絞り定義を作成する。 In step T11, the initial rough refinement definition is considered in consideration of the record that can be excluded as it is not necessary to collate among the name identification source data while referring to the name identification source data, the name identification destination data, and the already created name identification definition. Create

手順Ｔ１２では、名寄せ元のレコードをいくつかサンプリングし、サンプリングしたレコードを元に、手順Ｔ１１で作成した初期の粗絞り定義を用いて粗絞り処理を試行する。 In step T12, some name identification source records are sampled, and based on the sampled records, rough drawing processing is tried using the initial rough drawing definition created in step T11.

手順Ｔ１３では、使用するコンピュータの処理速度を踏まえて、名寄せ先データの絞込みレコード件数を評価する。十分に絞込みができていると評価されれば、後述の手順Ｔ１４に進む。さもなければ、粗絞り定義を修正し、手順Ｔ１２に戻る。 In procedure T13, the number of narrowed records of name identification destination data is evaluated based on the processing speed of the computer to be used. If it is evaluated that the narrowing has been made sufficiently, the process proceeds to a later-described procedure T14. Otherwise, the rough aperture definition is corrected and the process returns to step T12.

手順Ｔ１４では、上記手順Ｔ１２でサンプリングしたレコードについて、名寄せ処理を試行する。 In procedure T14, name identification processing is tried for the record sampled in procedure T12.

手順Ｔ１５では、上記手順Ｔ１４の名寄せ処理の試行結果を確認し、既に作成されている「名寄せ定義」に基づいて、名寄せ元レコードと照合すべき名寄せ先レコードが粗絞り処理により除外されていないかを調査する。除外されている場合には、上記手順Ｔ１２に戻る。 In step T15, the result of the name identification process in step T14 is confirmed, and whether or not the name identification target record to be collated with the name identification source record is excluded by the rough narrowing process based on the already created “name identification definition”. To investigate the. If it is excluded, the process returns to step T12.

以上のように、試行錯誤しながら手順Ｔ１２〜Ｔ１５を繰り返すことにより、手順Ｔ１６にて粗絞り定義が完成する。 As described above, by repeating steps T12 to T15 through trial and error, the rough aperture definition is completed in step T16.

この粗絞り定義を作成するにあたり、使用するコンピュータの処理速度を考慮して名寄せ先のレコードを絞り込み過ぎると、後続の名寄せ処理に必要なデータが漏れてしまうため、名寄せの精度が低くなる。一方で、名寄せの精度を重視し過ぎると、十分な粗絞りができないため、名寄せ処理に膨大な時間を要することになる。そのため、粗絞り定義の作成には試行錯誤を伴う。粗絞り定義の作成を人により行う場合の工数は、数人月にも及ぶことが一般的である。 In creating the rough narrowing definition, if the name identification destination records are narrowed too much in consideration of the processing speed of the computer to be used, the data required for the subsequent name identification processing will be leaked, resulting in low name identification accuracy. On the other hand, if too much emphasis is placed on the accuracy of name identification, a sufficient rough drawing cannot be performed, and therefore the name identification process takes an enormous amount of time. Therefore, the creation of the rough aperture definition involves trial and error. In general, the man-hours for creating a rough drawing definition by a person are several man-months.

［４．粗絞り定義を効率的に作成する技術］
以下、粗絞り定義を効率的に作成する技術について述べる。本技術は、事前に作成されている名寄せ定義に基づいて粗絞り定義を効率的に作成するものである。まず、名寄せ処理と粗絞り処理とについて図９を参照してさらに詳しく説明する。 [4. Efficiently creating rough aperture definition]
Hereinafter, a technique for efficiently creating a rough aperture definition will be described. This technique efficiently creates a rough drawing definition based on a name identification definition created in advance. First, the name identification process and the rough drawing process will be described in more detail with reference to FIG.

図９には、名寄せ元の１つのレコードＪ１と名寄せ先のレコードの集合Ｍを示している。これらは図２に示した名寄せ元のレコードＪ１及び名寄せ先のレコードの集合Ｍにそれぞれ相当する。そして、名寄せ先のレコードの集合Ｍに対して粗絞り処理を行うことにより、検索されたレコードの集合Ｍａが作成される。この粗絞り処理を行うにあたっては、図示しているように、検索用関数を用いて、集合Ｍから特定の文字列を含んだレコードを検索する。この粗絞り処理は、粗い精度で良い一方で高速に行う必要がある。検索用関数の例としては、既に述べたような完全一致、Ｕｎｉ−ｇｒａｍやＢｉ−ｇｒａｍといったｎ−ｇｒａｍ、前方一致、後方一致などがある。 FIG. 9 shows a set M of name identification source records J1 and name identification destination records. These correspond to the name identification source record J1 and the name identification target set M shown in FIG. Then, a set Ma of retrieved records is created by performing a rough narrowing process on the set M of name identification destination records. In performing the rough narrowing process, as shown in the figure, a record including a specific character string is searched from the set M using a search function. This rough drawing process needs to be performed at a high speed while providing a rough accuracy. Examples of search functions include perfect match as described above, n-gram such as Uni-gram and Bi-gram, forward match, and backward match.

検索されたレコードの集合Ｍａが作成されると、次に名寄せ元のレコードＪ１と集合Ｍａとの名寄せ処理を行う。このとき、レコードＪ１との関係で、名寄せ定義にて定義されている照合ルールの少なくとも１つを満たすレコードの集合ができる。この集合を名寄せ集合Ｍｂとして図示している。この名寄せ処理を行うにあたっては、図示しているように、比較用関数を用いて、レコードＪ１と集合Ｍａ内の各レコードとを比較する。この比較は、きめ細かく行う必要があるために処理に時間がかかるという点において、粗絞り処理とは対照的である。比較用関数の例としては、完全一致、編集距離がある。さらに、編集
距離の具体例として、比較する２つの文字列において異なる字数を見る関数がある。くわえて、比較する２つの文字列において異なる字数を、これら２つの文字列の一方の文字数で割ることにより得られる値を編集の比率として見る関数もある。 When the set Ma of the retrieved records is created, name identification processing is performed on the name identification source record J1 and the set Ma. At this time, a set of records satisfying at least one of the collation rules defined in the name identification definition is created in relation to the record J1. This set is illustrated as a name identification set Mb. In performing this name identification process, as shown in the figure, the comparison function is used to compare the record J1 with each record in the set Ma. This comparison is in contrast to the rough drawing process in that the process takes time because it needs to be done finely. Examples of comparison functions include perfect match and edit distance. Furthermore, as a specific example of the edit distance, there is a function for viewing different numbers of characters in two character strings to be compared. In addition, there is also a function for viewing the value obtained by dividing the number of different characters in the two character strings to be compared by the number of characters in one of these two character strings as the editing ratio.

このような粗絞り処理を行うにあたっては以下の点に留意する必要がある。
・レコードＪ１との名寄せ結果がＷｈｉｔｅまたはＧｒａｙとなるレコードの集合Ｍｂが、検索されたレコードの集合Ｍａに含まれるようにする。
・その一方で、集合Ｍａを極力小さな集合とする。
・粗絞り処理自体のコスト（処理時間、使用する計算資源など）を抑える。 It is necessary to pay attention to the following points when performing such rough drawing processing.
The record set Mb whose name identification result with the record J1 is White or Gray is included in the searched record set Ma.
On the other hand, the set Ma is set as small as possible.
-Reduce the cost of the rough drawing process itself (processing time, computational resources used, etc.).

そして、図１０には、名寄せ処理の比較条件と粗絞り処理の検索条件との対応関係を示している。図示しているように、名寄せ処理における比較条件として「完全一致」を用いる場合は、粗絞り処理における検索条件も「完全一致」となる。「完全一致」は、明確な文字列比較関数であり、そのまま粗絞り用としても利用できるためである。 FIG. 10 shows a correspondence relationship between the comparison condition of the name identification process and the search condition of the rough drawing process. As shown in the figure, when “complete match” is used as the comparison condition in the name identification process, the search condition in the rough narrowing process is also “complete match”. This is because “complete match” is a clear character string comparison function and can be used as it is for rough drawing.

次に、名寄せ処理における比較条件として「あいまい一致」すなわち編集距離に基づく条件を用いる場合を説明する。編集距離は、あいまい比較用の評価関数であり、比較する２つの文字列の並び順が一部変更されている場合にも有効に評価できる特徴を持つ。この比較条件を包含する検索条件としては、ｎ−ｇｒａｍ及びインデックス一致数に基づく検索条件が妥当である。 Next, a case where a “fuzzy match”, that is, a condition based on an edit distance is used as a comparison condition in the name identification process will be described. The edit distance is an evaluation function for ambiguous comparison, and has a characteristic that can be effectively evaluated even when the arrangement order of two character strings to be compared is partially changed. As a search condition including this comparison condition, a search condition based on n-gram and the number of index matches is appropriate.

続いて、名寄せ処理における編集距離に基づいた比較条件を、粗絞り用検索条件であるｎ−ｇｒａｍに変換する具体的な方法について説明する。なお、ｎ−ｇｒａｍの「ｎ」及び「インデックスの一致数」に基づいて評価するためには、比較元の文字列の最小字数を考慮する必要がある。 Next, a specific method for converting the comparison condition based on the edit distance in the name identification process into n-gram, which is a rough drawing search condition, will be described. In order to evaluate based on “n” of n-gram and “number of matching indexes”, it is necessary to consider the minimum number of characters of the comparison source character string.

まず、比較条件が「編集距離で１文字違い以内」である場合を説明する。比較元が「ＡＢＣ」という３文字の文字列である場合に、名寄せ先のレコードの集合Ｍから検索する必要のあるレコードを図１１に示している。ただし、「？」は、「Ａ」、「Ｂ」、「Ｃ」のいずれでもない文字とする。つまり、図１１に示している文字列パターンは、比較元の文字列「ＡＢＣ」から抽出される文字列パターンである。 First, a case will be described in which the comparison condition is “within 1 character difference in edit distance”. FIG. 11 shows records that need to be searched from the set M of name identification destination records when the comparison source is a three-character string “ABC”. However, “?” Is a character that is not any of “A”, “B”, and “C”. That is, the character string pattern shown in FIG. 11 is a character string pattern extracted from the comparison-source character string “ABC”.

図１１に示しているように、名寄せ先から検索する必要のあるレコードは、比較元の値「ＡＢＣ」における文字数１のインデックス（つまり、「Ａ」、「Ｂ」、「Ｃ」）のうち、２個以上を含むレコードである。したがって、図示したようなレコードを名寄せ先から検索するための検索条件は、「１（ｕｎｉ）−ｇｒａｍ、インデックスの一致数２以上」である。 As shown in FIG. 11, the records that need to be searched from the name identification destination are the one-character index (that is, “A”, “B”, “C”) in the comparison source value “ABC”. A record containing two or more. Therefore, the search condition for searching the record as shown in the figure from the name identification destination is “1 (uni) -gram, index match number 2 or more”.

次に、比較条件が同じく「編集距離で１文字違い以内」であって、比較元が「ＡＢＣＤ」という４文字の文字列である場合に、名寄せ先のレコードの集合Ｍから検索する必要のあるレコードを図１２に示している。ただし、「？」は、「Ａ」、「Ｂ」、「Ｃ」、「Ｄ」のいずれでもない文字とする。つまり、図１２に示している文字列パターンは、比較元の文字列「ＡＢＣＤ」から抽出される文字列パターンである。 Next, when the comparison condition is “less than one character difference in edit distance” and the comparison source is a character string of four characters “ABCD”, it is necessary to search from the set M of name identification destination records. The record is shown in FIG. However, “?” Is a character that is not any of “A”, “B”, “C”, and “D”. That is, the character string pattern shown in FIG. 12 is a character string pattern extracted from the comparison-source character string “ABCD”.

図１２に示しているように、名寄せ先から検索する必要のあるレコードは、比較元の値「ＡＢＣＤ」における文字数２のインデックス（つまり、「ＡＢ」、「ＢＣ」、「ＣＤ」）のうち、１個以上を含むレコードである。したがって、図示したようなレコードを名寄せ先から検索するための検索条件は、「２（ｂｉ）−ｇｒａｍ、インデックスの一致数１以上」である。あるいは、比較元の値「ＡＢＣＤ」における文字数１のインデックス（つまり、「Ａ」、「Ｂ」、「Ｃ」、「Ｄ」）のうち、３個以上を含むレコードを名寄せ先か
ら検索することとしてもよい。この場合の検索条件は「１（ｕｎｉ）−ｇｒａｍ、インデックスの一致数３以上」となる。 As shown in FIG. 12, the records that need to be searched from the name identification destination are the two-character indexes (that is, “AB”, “BC”, “CD”) in the comparison source value “ABCD”. A record including one or more records. Therefore, the search condition for searching the record as illustrated from the name identification destination is “2 (bi) -gram, index match number 1 or more”. Alternatively, a record including three or more characters in the index of 1 character (that is, “A”, “B”, “C”, “D”) in the comparison source value “ABCD” is searched from the name identification destination. Also good. The search condition in this case is “1 (uni) -gram, index match number 3 or more”.

図１１及び図１２に示した例から、名寄せ時の比較条件「編集距離ｄ字違い以内」を、粗絞り時の検索条件「ｎ−ｇｒａｍ、インデックスの一致数ｃ」へと変換するための式は以下のようになる。
ｎ＝Ｆ１（ｄ，ｍ）
ｃ＝Ｆ２（ｄ，ｍ，ｎ）
ただし、ｍは比較元の文字列の最小字数であり、Ｆ１及びＦ２は互いに異なる関数である。 From the example shown in FIG. 11 and FIG. 12, an expression for converting the comparison condition “within the edit distance d difference” within the name identification into the search condition “n-gram, index match number c” at the time of rough narrowing. Is as follows.
n = F1 (d, m)
c = F2 (d, m, n)
However, m is the minimum number of characters of the comparison source character string, and F1 and F2 are functions different from each other.

また、比較条件が「編集距離で２文字違い以内」であって、比較元が「ＡＢＣ」という３文字の文字列である場合に、名寄せ先のレコードの集合Ｍから検索する必要のあるレコードを図１３に示している。ただし、「？」は、「Ａ」、「Ｂ」、「Ｃ」のいずれでもない文字とする。つまり、図１３に示している文字列パターンは、比較元の文字列「ＡＢＣ」から抽出される文字列パターンである。 In addition, when the comparison condition is “edit distance is within two characters or less” and the comparison source is a character string of three characters “ABC”, a record that needs to be searched from the set M of name identification destination records This is shown in FIG. However, “?” Is a character that is not any of “A”, “B”, and “C”. That is, the character string pattern shown in FIG. 13 is a character string pattern extracted from the comparison-source character string “ABC”.

図１３に示しているように、名寄せ先から検索する必要のあるレコードは、比較元の値「ＡＢＣ」における文字数１のインデックス（つまり、「Ａ」、「Ｂ」、「Ｃ」）のうち、１個以上を含むレコードである。したがって、図示したようなレコードを名寄せ先から検索するための検索条件は、「１（ｕｎｉ）−ｇｒａｍ、インデックスの一致数１以上」である。 As shown in FIG. 13, a record that needs to be searched from the name identification destination is an index of 1 character (that is, “A”, “B”, “C”) in the comparison source value “ABC”. A record including one or more records. Therefore, the search condition for searching the record as shown in the figure from the name identification destination is “1 (uni) -gram, index match number 1 or more”.

図１１〜図１３に示したような例は一例であるが、さらに範囲を広げた結果を図１４（Ａ）に示している。図１４（Ａ）は、名寄せの比較条件（ｄ及びｍ）と、粗絞りの検索条件（ｎ及びｃ）との関係を示している。既に述べたとおり、ｄは名寄せの比較条件「編集距離ｄ字違い以内」の「ｄ」に相当し、ｍは比較元の文字列の最小字数である。また、ｎ及びｃは、粗絞りの検索条件「ｎ−ｇｒａｍ、インデックスの一致数ｃ」の「ｎ」及び「ｃ」にそれぞれ相当する。そして、図１４（Ａ）には、ｄについてはｄ＝１，２，３の３パターン、ｍについてはｍ＝１，２，・・・，１０の１０パターンを組み合わせた計３０パターンについてｎとｃのとりうる値を示している。 The example shown in FIGS. 11 to 13 is an example, but the result of further expanding the range is shown in FIG. FIG. 14A shows the relationship between the comparison conditions for name identification (d and m) and the rough search conditions (n and c). As already described, d corresponds to “d” in the comparison condition “within the edit distance d difference”, and m is the minimum number of characters of the comparison source character string. In addition, n and c correspond to “n” and “c” of the search condition “n-gram, index coincidence number c” of the rough aperture, respectively. In FIG. 14A, for d, three patterns of d = 1, 2, 3 are combined, and for m, 10 patterns of m = 1, 2,. The possible value of c is shown.

例えば、ｄ＝１かつｍ＝３のときは、「ｎ＝１／ｃ＝２」となっている。これは、図１１に示した例に相当する。そして、ｄ＝１かつｍ＝４のときは、「ｎ＝１／ｃ＝３」または「ｎ＝２／ｃ＝１」となっている。これは、図１２に示した例に相当する。また、ｄ＝２かつｍ＝３のときは、「ｎ＝１／ｃ＝１」となっている。これは、図１３に示した例に相当する。なお、ｄ＝１かつｍ＝１のときは「Ｎｏｎｅ（ａｌｌ）」となっているが、これは、名寄せ先の全レコードが検索されてしまうために粗絞りの意義がなくなることを示している。 For example, when d = 1 and m = 3, “n = 1 / c = 2”. This corresponds to the example shown in FIG. When d = 1 and m = 4, “n = 1 / c = 3” or “n = 2 / c = 1”. This corresponds to the example shown in FIG. When d = 2 and m = 3, “n = 1 / c = 1”. This corresponds to the example shown in FIG. Note that when d = 1 and m = 1, “None (all)” is obtained, which indicates that all the records in the name identification destination are searched, and thus the meaning of the rough narrowing is lost. .

図１４（Ｂ）は、名寄せの比較条件（ｒ及びｍ）と、粗絞りの検索条件（ｎ及びｃ）との関係を示している。ここで、ｒは名寄せの比較条件「編集距離（比率ｒ以内）」の「ｒ」に相当する。その他は図１４（Ａ）と同様であるので、説明を省略する。なお、ｒ＝０．３３かつｍ＝１のときは「Ｎｏｎｅ（＝０％）」となっているが、これは、ｒ＝０を指定したことと同じになってしまうために粗絞りの意義がなくなることを示している。 FIG. 14B shows a relationship between the comparison conditions for name identification (r and m) and the rough search conditions (n and c). Here, r corresponds to “r” in the comparison condition “edit distance (within ratio r)” of name identification. The rest of the configuration is the same as that in FIG. Note that when r = 0.33 and m = 1, “None (= 0%)” is obtained, but this is the same as specifying r = 0. Indicates that there will be no more.

図１４（Ａ）及び（Ｂ）から、ｄ（あるいはｒ）及びｍとｎ及びｃとの関係を図１５に示したような式で表すことができる。つまり、名寄せ定義で定義された比較条件に基づいて粗絞りの検索条件が作成できることになる。以下、ｍ、ｎ、ｃ、ｄをそれぞれ第１の数、第２の数、第３の数、第４の数とも呼ぶ。 From FIGS. 14A and 14B, the relationship between d (or r) and m, and n and c can be expressed by the equation shown in FIG. That is, the rough search condition can be created based on the comparison condition defined in the name identification definition. Hereinafter, m, n, c, and d are also referred to as a first number, a second number, a third number, and a fourth number, respectively.

図１５（Ａ）は、ｄ及びｍに基づいてｎ及びｃを求める式を示している。図１５（Ｂ）は、ｒ及びｍに基づいてｎ及びｃを求める式を示しているが、図１５（Ａ）に示した式と結果的に同じとなる。これは、ｄ＝ｒ×ｍ（小数点切捨て）という式に基づいて、ｒをｄに変換することができるためである。また、図１５（Ａ）及び（Ｂ）の両方に共通していえることとして、実際に粗絞りを行う際には、ｎを大きくすると検索用インデックスが巨大化するため、ｎの最大値を決めておく必要がある。 FIG. 15A shows an equation for obtaining n and c based on d and m. FIG. 15B shows an equation for obtaining n and c based on r and m, but the result is the same as the equation shown in FIG. This is because r can be converted to d based on the equation d = r × m (decimal point truncation). Further, it can be said that both of FIGS. 15A and 15B are common. When the rough narrowing is actually performed, the search index becomes enormous when n is increased. Therefore, the maximum value of n is determined. It is necessary to keep.

これまで図９〜図１５を参照して、名寄せの比較条件から粗絞りの検索条件を定める点について説明してきた。このようにして、従来では試行錯誤を繰り返しながら作成していた粗絞り定義を効率的に作成することができる。具体的には、従来行われている粗絞り定義の作成時に生じうる人為的ミスや本来定義すべき粗絞り条件の漏れを低減することができる。また、粗絞りの精度とコンピュータの処理速度とのバランスを考慮するための専門知識の習得や経験が不要となる。さらに、粗絞り定義の作成に必要な人的コストを大幅に削減することができる。その一方で、ユーザは、粗絞りの検索条件を考慮せずに、実際の名寄せ元データ及び名寄せ先データを見て直観的に「名寄せ定義」を作成することができる。 Until now, with reference to FIG. 9 to FIG. 15, the description has been made of the point that determines the search condition of the rough aperture from the comparison condition of name identification. In this way, it is possible to efficiently create a rough aperture definition that has been created by repeating trial and error in the past. Specifically, it is possible to reduce a human error that may occur at the time of creating a rough aperture definition that has been conventionally performed and leakage of a rough aperture condition that should be originally defined. In addition, it is not necessary to acquire specialized knowledge and experience for considering the balance between the precision of the rough drawing and the processing speed of the computer. Furthermore, it is possible to greatly reduce the human cost necessary for creating the rough drawing definition. On the other hand, the user can intuitively create the “name identification definition” by looking at the actual name identification source data and name identification destination data without considering the rough search conditions.

そして、図１５に示したような式を用いることにより、比較元の最小字数に該当するカラム内にどんな文字列が入っていても、（図９の集合Ｍｂに相当する）名寄せされた集合が、（同じく図９の集合Ｍａに相当する）粗絞りにより検索されたレコードの集合に含まれることになる。 Then, by using the expression as shown in FIG. 15, no matter what character string is included in the column corresponding to the minimum number of characters of the comparison source, the named set (corresponding to the set Mb in FIG. 9) , (Also corresponding to the set Ma in FIG. 9), it is included in the set of records searched by the rough drawing.

しかし、実際に名寄せをする際に、名寄せ元及び名寄せ先のレコードに含まれる文字列によっては、粗絞り条件のｎあるいはｃを、図１５に示したような式により求められる値よりも大きくすることにより、さらに小さな集合へと絞り込める可能性がある。このようにｎ及びｃを大きくすることを粗絞り定義のチューニングと呼ぶ。 However, when actually performing name identification, depending on the character strings included in the name identification source and name identification destination records, n or c of the rough narrowing condition is set to be larger than the value obtained by the expression as shown in FIG. By doing so, there is a possibility of narrowing down to a smaller set. Enlarging n and c in this way is called tuning of the rough aperture definition.

［５．粗絞り定義のチューニング］
以下、粗絞り定義のチューニングについて図１６及び図１７を参照して説明する。前提として、名寄せ定義と、図１５に示したような式に基づく粗絞り定義（初期の粗絞り定義とも呼ぶ）とが作成されているものとする。この粗絞り定義のｎ及びｃをチューニングするにあたり、まず名寄せ元のレコードをいくつかサンプリングする。このときのサンプリング率ｓは０．１であるとする。このサンプリングにおいては、名寄せ元のレコードを分析して、サンプリングされたレコードになるべく偏りがないようにする必要がある。例えば、統計学に基づいた手法を用いてサンプリングを行うことができる。また、図１６（Ａ）に示しているように、名寄せ先のレコードの集合Ｐは、レコードＲ１〜Ｒ１４という１４個のレコードを含んでいる。このことを図１６（Ａ）にてｖ＝１４と表現している。 [5. Tuning the coarse aperture definition]
Hereinafter, tuning of the rough aperture definition will be described with reference to FIGS. As a premise, it is assumed that a name identification definition and a rough aperture definition (also referred to as an initial rough aperture definition) based on the formula shown in FIG. 15 have been created. In tuning this rough aperture definition n and c, first, some records of the name identification source are sampled. The sampling rate s at this time is assumed to be 0.1. In this sampling, it is necessary to analyze the record of the name identification source so that the sampled record is as biased as possible. For example, sampling can be performed using a technique based on statistics. Further, as illustrated in FIG. 16A, the set P of name identification destination records includes 14 records R1 to R14. This is expressed as v = 14 in FIG.

次に、名寄せ定義に定義された比較条件に基づいて、名寄せ元からサンプリングしたレコードと名寄せ先との、粗絞りを伴わない名寄せを行う。これを名寄せ試行と呼ぶ。この名寄せ試行により、名寄せ元からサンプリングしたレコードとの関係で、名寄せ定義に定義された比較条件の少なくとも１つを満たす名寄せ先のレコードをその比較条件ごとに記録しておく。このようにして記録されるレコードの集合を名寄せ集合Ｐｃとして図１６（Ａ）に示している。集合ＰｃはレコードＲ１１〜Ｒ１４という４個のレコードを含んでいる。このことを図１６（Ａ）にてｘ＝４と表現している。 Next, based on the comparison condition defined in the name identification definition, name identification without rough narrowing is performed on the record sampled from the name identification source and the name identification destination. This is called name identification trial. By this name identification trial, a name identification destination record that satisfies at least one of the comparison conditions defined in the name identification definition in relation to the record sampled from the name identification source is recorded for each comparison condition. A set of records recorded in this manner is shown as a name identification set Pc in FIG. The set Pc includes four records, records R11 to R14. This is expressed as x = 4 in FIG.

続いて、名寄せ元からサンプリングしたレコードと、名寄せ先の全レコードとに対して、事前に作成されている粗絞り定義を用いて粗絞りを行う。これを粗絞り試行と呼ぶ。この粗絞り試行により検索された名寄せ先のレコードはカラム毎に記録しておく。粗絞り試
行により作成された集合Ｐａを図１６（Ａ）に示している。集合ＰａはレコードＲ３〜Ｒ１４という１２個のレコードを含んでいる。つまり、粗絞り試行によりレコードＲ１及びＲ２が除外されたことになる。このことを図１６（Ａ）にてｗ＝１２と表現している。 Subsequently, the rough sampling is performed on the records sampled from the name identification source and all the records of the name identification destination by using the rough aperture definition created in advance. This is called a rough drawing trial. The record of the name identification destination searched by this rough narrowing trial is recorded for each column. FIG. 16A shows the set Pa created by the rough drawing trial. The set Pa includes 12 records R3 to R14. That is, the records R1 and R2 are excluded due to the rough drawing trial. This is expressed as w = 12 in FIG.

次に、名寄せ試行によりＷｈｉｔｅまたはＧｒａｙと判定された名寄せ先のレコードの集合Ｐｃが、粗絞り試行により検索された名寄せ先のレコードの集合Ｐａに含まれているかどうかをチェックする。具体的には以下のように行う。 Next, it is checked whether or not the set Pc of name identification destination records determined as White or Gray by the name identification trial is included in the set Pa of name identification destination records searched by the rough narrowing trial. Specifically, it is performed as follows.

まず、集合Ｐａと集合Ｐｃとの積集合には、レコードＲ１１〜Ｒ１４という４個のレコードが含まれている。このことを図１６（Ａ）にてｙ＝４と表現している。また、集合Ｐｃが集合Ｐａに含まれる割合を粗絞り適合率と呼ぶ。粗絞り適合率はｙ／ｗ＝４／１２＝３３％と計算される。また、集合Ｐｃが集合Ｐａから漏れている割合を漏れ率と呼ぶ。漏れ率は１−ｙ／ｘ＝１−４／４＝０％と計算される。 First, the product set of the set Pa and the set Pc includes four records R11 to R14. This is expressed as y = 4 in FIG. Further, the ratio that the set Pc is included in the set Pa is referred to as a rough drawing matching rate. The rough drawing precision is calculated as y / w = 4/12 = 33%. Further, the rate at which the set Pc is leaking from the set Pa is referred to as a leakage rate. The leak rate is calculated as 1-y / x = 1-4 / 4 = 0%.

なお、名寄せ元の全レコードに関して名寄せ先を粗絞りした場合の検索される予測レコード数を予測粗絞り数と呼ぶ。予測粗絞り数はｗ／ｓ＝１２／０．１＝１２０以内と予測される。また、名寄せ元の全レコードに関して名寄せした場合にＷｈｉｔｅまたはＧｒａｙと判定されるであろう名寄せ先のレコード数を予測名寄せ数と呼ぶ。予測名寄せ数は、ｘ／ｓ＝４／０．１＝４０以内と予測される。 Note that the number of predicted records to be searched when the name identification destination is roughly narrowed down for all the name identification source records is referred to as a predicted rough narrowing number. The predicted rough aperture number is predicted to be within w / s = 12 / 0.1 = 120. In addition, the number of records at the name identification destination that will be determined as White or Gray when the names are identified for all the records at the name identification source is referred to as a predicted name identification number. The number of predicted names is predicted to be within x / s = 4 / 0.1 = 40.

つまり、前記チェックでは、漏れ率が０％かどうかが判断される。漏れ率が０％であれば、もう少し粗絞りの検索条件を狭くすることができると予想されるため、後述する検索条件変更に進む。なお、図１５に示したような式に基づいた粗絞り定義に基づく粗絞り試行を行った後の漏れ率は必ず０％となる。漏れ率が０％となるようにｎ及びｃの値が定められているからである。 That is, in the check, it is determined whether the leakage rate is 0%. If the leak rate is 0%, it is expected that the search condition for the rough aperture can be narrowed a little more, so the process proceeds to a search condition change described later. It should be noted that the leak rate after performing a rough drawing trial based on the rough drawing definition based on the equation as shown in FIG. 15 is always 0%. This is because the values of n and c are determined so that the leakage rate is 0%.

検索条件変更においては、粗絞りにより検索されるレコードの集合をより小さなものにするために、粗絞りの検索条件すなわちｎ及びｃの値を変更する。例えば、ｃの値だけをｃ’＝ｃ＋１に変更する。あるいは、ｎをｎ’＝ｎ＋１へ変更し、ｃをｃ’へ変更する。このときのｎ’とｃ’との関係を図１７に示している。つまり、ｃ’＝ｃ−（ｎ’−ｎ）×（ｄ＋１）となる。この式により、ｎをｎ’＝ｎ＋１に変更した場合のｃ’を求めることができる。そして、ｎ’が図１５を参照して説明したｎ−ｇｒａｍのｎの最大値以下であり、かつｃ’が０より大きい値であれば、ｎ’及びｃ’による検索条件を用いて、前記粗絞り試行と前記チェックと前記検索条件変更とを繰り返す。そして、チェックの結果、例えば図１６（Ｂ）に示すように漏れ率が０％ではなかった場合には、ｎ’及びｃ’ではなく、ｎ及びｃの値を最終的な値として決定する。あるいは、検索条件変更の結果、ｃ’が０以下の値となった場合にも、ｎ’及びｃ’ではなく、ｎ及びｃの値を最終的な値として決定する。 In the search condition change, in order to make the set of records searched by the rough narrower smaller, the rough narrow search condition, that is, the values of n and c are changed. For example, only the value of c is changed to c ′ = c + 1. Alternatively, n is changed to n ′ = n + 1, and c is changed to c ′. The relationship between n 'and c' at this time is shown in FIG. That is, c ′ = c− (n′−n) × (d + 1). From this equation, c ′ when n is changed to n ′ = n + 1 can be obtained. If n ′ is equal to or less than the maximum value of n of n-gram described with reference to FIG. 15 and c ′ is greater than 0, the search condition using n ′ and c ′ is used to The coarse aperture trial, the check, and the search condition change are repeated. As a result of the check, for example, when the leak rate is not 0% as shown in FIG. 16B, the values of n and c are determined as final values instead of n ′ and c ′. Alternatively, when c ′ becomes a value of 0 or less as a result of changing the search condition, the values of n and c are determined as final values instead of n ′ and c ′.

このように、検索条件をｎ’及びｃ’に変更し、それらを用いて粗絞り試行を行い、この粗絞り試行の結果と前記名寄せ試行の結果をチェックするという一連の処理を繰り返す。これにより、初期の値ｎより一定量大きい最終的な値ｎが定まる。 In this way, the search condition is changed to n ′ and c ′, and a rough narrowing trial is performed using them, and a series of processes of checking the result of the rough narrowing trial and the result of the name identification trial are repeated. Thereby, a final value n larger than the initial value n by a certain amount is determined.

一例として、初期の値ｎ＝１、初期の値ｃ＝３であったときの検索条件変更の流れを説明する。まずｎ’を求める。ｎ’がｎよりも１だけ大きい値とすれば、ｎ’＝ｎ＋１＝１＋１＝２である。次に、このｎ’を用いてｃ’を求める。つまり、ｃ’＝ｃ−（ｎ’−ｎ）×（ｄ＋１）＝３−（２−１）×（１＋１）＝１となる。これらｎ’＝２及びｃ’＝１による検索条件を用いて、粗絞り試行とチェックと検索条件変更とを繰り返す。 As an example, the flow of changing search conditions when the initial value n = 1 and the initial value c = 3 will be described. First, n ′ is obtained. If n ′ is a value larger by 1 than n, then n ′ = n + 1 = 1 + 1 = 2. Next, c ′ is obtained using this n ′. That is, c ′ = c− (n′−n) × (d + 1) = 3- (2-1) × (1 + 1) = 1. Using these search conditions with n ′ = 2 and c ′ = 1, the rough aperture trial, the check, and the search condition change are repeated.

これまで図１６及び図１７を参照して粗絞り定義のチューニングについて説明してきた
。このチューニングにより、粗絞り定義にて定義される検索条件を、実際の名寄せ元データ及び名寄せ先データに応じてより範囲の狭い検索条件に変更し、粗絞りをさらに効率的に行うことができるようになる。 So far, the tuning of the rough aperture definition has been described with reference to FIGS. 16 and 17. With this tuning, the search conditions defined in the rough narrowing definition can be changed to narrower search conditions according to the actual name identification source data and name identification target data, so that rough narrowing can be performed more efficiently. become.

［６．実施例］
続いて、粗絞り定義の作成とチューニングの実施例について図１８〜図２０を参照して説明する。図１８は、名寄せを行う情報処理装置１の機能構成例を示している。情報処理装置１は、制御部１１と不揮発性記憶部１２と揮発性記憶部１３とを備えている。不揮発性記憶部１２は、名寄せ元データを格納した名寄せ元ＤＢ１２１と、名寄せ先データを格納した名寄せ先ＤＢ１２２とを有している。さらに、不揮発性記憶部１２には、事前にユーザによって作成された名寄せ定義１２３が保存されている。加えて、不揮発性記憶部１２には、後述する粗絞り定義生成部１１１によって作成される粗絞り定義１２４が保存される。 [6. Example]
Next, an example of creating and tuning a rough aperture definition will be described with reference to FIGS. FIG. 18 illustrates a functional configuration example of the information processing apparatus 1 that performs name identification. The information processing apparatus 1 includes a control unit 11, a nonvolatile storage unit 12, and a volatile storage unit 13. The nonvolatile storage unit 12 includes a name identification source DB 121 that stores name identification source data, and a name identification destination DB 122 that stores name identification source data. Furthermore, the non-volatile storage unit 12 stores a name identification definition 123 created in advance by the user. In addition, the non-volatile storage unit 12 stores a rough aperture definition 124 created by a rough aperture definition generation unit 111 described later.

制御部１１は、粗絞り定義生成部１１１と、粗絞り処理部１１３と、名寄せ部１１４とを有している。粗絞り定義生成部１１１は、名寄せ定義１２３を元にチューニング済みの粗絞り定義１２４を作成して不揮発性記憶部１２に保存する。粗絞り処理部１１３は、名寄せ元ＤＢ１２１にあるレコードと粗絞り定義１２４とを参照し、名寄せ先ＤＢ１２２にあるレコードの粗絞りを行う。粗絞りの結果は粗絞り処理結果１３１として揮発性記憶部１３に保存される。名寄せ部１１４は、粗絞り処理結果１３１に基づいて、名寄せ元ＤＢ１２１の各レコードと、名寄せ先ＤＢ１２２のレコードのうちの絞り込まれたレコードとの名寄せを行う。名寄せの結果は、名寄せ処理結果１３２として揮発性記憶部１３に保存される。 The control unit 11 includes a rough aperture definition generation unit 111, a rough aperture processing unit 113, and a name identification unit 114. The rough aperture definition generating unit 111 creates a tuned rough aperture definition 124 based on the name identification definition 123 and stores it in the nonvolatile storage unit 12. The rough narrowing processing unit 113 refers to the records in the name identification source DB 121 and the rough narrowing definition 124 and performs rough narrowing of records in the name identification target DB 122. The result of rough drawing is stored in the volatile storage unit 13 as the result of rough drawing processing 131. Based on the rough narrowing processing result 131, the name collation unit 114 performs name collation between each record in the name collation source DB 121 and a narrowed record among the records in the name collation destination DB 122. The name identification result is stored in the volatile storage unit 13 as the name identification process result 132.

図１９は、名寄せを行う情報処理装置１のハードウェア構成例を示している。情報処理装置１は、ＣＰＵ１０１０と、インタフェース装置１０２０と、表示装置１０３０と、入力装置１０４０と、ドライブ装置１０５０と、補助記憶装置１０６０と、メモリ装置１０７０とを備えており、これらがバス１０８０により相互に接続されている。 FIG. 19 illustrates a hardware configuration example of the information processing apparatus 1 that performs name identification. The information processing apparatus 1 includes a CPU 1010, an interface device 1020, a display device 1030, an input device 1040, a drive device 1050, an auxiliary storage device 1060, and a memory device 1070, which are mutually connected via a bus 1080. It is connected to the.

情報処理装置１の機能を実現するプログラムは、ＣＤ−ＲＯＭ等の記録媒体１０９０によって提供される。プログラムを記録した記録媒体１０９０がドライブ装置１０５０にセットされると、プログラムが記録媒体１０９０からドライブ装置１０５０を介して補助記憶装置１０６０にインストールされる。あるいは、プログラムのインストールは必ずしも記録媒体１０９０により行う必要はなく、ネットワークを介して他のコンピュータからダウンロードすることもできる。補助記憶装置１０６０は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 A program for realizing the functions of the information processing apparatus 1 is provided by a recording medium 1090 such as a CD-ROM. When the recording medium 1090 on which the program is recorded is set in the drive device 1050, the program is installed from the recording medium 1090 to the auxiliary storage device 1060 via the drive device 1050. Alternatively, the program does not necessarily have to be installed using the recording medium 1090, and can be downloaded from another computer via a network. The auxiliary storage device 1060 stores the installed program and also stores necessary files and data.

メモリ装置１０７０は、プログラムの起動指示があった場合に、補助記憶装置１０６０からプログラムを読み出して格納する。ＣＰＵ１０１０は、メモリ装置１０７０に格納されたプログラムに従って情報処理装置１の機能を実現する。インタフェース装置１０２０は、ネットワークを通して他のコンピュータに接続するためのインタフェースとして用いられる。表示装置１０３０はプログラムによるＧＵＩ（Graphical User Interface）等を表示する。入力装置１０４０はキーボード及びマウス等であり、名寄せのユーザが様々な指示を入力するために用いるものである。 The memory device 1070 reads the program from the auxiliary storage device 1060 and stores it when there is an instruction to start the program. The CPU 1010 realizes the function of the information processing apparatus 1 according to a program stored in the memory device 1070. The interface device 1020 is used as an interface for connecting to another computer through a network. The display device 1030 displays a GUI (Graphical User Interface) or the like by a program. The input device 1040 is a keyboard, a mouse, or the like, and is used by a name identification user to input various instructions.

図２０は、情報処理装置１により行われる粗絞り定義１２４を作成する処理の流れを示している。まずステップＳ４００にて粗絞り定義生成部１１１は、不揮発性記憶部１２に事前に保存されている名寄せ定義１２３を読み込む。名寄せ定義１２３の具体例は図３に示したとおりである。続いて粗絞り定義生成部１１１は、読み込んだ名寄せ定義１２３を元に図示しない初期の粗絞り定義（現粗絞り定義とも呼ぶ）を生成し、不揮発性記憶部１
に保存する。このときの初期の粗絞り定義の生成の方法は、図１０〜図１５を参照して説明したとおりである。すなわち、名寄せ定義１２３に定義された１つ以上の比較条件のそれぞれについて、比較条件が「完全一致」であれば粗絞りの検索条件も「完全一致」とし、比較条件が「あいまい一致」であれば、ｎ−ｇｒａｍ及びインデックス一致数ｃを用いた粗絞りのあいまい検索条件を定義する。 FIG. 20 shows a flow of processing for creating the rough aperture definition 124 performed by the information processing apparatus 1. First, in step S400, the rough aperture definition generator 111 reads the name identification definition 123 stored in advance in the nonvolatile storage unit 12. A specific example of the name identification definition 123 is as shown in FIG. Subsequently, the rough aperture definition generation unit 111 generates an initial rough aperture definition (not shown) based on the read name identification definition 123 (also referred to as a current rough aperture definition), and the nonvolatile storage unit 1
Save to. The method for generating the initial rough aperture definition at this time is as described with reference to FIGS. That is, for each of one or more comparison conditions defined in the name identification definition 123, if the comparison condition is “complete match”, the rough search condition is also “complete match”, and the comparison condition is “fuzzy match”. For example, a rough search fuzzy search condition using n-gram and index matching number c is defined.

ステップＳ４０１にて粗絞り定義生成部１１１は、生成した初期の粗絞り定義にあいまい条件が含まれているか否かを判断する。この判断結果が「ＮＯ」であれば処理を終了し、初期の粗絞り定義をそのまま最終的な粗絞り定義１２４とする。 In step S401, the rough aperture definition generation unit 111 determines whether the generated initial rough aperture definition includes an ambiguous condition. If the determination result is “NO”, the process is terminated, and the initial rough aperture definition is used as the final rough aperture definition 124 as it is.

ステップＳ４０１の判断結果が「ＹＥＳ」であった場合には、ステップＳ４０２にて名寄せ部１１４が名寄せ元ＤＢ１２１にあるレコードをいくつかサンプリングする。上述したようにサンプリングは、名寄せ元のレコードを分析して、サンプリングされたレコードになるべく偏りがないように行う。図示していないが、サンプリングされた名寄せ元のレコードは揮発性記憶部１３に保存される。 If the determination result in step S401 is “YES”, the name identification unit 114 samples several records in the name identification source DB 121 in step S402. As described above, the sampling is performed by analyzing the name identification source record so that the sampled records are not biased as much as possible. Although not shown, the sampled name identification source record is stored in the volatile storage unit 13.

ステップＳ４０３にて名寄せ部１１４が、サンプリングしたいくつかのレコードについての粗絞りを伴わない名寄せ、つまり名寄せ試行を行う。名寄せ部１１４は、サンプリングしたレコードとの関係で、名寄せ定義に定義された比較条件の少なくとも１つを満たす名寄せ先のレコードをその比較条件ごとに名寄せ試行結果（図示せず）として揮発性記憶部１３に保存する。 In step S <b> 403, the name identification unit 114 performs name identification for some of the sampled records without rough narrowing, that is, name identification trial. The name identification unit 114 has a volatile storage unit as a name identification trial result (not shown) for each comparison condition for a name identification destination record that satisfies at least one of the comparison conditions defined in the name identification definition in relation to the sampled records. 13 to save.

ステップＳ４０４にて粗絞り処理部１１３は、名寄せ部１１４によりサンプリングされたレコードと、現粗絞り定義とを参照し、粗絞り試行を行う。粗絞り試行の結果は、図示しない粗絞り試行結果として揮発性記憶部１３に保存される。 In step S404, the rough aperture processing unit 113 refers to the record sampled by the name identification unit 114 and the current rough aperture definition, and performs a rough aperture trial. The result of the rough drawing trial is stored in the volatile storage unit 13 as a rough drawing trial result (not shown).

ステップＳ４０５にて粗絞り定義生成部１１１は、揮発性記憶部１３に保存されている名寄せ試行結果と粗絞り試行結果とを参照する。そして、粗絞り定義生成部１１１は、図１６（Ａ）及び（Ｂ）に示したような粗絞り適合率及び漏れ率を算出する。 In step S405, the rough aperture definition generation unit 111 refers to the name identification trial result and the rough aperture trial result stored in the volatile storage unit 13. Then, the rough aperture definition generating unit 111 calculates the rough aperture matching rate and the leak rate as shown in FIGS. 16 (A) and 16 (B).

ステップＳ４０６にて粗絞り定義生成部１１１はステップＳ４０５にて算出した漏れ率が０％か否かをチェックする。チェック結果が「ＹＥＳ」であればステップＳ４０７に進む。チェック結果が「ＮＯ」であればステップＳ４０９に進み、現粗絞り定義の１つ前の粗絞り定義を現粗絞り定義として、ステップＳ４１０に進む。なお、初期の粗絞り定義に関しては漏れ率が必ず０％となるため、ステップＳ４０６のチェック結果は必ず「ＹＥＳ」となる。 In step S406, the rough aperture definition generator 111 checks whether or not the leakage rate calculated in step S405 is 0%. If the check result is “YES”, the process proceeds to step S407. If the check result is “NO”, the process proceeds to step S409, and the rough aperture definition immediately before the current rough aperture definition is set as the current rough aperture definition, and the process proceeds to step S410. Note that the leak rate is always 0% for the initial rough aperture definition, so the check result in step S406 is always "YES".

ステップＳ４０７にて粗絞り定義生成部１１１は、ステップＳ４０５にて算出した粗絞り適合率が１００％か否かをチェックする。このチェック結果が「ＹＥＳ」であればステップＳ４１０に進み、さもなければステップＳ４０８に進む。 In step S407, the rough aperture definition generator 111 checks whether or not the rough aperture matching ratio calculated in step S405 is 100%. If the check result is “YES”, the process proceeds to step S410; otherwise, the process proceeds to step S408.

ステップＳ４０８にて粗絞り定義生成部１１１は、現粗絞り定義に定義されているあいまい検索条件の中でチューニング対象となっているものがあるか否かを判断する。この判断結果が「ＹＥＳ」であればステップＳ４１２に進み、さもなければステップＳ４１０に進む。 In step S408, the rough aperture definition generation unit 111 determines whether there is a fuzzy search condition defined in the current rough aperture definition that is a tuning target. If the determination result is “YES”, the process proceeds to step S412; otherwise, the process proceeds to step S410.

ステップＳ４１０にて粗絞り定義生成部１１１は、現粗絞り定義に定義されているあいまい検索条件の中でまだチューニングがなされていないものがあるか否かを判断する。この判断結果が「ＹＥＳ」であればステップＳ４１１に進む。 In step S410, the rough aperture definition generating unit 111 determines whether there is a fuzzy search condition defined in the current rough aperture definition that has not been tuned yet. If this determination is “YES”, the flow proceeds to step S411.

ステップＳ４１１にて粗絞り定義生成部１１１は、まだチューニングがなされていないあいまい検索条件のうちの１つをチューニング対象として選ぶ。 In step S411, the rough aperture definition generation unit 111 selects one of the fuzzy search conditions that have not been tuned as a tuning target.

ステップＳ４１２にて粗絞り定義生成部１１１は、チューニング対象のあいまい検索条件を、より絞り込める条件に変更する。具体的には、図１７を参照して説明したように、チューニング対象のあいまい検索条件に含まれるｎ−ｇｒａｍのｎに１を加えてｎ’を算出する。続いて、このｎ’及びインデックス一致数ｃを元にしてｃ’を算出する。 In step S <b> 412, the rough aperture definition generation unit 111 changes the fuzzy search condition to be tuned to a more narrow condition. Specifically, as described with reference to FIG. 17, n ′ is calculated by adding 1 to n of n-gram included in the fuzzy search condition to be tuned. Subsequently, c ′ is calculated based on the n ′ and the index coincidence number c.

ステップＳ４１３にて粗絞り定義生成部１１１は、あいまい検索条件を変更することができるか否かを判断する。具体的にはｎ’＝ｎ＋１が、図１５を参照して説明したｎ−ｇｒａｍのｎの最大値以下であり、かつｃ’が０より大きい場合に、変更可能と判断する。この場合は、ｎ’及びｃ’を新たなｎ及びｃとして新たなあいまい検索条件を作成する。もし変更が不可能な場合には、ｃ’＝ｃ＋１として新たなあいまい検索条件を作成する。そして、この新たなあいまい検索条件を含んだ新たな粗絞り定義を作成し、ステップＳ４０４以降を繰り返す。つまり、初期の粗絞り定義におけるｎやｃをどの程度大きくできるかは、ステップＳ４０３における名寄せ試行結果とステップＳ４０４における粗絞り試行結果とに基づいて定められる。そして、ステップＳ４１３の判断結果がいずれ「ＮＯ」となり、ステップＳ４１０に進む。 In step S413, the rough aperture definition generation unit 111 determines whether or not the ambiguous search condition can be changed. Specifically, when n ′ = n + 1 is equal to or less than the maximum value of n of n-gram described with reference to FIG. 15 and c ′ is greater than 0, it is determined that the change is possible. In this case, a new fuzzy search condition is created with n 'and c' as new n and c. If the change is impossible, a new fuzzy search condition is created with c '= c + 1. Then, a new rough aperture definition including this new fuzzy search condition is created, and the steps after step S404 are repeated. That is, how much n and c in the initial rough aperture definition can be increased is determined based on the name collation trial result in step S403 and the rough aperture trial result in step S404. Then, the determination result in step S413 eventually becomes “NO”, and the process proceeds to step S410.

そして、ステップＳ４１０の判断結果がいずれ「ＮＯ」となり、処理を終了する。このときの粗絞り定義がそのまま最終的な粗絞り定義１２４となる。 Then, the determination result in step S410 eventually becomes “NO”, and the process ends. The rough aperture definition at this time becomes the final rough aperture definition 124 as it is.

［７．他の実施形態］
図３に示した名寄せ定義においては、比較する名寄せ元のカラムと名寄せ先のカラムの組み合わせ１つにつき、１つの比較条件を定義した。しかし、他の形態では、比較する名寄せ元のカラムと名寄せ先のカラムの組み合わせ１つにつき、複数の比較条件を定義することができる。 [7. Other Embodiments]
In the name identification definition shown in FIG. 3, one comparison condition is defined for each combination of a name identification source column and a name identification destination column to be compared. However, in another embodiment, a plurality of comparison conditions can be defined for one combination of a name identification source column and a name identification target column to be compared.

図２０のステップＳ４１２では、ｎを１だけ大きくすることに限られず、ｎを２だけ、３だけあるいはそれ以上の値だけ大きくしてもよい。 In step S412, FIG. 20 is not limited to increasing n by 1, but may increase n by 2, 3 or more.

前述した情報処理装置の機能的構成及び物理的構成は、前述の態様に限られるものではなく、例えば、各機能や物理資源を統合して実装したり、逆に、さらに分散して実装したりすることも可能である。 The functional configuration and physical configuration of the information processing apparatus described above are not limited to the above-described aspects. For example, the functions and physical resources are integrated and mounted, or on the contrary, the information processing apparatus is further distributed and mounted. It is also possible to do.

以上の実施形態に関し、さらに以下の付記を開示する。
（付記１）
所定の文字列との間の編集距離が所定数以下の文字列を文字列群から抽出する抽出プログラムであって、
前記所定の文字列内の連続する文字列である１または複数の部分文字列であって、前記所定の文字列において連続する文字数が前記所定の文字列の文字数を前記所定数で除算した商よりも小さい、１または複数の部分文字列を抽出し、
抽出された前記１または複数の部分文字列のいずれかを含む文字列を前記文字列群から抽出し、
前記文字列群から抽出された文字列について、前記所定の文字列との間の編集距離が所定の距離以下であるか否か判定する、
処理をコンピュータに実行させる抽出プログラム。
（付記２）
前記所定の文字列との間の編集距離が前記所定数以下の文字列は、
前記所定の文字列に対して前記所定数以下の編集回数の編集を行なうことにより得られ
る文字列である、
ことを特徴とする付記１に記載の抽出プログラム。
（付記３）
前記コンピュータに、
前記１または複数の部分文字列のうち、前記所定の文字列の文字数及び１の和から、前記所定数及び１の和と前記商よりも小さい自然数との積を引いた差以下の部分文字列を含む文字列を、前記文字列群から抽出する、
ことを実行させることを特徴とする付記１に記載の抽出プログラム。
（付記４）
所定の文字列との間の編集距離が所定数以下の文字列を文字列群から抽出する抽出方法であって、
前記所定の文字列内の連続する文字列である１または複数の部分文字列であって、前記所定の文字列において連続する文字数が前記所定の文字列の文字数を前記所定数で除算した商よりも小さい、１または複数の部分文字列を抽出し、
抽出された前記１または複数の部分文字列のいずれかを含む文字列を前記文字列群から抽出し、
前記文字列群から抽出された文字列について、前記所定の文字列との間の編集距離が所定の距離以下であるか否か判定する、
処理をコンピュータに実行させる抽出方法。
（付記５）
前記所定の文字列との間の編集距離が前記所定数以下の文字列は、
前記所定の文字列に対して前記所定数以下の編集回数の編集を行なうことにより得られる文字列である、
ことを特徴とする付記４に記載の抽出方法。
（付記６）
前記コンピュータに、
前記１または複数の部分文字列のうち、前記所定の文字列の文字数及び１の和から、前記所定数及び１の和と前記商よりも小さい自然数との積を引いた差以下の部分文字列を含む文字列を、前記文字列群から抽出する、
ことを実行させることを特徴とする付記４に記載の抽出方法。
（付記７）
所定の文字列との間の編集距離が所定数以下の文字列を文字列群から抽出する抽出装置であって、
前記所定の文字列内の連続する文字列である１または複数の部分文字列であって、前記所定の文字列において連続する文字数が前記所定の文字列の文字数を前記所定数で除算した商よりも小さい、１または複数の部分文字列を抽出する第１の抽出手段と、
抽出された前記１または複数の部分文字列のいずれかを含む文字列を前記文字列群から抽出する第２の抽出手段と、
前記文字列群から抽出された文字列について、前記所定の文字列との間の編集距離が所定の距離以下であるか否か判定する判定手段と、
を含むことを特徴とする抽出装置。
（付記８）
前記所定の文字列との間の編集距離が前記所定数以下の文字列は、
前記所定の文字列に対して前記所定数以下の編集回数の編集を行なうことにより得られる文字列である、
ことを特徴とする付記７に記載の抽出装置。
（付記９）
前記第２の抽出手段が、
前記１または複数の部分文字列のうち、前記所定の文字列の文字数及び１の和から、前記所定数及び１の和と前記商よりも小さい自然数との積を引いた差以下の部分文字列を含
む文字列を、前記文字列群から抽出する、
ことを特徴とする付記７に記載の抽出装置。 Regarding the above embodiment, the following additional notes are disclosed.
(Appendix 1)
An extraction program for extracting from a character string group a character string having an edit distance between a predetermined character string and a predetermined number or less,
A quotient obtained by dividing the number of characters in the predetermined character string by the predetermined number, which is one or a plurality of partial character strings that are continuous character strings in the predetermined character string. Extract one or more substrings that are smaller than
Extracting a character string including any one of the extracted partial character strings from the character string group;
For a character string extracted from the character string group, it is determined whether an edit distance between the character string and the predetermined character string is equal to or less than a predetermined distance.
An extraction program that causes a computer to execute processing.
(Appendix 2)
A character string having an edit distance between the predetermined character string and the predetermined number or less is:
It is a character string obtained by performing editing of the predetermined number of times or less with respect to the predetermined character string.
The extraction program according to supplementary note 1, characterized in that:
(Appendix 3)
In the computer,
Of the one or a plurality of partial character strings, a partial character string equal to or smaller than a difference obtained by subtracting a product of the number of characters and 1 of the predetermined character string and a sum of the predetermined number and 1 and a natural number smaller than the quotient. A character string including the character string group,
The extraction program according to supplementary note 1, wherein the extraction program is executed.
(Appendix 4)
An extraction method for extracting a character string having a predetermined number or less of edit distances from a predetermined character string from a character string group,
A quotient obtained by dividing the number of characters in the predetermined character string by the predetermined number, which is one or a plurality of partial character strings that are continuous character strings in the predetermined character string. Extract one or more substrings that are smaller than
Extracting a character string including any one of the extracted partial character strings from the character string group;
For a character string extracted from the character string group, it is determined whether an edit distance between the character string and the predetermined character string is equal to or less than a predetermined distance.
An extraction method that causes a computer to execute processing.
(Appendix 5)
A character string having an edit distance between the predetermined character string and the predetermined number or less is:
It is a character string obtained by performing editing of the predetermined number of times or less with respect to the predetermined character string.
The extraction method according to supplementary note 4, characterized by:
(Appendix 6)
In the computer,
Of the one or a plurality of partial character strings, a partial character string equal to or smaller than a difference obtained by subtracting a product of the number of characters and 1 of the predetermined character string and a sum of the predetermined number and 1 and a natural number smaller than the quotient. A character string including the character string group,
The extraction method according to supplementary note 4, wherein:
(Appendix 7)
An extraction device that extracts a character string having an edit distance between a predetermined character string and a predetermined number or less from a character string group,
A quotient obtained by dividing the number of characters in the predetermined character string by the predetermined number, which is one or a plurality of partial character strings that are continuous character strings in the predetermined character string. A first extraction means for extracting one or more partial character strings that are smaller than
Second extraction means for extracting a character string including any one of the extracted partial character strings from the character string group;
Determination means for determining whether an edit distance between the character string extracted from the character string group and the predetermined character string is equal to or less than a predetermined distance;
The extraction apparatus characterized by including.
(Appendix 8)
A character string having an edit distance between the predetermined character string and the predetermined number or less is:
It is a character string obtained by performing editing of the predetermined number of times or less with respect to the predetermined character string.
The extraction device according to appendix 7, characterized by:
(Appendix 9)
The second extraction means comprises:
Of the one or a plurality of partial character strings, a partial character string equal to or smaller than a difference obtained by subtracting a product of the number of characters and 1 of the predetermined character string and a sum of the predetermined number and 1 and a natural number smaller than the quotient. A character string including the character string group,
The extraction device according to appendix 7, characterized by:

１情報処理装置
１１制御部
１２不揮発性記憶部
１３揮発性記憶部
１００名寄せ元
１０１名寄せ先
１０２粗絞り処理
１０２ａ粗絞り定義
１０２ｂ検索結果
１０３名寄せ処理
１０３ａ名寄せ定義
１１１粗絞り定義生成部
１１３粗絞り処理部
１１４名寄せ部
１２１名寄せ元ＤＢ
１２２名寄せ先ＤＢ
１２３名寄せ定義
１２４粗絞り定義
１３１粗絞り処理結果
１３２名寄せ処理結果
１０１０ＣＰＵ
１０２０インタフェース装置
１０３０表示装置
１０４０入力装置
１０５０ドライブ装置
１０６０補助記憶装置
１０７０メモリ装置
１０８０バス
１０９０記録媒体
ｍ比較元の文字列の最小字数（第１の数）
ｎ「ｎ−ｇｒａｍ」のｎの値（第２の数）
ｃインデックス一致数（第３の数）
ｄ「編集距離ｄ字違い以内」のｄ（第４の数）
ｒ「編集距離比率ｒ以内」のｒ DESCRIPTION OF SYMBOLS 1 Information processing apparatus 11 Control part 12 Non-volatile memory | storage part 13 Volatile memory | storage part 100 Name collation source 101 Name collation target 102 Rough narrowing process 102a Rough narrowing definition 102b Search result 103 Name collation process 103a Name collation definition 111 Rough narrowing definition production | generation part 113 Rough narrowing process Part 114 Name identification part 121 Name identification source DB
122 name identification DB
123 Name identification definition 124 Coarse aperture definition 131 Coarse aperture processing result 132 Name identification processing result 1010 CPU
1020 Interface device 1030 Display device 1040 Input device 1050 Drive device 1060 Auxiliary storage device 1070 Memory device 1080 Bus 1090 Recording medium m Minimum number of characters of comparison source character string (first number)
n Value of n of “n-gram” (second number)
c Number of index matches (third number)
d d (fourth number) of “edit distance within d-character difference”
r “Edit distance ratio within r” r

Claims

A one or more partial character strings included in Jo Tokoro string, before Symbol less characters than the quotient obtained by dividing the number of characters of a given character string with a predetermined edit distance, one or more substrings Extract and
The difference obtained by subtracting the product of the predetermined edit distance and the sum of 1 and the natural number smaller than the quotient from the sum of the number of characters of the predetermined character string and the sum of 1 among the extracted partial character strings. a string containing one of the following substrings extracted from character strings,
An extraction program that causes a computer to execute processing.

For a character string extracted from the character string group, it is determined whether an edit distance between the character string and the predetermined character string is equal to or less than the predetermined edit distance.
The extraction program according to claim 1, further executing processing.

The character string whose edit distance between the predetermined character string is equal to or less than the predetermined edit distance is:
A character string obtained by editing the predetermined character string for the number of editing times equal to or less than the predetermined editing distance ;
The extraction program according to claim 2 , wherein:

A one or more partial character strings included in Jo Tokoro string, before Symbol less characters than the quotient obtained by dividing the number of characters of a given character string with a predetermined edit distance, one or more substrings Extract and
The difference obtained by subtracting the product of the predetermined edit distance and the sum of 1 and the natural number smaller than the quotient from the sum of the number of characters of the predetermined character string and the sum of 1 among the extracted partial character strings. a string containing one of the following substrings extracted from character strings,
An extraction method that causes a computer to execute processing.

A one or more partial character strings included in Jo Tokoro string, before Symbol less characters than the quotient obtained by dividing the number of characters of a given character string with a predetermined edit distance, one or more substrings First extracting means for extracting;
The difference obtained by subtracting the product of the predetermined edit distance and the sum of 1 and the natural number smaller than the quotient from the sum of the number of characters of the predetermined character string and the sum of 1 among the extracted partial character strings. A second extraction means for extracting a character string including any of the following partial character strings from the character string group ;
The extraction apparatus characterized by including.