JP6143638B2

JP6143638B2 - Data processing apparatus and data processing method

Info

Publication number: JP6143638B2
Application number: JP2013216568A
Authority: JP
Inventors: 浩司森川; 勝敏高梨; 聡宗形
Original assignee: 株式会社日立ソリューションズ東日本
Priority date: 2013-10-17
Filing date: 2013-10-17
Publication date: 2017-06-07
Anticipated expiration: 2033-10-17
Also published as: JP2015079380A

Description

本発明は、データ処理装置に関し、特に複数の数値の中からひとつの数値が選択されることを意味するメタ文字表現を含んだデータを抽出する技術に関する。 The present invention relates to a data processing apparatus, and more particularly to a technique for extracting data including a metacharacter expression meaning that one numerical value is selected from a plurality of numerical values.

一般に、顧客が商品カタログを参照して、商品を発注する際に商品の型番で発注することが多い。発注を受けた受注サーバは、注文で誤入力（例えば、カタログにない商品が指定されること）があった場合、文字列類似度を用いた検索でカタログのデータベースから正しい商品名の候補を抽出することができる（例えば、特許文献１参照）。 In general, when a customer places an order for a product by referring to a product catalog, the customer often places an order with the product model number. When an order entry server receives an order, if there is an error in the order (for example, a product that is not in the catalog is specified), the correct product name candidate is extracted from the catalog database by searching using the character string similarity. (For example, refer to Patent Document 1).

産業部品の商品カタログにおいては、商品の型番の一部が加工仕様を表しているものがあり、加工可能な条件の幅に合わせて商品の型番の一部が「加工仕様の幅が表記されておりその範囲で加工条件をひとつの数値が選択される」という「メタ文字表現」で表されているものがある。 In the product catalog of industrial parts, some of the product model numbers represent processing specifications, and some of the product model numbers are displayed in accordance with the width of the conditions that can be processed. Some of the processing conditions are expressed by “metacharacter expression” that “a single numerical value is selected as the processing condition”.

「メタ文字表現」を持つ商品の型番の例として、Ａ｛１，１．５，２，２．５，３｝Ｂとあれば、｛１，１．５，２，２．５，３｝は５つの数値１、１．５、２、２．５、３の中からひとつの数値が選択されることを意味する。顧客は発注の際に、例えばＡ１．５Ｂと入力する。 As an example of the model number of a product having “meta-character expression”, if A {1, 1.5, 2, 2.5, 3} B, {1, 1.5, 2, 2.5, 3} Means that one numerical value is selected from five numerical values 1, 1.5, 2, 2.5, and 3. The customer inputs, for example, A1.5B when placing an order.

また、他の「メタ文字表現」を持つ商品の型番の例として、Ａ｛１．．３（０．５）｝Ｂとあれば、｛１．．３（０．５）｝は１から３まで０．５刻みの数値の中から一つを選ぶことを意味する。この場合は｛１．．３（０．５）｝は、｛１，１．５，２，２．５，３｝と同じ意味となる。 In addition, as an example of the model number of a product having another “metacharacter expression”, A {1. . 3 (0.5)} B, {1. . 3 (0.5)} means that one of the numbers from 1 to 3 is selected from 0.5. In this case, {1. . 3 (0.5)} has the same meaning as {1, 1.5, 2, 2.5, 3}.

国際公開ＷＯ２００７／１３２５６４号International Publication WO2007 / 132564

一般に、文字列の類似度の算出は、単一商品を表す文字列に対して行なうものである。このため、データベースにメタ文字表現を含んでいるデータがある場合、メタ文字表現で表された文字列に対しては類似度算出が非常に困難である。 In general, the similarity between character strings is calculated for a character string representing a single product. For this reason, when there is data including a metacharacter expression in the database, it is very difficult to calculate a similarity for a character string expressed in the metacharacter expression.

したがって、メタ文字表現を持つ商品の型番に対して文字列類似度を計算する場合、メタ文字表現を持つ商品の型番を単一の商品を表す文字列に分解して計算しなければならなかった。すなわち、「メタ文字表現」を持つ商品の型番がＡ｛１．．３（０．５）｝Ｂの場合、Ａ１Ｂ、Ａ１．５Ｂ、Ａ２Ｂ、Ａ２．５Ｂ、Ａ３Ｂと分解する必要がでてくる。 Therefore, when calculating the character string similarity for the model number of a product having a meta-character expression, the model number of the product having the meta-character expression has to be divided into a character string representing a single product and calculated. . That is, the model number of the product having “metacharacter expression” is A {1. . In the case of 3 (0.5)} B, it is necessary to decompose into A1B, A1.5B, A2B, A2.5B, and A3B.

しかしながら、メタ文字表現を持つ商品の型番を単一の商品の型番の文字列に分解すると、商品点数が爆発的に増加するため、計算資源や計算時間の観点からこの分解は現実的ではなかった。 However, if the product model number with meta-character representation is decomposed into a single product model number string, the number of product items increases explosively, so this decomposition is not realistic from the viewpoint of computational resources and calculation time. .

また、類似の文字列を検索する場合に、ＶＰ（Vantage-Point）木などの距離をインデックスとする２分探索木を用いると、全件検索よりも高速に検索できる場合がある。しかしながら、メタ文字表現を有する文字列で、ＶＰ木が使用できるか否かはいまだ検討されていないのが実情である。 Also, when searching for similar character strings, using a binary search tree with a distance as an index, such as a VP (Vantage-Point) tree, may be possible to search faster than an all-case search. However, in reality, whether or not a VP tree can be used with a character string having a metacharacter representation has not yet been examined.

本発明は、前記した課題を解決するためになされたものであり、複数の数値の中からひとつの数値が選択されることを意味するメタ文字表現を含んだデータを抽出することができるデータ処理装置およびデータ処理方法を提供することを目的とする。 The present invention has been made to solve the above-described problem, and is a data process capable of extracting data including a metacharacter expression that means that one numerical value is selected from a plurality of numerical values. An object is to provide an apparatus and a data processing method.

前記目的を達成するため、本発明のデータ処理装置は、入力された対象データに類似する候補データをデータベースから抽出するデータ処理装置であって、データベースには、複数の数値の中からひとつの数値が選択されることを意味するメタ文字表現を含むデータおよびメタ文字表現を含まないデータが格納されており、対象データに類似する候補データの抽出先がメタ文字表現を含むデータの場合（例えば、図４のメタ文字表現を含むデータ７４Ｂ）、メタ文字表現を含んだ文字列のうち、メタ文字表現以外の数値をメタ文字表現とし（すなわち、選択肢がひとつの数値選択表現とし）、メタ文字表現ごとに一単位文字として扱うとともに、対象データおよびデータベースのデータに含まれる、ひとつの数値および該数値以外の文字を一単位文字として分割し、対象データに類似する候補データの抽出先がメタ文字表現を含まないデータの場合（例えば、図４のメタ文字表現を含むデータ７４Ａの場合）、対象データおよびデータベースのデータに含まれる文字列の文字ごとに一単位文字として分割し、分割された単位文字の数を単位文字数として算出する単位文字分割部と、データベースのデータからメタ文字表現を含むデータと、メタ文字表現を含まないデータとに分類し、分類されてデータに基づく単位文字数ごとの２分探索木（例えば、ＶＰ木）を作成する２分探索木作成部（例えば、ＶＰ木作成部６０）と、対象データの単位文字数に基づき所定範囲に該当する２分探索木を、候補データとして抽出する２分探索木抽出部（例えば、ＶＰ木抽出部３２）と、対象データとデータベースのデータとの間の距離を算出する距離算出部と、２分探索木抽出部で抽出された候補データのうち、算出された距離が所定値以下のデータを、候補データとして抽出する候補データ抽出部と、を有することを特徴とする。 In order to achieve the above object, a data processing apparatus according to the present invention is a data processing apparatus that extracts candidate data similar to input target data from a database, and the database includes one numerical value among a plurality of numerical values. When data including a metacharacter expression meaning that is selected and data not including a metacharacter expression are stored, and candidate data similar to the target data is extracted from data including the metacharacter expression (for example, Data 74B including the metacharacter expression of FIG. 4), and among the character string including the metacharacter expression, a numerical value other than the metacharacter expression is set as the metacharacter expression (that is, the number selection expression having one option), and the metacharacter expression Each unit is treated as one unit character, and one numerical value and characters other than the numerical value included in the target data and database data are When the candidate data similar to the target data is extracted from data that does not include the metacharacter expression (for example, the data 74A including the metacharacter expression of FIG. 4), the target data and the data of the database are divided. A unit character dividing unit that divides each character of the included character string as one unit character and calculates the number of divided unit characters as the number of unit characters, data including meta character expression from database data, and meta character expression A binary search tree creation unit (for example, a VP tree creation unit 60) that classifies the data into not-contained data, creates a binary search tree (for example, a VP tree) for each number of unit characters based on the classified data, and target data A binary search tree extraction unit (for example, a VP tree extraction unit 32) that extracts a binary search tree corresponding to a predetermined range based on the number of unit characters as candidate data; Of candidate data extracted by the distance calculation unit for calculating the distance between the data and the data in the database and the candidate data extracted by the binary search tree extraction unit as candidate data And a data extraction unit.

本発明によれば、複数の数値の中からひとつの数値が選択されることを意味するメタ文字表現を含んだデータを抽出することができる。 According to the present invention, it is possible to extract data including a metacharacter expression meaning that one numerical value is selected from a plurality of numerical values.

本発明の実施形態に係るデータ処理装置を示す図である。It is a figure which shows the data processor which concerns on embodiment of this invention. カタログ型番ＤＢの一例を示す図である。It is a figure which shows an example of catalog model number DB. 誤入力修正履歴ＤＢの一例を示す図である。It is a figure which shows an example of incorrect input correction log | history DB. 単位文字数順カタログ型番ＤＢの一例を示す図である。It is a figure which shows an example of catalog model number DB in order of unit character number. 入力判定処理を示すフローチャートである。It is a flowchart which shows an input determination process. 入力判定処理の一例を示す図である。It is a figure which shows an example of an input determination process. ＶＰ木の作成例を示す図である。It is a figure which shows the example of creation of VP tree. ＶＰ木の編集距離の中間値を示す図である。It is a figure which shows the intermediate value of the edit distance of VP tree. 文字列ｑ、ｐ、ｌの編集距離の関係を示す図である。It is a figure which shows the relationship of the edit distance of the character strings q, p, and l. 文字列ｑ、ｐ、ｒの編集距離の関係を示す図である。It is a figure which shows the relationship of the edit distance of character strings q, p, and r. 左側の枝刈り条件が成立する場合における枝刈りができない例を示す図である。It is a figure which shows the example which cannot be pruned in case the left pruning condition is satisfied. 右側の枝刈り条件が成立する場合における枝刈りができない例を示す図である。It is a figure which shows the example which cannot be pruned in case the right pruning condition is satisfied. 誤入力処理を示すフローチャートである。It is a flowchart which shows an erroneous input process. ＶＰ木の抽出例を示す図である。It is a figure which shows the example of extraction of VP tree. ＶＰ木検索処理を示すフローチャートである。It is a flowchart which shows VP tree search processing. ＶＰ木の検索例を示す図である。It is a figure which shows the example of a search of VP tree. クライアントの表示画面の一例を示す図である。It is a figure which shows an example of the display screen of a client.

以下、本発明の実施形態について図面を参照して詳細に説明する。
図１は、本発明の実施形態に係るデータ処理装置を示す図である。クライアント２００から入力された商品の型番を認証するためのデータ処理装置１００は、商品カタログのデータベース７０と、入力された対象データがデータベース７０に格納されているか否か検索し、対象データがデータベース７０に格納されていなかった場合に、対象データに類似する候補データをデータベース７０から抽出する処理部１０と、抽出する際に使用される閾値などを入力する入力部８１、処理結果を表示する表示部８２、およびネットワーク３００を介してクライアント２００などと通信を行う通信部８５から構成される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram showing a data processing apparatus according to an embodiment of the present invention. The data processing apparatus 100 for authenticating the product model number input from the client 200 searches the product catalog database 70 and whether the input target data is stored in the database 70, and the target data is the database 70. The processing unit 10 for extracting candidate data similar to the target data from the database 70, the input unit 81 for inputting a threshold value used for the extraction, and the display unit for displaying the processing result 82, and a communication unit 85 that communicates with the client 200 and the like via the network 300.

データベース７０は、ＨＤＤ（Hard disk drive）装置などにより構成される。処理部１０は、ＣＰＵ（Central Processing Unit）によって、ＲＡＭ（Random Access Memory）やＨＤＤ上のプログラムを実行することで実現される。入力部８１は、キーボードやマウスなどのコンピュータに指示を入力するための装置であり、プログラム起動などの指示を入力する。表示部８２は、ディスプレイなどであり、データ処理装置１００による処理の実行状況や実行結果などを表示する。通信部８５は、ネットワーク３００を介して、他のサーバなどと各種データやコマンドを交換する。 The database 70 is configured by an HDD (Hard disk drive) device or the like. The processing unit 10 is realized by executing a program on a RAM (Random Access Memory) or HDD by a CPU (Central Processing Unit). The input unit 81 is a device for inputting an instruction to a computer such as a keyboard and a mouse, and inputs an instruction for starting a program. The display unit 82 is a display or the like, and displays an execution status and an execution result of the processing by the data processing apparatus 100. The communication unit 85 exchanges various data and commands with other servers via the network 300.

データベース７０には、商品カタログの詳細情報である商品名、商品の型番、メーカ名、価格、納期、製品の特徴などの情報であるカタログＤＢ７１、カタログＤＢ７１のうち所定の分類ごとに商品の型番を分類した情報であるカタログ型番ＤＢ７２（図２参照）、誤入力があった際に修正された入力回数の履歴を記録した情報である誤入力修正履歴ＤＢ７３（図３参照）、カタログ型番ＤＢ７２を、単位文字数順に並べ替えた情報である単位文字数順カタログ型番ＤＢ７４（図４参照）、対象データに類似する候補データをデータベース７０から抽出する際に使用する情報（例えば、閾値）である抽出条件情報７５、単位文字数順カタログ型番ＤＢ７４から作成されたカタログ型番ＶＰ木ＤＢ７６（カタログ型番２分探索木ＤＢ）などが格納されている。なお、ＶＰ木は、２分探索木（Binary search tree）のひとつであり、探索木のうちで最も基本的な木構造である。 In the database 70, the product name, the product model number, the manufacturer name, the price, the delivery date, the product characteristics, and the like, which are detailed information of the product catalog, are stored in the catalog DB 71 and the catalog DB 71 for each predetermined classification. A catalog model number DB 72 (see FIG. 2) that is classified information, an erroneous input correction history DB 73 (see FIG. 3), and a catalog model number DB 72 that are records of the history of the number of times of input corrected when there is an erroneous input, Catalog model number DB 74 (see FIG. 4), which is information sorted in the order of the number of unit characters, and extraction condition information 75, which is information (for example, a threshold) used when extracting candidate data similar to the target data from the database 70 The catalog model number VP tree DB 76 (catalog model binary search tree DB) created from the unit model number order catalog model DB 74 It is. The VP tree is one of binary search trees, and has the most basic tree structure among the search trees.

処理部１０は、入力された対象データがデータベース７０に格納されているか否かを検索する検索部２０（入力判定処理部）と、対象データがデータベース７０に格納されていなかった場合に、対象データに類似する候補データをデータベース７０から抽出する抽出部３０（誤入力処理部）と、抽出された候補データを提示する候補提示部４０と、誤入力修正履歴ＤＢ７３の履歴を更新する履歴更新部５０と、単位文字数別のカタログ型番ＶＰ木ＤＢ７６を作成するＶＰ木作成部６０（２分探索木作成部）の複数のプログラムを有する。 The processing unit 10 includes a search unit 20 (input determination processing unit) that searches whether or not the input target data is stored in the database 70, and the target data when the target data is not stored in the database 70. An extraction unit 30 (error input processing unit) that extracts candidate data similar to the database 70, a candidate presentation unit 40 that presents the extracted candidate data, and a history update unit 50 that updates the history of the erroneous input correction history DB 73 And a plurality of programs of the VP tree creation unit 60 (binary search tree creation unit) that creates the catalog model number VP tree DB 76 for each number of unit characters.

検索部２０は、対象データに類似する候補データの抽出先がメタ文字表現を含むデータの場合、メタ文字表現を含んだ文字列のうち、メタ文字表現以外の数値をメタ文字表現とし（すなわち、選択肢がひとつの数値選択表現とし）、メタ文字表現ごとに一単位文字として扱うとともに、対象データおよびデータベースのデータに含まれる、ひとつの数値および該数値以外の文字を一単位文字として分割し、対象データに類似する候補データの抽出先がメタ文字表現を含まないデータの場合、対象データおよびデータベースのデータに含まれる文字列の文字ごとに一単位文字として分割し、分割された単位文字の数を単位文字数として算出する単位文字分割部２１と、対象データの単位文字と、比較対象データの単位文字とが合致しているか否かを比較する単位文字比較部２２を有している。 When the extraction destination of candidate data similar to the target data is data including a metacharacter expression, the search unit 20 uses a numerical value other than the metacharacter expression as a metacharacter expression among character strings including the metacharacter expression (that is, Each option is treated as one unit character for each meta-character representation, and one numeric value and other characters included in the target data and database data are divided as one unit character. When candidate data similar to the data is extracted from data that does not contain meta-character representation, it is divided as one unit character for each character of the character string contained in the target data and database data, and the number of divided unit characters is Whether the unit character dividing unit 21 calculated as the number of unit characters matches the unit character of the target data and the unit character of the comparison target data And a unit character comparison unit 22 for comparing.

抽出部３０は、前記した単位文字分割部２１、単位文字数が所定範囲に該当するＶＰ木を抽出するＶＰ木抽出部３２（２分探索木抽出部）と、対象データと比較対象データとの間の距離を算出する距離算出部３３と、ＶＰ木抽出部３２で抽出された候補データのうち、算出された距離が所定値以下のデータを、候補データとして抽出する候補データ抽出部３４を有している。 The extracting unit 30 includes a unit character dividing unit 21, a VP tree extracting unit 32 (binary search tree extracting unit) that extracts a VP tree in which the number of unit characters falls within a predetermined range, and between the target data and the comparison target data. A distance calculation unit 33 that calculates the distance of the candidate data, and a candidate data extraction unit 34 that extracts, as candidate data, data whose calculated distance is equal to or less than a predetermined value from the candidate data extracted by the VP tree extraction unit 32 ing.

本実施形態のデータ処理装置１００は、詳細は後記するが、下記の特徴を有する。
（１）ＶＰ木作成部６０は、データベース７０のデータを、メタ文字表現を含むデータと、メタ文字表現を含まないデータとに分類し、該分類されたデータごとにＶＰ木を作成する。
（２）メタ文字表現を含むデータのＶＰ木構築時は、数値を一単位文字扱いとしてハウスドルフ距離となる編集距離を用いる。この場合に、単位文字分割部２１は、メタ文字表現を含んだ文字列のうち、メタ文字表現以外の数値をメタ文字表現とし（すなわち、選択肢がひとつの数値選択表現とし）一単位文字として扱うとともに、文字列の文字ごとに一単位文字として扱う。 The data processing apparatus 100 according to the present embodiment has the following characteristics although details will be described later.
(1) The VP tree creation unit 60 classifies the data in the database 70 into data including a metacharacter expression and data not including a metacharacter expression, and generates a VP tree for each of the classified data.
(2) When constructing a VP tree of data including a metacharacter expression, an edit distance that is a Hausdorff distance is used by treating a numerical value as one unit character. In this case, the unit character dividing unit 21 treats a numerical value other than the metacharacter expression as a metacharacter expression (that is, a numerical value selection expression with one option) from the character string including the metacharacter expression as one unit character. At the same time, each character of the character string is treated as one unit character.

図２は、カタログ型番ＤＢの一例を示す図である。カタログ型番ＤＢ７２には、多数の商品の型番であるデータが登録されている。例えば、メタ文字表現を含まないデータとして、１２３３４５、Ｐ１０−５、・・・があり、メタ文字表現を含むデータとして、Ｃ｛１０，１１｝−Ｄ３、Ｃ｛１，１０｝−Ｂ｛２，３｝、・・・がある。商品の型番には、既に説明したメタ文字表現で記載されているものが多数ある。 FIG. 2 is a diagram illustrating an example of the catalog model number DB. In the catalog model number DB 72, data representing model numbers of many products is registered. For example, there are 123345, P10-5,... As data not including the metacharacter expression, and C {10,11} -D3, C {1,10} -B {2 as the data including the metacharacter expression. , 3},. There are many product model numbers that are already described in the metacharacter representation.

具体的に説明すると、商品の型番がＤ５１−Ｍ｛０．１．．１００（０．０１）｝−ＬＳ｛１０，２０，４５，５０，９０｝のデータの場合、メタ文字表現を含んだ文字列のうち、メタ文字表現以外の数値（例えば、５１）をメタ文字表現以外の数値をメタ文字表現とし（すなわち選択肢がひとつの数値選択表現とし）一単位文字として扱うとともに、文字列の文字ごとに一単位文字として扱う。すなわち、前記データの場合、Ｄ｛５１｝−Ｍ｛０．１．．１００（０．０１）｝−ＬＳ｛１０，２０，４５，５０，９０｝となり、文字「Ｄ」、メタ文字表現「｛５１｝」、文字「−」、文字「Ｍ」、メタ文字表現「｛０．１．．１００（０．０１）｝」、文字「−」、文字「Ｌ」、文字「Ｓ」、メタ文字表現「｛１０，２０，４５，５０，９０｝」から構成されており、単位文字数としては、「９」となる。なお、メタ文字表現の部分は、数値１個を表すため、ひとつの単位文字である。 More specifically, the product model number is D51-M {0.1. . In the case of data of 100 (0.01)}-LS {10, 20, 45, 50, 90}, a numerical value (for example, 51) other than the metacharacter expression is used as the metacharacter in the character string including the metacharacter expression. Numeric values other than expressions are treated as one-unit characters as meta-character representations (that is, as one-choice numeric selection representations), and each character of the character string is treated as one unit character. That is, in the case of the data, D {51} -M {0.1. . 100 (0.01)}-LS {10, 20, 45, 50, 90}, the character “D”, the metacharacter expression “{51}”, the character “−”, the character “M”, the metacharacter expression “ {0.1 ... 100 (0.01)} ", character"-", character" L ", character" S ", and metacharacter expression" {10, 20, 45, 50, 90} " The number of unit characters is “9”. Note that the portion of the metacharacter expression is one unit character because it represents one numerical value.

前記のメタ文字表現部分である｛０．１．．１００（０．０１）｝は、０．１から１００まで、０．０１刻みの数値の中からひとつが指定されることを意味する。また、メタ文字表現部分である｛１０，２０，４５，５０，９０｝は、列挙された数値の中からひとつが指定されることを意味する。 The metacharacter expression part is {0.1. . 100 (0.01)} means that one of the numerical values in increments of 0.01 from 0.1 to 100 is designated. Further, {10, 20, 45, 50, 90}, which is a metacharacter expression part, means that one of the listed numerical values is designated.

図３は、誤入力修正履歴ＤＢの一例を示す図である。誤入力修正履歴ＤＢ７３には、誤入力があった際に修正された入力回数の履歴を記録した修正履歴の情報であり、型番と誤入力回数を含んで構成されている。具体的に説明すると、型番がＣ｛１０，１１｝−３Ｄの場合、誤入力回数は５０回あり、型番がＣ｛１，１０｝−Ｂ｛２，３｝の場合、誤入力回数は１００回あったことがわかる。 FIG. 3 is a diagram illustrating an example of the erroneous input correction history DB. The incorrect input correction history DB 73 is correction history information that records the history of the number of inputs corrected when there was an incorrect input, and includes the model number and the number of erroneous inputs. More specifically, when the model number is C {10,11} -3D, the number of erroneous inputs is 50, and when the model number is C {1,10} -B {2,3}, the number of erroneous inputs is 100. You can see that there were times.

図４は、単位文字数順カタログ型番ＤＢの一例を示す図である。単位文字分割部２１は、（図１参照）は、データベースのデータを、メタ文字表現を含まないデータ７４Ａと、メタ文字表現を含むデータ７４Ｂとに分類するとともに、分類されたデータごとに予めカタログ型番ＤＢ７２を単位文字数順に分類している。単位文字数順カタログ型番ＤＢ７４には、単位文字数、単位文字分割後の型番（単位文字ごとにコンマ（，）で区切られている）、元の型番から構成されている。 FIG. 4 is a diagram illustrating an example of a catalog model number DB in unit character number order. The unit character division unit 21 (see FIG. 1) classifies the data in the database into data 74A that does not include a metacharacter expression and data 74B that includes a metacharacter expression, and catalogs each classified data in advance. The model number DB 72 is classified in the order of the number of unit characters. The unit character number order catalog model number DB 74 includes the number of unit characters, the model numbers after the unit character division (separated by a comma (,) for each unit character), and the original model number.

メタ文字表現を含まないデータ７４Ａの場合、文字列の文字ごとに一単位文字とする。例えば、型番がＰ１０−５の場合、Ｐ，１，０，−，５に分割され、単位文字数は「５」である。また、型番が１２３３４５の場合、１，２，３，３，４，５に分割され単位文字数は「６」である。 In the case of data 74A that does not include a metacharacter expression, one character is used for each character in the character string. For example, when the model number is P10-5, it is divided into P, 1, 0,-, 5 and the number of unit characters is "5". When the model number is 123345, it is divided into 1, 2, 3, 3, 4, and 5 and the number of unit characters is “6”.

メタ文字表現を含むデータ７４Ｂの場合、メタ文字表現を含んだ文字列のうち、メタ文字表現以外の数値をメタ文字表現とし（すなわち、選択肢がひとつの数値選択表現とし）、メタ文字表現ごとに一単位文字として扱うとともに、対象データおよびデータベースのデータに含まれる、ひとつの数値および該数値以外の文字を一単位文字として分割する。例えば、単位文字分割後の型番がＡ，Ｂ，Ｃ，−，Ｌ，｛１．．１０（０．１）｝，−，Ｗ，｛２．．５（０．５）｝，−，Ｈ，｛１，２，３｝の場合、単位文字数は「１２」である。また、単位文字分割後の型番がＣ，｛１０，１１｝，−，Ｄ，３の場合、単位文字数は「５」である。 In the case of the data 74B including the meta character expression, a numerical value other than the meta character expression is set as the meta character expression (that is, the numerical value selection expression having one option) in the character string including the meta character expression, and for each meta character expression. While treating as one unit character, one numerical value and characters other than the numerical value included in the target data and database data are divided as one unit character. For example, the model numbers after unit character division are A, B, C,-, L, {1. . 10 (0.1)}, −, W, {2. . In the case of 5 (0.5)}, −, H, {1, 2, 3}, the number of unit characters is “12”. When the model number after the unit character division is C, {10, 11},-, D, 3, the number of unit characters is “5”.

ここで、図４において、型番がＣ｛１０，１１｝−Ｄ３の場合、単位文字分割後の型番が、Ｃ，｛１０，１１｝，−，Ｄ，｛３｝となり、メタ文字表現を含んだ文字列のうち、メタ文字表現以外の数値「３」がメタ文字表現である「｛３｝」に修正されている。 Here, in FIG. 4, when the model number is C {10,11} -D3, the model number after the unit character division is C, {10,11},-, D, {3}, which includes a metacharacter expression. In the character string, the numerical value “3” other than the metacharacter expression is corrected to “{3}” which is the metacharacter expression.

（入力判定処理）
図５は、入力判定処理を示すフローチャートである。図６は、入力判定処理の一例を示す図である。適宜図１を参照して説明する。入力判定処理Ｓ１１０は、検索部２０がクライアント２００から発注データを受理すると、発注データに含まれる型番を対象データとして、データベース７０に格納されているか否かを検索する処理である。対象データは、例えば、図６（ａ）に示す「Ａ１０−Ｂ２」である。 (Input judgment processing)
FIG. 5 is a flowchart showing the input determination process. FIG. 6 is a diagram illustrating an example of the input determination process. This will be described with reference to FIG. The input determination process S110 is a process for searching whether or not the model number included in the order data is stored in the database 70 as the target data when the search unit 20 receives the order data from the client 200. The target data is, for example, “A10-B2” shown in FIG.

単位文字分割部２１は、対象データを下記に基づき単位文字に分割する（処理Ｓ１１１）。
（１）検索先がメタ文字表現を含まない場合（例えば、メタ文字表現を含まないデータ７４Ａの場合）：対象データを文字列の文字ごとに一単位文字として分割
（２）検索先がメタ文字表現を含む場合（例えば、メタ文字表現を含むデータ７４Ｂの場合）：対象データの数値は一単位文字として分割
すなわち、図６（ｂ）に示すように、（１）の場合、「Ａ，１，０，−，Ｂ，２」となり、（２）の場合、「Ａ，１０，−，Ｂ，２」となる。 The unit character dividing unit 21 divides the target data into unit characters based on the following (processing S111).
(1) When the search destination does not include a metacharacter expression (for example, when the data 74A does not include a metacharacter expression): The target data is divided into one unit character for each character of the character string. (2) The search destination is a metacharacter. In the case of including an expression (for example, in the case of data 74B including a metacharacter expression): The numerical value of the target data is divided as one unit character. That is, as shown in FIG. , 0,-, B, 2 ", and in the case of (2)," A, 10,-, B, 2 ".

単位文字分割部２１は、該当するカタログ型番ＤＢ７２をメモリに読込み、作業用のリストに格納する（処理Ｓ１１２）。 The unit character dividing unit 21 reads the corresponding catalog model number DB 72 into the memory and stores it in the work list (processing S112).

単位文字比較部２２は、対象データの先頭から単位文字ごとに（処理Ｓ１１３）、処理Ｓ１１４〜処理Ｓ１１６を繰り返す。まず、対象データの単位文字に該当する単位文字がリストにあるか否かを判定し（処理Ｓ１１４）、該当する単位文字がある場合（処理Ｓ１１４，Ｙｅｓ）、リストを絞り込み（処理Ｓ１１５）、次の単位文字へ進む（処理Ｓ１１６）。該当する単位文字がない場合（処理Ｓ１１４，Ｎｏ）、誤入力処理に進む（処理Ｓ１１８）。 The unit character comparison unit 22 repeats the processes S114 to S116 for each unit character from the beginning of the target data (process S113). First, it is determined whether or not a unit character corresponding to the unit character of the target data is in the list (step S114). If there is a corresponding unit character (step S114, Yes), the list is narrowed down (step S115). To the unit character (step S116). If there is no corresponding unit character (No at Step S114), the process proceeds to an erroneous input process (Step S118).

図６を参照して具体的に説明すると、１単位文字目で絞り込むと、図６（ｃ）に示すようになる。対象データの「Ａ」に該当するデータが絞り込まれ、例えば、Ａ１０−ＸＹ、Ａ｛１０，１３｝−Ｄ｛２，３｝、ＡＢＣ−Ｌ｛１．．１０（０．１）｝−Ｗ｛２．．５（０．５）｝−Ｈ｛１，２，３｝となる。次に、２単位文字目で絞り込むと、図６（ｄ）に示すようになる。処理Ｓ１１１の（１）の場合、「１」に該当するデータが絞り込まれ、例えば、Ａ１−ＸＹとなる。処理Ｓ１１１の（２）の場合、「１０」に該当するデータが絞り込まれ、例えば、Ａ｛１０，１３｝−Ｄ｛２，３｝となる。なお、メタ文字表現の｛１０，１３｝には、「１０」が含まれている。 Specifically, referring to FIG. 6, when narrowing down by the first unit character, it becomes as shown in FIG. Data corresponding to “A” of the target data is narrowed down, for example, A10-XY, A {10,13} -D {2,3}, ABC-L {1. . 10 (0.1)}-W {2. . 5 (0.5)}-H {1, 2, 3}. Next, when narrowing down by the second unit character, it becomes as shown in FIG. In the case of process S111 (1), data corresponding to “1” is narrowed down to, for example, A1-XY. In the case of process S111 (2), data corresponding to “10” is narrowed down to, for example, A {10, 13} −D {2, 3}. Note that “10” is included in {10, 13} of the metacharacter expression.

さらに、３単位文字目で絞り込むと、図６（ｅ）に示すようになる。処理Ｓ１１１の（１）の場合「０」、処理Ｓ１１１の（２）の場合「−」に該当するデータが絞り込まれるが、図６（ｄ）にくらべて変更のないＡ１０−ＸＹ、Ａ｛１０，１３｝−Ｄ｛２，３｝となる。さらに、４単位文字目で絞り込むと、図６（ｆ）に示すように、Ａ１０−ＸＹとなり、さらに、５単位文字目で絞りこむと、図６（ｇ）に示すように、リストには、該当するデータがない状態になる。図６に示す例の場合、５単位文字目の絞りこみで対象データである「Ａ１０−Ｂ２」が、誤入力と判定される。 Further, when narrowing down by the third unit character, it becomes as shown in FIG. Data corresponding to “0” in the case of the process S111 (1) and “-” in the case of the process S111 (2) is narrowed down, but A10-XY, A {10 that is not changed compared to FIG. , 13} −D {2, 3}. Furthermore, when narrowing down by the 4th unit character, it becomes A10-XY as shown in FIG. 6 (f), and when narrowing down by the 5th unit character, as shown in FIG. There is no corresponding data. In the case of the example shown in FIG. 6, “A10-B2”, which is the target data, is determined as an erroneous input by narrowing down the fifth unit character.

一方、対象データである「Ａ１０−Ｂ２」があった場合、型番は正しいので、受注処理に進む（処理Ｓ１１７）。なお、受注処理は、受注処理を担当する受注サーバ（不図示）が担当するが、説明は省略する。 On the other hand, if there is “A10-B2” which is the target data, the model number is correct, and the process proceeds to the order receiving process (process S117). The order receiving process is handled by an order receiving server (not shown) in charge of the order receiving process, but the description thereof is omitted.

（ＶＰ木作成処理）
本実施形態の基本となる、メタ文字表現を含まないデータのＶＰ木の作成について説明する。図７は、ＶＰ木の作成例を示す図である。図８は、ＶＰ木の編集距離の中間値を示す図である。ＶＰ木は編集距離をインデックスとする２分探索木のひとつである。なお、文字列を、ここではノードと称する。 (VP tree creation processing)
The creation of a VP tree of data that does not include a metacharacter expression, which is the basis of this embodiment, will be described. FIG. 7 is a diagram illustrating an example of creating a VP tree. FIG. 8 is a diagram showing an intermediate value of the edit distance of the VP tree. The VP tree is one of binary search trees with the edit distance as an index. Here, the character string is referred to as a node.

図７には、７つのノード「ＡＢＣＤＥＦＧ」、「ＡＢＣ４ＥＦＧ」、「ＡＢＣ４５ＦＧ」、「ＡＢ３４５ＦＧ」、「Ａ２３４５６Ｇ」、「Ａ２３４５６７」、「１２３４５６７」からなるＶＰ木の例を示す。図７のＶＰ木は、７つのノードの中から基準型番を「ＡＢＣＤＥＦＧ」として、ＶＰ木が作成されている。 FIG. 7 shows an example of a VP tree including seven nodes “ABCDEFG”, “ABC4EFG”, “ABC45FG”, “AB345FG”, “A23456G”, “A234567”, “1234567”. In the VP tree of FIG. 7, the VP tree is created with the reference model number “ABCDEFG” among the seven nodes.

ノード「ＡＢＣＤＥＦＧ」とすべての６つの子ノードの編集距離の中間値Ｍを求めると４であり、左側に編集距離４未満のものを配置し、右側に編集距離４以上のものを配置する。そして、ノード「ＡＢＣ４ＥＦＧ」とその２つの子ノードの編集距離の中間値Ｍを求めると１．５であり、左側に編集距離１．５未満のものを配置し、右側に編集距離１．５以上のものを配置する。さらに、ノード「Ａ２３４５６Ｇ」とその２つの子ノードの編集距離の中間値Ｍを求めると１．５であり、左側に編集距離１．５未満のものを配置し、右側に編集距離１．５以上のものを配置する。図８には、各ノードと編集距離の中間値Ｍとの関係を図示している。なお、データ間の距離を算出する際に、例えば、ハミング距離（信号距離）、レーベンシュタイン距離（編集距離）など、既存の技術を利用可能である。 When the intermediate value M of the edit distances of the node “ABCDEFG” and all six child nodes is obtained, the edit distance is less than 4 on the left side, and the edit distance of 4 or more is placed on the right side. Then, when the intermediate value M of the edit distance between the node “ABC4EFG” and its two child nodes is 1.5, the edit distance of less than 1.5 is arranged on the left side, and the edit distance of 1.5 or more is placed on the right side. Place things. Further, when the intermediate value M of the edit distance between the node “A23456G” and its two child nodes is obtained, it is 1.5, the edit distance less than 1.5 is arranged on the left side, and the edit distance of 1.5 or more is placed on the right side. Place things. FIG. 8 illustrates the relationship between each node and the intermediate value M of the edit distance. In calculating the distance between the data, for example, existing techniques such as a Hamming distance (signal distance) and a Levenshtein distance (editing distance) can be used.

一般的な編集距離の定義として、
（１）編集距離が０となるのは、２つの文字列が等しいときだけであり、かつそのときに限られる。
（２）対称である。すなわち、２つの文字列ｓ、ｔの編集距離ｄには、
ｄ（ｓ，ｔ）＝ｄ（ｔ，ｓ）が成り立つ。
（３）編集距離の三角不等式が成り立つ。任意の文字列をｕとするとき、
ｄ（ｓ，ｔ）≦ｄ（ｓ，ｕ）＋ｄ（ｕ，ｔ）を満たす。 As a general definition of edit distance,
(1) The edit distance becomes 0 only when the two character strings are equal and is limited to that time.
(2) Symmetric. That is, the edit distance d between the two character strings s and t is
d (s, t) = d (t, s) is established.
(3) Triangular inequality of edit distance holds. When u is an arbitrary character string,
It satisfies d (s, t) ≦ d (s, u) + d (u, t).

前記で作成されたＶＰ木の子ノードを持つすべてのノードについて成り立つ。以降、子ノードと述べる際には孫以下のノードも含むものとする。
（ａ）ノードはそのノードの子ノードすべてとの間の編集距離の中間値Ｍを持つ。
（ｂ）ノードとそのノードの左側子ノードとの編集距離は、ノードが持つ中央値Ｍ未満である。
（ｃ）ノードとそのノードの右側子ノードとの編集距離は、ノードが持つ中央値Ｍ以上である。 This holds for all nodes that have child nodes of the VP tree created above. Hereinafter, when a child node is described, it also includes nodes below grandchildren.
(A) A node has an intermediate value M of edit distances with all child nodes of the node.
(B) The edit distance between the node and the left child node of the node is less than the median value M of the node.
(C) The edit distance between the node and the right child node of the node is equal to or greater than the median value M of the node.

図９は、文字列ｑ、ｐ、ｌの編集距離の関係を示す図である。今、対象データである検索文字列をｑ、ＶＰ木のあるノードの文字列をｐ、文字列ｐのノードの任意の左側子ノード文字列をｌとする。検索文字列との編集距離がＤ以下の文字列をＶＰ木で検索する場合、Ｄ＋Ｍ＜ｄ（ｑ，ｐ）が成り立てば、そのノードの左側の枝は検索しなくてよい（以降、左側の枝刈りと呼ぶ。） FIG. 9 is a diagram showing the relationship between the edit distances of the character strings q, p, and l. Now, let q be a search character string as target data, p be a character string of a node having a VP tree, and l be an arbitrary left child node character string of a node of the character string p. When a VP tree is searched for a character string whose edit distance to the search character string is D or less, if D + M <d (q, p) is established, the left branch of the node may not be searched (hereinafter, the left side (This is called pruning.)

Ｄ＋Ｍ＜ｄ（ｑ，ｐ）が成り立つとき、文字列ｐのノードの左側のノード文字列で、検索文字列ｑとの間の編集距離がＤ以下になるものは存在しない。図９において、文字列ｐのノードの左側すべての子ノードは文字列ｐのノードから半径Ｍの内側にある。このため、Ｄ＋Ｍ＜ｄ（ｑ，ｐ）が成り立つときは文字列ｐの左側のどの文字列も文字列ｑからの編集距離はＤよりも大きく（遠く）なる。 When D + M <d (q, p) holds, there is no node character string on the left side of the node of the character string p that has an edit distance of D or less with respect to the search character string q. In FIG. 9, all the child nodes on the left side of the node of the character string p are inside the radius M from the node of the character string p. For this reason, when D + M <d (q, p) holds, the edit distance from the character string q for any character string on the left side of the character string p is larger (far) than D.

図１０は、文字列ｑ、ｐ、ｒの編集距離の関係を示す図である。今、対象データである検索文字列をｑ、ＶＰ木のあるノードの文字列をｐ、文字列ｐのノードの任意の右側子ノード文字列をｒとする。検索文字列との編集距離がＤ以下の文字列をＶＰ木で検索する場合、Ｄ＋ｄ（ｑ，ｐ）≦Ｍが成り立てば、そのノードの右側の枝は検索しなくてよい（以降、右側の枝刈りと呼ぶ） FIG. 10 is a diagram illustrating the relationship between the edit distances of the character strings q, p, and r. Now, let q be a search character string as target data, p be a character string of a node having a VP tree, and r be an arbitrary right child node character string of a node of the character string p. When searching for a character string whose edit distance to the search character string is D or less using a VP tree, if D + d (q, p) ≦ M holds, the right branch of the node may not be searched (hereinafter, the right side Called pruning)

Ｄ＋ｄ（ｑ，ｐ）≦Ｍが成り立つとき、文字列ｐのノードの右側ノード文字列で検索文字列ｑとの間の編集距離がＤ以下になるものは存在しない。図１０に示すように、文字列ｐのノードの右側のすべての子ノードは文字列ｐのノードから半径Ｍの外側にあるので、Ｄ＋ｄ（ｑ，ｐ）≦Ｍが成り立つときは文字列ｐの右側のどの文字列も文字列ｑからの編集距離はＤよりも大きく（遠く）なる。 When D + d (q, p) ≦ M holds, there is no right-side node character string of the node of the character string p whose edit distance from the search character string q is less than or equal to D. As shown in FIG. 10, since all the child nodes on the right side of the node of the character string p are outside the radius M from the node of the character string p, when D + d (q, p) ≦ M holds, In any character string on the right side, the editing distance from the character string q is larger (far) than D.

図９に示す左側の枝刈り条件および図１０に示す右側の枝刈条件が成り立つ場合には、ＶＰ木での検索の編集距離計算回数は全件検索での編集距離計算回数よりも少なくなり、そのためＶＰ木による検索は全件検索と比べて高速になる。なお、ＶＰ木で枝刈りがまったくできない場合は全件検索となる。 When the left pruning condition shown in FIG. 9 and the right pruning condition shown in FIG. 10 are satisfied, the edit distance calculation count of the search in the VP tree is smaller than the edit distance calculation count in the all-case search, For this reason, the search using the VP tree is faster than the all-case search. If no pruning can be performed with the VP tree, all cases are searched.

前記で示した枝刈り方法は、メタ文字表現を含まないデータ７４Ａ（図４参照）の場合には適用可能である。しかし、メタ文字表現を含むデータの場合７４Ｂ（図４参照）に適用可能であるかの検討事例がなく、本実施形態において、編集距離の検討を含め詳細に検討した。 The pruning method described above is applicable to the case of data 74A (see FIG. 4) that does not include a metacharacter expression. However, in the case of data including a metacharacter expression, there is no examination example of whether it can be applied to 74B (see FIG. 4), and in this embodiment, detailed examination including examination of edit distance is performed.

（編集距離）
本実施形態における、距離算出部３３が算出する編集距離について説明する。編集距離には、ある文字列と別の文字列の類似度を測る手法のひとつであるレーベンシュタイン（Levenshtein）距離、Damerau-Levenshtein距離などを利用する。但し、本実施形態のメタ文字表現について編集距離を取り扱った例はこれまでない。メタ文字表現の部分は、複数の数値の集合とみなすこともできるため、本実施形態でその取り合い方法について検討した。 (Edit distance)
The edit distance calculated by the distance calculation unit 33 in this embodiment will be described. For the edit distance, Levenshtein distance, Damerau-Levenshtein distance or the like, which is one of methods for measuring the similarity between one character string and another character string, is used. However, there has been no example of handling the edit distance for the metacharacter expression of the present embodiment. Since the portion of the metacharacter expression can be regarded as a set of a plurality of numerical values, the method for dealing with it has been examined in this embodiment.

編集距離は、編集距離が大きいほど２つの文字列が異なることを表す。編集距離は、対象データである文字列を単位文字ごとに分割し、一方の文字列の単位文字を挿入・削除・置換（場合によっては転置も考慮）して他方の文字列を生成するときの最小操作回数で定義される。メタ文字表現の部分は、数値１個を表すため、メタ文字表現の部分列はどれも単位文字として編集距離を計算する。既に説明したように、検索先がメタ文字表現を含む場合、対象データであるカタログの商品の型番中の数値部分は、１単位文字として編集距離を計算する。 The edit distance indicates that the two character strings are different as the edit distance is larger. The edit distance is determined when the character string that is the target data is divided into unit characters, and the unit character of one character string is inserted, deleted, or replaced (considering transposition in some cases) to generate the other character string. It is defined by the minimum number of operations. Since the part of the meta character expression represents one numerical value, the edit distance is calculated by using each of the sub strings of the meta character expression as unit characters. As already described, when the search destination includes a meta-character expression, the numerical value part in the model number of the catalog product as the target data is calculated as one unit character to calculate the edit distance.

メタ選択表現の部分に関する編集距離の計算ルールを以下のようにする。
（１）メタ選択表現の部分｛ｍ１，ｍ２，ｍ３｝と文字ｎとの比較：ｎがｍ１，ｍ２，ｍ３のいずれかと一致すれば同じ単位文字，一致しなければ異なる単位文字とする。
（２）メタ選択表現の部分どうしである｛ｍ１，ｍ２，ｍ３｝と｛ｎ１，ｎ２，ｎ３，ｎ４｝との比較：メタ選択表現同士がまったく同じ中身であれば同じ単位文字、異なる中身であれば異なる単位文字とする。 The edit distance calculation rule for the meta selection expression is as follows.
(1) Comparison of the meta-selection expression part {m1, m2, m3} and the letter n: If n matches any of m1, m2 or m3, the same unit character is used.
(2) Comparison between {m1, m2, m3} and {n1, n2, n3, n4}, which are parts of meta selection expressions: If the meta selection expressions are exactly the same contents, the same unit characters and different contents If there is a different unit character.

例えば、Ａ１−Ｂ２と、Ｃ｛１０，１１｝−Ｄ３との編集距離の算出の場合、文字「Ａ」を文字「Ｃ」に置換し、数値「１」をメタ文字表現の部分「｛１０，１１｝」に置換し、文字「−」は同一であり、文字「Ｂ」を文字「Ｄ」に置換し、数値「２」を数値「３」に置換している、よって、編集距離が「４」となる。 For example, in the case of calculating the edit distance between A1-B2 and C {10,11} -D3, the character “A” is replaced with the character “C”, and the numerical value “1” is replaced with the metacharacter expression part “{10 , 11} ”, the character“ − ”is the same, the character“ B ”is replaced with the character“ D ”, and the numerical value“ 2 ”is replaced with the numerical value“ 3 ”. “4”.

同様に、Ａ１−Ｂ２と、Ｃ｛１，１０｝−Ｂ｛２，３｝との編集距離の算出の場合、文字「Ａ」を文字「Ｃ」に置換し、数値「１」は、メタ文字表現の部分「｛１，１０｝」に含まれており、文字「−」および文字「Ｂ」の両者については同一であり、数値「２」は、メタ文字表現の部分「｛２，３｝」に含まれている。よって、編集距離が「１」となる。 Similarly, when calculating the edit distance between A1-B2 and C {1,10} -B {2,3}, the character “A” is replaced with the character “C”, and the numerical value “1” The character expression part “{1, 10}” is the same for both the character “−” and the character “B”, and the numerical value “2” is the meta character expression part “{2, 3”. } ". Therefore, the edit distance is “1”.

言い換えると、単位文字の比較のルールとしては、下記となる。
（ａ）数値の単位文字同士の比較：
同じ数値であれば、編集距離を＋０（何も加算しないこと）とする。
異なる数値であれば、編集距離を＋１（１を加算すること）とする。
（ｂ）数値の単位文字と、メタ文字表現の部分の比較：
数値がメタ文字表現の部分の範囲に入っていれば編集距離を＋０とする。
数値がメタ文字表現の部分の範囲に入っていなければ編集距離を＋１とする。
（ｃ）メタ文字表現部分同士の比較：
メタ選択表現の部分同士である｛ｍ１，…，ｍ_Ｍ｝と｛ｎ１，…，ｎ_Ｎ｝とが完全一致の数値であれは、編集距離を＋０とする。
メタ選択表現の部分同士である｛ｍ１，…，ｍ_Ｍ｝と｛ｎ１，…，ｎ_Ｎ｝とが完全一致の数値でなければ、編集距離を＋１とする。 In other words, the unit character comparison rules are as follows.
(A) Comparison of numerical unit characters:
If the numbers are the same, the edit distance is +0 (no addition).
If the numbers are different, the editing distance is +1 (add 1).
(B) Comparison of numerical unit characters and metacharacter expression:
If the numerical value is within the range of the metacharacter expression, the editing distance is set to +0.
If the numerical value is not within the range of the metacharacter expression, the editing distance is set to +1.
(C) Comparison of metacharacter expression parts:
If {m1,..., _{M M} } and {n1,..., _{N N} }, which are parts of the meta selection expression, are completely coincident numerical values, the editing distance is set to +0.
If {m1,..., _{M M} } and {n1,..., _{N N} }, which are parts of the meta selection expression, are not completely matching numerical values, the editing distance is set to +1.

前記計算ルールに基づいて枝刈り条件について検討した。
図１１は、左側の枝刈り条件を成立する場合における枝刈りができない例を示す図である。ここで、メタ文字表現の文字列は大文字とする。図１１を参照して、左側の枝刈りが可能な条件「Ｄ（＝１）＋Ｍ（＝２）＜ｄ（ｑ，Ｐ）（＝４）」が成立っているが、枝刈りすると、ｑから距離１にあるＬを逃してしまうことを説明する。 Pruning conditions were examined based on the calculation rules.
FIG. 11 is a diagram illustrating an example in which pruning is not possible when the left pruning condition is satisfied. Here, the character string of the metacharacter expression is capitalized. Referring to FIG. 11, the condition “D (= 1) + M (= 2) <d (q, P) (= 4)” that enables pruning on the left side is established. The fact that L at a distance of 1 is missed will be explained.

メタ文字表現を含む文字列（以下、メタ文字列という）Ｐ「Ｃ｛１０，１１｝−Ｄ３」は、メタ文字を含まない文字列「Ｃ１０−Ｄ３」と「Ｃ１１−Ｄ３」（以下、要素非メタ文字列という）との集合と見ることができる。 A character string including a meta character expression (hereinafter referred to as a meta character string) P “C {10,11} -D3” includes character strings “C10-D3” and “C11-D3” (hereinafter referred to as elements) that do not include a meta character. It can be viewed as a set of non-metastrings).

非メタ文字列とメタ文字列との間の距離は、計算ルールの定義から、非メタ文字列とメタ文字列を構成する要素非メタ文字列との間の距離の最小値になる。
ｄ（Ａ１−Ｂ２，Ｃ１０−Ｄ３）＝４、
ｄ（Ａ１−Ｂ２，Ｃ１１−Ｄ３）＝４
よりｄ（Ａ１−Ｂ２，Ｃ｛１０，１１｝−Ｄ３）＝４となる。 The distance between the non-meta character string and the meta character string is the minimum value of the distance between the non-meta character string and the element non-meta character string constituting the meta character string, based on the definition of the calculation rule.
d (A1-B2, C10-D3) = 4,
d (A1-B2, C11-D3) = 4
Therefore, d (A1-B2, C {10,11} -D3) = 4.

同様に、メタ文字列Ｌ「Ｃ｛１，１０｝−Ｂ｛２，３｝」は、要素非メタ文字列「Ｃ１−Ｂ２」，「Ｃ１−Ｂ３」，「Ｃ１０−Ｂ２」，「Ｃ１０−Ｂ３」の集合と見ることができる。
ｄ（Ａ１−Ｂ２，Ｃ１−Ｂ２）＝１、
ｄ（Ａ１−Ｂ２，Ｃ１−Ｂ３）＝２、
ｄ（Ａ１−Ｂ２，Ｃ１０−Ｂ２）＝２、
ｄ（Ａ１−Ｂ２，Ｃ１０−Ｂ３）＝３
より、ｄ（Ａ１−Ｂ２，Ｃ｛１，１０｝−Ｂ｛２，３｝）＝１となる。 Similarly, the meta character string L “C {1, 10} -B {2, 3}” includes element non-meta character strings “C1-B2”, “C1-B3”, “C10-B2”, “C10-”. It can be seen as a set of “B3”.
d (A1-B2, C1-B2) = 1,
d (A1-B2, C1-B3) = 2
d (A1-B2, C10-B2) = 2,
d (A1-B2, C10-B3) = 3
Therefore, d (A1-B2, C {1,10} -B {2,3}) = 1.

一方、メタ文字列Ｐとメタ文字列Ｌとの間の距離は、メタ文字表現部分（｛｝部分）同士の比較は完全一致か否かでみるので、ｄ（Ｐ，Ｌ）は、
メタ文字列Ｌを「Ｃ｛１，１０｝−Ｂ２」と「Ｃ｛１，１０｝−Ｂ３」の集合と見て、
ｄ（Ｃ｛１０，１１｝−Ｄ３，Ｃ｛１，１０｝−Ｂ２）＝３、
ｄ（Ｃ｛１０，１１｝−Ｄ３，Ｃ｛１，１０｝−Ｂ３）＝２
より、ｄ（Ｐ，Ｌ）＝２である。 On the other hand, since the distance between the meta character string P and the meta character string L is determined based on whether or not the comparison between the meta character expression parts ({} parts) is a complete match, d (P, L) is
Viewing the meta string L as a set of “C {1,10} -B2” and “C {1,10} -B3”,
d (C {10,11} -D3, C {1,10} -B2) = 3,
d (C {10,11} -D3, C {1,10} -B3) = 2
Therefore, d (P, L) = 2.

これは、枝刈り条件を考える前提となる三角不等式「ｄ（ｑ，Ｐ）≦ｄ（ｑ，Ｌ）＋ｄ（Ｌ，Ｐ）」が成り立っていない。つまり、今の計算ルールではメタ文字列同士の間で距離が定義できないことになる。 This does not hold the triangular inequality “d (q, P) ≦ d (q, L) + d (L, P)”, which is a premise for considering the pruning condition. In other words, the current calculation rule cannot define the distance between meta character strings.

なお、非メタ文字列ａ，ｂとメタ文字列Ｃの間は距離が定義でき、その時に成立っている三角不等式は、
ｄ（ａ，Ｃ）≦ｄ（ａ，ｂ）＋ｄ（ｂ，Ｃ）
である。ただし、メタ文字列をＤとするとき
ｄ（ａ，Ｃ）≦ｄ（ａ，Ｄ）＋ｄ（Ｄ，Ｃ）
は成り立たないことを意味する。 Note that a distance can be defined between the non-meta character strings a and b and the meta character string C, and the triangle inequality formed at that time is
d (a, C) ≦ d (a, b) + d (b, C)
It is. However, when the meta character string is D, d (a, C) ≦ d (a, D) + d (D, C)
Means not true.

以上から、前記編集ルールに基づくＶＰ木では枝刈りしてはいけないことになってしまい、このままではＶＰ木を利用するメリットが失われる。そこで、ＶＰ木構築時には以下のルールに従うように改善した。
ＶＰ木構築ルール（１）：メタ文字表現を含む文字列と、メタ文字表現を含まない文字列とは分けて別々にＶＰ木を構築する。対象データである検索文字列は、メタ文字表現を含まないので、このルールによって、メタ文字表現を含まないＶＰ木には、通常の編集距離での計算と枝刈り条件が適用できる。 From the above, the VP tree based on the editing rule must not be pruned, and the merit of using the VP tree is lost in this state. Therefore, when building a VP tree, the following rules were improved.
VP tree construction rule (1): A character string including a meta character expression and a character string not including a meta character expression are separated and a VP tree is separately constructed. Since the search character string that is the target data does not include the metacharacter expression, the calculation at the normal editing distance and the pruning condition can be applied to the VP tree that does not include the metacharacter expression.

ところで、文字列ｓ（文字列長をＬＬｓとする）と文字列ｔ（文字列長をＬＬｔとする）との間の編集距離ｄ（ｓ，ｔ）については、その定義から、
｜ＬＬｓ−ＬＬｔ｜≦ｄ（ｓ，ｔ）≦ＬＬｓ＋ＬＬｔとなる。
これより｜ＬＬｓ−ｄ（ｓ，ｔ）｜≦ＬＬｔ≦ＬＬｓ＋ｄ（ｓ，ｔ）が得られる。 By the way, the edit distance d (s, t) between the character string s (character string length is LLs) and the character string t (character string length is LLt) is
| LLs−LLt | ≦ d (s, t) ≦ LLs + LLt.
Thus, | LLs−d (s, t) | ≦ LLt ≦ LLs + d (s, t) is obtained.

したがって、検索対象文字列をその長さでグルーピングしておけば、文字列ｑ（文字列長をＬＬｑとする）に対して編集距離Ｄ以下の文字列を抽出する際には文字列長さＬＬが、｜ＬＬｑ−Ｄ｜≦ＬＬ≦ＬＬｓ＋Ｄを満たす文字列についてだけ編集距離を計算すればよい。（以降、これを文字列長グループ検索と呼ぶ。） Therefore, if the search target character strings are grouped by their lengths, the character string length LL is used when extracting a character string having an edit distance D or less with respect to the character string q (character string length is LLq). However, it is only necessary to calculate the edit distance only for a character string that satisfies | LLq−D | ≦ LL ≦ LLs + D. (Hereafter, this is called a character string length group search.)

そこで、ＶＰ木の構築では以下のルールも設定する。
ＶＰ木構築ルール（２）：メタ文字表現を含む文字列と含まない文字列それぞれで、文字列長ごとにグルーピングし、そのグループ（文字列長）ごとにＶＰ木を作る。 Therefore, the following rules are also set in the construction of the VP tree.
VP tree construction rule (2): Character strings including metacharacter expressions and character strings not including them are grouped for each character string length, and a VP tree is created for each group (character string length).

ＶＰ木構築ルール（３）：メタ文字表現を含む文字列のＶＰ木については、メタ文字表現以外の数値部分をメタ文字記号「｛」・「｝」で囲む。 VP tree construction rule (3): For a VP tree of a character string including a metacharacter expression, a numerical part other than the metacharacter expression is surrounded by metacharacter symbols “{” and “}”.

このルールによって、メタ文字列間の編集距離の計算時には、前記（ｃ）メタ文字表現部分同士の比較が適用され、メタ文字列間の編集距離はハウスドルフ距離となる。その結果メタ文字表現を含まない文字列ｑとメタ文字列Ｕ，Ｖとの間の距離の三角不等式として、ｄ（ｑ，Ｕ）≦ｄ（ｑ，Ｖ）＋ｄ（Ｖ，Ｕ）が成り立つようになる。このため、ＶＰ木の左側の枝刈り前提条件となる三角不等式が成り立つようになる。したがって、メタ文字列のＶＰ木でも左側の枝刈り条件が成り立つ場合には左側は枝刈りができる。 According to this rule, when the edit distance between the meta character strings is calculated, the comparison between the (c) meta character expression parts is applied, and the edit distance between the meta character strings becomes the Hausdorff distance. As a result, d (q, U) ≦ d (q, V) + d (V, U) is established as a triangular inequality of the distance between the character string q not including the meta character expression and the meta character strings U and V. become. For this reason, the triangle inequality which becomes a pruning precondition on the left side of the VP tree is established. Therefore, if the left pruning condition is satisfied even in the VP tree of the meta string, the left side can be pruned.

図１２は、右側の枝刈り条件を成立する場合における枝刈りができない例を示す図である。右側の枝刈りが可能な条件「Ｄ（＝１）＋ｄ（ｑ，Ｐ）（＝１）≦Ｍ（＝２）」が成立っているが、枝刈りすると、ｑから距離１にあるＲを逃してしまうことを説明する。 FIG. 12 is a diagram illustrating an example in which pruning is not possible when the right pruning condition is satisfied. The condition “D (= 1) + d (q, P) (= 1) ≦ M (= 2)” that enables pruning on the right side is established, but when pruning, R at a distance 1 from q Explain what you miss.

メタ文字列Ｐ「Ａ｛１，３｝−Ｄ｛２，３｝」は要素非メタ文字列「Ａ１−Ｄ２」，「Ａ１−Ｄ３」，「Ａ３−Ｄ２」，「Ａ３−Ｄ３」の集合と見ることができる。
ｄ（Ａ１−Ｂ２，Ａ１−Ｄ２）＝１、
ｄ（Ａ１−Ｂ２，Ａ１−Ｄ３）＝２、
ｄ（Ａ１−Ｂ２，Ａ３−Ｄ２）＝２、
ｄ（Ａ１−Ｂ２，Ａ３−Ｄ３）＝３
なので、ｄ（Ａ１−Ｂ２，Ａ｛１，３｝−Ｄ｛２，３｝）＝１となる。 The meta character string P “A {1,3} -D {2,3}” is a set of element non-meta character strings “A1-D2”, “A1-D3”, “A3-D2”, “A3-D3”. Can be seen.
d (A1-B2, A1-D2) = 1,
d (A1-B2, A1-D3) = 2,
d (A1-B2, A3-D2) = 2,
d (A1-B2, A3-D3) = 3
Therefore, d (A1-B2, A {1,3} -D {2,3}) = 1.

同様に、メタ文字列Ｒ「Ｃ｛１，１０｝−Ｂ｛２，３｝」は、要素非メタ文字列「Ｃ１−Ｂ２」，「Ｃ１−Ｂ３」，「Ｃ１０−Ｂ２」，「Ｃ１０−Ｂ３」の集合と見ることができる。
ｄ（Ａ１−Ｂ２，Ｃ１−Ｂ２）＝１、
ｄ（Ａ１−Ｂ２，Ｃ１−Ｂ３）＝２、
ｄ（Ａ１−Ｂ２，Ｃ１０−Ｂ２）＝２、
ｄ（Ａ１−Ｂ２，Ｃ１０−Ｂ３）＝３
より、ｄ（Ａ１−Ｂ２，Ｃ｛１，１０｝−Ｂ｛２，３｝）＝１となる。 Similarly, the meta character string R “C {1,10} -B {2,3}” includes element non-metacharacter strings “C1-B2”, “C1-B3”, “C10-B2”, “C10-”. It can be seen as a set of “B3”.
d (A1-B2, C1-B2) = 1,
d (A1-B2, C1-B3) = 2
d (A1-B2, C10-B2) = 2,
d (A1-B2, C10-B3) = 3
Therefore, d (A1-B2, C {1,10} -B {2,3}) = 1.

一方、メタ文字列Ｐとメタ文字列Ｒとの間の距離は、メタ文字表現部分（｛｝部分）同士の比較は完全一致か否かでみるので、ｄ（Ｐ，Ｒ）＝３である。 On the other hand, the distance between the meta character string P and the meta character string R is d (P, R) = 3 because the comparison between the meta character expression parts ({} parts) is a perfect match or not. .

これは、枝刈り条件を考える前提となる三角不等式「ｄ（Ｐ，Ｒ）≦ｄ（Ｐ，ｑ）＋ｄ（ｑ，Ｒ）」が成り立っていない。つまり、今の計算ルールではメタ文字列同士の間で距離が定義できないことになる。 This does not hold the triangular inequality “d (P, R) ≦ d (P, q) + d (q, R)”, which is a premise for considering the pruning condition. In other words, the current calculation rule cannot define the distance between meta character strings.

なお、メタ文字列ａ，ｂとメタ文字列Ｃの間は距離が定義でき、その時に成立っている三角不等式は、
ｄ（ａ，Ｃ）≦ｄ（ａ，ｂ）＋ｄ（ｂ，Ｃ）
である。ただし、メタ文字列をＤとするとき、
ｄ（Ｃ，Ｄ）≦ｄ（Ｃ，ａ）＋ｄ（ａ，Ｄ）は成り立たないことを意味する。 Note that a distance can be defined between the meta strings a and b and the meta string C, and the triangle inequality that holds at that time is
d (a, C) ≦ d (a, b) + d (b, C)
It is. However, when the meta string is D,
This means that d (C, D) ≦ d (C, a) + d (a, D) does not hold.

そこでＶＰ木検索時には以下のルールを設定する。
ＶＰ木検索ルール（４）：メタ文字表現を含むメタ文字列ＶＰ木の検索では、左側の枝刈り条件が成り立てば左側は枝刈りするが、右側については枝刈り条件が成立するか否かを問わず常に検索する。 Therefore, the following rules are set during VP tree search.
VP tree search rule (4): In a search of a meta-character string VP tree including a meta-character expression, if the left pruning condition is satisfied, the left side is pruned, but the right side is pruned. Always search regardless.

このルールによってＶＰ木全体の右側が枝刈りしてはいけないことになるのではなく、各ノードの右側を枝刈りしてはいけないだけである。 This rule does not mean that the right side of the entire VP tree should be pruned, but only the right side of each node should not be pruned.

また、ＶＰ木構築ルール（２）があるので、ＶＰ木でまったく枝刈りができない場合でも全件検索になることはない。今回のＶＰ木利用による探索時間は最悪で文字列長グループ検索と同じで、枝刈りができればその分だけ高速化する。枝刈りが発生する割合はデータとそのときのＶＰ木の構成に依存する。 In addition, since there is a VP tree construction rule (2), even if no pruning is possible with the VP tree, there is no search for all cases. The search time using the VP tree this time is the same as the character string length group search, and if pruning is possible, the speed is increased accordingly. The rate at which pruning occurs depends on the data and the configuration of the VP tree at that time.

なお、ＶＰ木は検索のたびに作る必要はなく、ＶＰ木の元となるデータベースの変更がない限りは一度作ったＶＰ木は使い続けることができる。また、ＶＰ木の構築に要する時間の関数をＯ、ＶＰ木作成に用いるデータ数をｎとすると、Ｏ（ｎｌｏｇｎ）である。したがって、データベースに変更が生じた場合は夜間バッチでＶＰ木を再構築することが可能である。さらに、ＶＰ木構築ルール（１）およびＶＰ木構築ルール（２）のようにデータを分割できるため、ＶＰ木の構築と検索もこの単位で並列化が可能である。この単位での並列化を行えばさらに高速化できる。 Note that it is not necessary to create a VP tree for each search, and a VP tree once created can be used as long as there is no change in the database that is the basis of the VP tree. Further, O (nlogn) is O where the function of time required for constructing the VP tree is O and the number of data used for creating the VP tree is n. Therefore, it is possible to reconstruct the VP tree in a nightly batch if the database changes. Furthermore, since data can be divided like the VP tree construction rule (1) and the VP tree construction rule (2), the construction and search of the VP tree can be performed in parallel in this unit. If parallelization is performed in this unit, the speed can be further increased.

（誤入力処理）
図１３は、誤入力処理を示すフローチャートである。図１４は、ＶＰ木の抽出例を示す図である。適宜図１を参照して説明する。誤入力処理Ｓ１３０は、対象データがデータベース７０に格納されていなかった場合に、対象データに類似する候補データをデータベース７０から抽出する処理である。 (Incorrect input processing)
FIG. 13 is a flowchart showing an erroneous input process. FIG. 14 is a diagram illustrating an example of extraction of a VP tree. This will be described with reference to FIG. The erroneous input process S <b> 130 is a process of extracting candidate data similar to the target data from the database 70 when the target data is not stored in the database 70.

候補データ抽出部３４は、対象データを受付けると（処理Ｓ１３１）、修正候補リストを初期化する（処理Ｓ１３２）。 When the candidate data extracting unit 34 receives the target data (processing S131), the candidate data extracting unit 34 initializes the correction candidate list (processing S132).

ＶＰ木抽出部３２は、抽出条件情報７５に格納されている閾値を読込む（処理Ｓ１３３）。 The VP tree extraction unit 32 reads the threshold value stored in the extraction condition information 75 (processing S133).

なお、ＶＰ木抽出部３２が、ＶＰ木検索処理を行う前に、ＶＰ木作成部６０は、予め単位文字数順カタログ型番ＤＢ７４（図４参照）から、メタ文字表現を含まないデータ７４Ａ（メタ文字なし）と、メタ文字表現を含むデータ７４Ｂ（メタ文字あり）とに基づき、単位文字数ごとに分類し（図１４（ａ）参照）、分類された単位文字数ごとに、ＶＰ木を作成している（図１４（ｂ）参照）。 Before the VP tree extraction unit 32 performs the VP tree search process, the VP tree creation unit 60 previously stores data 74A (metacharacters) that does not include a metacharacter expression from the unit character number order catalog model number DB 74 (see FIG. 4). None) and data 74B including meta-character representation (with meta-characters), classification is made for each unit character number (see FIG. 14A), and a VP tree is created for each classified unit character number. (See FIG. 14 (b)).

候補データ抽出部３４は、メタ文字表現を含まないＶＰ木検索処理（処理Ｓ１３４）およびメタ文字表現を含むＶＰ木検索処理（処理Ｓ１３５）で、修正候補を抽出する。処理Ｓ１３４および処理Ｓ１３５について、図１５を参照して説明する。 The candidate data extraction unit 34 extracts correction candidates by a VP tree search process that does not include a metacharacter expression (process S134) and a VP tree search process that includes a metacharacter expression (process S135). Processing S134 and processing S135 will be described with reference to FIG.

図１５は、ＶＰ木検索処理を示すフローチャートであり、（ａ）はメタ文字表現を含まないＶＰ木検索処理であり、（ｂ）はメタ文字表現を含むＶＰ木検索処理である。 FIG. 15 is a flowchart showing a VP tree search process, where (a) is a VP tree search process that does not include a metacharacter expression, and (b) is a VP tree search process that includes a metacharacter expression.

図１５（ａ）に示すメタ文字表現を含まないＶＰ木検索処理Ｓ１３４において、単位文字分割部２１は、対象データを単位文字に分割する（処理Ｓ１５１）。この際、文字列の文字ごとに一単位文字とする。例えば、対象データが「Ａ１０−Ｂ２」の場合、「Ａ，１，０，−，Ｂ，２」と分割され、単位文字数は６となる。ＶＰ木抽出部３２は、単位文字数および閾値に基づき、式（１）による絞込み単位文字数でＶＰ木を絞り込む（処理Ｓ１５２、図１４（ｃ）参照）。なお、処理Ｓ１５１は、入力判定処理Ｓ１１０で既に実行している場合は、省略することができる。 In the VP tree search process S134 that does not include the metacharacter expression shown in FIG. 15A, the unit character dividing unit 21 divides the target data into unit characters (process S151). At this time, one character is used for each character of the character string. For example, when the target data is “A10-B2”, it is divided into “A, 1, 0, −, B, 2”, and the number of unit characters is 6. The VP tree extraction unit 32 narrows down the VP tree based on the number of unit characters by the expression (1) based on the number of unit characters and the threshold (see processing S152, FIG. 14C). Note that the process S151 can be omitted if it has already been executed in the input determination process S110.

（対象データの単位文字数−閾値）の絶対値
≦絞り込み単位文字数
≦（対象データの単位文字数＋閾値）・・・式（１）
具体的には、対象データの単位文字数の「６」および閾値の「１」の場合、式（１）により、５≦絞り込み単位文字数≦７となる。 Absolute value of (number of target data unit characters-threshold) ≤ number of narrowing unit characters
≦ (Number of unit characters of target data + threshold) Expression (1)
Specifically, in the case where the number of unit characters of the target data is “6” and the threshold is “1”, 5 ≦ number of narrowed-down unit characters ≦ 7 according to Expression (1).

候補データ抽出部３４は、絞りこまれた単位文字ごとに（処理Ｓ１５３）、処理Ｓ１５４〜処理Ｓ１５６を繰り返す。候補データ抽出部３４は、処理Ｓ１５４で、深さ優先探索を実施し、左側の枝刈り条件を満たすときは、ノードの左側の枝刈りをし、また、右側の枝刈り条件を満たすときは、ノードの右側の枝刈りをする。そして、候補データ抽出部３４は、処理Ｓ１５５で、修正候補データを修正候補リストに登録する。 The candidate data extraction unit 34 repeats the processes S154 to S156 for each narrowed unit character (process S153). In step S154, the candidate data extraction unit 34 performs a depth-first search. When the left pruning condition is satisfied, the candidate data extracting unit 34 prunes the left side of the node, and when the right pruning condition is satisfied, Prunes the right side of the node. In step S155, the candidate data extraction unit 34 registers the correction candidate data in the correction candidate list.

図１５（ｂ）に示すメタ文字表現を含むＶＰ木検索処理Ｓ１３５において、単位文字分割部２１は、対象データを単位文字に分割する（処理Ｓ１６１）。この際、数値は一単位文字とする。例えば、対象データが「Ａ１０−Ｂ２」の場合、「Ａ，１０，−，Ｂ，２」と分割され、単位文字数は５となる。具体的には、対象データの単位文字数の「５」および閾値の「１」の場合、式（１）により、４≦絞り込み単位文字数≦６となる。ＶＰ木抽出部３２は、単位文字数および閾値に基づき、前記式（１）による絞込み単位文字数でＶＰ木を絞り込む（処理Ｓ１６２、図１４（ｃ）参照）。なお、処理Ｓ１６１は、入力判定処理Ｓ１１０で既に実行している場合は、省略することができる。 In the VP tree search process S135 including the metacharacter expression shown in FIG. 15B, the unit character dividing unit 21 divides the target data into unit characters (process S161). In this case, the numerical value is a single character. For example, when the target data is “A10-B2”, it is divided into “A, 10,-, B, 2”, and the number of unit characters is 5. Specifically, in the case where the number of unit characters of the target data is “5” and the threshold is “1”, 4 ≦ the number of narrowed-down unit characters ≦ 6 according to Expression (1). The VP tree extraction unit 32 narrows down the VP tree based on the number of unit characters by the above formula (1) based on the number of unit characters and the threshold (see processing S162, FIG. 14C). Note that step S161 can be omitted if it has already been executed in the input determination step S110.

候補データ抽出部３４は、絞りこまれた単位文字ごとに（処理Ｓ１６３）、処理Ｓ１６４〜処理Ｓ１６６を繰り返す。候補データ抽出部３４は、処理Ｓ１６４で、深さ優先探索を実施し、左側の枝刈り条件を満たすときは、ノードの左側の枝刈りをするが、右側の枝刈りの条件を満たすか否かを判定せず、検索の対象とすることが本実施形態の特徴のひとつである（図１６参照）。そして、候補データ抽出部３４は、処理Ｓ１６５で、修正候補データを修正候補リストに登録する。 The candidate data extracting unit 34 repeats the processes S164 to S166 for each narrowed unit character (process S163). In step S164, the candidate data extraction unit 34 performs a depth-first search. When the left pruning condition is satisfied, the candidate data extraction unit 34 performs pruning on the left side of the node, but whether or not the right pruning condition is satisfied. One of the features of the present embodiment is that the search target is not determined (see FIG. 16). In step S165, the candidate data extraction unit 34 registers the correction candidate data in the correction candidate list.

なお、処理Ｓ１５４および処理Ｓ１６４において、深さ優先探索で実施するとしたがこれに限定されるものではない。例えば、幅優先探索で実施してもよい。 In addition, in process S154 and process S164, although it implemented by depth priority search, it is not limited to this. For example, you may implement by breadth priority search.

図１６は、ＶＰ木の検索例を示す図である。処理Ｓ１６４の内容に具体的に説明する。ノードＡにおいて、左側の枝刈り条件を満たすときは、ノードＢ以下の枝刈りができる。ノードＡにおいて、右側の枝刈り条件を満たすか否かを判定せず、ノードＣに進む。そして、ノードＣにおいて、左側の枝刈り条件を満たすときは、ノードＤ以下の枝刈りができる。ノードＣにおいて、右側の枝刈り条件を満たすか否かを判定せず、ノードＥに進む。さらに、ノードＥにおいて、左側の枝刈り条件を満たさないときは、ノードＦ以下のノードを検索対象とする。また、ノードＥにおいて、右側の枝刈り条件を満たすか否かは判定せず、ノードＧ以下を検索対象とする。 FIG. 16 is a diagram illustrating a search example of a VP tree. The contents of the process S164 will be specifically described. In node A, when the left pruning condition is satisfied, pruning after node B can be performed. In node A, the process proceeds to node C without determining whether the right pruning condition is satisfied. Then, in the node C, when the left pruning condition is satisfied, the pruning after the node D can be performed. In node C, the process proceeds to node E without determining whether the right pruning condition is satisfied. Furthermore, in the node E, when the left pruning condition is not satisfied, the nodes below the node F are set as search targets. In node E, whether or not the right pruning condition is satisfied is not determined, and nodes G and below are set as search targets.

図１３に戻り、候補データ抽出部３４は、修正候補リストに対象データに対する修正候補があるか否かを判定し（処理Ｓ１３６）、修正候補がない場合（処理Ｓ１３６，Ｎｏ）、閾値が大きくなるように変更し（処理Ｓ１３７）、処理Ｓ１３４に戻る。 Returning to FIG. 13, the candidate data extraction unit 34 determines whether or not there is a correction candidate for the target data in the correction candidate list (processing S136). If there is no correction candidate (No in processing S136), the threshold value increases. (Step S137) and return to Step S134.

修正候補がある場合（処理Ｓ１３６，Ｙｅｓ）、候補提示部４０は、誤入力修正履歴ＤＢ７３を参照して、誤入力回数の多い順に並べ替え（処理Ｓ１４０）、並べ替えした修正候補データを、クライアント２００に出力する（処理Ｓ１４１）。出力例は、図１７を参照して後記する。 When there is a correction candidate (Yes at Step S136), the candidate presenting unit 40 refers to the erroneous input correction history DB 73, rearranges them in the descending order of the number of erroneous inputs (Step S140), and sorts the corrected candidate data into the client. 200 (processing S141). An output example will be described later with reference to FIG.

そして、候補提示部４０は、修正候補データがクライアント２００で選択されたか否かを判定し（処理Ｓ１４２）、選択された場合（処理Ｓ１４２，Ｙｅｓ）、選択された修正候補データの誤入力回数を１加算することによって誤入力修正履歴ＤＢ７３を更新し（処理Ｓ１４３）、一連の処理を終了する。一方、修正候補データが選択されない場合（処理Ｓ１４２，Ｎｏ）、入力判定処理Ｓ１１０に戻り（処理Ｓ１４４、図５参照）、処理を終了する。 Then, the candidate presentation unit 40 determines whether or not the correction candidate data has been selected by the client 200 (processing S142), and when selected (processing S142, Yes), the number of erroneous inputs of the selected correction candidate data is determined. By adding 1, the erroneous input correction history DB 73 is updated (processing S143), and a series of processing ends. On the other hand, when the correction candidate data is not selected (No in process S142), the process returns to the input determination process S110 (see process S144, FIG. 5), and the process is terminated.

図１７は、クライアントの表示画面の一例を示す図である。図１７には、候補提示部４０（図１参照）が提示するクライアント２００の表示画面を示す。表示画面には、例えば、図１７（ａ）に示す例では、「Ａ１０−Ｂ２という型番は存在しません。もしかしてこちらではないでしょうか？Ｃ｛１，１０｝−Ｂ｛２，３｝、Ａ｛１０，１３｝−Ｄ｛２，３｝」という表示がされる。顧客は、発注の入力間違いがあるとすばやく気づくことができ、例えば、Ｃ１０−Ｂ２が入力される。 FIG. 17 is a diagram illustrating an example of a display screen of the client. FIG. 17 shows a display screen of the client 200 presented by the candidate presentation unit 40 (see FIG. 1). On the display screen, for example, in the example shown in FIG. 17A, “A10-B2 does not exist. The model number is not here? C {1,10} -B {2,3}, A {10, 13} -D {2, 3} "is displayed. The customer can quickly notice that there is an input error in the order, for example, C10-B2 is input.

図１７（ｂ）は、候補提示部４０が、図１７（ａ）のメタ文字表現を含む型番を、メタ文字表現を含まない型番に分解した例である。図１７（ｂ）に示す例では、「Ａ１０−Ｂ２という型番は存在しません。もしかしてこちらではないでしょうか？Ａ１０−Ｄ２、Ｃ１０−Ｂ２、Ｃ１０−Ｂ３、Ａ１０−Ｄ３、Ａ１３−Ｄ２、Ａ１３−Ｄ３、Ｃ１−Ｂ２、Ｃ１−Ｂ３」という表示がされる。顧客は、図１７（ａ）と同様に、発注の入力間違いがあるとすばやく気づくことができる。なお、前記の例は、基本的に距離が小さい順、かつ、アルファベット順に出力している。 FIG. 17B is an example in which the candidate presentation unit 40 decomposes the model number including the metacharacter expression of FIG. 17A into the model number not including the metacharacter expression. In the example shown in FIG. 17B, “A10-B2 does not exist. Maybe this? A10-D2, C10-B2, C10-B3, A10-D3, A13-D2, A13- “D3, C1-B2, C1-B3” are displayed. As in FIG. 17A, the customer can quickly notice that there is an input error in the order. In the above example, the output is basically in ascending order of distance and in alphabetical order.

本実施形態では、クライアント２００からの入力について説明したが、管理者が入力部８１を介して、対象データを入力してもよい。 Although the input from the client 200 has been described in the present embodiment, the administrator may input target data via the input unit 81.

対象データに対する比較対象データとして数百万件を対象とすることがある。本実施形態のデータ処理装置を適用することにより、入力誤りを早期に発見できるとともに、類似する型番を提示することができるため、インターネットビジネスの向上に有効な手段となる。また、本実施形態では、カタログの型番を対象データとして示したがこれに限定されるのではなく、メタ文字表現の数値を含む文字列であれば、データ処理方法を適用することができる。 In some cases, millions of cases are targeted for comparison with the target data. By applying the data processing apparatus of the present embodiment, an input error can be detected at an early stage and a similar model number can be presented, which is an effective means for improving the Internet business. In the present embodiment, the model number of the catalog is shown as the target data. However, the present invention is not limited to this, and the data processing method can be applied to any character string including a numerical value of metacharacter expression.

１０処理部
２０検索部
２１単位文字分割部
２２単位文字比較部
３０抽出部
３２ＶＰ木抽出部（２分探索木抽出部）
３３距離算出部
３４候補データ抽出部
４０候補提示部
５０履歴更新部
６０ＶＰ木作成部（２分探索木作成部）
７０データベース
７１カタログＤＢ
７２カタログ型番ＤＢ
７３誤入力修正履歴ＤＢ
７４単位文字数順カタログ型番ＤＢ
７５抽出条件情報
７６カタログ型番ＶＰ木ＤＢ（カタログ型番２分探索木ＤＢ）
１００データ処理装置
２００クライアント
３００ネットワーク
Ｓ１１０入力判定処理
Ｓ１３０誤入力処理
Ｓ１３４メタ文字表現を含まないＶＰ木検索処理
Ｓ１３５メタ文字表現を含むＶＰ木検索処理 DESCRIPTION OF SYMBOLS 10 Processing part 20 Search part 21 Unit character division part 22 Unit character comparison part 30 Extraction part 32 VP tree extraction part (binary search tree extraction part)
33 Distance calculation unit 34 Candidate data extraction unit 40 Candidate presentation unit 50 History update unit 60 VP tree creation unit (binary search tree creation unit)
70 Database 71 Catalog DB
72 Catalog DB
73 Wrong input correction history DB
74 Catalog Number DB in Unit Character Order
75 Extraction Condition Information 76 Catalog Model Number VP Tree DB (Catalog Model Number Binary Search Tree DB)
DESCRIPTION OF SYMBOLS 100 Data processing apparatus 200 Client 300 Network S110 Input determination process S130 Incorrect input process S134 VP tree search process which does not include meta character expression S135 VP tree search process which includes meta character expression

Claims

A data processing apparatus that extracts candidate data similar to input target data from a database,
In the database, data including a metacharacter expression meaning that one numerical value is selected from a plurality of numerical values and data not including a metacharacter expression are stored,
When the extraction destination of candidate data similar to the target data is data including a metacharacter expression, a numerical value other than the metacharacter expression is set as a metacharacter expression among the character strings including the metacharacter expression, and each metacharacter expression While treating as one unit character, dividing one numerical value and characters other than the numerical value included in the target data and the data of the database as one unit character, the extraction destination of candidate data similar to the target data is a meta character In the case of data that does not include expression, a unit character dividing unit that divides each character of the character string included in the target data and the data of the database as one unit character and calculates the number of divided unit characters as the number of unit characters When,
A binary search tree creation unit that classifies data including metacharacter expressions and data not including metacharacter expressions from the database data, and generates a binary search tree for each number of unit characters based on the classified data. When,
A binary search tree extraction unit that extracts a binary search tree corresponding to a predetermined range based on the number of unit characters of the target data as the candidate data;
A distance calculation unit for calculating a distance between the target data and the data in the database;
A candidate data extraction unit for extracting, as candidate data, data whose calculated distance is equal to or smaller than a predetermined value among candidate data extracted by the binary search tree extraction unit. apparatus.

The candidate data extraction unit, in the case of a binary search tree including the metacharacter representation,
When the distance D between the target data and the candidate data, the intermediate value M of the distance of the binary search tree, and the distance d between the target data and the root node of the binary search tree,
d ≦ MD
2. The data processing apparatus according to claim 1, wherein the binary search tree is searched even when the above condition is satisfied.

The candidate data extraction unit is defined as a distance D between the target data and the candidate data, an intermediate value M of the distance of the binary search tree, and a distance d between the target data and a root node of the binary search tree.
D + M <d
The data processing apparatus according to claim 1, wherein when the above condition is satisfied, the binary search tree is not searched.

The data processing apparatus includes a database and a processing unit.
The database stores data including a metacharacter expression that means that one numerical value is selected from a plurality of numerical values,
The processor is
Search whether the input target data is stored in the database, and when the target data is not stored in the database,
Among the character strings including the metacharacter expression, a numerical value other than the metacharacter expression is set as the metacharacter expression, and each metacharacter expression is treated as one unit character, and is included in the target data and the database data. A process of dividing a numerical value and a character other than the numerical value as one unit character, and calculating the number of the divided unit characters as the number of unit characters;
A process of creating a binary search tree for each unit character number from the data in the database;
The binary search tree the unit number of characters corresponds to a predetermined range, the process of extracting the candidate data,
A process of calculating a distance between the target data and the data in the database;
The data processing method comprising: executing, among the extracted candidate data, a process of extracting data having the calculated distance equal to or less than a predetermined value as the candidate data.