JPWO2007132564A1

JPWO2007132564A1 - Data processing apparatus and method

Info

Publication number: JPWO2007132564A1
Application number: JP2008515434A
Authority: JP
Inventors: 良記伊藤; 広樹谷岡
Original assignee: 株式会社ジャストシステム
Priority date: 2006-05-13
Filing date: 2007-05-14
Publication date: 2009-09-24
Anticipated expiration: 2027-05-14
Also published as: JP5049965B2; WO2007132564A1

Abstract

あるデータに類似するデータを高速に検索する技術を提供する。データ取得部４１が対象データを取得すると、検索部４２が、対象データがデータベース６０に存在するか否かを検索する。データベース６０に存在しなければ、まず、要素数比較部４４が、対象データとの要素数の差が所定値以上であるデータを候補から除外する。次に、固有数値比較部４５が、対象データの固有数値を算出し、固有数値間の偽距離が所定値以上であるデータを候補から除外する。さらに、使用要素比較部４６が、対象データとの使用要素の差が所定値以上であるデータを候補から除外する。編集距離算出部４７は、絞り込まれたデータに対して、対象データとの間の編集距離を算出し、編集距離が所定値以下であるデータを類似するデータとして抽出する。候補提示部４８は、抽出された候補データをユーザに提示する。A technique for searching data similar to certain data at high speed is provided. When the data acquisition unit 41 acquires the target data, the search unit 42 searches whether the target data exists in the database 60. If it does not exist in the database 60, first, the element number comparison unit 44 excludes data whose difference in the number of elements from the target data is a predetermined value or more from the candidates. Next, the eigenvalue comparison unit 45 calculates an eigenvalue of the target data, and excludes data having a false distance between eigenvalues equal to or greater than a predetermined value from candidates. Further, the use element comparison unit 46 excludes data whose difference in use element from the target data is a predetermined value or more from the candidates. The edit distance calculation unit 47 calculates an edit distance between the narrowed-down data and the target data, and extracts data whose edit distance is equal to or less than a predetermined value as similar data. The candidate presenting unit 48 presents the extracted candidate data to the user.

Description

本発明は、データ処理技術に関し、特に、あるデータに類似するデータを検索する技術に関する。 The present invention relates to a data processing technique, and more particularly to a technique for retrieving data similar to certain data.

ワードプロセッサなどに設けられる機能の一つに、英単語のスペルミスや誤字脱字などを検出し、修正候補を提示するスペルチェッカーがある。従来のスペルチェッカーは、ユーザが陥りやすいミスのパターンなどをルール化し、スペルミスを検出したときには、ルールに基づいて修正候補を抽出して提示していた。 One of the functions provided in a word processor or the like is a spell checker that detects misspellings or typographical errors in English words and presents correction candidates. Conventional spell checkers rule out mistake patterns and the like that are easy for a user to fall, and when spelling errors are detected, correction candidates are extracted and presented based on the rules.

しかし、ユーザが、必ずしもルール化されたパターンと同じミスをするとは限らないので、適切な修正候補を推測できない場合もある。より客観的で高速なアルゴリズムにより、類似するデータを抽出する技術が求められる。 However, since the user does not necessarily make the same mistake as the ruled pattern, it may not be possible to guess an appropriate correction candidate. A technique for extracting similar data by a more objective and faster algorithm is required.

本発明はこうした状況に鑑みてなされたものであり、その目的は、あるデータに類似するデータを高速に検索する技術を提供することにある。 The present invention has been made in view of such circumstances, and an object of the present invention is to provide a technique for searching data similar to certain data at high speed.

本発明のある態様は、データ処理装置に関する。このデータ処理装置は、対象データがデータベースに格納されているか否かを検索する検索部と、前記対象データが前記データベースに格納されていなかった場合に、前記対象データに類似する候補データを前記データベースから抽出する抽出部と、を備え、前記抽出部は、前記対象データと前記データベースに格納されたデータとの間の距離を算出し、前記距離が所定の上限よりも小さいデータを前記候補データとして抽出する距離算出部と、前記距離算出部が前記距離を算出する前に、データの構成要素を複数のグループに分類したとき、それぞれのグループに属する構成要素が前記対象データに含まれるか否かを前記グループごとに表した固有数値を算出し、前記対象データの固有数値と前記データベースに格納されているデータの固有数値との間の偽距離を算出し、前記偽距離が前記所定の上限よりも大きいデータを、前記距離算出部が前記距離を算出する対象から除外する固有数値比較部と、を含むことを特徴とする。 One embodiment of the present invention relates to a data processing apparatus. The data processing apparatus includes a search unit that searches whether or not target data is stored in a database, and candidate data similar to the target data when the target data is not stored in the database. An extraction unit that extracts from the data, the extraction unit calculates a distance between the target data and the data stored in the database, and data with the distance smaller than a predetermined upper limit as the candidate data When the component of data is classified into a plurality of groups before the distance calculator calculates the distance, and whether the component belonging to each group is included in the target data Is calculated for each group, and the specific numerical value of the target data and the data stored in the database are fixed. An eigenvalue comparison unit that calculates a pseudo-distance between the numerical values and excludes data in which the pseudo-distance is larger than the predetermined upper limit from a target from which the distance calculation unit calculates the distance. And

この態様によると、データ間の距離を利用して、客観的に類似するデータを抽出することができる。また、データ間の距離を算出する前に、各データに定義された固有数値の間の偽距離を利用して計算の対象となるデータを絞り込むことにより、高速に類似するデータを抽出することができる。 According to this aspect, objectively similar data can be extracted using the distance between the data. In addition, before calculating the distance between data, it is possible to extract similar data at high speed by narrowing down the data to be calculated using the pseudo distance between the eigenvalues defined in each data. it can.

前記固有数値は、前記グループの数と同じ桁数の二進数であってもよく、前記固有数値比較部は、それぞれの前記グループに対してビットを割り当て、グループに属する構成要素がデータに含まれる場合はそのグループに割り当てられたビットを「１」とし、含まれない場合はそのグループに割り当てられたビットを「０」として、前記固有数値を算出してもよい。前記固有数値比較部は、２つの固有数値の間の偽距離を算出するときに、一方の固有数値をビット反転した後に両者の論理積を演算したビット列に含まれる「１」の数と、他方の固有数値をビット反転した後に両者の論理積を演算したビット列に含まれる「１」の数とのうち大きい方を前記偽距離としてもよい。前記固有数値比較部は、２つの固有数値の間の偽距離を算出するときに、２つの固有数値のビット列のうち「１」が多い方の固有数値をビット反転した後に両者の論理積を演算したビット列に含まれる「１」の数を前記偽距離としてもよい。 The eigenvalue may be a binary number having the same number of digits as the number of the group, and the eigenvalue comparison unit assigns a bit to each of the groups, and a component belonging to the group is included in the data In this case, the bit assigned to the group may be set to “1”, and if not included, the bit assigned to the group may be set to “0” to calculate the eigenvalue. The eigenvalue comparison unit calculates the false distance between two eigenvalues by bit-inverting one eigenvalue and then calculating the logical product of both, and the number of “1” included in the other The pseudo-distance may be the larger of the numbers of “1” included in the bit string obtained by performing the logical inversion of the two after bit-reversing the eigenvalue. When calculating the pseudo distance between two eigenvalues, the eigenvalue comparison unit bit-inverts the eigenvalue of the bit string of the two eigenvalues with the larger “1” and then calculates the logical product of the two. The number of “1” included in the bit string may be set as the false distance.

前記抽出部は、前記固有数値比較部が前記偽距離を算出する前に、構成要素数の差が前記所定の上限を超えるデータを、前記固有数値比較部が前記偽距離を算出する対象から除外する要素数比較部を更に含んでもよい。前記抽出部は、前記距離算出部が前記距離を算出する前に、前記対象データに含まれ、かつ、前記データベースに格納されたデータに含まれない構成要素の数と、前記データベースに格納されたデータに含まれ、かつ、前記対象データに含まれない構成要素の数とを算出し、いずれかが前記所定の上限を超えるデータを、前記距離算出部が前記距離を算出する対象から除外する使用要素比較部を更に含んでもよい。これにより、更に高速に類似するデータを抽出することができる。 The extraction unit excludes data whose difference in the number of components exceeds the predetermined upper limit from the target for which the eigenvalue comparison unit calculates the pseudo distance before the eigenvalue comparison unit calculates the pseudo distance. It may further include an element number comparison unit. The extraction unit includes the number of components included in the target data and not included in the data stored in the database before the distance calculation unit calculates the distance, and stored in the database. The number of components included in the data and not included in the target data is calculated, and any one of the data exceeding the predetermined upper limit is excluded from the target from which the distance calculation unit calculates the distance An element comparison unit may be further included. Thereby, similar data can be extracted at higher speed.

前記データベースは、前記データを、前記構成要素数ごと、かつ、前記固有数値ごとに分類して格納してもよい。これにより、データベースの検索効率を向上させることができ、検索に要する時間を短縮することができる。また、同様に、類似するデータをデータベースから抽出する効率及び速度を向上させることができる。 The database may store the data classified for each number of components and for each eigenvalue. Thereby, the search efficiency of the database can be improved, and the time required for the search can be shortened. Similarly, the efficiency and speed of extracting similar data from the database can be improved.

データ処理装置は、前記対象データ、前記候補データとして抽出されたデータ、又は前記候補データの中からユーザにより選択されたデータを、同じ固有数値を持つデータ群の中で上位に配置させる学習部を更に備えてもよい。前記対象データ、前記候補データとして抽出されたデータ、又は前記候補データの中からユーザにより選択されたデータと同じ固有数値を持つデータ群を、同じ構成要素数のデータ群の中で上位に配置させる学習部を更に備えてもよい。これにより、データベースの検索効率を向上させることができる。また、類似するデータの候補を提示するときに、よく使われるデータがより上位になるように表示順を最適化することができる。 The data processing device includes a learning unit that arranges the target data, the data extracted as the candidate data, or the data selected by the user from the candidate data at a higher rank in a data group having the same eigenvalue. Further, it may be provided. A data group having the same eigenvalue as the target data, the data extracted as the candidate data, or the data selected by the user from the candidate data is arranged higher in the data group having the same number of components A learning unit may be further provided. Thereby, the search efficiency of a database can be improved. In addition, when presenting similar data candidates, the display order can be optimized so that frequently used data is higher.

前記距離算出部は、構成要素の挿入、削除、又は置換によって、一方のデータを他方のデータに変形するのに必要な手順の最小回数を算出して前記距離としてもよい。 The distance calculation unit may calculate the minimum number of procedures necessary to transform one data into the other data by inserting, deleting, or replacing a component, and may use the distance as the distance.

本発明の別の態様も、データ処理装置に関する。このデータ処理装置は、データベースに格納するデータ群を取得し、取得したデータ群において、各データを構成する構成要素の使用頻度を算出する使用頻度算出部と、前記使用頻度に基づいて、前記構成要素を複数のグループに分類する分類生成部と、各データに対して、前記グループに属する構成要素が前記対象データに含まれるか否かを前記グループごとに表した固有数値を算出する固有数値算出部と、前記データ群に含まれるデータを、使用している要素数及び前記固有数値で分類して前記データベースに格納するデータソート部と、を備えることを特徴とする。 Another aspect of the present invention also relates to a data processing apparatus. The data processing apparatus acquires a data group to be stored in a database, and in the acquired data group, calculates a usage frequency of a component constituting each data, and uses the configuration based on the usage frequency. Classification generator for classifying elements into a plurality of groups, and for each data, eigenvalue calculation for calculating for each group whether or not a component belonging to the group is included in the target data And a data sorting unit that classifies data included in the data group according to the number of elements used and the unique numerical value and stores the data in the database.

前記固有数値は、前記グループの数と同じ桁数の二進数であってもよく、前記固有数値算出部は、それぞれの前記グループに対してビットを割り当て、グループに属する構成要素がデータに含まれる場合はそのグループに割り当てられたビットを「１」とし、含まれない場合はそのグループに割り当てられたビットを「０」として、前記固有数値を算出してもよい。 The eigenvalue may be a binary number having the same number of digits as the number of the group, and the eigenvalue calculating unit assigns a bit to each of the groups, and a component belonging to the group is included in the data In this case, the bit assigned to the group may be set to “1”, and if not included, the bit assigned to the group may be set to “0” to calculate the eigenvalue.

本発明の更に別の態様は、データ処理方法に関する。このデータ処理方法は、対象データがデータベースに格納されているか否かを検索するステップと、前記対象データが前記データベースに格納されていなかった場合に、前記対象データに類似する候補データを前記データベースから抽出するステップと、を備え、前記抽出するステップは、前記対象データと前記データベースに格納されたデータとの間の距離を算出し、前記距離が所定の上限よりも小さいデータを前記候補データとして抽出するステップと、前記距離を算出する前に、データの構成要素を複数のグループに分類したとき、それぞれのグループに属する構成要素が前記対象データに含まれるか否かを前記グループごとに表した固有数値を算出し、前記対象データの固有数値と前記データベースに格納されているデータの固有数値との間の偽距離を算出し、前記偽距離が前記所定の上限よりも大きいデータを、前記距離を算出する対象から除外するステップと、を含むことを特徴とする。 Yet another embodiment of the present invention relates to a data processing method. This data processing method includes a step of searching whether target data is stored in a database, and candidate data similar to the target data from the database when the target data is not stored in the database. Extracting, calculating the distance between the target data and the data stored in the database, and extracting the data whose distance is smaller than a predetermined upper limit as the candidate data And, when the data components are classified into a plurality of groups before the distance is calculated, the unique data indicating whether the target data is included in the target data. Calculate the numerical value, and the unique number of the target data and the unique number of the data stored in the database Calculating a pseudorange between said data larger than pseudorange said predetermined upper limit, characterized in that it comprises a and a step of excluding from the target of calculating the distance.

なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システムなどの間で変換したものもまた、本発明の態様として有効である。 It should be noted that any combination of the above-described constituent elements and a representation of the present invention converted between a method, an apparatus, a system, etc. are also effective as an aspect of the present invention.

本発明によれば、あるデータに類似するデータを高速に検索する技術を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the technique which searches the data similar to a certain data at high speed can be provided.

実施の形態に係るデータ処理装置の構成を示す図である。It is a figure which shows the structure of the data processor which concerns on embodiment. 構成要素の分類を示す図である。It is a figure which shows the classification | category of a component. データベースの内部データの例を示す図である。It is a figure which shows the example of the internal data of a database. 要素数比較部により候補が絞り込まれたデータを示す図である。It is a figure which shows the data by which the candidate was narrowed down by the element number comparison part. 固有数値比較部により候補が絞り込まれたデータを示す図である。It is a figure which shows the data by which the candidate was narrowed down by the intrinsic | native numerical value comparison part. 使用要素比較部により候補が絞り込まれたデータを示す図である。It is a figure which shows the data by which the candidate was narrowed down by the use element comparison part. 編集距離算出部により抽出された候補データを示す図である。It is a figure which shows the candidate data extracted by the edit distance calculation part. 実施の形態に係るデータ処理方法の手順を示すフローチャートである。It is a flowchart which shows the procedure of the data processing method which concerns on embodiment.

Explanation of symbols

１０データ処理装置、３０データベース生成部、３１使用頻度算出部、３２分類生成部、３３固有数値算出部、３４データソート部、４１データ取得部、４２検索部、４３候補抽出部、４４要素数比較部、４５固有数値比較部、４６使用要素比較部、４７編集距離算出部、４８候補提示部、４９学習部、６０データベース。 DESCRIPTION OF SYMBOLS 10 Data processor, 30 Database production | generation part, 31 Usage frequency calculation part, 32 Classification production | generation part, 33 Eigenvalue calculation part, 34 Data sort part, 41 Data acquisition part, 42 Search part, 43 Candidate extraction part, 44 Element number comparison Part, 45 eigenvalue comparison part, 46 use element comparison part, 47 edit distance calculation part, 48 candidate presentation part, 49 learning part, 60 database.

実施の形態に係るデータ処理装置は、対象データがデータベース中に存在するか否かを検索し、存在しない場合は、データベース中から類似するデータを抽出して提示する。例えば、英単語が登録された辞書データベースを用いて、英単語のスペルが正しいか否かをチェックし、正しくないと判定されたときには修正候補を提示するスペルチェッカー機能を提供することができる。また、ＤＮＡの塩基配列が登録されたＤＮＡデータベースを用いて、異なる生物種が持つ同様の遺伝子を同定したり、またそれらの距離を測ることで種が分岐してから経過した時間を推定したりする機能を提供することができる。さらに、画像や音楽等のデータベースを用いて、類似する画像や音楽等を抽出することができる。 The data processing apparatus according to the embodiment searches whether or not the target data exists in the database, and if not, extracts similar data from the database and presents it. For example, it is possible to provide a spell checker function that checks whether or not the spelling of an English word is correct using a dictionary database in which English words are registered, and presents correction candidates when it is determined that the spelling is not correct. In addition, using a DNA database in which DNA base sequences are registered, you can identify similar genes in different species, and estimate the time that has elapsed since the species diverged by measuring their distance. The function to do can be provided. Furthermore, similar images and music can be extracted using a database of images and music.

本実施の形態では、類似するデータを抽出するために、対象データと、データベースに登録されたデータの間の「距離」を算出し、距離が近いもの同士を類似していると判定する。データ間の距離は、データ間の差異を反映するものであればよく、例えば、ハミング距離（信号距離）、レーベンシュタイン距離（編集距離）など、既知の技術を利用可能である。また、スミス・ウォーターマンアルゴリズムなどを用いて、局所アライメントの類似度を計算してもよい。以下、編集距離を利用して英単語のスペルチェックを行う例について説明する。 In this embodiment, in order to extract similar data, the “distance” between the target data and the data registered in the database is calculated, and those having a short distance are determined to be similar. The distance between the data only needs to reflect a difference between the data. For example, a known technique such as a Hamming distance (signal distance) or a Levenstein distance (editing distance) can be used. Alternatively, the similarity of local alignment may be calculated using Smith Waterman algorithm or the like. Hereinafter, an example in which the spelling check of English words is performed using the edit distance will be described.

編集距離は、文字の挿入や削除、置換によって、一つの文字列を別の文字列に変形するのに必要な手順の最小回数であり、一般に、動的計画法によるアルゴリズムを用いて計算できる。しかし、スペルチェッカーの精度を向上させるために、辞書に多くの英単語を登録すればするほど、編集距離を計算する対象が増える。１つの英単語のスペルミスを検出するために、その単語に対して、辞書に登録された全ての英単語との間の編集距離を算出して修正候補を提示すると、修正候補を提示するまでに多くの時間を要し、かえってユーザの利便性を損なうおそれがある。 The edit distance is the minimum number of steps required for transforming one character string into another character string by inserting, deleting, or replacing characters, and can be generally calculated using an algorithm based on dynamic programming. However, as more English words are registered in the dictionary in order to improve the accuracy of the spell checker, the number of objects for calculating the edit distance increases. In order to detect a spelling error of one English word, when the correction distance is calculated and the correction candidate is presented with respect to that word and all the English words registered in the dictionary, It takes a lot of time, and there is a risk that the convenience of the user may be impaired.

本実施の形態では、編集距離を実際に計算する前に、類似度の低いものを予め計算の対象から除外し、対象を絞り込んでから編集距離を算出することにより、辞書の登録数が増加しても短時間で修正候補を抽出して提示する技術を提案する。本実施の形態では、各データに固有数値を定義して、固有数値間の偽距離を算出することによりデータ間のおおまかな距離を測定し、この偽距離が所定の上限を超えるものを予め除外する。 In the present embodiment, before actually calculating the edit distance, those having low similarity are excluded from the calculation target in advance, and the target is narrowed down and then the edit distance is calculated, thereby increasing the number of registered dictionaries. However, we propose a technique for extracting and presenting correction candidates in a short time. In this embodiment, an eigenvalue is defined for each data, and a rough distance between the data is measured by calculating a pseudo-distance between eigenvalues, and those whose pseudo-range exceeds a predetermined upper limit are excluded in advance. To do.

図１は、実施の形態に係るデータ処理装置の構成を示す。データ処理装置１０は、データベース生成部３０、データ取得部４１、検索部４２、候補抽出部４３、候補提示部４８、学習部４９、データベース６０を含む。データベース生成部３０は、使用頻度算出部３１、分類生成部３２、固有数値算出部３３、データソート部３４を含む。候補抽出部４３は、要素数比較部４４、固有数値比較部４５、使用要素比較部４６、及び編集距離算出部４７を含む。これらの構成は、ハードウエアコンポーネントでいえば、任意のコンピュータのＣＰＵ、メモリ、メモリにロードされたプログラムなどによって実現されるが、ここではそれらの連携によって実現される機能ブロックを描いている。したがって、これらの機能ブロックがハードウエアのみ、ソフトウエアのみ、またはそれらの組合せによっていろいろな形で実現できることは、当業者には理解されるところである。 FIG. 1 shows a configuration of a data processing apparatus according to the embodiment. The data processing apparatus 10 includes a database generation unit 30, a data acquisition unit 41, a search unit 42, a candidate extraction unit 43, a candidate presentation unit 48, a learning unit 49, and a database 60. The database generation unit 30 includes a use frequency calculation unit 31, a classification generation unit 32, a unique numerical value calculation unit 33, and a data sort unit 34. The candidate extraction unit 43 includes an element number comparison unit 44, an eigenvalue comparison unit 45, a used element comparison unit 46, and an edit distance calculation unit 47. In terms of hardware components, these configurations are realized by a CPU of a computer, a memory, a program loaded in the memory, and the like, but here, functional blocks realized by their cooperation are illustrated. Accordingly, those skilled in the art will understand that these functional blocks can be realized in various forms by hardware only, software only, or a combination thereof.

まず、英単語の辞書データを格納したデータベース６０を生成する手順について説明する。使用頻度算出部３１は、データベース６０に格納すべきデータの集合を取得し、データ集合の内部において、各要素がどの程度多くのデータに使用されているかを算出する。この例では、使用頻度算出部３１は、英単語のリストを取得すると、それらの英単語に使用されている文字の使用頻度を算出する。一つのデータに同じ要素が複数回使用されている場合には、本実施の形態では、使用頻度を１とカウントするが、データの内容によっては、使用されている要素の回数分カウントしてもよい。 First, a procedure for generating the database 60 storing dictionary data of English words will be described. The usage frequency calculation unit 31 acquires a set of data to be stored in the database 60 and calculates how much data each element is used in the data set. In this example, when the usage frequency calculation unit 31 acquires a list of English words, the usage frequency calculation unit 31 calculates the usage frequency of characters used in the English words. When the same element is used multiple times in one data, in this embodiment, the usage frequency is counted as 1. However, depending on the content of the data, the number of times of the used element may be counted. Good.

分類生成部３２は、データに使用される要素を複数のグループに分類する。このとき、分類生成部３２は、各分類に属する要素の使用頻度の合計がほぼ均一になるように要素を分類する。例えば、１０個の英単語のリストにおいて、「ａ」が８回、「ｂ」が４回、「ｃ」が６回、「ｄ」が１０回使用されている場合、「ａ」と「ｃ」、「ｂ」と「ｄ」の２つのグループに分類すると、各分類に属する文字の使用頻度の合計は、いずれも１４回となり、均一になる。分類生成部３２が、要素をいくつのグループに分類するかは、後述する検索や候補の抽出の効率を考慮して決定される。本実施の形態では、説明を簡略化するために、図２に示すように、２６文字のアルファベットを５つのグループに分類する。 The classification generation unit 32 classifies elements used for data into a plurality of groups. At this time, the classification generation unit 32 classifies the elements so that the total use frequency of the elements belonging to each classification is substantially uniform. For example, in a list of 10 English words, if “a” is used 8 times, “b” is used 4 times, “c” is used 6 times, and “d” is used 10 times, “a” and “c” ”,“ B ”, and“ d ”, the total usage frequency of characters belonging to each classification is 14 times, which is uniform. The number of groups into which the classification generation unit 32 classifies elements is determined in consideration of the efficiency of search and candidate extraction described later. In the present embodiment, in order to simplify the description, as shown in FIG. 2, the 26-character alphabet is classified into five groups.

固有数値算出部３３は、それぞれのデータに対して、分類生成部３２により分類された分類の数と同じ桁数のビット列で構成される固有数値を算出する。固有数値算出部３３は、各分類に対して１つのビットを割り当て、その分類に属する要素がデータに使用されていれば、その分類に割り当てられたビットを「１」とし、使用されていなければ、その分類に割り当てられたビットを「０」とする。本実施の形態では、図２に示すように、アルファベットを５つのグループに分類しているので、固有数値は５ビットの２進数となる。英単語に、「ａ」、「ｂ」、「ｃ」、「ｄ」、「ｅ」のいずれかの文字が使用されていれば、最上位から「１」ビット目のビットを「１」とし、使用されていなければ「０」とする。同様に、英単語に、「ｆ」から「ｊ」のいずれかの文字が使用されていれば、最上位から「２」ビット目のビットを「１」とし、使用されていなければ「０」とする。このようにして、固有数値算出部３３は、リストに含まれる全ての英単語の固有数値を算出する。例えば、英単語「ｔｅｓｔ」の固有数値は「１００１０」となる。 The eigenvalue calculation unit 33 calculates, for each data, an eigenvalue composed of a bit string having the same number of digits as the number of classifications classified by the classification generation unit 32. The eigenvalue calculation unit 33 assigns one bit to each classification, and if an element belonging to the classification is used in the data, the bit assigned to the classification is set to “1”. The bits assigned to the classification are set to “0”. In the present embodiment, as shown in FIG. 2, since the alphabet is classified into five groups, the eigenvalue is a 5-bit binary number. If any of the letters “a”, “b”, “c”, “d”, “e” is used in the English word, the bit of the “1” bit from the most significant bit is set to “1”. “0” if not used. Similarly, if any letter from “f” to “j” is used in the English word, the bit of the “2” bit from the most significant bit is set to “1”, and if not used, “0”. And In this manner, the unique numerical value calculation unit 33 calculates the specific numerical values of all English words included in the list. For example, the specific value of the English word “test” is “10010”.

データソート部３４は、データ集合に含まれるデータを、使用している要素数でソートし、さらに、同じ要素数のデータ集合内で、同じ固有数値を持つデータをまとめる。すなわち、英単語のリストは、文字数によりソートされ、さらに、同じ文字数の英単語は、固有数値により分類される。このようにして、英単語の辞書データが生成される。生成された辞書データの例を図３に示す。 The data sorting unit 34 sorts the data included in the data set by the number of elements used, and further collects data having the same unique numerical value in the data set having the same number of elements. That is, the list of English words is sorted by the number of characters, and English words having the same number of characters are classified by eigenvalues. In this way, English dictionary data is generated. An example of the generated dictionary data is shown in FIG.

つづいて、英単語のスペルをチェックする手順について説明する。データ取得部４１は、検索対象となるデータを取得する。ここでは、スペルチェックの対象となる英単語を取得する。検索部４２は、取得したデータがデータベース６０に存在するか否かを検索する。ここでは、前述したように、データベース６０に格納された辞書には英単語が文字数でソートされ、かつ、固有数値で分類されて格納されているので、検索部４２は、取得した英単語の文字数と固有数値を算出し、データベース６０中の該当するレコードを検索する。例えば、データ取得部４１が「ｔｅｓｔ」という英単語を取得した場合、検索部４２は、要素数が「４」で固有数値が「１００１０」であるレコードのみを検索すればよい。これにより、効率よくデータを検索することができる。 Next, the procedure for checking the spelling of English words will be described. The data acquisition unit 41 acquires data to be searched. Here, an English word to be subjected to spell check is acquired. The search unit 42 searches whether the acquired data exists in the database 60. Here, as described above, the English words are sorted by the number of characters in the dictionary stored in the database 60 and classified and stored by the specific numerical value. Therefore, the search unit 42 determines the number of characters of the acquired English words. And the specific numerical value is calculated, and the corresponding record in the database 60 is searched. For example, when the data acquisition unit 41 acquires the English word “test”, the search unit 42 only needs to search for records having the number of elements “4” and the unique numerical value “10010”. Thereby, data can be searched efficiently.

取得したデータがデータベース６０に存在すれば、処理を終了する。存在しなければ、候補抽出部４３は、データベース６０に存在するデータの中で、取得したデータに類似するものを抽出する。スペルチェックの場合は、取得した英単語がデータベース６０に存在しなければ、スペルミスの可能性があるので、候補抽出部４３が修正候補を抽出する。候補抽出部４３は、データ間の距離に基づいて類似度を判定し、距離が所定の値よりも近いデータを候補として抽出する。以下、対象データ「ｔｅｗｔ」との編集距離が２以下である英単語を抽出する例について説明する。 If the acquired data exists in the database 60, the process ends. If it does not exist, the candidate extraction unit 43 extracts data similar to the acquired data from the data existing in the database 60. In the case of spell check, if the acquired English word does not exist in the database 60, there is a possibility of a spelling error, so the candidate extraction unit 43 extracts correction candidates. The candidate extraction unit 43 determines the similarity based on the distance between the data, and extracts data whose distance is closer than a predetermined value as a candidate. Hereinafter, an example of extracting English words whose edit distance to the target data “tewt” is 2 or less will be described.

要素数比較部４４は、対象データの要素数を算出し、データベース６０中のデータと比較する。編集距離がｎ以下のデータを抽出する場合、要素数の差がｎ＋１以上であるデータは、編集距離を算出するまでもなく候補から除外される。したがって、要素数比較部４４は、対象データ「ｔｅｗｔ」の文字数が「４」であることから、文字数が「１」の英単語と、文字数が「７」以上の英単語を候補から除外する。これにより、図３に示したデータベース６０のデータは、要素数という大分類により、図４に示すように絞り込まれる。 The element number comparison unit 44 calculates the number of elements of the target data and compares it with the data in the database 60. When data whose edit distance is n or less is extracted, data whose difference in the number of elements is n + 1 or more is excluded from candidates without calculating the edit distance. Therefore, since the number of characters of the target data “tewt” is “4”, the number-of-elements comparison unit 44 excludes English words with the number of characters “1” and English words with the number of characters “7” or more from the candidates. As a result, the data in the database 60 shown in FIG. 3 is narrowed down as shown in FIG.

固有数値比較部４５は、対象データの固有数値を算出し、データベース６０中のデータと比較する。固有数値比較部４５は、対象データの固有数値と、データベース６０中のデータの固有数値との間の「偽距離」を以下のようにして算出し、算出された「偽距離」が、候補として抽出すべき編集距離の上限よりも大きいデータは候補から除外する。なお、以下で説明する固有数値間の偽距離は、距離公理のうち対称性を満たさないので、偽距離と呼んでいる。 The eigenvalue comparison unit 45 calculates the eigenvalue of the target data and compares it with the data in the database 60. The eigenvalue comparison unit 45 calculates the “false distance” between the eigenvalue of the target data and the eigenvalue of the data in the database 60 as follows, and the calculated “false distance” is used as a candidate. Data larger than the upper limit of the edit distance to be extracted is excluded from the candidates. The pseudo distance between eigenvalues described below is called a pseudo distance because it does not satisfy symmetry in the distance axiom.

固有数値比較部４５は、一方の固有数値をビット反転した後に両者の論理積を演算したビット列に含まれる「１」の数と、他方の固有数値をビット反転した後に両者の論理積を演算したビット列に含まれる「１」の数とのうち大きい方を偽距離とする。例えば、対象データ「ｔｅｗｔ」の固有数値「１００１１」と、固有数値「００１００」との間の偽距離を算出すると、まず、前者「１００１１」の反転「０１１００」と後者「００１００」との論理積「００１００」の「１」の数は「１」である。また、後者「００１００」の反転「１１０１１」と前者「１００１１」との論理積「１００１１」の「１」の数は「３」である。したがって、「１」と「３」の大きい方である「３」が、固有数値「１００１１」と「００１００」の間の偽距離となる。 The eigenvalue comparison unit 45 calculates the logical product of the number of “1” included in the bit string obtained by performing the logical inversion of one of the eigenvalues and the other eigenvalue after performing the bit inversion of the other eigenvalue. The larger one of the number of “1” included in the bit string is set as a false distance. For example, when the false distance between the eigenvalue “10011” of the target data “tewt” and the eigenvalue “00100” is calculated, first, the logical product of the inverted “01100” of the former “10011” and the latter “00100” The number of “1” in “00100” is “1”. The number of “1” s of the logical product “10011” of the inverted “11011” of the latter “00100” and the former “10011” is “3”. Therefore, “3”, which is the larger of “1” and “3”, is a pseudo distance between the eigenvalues “10011” and “00100”.

固有数値間のハミング距離ではなく偽距離を計算するのは、データ間の編集距離を算出するときに、置換処理を「１」と数えるからである。データの削除、挿入のみを考慮する場合はハミング距離でもよいが、置換を考慮する場合は、固有数値間の排他的論理和から得られるハミング距離を編集距離の上限と比較してデータの候補を絞り込むと、編集距離が上限を超えていないデータまで除外してしまう可能性がある。例えば、文字列「ａｆｐ」の固有数値は「１１０１０」であり、文字列「ｕｐ」の固有数値は「０００１１」である。これらの排他的論理和は「１１００１」であるから、固有数値間のハミング距離は「３」である。しかし、「ａｆｐ」の「ａ」を「ｕ」に置換し、「ｆ」を削除すると、「ｕｐ」になることから、データ間の編集距離は「２」である。したがって、固有数値間のハミング距離は、データ間の編集距離よりも大きくなる可能性がある。それに対して、上述の偽距離を算出すると、「ａｆｐ」の固有数値の反転「００１０１」と「ｕｐ」の固有数値「０００１１」の論理積は「００００１」であり、後者の反転「１１１００」と前者「１１０１０」の論理積は「１１０００」であるから、固有数値間の偽距離は「２」である。このように、固有数値間の偽距離は、データ間の編集距離を超えないので、データの絞り込みに利用することができる。 The reason why the false distance is calculated instead of the Hamming distance between eigenvalues is that the substitution process is counted as “1” when the edit distance between the data is calculated. When considering only deletion or insertion of data, the Hamming distance may be used, but when considering substitution, the Hamming distance obtained from the exclusive OR between eigenvalues is compared with the upper limit of the editing distance, and data candidates are selected. When narrowing down, there is a possibility of excluding data whose edit distance does not exceed the upper limit. For example, the unique value of the character string “afp” is “11010”, and the unique value of the character string “up” is “00011”. Since these exclusive ORs are “11001”, the Hamming distance between eigenvalues is “3”. However, if “a” in “afp” is replaced with “u” and “f” is deleted, “up” is obtained, so the edit distance between the data is “2”. Therefore, the Hamming distance between eigenvalues may be larger than the editing distance between data. On the other hand, when the above pseudo distance is calculated, the logical product of the eigenvalue inversion “00101” of “afp” and the eigenvalue “00011” of “up” is “00001”, and the inversion of the latter “11100” Since the logical product of the former “11010” is “11000”, the false distance between eigenvalues is “2”. Thus, since the pseudo distance between eigenvalues does not exceed the edit distance between data, it can be used for narrowing down data.

固有数値比較部４５は、２つの固有数値の間の偽距離を算出するときに、２つの固有数値のビット列のうち「１」が多い方の固有数値をビット反転した後に両者の論理積を演算したビット列に含まれる「１」の数を偽距離としてもよい。例えば、対象データ「ｔｅｗｔ」の固有数値「１００１１」と、固有数値「００１００」との間の距離を算出すると、まず、後者の固有数値の方が「１」の数が少ないのでこれをビット反転し、「１１０１１」を得る。これと「１００１１」の論理積を計算すると「１００１１」となり、「１」の数は「３」となる。これは、編集距離の上限「２」よりも大きいので、固有数値比較部４５は、固有数値が「１１０１１」である英単語を候補から除外する。これにより、図４に示した候補のデータは、固有数値という中分類により、さらに図５に示すように絞り込まれる。 When calculating the pseudo distance between two eigenvalues, the eigenvalue comparison unit 45 performs bit-inversion on the eigenvalue having the larger “1” in the bit string of the two eigenvalues and then calculates the logical product of the two. The number of “1” included in the bit string may be a false distance. For example, if the distance between the eigenvalue “10011” of the target data “tewt” and the eigenvalue “00100” is calculated, first, the latter eigenvalue has a smaller number of “1” s, so this is bit-inverted. And “11011” is obtained. When the logical product of this and “10011” is calculated, it becomes “10011”, and the number of “1” becomes “3”. Since this is larger than the upper limit “2” of the edit distance, the eigenvalue comparison unit 45 excludes English words whose eigenvalue is “11011” from the candidates. As a result, the candidate data shown in FIG. 4 is further narrowed down as shown in FIG.

使用要素比較部４６は、対象データの構成要素と、データベース中のデータの構成要素とを比較する。使用要素比較部４６は、対象データに使用されている要素と、データベース６０中のデータに使用されている要素との違いを算出し、算出された違いが、候補として抽出すべき編集距離の上限よりも大きいデータは候補から除外する。使用要素比較部４６は、対象データが使用していて、データベース６０中のデータが使用していない要素の数と、データベース６０中のデータが使用していて、対象データが使用していない要素の数を算出し、いずれかが編集距離の上限を超えていれば、そのデータを候補から除外する。例えば、対象データ「ｔｅｗｔ」とデータ「ｗｏｒｄ」の使用要素の違いは次のようになる。対象データ「ｔｅｗｔ」に使用されていてデータ「ｗｏｒｄ」に使用されていない要素は、「ｔ」、「ｅ」の２つであり、データ「ｗｏｒｄ」に使用されていて対象データ「ｔｅｗｔ」に使用されていない要素は、「ｏ」、「ｒ」、「ｄ」の３つである。したがって、両者の使用要素の違いは「３」であり、編集距離の上限「２」よりも大きいので、データ「ｗｏｒｄ」は候補から除外される。これにより、図５に示した候補のデータは、さらに図６に示すように絞り込まれる。 The use element comparison unit 46 compares the constituent elements of the target data with the constituent elements of the data in the database. The used element comparison unit 46 calculates the difference between the element used for the target data and the element used for the data in the database 60, and the calculated difference is the upper limit of the edit distance to be extracted as a candidate. Data larger than are excluded from the candidates. The used element comparison unit 46 uses the number of elements that are used by the target data and are not used by the data in the database 60 and the elements that are used by the data in the database 60 and are not used by the target data. The number is calculated, and if any of them exceeds the upper limit of the edit distance, the data is excluded from the candidates. For example, the difference in the use elements of the target data “tewt” and the data “word” is as follows. There are two elements “t” and “e” that are used in the target data “tewt” but are not used in the data “word”. They are used in the data “word” and are included in the target data “tewt”. There are three elements “o”, “r”, and “d” that are not used. Therefore, the difference between the elements used is “3”, which is larger than the upper limit “2” of the edit distance, and therefore the data “word” is excluded from the candidates. As a result, the candidate data shown in FIG. 5 is further narrowed down as shown in FIG.

編集距離算出部４７は、以上のように絞り込まれた候補に対して、対象データとの間の編集距離を算出し、編集距離が所定の値よりも近いデータを候補として抽出する。編集距離算出部４７に代えて、信号距離やスミスウォーターマンアルゴリズムによる類似度など、他の方式でデータ間の距離や類似度を算出する構成を設けてもよい。一般に、距離は、データ間の類似性が高いほど小さい値となり、類似度は、データ間の類似性が高いほど大きい値となるが、ここでは、類似性が高いことを「距離が近い」と表現している。したがって、編集距離を算出する場合は、算出された値が所定の上限よりも小さいデータを抽出し、類似度を算出する場合は、算出された値が所定の下限よりも大きいデータを抽出する。以上の手順により、図７に示した候補データが抽出される。 The edit distance calculation unit 47 calculates the edit distance between the candidates narrowed down as described above and the target data, and extracts data whose edit distance is closer than a predetermined value as candidates. Instead of the edit distance calculation unit 47, a configuration may be provided in which the distance or similarity between data is calculated by other methods such as a signal distance or a similarity by Smith Waterman algorithm. In general, the distance becomes smaller as the similarity between the data is higher, and the similarity becomes larger as the similarity between the data is higher. Here, “higher similarity” means that the distance is higher. expressing. Therefore, when the edit distance is calculated, data whose calculated value is smaller than a predetermined upper limit is extracted, and when the similarity is calculated, data whose calculated value is larger than a predetermined lower limit is extracted. The candidate data shown in FIG. 7 is extracted by the above procedure.

候補提示部４８は、候補抽出部４３により抽出された候補をユーザに提示する。候補提示部４８は、算出された距離が近いデータが上位になるように表示するのが好ましい。これにより、より類似性の高い候補を上位に表示することができる。候補提示部４８は、算出された距離の値とともに候補データを提示してもよい。これにより、ユーザが提示された候補データの中からデータを選択する際に、距離の値を参照して類似性を判断することができる。提示された候補データの中からユーザにより選択されたデータがワードプロセッサなどに対して出力される。 The candidate presenting unit 48 presents the candidates extracted by the candidate extracting unit 43 to the user. The candidate presenting unit 48 preferably displays the data so that the calculated distance is close to the top. Thereby, candidates with higher similarity can be displayed at the top. The candidate presenting unit 48 may present candidate data together with the calculated distance value. Thereby, when selecting data from the candidate data presented by the user, the similarity can be determined with reference to the distance value. Data selected by the user from the presented candidate data is output to a word processor or the like.

学習部４９は、候補抽出部４３により抽出された候補データや、その後ユーザにより選択されたデータが、データベース６０の上位に配置されるようにデータを並び替えて、データベース６０を学習させる。学習部４９は、対象データ、候補データとして抽出されたデータ、又は候補データの中からユーザにより選択されたデータと同じ固有数値を持つデータ群を、同じ構成要素数のデータ群の中で上位に配置されるように移動させる。さらに、学習部４９は、対象データ、候補データとして抽出されたデータ、又は候補データの中からユーザにより選択されたデータを、同じ固有数値を持つデータ群の中で上位に配置されるように移動させる。これにより、使用されているデータや、使用されているデータに類似するデータとして抽出又は選択されたデータが、次回以降の検索において、より早く検索対象となるようにすることができ、検索の効率及び速度を向上させることができる。また、候補提示部４８が候補データを提示する際に、使用、抽出、又は選択される頻度の高いデータがより上位になるように表示順を最適化することができる。これにより、ユーザの利便性を向上させることができる。 The learning unit 49 causes the database 60 to learn by rearranging the data so that the candidate data extracted by the candidate extraction unit 43 and the data selected by the user thereafter are arranged on the upper level of the database 60. The learning unit 49 places the data group having the same eigenvalue as the target data, the data extracted as the candidate data, or the data selected by the user from the candidate data, in the data group having the same number of components. Move to be placed. Further, the learning unit 49 moves the target data, the data extracted as the candidate data, or the data selected by the user from the candidate data so as to be placed higher in the data group having the same eigenvalue. Let As a result, data that is used or data that is extracted or selected as data that is similar to the data that is used can be searched more quickly in the next and subsequent searches. And speed can be improved. In addition, when the candidate presentation unit 48 presents candidate data, the display order can be optimized so that data that is frequently used, extracted, or selected has a higher rank. Thereby, a user's convenience can be improved.

要素数、固有数値、使用要素を比較する処理は、編集距離を算出するよりも高速に行うことができるので、編集距離の算出に先立って、これらの処理により予め候補を絞り込むことで、類似するデータを抽出する処理を高速化することができる。要素数、固有数値、使用要素を比較する処理は、この順で処理時間が長くなるので、より高速な処理を先に実行することで、絞り込みの効率を向上させることができ、類似するデータを抽出する処理を更に高速化することができる。 Since the process of comparing the number of elements, eigenvalues, and used elements can be performed faster than calculating the edit distance, it is similar by narrowing down the candidates in advance by these processes before calculating the edit distance. The process of extracting data can be speeded up. The process of comparing the number of elements, eigenvalues, and elements used increases the processing time in this order, so by executing faster processing first, the efficiency of narrowing down can be improved, and similar data can be obtained. The extraction process can be further accelerated.

図８は、実施の形態に係るデータ処理方法の手順を示すフローチャートである。まず、データ取得部４１が対象データを取得すると（Ｓ１０）、検索部４２が、対象データがデータベース６０に存在するか否かを検索する（Ｓ１４）。データベース６０に存在しなければ（Ｓ１４のＮ）、候補抽出部４３が対象データに類似するデータをデータベース６０から抽出する。まず、要素数比較部４４が、対象データに使用されている要素の数を算出し、要素数の差が所定値以上であるデータを候補から除外して対象を絞り込む（Ｓ１６）。次に、固有数値比較部４５が、対象データの固有数値を算出し、固有数値間の偽距離が所定値以上であるデータを候補から除外して対象を絞り込む（Ｓ１８）。さらに、使用要素比較部４６が、対象データに使用されている要素とデータベース６０のデータに使用されている要素を比較して、所定値以上の差があるデータを候補から除外して対象を絞り込む（Ｓ２０）。編集距離算出部４７は、絞り込まれたデータに対して、対象データとの間の編集距離を算出し、編集距離が所定値以下であるデータを類似するデータとして抽出する（Ｓ２２）。候補提示部４８は、抽出された候補データをユーザに提示する（Ｓ２４）。データベース６０に対象データが存在していた場合は（Ｓ１４のＹ）、類似するデータを抽出する処理をスキップする。学習部４９は、対象データ、又は対象データに類似するデータとして抽出されたデータ、又は抽出されたデータの中からユーザにより選択されたデータが、データベース６０の上位に配置されるように、データベース６０における位置を移動させて、データベース６０を学習させる（Ｓ２６）。 FIG. 8 is a flowchart illustrating a procedure of the data processing method according to the embodiment. First, when the data acquisition unit 41 acquires target data (S10), the search unit 42 searches whether the target data exists in the database 60 (S14). If it does not exist in the database 60 (N in S14), the candidate extraction unit 43 extracts data similar to the target data from the database 60. First, the number-of-elements comparison unit 44 calculates the number of elements used in the target data, and narrows down the target by excluding data whose difference in the number of elements is a predetermined value or more from candidates (S16). Next, the eigenvalue comparison unit 45 calculates an eigenvalue of the target data, and excludes data whose pseudo distance between eigenvalues is equal to or greater than a predetermined value from candidates to narrow down the target (S18). Further, the used element comparison unit 46 compares the element used for the target data with the element used for the data in the database 60, and excludes data having a difference of a predetermined value or more from the candidates to narrow down the target. (S20). The edit distance calculation unit 47 calculates an edit distance between the narrowed-down data and the target data, and extracts data having an edit distance equal to or less than a predetermined value as similar data (S22). The candidate presentation unit 48 presents the extracted candidate data to the user (S24). If the target data exists in the database 60 (Y in S14), the process of extracting similar data is skipped. The learning unit 49 stores the database 60 so that the data extracted as the target data, the data similar to the target data, or the data selected by the user from the extracted data is arranged on the upper side of the database 60. The database 60 is learned by moving the position at (S26).

以上、本発明を実施の形態をもとに説明した。この実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described based on the embodiments. This embodiment is an exemplification, and it will be understood by those skilled in the art that various modifications can be made to combinations of the respective constituent elements and processing processes, and such modifications are also within the scope of the present invention. is there.

本発明は、あるデータに類似するデータを抽出するデータ処理装置に利用可能である。 The present invention can be used in a data processing apparatus that extracts data similar to certain data.

Claims

A search unit for searching whether the target data is stored in the database;
An extraction unit that extracts candidate data similar to the target data from the database when the target data is not stored in the database;
The extraction unit includes:
A distance calculation unit that calculates a distance between the target data and data stored in the database, and extracts data having the distance smaller than a predetermined upper limit as the candidate data;
When the component of data is classified into a plurality of groups before the distance calculation unit calculates the distance, whether or not the component belonging to each group is included in the target data is indicated for each group. Calculating an eigenvalue, calculating a false distance between the eigenvalue of the target data and the eigenvalue of the data stored in the database, and calculating the data in which the false distance is greater than the predetermined upper limit, An eigenvalue comparison unit that the calculation unit excludes from the target for calculating the distance;
A data processing apparatus comprising:

The eigenvalue is a binary number having the same number of digits as the number of the group, and the eigenvalue comparison unit assigns a bit to each of the groups, and when a component belonging to the group is included in the data, 2. The data processing apparatus according to claim 1, wherein the eigenvalue is calculated by setting a bit assigned to the group to “1” and, if not included, a bit assigned to the group to “0”. .

The eigenvalue comparison unit calculates the false distance between two eigenvalues by bit-inverting one eigenvalue and then calculating the logical product of both, and the number of “1” included in the other 3. The data processing apparatus according to claim 2, wherein a larger one of the number of “1” included in the bit string obtained by calculating the logical product of both of the eigenvalues after bit inversion is set as the false distance.

When calculating the pseudo distance between two eigenvalues, the eigenvalue comparison unit bit-inverts the eigenvalue of the bit string of the two eigenvalues with the larger “1” and then calculates the logical product of the two. 3. The data processing apparatus according to claim 2, wherein the number of “1” included in the bit string is the false distance.

The extraction unit excludes data whose difference in the number of components exceeds the predetermined upper limit from the target for which the eigenvalue comparison unit calculates the pseudo distance before the eigenvalue comparison unit calculates the pseudo distance. 5. The data processing apparatus according to claim 1, further comprising an element number comparison unit.

The extraction unit includes the number of components included in the target data and not included in the data stored in the database before the distance calculation unit calculates the distance, and stored in the database. The number of components included in the data and not included in the target data is calculated, and any one of the data exceeding the predetermined upper limit is excluded from the target from which the distance calculation unit calculates the distance 6. The data processing apparatus according to claim 1, further comprising an element comparison unit.

The data processing apparatus according to any one of claims 1 to 6, wherein the database classifies and stores the data for each number of components and for each unique numerical value.

It further comprises a learning unit that arranges the target data, the data extracted as the candidate data, or the data selected by the user from the candidate data in a data group having the same eigenvalue. The data processing apparatus according to claim 7.

A data group having the same eigenvalue as the target data, the data extracted as the candidate data, or the data selected by the user from the candidate data is arranged higher in the data group having the same number of components The data processing apparatus according to claim 7, further comprising a learning unit.

The distance calculation unit calculates the minimum number of steps required to transform one data into another data by inserting, deleting, or replacing a component, and sets the distance as the distance. The data processing device according to any one of 1 to 9.

A usage frequency calculation unit that acquires a data group to be stored in the database and calculates the usage frequency of the constituent elements constituting each data in the acquired data group;
A classifying unit that classifies the components into a plurality of groups based on the usage frequency;
For each data, a unique numerical value calculation unit for calculating a specific numerical value for each group indicating whether or not a component belonging to the group is included in the data;
A data sorting unit that classifies the data included in the data group according to the number of elements used and the unique numerical value and stores the data in the database;
A data processing apparatus comprising:

The eigenvalue is a binary number having the same number of digits as the number of the group, and the eigenvalue calculation unit allocates a bit to each of the groups, and when a component belonging to the group is included in the data, 12. The data processing apparatus according to claim 11, wherein the unique numerical value is calculated by setting a bit assigned to the group to “1” and, if not included, a bit assigned to the group to “0”. .

Searching whether the target data is stored in the database;
Extracting the candidate data similar to the target data from the database when the target data is not stored in the database,
The extracting step includes:
Calculating a distance between the target data and data stored in the database, and extracting data having the distance smaller than a predetermined upper limit as the candidate data;
Before calculating the distance, when the data components are classified into a plurality of groups, a unique numerical value is calculated for each group indicating whether or not a component belonging to each group is included in the target data. , Calculating a pseudo distance between the eigenvalue of the target data and the eigenvalue of the data stored in the database, and the data for which the pseudo distance is greater than the predetermined upper limit from the target for calculating the distance Step to exclude,
A data processing method comprising:

A function to search whether the target data is stored in the database;
A function for extracting candidate data similar to the target data from the database when the target data is not stored in the database;
The extracting function is:
A function of calculating a distance between the target data and data stored in the database, and extracting data having the distance smaller than a predetermined upper limit as the candidate data;
Before calculating the distance, when the data components are classified into a plurality of groups, a unique numerical value is calculated for each group indicating whether or not a component belonging to each group is included in the target data. , Calculating a pseudo distance between the eigenvalue of the target data and the eigenvalue of the data stored in the database, and the data for which the pseudo distance is greater than the predetermined upper limit from the target for calculating the distance The features to exclude,
A data processing program comprising: