JP6549786B2

JP6549786B2 - Data cleansing system, method and program

Info

Publication number: JP6549786B2
Application number: JP2018510205A
Authority: JP
Inventors: 健太郎角井
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2016-04-08
Filing date: 2016-04-08
Publication date: 2019-07-24
Anticipated expiration: 2036-04-08
Also published as: WO2017175375A1; JPWO2017175375A1

Description

本発明は、データクレンジングに関する。 The present invention relates to data cleansing.

企業内に散在するデータを統合し、業務分析に活用するソリューションが注目されている。その実現のためには、トランザクションデータに含まれる属性が過不足なく格納されているマスタデータが必要である。マスタデータは、それが表現するエンティティのインスタンスとレコードとが１対１に対応付けられている必要がある。すなわち、各レコードは、他のレコードに対して一意に識別可能である必要がある。 Solutions that integrate data scattered in a company and use it for business analysis are attracting attention. In order to realize that, it is necessary to have master data in which attributes included in transaction data are stored without excess or deficiency. The master data needs to have a one-to-one correspondence between the instance of the entity it represents and the record. That is, each record needs to be uniquely identifiable with respect to other records.

しかし、多くの企業のマスタデータは、活用に適さない状態であることが多い。すなわち、多くのマスタデータが、人間によって入力されたデータを基に作成されているため、入力ミスや運用上の不備などの様々な原因により、レコードの一意性を失っており、重複レコードを含んでいることが多い。それ以外にも、企業合併等で複数のマスタデータを統合する際に、重複レコードが問題となることもある。 However, master data of many companies is often not suitable for utilization. That is, since many master data are created based on data input by humans, the uniqueness of records is lost due to various causes such as input errors and operational defects, and duplicate records are included. Often Other than that, duplicate records may become a problem when integrating multiple master data in a business merger or the like.

なお、レコードの一意性とは、マスタデータのエンティティに関する意味論上の一意性であることに注意する。例えば、従業員マスタや顧客マスタなど、人物をエンティティの単位とするマスタデータの場合、個人名や住所表記などのデータには、表記の揺れ、別称、又は省略など、表現としての曖昧さが生じ得る。また、いわゆる半角文字と全角文字の区別など、文字符号化の多様性による曖昧さも生じ得る。または、全くの同姓同名であっても別人ということも生じ得る。したがって、人物をエンティティの単位とするマスタデータでは、このような表記上の曖昧さが存在することを前提として（可能ならば一定のルールに従って表記を統一した上で）、各レコードがただ一人の人物に対応しており、同一人物が複数のレコードに存在しないことが重要である。 It should be noted that the uniqueness of a record is the semantic uniqueness of an entity of master data. For example, in the case of master data in which a person is a unit of an entity, such as an employee master or a customer master, data such as personal names and address notations have ambiguities as expressions such as fluctuation of names, alias or omission. obtain. In addition, ambiguity may also occur due to variations in character encoding, such as so-called distinction between half-width characters and full-width characters. Or even if they have the same first and last name, they may be different people. Therefore, in master data in which a person is a unit of entity, it is assumed that there is only one record, assuming that such a typographical ambiguity exists (possibly by unifying the notation according to certain rules if possible). It is important that it corresponds to a person and that the same person does not exist in multiple records.

マスタデータにおける重複レコードとは、表記上の異同に関わらず、そのレコードの示すエンティティが重複していることをいう。企業内のデータを活用するためには、マスタデータに含まれ得るこのような重複レコードを、人手で調査及び修正する作業が必要である。この作業は、一般にデータクレンジングと呼ばれる。 The duplicate record in the master data means that the entity indicated by the record is duplicated regardless of the notation difference. In order to utilize in-house data, it is necessary to manually search and correct such duplicate records that may be included in the master data. This task is generally called data cleansing.

データクレンジングを支援する技術として、類似するレコードを検出する技術が知られている。重複レコードは、表記上の曖昧さを含むため、単純な文字列マッチングで検出できるとは限らない。しかし、レコード間の表記上の差異が少ないもの、すなわち、類似するレコード同士は、重複レコードである可能性が高い。特許文献１には、データテーブルの各レコードから特徴ベクトルを生成して相互に比較することにより、類似するレコードを検出する技術が開示されている。 A technique for detecting similar records is known as a technique for supporting data cleansing. Duplicated records can not always be detected by simple string matching because they contain a typographical ambiguity. However, it is highly possible that the records having small notational differences between records, that is, similar records are duplicate records. Patent Document 1 discloses a technique for detecting similar records by generating a feature vector from each record of the data table and comparing them with each other.

データテーブルにおいて、レコードを一意に識別可能とするカラムは、「キー」と呼ばれる。キーは人工的に作成することもあるが、カラムに格納されている値（「カラム値」という）が全て異なる場合（つまり一意性が保証されている場合）、当該カラムは、キーとして採用可能である。また、１つのカラムでは一意性を保証できなくとも、複数のカラムの組み合わせが一意性を保証するならば、当該複数のカラムの組み合わせは、複合キーとして採用可能である。このような、レコードを一意に識別可能とするカラムの組み合わせをＵＣＣ（ＵｎｉｑｕｅＣｏｌｕｍｎＣｏｍｂｉｎａｔｉｏｎ）と呼ぶ。特許文献２には、ＵＣＣを検出する技術が開示されている。 In the data table, a column that makes a record uniquely identifiable is called a "key". The key may be artificially created, but if all the values stored in the column (referred to as "column value") are different (that is, uniqueness is guaranteed), the column can be adopted as the key It is. In addition, even if uniqueness can not be guaranteed in one column, if the combination of multiple columns guarantees uniqueness, the combination of multiple columns can be adopted as a composite key. Such a combination of columns that allows a record to be uniquely identified is called UCC (Unique Column Combination). Patent Document 2 discloses a technique for detecting UCC.

米国特許出願公開第２００７／０００５５５６号明細書US Patent Application Publication No. 2007/0005556 国際公開第２０１４／１９１０５９号International Publication No. 2014/191059

マスタデータのデータクレンジングを支援するために上述の類似するレコードを検出する技術を採用したとしても、或るレコードを重複レコードと見做して削除すべきか、それともそのレコード自体は残し、表記揺れ等を修正すべきかなどの最終的な判断は、データクレンジングの担当者が行う。データクレンジングの担当者は、目視で重複レコードを発見し、レコードを削除すべきか、それとも表記を修正すべきかなどを判断する必要がある。しかしながら、企業内マスタデータが、１００個を超えるような多数のカラムで構成されていたり、多くのカラムが用途不明であったりすることも多い。このような場合、データクレンジングの担当者の作業負担は、非常に大きい。 Even if the above-mentioned technique for detecting similar records is adopted in order to support data cleansing of master data, should a certain record be regarded as a duplicate record and be deleted, or the record itself be left, writing fluctuation etc. The final decision on whether or not to make corrections is made by the person responsible for data cleansing. Data cleansing personnel need to visually detect duplicate records and decide whether to delete records or correct the notation. However, in-house master data often consists of a large number of columns, such as more than 100, or many columns have unknown applications. In such a case, the workload of the person in charge of data cleansing is very large.

また、表記揺れの統一や全角文字と半角文字との統一など、データテーブルを修正した結果、レコードの一意性を保証していたキー（カラム）が、一意性を保証しなくなることが発生し得る。特に、レコードの一意性を保証している複合キーが、データテーブルの修正によって一意性を保証しなくなった場合、データクレンジング担当者が、その複合キーの一意性の喪失を見落としてしまう可能性が高い。 In addition, as a result of correcting the data table, such as unification of writing fluctuation and unification of full-width characters and half-width characters, it may occur that keys (columns) that guarantee uniqueness of records no longer guarantee uniqueness. . In particular, if a composite key that guarantees uniqueness of a record no longer guarantees uniqueness due to the modification of the data table, it is possible that the data cleanser may overlook the loss of uniqueness of the composite key. high.

すなわち、従来技術では、データクレンジング担当者の作業負担が大きく、また、当該担当者の修正作業によって生じ得るキー又は複合キーの喪失が見落とされてしまうおそれがある。そこで本発明の目的は、データクレンジングの作業効率及び／又は作業精度を向上させることにある。 That is, in the prior art, the work load of the data cleansing person is heavy, and the loss of the key or the composite key that may occur due to the correction work of the person in charge may be overlooked. Therefore, an object of the present invention is to improve the working efficiency and / or working accuracy of data cleansing.

一実施例に係るデータクレンジングシステムは、プロセッサ及びメモリを有する。当該プロセッサは、メモリからデータテーブルを読み出し、データテーブルのレコード間の類似度を算出し、前記データテーブルの各レコードを一意に識別可能とするカラムの組であるＵＣＣ（ＵｎｉｑｕｅＣｏｌｕｍｎＣｏｍｂｉｎａｔｉｏｎ）を検出する。 A data cleansing system according to one embodiment comprises a processor and a memory. The processor reads the data table from the memory, calculates the degree of similarity between the records of the data table, and detects a UCC (Unique Column Combination) which is a set of columns that enables each record of the data table to be uniquely identified. .

本発明によれば、データクレンジングの作業効率及び／又は作業精度を向上させることができる。 According to the present invention, the working efficiency and / or working accuracy of data cleansing can be improved.

データクレンジングシステムの物理的な構成例を示す。The physical structural example of a data cleansing system is shown. データクレンジングシステムが有する機能の構成例を示す。The structural example of the function which a data cleansing system has is shown. メタデータ生成部に含まれる機能の例を示す。An example of a function included in a metadata generation unit is shown. データ表示処理の例を示すフローチャートである。It is a flowchart which shows the example of a data display process. データテーブルの例を示す。An example of a data table is shown. 類似レコードマトリクスの例を示す。An example of a similar record matrix is shown. ＵＣＣリストの例を示す。An example of a UCC list is shown. メタデータ生成処理の例を示すフローチャートである。It is a flow chart which shows an example of metadata generation processing. ハッシュ行列の例を示す。An example of a hash matrix is shown. ＭｉｎＨａｓｈシグネチャの例を示す。An example of a MinHash signature is shown. ポジションリストインデックス（ＰＬＩ）の例を示す。An example of a position list index (PLI) is shown. 類似レコードマトリクスの生成処理の一例を示すフローチャートである。It is a flowchart which shows an example of a production | generation process of a similar record matrix. ＵＣＣ候補カラムの抽出処理の一例を示すフローチャートである。It is a flowchart which shows an example of the extraction process of a UCC candidate column. ＵＣＣリストの生成処理の一例を示すフローチャートである。It is a flow chart which shows an example of generation processing of a UCC list. データ編集処理の一例を示すフローチャートである。It is a flow chart which shows an example of data edit processing. メタデータ再生成処理の一例を示すフローチャートである。It is a flowchart which shows an example of a metadata regeneration process. データテーブル表示画面の例を示す。An example of a data table display screen is shown. 改良データテーブル表示画面の例を示す。The example of an improvement data table display screen is shown.

以下、実施例を説明する。以下の説明では、「ｘｘｘテーブル」又は「ｘｘｘリスト」の表現にて情報を説明することがあるが、情報は、どのようなデータ構造で表現されていてもよい。すなわち、情報がデータ構造に依存しないことを示すために、「ｘｘｘテーブル」又は「ｘｘｘリスト」を「ｘｘｘ情報」と呼ぶことができる。さらに、各情報の内容を説明する際に、「識別情報」、「識別子」、「名」、「名前」、「ＩＤ」という表現を用いるが、これらについてはお互いに置換が可能である。 Examples will be described below. In the following description, information may be described by the expression “xxx table” or “xxx list”, but the information may be expressed by any data structure. That is, "xxx table" or "xxx list" can be called "xxx information" to indicate that the information does not depend on the data structure. Furthermore, when describing the contents of each information, the expressions “identification information”, “identifier”, “name”, “name”, “ID” are used, but they can be mutually replaced.

また、以下の説明では、「プログラム」を主語として処理を説明する場合があるが、プログラムは、プロセッサ（例えばＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ））によって実行されることで、定められた処理を、適宜に記憶資源（例えばメモリ）及び通信インターフェイスデバイスのうちの少なくとも１つを用いながら行うため、処理の主語が、プロセッサ、そのプロセッサを有する装置とされてもよい。プロセッサが行う処理の一部又は全部が、ハードウェア回路で行われてもよい。コンピュータプログラムは、プログラムソースからインストールされてよい。プログラムソースは、プログラム配布サーバ又は記憶メディア（例えば可搬型の記憶メディア）であってもよい。 Further, in the following description, processing may be described with “program” as the subject, but the program is executed by a processor (for example, a CPU (Central Processing Unit)) to appropriately determine the processing defined. The subject of the processing may be a processor, or an apparatus having the processor, in order to use at least one of a storage resource (for example, a memory) and a communication interface device. Some or all of the processing performed by the processor may be performed by hardware circuitry. The computer program may be installed from a program source. The program source may be a program distribution server or storage medium (eg, portable storage medium).

また、以下の説明では、同種の要素を区別して説明する場合には、「ＰＬＩ３０３Ａ」、「ＰＬＩ３０３Ｂ」のように、参照符号を使用し、同種の要素を区別しないで説明する場合には、「ＰＬＩ３０３」のように参照符号のうちの共通番号のみを使用することがある。 Further, in the following description, when different elements of the same type are described separately, like “PLI 303A” and “PLI 303 B”, reference numerals are used, and when the same elements are not distinguished, “ Only the common numbers of the reference signs may be used as in PLI 303 ".

図１は、データクレンジングシステム１００の物理的な構成例を示す。 FIG. 1 shows a physical configuration example of the data cleansing system 100.

データクレンジングシステム１００は、計算機の一例であり、プロセッサ１０１、メモリ１０２、ストレージ１０３、ネットワークインタフェース１０４及びコンソール１０５を有する。データクレンジングシステム１００の例は、パーソナルコンピュータ、ラックマウントサーバ又はブレードサーバ等である。プロセッサ１０１は、メモリ１０２、ストレージ１０３、ネットワークインタフェース１０４及びコンソール１０５と、双方向通信可能に接続されている。データクレンジングシステム１００は、これらの構成要素の一部のみを有しても良いし、複数の同じ構成要素を有してもよい。 The data cleansing system 100 is an example of a computer, and includes a processor 101, a memory 102, a storage 103, a network interface 104, and a console 105. An example of the data cleansing system 100 is a personal computer, a rack mount server, a blade server or the like. The processor 101 is bi-directionally connected to the memory 102, the storage 103, the network interface 104, and the console 105. The data cleansing system 100 may have only a part of these components or may have a plurality of the same components.

プロセッサ１０１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等のハードウェアによる演算装置であり、メモリ１０２からプログラムを読み出して実行する。 The processor 101 is an arithmetic device based on hardware such as a central processing unit (CPU), reads a program from the memory 102, and executes the program.

メモリ１０２は、揮発性の半導体メモリから構成され、プログラムやデータなどを保持する。メモリ１０２の例は、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＭＲＡＭ（ＭａｇｎｅｔｏｒｅｓｉｓｔｉｖｅＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＦｅＲＡＭ（ＦｅｒｒｏｅｌｅｃｔｒｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）である。 The memory 102 is composed of volatile semiconductor memory and holds programs, data, and the like. Examples of the memory 102 are a dynamic random access memory (DRAM), a magnetoresistive random access memory (MRAM), and a ferroelectric random access memory (FeRAM).

ストレージ１０３は、不揮発性の記憶装置から構成され、プログラムやデータなどを保持する。ストレージ１０３の例は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、又はこれらの組み合わせなどである。 The storage 103 is configured of a non-volatile storage device, and holds programs, data, and the like. An example of the storage 103 is a hard disk drive (HDD), a solid state drive (SSD), or a combination thereof.

ネットワークインタフェース１０４は、例えば、ＮＩＣ（ＮｅｔｗｏｒｋＩｎｔｅｒｆａｃｅＣｏｎｔｒｏｌｌｅｒ）等の通信デバイスで構成され、ネットワーク１０６と接続される。ネットワークインタフェース１０４は、ネットワーク１０６を介して他の装置と通信するためのプロトコルを制御する。ネットワーク１０６の例は、イーサネット（登録商標）、ＩＥＥＥ（ＩｎｓｔｉｔｕｔｅｏｆＥｌｅｃｔｒｉｃａｌａｎｄＥｌｅｃｔｒｏｎｉｃｓＥｎｇｉｎｅｅｒｓ）８０２．１１規格に基づく無線ネットワーク、ＳＯＮＥＴ／ＳＤＨ（ＳｙｎｃｈｒｏｎｏｕｓＯｐｔｉｃａｌＮｅｔｗｏｒｋ／ＳｙｎｃｈｒｏｎｏｕｓＤｉｇｉｔａｌＨｉｅｒａｒｃｈｙ）規格に基づく広域ネットワーク、又は、これら複数のネットワーク技術を組み合わせたネットワークなどである。 The network interface 104 is configured by a communication device such as a network interface controller (NIC), for example, and is connected to the network 106. The network interface 104 controls a protocol for communicating with other devices via the network 106. An example of the network 106 is Ethernet (registered trademark), a wireless network based on the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standard, a wide area network based on the SONET / SDH (Synchronous Optical Network / Synchronous Digital Hierarchy) standard, or It is a network that combines these multiple network technologies.

コンソール１０５は、例えば、キーボード及びマウス等の入力装置と、液晶表示パネル等のディスプレイ装置とから構成される。コンソール１０５は、入力装置から入力された操作に対応する操作信号を受信し、その操作信号の内容をプロセッサ１０１に通知してよい。そして、コンソール１０５は、プロセッサ１０１から出力される、テキスト情報、グラフィカル情報に基づくテキスト及び画像等を、ディスプレイ装置に表示する。 The console 105 includes, for example, an input device such as a keyboard and a mouse, and a display device such as a liquid crystal display panel. The console 105 may receive an operation signal corresponding to the operation input from the input device, and notify the processor 101 of the content of the operation signal. Then, the console 105 displays text information, text and images based on graphical information, and the like output from the processor 101 on the display device.

ストレージ１０３に格納されているＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）やユーザプログラムは、データクレンジングシステム１００の起動時又はそれ自身の実行時に、メモリ１０２に読み出されてよい。そして、プロセッサ１０１が、メモリ１０２に読み出されたＯＳ及びユーザプログラムを実行することによって、データクレンジングシステム１００の有する各種機能が実現されてよい。プロセッサ１０１が実行するプログラムは、リムーバブルメディア（ＣＤ−ＲＯＭ、フラッシュメモリ等）又はネットワークなどを介してデータクレンジングシステム１００に導入され、ストレージ１０３に格納されてよい。このため、データクレンジングシステム１００は、リムーバブルメディアからデータを読み込むインターフェースを有してよい。 An operating system (OS) and a user program stored in the storage 103 may be read out to the memory 102 when the data cleansing system 100 starts up or when it is executed. Then, when the processor 101 executes the OS and the user program read into the memory 102, various functions of the data cleansing system 100 may be realized. The program executed by the processor 101 may be introduced into the data cleansing system 100 via removable media (CD-ROM, flash memory, etc.) or a network, and stored in the storage 103. To this end, the data cleansing system 100 may have an interface for reading data from removable media.

図２は、データクレンジングシステム１００が有する機能の構成例を示す。 FIG. 2 shows an exemplary configuration of functions of the data cleansing system 100.

データクレンジングシステム１００は、機能として、表示部２０２、操作部２０３、データ編集部２０４、入出力部２０５及びメタデータ生成部２０６を有する。これらの機能は、メモリ１０２に格納されたプログラムがプロセッサ１０１で実行されることにより実現されてよい。 The data cleansing system 100 has a display unit 202, an operation unit 203, a data editing unit 204, an input / output unit 205, and a metadata generation unit 206 as functions. These functions may be realized by the processor 101 executing a program stored in the memory 102.

操作部２０３は、コンソール１０５を通じて入力された操作内容を、各種命令として解釈する。操作部２０３は、データ表示命令を表示部２０２に渡し、データ編集命令をデータ編集部２０４に渡してよい。 The operation unit 203 interprets the operation content input through the console 105 as various commands. The operation unit 203 may pass a data display instruction to the display unit 202 and may pass a data editing instruction to the data editing unit 204.

入出力部２０５は、ストレージ１０３に格納されているデータファイル１０７を読み出し、データテーブル２０１としてメモリ１０２に格納する。 The input / output unit 205 reads the data file 107 stored in the storage 103 and stores it as the data table 201 in the memory 102.

メタデータ生成部２０６は、メモリ１０２に格納されているデータテーブル２０１に基づいて、メタデータ２０７を生成する。メタデータ生成部２０６は、その生成したメタデータ２０７をメモリ１０２に格納してよい。メタデータ生成部２０６の詳細については後述する。 The metadata generation unit 206 generates the metadata 207 based on the data table 201 stored in the memory 102. The metadata generation unit 206 may store the generated metadata 207 in the memory 102. Details of the metadata generation unit 206 will be described later.

表示部２０２は、コンソール１０５を通じてデータクレンジングに関する情報を表示する。表示部２０２は、データテーブル２０１及びメタデータ２０７に係る情報を、コンソール１０５を通じて表示してよい。また、表示部２０２は、操作部２０３から受領したデータ表示命令の内容に応じて、情報の表示態様を変えてよい。 The display unit 202 displays information on data cleansing through the console 105. The display unit 202 may display information related to the data table 201 and the metadata 207 through the console 105. Further, the display unit 202 may change the display mode of the information in accordance with the content of the data display instruction received from the operation unit 203.

なお、処理負荷の分散や可用性の向上等のため、上述の機能の一部又は全部は、複数のデータクレンジングシステム１００に分散されてもよい。また、データクレンジングシステム１００は、１つの物理的な計算機で、又は、複数の論理的又は物理的な計算機で構成されてもよい。データクレンジングシステム１００が複数の物理計算機で構成される場合、上述の機能は、複数のプロセッサ１０１がネットワーク１０６を介して通信を行うことで、実現されてよい。 Note that part or all of the above-described functions may be distributed to a plurality of data cleansing systems 100 for the purpose of distribution of processing load, improvement of availability, and the like. In addition, the data cleansing system 100 may be configured by one physical computer or a plurality of logical or physical computers. When the data cleansing system 100 is configured by a plurality of physical computers, the above-described functions may be realized by the plurality of processors 101 communicating via the network 106.

図３は、メタデータ生成部２０６に含まれる機能の例を示す。 FIG. 3 illustrates an example of functions included in the metadata generation unit 206.

メタデータ生成部２０６は、機能として、類似レコード検出部２０８、ＵＣＣ検出部２０９、ハッシュ行列生成部２１０を含んでよい。メタデータ２０７は、類似レコードマトリクス２５０、ＵＣＣリスト２６０を含んでよい。 The metadata generation unit 206 may include the similar record detection unit 208, the UCC detection unit 209, and the hash matrix generation unit 210 as functions. The metadata 207 may include a similarity record matrix 250 and a UCC list 260.

ハッシュ行列生成部２１０は、データテーブル２０１から、ハッシュ行列３０１（図９参照）を生成する。 The hash matrix generation unit 210 generates a hash matrix 301 (see FIG. 9) from the data table 201.

類似レコード検出部２０８は、ハッシュ行列生成部２１０が生成したハッシュ行列３０１を用いて、データテーブル２０１に含まれるレコード間の類似度を算出する。類似レコード算出部２０８は、その算出した類似度を、類似レコードマトリクス２５０に格納してよい。 The similar record detection unit 208 uses the hash matrix 301 generated by the hash matrix generation unit 210 to calculate the degree of similarity between the records included in the data table 201. The similar record calculating unit 208 may store the calculated degree of similarity in the similar record matrix 250.

ＵＣＣ検出部２０９は、ハッシュ行列生成部２１０が生成したハッシュ行列３０１を用いて、データテーブル２０１からＵＣＣ（カラムのセット）を検出する。ＵＣＣ検出部２０９は、その検出したＵＣＣを、ＵＣＣリスト２６０に登録してよい。 The UCC detection unit 209 detects the UCC (set of columns) from the data table 201 using the hash matrix 301 generated by the hash matrix generation unit 210. The UCC detection unit 209 may register the detected UCC in the UCC list 260.

類似レコード検出部２０８とＵＣＣ検出部２０９とは、ハッシュ行列生成部２１０が生成した同じハッシュ行列３０１を用いてよい。これにより、類似レコード検出用のハッシュ行列と、ＵＣＣ検出用のハッシュ行列とを別々に生成する場合と比較して、システム全体の処理量を減らすことができる。 The similar record detection unit 208 and the UCC detection unit 209 may use the same hash matrix 301 generated by the hash matrix generation unit 210. As a result, the processing amount of the entire system can be reduced compared to the case where the hash matrix for similar record detection and the hash matrix for UCC detection are separately generated.

図４は、データ表示処理の例を示すフローチャートである。 FIG. 4 is a flowchart showing an example of data display processing.

入出力部２０５は、ストレージ１０３からデータファイル１０７を読み出す（ステップＳ４０２）。 The input / output unit 205 reads the data file 107 from the storage 103 (step S402).

入出力部２０５は、例えば、ＣＳＶ（Ｃｏｍｍａ−ＳｅｐａｒａｔｅｄＶａｌｕｅｓ）形式等のフォーマットでシリアライズされているデータファイル１０７をパージングして、データテーブル２０１（図５参照）を生成する（ステップＳ４０４）。 The input / output unit 205 parses the data file 107 serialized in a format such as, for example, a CSV (Comma-Separated Values) format, and generates the data table 201 (see FIG. 5) (step S404).

メタデータ生成部２０６は、その生成したデータテーブル２０１からメタデータ２０７を生成し、メモリ１０２に格納する（ステップＳ４０６）。当該処理の詳細については後述する。 The metadata generation unit 206 generates the metadata 207 from the generated data table 201 and stores it in the memory 102 (step S406). Details of the process will be described later.

表示部２０２は、データテーブル２０１及びメタデータ２０７を、コンソール１０５を通じて表示する（ステップＳ４０８）。当該表示例については後述する（図１７、図１８参照）。 The display unit 202 displays the data table 201 and the metadata 207 through the console 105 (step S408). The display example will be described later (see FIGS. 17 and 18).

図５は、データテーブル２０１の例を示す。 FIG. 5 shows an example of the data table 201.

データテーブル２０１は、本実施例におけるデータクレンジングの対象データである。データテーブル２０１には、どのようなデータが格納されていてもよい。データテーブル２０１は、複数のレコードと複数のカラムとから構成されており、レコードの各カラムには、値（カラム値又はセル値という）が格納されてよい。 The data table 201 is target data of data cleansing in the present embodiment. The data table 201 may store any data. The data table 201 includes a plurality of records and a plurality of columns, and each column of the records may store a value (referred to as a column value or a cell value).

本実施例では、説明のために、各レコードに、レコードを一意に識別可能なレコードＩＤ（Ｒ００１、Ｒ００２、…）を付与する。また、各カラムに、カラムを一意に識別可能なカラムＩＤ（Ｃ００１、Ｃ００２、…）を付与する。レコードＩＤ及び／又はカラムＩＤは、元のデータファイル１０７に含まれている必要はなく、入出力部２０５のパージング処理によって付与されてよい。レコードＩＤはレコード名と呼んでもよい。カラムＩＤはカラム名と呼んでもよい。 In this embodiment, for the sake of explanation, each record is assigned a record ID (R001, R002,...) Capable of uniquely identifying the record. Further, each column is assigned a column ID (C001, C002,...) Which can uniquely identify the column. The record ID and / or the column ID need not be included in the original data file 107, and may be assigned by the parsing process of the input / output unit 205. The record ID may be called a record name. The column ID may be called a column name.

図５の例は、レコードＩＤ「Ｒ００１」のレコードは、カラムＩＤ「Ｃ００１」にカラム値「ＡＡＡ」、カラムＩＤ「Ｃ００２」にカラム値「ＣＣＣ」、カラムＩＤ「Ｃ００３」にカラム値「０」、カラムＩＤ「Ｃ００４」にカラム値「０」を有することを示す。 In the example of FIG. 5, the record of the record ID "R001" has the column value "AAA" in the column ID "C001", the column value "CCC" in the column ID "C002", and the column value "0" in the column ID "C003" , Column ID "C004" has a column value "0".

図６は、類似レコードマトリクス２５０の例を示す。類似レコードマトリクス２５０は、メタデータ２０７に含まれてよい。 FIG. 6 shows an example of the similar record matrix 250. The similar record matrix 250 may be included in the metadata 207.

類似レコードマトリクス２５０は、データテーブル２０１に含まれる２つのレコード間の類似度を管理する。 The similar record matrix 250 manages the degree of similarity between two records included in the data table 201.

類似レコードマトリクス２５０における各行と各列には、それぞれ、データテーブル２０１に含まれる各レコードＩＤが付与されてよい。行のレコードＩＤと列のレコードＩＤとの交点のセルには、当該行のレコードＩＤのレコードと、当該列のレコードＩＤのレコードとの間の類似度が格納されてよい。 Each record ID included in the data table 201 may be assigned to each row and each column in the similar record matrix 250. In the cell at the intersection of the row record ID and the column record ID, the similarity between the record of the record ID of the row and the record of the record ID of the column may be stored.

類似度は、値が大きいほど類似することを示す、０〜１の範囲を取り得る値であってよい。図６の例は、レコードＩＤ「Ｒ００２」と「Ｒ００１」との間の類似度が「０．８０」である（比較的類似している）ことを示す。 The similarity may be a value that can range from 0 to 1, indicating that the larger the value, the more similar. The example of FIG. 6 indicates that the similarity between the record IDs “R002” and “R001” is “0.80” (relatively similar).

図７は、ＵＣＣリスト２６０の例を示す。ＵＣＣリスト２６０は、メタデータ２０７に含まれてよい。 FIG. 7 shows an example of the UCC list 260. The UCC list 260 may be included in the metadata 207.

ＵＣＣリスト２６０は、データテーブル２０１の各レコードを一意に識別可能なカラムＩＤの組（つまりＵＣＣ）を管理する。 The UCC list 260 manages a set of column IDs (that is, UCC) that can uniquely identify each record of the data table 201.

例えば、カラムＩＤ「Ｃ００１」のカラム値とカラムＩＤ「Ｃ００２」のカラム値との組によって、データテーブル２０１の全てのレコードを一意に識別可能な場合、そのカラムＩＤ「Ｃ００１」及び「Ｃ００２」の組は、ＵＣＣである。この場合、ＵＣＣリスト２６０には、カラムＩＤ「Ｃ００１」及び「Ｃ００２」の組が格納される。 For example, when all the records in the data table 201 can be uniquely identified by a combination of the column value of column ID "C001" and the column value of column ID "C002", the column IDs "C001" and "C002" are used. The set is UCC. In this case, the UCC list 260 stores a set of column IDs “C001” and “C002”.

ＵＣＣリスト２６０において、各ＵＣＣには、ＵＣＣ（カラムＩＤの組）を一意に識別可能なＵＣＣＩＤ（Ｕ００１、Ｕ００２、…）が付与されてよい。 In the UCC list 260, each UCC may be assigned a UCC ID (U001, U002,...) That can uniquely identify UCC (a set of column IDs).

図８は、メタデータ生成処理の例を示すフローチャートである。 FIG. 8 is a flowchart illustrating an example of metadata generation processing.

メタデータ生成部２０６は、或るハッシュ関数を用いて、データテーブル２０１の各カラム値のハッシュ値を算出する（ステップＳ６０２）。 The metadata generation unit 206 calculates a hash value of each column value of the data table 201 using a certain hash function (step S602).

メタデータ生成部２０６は、その算出したハッシュ値を用いて、データテーブル２０１に対するハッシュ行列３０１（図９参照）を生成する（ステップＳ６０４）。 The metadata generation unit 206 generates a hash matrix 301 (see FIG. 9) for the data table 201 using the calculated hash value (step S604).

メタデータ生成部２０６は、その生成したハッシュ行列３０１を用いて、類似レコードマトリクス２５０（図６参照）及びＵＣＣリスト２６０（図７参照）を生成する（ステップＳ６０６）。 The metadata generation unit 206 generates the similar record matrix 250 (see FIG. 6) and the UCC list 260 (see FIG. 7) using the generated hash matrix 301 (step S606).

図９は、ハッシュ行列３０１の例を示す。 FIG. 9 shows an example of the hash matrix 301.

ハッシュ行列３０１は、データテーブル２０１の各カラム値に対して、或るハッシュ関数を適用して算出されるハッシュ値で構成される行列である。ハッシュ行列３０１は、異なるハッシュ関数毎に生成されてよい。本実施例では、各ハッシュ関数を一意に識別可能なＩＤを「ハッシュ関数ＩＤ」と呼ぶ。 The hash matrix 301 is a matrix composed of hash values calculated by applying a hash function to each column value of the data table 201. The hash matrix 301 may be generated for each different hash function. In this embodiment, an ID capable of uniquely identifying each hash function is called a "hash function ID".

ハッシュ行列３０１は、各行と各列に、それぞれ、データテーブル２０１の各レコードＩＤと各カラムＩＤとを有してよい。行のレコードＩＤと列のカラムＩＤとの交点のセルには、データテーブル２０１の当該レコードＩＤのレコードにおける、当該カラムＩＤのカラム値のハッシュ値が格納されてよい。 The hash matrix 301 may have each record ID of each data table 201 and each column ID in each row and each column. The hash value of the column value of the column ID in the record of the record ID of the data table 201 may be stored in the cell of the intersection of the record ID of the row and the column ID of the column.

なお、図９では、説明のために、行にレコードＩＤ、列にカラムＩＤが付与されている。実際にメモリに格納されるハッシュ行列３０１には、このようなＩＤが付与されていなくてもよい。 In FIG. 9, for the sake of explanation, a record ID is given to a row and a column ID is given to a column. Such an ID may not be assigned to the hash matrix 301 actually stored in the memory.

図１０は、ＭｉｎＨａｓｈシグネチャ３０２の例を示す。 FIG. 10 shows an example of the MinHash signature 302.

ＭｉｎＨａｓｈシグネチャ３０２は、ＭｉｎＨａｓｈ法に用いられる。ＭｉｎＨａｓｈシグネチャ３０２は、ハッシュ行列３０１に基づいて生成されてよい。 The MinHash signature 302 is used for the MinHash method. The MinHash signature 302 may be generated based on the hash matrix 301.

説明のために、図１０のＭｉｎＨａｓｈシグネチャ３０２の各行には、ハッシュ行列３０１の各レコードＩＤが付与されてよい。また、各列には、図９で述べた各ハッシュ関数ＩＤ（ｈ１、ｈ２、…）が付与されてよい。 For the purpose of description, each record ID of the hash matrix 301 may be assigned to each row of the MinHash signature 302 of FIG. In addition, each hash function ID (h1, h2,...) Described in FIG. 9 may be assigned to each column.

ＭｉｎＨａｓｈシグネチャ３０２のレコードＩＤとハッシュ関数ＩＤとの交点のセルには、そのハッシュ関数ＩＤのハッシュ関数から生成されたハッシュ行列３０１における、そのレコードＩＤに属する複数のハッシュ値のうちの最小のハッシュ値が格納される。例えば、図９のハッシュ行列３０１が、ハッシュ関数ＩＤ「ｈ１」のハッシュ関数から生成されたものであるとすると、図１０のＭｉｎＨａｓｈシグネチャ３０２における、レコードＩＤ「Ｒ００１」とハッシュ関数ＩＤ「ｈ１」との交点のセルには、図９のハッシュ行列３０１のレコードＩＤ「Ｒ００１」のレコードに属する複数のハッシュ値「１２３４」、「４１２２」、「５６２８」、…、のうちの最小のハッシュ値「１２３４」が格納される。同様に、レコードＩＤ「Ｒ００１」とハッシュ関数ＩＤ「ｈ２」との交点の欄には、ハッシュ関数ＩＤ「ｈ２」のハッシュ関数から生成ハッシュ行列のレコードＩＤ「Ｒ００１」に属する複数のハッシュ値のうちの最小のハッシュ値が格納される。 In the cell at the intersection of the record ID of the MinHash signature 302 and the hash function ID, the smallest hash value among a plurality of hash values belonging to the record ID in the hash matrix 301 generated from the hash function of the hash function ID Is stored. For example, assuming that the hash matrix 301 of FIG. 9 is generated from the hash function of the hash function ID “h1”, the record ID “R001” and the hash function ID “h1” in the MinHash signature 302 of FIG. The smallest hash value “1234” among the plurality of hash values “1234”, “4122”, “5628”,... Belonging to the record of the record ID “R001” of the hash matrix 301 of FIG. Is stored. Similarly, in the column of the intersection of the record ID "R001" and the hash function ID "h2", a hash function of the hash function ID "h2" is used to generate a hash function among a plurality of hash values belonging to the record ID "R001" of the generated hash matrix. The smallest hash value of is stored.

なお、ハッシュ関数ＩＤ「ｈ１」のハッシュ関数で算出したハッシュ値を循環シフトし、その循環シフトした値と乱数との間でＸＯＲを算出し、その算出した値を、ハッシュ関数ＩＤ「ｈ２」に係るハッシュ値に相当する値として用いてもよい。この場合、ＭｉｎＨａｓｈシグネチャ３０２において、ハッシュ関数ＩＤ「ｈ２」のセルには、ハッシュ関数ＩＤ「ｈ２」に係るハッシュ値に相当する値のうちの最小値が格納されてよい。 The hash value calculated by the hash function of the hash function ID “h1” is cyclically shifted, XOR is calculated between the cyclically shifted value and the random number, and the calculated value is set as the hash function ID “h2”. You may use as a value corresponded to the hash value which concerns. In this case, in the MinHash signature 302, the minimum value among the values corresponding to the hash value relating to the hash function ID “h2” may be stored in the cell of the hash function ID “h2”.

図１１は、ポジションリストインデックス（ＰＬＩ）３０３の例を示す。 FIG. 11 shows an example of a position list index (PLI) 303.

ＰＬＩ３０３は、データテーブル２０１のカラムＩＤ毎に生成されてよい。図１１の例において、ＰＬＩ３０３Ａはデータテーブル２０１のカラムＩＤ「Ｃ００１」に係るＰＬＩである。ＰＬＩ３０３Ｂ、３０３Ｃ、３０３Ｄについても同様である。 The PLI 303 may be generated for each column ID of the data table 201. In the example of FIG. 11, PLI 303A is PLI related to the column ID “C001” of the data table 201. The same applies to PLI 303B, 303C, and 303D.

或るカラムＩＤに係るＰＬＩ３０３は、ハッシュ行列３０１における当該カラムＩＤの列に同一ハッシュ値を有する複数のレコードＩＤと、当該同一ハッシュ値と、を対応付けて管理する。 The PLI 303 associated with a certain column ID associates and manages a plurality of record IDs having the same hash value in the column ID column in the hash matrix 301 and the same hash value.

図１１の例において、ＰＬＩ３０３Ａは、ハッシュ行列３０１において、カラムＩＤ「Ｃ００１」の列に同一ハッシュ値「１２３４」を有する複数のレコードＩＤ「Ｒ００１」及び「Ｒ００３」が存在することを示す。 In the example of FIG. 11, the PLI 303A indicates that, in the hash matrix 301, a plurality of record IDs "R001" and "R003" having the same hash value "1234" exist in the column of the column ID "C001".

ＰＬＩ３０３の構造は、一般にハッシュテーブルとして知られるデータ構造に類似していることに注目されたい。ＰＬＩ３０３は、ハッシュ行列３０１が有するハッシュ値を活用してハッシュテーブルを生成し、二つ以上のエントリがあるバケットのみを取り出したものであってよい。 It should be noted that the structure of PLI 303 is similar to the data structure commonly known as a hash table. The PLI 303 may be a hash table generated by utilizing the hash values of the hash matrix 301, and only the bucket having two or more entries may be extracted.

図１２は、類似レコードマトリクス２５０の生成処理の一例を示すフローチャートである。当該処理は、図８のステップＳ６０６の処理に相当する。 FIG. 12 is a flowchart showing an example of generation processing of the similar record matrix 250. The said process is corresponded to the process of FIG.8 S606.

メタデータ生成部２０６は、データテーブル２０１に対応するＭｉｎＨａｓｈシグネチャ３０２を生成する（ステップＳ８０４）。ＭｉｎＨａｓｈシグネチャ３０２は、上述の図１０で説明したように生成されてよい。 The metadata generation unit 206 generates the MinHash signature 302 corresponding to the data table 201 (step S804). The MinHash signature 302 may be generated as described in FIG. 10 above.

メタデータ生成部２０６は、その生成したＭｉｎＨａｓｈシグネチャ３０２の複数の列を、幾つかのグループに分割してよい。ここでは、分割された各グループを、「バンド」と呼ぶ（ステップＳ８０６）。 The metadata generation unit 206 may divide the plurality of columns of the generated MinHash signature 302 into several groups. Here, each divided group is called a "band" (step S806).

メタデータ生成部２０６は、ＭｉｎＨａｓｈシグネチャ３０２の各レコードＩＤについて、バンドに属する列のハッシュ値を結合してよい。そして、メタデータ生成部２０６は、その結合したハッシュ値に対して所定のハッシュ関数を適用し、ハッシュ値を算出してよい（ステップＳ８０８）。メタデータ生成部２０６は、各バンドに対してこの処理を実行してよい。このハッシュ値の算出過程は、いわゆるＬＳＨ（ＬｏｃａｌｉｔｙＳｅｎｓｉｔｉｖｅＨａｓｈｉｎｇ：局所性鋭敏型ハッシング）として知られるアルゴリズムであってよい。この場合、ハッシュ値が同一なレコードの組は、類似している可能性が高いことが知られている。 The metadata generation unit 206 may combine, for each record ID of the MinHash signature 302, the hash value of the column belonging to the band. Then, the metadata generation unit 206 may apply a predetermined hash function to the combined hash value to calculate the hash value (step S808). The metadata generation unit 206 may execute this process for each band. The process of calculating the hash value may be an algorithm known as so-called LSH (Locality Sensitive Hashing). In this case, it is known that sets of records having the same hash value are likely to be similar.

メタデータ生成部２０６は、そのハッシュ値が同一な全てのレコードの組のそれぞれについて、ステップＳ８１４の処理を実行する（ＬＯＯＰ２）。各ループ処理で選択されるレコードの組を「選択レコードの組」という。 The metadata generation unit 206 executes the process of step S 814 for each set of all records having the same hash value (LOOP 2). The set of records selected in each loop process is called "set of selected records".

メタデータ生成部２０６は、選択レコードの組のＭｉｎＨａｓｈシグネチャ３０２のハッシュ値が一致する確率を算出する（ステップＳ８１４）。この確率は、Ｊａｃｃａｒｄ距離と呼ばれる２集合間の類似度の指標を近似することが知られている。そこでこの確率を類似度とし、類似レコードマトリクス２５０に格納する。 The metadata generation unit 206 calculates the probability that the hash values of the MinHash signature 302 of the set of selected records match (step S 814). This probability is known to approximate an index of similarity between two sets called Jaccard distance. Therefore, this probability is regarded as similarity, and is stored in the similar record matrix 250.

図１３は、ＵＣＣ候補カラムの抽出処理の一例を示すフローチャートである。 FIG. 13 is a flowchart showing an example of UCC candidate column extraction processing.

本処理は、ＵＣＣリスト２６０の生成処理（図１４参照）の前に、ＵＣＣに含まれる可能性の高いカラム（「ＵＣＣ候補カラム」という）を抽出する処理である。この処理を行うことにより、ＵＣＣ検出の処理量を減らすことができる。 This process is a process of extracting a column likely to be included in the UCC (referred to as “UCC candidate column”) before the generation process of the UCC list 260 (see FIG. 14). By performing this process, the amount of UCC detection processing can be reduced.

メタデータ生成部２０６は、ハッシュ行列３０１の全カラムのそれぞれについて、ステップＳ９０４〜Ｓ９０８を実行する（ＬＯＯＰ１）。各ループ処理で選択されるカラムを「選択カラム」という。 The metadata generation unit 206 executes steps S904 to S908 for each of all the columns of the hash matrix 301 (LOOP1). The column selected in each loop processing is called "selected column".

メタデータ生成部２０６は、ハッシュ行列３０１の選択カラムのハッシュ値を用いて、データテーブル２０１の当該選択カラムのカーディナリティを算出する（ステップＳ９０４）。カラムのカーディナリティは、当該カラムに格納されているカラム値の種類の数（異なり数）であってよい。ハッシュ値からカーディナリティを近似する方法として、ＨｙｐｅｒＬｏｇＬｏｇアルゴリズムを採用してもよい。 The metadata generation unit 206 calculates the cardinality of the selected column of the data table 201 using the hash value of the selected column of the hash matrix 301 (step S904). The cardinality of a column may be the number (different number) of types of column values stored in the column. As a method of approximating the cardinality from the hash value, the HyperLog Log algorithm may be adopted.

メタデータ生成部２０６は、その算出したカーディナリティが所定の閾値以下であるか否かを判定する（ステップＳ９０６）。当該判定結果が肯定的な場合（ステップＳ９０６：ＹＥＳ）、メタデータ生成部２０６は、選択カラムをＵＣＣ候補から除外する（ステップＳ９０８）。カーディナリティが低いカラムは、カラム値の異なり数が少ないため、ＵＣＣを構成する可能性が低いからである。 The metadata generation unit 206 determines whether the calculated cardinality is less than or equal to a predetermined threshold (step S906). If the determination result is affirmative (step S906: YES), the metadata generation unit 206 excludes the selected column from the UCC candidates (step S908). This is because columns with low cardinality have a small number of different column values, and thus are unlikely to constitute UCC.

当該判定結果が否定的な場合（ステップＳ９０６：ＮＯ）、メタデータ生成部２０６は、特に何もしなくてよい。以上の処理により、ＵＣＣ候補カラムが抽出される。 If the determination result is negative (step S906: NO), the metadata generation unit 206 does not need to do anything. UCC candidate columns are extracted by the above process.

図１４は、ＵＣＣリスト２６０の生成処理の一例を示すフローチャートである。当該処理は、図８のステップＳ６０６の処理に相当する。本処理は、図１３の処理で抽出されたＵＣＣ候補カラムから、ＵＣＣリストを生成する処理の例である。 FIG. 14 is a flowchart showing an example of the generation process of the UCC list 260. The said process is corresponded to the process of FIG.8 S606. This process is an example of a process of generating a UCC list from the UCC candidate columns extracted in the process of FIG.

メタデータ生成部２０６は、全てのＵＣＣ候補カラムのそれぞれについて、ステップＳ１００４〜Ｓ１００８を実行する（ＬＯＯＰ１）。各ループ処理で選択されるＵＣＣ候補カラムを、「選択ＵＣＣ候補カラム」という。 The metadata generation unit 206 executes steps S1004 to S1008 for each of all UCC candidate columns (LOOP1). The UCC candidate column selected in each loop process is referred to as "selected UCC candidate column".

メタデータ生成部２０６は、選択ＵＣＣ候補カラムについて、ハッシュ行列３０１のハッシュ値を用いて、ＰＬＩ３０３を生成する（ステップＳ１００４）。 The metadata generation unit 206 generates the PLI 303 for the selected UCC candidate column using the hash value of the hash matrix 301 (step S1004).

メタデータ生成部２０６は、ＰＬＩ３０３に、２以上のレコードＩＤを有するエントリが存在するか否かを判定する（ステップＳ１００６）。ＰＬＩ３０３に、２以上のレコードＩＤを有するエントリが存在しない場合（ステップＳ１００６：ＮＯ）、メタデータ生成部２０６は、選択ＵＣＣ候補カラムを、ＵＣＣリスト２６０に登録する（ステップＳ１００８）。選択ＵＣＣ候補カラムは、単独でレコードの一意性を保証し得るからである。ＰＬＩ３０３に、２以上のレコードＩＤを有するエントリが存在する場合（ステップＳ１００６：ＹＥＳ）、メタデータ生成部２０６は、特に何もしなくて良い。 The metadata generation unit 206 determines whether there is an entry having two or more record IDs in the PLI 303 (step S1006). If there is no entry having two or more record IDs in the PLI 303 (step S1006: NO), the metadata generation unit 206 registers the selected UCC candidate column in the UCC list 260 (step S1008). This is because the selected UCC candidate column can guarantee the uniqueness of the record alone. If there is an entry having two or more record IDs in the PLI 303 (step S1006: YES), the metadata generation unit 206 may do nothing in particular.

次に、メタデータ生成部２０６は、上述の処理においてＵＣＣリストに登録されなかった残りのＵＣＣ候補カラムによる全ての組のそれぞれについて、ステップＳ１０１２〜Ｓ１０１６を実行する（ＬＯＯＰ２）。各ループ処理で選択されるＵＣＣ候補カラムの組を、「選択ＵＣＣ候補カラムの組」という。 Next, the metadata generation unit 206 executes steps S1012 to S1016 for each of all the sets of the remaining UCC candidate columns not registered in the UCC list in the above-described processing (LOOP2). A set of UCC candidate columns selected in each loop process is referred to as a “selected UCC candidate column set”.

メタデータ生成部２０６は、選択ＵＣＣ候補カラムの組に関するＰＬＩ３０３の各エントリをレコードＩＤの集合とみなし、共通集合を算出する（ステップＳ１０１２）。 The metadata generation unit 206 regards each entry of the PLI 303 related to the set of selected UCC candidate columns as a set of record IDs, and calculates a common set (step S1012).

メタデータ生成部２０６は、その算出した共通集合が空集合であるか否かを判定する（ステップＳ１０１４）。共通集合が空集合の場合（ステップＳ１０１４：ＹＥＳ）、メタデータ生成部２０６は、選択ＵＣＣ候補カラムの組を、ＵＣＣリスト２６０に登録する。この選択ＵＣＣ候補カラムの組は、当該カラムの組でレコードの一意性を保証し得るからである。共通集合が空集合でない場合（ステップＳ１０１４：ＮＯ）、メタデータ生成部２０６は、特に何もしなくてよい。 The metadata generation unit 206 determines whether the calculated common set is an empty set (step S1014). If the common set is an empty set (step S1014: YES), the metadata generation unit 206 registers the set of selected UCC candidate columns in the UCC list 260. This is because the set of selected UCC candidate columns can guarantee the uniqueness of the record in the set of columns. If the common set is not an empty set (step S1014: NO), the metadata generation unit 206 may do nothing in particular.

図１５は、データ編集処理の一例を示すフローチャートである。 FIG. 15 is a flowchart showing an example of the data editing process.

データテーブル２０１が編集（修正）されると、それに応じてメタデータ２０７の内容も変化し得る。本処理は、データ編集処理と、それに応じて発生するメタデータ再生成処理の例である。 When the data table 201 is edited (corrected), the contents of the metadata 207 may also change accordingly. The present processing is an example of data editing processing and metadata regeneration processing generated accordingly.

データ編集部２０４は、データ編集命令を受領すると（ステップＳ１１０２）、データテーブル２０１を編集する（ステップＳ１１０４）。このデータ編集命令は、コンソール１０５を通じてデータ編集の入力操作を受け付けた操作部２０３から、データ編集部２０４に渡されてよい。 When the data editing unit 204 receives the data editing instruction (step S1102), the data editing unit 204 edits the data table 201 (step S1104). The data editing instruction may be passed from the operation unit 203 which has received the data editing input operation through the console 105 to the data editing unit 204.

メタデータ生成部２０６は、メタデータ２０７を再生成する（ステップＳ１１０６）。当該処理の詳細については後述する（図１６参照）。 The metadata generation unit 206 regenerates the metadata 207 (step S1106). Details of the process will be described later (see FIG. 16).

表示部２０２は、編集されたデータテーブル２０１及び再生成されたメタデータ２０７の内容を、コンソール１０５を通じて表示する（ステップＳ１１０８）。 The display unit 202 displays the content of the edited data table 201 and the regenerated metadata 207 through the console 105 (step S1108).

図１６は、メタデータ再生成処理の一例を示すフローチャートである。当該処理は、図１５のステップＳ１１０６の処理に相当する。 FIG. 16 is a flowchart illustrating an example of the metadata regeneration process. The process corresponds to the process of step S1106 in FIG.

ここでは、レコード削除に係るデータ編集命令を受領した場合の例と、セル更新に係るデータ編集命令を受領した場合の例を示す。レコード削除に係るデータ編集命令は、例えば、クレンジング作業において、重複レコードと判定されたレコードを削除する場合に発行される。セル更新に係るデータ編集命令は、例えば、表記を統一するためにデータを書き換える場合に発行される。 Here, an example in the case of receiving the data editing instruction related to the record deletion and an example in the case of receiving the data editing instruction related to the cell update are shown. The data editing instruction related to the record deletion is issued, for example, when deleting a record determined to be a duplicate record in the cleansing operation. A data editing instruction related to cell update is issued, for example, when data is rewritten in order to unify the notation.

メタデータ生成部２０６は、受領したデータ編集命令が、レコード削除及びセル更新の何れであるかを判定する（ステップＳ１２０２）。 The metadata generation unit 206 determines whether the received data editing instruction is a record deletion or a cell update (step S1202).

まず、レコード削除に係るデータ編集命令を受領した場合について説明する。当該処理では、ＰＬＩ３０３と類似レコードマトリクス２５０とを更新する処理が実行される。 First, the case where a data editing instruction related to record deletion is received will be described. In the process, a process of updating the PLI 303 and the similar record matrix 250 is performed.

メタデータ生成部２０６は、削除対象のレコードに属する各カラム値のハッシュ値を、ハッシュ行列３０１から取得する（ステップＳ１２０６）。 The metadata generation unit 206 acquires the hash value of each column value belonging to the record to be deleted from the hash matrix 301 (step S1206).

メタデータ生成部２０６は、各カラムのＰＬＩ３０３から、それぞれ、その取得したハッシュ値と、削除対象のレコードＩＤとを削除する（ステップＳ１２０８）。 The metadata generation unit 206 deletes the acquired hash value and the record ID to be deleted from the PLI 303 of each column (step S1208).

メタデータ生成部２０６は、削除対象のレコードＩＤを、ハッシュ行列３０１、ＭｉｎＨａｓｈシグネチャ３０２、及び、類似レコードマトリクス２５０から削除する（ステップＳ１２１０）。そして、本処理を終了する。 The metadata generation unit 206 deletes the record ID to be deleted from the hash matrix 301, the MinHash signature 302, and the similar record matrix 250 (step S1210). Then, the process ends.

次に、セル値の更新に係るデータ編集命令を受領した場合について説明する。当該処理では、ＵＣＣリスト２６０の更新処理が実行される。本処理の説明において、更新されたセル値を「更新セル値」という。 Next, the case where a data editing instruction related to the update of the cell value is received will be described. In the process, an update process of the UCC list 260 is performed. In the description of this process, the updated cell value is referred to as "updated cell value".

メタデータ生成部２０６は、更新セル値からハッシュ値を算出し、その算出したハッシュ値を用いてハッシュ行列３０１を更新する（ステップＳ１２２２〜Ｓ１２２４）。 The metadata generation unit 206 calculates a hash value from the updated cell value, and updates the hash matrix 301 using the calculated hash value (steps S1222 to S1224).

メタデータ生成部２０６は、更新セル値を含むカラムのＰＬＩ３０３を更新する（ステップＳ１２２６）。 The metadata generation unit 206 updates the PLI 303 of the column including the updated cell value (step S1226).

メタデータ生成部２０６は、その更新セル値を含むカラムＩＤ（「更新カラムＩＤ」という）が、ＵＣＣリスト２６０に含まれているかを判定する（ステップＳ１２２８）。当該判定結果が肯定的な場合（Ｓ１２２８：ＹＥＳ）、メタデータ生成部２０６は、次のステップＳ１２３０の処理に進み、否定的な場合（ステップＳ１２２８：ＮＯ）、本処理を終了する。 The metadata generation unit 206 determines whether a column ID (referred to as “update column ID”) including the updated cell value is included in the UCC list 260 (step S1228). If the determination result is affirmative (S1228: YES), the metadata generation unit 206 proceeds to the process of the next step S1230, and if negative (step S1228: NO), this process ends.

メタデータ生成部２０６は、更新カラムＩＤを含む各ＵＣＣについて、次のステップＳ１２３２〜Ｓ１２３６の処理を行う（ステップＳ１２３０）。 The metadata generation unit 206 performs the process of the following steps S1232 to S1236 for each UCC including the update column ID (step S1230).

すなわち、メタデータ生成部２０６は、更新カラムＩＤのＰＬＩ３０３から、更新されたハッシュ値のエントリを取得する。そして、メタデータ生成部２０６は、その取得したエントリのレコードＩＤ群と、他のＰＬＩ３０３のエントリのレコードＩＤ群との間の共通集合を算出する（ステップＳ１２３２）。 That is, the metadata generation unit 206 acquires an entry of the updated hash value from the PLI 303 of the update column ID. Then, the metadata generation unit 206 calculates a common set between the record ID group of the acquired entry and the record ID group of the entry of the other PLI 303 (step S1232).

メタデータ生成部２０６は、その算出した共通集合が空集合であるか否かを判定する（ステップＳ１２３４）。すなわち、メタデータ生成部２０６は、その取得したレコードＩＤ群が、他の何れのＰＬＩ３０３のレコードＩＤ群にも含まれていないか、それとも、他の何れかのＰＬＩ３０３のレコードＩＤ群に含まれているかを判定する。 The metadata generation unit 206 determines whether the calculated common set is an empty set (step S1234). That is, the metadata generation unit 206 determines whether the acquired record ID group is not included in the record ID group of any other PLI 303 or included in the record ID group of any other PLI 303. Determine if it exists.

メタデータ生成部２０６は、共通集合が空集合で無い場合（ステップＳ１２３４：ＮＯ）、その空集合でないＰＬＩ３０３のカラムＩＤの組を、ＵＣＣリスト２６０から削除する（ステップＳ１２３６）。このカラムＩＤの組は、セル値の更新によってＵＣＣでなくなったからである。 If the common set is not an empty set (step S1234: NO), the metadata generation unit 206 deletes the set of column IDs of the non-empty PLI 303 from the UCC list 260 (step S1236). This is because the column ID set is not UCC due to the update of the cell value.

メタデータ生成部２０６は、共通集合が空集合の場合（ステップＳ１２３４：ＹＥＳ）、特に何もしなくてよい。 If the common set is an empty set (step S1234: YES), the metadata generation unit 206 does not have to do anything in particular.

なお、メタデータ生成部２０６は、上記の処理に加えて、類似レコードマトリクス２５０の更新処理を実行してもよい。 The metadata generation unit 206 may execute an update process of the similar record matrix 250 in addition to the above process.

以上の処理では、更新セル値を含むカラムがＵＣＣに属する場合にのみ、ＵＣＣリスト２６０の更新処理が実行される。すなわち、本実施例によれば、セル値が更新された場合におけるＵＣＣリスト２６０の更新処理量を減らすことができる。 In the above process, the UCC list 260 is updated only when the column including the updated cell value belongs to the UCC. That is, according to the present embodiment, it is possible to reduce the amount of update processing of the UCC list 260 when the cell value is updated.

図１７は、データテーブル表示画面４００の例を示す。本画面は、図４のステップＳ４０８又は図１５のステップＳ１１０８の処理によって表示されてよい。 FIG. 17 shows an example of the data table display screen 400. This screen may be displayed by the process of step S408 of FIG. 4 or step S1108 of FIG.

表示部２０２は、データテーブル２０１に基づいて、図１７に示すようなデータテーブル表示画面４００を生成し、コンソール１０５に表示してよい。データテーブル表示画面４００には、データテーブル２０１のレコードＩＤとカラムＩＤとが合わせて表示されてよい。 The display unit 202 may generate a data table display screen 400 as shown in FIG. 17 based on the data table 201 and may display it on the console 105. On the data table display screen 400, the record ID and the column ID of the data table 201 may be displayed together.

図１８は、改良データテーブル表示画面４０１の例を示す。本画面は、図４のステップＳ４０８又は図１５のステップＳ１１０８の処理によって表示されてよい。 FIG. 18 shows an example of the improved data table display screen 401. This screen may be displayed by the process of step S408 of FIG. 4 or step S1108 of FIG.

表示部２０２は、データテーブル２０１及びメタデータ２０７に基づいて、図１８に示すような改良データテーブル表示画面４０１を生成し、コンソール１０５に表示してよい。 The display unit 202 may generate an improved data table display screen 401 as shown in FIG. 18 based on the data table 201 and the metadata 207, and may display it on the console 105.

改良データテーブル表示画面４０１には、ＵＣＣリスト２６０に属するカラムＩＤ又はカラムＩＤ群を選択可能なボタン４０２が含まれてよい。データクレンジングの担当者がこのボタン４０２を押下すると、データテーブルにおけるそのボタン４０２によって選択されたカラムＩＤ又はカラムＩＤ群に相当する列が、他の列と区別可能な態様で（例えば異なる色で）強調表示されてよい（図１８の斜線部分を参照）。 The improved data table display screen 401 may include a button 402 capable of selecting a column ID or a column ID group belonging to the UCC list 260. When a person in charge of data cleansing presses this button 402, a column corresponding to the column ID or column ID group selected by the button 402 in the data table can be distinguished from other columns (for example, in different colors) It may be highlighted (see shaded area in FIG. 18).

また、改良データテーブル表示画面４０１のレコード間の類似度表示エリア４０３には、類似レコードマトリクス２５０における当該レコード間の類似度が表示されてよい。このとき、表示部２０２は、類似度が高いレコードをできるだけ上位に表示してよい。例えば、表示部２０２は、類似度が高い順にレコードをソートして表示してよい。 Further, in the similarity display area 403 between records of the improved data table display screen 401, the similarity between the records in the similar record matrix 250 may be displayed. At this time, the display unit 202 may display records having high similarity as high as possible. For example, the display unit 202 may sort and display records in descending order of similarity.

なお、図１８の改良データテーブル表示画面４０１は、あくまでメタデータ２０７に含まれる情報をコンソール１０５に表示する一例であり、その表示態様はこれに限定されるものではない。 Note that the improved data table display screen 401 of FIG. 18 is merely an example in which the information included in the metadata 207 is displayed on the console 105, and the display mode is not limited to this.

本実施例によれば、類似度の高いレコードを上位に表示することができる。これにより、データクレンジングの担当者は、データクレンジングが必要と思われるレコードを容易に見つけることができる。 According to this embodiment, records with high similarity can be displayed at the top. This allows data cleansing personnel to easily find the records that may need data cleansing.

また、本実施例によれば、ＵＣＣに属するカラムを認識可能な態様で表示することができる。これにより、データクレンジングの担当者は、何れのセル値を修正するとＵＣＣの関係が喪失し得るのかを容易に認識することができる。 Moreover, according to the present embodiment, the columns belonging to UCC can be displayed in a recognizable manner. This allows the data cleansing agent to easily recognize which cell value would cause UCC's relationship to be lost.

さらに、本発明によれば、類似度の高いレコードとＵＣＣに属するカラムとを合わせて表示することができる。これにより、データクレンジングの担当者は、レコードの意味論上の一意性と表記上の一意性とを一致させつつ、データを修正することができる。すなわち、データクレンジングの担当者は、データクレンジングの作業を効率的に行うことができる。 Furthermore, according to the present invention, records having high similarity and columns belonging to UCC can be displayed together. This enables the person in charge of data cleansing to correct data while matching the semantic uniqueness of the record with the notional uniqueness. That is, the person in charge of data cleansing can efficiently perform the data cleansing task.

上述した実施例は、本発明の説明のための例示であり、本発明の範囲を実施例にのみ限定する趣旨ではない。当業者は、本発明の要旨を逸脱することなしに、他の様々な態様で本発明を実施することができる。 The embodiment described above is an illustration for explaining the present invention, and is not intended to limit the scope of the present invention to the embodiment. Those skilled in the art can practice the present invention in various other aspects without departing from the scope of the present invention.

例えば、或る実施例の構成の一部を他の実施例の構成に置き換えてもよい。或る実施例の構成に他の実施例の構成を加えてもよい。各実施例の構成の一部に対して、他の構成を追加、削除又は置換してもよい。 For example, part of the configuration of one embodiment may be replaced with the configuration of another embodiment. The configuration of another embodiment may be added to the configuration of one embodiment. Other configurations may be added to, deleted from, or replaced with some of the configurations of the respective embodiments.

また、上述した実施例における各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等により、ハードウェアで実現してもよく、プロセッサがそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスクドライブ、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶装置、又は、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に格納することができる。 In addition, each configuration, function, processing unit, processing means, and the like in the above-described embodiment may be realized by hardware, for example, by designing part or all of them with an integrated circuit or the like. It may be realized by software by interpreting and executing a program that realizes a function. Information such as programs, tables, and files for realizing each function can be stored in a memory, a hard disk drive, a storage device such as a solid state drive (SSD), or a recording medium such as an IC card, an SD card, or a DVD. .

また、図面には、説明のために必要と考えられる制御線及び／又は情報線が示されており、実装において必要な全ての制御線及び／又は情報線が示されているわけではない。すなわち、図示されていない場合であっても、ほぼ全ての構成が相互に接続されていてよい。 Also, in the drawings, control lines and / or information lines which are considered to be necessary for explanation are shown, and not all control lines and / or information lines necessary for implementation are shown. That is, even if not shown, almost all the components may be connected to each other.

１００：データクレンジングシステム２０１：データテーブル２０６：メタデータ生成部２０８：類似レコード検出部２０９：ＵＣＣ検出部２１０：ハッシュ行列生成部 100: data cleansing system 201: data table 206: metadata generation unit 208: similar record detection unit 209: UCC detection unit 210: hash matrix generation unit

Claims

A system for performing data cleansing, comprising a processor and a memory,
The processor is
Read data table from the memory,
Calculate the similarity between records in the data table,
Detecting a UCC (Unique Column Combination) which is a set of columns that enables each record of the data table to be uniquely identified,
A data cleansing system that displays the similarity and UCC.

The processor is
Generate a hash matrix from the data table,
The data cleansing system according to claim 1, wherein the calculation of the degree of similarity and the detection of the UCC are performed using the generated hash matrix.

The data cleansing system according to claim 2, wherein the processor calculates the similarity between records of the data table based on the MinHash method.

The processor is
The cardinality of each column of the data table is calculated from the generated hash matrix,
The data cleansing system according to claim 2, wherein a column whose calculated cardinality is equal to or less than a predetermined threshold is excluded from UCC candidates.

The processor is
In displaying the contents of the data table, a record having a high degree of similarity is displayed at the top, and a column included in the UCC is displayed in a distinguishable manner from other columns.
The data cleansing system according to claim 1, which receives a change in the value of the data table.

The processor is
If there is more than one UCC, accept the UCC selection,
The data cleansing system according to claim 5, wherein the columns included in the selected UCC are displayed in a distinguishable manner from other columns.

The processor is
The data cleansing system according to claim 5, wherein when a value of a column included in the UCC is changed, UCC redetection is performed.

A method of data cleansing,
Get the data table,
Calculate the similarity between records in the data table,
Detecting a UCC (Unique Column Combination) which is a set of columns that enables each record of the data table to be uniquely identified,
A data cleansing method for displaying the similarity and UCC.

In a system that performs data cleansing,
Get the data table,
Calculate the similarity between records in the data table,
Detecting a UCC (Unique Column Combination) which is a set of columns that enables each record of the data table to be uniquely identified,
A computer program for executing displaying the similarity and UCC.