WO2017175375A1 - Système, procédé et programme de nettoyage de données - Google Patents

Système, procédé et programme de nettoyage de données Download PDF

Info

Publication number
WO2017175375A1
WO2017175375A1 PCT/JP2016/061532 JP2016061532W WO2017175375A1 WO 2017175375 A1 WO2017175375 A1 WO 2017175375A1 JP 2016061532 W JP2016061532 W JP 2016061532W WO 2017175375 A1 WO2017175375 A1 WO 2017175375A1
Authority
WO
WIPO (PCT)
Prior art keywords
ucc
data
data table
record
column
Prior art date
Application number
PCT/JP2016/061532
Other languages
English (en)
Japanese (ja)
Inventor
健太郎 角井
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2016/061532 priority Critical patent/WO2017175375A1/fr
Priority to JP2018510205A priority patent/JP6549786B2/ja
Publication of WO2017175375A1 publication Critical patent/WO2017175375A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to data cleansing.
  • the master data of many companies is often not suitable for use.
  • the records are not unique due to various reasons such as input errors and operational inadequacies, and duplicate records are included. There are many cases.
  • duplicate records may become a problem when multiple master data are integrated in a corporate merger.
  • the uniqueness of the record is the semantic uniqueness of the entity of the master data.
  • data such as personal names and address notations may be ambiguous as expressions such as fluctuations in names, alternative names, or omissions. obtain.
  • ambiguity due to the diversity of character encoding such as so-called distinction between half-width characters and full-width characters, may occur. Or even if they have the same name and the same name, they may be different people.
  • each record has only one person. It is important that the person corresponds to the person and the same person does not exist in a plurality of records.
  • Duplicate record in master data means that the entity indicated by the record is duplicated regardless of notation. In order to utilize the data in the company, it is necessary to manually investigate and correct such duplicate records that can be included in the master data. This operation is generally called data cleansing.
  • Patent Document 1 discloses a technique for detecting similar records by generating feature vectors from each record of a data table and comparing them with each other.
  • a column that can uniquely identify a record is called a “key”.
  • a key may be created artificially, but if the values stored in the column (called “column values”) are all different (that is, uniqueness is guaranteed), the column can be used as a key. It is. Even if uniqueness cannot be guaranteed with one column, the combination of multiple columns can be used as a composite key if the combination of multiple columns guarantees uniqueness.
  • UCC Unique Column Combination
  • the key (column) that guarantees the uniqueness of the record may not guarantee the uniqueness. .
  • the data cleanser may miss the loss of uniqueness of the composite key. high.
  • an object of the present invention is to improve the work efficiency and / or work accuracy of data cleansing.
  • the data cleansing system includes a processor and a memory.
  • the processor reads the data table from the memory, calculates the similarity between records in the data table, and detects a UCC (Unique Column Combination) that is a set of columns that allows each record of the data table to be uniquely identified. .
  • UCC Unique Column Combination
  • the physical structural example of a data cleansing system is shown.
  • the structural example of the function which a data cleansing system has is shown.
  • generation part is shown. It is a flowchart which shows the example of a data display process. An example of a data table is shown. An example of a similar record matrix is shown. An example of a UCC list is shown. It is a flowchart which shows the example of a metadata production
  • PLI position list index
  • information may be described in terms of “xxx table” or “xxx list”, but the information may be expressed in any data structure. That is, in order to show that the information does not depend on the data structure, the “xxx table” or the “xxx list” can be called “xxx information”. Furthermore, in describing the contents of each information, the expressions “identification information”, “identifier”, “name”, “name”, and “ID” are used, but these can be replaced with each other.
  • the process may be described using “program” as a subject.
  • the program is executed by a processor (for example, a CPU (Central Processing Unit)), so that a predetermined process is appropriately performed. Since the processing is performed using at least one of a storage resource (for example, a memory) and a communication interface device, the subject of the processing may be a processor and an apparatus having the processor. Part or all of the processing performed by the processor may be performed by a hardware circuit.
  • the computer program may be installed from a program source.
  • the program source may be a program distribution server or a storage medium (for example, a portable storage medium).
  • FIG. 1 shows a physical configuration example of the data cleansing system 100.
  • the data cleansing system 100 is an example of a computer, and includes a processor 101, a memory 102, a storage 103, a network interface 104, and a console 105.
  • An example of the data cleansing system 100 is a personal computer, a rack mount server, a blade server, or the like.
  • the processor 101 is connected to the memory 102, the storage 103, the network interface 104, and the console 105 so as to be capable of bidirectional communication.
  • the data cleansing system 100 may have only some of these components, or may have multiple identical components.
  • the processor 101 is a hardware arithmetic device such as a CPU (Central Processing Unit), and reads a program from the memory 102 and executes it.
  • a CPU Central Processing Unit
  • the memory 102 is composed of a volatile semiconductor memory, and holds programs and data. Examples of the memory 102 are DRAM (Dynamic Random Access Memory), MRAM (Magnetic Resistive Random Access Memory), and FeRAM (Ferroelectric Random Access Memory).
  • DRAM Dynamic Random Access Memory
  • MRAM Magnetic Resistive Random Access Memory
  • FeRAM Feroelectric Random Access Memory
  • the storage 103 is composed of a non-volatile storage device and holds programs and data.
  • An example of the storage 103 is an HDD (Hard Disk Drive), an SSD (Solid State Drive), or a combination thereof.
  • the network interface 104 is composed of a communication device such as a NIC (Network Interface Controller) and is connected to the network 106, for example.
  • the network interface 104 controls a protocol for communicating with other devices via the network 106.
  • Examples of the network 106 include a wireless network based on the Ethernet (registered trademark), IEEE (Institut of Electrical and Electronics Engineers) 802.11 standard, a SONET / SDH (Synchronous Optical Network / Synchronous Network), A network that combines these multiple network technologies.
  • the console 105 includes, for example, an input device such as a keyboard and a mouse and a display device such as a liquid crystal display panel.
  • the console 105 may receive an operation signal corresponding to an operation input from the input device, and notify the processor 101 of the content of the operation signal. Then, the console 105 displays text information and text and images based on the graphical information output from the processor 101 on the display device.
  • the OS (Operating System) and the user program stored in the storage 103 may be read into the memory 102 when the data cleansing system 100 is started or when it is executed.
  • Various functions of the data cleansing system 100 may be realized by the processor 101 executing the OS and the user program read to the memory 102.
  • a program executed by the processor 101 may be introduced into the data cleansing system 100 via a removable medium (CD-ROM, flash memory, etc.) or a network and stored in the storage 103. For this reason, the data cleansing system 100 may have an interface for reading data from a removable medium.
  • FIG. 2 shows a configuration example of the functions of the data cleansing system 100.
  • the data cleansing system 100 includes a display unit 202, an operation unit 203, a data editing unit 204, an input / output unit 205, and a metadata generation unit 206 as functions. These functions may be realized by a program stored in the memory 102 being executed by the processor 101.
  • the operation unit 203 interprets the operation content input through the console 105 as various commands.
  • the operation unit 203 may pass a data display command to the display unit 202 and a data editing command to the data editing unit 204.
  • the input / output unit 205 reads the data file 107 stored in the storage 103 and stores it in the memory 102 as the data table 201.
  • the metadata generation unit 206 generates metadata 207 based on the data table 201 stored in the memory 102.
  • the metadata generation unit 206 may store the generated metadata 207 in the memory 102. Details of the metadata generation unit 206 will be described later.
  • the display unit 202 displays information related to data cleansing through the console 105.
  • the display unit 202 may display information related to the data table 201 and the metadata 207 through the console 105.
  • the display unit 202 may change the information display mode according to the content of the data display command received from the operation unit 203.
  • the above-described functions may be distributed to a plurality of data cleansing systems 100 in order to distribute processing load and improve availability.
  • the data cleansing system 100 may be composed of one physical computer or a plurality of logical or physical computers.
  • the above-described functions may be realized by a plurality of processors 101 performing communication via the network 106.
  • FIG. 3 shows an example of functions included in the metadata generation unit 206.
  • the metadata generation unit 206 may include a similar record detection unit 208, a UCC detection unit 209, and a hash matrix generation unit 210 as functions.
  • the metadata 207 may include a similar record matrix 250 and a UCC list 260.
  • the hash matrix generation unit 210 generates a hash matrix 301 (see FIG. 9) from the data table 201.
  • the similar record detection unit 208 calculates the similarity between records included in the data table 201 using the hash matrix 301 generated by the hash matrix generation unit 210.
  • the similar record calculation unit 208 may store the calculated similarity in the similar record matrix 250.
  • the UCC detection unit 209 detects the UCC (column set) from the data table 201 using the hash matrix 301 generated by the hash matrix generation unit 210.
  • the UCC detection unit 209 may register the detected UCC in the UCC list 260.
  • the similar record detection unit 208 and the UCC detection unit 209 may use the same hash matrix 301 generated by the hash matrix generation unit 210. Thereby, the processing amount of the entire system can be reduced as compared with a case where a hash matrix for detecting similar records and a hash matrix for detecting UCC are generated separately.
  • FIG. 4 is a flowchart showing an example of data display processing.
  • the input / output unit 205 reads the data file 107 from the storage 103 (step S402).
  • the input / output unit 205 parses the data file 107 serialized in a format such as CSV (Comma-Separated Values) format, for example, and generates a data table 201 (see FIG. 5) (step S404).
  • CSV Common-Separated Values
  • the metadata generation unit 206 generates metadata 207 from the generated data table 201 and stores it in the memory 102 (step S406). Details of this processing will be described later.
  • the display unit 202 displays the data table 201 and the metadata 207 through the console 105 (step S408).
  • the display example will be described later (see FIGS. 17 and 18).
  • FIG. 5 shows an example of the data table 201.
  • the data table 201 is target data for data cleansing in this embodiment. Any data may be stored in the data table 201.
  • the data table 201 includes a plurality of records and a plurality of columns, and a value (referred to as a column value or a cell value) may be stored in each column of the record.
  • a record ID (R001, R002,%) That can uniquely identify the record is assigned to each record.
  • a column ID (C001, C002,%) That can uniquely identify the column is assigned to each column.
  • the record ID and / or the column ID do not need to be included in the original data file 107 and may be given by the parsing process of the input / output unit 205.
  • the record ID may be called a record name.
  • the column ID may be called a column name.
  • the record with the record ID “R001” has a column value “AAA” in the column ID “C001”, a column value “CCC” in the column ID “C002”, and a column value “0” in the column ID “C003”.
  • Column ID “C004” has a column value “0”.
  • FIG. 6 shows an example of the similar record matrix 250.
  • the similar record matrix 250 may be included in the metadata 207.
  • the similar record matrix 250 manages the similarity between two records included in the data table 201.
  • Each record ID included in the data table 201 may be assigned to each row and each column in the similar record matrix 250.
  • the cell at the intersection of the row record ID and the column record ID may store the similarity between the record ID record of the row and the record ID record of the column.
  • the similarity may be a value that can be in the range of 0 to 1, indicating that the larger the value, the more similar.
  • the example of FIG. 6 indicates that the similarity between the record IDs “R002” and “R001” is “0.80” (relatively similar).
  • FIG. 7 shows an example of the UCC list 260.
  • the UCC list 260 may be included in the metadata 207.
  • the UCC list 260 manages a set of column IDs (that is, UCC) that can uniquely identify each record of the data table 201.
  • the column IDs “C001” and “C002” are UCC.
  • a set of column IDs “C001” and “C002” is stored in the UCC list 260.
  • each UCC may be given a UCC ID (U001, U002,%) That can uniquely identify the UCC (a set of column IDs).
  • FIG. 8 is a flowchart showing an example of metadata generation processing.
  • the metadata generation unit 206 calculates a hash value of each column value of the data table 201 using a certain hash function (step S602).
  • the metadata generation unit 206 generates a hash matrix 301 (see FIG. 9) for the data table 201 using the calculated hash value (step S604).
  • the metadata generation unit 206 generates the similar record matrix 250 (see FIG. 6) and the UCC list 260 (see FIG. 7) using the generated hash matrix 301 (step S606).
  • FIG. 9 shows an example of the hash matrix 301.
  • the hash matrix 301 is a matrix composed of hash values calculated by applying a certain hash function to each column value of the data table 201.
  • the hash matrix 301 may be generated for each different hash function.
  • an ID that can uniquely identify each hash function is referred to as a “hash function ID”.
  • the hash matrix 301 may have each record ID and each column ID of the data table 201 in each row and each column. In the cell at the intersection of the row record ID and the column ID of the column, the hash value of the column value of the column ID in the record of the record ID of the data table 201 may be stored.
  • record IDs are assigned to rows and column IDs are assigned to columns for the sake of explanation. Such an ID may not be assigned to the hash matrix 301 actually stored in the memory.
  • FIG. 10 shows an example of the MinHash signature 302.
  • the MinHash signature 302 is used for the MinHash method.
  • the MinHash signature 302 may be generated based on the hash matrix 301.
  • each record ID of the hash matrix 301 may be given to each row of the MinHash signature 302 of FIG. Further, each hash function ID (h1, h2,%) Described in FIG. 9 may be assigned to each column.
  • the cell at the intersection of the record ID and the hash function ID of the MinHash signature 302 has the minimum hash value among the plurality of hash values belonging to the record ID in the hash matrix 301 generated from the hash function of the hash function ID. Is stored. For example, if the hash matrix 301 in FIG. 9 is generated from the hash function with the hash function ID “h1”, the record ID “R001” and the hash function ID “h1” in the MinHash signature 302 in FIG. Is the smallest hash value “1234” among the plurality of hash values “1234”, “4122”, “5628”,... Belonging to the record with the record ID “R001” of the hash matrix 301 of FIG. Is stored.
  • the hash value calculated by the hash function having the hash function ID “h1” is cyclically shifted, XOR is calculated between the cyclically shifted value and the random number, and the calculated value is assigned to the hash function ID “h2”. It may be used as a value corresponding to the hash value.
  • the cell having the hash function ID “h2” may store the minimum value among the values corresponding to the hash value related to the hash function ID “h2”.
  • FIG. 11 shows an example of the position list index (PLI) 303.
  • the PLI 303 may be generated for each column ID of the data table 201.
  • the PLI 303 ⁇ / b> A is a PLI related to the column ID “C001” of the data table 201.
  • PLI 303B, 303C, and 303D The same applies to PLI 303B, 303C, and 303D.
  • the PLI 303 related to a certain column ID manages a plurality of record IDs having the same hash value in the column of the column ID in the hash matrix 301 and the same hash value in association with each other.
  • the PLI 303A indicates that there are a plurality of record IDs “R001” and “R003” having the same hash value “1234” in the column of the column ID “C001” in the hash matrix 301.
  • PLI 303 is similar to a data structure commonly known as a hash table.
  • the PLI 303 may be a table in which a hash table is generated using a hash value of the hash matrix 301 and only a bucket having two or more entries is extracted.
  • FIG. 12 is a flowchart showing an example of the generation process of the similar record matrix 250. This process corresponds to the process in step S606 in FIG.
  • the metadata generation unit 206 generates a MinHash signature 302 corresponding to the data table 201 (step S804).
  • the MinHash signature 302 may be generated as described above with reference to FIG.
  • the metadata generation unit 206 may divide the plurality of columns of the generated MinHash signature 302 into several groups. Here, each divided group is referred to as a “band” (step S806).
  • the metadata generation unit 206 may combine the hash values of the columns belonging to the band for each record ID of the MinHash signature 302. Then, the metadata generation unit 206 may calculate a hash value by applying a predetermined hash function to the combined hash value (step S808). The metadata generation unit 206 may perform this process for each band.
  • This hash value calculation process may be an algorithm known as so-called LSH (Locality Sensitive Hashing). In this case, it is known that a set of records having the same hash value is likely to be similar.
  • LSH Location Sensitive Hashing
  • the metadata generation unit 206 executes the process of step S814 for each set of all records having the same hash value (LOOP2).
  • LOOP2 hash value
  • a set of records selected in each loop process is referred to as a “selected record set”.
  • the metadata generation unit 206 calculates the probability that the hash value of the MinHash signature 302 of the set of selected records matches (step S814). This probability is known to approximate an index of similarity between two sets called the Jaccard distance. Therefore, this probability is set as the similarity and stored in the similar record matrix 250.
  • FIG. 13 is a flowchart showing an example of the extraction process of the UCC candidate column.
  • This process is a process for extracting a column (referred to as a “UCC candidate column”) that is likely to be included in the UCC before the UCC list 260 generation process (see FIG. 14). By performing this process, the processing amount of UCC detection can be reduced.
  • the metadata generation unit 206 executes steps S904 to S908 for all the columns of the hash matrix 301 (LOOP1). A column selected in each loop process is referred to as a “selected column”.
  • the metadata generation unit 206 calculates the cardinality of the selected column of the data table 201 using the hash value of the selected column of the hash matrix 301 (step S904).
  • the cardinality of a column may be the number (different number) of types of column values stored in the column.
  • the HyperLogLog algorithm may be adopted as a method for approximating the cardinality from the hash value.
  • the metadata generation unit 206 determines whether or not the calculated cardinality is equal to or less than a predetermined threshold (step S906). When the determination result is affirmative (step S906: YES), the metadata generation unit 206 excludes the selected column from the UCC candidate (step S908). This is because a column having a low cardinality has a small number of different column values and thus has a low possibility of forming a UCC.
  • step S906 NO
  • the metadata generation unit 206 does not have to do anything.
  • the UCC candidate column is extracted.
  • FIG. 14 is a flowchart illustrating an example of a process for generating the UCC list 260. This process corresponds to the process in step S606 in FIG. This process is an example of a process for generating a UCC list from the UCC candidate columns extracted in the process of FIG.
  • the metadata generation unit 206 executes steps S1004 to S1008 for each of all UCC candidate columns (LOOP1).
  • the UCC candidate column selected in each loop process is referred to as “selected UCC candidate column”.
  • the metadata generation unit 206 generates the PLI 303 for the selected UCC candidate column using the hash value of the hash matrix 301 (step S1004).
  • the metadata generation unit 206 determines whether there is an entry having two or more record IDs in the PLI 303 (step S1006). If there is no entry having two or more record IDs in the PLI 303 (step S1006: NO), the metadata generation unit 206 registers the selected UCC candidate column in the UCC list 260 (step S1008). This is because the selected UCC candidate column alone can guarantee the uniqueness of the record. When there is an entry having two or more record IDs in the PLI 303 (step S1006: YES), the metadata generation unit 206 does not have to do anything.
  • the metadata generation unit 206 executes Steps S1012 to S1016 for each of all the pairs based on the remaining UCC candidate columns that are not registered in the UCC list in the above processing (LOOP2).
  • a set of UCC candidate columns selected in each loop process is referred to as a “selected UCC candidate column set”.
  • the metadata generation unit 206 regards each entry of the PLI 303 regarding the set of selected UCC candidate columns as a set of record IDs, and calculates a common set (step S1012).
  • the metadata generation unit 206 determines whether or not the calculated common set is an empty set (step S1014). When the common set is an empty set (step S1014: YES), the metadata generation unit 206 registers the set of selected UCC candidate columns in the UCC list 260. This is because the selected UCC candidate column set can guarantee the uniqueness of the record in the column set. When the common set is not an empty set (step S1014: NO), the metadata generation unit 206 does not have to do anything.
  • FIG. 15 is a flowchart showing an example of the data editing process.
  • This process is an example of a data editing process and a metadata regeneration process that occurs accordingly.
  • the data editing unit 204 edits the data table 201 (step S1104).
  • This data editing command may be passed to the data editing unit 204 from the operation unit 203 that has received the data editing input operation via the console 105.
  • the metadata generation unit 206 regenerates the metadata 207 (step S1106). Details of this processing will be described later (see FIG. 16).
  • the display unit 202 displays the contents of the edited data table 201 and the regenerated metadata 207 through the console 105 (step S1108).
  • FIG. 16 is a flowchart showing an example of the metadata regeneration process. This processing corresponds to the processing in step S1106 in FIG.
  • a data editing command related to record deletion is issued, for example, when a record determined to be a duplicate record is deleted in a cleansing operation.
  • a data editing command related to cell update is issued, for example, when data is rewritten to unify the notation.
  • the metadata generation unit 206 determines whether the received data editing command is record deletion or cell update (step S1202).
  • the metadata generation unit 206 acquires the hash value of each column value belonging to the record to be deleted from the hash matrix 301 (step S1206).
  • the metadata generation unit 206 deletes the acquired hash value and the record ID to be deleted from the PLI 303 of each column (step S1208).
  • the metadata generation unit 206 deletes the record ID to be deleted from the hash matrix 301, the MinHash signature 302, and the similar record matrix 250 (step S1210). Then, this process ends.
  • the metadata generation unit 206 calculates a hash value from the updated cell value, and updates the hash matrix 301 using the calculated hash value (steps S1222 to S1224).
  • the metadata generation unit 206 updates the PLI 303 of the column including the updated cell value (step S1226).
  • the metadata generation unit 206 determines whether or not the column ID including the updated cell value (referred to as “updated column ID”) is included in the UCC list 260 (step S1228). If the determination result is affirmative (S1228: YES), the metadata generation unit 206 proceeds to the next step S1230, and if negative (step S1228: NO), the process ends.
  • the metadata generation unit 206 performs the processing of the next steps S1232 to S1236 for each UCC including the update column ID (step S1230).
  • the metadata generation unit 206 acquires the updated hash value entry from the PLI 303 of the update column ID. Then, the metadata generation unit 206 calculates a common set between the record ID group of the acquired entry and the record ID group of the entry of another PLI 303 (step S1232).
  • the metadata generation unit 206 determines whether or not the calculated common set is an empty set (step S1234). That is, the metadata generation unit 206 determines whether the acquired record ID group is not included in any other record ID group of the PLI 303, or included in any other record ID group of the PLI 303. It is determined whether or not.
  • step S1234 the metadata generation unit 206 deletes the column ID pair of the PLI 303 that is not the empty set from the UCC list 260 (step S1236). This is because the column ID pair is no longer a UCC due to the update of the cell value.
  • the metadata generation unit 206 does not need to do anything when the common set is an empty set (step S1234: YES).
  • the metadata generation unit 206 may execute an update process of the similar record matrix 250 in addition to the above process.
  • the update process of the UCC list 260 is executed only when the column including the update cell value belongs to the UCC. That is, according to the present embodiment, the update processing amount of the UCC list 260 when the cell value is updated can be reduced.
  • FIG. 17 shows an example of the data table display screen 400. This screen may be displayed by the process of step S408 in FIG. 4 or step S1108 in FIG.
  • the display unit 202 may generate a data table display screen 400 as shown in FIG. 17 based on the data table 201 and display it on the console 105. On the data table display screen 400, the record ID and column ID of the data table 201 may be displayed together.
  • FIG. 18 shows an example of the improved data table display screen 401. This screen may be displayed by the process of step S408 in FIG. 4 or step S1108 in FIG.
  • the display unit 202 may generate an improved data table display screen 401 as shown in FIG. 18 based on the data table 201 and the metadata 207 and display it on the console 105.
  • the improved data table display screen 401 may include a button 402 for selecting a column ID or a column ID group belonging to the UCC list 260.
  • this button 402 When the person in charge of data cleansing presses this button 402, the column corresponding to the column ID or the column ID group selected by the button 402 in the data table is distinguished from other columns (for example, in a different color). It may be highlighted (see shaded area in FIG. 18).
  • the similarity between the records in the similar record matrix 250 may be displayed.
  • the display unit 202 may display a record having a high similarity as high as possible.
  • the display unit 202 may sort and display records in descending order of similarity.
  • the improved data table display screen 401 in FIG. 18 is merely an example of displaying the information included in the metadata 207 on the console 105, and the display mode is not limited to this.
  • a record having a high similarity can be displayed at the top. Therefore, the person in charge of data cleansing can easily find a record that seems to require data cleansing.
  • the columns belonging to the UCC can be displayed in a recognizable manner.
  • the person in charge of data cleansing can easily recognize which cell value is corrected so that the UCC relationship can be lost.
  • records with high similarity and columns belonging to UCC can be displayed together.
  • the person in charge of data cleansing can correct the data while matching the semantic uniqueness of the record with the uniqueness of the notation. That is, the person in charge of data cleansing can efficiently perform the data cleansing work.
  • a part of the configuration of a certain embodiment may be replaced with the configuration of another embodiment.
  • the configuration of another embodiment may be added to the configuration of one embodiment. You may add, delete, or replace another structure with respect to a part of structure of each Example.
  • each of the configurations, functions, processing units, processing means, and the like in the above-described embodiments may be realized in hardware by designing a part or all of them, for example, with an integrated circuit. It may be realized by software by interpreting and executing a program for realizing the function. Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk drive, or an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD. .
  • a storage device such as a memory, a hard disk drive, or an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.
  • control lines and / or information lines which are considered necessary for the description are shown, and not all control lines and / or information lines necessary for mounting are shown. That is, even if not shown, almost all the components may be connected to each other.
  • Data cleansing system 201 Data table 206: Metadata generation unit 208: Similar record detection unit 209: UCC detection unit 210: Hash matrix generation unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention permet d'améliorer le rendement d'un travail de nettoyage de données. Ce système de nettoyage de données extrait une table de données d'une mémoire, calcule le degré de similarité entre des enregistrements de la table de données, détecte une combinaison de colonnes unique (UCC) qui constitue un ensemble de colonnes permettant d'identifier de manière unique chaque enregistrement de la table de données et affiche le degré de similarité et l'UCC.
PCT/JP2016/061532 2016-04-08 2016-04-08 Système, procédé et programme de nettoyage de données WO2017175375A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2016/061532 WO2017175375A1 (fr) 2016-04-08 2016-04-08 Système, procédé et programme de nettoyage de données
JP2018510205A JP6549786B2 (ja) 2016-04-08 2016-04-08 データクレンジングシステム、方法、及び、プログラム

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2016/061532 WO2017175375A1 (fr) 2016-04-08 2016-04-08 Système, procédé et programme de nettoyage de données

Publications (1)

Publication Number Publication Date
WO2017175375A1 true WO2017175375A1 (fr) 2017-10-12

Family

ID=60001074

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/061532 WO2017175375A1 (fr) 2016-04-08 2016-04-08 Système, procédé et programme de nettoyage de données

Country Status (2)

Country Link
JP (1) JP6549786B2 (fr)
WO (1) WO2017175375A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024009404A1 (fr) * 2022-07-05 2024-01-11 日本電信電話株式会社 Dispositif d'analyse de données de journal, procédé d'analyse de données de journal et programme d'analyse de données de journal

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113568894A (zh) * 2020-04-28 2021-10-29 中移动信息技术有限公司 数据库的数据冗余处理方法、装置、电子设备及存储介质

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011257854A (ja) * 2010-06-07 2011-12-22 Hitachi Ltd 医療情報管理システム、医療情報管理方法、および医療情報管理プログラム

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011257854A (ja) * 2010-06-07 2011-12-22 Hitachi Ltd 医療情報管理システム、医療情報管理方法、および医療情報管理プログラム

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FUKUDA: "IBM infosphere identity insight solutions: Intelligent Solution for Fighting Threat and fraud", PROVISION, vol. 65, 21 May 2010 (2010-05-21), pages 45 - 51 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024009404A1 (fr) * 2022-07-05 2024-01-11 日本電信電話株式会社 Dispositif d'analyse de données de journal, procédé d'analyse de données de journal et programme d'analyse de données de journal

Also Published As

Publication number Publication date
JPWO2017175375A1 (ja) 2019-01-17
JP6549786B2 (ja) 2019-07-24

Similar Documents

Publication Publication Date Title
US20200356901A1 (en) Target variable distribution-based acceptance of machine learning test data sets
US7949938B2 (en) Comparing and merging multiple documents
US8099553B2 (en) Refactoring virtual data storage hierarchies
CN104731896A (zh) 一种数据处理方法及系统
US20170212811A1 (en) Recovering a specified set of documents from a database backup
JPWO2011077858A1 (ja) 階層型データベースにおけるポインタの整合性をチェックするためのシステム、方法及びプログラム
US20150052157A1 (en) Data transfer content selection
CN102521338B (zh) 对于数据表示项目返回的占位符
JP2017045080A (ja) 業務フロー仕様再生方法
US20180329873A1 (en) Automated data extraction system based on historical or related data
WO2017175375A1 (fr) Système, procédé et programme de nettoyage de données
KR20210023636A (ko) 장기간 연관성 높은 문서 클러스터링을 위한 방법 및 시스템
JP2016091070A (ja) 情報処理システム、分類方法、及びそのためのプログラム
CN104063171B (zh) 信息处理装置、信息处理方法
KR20100083778A (ko) 저장 영역 네트워크 상호 동작 관계의 획득 및 확장
WO2016117007A1 (fr) Système de base de données et procédé de gestion de base de données
CN105354506A (zh) 隐藏文件的方法和装置
JP7381290B2 (ja) 計算機システム及びデータの管理方法
JP2018109898A (ja) データマイグレーションシステム
JP6123372B2 (ja) 情報処理システム、名寄せ判定方法及びプログラム
CN107408104A (zh) 样式的声明级联重新排序
CN111857883A (zh) 页面数据校验方法、装置、电子设备及存储介质
JP2008210068A (ja) データ処理装置及びデータ処理方法及びプログラム
JP6547341B2 (ja) 情報処理装置、方法及びプログラム
JPWO2014168199A1 (ja) 論理演算方法および情報処理装置

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 2018510205

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16897931

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16897931

Country of ref document: EP

Kind code of ref document: A1