WO2014000786A1 - A method for repairing records in a database - Google Patents

A method for repairing records in a database Download PDF

Info

Publication number
WO2014000786A1
WO2014000786A1 PCT/EP2012/062446 EP2012062446W WO2014000786A1 WO 2014000786 A1 WO2014000786 A1 WO 2014000786A1 EP 2012062446 W EP2012062446 W EP 2012062446W WO 2014000786 A1 WO2014000786 A1 WO 2014000786A1
Authority
WO
WIPO (PCT)
Prior art keywords
database
data
records
computer readable
modification
Prior art date
Application number
PCT/EP2012/062446
Other languages
French (fr)
Inventor
Ihab Francis Ilyas Kaldas
Mourad Ouzzani
Original Assignee
Qatar Foundation
Hoarton, Lloyd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qatar Foundation, Hoarton, Lloyd filed Critical Qatar Foundation
Priority to PCT/EP2012/062446 priority Critical patent/WO2014000786A1/en
Publication of WO2014000786A1 publication Critical patent/WO2014000786A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Definitions

  • the present invention relates to a method for repairing records in a database and more particularly relates to holistic database record repair.
  • a database is a collection of information arranged in an organised manner.
  • a typical database might include medical, financial or accounting information, demograph ics and market survey data, bibl iographic or arch ival data, personnel and organisational information, public governmental records, private business or customer data such as addresses and phone numbers, etc.
  • Such information is usually contained in computer files arranged in a preselected database format, and the data contents within them can be maintained for convenient access on magnetic or optical media, both for storage and for updating the file contents as needed.
  • Poor data quality can have undesirable implications for the effectiveness of a business or other organisation or entity. For example, in healthcare, where incorrect information about patients in an Electronic Health Record (EHR) may lead to wrong treatments and prescriptions, ensuring the accuracy of database entries is of prime importance.
  • EHR Electronic Health Record
  • the present invention seeks to provide an improved method for repairing records in a database.
  • a method for repairing records in a database comprising generating a plurality of constraint specifications from specified classes and methods; applying the constraint specifications to the database; generating modification data comprising a list of modifications to be made to the records according to the constraint specifications and modifying the database according to the modification data to produce a modified database instance.
  • the method comprises storing the modification data for access after the database has been modified.
  • the method comprises storing the mod ification data by annotating the database with the modification data to provide an annotated database instance.
  • the method comprises storing the modification data in a file which is separate from the database.
  • the method comprises integrating the specified classes and methods in a late binding mode.
  • the specified classes and methods comprise specified data characteristics.
  • the specified classes and methods comprise specified data quality constraints and/or rules.
  • the specified classes and methods comprise a specified budget.
  • the method further comprises providing feedback information to a user which is indicative of the progress of the database repair.
  • the method further comprises applying a safety check on each constraint specification prior to applying each constraint specification to the database.
  • the method further comprises generating value modification clauses from the modification data, the value modification clauses comprising data for transformation into data to be used to repair records in the database.
  • the step of modifying the database comprises modifying the database to repair more than two data quality issues simultaneously.
  • a computer readable storage medium storing machine readable instructions that, when executed by a processor, implement a method for repairing records in a database comprising generating a plurality of constraint specifications from specified classes and methods; applying the constraint specifications to the database; generating modification data comprising a list of modifications to be made to the records according to the constraint specifications; and modifying the database according to the modification data to produce a modified database instance.
  • the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: storing the modification data so that the modification data can be accessed after the database has been modified.
  • the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: storing the modification data by annotating the database with the modification data to provide an annotated database instance.
  • the computer readable storage med ium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: storing the modification data in a file which is separate from the database.
  • the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: integrating the specified classes and methods in a late binding mode.
  • the specified classes and methods comprise specified data characteristics.
  • the specified classes and methods comprise specified data quality constraints and/or rules.
  • the specified classes and methods comprise a specified budget.
  • the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: providing feedback information to a user which is indicative of the progress of the database repair.
  • the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: applying a safety check on the constraint specifications prior to applying the constraint specifications to the database.
  • the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising : further comprising generating value modification clauses from the modification data, the value modification clauses comprising data for transformation into data to be used to repair records in the database.
  • the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: modifying the database to repair more than two data quality issues simultaneously.
  • Figure 1 is a schematic representation of a method for repairing records in a database according to one embodiment of the invention
  • F ig u re 2 is a schematic block diagram of a method according to an embodiment of the invention
  • Figure 3 is a schematic block diagram of part of a method according to an embodiment of the invention.
  • Figure 4 is a schematic block diagram of part of a method according to an embodiment of the invention.
  • Figure 5 is a schematic block diagram of an apparatus according to an embodiment of the invention.
  • a database cleaning platform 1 comprises a core 2.
  • the core 2 is operable to receive data via plugin modules which may be provided by field experts.
  • the plugin data modules comprise metadata 3, data quality rules and constraints 4 and a data repair budget 5. It is, however, to be appreciated that the core 2 may be operable to receive a greater or fewer number of data modules.
  • the core 2 is operable to receive the data from a database 6.
  • the database 6 is a dirty database in the sense that some of the records in the database comprise data quality problems such as incorrect, duplicate or outdated information or missing values.
  • the system 1 is operable to generate a plurality of constraint specifications from specified classes and methods, such as the plugin data modules 3-5 and then to apply each constraint specification to the database 6.
  • the core 2 generates modification data comprising a list of modifications to be made to the records in the database 6 according to each constraint specification.
  • the modification data is stored in a modification data record 7 which retains data auditing information.
  • the modification data record 7 may be a data file which is separate to the database 6 or the modification data record may be stored by annotating the database 6 to produce an annotated database instance.
  • the core 2 modifies the database 6 according to the modification data 7 to produce a modified database instance 8 which, in this example, is a repaired database instance in which the data quality problems such as incorrect, duplicate or out-dated information or missing values are repaired.
  • the system 1 is operable to store the modification data 7 so that the modification data 7 persists after repair of the database 6.
  • the modification data 7 comprises audit information detailing each operation performed by the core 2 during the cleaning process.
  • the modification data 7 is thus an audit record of the changes that were made to the records in the database during the cleaning process. Since the modification data 7 persists after the cleaning process, the modification data 7 can be analysed by a data cleaning debugger to enable the data cleaning debugger to inspect and review the changes that were made to the database 6 during the cleaning process.
  • a method of repairing records in a database of an embodiment of the invention comprises receiving the database 6 which comprises dirty data.
  • the core 2 receives constraints specifications 9 which comprise constraints specifications to effect deduplication, functional dependency (FD) resolution, conditional functional dependency (CFD) resolution and other user defined rules.
  • the system 1 provides abstract classes and methods as data specification inputs 10.
  • the core 2 compiles the constraints specifications and the data specifications in a late binding mode.
  • the constraints specifications are compiled in the core 2 by an abstract constraints compiler 1 1 .
  • the abstract constraints compiler 1 1 generates modification data which, in this embodiment, is stored alongside the database 6 to produce an annotated database instance 12.
  • the annotated database instance 12 comprises the database 6, which is not yet changed, and the modification data which is provided as data file that is either attached to or stored separately from the database 6.
  • the annotated database instance 12 is then input into an abstract repair solver 13 which is provided within the core 2.
  • the abstract repair solver 13 modifies the records in the database 6 according to the annotated modification data to at least partly clean the data in the database 6 by removing the data quality problems such as incorrect, dupl icate or out-dated information or missing values.
  • the system 1 comprises a cleaning progress monitor 14 which monitors the system and provides feedback to a user about the progress of the cleaning process.
  • the abstract repair solver 13 outputs a modified database instance 8 in which some or all of the records in the database have been cleaned. If, however, the process was unable to clean all records in the database 6 or if a user intervenes to request further cleaning, the system 1 repeats the cleaning process to incrementally repair the records in the database 6. During the incremental repair, new constraints specifications may be provided to allow further cleaning and repair of the database.
  • the core 2 is operable to stop the cleaning process if the process is not able to identify one or more correct values to clean the database 6.
  • the core 2 is also operable to stop the cleaning process if it detects that the process replaces the same data with different values during automatic repeat iterations of the process. This minimises the chance of the core 2 becoming stuck in a processing loop.
  • the modification data used during the cleaning process is stored as auditing information 15 which can be reviewed by a data cleaning debugger 16.
  • a user can inspect the auditing information 15 to see what occurred during the cleaning process. This can assist the user in reviewing the database modifications to find potential errors in the constraints and data specifications.
  • the abstract constraints compiler 1 1 incorporates an input which is operable to receive the constraints specifications and rules 9 specified by a user.
  • the abstract constraints compiler 1 1 preferably incorporates a constraints safety test module 17 which checks the constraints specification and rules 9 input by the user. If the constraints specifications and rules 9 are not appropriate then the constraints safety test module 17 generates a report and transmits the report to the user. If, on the other hand, the constraints specifications and rules 9 are appropriate then the abstract constraints compiler parses the records in the database 6 to identify records which match the constraints and violate or satisfy the constraints. The abstract constraints compiler 1 1 then generates modification data indicative of the modifications required to the records in the database 6 to repair the records according to the constraints specifications and rules 9.
  • the abstract constraints compiler 1 1 stores the modification data as a data annotation 18 which is applied to the database to produce an annotated database instance 12.
  • the abstract repair solver 1 3 will now be described in more detail with reference to figure 4.
  • the abstract repair solver 13 initially generates from the annotated database instance all possible value modification clauses that are necessary to apply the modification data to repair the records in the database.
  • the abstract repair solver 13 uses the generated modification clauses to repair the records in the database according to one of a number of different repair methods. Two such repair methods are d iscussed below but it is to be appreciated that other repair methods may be used in further embodiments of the invention.
  • a first repair method transforms the modification clauses into conjunctive normal form clauses which are then fed into a satisfiability (SAT) solver.
  • SAT satisfiability
  • Embodiments of the invention provide a holistic and extensible data cleaning platform where any data cleaning methods can be deployed with a minimal implementation effort from experts.
  • Embodiments of the invention can also handle multiple data cleaning methods targeting different data quality issues such as deduplication, FD-repair and outdated data in a holistic and consistent way. The system therefore allows users to clean data in a database more easily and efficiently than conventional data cleaning systems.
  • the holistic nature of an embodiment of the invention is achieved by enabling simultaneous execution of two or more data cleaning routines on a single database.
  • the stored modification data provides a record of each cleaning action which can be accessed by an end user or an auditor after cleaning has occurred to inspect and analyse each cleaning action.
  • FIG. 5 is a schematic block diagram of an apparatus according to an embodiment of the invention which is suitable for implementing any of the systems or processes described above.
  • Apparatus 400 includes one or more processors, such as processor 401 , providing an execution platform for executing machine readable instructions such as software. Commands and data from the processor 401 are communicated over a communication bus 399.
  • the system 400 also includes a main memory 402, such as a Random Access Memory (RAM), where machine readable instructions may reside during runtime, and a secondary memory 405.
  • main memory 402 such as a Random Access Memory (RAM), where machine readable instructions may reside during runtime
  • the secondary memory 405 includes, for example, a hard disk drive 407 and/or a removable storage drive 430, representing a floppy diskette drive, a magnetic tape drive, a compact disk drive, etc., or a nonvolatile memory where a copy of the machine readable instructions or software may be stored.
  • the secondary memory 405 may also include ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM).
  • data representing any one or more of updates, possible updates or candidate replacement entries, and listings for identified tuples may be stored in the main memory 402 and/or the secondary memory 405.
  • the removable storage drive 430 reads from and/or writes to a removable storage unit 409 in a well-known manner.
  • a user interfaces with the system 400 with one or more input devices 41 1 , such as a keyboard, a mouse, a stylus, and the like in order to provide user input data.
  • the display adaptor 415 interfaces with the communication bus 399 and the display 417 and receives display data from the processor 401 and converts the display data into display commands for the display 417.
  • a network interface 419 is provided for communicating with other systems and devices via a network (not shown).
  • the system can include a wireless interface 421 for commu n icating with wireless devices in the wireless community.
  • the system 400 shown in figure 5 is provided as an example of a possible platform that may be used, and other types of platforms may be used as is known in the art.
  • One or more of the steps described above may be implemented as instructions embedded on a computer readable medium and executed on the system 400.
  • the steps may be embodied by a computer program, which may exist in a variety of forms both active and inactive. For example, they may exist as software program(s) comprised of program instructions in source code, object code, executable code or other formats for performing some of the steps.
  • any of the above may be embodied on a computer readable medium, which include storage devices and signals, in compressed or uncompressed form.
  • suitable computer readable storage devices include conventional computer system RAM (random access memory), ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM), and magnetic or optical disks or tapes.
  • Examples of computer readable signals, whether modulated using a carrier or not, are signals that a computer system hosting or running a computer program may be configured to access, including signals downloaded through the Internet or other networks. Concrete examples of the foregoing include distribution of the programs on a CD ROM or via Internet download. In a sense, the Internet itself, as an abstract entity, is a computer readable medium. The same is true of computer networks in general. It is therefore to be understood that those functions enumerated above may be performed by any electronic device capable of executing the above-described functions.
  • equivalence classes 405 can reside in memory 402 having been derived from records of a database 209.
  • One or more of algorithms of blocks 300, 305 or 307 can reside in memory 402 such as to provide respective engines 403 for cleaning, merging and selecting records of a database, including a modified instance of a database for example. That is, engine 403 can be a cleaning engine or a merge engine which is operable to perform the processes associated with the tasks of blocks 300, 305, 307 for example.
  • a database 209 is shown in figure 5 as a standalone database connected to bus 399. However, it can be a database which can be queried and have data written to it from a remote location using the wired or wireless network connections mentioned above. Alternatively, database 209 may be stored in memory 405, such as on a HDD of system 400 for example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for repairing records in a database comprises generating at least one constraints specification from specified classes and methods, applying the or each constraints specification to the database and generating modification data comprising a list of modifications to be made to the records according to the or each constraints specification. The method further comprises modifying the database according to the modification data to produce a modified database instance. The modification data persists after the cleaning process has occurred to enable the cleaning process to be audited.

Description

A METHOD FOR REPAIRING RECORDS IN A DATABASE
Description of Invention The present invention relates to a method for repairing records in a database and more particularly relates to holistic database record repair.
A database is a collection of information arranged in an organised manner. A typical database might include medical, financial or accounting information, demograph ics and market survey data, bibl iographic or arch ival data, personnel and organisational information, public governmental records, private business or customer data such as addresses and phone numbers, etc.
Such information is usually contained in computer files arranged in a preselected database format, and the data contents within them can be maintained for convenient access on magnetic or optical media, both for storage and for updating the file contents as needed.
Poor data quality can have undesirable implications for the effectiveness of a business or other organisation or entity. For example, in healthcare, where incorrect information about patients in an Electronic Health Record (EHR) may lead to wrong treatments and prescriptions, ensuring the accuracy of database entries is of prime importance.
A large variety of computational procedures for cleaning or repairing erroneous or duplicate entries in databases have been proposed . Typically, such procedures can automatically or semi-automatically identify errors and, when possible, correct them. Typically, however, these approaches have several limitations relating to the introduction of new database errors as a result of changes that have been made. For example, a repair in order to correct a functional dependency problem may lead to duplication errors. Similarly, deduplication can lead to functional dependency violations within a database. Most existing data cleaning methods tackle the different issues involved in cleaning a dirty database in isolation. Furthermore, these methods work for specific data and specific constraints. Each time a new data cleaning problem arises or a new data set needs to be cleaned, practitioners either build a new data cleaning system from scratch or adopt an existing tool. The former option is clearly proh ibitive and requ ires h igh expertise. In the latter case, considerable efforts will usually go into preparing the data, customising the data quality constraints for the tool, and even tweaking the tool if the code is available. Since these tools tackle each specific cleaning problem in isolation, fixing one dirty piece of data often has a negative impact on other parts of the data or violates other data constraints.
The present invention seeks to provide an improved method for repairing records in a database.
According to one aspect of the present invention, there is provided a method for repairing records in a database, the method comprising generating a plurality of constraint specifications from specified classes and methods; applying the constraint specifications to the database; generating modification data comprising a list of modifications to be made to the records according to the constraint specifications and modifying the database according to the modification data to produce a modified database instance.
Preferably, the method comprises storing the modification data for access after the database has been modified.
Conveniently, the method comprises storing the mod ification data by annotating the database with the modification data to provide an annotated database instance. Advantageously, the method comprises storing the modification data in a file which is separate from the database.
Preferably, the method comprises integrating the specified classes and methods in a late binding mode.
Conveniently, the specified classes and methods comprise specified data characteristics. Advantageously, the specified classes and methods comprise specified data quality constraints and/or rules.
Preferably, the specified classes and methods comprise a specified budget. Conveniently, the method further comprises providing feedback information to a user which is indicative of the progress of the database repair.
Advantageously, the method further comprises applying a safety check on each constraint specification prior to applying each constraint specification to the database.
Preferably, the method further comprises generating value modification clauses from the modification data, the value modification clauses comprising data for transformation into data to be used to repair records in the database.
Conveniently, the step of modifying the database comprises modifying the database to repair more than two data quality issues simultaneously.
According to another aspect of the present invention, there is provided a computer readable storage medium storing machine readable instructions that, when executed by a processor, implement a method for repairing records in a database comprising generating a plurality of constraint specifications from specified classes and methods; applying the constraint specifications to the database; generating modification data comprising a list of modifications to be made to the records according to the constraint specifications; and modifying the database according to the modification data to produce a modified database instance.
Preferably, the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: storing the modification data so that the modification data can be accessed after the database has been modified.
Conveniently, the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: storing the modification data by annotating the database with the modification data to provide an annotated database instance.
Advantageously, the computer readable storage med ium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: storing the modification data in a file which is separate from the database.
Preferably, the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: integrating the specified classes and methods in a late binding mode.
Preferably, the specified classes and methods comprise specified data characteristics. Conveniently, the specified classes and methods comprise specified data quality constraints and/or rules.
Advantageously, the specified classes and methods comprise a specified budget.
Preferably, the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: providing feedback information to a user which is indicative of the progress of the database repair.
Conveniently, the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: applying a safety check on the constraint specifications prior to applying the constraint specifications to the database.
Advantageously, the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising : further comprising generating value modification clauses from the modification data, the value modification clauses comprising data for transformation into data to be used to repair records in the database. Preferably, the computer readable storage medium further stores instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: modifying the database to repair more than two data quality issues simultaneously. In order that the invention may be more readily understood, and so that further features thereof may be appreciated, embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings in which:
Figure 1 is a schematic representation of a method for repairing records in a database according to one embodiment of the invention; F ig u re 2 is a schematic block diagram of a method according to an embodiment of the invention;
Figure 3 is a schematic block diagram of part of a method according to an embodiment of the invention;
Figure 4 is a schematic block diagram of part of a method according to an embodiment of the invention; and
Figure 5 is a schematic block diagram of an apparatus according to an embodiment of the invention.
Referring to figure 1 , a database cleaning platform 1 comprises a core 2. The core 2 is operable to receive data via plugin modules which may be provided by field experts. The plugin data modules comprise metadata 3, data quality rules and constraints 4 and a data repair budget 5. It is, however, to be appreciated that the core 2 may be operable to receive a greater or fewer number of data modules.
The core 2 is operable to receive the data from a database 6. In this example, the database 6 is a dirty database in the sense that some of the records in the database comprise data quality problems such as incorrect, duplicate or outdated information or missing values.
The system 1 is operable to generate a plurality of constraint specifications from specified classes and methods, such as the plugin data modules 3-5 and then to apply each constraint specification to the database 6. The core 2 generates modification data comprising a list of modifications to be made to the records in the database 6 according to each constraint specification. The modification data is stored in a modification data record 7 which retains data auditing information. The modification data record 7 may be a data file which is separate to the database 6 or the modification data record may be stored by annotating the database 6 to produce an annotated database instance. The core 2 modifies the database 6 according to the modification data 7 to produce a modified database instance 8 which, in this example, is a repaired database instance in which the data quality problems such as incorrect, duplicate or out-dated information or missing values are repaired.
The system 1 is operable to store the modification data 7 so that the modification data 7 persists after repair of the database 6. The modification data 7 comprises audit information detailing each operation performed by the core 2 during the cleaning process. The modification data 7 is thus an audit record of the changes that were made to the records in the database during the cleaning process. Since the modification data 7 persists after the cleaning process, the modification data 7 can be analysed by a data cleaning debugger to enable the data cleaning debugger to inspect and review the changes that were made to the database 6 during the cleaning process.
Referring now to figure 2, a method of repairing records in a database of an embodiment of the invention comprises receiving the database 6 which comprises dirty data. The core 2 receives constraints specifications 9 which comprise constraints specifications to effect deduplication, functional dependency (FD) resolution, conditional functional dependency (CFD) resolution and other user defined rules. The system 1 provides abstract classes and methods as data specification inputs 10. The core 2 compiles the constraints specifications and the data specifications in a late binding mode.
The constraints specifications are compiled in the core 2 by an abstract constraints compiler 1 1 . The abstract constraints compiler 1 1 generates modification data which, in this embodiment, is stored alongside the database 6 to produce an annotated database instance 12. The annotated database instance 12 comprises the database 6, which is not yet changed, and the modification data which is provided as data file that is either attached to or stored separately from the database 6.
The annotated database instance 12 is then input into an abstract repair solver 13 which is provided within the core 2. The abstract repair solver 13 modifies the records in the database 6 according to the annotated modification data to at least partly clean the data in the database 6 by removing the data quality problems such as incorrect, dupl icate or out-dated information or missing values. The system 1 comprises a cleaning progress monitor 14 which monitors the system and provides feedback to a user about the progress of the cleaning process.
The abstract repair solver 13 outputs a modified database instance 8 in which some or all of the records in the database have been cleaned. If, however, the process was unable to clean all records in the database 6 or if a user intervenes to request further cleaning, the system 1 repeats the cleaning process to incrementally repair the records in the database 6. During the incremental repair, new constraints specifications may be provided to allow further cleaning and repair of the database. The core 2 is operable to stop the cleaning process if the process is not able to identify one or more correct values to clean the database 6. The core 2 is also operable to stop the cleaning process if it detects that the process replaces the same data with different values during automatic repeat iterations of the process. This minimises the chance of the core 2 becoming stuck in a processing loop.
The modification data used during the cleaning process is stored as auditing information 15 which can be reviewed by a data cleaning debugger 16. A user can inspect the auditing information 15 to see what occurred during the cleaning process. This can assist the user in reviewing the database modifications to find potential errors in the constraints and data specifications.
The abstract constraints compiler 1 1 will now be described in more detail with reference to figure 3. The abstract constraints compiler incorporates an input which is operable to receive the constraints specifications and rules 9 specified by a user. The abstract constraints compiler 1 1 preferably incorporates a constraints safety test module 17 which checks the constraints specification and rules 9 input by the user. If the constraints specifications and rules 9 are not appropriate then the constraints safety test module 17 generates a report and transmits the report to the user. If, on the other hand, the constraints specifications and rules 9 are appropriate then the abstract constraints compiler parses the records in the database 6 to identify records which match the constraints and violate or satisfy the constraints. The abstract constraints compiler 1 1 then generates modification data indicative of the modifications required to the records in the database 6 to repair the records according to the constraints specifications and rules 9. The abstract constraints compiler 1 1 stores the modification data as a data annotation 18 which is applied to the database to produce an annotated database instance 12.
The abstract repair solver 1 3 will now be described in more detail with reference to figure 4. The abstract repair solver 13 initially generates from the annotated database instance all possible value modification clauses that are necessary to apply the modification data to repair the records in the database.
The abstract repair solver 13 uses the generated modification clauses to repair the records in the database according to one of a number of different repair methods. Two such repair methods are d iscussed below but it is to be appreciated that other repair methods may be used in further embodiments of the invention.
A first repair method, labelled as option 1 in figure 4, transforms the modification clauses into conjunctive normal form clauses which are then fed into a satisfiability (SAT) solver. The SAT solver then generates the repaired database instance.
In a second repair method, labelled as option 2 in figure 4, the value modification clauses are transformed into a set of graphs which are then solved and used to generate a repaired database instance.
Embodiments of the invention provide a holistic and extensible data cleaning platform where any data cleaning methods can be deployed with a minimal implementation effort from experts. Embodiments of the invention can also handle multiple data cleaning methods targeting different data quality issues such as deduplication, FD-repair and outdated data in a holistic and consistent way. The system therefore allows users to clean data in a database more easily and efficiently than conventional data cleaning systems.
The holistic nature of an embodiment of the invention is achieved by enabling simultaneous execution of two or more data cleaning routines on a single database. The stored modification data provides a record of each cleaning action which can be accessed by an end user or an auditor after cleaning has occurred to inspect and analyse each cleaning action.
Figure 5 is a schematic block diagram of an apparatus according to an embodiment of the invention which is suitable for implementing any of the systems or processes described above. Apparatus 400 includes one or more processors, such as processor 401 , providing an execution platform for executing machine readable instructions such as software. Commands and data from the processor 401 are communicated over a communication bus 399. The system 400 also includes a main memory 402, such as a Random Access Memory (RAM), where machine readable instructions may reside during runtime, and a secondary memory 405. The secondary memory 405 includes, for example, a hard disk drive 407 and/or a removable storage drive 430, representing a floppy diskette drive, a magnetic tape drive, a compact disk drive, etc., or a nonvolatile memory where a copy of the machine readable instructions or software may be stored. The secondary memory 405 may also include ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM). In addition to software, data representing any one or more of updates, possible updates or candidate replacement entries, and listings for identified tuples may be stored in the main memory 402 and/or the secondary memory 405. The removable storage drive 430 reads from and/or writes to a removable storage unit 409 in a well-known manner.
A user interfaces with the system 400 with one or more input devices 41 1 , such as a keyboard, a mouse, a stylus, and the like in order to provide user input data. The display adaptor 415 interfaces with the communication bus 399 and the display 417 and receives display data from the processor 401 and converts the display data into display commands for the display 417. A network interface 419 is provided for communicating with other systems and devices via a network (not shown). The system can include a wireless interface 421 for commu n icating with wireless devices in the wireless community.
It will be apparent to one of ordinary skill in the art that one or more of the components of the system 400 may not be included and/or other components may be added as is known in the art. The system 400 shown in figure 5 is provided as an example of a possible platform that may be used, and other types of platforms may be used as is known in the art. One or more of the steps described above may be implemented as instructions embedded on a computer readable medium and executed on the system 400. The steps may be embodied by a computer program, which may exist in a variety of forms both active and inactive. For example, they may exist as software program(s) comprised of program instructions in source code, object code, executable code or other formats for performing some of the steps. Any of the above may be embodied on a computer readable medium, which include storage devices and signals, in compressed or uncompressed form. Examples of suitable computer readable storage devices include conventional computer system RAM (random access memory), ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM), and magnetic or optical disks or tapes. Examples of computer readable signals, whether modulated using a carrier or not, are signals that a computer system hosting or running a computer program may be configured to access, including signals downloaded through the Internet or other networks. Concrete examples of the foregoing include distribution of the programs on a CD ROM or via Internet download. In a sense, the Internet itself, as an abstract entity, is a computer readable medium. The same is true of computer networks in general. It is therefore to be understood that those functions enumerated above may be performed by any electronic device capable of executing the above-described functions.
In one embodiment, equivalence classes 405 can reside in memory 402 having been derived from records of a database 209. One or more of algorithms of blocks 300, 305 or 307 can reside in memory 402 such as to provide respective engines 403 for cleaning, merging and selecting records of a database, including a modified instance of a database for example. That is, engine 403 can be a cleaning engine or a merge engine which is operable to perform the processes associated with the tasks of blocks 300, 305, 307 for example.
A database 209 is shown in figure 5 as a standalone database connected to bus 399. However, it can be a database which can be queried and have data written to it from a remote location using the wired or wireless network connections mentioned above. Alternatively, database 209 may be stored in memory 405, such as on a HDD of system 400 for example.
When used in this specification and claims, the terms "comprises" and "comprising" and variations thereof mean that the specified features, steps or integers are included . The terms are not to be interpreted to exclude the presence of other features, steps or components.

Claims

1 . A method for repairing records in a database, the method comprising: generating a plurality of constraint specifications from specified classes and methods;
applying the constraint specifications to the database;
generating modification data comprising a list of modifications to be made to the records according to the constraint specifications; and
modifying the database according to the modification data to produce a modified database instance.
2. A method according to claim 1 , wherein the method comprises storing the modification data for access after the database has been modified.
3. A method according to claim 2, wherein the method comprises storing the modification data by annotating the database with the modification data to provide an annotated database instance.
4. A method according to claim 2, wherein the method comprises storing the modification data in a file which is separate from the database.
5. A method according to any one of the preceding claims, wherein the method comprises integrating the specified classes and methods in a late binding mode.
6. A method according to any one of the preceding claims, wherein the specified classes and methods comprise specified data characteristics.
7. A method according to any one of the preceding claims, wherein the specified classes and methods comprise specified data quality constraints and/or rules.
8. A method according to any one of the preceding claims, wherein the specified classes and methods comprise a specified budget.
9. A method according to any one of the preceding claims, wherein the method further comprises:
providing feedback information to a user which is indicative of the progress of the database repair.
10. A method according to any one of the preceding claims, wherein the method further comprises:
applying a safety check on each constraint specification prior to applying each constraint specification to the database.
1 1 . A method according to any one of the preceding claims, wherein the method further comprises:
generating value modification clauses from the modification data, the value modification clauses comprising data for transformation into data to be used to repair records in the database.
12. A method according to any one of the preceding claims, wherein the step of modifying the database comprises modifying the database to repair more than two data quality issues simultaneously.
13. A computer readable storage med ium storing mach ine readable instructions that, when executed by a processor, implement a method for repairing records in a database comprising:
generating a plurality of constraint specifications from specified classes and methods;
applying the constraint specifications to the database; generating modification data comprising a list of modifications to be made to the records according to the constraint specifications; and
modifying the database according to the modification data to produce a modified database instance.
14. A computer readable storage medium according to claim 13 further storing instructions that, when executed by the processor, implement a method for repairing records in a database further comprising:
storing the modification data so that the modification data can be accessed after the database has been modified.
15. A computer readable storage medium according to claim 14 further storing instructions that, when executed by the processor, implement a method for repairing records in a database further comprising:
storing the modification data by annotating the database with the modification data to provide an annotated database instance.
16. A computer readable storage medium according to claim 14 further storing instructions that, when executed by the processor, implement a method for repairing records of a database further comprising:
storing the mod ification data in a file wh ich is separate from the database.
17. A computer readable storage medium according to any one of claims 13 to 16 further storing instructions that, when executed by the processor, implement a method for repairing records of a database further comprising: integrating the specified classes and methods in a late binding mode.
18. A computer readable storage medium according to any one of claims 13 to 17, wherein the specified classes and methods comprise specified data characteristics.
19. A computer readable storage medium according to any one of claims 13 to 18, wherein the specified classes and methods comprise specified data quality constraints and/or rules.
20. A computer readable storage medium according to any one of claims 13 to 19, wherein the specified classes and methods comprise a specified budget.
21 . A computer readable storage medium according to any one of claims 13 to 20 further storing instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: providing feedback information to a user which is indicative of the progress of the database repair.
22. A computer readable storage medium according to any one of claims 13 to 21 further storing instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: applying a safety check on the constraint specifications prior to applying the constraint specifications to the database.
23. A computer readable storage medium according to any one of claims 13 to 22 further storing instructions that, when executed by the processor, implement a method for repairing records in a database further comprising: generating value modification clauses from the modification data, the value modification clauses comprising data for transformation into data to be used to repair records in the database.
24. A computer readable storage medium according to any one of claims 13 to 23, wherein the step of modifying the database comprises modifying the database to repair more than two data quality issues simultaneously.
PCT/EP2012/062446 2012-06-27 2012-06-27 A method for repairing records in a database WO2014000786A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2012/062446 WO2014000786A1 (en) 2012-06-27 2012-06-27 A method for repairing records in a database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2012/062446 WO2014000786A1 (en) 2012-06-27 2012-06-27 A method for repairing records in a database

Publications (1)

Publication Number Publication Date
WO2014000786A1 true WO2014000786A1 (en) 2014-01-03

Family

ID=46640646

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2012/062446 WO2014000786A1 (en) 2012-06-27 2012-06-27 A method for repairing records in a database

Country Status (1)

Country Link
WO (1) WO2014000786A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140052723A1 (en) * 2012-08-20 2014-02-20 Research In Motion Limited Methods and devices for applying constraints to data object
CN108121778A (en) * 2017-12-14 2018-06-05 浙江航天恒嘉数据科技有限公司 A kind of heterogeneous database exchange and cleaning system and method
US10472345B2 (en) 2016-02-04 2019-11-12 Merck Sharp & Dohme Corp. Methods of preparing hydroxylamine derivatives useful in the preparation of anti-infective agents
WO2020219405A1 (en) 2019-04-26 2020-10-29 Merck Sharp & Dohme Corp. Process for the preparation of intermediates useful for making (2s,5r)-7-oxo-n-piperidin-4-yl-6-(sulfoxy)-1,6-diazabicyclo[3.2.1]octane-2-carboxamide

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HASIMAH HJ MOHAMED ET AL: "E-Clean: A Data Cleaning Framework for Patient Data", INFORMATICS AND COMPUTATIONAL INTELLIGENCE (ICI), 2011 FIRST INTERNATIONAL CONFERENCE ON, IEEE, 12 December 2011 (2011-12-12), pages 63 - 68, XP032104304, ISBN: 978-1-4673-0091-9, DOI: 10.1109/ICI.2011.21 *
HIMA PRASAD K ET AL: "Data Cleansing Techniques for Large Enterprise Datasets", SRII GLOBAL CONFERENCE (SRII), 2011 ANNUAL, IEEE, 29 March 2011 (2011-03-29), pages 135 - 144, XP031897210, ISBN: 978-1-61284-415-2, DOI: 10.1109/SRII.2011.26 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140052723A1 (en) * 2012-08-20 2014-02-20 Research In Motion Limited Methods and devices for applying constraints to data object
US9223825B2 (en) * 2012-08-20 2015-12-29 Blackberry Limited Methods and devices for applying constraints to data object
US10472345B2 (en) 2016-02-04 2019-11-12 Merck Sharp & Dohme Corp. Methods of preparing hydroxylamine derivatives useful in the preparation of anti-infective agents
CN108121778A (en) * 2017-12-14 2018-06-05 浙江航天恒嘉数据科技有限公司 A kind of heterogeneous database exchange and cleaning system and method
CN108121778B (en) * 2017-12-14 2020-12-25 浙江航天恒嘉数据科技有限公司 Heterogeneous data exchange and cleaning system and method
WO2020219405A1 (en) 2019-04-26 2020-10-29 Merck Sharp & Dohme Corp. Process for the preparation of intermediates useful for making (2s,5r)-7-oxo-n-piperidin-4-yl-6-(sulfoxy)-1,6-diazabicyclo[3.2.1]octane-2-carboxamide

Similar Documents

Publication Publication Date Title
Ji et al. Maintaining feature traceability with embedded annotations
US9619373B2 (en) Method and apparatus to semantically connect independent build and test processes
US7516367B1 (en) Automated, distributed problem determination and upgrade planning tool
US9513902B2 (en) Automated code coverage measurement and tracking per user story and requirement
US20120167053A1 (en) Targeting code sections for correcting computer program product defects using records of a defect tracking system
US8145942B2 (en) Methods and systems for troubleshooting remote systems through recreation of remote system scenarios
WO2014000786A1 (en) A method for repairing records in a database
US9621679B2 (en) Operation task managing apparatus and method
CN113742200B (en) Database version upgrade test method, equipment and computer readable medium
KR20180130733A (en) System and method for recommending component reuse based on collaboration dependency
CN110990051A (en) Method, device, medium and equipment for maintaining dependency relationship of software package
WO2014000785A1 (en) A method for cleaning data records in a database
WO2014000788A1 (en) A method for cleaning data records in a database
Kakarontzas et al. Component certification as a prerequisite forwidespread oss reuse
WO2013029817A1 (en) Database record repair
Murphey Automated Windows event log forensics
US11392371B2 (en) Identification of a partial code to be refactored within a source code
JP6045707B2 (en) License management apparatus, license management method, and program
CN112699011A (en) Method and device for counting incremental code coverage rate, electronic equipment and storage medium
Kumar et al. An empirical study of bad smell in code on maintenance effort
CN111414194A (en) Interface information generation method and system, electronic equipment and storage medium
US8949819B2 (en) Rationalizing functions to identify re-usable services
JP5464672B2 (en) Quality control device, quality control system, quality control method, and program
Muzammul Model Driven Re-engineering with the Fields of Re-structuring: Software Quality Assurance Theory
CN112860284B (en) SP upgrade package generation method and device for equipment remote upgrade and computer equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12745646

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 17.06.2015)

122 Ep: pct application non-entry in european phase

Ref document number: 12745646

Country of ref document: EP

Kind code of ref document: A1