WO2015199533A1

WO2015199533A1 - System and method for managing change data in database

Info

Publication number: WO2015199533A1
Application number: PCT/MY2015/050062
Authority: WO
Inventors: Tong Khin Thong; Azizah MAIDIN; Wah Yan KOONG; Nagendran M. Perumal
Original assignee: Mimos Berhad
Priority date: 2014-06-26
Filing date: 2015-06-25
Publication date: 2015-12-30
Also published as: MY188208A

Abstract

The present invention relates to an improved system for managing change data including the data newly added, removed and modified in database and to the method thereof. The system comprises an extract, transform and load (ETL) unit (100) that is configured for extracting data from a data source and loading the data into a staging table; and a unit for identifying the change data in the staging table, coupled to the ETL unit (100). The system can be characterized by a rule-based key generator (101 ), a record linkage unit (102), a delta data comparison unit (103) and a change data extraction unit (104) for quick and easy detection or capture of the change data in continuous manner without involving modification of the source's system and source database.

Description

SYSTEM AND METHOD FOR MANAGING CHANGE DATA IN DATABASE

FIELD OF THE INVENTION The present invention relates generally to electrical digital data processing for error monitoring and detection. More particularly, the present invention relates to an improved system for managing change data in database and to the method thereof. BACKGROUND OF THE IN VENTION

Computers are tools that can be used for storing and providing access to huge amounts of information. Database is a common component for storing information on the computer systems while providing ability to process information tremendously including performing a host of various tasks. Typically, a database may be an organized collection of related information stored as "records" having many "fields" of information. Using the database, information or data may be added or deleted from data files, information retrieved from or updated in such files, and so forth.

To capture change data which may be the deletion, insertion or modification of data in a database, prior art suggests various techniques and approaches; U.S. Patent no. 6,999,977, for example, suggests data extraction, transport, arid loading techniques where modifications made to on-line transaction processing (OLTP) tables are maintained in a database object, referred to as a change table. In this document the change data in the change table is then made available to analysis applications in a controlled manner like a publication_^ subscription model. One of the problems associated with most prior art arises

database to capture the changes. In addition to thai, most techniques described in prior art require a source data table key for detecting the change data, where in its absence, the techniques cannot work around.

As data volumes grow bigger through the database such that, even performing simple tasks, for example identifying change data, can become unwieldy and very time-consuming, it would be desirable to provide more sophisticated, efficient system and method for managing change data in the database. It is also preferred that the improved system and method can help to quickly identif and extract change data captured from any database without modifying its structure, thus migrate the change data to a target data set continuously.

Therefore, a need exists for improved system and method for managing change data in a database in order to obtain better performance in a database environment,

SUMMARY OF THE INVENTIO

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later. Accordingly, the present invention provides a system for managing change data in a database. The system comprises an extract, transform and load (ETL) unit for extracting data from a data source and loading the data into a staging table, and a unit for identifying the change data in the Staging table coupled to the ETL unit.

The system of the present invention can be characterized by the ETL unit that is configured for providing first table and a second table and for loading the data thereto. In one preferred embodiment, the first table and the second table thereto, and a reference number column having a reference number corresponding to the row sequence number. The second table preferably comprises a delta column fo indicating modification status of the data. Essentially, the first table and the second table, each comprising the staging table that is loaded with the data extracted from a data target and the data source, respectively. According to one preferred embodiment, the unit for identifying that is utilized in the present invention comprises a rule-based key generator, a record linkage unit, and a delta data comparison unit, and a change data extraction unit.

The rule-based key generator is configured for obtaining a set of reference keys based on the staging table. Preferably, the set of reference keys comprises at least one data key column with non-duplicated data. In one preferred embodiment, the set of reference keys comprises a combination of data key columns.

The record linkage unit is configured for linking, based on the set of reference keys, the data loaded in the first table and the second table by way of mapping the row sequence number of the first table to the reference number column in the second table for locating the data newly added thereto and mapping the row sequence number of the second table to the reference number column in the first table for locating the data removed therefrom. In one preferred embodiment, the record linkage unit assigns a default Ό' value of the reference number to the reference number column of the second table indicative of the data newly added thereto. In another preferred embodiment, the record linkage unit assigns a default Ό' value of the reference number to the reference number column of the first table indicative of the data removed therefrom. The record linkage unit links the data using a subtracted reference key that can be selected from the set of reference keys if the reference number column of the second table rema i ns u nmap ped .

The delta data comparison unit is configured for examining the data linked between the first table and the second table to detect the data modified in the the first table and the second table which has a reference number in the second table that corresponds to the row sequence number of the first table. The delta data comparison unit is also adapted for marking the delta column with a modification status flag. Preferably, the delta data comparison unit marks the delta column with a first modification status flag indicating that the data is modified or a second modification status flag indicating that the data is not modified. The change data extraction unit is configured for extracting and exporting the data newly added to, removed from, and modified in the database to the data target. In one preferred embodiment, the change data extraction unit identifies and extracts the change data which may include the data newly added to the database When the reference number in the reference number column of the second table reflects the default '0' value, the data removed from the database when the reference number in the reference number column of the first table reflects the default Ό' value, and the data modified in the database when the delta column is marked with the first modification status flag.

In accordance with another aspect, the present invention provides a method of managing change data in a database. The method comprises the steps of extracting data from a data source and loading the data into a staging table; and identifying the change data in the staging table.

The method of the present invention may be characterized by the steps of providing, based on the staging table, a first table and a second table comprising a row identifier column having a row sequence number assigned thereto, and a reference number column having a reference number corresponding to the row sequence number, wherein the second table further comprising a delta column for indicating modification status of the data; obtaining a set of reference keys based on the staging table, wherein the set of reference keys comprising at least one data key column with non-duplscated data; based on the set of reference keys. linking the data loaded in the first table and the second table by way of mapping the row sequence number of the first table to the reference number column in the second table for locating the data newly added thereto and mapping the row sequence number of the second table to the reference number column in the first the first table and the second table having a reference number in the second table that corresponds to the row sequence number of the first table for deteciing the data modified in the database, thus marking the delta column with a modification status flag; and extracting the data newly added to, removed from, and modified in the database to export to a data target. An advantage of the present invention is thai the system and method does not involve modification of the source's system and source database. The present invention suggests that such unnecessary modification may be avoided by way of providing the first table and the second table with additional columns including a row identifier column, a reference number column, and a delta column in order to detect chang data such as deletion, insertion and modification of data. Moreover, the present invention quickly extracts and migrates the change data captured from the database to a target data set continuously that completes the process in relatively fast and save time.

Another advantage of the present invention is that the system and method can be used on any type of data, including those with no key. The present invention generates a set of reference keys locally through a rule-based key generator that may be used to link the data between the first table and the second table. Advantageously, the present invention adapts a key subtraction technique to link the data for obtaining accurate data matching.

The foregoing and other objects, features, aspects arid advantages Of the present invention will become better understood from a careful reading of a detailed description provided herein below with appropriate reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS A more complete appreciation of the invention and many of the attendant advantages thereof will be readily as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein: Figure 1 is a schematic block diagram showing architecture of the system for managing change data in a database according to one embodiment of the present invention;

Figure 2 is a flow diagram showing the method of managing change data in a database according to one embodiment of the present invention; Figure 3 is a flow diagram showing the steps of obtaining a set of reference keys according to one embodiment of the present invention; Figure 4 is a flow diagram showing the steps of linking the data between the first table and the second table according to one embodiment of the present invention; and

Figure 5 is a flow diagram showing the steps of examining the data linked between the first table and the second table according to one embodiment of the present invention;

It is noted that the drawings may not be to scale. The drawings are intended to depict only typical aspects of the invention, and therefore should not be considered as limiting the scope of the invention.

DETAILED DESCRIPTION OF THE INVENTION

The present invention aims to provide system and method for managing change data in database that is capable of quickly, easily detecting or capturing change data including newly added data, removed data and modified data continuously in a database without involving modification of the source's system and source database. The present invention preferably detects the change data using a dynamically generated reference key that is configured for data linking or matching through additional columns or fields to the staging table, in that, advantageously, the present invention may be applied to any type of data.

Figure 1 shows architecture of the system according to one exemplary transform and load (ETL) unit 100 that may be configured to extract data from a data source into and out of a staging database. The ETL unit 100 refers to a data integration process for transferring raw data from the data source to a data target and next preparing the information for downstream uses. In one embodiment, the ETL unit 100 loads the data extracted from the data source containing a file in, for example, a comma- separated values (CSV) format into a staging table. The ETL unit 100 is coupled to a unit for identifying the change data captured from the staging table. As can be seen from Figure 1 , the staging database essentially Stores data extracted from the data source as well as from the data target. In the context of the present invention, the data source may be a specific data set, metadata set, database or metadata repository from where data or metadata are available.

The term "data source" can also refer to the characteristics and components of the raw statistical data used for compiling statistical aggregates.

The term "data target" in the present invention may refer to meta-objects in a database that is uniformly seen within a query definition and having data that is reported uniformly by one or more reporting modules. According to one preferred embodiment of the present invention, the ETL unit 100 is configured to generate two tables, namely a first table and a second table based upon the staging table. The first table and the second table comprise a row identifier column and a reference number column being added to the staging table. The row identifier Column essentially contains a row sequence number assigned to the data with automatically increasing field where the row sequence number is immediately generated when the data is inserted into the row of each of the two tables. The reference number column essentially contains a reference number which corresponds to the ro sequence number of the other table that can be Used for linking the data between the first table and the second table. For instance, the reference number of the second table corresponds to the row sequence number of the first table) and so forth. The reference number column has a default Ό" value of the reference number in the event that the data between the first table and second table is unlinked. The second table further comprises a delta column, in that the second table may be distinguished from the first table. The delta column essentially indicates modification status of the data by way of marking a modification status fl g therein. In preferred embodiments of the present invention, the first table is preferably loaded with the data thai is extracted from the data target, in which Such data can be remarked as old data. The second table is preferably loaded with the data that is extracted from the data source, in whieh such data Can be remarked as new data.

Tables 1 and 2 show an example of the first table and the second table, respectively before the data linking.

Table 1 : First table with additional row identifier column and reference number column loaded with data extracted from data target (old data), before the data linking.

(Default value for reference number column Table 2: Second table with additional row identifier column, reference number column and delta column loaded with data extracted from data source (new data), before the data linking.

(Default value for reference number column = 'ϋ')

(Default value for delta column = 'N' or no) Preferably, for every table, the reference number column has: a default '0' value of the reference number, and the delta column has a default 'NT modification status flag. In accordance with one preferred embodiment of the present invention, the unit for identifying the change data comprises a rule-based key generator 101 , a record linkage unit 102, a delta data comparison unit 103 and a change data extraction unit 104, as shown in Figure 1.

Accordingly, the rule-based key generator 101 is configured to obtain a set of reference keys based on the staging table In the staging database. The set of reference keys may comprise at least one data key column with non-duplicated data. The set of reference keys can comprise of a combination of data key columns with non-duplicated data. In one preferred embodiment, the rule-based key generator 101 comprises a rule-based key identifier engine module thai can be used to browse through every column in the staging table to obtain a reference key. The rule implied in the rule-based key generator 101 may include column type, length, field with null or blank value, filed with only one character length, field with date time format, field with double format, and field with text and the like so long that the rule can reflect unsuitafoility and defeats the purpose of a reference key.

Figure 3 depicts the steps of obtaining a set of reference keys 204 according to one exemplary embodiment of the present invention. In the first step, the rule-based key generator 101 selects a first field or column Of the staging table and identifies the first field as a first reference key. To ensure the first reference key has no duplicated data, a data duplication check is performed to the key. If no duplicated data is found, the first reference key will be selected and utilized in the next step which is the data linking step 205. If otherwise, the first reference key will be disregarded and a new reference key will be considered.

In one example, if the staging table has only one field, the rule-based key rii_i riafr¾r 'ΐ ΩΊ fiolH in

the staging table to obtain a reference key, The combination of the fields must pass a duplication check before it can be selected and used to link the data between the first table and the second table. However, if the combination does not pass the duplication check, a new combination will be considered. The new combination may be a combination of more than two fields. The step of obtaining a set of reference keys 204 in the rule-based key generator 101 may be configured to repeat and continue until a set of reference keys is determined. The architecture of the system shows that the rule-based key generator 101 is connected to the record linkage unit 102.

The record linkage unit 102 is configured to link the data loaded in the first table and the second table by using the set of reference keys obtained through the rule-based key generator 101. The step of data linking 205 is shown generally in Figure d and particularly in Figure 4.

According to Figure 4, the data is linked by way of mapping the row sequence number of the first table to the reference number column in the second table for locating the data newly added thereto, and mapping the row sequence number of the second table to the reference number column in the first table for locating the data removed therefrom. in one preferred embodiment, the record linkage unit 102 assigns a default '0' value of the reference number to the reference number column of the second table indicative of the data which is newly added thereto. It is also preferred that the record linkage unit 102 assigns a default '0' value of the reference number to the reference number column of the first table indicative of the data which is removed therefrom.

Accordingly, the data which has been linked between the first table and the second table is required to undergo a data linking inspection, whereby the reference number in the reference number column will be inspected . If the data or record reflects at reference number that is not equal to Ό', i.e. the default value of reference number, the data is accepted to proceed with the subsequent step unsuccessfully mapped since the data has no match in both the first table and the second table. In response to this, the record linkage unit 102 subtracts the reference key that is currently used for the data linking, removes one column at a time and thereby re-linking the data (including re-mapping the row sequence number of the first table to the reference number column in the second table for locating the data newly added thereto, and re-mapping the row sequence number of the second table to the reference number column in the first table for locating the data removed therefrom) using the subtracted reference key in order to find its match. The present invention sets no limit to the number of reference key subtraction such that the subtraction may repeat and continue until the reference number column in the tables has no reference number that reflects a zero ("0") value.

Examples of the data linked between the first table and the second table based upon Tables 1 and 2 are shown in the following tables.

Table 3: First table having columns completely mapped with information, after the data linking.

(Default value for reference number co umn

Table 4: Second table having columns completely mapped with information, after the data linking.

(Default value for reference number column = 'Ο')

(Default value for delta column = Ή')

second table, the record linkage unit 102 may be connected to the delta data comparison unit 103.

The delta data comparison unit 103 comprises a delta data comparison module that can be configured to examine the data linked between the two tables for detecting the data which has been modified in the database as comparing the data between the first table and the second table, in one embodiment, the delta data comparison unit 103 examines the data linked between the first table and the Second table which has a reference number in the second table that corresponds to the row sequence number of the first table. The delta data comparison unit 103 preferably marks the delta column with a modification status flag upon the examination.

Figure 5 shows the steps of examining the data linked between the first table and the second table 206 according to one exemplary embodiment of the present invention. Accordingly, upon comparing and examining the data located in one row of the first table together with its corresponding row data of the second table, the delta data comparison unit 103 marks the relevant delta column with a first modification status flag to indicate that the data is modified or a second modification status flag to indicate that the data is not modified. I one preferred embodiment, the first modification status flag may be a 'Yes modification' or Ύ' flag and the second modification status flag may be a 'No modification' or 'N' flag, which is the default flag for the delta column. The step of examining the data which is a row-wise basis may continue until every data is examined and marked with a flag.

To extract the change data, the change data extraction unit 104 is adapted into the system of the present invention. The change data extraction unit 104 extracts and exports the data which is newly added, removed and modified to the data target for further relevant processes. Figure 5 partly shows the step of extracting and exporting the change data 206. Preferably, the change data extraction unit 104 identifies, extracts and exports the change data which includes the data newly added to the database when the reference number in the reference number column of the second table reflects the default 'Ο' value, the data removed the first table reflects the default Ό" value, and the data modified in the database when the delta column is marked with the first modification status flag. These change data can be exported in a comma- separated values (CSV) format or in any other suitable format for subsequent action such as data updating in the data target. The method of managing change data in a database according to the present invention will now be described by way of example, with reference to Figure 2. Example 1

The method can be initialized by the steps of extracting data 200 from a data source and loading the data into a staging table; and identifying the change data in the staging table.

In one preferred embodiment of the present invention, the method comprises the step of providing, based on the staging table, a first table and a second table 201 comprising a row identifier column having a row sequence number assigned thereto, and a reference number column having a reference number corresponding to the row sequence number, wherein the second table further comprising a delta column for indicating modification status of the data. In steps 202 and 203, the first table is preferably loaded with the data that is extracted from the data target, in which such data can be remarked as old data. The second table is preferably loaded with the data that is extracted from the data source, in which such data can be remarked as new data.

Upon that, the method continues with the step of obtaining a set of reference keys 204 based on the staging table, wherein the set of reference keys comprising at least one data key column ith non-duplicated data.

Next, based on the set of reference keys obtained previously, the step of linking the data 205 loaded in the first table and the second table by way of mapping the row sequence number of the first table to the referenc number

f r m rtrvinn the row sequence number of the second table to the reference number column Jn the first table for locating the data removed therefrom commences.

Once ihe data between the first table and the second table is linked, the step of examining the data 206 between the first table and the second table haying a reference number in the second table that corresponds to the row sequence number of the first table for detecting the data modified in the database, thus the step of marking the delta column 207 with a modification status flag begins. Following that, the method initiates the ste of extracting the data 208 which is newly added, removed, and modified in the database so as to export to the data target.

In various embodiments, the present invention may be performed at various change data detection frequencies and time intervals with no limit. In some cases, the frequency and time interval, for example, can depend on exported source data frequency.

While this invention has been particularly shown and described with reference to the exemplar embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope Of the invention as defined by the appended claims.

Claims

A system for managing change data in a database, comprising:

an extract, transform and load (ETL) unit (1:00) for extracting data from a data source and loading the data Into a staging table; and

a unit for identifying the change data in the staging table, coupled to the

ETL unit (100);

characterized in that,

the ETL unit (100) configured for providing a first table and a second table and for loading the data thereto, wherein the first table and the second table comprising a row identifier column having a row sequence number assigned thereto, and a reference number column having a reference number corresponding to the row sequence number, the second table further comprising a delta column for indicating modification status of the data; and the unit for identifying comprising:

a rule-based key generator (101) configured for obtaining a set of reference keys based on the staging table, wherein the set of reference keys comprising at least one data key column with non-duplicated data;

a record linkage unit (102) configured for linking, based on the set of reference keys, the data loaded in the first table and the second table, wherein the record linkage unit (102) maps the row sequence number of the first table to the reference number column in the second table for locating the data newly added thereto and maps the row sequence number of the second table to the reference number column in the first table for locating the data removed therefrom;

a delta data comparison unit (103) configured for examining the data linked betwee the first table and the second table to detect the data modified in the database and for marking the delta column with a modification status

a change data extraction unit (104) configured for extracting and exporting the data newly added to, removed from, and modified in the database to a data target. A system according to Claim 1 , wherein the record linkage unit (102) assigning a default Ό' value of the reference number to the reference number column of the second table indicative of the data newly added thereto.

A system according to Claim 1 , wherein the record linkage unit (102) assigning a default Ό' value of the reference number to the reference number colunin of the first table indicative of the data removed therefrom.

A system according to Claim 1 , wherein if the reference number column of the second table remains unmapped, the record linkage unit (102) linking the data using a subtracted reference key that can be selected from the set of reference keys.

A system according to Claim 1 , wherein the delta data comparison unit (103) examining the data linked between the first table and the second table having a reference number in the second table that corresponds to the row sequence number of the first table.

A system according to Claim 1 , wherein the delta data comparison unit (103) marking the delta column with a first modification status flag indicating thai the data is modified or a second modification status flag indicating that the data is not modified.

A system according to Claim 1 , wherein the change data extraction unit (104) identifying and extracting the change data including the data newl added to the database when the reference number in the reference number column of the second table reflects a default '0' value, the data removed from the database when the reference number in the reference number column of the when the delta column is marked with a first modification status flag.

8. A system according to Claim 1 , wherein the set of reference keys comprising a combination of data key columns.

9. A system according to Claim 1 , wherein the first table and the second table comprising the staging table each being respectively loaded with the data extracted from the data target and the data source. 10. A method of managing change data in a database^ comprising:

extracting data from a data source and loading the data into a staging table; and

identifying the change data in the staging table;

characterized in that,

the method comprising the steps of;

providing, based on the staging table, a first table and a second table comprising a row identifier column having a row sequence number assigned thereto, and a reference number column having a reference number corresponding to the row sequence number, wherein the second table further comprising a delta column for indicating modification status of the data;

obtaining a set of reference keys based on the staging table, wherein the set of reference keys comprising at least one data key column with non- duplicated data;

based on the set of reference keys, linking the data loaded in the first table and the second table by way of mapping the row sequence number of the first table to the reference number column in the second table for locating the data newly added thereto and mapping the row sequence number of the second table to the reference number column in the first table for locating the data removed therefrom;

examining the data linked between the first table and the second table having a reference number in the second table that corresponds to the row sequence number of the first table for detecting the data modified in the database, thus marking the delta column with a modification status flag; and

in †hd

database to export to a data target.