WO2016022019A1

WO2016022019A1 - Method for data input into a database

Info

Publication number: WO2016022019A1
Application number: PCT/MY2015/050077
Authority: WO
Inventors: Boon Keong SEAH; Kwang Ming NG
Original assignee: Mimos Berhad
Priority date: 2014-08-05
Filing date: 2015-07-28
Publication date: 2016-02-11
Also published as: MY187720A

Abstract

A database is a central source of data for query and analysis. Functions of a database include retrieving, analyzing, transforming, loading and managing data; making them available for further use. Disclosed herein is a method for improved data input into a database whilst maintaining the structure of metric of facts. Data is extracted from a data source (1), transformed and loaded into fact tables in a data warehouse by means of bridge tables. Automation of dynamically adding dimensions into multiple fact tables with respect to the data source (1) is disclosed in the present invention.

Description

METHOD FOR DATA INPUT INTO A DATABASE FIELD OF TECHNOLOGY The present invention relates to a method for data input into a database, more particularly the present invention relates to a method for data input into a data warehouse.

BACKGROUND OF THE INVENTION

A data warehouse allows an organization or company to create a consolidated view of its enterprise data, and using the data for reporting and data analysis. A data warehouse can be used to analyse a particular subject area, integrate data from multiple data sources as well as store current and historical data. Creating a data warehouse requires mapping data between sources and targets, then capturing the details of the transformation in a metadata repository.

One of the challenges faced is that frequent changes occurs on the management to analyse the data differently. For example, analysing the location of patient for a disease, distribution of bed in a hospital and patient history on top of data for the type of disease. Due to the fact that frequent new requests are made to meet new scope of data analysis, additional data or columns have to be added into the fact table. This affects significant process time if the fact table is huge. In addition, changes has to be made to the existing ETL (Extraction, Transformation and Loading) process of data transformation to fact table. Further compounding to the issue, a data warehouse generally comprises numerous fact tables in order to cater for various business process.

Patent US 20090055439 Al mentions a method for dynamically adding dimensions specific to a tenant in a data warehouse without changing the structure of the fact table. However, US 20090055439 Al does not teach of adding dimensions into multiple fact tables as well as restriction to data input as only a predefined number of columns is able to be added to the fact table.

Patent US 20120109917 Al teaches a method for maintaining a hierarchy bridge table consistent with a corresponding hierarchy table. However, US 20120109917 Al supports only a single hierarchy table node. Furthermore, a hierarchy bridge table must always be correct when compared to the corresponding hierarchy to prevent the system from being defective.

Patent US 8352458 B2 points out techniques for transforming records prior to loading the records into a data warehouse. The system mentioned in US 8352458 B2 uses the dimension tables for adding foreign keys corresponding to dimension tables with data stored in a staging table before loading to a fact table.

The problem in the aforementioned prior arts is that the bridge table is unable to support data loading into multiple fact tables in the data warehouse without changing the structure of the fact table. The present invention addresses this problem.

SUMMARY OF THE INVENTION The present invention aims to provide a method for data input into a database whilst maintaining the structure of the metric of facts, known as fact tables.

It is an object of the present invention to provide a method for data input into a database comprising the steps of extraction of data from a data source, categorizing data in at least a first medium, providing at least a second medium, generating data mapping relationship from at least a first medium to at least a second medium, transforming said data to at least a metric of fields and loading a database with metric of facts from said at least a second medium. It is a further object of the present invention to provide a method for automation of dynamically adding new categories into multiple metric of facts with respect to at least a first medium for categorizing data .

It is a further object of the invention to provide a method for supporting multiple metric of facts by having a prefix column in the metric of facts referencing to at least a second medium.

It is a further object of the present invention to provide a method for data input in order to avoid changes to existing extract, transform and load process of data transformation to the metric of facts. The present invention provides a method for data input into a database for supporting multiple fact tables.

BRIEF DESCRIPTION OF DRAWINGS

Figure 1 illustrates a method of data input into a database according to the present invention.

Figure 2 illustrates a graphical representation of a first medium according to the present invention.

Figure 3 illustrates a method for assigning surrogate keys in the first medium according to the present invention.

Figure 4 illustrates a graphical representation of a second medium according to the present invention.

Figure 5 illustrates a method for partitioning the second medium according to the present invention.

Figure 6 illustrates a method for updating assigned surrogate key to the second medium according to the present invention.

Figure 7 illustrates a method for updating predefined column in each metric of facts with assigned surrogate keys according to the present invention.

Figure 8 illustrates a method for adding of a predefined column to a metric of facts according to the present invention.

Figure 9 illustrates a flowchart for achieving dynamic dimension process according to the present invention. DETAILED DESCRIPTION OF THE EMBODIMENTS

Described below are various preferred embodiments of the present invention with reference to the accompanying drawings. Each of the following preferred embodiments describes an example in which a method of data input into a database may be implemented and the improvements over existing prior art. The configuration of the invention is not limited to the configuration mentioned in the following description.

It will be appreciated that data is populated into a fact table in a database or data warehouse by first extracting data from a data source (1). This data in its natural key form will be assigned a surrogate key (9) representation. The data extracted from the data source (1) will then be transformed into a metric of facts (4) or also known as fact tables through intermediate mediums or bridge tables such as dimension tables. For each row of data population, the measure is calculated and the aggregate value from multiple dimension tables will be loaded to fact tables in a data warehouse. This method of data population is carried out by mapping the aggregate value from a first medium (2) to a predefined metric of facts (4) preceding to transforming data to at least a metric of facts (4), and is capable of maintaining the structure of the metric of facts.

As demand increases by the management for more analysis on the data from the data warehouse, data sources (1) are provided with new additional columns. For each additional column or field in the data source (1), individual dimension tables will be generated to map the natural key with the newly assigned surrogate keys (9). Additional columns have to be created in the existing fact table to accommodate these additional fields. Therefore, changes to the existing extract, transform and load, ETL process from the data source (1) to the fact table are required. The process of transforming data from the data source (1) to dimension tables and finally loading data into to the fact tables has to be repeated for each fact table in the data warehouse with a plurality of fact tables. If the fact tables have large data size, this process will be time consuming. Furthermore, frequent changes on the columns of fact tables and ETL process will cause adverse impact on the database design. The abovementioned problems are solved with the introduction of the present invention. The present invention enables automation of dynamically adding dimensions into multiple fact tables with respect to the data source (1) defined whilst retaining the structure of the fact table. The present invention moreover discloses a method of data mapping that supports multiple fact tables using a prefixed field to reference the dynamic dimension table.

Referring to Figure 1, the figure illustrates a method for data input into a database according to the present invention comprising the steps of extracting data from a data source (1), categorizing said data in at least a first medium (2), providing at least a second medium (3), generating data mapping relationship from at least a first medium (2) to at least a second medium (3), transforming said data to at least a metric of facts (4) and loading the database with at least a metric of facts (4) from at least a second medium (3). Referring to Figure 1 and Figure 2, Figure 2 illustrates at least a first medium (2) shown in Figure 1 wherein at least a first medium categorizes data from the data source (1) according to the present invention. A key step of data input into the database is through categorizing or defining data by at least a first medium (2) in a form of fact definition table. This said at least a first medium (2) or fact definition table has to be defined in order to facilitate the definition of one or more metric of facts (4) or fact tables to be added to the data warehouse.

Said at least a first medium (2) comprises a first plurality of fields (5) interlinking with one another for representing the data in categories (6). Said at least a first medium (2) or fact definition table comprises at least a field for identification of the metric of facts (4) (i.e. "Fact ID" column), a field for naming of the metric of facts (4) (i.e. "Fact Name" column), a field indicating the number of categories (6) to be mapped from the data source (1) (i.e. "num_dimension" column), a field for naming of the granularity of data (7) (i.e. "dimension" column) and a field for identifying the data source (1) (i.e. "data source" column). For each of these fields in the fact definition table, a user is to fill in the data accordingly.

The abovementioned "Fact ID" uniquely identifies the ID of the fact table in the data warehouse while the "Fact Name" identifies the name of the fact table. The number of categories (6) to be mapped from the data source (1) is defined by the "num_dimension" field. The "dimension" field enables a user to define multiple dimension table name and the column name mapping to the data source (1) table. The field name defined here has the same column name in the data source (1) table to enable matching mapping name. The "data source" column identifies the name of the table of the data source (1). Referring to Figure 3, the figure illustrates naming of the granularity of data (7) according to the present invention. The naming of the granularity of data (7) is registered under the "dimension" field mentioned above. This step involves assigning surrogate key (9) to be represented in said at least a second medium (3) prior to be represented in a metric of facts (4). A naming convention for the granularity of data (7) comprises a string of a plurality of category names and plurality of field names in the sequence of field name then category name as seen below:

An example of the above naming convention is:

Ethnic'Src Ethnic ID' 11 Service Type'Src Service Type' Referring to Figure 1 and Figure 4, Figure 4 illustrates at least a second medium (3) shown in Figure 1, comprising a second plurality of fields (8) interlinking with one another for generating data mapping relationship according to the present invention. Once the data has been categorized in at least a first medium (2) in that of the fact definition table, at least a second medium (3) is provided which generates data mapping relationship from said at least a first medium (2) to said at least a second medium (3). Said at least a second medium (3) also referred to as dynamic dimension table will be defined to facilitate transforming of data to one or more fact tables.

The dynamic dimension table comprises a field for identification of the categories (6) (i.e. "DynamicDim ID" column), a field mapping the identification of the metric of facts (4) from at least a first medium (2) (i.e. "Fact ID" column), a field containing surrogate key of categories (6) obtained from at least a first medium (2) (i.e. "key value" column) and a field for identifying name of the data (i.e. "dimension name" column).

The "DynamicDim ID" identifies the ID of each dynamic dimension or category from the data source (1). The "Fact ID" maps each fact table with the dynamic dimension table. The "key value" contains the surrogate key (9) of the dimension table. The surrogate key (9) is obtained from mapping of each element in the "dimension" column of the Fact definition table. Furthermore, 'dimension name' column identifies the dimension table name.

Referring to Figure 5, the figure illustrates a method for partitioning of said at least a second medium (3) based on numbers of each metric of facts (4) in the database according to the present invention. The figure clearly shows that said at least a second medium (3) in that of dynamic dimension table is partitioned based on the number of fact tables in the data warehouse (i.e. "Fact ID column). For the every fact table in the data warehouse, a predefined column (10) is added to map the surrogate key (9) of the dynamic dimension table.

Referring to Figure 6, the figure illustrates a method for the storing of mapping relationship from the data source (1) field to at least a second medium (3) through a first medium according to the present invention. The figure shows multiple "data source" fields such as service type dimension and ethnic dimension are mapped and stored in "DynamicDim" table in the steps of:

i. determining number of categories (6) or dimensions (num_dimension) ii. parsing at least a second medium (3) or each dimension table and the field of "dimension" column, followed by reading of table name in "data source" column.

iii. assigning temporary variables to dimension table name, column name and data source (1) name

iv. executing function Num=Num+ l

The table name of the data source (1) from the database will then be read, initializing Numl(Numl = 0) Referring to Figure 7, the figure illustrates a method for assigning of surrogate keys (9) from said at least a second medium (3) to at least a metric of facts (4) according to the present invention. The multiple fact tables or metric of facts (4) are assigned with surrogate keys (9) utilizing the mapping of relationship between multiple data sources (1) and multiple dimension tables stored, where for each row of data read from data source (1), row > Numl

i. assigning a surrogate key (9) to each unique combination of values based on each column name of dimension table name defined in the Fact Definition matching column name defined in the data source (1) ii. updating assigned surrogate key (9) to a field (ie: "Key Value" column) in the dynamic dimension table for each fact table

iii. updating a predefined column (10) (ie: "DynamicDim" column) in each fact table in the data warehouse with the assigned surrogate keys (9) of the dynamic dimension table

iv. executing Increment Numl =Numl + 1

Referring to figure 8, the figure illustrates a method for updating of a predefined "DynamicDim" column (10) added to existing fact table according to the present invention. A default value of less than zero (i.e. -1) is assigned to the predefined column (10) which means that when there is no value in the surrogate key (9) of the dynamic dimension, the function will not run.

Referring to Figure 9, the figure illustrates the steps for dynamically adding dimensions into a fact table in the form of a flowchart according to the present invention. Data obtained from multiple sources will be categorized into at least a first medium (2), in this case a fact definition table for mapping of relationship between multiple data source (1), column names and multiple dimension table column name for each fact table (21). If the number of rows in the fact definition table exceeds zero (22), the subsequent process entails. Details or columns in the fact definition table will be read and a temporary variable is initialized to zero (23). The dynamic dimension table will be partitioned based on each fact table in the data warehouse (24). If numbers of dimension or fields exceeds the temporary variable, the process will loop (25). The mapping of the relationship between multiple data source (1) and multiple dimension table column will be stored through means of parsing the multiple dimension name, column name and data source (31) and then assigning temporary variables to them (32). The table name of the data source (1) from the database will be read, initializing Numl (Numl =0) (26). Finally, multiple fact tables will be assigned surrogate keys (9) utilizing the mapping of relationship between multiple data sources (1) and multiple dimension tables (41). The assigned surrogate keys (9) will be updated to a column in the dynamic dimension table (42) for each metric of fact (4). The data is then populated into the data base (1) through updating of each column in the metric of facts with assigned surrogate keys (43). The process flow will be repeated if the data source (1) is larger than Numl (27).

In as much as the present invention is subject to many variations, modifications and changes in detail, it is intended that all matter contained in the foregoing description or shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.

Claims

A method for data input into a database comprising the steps of:

extracting data from a data source (1);

categorizing said data in at least a first medium (2);

providing at least a second medium (3);

generating data mapping relationship from at least a first medium

(2) to at least a second medium

(3);

transforming said data to at least a metric of facts (4); and

loading the database with said at least a metric of facts (4);

characterized in that the method further comprising a step of mapping a value from said at least a first medium (2) to a predefined metric in said at least a metric of facts (4) preceding to said step of transforming said data to at least a metric of facts

(4) which is capable of maintaining structure of the metric of facts.

The method of claim 1, wherein said at least a first medium (2) comprises a first plurality of fields (5) interlinking with one another for representing the data in categories.

The method of claim 2, wherein said first plurality of fields

(5) comprises a combination of at least a field for:

identifying the metric of facts (4);

naming the metric of facts (4);

indicating the number of categories (6) to be mapped from the data source;

naming the granularity of data (7) for forming a part of fields in said at least a second medium (3); and

identifying the data source (1).

The method of claim 3, wherein said granularity of data (7) comprises a string of a plurality of category (6) names and plurality of field names.

The method of claim 1, wherein said at least a second medium (3) comprises a second plurality of fields (8) interlinking with one another for generating data mapping relationship between said at least a first medium (2) and at least a second medium (3).

6. The method of claim 5, wherein said second plurality of fields (8) comprises a combination of at least a field for:

identifying the categories (6);

mapping the identification of the metric of facts (4) from at least a first medium (2);

storing surrogate key (9) of categories (6) obtained from said at least a first medium (2); and

identifying name of the data source (1).

7. The method of claim 6, wherein said at least a second medium (3) is partitioned based on numbers of each metric of facts (4) in the database.

8. The method of claim 1, wherein the mapping of relationship between data source (1) field and at least a second medium (3) is stored comprising the steps of: determining number of categories (6);

parsing said at least a second medium (3), the fields of categories (6) and identified data source (1);

assigning temporary variables to said at least a second medium (3), the fields of categories (6) and data source (1); and

repeating abovementioned steps for every number of categories (6).

9. The method of claim 8, wherein the data source (1) field is read and a recurring function initialized.

10. The method of claim 1, wherein said at least a metric of facts (4) is assigned with surrogate keys (9) by steps of:

assigning the surrogate keys (9) to each combination of values based on each field name of said at least a second medium (3) defined in the data source (i);

updating the surrogate keys (9) to said at least a second medium (3) for each of the metric of facts (4);

updating a predefined field (10) in each metric of facts (4) in the database with assigned surrogate keys (9) of said at least a second medium (3); and

executing a recurring function.

11. The method of claim 10, wherein said predefined (10) field in the existing metric of facts (4) is assigned a value of less than zero then a recurring function is executed.