CA2395244A1

CA2395244A1 - Data restructurer for flattening hierarchies

Info

Publication number: CA2395244A1
Application number: CA002395244A
Authority: CA
Inventors: Alexander Gorelik; Sachinder S. Chawla; Sridhar Gantimahapatruni
Original assignee: Individual
Current assignee: Acta Technology Inc
Priority date: 1999-12-20
Filing date: 2000-12-20
Publication date: 2001-06-28
Also published as: WO2001046866A3; WO2001046866A2; EP1397757A2; AU2912701A

Abstract

In a computer database system, queries are applied against stored data including at least one hierarchy representing divisions of the stored data and responses to queries depend on the hierarchy, a hierarchical data structure representing parent-child pairings is input and a flattened hierarchical data structure representing a transitive closure of the hierarchy represented by the hierarchical data structure is generated. The flattened hierarchical data structure is used in generating a response to a query.

Description

DATA RESTRUCTURER FOR FLATTE\~ING HIERARCHIES
BACKGROUND OF THE INVE~TTiON
The present invention relates to database processing in general and hierarchical database processing in particular.
In large business organization, transactional data is collected as transactions occur and is then periodically aggregated into an analysis form to allow decision makers to glean information from the collected data. For example, a manufacturer might collect transactions relating to sales, ordering and production. An example of such a transaction is a sales record, indicating an amount of the sale, the salesperson making the sale, the items sold, the office associated with that salesperson, and other related data. In a large organization, these transaction records are often aggregated into mufti-dimensional tables for analysis, to provide informative views of the organization's data. A long list of thousands of sales records is not likely to be as useful to a decision maker as a report aggregating those sales records by geographical region, dates, types of items, salesperson, or other "dimension".
One approach to dealing with the volume of data created by an organization is to periodically populate an enterprise resource planning (ERP) database with transaction data stored in a multidimensional (MD) database. An analyst or other decision maker could then query the MD database to obtain a "slice" of the data along a particular dimension to glean information about the data. For example, an analyst might query the MD
database to obtain a report showing the total sales for each state for each month to get a breakdown of the annual U.S. sales of the organization. In other words, the analyst is requesting a slice of the aggregated data in the MD database along the geographic and time dimensions.
In all but the simplest MD databases, the dimensions are hierarchical. For example, if the data has a geographic component, that data might be grouped by country, sales region, state, city, field office, etc. Such grouping would allow an analyst to slice the MD data at many different geographical levels. In order to be able to provide data in such a form, the ERP system needs to store business hierarchies, such as a cost center hierarchy, a cost element hierarchy or an organization reporting hierarchy. In the above example, the ERP system would include a hierarchy to indicate which cost centers belong in which sales region, which field offices are allocated to which cities, along with a hierarchy of cities, states and countries. Such hierarchies can be stored in tables within the MD
database system as a collection of child-parent nodes.
Suppose the MD data is organized by geography and sales data is aggregated by salesperson. Suppose further that the hierarchy indicates, for each salesperson, an associated field office and each field office is associated with a sales region and each sales region is associated with a country. If an analyst queries a transactional database system to provide a report of sales by country, the transactional database system would have to examine each record in the transactional database system's table of sales records, and accumulate the results. In a real-time environment, where the analyst expects a real-time response to the query, such processing is difficult, and often impossible, to achieve in real-time. With an MD database, however, such processing is more easily done, because the data is aggregated.
When the analyst makes the same query of an MD database system, the MD
database system only needs to get the aggregated records and total them up according to the dimensions and slice indicated by the query. Thus, in response to a query for sales in a country of interest, the MD database system would query a geographical hierarchy table to identify which sales regions are in the country of interest, query the geographical hierarchy table to identify which field offices are in those sales regions, query the geographical hierarchy table to identify which salespersons are associated with those field offices, then query the MD database for sales by salesperson and total the sales returned.
This process of issuing multiple queries and summarizing the results takes time and will slow down the response to analyst queries, which can be a problem where the analyst is analyzing data in real-time, issuing queries, reviewing the responses and issuing other queries in response to the responses, in an effort to understand the organization's data.
In view of the nature of the analysis process, improved performance of an MD
database querying system would be a great benefit to analysts and organizations with large amounts of data to be analyzed.
SUMMARY OF THE INVENTION
In order to improve performance of a database querying system where a hierarchical structure is used in generating a response to a query, a flattener generates a horizontal representation or vertical representation of a transitive closure of a hierarchical data structure. The flattened representation can be generated in advance of any queries, so that the database querying system can more quickly respond to a query.
Preferably, the flattened representations are only generated as often as the hierarchical structure changes and are stored, so that a flattened representation does not need to be generated for each query.
In a standard hierarchical structure, such as a table of parent-child pairs, only the immediate parents and children are represented, whereas in a flattened representation, all the required relationships, such as grandparent/grandchildren relationships are included.
More particularly, a flattened representation represents the transitive closure of the hierarchical structure, in that all ancestor-descendant pairings are represented in the flattened structure. While the flattened structure typically stores the hierarchy much less efficiently than a standard parent-child hierarchy table, a precalculated flattened structure can improve the response times for responses to queries that require the hierarchical information.
A further understanding of the nature and advantages of the inventions herein may be realized by reference to the remaining portions of the specification and the attached drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 is a block diagram of a computer system according to one embodiment of the present invention.
Fig. 2 illustrates a hierarchy; Fig. 2(a) is a graph of the hierarchy and Fig.
2(b) shows a table data structure representing the hierarchy.
Fig. 3 is an illustration of a vertically flattened representation of the hierarchy shown in Fig. 2.
Fig. 4 is an illustration of a horizontally flattened representation of the hierarchy shown in Fig. 2.
Fig. 5 shows a table data structure representing an extended hierarchy, with child attributes stored in parent-child pairing records.
Fig. 6 shows a table data structure representing an extended hierarchy, with parent attributes and child attributes stored in parent-child pairing records;
the table of Fig. 6 also illustrates how parent-child pairings are stored when a child node is a child of more than one parent node.
Fig. 7 illustrates the conversion of a single node hierarchy into a transitively closed hierarchy table.
Fig. 8 illustrates a hierarchy.

Fig. 9 illustrates a.n extended hierarchy; Fig. 9(a) shows a table data structure representing the hierarchy; and F'ig. 9(b) shows a table data structure representing the corresponding vertically flattened r epresentation of the hierarchy.
Fig. 10 illustrates another extended hierarchy; Fig. 10(a) shows a table data structure representing the hierarchy; and Fig. 10(b) shows a table data structure representing the corresponding horizontally flattened representation of the hierarchy.
DESCRIPTION OF THE SPECIFIC EMBODIMENTS
Using the computer system described herein, a flattened hierarchy data structure is created for a hierarchical data structure so that the computer system can respond to queries more quickly. Using a flattened hierarchy table, a single query is all that is needed to obtain all the information about a hierarchy that is needed for responding to a query on the hierarchical data.
Fig. 1 illustrates such a computer system 10 according to one embodiment of the present invention. Computer system 10 is shown comprising a transactional data system 20 that maintains transactional data 22, such as OLTP (online transaction processing) data.
Transactional data system 20 is coupled to a data warehouser 30 that extracts, transforms and loads the transactional data into an aggregated data set 32. Aggregated data set 32 and associated metadata, including a hierarchical data structure 34, are provided to a flattener 40 and an analysis system 50, such as an OLAP (online analysis system) such as an database system.
Aggregated data set 32 is stored in an MD (multidimensional) database 52 in the format expected by analysis system 50. Hierarchical data structure 34 might be provided to analysis system 50 for storage as a parent-child table 56, but where the parent-child data is not separately needed, hierarchical data structure 34 can be provided only to flattener 40, which generates a flattened hierarchy from the hierarchical data structure 34 and provides that flattened hierarchy to analysis system 50 for storage as a flattened hierarchy table in table structure 54. Alternatively, analysis system could store hierarchical data structure 34 as parent-child table 56 and provide that table to flattener 40 for generation of the flattened hierarchy. With table structure 54 and MD database 52 in place, analysis system 50 fields queries and issues responses that use the data in table structure 54 and MD
database 52 to create the responses. Table structure 54 comprises vertically flattened hierarchies or horizontally flattened hierarchies, or both.

An example of a simple hierarchy 80 is shown in Fig. 2. Fig. 2(a) shows the hierarchy is graph form. As shown there, hierarchy 80 is made up of four nodes and three edges connecting those nodes. In a typical computer system, the hierarchy is likely to have many more nodes and possibly more levels. In hierarchy 80, the nodes are labeled "A", "B", S "C" and "D". Node A is the root node, node B is an intermediate node and nodes C and D
are leaf nodes. Hierarchy 80 is represented in a database system as a table, such as table 82 shown in Fig. 2(b). Table 82 can be a relational database table.
Fig. 3 illustrates a flat structure 84V that might result as a vertical flattening of table 82, while Fig. 4 illustrates a corresponding horizontally flattened hierarchy 84H. In flat structure 84V, each possible ancestor-descendant pairing is represented by a record in the structure, including the degenerate cases where a node is paired with itself.
As structure 84V
is generated by flattener 40, flattener 40 adds the fields for the columns shown in Fig. 3 (DEPTH, ANCESTOR ROOT and DESCENDANT LEAF).
The DEPTH field indicates the number of nodes between the ancestor and the descendant, including the descendant. The DEPTH field is useful for quickly filtering out different levels of pairings. For example, a list of all nodes can be easily obtained by querying flat structure 84V for all records with DEPTH=0 and a list of all parent-child pairings can be obtained by querying for all records with DEPTH=1.
The ANCESTOR ROOT field indicates whether or not the ancestor referenced by the row is a root node, i.e., has no parent, while the DESCENDANT LEAF
field indicates whether or not the descendant referenced by the row is a leaf node, i.e., has no children. These fields are useful for operations that are specific to root or leaf nodes. It should be understood that the structures and operations described herein apply to tree structures (single root, each child being a child of exactly one parent) as well as forest structures (possibly more than one tree, possibly more than one root per connected graph and multiple parent nodes for some child nodes).
Fig. 5 illustrates a more real-world example. Suppose an organization stores its sales data with a geographical dimension. A part of the corresponding hierarchy is shown as parent-child table 100 in Figure 5(a). According to that table 100, California (CA) and New York (NY) sales count as U.S. sales and U.S. sales count as worldwide (Earth) sales. If the geographical element of the data is the state of a sale, then the process of determining worldwide sales requires querying table 100 to determine the children of the root node record (Earth), then querying table 100 to determine the children of those children, and so on. Once a query engine determines which intermediate and leaf nodes are under the selected parent node, the multidimensional database can be queried for all data matching those root, intermediate and leaf nodes.
If a flattened hierarchy, such as vertically flattened hierarchy table 105V
shown in Fig. 5 (b), is used instead of the table 100, querying to determine the relevant nodes can be done in one query. Continuing the above example, the query engine need only query table 105V once, to obtain all relevant nodes (e.g., using the SQL statement "SELECT
ANCESTOR=Earth). If a horizontally flattened hierarchy table is used, such as table 105H
shown in Fig. 5 (c), determining the relevant nodes still only requires one query (e.g., the SQL statement "SELECT LEVEL 0=Earth). Of course, the query of table 105H
requires knowledge of the "level" at which the data is being sliced. Construction of table 105H also typically requires advanced knowledge of the maximum level for any leaf node.
In the example used in Fig. 5, the maximum level is two, so the columns labeled "..."
and "LEVEL n" could be replaced with one column labeled "LEVEL 3" containing all nulls, or the LEVEL 3 column could be omitted and the leaf node status of the last two records in table 105H could be assumed by the fact that there is no column beyond LEVEL
2.
For most useful applications, nodes of a hierarchy include attributes. In a homogeneous hierarchy, the children and parents are the same objects (i.e., have the same structure for their attributes). By contrast, a heterogeneous schema is one where different nodes have different schema. A homogeneous hierarchy can be represented in a relational table as a set of parent-child records, such as the relational table shown in Fig. 2(b) or Fig.
5(b). Information about a node could be stored in MD database 52, but where the information is specific to a node as it is positioned in its hierarchy, the preferred approach is to store the information as an attribute in the hierarchical data. For example, Fig. 6 shows a table 110 that is a relational table representing an extended hierarchy (an extended parent-child table) with attributes of the node stored in the table. Each record represents a parent-child pairing and the attributes for the child node of the pair are stored with the record.
In some cases, the attributes of the child node (or the parent node) depend on the pairing. For example, in an extended parent-child table, such as table 110' shown in Fig.
7, a field labeled "Rank" is included in the table. The rank values found in the Rank column represent the ranking, in size, of the child among other siblings (children of a common parent). If each child node is connected by an edge to only one parent node, then the ranking of the child is fixed and only one record per child is needed in table 110', but where a child node connects to more than one parent node (e.g., child node "Russia" connects to parent node "Earth" and parent node "Asia"), multiple records might be needed in table 110' to handle differing values for attributes that depend on which pairing is being examined.
Another feature to note from the table shown in Figs. 6-7 is that where both the parent and child attributes are stored for each pairing, the "root-null"
pairing (see the first record of table 110) is not needed. If only the child attributes are stored for each pairing, then the root-null pairing is needed, to holds the values for the attributes of the root node. In come cases, a null value for the PARENT field is undesirable, such as when an application includes programs that do not expect null values. For example, a program that lists all nodes might fail upon encountering a record with a null value. In such cases, a table such as table 110' shown in Fig. 7, with child and parent attributes stored in each pairing, could be used.
One disadvantage of table 110' over table 110 is that table 110' takes up more space.
In order to analyze a hierarchy using relational or dimensional modeling, the hierarchy frequently has to be flattened into a transitive closure, which is a complete list of ancestors-descendant pairings. A complete list, strictly speaking, would include all possible pairings of a node with itself, to accommodate the degenerate case of a single node hierarchy.
For example, given the single node hierarchy table 114 shown in Fig. 8, transitive closure table 116 results from the flattening process.
A more complex, and more typical, flattening is shown in Fig. 9. There, a two-level hierarchy with attributes is represented as table 120 shown in Fig.
9(a). The corresponding vertically flattened hierarchy is shown as table 125V in Fig.
9(b).
Using the system described herein, there are two ways to represent a hierarchy - a horizontal flattening and a vertical flattening. The horizontally flattened hierarchy is represented by a table having a set of rows with one row for each edge and one column for each level or attribute. This type of representation is typically useful for normalized hierarchies - hierarchies where the depth (number of levels) is the same for most branches.
The vertically flattened hierarchy is represented by a table having a separate row for each ancestor-descendent pair (including pairings of a node with itself). This type of representation is typically used for non-normalized hierarchies - hierarchies where the depth is very different for different branches.
If a hierarchy has attributes, the attributes should be included in the flattened representations, so that the attributes are available in the, preferably, one query of the hierarchy table. This is shown, for example, in table 12~V. For a vertical hierarchy, each row would include the parent attributes for the ancestor and the child attributes for the descendent.
For a horizontal representation, the parent and child attributes are provided for each level. Since parent and child attributes are sometime the same, there may exist some redundancy of attributes. For purposes of explanation, however, such redundancy is ignored in order to generalize the handling. An example of such a horizontal representation is shown in Fig. 10. There, Fig. 10(a) shows a parent-child table 130 that includes the parent attribute fields "PS" (PoliticalSystem) and the child attribute fields "PE"
(PoliticalEntity) and "CR"
(Child Rank, in population). In the corresponding horizontally flattened hierarchy, represented by table 135 in Fig. 10(b), the root node attribute is stored in the "PtypeO" field, level 1 child attributes are stored in the "Ctypel" and "Crankl" fields, level 1 parent attributes are stored in the "Ptypel" field and level 2 child attributes are stored in the "Ctype2" and "Crank2" fields. Note that there are no level 2 parent attributes, because none of the nodes at level 2 are parent nodes; they are all leaf nodes. Also, there are no level 0 child attributes, because level 0 is the root level and therefore there are no children nodes at that level.
The above description is illustrative and not restrictive. Many variations of the invention will become apparent to those of skill in the art upon review of this disclosure.
The scope of the invention should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims along with their full scope of equivalents.

Claims

WHAT IS CLAIMED IS:

1. In a computer database system, wherein queries are applied against stored data including at least one hierarchy representing divisions of the stored data and responses to queries depend on the hierarchy, a method comprising the steps of:
inputting a hierarchical data structure representing parent-child pairings;
generating a flattened hierarchical data structure representing a transitive closure of the hierarchy represented by the hierarchical data structure; and using the flattened hierarchical data structure in generating a response to a query.

2. The method of claim 1, wherein the hierarchical data structure represents a homogeneous hierarchy.

3. The method of claim 1, wherein the flattened hierarchical data structure is a horizontal representation.

4. The method of claim 1, wherein the flattened hierarchical data structure is a vertical representation.

5. The method of claim 1, wherein the hierarchical data structure includes attributes for parent nodes, attributes for child nodes, attributes specific to parent-child pairings, or a combination of more than one of the above.

6. The method of claim 1, wherein the hierarchical data structure is input as a relational data structure.

7. The method of claim 1, wherein the flattened hierarchical data structure is output as a relational data structure.

8. The method of claim 1, wherein the step of generating is performed before the computer database system receives a query for which a response to that query depends on the hierarchy.

9. The method of claim 1, further comprising the steps of:
determining when the hierarchical data structure changes;

when the hierarchical data structure changes, regenerating the flattened hierarchical data structure representing the transitive closure of the hierarchy represented by the hierarchical data structure.