CN112579706A

CN112579706A - Data warehouse model and application thereof

Info

Publication number: CN112579706A
Application number: CN201910929830.5A
Authority: CN
Inventors: 景超
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2021-03-30

Abstract

The invention relates to a data warehouse model and application thereof, based on each service field, adding an agent key, an operation time, an operation type, a pre-recorded field and a state automatic ending time field, and specifically designing and applying to various data warehouse data use scenes, the invention has the advantages that: 1. unifying data organization forms; 2. more storage engines can be compatible; 3. providing more direct single table analysis capability and more flexible data association logic; 4. can help realize more highly cohesive data integration work; 5. continuous integration of new data sources is more convenient; 6. a more effective continuous integration modeling thought is provided; 7. the state machine can promote the design and development of a state transition checking function according to a predefined state machine, check the accuracy of data in real time and alarm in time.

Description

Data warehouse model and application thereof

Technical Field

The invention relates to a data warehouse model and application thereof, and belongs to the technical field of data warehouse models.

Background

The star model is a multidimensional data relationship, which is composed of a Fact Table (Fact Table) and a set of Dimension tables (Dimension Table) to form a theme. Each dimension table has one dimension as a primary key, and the primary keys of all these dimensions are combined into the primary key of the fact table. The non-primary key attributes of Fact tables are called facts (Fact), which are generally numeric values or other data that can be calculated; the dimensions are data of types such as characters and time, and the data organized in this way can be subjected to aggregation calculation of summation (sum), averaging (average), counting (count) and percentage (percentage) according to different dimensions (part or all of the main keys of the fact table), and even 20-80 analysis, so that the condition of the business theme can be analyzed from different angles and numbers.

The star model is an unnormalized structure, and each dimension of the multidimensional dataset is directly connected with a fact table. It is emphasized that the dimensions are preprocessed to assemble a plurality of dimensions into a fact table to form a wide table. This is why we will often see some large-width tables, which are generally fact tables and include primary keys associated with dimensions and some metric information, while dimension tables are specific information of the dimensions inside the fact tables, and when in use, data are combined by join, which is relatively convenient for analysis of OLAP, and since there is no gradient dimension in the star model, data have some redundancy, as shown in fig. 1.

When one or more dimension tables are not directly connected to a fact table, but are connected to the fact table through other dimension tables, the dimension tables are illustrated as if a plurality of snowflakes are connected together, so that the snowflake model is called. The snowflake model is an extension of the star model, and is used for further layering the dimension tables of the star model, the original dimension tables can be extended into small fact tables to form local 'hierarchical' areas, and the decomposed tables are connected to the main dimension table but not the fact tables. As shown in FIG. 2, the region dimension table is further decomposed into dimension tables of country, province, city, etc. It has the advantages that: query performance is improved by minimizing the amount of data storage and coalescing smaller dimension tables. The snowflake model is more in line with the database paradigm, and data redundancy is reduced, but when data are analyzed, operation is complex, and more join tables are needed, so that the performance of the snowflake model is not necessarily higher than that of a star model.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a data warehouse model, which adopts a brand-new model structure design, not only can clearly mark the state of each record, but also can facilitate the modification of data in the model, and improve the work efficiency of data storage application in the data warehouse.

The invention adopts the following technical scheme for solving the technical problems: the invention designs a data warehouse model, which is used for constructing tables in a data warehouse aiming at various data records extracted from various data sources, wherein the various data records extracted from the various data sources are stored in the data warehouse in a mode of storing different version data in a new data record mode and storing historical version data records in an old mode; the data warehouse model includes business fields corresponding to the data sources, respectively; the system also comprises an agent key and an operation moment, wherein the agent key is used for uniquely identifying each data record in the data warehouse; the operation time is used for identifying the starting time of the state corresponding to each data record in the data warehouse.

As a preferred technical scheme of the invention: the data warehouse model also comprises an operation type, wherein the operation type is used for identifying the change of the business fields corresponding to different business data in each data record under the same natural key and at the adjacent operation time in the data warehouse.

As a preferred technical scheme of the invention: the data warehouse model also comprises a front record field, and the front record field is used for identifying the surrogate key of the data record under the same natural key and adjacent to the previous operation moment in the data warehouse.

As a preferred technical scheme of the invention: the data warehouse model also comprises a state automatic ending time field, wherein the state automatic ending time field is used for identifying the time when the states corresponding to the data records in the data warehouse respectively automatically end, and if the states corresponding to the data records do not have the automatic ending time, the state automatic ending time field of the data record is empty or infinite.

As a preferred technical scheme of the invention: the data warehouse model also comprises state machines respectively corresponding to the tables, and the state machines are used for identifying the change relation, the change action and the change condition between different states corresponding to the data records in the corresponding tables.

In view of the above, the technical problem to be solved by the present invention is to provide an application of a data warehouse model, which is to record data in a data warehouse by applying a model with a completely new structural design, so that not only can the state of each record be clearly marked, but also the data in the model can be modified conveniently, and the work efficiency of data storage application in the data warehouse can be improved.

The invention adopts the following technical scheme for solving the technical problems: the invention designs an application of a data warehouse model, which comprises a state duration obtaining method, wherein the state duration obtaining method is used for obtaining the duration of a state corresponding to a target data record in a data warehouse and is implemented as follows:

if the next data record of the adjacent time sequence with the same natural key exists in the data warehouse corresponding to the target data record, and the operation time of the next data record of the adjacent time sequence is less than the operation time of the target data record, the time difference from the operation time in the target data record to the operation time in the next data record of the adjacent time sequence is the duration of the state corresponding to the target data record;

if the next data record with the same natural key does not exist after the target data record and the automatic ending time of the state in the target data record is null or infinite, the time difference between the operating time and the current time in the target data record is the duration of the state corresponding to the target data record;

if the next data record with the same natural key does not exist after the target data record, and the field of the state automatic ending time in the target data record is non-empty and non-infinite, or the operating time in the next data record is later than the field of the state automatic ending time in the target data record, the time difference between the operating time in the target data record and the field of the state automatic ending time is the duration of the state corresponding to the target data record.

As a preferred technical solution of the present invention, the method includes a latest data record obtaining method, configured to obtain a latest data record corresponding to a target operation type under a target natural key in a data warehouse, and execute the following steps:

firstly, acquiring various data records corresponding to target operation types under target natural keys in a data warehouse; and then selecting the data record corresponding to the last operation moment, namely the latest data record corresponding to the target operation type under the target natural key.

As a preferred technical solution of the present invention, the method includes a target class state statistics duration obtaining method, which is used for implementing duration statistics of a target class state under a target natural key in a data warehouse, and the following is executed:

firstly, acquiring various data records corresponding to target natural key down and target class states in a data warehouse within a time-duration period as data records to be processed; then obtaining the time difference between the operation time in each data record to be processed and the operation time in the next data record corresponding to the operation time in the adjacent time sequence; and finally, counting to obtain an accumulation result of each time difference, namely counting the time length of the target natural key and the target class state in the data warehouse.

As a preferred technical solution of the present invention, the present invention includes a status checking method, which is used for checking a status corresponding to a data record in each table in a data warehouse, and when the status is periodically or specifically operated with respect to the table, the following is performed:

and checking the corresponding table according to the change relation, the change action and the change condition between different states corresponding to each data record in the table represented in the state machine, and giving an alarm if an error occurs.

Compared with the prior art, the data warehouse model and the application thereof have the following technical effects by adopting the technical scheme:

the invention designs a data warehouse model and application thereof, which have the following beneficial effects: 1. the data organization form is unified, so that the modeling is simpler and more standard; 2. more storage engines can be compatible, and the efficiency of the ETL of the data is improved; 3. the method provides more direct single-table analysis capability and more flexible data association logic, and is suitable for service analysis of various scenes; 4. The data integration work with higher cohesion can be realized, and the data is organized more in an object-oriented manner; 5. continuous integration of new data sources is more convenient, because each model table is relatively independent, the influence of the continuous integration is relatively low; 6. a more effective continuous integration modeling idea is provided, a basic service model and requirements are separately analyzed, a plurality of bins which are more universal relative to a foundation can be initially provided, and then continuous design and integration construction are carried out according to the requirements; 7. the state machine can promote the design and development of a state transition checking function according to a predefined state machine, check the accuracy of data in real time and alarm in time.

Drawings

FIG. 1 is an exemplary schematic diagram of a prior art star model;

FIG. 2 is an exemplary schematic diagram of a snowflake model in the prior art;

FIG. 3 is an exemplary diagram of a design data warehouse model of the present invention.

Detailed Description

The following description will explain embodiments of the present invention in further detail with reference to the accompanying drawings.

There are three TYPEs of gradual changes to the slowly changing dimension, and in many cases, in order to preserve the historical version, the TYPE2 form of the slowly changing dimension is used to organize the dimension data, i.e., the historical version is kept as it is when the new version is stored in the form of a new record. Three additional fields are typically added-a surrogate key (to distinguish version records), a state start time (to record the start time of a state), and a state end time (to record the end time of a state, if the state has not yet ended, the record is null). When a new state is generated, the state ending time field of the previous state is set to the time, and the new state is stored in a form of new record, so that the pull-linked list is formed by the state records.

At this time, each record is stored in the form of a state, having a state start time and a state end time. However, many times, modifying a large number of data records in a database is a very performance consuming matter, the performance of a pruning operation is generally much worse than that of an incremental search operation, and some storage engines have difficulty modifying existing data, such as HDFS.

Therefore, the invention designs a data warehouse model for realizing the construction of each table in the data warehouse aiming at each data record extracted from each data source, wherein based on the TYPE2 form, each data record extracted from each data source is stored in the data warehouse in a new data record form and a historical version data record is still stored in the data warehouse in the process that each data record is sequentially recorded in the data warehouse. And the state record is converted into an action record, the state record has a starting time and an ending time, if the starting time and the ending time are converted into an operation time, the action record is formed, the action causes the state transition, the time difference between the two actions is the duration of the state, and if the next action does not exist, the state is always continued.

When the data records are extracted from the data sources and stored in the data warehouse in the conventional TYPE2 form, the data warehouse model comprises a proxy key and an operation time, wherein the proxy key is used for uniquely identifying each data record in the data warehouse, besides each service field respectively corresponding to each data source; the operation time is used for identifying the starting time of the state corresponding to each data record in the data warehouse.

For example, for the change operation of the account address in the account information table, the original account change record a is not changed, an account change record B is newly added, the proxy key +1 is operated at the time of changing the account, the value of the account address field is the new address, and the other field values are copied again without changing.

The invention designs a data warehouse model, which comprises an operation type, a pre-record field and a state automatic ending time field besides a new agent key and an operation time, wherein the operation type is used for identifying the change of service fields corresponding to different service data in each data record under the same natural key and at the adjacent operation time in a data warehouse. The introduction of the operation type is that if a dimension table has a plurality of dimensions, such as a user interface address, a user interface type, a native place and the like, each dimension is possibly changed, only the proxy key and two fields at the operation time are used, which dimension is not distinguished is changed, and the operation type field is added for distinguishing at the time; the operation type is converted from a service state, for example, the operation type is "change home address".

In practical application, in order to form an action chain better and obtain the state duration conveniently, a previous record field is introduced for identifying a surrogate key of a data record under the same natural key and at the next previous operation time in a data warehouse, namely, the previous action record is associated in the ETL process of the data.

And the automatic state ending time field is used for identifying the time when the states corresponding to the data records in the data warehouse are automatically ended respectively, and if the states corresponding to the data records do not have the automatic ending time, the automatic state ending time field of the data records is empty or infinite, such as the failure time of the program content, or the user watches the program content without any operation.

Based on the newly-added design of each field of the data warehouse model, the state machine corresponding to each table is further designed, and the state machine is used for identifying the change relationship, the change action and the change condition between different states corresponding to each data record in the corresponding table. The operation type field stores code values of actions or conditions for state migration, and the code values are converted from state enumeration values, for example, user online states, which are caused by user login actions. At the same time, combing the states, actions and conditions into a state machine, determining a complete state conversion chain, and outputting a state conversion table, namely the state machine corresponding to each table. For example, in a simple user online detection system, a user online and offline is a complete state chain, and finally, the sorted state machine conversion logic is added to the ETL processing logic, so that a state machine is realized, and the state conversion table is shown in the following table 1.

TABLE 1

In practical applications, the pre-recorded field depends heavily on the state machine, a mature and correct state machine can perform real-time inspection during ETL of data, and an immature or incorrect state machine can cause a wrong record when data is lost, resulting in a certain cost for subsequent correction (modifying the value of the pre-recorded field of the next action record as a surrogate key for a new supplementary record). Therefore, this field can be determined whether to add or not according to actual conditions.

To sum up, as shown in fig. 3, the dimension table and the fact table of the data warehouse model designed by the present invention are unified into the same data organization form, and there is no definite theme or definite theme core, that is, the star body, the star corner, the center and the edge can not be clearly distinguished, and they are like one star on the sky on the earth, and the stars are independent from each other, but can be combined and linked into a constellation at will according to the imagination of people, which is the business logic, and there is no association relationship that needs to be strictly organized according to a certain shape. The original dimension table and the original fact table are only unified in table structure form, but have unchanged functions, and are still respectively called as the dimension table and the fact table. Each dimension table also resembles an object in an object-oriented programming language.

In order to better fit the business reality, as shown in fig. 3, fields such as location, establishment time, number of persons in the department, etc. are added to the department table; the product table is added with fields such as production time, production address and the like. From the graph, if a star model or a snowflake model is used for statistical analysis of the product production condition in a certain historical time period or the number of people in departments in a certain historical time period, a product production fact table and a department personnel fact table need to be additionally designed and then are respectively associated with a product dimension table and a department dimension table. The data warehouse model designed by the invention has the advantages that the fact table and the dimension table are integrated and are not distinguished, so that the product table can analyze the production condition of the product in a certain historical time period, even the production condition of the product in a certain area, and the fact table does not need to be additionally designed. Similarly, the department table itself can be used for various statistical analyses related to departments, and no fact table needs to be separately designed for the department table.

For the new model, a modeling method matching the new model is required. There are two general ideas for modeling the number of bins: the first idea is from top to bottom. Firstly determining a theme according to requirements, then determining a fact table of the theme, and finally determining a related dimension table. Concept two, from bottom to top. And finding out all dimensions according to the service model, determining facts according to the incidence relation among the dimensions, and finally screening out useful topics. It should be noted that the modeling step is likely to be an iterative process, so as to continuously modify the errors and omissions of the previous design and facilitate the continuous integration of the subsequently added data sources.

The invention is used for developing the idea II from bottom to top, starting from the business model, abandoning the data source and designing the data model conforming to the business model. First, dimensions and their attributes are determined. If there are business models, all dimensions involved in the business models are determined. If there is no business model, then the dimensions need to be analyzed from existing data sources. After the dimensions are determined, the attributes of each dimension are respectively determined, meanwhile, the simple attributes and the compound attributes (similar to the simple data types and the complex data types in java language) are identified, whether the attributes need to be independent of a dimension table is analyzed, and for the compound attributes, the dimension table is generally independent. The dimension table after being independent and the dimension table to which the dimension table is subordinate are called a slave dimension table and a master dimension table respectively.

And designing a corresponding physical model according to the dimension and the attribute analyzed, and creating a physical table structure. And adding a proxy key, an operation time, an operation type, a pre-record field and a state automatic ending time field in each dimension table, and including all service fields. It should be noted that it is not necessary to separately record a start operation and an end operation for the same state, and the operation time of the next operation is the end time of the state and the start time of the next state.

At this stage, if there is no clear requirement, the design of the basic model and the establishment of the physical table structure are completed, the corresponding data source table can be searched, the mapping relation between the data source table and the physical model table is made, and the ETL work from the data source table to the physical model table is performed. Since the dimension tables of the model have analyzable capability, the analysis of the single table and the association analysis between the master dimension table and the slave dimension tables can be carried out at the moment.

If it is necessary to preset some subjects for multi-table analysis, or there are some explicit requirements for multi-table analysis, then the following procedure is followed: the dimension table related to the requirement is listed, and the association relationship between the dimensions is clear, including the action and the quantity, for example, the relationship of many-to-many when the user watches the film and television programs, and the result is the fact table, so that the physical modeling can be carried out and the ETL can be started. Similarly, if new needs are added subsequently, continuous modeling and ETL integration work are carried out in the same way.

If there is no analysis requirement, but some multi-table analysis topics are preset, the association relations among all dimensions can be listed in a permutation and combination mode, including pairwise association, triplex association, quadruplex association and … …, and finally, the association relations with practical significance are screened out, which are fact tables. Of course, the result of permutation-combining is very large, so it is necessary to schedule well-implemented priorities and schedules.

In summary, the modeling method of the present invention is divided into two stages. The first stage is an independent dimension stage, and analyzes and designs all independent dimension tables (except a master dimension table and a slave dimension table). The second stage is a dimension association stage, and association relation analysis is carried out according to requirements or permutation and combination to finally determine the fact table. The modeling advantages of the invention are shown, the basic business model and the requirements can be separately analyzed and designed, a relatively basic and universal multi-bin model can be provided initially, and then continuous design and integrated construction can be carried out according to the requirements.

The data warehouse model provided by the invention is applied to practice and comprises a state duration obtaining method, a latest data record obtaining method, a target state statistical duration obtaining method and a state checking method. The state duration obtaining method is used for obtaining the duration of the state corresponding to the target data record in the data warehouse, and is implemented as follows:

In the first case, as an example, it is known from the syntax of natural language that one predicate is determined, which is a state duration that is a difference between operation timings of two records adjacent to a surrogate key of the same subject and object, and a subject and an object to which the predicate belongs need to be found. In connection with the present invention, the known predicate is the operation type field, and the subject and object need to be determined according to the actual business scenario. Taking the user watching the on-demand program as an example, the subject is user, i.e., userid, the object is program, i.e., content, and the predicate is various actions in watching the program, including play, pause, fast forward, fast backward, exit, etc., and is an enumerated value in the operation type field. Then, to calculate the duration of the first action (press play) of the content of u01 viewing content of c01 (no care is given to what the next action is), the difference between the operation timings of the two records adjacent to the surrogate key of u01+ c01 is the duration of the first play action of the user u01 viewing content c 01. Of course, in practice, the subject may be a plurality of fields, and the object may be a plurality of fields.

The latest data record obtaining method is used for obtaining the latest data record corresponding to the target operation type under the target natural key in the data warehouse and is implemented as follows:

The method for obtaining the statistic duration of the target class state is used for realizing the duration statistics of the target class state under a target natural key in a data warehouse and is implemented as follows:

The state checking method is used for checking the states corresponding to the data records in each table in the data warehouse, and when the operation is performed periodically or aiming at the tables, the following steps are executed:

and checking the corresponding table according to the change relation, the change action and the change condition between different states corresponding to each data record in the table represented in the state machine, and giving an alarm if an error occurs. Therefore, when a mature correct state machine exists, the action record with the state conversion error appears, and the repair can be rapidly made, so that the correct real record can be additionally recorded. If the recorded data is not timely subjected to additional recording and completely lost or a correct state machine is not available, the action recording with wrong state conversion occurs, and a certain data additional recording algorithm is required to be used in data application to avoid data loss. In practical application, according to the defined state machine, if two adjacent action records with incorrect state transition are found, one action record conforming to the state machine is inserted into the adjacent action records.

The data warehouse model designed by the technical scheme and the application thereof have the beneficial effects that: 1. the data organization form is unified, so that the modeling is simpler and more standard; 2. more storage engines can be compatible, and the efficiency of the ETL of the data is improved; 3. the method provides more direct single-table analysis capability and more flexible data association logic, and is suitable for service analysis of various scenes; 4. The data integration work with higher cohesion can be realized, and the data is organized more in an object-oriented manner; 5. continuous integration of new data sources is more convenient, because each model table is relatively independent, the influence of the continuous integration is relatively low; 6. a more effective continuous integration modeling idea is provided, a basic service model and requirements are separately analyzed, a plurality of bins which are more universal relative to a foundation can be initially provided, and then continuous design and integration construction are carried out according to the requirements; 7. the state machine can promote the design and development of a state transition checking function according to a predefined state machine, check the accuracy of data in real time and alarm in time.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A data warehouse model is used for constructing tables in a data warehouse aiming at various data records extracted from various data sources, wherein the various data records extracted from the various data sources are stored in the data warehouse in a mode that different versions of data are stored in a new data record mode and historical version data records are still stored; the data warehouse model includes business fields corresponding to the data sources, respectively; the method is characterized in that: the system also comprises an agent key and an operation moment, wherein the agent key is used for uniquely identifying each data record in the data warehouse; the operation time is used for identifying the starting time of the state corresponding to each data record in the data warehouse.

2. The data warehouse model of claim 1, wherein: the data warehouse model also comprises an operation type, wherein the operation type is used for identifying the change of the business fields corresponding to different business data in each data record under the same natural key and at the adjacent operation time in the data warehouse.

3. The data warehouse model of claim 2, wherein: the data warehouse model also comprises a front record field, and the front record field is used for identifying the surrogate key of the data record under the same natural key and adjacent to the previous operation moment in the data warehouse.

4. A data warehouse model as claimed in claim 3, wherein: the data warehouse model also comprises a state automatic ending time field, wherein the state automatic ending time field is used for identifying the time when the states corresponding to the data records in the data warehouse respectively automatically end, and if the states corresponding to the data records do not have the automatic ending time, the state automatic ending time field of the data record is empty or infinite.

5. A data warehouse model as claimed in claim 3, wherein: the data warehouse model also comprises state machines respectively corresponding to the tables, and the state machines are used for identifying the change relationship, the change action and the change condition between different states corresponding to the data records in the corresponding tables.

6. An application of the data warehouse model as claimed in claim 5, comprising a state duration obtaining method for obtaining duration of the state corresponding to the target data record in the data warehouse, the following are performed:

if the next data record of the adjacent time sequence with the same natural key exists in the data warehouse corresponding to the target data record, and the operation time of the next data record of the adjacent time sequence is less than the operation time of the target data record, the time difference from the operation time in the target data record to the operation time in the next data record of the adjacent time sequence is the duration of the state corresponding to the target data record; if the next data record of the same natural key does not exist after the target data record and the automatic ending time of the state in the target data record is null or infinite, the time difference between the operating time and the current time in the target data record is the duration of the state corresponding to the target data record;

7. The application of the data warehouse model as claimed in claim 6, comprising a latest data record obtaining method for obtaining the latest data record corresponding to the target operation type under the target natural key in the data warehouse, and performing the following steps: firstly, acquiring various data records corresponding to target operation types under target natural keys in a data warehouse; and then selecting the data record corresponding to the last operation moment, namely the latest data record corresponding to the target operation type under the target natural key.

8. The application of the data warehouse model as claimed in claim 6, comprising a target class state statistics duration obtaining method for achieving duration statistics of target class states under target natural keys in the data warehouse, the following steps are performed:

9. The application of the data warehouse model according to claim 6, comprising a status checking method, configured to implement checking, periodically or operating on the table, of the status corresponding to the data records in each table in the data warehouse, as follows: