CN115510021B

CN115510021B - Method and system for constructing data warehouse standard layer

Info

Publication number: CN115510021B
Application number: CN202210749186.5A
Authority: CN
Inventors: 杨立才; 邵宏力; 胡超; 刘磊; 李云; 邓知知
Original assignee: Jiangsu Kunshan Rural Commercial Bank Co ltd
Current assignee: Jiangsu Kunshan Rural Commercial Bank Co ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2023-12-22
Anticipated expiration: 2042-06-29
Also published as: CN115510021A

Abstract

The invention relates to a method and a system for constructing a data warehouse standard layer. Comprising the following steps: the standard layer comprises a table model and a field model; for each table in the database, determining whether the table is an island table, and putting a non-island table into a standard layer as a table model; the island list refers to the list and other lists without external key relation; for each table field in the database, determining whether it is the main data field; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristics is larger than the threshold value and is not the default value, the field is put into a standard layer; when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; if the code value field is the one, it is recommended to make a look-up setting code value conversion. The invention improves the degree of data standardization.

Description

Method and system for constructing data warehouse standard layer

Technical Field

The invention belongs to the technical field of business intelligence, and particularly relates to a method and a system for constructing a data warehouse standard layer.

Background

In the construction of data systems such as data warehouse, data management and data lake, the data needs to be processed in a standardized way before being put into a warehouse. In the traditional technical scheme, the data standardization operation adopts modes of manually checking PDM files, table remarks, field contents and the like of a database to judge whether each field of each table needs to be standardized or not, and how to standardize. The traditional technology has heavier dependence on manpower, and if field, table and field code value naming is not standard, naming interpretation is missing, personnel do not know the data structure and relationship, a PDM file or an explanation document is missing, personnel do not know the business flow of a company organization, and the like, the data standardization operation becomes very difficult. Especially, when the system in the organization is complicated and the data amount of the system table is large, a great amount of labor is needed to identify and judge the data, so that the problems of incomplete data standardization, standard floor omission and the like still exist.

Disclosure of Invention

The invention provides a method and a system for constructing a data warehouse standard layer.

In order to solve the technical problems in the prior art, the invention provides a method for constructing a data warehouse standard layer, which comprises the following steps: the standard layer comprises a table model and a field model;

For each table in the database, determining whether the table is an island table, and putting a non-island table into a standard layer as a table model; the island list refers to the list and other lists without external key relation;

for each table field in the database, determining whether it is the main data field; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristics is greater than a threshold value (such as 2%) and is not a default value, the field is put into a standard layer;

when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%, for example, when the original type is text, recommending conversion to a more accurate floating point type when the stored data are found to be (100%) floating point numbers (namely decimal); if the code value field is the code value field, a rule for performing configuration code value conversion is recommended. The code values described herein are enumerated values.

As a preferred embodiment, determining whether each table in the database is an island table through a table level knowledge graph; the table level knowledge graph is a knowledge graph which displays each table and the external key relation among the tables in a visual graph structure; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation; and determining whether the corresponding table has an external key relation by determining whether edges exist between all nodes in the table level knowledge graph, wherein when one node does not exist edges in any other node, the table represented by the node is an island table.

As a preferred embodiment, determining whether the fields of each table in the database are main data fields by a field level knowledge graph; the field level knowledge graph is a knowledge graph which displays the field of each table and the relationship between tables in a visual graph structure form; the field level knowledge graph comprises nodes and edges, wherein each node represents a field, and each edge represents a relationship among the fields; the relationships among the tables are embodied as relationships among fields from different tables and at least comprise foreign key relationships, data equality or data nulling equality; when the main data field is determined, two fields with the relationships among the tables being foreign key relationships, data equality or data nulling equality are found out through a field level knowledge graph, and when the original data of the two fields are sourced from different service systems, the two fields are used as the main data field.

As a preferred embodiment, the method for obtaining the table level knowledge graph comprises the following steps: acquiring a service system from which each table in a database comes, a table name and a field name in each table; for each table, analyzing the characteristics of each field according to the values of the fields in the table; aiming at each table, according to the table name, the field name and the value of the field, calculating to obtain the function dependency relationship in the table among the fields in the table; for each table, identifying a main key of each table according to the function dependency relationship in the table, searching and determining corresponding external keys in other tables according to the characteristics of the main key, and forming an external key relationship between the main key and the external keys; and displaying each table and the external key relation among the tables in a visual graph structure form as a table level knowledge graph.

As a preferred implementation manner, the method for acquiring the relationships between the tables in the field level knowledge graph is as follows: determining a table A to which an external key belongs through a function dependency relationship in the table, finding a closure of the field of the external key, and de-duplicating the field in the closure to form a temporary table B taking the field of the external key as a main key; through the external key relation, the table C where the main key is located is used as a left table, the temporary table B is used as a right table, and the internal connection is carried out to form a new temporary table D; the values of the fields in temporary table D in tables a and C are compared to form the following table-to-table relationship:

the data are equal, namely, the two columns of data in the temporary table D of the field between the table A and the table C are completely equal;

the data null values are equal, i.e. the fields between the table a and the table C are equal after the null values are removed from the two columns of data in the temporary table D.

The invention also provides a system for constructing the standard layer of the data warehouse, which comprises: a processor; a database; and a memory in which a program is stored, a database storing tables,

wherein when the processor executes the program, the following operations are performed:

for each table in the database, determining whether the table is an island table, and putting a non-island table into a standard layer as a table model; the island list refers to the list and other lists without external key relation; for each table field in the database, determining whether it is the main data field; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristics is greater than a threshold value (such as 2%) and is not a default value, the field is put into a standard layer; when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; if the code value field is the code value field, a rule for performing configuration code value conversion is recommended.

Compared with the prior art, the invention has the remarkable advantages that:

(1) The invention derives all data files from a plurality of upstream business systems, loads the data files into a large data platform, calculates and analyzes all table data of all business systems by utilizing mass storage and calculation capacity of the large data platform, and obtains main external key relation and function dependency relation of each table and other tables, and performs standardized processing of each table and each field according to the relation;

(2) In the construction process of the data warehouse or the data lake system, the method and the system can directly perform standardized processing on all tables and fields which need to enter the data warehouse or the data lake without relying on any manual identification, judgment and people to know the data tables, relations, field contents and field code values or inputting high-cost human resources, thereby improving the standardized processing efficiency and the standardized degree of the data and ensuring that the data entering the data warehouse or the data lake is uniform.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

FIG. 1 is a schematic flow chart of one embodiment of the present invention.

Fig. 2 is a schematic diagram showing a specific flow of step 300 in fig. 1.

Fig. 3 is a simplified schematic diagram of a field level knowledge graph.

Fig. 4 is a field level knowledge graph overview schematic.

Fig. 5 is a detailed schematic diagram of a field level knowledge graph section.

Fig. 6 is another detailed schematic diagram of a field level knowledge graph.

Detailed Description

It is easy to understand that various embodiments of the present invention can be envisioned by those of ordinary skill in the art without altering the true spirit of the present invention in light of the present teachings. Accordingly, the following detailed description and drawings are merely illustrative of the invention and are not intended to be exhaustive or to limit or restrict the invention. Rather, these embodiments are provided so that this disclosure will be thorough and complete by those skilled in the art. Preferred embodiments of the present invention are described in detail below with reference to the attached drawing figures, which form a part of the present application and are used in conjunction with embodiments of the present invention to illustrate the innovative concepts of the present invention.

The method and the system for constructing the data warehouse standard layer can finish the construction of the data warehouse standard layer only by obtaining the key characteristic information through data analysis.

The invention relates to a method for constructing a standard layer of a data warehouse, wherein the standard layer comprises a table model and a field model;

for each table field in the database, determining whether it is the main data field; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristics is more than 2% and is not a default value, the field is put into a standard layer;

when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; if the code value field is the code value field, a rule for performing configuration code value conversion is recommended.

In the invention, as a preferable mode, whether each table in the database is an island table or not is determined through a table level knowledge graph; the table level knowledge graph is a knowledge graph which displays each table and the external key relation among the tables in a visual graph structure; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation; and determining whether the corresponding table has an external key relation by determining whether edges exist between all nodes in the table level knowledge graph, wherein when one node does not exist edges in any other node, the table represented by the node is an island table. The method for obtaining the table level knowledge graph comprises the following steps: acquiring a service system from which each table in a database comes, a table name and a field name in each table; for each table, analyzing the characteristics of each field according to the values of the fields in the table; aiming at each table, according to the table name, the field name and the value of the field, calculating to obtain the function dependency relationship in the table among the fields in the table; for each table, identifying a main key of each table according to the function dependency relationship in the table, searching and determining corresponding external keys in other tables according to the characteristics of the main key, and forming an external key relationship between the main key and the external keys; and displaying each table and the external key relation among the tables in a visual graph structure form as a table level knowledge graph.

In the invention, as a preferable mode, whether the fields of each table in the database are main data fields or not is determined through a field level knowledge graph; the field level knowledge graph is a knowledge graph which displays the field of each table and the relationship between tables in a visual graph structure form; the field level knowledge graph comprises nodes and edges, wherein each node represents a field, and each edge represents a relationship among the fields; the relationships among the tables are embodied as relationships among fields from different tables and at least comprise foreign key relationships, data equality or data nulling equality; when the main data field is determined, two fields with the relationships among the tables being foreign key relationships, data equality or data nulling equality are found out through a field level knowledge graph, and when the original data of the two fields are sourced from different service systems, the two fields are used as the main data field. The method for obtaining the relationships among the tables in the field level knowledge graph comprises the following steps: determining a table A to which an external key belongs through a function dependency relationship in the table, finding a closure of the field of the external key, and de-duplicating the field in the closure to form a temporary table B taking the field of the external key as a main key; through the external key relation, the table C where the main key is located is used as a left table, the temporary table B is used as a right table, and the internal connection is carried out to form a new temporary table D; the values of the fields in temporary table D in tables a and C are compared to form the following table-to-table relationship:

The invention also provides a system for constructing a data warehouse standard layer, which comprises: a processor; a database; and a memory in which a program is stored, a database storing tables, wherein when the processor executes the program, the following operations are performed: for each table in the database, determining whether the table is an island table, and putting a non-island table into a standard layer as a table model; the island list refers to the list and other lists without external key relation; for each table field in the database, determining whether it is the main data field; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristics is more than 2% and is not a default value, the field is put into a standard layer; when the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; if the code value field is the code value field, a rule for performing configuration code value conversion is recommended.

The method and system for building a data warehouse standard layer described in this method of the invention will be described in detail below in connection with one specific embodiment. In practice, to facilitate the storage of the results of the calculations obtained by each step, a series of tables are created in the computing system to store the result data of each step. Of course, in actual operation, various tools such as text documents may be used to store the calculation results of the respective steps. As one example, the following series of data tables may be used to store the calculation results of each step in the process of building the data warehouse standard layer:

table 1 table LIST table tabs_list;

table 2 field LIST table column_list;

table 3 MASTER DATA field information table master_data_info;

table 4 field characteristic information table column_field_info;

table 5 standard layer model table.

The constructed form template can be pre-placed in a storage device of the system. As shown in fig. 1, the method for constructing a standard layer of a data warehouse in this embodiment includes the following steps:

s100, table names of data TABLES used for constructing the data warehouse are acquired, and the table names are stored in a table LIST table TABLES_LIST.

The LIST of all the TABLES is read from the database by the table data reading device, and the table names of the TABLES are stored in the table LIST table template preset in the storage device to form a full database table LIST table TABLES_LIST as described in table 1. If each table is from a different business system, the method further comprises obtaining the business system number of each table.

Table 1 shows a list of all tables read from the database.

Table 1 table LIST table tabs_list (partial example)

SYS_CODE	TABLE_CODE	COMMENT
			S03	ods.ods_s03_acc_accp	Silver-colored cushion cap account
S03	ods.ods_s03_ctr_loan_cont	Contract master form
			S03	ods.ods_s03_prd_bank_info	Bank information
S55	ods.ods_s55_bt_discount_batch	Post buying batch
			S58	ods.ods_s58_m_ci_customer	Customer basic information table
S58	ods.ods_s58_m_ci_person	Personal customer information master table
			S57	ods.ods_s57_tb_fss_transbook	Transfer information flow table

The meanings of the items in Table 1 are as follows:

the sys_code is a service system number, and the service system is each working system used by a certain unit, for example, a certain bank has a loan system, a proxy wage system, and the like, and data in these service systems are stored in a data warehouse in the form of a table.

TABLE_CODE is the English name of the TABLE in the data warehouse.

COMMENT is the Chinese name of each table. The chinese names shown in the component list are for more convenience of illustration, and in practical implementation, the column data information of the chinese names need not be included.

S200, obtaining the field of each table and storing the fields in a field LIST table COLUMNS_LIST.

The table data reading device acquires field information in each table from the data table stored in the data warehouse, and stores the field information in a field LIST table template preset in the storage device to form a field LIST table COLUMNS_LIST. A part of the field list table is shown in table 2.

TABLE 2 field inventory table COLUMNS_LIST (part of the examples)

SYS_CODE	TABLE_CODE	COL_NUM	COL_CODE	COMMENT
					S58	ods.ods_s58_m_ci_person	1	cust_no	Customer number
S58	ods.ods_s59_m_ci_person	2	cust_name	Customer name
					S58	ods.ods_s60_m_ci_person	3	cust_eng_name	English name of customer
S58	ods.ods_s61_m_ci_person	4	py_name	Pinyin name

The meanings of the items in Table 2 are as follows:

the sys_code is the service system number,

TABLE_CODE is the English name of the TABLE in the data warehouse.

COL _ NUM is the field number,

the COL _ CODE is the field name,

COMMENT is the Chinese name of each field. The chinese names shown in the component list are for more convenience of illustration, and in practical implementation, the column data information of the chinese names need not be included.

S300, obtaining a table-level knowledge graph and a field-level knowledge graph.

As shown in fig. 2, the present step specifically includes the following steps:

s301, analyzing the characteristics of each field according to the values of the fields in the table for each table.

The features include qualitative and quantitative features; the qualitative feature may include a data type of the field and the quantitative feature may include a length of the field.

In this embodiment, the qualitative features of the fields refer to the following qualitative analysis according to the values of each field (and the data in the fields):

COL_TYPE is the data TYPE of the field. Such as strings, different storage lengths, text, values, dates, times, etc.

COL_NULLABLE is whether a field is NULLABLE, belongs to a qualitative feature of a field, and is a preference, in some embodiments, whether a field is NULLABLE may not be taken as a qualitative feature of a field.

COL_PK is whether a field is a primary key, and belongs to the qualitative feature of the field. Of course, the step can not obtain the modified feature temporarily, and the feature is recorded in the basic meta information and the qualitative feature record table of the field in table 4 after the external key is obtained in the subsequent step.

COL_AUTOINCRE is a self-increasing field, belonging to the qualitative feature of the field, which may not be a qualitative feature of the field in some embodiments.

COL_DEFULT indicates whether a default field, belonging to the qualitative feature of the field, is a preferred item, and may not be considered a qualitative feature of the field in some embodiments.

The CODE VALUE FLG is a CODE VALUE field, which is a qualitative feature of the field, and may not be a qualitative feature of the field in some embodiments.

In this embodiment, the quantitative feature comprises:

COL_RECORDS is the number of field lines, belonging to the index feature.

COL_DISTINCT is the number of field de-duplication downstream, belonging to the index feature.

COL_NOTONULL_is the number of non-NULL rows in the field value, belonging to the index feature. Preferably, in some embodiments, it may not be an indicator feature of a field.

Of course, not all of the qualitative and quantitative features previously described are required in the present invention.

S302, calculating and obtaining function dependency relations among all fields in the same table according to table names, field names and field values for all tables, wherein the function dependency relations are called as in-table function dependency relations.

In the prior art, a plurality of methods can calculate and obtain the function dependency relationship, and the embodiment is not specifically developed. For ease of understanding, only briefly, function dependencies include fields for function-dependent derivation and function-dependent derivation result fields. For example, the field of the table prd _bank_info for function dependent derivation is the bank code bank_no, and the function dependent derivation field is the bank name. Thus, the intra-table functional dependencies can be understood as: the bank name can be deduced from the bank code bank no or can be said to depend on the bank code bank no.

S303, identifying the main key of each table according to the function dependency relationship in the table.

There are various methods in the prior art for calculating the primary key of the get table, and this embodiment is not specifically developed. The preferred candidate code set method of the present invention finds a primary key, which may be one or more candidate codes.

S304, searching and determining the corresponding external key in other tables according to the characteristics of the main key, and forming an external key relation between the main key and the external key.

There are various methods in the prior art for obtaining the foreign key relationship, and this embodiment is not specifically developed.

The invention preferably obtains the external key relation in the following way:

and taking the fields matched with the data type and the field length of the main key in other tables as external keys, wherein the fields matched with the data type and the field length of the main key are that the data type of the fields is the same as the data type of the main key, and the minimum length of the fields is larger than or equal to the minimum length of the main key and the maximum length of the fields is smaller than or equal to the maximum length of the main key.

Further, the fields matching the primary key data type and the field length may be further filtered, for example:

traversing the primary keys in sequence, and generating a corresponding bloom filter for the value of each primary key by a Hash method;

and comparing the value of the field matched with the data type and the field length of the primary key with a bloom filter corresponding to the primary key, and taking the field as a finally determined foreign key when the data coincidence rate is greater than a preset threshold value.

And S305, displaying each table and the external key relation among the tables in a visual diagram structure form as a table level knowledge graph.

After the external key relation is obtained, the tables in the database and the external key relation among the tables are stored in a graph database preset in a storage device in a graph structure mode, and a visual table level knowledge graph which can be convenient to inquire is formed.

The table level knowledge graph is shown in figure 3. The table level knowledge graph comprises 1 node and 1 side, wherein the circular node represents a table, and each node stores information representing the table and comprises basic meta information and related characteristic information of the table, such as English name, field number, table annotation (Chinese name), table number and the like. In each item of information, other information except the table english name may be used as the preferable addition information, and the node may or may not store the information. The table level knowledge graph only comprises one relation of external keys, and is shown as an edge which is connected with two nodes and is indicated by an arrow, FK marked on the edge indicates the relation of the external keys, each edge is a directed edge, wherein the node which starts is a table which belongs to a main key, and the node pointed by the arrow is a table which belongs to the external key. Each side also stores foreign key relation information, such as English name of the main key field, english name of the foreign key field, and coincidence rate of the main key and the foreign key. Preferably, because the foreign key may be a joint foreign key, the primary key and the foreign key are stored in the edge field storage by adopting a list, and the fields with the same subscript have association, so that the field mapping relation of the joint foreign key is completely stored.

S306, calculating the relationship between tables.

The table relationships in the invention are embodied as relationships among fields from different tables, including functional dependency relationships, data equality relationships and data nulling equality relationships among fields in different tables. The invention refers to the function dependency relationship among the fields in different tables as the inter-table function dependency relationship. The inter-table functional dependencies in turn include both one-way dependencies and two-way dependencies. Thus, in the present invention, the relationships between tables include four relationships, respectively:

unidirectional dependence;

two-way dependence;

the data are equal;

the data is nulled out of the equality relationship.

The relationships among the tables are used as the supplement of the relationships among the external keys and the external keys, so that the relationships among the tables are greatly enriched, and more functions are realized.

The calculation method of the relation among the four tables comprises the following steps: for the primary and foreign keys in the foreign key relationship,

firstly, selecting a table A to which an external key belongs through a function dependency relationship in the table, finding an external key field (comprising a joint external key) and a closure of the external key field, and in the current closure, because all other fields in the closure can be pushed out through the external key, removing the weight of the fields in the closure to form a temporary table B with the external key field as a main key;

Secondly, taking a table C where the main key is positioned as a left table, taking a temporary table B as a right table, and connecting to form a new temporary table D, wherein the fields in the temporary table D are actually from the tables A and C;

in the temporary table D, calculating the intra-table function dependency relationship of each segment in the temporary table D, wherein the intra-table function dependency relationship of the temporary table D is the inter-table function dependency relationship of the table A and the table C;

finally, the following table relationship is obtained by comparing the data of the field values in the table A to which the foreign key belongs and the table C to which the primary key belongs:

(1) Unidirectional dependence: the field between the table A and the table C has unidirectional dependency relationship in the temporary table D, and the relationship type is marked as fd; the embodiment stores the dependency relationship between only a single field

(2) Two-way dependence: the fields between the table A and the table C have a bi-directional dependency relationship in the temporary table D, namely, the two fields have data one-to-one corresponding results, and the relationship type is recorded as bfd; the embodiment stores the dependency relationship between only a single field

(3) Data equality: the fields between the table A and the table C are completely equal in two rows of data in the temporary table D, so that the data can be considered to have a stronger association or redundancy relation, and the relation type is recorded as equality;

(4) Data nulling equals: the fields between the table A and the table C are equal after the null value is removed from the two rows of data in the temporary table D, and the data can be considered to have a weak association or redundancy relation, and the relation type is recorded as the same;

S307, the external key relation, the function dependency relation in the table and the relation between the tables are displayed in a visual graph structure form to be used as a field level knowledge graph.

In the step, the fields are connected together by using the foreign key relationship, the function dependency relationship in the table and the relationship between tables, and are stored in a graph database preset in a storage device, and the visual graph structure is used for displaying the field level knowledge graph. A field level knowledge graph overview is shown in fig. 4. The field level knowledge graph contains 1 node and 7 edges. The round node represents a field, wherein each node stores information representing the field, and the information comprises table name, field English name, service system number, field number, chinese name, analyzed data type, field analysis length, whether the field can be empty, whether the field is a primary key, whether the field is a self-increasing field, whether the field is a default value, whether the field type judges the data proportion, whether the field contains Chinese, chinese data proportion, whether the field is a code value field, field line number, field duplication removing line number, field maximum length, field minimum length, field average length, field length variance, length median, field value non-NULL line number and the like. Among the above information, the other information is preferable except table names and field English names, and the information may be added or reduced according to actual requirements in practical application. Because the picture of fig. 4 is limited, only a part of the field level knowledge graph is displayed, and the 7 edges cannot be displayed completely, the field level knowledge graph is further displayed by using the local detail diagrams 5 and 6. It should be noted that fig. 5 and 6 are also part of the field level knowledge graph, like fig. 4, and do not refer to part of fig. 4.

The 7 sides are respectively:

(1) External key: in fig. 5 or fig. 6, an edge connecting two nodes is shown, FK marked on the edge is denoted as an external key relationship, each edge is a directed edge, a starting node is a main key, a node pointed by an arrow is an external key, and each edge also stores analyzed related information, and the analyzed related information mainly comprises a main external key coincidence rate.

(2) And (3) combining external keys: in fig. 5 or fig. 6, an edge connecting two nodes is embodied, and JFK marked on the edge is represented as a joint foreign key relationship. Because of the association of multiple fields, when several fields are combined to be embodied as several edges, for example, a combined primary key is composed of 3 fields, then a combined foreign key generates 3 edges. Each edge is a directed edge, wherein the node from which the edge starts is a table to which the main key belongs, the node pointed by the arrow is a table to which the external key belongs, and the analyzed related information is stored on each edge and mainly comprises the coincidence rate of the main external key.

(3) Function dependencies within the table: in the figure, an edge connecting two nodes is embodied, and FD marked on the edge is expressed as a function dependency relationship in the table. Since the function dependencies in the table are usually complex, only the relation with fd_level equal to 1 in the function dependency record table is selected in fig. 5 or fig. 6 to generate the function dependency relationship in the table. Each edge is a directed edge, wherein a node from which the edge starts is a field in left_column in the function dependency record table, and a node pointed by an arrow is a field in a corresponding right_column in the function dependency record table, which indicates that the right_column depends on the left_column.

(4) Unidirectional function dependency between tables: in fig. 5 or 6, an edge connecting two nodes is embodied, and EXFD marked on the edge is expressed as a function dependency relationship between tables. Rows of REL_TYPE equal to fd in the inter-table multiple-relationship record table are converted into the relationship, each edge is a directed edge, fields in LEFT_COL_CODE in the starting inter-table multiple-relationship record table, the nodes pointed by the arrows are fields in the right_col_code corresponding to the various relationship record tables between the tables, indicating right_col_code depends on left_col_code.

(5) Bi-directional functional dependency between tables: in fig. 5 or 6, an edge connecting two nodes is embodied, and EXBFD marked on the edge is expressed as a function dependency relationship between tables. Rows of REL_TYPE equal to bfd in the table-to-table multiple relation record table are converted into the relation, each edge is an undirected edge (the directed edge is limited by a graph database in a storage device in the drawing, and the directed edge is processed according to the undirected edge in actual calculation), wherein the various relationships between the starting tables record the fields in the LEFT COL CODE table, the nodes pointed by the arrows are fields in the right_col_code corresponding to the various relationship record tables between the tables, indicating that right_col_code and left_col_code depend on each other.

(6) Data equality relationship between tables: in fig. 5 or 6, an edge connecting two nodes is embodied, and the equalils marked on the edge is represented as an inter-table data equality relationship. Rows of REL_TYPE equal to equality in the table-to-table relationship record table are converted into the relationship, each edge is an undirected edge (the directed edge is limited by a graph database in a storage device and is processed according to the undirected edge in actual calculation), wherein the various relationships between the starting tables record the fields in the LEFT COL CODE table, the nodes pointed by the arrows are fields in the right_col_code corresponding to the various relationship record tables between the tables, indicating that the right_col_code and left_col_code data are equal.

(7) Data nulling equal relationship between tables: in fig. 5 or 6, an edge connecting two nodes is represented, and SAME marked on the edge is represented as an equal relationship of data nulling among tables. Rows of REL_TYPE equal to the same in the table of the record of various relations among the tables are converted into the relation, each edge is an undirected edge (the directed edge is drawn in the figure to be limited by a graph database in a storage device and is processed according to the undirected edge in actual calculation), wherein the various relationships between the starting tables record the fields in the LEFT COL CODE table, the nodes pointed by the arrows are fields in the right_col_code corresponding to the various relationship record tables between the tables, indicating that the right_col_code and left_col_code data are null equal.

Of course, in the present invention, not all of the relationships between tables in the field level knowledge graph described above need be used.

S400, all main data fields in the table list in the step 100 are acquired.

And when the two fields with the relationships of the tables being foreign key relationships, data equality or data nulling equality are found out through the field-level knowledge graph, and the original data of the two fields are sourced from different service systems, the two fields are used as main data fields. As described above, the field level knowledge graph includes nodes and edges, each node represents a field, and each edge represents a relationship between fields; the relationship of the foreign key relationship, the data equality or the data nulling equality is embodied as a corresponding side in the field level knowledge graph. The found fields are all recorded as main data fields and recorded in a main data information table preset in the storage device.

An example of a part of the main data field information table is shown in table 3.

TABLE 3 Master_DATA_INFO Main DATA field information Table (partial example)

SYS_CODE	TABLE_CODE	COL_CODE	MASTER_ID	ORDER
					s58	ods.ods_s58_m_ci_org	regi_regis_date	1821	1
s53	ods.ods_s53_vai_cus_com_xd	reg_start_date	1821	2
					s03	ods.ods_s03_cus_com	reg_start_date	1821	3
s28	ods.ods_s28_cus_com	reg_start_date	1821	4
					s53	ods.ods_s53_vai_cus_com_xd	fina_per_tel	1825	1
s03	ods.ods_s03_cus_com	fina_per_tel	1825	2
					s28	ods.ods_s28_cus_com	fina_per_tel	1825	3

The meanings of the items in Table 3 are as follows:

the sys_code is the service system number,

TABLE_CODE is the English name of the TABLE in the data warehouse.

The COL _ CODE is the field name,

Master_id is a main data packet, and when packet numbers are consistent, it is indicated that data are in the same packet, and data sharing occurs;

ORDER is the in-packet ordering sequence number, the ordering ORDER of which is determined by the descending ordering of the field dimension values (i.e., the number of de-duplicated rows), and a smaller in-packet ordering sequence number indicates that the field is more important.

S500, acquiring the characteristics of all the fields in the field list through the field level knowledge graph.

The field characteristics generally include:

1) COL_RECORDS field line number

2) non-NULL line count in COL_NOTINULL field value

3) COL_DISTINCT field de-duplication downstream number

4) COL_TYPE analysis judgment TYPE (field data TYPE)

5) COL_TYPE_JUDGE_RATE field TYPE judging data proportion

6) Whether the COL_DEFULT field is a default value

7) Whether the code_value_flg field is a CODE VALUE field

8) FILL RATE of fill_rate

The features 1) -8) are field features directly obtained according to a field level knowledge graph, and the filling RATE fill_rate is obtained by calculating 100% according to a formula col_NOTINULL/col_RECORDS. The foregoing FEATUREs are recorded in a field FEATURE information table preset in the storage device.

TABLE 4 field characterization information table COLUMNS_FEATURE_INFO

Of course, not all of the field features described above need be used in the present invention.

S600, determining whether each table is an acquired island table or not through a table level knowledge graph.

As described above, the table-level knowledge graph refers to a knowledge graph in which each table and the external key relationships between tables are displayed in a visual graph structure; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation; and determining whether the corresponding table has an external key relation by whether edges exist between all nodes in the table level knowledge graph, and when a certain node does not have edges with any other node, indicating that the table represented by the node does not have the external key relation with other tables, wherein the table represented by the node is an island table. Whether or not it is an island table is noted in the table LIST table tabs_list.

Table 1 after labeling table LIST table tabs LIST (part of example)

SYS_CODE	TABLE_CODE	COMMENT	IS_ISOLATION
				S03	ods.ods_s03_acc_accp	Silver-colored cushion cap account	N
S03	ods.ods_s03_ctr_loan_cont	Contract master form	N
				S03	ods.ods_s03_prd_bank_info	Bank information	N
S55	ods.ods_s55_bt_discount_batch	Post buying batch	N
				S58	ods.ods_s58_m_ci_customer	Customer basic information table	N
S58	ods.ods_s58_m_ci_person	Personal customer information master table	N
				S57	ods.ods_s57_tb_fss_transbook	Transfer information flow table	N

In the table, is_island IS an island table, where Y represents yes and N represents no.

And S700, forming a standard layer according to the following assembly rule according to the results of the steps.

According to the is_isolation content of table 1, a table selected as N (indicating the selection of a non-islanding table) IS added to the standard layer, i.e., the non-islanding table IS put into the standard layer as a table model.

The standard layer recommendation field is then specified for tables 3 and 4,

when a certain field is a main data field, the suggested information of the field in the display device is a reserved field, namely the field is added into a standard layer;

when the field is non-main data, the filling rate is more than 2% and is a non-default value, the field is recommended to be reserved, namely the field is added into the standard layer, otherwise, the standard layer is not recommended to be added;

The final standard layer model is shown in Table 5

Information of table 5 ods.ods_s03_ctr_loan_cont in display device

COL_COL_TYPE is the field English name, SORCE_COL_TYPE is the field original data TYPE, COL_TYPE is the field analysis judging TYPE, and ANALY_INFO is the assembled field analysis result.

The structure of each table in the above embodiments is merely an example, and in actual operation, each column data item is not necessarily only each item shown in each table in the above embodiments, and other item data may be also available.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention.

It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes described in the context of a single embodiment or with reference to a single figure in order to streamline the invention and aid those skilled in the art in understanding the various aspects of the invention. The present invention should not, however, be construed as including features that are essential to the patent claims in the exemplary embodiments.

Those skilled in the art will appreciate that all or part of the flow of the methods of the embodiments described above may be accomplished by way of a computer program to instruct associated hardware, where the program may be stored on a computer readable storage medium. Wherein the computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory, etc.

It should be understood that the devices, modules, units, components, etc. included in the system of one embodiment of the invention may be adaptively changed to arrange them in a device or system different from the embodiment. The system of the embodiments may include different devices, modules, units or components combined into one device, module, unit or component, or they may be divided into a plurality of sub-devices, sub-modules, sub-units or sub-components.

The apparatus, modules, units, or components of embodiments of the invention may be implemented in hardware, in software running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that embodiments in accordance with the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as a computer program product or a computer readable medium for carrying out a part or all of the methods described herein.

Claims

1. A method of constructing a standard layer of a data warehouse, wherein the standard layer comprises a table model and a field model;

for each table field in the database, determining whether it is the main data field; when the field is a main data field, the field is put into a standard layer; when the field is not the main data field, if the filling rate in the field characteristics is larger than the threshold value and is not the default value, the field is put into a standard layer;

When the analysis data type is inconsistent with the original type, recommending a conversion type when the field type judges that the data proportion is 100%; if the code value field is the code value field, recommending a rule for configuring code value conversion;

determining whether each table in the database is an island table or not through a table level knowledge graph;

the table level knowledge graph is a knowledge graph which displays each table and the external key relation among the tables in a visual graph structure; the table level knowledge graph comprises nodes and edges, wherein each node represents a table, and each edge represents an external key relation;

determining whether a corresponding table has an external key relation through whether edges exist between nodes in a table level knowledge graph, and when a certain node does not exist edges in any other node, the table represented by the node is an island table;

determining whether the fields of each table in the database are main data fields or not through a field level knowledge graph;

the field level knowledge graph is a knowledge graph which displays the field of each table and the relationship between tables in a visual graph structure form; the field level knowledge graph comprises nodes and edges, wherein each node represents a field, and each edge represents a relationship among the fields; the relationships among the tables are embodied as relationships among fields from different tables and at least comprise foreign key relationships, data equality or data nulling equality;

When the main data field is determined, two fields with the relationships among the tables being foreign key relationships, data equality or data nulling equality are found out through a field level knowledge graph, and when the original data of the two fields are sourced from different service systems, the two fields are used as the main data field.

2. The method for constructing a standard layer of a data warehouse as claimed in claim 1, wherein the method for obtaining the table level knowledge graph comprises the following steps:

acquiring a service system from which each table in a database comes, a table name and a field name in each table;

for each table, analyzing the characteristics of each field according to the values of the fields in the table; aiming at each table, according to the table name, the field name and the value of the field, calculating to obtain the function dependency relationship in the table among the fields in the table;

for each table, identifying a main key of each table according to the function dependency relationship in the table, searching and determining corresponding external keys in other tables according to the characteristics of the main key, and forming an external key relationship between the main key and the external keys;

and displaying each table and the external key relation among the tables in a visual graph structure form as a table level knowledge graph.

3. The method for constructing a standard layer of a data warehouse as claimed in claim 1, wherein the method for obtaining the relationships between the tables in the field level knowledge graph is as follows:

Determining a table A to which an external key belongs through a function dependency relationship in the table, finding an external key field and a closure of the external key field, and de-duplicating the field in the closure to form a temporary table B taking the field of the external key as a main key;

through the external key relation, the table C where the main key is located is used as a left table, the temporary table B is used as a right table, and the internal connection is carried out to form a new temporary table D;

the values of the fields in temporary table D in tables a and C are compared to form the following table-to-table relationship:

4. A system for building a standard layer of a data warehouse, comprising:

a processor; a database; and a memory in which a program is stored, a database storing tables,

5. The system for constructing a standard layer of a data warehouse of claim 4, wherein the method for obtaining the table level knowledge graph comprises the following steps:

6. The system for constructing a standard layer of a data warehouse of claim 4, wherein the method for obtaining the relationships between the tables in the field level knowledge graph comprises: