CN115858513A - Data governance method, data governance device, computer equipment and storage medium - Google Patents

Data governance method, data governance device, computer equipment and storage medium Download PDF

Info

Publication number
CN115858513A
CN115858513A CN202211564533.3A CN202211564533A CN115858513A CN 115858513 A CN115858513 A CN 115858513A CN 202211564533 A CN202211564533 A CN 202211564533A CN 115858513 A CN115858513 A CN 115858513A
Authority
CN
China
Prior art keywords
data table
data
service
information
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211564533.3A
Other languages
Chinese (zh)
Inventor
彭思翔
黄越
朱晓娟
王雅迪
罗源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Healthcare Shenzhen Co Ltd
Original Assignee
Tencent Healthcare Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Healthcare Shenzhen Co Ltd filed Critical Tencent Healthcare Shenzhen Co Ltd
Priority to CN202211564533.3A priority Critical patent/CN115858513A/en
Publication of CN115858513A publication Critical patent/CN115858513A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a data governance method, a data governance device, a computer device, a storage medium and a computer program product. The method relates to the technical field of big data and cloud, and comprises the following steps: performing database statement analysis processing on table building statements of a business data table to be treated to obtain various data table information contained in the business data table, taking the business data table to be treated as an entity, and performing base table association relation mining processing on the business data table to be treated based on the data table information to obtain a data table association relation; and then, based on the association relationship between the partition service data tables and the data tables, performing service theme combination processing on each partition service data table to obtain the service data tables under each service theme, providing data support for service theme combination through the service data relationship, and performing data standardized processing on the service data tables under each service theme to obtain a data processing result, thereby realizing metadata processing and standardized processing among different service modules.

Description

Data governance method, data governance device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data management method, an apparatus, a computer device, a storage medium, and a computer program product.
Background
With the development of computer technology, big data technology is developed, and big data is also called huge data, which means that the related data is too huge in scale to be captured, managed, processed and organized into information for helping enterprise operation decision within reasonable time through mainstream software tools. In the field of big data, one core service library can collect thousands of data tables, covers a plurality of different service modules, and has complicated and intricate relationships among the data tables, and the daily increment of a core single table can reach millions of orders, so how to capture the data relationships among the tables and carry out data management and standardization on the data relationships.
The current data governance method generally carries out data governance by identifying abnormal data. Namely, the abnormal data is repaired by detecting the abnormal data item and comparing the abnormal data item with the corresponding normal data item, thereby achieving the purpose of data management. However, the accuracy of the data governance method is difficult to guarantee, and sometimes according to the needs of the service, some values which seem to be abnormal may not be true abnormal values, and if the values are processed, the service data is inaccurate.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a data governance method, an apparatus, a computer device, a computer readable storage medium and a computer program product capable of improving the accuracy of data governance.
In a first aspect, the present application provides a method of data governance. The method comprises the following steps:
carrying out database statement analysis processing on a table building statement of a business data table to be managed to obtain data table information;
taking the business data table to be treated as an entity, and mining the table association relation of the business data table to be treated based on the data table information to obtain the data table association relation;
based on the partition service data table obtained by partitioning the service data table to be managed and the incidence relation of the data table, performing service theme combination processing on each partition service data table to obtain a service data table under each service theme;
and carrying out data standardization treatment on the service data table under each service theme to obtain a data treatment result.
In a second aspect, the application further provides a data governance device. The device comprises:
the statement analysis module is used for carrying out database statement analysis processing on the table building statement of the business data table to be treated to obtain data table information;
the incidence relation mining module is used for mining the incidence relation of the base table of the service data table to be treated based on the information of the data table by taking the service data table to be treated as an entity to obtain the incidence relation of the data table;
the service theme merging module is used for performing service theme merging processing on each subarea service data table based on the subarea service data table obtained by dividing the service data table to be managed and the incidence relation of the data tables to obtain the service data table under each service theme;
and the data management module is used for carrying out data standardized management on the service data table under each service theme to obtain a data management result.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:
carrying out database statement analysis processing on a table building statement of a business data table to be managed to obtain data table information;
taking the business data table to be treated as an entity, and mining the table association relation of the business data table to be treated based on the data table information to obtain the data table association relation;
based on the partition service data tables obtained by partitioning the service data tables to be managed and the incidence relation of the data tables, performing service theme combination processing on each partition service data table to obtain a service data table under each service theme;
and carrying out data standardization treatment on the service data table under each service theme to obtain a data treatment result.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:
carrying out database statement analysis processing on a table building statement of a business data table to be managed to obtain data table information;
taking the business data table to be treated as an entity, and mining the table association relation of the business data table to be treated based on the data table information to obtain the data table association relation;
based on the partition service data table obtained by partitioning the service data table to be managed and the incidence relation of the data table, performing service theme combination processing on each partition service data table to obtain a service data table under each service theme;
and carrying out data standardization treatment on the service data table under each service theme to obtain a data treatment result.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
carrying out database statement analysis processing on a table building statement of a business data table to be managed to obtain data table information;
taking the business data table to be treated as an entity, and mining the table association relation of the business data table to be treated based on the data table information to obtain the data table association relation;
based on the partition service data table obtained by partitioning the service data table to be managed and the incidence relation of the data table, performing service theme combination processing on each partition service data table to obtain a service data table under each service theme;
and carrying out data standardization treatment on the service data table under each service theme to obtain a data treatment result.
According to the data management method, the data management device, the computer equipment, the storage medium and the computer program product, database statement analysis processing is carried out on the table building statement of the business data table to be managed to obtain various data table information contained in the business data table, then the business data table to be managed is used as an entity, base table association relation mining processing is carried out on the business data table to be managed based on the data table information to obtain data table association relation, and the association relation among the data tables is identified and obtained from the obtained entity contact diagram; and then, based on the association relationship between the partition service data tables and the data tables obtained by partitioning the service data tables to be managed, service theme combination processing is carried out on each partition service data table to obtain the service data tables under each service theme, data support is provided for service theme combination through the service data relationship, and then data standardized management is carried out on the service data tables under each service theme to obtain a data management result, so that metadata management and standardized processing among different service modules are realized. According to the method and the system, through processing of automatic statement analysis, incidence relation mining of the data table, incidence relation analysis of the data table and the like, core data assets and relevant associated business data tables in original business are rapidly analyzed in the data governance and standardization process, data support is provided for business theme combination and business theme standardization according to the analyzed data incidence relation, data governance operation is completed, and accuracy of the data governance is guaranteed.
Drawings
FIG. 1 is a diagram of an environment in which a data governance method according to one embodiment may be implemented;
FIG. 2 is a schematic flow chart diagram of a data governance method in one embodiment;
FIG. 3 is a diagram illustrating data relationship redundancy resulting from tables sharing the same primary key in one embodiment;
FIG. 4 is a diagram of a result of a clustering cut in one embodiment;
FIG. 5 is a diagram illustrating the construction of minimal data connectivity in one embodiment;
FIG. 6 is a diagram that illustrates the merging of business topics based on data relationship mining results, under an embodiment;
FIG. 7 is a diagram illustrating service table merging under a service topic in one embodiment;
FIG. 8 is a diagram of a topic normalization flow in one embodiment;
FIG. 9 is a block diagram of an embodiment of an overall data governance and standardization architecture and system architecture;
FIG. 10 is a schematic flow diagram illustrating data relationship mining in one embodiment;
FIG. 11 is a block diagram of the structure of a data governance device in one embodiment;
FIG. 12 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The present application relates to the field of Cloud technology (Cloud technology), which is a hosting technology for unifying serial resources such as hardware, software, and network in a wide area network or a local area network to realize data calculation, storage, processing, and sharing. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.
The application specifically relates to Big data (Big data) technology in cloud technology, wherein the Big data refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which can have stronger decision-making power, insight discovery power and process optimization capability only by a new processing mode. With the advent of the cloud era, big data has attracted more and more attention, and the big data needs special technology to effectively process a large amount of data within a tolerance elapsed time. The method is suitable for the technology of big data, and comprises a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system.
In this context, it is to be understood that the terms:
primary-primary-edge (PPE): also called main key relationship, when the data table has the same main key, the relationship is connected through the main key.
Primary-foreign-edge (PFE): also called as primary foreign key relationship, the primary key of table A is the foreign key of table B, and the connection relationship between AB tables in the entity association diagram (ER diagram) is the primary foreign key relationship.
The community algorithm is as follows: the social group discovery algorithm is an algorithm for discovering useful information from a graph network, and for example, the algorithm may be used to perform mining of a maximum group in the graph network, or the algorithm may be used to perform a centrality calculation of a connected edge, thereby obtaining the importance of the edge.
The data governance method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. A data storage system, in particular a data warehouse, is integrated on the server 104. When a user at the terminal 102 side needs to perform data management on specified service data, the service management processing can be realized through the server 104, firstly, the terminal 102 specifies the service data to be processed, the server 104 takes out a corresponding service data table to be managed, then the corresponding service data table to be managed is synchronized into a data warehouse of the server 104, and database statement analysis processing is performed on a table building statement of the service data table to be managed to obtain data table information; taking the business data table to be treated as an entity, and mining the table association relation of the business data table to be treated based on the data table information to obtain the data table association relation; based on the association relationship between the partition service data table and the data table obtained by partitioning the service data table to be treated, performing service theme combination processing on each partition service data table to obtain a service data table under each service theme; and carrying out data standardization treatment on the service data table under each service theme to obtain a data treatment result. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal 102 may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
In one embodiment, as shown in fig. 2, a data governance method is provided, which is described by taking the example that the method is applied to the server 104 in fig. 1, and includes the following steps:
step 201, performing database statement analysis processing on a table building statement of a business data table to be managed to obtain data table information.
The system comprises a business data table to be managed, a grid virtual table and a data management module, wherein the business data table to be managed is an external business data table, the data table is a grid virtual table for temporarily storing data and is used for recording detailed business related data, and the data management module aims at performing data management on the data of the business data table to be managed so as to improve the availability of the data. In one embodiment, the business data table to be managed is a medical-related business data table, and the relationship among the tables of the medical business data is complicated and complicated, and the daily increase of a core single table is huge. How to capture the data relationship among the tables from complicated medical service modules and perform joint treatment and standardization on the table data with the correlation, which provides a good data base for data application, is a problem to be solved urgently. The table building statement is a database statement used for creating a service data table to be managed, and specifically includes the structure of the data table, the field name, the field explanation, the primary key information and other contents. The table building statements are originally used for building the service data tables at the service database end, and therefore are correspondingly stored at the service database end, when the service data tables in the service database need to be managed, the service data tables need to be synchronized from the service database to the data warehouse of the server 104 as the service data tables to be managed, and at this time, the server 104 may simultaneously obtain the table building statements from the service database so as to analyze the association relationship of the service tables. The database statement analysis processing refers to a process of obtaining relevant information of the business data table to be managed by analyzing the table building statement, and the obtained data table information specifically comprises key information including a table name, field information, a main external key and the like. The method for analyzing and processing the database sentences can be determined according to the complexity of the table building sentences, for more complex table building sentences, the analysis can be performed in a lexical analysis and syntactic analysis mode, and for simpler table building sentences, the analysis can be performed directly by a regular expression method.
Specifically, in the process of data management, a user may submit a corresponding data management request to the server 104 through the terminal 102, and after the server 104 obtains the data management request, the to-be-managed service data table that needs to be processed is determined, and then the to-be-managed service data table may be synchronized to the data warehouse and be processed in a partitioned manner. According to the scheme, the relevance in the business data table to be managed is mainly analyzed, and then data management is achieved based on the relevance of the business data table, so that when data management is conducted, database statement analysis processing needs to be conducted on the table building statement through automatic database statement analysis processing to obtain corresponding data table information, and then data table relevance analysis aiming at the data management process can be achieved based on the data table information. In one embodiment, the scheme of the application is particularly suitable for data management related to medical data, at this time, a to-be-managed service data table can be taken out from a medical data and information system, corresponding table building sentences are obtained, and database sentence analysis processing is performed on the table building sentences through regular expressions to obtain data table information.
And 203, taking the business data table to be treated as an entity, and mining the association relationship of the base table of the business data table to be treated based on the information of the data table to obtain the association relationship of the data table.
Wherein an entity refers to things that exist objectively and are distinguishable from each other. In the scheme of the application, the correlation analysis can be specifically carried out by taking each business data table to be managed as an entity, so that the correlation relation among the business data tables to be managed is determined, and effective data management is realized. The analysis of the association relationship can be realized by an entity contact diagram method, namely an E-R diagram, which provides a method for representing entity types, attributes and connections and is used for describing a conceptual model of the real world. It is an effective method to describe a conceptual model of real world relationships. Is one way to represent a conceptual relational model. The entity type is represented by a rectangular box, and the entity name is written in the rectangular box; the attributes of the entity are represented by an oval frame or a rounded rectangle, and the attributes are connected with the entity type in the corresponding relation by a solid line segment; the origin of the relation between the entity types is represented by "diamond", the names of the relation are written in the diamond, and the relation is connected with the relevant entity types by "solid line segments", and the type of the relation is marked beside the "solid line segments" (1, 1 n or m: n. In the scheme of the application, the business data table to be managed is mainly used as an entity in the entity contact diagram, the data content in the business data table to be managed is used as data in the entity contact diagram, such as a main primary key relationship and a main foreign key relationship, and the relationship among the data tables is used as the relationship among the entities. The main key and main key correlation information and the main key and foreign key correlation information are mainly used for mining the database table correlation relationship of the data table information.
Specifically, for the process of data governance, after the data table information is obtained, the data table association relationship may be analyzed based on the data table information, and the analysis of the association relationship may be specifically implemented in the form of an entity contact graph method. When the data table association relationship is analyzed, the table names of the business data tables to be managed in the data table information and the main key relationship and main foreign key relationship attributes between the tables can be extracted, then the table names of the business data tables to be managed construct entities of an entity association diagram, the attributes of the entity association diagram are obtained based on the data table information corresponding to the business data tables to be managed, then the association between the entities is determined based on the attributes such as the main key relationship and the main foreign key relationship in the data table information, the entity association diagram is constructed, and meanwhile the data table association relationship between the data tables is obtained through the entity association diagram.
And step 205, based on the association relationship between the partition service data table and the data table obtained by partitioning the service data table to be managed, performing service theme merging processing on each partition service data table to obtain a service data table under each service theme.
The partitioning, i.e., data partitioning, refers to dividing logically unified data into smaller physical units that can be managed independently, and is also called data division or data fragmentation. In the scheme of the application, the business data table to be managed is partitioned into all the partitioned business data tables through the partitioned fields carried in the business data table to be managed. The service theme merging refers to merging different partition service data tables according to the dimension of the service theme for processing. The service theme specifically refers to different types of data divided based on service logic corresponding to the service data.
Specifically, the scheme of the application further includes a process of performing data synchronization and data partitioning on the business data table to be managed, after the business data table to be managed is determined to be processed, the server 104 may synchronize the business data table to be managed in the business system to the data warehouse in a full-scale synchronization manner to start data management, the synchronization process is mainly performed on a source pasting (stg) layer of the data warehouse, the business data table to be managed in the business system may be synchronized to the data warehouse in a full-scale synchronization manner or an incremental synchronization manner, and then in a process from the source pasting layer to a first data operation (ods 1) layer, the synchronized business data table is partitioned according to partition fields to obtain a partition business data table. And then, the service data tables of the partitions are subjected to service theme combination processing based on the association relationship of the service data tables of the partitions and the data tables to obtain the service data tables under the service themes. The business theme can be specifically divided according to the actual business logic, but because the manual division workload is large and the data table is easy to careless, in the application, the theme division and the data table under the business theme are merged by using the result of the data association relationship mining to obtain the business data table under the business theme.
And step 207, performing data standardization treatment on the service data table under each service theme to obtain a data treatment result.
The data normalization treatment refers to processing the service data table in a data normalization mode to obtain a normalized service data table, and for the data normalization, the data is relatively complicated in the application process. Data transformation can therefore be performed by data normalization, which is specifically a way of data transformation in data mining by which data is transformed or unified into a form suitable for mining.
Specifically, after the topics are merged, the table structure of each service data table changes, and a large amount of content under each service topic is fused in the table, so that the requirements of service application are met, but the normalized requirements of the database table may not be met. Therefore, the business data table after the business theme combination processing is subjected to the data standardization treatment to meet the standardization requirement of the database table. In a specific embodiment, the process of performing data standardization treatment on the business data table specifically comprises three data treatment processes of main key standardization treatment, data field standardization treatment and inter-table association standardization treatment, and after the data standardization treatment, the business data table corresponding to the data treatment result can be obtained. These business data tables are structurally stored into a data warehouse for subsequent big data analysis. In one embodiment, the scheme of the application is particularly suitable for data governance related to medical data. At the moment, the data standardization treatment can be carried out on the business data table under each business theme, and after the data treatment result is obtained, corresponding data mining and data analysis are carried out based on the data treatment result according to the requirement of medical data analysis, so that the medical data analysis process is effectively carried out, and the accuracy of the medical data analysis is improved.
The data management method comprises the steps of firstly carrying out database statement analysis processing on a table building statement of a business data table to be managed to obtain various data table information contained in the business data table, then taking the business data table to be managed as an entity, and carrying out base table association relation mining processing on the business data table to be managed based on the data table information to obtain a data table association relation; and then, based on the association relationship between the partition service data tables and the data tables obtained by partitioning the service data tables to be managed, service theme combination processing is carried out on each partition service data table to obtain the service data tables under each service theme, data support is provided for service theme combination through the service data relationship, and then data standardized management is carried out on the service data tables under each service theme to obtain a data management result, so that metadata management and standardized processing among different service modules are realized. According to the method and the system, through processing of automatic statement analysis, incidence relation mining of the data table, incidence relation analysis of the data table and the like, core data assets and relevant associated business data tables in original business are rapidly analyzed in the data governance and standardization process, data support is provided for business theme combination and business theme standardization according to the analyzed data incidence relation, data governance operation is completed, and accuracy of the data governance is guaranteed.
In one embodiment, the method further comprises: carrying out full-scale synchronous processing on historical data in a business data table to be treated to obtain a historical business data table; and partitioning the historical service data table according to the partition fields of the historical service data table, and performing partition field conversion processing and service data deduplication processing on the partitioned historical service data table to obtain a partition service data table.
The full-scale synchronization is one of data synchronization modes, and the processes of extracting data from the business database and transmitting the data to the data warehouse are different. And the full synchronization is to synchronize all data in the service database to a data warehouse, which is the simplest way to ensure data synchronization on both sides. Corresponding to this, incremental synchronization is performed, that is, only new and changed data in the service data are synchronized to the data warehouse. In a particular embodiment, the period of data synchronization may be days, i.e., data synchronization may occur daily. The historical data in the business data table to be treated specifically refers to all the business data tables to be treated. The historical service data table is a service data table which is stored in the data warehouse after full synchronization, and the data stored in the historical service data table is consistent with the data stored in the service data table to be treated. The partition field refers to data table field data used for performing partition processing on the service data table to be managed, and after data synchronization, partition processing can be performed on the historical service data table based on the original partition field in the historical service data table. And the partition field conversion and the service data deduplication processing are used for converting the partition field of the partitioned service table into a field with service meaning and attribute. The partition field conversion and the service data deduplication specifically comprise three processes of partition dismantling, partition extracting and partition deduplication.
Specifically, when data management is performed, a to-be-managed service data table in a service database can be synchronously stored in a data warehouse, data management is realized by relying on the data warehouse, the data warehouse can specifically acquire the to-be-managed service data table stored in a service system database through full synchronization, and full synchronization processing is performed on historical data in the to-be-managed service data table. After the full-scale synchronous processing, generating a historical service data table corresponding to a service data table to be managed in a data warehouse, then partitioning the historical service data table through original partition fields of the historical service data table, performing partition field conversion processing and service data duplication elimination processing on the partitioned historical service data table, namely processing the partitioned historical service data table through three processes of partition splitting, partition extracting, partition duplication elimination and the like, so that a partition service data table can be obtained, and at the moment, the partition fields of the partition service data table are converted into fields with service significance and attributes. In one embodiment, the data synchronization and the data governance take days as a unit, the historical data in the business data table to be governed needs to be subjected to full-scale synchronization processing every day, and then the business data updated in real time needs to be synchronized through subsequent timed incremental synchronization, so that the data governance can cover all the business data. In the embodiment, through data synchronization and data partitioning, a more accurate partition service data table can be obtained, and the rationality and effectiveness of data management are ensured.
In one embodiment, step 201 includes: carrying out lexical analysis on the table building sentences of the business data table to be managed to obtain lexical marks; carrying out grammar parsing processing on the lexical markers to obtain an abstract grammar tree; and obtaining the data table information based on the abstract syntax tree.
The database statement analysis is specifically configured to extract a table name, a library name, and values of related fields from the database statement. The database statement analysis specifically comprises a lexical analysis part and a grammatical analysis part. Lexical analysis is mainly used for converting input into lexical labels. The lexical label includes keywords and non-keywords. Parsing is the process of generating an abstract syntax tree. The abstract syntax tree can convert the table building sentences of the business data table to be managed into structured data for analysis, so that various data table information can be accurately extracted.
Specifically, the scheme of the application can realize the analysis of the table building statement of the database by combining lexical analysis with grammar analysis, the analysis process firstly processes the table building statement of the business data table to be managed by the lexical analysis to obtain a corresponding lexical mark, then the table building statement can be converted into an abstract syntax tree by the grammar analysis process based on the lexical mark obtained by the analysis, and then the structure of the data table, the field name, the field explanation, the main key information and other data table information can be easily analyzed on the basis of the abstract syntax tree. In another embodiment, the table building sentence is simpler, and at this time, the database sentence parsing can be directly realized by a regular expression, which is also called a regular expression, and is a text pattern, including common characters (e.g., letters between a and z) and special characters (called "meta characters"), which is a concept of computer science. Regular expressions use a single string to describe, match a series of strings matching a certain syntactic rule, and are typically used to retrieve, replace, text that conforms to a certain pattern (rule). For simple table building statements, the table building statements can be directly analyzed through regular expressions. In the embodiment, the abstract syntax tree is constructed to effectively analyze the table building sentences of the business data table to be managed, so that the accuracy of data table information extraction is ensured.
In one embodiment, the mining of the table association relationship of the data table information through the entity contact graph method, and the obtaining of the table association relationship includes: extracting the associated information of the primary key and the associated information of the primary key and the foreign key in the data table information; constructing a first entity contact diagram based on the associated information of the primary key and the associated information of the primary key and the foreign key; carrying out side screening processing on the first entity contact graph through a community discovery algorithm to obtain a second entity contact graph; and obtaining the data table association relation based on the second entity contact diagram.
The main key and main key association information is also called main key relationship, and when the same main key exists in the data table, the relationship is connected through the main key. The main key of the A table is the external key of the B table, and the connection relation between the AB tables in the entity contact diagram (ER diagram) is the main external key relation. The community discovery algorithm can find useful information from the graph network, for example, the algorithm can be used for mining the maximum group in the graph network, and the algorithm can be used for calculating the centrality of the connection edge to acquire the importance of the edge. The first entity contact diagram refers to the preliminarily screened entity contact diagram, and the second entity contact diagram is the minimum connection diagram obtained by screening again on the basis of the first entity contact diagram.
Specifically, the method for the entity contact diagram of the scheme performs mining processing of the association relation of the base table, and during mining of the association relation, two kinds of information, namely, the association information of the primary key and the association information of the primary key and the foreign key, are extracted from the information of the data table. Then, based on the two kinds of information, the relevance between the partition service data tables is determined, and then the connection of the establishment entities (partition service data tables) is carried out based on the relevance, so that a first entity contact diagram containing all the relevance relations is established. And then, carrying out side screening processing on the first entity contact graph through a community discovery algorithm, and removing partial connecting sides with lower importance in the first entity contact graph to obtain a second entity contact graph. And reliable data table association relation can be obtained through the second entity contact diagram. In the embodiment, the entity contact graph corresponding to the business data tables is constructed through the association attributes between the two business data tables of the main key and the main key association information and the main key and the foreign key association information, and then the screening processing is carried out through the community discovery algorithm, so that the association relation of the data tables can be effectively mined, and the accuracy of mining the association relation of the data tables is ensured.
In one embodiment, constructing the first entity contact map based on the primary key and primary key association information and the primary key and foreign key association information includes: clustering and mining the associated information of the primary key and the primary key to obtain an associated information cluster of the primary key and the primary key; determining node degree information of nodes in the association information of the main key and the foreign key; screening edges in the main key and the main key correlation information group based on the node degree information to obtain a first screening edge; screening edges corresponding to the correlation information of the main key and the foreign key based on the node degree information to obtain a second screening edge; and constructing a first entity contact graph based on the first screening edge and the second screening edge.
The clustering mining refers to a process of performing community mining on the graph data to obtain a primary key and a primary key association information group (PPE group). The nodes, i.e. entities, contact the entities in the graph, i.e. the partitioned service data table. The node degree information (degree) is a basic concept of the graph structure, and refers to the number of edges associated with the node, also called degree of association.
Specifically, in an actual business database table, a situation that multiple business data tables share the same primary key often occurs, and a certain association relationship exists between the tables, but if all information is retained, a situation as shown in fig. 3 occurs, namely, the tables (a-E) with the same primary key are connected with each other, a dotted line represents information (PPE) associated with the primary key, the tables are also fully connected with all other tables (F-H) associated with a primary foreign key, and a solid line represents information (PFE) associated with the primary key and a foreign key, so that a table relationship is disordered. As can be seen from fig. 3, the tables with the same primary key are associated with each other to form a primary key association mesh structure, and in order to identify the primary key association mesh structures in all the table tables, the application performs clustering on all the primary key association relations by using a group mining method in a community discovery algorithm. As shown in FIG. 4, after inputting PPE connection information, the maximum Clique mining algorithm outputs node information in each node Clique (Clique) after clustering, nodes in the same node Clique have the same primary key, and the name of each Clique and node members in the Clique are stored in a dictionary form: { node group 1: [ A, B, C, D, E ], node group 2: [ P, Q, R ], node group 3: [ W, X, Y, Z ] } represents three clusters, respectively. Then, the corresponding node degree can be calculated, the node degree is calculated for all the main foreign key association connections, each main foreign key association edge is recorded as a triple form (u, v, e), wherein u and v respectively represent two end nodes of the edge e. The calculation method of degree D of each node is as follows
for e in all_PFE_edges:
D(u)+1;
D(v)+1;
Wherein all _ PFE _ edges represents the connecting edges in all primary key and foreign key association information. The method only brings PFE connection information into the node degree calculation, and avoids interference of redundant information of the PPE. Then, all edges in the primary key and the primary key association information group can be screened based on the node degree information, a part of connecting edges representing the association relation are removed, and the rest is the first screening edge. Similarly, all edges in the associated information of the primary key and the foreign key can be screened based on the node degree information, a part of connecting edges representing the association relationship are removed, and the rest are second screening edges. And then taking the entity contact map of the incidence relation represented by the remaining first screening edge and the second screening edge as the required first entity contact map. In the embodiment, clustering mining is performed through the main key association information, so that the main key and the main key association information clusters are determined, two-time screening is performed through the node degree, the important data table association relation can be effectively extracted to construct the first entity contact diagram, and the validity of the entity contact diagram is ensured.
In one embodiment, the screening the primary key and the edge in the information group associated with the primary key based on the node degree information to obtain a first screened edge includes: determining key nodes with the maximum node degree in the main key and the main key association information group based on the node degree information; the method comprises the steps that edges containing key nodes in a primary key and primary key association information group are used as screening reserved edges, and the edges in the primary key and primary key association information group are screened to obtain a first screening edge; screening the edges corresponding to the primary key and the foreign key correlation information based on the node degree information, and obtaining a second screening edge comprises: and screening the edges in the associated information of the main key and the external key to obtain a second screening edge, wherein the associated information of the main key and the external key comprises key nodes serving as screening reserved edges.
The key node is a node with the maximum node degree in the information group associated with the primary key and the primary key, and represents a core table in a service theme. The screening reserved edge refers to an edge reserved after screening, and except the screening reserved edge, edges represented by other associated information are removed.
Specifically, in the solution of the present application, a complete associated information graph may be constructed according to the associated information of the primary key and the associated information of the primary key and the foreign key. And then, screening the main key information and the main foreign key information through the node degree information, firstly determining the key node with the maximum node degree in the information group associated with the main key and the main key according to the node degree information, and simultaneously representing a core table under a service theme. And then, screening the associated information between the main keys, wherein the process only needs to reserve the edges containing the key nodes, namely, the associated information of the main keys corresponding to the core table is stored, and the edges representing the main key information between other nodes can be directly removed to obtain a first screening edge. Similarly, for the associated information of the main foreign key, only the edge including the key node needs to be reserved, that is, the associated information of the main foreign key corresponding to the core table is stored, and the main foreign key relationships of other nodes can be removed to obtain the second screening edge. In this way, the incidence relation graph only including the first screening edge and the second screening edge is the first entity contact graph. In one embodiment, the process of constructing the first entity contact graph may be as shown before and after the first arrow in fig. 5, where nodes a, B, C, D, and E are connected through a primary key relationship (PPE), and at the same time, they are connected with nodes F, G, and H through a primary foreign key relationship (PFE), after node information is screened through node degrees, it is determined that a key node is node a, at this time, a corresponding association relationship may be deleted, and nodes other than the key node a in a PPE group are connected with the key node a, and the connection relationships are marked as PPEs and are not connected to each other, and the key node a in the PPE group is connected with the key node a externally, and the other nodes in the PPE group are not connected with the PFE externally. The results are shown in the upper right hand corner of FIG. 5. According to the scheme, the edge screening is realized by identifying the key nodes, the association relation among the nodes can be effectively cleared, redundant edges are reduced, and the effectiveness of the first entity contact graph is ensured.
In one embodiment, the edge screening processing of the first entity contact graph through the community discovery algorithm to obtain the second entity contact graph comprises: performing centrality calculation on the side of the associated information of the main key and the external key in the first entity contact diagram through a community discovery algorithm to obtain a centrality calculation result; and performing edge screening on the first entity contact map based on the centrality calculation result to obtain a second entity contact map.
The centrality (centroity) is used to quantify the importance of a vertex in the graph, and similarly, the centrality can also be used to quantify the importance of a node or an edge in the graph. In the scheme of the application, the importance of the edge is mainly identified and judged through the center degree of the edge.
Specifically, after the master primary key association information and the master foreign key association information are screened through the node degree information, the centrality of each edge in the first entity contact graph can be calculated through a community discovery algorithm, and then the centrality information of the edge representing the primary key and the foreign key association information in the first entity contact graph is calculated based on the centrality of the edge, namely the centrality information of the edge of the PFE information is calculated, and the centrality of the edge is associated with the node information in the entity contact graph, specifically, the centrality is precisely associated with the nodes of the two segments of the edge. For the centrality of an edge, the value is related to the number of shortest paths between two nodes and the ratio of the two numbers of the shortest paths between two nodes passing through the edge, and the satisfied relationship is shown in the following formula:
Figure BDA0003986087870000161
in the above formula, s, t is all nodes in the first entity contact graph, δ (s, t) is the number of all shortest paths from the node s to the node t, and δ (s, t | e) is the number of edges e passing through in all shortest paths from the node s to the node t. And after the centrality of all the edges is calculated by the formula. And further screening the first entity contact graph based on the centrality calculation result to obtain a second entity contact graph. In this embodiment, the importance of the edge is calculated through the centrality, so that the entity contact graph is further screened, redundant edges are removed, and the validity of the second entity contact graph is ensured.
In one embodiment, the edge-screening the first entity contact map based on the centrality calculation result to obtain the second entity contact map includes: clearing a connecting edge in the first entity contact graph; and adding a connecting edge on the first entity contact graph again in the order of the centrality from large to small to obtain a second entity contact graph, wherein isolated nodes exist at two end nodes of the added connecting edge, and the isolated nodes are nodes without connection information.
Specifically, in the scheme of the application, the side screening of the first entity contact map may be implemented in a centrality arrangement manner, where an object to be screened is a connection side of associated information (PFE) of a primary key and a foreign key, and specifically, the connection side of the associated information of the primary key and the foreign key in the first entity contact map may be cleared first; and re-adding the connection edges of the associated information of the primary key and the foreign key in the first entity contact diagram in the order of the centrality from large to small. If the centrality of the connecting edges of the associated information of all the main keys and the external keys in the first entity contact graph is calculated, the connecting edges are stored as a centrality list from large to small, each element in the list is represented as a triple (u, v, e), then the connecting edges of the associated information of the main keys and the external keys in the current first entity contact graph are deleted, then the edge e is taken out from the centrality list in sequence, if at least one of the two end nodes of the edge e is an isolated node (i.e. a node without any connecting information), the edge is connected in the graph of the first entity contact graph, and otherwise the edge is not connected. And until all edges of the centrality list are judged, the obtained graph is the second entity contact graph representing the minimum connection relation. In one embodiment, the process of constructing the second entity contact graph may refer to the lower two contact graphs in fig. 5, first, a centrality is calculated for the upper right first entity contact graph, a connection edge corresponding to the association information of the main key and the external key related to the key node a and a centrality of the association information of the main key and the external key between each external node are determined, a centrality of the AF is 50, a centrality of the ag is 45, a centrality of the FG is 43, a centrality of the FG is 22, a centrality list after arrangement is {50,45,43,22,15}, then, connection edges are sequentially taken, a connection edge of the AF is taken first, where F is an isolated node, so that the node a and the node F are connected, and similarly, the node a and the node H are connected. According to the scheme, the edge screening is realized by identifying the centrality of the connecting edges, the association relation between the nodes can be effectively cleared, redundant edges are reduced, and the effectiveness of the second entity contact graph is guaranteed.
In one embodiment, the service theme merging processing is performed on the partition service data table based on the data table association relationship, the service theme of the partition service data table is determined, and obtaining the service data table under each service theme includes: identifying a core data table of each service theme in the partition service data table; determining an associated data table of each service theme in the partition service data table based on the data table association relation of the core data table; and fusing the data information of the associated data table in the core data table to obtain a service data table under each service theme.
The core data table refers to a most core table under a business theme, and is also a business data table serving as a key node in an entity contact graph, and a partial associated data table having an association relationship with a current business theme in the partition business data table can be determined through the association relationship between the core data table and other data tables.
Specifically, when the service data tables are classified, the core data tables of each service topic in the partitioned service data tables can be identified through the constructed entity contact diagram, each core data table corresponds to one service topic, and other data tables of the incidence relation between the main key and the incidence relation between the main key and the external key existing in the core data tables can be used as supplementary descriptions of the branch services, so that the essence of the topic merging process is to take the core tables as the center and fuse the information of the incidence tables, thereby obtaining the service data tables under each service topic. In a specific embodiment, a schematic diagram of topic division and data table merging by data relationship mining according to the present application can be shown in fig. 6, where a table in a block is a core table under a business topic, and a table in a dashed-line block represents a business topic taking the core table as a center and fusing associated table information. The fusion process of the field information can be as shown in fig. 7, taking the project filing theme in a certain service library as an example, the core table is a project filing summary information table, and the theme is subject-fused around the core table. In the association table, the personnel basic information table belongs to the dimension table. Although the "personnel type" field is also in the dimension table, because the field can be used as a dimension field for statistical analysis, the field is subjected to a retention process, and the field is subjected to a replacement operation based on the personnel basic information table in consideration of the possible loss, inaccurate information and other factors of the field. The project filing, transfer filing and remote filing information all belong to branch services in the filing service, and data details of three different filing services are recorded in detail respectively, so that the serial number is used as a correlation field to supplement information for a core table, namely field expansion operation. In the embodiment, the core data table of the business theme is determined, and then the branch business information in the business theme is determined based on the data table incidence relation of the core data table, so that the accuracy of dividing the business theme can be effectively ensured, and the treatment effect of data treatment is improved.
In one embodiment, obtaining the data governance result based on the business data table under each business topic comprises: determining a service primary key in a service data table; performing main key standardization treatment on the service data table based on the service main key to obtain a main key standardization data table; and carrying out data field standardized treatment and inter-table association standardized treatment on the main key standardized data table to obtain a data treatment result.
For the normalization of the primary key, the non-service primary key is usually added to the source table in the service database as the unique identifier of the data, and in the subject table, the service primary key needs to be determined according to the actual service usage, and the normalization processing is performed by performing data deduplication with the service primary key. However, for the non-standard data field, the service data includes non-standard data such as null value, value in non-value range, and the like, and thus, the data field needs to be subjected to standardization processing. The reason why the association between tables is not standardized is that tables between different topics need to be associated in data application, but partial data association fails due to the source data being irregular, so that normalization processing is needed.
Specifically, after the core data table is determined, one-time data standardization treatment can be performed based on the determined core data table, so that the usability of data is improved, and the effect of data treatment is ensured. Firstly, carrying out main key standardization treatment on a service data table based on a service main key to obtain a main key standardization data table. The main key standardization can effectively realize the deduplication processing of the service data so as to prevent the subsequent data processing from repeatedly processing the repeated data and ensure the efficiency of data planning treatment. And then, carrying out data field standardized treatment and inter-table association standardized treatment to obtain a data treatment result. In one embodiment, as shown in fig. 8, after determining the service primary key, the latest data may be retained according to the update time (update _ time), so as to achieve the purpose of performing primary key deduplication on the theme table a. In the actual treatment process, after the data quality control standard is determined, each field of the theme table is automatically matched with the corresponding quality control rule for quality verification, and dirty data which do not meet the standard are screened out and treated by the service side. In the embodiment, the management of the business data table under each business theme can be effectively realized through the data standardization management, so that an accurate data management result is obtained.
In one embodiment, the method further comprises: determining an entity contact diagram corresponding to the association relation of the data table; feeding back the entity contact map, and acquiring association adjustment information corresponding to the entity contact map; adjusting the entity contact map based on the association adjustment information to obtain a target entity contact map; and constructing a data warehouse model according to the target entity contact diagram, wherein the data warehouse model is used for storing the data governance result in a structured manner.
The data warehouse model is a model obtained by modeling the data warehouse, and can realize the functions of data storage, data summarization, data analysis and the like.
Specifically, the data governance method can be specifically applied to the field of data warehouse modeling, and an entity contact diagram related in the data table incidence relation analysis process is determined firstly; and then the feedback entity contacts the graph to modeling staff to acquire the feedback correlation adjustment information. Further adjusting the entity contact map based on the association adjustment information to obtain a target entity contact map; and finally, constructing a data warehouse model according to the target entity contact diagram. In the process of constructing the multi-bin, in order to deepen the understanding of the business, data insights and data table relationship combing are usually performed, and the fastest method for the data insights is to perform entity contact map drawing and displaying on the base tables involved in the multi-bin. In one embodiment, after a user inputs a table building statement of a related table, the table building statement can be automatically analyzed and entity connection relation optimization is carried out to obtain an optimized entity contact diagram, and then the diagram database (neo 4 j) is adopted to be displayed at a web end. And the user can quickly obtain the required final entity contact diagram by simply fine tuning on the basis of the entity contact diagram, inputs the final entity contact diagram into a database or a data asset platform for storage, and performs corresponding data warehouse modeling. In another embodiment, the solution of the present application can also be applied to a data standard platform, on which a user needs to sort and determine all table and field relationships in a database, including the association relationship of table fields, field value range, and the like. When combing, there are cases that the same field appears in different library tables, and the same field is in different library tables, although the name is the same, the data actually stored may be different, or the association constraint is different (such as strong agreement, weak agreement, etc.). By the method and the device, the entity contact diagram can be quickly analyzed before field level combing, different association constraints and field and table relations are positioned, and different types of association management can be performed on the entity contact diagram in the system. And then, linkage type change can be carried out during carding, so that the inherent consistency of standard results is ensured, and the repeated labor is saved. When the data standard is completed and the data standard is converted into the corresponding data quality control rule, the entity contact diagram can be quickly converted into the rule content of associated property control and the like. In one embodiment, the business data table to be administered comprises a raw medical data table, and the method further comprises: performing data treatment on the original medical data table based on the data warehouse model to obtain a data treatment result corresponding to the original medical data table; and structuring and storing the data treatment result corresponding to the original medical data table to a data warehouse model. In the embodiment, the corresponding data warehouse model can be effectively constructed through data warehouse modeling, so that the follow-up work such as storage and analysis of the treatment result of the relevant data of the business data is realized, and the effectiveness of data processing is ensured.
The application also provides an application scene, and the application scene applies the data governance method. Specifically, the data governance method is applied to the application scene as follows:
when a user needs to perform data analysis related to big data, if a business system related to the data is complicated, the data relation among the tables can be directly captured from the business system through the data governance method, and the table data with the relevant relation is subjected to joint governance and standardization, so that a good data basis is provided for data application. The overall architecture and system structure diagram for data governance and standardization of service data in a service data source can be referred to fig. 9, steps in a shadow frame in the diagram are detailed execution processes of steps in an octagonal frame, manual tasks and automatic tasks in the diagram are opposite to each other and are respectively triggered manually by a user or automatically executed by a system, and step tasks are expanded and detailed on a certain step node and are all executed in the shadow frame. Firstly, in a data source layer of a system, service data tables in the service data source can be updated in real time, a server for realizing data management of the application can synchronize the service data tables in the service data source into the server in an online or offline mode through methods such as first synchronization (full volume) and configuration timing synchronization (including normal synchronization and supplementary synchronization), the synchronized data tables are named as X _ history tables as shown in the figure, meanwhile, table building statement SQL analysis is carried out on table building statement data of the service data source through an SQL analysis program during data synchronization to obtain data table information containing table names, main keys, external keys and field information, then, the incidence relation among the data tables synchronized from the service data source is mined through the database table incidence relation mining, and the main tables and the incidence tables under various subjects are further determined. In the data warehouse, the source layer (stg) initially roughly partitions the synchronized data table, and at this time, partitioning is mainly implemented based on the partition fields in the data packets, and meanwhile, empty (null) data table partitions need to be filtered. Then, the business data table of the rough partition needs to be converted to obtain a partition business data table through partition field conversion processing and business data deduplication processing in a first data operation layer (ods 1), and the processes of the partition field conversion processing and the business data deduplication processing include steps of partition dismantling, partition extraction, partition deduplication and the like. Until now, each partition service data table is in a relatively scattered state, topic merging processing is performed from a first data operation layer to a second data operation layer (ods 2) according to topic dimensions (determined by mining of association relations of the table tables), the topic merging processing is divided into three processes of first execution, daily execution and execution after deployment according to different processing events, the three processes are respectively corresponding to the three processes of first synchronization, normal synchronization and data supplement synchronization in the synchronization process, the topic merging process specifically comprises a service table merging process, the service table merging depends on topics corresponding to the service tables, a table and a field range contained in each topic depend on association relations of the topic table tables mined by the pasting layer, and the process is mainly realized by searching service partitions corresponding to the tables. And in the further processing from the second data operation layer to the data detail layer (dwd), the theme table needs to be subjected to theme standardization processing, and finally the service data table which comprises service partitions, is divided according to themes, contains theme structure specifications and is subjected to data cleaning is obtained. The mining process of the association relation of the subject base table can be shown in fig. 10, and includes: and 1001, performing SQL (structured query language) analysis on table building sentences of the business data table to be treated to acquire data table information, and directly analyzing the table building sentences in a SQL sentence analysis mode to identify key data table information including table names, field information, main foreign keys and the like when the main library table association relation is mined. Step 1003, extracting the associated information of the main key and the main key in the data table information, mining the associated information group of the main key and the main key, identifying the associated information of the main key and the main key from the main external key information in the data table information, and mining the associated information group of the main key and the main key through a community mining algorithm. Step 1005, extracting the associated information of the main key and the external key, determining the node degree information of the nodes in the associated information of the main key and the external key, identifying the associated information of the main key and the external key from the main external key information in the data table information, calculating the node degree according to the associated information of the main key and the external key, and determining the node degree of each node in the graph structure. Step 1007, screening the edges in the main key and the main key correlation information group based on the node degree information to obtain a first screening edge; and screening the edges corresponding to the correlation information of the main key and the foreign key based on the node degree information to obtain a second screening edge. After the node degree information is obtained, edges of the association relation between the data tables can be screened through the node degree information, and redundant information in the association information is removed. And step 1009, constructing the preliminary screening entity contact map based on the first screening edge and the second screening edge. Namely, a preliminary screening entity contact diagram is constructed based on the incidence relation of the data table with redundant information removed. And step 1011, determining the centrality of each edge in the preliminary screening entity contact graph, and sorting the edges of the preliminary screening entity contact graph based on the centrality. The process is mainly used for further screening the data table association of the initially screened entity contact graph. And 1013, performing side screening processing on the preliminarily screened entity contact graph based on the side sorting result, and constructing a minimum connection graph. The minimum connection graph is constructed based on the association relationship of further screening, and the minimum connection graph reserves the stronger association relationship among the data tables, so that the subsequent data analysis can be carried out based on the minimum connection graph. I.e., step 1015, topic/table/field matching analysis, etc. The performing SQL analysis on the database table building statement to obtain the table information may specifically include: carrying out lexical analysis on the table building sentences of the business data table to be managed to obtain lexical marks; carrying out grammar parsing processing on the lexical markers to obtain an abstract grammar tree; and obtaining the data table information based on the abstract syntax tree. Based on the associated information of the main key and the associated information of the main key and the foreign key, the step of constructing a preliminary screening entity contact diagram comprises the following steps: clustering and mining the associated information of the primary key and the primary key to obtain an associated information cluster of the primary key and the primary key; determining node degree information of nodes in the association information of the main key and the foreign key; screening edges in the main key and the main key correlation information group based on the node degree information to obtain a first screening edge; screening edges corresponding to the correlation information of the main key and the foreign key based on the node degree information to obtain a second screening edge; and constructing a primary screening entity contact graph based on the first screening edge and the second screening edge. The process of constructing the minimum connection graph then includes: performing centrality calculation on the side of the associated information of the main key and the external key in the initially screened entity contact diagram through a community discovery algorithm to obtain a centrality calculation result; and performing edge screening on the initially screened entity contact graph based on the centrality degree calculation result to obtain a minimum connection graph. Then, the core data table of each service theme in the partition service data table can be further identified; determining an associated data table of each service theme in the partition service data table based on the data table association relation of the core data table; and fusing the data information of the associated data table in the core data table to obtain a service data table under each service theme. Determining a service main key in a service data table; performing main key standardization treatment on the service data table based on the service main key to obtain a main key standardization data table; and carrying out data field standardized treatment and inter-table association standardized treatment on the master key standardized data table to obtain a data treatment result. After the data management result is structurally stored in the data warehouse, subsequent data analysis and other processes can be carried out.
It should be understood that, although the steps in the flowcharts related to the embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the application also provides a data governance device for realizing the data governance method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so the specific limitations in one or more data governance device embodiments provided below can refer to the limitations on the data governance method in the foregoing, and details are not described herein again.
In one embodiment, as shown in fig. 11, there is provided a data governance device comprising:
and the statement analysis module 1102 is configured to perform database statement analysis processing on a table building statement of the service data table to be managed to obtain data table information.
And the incidence relation mining module 1104 is used for mining the base table incidence relation of the to-be-treated business data table as an entity based on the data table information to obtain the data table incidence relation.
And a service theme merging module 1106, configured to perform service theme merging processing on each partition service data table based on the partition service data table and data table association relationship obtained by partitioning the service data table to be managed, so as to obtain a service data table under each service theme.
And the data management module 1108 is configured to perform data standardized management on the service data table under each service theme to obtain a data management result.
In one embodiment, the data synchronization module is further included to: carrying out full synchronous processing on historical data in a business data table to be managed to obtain a historical business data table; and partitioning the historical service data table according to the partition fields of the historical service data table, and performing partition field conversion processing and service data deduplication processing on the partitioned historical service data table to obtain a partition service data table.
In one embodiment, the statement parsing module 1102 is specifically configured to: carrying out lexical analysis on the table building sentences of the business data table to be managed to obtain lexical marks; carrying out grammar parsing processing on the lexical markers to obtain an abstract grammar tree; and obtaining the data table information based on the abstract syntax tree.
In one embodiment, the association mining module 1104 is specifically configured to: extracting the associated information of the primary key and the associated information of the primary key and the foreign key in the data table information; constructing a first entity contact diagram based on the associated information of the primary key and the associated information of the primary key and the foreign key; carrying out side screening processing on the first entity contact graph through a community discovery algorithm to obtain a second entity contact graph; and obtaining the data table association relation based on the second entity contact diagram.
In one embodiment, the association mining module 1104 is specifically configured to: clustering and mining the associated information of the primary key and the primary key to obtain an associated information cluster of the primary key and the primary key; determining node degree information of nodes in the association information of the main key and the foreign key; screening edges in the main key and the main key correlation information group based on the node degree information to obtain a first screening edge; screening edges corresponding to the correlation information of the main key and the foreign key based on the node degree information to obtain a second screening edge; and constructing a first entity contact diagram based on the first screening edge and the second screening edge.
In one embodiment, the association mining module 1104 is specifically configured to: determining key nodes with the maximum node degree in the main key and the main key association information group based on the node degree information; the method comprises the steps that edges containing key nodes in a primary key and primary key association information group are used as screening reserved edges, and the edges in the primary key and primary key association information group are screened to obtain a first screening edge; and screening the edges in the associated information of the main key and the external key to obtain a second screening edge, wherein the associated information of the main key and the external key comprises key nodes serving as screening reserved edges.
In one embodiment, the association mining module 1104 is specifically configured to: performing centrality calculation on the side of the associated information of the main key and the external key in the first entity contact diagram through a community discovery algorithm to obtain a centrality calculation result; and performing side screening on the first entity contact diagram based on the centrality calculation result to obtain a second entity contact diagram.
In one embodiment, the association mining module 1104 is specifically configured to: removing the connecting edge of the associated information of the primary key and the foreign key in the first entity contact diagram; and adding the connecting edges of the associated information of the main key and the foreign key again in the first entity contact diagram in the order from the large centrality to the small centrality to obtain a second entity contact diagram, wherein isolated nodes exist at two end nodes of the added connecting edges, and the isolated nodes are nodes without connection information.
In one embodiment, the business topic merging module 1106 is specifically configured to: identifying a core data table of each service theme in the partition service data table; determining an associated data table of each service theme in the partition service data table based on the data table association relation of the core data table; and fusing the data information of the associated data table in the core data table to obtain a service data table under each service theme.
In one embodiment, the data governance module 1108 is specifically configured to: determining a service primary key in a service data table; performing main key standardization treatment on the service data table based on the service main key to obtain a main key standardization data table; and carrying out data field standardized treatment and inter-table association standardized treatment on the main key standardized data table to obtain a data treatment result.
In one embodiment, the method further comprises a bin modeling module for: determining an entity contact diagram corresponding to the association relation of the data table; feeding back the entity contact diagram, and acquiring association adjustment information corresponding to the entity contact diagram; adjusting the entity contact map based on the association adjustment information to obtain a target entity contact map; and constructing a data warehouse model according to the target entity contact diagram, wherein the data warehouse model is used for storing the data governance result in a structured mode.
In one embodiment, the system further comprises a data storage module for: performing data treatment on the original medical data table based on the data warehouse model to obtain a data treatment result corresponding to the original medical data table; and structuring and storing the data treatment result corresponding to the original medical data table to a data warehouse model.
All or part of each module in the data governance device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 12. The computer device includes a processor, a memory, an Input/Output interface (I/O for short), and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing data relevant to data governance. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a data governance method.
Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), for example. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the various embodiments provided herein may be, without limitation, general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, or the like.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (15)

1. A data governance method, comprising:
performing database statement analysis processing on a table building statement of a business data table to be treated to obtain data table information;
taking the business data table to be treated as an entity, and mining the table association relation of the business data table to be treated based on the data table information to obtain the data table association relation;
based on the partition service data table obtained by partitioning the service data table to be managed and the incidence relation of the data table, performing service theme combination processing on each partition service data table to obtain a service data table under each service theme;
and carrying out data standardization treatment on the service data table under each service theme to obtain a data treatment result.
2. The method of claim 1, further comprising:
carrying out full synchronous processing on historical data in a business data table to be managed to obtain a historical business data table;
and carrying out partition processing on the historical service data table according to the partition fields of the historical service data table, and carrying out partition field conversion processing and service data duplicate removal processing on the partitioned historical service data table to obtain a partition service data table.
3. The method of claim 1, wherein the performing database statement parsing on the table building statement of the service data table to be managed to obtain the data table information comprises:
performing lexical analysis on the table building sentences of the business data table to be managed to obtain lexical marks;
syntax parsing processing is carried out on the lexical markers to obtain an abstract syntax tree;
and obtaining data table information based on the abstract syntax tree.
4. The method of claim 1, wherein the taking the to-be-treated business data table as an entity and performing base table association mining processing on the to-be-treated business data table based on data table information to obtain a data table association comprises:
extracting the associated information of the main key and the associated information of the main key and the foreign key in the data table information;
constructing a first entity contact diagram based on the associated information of the primary key and the associated information of the primary key and the foreign key;
performing side screening processing on the first entity contact graph through a community discovery algorithm to obtain a second entity contact graph;
and obtaining the data table association relation based on the second entity contact diagram.
5. The method of claim 4, wherein the constructing a first entity contact map based on the primary key and primary key association information and the primary key and foreign key association information comprises:
clustering and mining the associated information of the primary key and the primary key to obtain an associated information cluster of the primary key and the primary key;
determining node degree information of nodes in the association information of the main key and the foreign key;
screening the edges in the main key and the main key correlation information group based on the node degree information to obtain a first screening edge;
screening edges corresponding to the correlation information of the main key and the foreign key based on the node degree information to obtain a second screening edge;
and constructing a first entity contact graph based on the first screening edge and the second screening edge.
6. The method of claim 5, wherein the filtering the edges in the information cliques associated with the primary keys and the primary keys based on the node degree information to obtain a first filtered edge comprises:
determining key nodes with the maximum node degree in the main key and main key association information groups based on the node degree information;
taking the edge containing the key node in the key and key association information group as a screening reservation edge, and screening the edge in the key and key association information group to obtain a first screening edge;
the screening the edges corresponding to the main key and the foreign key correlation information based on the node degree information to obtain a second screening edge comprises:
and screening the edges in the associated information of the main key and the external key to obtain a second screening edge, wherein the associated information of the main key and the external key comprises the key nodes as screening reserved edges.
7. The method as claimed in claim 4, wherein the performing edge filtering processing on the first entity contact map through a community discovery algorithm to obtain a second entity contact map comprises:
performing centrality calculation on the side of the information associated with the main key and the external key in the first entity contact diagram through a community discovery algorithm to obtain a centrality calculation result;
and performing side screening on the first entity contact map based on the centrality degree calculation result to obtain a second entity contact map.
8. The method of claim 4, wherein the edge-filtering the first entity contact map based on the centrality calculation result to obtain a second entity contact map comprises:
removing the connection edge of the associated information of the primary key and the foreign key in the first entity contact diagram;
and adding the connecting edges of the associated information of the main key and the foreign key again in the first entity contact diagram in the order of the centrality from large to small to obtain a second entity contact diagram, wherein isolated nodes exist at two end nodes of the added connecting edges, and the isolated nodes are nodes without connection information.
9. The method according to claim 1, wherein the performing service theme merging processing on the partition service data table based on the data table association relation, determining the service theme of the partition service data table, and obtaining the service data table under each service theme comprises:
identifying a core data table of each service theme in the partition service data table;
determining an association data table of each service theme in the partition service data table based on the data table association relationship of the core data table;
and fusing the data information of the associated data table in the core data table to obtain a service data table under each service theme.
10. The method of claim 9, wherein obtaining data governance results based on the business data table under each business topic comprises:
determining a service primary key in the service data table;
performing main key standardization treatment on the service data table based on the service main key to obtain a main key standardization data table;
and carrying out data field standardized treatment and inter-table association standardized treatment on the main key standardized data table to obtain a data treatment result.
11. The method according to any one of claims 1 to 10, further comprising:
determining an entity contact diagram corresponding to the data table association relation;
feeding back the entity contact diagram and acquiring association adjustment information corresponding to the entity contact diagram;
adjusting the entity contact diagram based on the association adjustment information to obtain a target entity contact diagram;
and constructing a data warehouse model according to the target entity contact diagram, wherein the data warehouse model is used for storing the data governance result in a structured manner.
12. The method of claim 11, wherein the chart of business data to be administered comprises a chart of raw medical data, the method further comprising:
performing data treatment on the original medical data table based on the data warehouse model to obtain a data treatment result corresponding to the original medical data table;
and structuring and storing the data treatment result corresponding to the original medical data table to the data warehouse model.
13. A data governance device, the device comprising:
the statement analysis module is used for carrying out database statement analysis processing on the table building statement of the business data table to be treated to obtain data table information;
the incidence relation mining module is used for mining the incidence relation of the base table of the business data table to be treated based on the information of the data table by taking the business data table to be treated as an entity so as to obtain the incidence relation of the data table;
the service theme merging module is used for performing service theme merging processing on each subarea service data table based on the subarea service data table obtained by dividing the service data table to be managed and the incidence relation of the data tables to obtain the service data table under each service theme;
and the data management module is used for carrying out data standardized management on the service data table under each service theme to obtain a data management result.
14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 12.
15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 12.
CN202211564533.3A 2022-12-07 2022-12-07 Data governance method, data governance device, computer equipment and storage medium Pending CN115858513A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211564533.3A CN115858513A (en) 2022-12-07 2022-12-07 Data governance method, data governance device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211564533.3A CN115858513A (en) 2022-12-07 2022-12-07 Data governance method, data governance device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115858513A true CN115858513A (en) 2023-03-28

Family

ID=85670763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211564533.3A Pending CN115858513A (en) 2022-12-07 2022-12-07 Data governance method, data governance device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115858513A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501757A (en) * 2023-06-20 2023-07-28 鹏城实验室 ER diagram-based simulation data construction method and device
CN117112536A (en) * 2023-08-25 2023-11-24 中电金信软件有限公司 Database migration method, device, equipment and storage medium
CN117539861A (en) * 2023-10-20 2024-02-09 国家开放大学 Relational data table association reconstruction method and device for data management

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501757A (en) * 2023-06-20 2023-07-28 鹏城实验室 ER diagram-based simulation data construction method and device
CN116501757B (en) * 2023-06-20 2023-10-03 鹏城实验室 ER diagram-based simulation data construction method and device
CN117112536A (en) * 2023-08-25 2023-11-24 中电金信软件有限公司 Database migration method, device, equipment and storage medium
CN117539861A (en) * 2023-10-20 2024-02-09 国家开放大学 Relational data table association reconstruction method and device for data management

Similar Documents

Publication Publication Date Title
CN112685385B (en) Big data platform for smart city construction
CN115858513A (en) Data governance method, data governance device, computer equipment and storage medium
CN109657074B (en) News knowledge graph construction method based on address tree
CN110300963A (en) Data management system in large-scale data repository
EP3513313A1 (en) System for importing data into a data repository
CN109522290B (en) HBase data block recovery and data record extraction method
Khattak et al. Change management in evolving web ontologies
CN106339274A (en) Method and system for obtaining data snapshot
CN104318481A (en) Power-grid-operation-oriented holographic time scale measurement data extraction conversion method
WO2021032146A1 (en) Metadata management method and apparatus, device, and storage medium
CN106802905B (en) Collaborative data exchange method of isomorphic PLM system
CN114925045A (en) PaaS platform for large data integration and management
CN113420026B (en) Database table structure changing method, device, equipment and storage medium
CN116662441A (en) Distributed data blood margin construction and display method
CN112148578A (en) IT fault defect prediction method based on machine learning
CN114329096A (en) Method and system for processing native map database
CN115640406A (en) Multi-source heterogeneous big data analysis processing and knowledge graph construction method
CN107704620B (en) Archive management method, device, equipment and storage medium
CN115640300A (en) Big data management method, system, electronic equipment and storage medium
CN115203435A (en) Entity relation generation method and data query method based on knowledge graph
CN113672692B (en) Data processing method, data processing device, computer equipment and storage medium
CN113626447B (en) Civil aviation data management platform and method
Gu Integration and optimization of ancient literature information resources based on big data technology
Glake et al. Data management in multi-agent simulation systems
CN116680445B (en) Knowledge-graph-based multi-source heterogeneous data fusion method and system for electric power optical communication system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination