CN113946568A

CN113946568A - Data management system and method

Info

Publication number: CN113946568A
Application number: CN202010683514.7A
Authority: CN
Inventors: 王苏栋; 张学武; 贾森
Original assignee: Cic Guoxin Beijing Technology Development Co ltd
Current assignee: Cic Guoxin Beijing Technology Development Co ltd
Priority date: 2020-07-15
Filing date: 2020-07-15
Publication date: 2022-01-18

Abstract

The invention is suitable for the technical field of data processing, and provides a data management system and a method, wherein the data management system comprises the following steps: the data source manager is used for acquiring source data information of the data to be processed; the data model builder is used for building a data model corresponding to the source data information according to the source data information, and the data model is configured with a preset data governance rule; the treatment script generator is used for acquiring model information of the data model; generating a data governance script according to the model information and a preset data governance rule, so that a governance script operator runs the data governance script to complete governance of the data to be processed; the invention takes the data model as standard guidance, and standardizes the standard and format of the complicated data, thereby avoiding the phenomena of mixed data structure and non-uniform standard and format in the data management process; the workload of data management is reduced, and complex operations such as manual code writing and storage processes required in the traditional data management are avoided.

Description

Data management system and method

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a data management system and a data management method.

Background

Along with the explosive development of information technology and the Internet, human beings enter a big data age, and data becomes fundamental strategic resources of the world today. Big data technology can be divided into three phases: in the early 1990 s, with the gradual maturity of data mining theory and database technology, a batch of business intelligence tools and knowledge management technology are beginning to be applied, such as data warehouses, expert systems, knowledge management systems and the like. This stage of data is essentially system operational data; in the mature stage, 10 years before the 21 st century, Web2.0 application is rapidly developed, a large amount of unstructured data is generated, the traditional processing method is difficult to deal with, the rapid burst of a large data technology is driven, and the Hadoop platform starts to take its way. This stage of data is essentially user input data; in a large-scale application stage, until 2010, big data application permeates all trades, decision is driven by data, and the intelligence degree of an information society is greatly improved.

At present, the factor severely limiting people from using big data technology is disordered use, and because the big data technology aiming at specific application scenes and mass data characteristics is not selected, a plurality of problems are caused, such as high system construction cost, poor application effect and the like; in general, the big data technology adopted by many application scenarios is not suitable for some links of data acquisition, storage and processing; moreover, many practitioners of big data technology application do not fully master the technology itself when selecting, or select the big data technology selection by simply considering or referring to one dimension from the perspective of multiple dimensions because of lack of practical application, and the big data technology selection is often not guided by the result and has no purpose. Taking a social credit system application system as an example, data of each government department and each channel related to credit needs to be collected, so that the problems of data acquisition, inconsistent data standards, high difficulty in data integration and application, poor data instantaneity and the like of a multi-source heterogeneous system are faced, and a large amount of complex and tedious coding workload is brought to data implementers.

Therefore, the existing data management mode has the technical problems of data acquisition of a multi-source heterogeneous system, inconsistent data standards, high difficulty in data integration and application and poor data real-time performance.

Disclosure of Invention

The embodiment of the invention aims to provide a data management system, and aims to solve the technical problems of data acquisition, inconsistent data standards, high difficulty in data integration and application and poor data real-time performance of a multi-source heterogeneous system in the conventional data management mode.

The embodiment of the invention is realized in such a way that a data management system comprises a data model builder, a source data information extractor and a management script generator, wherein the source data information extractor is communicated with the data model builder;

the source data information acquirer is used for acquiring source data information of data to be processed;

the data model builder is used for building a data model corresponding to the source data information according to the source data information, and the data model is configured with a preset data governance rule;

the governance script generator is used for acquiring model information of the data model; and generating a data treatment script according to the model information and a preset data treatment rule so that a treatment script operator runs the data treatment script to complete treatment on the data to be treated.

Another objective of an embodiment of the present invention is to provide a data management method, including:

acquiring source data information of data to be processed;

establishing a data model corresponding to the source data information according to the source data information, wherein the data model is configured with a preset data governance rule;

obtaining model information of the data model;

and generating a data governance script according to the model information and a preset data governance rule so that a governance script operator runs the data governance script to complete governance of the data to be processed.

According to the data management system provided by the embodiment of the invention, the data source manager is used for managing the data source to obtain the source data information of the data to be processed so as to shield the difference between all heterogeneous data, and the difficulty in data integration application is reduced; the data to be processed is modeled based on the data model builder, so that data standards and data relations are managed through the data model, and complicated data are subjected to standard and format standardization, so that the phenomena that data structures are mixed and the standards and the formats are not uniform in the data management process are avoided; and finally, the data management script is automatically generated by the management script generator according to the model information of the data model and the preset data management rule, so that the workload of data management is reduced, the complicated operations of manually compiling codes, storing processes and the like in the conventional data management are avoided, the characteristic attributes of various access platform data sources are not required to be known, and the time cost is saved.

Drawings

FIG. 1 is a schematic structural diagram of a data governance system according to an embodiment of the present invention;

FIG. 2 is a functional block diagram of a data source manager provided by an embodiment of the present invention;

FIG. 3 is a data source management modeling language diagram provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a hierarchical data model, which is provided by an embodiment of the present invention and takes a legal person library in a credit system application as an example;

FIG. 5 is a conceptual model diagram of legal library information provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a legal vault logical model provided by an embodiment of the present invention;

FIG. 7 is a schematic diagram of a process for processing a cleaning rule according to an embodiment of the present invention;

FIG. 8 is a diagram illustrating a conversion rule processing procedure according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a string processing procedure according to an embodiment of the present invention;

FIG. 10 is a schematic structural diagram of an abatement script generator according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of another data governance system according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of a conversion task management process according to an embodiment of the present invention;

FIG. 13 is a variation diagram of the extracted load policy data according to an embodiment of the present invention;

FIG. 14 is a schematic diagram of an abatement script operator provided by an embodiment of the present invention;

FIG. 15 is a flow chart of an implementation of a data governance method according to an embodiment of the present invention;

FIG. 16 is a flow chart of another implementation of a data governance method according to an embodiment of the present invention;

FIG. 17 is a flowchart illustrating another implementation of a data governance method according to an embodiment of the present invention;

fig. 18 is a flowchart of an implementation of another data governance method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, etc. may be used to describe various information in embodiments of the present invention, the information should not be limited by these terms. These terms are only used to distinguish one type of information from another.

In order to solve the problems of inconsistent data acquisition and data standards, high difficulty in data integration and application and poor data real-time performance of a multi-source heterogeneous system in the conventional data management mode, how to unify the data standards, how to shield the complexity caused by multi-source heterogeneous, how to efficiently solve the data integration, and how to ensure the data synchronization real-time performance, the problems can be solved by adopting the storage process of a database (oracle, MySQL, Gbase and the like), each data platform, each technical framework (Hadoop, MapReduce, spark and the like) and an ETL tool (such as a button) with a large amount of coding workload. Therefore, the embodiment of the invention provides a data management system, which takes a data model as standard guidance, configures extraction, cleaning and loading strategies aiming at specific data environments to generate data management tasks, automatically releases and schedules the data management tasks, can perform personalized configuration and scheduling according to actual requirements on an automatic basis, and can effectively solve the problems of multi-source, complexity and real-time performance of data management in the existing credit system construction application system; specifically, the data source manager is used for managing the data source to acquire the source data information of the data to be processed so as to shield the difference between all heterogeneous data, so that the difficulty of data integration and application is reduced, the disordered use of a big data technology is avoided, and the problems that the big data technology is wide in related range, high in threshold for mastering the big data technology and easy to be disordered used in the actual process are effectively solved; the data to be processed is modeled based on the data model builder, so that data standards and data relations are managed through the data model, and complicated data are subjected to standard and format standardization, so that the phenomena of data structure mixing and non-uniform standard and format in the data management process are avoided; and finally, the data management script is automatically generated by the management script generator according to the model information of the data model and the preset data management rule, so that the workload of data management is reduced, the traditional data management is avoided, the operations such as manual code writing, storage process and the like are needed, meanwhile, the data mapping is adaptively accessed, the characteristic attributes of various access platform data sources are not needed to be known, and the time cost is saved.

To further explain the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the embodiments, structures, features and effects according to the present invention will be given with reference to the accompanying drawings and preferred embodiments.

Fig. 1 shows a schematic structural diagram of a data administration system provided by an embodiment of the present invention, and for convenience of description, only parts related to the embodiment of the present invention are shown, and detailed descriptions are as follows:

in the embodiment of the invention, the data governance system comprises a data model builder 102, a source data information extractor 101 and a governance script generator 103, wherein the source data information extractor 101 is communicated with the data model builder 102.

The data source manager 101 is configured to obtain source data information of data to be processed.

In the embodiment of the present invention, the data source manager 101 is equivalent to a data source interface, can access multi-source heterogeneous data sources of different units, and can support relational databases, including relational databases such as oracle, mysql, sqlserver, south university, big-money-in-the-house, dreams, and the like; a supportable non-relational database comprising: hbase, Hive, hdfs, mongo, etc.; supportable files (excel, csv), ftp, etc.; supportable service classes including: http services, webservice services, etc. As shown in the schematic block diagram of the data source manager in fig. 2, a relational source, a non-relational source, a file, and a service API are accessed, a data directory, metadata information management, and a data service are formed in the data source manager, and these information are published in the form of a service; referring to a data source management modeling language (UML) diagram shown in fig. 3, the differentiation of different data sources is realized by integrating behavior packages and inheritance IDBService of various data sources.

In embodiments of the present invention, the source data information includes, but is not limited to, database type, database connection information (username, password, IP, port), etc., and if a relational database, information of a table.

In the embodiment of the invention, the source data information of the data to be processed is acquired by the data source manager 101, so that the difference among all heterogeneous data can be shielded, the difficulty in data integration and application is greatly reduced, the use confusion of a big data technology is avoided, and the problems that the big data technology is wide in related range, high in threshold for mastering the big data technology and easy to be confused in the actual process are effectively solved.

The data model builder 102 is configured to build a data model corresponding to the source data information according to the source data information, where the data model is configured with a preset data governance rule.

In the embodiment of the present invention, the data model builder 102 is configured to model data to be processed according to types, characteristics, and access protocols of various application services, and manage data standards and data relationships. The data models are divided into three types: hierarchical models, mesh models, and relational models.

In an embodiment of the present invention, the hierarchical data model is a data model that organizes data with a tree < hierarchy > structure. The tree is represented by a graph, is equivalent to an inverted tree, and can be known from the definition of trees (or binary trees) in a basic data structure, each tree has only one root node, and the rest nodes are non-root nodes; each node represents a concept that one record type corresponds to an entity, and each field of the record type corresponds to each attribute of the entity; each record type and its field must be recorded; as shown in fig. 4, the corporate library in the credit system application is taken as an example.

In an embodiment of the invention, the mesh data model is a data structure model representing entities and connections between entities with directed graphs; the mesh data model can be viewed as an extension of the relaxed hierarchical data model; all nodes in the mesh data model are allowed to exist apart from the parent node, namely two or more nodes without root nodes are allowed to exist in the whole model, and one or more parent nodes exist in one node, so that the mesh data model becomes a mesh directed graph. The correspondence between nodes is therefore no longer 1: n, but rather is an m: n, thereby overcoming the disadvantages of the hierarchical data model.

In the embodiment of the invention, the relational data model uses a data model name of a table to represent the relationship between the entities, and the relational database is the most popular database at present and is also a commonly used database such as an Oracle database and a MySQL database.

In the embodiment of the invention, different model types are selected for modeling according to different service scenes, and http service is opened, for example, a legal person library is taken as an example and belongs to a relational model, the construction of the legal person library is to realize the unification and sharing of basic information of the legal person, and the construction process comprises concept model design and logic model design; the concept model is a highest-level data model which defines core business concepts and relationships thereof, and the logic model is further refined and decomposed on the concept model and mainly describes entities, attributes and relationships among the entities. The corporate information library is used for describing basic information conditions of a corporate, and mainly includes corporate basic information, corporate basic information extension information, organization code information, registration or registration information, branch office information, tax information, statistical information, and the like, as shown in fig. 5; FIG. 6 illustrates a legal person library logical model that details the various entities, entity attributes and associations in the legal person conceptual model, the legal person basic information entity defines the common attribute information of all legal persons, and other entities contain different attribute information of different types of legal persons; taking the basic information of a legal person as an example, the description of the model is shown in the following table 1:

TABLE 1

In the embodiment of the invention, the data model builder 102 is realized by adopting a Spring Cloud framework set, and the development convenience of the Spring Boot is utilized to skillfully simplify the development of the infrastructure of the distributed system, such as service discovery registration, a configuration center, a message bus, load balancing, a circuit breaker, data monitoring and the like, and the development style of the Spring Boot can be used for one-key starting and deployment.

The governance script generator 103 is configured to obtain model information of the data model; and generating a data management script according to the model information and a preset data management rule so that a management script operator runs the data management script to complete the management of the data to be processed.

In the embodiment of the present invention, the model information of the data model includes, but is not limited to, logical relations of the data model, a directory of the model, a standard specification of information items of the model, and the like.

In the embodiment of the invention, the preset data governance rules are set by a user according to data governance requirements and comprise data standards and related governance rules, wherein the data standards are used for classifying data units and maintaining key value pairs of data; the key identifies the encoding of the data and the value is the name of the data. For example: in gender classification, 0 identifies a male and 1 identifies a female; the governing rules are determined according to the actual governing requirements, can be cleaning rules, conversion rules, character string processing rules, digital processing rules and the like, and are issued in a service mode respectively.

The cleaning rule refers to returning true/false through the input of parameters; for example: judging whether the ID card number is the ID card number, and entering the parameters as follows: 388388111, return false; the ginseng is as follows: 37011219881012XXXX, return true; as shown in the schematic diagram of the processing process of the cleaning rule shown in fig. 7, the general cleaning rule is implemented in java background and is issued as an http interface; binding the cleaning rule with the field in the data model; when the data is managed, the data model is matched with the metadata, so that the cleaning rule can be automatically bound with the field of the source data; the cleaning rules are connected in series just like a chain of responsibility; when data passes through the responsibility chain, cleaning and filtering are carried out step by step. The conversion rule is that data A is input and data B is returned; for example: inputting the ID card number and returning the birth date; as shown in the schematic diagram of the conversion rule processing process shown in fig. 8, the general conversion rule is implemented in java background and is issued as an http interface; binding the conversion rule with the field in the model; the model is matched with the source data during data management, so that the cleaning rule can be automatically bound with the field of the source data; when the source data of the data governance passes through the nodes of the conversion rule, the existing fields are converted according to the conversion rule. The character string processing mainly processes a series of operations related to the character string text, and as shown in a character string processing process schematic diagram of fig. 9, a general character string processing rule is realized in a java background and is issued into an http interface; binding the character string processing rule with the field in the model; the model is matched with the source data during data management, so that the character string processing can be automatically bound with the field of the source data; and when the source data of the data governance is subjected to the operation on the character string according to the bound character string processing rule. The digital processing rule is to calculate the numerical value, and the interface format is as follows: name: judging whether the number of the chips is odd; the following steps are described: judging whether the number is an odd number, if so, returning to true, otherwise, returning to false; the digital processing process comprises the following steps: realizing a general digital processing rule in a java background and issuing the general digital processing rule into an http interface; the digital processing rules are bound with the fields in the model; the model is matched with the source data during data management, so that digital processing can be automatically bound with the field of the source data; and when the source data of the data governance is subjected to digital operation according to the bound digital processing rule.

In the embodiment of the invention, the treatment script generator processes the data model corresponding to the data to be processed according to the specified data treatment rule through the standard specification of the information item of the model, and the treatment rule is bound with the field in the model, so that the model can be matched with the source data during the data treatment, the model is automatically bound with the field of the source data, the corresponding data treatment script is generated, the data treatment script is uniformly scheduled and executed by the data treatment script operator, and the implementer sorts and combs the treatment tasks according to the service, thereby avoiding the complexity of multiple data treatment tasks and ensuring the real-time property of data transmission. The application of the big data technology is facilitated, the use difficulty of the big data technology in actual application such as selection, matching and implementation is effectively reduced, and the application capability of the big data technology is improved.

Fig. 10 is a schematic structural diagram of an abatement script generator according to an embodiment of the present invention, and for convenience of description, only the relevant parts to the embodiment of the present invention are shown.

In this embodiment of the present invention, the abatement script generator 103 includes:

a model information obtaining unit 1001 is configured to obtain model information of the data model.

A management script generating unit 1002, configured to generate a data management script according to the model information and a preset data management rule, so that a management script runner runs the data management script to complete management of the data to be processed.

Fig. 11 is a schematic structural diagram of another data governance system provided in the embodiment of the present invention, and for convenience of explanation, only the parts related to the embodiment of the present invention are shown.

In this embodiment of the present invention, the abatement script generating unit 1002 includes:

a task type determining module 1101, configured to determine a task type according to the preset data governance rule.

In an embodiment of the present invention, the task type determining module 1101 is configured to determine whether the administration task is a conversion task or a quality inspection task according to the data administration rule, where the conversion task combines and converts data into a new data set, and the quality inspection task is an inspection of a known data set. The quality inspection task is to inspect on the basis of the existing data, and a quality report is formed through inspection, and the quality inspection task mainly comprises the following steps: the method comprises the following steps of field inspection, information integrity, service consistency and information repetition, namely when relevant management rules such as the corresponding field inspection, the information integrity, the service consistency and the information repetition are involved in a data management rule, determining the data management rule as a quality inspection task; when the data governance rule relates to a cleaning rule, a field conversion rule, a target data set and the like, determining the data governance rule as a conversion task.

And a conversion script generating module 1102, configured to generate a conversion script according to the model information and a preset data governance rule when the task type is a conversion task, so that a governance script runner operates the conversion script to complete governance of the data to be processed.

In the embodiment of the invention, the cleaning rules, and/or the field conversion rules, and/or the target data sets and the like preset based on user requirements are bound with the fields in the data model, the data model is matched with the source data during data management, the preset rules are automatically bound with the fields of the source data, and when the source data of the data management passes through the nodes of the preset management rules, the existing fields are managed according to the preset management rules.

In the embodiment of the invention, the preset data governance rules comprise cleaning rules, field conversion rules and target data sets. The conversion task is to combine and convert the source data into a new data set, and the target data set is the new data set to be converted; namely, a plurality of data extraction, conversion and loading processes are packaged, so that data management is performed on source data to a target data set according to preset management rules, and as shown in fig. 12, A, B, C data sets are converted into a new data set D through a conversion task.

In this embodiment of the present invention, the conversion script generating module 1102 is configured to generate a cleaning script and a field conversion script according to the model information, the cleaning rule, and the field conversion rule; generating a corresponding target data script according to the target data set; performing data extraction loading processing on the target data script according to the model information; and extracting the target data script, the cleaning script and the field conversion script after the loading processing according to the data to generate a conversion script so that a treatment script operator runs the conversion script to complete the treatment of the data to be processed.

In the embodiment of the invention, the cleaning rule for extracting the data is read according to the model information, and the data cleaning is completed by generating the cleaning script and calling the cleaning rule in the rule service; the selection and renaming of the source data field are completed by generating a field conversion script; reading target data set information by calling service provided by a data source manager, wherein the target data set information comprises database types, database connection information (user names, passwords, ip and ports) and information of tables if the target data set information is a relational database; when the target data set is read, a target table of the target data may not exist, and at this time, a create target data script needs to be generated (for example, different execution scripts for creating different data tables are generated according to the type of a target data source); and then, according to the model information, performing full load processing (periodically and fully loading the source data to the target data source) and incremental load processing (capturing a certain time (update time) or checkpoint (checkpoint) to load the data to the target data source) on the target data script, such as an extraction load policy data change diagram shown in fig. 13, full extraction- > full load: extracting the full amount of the source data set, and loading the full amount of the source data set when the source data set is loaded to the target data set, so that the data amount of the source data set is the same as that of the target data set; the method is suitable for scenes with small data volume; full decimation- > incremental loading: the source data set is extracted in full quantity, and incremental data is loaded when the source data set is loaded to the target data set, so that the target data set is increased by the increment of the source data set, and the changed data set is updated; delta extraction- > delta loading: extracting the added data from the source data set, and loading the added data to the target data set, so that the source data set is consistent with the target data set; incremental draw- > full load: and extracting the added data of the source data set, and loading the added data to the target data set in a full amount, so that the target data set is added with the added data of the source data set. And connecting the target data script, the cleaning script and the field conversion script after the data extraction and loading processing in series, and packaging into a conversion script according to the sequence.

And the quality inspection script generation module 1103 is used for generating a quality inspection script according to the model information and the preset data management rule when the task type is a quality inspection task, so that the quality inspection script runs by the management script running device, and the management of the data to be processed is completed.

In the embodiment of the invention, the preset data governance rules comprise format check rules and business logic check rules; the format check rule includes field check, information integrity rule, etc.

The quality inspection script generating module 1103 is configured to, when the task type is a quality inspection task, generate a corresponding format inspection script, a corresponding service logic inspection script, and a corresponding quality report script according to the model information, the format inspection rule, and the service logic inspection rule; and generating a quality inspection script according to the inspection script, the service logic inspection script and the quality report script so that a treatment script runner runs the quality inspection script to complete treatment of the data to be processed.

In the embodiment of the invention, the field is subjected to standardized check according to the format check rule according to the model standard, for example, whether the length of the field is in accordance with the standard or not is checked, whether the format of the field is in accordance with a specific format or not is checked, for example, the formats of an ID card number, a mobile phone number, a mailbox, a unified credit code and the like are checked, the format of a generated field check script is in an xml format, and the check logic adopts javascript to perform detection. And then according to whether the comparison between the model information and the data to be processed is complete or not, whether the information is missing or not is determined, for example: if the unified social credit code in the enterprise table is missing and the key information is null, if the enterprise name is detected to be null, the format of the generated information integrity script is xml format. Judging whether the data accords with business logic according to the logical relationship of the governance object, such as the loss of a main key, the absence of an enterprise in a main library, the inconsistency of values of the same fields in different data tables and the like, and generating a business consistency check script (business logic check script) with an xml format; and judging whether the data has repeated data according to the logical relationship of the treatment object, and generating an information repeatability inspection script in an xml format. And finally, generating a corresponding quality report script according to the inspection scripts. And serially connecting the scripts, packaging the scripts into quality inspection scripts according to the sequence, automatically issuing the quality inspection scripts to a data management operator, and automatically starting the data management operator. Wherein the quality report template is as follows:

according to the data management system provided by the embodiment of the invention, tasks are determined according to the preset data management rules, and then corresponding task scripts are automatically generated based on model information and the preset data management rules, so that a management script operator runs the conversion scripts to complete the management of the data to be processed, the workload of data management is greatly reduced, and the operations of manually compiling codes, storing processes and the like required by the traditional data management are avoided; in addition, data mapping is accessed in a self-adaptive manner, the characteristic attributes of various access platform data sources do not need to be known, and the time cost is saved; in addition, based on reusability and accumulability of data management work, a data model can be accumulated aiming at a data structure related to the same unit service, a data management task is automatically generated, extremely high reusability is achieved, and data management efficiency is improved.

For convenience of description, the structural schematic diagram of another data management system provided in the embodiment of the present invention only shows the parts related to the embodiment of the present invention, which are similar to the above embodiment, except that:

in the embodiment of the invention, the preset data management rule carries script scheduling period information;

the management script operator is used for configuring the scheduling period of the data management script according to the script scheduling period information; and running the data treatment script according to the scheduling period to complete the treatment of the data to be treated.

In the embodiment of the present invention, as shown in fig. 14, the management script runner is a data management scheduling container, which includes a PDI platform, a scheduler, a task device, and a trigger, and is a web program specially used for scheduling and monitoring data management tasks and conversion, the entire framework is integrated by using spring + springmvc, the conversion and the job are executed by calling an API of a button, and the scheduling task is completed by using a quartz framework, which may refer to the prior art specifically, and is not described herein in detail.

In the embodiment of the invention, the governance script runner bears the operation of various governance tasks; the data management scheduling container can configure the scheduling period of the tasks through the interface, start and stop the tasks, and sort and comb the tasks according to the services. All data management scripts are dispatched and executed in a unified mode by a management script operator, and implementing personnel sort and sort the management tasks according to services, so that the complexity of multiple data management tasks is avoided, the real-time performance of data transmission is guaranteed, meanwhile, the application of a big data technology is facilitated, the use difficulty of selection, matching, implementation and the like of the big data technology in practical application is effectively reduced, and the application capacity of the big data technology is improved.

Fig. 15 is a flowchart of an implementation of a data governance method according to an embodiment of the present invention, and for convenience of description, only a part related to the embodiment of the present invention is shown, and details are as follows:

in step S1501, source data information of data to be processed is acquired.

In step S1502, a data model corresponding to the source data information is established according to the source data information, and the data model is configured with a preset data governance rule.

In the embodiment of the invention, the data to be processed is modeled according to the type, the characteristics and the access protocol of various application services, and the data standard and the data relation are managed. The data models are divided into three types: hierarchical models, mesh models, and relational models.

In the embodiment of the invention, different model types are selected for modeling according to different service scenes, and http service is opened, for example, a legal person library is taken as an example and belongs to a relational model, the construction of the legal person library is to realize the unification and sharing of basic information of the legal person, and the construction process comprises concept model design and logic model design; the concept model is a highest-level data model which defines core business concepts and relationships thereof, and the logic model is further refined and decomposed on the concept model and mainly describes entities, attributes and relationships among the entities. The corporate information library is used for describing basic information conditions of a corporate, and mainly includes corporate basic information, corporate basic information extension information, organization code information, registration or registration information, branch office information, tax information, statistical information, and the like, as shown in fig. 5; FIG. 6 illustrates a legal person library logical model that details the various entities, entity attributes and associations in the legal person conceptual model, the legal person basic information entity defines the common attribute information of all legal persons, and other entities contain different attribute information of different types of legal persons; taking the corporate basic information as an example, the model description is shown in table 1 above.

In step S1503, model information of the data model is acquired.

In step S1504, a data governance script is generated according to the model information and a preset data governance rule, so that a governance script runner runs the data governance script to complete governance of the data to be processed.

In the embodiment of the invention, the data model corresponding to the data to be processed is processed according to the specified data management rule through the standard specification of the information item of the model, and the management rule is bound with the field in the model, so that the model can be matched with the source data during data management, the model is automatically bound with the field of the source data, a corresponding data management script is generated, the data management script is uniformly scheduled and executed by a data management script operator, and an implementer sorts and combs the management tasks according to the service, thereby avoiding the complex and complicated data management tasks when the data management tasks are multiple, and ensuring the real-time property of data transmission. The application of the big data technology is facilitated, the use difficulty of the big data technology in actual application such as selection, matching and implementation is effectively reduced, and the application capability of the big data technology is improved.

According to the data management method provided by the embodiment of the invention, the source data information of the data to be processed is obtained to shield the difference between all heterogeneous data, so that the data integration application difficulty is reduced; the data to be processed is modeled, so that data standards and data relations are managed through a data model, and complicated data are subjected to standard and format standardization, so that the phenomena of data structure mixing and non-uniform standard and format in the data management process are avoided; and finally, automatically generating a data management script according to the model information of the data model and preset data management rules, so that the workload of data management is reduced, complicated operations such as manual code compiling and storage processes required in the traditional data management are avoided, the characteristic attributes of various access platform data sources are not required to be known, and the time cost is saved.

Fig. 16 shows an implementation flow of another data governance method provided in the embodiment of the present invention, and for convenience of description, only the parts related to the embodiment of the present invention are shown, which are similar to the above embodiment, except that the step S1504 includes:

in step S1601, a task type is determined according to the preset data governance rule.

In the embodiment of the invention, the task types comprise a conversion task and a quality inspection task, wherein the conversion task combines and converts data into a new data set, and the quality inspection task is the inspection of a known data set. The quality inspection task is to inspect on the basis of the existing data, and forms a quality report through inspection, which mainly comprises: the method comprises the following steps of field inspection, information integrity, service consistency and information repetition, namely when relevant management rules such as corresponding field inspection, information integrity, service consistency and information repetition are involved in a data management rule, determining the data management rule as a quality inspection task; when the data governance rule relates to a cleaning rule, a field conversion rule, a target data set and the like, determining that the data governance rule is a conversion task.

In step S1602, when the task type is a conversion task, a conversion script is generated according to the model information and a preset data governance rule, so that the governance script runner runs the conversion script to complete governance of the data to be processed.

In the embodiment of the present invention, as shown in fig. 17, the step S1602 includes:

in step S1701, a cleaning script and a field conversion script are generated based on the model information, the cleaning rule, and the field conversion rule.

In the embodiment of the invention, the cleaning rule for extracting the data is read according to the model information, and the data cleaning is completed by generating the cleaning script and calling the cleaning rule in the rule service; the selection and renaming of the source data field are completed by generating a field conversion script.

In step S1702, a corresponding target data script is generated according to the target data set; and performing data extraction and loading processing on the target data script according to the model information.

In the embodiment of the invention, target data set information is read by calling service provided by a data source manager, wherein the target data set information comprises database types, database connection information (user names, passwords, ip and ports) and table information if the target data set information is a relational database; when reading the target data set, the target table of the target data may not exist, and at this time, a create target data script needs to be generated (for example, a different execution script for creating a different data table is generated according to the type of the target data source).

In step S1703, a target data script, a cleaning script, and a field conversion script after the loading processing are extracted according to the data, and a conversion script is generated, so that the treatment script runner runs the conversion script to complete treatment of the data to be processed.

In the embodiment of the present invention, a target data script is subjected to full load processing (periodically and fully loading source data to a target data source) and incremental load processing (capturing a certain time (update time) or a checkpoint (checkpoint) to load data to a target data source) according to model information, as shown in an extraction load policy data change diagram shown in fig. 13, full extraction- > full load: extracting the full amount of the source data set, and loading the full amount of the source data set when the source data set is loaded to the target data set, so that the data amount of the source data set is the same as that of the target data set; the method is suitable for scenes with small data volume; full decimation- > incremental loading: the source data set is extracted in full quantity, incremental data are loaded when the source data set is loaded to the target data set, so that the target data set is increased by the incremental data of the source data set, and the changed data set is updated; increment extraction- > increment loading: extracting the added data from the source data set, and loading the added data to the target data set, so that the source data set is consistent with the target data set; incremental draw- > full load: and extracting the added data of the source data set, and loading the added data into the target data set in a full amount, so that the target data set is added with the added data of the source data set. And connecting the target data script, the cleaning script and the field conversion script after the data extraction and loading processing in series, and packaging into the conversion script according to the sequence.

In step S1603, when the task type is a quality inspection task, a quality inspection script is generated according to the model information and a preset data management rule, so that the management script operator runs the quality inspection script to complete management of the data to be processed.

In the embodiment of the invention, the preset data governance rules comprise format check rules and business logic check rules; the format check rule includes field check, information integrity rule, etc. As shown in fig. 18, the step S1603 includes:

in step S1801, a corresponding format check script, a corresponding service logic check script, and a corresponding quality report script are generated according to the model information, the format check rule, and the service logic check rule.

In the embodiment of the invention, the field is subjected to standardized check according to the format check rule according to the model standard, for example, whether the length of the field is in accordance with the standard or not is checked, whether the format of the field is in accordance with a specific format or not is checked, for example, the formats of an ID card number, a mobile phone number, a mailbox, a unified credit code and the like are checked, the format of a generated field check script is in an xml format, and the check logic adopts javascript to perform detection. And then according to whether the comparison between the model information and the data to be processed is complete or not, whether the information is missing or not is determined, for example: if the unified social credit code in the enterprise table is missing and the key information is null, if the enterprise name is detected to be null, the format of the generated information integrity script is xml format. Judging whether the data accords with business logic according to the logical relationship of the governance object, such as the loss of a main key, the absence of an enterprise in a main library, the inconsistency of values of the same fields in different data tables and the like, and generating a business consistency check script (business logic check script) with an xml format; and judging whether the data has repeated data according to the logical relationship of the treatment object, and generating an information repeatability inspection script in an xml format. And finally, generating a corresponding quality report script according to the inspection scripts.

In step S1802, a quality inspection script is generated according to the inspection script, the service logic inspection script, and the quality report script, so that the administration script operator runs the quality inspection script to complete administration of the data to be processed.

In the embodiment of the invention, the scripts are connected in series and are packaged into the quality inspection script according to the sequence, and the quality inspection script is automatically issued to the data management operator and is automatically started.

According to the data management method provided by the embodiment of the invention, tasks are determined according to the preset data management rule, and then the corresponding task script is automatically generated based on the model information and the preset data management rule, so that a management script operator runs the conversion script to complete the management of the data to be processed, the workload of data management is greatly reduced, and the operations of manually compiling codes, storing processes and the like required by the traditional data management are avoided; in addition, data mapping is accessed in a self-adaptive manner, the characteristic attributes of various access platform data sources do not need to be known, and the time cost is saved; in addition, based on reusability and accumulability of data management work, a data model can be accumulated aiming at a data structure related to the same unit service, a data management task is automatically generated, extremely high reusability is achieved, and data management efficiency is improved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A data management system is characterized by comprising a data model builder, a source data information extractor and a management script generator, wherein the source data information extractor is communicated with the data model builder;

the data source manager is used for acquiring source data information of data to be processed;

the governance script generator is used for acquiring model information of the data model; and generating a data governance script according to the model information and a preset data governance rule so that a governance script operator runs the data governance script to complete governance of the data to be processed.

2. The data governance system of claim 1, wherein the governance script generator comprises:

a model information acquisition unit for acquiring model information of the data model; and

and the treatment script generating unit is used for generating a data treatment script according to the model information and a preset data treatment rule so that the treatment script runner runs the data treatment script to complete the treatment of the data to be treated.

3. The data abatement system of claim 2, wherein the abatement script generation unit comprises:

the task type determining module is used for determining the task type according to the preset data governing rule;

the conversion script generating module is used for generating a conversion script according to model information and a preset data treatment rule when the task type is a conversion task, so that a treatment script operator runs the conversion script to complete treatment on the data to be treated; and

and the quality inspection script generating module is used for generating a quality inspection script according to the model information and the preset data management rule when the task type is a quality inspection task, so that the quality inspection script is operated by the management script operator to complete the management of the data to be processed.

4. The data governance system of claim 3, wherein the preset data governance rules comprise a cleaning rule, a field conversion rule, and a target dataset;

the conversion script generation module is used for generating a cleaning script and a field conversion script according to the model information, the cleaning rule and the field conversion rule; generating a corresponding target data script according to the target data set; performing data extraction loading processing on the target data script according to the model information; and extracting the target data script, the cleaning script and the field conversion script after the loading processing according to the data to generate a conversion script so that a treatment script runner runs the conversion script to complete the treatment of the data to be processed.

5. The data governance system of claim 3, wherein the preset data governance rules comprise format check rules and business logic check rules;

the quality inspection script generation module is used for generating a corresponding format inspection script, a corresponding business logic inspection script and a corresponding quality report script according to the model information, the format inspection rule and the business logic inspection rule when the task type is the quality inspection task; and generating a quality inspection script according to the inspection script, the service logic inspection script and the quality report script so that a treatment script operator operates the quality inspection script to complete treatment of the data to be processed.

6. The data governance system of claim 1, wherein the preset data governance rules carry script scheduling period information;

7. A data governance method, comprising:

acquiring source data information of data to be processed;

obtaining model information of the data model;

8. The data governance method according to claim 7, wherein said step of generating a data governance script according to said model information and a predetermined data governance rule, so that a governance script operator runs said data governance script to complete governance of said data to be processed comprises:

determining the task type according to the preset data governing rule;

when the task type is a conversion task, generating a conversion script according to model information and a preset data treatment rule, so that a treatment script operator runs the conversion script to complete treatment on the data to be treated;

and when the task type is a quality inspection task, generating a quality inspection script according to the model information and preset data management rules, so that a management script operator operates the quality inspection script to complete the management of the data to be processed.

9. The data governance method of claim 8, wherein the preset data governance rules comprise a cleaning rule, a field conversion rule, and a target dataset;

when the task type is a conversion task, generating a conversion script according to model information and a preset data treatment rule so that a treatment script operator runs the conversion script to complete the treatment of the data to be processed, wherein the step comprises the following steps of:

generating a cleaning script and a field conversion script according to the model information, the cleaning rule and the field conversion rule;

generating a corresponding target data script according to the target data set; performing data extraction loading processing on the target data script according to the model information;

and extracting the target data script, the cleaning script and the field conversion script after the loading processing according to the data to generate a conversion script so that a treatment script runner runs the conversion script to complete the treatment of the data to be processed.

10. The data governance method according to claim 8, wherein the preset data governance rules comprise format check rules and business logic check rules;

when the task type is a quality inspection task, generating a quality inspection script according to model information and preset data management rules so that a management script operator runs the quality inspection script to complete the steps of managing the data to be processed, wherein the steps comprise:

generating a corresponding format check script, a corresponding business logic check script and a corresponding quality report script according to the model information, the format check rule and the business logic check rule;

and generating a quality inspection script according to the inspection script, the service logic inspection script and the quality report script so that a treatment script operator operates the quality inspection script to complete treatment of the data to be processed.