CN112732987A

CN112732987A - Full life cycle data map generation system and method

Info

Publication number: CN112732987A
Application number: CN202011642227.8A
Authority: CN
Inventors: 郭德坡; 孙伟; 高体伟; 苏萌; 赵群; 左云鹏; 姜楠; 连海俊
Original assignee: Beijing Percent Technology Group Co ltd
Current assignee: Beijing Percent Technology Group Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-30
Anticipated expiration: 2040-12-31
Also published as: CN112732987B

Abstract

The application discloses a full life cycle data map generation system, wherein data of a data source is obtained through a multi-source heterogeneous data access task, a first blood relationship, a second blood relationship and a third blood relationship are respectively determined through a data conversion task, a data management task and a diversified data processing script task, and information of the first data and the second data under a designated analysis dimension is determined through a data source table multi-dimensional statistical task; the data map generation module generates and displays a data map according to the blood relationship and the information under the specified analysis dimensionality, so that the problem that the blood relationship between the information under different analysis dimensionalities and multi-source data cannot be generated and displayed in a data map generation and display mode in the prior art is solved, the information display analysis dimensionality of the data map is effectively enriched, and the application width of an analysis result of data map information is improved. The application also discloses a full life cycle data map generation method.

Description

Full life cycle data map generation system and method

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a system and a method for generating a full-life-cycle data map.

Background

With the increase of data categories, the data volume is rapidly increased, and the data map is used as a method for visually and interfacially showing the data asset conditions, distribution and relationship of blood relationship, so that related personnel can be helped to rapidly know the distribution condition of the data assets, trace the source of the data and prejudge the influence analysis of the data.

In the prior art, when a data map is constructed, the data map is usually displayed according to the relevancy of the blood relationship among data, the method can only show the relevancy of the data on the blood relationship level, and then expresses the two-dimensional relationship among homologous data, but from the actual research and development and production requirements, the method for constructing the data map is often required to be used for displaying the blood relationship among multisource data from different analysis dimensions, so that the information display analysis dimensions of the data map are enriched, and the application width of the analysis result of the data map information is improved.

Disclosure of Invention

The embodiment of the application provides a full-life-cycle data map generation system, which is used for solving the problem that the blood relationship between different analysis dimensions and multi-source data cannot be generated and displayed in a data map generation and display mode in the prior art.

The embodiment of the application adopts the following technical scheme:

in a first aspect, a full-life cycle data map generation system is provided, comprising: the data source management module, the task configuration management module, the scheduling task management module and the data map generation module comprise:

the data source management module is used for storing first data to be subjected to blood relationship analysis and derived from a data source;

the task configuration management module is used for configuring a multi-source heterogeneous data access task, a data conversion task, a data management task and a diversified data processing script task;

the multi-source heterogeneous data access task is used for acquiring first data from a corresponding data source according to the input identification of the first data, storing the first data into the data source management module, and determining a first blood relationship between the first data and corresponding second data stored to a target source;

the data conversion task is used for converting the first data to obtain converted data and determining a second blood relationship between the first data and the converted data;

the data management task is used for auditing the converted data to obtain audited data serving as the second data to be stored in a target source;

the diversified data processing script task is used for determining a third blood relationship between the first data and the second data according to the specified analysis dimension;

the scheduling task management module is used for scheduling and executing the target task; the target task comprises the following steps: the method comprises the steps that a multi-source heterogeneous data access task, a data conversion task, a data governance task, a diversified data processing script task and a data source table multi-dimensional counting task are carried out, so that information of a first blood relationship, a second blood relationship, a third blood relationship and the first data and the second data under the specified analysis dimension is obtained; the data source table multidimensional statistic task is used for analyzing the information of the first data and the second data under the specified analysis dimension;

and the data map generation module is used for generating and displaying a data map according to the first blood relationship, the second blood relationship, the third blood relationship and the information under the specified analysis dimension, which are obtained by executing the target task.

In a second aspect, a full-life-cycle data map generation method is provided, including:

storing first data to be subjected to blood relationship analysis, which are derived from a data source;

configuring a multi-source heterogeneous data access task, a data conversion task, a data management task and a diversified data processing script task;

scheduling and executing the target task; the target task comprises the following steps: the method comprises the steps that a multi-source heterogeneous data access task, a data conversion task, a data governance task, a diversified data processing script task and a data source table multi-dimensional counting task are carried out, so that information of a first blood relationship, a second blood relationship, a third blood relationship and the first data and the second data under the specified analysis dimension is obtained; the data source table multidimensional statistic task is used for analyzing the information of the first data and the second data under the specified analysis dimension;

and generating and displaying a data map according to the obtained first blood relationship, the second blood relationship, the third blood relationship and the information under the specified analysis dimension.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:

the method comprises the steps that data of a data source are obtained through a multi-source heterogeneous data access task, a first blood relationship, a second blood relationship and a third blood relationship are respectively determined through a data conversion task, a data management task and a diversified data processing script task, and information of the first data and the second data under a specified analysis dimension is determined through a data source table multi-dimensional statistic task; the data map generation module generates and displays a data map according to the blood relationship and the information under the specified analysis dimensionality, so that the problem that the blood relationship between the information under different analysis dimensionalities and multi-source data cannot be generated and displayed in a data map generation and display mode in the prior art is solved, the information display analysis dimensionality of the data map is effectively enriched, and the application width of an analysis result of data map information is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a general structural framework diagram of a full-life-cycle data map generation system according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of data source types available for data collection according to an embodiment of the present disclosure;

fig. 3 is a schematic interface diagram for a user to configure the relevant information according to the embodiment of the present application;

fig. 4 is a schematic diagram of a multi-source heterogeneous data access task configuration interface for structured data according to an embodiment of the present application;

fig. 5 is a schematic view of a multi-source heterogeneous data access task configuration interface for unstructured data according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a data governance task configuration interface provided by an embodiment of the present application;

FIG. 7 is a diagram of a data standard system provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a diversified data processing script task configuration page provided by an embodiment of the present application;

fig. 9 is a schematic diagram of a task node selection interface provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of a target task selection interface provided in an embodiment of the present application;

FIG. 11 is a schematic diagram of a task detail parameter setting interface provided in an embodiment of the present application;

fig. 12 is a schematic diagram of a configuration message notification setting interface provided in an embodiment of the present application;

fig. 13 is a schematic diagram of a scheduling cycle setting interface for scheduling tasks according to an embodiment of the present application;

fig. 14 is a schematic specific flowchart of a full-life-cycle data map requesting method according to an embodiment of the present disclosure;

FIG. 15 is a graphical illustration of a relationship of data provided by an embodiment of the present application;

FIG. 16 is a diagram illustrating a data map field relationship according to an embodiment of the present application;

fig. 17 is a schematic view illustrating multidimensional information of a data resource according to an embodiment of the present application;

fig. 18 is a schematic view of displaying multidimensional information of a data resource according to another schematic view of displaying multidimensional information of a data resource provided in an embodiment of the present application.

Detailed Description

The first embodiment is as follows:

in order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

In order to solve the problem that the blood relationship between different analysis dimensions and multi-source data cannot be generated and displayed in a data map generation and display mode in the prior art, the embodiment of the application provides a full-life-cycle data map generation system.

Referring to fig. 1, fig. 1 is a schematic diagram of an overall structural framework of a full-life-cycle data map generating system according to an embodiment of the present disclosure.

The full-life-cycle data map generation system comprises a data source management module (namely data source management in fig. 1), a task configuration management module (namely task configuration management in fig. 1), a scheduling task management module (namely scheduling task management in fig. 1) and a data map generation module (namely data map generation in fig. 1). The specific functions of each module are described in detail as follows:

1. a data source management module to:

first data to be subjected to a blood-related relationship analysis is stored, originating from a data source.

2. A task configuration management module to: and configuring a multi-source heterogeneous data access task, a data conversion task, a data management task and a diversified data processing script task.

the diversified data processing script task is used for determining a third blood relationship between the first data and the second data according to the specified analysis dimension.

3. A scheduling task management module to:

scheduling and executing the target task; the target task comprises the following steps: the method comprises a multi-source heterogeneous data access task, a data conversion task, a data governance task, a diversified data processing script task and a data source table multi-dimensional counting task so as to obtain information of a first blood relationship, a second blood relationship, a third blood relationship and the first data and the second data under the specified analysis dimension.

The data source table multidimensional statistic task is used for analyzing the information of the first data and the second data under the specified analysis dimension.

4. A data map generation module to:

and the data map is generated and displayed according to the first blood relationship, the second blood relationship, the third blood relationship and the information under the specified analysis dimension, which are obtained by the scheduling task management module.

The data source management module 1 can acquire data from any data source type and can be flexibly set.

For example, as shown in fig. 2, the data source management module may display the types of data sources available for data collection on the display device, where the types of data sources include, but are not limited to:

an ftp file server, which may provide data stored in a nosql non-relational database (e.g., memcached, redis, mangodb, etc.); RDBMS relational databases such as kingbase, mysql, oracle; DM; UXDB; gauss 200; shentong, and the like.

Based on the displayed data source types, the user can select the type of the data source of the data desired to be subjected to the blood relationship analysis based on the user input device (such as a mouse, a keyboard and the like), and further configure the related information of the data source. As shown in fig. 3, the interface schematic diagram is displayed on the display device by the data source management module, and is used for the user to configure the relevant information.

The relevant configured information may include, for example, a name of a configured data source, a configured machine name/IP address, a port number, a user name, a password, a database, a Jdbc parameter, a pool connection trigger, and the like. In the embodiment of the present application, the information selected and/or configured by the user for uniquely locating the data source is referred to as a data source identifier. Based on the data source identifier, the data source management module can uniquely locate the corresponding data source, and then acquire data from the data source.

The selected and configured information is input into the data source management module, so that the data source management module can acquire data according to the information.

It can be understood that, if the data source identifiers of different data sources are input to the data source management module, the data source management module may obtain data from the different data sources according to the data source identifiers of the different data sources, thereby implementing the obtaining of "multi-source data".

In one embodiment, the data source management module may be implemented by four steps as shown in fig. 1, that is: selecting a data source type, adding a data source newly, testing connection and storing the data source. The steps are explained as follows:

selecting the data source type means acquiring information of the data source type input by the user and the address of the data source. In the embodiment of the present application, the information of the data source type and the address of the data source belong to the category of the data source identifier.

The new data source is to store the information and address of the data source type input by the user, so as to subsequently obtain data from the corresponding data source according to the stored information and address of the data source type.

And the test connection refers to requesting to establish connection with the data source corresponding to the address of the stored data source and matched with the information of the type of the stored data source, and testing whether the connection is established successfully.

And data source storage, namely acquiring data from the data source successfully establishing the connection after the test connection is successfully established. For example, the full amount of data from the data source may be obtained, or data generated by the data source within a specified time frame (e.g., the last week) may be obtained, etc.

In a specific embodiment, specific functions of the "2 task configuration management module" may refer to fig. 1, and include three parts, namely, data consanguinity policy making, data auditing and script publishing, respectively.

The data blood relationship analysis strategy may include the following steps shown in fig. 1 (these steps are not necessarily executed in the order shown in the figure):

the method comprises the steps of multi-source heterogeneous data access task configuration, source and target selection, mapping information configuration and data blood relationship analysis strategy setting; and configuring a data conversion task, selecting a data source base table, configuring a conversion rule and setting a data blood margin analysis strategy.

Data auditing may include the following steps shown in FIG. 1 (which are not necessarily performed in the order shown):

data management task configuration, data standard system (establishment), selection of audit data tables and configuration fields and table audit rules.

Script publishing, which may include the following steps shown in FIG. 1 (which steps are not necessarily performed in the order shown in the figure):

the method comprises the steps of task configuration of diversified data processing scripts, script type selection, script task development, debugging, saving and script task release.

First, the respective steps involved in the "data-blood-relationship analysis strategy formulation" are described.

The configuration of the multi-source heterogeneous data access task refers to the generation of the multi-source heterogeneous data access task according to configuration information input by a user based on a multi-source heterogeneous data access task configuration interface shown in fig. 4 or fig. 5. The task target of the task comprises the following steps of storing the data acquired by the data source to the target source: analyzing the blood relationship between the data acquired from the data source and the data stored to the target source based on the strategy 1 set by the set data blood relationship analysis strategy (the strategy is called as the strategy 1 for short).

The configuration items shown in fig. 4 are mainly used for configuring data access tasks for "structured" type data, and the configuration items mainly include 4 categories: basic information, source, target, and advanced settings.

The basic information includes: configuration items such as resource names, types, import modes and the like;

the source includes: configuration items such as source type, data source, source database, source table name and source table field;

the objectives include: configuration items such as a target source, a target library, a target table name, a target table field and the like;

advanced settings include: whether to perform a data override configuration item.

For example, assuming that the user configures an import mode, a data source, a source table, a field, a target source, a target repository, and a target table name — a heterogeneous data access task may be generated according to the information, and the task includes: by adopting the importing mode, the data in the field is acquired from the source table of the data source stored by the data source management module and is stored in the table with the target table name of the target library of the target source.

By executing the multi-source heterogeneous data access task, the task configuration management module can acquire the data stored by the data source management module and analyze the blood relationship (first blood relationship) between the acquired data and the corresponding data stored to the target source.

The configuration items shown in fig. 5 are mainly used for configuring data access tasks for data of an "unstructured" type, and mainly include 4 categories: basic information, source, target, and advanced settings. The configuration items of each category are specifically shown in fig. 5, and are not described herein again.

Selecting a source and a target means that a data source which is matched and consistent with the information of the source and is available is created in the data source management module according to the information of the source input by the user (the created data in the data source can be acquired and stored from other data sources for establishing connection by the data source management module), and a target source which is matched and consistent with the information of the target is created in the task configuration management module according to the information of the target input by the user.

Wherein the created data source comprises a source table; the created target source includes a target repository having a target table with a user-configured target table name.

The configuration of the mapping information refers to the establishment of the mapping relationship between the data source and the target source in the task configuration management module, and can be refined to the establishment of the mapping relationship between the source table, the source database, the source table field and the target source, the target database, the target table and the target field of the data source. The mapping here characterizes the data from the data source (which can be refined to a specific source database, source table field) and finally to which target source (which can be refined to a specific target database, target table field).

Setting a data blood relationship analysis strategy (which is called strategy 1 for short) refers to setting the data blood relationship analysis strategy in the task configuration management module. The policies may be input to the task configuration management module by a user.

The data blood relationship analysis strategy can include, but is not limited to, at least one of the following:

the homologous and homologous database data blood relationship analysis strategy is used for analyzing blood relationship of data of the same type of database from the same type of data source;

the homologous and heterogeneous database data blood relationship analysis strategy is used for blood relationship analysis of data of different types of databases from different data sources;

a heterogeneous data genetic relationship analysis strategy is used for genetic relationship analysis of data from the same and different types of databases from different data sources.

In the analysis strategy, the manner of analyzing whether the data is homologous or heterologous may include: and for the data to be subjected to the blood relationship analysis, judging whether the data is homologous or heterologous by analyzing the attribute of the corresponding data source serving as the data source based on the mapping relationship established by executing the step of configuring mapping information. If the attributes are the same, judging that the attributes are homologous; otherwise, judging as a different source.

Further, based on the mapping relationship, if the data to be subjected to the blood relationship analysis are from the same database of the same data source, the data are judged to be homologous and same database data; and if the data to be subjected to the blood relationship analysis is from different databases of the same data source, judging that the data is homologous and heterogeneous database data.

The attribute may be a type of a data source, i.e., a source type. Specific types include, for example, ftp file servers, RDBMS relational databases such as kingbase, mysql, oracle; DM; UXDB; gauss 200; shentong, and the like.

The homologous and homologous database data blood relationship analysis strategy comprises the following steps: calculating table-to-table, field-to-table and field-to-field blood relationship in the database aiming at the homologous and homologous database data;

the homologous and heterogeneous database data blood relationship analysis strategy comprises the following steps: calculating table-to-library, table-to-table, field-to-table and field-to-field blood relationship aiming at homologous and heterogeneous database data;

heterogeneous data genetic relationship analysis strategy, comprising: for heterogeneous data, full-type kinship relationships are computed, namely, field-to-field, field-to-table, table-to-library, data source-to-data source-level kinship relationships.

The following further introduces additional steps to implement the data margin analysis strategy formulation function:

the data conversion task configuration refers to a task configured to convert data of the blood relationship to be analyzed in order to ensure the availability of the data. The task goals of the task include: data from a data source is converted, and based on a set data blood relationship analysis strategy (strategy 2 described later), determination of blood relationship between the data before conversion and the data after conversion is realized.

The data conversion is performed after the data is acquired from the data source and before the data is stored in the target source. Data acquired from a data source may be converted at this stage by performing a data conversion task, and then the converted data may be stored to a target source.

The conversion referred to herein may include, but is not limited to: summary data, identify and correct errors in data, deduplication, select and/or exclude data according to some criteria, sorting (numeric sorting, alphabetical sorting, temporal sorting, reverse sorting, zone sorting, etc.), transposing (rotating row data into column data, or rotating column data into row data), splitting (dividing a column of data into two or more columns of data), and so forth.

Selecting a database table refers to selecting a source database to be converted from the source databases created by the data source management module.

The configuration of the conversion rule refers to a rule on which the conversion task is specifically configured during execution, for example, what criteria are used for identifying and correcting data errors, what criteria are used for sorting data, and the like.

Setting a data blood relationship analysis strategy (which is called herein as strategy 2 for short) refers to setting the data blood relationship analysis strategy in the task configuration management module. The policies may be input to the task configuration management module by a user.

However, the setting strategy 2 is different from the procedure of setting the data-based blood-relationship analysis strategy (strategy 1) described above. The specific difference is that policy 1 and policy 2 are different.

The strategy 1 is a data blood relationship analysis strategy set for a multi-source heterogeneous data access task, is relatively simple, and only needs to analyze the blood relationship of homologous and homologous database data, the blood relationship of homologous and heterologous database data, and the blood relationship of heterologous data, calculate table-to-table, field-to-field blood relationship, table-to-library, table-to-table, field-to-field blood relationship, field-to-table, table-to-library, data source-to-data source blood relationship and the like. The strategy 2 is relatively complex, and a corresponding blood relationship analysis strategy needs to be set corresponding to the adopted conversion rule, so as to accurately determine the blood relationship between the data which is sourced from the data source before conversion and the data which is stored in the target source after conversion. Rules which do not influence the blood relationship can be ignored, and a corresponding data blood relationship analysis strategy is not required to be set.

The following introduces 4 steps involved in data auditing:

the data management task configuration refers to configuring a data management task, namely configuring a data auditing task. The task goals of this task are: and auditing the data of the data source and/or the target source to obtain the audited data. Wherein, auditing the data of the data source can occur before or after the conversion; auditing the data of the target source may occur after storing the converted data to the target source. Fig. 6 is a schematic diagram of a data administration task configuration interface provided in an embodiment of the present application.

By executing the data management task, the data in the table fields of the data tables of the selected database needing to be audited can be audited based on the set field and table audit rules and the standards established by the data standard system, so as to obtain the audited data.

The data standard system (establishment) refers to establishment of naming specifications, data element standards, code table standards, field standards and the like of data. Based on the established specifications/standards, when auditing data, the audited data can be renamed, set according to a data metadata standard, set according to a code table standard and a field standard, and the like. Fig. 7 is a schematic diagram of a data standard system according to the embodiment of the present application.

The selection of the audit data table refers to the selection of the table field of the data table of the database needing audit from the data source and/or the target source.

Configuring field and table audit rules, which refers to setting the field and table audit rules to audit the data in the selected target table field.

The following introduces the steps involved in the script publishing:

the diversified data processing script task configuration refers to the configuration of diversified data processing script tasks to be executed. The script may be of a type such as SHELL, PYTHON, SPARK, or the like. When performing task configuration, a technician may select a script type based on the interface shown in fig. 8, further develop a diversified data processing script of the type, input the developed diversified data processing script into the task configuration management module as a diversified data processing script task to be executed, and thus complete configuration of the diversified data processing script task.

The task goals of the diversified data processing script task are: and (3) aiming at the data to be subjected to the blood relationship analysis, performing the blood relationship analysis according to the data blood relationship analysis strategy 3, thereby obtaining the corresponding blood relationship analysis result. The data blood relationship analysis strategy 3 can be flexibly set according to specific requirements. It may be any data margin analysis strategy other than data

margin analysis strategy

1 or 2. For example, the data blood margin analysis strategy 3 may be: and analyzing the operation of the source data table and the target table to determine the operation association between the data stored to the target source and the data from the data source, or determine the similarity between the data stored to the target source and the data from the data source.

Selecting a script type refers to a technician selecting a script type of a diversified data processing script to be developed based on an interface as shown in fig. 8.

The development script task refers to the development of diversified data processing scripts by technicians.

The task of debugging, saving and releasing the script refers to that a technician inputs the developed diversified data processing script into a task configuration management module, and after the task of the diversified data processing script is configured, the task is debugged, saved and released. The various data processing script tasks which are debugged pass are saved; the saved diversified data processing script tasks can be issued to nodes for executing the diversified data processing script tasks according to actual requirements.

The following describes "3, schedule task management module".

In one embodiment, the specific functions thereof can be referred to in fig. 1. Wherein:

selecting required task nodes from the task menu refers to selecting various task nodes which may be involved in the analysis process of the data blood relationship. For example, the task node may be selected based on a graphical operation interface as shown in fig. 9. In fig. 9, the graphical identifiers corresponding to the data such as the access checkpoint, the access vehicle data, and the access vehicle entry information are task nodes representing the selected multi-source heterogeneous data access task; the graphical identification corresponding to the audit represents the task node of the selected data management task; cleaning the corresponding graphical identification, namely representing the task node of the selected data conversion task; and analyzing the corresponding graphical identification of the virus-involved vehicle, namely representing the task node of the selected diversified data processing script task.

Besides the task nodes, the delay nodes, the conditional nodes, the sink nodes, the sub-scheduling nodes and the like can be configured, and can be flexibly selected and combined according to needs.

Selecting a target task refers to selecting a task to be executed. For example, the target task may be selected based on a graphical user interface as shown in fig. 10. In fig. 10, for example, access card data, access vehicle data, and access vehicle entry information, that is, access tasks representing selected multi-source heterogeneous data include: data of an access card, data of an access vehicle and entry information of the access vehicle; auditing, namely representing that a data treatment task is selected; clear, namely representing that a data conversion task is selected; the analysis of the vehicle involved in the virus represents the selection of diversified data processing script tasks. The selected task is the task to be executed, or called the target task.

The task detailed parameter setting means setting a value of a parameter required by the task when the task is executed, for the task to be executed. For example, parameter setting may be performed based on a graphical operation interface as shown in fig. 11. In fig. 11, specific variable values of variables named param1 included in a task, such as 123, may be set for a task to be executed, such as a task of "checkpoint passing through vehicle log audit".

The configuration message notification setting refers to setting a message type, a message reminding mode and the like of a notification/message which can be sent by a task in the execution process aiming at the task to be executed. For example, the configuration message notification setting may be performed based on a graphical operation interface as shown in fig. 12. Based on the set message type, message reminding mode and the like, when corresponding events occur in the task execution process, such as node start, node end, and the task running time conforms to the alarm condition and running errors, the set message reminding mode is adopted to send out the message with the set message type.

Setting a scheduling period of a scheduling task means setting a specific scheduling period for the scheduling task. For example, the setting of the scheduling period may be performed based on a graphical operation interface as shown in fig. 13.

And the scheduling task executor is used for executing the scheduling task according to the set scheduling period.

And analyzing results, namely acquiring the data blood relationship analyzed by executing the task to be executed. As can be seen from the foregoing description, by executing a multi-source heterogeneous data access task, it is possible to obtain a blood relationship between data acquired from a data source and corresponding data stored to a target source according to the data blood relationship analysis policy 1; by executing the data conversion task, the determination of the blood relationship between the data before conversion and the data after conversion can be realized according to the data blood relationship analysis strategy 2.

And the execution result monitoring and alarm notification refers to monitoring the execution process and the result of the task to be executed, and if an event triggering the sending of a notification/message or the alarm is generated, sending the notification/message or the alarm according to the setting of the configuration message notification setting.

The following describes "4, data map generation module".

the management of the data resource analysis dimension information refers to the selection of a data source and/or a target source to be subjected to analysis dimension information management.

The public analysis dimension configuration refers to configuring public analysis dimensions. The common analysis dimension may be a default number of analysis dimensions, including but not limited to: and at least one of analysis dimensions such as data storage capacity of the data source and/or the target source, table field information, change information statistics and the like. The public analysis dimension information of the data can be displayed on a data map subsequently.

Private analysis dimension configuration refers to configuring a private analysis dimension. The private analysis dimension of the data can be displayed on the data map subsequently. The private analysis dimension generally refers to a specific analysis dimension that is different from the public analysis dimension and can be displayed in the data map, such as but not limited to: the data type of the data source and/or the target source, the type of the data source, the total amount of data satisfying a set condition (which may be set in advance as needed), and the like.

The data source table multi-dimensional statistical task configuration refers to configuring a multi-dimensional statistical task aiming at data in a source table and/or a target table based on public analysis dimensions configured by public analysis dimension configuration and private analysis dimensions configured by private analysis dimension configuration. The task target of the data source table multi-dimensional statistical task is to perform information statistics of the analysis dimensions on data in the source table and/or the target table according to the configured public analysis dimensions and private analysis dimensions to obtain corresponding statistical results.

And generating the scheduling task, namely generating the scheduling task according to the task target of the multidimensional statistical task of the data source table. The task goal of the scheduling task is to schedule and execute the task to be executed so as to obtain the data consanguinity relationship and/or the statistical result obtained according to the configured public analysis dimension and the private analysis dimension. According to specific requirements, the tasks to be executed may include at least one or more or all of the selected target tasks and the multidimensional statistical tasks of the data source table.

The generated scheduling task can be transmitted to the scheduling task executor, so that the scheduling task executor executes the scheduling task, and the task to be executed is executed in the process of executing the scheduling task.

And the storage of the base table relation and the field relation refers to the storage of the data blood relationship obtained by the analysis of the task scheduling management module.

The data resource multi-analysis dimension information is information of public analysis dimension and private analysis dimension of data obtained by executing a data source table multi-dimensional statistical task and is stored.

And the data calibration refers to calibrating the stored data blood relationship and the information of the public analysis dimension and the private analysis dimension of the data.

In addition to the above steps, the data map generation module is further configured to generate and display a data map according to the information of each blood relationship obtained by executing the target task and the public analysis dimension and the private analysis dimension of the data.

In the embodiment of the present application, a process for calibrating data includes the following steps:

the method comprises the following steps: if the execution link of the execution result monitoring and alarm notification sends task execution abnormal information notification, data such as blood relationship analysis results and the like generated by tasks with execution failure are automatically cleaned;

step two: if the task is normally executed and completed, a blood relationship analysis result is generated, whether the blood relationship contains the annular circulation relationship is detected, if yes, the annular circulation relationship is temporarily set as unavailable data, and an administrator is reminded to correct the annular circulation relationship through mails or short messages, and the annular circulation relationship is manually checked and corrected.

In the embodiment of the application, a manual calibration interface can be provided for the condition that inaccurate data exist in the information of the public analysis dimension and the private analysis dimension of the data and the data blood relationship caused by some uncontrollable reasons; and providing a manual input interface for the data consanguinity relation which can not be completely obtained through program analysis and the information of the public analysis dimension and the private analysis dimension of the data.

Summarizing the above data calibration, it may comprise at least one of the following:

monitoring the execution process of the execution node on the data blood relationship analysis target script task, if the execution fails, sending out an execution failure prompt, and clearing a data blood relationship analysis result generated by the data blood relationship analysis target script task which fails to be executed; if the execution is successful, analyzing whether a circular circulation relationship exists in a data blood relationship analysis result generated by the data blood relationship analysis target script task which is successfully executed; if the data exists, sending out a data correction prompt, and setting the corresponding data blood relationship analysis result to be in an unavailable state.

Secondly, in the execution process of the data blood relationship analysis target script task by the execution node, periodically monitoring the data correctness of a data blood relationship analysis result generated by the execution node executing the data blood relationship analysis target script task, if incorrect data exists, sending a data error prompt, and displaying a data calibration operation interface on a display device; correcting the incorrect data based on data entered into the data calibration operator interface via a user input device.

Thirdly, displaying a data entry interface on a display device; displaying respective visual content in the data map based on data input to the data entry interface via a user input device;

fourthly, displaying a data map query interface on a display device; and inquiring the data map matched with the data map inquiry condition for displaying based on the data map inquiry condition input into the data map inquiry interface through the user input device.

By adopting the system provided by the embodiment of the application, the data of the data source is obtained by the multi-source heterogeneous data access task, the first blood relationship, the second blood relationship and the third blood relationship are respectively determined by the data conversion task, the data management task and the diversified data processing script task, and the information of the first data and the second data in the designated analysis dimension is determined by the data source table multidimensional statistic task; the data map generation module generates and displays a data map according to the blood relationship and the information under the specified analysis dimensionality, so that the problem that the blood relationship between the information under different analysis dimensionalities and multi-source data cannot be generated and displayed in a data map generation and display mode in the prior art is solved, the information display analysis dimensionality of the data map is effectively enriched, and the application width of an analysis result of data map information is improved.

Based on the same inventive concept, the embodiment of the present application further provides a full life cycle data map generation method, including the following steps:

step 11, storing first data to be subjected to blood relationship analysis from a data source;

for a case that different data source data may need to be acquired, step 11 may specifically include:

acquiring at least two different input data source identifications;

and respectively acquiring data from the data sources corresponding to the at least two different data source identifications as first data according to the at least two different data source identifications, and storing the first data.

Step 12, configuring a multi-source heterogeneous data access task, a data conversion task, a data management task and a diversified data processing script task;

Step 13, scheduling and executing the target task;

the target task comprises the following steps: the method comprises a multi-source heterogeneous data access task, a data conversion task, a data governance task, a diversified data processing script task and a data source table multi-dimensional counting task so as to obtain information of a first blood relationship, a second blood relationship, a third blood relationship and the first data and the second data under the specified analysis dimension.

And the data source table multidimensional statistic task is used for analyzing the information of the first data and the second data under the specified analysis dimension.

And 14, generating and displaying a data map according to the obtained first blood relationship, the second blood relationship, the third blood relationship and the information under the specified analysis dimension.

Optionally, the first blood relationship is determined according to at least one of the following preset data blood relationship analysis strategies:

In order to support the user, the specified analysis dimension may be flexibly set according to the user's own needs, and optionally, the specified analysis dimension may be determined in the following manner:

and acquiring a public analysis dimension of default setting and a private analysis dimension input by a user as the specified analysis dimension.

In order to correct the error data, the method provided in the embodiment of the present application may further include:

monitoring the execution process of the target task, if the execution fails, sending out an execution failure prompt, and clearing a data blood relationship analysis result generated by the target task which fails to be executed; if the execution is successful, analyzing whether a circular circulation relationship exists in a data blood relationship analysis result generated by the target task which is successfully executed; if the data exists, sending out a data correction prompt, and setting the corresponding data blood relationship analysis result to be in an unavailable state.

In order to implement the correction on the error data, or the method provided by the embodiment of the present application may further include:

in the execution process of the target task, periodically monitoring the data correctness of the data blood relationship analysis result generated by the target task, if incorrect data exists, sending a data error prompt, and displaying a data calibration operation interface on a display device; correcting the incorrect data based on data entered into the data calibration operator interface via a user input device.

Optionally, to solve the problem that information required by the user may be missing in the data map, the method provided in the embodiment of the present application may further include: displaying a data entry interface on a display device; displaying respective visual content in the data map based on data input to the data entry interface via a user input device.

Optionally, to facilitate a user to query a data map, the method provided in the embodiment of the present application may further include: displaying a data map query interface on a display device; and inquiring the data map matched with the data map inquiry condition for displaying based on the data map inquiry condition input into the data map inquiry interface through the user input device.

By adopting the method provided by the embodiment of the application, the data of the data source is obtained by the multi-source heterogeneous data access task, the first blood relationship, the second blood relationship and the third blood relationship are respectively determined by the data conversion task, the data management task and the diversified data processing script task, and the information of the first data and the second data under the designated analysis dimension is determined by the data source table multi-dimensional statistic task; the data map generation module generates and displays a data map according to the blood relationship and the information under the specified analysis dimensionality, so that the problem that the blood relationship between the information under different analysis dimensionalities and multi-source data cannot be generated and displayed in a data map generation and display mode in the prior art is solved, the information display analysis dimensionality of the data map is effectively enriched, and the application width of an analysis result of data map information is improved.

Example two:

in order to solve the problem that the blood relationship of multi-source data under different dimensions cannot be generated and displayed in a data map generation and display mode in the prior art, the embodiment of the application provides a data map request method.

Referring to fig. 14, fig. 14 is a schematic flowchart illustrating a full-life-cycle data map requesting method according to an embodiment of the present disclosure.

S21: and the user sends a request instruction to the full-life-cycle data map generating system according to the requirement.

In the embodiment of the present application, the request instruction includes: the system comprises a data source identifier, a target source identifier, a user authority level identifier and a data query dimension identifier.

S22: the full life cycle data map generation system parses the user instructions.

The system analyzes an instruction sent by a user to obtain a data source identifier, a target source identifier and a user authority identifier, wherein the data source identifier is used for acquiring data in a mysql database, the target source identifier is used for acquiring information of a vehicle involved in a virus in the mysql database, the user authority identifier is used for judging whether the system can provide data map query service for the user, and if the user authority identifier is not matched with the authority identifier information preset in the system, the system cannot provide the data map query service for the user.

S23: and generating a data map matched with the user authority level based on the user authority level identification in the user instruction.

As shown in fig. 15, the user authority level is obtained by parsing based on the user authority level identifier in the user command, the authority level information of the user is matched with the data range information of the database to obtain the data range corresponding to the user authority level, and a data map is generated, following the above example, when the user sends the data map query instruction of the virus-involved vehicle to the system, the system analyzes the parameters in the instruction to obtain the user authority level information, based on the user authority level information, the system automatically matches the available data corresponding to the level information, generates a data map of the vehicle involved in the virus corresponding to the user level based on the data, and at this time, if the user authority level is 2 level, the user can inquire the vehicle data map of the China and European vehicles, and if the user authority level is 5 level, the user can inquire the vehicle data map of the China and European vehicles.

S24: based on the data map corresponding to the user rating, a consanguinity map is generated.

Based on the data map corresponding to the user level, data blood relationship analysis is performed on the data in the data map to obtain the dependency relationship among the data, and a DAG (Directed Acyclic Graph) which is a skeleton Graph of the data map is constructed according to the dependency relationship among the data.

For a specific strategy for analyzing the relationship between the data at the blood relationship, reference may be made to the "data blood relationship analysis strategy" in the first embodiment, which is not described herein again.

For example, the current authority level of the user is level 2, so that the user can query a virus-related vehicle data map composed of Chinese virus-related vehicle information, further obtain the dependency relationship among data corresponding to the data through data consanguinity analysis, and generate and display a DAG (direct acquired cyclic Graph) which is a skeleton diagram of the virus-related vehicle data map based on the dependency relationship among the data.

S25: and acquiring data of different query dimensions from the data map corresponding to the user level for statistics based on the data query dimension identification in the user instruction.

As shown in fig. 16, when it is assumed that the query dimension is identified as the field dimension in the above example, the system performs statistics and display on information of the field dimension of data corresponding to different nodes in the DAG map based on the DAG map obtained in S24, and if the user performs query statistics on field data information in the DAG map for the virus-involved vehicle, as shown in fig. 17, the system may perform statistics and display on information such as the field name, the field type, whether to distinguish the field, and the field description.

As shown in fig. 18, in the embodiment of the present application, the user may also query the dimensions of partition information, storage information, consanguinity information, production information, change information, DDL statements, and the like.

When the system queries data, an asynchronous mechanism is used for querying the data.

S26: and (5) sorting and summarizing the statistical results in the step (25), and displaying the processing results.

Claims

1. A full life cycle data map generation system is characterized by comprising a data source management module, a task configuration management module, a scheduling task management module and a data map generation module, wherein:

2. The system of claim 1, wherein the first kindred relationship is determined according to at least one of the following pre-set data kindred analysis strategies:

3. The system of claim 1, wherein the data map generation module is further to:

4. The system of claim 1, wherein the data map generation module is further configured to perform at least one of:

monitoring the execution process of the target task, if the execution fails, sending out an execution failure prompt, and clearing a data blood relationship analysis result generated by the target task which fails to be executed; if the execution is successful, analyzing whether a circular circulation relationship exists in a data blood relationship analysis result generated by the target task which is successfully executed; if the data exists, sending out a data correction prompt, and setting the corresponding data blood relationship analysis result to be in an unavailable state;

in the execution process of the target task, periodically monitoring the data correctness of the data blood relationship analysis result generated by the target task, if incorrect data exists, sending a data error prompt, and displaying a data calibration operation interface on a display device; correcting the incorrect data based on data input into the data calibration operation interface through a user input device;

displaying a data entry interface on a display device; displaying respective visual content in the data map based on data input to the data entry interface via a user input device;

displaying a data map query interface on a display device; and inquiring the data map matched with the data map inquiry condition for displaying based on the data map inquiry condition input into the data map inquiry interface through the user input device.

5. The system of claim 1, wherein the multi-source heterogeneous data access task is specifically configured to:

acquiring at least two different input data source identifications;

and respectively acquiring data from the data sources corresponding to the at least two different data source identifications according to the at least two different data source identifications, and storing the data as the first data to the data source management module.

6. A full life cycle data map generation method, comprising:

7. The method of claim 6, wherein the first kindred relationship is determined according to at least one of the following pre-set data kindred analysis strategies:

8. The method of claim 6, wherein the specified analysis dimension is determined by:

9. The method of claim 6, further comprising at least one of:

10. The method of claim 6, wherein the multi-source heterogeneous data access task is specifically configured to:

acquiring at least two different input data source identifications;