WO2022143045A1 - 数据血缘关系的确定方法及装置、存储介质、电子装置 - Google Patents

数据血缘关系的确定方法及装置、存储介质、电子装置 Download PDF

Info

Publication number
WO2022143045A1
WO2022143045A1 PCT/CN2021/136131 CN2021136131W WO2022143045A1 WO 2022143045 A1 WO2022143045 A1 WO 2022143045A1 CN 2021136131 W CN2021136131 W CN 2021136131W WO 2022143045 A1 WO2022143045 A1 WO 2022143045A1
Authority
WO
WIPO (PCT)
Prior art keywords
metadata
data
relationship
graph database
database
Prior art date
Application number
PCT/CN2021/136131
Other languages
English (en)
French (fr)
Inventor
韩林
侯春华
申光
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2022143045A1 publication Critical patent/WO2022143045A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation

Definitions

  • the present application mainly relates to the field of communications, and relates to a method and device for determining blood relationship of data, a storage medium, and an electronic device.
  • Extract-Transform-Load (ETL for short) is the key link in building a data warehouse.
  • ETL Extract-Transform-Load
  • the recording and analysis of data flow also has great practical significance, such as data traceability, evaluation of data value, data quality evaluation, and reference for data archiving and destruction.
  • the data blood relationship analysis method based on relational database has complex problems such as model establishment, data storage and query data blood relationship, and no effective technical solution has been proposed yet.
  • An embodiment of the present application provides a method for determining a data blood relationship, including: acquiring metadata of an extraction, transformation, and loading ETL task, where the metadata includes at least one of the following: a database, a data table, and a data field; The metadata is analyzed and processed to save the metadata of the ETL task, the inclusion relationship of the metadata and the mapping relationship between the metadata in the graph database, wherein the inclusion relationship is used to indicate the The pairwise inclusion relationship between the database, the data table and the data field, the mapping relationship is used to indicate the pairwise mapping relationship between the database, the data table and the data field; in response to the data query request of the target data, through the The graph database determines the data lineage of the target data.
  • the embodiment of the present application also provides a device for determining the blood relationship of data, including: an acquisition module configured to acquire metadata of an extraction, transformation, and loading ETL task, wherein the metadata includes at least one of the following: a database, a data table , the data field; the processing module is used to analyze and process the metadata, so as to save the metadata of the ETL task, the inclusion relationship of the metadata and the mapping relationship between the metadata in the graph database , wherein the inclusion relationship is used to indicate the pairwise inclusion relationship between the database, the data table and the data field, and the mapping relationship is used to indicate the pairwise mapping relationship between the database, the data table and the data field
  • the response module is used to respond to the data query request of the target data, and determine the data blood relationship of the target data through the graph database.
  • the embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the storage medium, wherein the computer program is configured to execute the steps in any one of the above method embodiments when running.
  • Embodiments of the present application further provide an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to execute any one of the foregoing method embodiments. A step of.
  • Fig. 1 is the hardware structure block diagram of the computer terminal of the determination method of the blood relationship of data of the embodiment of the present application;
  • FIG. 2 is a flowchart of a method for determining a blood relationship in data according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of metadata type definition and creation according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the construction of a data blood relationship according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of data blood relationship analysis according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a graph traversal node and a directed edge according to an embodiment of the present application
  • FIG. 7 is a structural block diagram of an apparatus for determining a blood relationship of data according to an embodiment of the present application.
  • the embodiments of the present application provide a method and device, a storage medium, and an electronic device for determining a data blood relationship, so as to at least solve the problem that, in a data blood relationship analysis method based on a relational database, model establishment, data storage and query data blood relationship are all more complex issues.
  • FIG. 1 is a hardware structural block diagram of a computer terminal according to the method for determining the blood relationship of data according to the embodiment of the present application.
  • the computer terminal may include one or more (only one is shown in FIG.
  • processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.
  • the above-mentioned computer terminal may further include a transmission device 106 and an input and output device 108 for communication functions.
  • the structure shown in FIG. 1 is only a schematic diagram, which does not limit the structure of the above-mentioned computer terminal.
  • the computer terminal may also include more or fewer components than those shown in FIG. 1 , or have a different configuration with equivalent or more functions than those shown in FIG. 1 .
  • the memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer programs corresponding to the method for determining the blood relationship in the embodiments of the present application, the processor 102 runs the computer programs stored in the memory 104, Thereby, various functional applications and data processing are performed, that is, the above-mentioned method is realized.
  • Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, memory 104 may further include memory located remotely from processor 102, which may be connected to a computer terminal through a network.
  • Transmission means 106 are used to receive or transmit data via a network.
  • the specific example of the above-mentioned network may include a wireless network provided by the communication provider of the computer terminal.
  • the transmission device 106 includes a network adapter (Network Interface Controller, NIC for short), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, RF for short) module, which is used to communicate with the Internet in a wireless manner.
  • RF Radio Frequency
  • FIG. 2 is a flowchart of the method for determining the blood relationship of data according to the embodiment of the present application. As shown in FIG. 2 , the method includes:
  • Step S202 Obtain metadata of the extraction, transformation, and loading ETL task, where the metadata includes at least one of the following: a database, a data table, and a data field.
  • Step S204 analyzing and processing the metadata, so as to save the metadata of the ETL task, the inclusion relationship of the metadata and the mapping relationship between the metadata in a graph database, wherein the inclusion The relationship is used to indicate the pairwise inclusion relation between the database, the data table and the data field, and the mapping relation is used to indicate the pairwise mapping relation between the database, the data table and the data field.
  • Step S206 in response to the data query request of the target data, determine the data blood relationship of the target data through the graph database.
  • the metadata of the extraction, transformation and loading ETL task is obtained, wherein the metadata includes at least one of the following: a database, a data table, and a data field; the metadata is analyzed and processed to convert the metadata of the ETL task, the metadata of the metadata
  • the inclusion relationship and the mapping relationship between metadata are stored in the graph database, where the inclusion relationship is used to indicate the pairwise inclusion relationship between the database, data table and data field, and the mapping relationship is used to indicate the database, data table and data field.
  • the pairwise mapping relationship between them in response to the data query request of the target data, the data blood relationship of the target data is determined through the graph database, that is, the metadata of the ETL task, the inclusion relationship of the metadata and the mapping relationship between the metadata are saved in the graph.
  • the data blood relationship of the target data is determined through the graph database, and the above technical solutions are used to solve the problems of complex model establishment, data storage and query data blood relationship analysis method based on relational database.
  • the method for determining the blood relationship of data based on graph database makes model establishment, data storage and query data blood relationship simpler and more efficient.
  • step S204 The specific implementation steps of step S204 are as follows:
  • Step 1 analyze and process the metadata to save the metadata of the ETL task in the graph database, specifically, obtain the metadata of the data source end of the ETL task and the data destination of the ETL task. metadata; determine the first metadata type of the metadata of the data source and the second metadata type of the metadata of the data destination according to the metadata type provided by the graph database; convert the metadata of the data source Save in the graph database according to the first metadata type, and save the metadata of the data destination in the graph database according to the second metadata type;
  • Step 2 Perform analysis and processing on the metadata to save the inclusion relationship of the metadata in the graph database, specifically, determine the pairwise inclusion relationship between the database, data tables and data fields;
  • the object creation mode provided by the graph database creates the two-by-two inclusion relationship, and saves the created two-by-two inclusion relationship in the graph database;
  • the object creation method provided by the graph database to analyze the database.
  • the pairwise inclusion relationship between the data table and the data field is created, the created pairwise inclusion relationship is obtained, and the created pairwise inclusion relationship is saved in the graph database again.
  • Step 3 Perform analysis and processing on the metadata to save the mapping relationship between the metadata in a graph database, specifically, create an ETL task metadata type in the graph database, wherein the ETL
  • the task metadata type includes: an input/output list, the mapping relationship of the metadata, and the input/output list attribute is used to store the metadata of the data source and the metadata of the data destination; obtain the metadata between the metadata and save the mapping relationship in the created ETL task metadata type, so as to save the mapping relationship between the metadata in the graph database.
  • ETL task metadata types include: input and output list, metadata mapping relationship, storing the metadata of the data source end and data destination end in the ETL task into the corresponding input and output list, and storing the mapping relationship between metadata It is stored in the mapping relationship of the metadata in the corresponding ETL task metadata type, and then the mapping relationship between the metadata is stored in the graph database.
  • the metadata of the ETL task the inclusion relationship of the metadata and the mapping relationship between the metadata are stored in the graph database.
  • step S206 in response to the data query request, in the input and output list, a traversal query is performed through the traversal language of the graph database to determine the target data. data kinship.
  • a traversal query is performed through the traversal language according to the input direction and/or the output direction, so as to determine the data blood relationship of the target data.
  • a data query request In order to determine the data blood relationship of the target data through the graph database, a data query request must be obtained first, based on the input and output list of the ETL task metadata type, starting from the target data, traversing through the graph database according to the input direction and/or output direction Language query the data blood relationship of target data.
  • ETL tasks can be regarded as a data flow method, and the three basic elements of data flow are data sources.
  • the data flow direction, the data destination, and secondly, it can also contain more detailed information, such as the blood relationship corresponding to the field.
  • the specific implementation steps are as follows:
  • Step 1 Abstract the metadata type of the data source of the ETL task and the data destination of the ETL task.
  • the metadata type includes the metadata type of the database, data table and data field, and define the inclusive relationship between the metadata types, such as , the data field belongs to the data table, and the data table belongs to the database.
  • the metadata type is initialized.
  • the ETL task metadata type is also defined for the ETL task.
  • the ETL task metadata type includes the input and output object list attributes converted in ETL and the corresponding field mapping relationship.
  • the input and output object list attributes converted in ETL are analysis data. important part of blood ties.
  • Step 2 Analyze the specific information of the data source and data destination in the ETL task (equivalent to the inclusion relationship of the metadata in the above embodiment), that is, the database, data table and data fields and the relationship between them, using the graph database
  • the provided metadata object creation method creates a metadata object corresponding to the metadata type and the inclusion dependency between the objects.
  • the metadata object corresponding to the metadata type and the inclusion dependency between the objects are globally unique.
  • Step 1 and Step 2 are shown in FIG. 3 , which is a schematic diagram of metadata type definition and creation according to an embodiment of the present application.
  • Step 3 Analyze the data flow between the data elements in the ETL task (equivalent to the mapping relationship between the metadata in the above embodiment), that is, the direction information from the data source to the data destination and the corresponding relationship of the fields, and create an ETL task Metadata type object, which stores relevant information.
  • the most important thing is to store the metadata objects corresponding to the created data source and data destination in the ETL task into the input and output object lists of the metadata type of the ETL task.
  • the field correspondence of the metadata object of the data source and the data destination is stored in the field mapping relationship of the metadata object of the ETL task.
  • FIG. 4 is a schematic diagram of constructing a data blood relationship according to an embodiment of the present application.
  • Step 4 Based on the data stored in the graph database, data lineage query and analysis are performed through the query language and method provided by the graph database.
  • the traversal language of the traversal language starts from the input object and performs traversal query operations in both input and output directions. According to the needs of the query, the number of traversal levels can be adjusted to query the blood relationship.
  • Step 5 After sorting and transforming the queried blood relationship data, data consumption can be carried out, such as drawing a data blood relationship diagram, which clearly shows the upstream source and downstream destination of the data and the related ETL task information, which can be easily carried out. Data traceability and other analysis. As shown in FIG. 5 , FIG. 5 is a schematic diagram of data blood relationship analysis according to an embodiment of the present application.
  • Step 1 First, abstract Mysql database and Oracle database and metadata types such as database tables and fields. Because Mysql and Oracle are relational databases, they are abstracted into the same type, namely RdbResource (database), RdbTable (data table), RdbColumn (data table field) (equivalent to the data field in the above embodiment).
  • the Atlas metadata type is defined in xml format.
  • RdbResource contains a list data attribute of type RdbTable, which is the contained data table;
  • RdbTable contains a list data attribute of type RdbColumn, which is the contained data table.
  • Data table fields are examples of data attributes such as name and type.
  • ETLJob contains a list data attribute of the ETLJobTrans type;
  • ETLJobTrans contains the input and output object list attributes, as well as the field mapping relationship in addition to the basic attributes.
  • Step 2 Analyze the database to which the Mysql data table A and the Oracle data table B of the ETL task belong and the data table fields included. After connecting through the configuration database, analyze the database and the database of the Mysql data table A and the Oracle data table B of the ETL task.
  • the included data table fields can be completed by using automated database metadata analysis, and then calling the Atlas interface to create the corresponding RdbResource, RdbTable and RdbColumn type objects.
  • Step 3 Analyze the conversion steps in the ETL task (equivalent to the mapping relationship between the metadata in the above-mentioned embodiment), create the corresponding ETLJob and ETLJobTrans metadata type objects, and add the input and output object list attributes of the ETLJobTrans object respectively.
  • Mysql data table A and Oracle data table B, and the corresponding field mapping relationships are also stored in Atlas.
  • Step 4 Query and analyze the data stored in Atlas, and use the graph data traversal language provided by Atlas to loop through the query with Mysql table A as the starting point, and use the ETLJobTrans metadata object as its input and output object list attributes. relation.
  • the traversal direction is: query the input edge of Mysql data table A as the node of the input object list, query the input edge of the node of ETLJobTrans, and then query the output object list node of the output edge of the node, and the output edge of the output object list Oracle data
  • the node of table B through a simple method of querying the downstream data blood relationship of Mysql data table A, can be repeatedly queried many times, and then the data blood relationship of the downstream multi-layer can be queried; on the contrary, the query in the opposite direction can query Mysql Upstream data lineage for data sheet A.
  • FIG. 6 is a schematic diagram of graph traversal nodes and directed edges according to an embodiment of the present application.
  • Step 5 The data queried by the graph traversal can be simplified and transformed as needed, so as to provide a data structure that is convenient for processing to the blood relationship data consumer.
  • it can be simplified as two sets of data: node data list and directed edge data list.
  • the node data list only contains data table nodes and ETL task nodes, and the directed edge only stores the source node id and destination node id, and the data consumer can pass The two sets of data are restored to construct a true blood relationship.
  • the graph database custom metadata function to predefine various data elements, including but not limited to common relational databases (such as mysql, oracle, sql server, etc.) , big data related types (hive, impala, hbase, ES, Mongodb, etc.), structured or unstructured files (ftp, hdfs), etc., and the type of ETL task itself, and type initialization in the graph database; then for ETL tasks , analyze the blood relationship between the data elements of each ETL task, convert such relationship into the relationship between entities in the structural graph database, and store it in the graph database; finally, use the efficient and convenient query method provided by the graph database to perform data blood relationship analysis .
  • common relational databases such as mysql, oracle, sql server, etc.
  • big data related types hive, impala, hbase, ES, Mongodb, etc.
  • structured or unstructured files ftp, hdfs
  • an apparatus for determining the blood relationship of data is also provided, and the apparatus is used to implement the above-mentioned embodiments and preferred embodiments, and what has been described will not be repeated.
  • the term "module” may be a combination of software and/or hardware that implements a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.
  • FIG. 7 is a structural block diagram of an apparatus for determining a blood relationship of data according to an embodiment of the present application. As shown in FIG. 7 , the apparatus includes:
  • the obtaining module 72 is configured to obtain the metadata of the extraction, transformation and loading ETL task, wherein the metadata includes at least one of the following: a database, a data table, and a data field;
  • the processing module 74 is configured to analyze and process the metadata, so as to save the metadata of the ETL task, the inclusion relationship of the metadata and the mapping relationship between the metadata in the graph database, wherein, The inclusion relationship is used to indicate the pairwise inclusion relationship between the database, the data table and the data field, and the mapping relationship is used to indicate the pairwise mapping relationship between the database, the data table and the data field;
  • the response module 76 is configured to respond to the data query request of the target data, and determine the data blood relationship of the target data through the graph database.
  • the metadata of extracting, converting and loading ETL tasks is obtained, wherein the metadata includes at least one of the following: a database, a data table, and a data field; the metadata is analyzed and processed to convert the metadata of the ETL task, the metadata of the metadata
  • the inclusion relationship and the mapping relationship between metadata are stored in the graph database, where the inclusion relationship is used to indicate the pairwise inclusion relationship between the database, data table and data field, and the mapping relationship is used to indicate the database, data table and data field.
  • the pairwise mapping relationship between them in response to the data query request of the target data, the data blood relationship of the target data is determined through the graph database, that is, the metadata of the ETL task, the inclusion relationship of the metadata and the mapping relationship between the metadata are saved in the graph.
  • the data blood relationship of the target data is determined through the graph database, and the above technical solutions are used to solve the problems of the data blood relationship analysis method based on the relational database, model establishment, data storage and query data blood relationship are relatively complex and other problems.
  • the method for determining the blood relationship of data based on graph database makes model establishment, data storage and query data blood relationship simpler and more efficient.
  • the processing module is further configured to perform analysis and processing on the metadata, so as to save the metadata of the ETL task in a graph database, and specifically, obtain the data of the data source of the ETL task. Metadata and the metadata of the data destination of the ETL task; determine the first metadata type of the metadata of the data source and the second metadata of the metadata of the data destination according to the metadata type provided by the graph database data type; save the metadata of the data source in the graph database according to the first metadata type, and save the metadata of the data destination in the graph database according to the second metadata type middle.
  • the processing module is further configured to perform analysis and processing on the metadata, so as to save the inclusion relationship of the metadata in a graph database, specifically, determine the database, data tables and data The pairwise inclusion relationship between fields; the pairwise inclusion relationship is created according to the object creation method provided by the graph database, and the created pairwise inclusion relationship is stored in the graph database.
  • the object creation method provided by the graph database to analyze the database.
  • the pairwise inclusion relationship between the data table and the data field is created, the created pairwise inclusion relationship is obtained, and the created pairwise inclusion relationship is saved in the graph database again.
  • the processing module is further configured to perform analysis and processing on the metadata, so as to save the mapping relationship between the metadata in a graph database, specifically, create a mapping relationship in the graph database ETL task metadata type, wherein the ETL task metadata type includes: an input and output list, the mapping relationship of the metadata, and the input and output list attributes are used to store the metadata of the data source and the data purpose The metadata of the terminal; obtain the mapping relationship between the metadata, and save the mapping relationship in the created ETL task metadata type, so as to save the mapping relationship between the metadata in the graph database.
  • the ETL task metadata type includes: an input and output list, the mapping relationship of the metadata, and the input and output list attributes are used to store the metadata of the data source and the data purpose The metadata of the terminal; obtain the mapping relationship between the metadata, and save the mapping relationship in the created ETL task metadata type, so as to save the mapping relationship between the metadata in the graph database.
  • ETL task metadata types include: input and output list, metadata mapping relationship, storing the metadata of the data source end and data destination end in the ETL task into the corresponding input and output list, and storing the mapping relationship between metadata It is stored in the mapping relationship of the metadata in the corresponding ETL task metadata type, and then the mapping relationship between the metadata is stored in the graph database.
  • the metadata of the ETL task, the inclusion relationship of the metadata and the mapping relationship between the metadata are stored in the graph database.
  • the response module is further configured to respond to the data query request, in the input and output list, perform a traversal query through the traversal language of the graph database to determine the data lineage of the target data relation.
  • the response module is further configured to perform a traversal query in the input and output list through the traversal language according to the input direction and/or the output direction, so as to determine the data blood relationship of the target data.
  • a data query request In order to determine the data blood relationship of the target data through the graph database, a data query request must be obtained first, based on the input and output list of the ETL task metadata type, starting from the target data, traversing through the graph database according to the input direction and/or output direction Language query the data blood relationship of target data.
  • the above modules can be implemented by software or hardware, and the latter can be implemented in the following ways, but not limited to this: the above modules are all located in the same processor; or, the above modules can be combined in any combination The forms are located in different processors.
  • Embodiments of the present application further provide a storage medium, where a computer program is stored in the storage medium, wherein the computer program is configured to execute the steps in any one of the above method embodiments when running.
  • the above-mentioned storage medium may be configured to store a computer program for executing the following steps:
  • Step S1 Obtain metadata of the extraction, transformation, and loading ETL task, where the metadata includes at least one of the following: a database, a data table, and a data field.
  • Step S2 performing analysis and processing on the metadata, so as to save the metadata of the ETL task, the inclusion relationship of the metadata and the mapping relationship between the metadata in a graph database, wherein the inclusion The relationship is used to indicate the pairwise inclusion relation between the database, the data table and the data field, and the mapping relation is used to indicate the pairwise mapping relation between the database, the data table and the data field.
  • Step S3 in response to the data query request of the target data, determine the data blood relationship of the target data through the graph database.
  • the above-mentioned storage medium may include but is not limited to: a USB flash drive, a read-only memory (Read-Only Memory, referred to as ROM), a random access memory (Random Access Memory, referred to as RAM), Various media that can store computer programs, such as removable hard disks, magnetic disks, or optical disks.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • Embodiments of the present application further provide an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.
  • the above-mentioned electronic device may further include a transmission device and an input-output device, wherein the transmission device is connected to the above-mentioned processor, and the input-output device is connected to the above-mentioned processor.
  • the above-mentioned processor may be configured to execute the following steps through a computer program:
  • Step S1 Obtain metadata of the extraction, transformation, and loading ETL task, where the metadata includes at least one of the following: a database, a data table, and a data field.
  • Step S2 performing analysis and processing on the metadata, so as to save the metadata of the ETL task, the inclusion relationship of the metadata and the mapping relationship between the metadata in a graph database, wherein the inclusion The relationship is used to indicate the pairwise inclusion relation between the database, the data table and the data field, and the mapping relation is used to indicate the pairwise mapping relation between the database, the data table and the data field.
  • Step S3 in response to the data query request of the target data, determine the data blood relationship of the target data through the graph database.
  • the above-mentioned storage medium may include but is not limited to: a USB flash drive, a read-only memory (Read-Only Memory, referred to as ROM), a random access memory (Random Access Memory, referred to as RAM), Various media that can store program codes, such as removable hard disks, magnetic disks, or optical disks.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • modules or steps of the present application can be implemented by a general-purpose computing device, and they can be centralized on a single computing device, or distributed in a network composed of multiple computing devices Alternatively, they may be implemented in program code executable by a computing device, such that they may be stored in a storage device and executed by the computing device, and in some cases, in a different order than here
  • the steps shown or described are performed either by fabricating them separately into individual integrated circuit modules, or by fabricating multiple modules or steps of them into a single integrated circuit module.
  • the present application is not limited to any particular combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种数据血缘关系的确定方法及装置、存储介质、电子装置,上述方法包括:获取抽取转换加载ETL任务的元数据,其中,元数据包括以下至少之一:数据库,数据表,数据字段;对元数据进行分析处理,以将ETL任务的元数据,元数据的包含关系以及元数据之间的映射关系保存在图数据库中,其中,包含关系用于指示数据库,数据表和数据字段之间的两两包含关系,映射关系用于指示数据库,数据表和数据字段之间的两两映射关系;响应目标数据的数据查询请求,通过图数据库确定目标数据的数据血缘关系,即将ETL任务的元数据,元数据的包含关系以及元数据之间的映射关系保存在图数据库中,进而通过图数据库确定目标数据的数据血缘关系。

Description

数据血缘关系的确定方法及装置、存储介质、电子装置
交叉引用
本申请基于申请号为“202011617620.1”、申请日为2020年12月30日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本申请主要涉及通信领域,涉及一种数据血缘关系的确定方法及装置、存储介质、电子装置。
背景技术
随着信息化和互联网技术的高速发展,“信息爆炸”的时代已然来临。不管是政府还是企业,电子信息化成为自身发展必然趋势,而各种信息化系统中的数据不仅数据量巨大,并且存储介质和格式多种多样,因此消除“数据孤岛”,做好数据整合、共享和对整合后的数据进行挖掘分析越来越重要。
在解决“数据孤岛”的方法中,数据仓库技术是一种最佳实践。数据仓库是面向主题的、集成的、与时间相关的、不可修改的数据集合。而抽取-转换-加载(Extract-Transform-Load,简称ETL)是构建数据仓库的关键环节。而通过ETL进行数据交换和共享的过程,其数据流动的记录和分析也具有较大的实际意义,比如数据溯源、评估数据价值、数据质量评估和数据归档及销毁的参考等。
基于关系型数据库的数据血缘关系分析方法,模型创建、存储效率以及复杂情况下的查询效率都无法满足复杂情景下的需求。传统关系型数据库针对数据血缘关系建模较为复杂,需要涉及多张关联数据表并且概念较多不易于开发人员理解;存储时需要多表存入,代码逻辑较为复杂;查询速度局限于关联数据库多表查询,特别是对于数据血缘关系链路较长和复杂时,性能问题尤其明显。
基于关系型数据库的数据血缘关系分析方法,模型建立、数据存数以及查询数据血缘关系都较为复杂等问题,尚未提出有效的技术方案。
发明内容
本申请实施例提供了一种数据血缘关系的确定方法,包括:获取抽取转换加载ETL任务的元数据,其中,所述元数据包括以下至少之一:数据库,数据表,数据字段;对所述元数据进行分析处理,以将所述ETL任务的元数据,所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中,其中,所述包含关系用于指示所述数据库,数据表和数据字段之间的两两包含关系,所述映射关系用于指示所述数据库,数据表和数据字段之间的两两映射关系;响应目标数据的数据查询请求,通过所述图数据库确定目标数据的数据血缘关系。
本申请的实施例还提供了一种数据血缘关系的确定装置,包括:获取模块,用于获取抽取转换加载ETL任务的元数据,其中,所述元数据包括以下至少之一:数据库,数据表,数据字段;处理模块,用于对所述元数据进行分析处理,以将所述ETL任务的元数据,所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中,其中,所述包含关系用于指示所述数据库,数据表和数据字段之间的两两包含关系,所述映射关系用于指示所述数据库,数据表和数据字段之间的两两映射关系;响应模块,用于响应目标数据的数据查询请求,通过所述图数据库确定目标数据的数据血缘关系。
本申请的实施例还提供了一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
本申请的实施例还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1是本申请实施例的数据血缘关系的确定方法的计算机终端的硬件结构框图;
图2是本申请实施例的数据血缘关系的确定方法的流程图;
图3是本申请实施例的元数据类型定义及创建示意图;
图4是本申请实施例的数据血缘关系构建示意图;
图5是本申请实施例的数据血缘关系分析示意图;
图6是本申请实施例的图遍历节点和有向边示意图;
图7是本申请实施例的数据血缘关系的确定装置的结构框图。
具体实施方式
本申请实施例提供了一种数据血缘关系的确定方法及装置、存储介质、电子装置,以至少解决,基于关系型数据库的数据血缘关系分析方法,模型建立、数据存数以及查询数据血缘关系都较为复杂等问题。
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
本申请实施例所提供的方法可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在计算机终端上为例,图1是本申请实施例的数据血缘关系的确定方法的计算机终端的硬件结构框图。如图1所示,计算机终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104,可选地,上述计算机终端还可以包括用于通信功能的传 输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述计算机终端的结构造成限定。例如,计算机终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示等同功能或比图1所示功能更多的不同的配置。存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本申请实施例中的数据血缘关系的确定方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及数据处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。传输装置106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,简称为RF)模块,其用于通过无线方式与互联网进行通讯。
本申请的实施例,提供了一种数据血缘关系的确定方法,应用于上述计算机终端,图2是本申请实施例的数据血缘关系的确定方法的流程图,如图2所示,包括:
步骤S202,获取抽取转换加载ETL任务的元数据,其中,所述元数据包括以下至少之一:数据库,数据表,数据字段。
步骤S204,对所述元数据进行分析处理,以将所述ETL任务的元数据,所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中,其中,所述包含关系用于指示所述数据库,数据表和数据字段之间的两两包含关系,所述映射关系用于指示所述数据库,数据表和数据字段之间的两两映射关系。
步骤S206,响应目标数据的数据查询请求,通过所述图数据库确定目标数据的数据血缘关系。
通过上述步骤,获取抽取转换加载ETL任务的元数据,其中,元数据包括以下至少之一:数据库,数据表,数据字段;对元数据进行分析处理,以将ETL任务的元数据,元数据的包含关系以及元数据之间的映射关系保存在图数据库中,其中,包含关系用于指示数据库,数据表和数据字段之间的两两包含关系,映射关系用于指示数据库,数据表和数据字段之间的两两映射关系;响应目标数据的数据查询请求,通过图数据库确定目标数据的数据血缘关系,即将ETL任务的元数据,元数据的包含关系以及元数据之间的映射关系保存在图数据库中,进而通过图数据库确定目标数据的数据血缘关系,采用上述技术方案,解决了,基于关系型数据库的数据血缘关系分析方法,模型建立、数据存数以及查询数据血缘关系都较为复杂等问题,基于图数据库的数据血缘关系的确定方法,使得模型建立、数据存数以及查询数据血缘关系更加简单和高效。
步骤S204的具体实现步骤如下:
步骤1:对所述元数据进行分析处理,以将所述ETL任务的元数据保存在图数据库中,具体地,获取所述ETL任务的数据源端的元数据和所述ETL任务的数据目的端的元数据;根据所述图数据库提供的元数据类型确定所述数据源端的元数据的第一元数据类型以及所述数据目的端的元数据的第二元数据类型;将所述数据源端的元数据按照所述第一元数据类型 保存在所述图数据库中,以及将所述数据目的端的元数据按照所述第二元数据类型保存在所述图数据库中;
也就是说,获取所述ETL任务的数据源端的元数据和数据目的端的元数据,根据图数据库提供的元数据类型定义方法,确定ETL任务的数据源端的元数据和数据目的端的元数据类型,即第一元数据类型和第二元数据类型,进而将数据源端的元数据和数据目的端的元数据分别按照所述第一元数据类型和第二元数据类型保存在所述图数据库中。
步骤2:对所述元数据进行分析处理,以将所述元数据的包含关系保存在图数据库中,具体的,确定所述数据库,数据表和数据字段之间的两两包含关系;按照所述图数据库提供的对象创建方式对所述两两包含关系进行创建,并将创建后的两两包含关系保存在所述图数据库中;
需要说明的是,分析ETL任务的数据源端的元数据和数据目的端的元数据的具体信息,即数据库,数据表和数据字段之间的两两包含关系,使用图数据库提供的对象创建方式对数据库字段数据库,数据表和数据字段之间的两两包含关系进行创建,得到创建后的两两包含关系,再次将创建后的两两包含关系保存在图数据库中。
步骤3:对所述元数据进行分析处理,以将所述元数据之间的映射关系保存在图数据库中,具体的,在所述图数据库中创建ETL任务元数据类型,其中,所述ETL任务元数据类型包含:输入输出列表,所述元数据的映射关系,所述输入输出列表属性用于存储所述数据源端的元数据以及所述数据目的端的元数据;获取所述元数据之间的映射关系,并将所述映射关系保存在创建的ETL任务元数据类型中,以将所述元数据之间的映射关系保存在图数据库中。
具体的,分析ETL任务中的数据源端的元数据和数据目的端的元数据的映射关系,可以理解为数据源端的元数据和数据目的端的方向信息以及字段的对应关系,创建ETL任务元数据类型,ETL任务元数据类型包括:输入输出列表,元数据的映射关系,将ETL任务中的数据源端的元数据和数据目的端的元数据存储到对应的输入输出列表中,将元数据之间的映射关系保存在对应的ETL任务元数据类型中的元数据的映射关系中,进而完成将所述元数据之间的映射关系保存在图数据库中。
通过上述步骤1-3完成将所述ETL任务的元数据,所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中。
步骤S206的实现方式有很多种,在一个示例性实施例中,响应所述数据查询请求,在所述输入输出列表中,通过所述图数据库的遍历语言进行遍历查询,以确定所述目标数据的数据血缘关系。
具体的,在所述输入输出列表中,按照输入方向和/或输出方向通过所述遍历语言进行遍历查询,以确定所述目标数据的数据血缘关系。
为了实现通过所述图数据库确定目标数据的数据血缘关系,首先要获取数据查询请求,基于ETL任务元数据类型的输入输出列表,从目标数据开始按照输入方向和/或输出方向通过图数据库的遍历语言查询目标数据的数据血缘关系。
以下结合几个可选实施例对上述数据血缘关系的确定方法的流程进行解释说明,但不用于限定本申请实施例的技术方案。
为了有效整合分散异构的数据信息资源,消除“数据孤岛”现象,目前采用ETL工具编排 处理任务,ETL任务可以看作是一种数据流动方式,而数据流动的三个基本元素是数据源,数据流向,数据目的端,其次还可以包含更为详细的信息,例如字段对应血缘关系,具体实现步骤如下:
步骤1:抽象出ETL任务的数据源和ETL任务的数据目的端的元数据类型,元数据类型包括数据库、数据表及数据字段的元数据类型,并定义元数据类型之间的包含从属关系,例如,数据字段属于数据表,数据表属于数据库,根据图数据库提供的元数据类型定义方法,对元数据类型初始化。另外,还将ETL任务定义ETL任务元数据类型,其中ETL任务元数据类型包含ETL中转换的输入和输出对象列表属性和对应的字段映射关系,ETL中转换的输入和输出对象列表属性是分析数据血缘关系的重要部分。
步骤2:分析ETL任务中的数据源和数据目的端的具体信息(相当于上述实施例中的元数据的包含关系),即数据库、数据表及数据字段以及他们之间存在的关系,使用图数据库提供的元数据对象创建方法新建对应元数据类型的元数据对象以及对象之间的包含从属关系,对应元数据类型的元数据对象以及对象之间的包含从属关系具有全局唯一性。步骤1和步骤2如图3所示,图3是根据本申请实施例的元数据类型定义及创建示意图。
步骤3:分析ETL任务中的数据元之间的数据流向(相当于上述实施例中的元数据之间的映射关系),即数据源到数据目的端的方向信息及字段的对应关系,创建ETL任务元数据类型对象,将相关信息存入,其中最重要的就是分别将ETL任务中的已经创建的数据源和数据目的端对应的元数据对象存入ETL任务元数据类型的输入和输出对象列表中,数据源和数据目的端元数据对象的字段对应关系存入ETL任务元数据对象的字段映射关系中。如图4所示,图4是根据本申请实施例的数据血缘关系构建示意图。
步骤4:基于已存入图数据库中的数据,通过图数据库提供的查询语言和方法进行数据血缘查询和分析,原理就是基于ETL任务元数据类型对象中存储的输入和输出对象列表,通过图数据库的遍历语言,从输入对象开始进行输入和输出两个方向的遍历查询操作,按照查询需要,可调整遍历层级深度数量,即可查询血缘关系。
步骤5:将查询到的血缘数据进行整理转换后,即可进行数据消费,比如绘制数据血缘关系图,清晰的展示出数据的上游来源和下游去向以及相关联的ETL任务信息,进而可以方便进行数据溯源等分析。如图5所示,图5是根据本申请实施例的数据血缘关系分析示意图。
下面以抽取Mysql数据库数据表A中数据加载到Oracle数据库数据表B中的ETL任务和Atlas元数据工具作为图数据存储工具框架为例详述,具体步骤如下:
步骤1:首先抽象Mysql数据库和Oracle数据库及数据库表,字段等元数据类型,因为Mysql和Oracle都是关系型数据库,所以抽象为同一种类型,分别为RdbResource(数据库),RdbTable(数据表),RdbColumn(数据表字段)(相当于上述实施例中的数据字段)。以xml格式定义Atlas元数据类型,除基础的名称、类型等属性外,RdbResource中包含一个RdbTable类型的列表数据属性,为所含数据表;RdbTable中包含一个RdbColumn类型的列表数据属性,为所含数据表字段。另外针对ETL任务,抽象ETL任务元数据类型ETLJob,ETLJob除基础名称、类型等属性外,包含一个ETLJobTrans类型的列表数据属性;ETLJobTrans除基础属性外,包含输入和输出对象列表属性,以及字段映射关系。
步骤2:分析ETL任务的Mysql数据表A和Oracle数据表B的所属数据库和包含的数据表字段,在通过配置数据库连接后,分析ETL任务的Mysql数据表A和Oracle数据表B 的所属数据库和包含的数据表字段可以通过使用自动化的数据库元数据分析完成,进而调用Atlas接口创建相应的RdbResource、RdbTable和RdbColumn类型对象。
步骤3:分析ETL任务中的转换步骤(相当于上述实施例中的元数据之间的映射关系),创建相应的ETLJob和ETLJobTrans元数据类型对象,其中ETLJobTrans对象的输入输出对象列表属性,分别添加Mysql数据表A和Oracle数据表B,同时对应的字段映射关系也存入Atlas中。
步骤4:查询分析Atlas已存入的数据,通过Atlas提供的图数据遍历语言,以Mysql表A为起点循环遍历查询,经由ETLJobTrans元数据对象,分别作为其的输入和输出对象列表属性关联的血缘关系。其遍历方向为,查询Mysql数据表A的入边为输入对象列表的节点,在查询入边为ETLJobTrans的节点,然后查询该节点出边的输出对象列表节点,和输出对象列表的输出边Oracle数据表B的节点,通过简单的一层查询Mysql数据表A的下游数据血缘关系的方法,可重复查询多次,即可查询下游多层的数据血缘关系;反之,相反方向的查询,可以查询Mysql数据表A的上游数据血缘关系。如图6所示,图6是根据本申请实施例的图遍历节点和有向边示意图。
步骤5:针对图遍历查询出的数据,可以按需进行简化和转换,以方便处理的数据结构提供给血缘关系数据消费方。例如简化为,两组数据:节点数据列表和有向边数据列表,节点数据列表仅包含数据表节点和ETL任务节点,有向边仅存放源节点id和目的节点id,数据消费方即可通过两组数据还原构造出真实的血缘关系。另外可以通过定制化的图数据库遍历语言,查询其他相关的数据,如数据库血缘关系,数据表字段血缘关系等等。
通过上述步骤,开发和定义的各类ETL任务,首先使用图数据库自定义元数据功能对于各类数据元进行预定义,包括但不限于常见的关系型数据库(如mysql、oracle、sql server等)、大数据相关类型(hive、impala、hbase、ES、Mongodb等)、结构化或非结构化文件(ftp、hdfs)等以及ETL任务本身类型,并在图数据库中进行类型初始化;然后对于ETL任务,分析各ETL任务的数据元之间的血缘关系,将此类关系转化为构造图数据库实体之间关系,存入图数据库中;最后利用图数据库提供的高效和方便的查询方法进行数据血缘分析。
以上是对本申请的较佳实施进行了具体说明,但实施的方法不局限于此,特别的对于图数据元数据类型的定义和图数据库遍历查询语言的使用,可以针对不同的数据血缘的分析需求,进行不同的定制或者变形,例如简化或胜省略ETL任务元数据类型的定义,直接将数据表元数据对象进行有向边的关联。这些等同的定制或变形均包含在本申请权利要求所限定的范围内。
通过以上的实施例的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施例。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在本实施例中还提供了一种数据血缘关系的确定装置,该装置用于实现上述实施例及优选实施例,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者 软件和硬件的组合的实现也是可能并被构想的。
图7是本申请实施例的数据血缘关系的确定装置的结构框图,如图7所示,该装置包括:
获取模块72,用于获取抽取转换加载ETL任务的元数据,其中,所述元数据包括以下至少之一:数据库,数据表,数据字段;
处理模块74,用于对所述元数据进行分析处理,以将所述ETL任务的元数据,所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中,其中,所述包含关系用于指示所述数据库,数据表和数据字段之间的两两包含关系,所述映射关系用于指示所述数据库,数据表和数据字段之间的两两映射关系;
响应模块76,用于响应目标数据的数据查询请求,通过所述图数据库确定目标数据的数据血缘关系。
通过本申请,获取抽取转换加载ETL任务的元数据,其中,元数据包括以下至少之一:数据库,数据表,数据字段;对元数据进行分析处理,以将ETL任务的元数据,元数据的包含关系以及元数据之间的映射关系保存在图数据库中,其中,包含关系用于指示数据库,数据表和数据字段之间的两两包含关系,映射关系用于指示数据库,数据表和数据字段之间的两两映射关系;响应目标数据的数据查询请求,通过图数据库确定目标数据的数据血缘关系,即将ETL任务的元数据,元数据的包含关系以及元数据之间的映射关系保存在图数据库中,进而通过图数据库确定目标数据的数据血缘关系,采用上述技术方案,解决了,基于关系型数据库的数据血缘关系分析方法,模型建立、数据存数以及查询数据血缘关系都较为复杂等问题,基于图数据库的数据血缘关系的确定方法,使得模型建立、数据存数以及查询数据血缘关系更加简单和高效。
在一个可选实施例中,处理模块,还用于对所述元数据进行分析处理,以将所述ETL任务的元数据保存在图数据库中,具体地,获取所述ETL任务的数据源端的元数据和所述ETL任务的数据目的端的元数据;根据所述图数据库提供的元数据类型确定所述数据源端的元数据的第一元数据类型以及所述数据目的端的元数据的第二元数据类型;将所述数据源端的元数据按照所述第一元数据类型保存在所述图数据库中,以及将所述数据目的端的元数据按照所述第二元数据类型保存在所述图数据库中。
也就是说,获取所述ETL任务的数据源端的元数据和数据目的端的元数据,根据图数据库提供的元数据类型定义方法,确定ETL任务的数据源端的元数据和数据目的端的元数据类型,即第一元数据类型和第二元数据类型,进而将数据源端的元数据和数据目的端的元数据分别按照所述第一元数据类型和第二元数据类型保存在所述图数据库中。
在一个可选实施例中,处理模块,还用于对所述元数据进行分析处理,以将所述元数据的包含关系保存在图数据库中,具体的,确定所述数据库,数据表和数据字段之间的两两包含关系;按照所述图数据库提供的对象创建方式对所述两两包含关系进行创建,并将创建后的两两包含关系保存在所述图数据库中。
需要说明的是,分析ETL任务的数据源端的元数据和数据目的端的元数据的具体信息,即数据库,数据表和数据字段之间的两两包含关系,使用图数据库提供的对象创建方式对数据库,数据表和数据字段之间的两两包含关系进行创建,得到创建后的两两包含关系,再次将创建后的两两包含关系保存在图数据库中。
在一个可选实施例中,处理模块,还用于对所述元数据进行分析处理,以将所述元数据 之间的映射关系保存在图数据库中,具体的,在所述图数据库中创建ETL任务元数据类型,其中,所述ETL任务元数据类型包含:输入输出列表,所述元数据的映射关系,所述输入输出列表属性用于存储所述数据源端的元数据以及所述数据目的端的元数据;获取所述元数据之间的映射关系,并将所述映射关系保存在创建的ETL任务元数据类型中,以将所述元数据之间的映射关系保存在图数据库中。
具体的,分析ETL任务中的数据源端的元数据和数据目的端的元数据的映射关系,可以理解为数据源端的元数据和数据目的端的方向信息以及字段的对应关系,创建ETL任务元数据类型,ETL任务元数据类型包括:输入输出列表,元数据的映射关系,将ETL任务中的数据源端的元数据和数据目的端的元数据存储到对应的输入输出列表中,将元数据之间的映射关系保存在对应的ETL任务元数据类型中的元数据的映射关系中,进而完成将所述元数据之间的映射关系保存在图数据库中。
进而完成了将所述ETL任务的元数据,所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中。
在一个示例性实施例中,响应模块,还用于响应所述数据查询请求,在所述输入输出列表中,通过所述图数据库的遍历语言进行遍历查询,以确定所述目标数据的数据血缘关系。
具体的,响应模块,还用于在所述输入输出列表中,按照输入方向和/或输出方向通过所述遍历语言进行遍历查询,以确定所述目标数据的数据血缘关系。
为了实现通过所述图数据库确定目标数据的数据血缘关系,首先要获取数据查询请求,基于ETL任务元数据类型的输入输出列表,从目标数据开始按照输入方向和/或输出方向通过图数据库的遍历语言查询目标数据的数据血缘关系。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。
本申请的实施例还提供了一种存储介质,该存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的计算机程序:
步骤S1,获取抽取转换加载ETL任务的元数据,其中,所述元数据包括以下至少之一:数据库,数据表,数据字段。
步骤S2,对所述元数据进行分析处理,以将所述ETL任务的元数据,所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中,其中,所述包含关系用于指示所述数据库,数据表和数据字段之间的两两包含关系,所述映射关系用于指示所述数据库,数据表和数据字段之间的两两映射关系。
步骤S3,响应目标数据的数据查询请求,通过所述图数据库确定目标数据的数据血缘关系。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本申请的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算 机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
可选地,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:
步骤S1,获取抽取转换加载ETL任务的元数据,其中,所述元数据包括以下至少之一:数据库,数据表,数据字段。
步骤S2,对所述元数据进行分析处理,以将所述ETL任务的元数据,所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中,其中,所述包含关系用于指示所述数据库,数据表和数据字段之间的两两包含关系,所述映射关系用于指示所述数据库,数据表和数据字段之间的两两映射关系。
步骤S3,响应目标数据的数据查询请求,通过所述图数据库确定目标数据的数据血缘关系。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
可选地,本实施例中的具体示例可以参考上述实施例及可选实施例中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种数据血缘关系的确定方法,包括:
    获取抽取转换加载ETL任务的元数据,其中,所述元数据包括以下至少之一:数据库,数据表,数据字段;
    对所述元数据进行分析处理,以将所述ETL任务的元数据、所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中,其中,所述包含关系用于指示所述数据库,数据表和数据字段之间的两两包含关系,所述映射关系用于指示所述数据库,数据表和数据字段之间的两两映射关系;
    响应目标数据的数据查询请求,通过所述图数据库确定目标数据的数据血缘关系。
  2. 根据权利要求1所述的方法,其中,对所述元数据进行分析处理,以将所述ETL任务的元数据保存在图数据库中,包括:
    获取所述ETL任务的数据源端的元数据和所述ETL任务的数据目的端的元数据;
    根据所述图数据库提供的元数据类型确定所述数据源端的元数据的第一元数据类型以及所述数据目的端的元数据的第二元数据类型;
    将所述数据源端的元数据按照所述第一元数据类型保存在所述图数据库中,以及将所述数据目的端的元数据按照所述第二元数据类型保存在所述图数据库中。
  3. 根据权利要求1至权利要求2中任一项所述的方法,其中,对所述元数据进行分析处理,以将所述元数据的包含关系保存在图数据库中,包括:
    确定所述数据库,数据表和数据字段之间的两两包含关系;
    按照所述图数据库提供的对象创建方式对所述两两包含关系进行创建,并将创建后的两两包含关系保存在所述图数据库中。
  4. 根据权利要求1至权利要求3中任一项所述的方法,其中,对所述元数据进行分析处理,以将所述元数据之间的映射关系保存在图数据库中,包括:
    在所述图数据库中创建ETL任务元数据类型,其中,所述ETL任务元数据类型包含:输入输出列表,所述元数据的映射关系,所述输入输出列表属性用于存储所述数据源端的元数据以及所述数据目的端的元数据;
    获取所述元数据之间的映射关系,并将所述映射关系保存在创建的ETL任务元数据类型中,以将所述元数据之间的映射关系保存在图数据库中。
  5. 根据权利要求1至权利要求4中任一项所述的方法,其中,响应目标数据的数据查询请求,通过所述图数据库确定目标数据的数据血缘关系,包括:
    响应所述数据查询请求,在所述输入输出列表中,通过所述图数据库的遍历语言进行遍历查询,以确定所述目标数据的数据血缘关系。
  6. 根据权利要求5所述的方法,其中,在所述输入输出列表中,通过所述图数据库的遍 历语言进行遍历查询,以确定所述目标数据的数据血缘关系,包括:
    在所述输入输出列表中,按照输入方向和/或输出方向通过所述遍历语言进行遍历查询,以确定所述目标数据的数据血缘关系。
  7. 一种数据血缘关系的确定装置,包括:
    获取模块,用于获取抽取转换加载ETL任务的元数据,其中,所述元数据包括以下至少之一:数据库,数据表,数据字段;
    处理模块,用于对所述元数据进行分析处理,以将所述ETL任务的元数据,所述元数据的包含关系以及所述元数据之间的映射关系保存在图数据库中,其中,所述包含关系用于指示所述数据库,数据表和数据字段之间的两两包含关系,所述映射关系用于指示所述数据库,数据表和数据字段之间的两两映射关系;
    响应模块,用于响应目标数据的数据查询请求,通过所述图数据库确定目标数据的数据血缘关系。
  8. 根据权利要求7所述的装置,其中,所述处理模块,还用于获取所述ETL任务的数据源端的元数据和所述ETL任务的数据目的端的元数据;根据所述图数据库提供的元数据类型确定所述数据源端的元数据的第一元数据类型以及所述数据目的端的元数据的第二元数据类型;将所述数据源端的元数据按照所述第一元数据类型保存在所述图数据库中,以及将所述数据目的端的元数据按照所述第二元数据类型保存在所述图数据库中。
  9. 一种计算机可读的存储介质,其特征在于,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至6任一项中所述的方法。
  10. 一种电子装置,包括存储器和处理器,其特征在于,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至6任一项中所述的方法。
PCT/CN2021/136131 2020-12-30 2021-12-07 数据血缘关系的确定方法及装置、存储介质、电子装置 WO2022143045A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011617620.1 2020-12-30
CN202011617620.1A CN114691786A (zh) 2020-12-30 2020-12-30 数据血缘关系的确定方法及装置、存储介质、电子装置

Publications (1)

Publication Number Publication Date
WO2022143045A1 true WO2022143045A1 (zh) 2022-07-07

Family

ID=82134098

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/136131 WO2022143045A1 (zh) 2020-12-30 2021-12-07 数据血缘关系的确定方法及装置、存储介质、电子装置

Country Status (2)

Country Link
CN (1) CN114691786A (zh)
WO (1) WO2022143045A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115062049A (zh) * 2022-07-28 2022-09-16 浙江城云数字科技有限公司 一种数据血缘分析方法及装置
CN115168363A (zh) * 2022-07-29 2022-10-11 北京远舢智能科技有限公司 元数据的处理方法、装置、电子设备及存储介质
CN115757655A (zh) * 2022-11-14 2023-03-07 中国兵器工业计算机应用技术研究所 一种基于元数据管理的数据血缘分析系统和方法
CN116028248A (zh) * 2023-03-30 2023-04-28 紫金诚征信有限公司 适用于web端的数据处理方法、装置及电子设备
CN116166718A (zh) * 2023-04-25 2023-05-26 北京捷泰云际信息技术有限公司 一种数据血缘获取方法和装置
CN116541887B (zh) * 2023-07-07 2023-09-15 云启智慧科技有限公司 一种大数据平台数据安全保护方法
CN117273131A (zh) * 2023-11-22 2023-12-22 四川三合力通科技发展集团有限公司 一种跨节点数据关系发现系统及方法
CN117555950A (zh) * 2024-01-12 2024-02-13 山东再起数据科技有限公司 基于数据中台的数据血缘关系构建方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739893A (zh) * 2018-12-28 2019-05-10 上海连尚网络科技有限公司 一种元数据管理方法、设备及计算机可读介质
CN109739894A (zh) * 2019-01-04 2019-05-10 深圳前海微众银行股份有限公司 补充元数据描述的方法、装置、设备及存储介质
CN112115192A (zh) * 2020-10-09 2020-12-22 北京东方通软件有限公司 一种etl系统的高效流程编排方法和系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109739893A (zh) * 2018-12-28 2019-05-10 上海连尚网络科技有限公司 一种元数据管理方法、设备及计算机可读介质
CN109739894A (zh) * 2019-01-04 2019-05-10 深圳前海微众银行股份有限公司 补充元数据描述的方法、装置、设备及存储介质
CN112115192A (zh) * 2020-10-09 2020-12-22 北京东方通软件有限公司 一种etl系统的高效流程编排方法和系统

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115062049A (zh) * 2022-07-28 2022-09-16 浙江城云数字科技有限公司 一种数据血缘分析方法及装置
CN115062049B (zh) * 2022-07-28 2022-11-18 浙江城云数字科技有限公司 一种数据血缘分析方法及装置
CN115168363A (zh) * 2022-07-29 2022-10-11 北京远舢智能科技有限公司 元数据的处理方法、装置、电子设备及存储介质
CN115757655B (zh) * 2022-11-14 2023-07-07 中国兵器工业计算机应用技术研究所 一种基于元数据管理的数据血缘分析系统和方法
CN115757655A (zh) * 2022-11-14 2023-03-07 中国兵器工业计算机应用技术研究所 一种基于元数据管理的数据血缘分析系统和方法
CN116028248A (zh) * 2023-03-30 2023-04-28 紫金诚征信有限公司 适用于web端的数据处理方法、装置及电子设备
CN116028248B (zh) * 2023-03-30 2023-07-25 紫金诚征信有限公司 适用于web端的数据处理方法、装置及电子设备
CN116166718A (zh) * 2023-04-25 2023-05-26 北京捷泰云际信息技术有限公司 一种数据血缘获取方法和装置
CN116541887B (zh) * 2023-07-07 2023-09-15 云启智慧科技有限公司 一种大数据平台数据安全保护方法
CN117273131A (zh) * 2023-11-22 2023-12-22 四川三合力通科技发展集团有限公司 一种跨节点数据关系发现系统及方法
CN117273131B (zh) * 2023-11-22 2024-02-13 四川三合力通科技发展集团有限公司 一种跨节点数据关系发现系统及方法
CN117555950A (zh) * 2024-01-12 2024-02-13 山东再起数据科技有限公司 基于数据中台的数据血缘关系构建方法
CN117555950B (zh) * 2024-01-12 2024-04-02 山东再起数据科技有限公司 基于数据中台的数据血缘关系构建方法

Also Published As

Publication number Publication date
CN114691786A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
WO2022143045A1 (zh) 数据血缘关系的确定方法及装置、存储介质、电子装置
EP3274875B1 (en) System and method for querying data sources
CN102799622B (zh) 基于MapReduce扩展框架的分布式SQL查询方法
Zhao et al. Modeling MongoDB with relational model
CN109656963B (zh) 元数据获取方法、装置、设备及计算机可读存储介质
WO2019143705A1 (en) Dimension context propagation techniques for optimizing sql query plans
US11693912B2 (en) Adapting database queries for data virtualization over combined database stores
US20140358844A1 (en) Workflow controller compatibility
WO2020238597A1 (zh) 基于Hadoop的数据更新方法、装置、系统及介质
JP2010524060A (ja) 分散コンピューティングにおけるデータマージング
US9201700B2 (en) Provisioning computer resources on a network
US10311045B2 (en) Aggregation/evaluation of heterogenic time series data
AU2017254506B2 (en) Method, apparatus, computing device and storage medium for data analyzing and processing
Al Naami et al. GISQF: An efficient spatial query processing system
CN114461603A (zh) 多源异构数据融合方法及装置
CN111309868A (zh) 一种知识图谱构建、检索方法及装置
WO2018045610A1 (zh) 用于执行分布式计算任务的方法和装置
Chakraborty et al. Semantic etl—State-of-the-art and open research challenges
CN114969441A (zh) 基于图数据库的知识挖掘引擎系统
US10459987B2 (en) Data virtualization for workflows
CN116383238B (zh) 基于图结构的数据虚拟化系统、方法、装置、设备及介质
Dombrowski et al. Knowledge graphs for an automated information provision in the factory planning
CN112506887A (zh) 车辆终端can总线数据处理方法及装置
CN111159213A (zh) 一种数据查询方法、装置、系统和存储介质
CN109344175A (zh) 关系型数据库数据分析能力扩展方法、系统及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913786

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13.11.2023)