CN117056369A

CN117056369A - Data blood edge processing method, device, equipment and medium

Info

Publication number: CN117056369A
Application number: CN202311015269.2A
Authority: CN
Inventors: 丁锐
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2023-11-14

Abstract

The application discloses a data blood edge processing method, a device, equipment and a medium, which can be applied to the field of big data or the field of finance. The method comprises the following steps: based on hook plug-in, intercepting a database operation request and analyzing to obtain related information of a data table; the related information of the data table is used for representing the blood margin relation among the data tables and the blood margin relation among the fields in the data tables; asynchronously sending the related information of the data table to a message queue; consuming the related information of the data table output by the message queue, and storing the related information of the data table in a target database; the target database is used for inquiring and displaying the blood-edge relation among the data tables and the blood-edge relation among the fields in the data tables. Therefore, the hook plug-in can be used for directly intercepting the database operation request and analyzing the blood-edge relation related to the data table, so that the data blood-edge analysis can be rapidly realized, and the processing efficiency and the accuracy are improved.

Description

Data blood edge processing method, device, equipment and medium

Technical Field

The present application relates to the field of big data technologies, and in particular, to a method, an apparatus, a device, and a medium for processing a data blood edge.

Background

The data blood-edge, also called the blood-edge relationship of data, refers to a data relationship that is naturally formed in the whole life cycle of generation, processing, fusion, and flow to final extinction. Since the full life cycle of the data finally needs to be integrated back to enable the actual business, the analysis and processing process of the data blood edges has important significance for the actual business.

Taking a financial institution as an example, the financial institution has a plurality of business systems and complex data structures. In the conventional data blood-edge processing method, the method is mainly implemented by means of extraction, conversion and loading (ETL) operation of analysis data. However, since the ETL job is large and the processing procedure is complicated, errors are likely to occur, and therefore, the processing efficiency of the method for performing data blood-edge analysis based on the ETL job is not high, and the accuracy of the blood-edge relationship obtained by the processing is not good.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a medium for processing data blood edges, which are used for improving the processing efficiency of the data blood edges and the accuracy of the obtained blood edge relationship.

In a first aspect, an embodiment of the present application provides a method for processing a data blood edge, including:

based on hook plug-in, intercepting a database operation request and analyzing to obtain related information of a data table; the related information of the data table is used for representing the blood margin relation among the data tables and the blood margin relation among the fields in the data tables;

asynchronously sending the related information of the data table to a message queue;

consuming the related information of the data table output by the message queue, and storing the related information of the data table in a target database; the target database is used for inquiring and displaying the blood-edge relation among the data tables and the blood-edge relation among the fields in the data tables.

Optionally, the message queue is a Kafka message queue; and the consumption of the relevant information of the data table output by the message queue comprises the following steps:

registering the topic of the Kafka and consuming the related information of the data table based on the topic.

Optionally, the method further comprises:

integrating the fields of the same data table in the same database to obtain an integrated field;

and updating the blood relationship among the fields in the data table based on the integrated field.

Optionally, the hook plug-in is obtained by the following steps:

determining a currently running data engine; one such data engine corresponds to a hook plug-in;

based on the data engine, a corresponding hook plug-in is determined.

Optionally, the data engine includes a data warehouse tool hive; the hook plug-in corresponding to the hive is a hook plug-in hive hook hung in the hive;

the hook plug-in is used for intercepting and analyzing a database operation request to obtain related information of a data table, and comprises the following steps:

configuring a hive hook interface based on the hive hook; the hive hook interface is used for intercepting and analyzing the database operation request to obtain the related information of the data table.

Optionally, the data engine comprises a big data calculation engine Spark; the hook plug-in corresponding to the Spark is a hook plug-in Spark hook hung in the Spark;

configuring a Spark hook interface based on the Spark hook; the Spark hook interface is used for determining and intercepting database operation requests and analyzing the database operation requests to obtain related information of the data table.

Optionally, the data engine includes a data query engine prest; the hook plug-in corresponding to the prest is a hook plug-in prest hook hung in the prest;

configuring a Presto hook interface based on the Presto hook; the Presto hook interface is used for determining and intercepting database operation requests and analyzing the database operation requests to obtain relevant information of the data table.

In a second aspect, an embodiment of the present application provides a device for processing a data blood edge, including:

the analysis module is used for intercepting and analyzing the database operation request based on the hook plug-in to obtain the related information of the data table; the related information of the data table is used for representing the blood margin relation among the data tables and the blood margin relation among the fields in the data tables;

the sending module is used for asynchronously sending the related information of the data table to the message queue;

the consumption module is used for consuming the related information of the data table output by the message queue and storing the related information of the data table in a target database; the target database is used for inquiring and displaying the blood-edge relation among the data tables and the blood-edge relation among the fields in the data tables.

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform any implementation of the method of processing data blood edges described above.

In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, where instructions are stored, when the instructions are executed on an electronic device, cause the electronic device to perform any implementation of the method for processing a data blood edge described above.

From the above technical solutions, the embodiment of the present application has the following advantages:

in the embodiment of the application, based on the hook plug-in, the database operation request is intercepted and analyzed, and after the related information of the data table is obtained, the related information of the data table can be asynchronously sent to the message queue. Because the related information of the data tables can be used for representing the blood-edge relationship among the data tables and the blood-edge relationship among the fields in the data tables, after the related information of the data tables output by the message queue is consumed, the related information of the data tables can be stored in a target database, and the target database can be used for inquiring and displaying the blood-edge relationship. Therefore, the hook plug-in can be used for directly intercepting the database operation request and analyzing the blood edge relation related to the data table, namely analyzing the blood edge relation among the data tables and the blood edge relation among fields in the data tables, so that the data blood edge analysis can be rapidly realized, and the processing efficiency and the accuracy are improved. In addition, the information is cached through the message queue and consumed, so that integration of the blood-edge relations is facilitated, the blood-edge relations can be queried and displayed after the information is stored in the target database, the blood-edge relations of data can be clearly and intuitively mastered, and smooth expansion of actual business of a follow-up financial institution is facilitated.

Drawings

FIG. 1 is a flowchart of a method for processing data blood edges according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a data blood edge processing device according to an embodiment of the present application.

Detailed Description

As described above, the financial institution is taken as an example, and the business systems of the financial institution are numerous and the data structure is complex. In the traditional data blood-edge processing method, the method is mainly realized by means of ETL operation for analyzing data. However, since the ETL job is large and the processing procedure is complicated, errors are likely to occur, and therefore, the processing efficiency of the method for performing data blood-edge analysis based on the ETL job is not high, and the accuracy of the blood-edge relationship obtained by the processing is not good.

In order to solve the above problems, an embodiment of the present application provides a method for processing a data blood edge, which may include: based on the hook plug-in, the database operation request is intercepted and analyzed, and after the related information of the data table is obtained, the related information of the data table can be asynchronously sent to the message queue. Because the related information of the data tables can be used for representing the blood-edge relationship among the data tables and the blood-edge relationship among the fields in the data tables, after the related information of the data tables output by the message queue is consumed, the related information of the data tables can be stored in a target database, and the target database can be used for inquiring and displaying the blood-edge relationship.

Therefore, the hook plug-in can be used for directly intercepting the database operation request and analyzing the blood edge relation related to the data table, namely analyzing the blood edge relation among the data tables and the blood edge relation among fields in the data tables, so that the data blood edge analysis can be rapidly realized, and the processing efficiency and the accuracy are improved. In addition, the information is cached through the message queue and consumed, so that integration of the blood-edge relations is facilitated, the blood-edge relations can be queried and displayed after the information is stored in the target database, the blood-edge relations of data can be clearly and intuitively mastered, and smooth expansion of actual business of a follow-up financial institution is facilitated.

It should be noted that the method, the device, the equipment and the medium for processing the data blood edges provided by the embodiment of the application can be used in the big data field or the financial field. The foregoing is merely an example, and is not limited to the application fields of the method, the device, the equipment and the medium for processing the data blood edges provided by the embodiment of the present application. In addition, the embodiment of the application also does not limit the execution main body of the data blood edge processing method, for example, the data blood edge processing method of the embodiment of the application can be applied to data processing equipment such as terminal equipment or a server. The terminal device may be an electronic device such as a smart phone, a computer, a personal digital assistant (Personal Digital Assistant, PDA), a tablet computer, etc. The servers may be stand alone servers, clustered servers, or cloud servers.

For ease of understanding, the terms involved in the embodiments of the present application will be described first.

Hook plug-ins, i.e., hook plug-ins, function to modify or extend the original behavior of an operating system, application, or other software component by intercepting function calls, messaging, event transfers between software modules. Specifically, an arbitrary program can be pre-installed in an original program through a hook plug-in, when the original program is executed to the position of the hook plug-in, the original program can be intercepted first, and the installed program is executed first. In the data blood edge processing scene provided by the embodiment of the application, the hook plug-in can be used for intercepting and analyzing the database operation request, so that the related information of the data tables, namely the blood edge relation among the data tables and the blood edge relation among the fields in the data tables, is analyzed.

The Kafka message queue is a high-throughput distributed publish-subscribe message system, and is mainly used for buffering in a data processing system, especially for real-time streaming data processing. In Kafka, which is an important part of the topic, messages can be classified by topic, on which both the generation and consumption of the message are required.

The data warehouse tool hive can be used for extracting, converting and loading data, is a large-scale data mechanism and is suitable for carrying out statistical analysis on a data warehouse. Accordingly, hive hook refers to a hook plug-in that is hooked in hive.

The big data computing engine Spark is a fast and general computing engine designed for large-scale data processing. Accordingly, spark hook refers to a hook plug-in that is hooked in Spark.

The data query engine Presto is a data query engine, and can perform rapid interactive analysis on data. Accordingly, presto hook refers to a hook plug-in that is hooked in Presto.

In order to make the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Fig. 1 is a flowchart of a method for processing data blood edges according to an embodiment of the present application. Referring to fig. 1, the method for processing data blood edges provided in the embodiment of the present application may include:

s101: based on the hook plug-in, the database operation request is intercepted and analyzed, and relevant information of the data table is obtained.

The related information of the data table is used for representing the blood-edge relation among the data tables and the blood-edge relation among the fields in the data tables. The related information of the data table may specifically indicate which data table of which database a certain data is stored in, what the field corresponding to the data is, and the attribute of the field. Thus, the blood-edge relationship between the data tables and the blood-edge relationship between the fields in the data tables can be characterized.

In an embodiment of the present application, a data engine may correspond to a hook plug-in. Corresponding to this, the hook plug-in may be obtained by: determining a currently running data engine; based on the data engine, a corresponding hook plug-in is determined. Therefore, the hook plug-in can be obtained rapidly, and the interception and analysis of the database operation request can be realized conveniently, so that the blood-edge relationship between the data tables and the blood-edge relationship between the fields in the data tables can be obtained. For example, for a big data platform of commercial banking, it may specifically include three data engines, hive, spark, and prest. Correspondingly, the hook plug-in corresponding to the hive is a hook plug-in hive hook hung in the hive; the hook plug-in corresponding to the Spark is a hook plug-in Spark hook hung in the Spark; the hook plug-in corresponding to the prest is a hook plug-in prest hook which is hung in the prest.

Based on this, for the three different data engines and their respective hook plugins, the embodiments of the present application may provide different implementation manners to implement the process of intercepting and analyzing the database operation request, that is, S101, which are described below.

As a possible implementation manner, for hive hook, S101 may specifically include: configuring a hive hook interface based on the hive hook; the hive hook interface is used for intercepting and analyzing the database operation request to obtain the related information of the data table. Specifically, in hive, the hive hook interface may be an execeutewithhookcontext interface, which may be used to obtain the relevant information of the data table from HookContext.

As another possible implementation manner, for Spark hook, S101 may specifically include: configuring a Spark hook interface based on the Spark hook; the Sparkhook interface is used for determining and intercepting database operation requests and analyzing the database operation requests to obtain relevant information of the data table. Specifically, in Spark, the above-mentioned hive hook interface may be a query execution list interface, where the interface may be used to obtain currently executed query execution through an onSuccess callback function, obtain all Attribute of Attribute by using the output method of LogicalPlan, and traverse and parse Attribute of Attribute by using the exprId mapping relation of NamedExpress to obtain relevant information of the data table.

As yet another possible implementation manner, for Presto hook, S101 may specifically include: configuring a Presto hook interface based on the Presto hook; the Presto hook interface is used for determining and intercepting database operation requests and analyzing the database operation requests to obtain relevant information of the data table. Specifically, in Presto, the Presto hook interface may be an EventListener interface, which may be used to parse the relevant information of the data table from the query completedevent of the queryCompleted callback function.

S102: and asynchronously sending the related information of the data table to a message queue.

In the implementation of the application, the information is sent to the message queue in an asynchronous transmission mode, and the response of the message queue is not required to be waited, so that the flexibility and the usability of the data engine can be improved, and the performance of the data engine is improved.

S103: consuming the related information of the data table output by the message queue, and storing the related information of the data table in a target database.

In the embodiment of the present application, the process of consuming the relevant information of the data table output by the message queue, that is, S103, may specifically include: registering the topic of Kafka and consuming the related information of the data table based on the topic. In Kafka, different topics can be consumed by consumers subscribed to the topics, so that integration of the blood-edge relations is facilitated by registering the topics and consuming relevant information of the data table based on the topics, the blood-edge relations can be queried and displayed after the integration is stored in a target database, the blood-edge relations of data can be clearly and intuitively mastered, and smooth expansion of actual business of a subsequent financial institution is facilitated.

In addition, the target database may be an elastic search (a search server that can provide database services). Correspondingly, the target database can be used for inquiring and displaying the blood-edge relation among the data tables and the blood-edge relation among the fields in the data tables.

Further, during data processing and analysis, it may sometimes be desirable to integrate fields of the same data table in the same database to facilitate better data lineage processing. Taking a data table containing user information as an example, the user information may include the name of the user, and in the data table, specifically, two different fields are used to respectively store the last name and the first name of the user, so that the name of the user can be obtained by combining the two fields, and the subsequent data blood-edge processing can be facilitated. Based on this, in the embodiment of the present application, fields of the same data table in the same database may be integrated to obtain an integrated field; the blood-edge relationship between the fields in the data table is updated based on the integrated fields.

Based on the above relevant content of S101-S103, in the embodiment of the present application, based on the hook plug-in, the database operation request is intercepted and parsed, and after relevant information of the data table is obtained, the relevant information of the data table may be asynchronously sent to the message queue. Because the related information of the data tables can be used for representing the blood-edge relationship among the data tables and the blood-edge relationship among the fields in the data tables, after the related information of the data tables output by the message queue is consumed, the related information of the data tables can be stored in a target database, and the target database can be used for inquiring and displaying the blood-edge relationship. Therefore, the hook plug-in can be used for directly intercepting the database operation request and analyzing the blood edge relation related to the data table, namely analyzing the blood edge relation among the data tables and the blood edge relation among fields in the data tables, so that the data blood edge analysis can be rapidly realized, and the processing efficiency and the accuracy are improved. In addition, the information is cached through the message queue and consumed, so that integration of the blood-edge relations is facilitated, the blood-edge relations can be queried and displayed after the information is stored in the target database, the blood-edge relations of data can be clearly and intuitively mastered, and smooth expansion of actual business of a follow-up financial institution is facilitated.

Based on the data blood edge processing method provided by the embodiment, the embodiment of the application can also provide a data blood edge processing device. The data blood-edge processing device is described below with reference to the examples and drawings, respectively.

Fig. 2 is a schematic structural diagram of a data blood edge processing device according to an embodiment of the present application. Referring to fig. 2, a data blood edge processing apparatus 200 according to an embodiment of the present application includes:

the parsing module 201 is configured to intercept the database operation request and parse the database operation request based on the hook plug-in, so as to obtain relevant information of the data table; the related information of the data table is used for representing the blood margin relation among the data tables and the blood margin relation among the fields in the data tables;

a sending module 202, configured to asynchronously send related information of the data table to a message queue;

the consumption module 203 is configured to consume related information of the data table output by the message queue, and store the related information of the data table in a target database; the target database is used for inquiring and displaying the blood-edge relation among the data tables and the blood-edge relation among the fields in the data tables.

As one embodiment, the message queue is a Kafka message queue; the consumption module 203 includes:

and the consumption sub-module is used for registering the theme of the Kafka and consuming the related information of the data table based on the theme.

As an embodiment, the apparatus 200 for processing data blood edges further includes:

the integration module is used for integrating the fields of the same data table in the same database to obtain integrated fields;

and the updating module is used for updating the blood-margin relation among the fields in the data table based on the integrated fields.

As an embodiment, the hook plug-in is obtained by the following modules:

the first determining module is used for determining a data engine which is currently running; one such data engine corresponds to a hook plug-in;

and the second determining module is used for determining the corresponding hook plugin based on the data engine.

As one embodiment, the data engine includes a data warehouse tool hive; the hook plug-in corresponding to the hive is a hook plug-in hive hook hung in the hive;

the parsing module 201 includes:

the first configuration module is used for configuring a hive hook interface based on the hive hook; the hive hook interface is used for intercepting and analyzing the database operation request to obtain the related information of the data table.

As one embodiment, the data engine comprises a big data calculation engine Spark; the hook plug-in corresponding to the Spark is a hook plug-in Spark hook hung in the Spark;

the parsing module 201 includes:

the second configuration module is used for configuring a Spark hook interface based on the Spark hook; the Spark hook interface is used for determining and intercepting database operation requests and analyzing the database operation requests to obtain related information of the data table.

As one embodiment, the data engine includes a data query engine prest; the hook plug-in corresponding to the prest is a hook plug-in prest hook hung in the prest;

the parsing module 201 includes:

a third configuration module, configured to configure a Presto hook interface based on the Presto hook; the Presto hook interface is used for determining and intercepting database operation requests and analyzing the database operation requests to obtain relevant information of the data table.

Further, an embodiment of the present application further provides an electronic device, including: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

Further, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, which when executed on an electronic device, cause the electronic device to execute any implementation manner of the data blood-edge processing method.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described example methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application. It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for processing a data blood edge, comprising:

2. The processing method according to claim 1, wherein the message queue is a Kafka message queue Kafka; and the consumption of the relevant information of the data table output by the message queue comprises the following steps:

3. A method of processing according to claim 1, wherein the method further comprises:

4. The processing method according to claim 1, wherein the hook plug-in is obtained by:

based on the data engine, a corresponding hook plug-in is determined.

5. The processing method of claim 4, wherein the data engine comprises a data warehouse tool hive; the hook plug-in corresponding to the hive is a hook plug-in hive hook hung in the hive;

6. The processing method of claim 4, wherein the data engine comprises a big data calculation engine Spark; the hook plug-in corresponding to the Spark is a hook plug-in Spark hook hung in the Spark;

7. The processing method of claim 4, wherein the data engine comprises a data query engine prest; the hook plug-in corresponding to the prest is a hook plug-in prest hook hung in the prest;

8. A data blood edge processing apparatus, comprising:

9. An electronic device, the device comprising: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of processing data blood clots of any of claims 1 to 7.

10. A computer readable storage medium having instructions stored therein which, when executed on an electronic device, cause the electronic device to perform the method of processing data blood clots of any one of claims 1 to 7.