CN112925777A

CN112925777A - Method and system for detecting data blood margin of HIVE database

Info

Publication number: CN112925777A
Application number: CN202110211183.1A
Authority: CN
Inventors: 苏瑀; 陈筱进; 刘登贺; 张世杰
Original assignee: Jilin Yillion Bank Co ltd
Current assignee: Jilin Yillion Bank Co ltd
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-06-08

Abstract

The invention discloses a method and a system for detecting the blood margin of data in a HIVE database, wherein the method comprises the following steps: configuring a LineageLogger Hook function; analyzing the HiveSql based on the LineagLogger Hook function to generate a hive.log log; carrying out data cleaning on the hive.log log to form a JOIN format, and importing the cleaned data into an open source database neo4 j; querying the dependency relationship between the fields by using a neo4j interface; calling a graph database neo4j API interface, analyzing JSON strings, and carrying out visual display on the data blood margin. The invention can effectively complete the analysis and combing of the data blood relationship among the data tables and the fields.

Description

Method and system for detecting data blood margin of HIVE database

Technical Field

The invention relates to the technical field of data management of big data, in particular to a method and a system for detecting the data blood margin of a HIVE database.

Background

Since the age of big data in 2013, the big data brings new opportunities and challenges to the development of various industries, and the importance of various industries on the implied value in the mass data is increasing day by day. The data warehouse collects all commonly used and important business related index data from mass data, time cost of data retrieval is reduced, data quality and consistency are improved, application to historical data is improved, and therefore the value of data hiding is better mined.

The data blood relationship vividly depicts the data from bottom to top to be collected layer by layer, accurately and clearly reveals the blood relationship among all levels of data entities, and powerfully supports the development, test, operation and maintenance of a business system. It records the entire history of data processing, including the origin of the data and all subsequent processes of processing the data, and is particularly important for analyzing the data, tracking the dynamic evolution of the data, measuring the reliability of the data, ensuring the quality of the data, and the like. Along with the operation of the system and the continuous adjustment of the related business system in the practical application process, more and more data nodes have problems, the maintenance cost is very high, and only a few common reports work normally. If the situation occurs, tracing can be carried out according to the data blood relationship, and specific nodes are detected to have problems.

When a certain part of data is abnormal and alarmed, the reason for analyzing the data abnormality can be tracked downwards through the data blood relationship graph, and which data entities can be influenced can be analyzed upwards through the influence graph. When the table structure changes, it can be analyzed by the impact graph which programs need to be modified. Meanwhile, the data consanguinity relationship is beneficial to better combing business of data warehouse colleagues, and functions of establishing the dependency relationship of ETL task scheduling more conveniently and quickly judging whether the task batching failure affects a downstream system or not and the like are achieved.

Metadata management becomes more and more important as data warehouse access to tables and models built increases, and metadata table consanguineous relationships maintain the relationships between tables. And the good metadata management can clearly and definitely see the relation between each table and the model. The mining of the blood relationship of the metadata plays an important role in tracking the data flow direction, troubleshooting the business problem, reducing the maintenance cost, improving the development efficiency and the like.

Therefore, how to effectively determine the blood relationship of data is an urgent problem to be solved.

Disclosure of Invention

In view of this, the invention provides a method for detecting the data blood relationship of the HIVE database, which can effectively complete the analysis and combing of the data blood relationship among the data tables and fields.

The invention provides a method for detecting the blooding margin of data in a HIVE database, which comprises the following steps:

configuring a LineageLogger Hook function;

analyzing the HiveSql based on the LineagLogger Hook function to generate a hive.log log;

carrying out data cleaning on the hive.log log to form a JOIN format, and importing the cleaned data into an open source database neo4 j;

querying the dependency relationship between the fields by using a neo4j interface;

calling a graph database neo4j API interface, analyzing JSON strings, and carrying out visual display on the data blood margin.

Preferably, the configuring the LineageLogger Hook function includes:

and configuring a hive-site xml file by adding parameters above the hive2.0 version, and simultaneously configuring Hook output.

Preferably, the calling of the graph database neo4j API interface, parsing the JSON string, and performing visual display on the data blood margin includes:

and calling a graph database neo4j API (application program interface) through a visual display tool Tableau, analyzing JSON (Java Server connection) strings, and visually displaying the data blood margin.

A system for detecting HIVE database data bloods borders, comprising:

the configuration module is used for configuring a LineageLogger Hook function;

the first analysis module is used for analyzing the HiveSql based on the LineagLogger Hook function to generate a hive.log log;

the cleaning module is used for cleaning data of the hive.log to form a JOIN format and importing the cleaned data into an open source database neo4 j;

the query module is used for querying the dependency relationship between the fields by using the neo4j interface;

and the second analysis module is used for calling the graph database neo4j API interface, analyzing the JSON string and carrying out visual display on the data blood margin.

Preferably, the configuration module is specifically configured to:

Preferably, the second parsing module is specifically configured to:

In summary, the invention discloses a method for detecting the blood margin of data in an HIVE database, when the blood margin of the data in the HIVE database needs to be detected, firstly configuring a LineageLogiger Hook function; then analyzing the HiveSql based on a LineagLogger Hook function to generate a hive.log log; carrying out data cleaning on the hive.log log to form a JOIN format, and importing the cleaned data into an open source database neo4 j; querying the dependency relationship between the fields by using a neo4j interface; calling a graph database neo4j API interface, analyzing JSON strings, and carrying out visual display on the data blood margin. The invention can effectively complete the analysis and combing of the data blood relationship among the data tables and the fields.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method of embodiment 1 of the method for detecting the data blooding margin of the HIVE database disclosed in the present invention;

FIG. 2 is a flowchart of a method of embodiment 2 of the method for detecting the data blooding margin of the HIVE database disclosed in the present invention;

FIG. 3 is a flowchart of a method of embodiment 3 of the method for detecting the data blooding margin of the HIVE database according to the present invention;

FIG. 4 is a schematic structural diagram of an embodiment 1 of the system for detecting the data blooding margin of the HIVE database according to the present invention;

FIG. 5 is a schematic structural diagram of an embodiment 2 of the system for detecting the data blooding margin of the HIVE database according to the present invention;

FIG. 6 is a schematic structural diagram of an embodiment 3 of the system for detecting the data blooding margin of the HIVE database.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, which is a flowchart of a method of embodiment 1 of the method for detecting the blood margin of the data in the HIVE database, the method may include the following steps:

s101, configuring a LineageLogger Hook function;

when the data blooding margin of the HIVE database needs to be detected, firstly, a LineageLogger Hook function is configured above the HIVE2.0 version.

S102, analyzing the HiveSql based on the LineagLogger Hook function to generate a hive.log log;

after the configuration of the LineageLogger Hook function is completed, field analysis based on hive component operation is achieved based on the configured LineageLogger Hook function, the dependency relationship among the fields is obtained, and a hive.

S103, carrying out data cleaning on the hive.log to form a JOIN format, and importing the cleaned data into an open source database neo4 j;

after the hive.log log is generated, as the hive.log has more useless data, the inter-table field dependency relationship cannot be clearly and concisely displayed, and the Neo4j interface cannot be directly used for data access, the hive.log needs to be subjected to data cleaning, the cleaned data is stored in a JOIN format, and the data after relevant processing is imported into the open source database Neo4 j.

S104, querying the dependency relationship among the fields by using a neo4j interface;

then, depending on the strong association characterization of neo4j, a combination of field dependencies is implemented.

And S105, calling a graph database neo4j API interface, analyzing JSON strings, and performing visual display on the data blood margin.

And finally, calling a graph database neo4j API interface, analyzing JSON strings and realizing the visualization display of the data blood relationship representation.

In summary, in the above embodiment, when the blood margin of the data in the HIVE database needs to be detected, the LineageLogger Hook function is configured first; then analyzing the HiveSql based on a LineagLogger Hook function to generate a hive.log log; carrying out data cleaning on the hive.log log to form a JOIN format, and importing the cleaned data into an open source database neo4 j; querying the dependency relationship between the fields by using a neo4j interface; calling a graph database neo4j API interface, analyzing JSON strings, and carrying out visual display on the data blood margin. The analysis and the combing of the data blood relationship among the data tables and the fields can be effectively finished.

As shown in fig. 2, which is a flowchart of a method of embodiment 2 of the method for detecting the blood margin of the data in the HIVE database disclosed in the present invention, the method may include the following steps:

s201, configuring a hive-site.xml file in a mode of adding parameters above the hive2.0 version, and configuring Hook output at the same time;

Specifically, from the perspective of analyzing the HiveSql method, a hive-site.xml file is configured in a mode of adding parameters above the hive2.0 version, and Hook output is configured at the same time, namely, a hive-log4j2.properties configuration file is modified, and the linkage logger Hook function is configured and completed.

S202, analyzing the HiveSql based on the LineagLogger Hook function to generate a hive.log log;

S203, carrying out data cleaning on the hive.log to form a JOIN format, and importing the cleaned data into an open source database neo4 j;

S204, querying the dependency relationship among the fields by using a neo4j interface;

And S205, calling a graph database neo4j API interface, analyzing JSON strings, and performing visual display on the data blood margin.

To sum up, in this embodiment, on the basis of the above embodiments, when configuring the LineageLogger Hook function, the hive-site.xml file may be configured in a manner of adding a parameter above the hive2.0 version, and the Hook output is configured at the same time, so as to implement the LineageLogger Hook function configuration.

As shown in fig. 3, which is a flowchart of a method of embodiment 3 of the method for detecting the blood margin of the data in the HIVE database, the method may include the following steps:

s301, configuring a hive-site.xml file in a mode of adding parameters above the hive2.0 version, and configuring Hook output at the same time;

S302, analyzing the HiveSql based on the LineagLogger Hook function to generate a hive.log log;

S303, carrying out data cleaning on the hive.log to form a JOIN format, and importing the cleaned data into an open source database neo4 j;

S304, querying the dependency relationship among the fields by using a neo4j interface;

S305, calling a graph database neo4j API (application program interface) through a visual display tool Tableau, analyzing JSON (Java Server object) strings, and visually displaying the data blooding margin.

And finally, calling a graph database neo4j API (application programming interface) through a visual display tool Tableau with higher flexibility and dynamic performance, analyzing JSON (Java Server object) strings, and realizing visual display of data blood margin characterization.

In summary, from the perspective of analyzing the HiveSql method, the present invention configures the LineageLogger Hook function to analyze the HiveSql above the Hive2.0 version, and finally realizes field analysis based on Hive component operation, obtains the dependency relationship between fields, and generates a log, and the method can fully utilize the Hive internal method to analyze, improve the efficiency of analyzing the HiveSql, and reduce the complexity of analysis; the log of the field dependency relationship is cleaned and preprocessed, and then is stored into a JSON format, so that the storage and the query are convenient, a JSON format data file is imported into an open source graph database neo4j, the combination of the field dependency relationship is realized by relying on the strong incidence relationship representation of neo4j, and the visual display of the data consanguinity is realized by calling a graph database neo4j API (application program interface) through a visual display tool Tableau with high flexibility and dynamics. The overall maintenance cost of the plurality of bins is reduced, and the data quality problem is reduced.

As shown in fig. 4, which is a schematic structural diagram of an embodiment 1 of the system for detecting a blood margin in a HIVE database according to the present invention, the system may include:

a configuration module 401, configured to configure a LineageLogger Hook function;

A first analysis module 402, configured to analyze the HiveSql based on the LineageLogger Hook function, and generate a hive.log log;

A cleaning module 403, configured to clean data of the hive.log to form a JOIN format, and import the cleaned data into the open source graph database neo4 j;

A query module 404, configured to query dependencies between fields using the neo4j interface;

And the second analysis module 405 is used for calling the database neo4j API interface, analyzing the JSON string and carrying out visual display on the data blood margin.

As shown in fig. 5, which is a schematic structural diagram of embodiment 2 of a system for detecting a blood margin in a HIVE database according to the present invention, the system may include:

a configuration module 501, configured to configure a hive-site.xml file by adding parameters above the hive2.0 version, and configure Hook output at the same time;

A first analysis module 502, configured to analyze the HiveSql based on the LineageLogger Hook function, and generate a hive.log log;

A cleaning module 503, configured to clean data of the hive.log to form a JOIN format, and import the cleaned data into the open source graph database neo4 j;

A query module 504, configured to query dependencies between fields using the neo4j interface;

And the second analysis module 505 is configured to call the database neo4j API interface, analyze the JSON string, and perform visual display on the data blood margin.

As shown in fig. 6, which is a schematic structural diagram of embodiment 3 of the system for detecting a blood margin in a HIVE database according to the present invention, the system may include:

the configuration module 601 is used for configuring a hive-site.xml file in a mode of adding parameters above the hive2.0 version and configuring Hook output at the same time;

A first analysis module 602, configured to analyze the HiveSql based on a LineageLogger Hook function, and generate a hive.log log;

A cleaning module 603, configured to clean data of the hive.log to form a JOIN format, and import the cleaned data into the open source database neo4 j;

A query module 604, configured to query dependencies between fields using the neo4j interface;

The second parsing module 605 is configured to invoke a graph database neo4j API interface through a visualization display tool, Tableau, parse JSON strings, and perform visualization display on data blood margins.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for detecting HIVE database data bloods borders, comprising:

configuring a LineageLogger Hook function;

2. The method of claim 1, wherein configuring the LineageLogger Hook function comprises:

3. The method according to claim 2, wherein the calling of the graph database neo4j API interface, parsing JSON string, and visualizing the data blooding margin comprises:

4. A system for detecting HIVE database data bloods borders, comprising:

the configuration module is used for configuring a LineageLogger Hook function;

5. The system of claim 4, wherein the configuration module is specifically configured to:

6. The system of claim 5, wherein the second parsing module is specifically configured to: