CN110532261B

CN110532261B - Method and device for visually monitoring Hive data warehouse

Info

Publication number: CN110532261B
Application number: CN201910672433.4A
Authority: CN
Inventors: 和思扬
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2022-09-20
Anticipated expiration: 2039-07-24
Also published as: CN110532261A

Abstract

The embodiment of the invention discloses a method and a device for visually monitoring a Hive data warehouse, wherein the method comprises the following steps: storing specific information and task routine information of each table and partition of the Hive data warehouse through a buffer; when a routine task is submitted, analyzing the stored information through a structured query language (sql) analyzer; analyzing the information of each table, acquiring the relationship between the information in each table, and merging the information and the relationship of each table to obtain merged information of each dimension of each table; and reading the merging information of each dimension of each table for web page display. The embodiment of the invention can comb the complicated database table dependency relationship and optimize and adjust the cluster tasks, so that a manager can observe each dimension of the database, the monitoring convenience is improved, and the management cost is reduced.

Description

Method and device for visually monitoring Hive data warehouse

Technical Field

The invention relates to a Hive data warehouse technology, in particular to a method and a device for visually monitoring a Hive data warehouse.

Background

Hadoop is a distributed system infrastructure developed by the Apache foundation, Hive is a data warehouse tool based on Hadoop, can map a Structured data file into a database table, provides a simple Structured Query Language (SQL) Query function, and can convert SQL statements into a computing framework MapReduce task running on a resource manager yann for running. SQL is a database query and programming language used for accessing data and querying, updating and managing a relational database system; and is also an extension of the database script file.

In the prior art, in the process of monitoring an enterprise-level Hive data warehouse, the dependency relationship of an intricate database table cannot be combed, and the cluster tasks cannot be optimized and adjusted; therefore, managers are not easy to observe all dimensions of the data warehouse, the complexity of management and maintenance is increased, and the cost of service data table combing and controlling is high.

Disclosure of Invention

In order to solve the technical problem, embodiments of the present invention provide a method and an apparatus for visually monitoring a Hive data warehouse, which can comb up the dependency relationship of an intricate database table, optimize and adjust a cluster task, enable a manager to observe each dimension of the data warehouse, improve monitoring convenience, and reduce management cost.

In order to achieve the object of the present invention, in one aspect, an embodiment of the present invention provides a method for visually monitoring a Hive data warehouse, including:

storing specific information and task routine information of each table and partition of the Hive data warehouse through a buffer;

when a routine task is submitted, analyzing the stored information through a structured query language sql analyzer;

analyzing the information of each table, acquiring the relationship between the information in each table, and merging the information and the relationship of each table to obtain merged information of each dimension of each table;

and reading the merging information of each dimensionality of each table for web page display.

Further, the method comprises:

in the Hive data warehouse, regularly refreshing the information of each partition of each table of the storage data warehouse as first-class information, and storing the first-class information through a buffer;

after the Hive script is submitted every time, analyzing the Sql statement in the first type of information through an Sql analyzer, analyzing a data source table and a corresponding target table of each section of Sql, and storing dependency information of the data source table and the target table into the buffer as second type of information;

and converting the sql statement into a calculation framework MapReduce task running on a resource manager yarn for running, calculating and capturing specific information of the task as third-class information, and storing the third-class information into the buffer.

Further, the method comprises:

and the buffer combines the stored first-type information to the third-type information to form combined information for each table.

Further, the method comprises:

the merging information of each table is the detailed data of each table, and comprises the following steps:

partition size, throughput time, resource consumption, upstream and downstream relationships.

Optionally, the method further comprises:

the final merged result of the buffers is read for providing the specific settings at the front end.

Further, the method further comprises: the buffer performs global scanning or appointed library scanning on the Hive data warehouse according to the configuration, caches the information of each table in the database in the form of a local file, and performs timing refreshing according to a time interval appointed by a user, wherein the refreshing frequency can be set on a web interface.

On the other hand, an embodiment of the present invention further provides a device for visually monitoring a Hive data warehouse, including:

the storage module is used for storing specific information and task routine information of each table and partition of the Hive data warehouse through the buffer;

the analysis module is used for analyzing the stored information through a structured query language sql analyzer when the routine task is submitted;

the merging module is used for acquiring the relationship between the information in each table after analyzing the information of each table, merging the information and the relationship of each table to obtain the merged information of each dimensionality of each table;

and the display module is used for reading the merging information of each dimensionality of each table and displaying the web pages.

Further, the storage module is configured to:

Further, the merging module is configured to:

Further, the apparatus is further configured to:

The embodiment of the invention stores the specific information and task routine information of each table and partition of the Hive data warehouse through the buffer; when a routine task is submitted, analyzing the stored information through a structured query language sql analyzer; analyzing the information of each table, acquiring the relationship between the information in each table, and merging the information and the relationship of each table to obtain merged information of each dimension of each table; and reading the merging information of each dimension of each table for web page display. The embodiment of the invention can comb the dependency relationship of the complex database table and optimize and adjust the cluster tasks, so that a manager can observe each dimension of the data warehouse, the monitoring convenience is improved, and the management cost is reduced.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a flow chart of a method for visually monitoring a Hive data warehouse according to an embodiment of the invention;

FIG. 2 is a schematic diagram illustrating an implementation of a method for visually monitoring a Hive data warehouse according to an embodiment of the present invention;

fig. 3 is a structural diagram of an apparatus for visually monitoring a Hive data warehouse according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

Fig. 1 is a flowchart of a method for visually monitoring a Hive data warehouse according to an embodiment of the present invention, and as shown in fig. 1, the method according to the embodiment of the present invention includes the following steps:

step 101: storing specific information and task routine information of each table and partition of the Hive data warehouse through a buffer;

the embodiment of the invention provides a visual monitoring method based on Hive data warehouses, which is mainly applied to visual monitoring of enterprise-level Hive data warehouses. The monitoring information is accurate to the table level and comprises all partitions, the total size of data files of sub-partitions, the output time, the memory occupation, the upper and lower blood relationship graphs of the table and the like. The important information is displayed in a web interface mode, an intuitive way is provided for a data warehouse manager to monitor the routine condition of each table, and the method is more beneficial to combing the complicated business dependence among the tables.

Step 102: when a routine task is submitted, analyzing the stored information through a structured query language sql analyzer;

step 103: analyzing the information of each table, acquiring the relationship between the information in each table, and merging the information and the relationship of each table to obtain merged information of each dimension of each table;

step 104: and reading the merging information of each dimension of each table for web page display.

Further, the method comprises:

and converting the sql statement into a computation framework MapReduce task running on a resource manager yarn for running, computing and capturing specific information of the task as third type information, and storing the third type information into the buffer.

Further, the method comprises:

and the buffer combines the stored first-class information to the third-class information to form combined information for each table.

Fig. 2 is a schematic diagram illustrating an implementation of a method for visually monitoring a Hive data warehouse according to an embodiment of the present invention, and as shown in fig. 2, an implementation process of a technical solution according to an embodiment of the present invention is as follows:

the embodiment of the invention comprises a buffer, an sql parser and a web platform.

Specifically, a buffer is introduced outside the Hive data warehouse, and the information of each partition of each table of the storage data warehouse is periodically refreshed.

And introducing a Sql analyzer, analyzing the Sql statement after the Hive script is submitted each time, analyzing a data source table and an output target table of each section of Sql, and storing the dependency information into a buffer.

And submitting the parsed sql task to yarn for mapReduce calculation, capturing information such as the starting time and the ending time of the task, memory occupation and the like, and storing the information into a buffer.

Among them, yarn is a resource manager, responsible for managing and scheduling cluster resources, and it can realize the allocation of all the resources such as cpu, memory, file system, disk, etc. of the cluster. And MapReduce is a computational framework, running on yarn.

The buffer combines the information obtained in the above steps to form detailed data for each table, including partition size, output time, resource consumption, upstream and downstream relationship, etc.

And reading the final combined result of the buffer for the front-end web exhibition, and providing related settings at the front end.

The method for memory allocation for a single hive task is implemented by introducing a preprocessor and a mapReduce configurator after the task is submitted and before the mapReduce is executed, and a schematic diagram is shown in fig. 2. The specific implementation process is as follows:

a1, introducing a buffer outside the Hive data warehouse, wherein the buffer can perform global scanning or specified library scanning on the Hive data warehouse according to configuration, buffer information such as table names of all tables in the database, partition names, total size of data files of all partitions, creation time and the like in a local file form, and perform timing refreshing according to a time interval specified by a user, and the refreshing frequency can be set on a web interface;

a2, introducing an Sql parser, after the Hive script is submitted each time, scanning all the contents of the script and parsing the Sql statement, parsing a data source table and a generated destination table of each section of Sql according to INSERT/FROM/JOIN/UNION statements and the like, and storing the dependency information into a buffer.

a3, submitting the parsed sql task to yarn for mapReduce calculation, capturing information such as the start time and the end time of the task, the highest memory occupation amount and the like, and storing the information in a buffer.

a4, the buffer memory combines the tables and the partition information obtained in the step a1, the upper and lower level dependency information of each table obtained in the step a2, and the task execution information in the step a3 to form the partition data size, the upstream data source and the downstream data destination of each table, the time consumption and the memory consumption when the data is generated in each routine, and the like.

a5 reads the final merging result of the buffer, and the front-end display is realized through Java Web and the like. The show content comprises the tables contained within the individual Hive banks; the size of the partition data contained in each table; a direct upstream data source table of each table and a sub-table of downstream data import; time per routine, resource consumption, etc. The front end provides a setting interface for a user to specify the refreshing frequency of the buffer, cache Hive number bin information in nearly XX days, and reserve XX days of the cached information in the buffer, and the like.

According to the embodiment of the invention, by visually monitoring the Hive data warehouse, the complex table dependency relationship of a warehouse manager is easier to comb; the memory consumption and the output time of each table in routine are controlled more accurately, so that the cluster tasks are optimized and adjusted conveniently; meanwhile, the data size of all routine partitions in a certain table is displayed for a period of time, so that a user can perform transverse comparison and is easier to locate the abnormity, for example: the partition file generated in a routine is far smaller than the adjacent time nodes in the front and the back, so that the problem can be traced according to the routine. Therefore, managers can observe all dimensions of the data warehouse more easily, convenience of management and maintenance is greatly improved, and the cost of combing and controlling the business data table is reduced.

The embodiment of the invention introduces the buffer and the sql parser to store and display important information of each table of the Hive data warehouse, including the partition size, the blood relationship, the routine output condition and the like of the table, and the information is displayed to managers of several warehouses in a visual interface mode, so that each table of the Hive data warehouse is monitored in a macroscopic angle, and the maintenance is more convenient.

The visual monitoring method of the Hive data warehouse comprises the following steps: important information of each table and each partition of the Hive data warehouse and related information of task routines are stored through a buffer, the upper and lower level blood relationship of each table is analyzed through an sql analyzer when a routine task is submitted and stored through the buffer, the information is combined, and each dimension information of each table is used for web page display;

fig. 3 is a structural diagram of an apparatus for visually monitoring a Hive data warehouse according to an embodiment of the present invention, and as shown in fig. 3, an apparatus for visually monitoring a Hive data warehouse according to another aspect of the embodiment of the present invention includes:

the storage module 301 is used for storing specific information of each table and partition of the Hive data warehouse and task routine information through a buffer;

the analysis module 302 is used for analyzing the stored information through a structured query language sql analyzer when the routine task is submitted;

the merging module 302 is configured to obtain a relationship between information in each table after analyzing information of each table, and merge the information and the relationship of each table to obtain merged information of each dimension of each table;

a presentation module 304, configured to read the merging information of each dimension of each table, for web page presentation.

Further, the storage module 301 is configured to:

Further, the merging module 302 is configured to:

Further, the apparatus is further configured to:

Wherein the apparatus is configured to: the method for realizing the buffer and the sql parser mainly comprises the following steps for parsing and storing the important information of each table of the data warehouse:

introducing a buffer outside the Hive data warehouse, and regularly refreshing the information of each partition of each table of the storage data warehouse;

introducing an Sql analyzer, analyzing an Sql statement after a Hive script is submitted each time, analyzing a data source table and an output target table of each section of Sql, and storing the dependency information into a buffer;

submitting the parsed sql task to yarn for mapReduce calculation, capturing information such as start and end time of the task, memory occupation and the like, and storing the information into a buffer;

the buffer combines the acquired information to form detailed data for each table, including partition size, output time, resource consumption, upstream-downstream relationship and the like;

the final merged result of the cache is read for front-end web presentation and relevant settings are provided at the front-end.

The embodiment of the invention stores the specific information and task routine information of each table and partition of the Hive data warehouse through the buffer; when a routine task is submitted, analyzing the stored information through a structured query language (sql) analyzer; analyzing the information of each table, acquiring the relationship between the information in each table, and merging the information and the relationship of each table to obtain merged information of each dimension of each table; and reading the merging information of each dimensionality of each table for web page display. The embodiment of the invention can comb the dependency relationship of the complex database table and optimize and adjust the cluster tasks, so that a manager can observe each dimension of the data warehouse, the monitoring convenience is improved, and the management cost is reduced.

Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for visually monitoring a Hive data warehouse, which is characterized by comprising the following steps:

specific information and task routine information of each table and partition of the Hive data warehouse are stored through a buffer;

when a routine task is submitted, analyzing the stored information through a structured query language (sql) analyzer;

and reading the merging information of each dimension of each table for web page display.

2. The method for visually monitoring a Hive data warehouse of claim 1, which comprises:

3. A method of visually monitoring a Hive data warehouse according to claim 2, which comprises:

4. A method of visually monitoring a Hive data warehouse according to claim 3, which comprises:

5. The method of visually monitoring a Hive data warehouse according to claim 1, further comprising:

6. The method of visually monitoring a Hive data warehouse according to claim 1, further comprising: the buffer performs global scanning or appointed library scanning on the Hive data warehouse according to the configuration, caches the information of each table in the database in the form of a local file, and performs timing refreshing according to a time interval appointed by a user, wherein the refreshing frequency can be set on a web interface.

7. An apparatus for visually monitoring a Hive data warehouse, comprising:

the storage module is used for storing specific information and task routine information of each table and partition of the Hive data warehouse through a buffer;

8. The apparatus for visually monitoring a Hive data warehouse according to claim 7, wherein the storage module is configured to:

9. An apparatus for visually monitoring a Hive data warehouse according to claim 8, wherein the merge module is configured to:

10. An apparatus for visually monitoring a Hive data warehouse according to claim 9, wherein said apparatus is further configured to: