CN109815283B

CN109815283B - Heterogeneous data source visual query method

Info

Publication number: CN109815283B
Application number: CN201811547858.4A
Authority: CN
Inventors: 王海荣; 肖万来; 刘娟; 朱文洲; 徐云龙
Original assignee: Sugon Nanjing Research Institute Co ltd
Current assignee: Sugon Nanjing Research Institute Co ltd
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2020-10-09
Anticipated expiration: 2038-12-18
Also published as: CN109815283A

Abstract

The invention discloses a heterogeneous data source visual query method.A user selects a query range on a foreground visual interface; judging whether the selected data source is loaded into a presto cluster; if the selected data source is not loaded to the presto cluster, adding a data source to be loaded in the presto cluster; sending a query statement to the presto cluster, and completing query in the selected query range; detecting whether a plug-in required by a data source to be loaded is a data source plug-in built in a presto cluster, if so, dynamically loading a configuration file in a catalog folder by the presto cluster; if not, creating the plug-in of the data source to be loaded, and embedding the plug-in into the Presto cluster, loading the configuration file in the catalog folder, and completing the loading of the data source. The method dynamically loads the data sources by configuring different data source basic information and adopting the operation of a graphical interface, thereby realizing the query of the heterogeneous data sources based on presto.

Description

Heterogeneous data source visual query method

Technical Field

The invention relates to the technical field of computers, in particular to a heterogeneous data source visual query method.

Background

In the era of big data explosion, the storage forms of data are various, mainly structured data, semi-structured data and unstructured data. Different data are stored in different data sources, most of structured data are stored in a relational data source, and domestic databases such as mysql, oracle, sqlserver, dreams, gold warehouse data sources and the like are the mainstream; storing historical data or data subjected to cleaning and conversion in a data source of a large data platform such as hive or an MPP data source based on a share nothing framework; the semi-structured data, such as streaming data, is mainly stored in kafka, a large number of structured log files are stored in a file server, and semi-structured key value data stored in a memory database redis stored in a memory database. Data in different data sources need to be subjected to integration query, and unified statistical analysis is carried out according to information in different data sources. Aiming at how to quickly query big data in a heterogeneous data source, the traditional solution is to write a specific query program, generally a jdbc program, aiming at different data sources, encapsulate a result set queried in the jdbc in an integration, then encapsulate data in a specific second set aiming at other data sources, such as a semi-structured data source and log data, and finally query and analyze the data by using a java api aiming at the data in the integration and the second set. In another way, data is uniformly aggregated into a platform, such as an oracle database or a hive data warehouse, and then the aggregated data is queried and analyzed by using respective query methods.

The current query method needs to write a large number of adaptation programs for different data sources, because adaptation needs to be performed on streaming message type data and on semi-structured log files in addition to the jdbc interface provided by common relational data. The general adaptive interface can not use the structured query language to make summary statistics on the data. For the second kind of data gathered to the unified platform, a large number of collection and extraction programs need to be written to gather the data in a unified mode, linear expansion query cannot be conducted according to query complexity, only machine resources of the machine can be used, and extra nodes cannot be used for enhancing query. The problem that may exist in this method is that when the statistical requirement changes or the queried data source changes, a large amount of manpower is needed to adjust the statistical mode and the summary mode, and when the structured query statement is not used for querying, a large amount of manpower cost may be needed for maintenance. When a large data set occurs, the query may overflow and the query takes a lot of time due to the limited resources of the nodes.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a visual query method for heterogeneous data sources, which is used for dynamically loading data sources by configuring different basic information of the data sources and adopting the operation of a graphical interface to realize the query of heterogeneous data sources based on presto.

The technical scheme is as follows: the invention adopts the following technical scheme:

a heterogeneous data source visualization query method comprises the following steps: a user selects a query scope on a foreground visual interface, wherein the query scope comprises one or more data sources; judging whether the selected data source is loaded into a presto cluster; if the selected data source is not loaded to the presto cluster, adding a data source to be loaded in the presto cluster; sending a query statement to the presto cluster, and completing query in the selected query range;

the method for adding the data source to be loaded in the presto cluster comprises the following steps:

(1) registering a data source to be loaded;

(2) generating a configuration file of a data source to be loaded in a catalog folder;

(3) detecting whether the data source plug-in is a data source plug-in built in a presto cluster or not according to the type of a data source to be loaded, and if so, dynamically loading a configuration file in a catalog folder by the presto cluster; if not, creating the plug-in of the data source to be loaded, embedding the plug-in into the presto cluster, loading the configuration file in the catalog folder, and completing the dynamic loading of the data source.

In order to realize visualization, the following steps are also included before sending the query statement to the presto cluster:

decomposing the query task into a data table and an operation command; constructing a vector diagram according to the decomposed query task by using a Raphaeljs control, wherein nodes in the vector diagram comprise a data table and an operation command in the query task; displaying the constructed vector diagram on a foreground visual interface; the background analyzes the vector diagram into a structured query statement and sends the structured query statement to a presto cluster to complete query; the operation command comprises connection, function, screening and aggregation.

As an optimization, after sending an inquiry statement to the presto cluster, judging whether the quantity of the task data blocks of the started presto working nodes exceeds a threshold value of the quantity of the task data blocks returned by the presto cluster; if the number of the task data blocks exceeds the threshold value, starting a new presto working node service; re-sending the query statement to the presto cluster;

and when the number of the task data blocks in a period of time is lower than the threshold value of the number of the task data blocks of the started presto working nodes, closing partial presto working node services.

Priority levels are set for all presto working nodes, presto working nodes with high priority levels are started firstly when starting, and presto working nodes with low priority levels are closed firstly when closing.

Furthermore, the user is set with authority, the user is matched with the query queue according to the user authority, and the query statement sent by the user is submitted to the query queue matched with the user.

Has the advantages that: compared with the prior art, the visual query method for the heterogeneous data sources realizes hot deployment by detecting the built-in data source plug-ins of the presto cluster and dynamically loading the configuration files of the newly added data sources, can enable presto to access the newly added data sources in real time without restarting the presto, and improves the usability of the system.

Drawings

FIG. 1 is a flow diagram of data source loading of structured data;

FIG. 2 is a flow diagram of data source loading for a kafka message queue;

FIG. 3 is a flowchart of Hive loading of multiple big data platforms;

FIG. 4 is a flow diagram of visualizing queries;

FIG. 5 is a flow chart of resource scheduling;

FIG. 6 is a flow diagram of limiting the number of concurrencies of queries.

Detailed Description

The invention is further elucidated with reference to the drawings and the detailed description.

A heterogeneous data source visualization query method comprises the following steps:

step 1, a visual interface is established on a foreground, a user selects a query range on the foreground visual interface, and the query range comprises one or more data sources; judging whether the selected data source is loaded into a presto cluster; if the selected data source is not loaded to the presto cluster, adding a data source to be loaded in the presto cluster;

adding a data source to be loaded in a presto cluster, comprising the following steps:

(1) registering a data source to be loaded;

the registration procedure differs for different types of data sources. For a data source of structured data, inputting data source information including an IP address, a port number, a user name and a password, and testing the Connectivity of the data source through jdbc (Java DataBase Connectivity).

For a data source of semi-structured data, a profile is required to be configured, which is mainly divided into the following types,

one is a kafka message queue, and the description file comprises the address of the broker of kafka, and description information of the corresponding message and the corresponding table in kafka. presto will parse and segment the data according to the description in the specification file and perform the query operation.

One is a redis memory type database, the description file includes the address and port of the redis, the password of the redis, the serial number of the current database and the description of the key value stored in the redis, and the presto segments and analyzes the data according to the description of the key value.

The description file is saved in json format and defines column names and column descriptions.

(3) detecting whether the data source plug-in is a data source plug-in built in a presto cluster or not according to the type of a data source to be loaded, and if so, dynamically loading a configuration file in a catalog folder by the presto cluster; if not, creating a plug-in of the data source to be loaded, embedding the plug-in into the presto cluster, loading the configuration file in the catalog folder, and completing the dynamic loading of the data source without restarting presto service.

As shown in FIG. 1, a flow diagram is loaded for a data source of structured data; as shown in fig. 2, a flow diagram for loading data sources for a kafka message queue.

The loading process of the plurality of hive big data sources comprises the following steps:

configuring a hive data source plug-in; configuring a source data service of each hive, wherein the service mainly describes and defines tables and column fields in the hive; configuring hdfs related file settings, and specifying hdfs addresses, namenodes and other related addresses; specifying operation rights, such as rename and delete, for each hive table; and dynamically loading a data source, and interfacing the hive data sources of one or more big data platforms.

After the data source is loaded, the virtual table structure information of different data sources can be checked on the interface.

Step 2, decomposing the query task into a data table and an operation command; a vector diagram is constructed by Raphaeljs controls according to decomposed query tasks, nodes in the vector diagram comprise data tables and operation commands in the query tasks, and the nodes are mainly divided into the following categories: (1) a data table control node, which describes the relevant information of the data table; (2) connecting control nodes, wherein the operation of the nodes is to perform correlation operation on tables, and the correlation operation comprises internal connection, left connection, right connection and the like; (3) the function control node can perform function operation on the specific data column; (4) the screening control node comprises row screening and column screening; (5) and aggregating the control nodes, performing aggregation statistics according to specific values, and performing statistical analysis by using a complex window function. And selecting nodes by a user according to the requirement, and determining the connection sequence of the nodes so as to generate the vector diagram.

The invention uses svg to render vector graphics in a browser by using a lightweight js library of Raphael. Then, each step of operation of the query task is virtualized to be a control node in the foreground, such as a data table control, which describes one table in the data warehouse from the logical structure, and the association control indicates that the association operation needs to be performed on the two tables. The user selects the nodes by dragging, the vector diagram can specifically describe the position of each dragged control node, and the size and the shape of each control node and the connection line between the control nodes are recorded.

And displaying the constructed vector diagram on a foreground visual interface.

Step 3, the background stores the vector diagram generated by the foreground, analyzes the nodes according to the node description of the vector diagram analysis, and converts the vector diagram into a structured query statement;

and the background carries out dynamic optimization adjustment on the converted structured query statement according to the complexity of the query. For complex queries, a packing and splitting mode is adopted, related query requirements are packaged into sub-queries, screening and filtering are performed again based on results of the sub-queries, and the speed is increased; for simple query, predicate push-down is adopted, and screening conditions are enhanced and filtered, so that filtering can be performed in advance, and the query rate is improved. The flow of visualizing the query is shown in fig. 4.

And sending the optimized and adjusted structured query statement to a presto cluster.

Step 4, after the presto cluster receives the query statement, dividing the query task into a plurality of task blocks according to the size of the data volume and the complexity of the data, and returning the number of the task blocks to the task database; binding the computing nodes and the task quantity blocks by default; judging whether the number of the task data blocks of the started presto working node exceeds a threshold value of the number of the task data blocks returned by the presto cluster; and if the number threshold of the task data blocks is exceeded, starting a new presto working node service to provide additional query, computing power and memory, and sending query statements to the presto cluster again to equally distribute the tasks to all presto working nodes including the newly started working nodes. The flow of resource scheduling is shown in fig. 5.

For presto working nodes with different performances, the working state of the presto working nodes can affect the working capacity of the whole cluster. In the invention, priority levels are set for all presto working nodes according to the performance, the working nodes with excellent performance have high priority levels, the working nodes with poor performance have low priority levels, the presto working nodes with high priority levels are started firstly when being started, and the presto working nodes with low priority levels are closed firstly when being closed, so that the utilization rate of the working nodes with good performance is improved, and the improvement of the performance of the whole cluster is facilitated.

In order to solve the problem of insufficient resources caused by simultaneous execution of a large number of queries, the invention sets the authority for the user, matches the user with the query queue according to the user authority, and submits the query statement sent by the user to the query queue matched with the user. Different queues have different attributes, the queues have the limit of statement capacity, when the capacity in the queues reaches the upper limit, the statements submitted later enter a waiting stage, when the queues are idle, the original waiting statements are analyzed, the task is segmented, the resource limit is judged, and then the system resources are dynamically adjusted according to the resource requirements. Therefore, the number of the query concurrencies of the same user is limited, and the concurrent query efficiency of different users is improved. The flow of limiting the number of concurrencies of queries is shown in FIG. 6.

Claims

1. A heterogeneous data source visualization query method is characterized by comprising the following steps: a user selects a query scope on a foreground visual interface, wherein the query scope comprises one or more data sources; judging whether the selected data source is loaded into a presto cluster; if the selected data source is not loaded to the presto cluster, adding a data source to be loaded in the presto cluster; sending a query statement to the presto cluster, and completing query in the selected query range;

(1) registering a data source to be loaded;

(3) detecting whether the data source plug-in is a data source plug-in built in a presto cluster or not according to the type of a data source to be loaded, and if so, dynamically loading a configuration file in a catalog folder by the presto cluster; if not, creating the plug-in of the data source to be loaded, embedding the plug-in into a presto cluster, loading the configuration file in the catalog folder, and completing the dynamic loading of the data source;

before sending the query statement to the presto cluster, the method also comprises the following steps:

2. The method for visual query of heterogeneous data sources according to claim 1, further comprising: the background carries out dynamic optimization adjustment on the converted structured query statement according to the complexity of the query; the optimization adjustment comprises: for complex query, a packing and splitting mode is adopted, related query requirements are packaged into sub-queries, and screening and filtering are performed again based on results of the sub-queries; for simple query, predicate push-down is adopted, screening conditions are enhanced, and filtering is performed; and sending the optimized and adjusted structured query statement to a presto cluster.

3. The method for visual query of heterogeneous data sources according to claim 1, further comprising determining whether a threshold value of the number of task data blocks of a presto working node that has been started is exceeded according to the number of task data blocks returned by a presto cluster after a query statement is sent to the presto cluster; if the number of the task data blocks exceeds the threshold value, starting a new presto working node service; re-sending the query statement to the presto cluster;

4. The visual query method for the heterogeneous data sources according to claim 3, characterized in that priority levels are set for all presto working nodes, the presto working nodes with high priority levels are started first when starting, and the presto working nodes with low priority levels are closed first when closing.

5. The visual query method for the heterogeneous data sources according to claim 1, wherein a right is set for a user, the user and the query queue are matched according to the user right, and a query statement sent by the user is submitted to the query queue matched by the user.