CN109815283B - Heterogeneous data source visual query method - Google Patents

Heterogeneous data source visual query method Download PDF

Info

Publication number
CN109815283B
CN109815283B CN201811547858.4A CN201811547858A CN109815283B CN 109815283 B CN109815283 B CN 109815283B CN 201811547858 A CN201811547858 A CN 201811547858A CN 109815283 B CN109815283 B CN 109815283B
Authority
CN
China
Prior art keywords
query
presto
data source
data
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811547858.4A
Other languages
Chinese (zh)
Other versions
CN109815283A (en
Inventor
王海荣
肖万来
刘娟
朱文洲
徐云龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sugon Nanjing Research Institute Co ltd
Original Assignee
Sugon Nanjing Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sugon Nanjing Research Institute Co ltd filed Critical Sugon Nanjing Research Institute Co ltd
Priority to CN201811547858.4A priority Critical patent/CN109815283B/en
Publication of CN109815283A publication Critical patent/CN109815283A/en
Application granted granted Critical
Publication of CN109815283B publication Critical patent/CN109815283B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a heterogeneous data source visual query method.A user selects a query range on a foreground visual interface; judging whether the selected data source is loaded into a presto cluster; if the selected data source is not loaded to the presto cluster, adding a data source to be loaded in the presto cluster; sending a query statement to the presto cluster, and completing query in the selected query range; detecting whether a plug-in required by a data source to be loaded is a data source plug-in built in a presto cluster, if so, dynamically loading a configuration file in a catalog folder by the presto cluster; if not, creating the plug-in of the data source to be loaded, and embedding the plug-in into the Presto cluster, loading the configuration file in the catalog folder, and completing the loading of the data source. The method dynamically loads the data sources by configuring different data source basic information and adopting the operation of a graphical interface, thereby realizing the query of the heterogeneous data sources based on presto.

Description

Heterogeneous data source visual query method
Technical Field
The invention relates to the technical field of computers, in particular to a heterogeneous data source visual query method.
Background
In the era of big data explosion, the storage forms of data are various, mainly structured data, semi-structured data and unstructured data. Different data are stored in different data sources, most of structured data are stored in a relational data source, and domestic databases such as mysql, oracle, sqlserver, dreams, gold warehouse data sources and the like are the mainstream; storing historical data or data subjected to cleaning and conversion in a data source of a large data platform such as hive or an MPP data source based on a share nothing framework; the semi-structured data, such as streaming data, is mainly stored in kafka, a large number of structured log files are stored in a file server, and semi-structured key value data stored in a memory database redis stored in a memory database. Data in different data sources need to be subjected to integration query, and unified statistical analysis is carried out according to information in different data sources. Aiming at how to quickly query big data in a heterogeneous data source, the traditional solution is to write a specific query program, generally a jdbc program, aiming at different data sources, encapsulate a result set queried in the jdbc in an integration, then encapsulate data in a specific second set aiming at other data sources, such as a semi-structured data source and log data, and finally query and analyze the data by using a java api aiming at the data in the integration and the second set. In another way, data is uniformly aggregated into a platform, such as an oracle database or a hive data warehouse, and then the aggregated data is queried and analyzed by using respective query methods.
The current query method needs to write a large number of adaptation programs for different data sources, because adaptation needs to be performed on streaming message type data and on semi-structured log files in addition to the jdbc interface provided by common relational data. The general adaptive interface can not use the structured query language to make summary statistics on the data. For the second kind of data gathered to the unified platform, a large number of collection and extraction programs need to be written to gather the data in a unified mode, linear expansion query cannot be conducted according to query complexity, only machine resources of the machine can be used, and extra nodes cannot be used for enhancing query. The problem that may exist in this method is that when the statistical requirement changes or the queried data source changes, a large amount of manpower is needed to adjust the statistical mode and the summary mode, and when the structured query statement is not used for querying, a large amount of manpower cost may be needed for maintenance. When a large data set occurs, the query may overflow and the query takes a lot of time due to the limited resources of the nodes.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a visual query method for heterogeneous data sources, which is used for dynamically loading data sources by configuring different basic information of the data sources and adopting the operation of a graphical interface to realize the query of heterogeneous data sources based on presto.
The technical scheme is as follows: the invention adopts the following technical scheme:
a heterogeneous data source visualization query method comprises the following steps: a user selects a query scope on a foreground visual interface, wherein the query scope comprises one or more data sources; judging whether the selected data source is loaded into a presto cluster; if the selected data source is not loaded to the presto cluster, adding a data source to be loaded in the presto cluster; sending a query statement to the presto cluster, and completing query in the selected query range;
the method for adding the data source to be loaded in the presto cluster comprises the following steps:
(1) registering a data source to be loaded;
(2) generating a configuration file of a data source to be loaded in a catalog folder;
(3) detecting whether the data source plug-in is a data source plug-in built in a presto cluster or not according to the type of a data source to be loaded, and if so, dynamically loading a configuration file in a catalog folder by the presto cluster; if not, creating the plug-in of the data source to be loaded, embedding the plug-in into the presto cluster, loading the configuration file in the catalog folder, and completing the dynamic loading of the data source.
In order to realize visualization, the following steps are also included before sending the query statement to the presto cluster:
decomposing the query task into a data table and an operation command; constructing a vector diagram according to the decomposed query task by using a Raphaeljs control, wherein nodes in the vector diagram comprise a data table and an operation command in the query task; displaying the constructed vector diagram on a foreground visual interface; the background analyzes the vector diagram into a structured query statement and sends the structured query statement to a presto cluster to complete query; the operation command comprises connection, function, screening and aggregation.
As an optimization, after sending an inquiry statement to the presto cluster, judging whether the quantity of the task data blocks of the started presto working nodes exceeds a threshold value of the quantity of the task data blocks returned by the presto cluster; if the number of the task data blocks exceeds the threshold value, starting a new presto working node service; re-sending the query statement to the presto cluster;
and when the number of the task data blocks in a period of time is lower than the threshold value of the number of the task data blocks of the started presto working nodes, closing partial presto working node services.
Priority levels are set for all presto working nodes, presto working nodes with high priority levels are started firstly when starting, and presto working nodes with low priority levels are closed firstly when closing.
Furthermore, the user is set with authority, the user is matched with the query queue according to the user authority, and the query statement sent by the user is submitted to the query queue matched with the user.
Has the advantages that: compared with the prior art, the visual query method for the heterogeneous data sources realizes hot deployment by detecting the built-in data source plug-ins of the presto cluster and dynamically loading the configuration files of the newly added data sources, can enable presto to access the newly added data sources in real time without restarting the presto, and improves the usability of the system.
Drawings
FIG. 1 is a flow diagram of data source loading of structured data;
FIG. 2 is a flow diagram of data source loading for a kafka message queue;
FIG. 3 is a flowchart of Hive loading of multiple big data platforms;
FIG. 4 is a flow diagram of visualizing queries;
FIG. 5 is a flow chart of resource scheduling;
FIG. 6 is a flow diagram of limiting the number of concurrencies of queries.
Detailed Description
The invention is further elucidated with reference to the drawings and the detailed description.
A heterogeneous data source visualization query method comprises the following steps:
step 1, a visual interface is established on a foreground, a user selects a query range on the foreground visual interface, and the query range comprises one or more data sources; judging whether the selected data source is loaded into a presto cluster; if the selected data source is not loaded to the presto cluster, adding a data source to be loaded in the presto cluster;
adding a data source to be loaded in a presto cluster, comprising the following steps:
(1) registering a data source to be loaded;
the registration procedure differs for different types of data sources. For a data source of structured data, inputting data source information including an IP address, a port number, a user name and a password, and testing the Connectivity of the data source through jdbc (Java DataBase Connectivity).
For a data source of semi-structured data, a profile is required to be configured, which is mainly divided into the following types,
one is a kafka message queue, and the description file comprises the address of the broker of kafka, and description information of the corresponding message and the corresponding table in kafka. presto will parse and segment the data according to the description in the specification file and perform the query operation.
One is a redis memory type database, the description file includes the address and port of the redis, the password of the redis, the serial number of the current database and the description of the key value stored in the redis, and the presto segments and analyzes the data according to the description of the key value.
The description file is saved in json format and defines column names and column descriptions.
(2) Generating a configuration file of a data source to be loaded in a catalog folder;
(3) detecting whether the data source plug-in is a data source plug-in built in a presto cluster or not according to the type of a data source to be loaded, and if so, dynamically loading a configuration file in a catalog folder by the presto cluster; if not, creating a plug-in of the data source to be loaded, embedding the plug-in into the presto cluster, loading the configuration file in the catalog folder, and completing the dynamic loading of the data source without restarting presto service.
As shown in FIG. 1, a flow diagram is loaded for a data source of structured data; as shown in fig. 2, a flow diagram for loading data sources for a kafka message queue.
The loading process of the plurality of hive big data sources comprises the following steps:
configuring a hive data source plug-in; configuring a source data service of each hive, wherein the service mainly describes and defines tables and column fields in the hive; configuring hdfs related file settings, and specifying hdfs addresses, namenodes and other related addresses; specifying operation rights, such as rename and delete, for each hive table; and dynamically loading a data source, and interfacing the hive data sources of one or more big data platforms.
After the data source is loaded, the virtual table structure information of different data sources can be checked on the interface.
Step 2, decomposing the query task into a data table and an operation command; a vector diagram is constructed by Raphaeljs controls according to decomposed query tasks, nodes in the vector diagram comprise data tables and operation commands in the query tasks, and the nodes are mainly divided into the following categories: (1) a data table control node, which describes the relevant information of the data table; (2) connecting control nodes, wherein the operation of the nodes is to perform correlation operation on tables, and the correlation operation comprises internal connection, left connection, right connection and the like; (3) the function control node can perform function operation on the specific data column; (4) the screening control node comprises row screening and column screening; (5) and aggregating the control nodes, performing aggregation statistics according to specific values, and performing statistical analysis by using a complex window function. And selecting nodes by a user according to the requirement, and determining the connection sequence of the nodes so as to generate the vector diagram.
The invention uses svg to render vector graphics in a browser by using a lightweight js library of Raphael. Then, each step of operation of the query task is virtualized to be a control node in the foreground, such as a data table control, which describes one table in the data warehouse from the logical structure, and the association control indicates that the association operation needs to be performed on the two tables. The user selects the nodes by dragging, the vector diagram can specifically describe the position of each dragged control node, and the size and the shape of each control node and the connection line between the control nodes are recorded.
And displaying the constructed vector diagram on a foreground visual interface.
Step 3, the background stores the vector diagram generated by the foreground, analyzes the nodes according to the node description of the vector diagram analysis, and converts the vector diagram into a structured query statement;
and the background carries out dynamic optimization adjustment on the converted structured query statement according to the complexity of the query. For complex queries, a packing and splitting mode is adopted, related query requirements are packaged into sub-queries, screening and filtering are performed again based on results of the sub-queries, and the speed is increased; for simple query, predicate push-down is adopted, and screening conditions are enhanced and filtered, so that filtering can be performed in advance, and the query rate is improved. The flow of visualizing the query is shown in fig. 4.
And sending the optimized and adjusted structured query statement to a presto cluster.
Step 4, after the presto cluster receives the query statement, dividing the query task into a plurality of task blocks according to the size of the data volume and the complexity of the data, and returning the number of the task blocks to the task database; binding the computing nodes and the task quantity blocks by default; judging whether the number of the task data blocks of the started presto working node exceeds a threshold value of the number of the task data blocks returned by the presto cluster; and if the number threshold of the task data blocks is exceeded, starting a new presto working node service to provide additional query, computing power and memory, and sending query statements to the presto cluster again to equally distribute the tasks to all presto working nodes including the newly started working nodes. The flow of resource scheduling is shown in fig. 5.
And when the number of the task data blocks in a period of time is lower than the threshold value of the number of the task data blocks of the started presto working nodes, closing partial presto working node services.
For presto working nodes with different performances, the working state of the presto working nodes can affect the working capacity of the whole cluster. In the invention, priority levels are set for all presto working nodes according to the performance, the working nodes with excellent performance have high priority levels, the working nodes with poor performance have low priority levels, the presto working nodes with high priority levels are started firstly when being started, and the presto working nodes with low priority levels are closed firstly when being closed, so that the utilization rate of the working nodes with good performance is improved, and the improvement of the performance of the whole cluster is facilitated.
In order to solve the problem of insufficient resources caused by simultaneous execution of a large number of queries, the invention sets the authority for the user, matches the user with the query queue according to the user authority, and submits the query statement sent by the user to the query queue matched with the user. Different queues have different attributes, the queues have the limit of statement capacity, when the capacity in the queues reaches the upper limit, the statements submitted later enter a waiting stage, when the queues are idle, the original waiting statements are analyzed, the task is segmented, the resource limit is judged, and then the system resources are dynamically adjusted according to the resource requirements. Therefore, the number of the query concurrencies of the same user is limited, and the concurrent query efficiency of different users is improved. The flow of limiting the number of concurrencies of queries is shown in FIG. 6.

Claims (5)

1. A heterogeneous data source visualization query method is characterized by comprising the following steps: a user selects a query scope on a foreground visual interface, wherein the query scope comprises one or more data sources; judging whether the selected data source is loaded into a presto cluster; if the selected data source is not loaded to the presto cluster, adding a data source to be loaded in the presto cluster; sending a query statement to the presto cluster, and completing query in the selected query range;
the method for adding the data source to be loaded in the presto cluster comprises the following steps:
(1) registering a data source to be loaded;
(2) generating a configuration file of a data source to be loaded in a catalog folder;
(3) detecting whether the data source plug-in is a data source plug-in built in a presto cluster or not according to the type of a data source to be loaded, and if so, dynamically loading a configuration file in a catalog folder by the presto cluster; if not, creating the plug-in of the data source to be loaded, embedding the plug-in into a presto cluster, loading the configuration file in the catalog folder, and completing the dynamic loading of the data source;
before sending the query statement to the presto cluster, the method also comprises the following steps:
decomposing the query task into a data table and an operation command; constructing a vector diagram according to the decomposed query task by using a Raphaeljs control, wherein nodes in the vector diagram comprise a data table and an operation command in the query task; displaying the constructed vector diagram on a foreground visual interface; the background analyzes the vector diagram into a structured query statement and sends the structured query statement to a presto cluster to complete query; the operation command comprises connection, function, screening and aggregation.
2. The method for visual query of heterogeneous data sources according to claim 1, further comprising: the background carries out dynamic optimization adjustment on the converted structured query statement according to the complexity of the query; the optimization adjustment comprises: for complex query, a packing and splitting mode is adopted, related query requirements are packaged into sub-queries, and screening and filtering are performed again based on results of the sub-queries; for simple query, predicate push-down is adopted, screening conditions are enhanced, and filtering is performed; and sending the optimized and adjusted structured query statement to a presto cluster.
3. The method for visual query of heterogeneous data sources according to claim 1, further comprising determining whether a threshold value of the number of task data blocks of a presto working node that has been started is exceeded according to the number of task data blocks returned by a presto cluster after a query statement is sent to the presto cluster; if the number of the task data blocks exceeds the threshold value, starting a new presto working node service; re-sending the query statement to the presto cluster;
and when the number of the task data blocks in a period of time is lower than the threshold value of the number of the task data blocks of the started presto working nodes, closing partial presto working node services.
4. The visual query method for the heterogeneous data sources according to claim 3, characterized in that priority levels are set for all presto working nodes, the presto working nodes with high priority levels are started first when starting, and the presto working nodes with low priority levels are closed first when closing.
5. The visual query method for the heterogeneous data sources according to claim 1, wherein a right is set for a user, the user and the query queue are matched according to the user right, and a query statement sent by the user is submitted to the query queue matched by the user.
CN201811547858.4A 2018-12-18 2018-12-18 Heterogeneous data source visual query method Active CN109815283B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811547858.4A CN109815283B (en) 2018-12-18 2018-12-18 Heterogeneous data source visual query method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811547858.4A CN109815283B (en) 2018-12-18 2018-12-18 Heterogeneous data source visual query method

Publications (2)

Publication Number Publication Date
CN109815283A CN109815283A (en) 2019-05-28
CN109815283B true CN109815283B (en) 2020-10-09

Family

ID=66602100

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811547858.4A Active CN109815283B (en) 2018-12-18 2018-12-18 Heterogeneous data source visual query method

Country Status (1)

Country Link
CN (1) CN109815283B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674163A (en) * 2019-08-26 2020-01-10 天津浪淘科技股份有限公司 Heterogeneous data query system and method based on BS framework
CN110909386A (en) * 2019-11-21 2020-03-24 福建南威软件有限公司 Unified authorization access method and system for multiple data sources
US11048716B1 (en) * 2020-01-31 2021-06-29 Snowflake Inc. Managed virtual warehouses for tasks
CN111897824B (en) * 2020-03-25 2024-09-17 上海云砺信息科技有限公司 Data operation method, device, equipment and storage medium
CN112328668B (en) * 2020-09-10 2024-08-06 北京锐安科技有限公司 Data visualization realization method, device, equipment and storage medium
CN112364047B (en) * 2020-11-03 2024-04-09 山东华智人才科技有限公司 Presto-based dynamic data processing method and cross-library query method
CN114721746B (en) * 2022-06-06 2022-09-09 杭州玳数科技有限公司 Trino catalog hot loading method and system
CN115168690B (en) * 2022-09-06 2022-12-27 深圳市明源云科技有限公司 Data query method and device based on browser plug-in, electronic equipment and medium
CN116302206B (en) * 2023-03-31 2024-03-12 中电云计算技术有限公司 Presto data source hot loading method based on MQ
CN117289924A (en) * 2023-10-13 2023-12-26 河北云在信息技术服务有限公司 Visual task scheduling system and method based on Flink

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335472A (en) * 2015-09-30 2016-02-17 北京京东尚科信息技术有限公司 Method and device for updating configuration list of data query engine
CN105426467A (en) * 2015-11-16 2016-03-23 北京京东尚科信息技术有限公司 SQL query method and system for Presto
CN108874926A (en) * 2018-05-31 2018-11-23 康键信息技术(深圳)有限公司 Mass data inquiry method, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8977600B2 (en) * 2013-05-24 2015-03-10 Software AG USA Inc. System and method for continuous analytics run against a combination of static and real-time data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335472A (en) * 2015-09-30 2016-02-17 北京京东尚科信息技术有限公司 Method and device for updating configuration list of data query engine
CN105426467A (en) * 2015-11-16 2016-03-23 北京京东尚科信息技术有限公司 SQL query method and system for Presto
CN108874926A (en) * 2018-05-31 2018-11-23 康键信息技术(深圳)有限公司 Mass data inquiry method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN109815283A (en) 2019-05-28

Similar Documents

Publication Publication Date Title
CN109815283B (en) Heterogeneous data source visual query method
US11888702B2 (en) Intelligent analytic cloud provisioning
JP6117378B2 (en) System and method for a distributed database query engine
US20180285417A1 (en) Intelligent query parameterization of database workloads
US9633104B2 (en) Methods and systems to operate on group-by sets with high cardinality
US11055352B1 (en) Engine independent query plan optimization
CN105677812A (en) Method and device for querying data
US9930113B2 (en) Data retrieval via a telecommunication network
US20070250517A1 (en) Method and Apparatus for Autonomically Maintaining Latent Auxiliary Database Structures for Use in Executing Database Queries
US11620310B1 (en) Cross-organization and cross-cloud automated data pipelines
US20220413819A1 (en) Code translations of resilient distributed datasets in databases
CN110740079A (en) full link benchmark test system for distributed scheduling system
US11599512B1 (en) Schema inference for files
US11775544B2 (en) Feature sets using semi-structured data storage
Sinthong et al. Aframe: Extending dataframes for large-scale modern data analysis
CN114443680A (en) Database management system, related apparatus, method and medium
CN108319604B (en) Optimization method for association of large and small tables in hive
US11868353B1 (en) Fingerprints for database queries
Sinthong et al. AFrame: Extending DataFrames for large-scale modern data analysis (Extended Version)
CN115982230A (en) Cross-data-source query method, system, equipment and storage medium of database
US20220215021A1 (en) Data Query Method and Apparatus, Computing Device, and Storage Medium
CN113742346A (en) Asset big data platform architecture optimization method
CN109918410B (en) Spark platform based distributed big data function dependency discovery method
EP2990960A1 (en) Data retrieval via a telecommunication network
US12007994B1 (en) Partition granular selectivity estimation for predicates

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant