CN114756629B - Multi-source heterogeneous data interaction analysis engine and method based on SQL - Google Patents

Multi-source heterogeneous data interaction analysis engine and method based on SQL Download PDF

Info

Publication number
CN114756629B
CN114756629B CN202210677318.8A CN202210677318A CN114756629B CN 114756629 B CN114756629 B CN 114756629B CN 202210677318 A CN202210677318 A CN 202210677318A CN 114756629 B CN114756629 B CN 114756629B
Authority
CN
China
Prior art keywords
data
task
sql
sub
plan
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210677318.8A
Other languages
Chinese (zh)
Other versions
CN114756629A (en
Inventor
罗实
杨博
张璠璠
张涛
傅四维
王永恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202210677318.8A priority Critical patent/CN114756629B/en
Publication of CN114756629A publication Critical patent/CN114756629A/en
Application granted granted Critical
Publication of CN114756629B publication Critical patent/CN114756629B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues

Abstract

The invention discloses a multisource heterogeneous data interactive analysis engine and a multisource heterogeneous data interactive analysis method based on SQL, wherein the engine comprises the following components: the method comprises the following steps of sequentially connecting an interactive console, a coordinator, an actuator and a source driver, wherein the method comprises the following steps: step S1: the interactive control console acquires SQL data and feeds back an SQL execution result; step S2: analyzing SQL data by a coordinator to generate a logic plan tree, wherein the logic plan tree comprises a series of tasks with dependency relationship, splitting the logic plan tree into sub-plans capable of being executed concurrently, each sub-plan comprises a group of tasks, and scheduling the sub-plans to an actuator; and step S3: and the executor acquires data source data through the source driver according to the sub-plan, executes the tasks in the sub-plan and feeds back the SQL execution result to the interactive console. The interactive correlation fusion analysis method for the multi-source heterogeneous data has the advantages that the purpose of interactive correlation fusion analysis of the multi-source heterogeneous data is achieved, the cost is reduced, the data accuracy and timeliness are improved, and the expansibility and maintainability are improved.

Description

Multi-source heterogeneous data interaction analysis engine and method based on SQL
Technical Field
The invention relates to the technical field of big data interactive analysis, in particular to a multisource heterogeneous data interactive analysis engine and method based on SQL.
Background
In the big data era, with the explosive increase of data volume and the diversification of data structures, two core problems which need to be solved urgently are brought to data analysis: real-time and relevance fusion. For an application system, data is often dispersed among multiple heterogeneous data storage structures, such as databases, search engines, cache systems, message queues, local files, and the like. The analysis scenes of multiple sources and different architectures bring difficulties to data association analysis, one-time service analysis often needs multiple intermediate steps to be completed in a coordinated mode, the cost of actual operation is high, and the accuracy and timeliness of data cannot meet requirements. In recent years, although various data sources are uniformly integrated by the proposal of a data lake, the characteristic of taking storage as a center has a lot of inconveniences in the aspects of access and management of massive and various data. How to efficiently and conveniently analyze large-scale multi-source heterogeneous data in real time becomes a new challenge.
Disclosure of Invention
In order to solve the defects of the prior art and realize the purpose of interactive correlation fusion analysis of multi-source heterogeneous data by loading different data source linkers, the invention adopts the following technical scheme:
a multisource heterogeneous data interaction analysis engine based on SQL comprises an interaction console, a coordinator, an actuator and a source driver which are sequentially connected;
the interactive console is used for acquiring SQL data and returning an SQL execution result;
the coordinator is used for analyzing the SQL data to generate a logic plan tree, wherein the logic plan tree comprises a series of tasks with dependency relationship, the logic plan tree is divided into sub-plans capable of being executed concurrently, each sub-plan comprises a group of tasks, and the sub-plans are dispatched to an actuator;
the executor acquires data source data through the source driver according to the sub-plan, executes tasks in the sub-plan, and feeds back SQL execution results to the interactive console;
and the source driver is used for establishing a link between the actuator and a data source and acquiring data of the data source.
Furthermore, the interactive console receives and counts SQL data, performs multitask execution and control, monitors the state of the coordinator and/or the executor, maintains and queries the data source list, and feeds back an SQL execution result. The structure of the invention adopts a classic MVC mode, a user carries out interactive operation through a client, and the task progress and the analysis result can be fed back in real time through a console.
Furthermore, the coordinator comprises an SQL analysis module, a metadata interface, a task planning module, a task scheduling module and a service discovery module;
the SQL analysis module analyzes SQL data to generate an abstract syntax tree, and acquires data source information through a metadata interface for verification;
the metadata interface is used for inquiring and checking data source information;
the task planning module is used for converting the abstract syntax tree into a logic plan tree;
the task scheduling module is used for splitting the tasks in the logic plan tree and scheduling and distributing the sub-plans obtained by splitting to a group of actuators;
and the service discovery module acquires the state of the actuator and is used for allocating the actuator in the scheduling process.
Further, the service discovery module collects heartbeat and task load data of each actuator, maintains a current actuator list for normal distribution, and is used for the task scheduling module to select the actuators to perform sub-plan distribution according to the current available actuators and load states thereof.
Further, the tasks in the logic plan tree comprise a data reading task, a data filtering task, a data association task, a grouping and aggregating task, a sequencing task, an extraction task and an output task;
the data reading task is to read data in a data source;
the data filtering task is used for filtering the read data source data;
the data association task is used for associating the data which are read in parallel and filtered;
the grouping aggregation task is used for grouping the associated data;
the sequencing task is used for sequencing the grouped data;
the extraction task is to extract the sequenced data;
and the output task is used for outputting the extracted data.
Furthermore, the executor comprises an execution queue, a first judgment module, a second judgment module, a third judgment module, a splitting module, an execution module, a merging module and a fourth judgment module;
the execution queue is used for discharging the sub-plans acquired by the executors according to the sequence;
the first judging module acquires a current task from the execution queue, if the current task is a remote type task in the sub-plan, a remote channel is opened to transmit data, the remote type task comprises an output task and is used for feeding back an execution result, and otherwise, the current task is input into the second judging module;
the second judging module acquires and judges the current task, if the current task is the data reading task in the sub-plan, the current task is input into the third judging module, otherwise, the current task is input into the splitting module;
the third judging module judges whether the corresponding source driver is loaded or not according to the current task, if so, the data source data are read in batch, otherwise, the corresponding source driver is loaded, a link pool is initialized, and the corresponding data source is linked;
the splitting module splits the current task according to the data processing scale according to the resources of the actuator corresponding to the current task; the task inside the executor is reasonably split, the data blocks processed by the split slice subtasks can be completely executed in the memory, the disk IO operation is avoided, and the analysis timeliness is improved compared with the traditional MapReduce, hive and other big data calculation frameworks; the task of the sub-plan is asynchronously executed in the executor in a pipeline mode, and distributed concurrent processing and pipelined calculation provide technical support for real-time analysis of big data;
the execution module performs concurrent execution according to the split subtasks;
the merging module is used for merging, rearranging and removing duplicate of an intermediate result obtained by executing the subtasks to obtain an execution result of the current task and feeding back the execution result;
and the fourth judging module is used for judging whether the unexecuted task exists in the execution queue, if so, taking the unexecuted task as the current task and entering the first judging module, otherwise, ending the execution of the task.
The actuator cluster adopts a parallel processing framework, each actuator has independent computing and storage resources, and the failure of a single-node task does not affect the whole process. The scale and the size of the cluster can be dynamically adjusted according to the data volume to be processed, and the analysis of hundred million-level big data can be marginal.
Furthermore, the coordinator performs segmentation on the sub-plan, adds a remote type exchange task, establishes a remote channel for data exchange between different actuators when scheduling is performed on different actuators, and performs remote transmission on data through a network in a distributed environment.
A multisource heterogeneous data interaction analysis method based on SQL comprises the following steps:
step S1: acquiring SQL data;
step S2: analyzing SQL data to generate a logic plan tree, wherein the logic plan tree comprises a series of tasks with dependency relationship, and the logic plan tree is divided into sub-plans capable of being executed concurrently, and each sub-plan comprises a group of tasks;
and step S3: and according to the sub-plan, establishing a link with a data source, acquiring data of the data source, executing tasks in the sub-plan, and returning an SQL (structured query language) execution result.
Further, the step S3 includes the steps of:
step S31, acquiring a sub-plan, and putting tasks in the sub-plan into an execution queue;
step S32, acquiring a current task from the execution queue and initializing;
step S33, judging whether the current task is a remote type task, if so, starting a remote channel to transmit data, wherein the remote type task comprises an output task used for feeding back an execution result, otherwise, entering step S34;
step S34, judging whether the current task is a data reading task in the sub-plan, if so, judging whether a corresponding source driver is loaded according to the current task, if so, directly reading the data source data in batch, if not, loading the corresponding data source driver, initializing a link pool, linking the corresponding data source, then reading the data source data in batch, and if not, entering step S35;
step S35, splitting the current task according to the data processing scale according to the resources of the actuator corresponding to the current task; the task in the executor is reasonably split, the data blocks processed by the split slice subtasks can be completely executed in the memory, the disk IO operation is avoided, and the analysis timeliness is improved compared with the traditional MapReduce, hive and other large data calculation frameworks; the task of the sub-plan is asynchronously executed in the executor in a pipeline mode, and distributed concurrent processing and pipelined calculation provide technical support for real-time analysis of big data; performing concurrent execution according to the split subtasks; merging, rearranging and removing duplicate of intermediate results obtained by executing the subtasks to obtain an execution result of the current task and feeding back the execution result; and judging whether the execution queue has an unexecuted task, if so, taking the unexecuted task as the current task, and entering the step S32, otherwise, ending the execution of the task.
And further, the sub-plan is segmented, a remote type exchange task is added, when scheduling is carried out on different actuators, a remote channel is established for data exchange among the different actuators, and data needs to be remotely transmitted through a network in a distributed environment.
The invention has the advantages and beneficial effects that:
according to the multisource heterogeneous data interaction analysis engine and method based on SQL, correlation fusion analysis is carried out on various mainstream heterogeneous data sources, and only corresponding data source driving plug-ins need to be loaded for specific data sources, so that the problems of cost improvement, data accuracy reduction, timeliness reduction and the like caused by the fact that one-time business analysis often needs to be completed cooperatively through multiple intermediate steps in an analysis scene of a multisource and heterogeneous architecture are solved, and the SQL-based multisource heterogeneous data interaction analysis engine and method have good expansibility and maintainability. The SQL is defined as a logic plan tree consisting of a plurality of tasks, the sub-plans are dispatched to the distributed cluster to be executed concurrently through reasonable splitting, task tasks of the sub-plans are executed asynchronously in the executer in a pipeline mode, and distributed concurrent processing and pipeline calculation provide technical support for real-time analysis of big data.
Drawings
FIG. 1 is a schematic structural diagram of an interaction analysis engine in an embodiment of the present invention.
FIG. 2 is a console page display diagram of the interaction analysis engine in an embodiment of the present invention.
FIG. 3 is a schematic diagram of a logical plan tree structure of a coordinator in an embodiment of the present invention.
FIG. 4 is a flowchart of a method for constructing an interactive analysis engine according to an embodiment of the present invention.
FIG. 5 is a flow chart illustrating the execution of the worker node of the executor in an embodiment of the present invention.
FIG. 6 is a flow chart of multi-threaded scheduled execution of a sub-task in an embodiment of the invention.
FIG. 7 is a schematic diagram of an apparatus for constructing an interactive analysis engine according to the present invention.
Detailed Description
The following describes in detail embodiments of the present invention with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are given by way of illustration and explanation only, not limitation.
As shown in FIG. 1, the SQL-based multi-source heterogeneous data interaction analysis engine comprises an interaction console, a coordinator, an executor and a source driver.
The interactive console is used for acquiring SQL data and returning an SQL execution result; the interactive console receives and counts SQL data, executes and controls multiple tasks, monitors the state of the coordinator and/or the executor, maintains and inquires the data source list and feeds back the SQL execution result.
Specifically, the interactive console is a bridging window between the engine and the user, and its functions include receiving user input, counting SQL execution data, monitoring node status of the coordinator/executor, displaying feedback analysis results, and the like. As shown in fig. 2, in the interactive console page, the top is a cluster monitoring function module, which monitors queue data, cluster resources, and network traffic, where the queue data respectively shows the states (execution, queuing, and blocking) of SQL query statements, the cluster resources respectively show the states of an actuator, a CPU, and a memory, and the network traffic respectively shows the number of network task concurrences, the number of lines, and the number of bytes; the left side is a data source used for selecting a corresponding data source to be inquired; the SQL query statement input interface, the query result display interface and the query record station are interfaces on the right side from top to bottom in sequence.
The coordinator is used for analyzing the SQL data to generate a logic plan tree, the logic plan tree comprises a series of tasks with dependency relationship, the logic plan tree is divided into sub-plans capable of being executed concurrently, each sub-plan comprises a group of tasks, and the sub-plans are dispatched to the executor; the coordinator comprises an SQL analysis module, a metadata interface, a task planning module, a task scheduling module and a service discovery module;
the SQL analysis module is used for analyzing the SQL data to generate an abstract syntax tree and acquiring data source information through a metadata interface for verification;
the metadata interface is used for inquiring and checking data source information;
the task planning module is used for converting the abstract syntax tree into a logic plan tree;
the task scheduling module is used for splitting the tasks in the logic plan tree and distributing the split sub-plans to a group of actuators;
and the service discovery module acquires the state of the actuator and is used for allocating the actuator in the scheduling process.
Tasks in the logic plan tree comprise a data reading task, a data filtering task, a data association task, a grouping and aggregating task, a sequencing task, an extraction task and an output task;
the data reading task is used for reading data in a data source;
the data filtering task is used for filtering the read data source data;
the data association task is used for associating the data which are read in parallel and filtered;
grouping and aggregating tasks, namely grouping the associated data;
a sorting task for sorting the grouped data;
the extraction task is to extract the sequenced data;
and the output task is to output the extracted data.
The coordinator divides the sub-plan, adds remote type exchange tasks, and establishes a remote channel for data exchange between different actuators when scheduling is performed on different actuators.
Specifically, the coordinator is responsible for receiving the SQL query request, analyzing the SQL statement to generate a logic plan tree, splitting the logic plan tree into sub-plans that can be concurrently executed in the distributed cluster, and finally scheduling the sub-plans to each actuator node according to a specific algorithm for execution. The coordinator contains the following 5 sub-module controls:
a) SQL parsing
The SQL analysis control is responsible for analyzing the SQL grammar, and analyzes SQL into an abstract grammar tree by utilizing altr4 (grammar generator tool) and performs certain optimization. The data source is then verified by querying the meta-configuration information, such as the sql query statement "select t1.Rank, count (— as num from c1.Student t1 join c2.Family t2 on t1.Family id = t2.Id where t1.Age < 18 and t2.Income > 30 group by t1.Rank sort by num 5," this statement is a group statistical ranking of ranks for a group of students that are under 18 years of age and have a family annual income greater than 30 w. Considering that two data of the student and the family belong to two different data sources c1 and c2, the SQL parsing control firstly parses and optimizes the SQL grammar through the antlr4, then checks the information of the two data sources c1 and c2, inquires whether the aliases of the existing data source configuration are c1 and c2, and directly feeds back the SQL statement error of the console if no data source configuration or missing configuration exists. Otherwise, the SQL abstract syntax tree is transmitted to the task planning control after the verification is passed.
b) Metadata API
The metadata API control is generated when the coordinator loads a source configuration file, and the source configuration file indicates data source information to be loaded and analyzed. Including aliases, ip addresses, ports, schema libraries, data tables, etc. for the data source. The metadata API control provides a source information query checking function for user SQL during SQL analysis.
c) Mission planning
The mission planning control converts the SQL abstract syntax tree into a logical plan tree, and the logical plan tree finally generated by the above SQL statement is as shown in fig. 3. The logic planning tree is composed of a series of task tasks with dependency relations, and the input dependency of an upstream task is the output of a downstream task. The DataScanTask is responsible for reading data in a specific data source, and the left branch and the right branch respectively correspond to the acquisition of the data source c1.Student and c2. Family; the FilterTask is responsible for filtering the read source data, the left branch is data with the age of the selected student less than 18, and the right branch is data with the annual income of the selected student more than 30 ten thousand; joinTask performs correlation operation on the data of the left branch and the data of the right branch according to the studentId; agregatetask is a grouping and aggregating operation according to student. The SortTask performs forward sequence operation on the statistical data after rank grouping; the LimitTask takes the first 5 bits of the sorted information; outputTask represents the output task of the final result data.
d) Task scheduling
And the task scheduling control is responsible for splitting the tasks of the logic plan tree and scheduling the split sub-plans in the distributed cluster. As shown in fig. 3, according to the structure of the logical plan tree, the logical plan tree is divided into 4 sub-plans by a dotted line, and each sub-plan is distributed to one or more executors to execute according to a specific scheduling algorithm. The data exchange of different executors needs to be carried out by adding an exchange task, namely, exchange task, which indicates that the data needs to be remotely transmitted through a network in a distributed environment.
e) Service discovery
The service discovery module collects heartbeat and task load data of each actuator, maintains a current actuator list for normal distribution, and is used for the task scheduling module to select the actuators to perform sub-plan distribution according to the current available actuators and load states thereof.
Specifically, the service discovery control is an auxiliary control of the coordinator, and is responsible for collecting heartbeat and task load data of each actuator in the cluster and maintaining a current actuator list for normal distribution. Through the service discovery control, task scheduling can be performed by preferentially selecting execution nodes according to a currently available actuator and a load state thereof and a specific scheduling algorithm for plan distribution.
The executor is responsible for the task execution of the sub-plan. A plurality of actuator nodes form a distributed working cluster, each actuator node in the cluster is a started process, and receives a sub-plan distributed by a coordinator in an http mode. Meanwhile, each actuator node has own state information monitoring, including the conditions of the number of tasks, the size of a thread pool, the execution state of the tasks, the CPU of the actuator node, the utilization rate of the memory and the like, and sends heartbeat and load data to the coordinator at regular time to indicate the health condition of the actuator node.
The executor acquires data source data through the source driver according to the sub-plan, executes tasks in the sub-plan, and feeds back SQL execution results to the interactive console;
the actuator comprises an execution queue, a first judgment module, a second judgment module, a third judgment module, a splitting module, an execution module, a merging module and a fourth judgment module;
the execution queue is used for discharging the sub-plans acquired by the executors according to the sequence;
the first judging module is used for acquiring the current task from the execution queue, if the current task is a remote type task in the sub-plan, a remote channel is opened to transmit data, the remote type task comprises an output task and is used for feeding back an execution result, and otherwise, the current task is input into the second judging module;
the second judging module acquires and judges the current task, if the current task is a data reading task in the sub-plan, the current task is input into the third judging module, and if not, the current task is input into the splitting module;
the third judging module is used for judging whether the corresponding source driver is loaded or not according to the current task, if so, reading data source data in batches, otherwise, loading the corresponding source driver, initializing a link pool and linking the corresponding data source;
the splitting module splits the current task according to the data processing scale according to the resources of the actuator corresponding to the current task; the task inside the executor is reasonably split, the data blocks processed by the split slice subtasks can be completely executed in the memory, the disk IO operation is avoided, and the analysis timeliness is improved compared with the traditional MapReduce, hive and other big data calculation frameworks; the task of the sub-plan is asynchronously executed in the executor in a pipeline mode, and distributed concurrent processing and pipelined calculation provide technical support for real-time analysis of big data.
Specifically, the splitting module divides input data of the task into a plurality of slice data subsets, and schedules processing logic of the task to the plurality of slice data subsets in a multithread manner to be executed concurrently. The input data depends on the upstream task, and when the upstream task is DataScanTask, the implementation is specifically realized according to different source data types, for example: aiming at Hdfs (Hadoop Distributed File System), namely a Hadoop Distributed File System, a slice data subset corresponds to a sequence File under partition data; aiming at the Mysql relational database management system, a slice data subset corresponds to a plurality of rows of record records of an entity table; when the upstream task is non-DataScanTask, the output data of the upstream is re-split. The splitting logic generally considers factors such as the size of a memory which can be run by a current node, the number of task concurrencies, the size of a single task execution thread pool, the maximum length of combined calculation of each field type of single data and the like.
The execution module performs concurrent execution according to the split subtasks;
the merging module is used for merging, rearranging and removing duplicate of the intermediate results obtained by executing the subtasks to obtain the execution result of the current task and feeding back the execution result;
and the fourth judging module is used for judging whether the unexecuted task exists in the execution queue, if so, taking the unexecuted task as the current task and entering the first judging module, otherwise, ending the execution of the task.
The actuator cluster adopts a parallel processing framework, each actuator has independent computing and storage resources, and the failure of a single-node task does not affect the whole process. The size of the cluster can be dynamically adjusted according to the data volume to be processed, and analysis of hundred million-level big data can be carried out with much margin.
The source driver is used for establishing a link between the actuator and a data source and acquiring data of the data source;
specifically, the source driver is a realization basis of multi-source heterogeneous data association analysis, each source driver is a specific realization of a certain specific data source under a unified interface, and a series of operations such as management of a source instance, thread pool maintenance, data reading and SQL execution are provided. The source driver is attached to the executor in a plug-in mode, and the loading of the source driver is realized in a dynamic binding mode.
As shown in fig. 4, the multisource heterogeneous data interaction analysis method based on SQL includes the following steps:
step S1: acquiring SQL data;
step S2: analyzing SQL data to generate a logic plan tree, wherein the logic plan tree comprises a series of tasks with dependency relationship, and the logic plan tree is divided into a plurality of sub-plans, and each sub-plan comprises a group of tasks;
in particular, multiple sub-plans can be executed concurrently. The sub-plan is an independent part of a plan tree logic structure, the sub-plan is allocated to an actuator cluster for execution, as shown in fig. 3, a "grouping aggregation task" is a sub-plan, which can be allocated to multiple actuators for concurrent execution through a hash splitting group by field (also, an actuator may concurrently execute multiple different sub-plans), and finally, the sub-plan is completed in the cluster through a combining group by field.
And step S3: and according to the sub-plan, establishing a link with a data source, acquiring data of the data source, executing tasks in the sub-plan, and returning an SQL (structured query language) execution result. As shown in fig. 5, the method comprises the following steps:
step S31, acquiring a sub-plan, and putting tasks in the sub-plan into an execution queue;
step S32, acquiring a current task from the execution queue and initializing;
step S33, judging whether the current task is a remote type task, if so, starting a remote channel to transmit data, wherein the remote type task comprises an output task used for feeding back an execution result, otherwise, entering step S34;
step S34, judging whether the current task is a data reading task in the sub-plan, if so, judging whether a corresponding source driver is loaded according to the current task, if so, directly reading data source data, if not, loading the corresponding source driver, then reading the data source data, and if not, entering step S35;
step S35, splitting the current task according to the data processing scale according to the resources of the actuator corresponding to the current task; performing concurrent execution according to the split subtasks; merging, rearranging and removing duplicate of intermediate results obtained by executing the subtasks to obtain an execution result of the current task and feeding back the execution result; and judging whether the execution queue has an unexecuted task, if so, taking the unexecuted task as the current task, and entering the step S32, otherwise, ending the execution of the task.
The sub-plan is segmented, remote type exchange tasks are added, and when scheduling is carried out on different actuators, remote channels are established for data exchange among different actuators.
In the embodiment of the present invention, step S3 specifically executes the following process:
step S31, acquiring the sub-plans, and sequentially putting task sets of the sub-plans into a local execution queue according to the sequence;
step S32, taking out the first task from the local execution queue and initializing the state information of the task;
and step S33, judging whether the current task is a remote task type, wherein the remote tasks comprise OutputTask and ExchangeTask, and the OutputTask and the ExchangeTask are respectively tasks for outputting results to the coordinator or exchanging intermediate data to the executor. The exchange task is a logic task added by the task scheduling control of the coordinator when the sub-plan is split, and indicates that data needs to be transmitted at different nodes through a network in a distributed environment. If the task is a remote task, opening a remote channel, and transmitting data to a coordinator or other actuators through a network, otherwise, entering a step S34;
step S34, judging whether the current task is a data source reading task DataScanTask, if so, judging whether a corresponding data source drive is loaded, if so, directly reading target data source data in batches, if not, loading the data source drive, initializing a link pool, linking the corresponding data source, then reading the target data source data in batches, and if not, entering step S35;
and step S35, performing slice splitting on the node memory according to the sizes of the node memory and the thread pool and the data scale of task processing. The slice is a minimum execution unit set of data to be operated, such as a single sequence file corresponding to each partition in the Hdfs data source, or a plurality of rows of record records of an entity table in the Mysql data source, or a batch of subdata output by an upstream task. For the split slice data set, adopting multi-thread scheduling execution, as shown in fig. 6; then, merging the slice intermediate results, locally merging and rearranging the shuffle deduplication optimization, and returning the current task result; and judging whether the local queue of the executor has tasks, if not, ending the process, and if so, circularly entering the step S32.
The multisource heterogeneous data interaction analysis engine based on SQL and the construction method thereof can perform association fusion analysis aiming at various mainstream heterogeneous data sources, only needs to load corresponding data source drive plug-ins aiming at specific data sources, and has very good expansibility and maintainability. According to the invention, SQL is defined as a logic plan tree consisting of a plurality of tasks, the sub-plans are dispatched to the distributed cluster to be executed concurrently through reasonable splitting, task tasks of the sub-plans are executed asynchronously in the actuator in a pipeline mode, and distributed concurrent processing and pipeline calculation provide technical support for real-time analysis of big data.
The structure of the invention adopts a classic MVC mode, a user carries out interactive operation through a client, and the task progress and the analysis result can be fed back in real time through a console. The actuator cluster adopts a parallel processing framework, each actuator has independent computing and storage resources, and the whole process is not influenced by the failure of a single-node task. The scale and the size of the cluster can be dynamically adjusted according to the data volume to be processed, and the analysis of hundred million-level big data can be marginal.
The method reasonably splits the task in the executor, ensures that the data blocks processed by the split slice subtasks can be completely executed in the memory, avoids disk IO (input/output) operation, and improves the analysis timeliness compared with the traditional MapReduce, hive and other large data calculation frameworks.
Corresponding to the embodiment of the multisource heterogeneous data interaction analysis method based on the SQL, the invention also provides an embodiment of a multisource heterogeneous data interaction analysis device based on the SQL.
Referring to fig. 7, the multisource heterogeneous data interaction analysis device based on SQL provided by the embodiment of the invention includes a memory and one or more processors, where the memory stores executable codes, and when the one or more processors execute the executable codes, the multisource heterogeneous data interaction analysis device based on SQL is used to implement the multisource heterogeneous data interaction analysis method based on SQL in the above embodiment.
The embodiment of the multisource heterogeneous data interaction analysis device based on SQL can be applied to any equipment with data processing capability, and the any equipment with data processing capability can be equipment or devices such as computers. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. In terms of hardware, as shown in fig. 7, the present invention is a hardware structure diagram of any device with data processing capability in which the SQL-based multi-source heterogeneous data interaction analysis apparatus is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, in the embodiment, any device with data processing capability in which the apparatus is located may also include other hardware according to the actual function of the any device with data processing capability, which is not described again.
The specific details of the implementation process of the functions and actions of each unit in the above device are the implementation processes of the corresponding steps in the above method, and are not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the invention also provides a computer-readable storage medium, wherein a program is stored on the computer-readable storage medium, and when the program is executed by a processor, the SQL-based multi-source heterogeneous data interaction analysis method in the embodiment is realized.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. The utility model provides a multisource heterogeneous data interaction analysis engine based on SQL, includes interactive control platform, coordinator, executor and the source driver that connects gradually, its characterized in that:
the interactive console is used for acquiring SQL data and returning an SQL execution result;
the coordinator is used for analyzing the SQL data to generate a logic plan tree, the logic plan tree comprises a series of tasks with dependency relationship, the logic plan tree is divided into sub-plans capable of being executed concurrently, each sub-plan comprises a group of tasks, and the sub-plans are dispatched to the executor;
the executor acquires data source data through the source driver according to the sub-plan, executes tasks in the sub-plan, and feeds back SQL execution results to the interactive console; the executor comprises an execution queue, a first judging module, a second judging module, a third judging module, a splitting module, an execution module, a merging module and a fourth judging module;
the execution queue is used for discharging the sub-plans acquired by the executors;
the first judging module acquires a current task from the execution queue, if the current task is a remote type task in the sub-plan, a remote channel is opened to transmit data, the remote type task comprises an output task and is used for feeding back an execution result, and otherwise, the current task is input into the second judging module;
the second judging module acquires and judges the current task, if the current task is the data reading task in the sub-plan, the current task is input into the third judging module, otherwise, the current task is input into the splitting module;
the third judging module judges whether the corresponding source driver is loaded or not according to the current task, if so, the data source data is read, and otherwise, the corresponding source driver is loaded;
the splitting module splits the current task according to the data processing scale according to the resources of the actuator corresponding to the current task;
the execution module performs concurrent execution according to the split subtasks;
the merging module is used for merging, rearranging and removing duplicate of an intermediate result obtained by executing the subtasks to obtain an execution result of the current task and feeding back the execution result;
the fourth judging module is used for judging whether the unexecuted task exists in the execution queue, if so, the unexecuted task is taken as the current task and enters the first judging module, otherwise, the execution of the task is finished;
and the source driver is used for establishing a link between the actuator and a data source and acquiring data of the data source.
2. The SQL-based multi-source heterogeneous data interaction analysis engine of claim 1, wherein: the interactive console receives and counts SQL data, executes and controls multiple tasks, monitors states of a coordinator and/or an actuator, maintains and inquires a data source list, and feeds back an SQL execution result.
3. The SQL-based multi-source heterogeneous data interaction analysis engine of claim 1, wherein: the coordinator comprises an SQL analysis module, a metadata interface, a task planning module, a task scheduling module and a service discovery module;
the SQL analysis module analyzes the SQL data to generate an abstract syntax tree, and acquires data source information through a metadata interface for verification;
the metadata interface is used for inquiring and verifying data source information;
the task planning module is used for converting the abstract syntax tree into a logic plan tree;
the task scheduling module splits tasks in the logic plan tree and distributes the split sub-plans to a group of actuators;
and the service discovery module acquires the state of the actuator and is used for allocating the actuator in the scheduling process.
4. The SQL-based multi-source heterogeneous data interaction analysis engine of claim 3, wherein: the service discovery module collects heartbeat and task load data of each actuator, maintains an actuator list which can be normally distributed at present, and is used for the task scheduling module to select the actuators to carry out sub-plan distribution according to the currently available actuators and load states of the actuators.
5. The SQL-based multi-source heterogeneous data interaction analysis engine of claim 1, wherein: tasks in the logic plan tree comprise a data reading task, a data filtering task, a data association task, a grouping and aggregating task, a sequencing task, an extraction task and an output task;
the data reading task is used for reading data in a data source;
the data filtering task is used for filtering the read data source data;
the data association task is used for associating the data which are read in parallel and filtered;
the grouping aggregation task is used for grouping the associated data;
the sequencing task is used for sequencing the grouped data;
the extraction task is to extract the sorted data;
and the output task is used for outputting the extracted data.
6. The SQL-based multi-source heterogeneous data interaction analysis engine of claim 1, wherein: the coordinator divides the sub-plan, adds remote type exchange tasks, and establishes a remote channel for data exchange between different actuators when scheduling is carried out on different actuators.
7. A multisource heterogeneous data interaction analysis method based on SQL is characterized by comprising the following steps:
step S1: acquiring SQL data;
step S2: analyzing SQL data to generate a logic plan tree, wherein the logic plan tree comprises a series of tasks with dependency relationship, and the logic plan tree is divided into sub-plans which can be executed concurrently, and each sub-plan comprises a group of tasks;
and step S3: according to the sub-plan, establishing a link with a data source, acquiring data of the data source, executing tasks in the sub-plan, and returning an SQL execution result, wherein the method comprises the following steps:
step S31, acquiring a sub-plan, and putting tasks in the sub-plan into an execution queue;
step S32, acquiring a current task from the execution queue and initializing;
step S33, judging whether the current task is a remote type task, if so, starting a remote channel to transmit data, wherein the remote type task comprises an output task used for feeding back an execution result, otherwise, entering step S34;
step S34, judging whether the current task is a data reading task in the sub-plan, if so, judging whether a corresponding source driver is loaded according to the current task, if the corresponding source driver is loaded, directly reading data source data, if the corresponding source driver is not loaded, loading the corresponding data source driver, then reading the data source data, and if not, entering step S35;
step S35, splitting the current task according to the data processing scale according to the resources of the actuator corresponding to the current task; performing concurrent execution according to the split subtasks; merging, rearranging and removing duplicate of intermediate results obtained by executing the subtasks to obtain an execution result of the current task and feeding back the execution result; and judging whether the execution queue has the unexecuted task, if so, taking the unexecuted task as the current task, and entering the step S32, otherwise, ending the execution of the task.
8. The SQL-based multi-source heterogeneous data interaction analysis method according to claim 7, wherein: and segmenting the sub-plan, adding a remote type exchange task, and establishing a remote channel for data exchange between different actuators when scheduling is carried out on different actuators.
CN202210677318.8A 2022-06-16 2022-06-16 Multi-source heterogeneous data interaction analysis engine and method based on SQL Active CN114756629B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210677318.8A CN114756629B (en) 2022-06-16 2022-06-16 Multi-source heterogeneous data interaction analysis engine and method based on SQL

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210677318.8A CN114756629B (en) 2022-06-16 2022-06-16 Multi-source heterogeneous data interaction analysis engine and method based on SQL

Publications (2)

Publication Number Publication Date
CN114756629A CN114756629A (en) 2022-07-15
CN114756629B true CN114756629B (en) 2022-10-21

Family

ID=82336175

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210677318.8A Active CN114756629B (en) 2022-06-16 2022-06-16 Multi-source heterogeneous data interaction analysis engine and method based on SQL

Country Status (1)

Country Link
CN (1) CN114756629B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117056316B (en) * 2023-10-10 2024-01-26 之江实验室 Multi-source heterogeneous data association query acceleration method, device and equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609451A (en) * 2012-01-11 2012-07-25 华中科技大学 SQL (structured query language) query plan generation method oriented to streaming data processing
CN109284302A (en) * 2018-08-10 2019-01-29 新华三大数据技术有限公司 Data processing method and device
CN111625696A (en) * 2020-07-28 2020-09-04 北京升鑫网络科技有限公司 Distributed scheduling method, computing node and system of multi-source data analysis engine
CN112579625A (en) * 2020-09-28 2021-03-30 京信数据科技有限公司 Multi-source heterogeneous data treatment method and device
CN112579626A (en) * 2020-09-28 2021-03-30 京信数据科技有限公司 Construction method and device of multi-source heterogeneous SQL query engine
CN113032423A (en) * 2021-05-31 2021-06-25 北京谷数科技股份有限公司 Query method and system based on dynamic loading of multiple data engines

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7340452B2 (en) * 2003-12-16 2008-03-04 Oracle International Corporation Parallel single cursor model on multiple-server configurations
FR2872940B1 (en) * 2004-07-09 2010-07-30 Inst Curie HETEROGENEOUS DATABASE INTERROGATION SYSTEM AND INTERROGATION METHOD
KR20180035035A (en) * 2016-09-28 2018-04-05 한국전자통신연구원 Method and apparatus for optimizing query in data engine
CN107729366B (en) * 2017-09-08 2021-01-05 广东省建设信息中心 Universal multi-source heterogeneous large-scale data synchronization system
CN108363746B (en) * 2018-01-26 2022-07-26 福建星瑞格软件有限公司 Unified SQL query system supporting multi-source heterogeneous data
CN108920261B (en) * 2018-05-23 2020-03-24 中国航天系统科学与工程研究院 Two-stage adaptive scheduling method suitable for massive parallel data processing tasks
CN110263105B (en) * 2019-05-21 2021-09-10 北京百度网讯科技有限公司 Query processing method, query processing system, server, and computer-readable medium
CN112527876A (en) * 2020-12-08 2021-03-19 国网四川省电力公司信息通信公司 Unified database access system based on multi-source heterogeneous data analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609451A (en) * 2012-01-11 2012-07-25 华中科技大学 SQL (structured query language) query plan generation method oriented to streaming data processing
CN109284302A (en) * 2018-08-10 2019-01-29 新华三大数据技术有限公司 Data processing method and device
CN111625696A (en) * 2020-07-28 2020-09-04 北京升鑫网络科技有限公司 Distributed scheduling method, computing node and system of multi-source data analysis engine
CN112579625A (en) * 2020-09-28 2021-03-30 京信数据科技有限公司 Multi-source heterogeneous data treatment method and device
CN112579626A (en) * 2020-09-28 2021-03-30 京信数据科技有限公司 Construction method and device of multi-source heterogeneous SQL query engine
CN113032423A (en) * 2021-05-31 2021-06-25 北京谷数科技股份有限公司 Query method and system based on dynamic loading of multiple data engines

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HSQL: A Highly Scalable Cloud Database for Multi-user Query Processing;Chao-Rui Chang et al.;《2012 IEEE Fifth International Conference on Cloud Computing》;20120802;第943-944页 *
配电网监测大数据的Impala快速查询技术;屈志坚等;《电力科学与技术学报》;20180628(第02期);第148-156页 *

Also Published As

Publication number Publication date
CN114756629A (en) 2022-07-15

Similar Documents

Publication Publication Date Title
CN107239335B (en) Job scheduling system and method for distributed system
US20220171781A1 (en) System And Method For Analyzing Data Records
CN109933306B (en) Self-adaptive hybrid cloud computing framework generation method based on operation type recognition
US9639575B2 (en) Method and system for processing data queries
US8214356B1 (en) Apparatus for elastic database processing with heterogeneous data
US10769147B2 (en) Batch data query method and apparatus
CN106897322A (en) The access method and device of a kind of database and file system
JP5730386B2 (en) Computer system and parallel distributed processing method
CN113360554B (en) Method and equipment for extracting, converting and loading ETL (extract transform load) data
CN109918184A (en) Picture processing system, method and relevant apparatus and equipment
CN114756629B (en) Multi-source heterogeneous data interaction analysis engine and method based on SQL
CN113297057A (en) Memory analysis method, device and system
CN112182031B (en) Data query method and device, storage medium and electronic device
Chen et al. Pisces: optimizing multi-job application execution in mapreduce
CN117056303B (en) Data storage method and device suitable for military operation big data
CN107818181A (en) Indexing means and its system based on Plcient interactive mode engines
CN109033196A (en) A kind of distributed data scheduling system and method
Davidson et al. Technical review of apache flink for big data
CN115168297A (en) Bypassing log auditing method and device
KR102605932B1 (en) Method for providing data processing service for structured data and non-structured data based on work space and server for performing the method
KR102605933B1 (en) Method for allocating work space on server based on instance feature and apparatus for performing the method
CN117390040B (en) Service request processing method, device and storage medium based on real-time wide table
CN113553320B (en) Data quality monitoring method and device
US11663216B2 (en) Delta database data provisioning
Monu et al. A Review on Storage and Large-Scale Processing of Data-Sets Using Map Reduce, YARN, SPARK, AVRO, MongoDB

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant