CN110955701B

CN110955701B - Distributed data query method, device and distributed system

Info

Publication number: CN110955701B
Application number: CN201911176025.6A
Authority: CN
Inventors: 杨华卫; 毕伟; 贾晓芸
Original assignee: Zhongsi Boan Technology Beijing Co ltd
Current assignee: Zhongsi Boan Technology Beijing Co ltd
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2023-04-25
Anticipated expiration: 2039-11-26
Also published as: CN110955701A

Abstract

The invention provides a query method and device of distributed data and a distributed system, wherein the method comprises the following steps: after acquiring the query request, determining a plurality of target nodes corresponding to the query request, and determining processing parameters of each target node; generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially perform data processing according to the query plan; and generating a query result corresponding to the query request according to the last data processing result fed back by the last target node. According to the query method, the query device and the query system for the distributed data, provided by the embodiment of the invention, the planning node generates the ordered query plan based on the processing parameters of the target nodes, heterogeneous attributes such as trust relationship, threat level, hardware support and the like can be fused in the heterogeneous environment, and each target node can process the data according to the security requirement or the hardware support and the like, so that the data sharing is transparent to the heterogeneous environment.

Description

Distributed data query method, device and distributed system

Technical Field

The invention relates to the technical field of distributed data, in particular to a query method and device of distributed data and a distributed system.

Background

Currently, there is an increasing demand for data sharing in e-government, healthcare, financial and artificial intelligence industries, such as precision medicine, to share clinical, genetic, environmental and lifestyle data to better treat and prevent diseases. Data owners who possess data typically constitute a distributed data system in a distributed manner as a data source.

Query processing on distributed data sources has been widely studied; and, based on different security settings and threat assumptions, various protocols were designed. Query processing algorithms and systems currently exist in homogeneous environments, i.e., assuming the same protocol is used between parties. In many practical scenarios, distributed data sharing is often implemented in heterogeneous environments, and different protocols are used between parties. The realistic reasons for security heterogeneous are various trust relationships between data owners, different threat levels along different communication channels and different computing nodes, the degree of special hardware support available, etc.

If the security query processing technique designed for a homogeneous environment is used in a heterogeneous environment, the most stringent security requirements and/or the lowest available hardware support among data owners will have to be met, which will result in unnecessarily high computational expense.

Disclosure of Invention

In order to solve the above problems, an objective of an embodiment of the present invention is to provide a distributed data query method, a distributed data query device, and a distributed system.

In a first aspect, an embodiment of the present invention provides a method for querying distributed data, including:

after a query request is acquired, determining a plurality of target nodes corresponding to the query request, and determining processing parameters of each target node;

generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially perform data processing according to the query plan;

and generating a query result corresponding to the query request according to the last data processing result fed back by the last target node.

In one possible implementation, the generating the ordered query plan according to the processing parameters of all the target nodes includes:

and allocating corresponding one or more atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of the directed acyclic structure according to the dependency relationship among all the atomic operations.

In one possible implementation manner, the instructing the target node to sequentially perform data processing according to the query plan includes:

Indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing;

performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking corresponding data processing results as the current data set of the current atomic operation, and sending the current data set to subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with a dependency relation pointed by the current atomic operation;

repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.

In one possible implementation manner, the instructing the target node to sequentially perform data processing according to the query plan includes: the target node is instructed to sequentially perform data processing according to the query plan in a preset safety mode, and a data processing result conforming to the safety mode is generated;

The generating the query result corresponding to the query request according to the last data processing result fed back by the last target node comprises the following steps: and carrying out security mode removal processing on the last data processing result which is fed back by the last target node and accords with the security mode, and taking the processing result after the security mode removal processing as a query result corresponding to the query request.

In one possible implementation, after the generating the ordered query plan according to the processing parameters of all the target nodes, the method further includes:

determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost;

and distributing the query resources to the corresponding target nodes.

In a second aspect, an embodiment of the present invention further provides a distributed data query apparatus, including:

the preprocessing module is used for determining a plurality of target nodes corresponding to the query request after the query request is acquired, and determining processing parameters of each target node;

the query plan module is used for generating an ordered query plan according to the processing parameters of all the target nodes and indicating the target nodes to sequentially process data according to the query plan;

And the result generation module is used for generating a query result corresponding to the query request according to the last data processing result fed back by the last target node.

In a third aspect, an embodiment of the present invention further provides a distributed system, including: the system comprises a planning node and a plurality of all data nodes, wherein the planning node is a trusted node;

the planning node is used for determining a plurality of target nodes corresponding to the query request from all the data nodes after the query request is acquired, and determining processing parameters of the target nodes, wherein the processing parameters comprise hardware parameters of the target nodes and trust relations between the target nodes and other target nodes;

the planning node is also used for generating an ordered query plan according to the processing parameters of all the target nodes and sending the query plan to the target nodes;

the target node is used for carrying out data processing according to the query plan and sending the data processing result to other target nodes until the last target node sends the last data processing result to the planning node;

and the planning node generates a query result corresponding to the query request according to the final data processing result.

In one possible implementation, the generating, by the planning node, an ordered query plan according to the processing parameters of all the target nodes includes:

and the planning node distributes corresponding one or more atomic operations for each target node, determines the dependency relationship among all the atomic operations, and generates a query plan of the directed acyclic structure according to the dependency relationship among all the atomic operations.

In one possible implementation, the target node performing data processing according to the query plan includes:

the current atomic operation of the target node acquires a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing;

In one possible implementation, the system further comprises a query node for initiating the query request;

after the planning node generates an ordered query plan according to the processing parameters of all the target nodes, the planning node determines corresponding query cost according to the query plan and sends the query cost to the query node;

the query node feeds back query resources matched with the query cost to the planning node; the planning node distributes the query resources to the corresponding target nodes after receiving the query resources.

In the solution provided in the first aspect of the embodiment of the present invention, when a query is required, a planning node is used as a supervisor, the planning node selects all corresponding data nodes as target nodes in the query process, constructs a calculation task of the target nodes into an ordered query plan and coordinates execution of the query plan, so that all the target nodes can feed back a final data processing result after the query plan is executed, and the planning node returns the final data processing result to the query node, thereby completing the query process of distributed data. In this embodiment, a global ordered query plan is generated based on a planning node, a computing task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computing task by the target node. The target node executes the calculation task in the security mode, so that data leakage can be avoided even if the data can be kept locally, the security of the target node is prevented from being threatened from the data source, and safe and effective sharing of private data can be realized.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a flowchart of a distributed data query method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a query plan with a directed acyclic structure in a distributed data query method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a distributed data query method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a distributed data query device according to an embodiment of the present invention;

FIG. 5 shows a first architecture diagram of a distributed system provided by an embodiment of the present invention;

Fig. 6 shows a second structural schematic diagram of the distributed system according to the embodiment of the present invention.

Detailed Description

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

The distributed data query method provided by the embodiment of the invention is executed by the trusted planning node, and the planning node can generate an ordered query plan so as to realize data query. Referring to fig. 1, the method includes:

step 101: after the query request is acquired, a plurality of target nodes corresponding to the query request are determined, and processing parameters of each target node are determined.

In the embodiment of the invention, the data are stored in some nodes of the distributed system in a distributed mode, and the nodes are all nodes of the data; when other nodes need to query certain data, corresponding data needs to be acquired from all nodes of one or more data. In the embodiment, a trusted planning node is selected as an intermediate role to supervise the whole data query process; the planning node is trusted or auditable, and can be realized by a blockchain technology. After the planning node acquires the query request, determining which data all nodes store the data matched with the query request, and taking all the data nodes stored with the corresponding data as target nodes.

In this embodiment, after determining the target nodes, the planning node needs to determine a processing parameter of each target node, where the processing parameter may specifically include a hardware parameter of the target node, and a trust relationship between the target node and other target nodes. The trust relationship between the target node and other nodes is used for representing whether data transmission is allowed between the two target nodes, the cost when the data is transmitted, and the like; the hardware parameter is a parameter related to device hardware at the target node, which may be used to represent the computing power of the target node, or to represent a computing priority, etc.

In this embodiment, the query request may be generated autonomously by the planning node, or may be generated by another query node, and the query request is sent to the planning node. In addition, it can be understood by those skilled in the art that the "query node", "planning node" and "target node" in this embodiment are all nodes that execute different functions in the query process, and are not used to limit that a certain node can only execute one of the functions. For example, all the nodes of the data in the above process can be used as target nodes, and the all the nodes of the data can also be used as query nodes for query operation in other query processes, or can also be used as planning nodes in other query processes if the all the nodes of the data are trusted.

Step 102: and generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially process data according to the query plan.

In the embodiment of the invention, after the planning node determines the processing parameters of all the target nodes, the planning node can generate a corresponding query plan. In this embodiment, the query plan is an execution plan corresponding to the query request constructed by the planning node, and based on the query plan, each target node can learn which data processing operations are needed by itself, that is, which computing tasks need to be executed; meanwhile, the query plan has sequential characteristics, namely the query plan is orderly, and a plurality of target nodes can sequentially execute calculation tasks based on the sequential characteristics of the query plan and finally obtain corresponding results, namely final data processing results. In this embodiment, the planning node is a planner of the query plan, and after the planning node issues the query plan to the target node, the target node performs a corresponding calculation task as an executor to obtain a corresponding data processing result.

The query plan in this embodiment may specify an execution sequence of the target nodes, and after the current target node finishes data processing, the data processing result may be sent to the next-order target node, so that the next-order target node may continue to execute the calculation task based on the data processing result of the current target node, and so on, until the last target node generates the last data processing result. The "order" in the present embodiment is not limited to the order of the end-to-end, but may be the order represented by the directed acyclic structure, etc.

Step 103: and generating a query result corresponding to the query request according to the last data processing result fed back by the last target node.

In the embodiment of the invention, all target nodes execute orderly query plans in sequence, and after the last target node processes data, a corresponding processing result, namely a last data processing result, can be generated, and the last target node can feed back the last data processing result to the planning node. In this embodiment, the final data processing result is a final result obtained after all the target nodes execute the query plan, and the final data processing result can represent the corresponding query result. At this time, the planning node may directly send the final data processing result to the query node as a corresponding query result; or the planning node can process the final data processing result to generate a query result and send the generated query result to the query node. And when the final data processing result generated by the target node cannot be directly provided to the query node, processing by the trusted planning node to generate data which can be provided to the query node.

Optionally, in the distributed network, when all the nodes of the data share the data, there may be a risk that the data is obtained by an illegal node and the data is revealed, and even economic loss and legal responsibility of all the nodes of the data are caused, so that the distributed data sharing is blocked by the problems of security and privacy. In order to avoid the problem of data leakage in the data query process, the method provided in the embodiment adopts a mode of loading a security mode to perform data processing, so that the privacy data is safely and effectively shared. Specifically, the step 102 "instructing the target node to sequentially perform data processing according to the query plan" includes:

step A1: and indicating the target node to sequentially perform data processing according to the query plan in a preset security mode, and generating a data processing result conforming to the security mode.

In the embodiment of the invention, the target node is all the nodes, and when the target node executes the calculation task based on the query plan, the data processing is required to be performed in the security mode, so that the generated data processing result is the result conforming to the security mode. The Security mode may be a Security Enclave (Security Enclave), a Docker container, an encryption form, multiparty Security computation, or the like, that is, the target node may use a sandbox mechanism or the like to allocate independent computing power for a computing task, or encrypt a data processing result, so that other target nodes may hardly acquire sensitive information in the data processing result after acquiring the data processing result of the local target node; similarly, it is difficult for the local target node to acquire sensitive information in the data processing results sent by other target nodes. In this embodiment, the target node needs to execute the query plan honest, but it may not limit whether the target node steals data of other nodes, i.e. allow the target node to attempt to derive data obtained from other target nodes. In the security mode, the security and privacy of data during transmission between the target nodes can be ensured.

Meanwhile, the last data processing result generated by the last target node is a processing result conforming to the security mode, so as to ensure that the query node can normally read the last data processing result. Specifically, the step 103 of generating the query result corresponding to the query request according to the last data processing result fed back by the last target node includes:

step A2: and carrying out security mode removal processing on the last data processing result which is fed back by the last target node and accords with the security mode, and taking the processing result after the security mode removal processing as a query result corresponding to the query request.

In the embodiment of the invention, the planning node can perform security mode removal processing on the final data processing result, so that data irrelevant to the security mode is generated, and the data can be used as the query result provided for the query node, so that the query node can normally read the data in the final data processing result. The security mode removing processing is the opposite processing mode with the target node when data processing is performed based on the security mode; for example, if the target node performs data processing in an encryption manner, the security mode removal processing is corresponding decryption processing.

When the query is needed, the planning node is used as a supervisor, the planning node selects all corresponding data nodes as target nodes in the query process, the calculation tasks of the target nodes are constructed into an ordered query plan, and the execution of the query plan is coordinated, so that all the target nodes can feed back the final data processing result after the query plan is executed, and the planning node feeds back the final data processing result to the query node, thereby completing the query process of the distributed data. In this embodiment, a global ordered query plan is generated based on a planning node, a computing task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computing task by the target node. The target node executes the calculation task in the security mode, so that data leakage can be avoided even if the data can be kept locally, the security of the target node is prevented from being threatened from the data source, and safe and effective sharing of private data can be realized. In the embodiment, the security calculation mode of locally keeping the data is beneficial to ensuring the security of the data, so that the security threat of the data in the transmission process and the transmission destination is reduced; at the same time, the algorithm dispatch is generally much faster than the data transfer, thus making the data processing faster.

On the basis of the embodiment, the planning node specifically generates a query plan of a directed acyclic structure. Specifically, the step 102 "generating an ordered query plan according to the processing parameters of all the target nodes" includes:

step B1: and allocating corresponding one or more atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of the directed acyclic structure according to the dependency relationship among all the atomic operations.

In the embodiment of the invention, the query plan of the directed acyclic structure is generated by taking the atomic operation as a basic unit. The atomic operation is a basic operation in the query process, and may specifically be projection (project), selection (selection), natural join (natural join), deduplication (set difference), and (set unit), renaming (renaming), and so on. A dependency between two atomic operations refers to the need for one atomic operation to rely on data in the other atomic operation for data processing. In this embodiment, the dependency relationship has a sense that if the atomic operation a depends on the atomic operation B, the atomic operation B does not depend on the atomic operation B in the two atomic operations. After determining the dependency relationships between all atomic operations, a query plan of a directed acyclic structure may be generated, and a schematic diagram of the structure of the query plan provided in this embodiment is shown in fig. 2, where fig. 2 represents the query plan in a directed acyclic graph (DAG, directed Acyclic Graph). In fig. 2, each circle represents an atomic operation, the dependency between two atomic operations is represented by a directed edge, and each dashed box represents a target node. That is, in fig. 2, five target nodes A, B, C, D, E are included, and the five target nodes are sequentially allocated 1, 3, 5, 4, and 3 atomic operations, for example, the target node B includes three atomic operations B1, B2, and B3; meanwhile, the atomic operation a1 has a directed edge pointing to the atomic operation b3, and then the atomic operation b3 depends on the atomic operation a1.

Optionally, after determining the query plan of the directed acyclic structure, the planning node may instruct the target node to perform the computing task. Specifically, the step 102 of "instructing the target node to sequentially perform data processing according to the query plan" includes:

step B2: the current atomic operation of the target node is instructed to acquire a preamble data set of the preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing.

Step B3: and performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking the corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to the subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with the dependency relation pointed by the current atomic operation.

Step B4: repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.

In the embodiment of the invention, the planning node can send the query plan to the target node, so that the target node can know which data processing needs to be performed, and each atomic operation sequentially performs data processing according to the query plan of the directed acyclic structure. In this embodiment, a target node that needs to perform data processing is used as a current target node, and a current atomic operation of the current target node is determined, and if a dependency relationship of an atomic operation points to the current atomic operation, the atomic operation is a preamble atomic operation of the current atomic operation. As in fig. 2, atomic operation a1 points to atomic operation b3, i.e., atomic operation a1 has a dependency relationship that points to atomic operation b3, so atomic operation a1 is a predecessor atomic operation of atomic operation b 3; similarly, atomic operations b1 and b2 are also predecessor atomic operations of atomic operation b 3.

If the current atomic operation does not have a preceding atomic operation, the current atomic operation may be referred to as an initial atomic operation, and atomic operations a1, b1, etc. in fig. 2 are all initial atomic operations. In this embodiment, the execution of the query plan may be started from the initial atomic operation, that is, the current atomic operation is the initial atomic operation, and at this time, since there is no preamble atomic operation, the preamble data set is empty, the initial atomic operation may perform data processing only on locally stored data, and at this time, the data processing is a processing procedure consistent with the initial atomic operation; for example, if the initial atomic operation a1 is a deduplication process, the initial atomic operation a1 may perform the deduplication process on the locally stored data, so as to obtain a corresponding processing result after the deduplication process, and use the processing result as the dataset of the initial atomic operation a1, that is, the current dataset.

If the current atomic operation is not the initial atomic operation, that is, the preamble atomic operation exists, the current atomic operation acquires all data sets of the preamble atomic operation, that is, the preamble data sets, and then performs data processing corresponding to the current atomic operation on all the preamble data sets to generate a corresponding data processing result, that is, the current data set, that is, the data set of the current atomic operation. Alternatively, the current atomic operation may perform data processing only on the preamble data set, or may perform comprehensive processing on the preamble data set and the data stored in the local node.

In addition, if the dependency of the current atomic operation is directed to another atomic operation, the other atomic operation is a subsequent atomic operation to the current atomic operation; as in fig. 2, the atomic operation b3 is one subsequent atomic operation to the atomic operation a1, and the atomic operation c5 is one subsequent atomic operation to the atomic operation b 3. If the current atomic operation has the subsequent atomic operation, the current atomic operation sends the generated current data set to the subsequent atomic operation, so that the subsequent atomic operation can be used as the current atomic operation to continue to execute the steps B2 and B3, and the subsequent atomic operation does not exist, and the data set generated by the current atomic operation at the moment can be used as the final data processing result. In fig. 2, the atomic operation e3 is the last atomic operation of the directed acyclic structure, and the data set of the atomic operation e3 can be the final data processing result. It will be appreciated by those skilled in the art that, by taking the example of fig. 2 that includes one last atomic operation e3 (i.e., an atomic operation in which there is no subsequent atomic operation), there may be a plurality of last atomic operations in practical applications, i.e., a plurality of atomic operations may not have a subsequent atomic operation, and then the data set of all the last atomic operations may be used as the last data processing result. In this embodiment, all the atomic operations are traversed as the end condition, that is, when all the atomic operations perform corresponding data processing, it is indicated that all the atomic operations perform the computing tasks allocated to the atomic operations, and at this time, it is indicated that the query plan generated by the planning node is completely executed, and at this time, the final result may be obtained.

In this embodiment, the atomic operation delta may be performed using a multi-tuple

A representation; wherein op is _δ Representing the operation processing corresponding to the atomic operation delta, such as projection, deduplication, etc.; m is m _δ Representing the number of leading atomic operations of the atomic operation delta, X _δi A preamble data set representing an ith preamble atomic operation of the atomic operation delta; j means that the atomic operation delta belongs to the j-th target node and j epsilon [1, n]N is the number of target nodes; />

Representing data stored locally at the jth target node, and +.>

Not necessarily.

Specifically, referring to fig. 2, the atomic operation a1 first performs a computing task, and since the atomic operation a1 is an initial atomic operation and the target node a is the first target node, the multi-tuple of the atomic operation a1 is

Namely, atomic operation a1 +.>

Performing corresponding treatment with +.>

The result of the processing is the data set of atomic operation a1 +.>

I.e. < ->

Wherein the function op _δ (X) represents corresponding processing of data X according to atomic operations delta, I _δ A dataset representing an atomic operation delta. Atomic operation a1 is determining its dataset +.>

After that, the data set can be +.>

To a subsequent atomic operation, i.e., to atomic operation b3. In addition, the processing procedure of the initial atomic operations such as the atomic operations b1 and b2 is similar to that described above, and will not be repeated here. Wherein if the data processing can be performed in the secure mode, the function op _δ (X) can then mean that the data X is processed accordingly in the secure mode according to the atomic operation delta.

For atomic operation B3, target node B is taken as the second target node, i.e. j=2; and it has three precursor atomic operations, the multiple group can be

I.e. < ->

The preamble data sets of three preamble atomic operations, i.e. the data sets of atomic operations a1, b2, can be represented separately +.>

The atomic operation b3 performs data processing on the corresponding preamble data set to generate the data set +.>

And is also provided with

Wherein the atomic operation B3 requires data stored for the target node B

The multi-element group can be in the form described above when processed; if the atomic erase operation b3 does not require treatment +.>

Its multiple group can be +>

Or will->

Is->

The value is assigned to null.

Repeating the above process, the last atomic operation e3 can determine the corresponding data set

The data set->

And the final data processing result is obtained.

In the embodiment of the invention, the planning node constructs a distributed query plan by designing the directed acyclic graph, expresses each target node task by the atomic operation and the dependency relationship of the relational algebra level, sends the atomic operation to the target node for data calculation, and can optimize the global. The method can uniformly express the data sharing modes among the nodes in the environment with inconsistent safety characteristics, and realizes the data sharing in the heterogeneous environment; by generating the user task as a global query plan, it is advantageous to optimize the global query plan, such as minimizing the transmission of data between nodes, dispatching algorithms to safer node computations, etc.

In the above embodiment, the query node needs to provide a certain resource to perform the query operation. Specifically, after the step 102 of generating the ordered query plan according to the processing parameters of all the target nodes, the method further includes:

step C1: and determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost.

Step C2: the query resources are allocated to the respective target nodes.

In the embodiment of the invention, after the planning node determines the query plan, the cost of executing the query plan, namely the query cost, of the target node can be calculated, and the corresponding query resource is taken from the query node; if the query node provides the query resource for the planning node, the planning node continues to issue a query plan to the target node to execute the corresponding calculation task; if the query node does not provide the query resource, the query process ends. In addition, after the query result is fed back to the query node, the planning node allocates a corresponding query resource to each target node. The query resource may be a fee type resource, an integral type resource, or other resources capable of rewarding the target node.

Specifically, in this embodiment, the nodes in the distributed network are divided into three types, i.e., a query node, a planning node, and all data nodes, and the overall flow of the query process can be shown in fig. 3, where the target node is all data nodes. In fig. 3, the target node is responsible for sharing data and executing computing tasks, the query node initiates data query or retrieval tasks, and the planning node constructs the tasks of the target node as a query plan and coordinates execution of the plan. And a query initiated from the query node is returned to the planning node after the query node coordinates the target node to execute, and then returned to the query node.

The above describes in detail the flow of the distributed data query method, which may also be implemented by a corresponding device, and the structure and function of the device are described in detail below.

The query device for distributed data provided by the embodiment of the invention can be specifically arranged in a query node. Referring to fig. 4, the query device includes:

a preprocessing module 41, configured to determine a plurality of target nodes corresponding to a query request after the query request is acquired, and determine a processing parameter of each target node;

A query plan module 42, configured to generate an ordered query plan according to processing parameters of all the target nodes, and instruct the target nodes to sequentially perform data processing according to the query plan;

and the result generating module 43 is configured to generate a query result corresponding to the query request according to the last data processing result fed back by the last target node.

On the basis of the above embodiment, the query plan module 42 generates an ordered query plan according to the processing parameters of all the target nodes, including:

On the basis of the above embodiment, the query plan module 42 instructs the target node to sequentially perform data processing according to the query plan includes:

On the basis of the above-described embodiments,

the query plan module 42 instructs the target node to sequentially perform data processing according to the query plan, including: the target node is instructed to sequentially perform data processing according to the query plan in a preset safety mode, and a data processing result conforming to the safety mode is generated;

the generating, by the result generating module 43, a query result corresponding to the query request according to the last data processing result fed back by the last target node includes: and carrying out security mode removal processing on the last data processing result which is fed back by the last target node and accords with the security mode, and taking the processing result after the security mode removal processing as a query result corresponding to the query request.

On the basis of the embodiment, the device further comprises a resource allocation module;

after the query plan module 42 generates an ordered query plan based on the processing parameters of all the target nodes, the resource allocation module is configured to:

determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost; and distributing the query resources to the corresponding target nodes.

When the query device for the distributed data is required to query, the planning node is used as a supervisor, all nodes of the corresponding data are selected by the planning node as target nodes in the query process, the calculation tasks of the target nodes are constructed into an ordered query plan, and the execution of the query plan is coordinated, so that all the target nodes can feed back the final data processing result after the query plan is executed, and the planning node feeds back the final data processing result to the query node, thereby completing the query process of the distributed data. In this embodiment, a global ordered query plan is generated based on a planning node, a computing task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computing task by the target node. The target node executes the calculation task in the security mode, so that data leakage can be avoided even if the data can be kept locally, the security of the target node is prevented from being threatened from the data source, and safe and effective sharing of private data can be realized.

Based on the same inventive concept, an embodiment of the present invention further provides a distributed system, referring to fig. 5, including: a planning node 51 and a plurality of data all nodes 52, the planning node 51 being a trusted node; all nodes 52 of data may be directly or indirectly connected to the planning node 51.

The planning node 51 is configured to determine, after acquiring a query request, a plurality of target nodes corresponding to the query request from all the data nodes 52, and determine processing parameters of the target nodes, where the processing parameters include hardware parameters of the target nodes and trust relationships between the target nodes and other target nodes;

the planning node 51 is further configured to generate an ordered query plan according to the processing parameters of all the target nodes, and send the query plan to the target nodes;

the target node is configured to perform data processing according to the query plan, and send a data processing result to other target nodes until the last target node sends a last data processing result to the planning node 51;

the planning node 51 generates a query result corresponding to the query request according to the final data processing result.

On the basis of the above embodiment, the generating, by the planning node 51, an ordered query plan according to the processing parameters of all the target nodes includes:

the planning node 51 allocates a corresponding one or more atomic operations to each of the target nodes, determines the dependency relationships between all the atomic operations, and generates a query plan of a directed acyclic structure according to the dependency relationships between all the atomic operations.

On the basis of the above embodiment, the target node performing data processing according to the query plan includes:

On the basis of the above embodiment, referring to fig. 6, the distributed system further comprises a query node 52 for initiating the query request;

after the planning node 51 generates an ordered query plan according to the processing parameters of all the target nodes, the planning node 51 determines a corresponding query cost according to the query plan and sends the query cost to the query node 52;

the query node 52 feeds back query resources matching the query cost to the planning node 51; the planning node 51, after receiving the query resource, allocates the query resource to the corresponding target node.

The distributed system provided by the embodiment of the invention divides the nodes into three parts, namely the query node 52, the planning node 51 and the data all nodes 52, and all or part of the data all nodes 52 in fig. 5 can be used as target nodes. In this embodiment, in different query tasks, different nodes may be other types of nodes, for example, all the nodes 52 of the data may also be used as the query nodes 52 for query. Furthermore, fig. 5 and 6 are merely schematic representations of the structure of the distributed system, and are not intended to limit the architecture on which the distributed system must be based; for example, a query node may be indirectly connected to the planning node through all other data nodes and initiate a query. Meanwhile, what is shown in fig. 5 and 6 is a communication connection relation allowable between the respective nodes, and is not used to indicate that the planning node must generate the query plan in accordance with all the connection relations between all the nodes of the data shown in the figures. The detailed description of the distributed system can be found in the embodiments corresponding to fig. 1 to 3, and the detailed description is omitted here.

The distributed system provided by the embodiment can generate a global ordered query plan based on the trusted planning node, allocate a calculation task for a corresponding target node, and obtain a corresponding query result by executing the calculation task through the target node. The target node executes the calculation task in the security mode, so that data leakage can be avoided even if the data can be kept locally, the security of the target node is prevented from being threatened from the data source, and safe and effective sharing of private data can be realized.

The foregoing is merely specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about the modified or alternative embodiments within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for querying distributed data, comprising:

generating a query result corresponding to the query request according to the last data processing result fed back by the last target node;

wherein the generating an ordered query plan according to the processing parameters of all the target nodes comprises:

distributing corresponding one or more atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations;

the instructing the target node to sequentially perform data processing according to the query plan includes:

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the instructing the target node to sequentially perform data processing according to the query plan includes: the target node is instructed to sequentially perform data processing according to the query plan in a preset safety mode, and a data processing result conforming to the safety mode is generated;

3. The method of claim 1, further comprising, after said generating an ordered query plan based on processing parameters of all of said target nodes:

and distributing the query resources to the corresponding target nodes.

4. A distributed data querying device, comprising:

the result generation module is used for generating a query result corresponding to the query request according to the last data processing result fed back by the last target node;

wherein the query plan module generating an ordered query plan according to the processing parameters of all the target nodes comprises:

The query plan module instructs the target node to sequentially perform data processing according to the query plan, including:

5. A distributed system, comprising: the system comprises a planning node and a plurality of all data nodes, wherein the planning node is a trusted node;

the planning node generates a query result corresponding to the query request according to the final data processing result;

wherein the generating, by the planning node, an ordered query plan according to the processing parameters of all the target nodes includes:

the planning node distributes corresponding one or more atomic operations for each target node, determines the dependency relationship among all the atomic operations, and generates a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations;

The target node performing data processing according to the query plan comprises the following steps:

6. The distributed system of claim 5 further comprising a query node for initiating the query request;