CN110955701B - Distributed data query method, device and distributed system - Google Patents

Distributed data query method, device and distributed system Download PDF

Info

Publication number
CN110955701B
CN110955701B CN201911176025.6A CN201911176025A CN110955701B CN 110955701 B CN110955701 B CN 110955701B CN 201911176025 A CN201911176025 A CN 201911176025A CN 110955701 B CN110955701 B CN 110955701B
Authority
CN
China
Prior art keywords
query
atomic operation
node
data processing
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911176025.6A
Other languages
Chinese (zh)
Other versions
CN110955701A (en
Inventor
杨华卫
毕伟
贾晓芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongsi Boan Technology Beijing Co ltd
Original Assignee
Zhongsi Boan Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongsi Boan Technology Beijing Co ltd filed Critical Zhongsi Boan Technology Beijing Co ltd
Priority to CN201911176025.6A priority Critical patent/CN110955701B/en
Publication of CN110955701A publication Critical patent/CN110955701A/en
Application granted granted Critical
Publication of CN110955701B publication Critical patent/CN110955701B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a query method and device of distributed data and a distributed system, wherein the method comprises the following steps: after acquiring the query request, determining a plurality of target nodes corresponding to the query request, and determining processing parameters of each target node; generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially perform data processing according to the query plan; and generating a query result corresponding to the query request according to the last data processing result fed back by the last target node. According to the query method, the query device and the query system for the distributed data, provided by the embodiment of the invention, the planning node generates the ordered query plan based on the processing parameters of the target nodes, heterogeneous attributes such as trust relationship, threat level, hardware support and the like can be fused in the heterogeneous environment, and each target node can process the data according to the security requirement or the hardware support and the like, so that the data sharing is transparent to the heterogeneous environment.

Description

Distributed data query method, device and distributed system
Technical Field
The invention relates to the technical field of distributed data, in particular to a query method and device of distributed data and a distributed system.
Background
Currently, there is an increasing demand for data sharing in e-government, healthcare, financial and artificial intelligence industries, such as precision medicine, to share clinical, genetic, environmental and lifestyle data to better treat and prevent diseases. Data owners who possess data typically constitute a distributed data system in a distributed manner as a data source.
Query processing on distributed data sources has been widely studied; and, based on different security settings and threat assumptions, various protocols were designed. Query processing algorithms and systems currently exist in homogeneous environments, i.e., assuming the same protocol is used between parties. In many practical scenarios, distributed data sharing is often implemented in heterogeneous environments, and different protocols are used between parties. The realistic reasons for security heterogeneous are various trust relationships between data owners, different threat levels along different communication channels and different computing nodes, the degree of special hardware support available, etc.
If the security query processing technique designed for a homogeneous environment is used in a heterogeneous environment, the most stringent security requirements and/or the lowest available hardware support among data owners will have to be met, which will result in unnecessarily high computational expense.
Disclosure of Invention
In order to solve the above problems, an objective of an embodiment of the present invention is to provide a distributed data query method, a distributed data query device, and a distributed system.
In a first aspect, an embodiment of the present invention provides a method for querying distributed data, including:
after a query request is acquired, determining a plurality of target nodes corresponding to the query request, and determining processing parameters of each target node;
generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially perform data processing according to the query plan;
and generating a query result corresponding to the query request according to the last data processing result fed back by the last target node.
In one possible implementation, the generating the ordered query plan according to the processing parameters of all the target nodes includes:
and allocating corresponding one or more atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of the directed acyclic structure according to the dependency relationship among all the atomic operations.
In one possible implementation manner, the instructing the target node to sequentially perform data processing according to the query plan includes:
Indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing;
performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking corresponding data processing results as the current data set of the current atomic operation, and sending the current data set to subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with a dependency relation pointed by the current atomic operation;
repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
In one possible implementation manner, the instructing the target node to sequentially perform data processing according to the query plan includes: the target node is instructed to sequentially perform data processing according to the query plan in a preset safety mode, and a data processing result conforming to the safety mode is generated;
The generating the query result corresponding to the query request according to the last data processing result fed back by the last target node comprises the following steps: and carrying out security mode removal processing on the last data processing result which is fed back by the last target node and accords with the security mode, and taking the processing result after the security mode removal processing as a query result corresponding to the query request.
In one possible implementation, after the generating the ordered query plan according to the processing parameters of all the target nodes, the method further includes:
determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost;
and distributing the query resources to the corresponding target nodes.
In a second aspect, an embodiment of the present invention further provides a distributed data query apparatus, including:
the preprocessing module is used for determining a plurality of target nodes corresponding to the query request after the query request is acquired, and determining processing parameters of each target node;
the query plan module is used for generating an ordered query plan according to the processing parameters of all the target nodes and indicating the target nodes to sequentially process data according to the query plan;
And the result generation module is used for generating a query result corresponding to the query request according to the last data processing result fed back by the last target node.
In a third aspect, an embodiment of the present invention further provides a distributed system, including: the system comprises a planning node and a plurality of all data nodes, wherein the planning node is a trusted node;
the planning node is used for determining a plurality of target nodes corresponding to the query request from all the data nodes after the query request is acquired, and determining processing parameters of the target nodes, wherein the processing parameters comprise hardware parameters of the target nodes and trust relations between the target nodes and other target nodes;
the planning node is also used for generating an ordered query plan according to the processing parameters of all the target nodes and sending the query plan to the target nodes;
the target node is used for carrying out data processing according to the query plan and sending the data processing result to other target nodes until the last target node sends the last data processing result to the planning node;
and the planning node generates a query result corresponding to the query request according to the final data processing result.
In one possible implementation, the generating, by the planning node, an ordered query plan according to the processing parameters of all the target nodes includes:
and the planning node distributes corresponding one or more atomic operations for each target node, determines the dependency relationship among all the atomic operations, and generates a query plan of the directed acyclic structure according to the dependency relationship among all the atomic operations.
In one possible implementation, the target node performing data processing according to the query plan includes:
the current atomic operation of the target node acquires a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing;
performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking corresponding data processing results as the current data set of the current atomic operation, and sending the current data set to subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with a dependency relation pointed by the current atomic operation;
Repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
In one possible implementation, the system further comprises a query node for initiating the query request;
after the planning node generates an ordered query plan according to the processing parameters of all the target nodes, the planning node determines corresponding query cost according to the query plan and sends the query cost to the query node;
the query node feeds back query resources matched with the query cost to the planning node; the planning node distributes the query resources to the corresponding target nodes after receiving the query resources.
In the solution provided in the first aspect of the embodiment of the present invention, when a query is required, a planning node is used as a supervisor, the planning node selects all corresponding data nodes as target nodes in the query process, constructs a calculation task of the target nodes into an ordered query plan and coordinates execution of the query plan, so that all the target nodes can feed back a final data processing result after the query plan is executed, and the planning node returns the final data processing result to the query node, thereby completing the query process of distributed data. In this embodiment, a global ordered query plan is generated based on a planning node, a computing task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computing task by the target node. The target node executes the calculation task in the security mode, so that data leakage can be avoided even if the data can be kept locally, the security of the target node is prevented from being threatened from the data source, and safe and effective sharing of private data can be realized.
In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 shows a flowchart of a distributed data query method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a query plan with a directed acyclic structure in a distributed data query method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a distributed data query method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a distributed data query device according to an embodiment of the present invention;
FIG. 5 shows a first architecture diagram of a distributed system provided by an embodiment of the present invention;
Fig. 6 shows a second structural schematic diagram of the distributed system according to the embodiment of the present invention.
Detailed Description
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
The distributed data query method provided by the embodiment of the invention is executed by the trusted planning node, and the planning node can generate an ordered query plan so as to realize data query. Referring to fig. 1, the method includes:
step 101: after the query request is acquired, a plurality of target nodes corresponding to the query request are determined, and processing parameters of each target node are determined.
In the embodiment of the invention, the data are stored in some nodes of the distributed system in a distributed mode, and the nodes are all nodes of the data; when other nodes need to query certain data, corresponding data needs to be acquired from all nodes of one or more data. In the embodiment, a trusted planning node is selected as an intermediate role to supervise the whole data query process; the planning node is trusted or auditable, and can be realized by a blockchain technology. After the planning node acquires the query request, determining which data all nodes store the data matched with the query request, and taking all the data nodes stored with the corresponding data as target nodes.
In this embodiment, after determining the target nodes, the planning node needs to determine a processing parameter of each target node, where the processing parameter may specifically include a hardware parameter of the target node, and a trust relationship between the target node and other target nodes. The trust relationship between the target node and other nodes is used for representing whether data transmission is allowed between the two target nodes, the cost when the data is transmitted, and the like; the hardware parameter is a parameter related to device hardware at the target node, which may be used to represent the computing power of the target node, or to represent a computing priority, etc.
In this embodiment, the query request may be generated autonomously by the planning node, or may be generated by another query node, and the query request is sent to the planning node. In addition, it can be understood by those skilled in the art that the "query node", "planning node" and "target node" in this embodiment are all nodes that execute different functions in the query process, and are not used to limit that a certain node can only execute one of the functions. For example, all the nodes of the data in the above process can be used as target nodes, and the all the nodes of the data can also be used as query nodes for query operation in other query processes, or can also be used as planning nodes in other query processes if the all the nodes of the data are trusted.
Step 102: and generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially process data according to the query plan.
In the embodiment of the invention, after the planning node determines the processing parameters of all the target nodes, the planning node can generate a corresponding query plan. In this embodiment, the query plan is an execution plan corresponding to the query request constructed by the planning node, and based on the query plan, each target node can learn which data processing operations are needed by itself, that is, which computing tasks need to be executed; meanwhile, the query plan has sequential characteristics, namely the query plan is orderly, and a plurality of target nodes can sequentially execute calculation tasks based on the sequential characteristics of the query plan and finally obtain corresponding results, namely final data processing results. In this embodiment, the planning node is a planner of the query plan, and after the planning node issues the query plan to the target node, the target node performs a corresponding calculation task as an executor to obtain a corresponding data processing result.
The query plan in this embodiment may specify an execution sequence of the target nodes, and after the current target node finishes data processing, the data processing result may be sent to the next-order target node, so that the next-order target node may continue to execute the calculation task based on the data processing result of the current target node, and so on, until the last target node generates the last data processing result. The "order" in the present embodiment is not limited to the order of the end-to-end, but may be the order represented by the directed acyclic structure, etc.
Step 103: and generating a query result corresponding to the query request according to the last data processing result fed back by the last target node.
In the embodiment of the invention, all target nodes execute orderly query plans in sequence, and after the last target node processes data, a corresponding processing result, namely a last data processing result, can be generated, and the last target node can feed back the last data processing result to the planning node. In this embodiment, the final data processing result is a final result obtained after all the target nodes execute the query plan, and the final data processing result can represent the corresponding query result. At this time, the planning node may directly send the final data processing result to the query node as a corresponding query result; or the planning node can process the final data processing result to generate a query result and send the generated query result to the query node. And when the final data processing result generated by the target node cannot be directly provided to the query node, processing by the trusted planning node to generate data which can be provided to the query node.
Optionally, in the distributed network, when all the nodes of the data share the data, there may be a risk that the data is obtained by an illegal node and the data is revealed, and even economic loss and legal responsibility of all the nodes of the data are caused, so that the distributed data sharing is blocked by the problems of security and privacy. In order to avoid the problem of data leakage in the data query process, the method provided in the embodiment adopts a mode of loading a security mode to perform data processing, so that the privacy data is safely and effectively shared. Specifically, the step 102 "instructing the target node to sequentially perform data processing according to the query plan" includes:
step A1: and indicating the target node to sequentially perform data processing according to the query plan in a preset security mode, and generating a data processing result conforming to the security mode.
In the embodiment of the invention, the target node is all the nodes, and when the target node executes the calculation task based on the query plan, the data processing is required to be performed in the security mode, so that the generated data processing result is the result conforming to the security mode. The Security mode may be a Security Enclave (Security Enclave), a Docker container, an encryption form, multiparty Security computation, or the like, that is, the target node may use a sandbox mechanism or the like to allocate independent computing power for a computing task, or encrypt a data processing result, so that other target nodes may hardly acquire sensitive information in the data processing result after acquiring the data processing result of the local target node; similarly, it is difficult for the local target node to acquire sensitive information in the data processing results sent by other target nodes. In this embodiment, the target node needs to execute the query plan honest, but it may not limit whether the target node steals data of other nodes, i.e. allow the target node to attempt to derive data obtained from other target nodes. In the security mode, the security and privacy of data during transmission between the target nodes can be ensured.
Meanwhile, the last data processing result generated by the last target node is a processing result conforming to the security mode, so as to ensure that the query node can normally read the last data processing result. Specifically, the step 103 of generating the query result corresponding to the query request according to the last data processing result fed back by the last target node includes:
step A2: and carrying out security mode removal processing on the last data processing result which is fed back by the last target node and accords with the security mode, and taking the processing result after the security mode removal processing as a query result corresponding to the query request.
In the embodiment of the invention, the planning node can perform security mode removal processing on the final data processing result, so that data irrelevant to the security mode is generated, and the data can be used as the query result provided for the query node, so that the query node can normally read the data in the final data processing result. The security mode removing processing is the opposite processing mode with the target node when data processing is performed based on the security mode; for example, if the target node performs data processing in an encryption manner, the security mode removal processing is corresponding decryption processing.
When the query is needed, the planning node is used as a supervisor, the planning node selects all corresponding data nodes as target nodes in the query process, the calculation tasks of the target nodes are constructed into an ordered query plan, and the execution of the query plan is coordinated, so that all the target nodes can feed back the final data processing result after the query plan is executed, and the planning node feeds back the final data processing result to the query node, thereby completing the query process of the distributed data. In this embodiment, a global ordered query plan is generated based on a planning node, a computing task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computing task by the target node. The target node executes the calculation task in the security mode, so that data leakage can be avoided even if the data can be kept locally, the security of the target node is prevented from being threatened from the data source, and safe and effective sharing of private data can be realized. In the embodiment, the security calculation mode of locally keeping the data is beneficial to ensuring the security of the data, so that the security threat of the data in the transmission process and the transmission destination is reduced; at the same time, the algorithm dispatch is generally much faster than the data transfer, thus making the data processing faster.
On the basis of the embodiment, the planning node specifically generates a query plan of a directed acyclic structure. Specifically, the step 102 "generating an ordered query plan according to the processing parameters of all the target nodes" includes:
step B1: and allocating corresponding one or more atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of the directed acyclic structure according to the dependency relationship among all the atomic operations.
In the embodiment of the invention, the query plan of the directed acyclic structure is generated by taking the atomic operation as a basic unit. The atomic operation is a basic operation in the query process, and may specifically be projection (project), selection (selection), natural join (natural join), deduplication (set difference), and (set unit), renaming (renaming), and so on. A dependency between two atomic operations refers to the need for one atomic operation to rely on data in the other atomic operation for data processing. In this embodiment, the dependency relationship has a sense that if the atomic operation a depends on the atomic operation B, the atomic operation B does not depend on the atomic operation B in the two atomic operations. After determining the dependency relationships between all atomic operations, a query plan of a directed acyclic structure may be generated, and a schematic diagram of the structure of the query plan provided in this embodiment is shown in fig. 2, where fig. 2 represents the query plan in a directed acyclic graph (DAG, directed Acyclic Graph). In fig. 2, each circle represents an atomic operation, the dependency between two atomic operations is represented by a directed edge, and each dashed box represents a target node. That is, in fig. 2, five target nodes A, B, C, D, E are included, and the five target nodes are sequentially allocated 1, 3, 5, 4, and 3 atomic operations, for example, the target node B includes three atomic operations B1, B2, and B3; meanwhile, the atomic operation a1 has a directed edge pointing to the atomic operation b3, and then the atomic operation b3 depends on the atomic operation a1.
Optionally, after determining the query plan of the directed acyclic structure, the planning node may instruct the target node to perform the computing task. Specifically, the step 102 of "instructing the target node to sequentially perform data processing according to the query plan" includes:
step B2: the current atomic operation of the target node is instructed to acquire a preamble data set of the preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing.
Step B3: and performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking the corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to the subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with the dependency relation pointed by the current atomic operation.
Step B4: repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
In the embodiment of the invention, the planning node can send the query plan to the target node, so that the target node can know which data processing needs to be performed, and each atomic operation sequentially performs data processing according to the query plan of the directed acyclic structure. In this embodiment, a target node that needs to perform data processing is used as a current target node, and a current atomic operation of the current target node is determined, and if a dependency relationship of an atomic operation points to the current atomic operation, the atomic operation is a preamble atomic operation of the current atomic operation. As in fig. 2, atomic operation a1 points to atomic operation b3, i.e., atomic operation a1 has a dependency relationship that points to atomic operation b3, so atomic operation a1 is a predecessor atomic operation of atomic operation b 3; similarly, atomic operations b1 and b2 are also predecessor atomic operations of atomic operation b 3.
If the current atomic operation does not have a preceding atomic operation, the current atomic operation may be referred to as an initial atomic operation, and atomic operations a1, b1, etc. in fig. 2 are all initial atomic operations. In this embodiment, the execution of the query plan may be started from the initial atomic operation, that is, the current atomic operation is the initial atomic operation, and at this time, since there is no preamble atomic operation, the preamble data set is empty, the initial atomic operation may perform data processing only on locally stored data, and at this time, the data processing is a processing procedure consistent with the initial atomic operation; for example, if the initial atomic operation a1 is a deduplication process, the initial atomic operation a1 may perform the deduplication process on the locally stored data, so as to obtain a corresponding processing result after the deduplication process, and use the processing result as the dataset of the initial atomic operation a1, that is, the current dataset.
If the current atomic operation is not the initial atomic operation, that is, the preamble atomic operation exists, the current atomic operation acquires all data sets of the preamble atomic operation, that is, the preamble data sets, and then performs data processing corresponding to the current atomic operation on all the preamble data sets to generate a corresponding data processing result, that is, the current data set, that is, the data set of the current atomic operation. Alternatively, the current atomic operation may perform data processing only on the preamble data set, or may perform comprehensive processing on the preamble data set and the data stored in the local node.
In addition, if the dependency of the current atomic operation is directed to another atomic operation, the other atomic operation is a subsequent atomic operation to the current atomic operation; as in fig. 2, the atomic operation b3 is one subsequent atomic operation to the atomic operation a1, and the atomic operation c5 is one subsequent atomic operation to the atomic operation b 3. If the current atomic operation has the subsequent atomic operation, the current atomic operation sends the generated current data set to the subsequent atomic operation, so that the subsequent atomic operation can be used as the current atomic operation to continue to execute the steps B2 and B3, and the subsequent atomic operation does not exist, and the data set generated by the current atomic operation at the moment can be used as the final data processing result. In fig. 2, the atomic operation e3 is the last atomic operation of the directed acyclic structure, and the data set of the atomic operation e3 can be the final data processing result. It will be appreciated by those skilled in the art that, by taking the example of fig. 2 that includes one last atomic operation e3 (i.e., an atomic operation in which there is no subsequent atomic operation), there may be a plurality of last atomic operations in practical applications, i.e., a plurality of atomic operations may not have a subsequent atomic operation, and then the data set of all the last atomic operations may be used as the last data processing result. In this embodiment, all the atomic operations are traversed as the end condition, that is, when all the atomic operations perform corresponding data processing, it is indicated that all the atomic operations perform the computing tasks allocated to the atomic operations, and at this time, it is indicated that the query plan generated by the planning node is completely executed, and at this time, the final result may be obtained.
In this embodiment, the atomic operation delta may be performed using a multi-tuple
Figure BDA0002289970670000131
A representation; wherein op is δ Representing the operation processing corresponding to the atomic operation delta, such as projection, deduplication, etc.; m is m δ Representing the number of leading atomic operations of the atomic operation delta, X δi A preamble data set representing an ith preamble atomic operation of the atomic operation delta; j means that the atomic operation delta belongs to the j-th target node and j epsilon [1, n]N is the number of target nodes; />
Figure BDA0002289970670000132
Representing data stored locally at the jth target node, and +.>
Figure BDA0002289970670000133
Not necessarily.
Specifically, referring to fig. 2, the atomic operation a1 first performs a computing task, and since the atomic operation a1 is an initial atomic operation and the target node a is the first target node, the multi-tuple of the atomic operation a1 is
Figure BDA0002289970670000134
Namely, atomic operation a1 +.>
Figure BDA0002289970670000135
Performing corresponding treatment with +.>
Figure BDA0002289970670000136
The result of the processing is the data set of atomic operation a1 +.>
Figure BDA0002289970670000137
I.e. < ->
Figure BDA0002289970670000138
Wherein the function op δ (X) represents corresponding processing of data X according to atomic operations delta, I δ A dataset representing an atomic operation delta. Atomic operation a1 is determining its dataset +.>
Figure BDA0002289970670000139
After that, the data set can be +.>
Figure BDA00022899706700001310
To a subsequent atomic operation, i.e., to atomic operation b3. In addition, the processing procedure of the initial atomic operations such as the atomic operations b1 and b2 is similar to that described above, and will not be repeated here. Wherein if the data processing can be performed in the secure mode, the function op δ (X) can then mean that the data X is processed accordingly in the secure mode according to the atomic operation delta.
For atomic operation B3, target node B is taken as the second target node, i.e. j=2; and it has three precursor atomic operations, the multiple group can be
Figure BDA0002289970670000141
I.e. < ->
Figure BDA0002289970670000142
The preamble data sets of three preamble atomic operations, i.e. the data sets of atomic operations a1, b2, can be represented separately +.>
Figure BDA0002289970670000143
The atomic operation b3 performs data processing on the corresponding preamble data set to generate the data set +.>
Figure BDA0002289970670000144
And is also provided with
Figure BDA0002289970670000145
Wherein the atomic operation B3 requires data stored for the target node B
Figure BDA0002289970670000146
The multi-element group can be in the form described above when processed; if the atomic erase operation b3 does not require treatment +.>
Figure BDA0002289970670000147
Its multiple group can be +>
Figure BDA0002289970670000148
Or will->
Figure BDA0002289970670000149
Is->
Figure BDA00022899706700001410
The value is assigned to null.
Repeating the above process, the last atomic operation e3 can determine the corresponding data set
Figure BDA00022899706700001411
The data set->
Figure BDA00022899706700001412
And the final data processing result is obtained.
In the embodiment of the invention, the planning node constructs a distributed query plan by designing the directed acyclic graph, expresses each target node task by the atomic operation and the dependency relationship of the relational algebra level, sends the atomic operation to the target node for data calculation, and can optimize the global. The method can uniformly express the data sharing modes among the nodes in the environment with inconsistent safety characteristics, and realizes the data sharing in the heterogeneous environment; by generating the user task as a global query plan, it is advantageous to optimize the global query plan, such as minimizing the transmission of data between nodes, dispatching algorithms to safer node computations, etc.
In the above embodiment, the query node needs to provide a certain resource to perform the query operation. Specifically, after the step 102 of generating the ordered query plan according to the processing parameters of all the target nodes, the method further includes:
step C1: and determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost.
Step C2: the query resources are allocated to the respective target nodes.
In the embodiment of the invention, after the planning node determines the query plan, the cost of executing the query plan, namely the query cost, of the target node can be calculated, and the corresponding query resource is taken from the query node; if the query node provides the query resource for the planning node, the planning node continues to issue a query plan to the target node to execute the corresponding calculation task; if the query node does not provide the query resource, the query process ends. In addition, after the query result is fed back to the query node, the planning node allocates a corresponding query resource to each target node. The query resource may be a fee type resource, an integral type resource, or other resources capable of rewarding the target node.
Specifically, in this embodiment, the nodes in the distributed network are divided into three types, i.e., a query node, a planning node, and all data nodes, and the overall flow of the query process can be shown in fig. 3, where the target node is all data nodes. In fig. 3, the target node is responsible for sharing data and executing computing tasks, the query node initiates data query or retrieval tasks, and the planning node constructs the tasks of the target node as a query plan and coordinates execution of the plan. And a query initiated from the query node is returned to the planning node after the query node coordinates the target node to execute, and then returned to the query node.
The above describes in detail the flow of the distributed data query method, which may also be implemented by a corresponding device, and the structure and function of the device are described in detail below.
The query device for distributed data provided by the embodiment of the invention can be specifically arranged in a query node. Referring to fig. 4, the query device includes:
a preprocessing module 41, configured to determine a plurality of target nodes corresponding to a query request after the query request is acquired, and determine a processing parameter of each target node;
A query plan module 42, configured to generate an ordered query plan according to processing parameters of all the target nodes, and instruct the target nodes to sequentially perform data processing according to the query plan;
and the result generating module 43 is configured to generate a query result corresponding to the query request according to the last data processing result fed back by the last target node.
On the basis of the above embodiment, the query plan module 42 generates an ordered query plan according to the processing parameters of all the target nodes, including:
and allocating corresponding one or more atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of the directed acyclic structure according to the dependency relationship among all the atomic operations.
On the basis of the above embodiment, the query plan module 42 instructs the target node to sequentially perform data processing according to the query plan includes:
indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing;
Performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking corresponding data processing results as the current data set of the current atomic operation, and sending the current data set to subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with a dependency relation pointed by the current atomic operation;
repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
On the basis of the above-described embodiments,
the query plan module 42 instructs the target node to sequentially perform data processing according to the query plan, including: the target node is instructed to sequentially perform data processing according to the query plan in a preset safety mode, and a data processing result conforming to the safety mode is generated;
the generating, by the result generating module 43, a query result corresponding to the query request according to the last data processing result fed back by the last target node includes: and carrying out security mode removal processing on the last data processing result which is fed back by the last target node and accords with the security mode, and taking the processing result after the security mode removal processing as a query result corresponding to the query request.
On the basis of the embodiment, the device further comprises a resource allocation module;
after the query plan module 42 generates an ordered query plan based on the processing parameters of all the target nodes, the resource allocation module is configured to:
determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost; and distributing the query resources to the corresponding target nodes.
When the query device for the distributed data is required to query, the planning node is used as a supervisor, all nodes of the corresponding data are selected by the planning node as target nodes in the query process, the calculation tasks of the target nodes are constructed into an ordered query plan, and the execution of the query plan is coordinated, so that all the target nodes can feed back the final data processing result after the query plan is executed, and the planning node feeds back the final data processing result to the query node, thereby completing the query process of the distributed data. In this embodiment, a global ordered query plan is generated based on a planning node, a computing task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computing task by the target node. The target node executes the calculation task in the security mode, so that data leakage can be avoided even if the data can be kept locally, the security of the target node is prevented from being threatened from the data source, and safe and effective sharing of private data can be realized.
Based on the same inventive concept, an embodiment of the present invention further provides a distributed system, referring to fig. 5, including: a planning node 51 and a plurality of data all nodes 52, the planning node 51 being a trusted node; all nodes 52 of data may be directly or indirectly connected to the planning node 51.
The planning node 51 is configured to determine, after acquiring a query request, a plurality of target nodes corresponding to the query request from all the data nodes 52, and determine processing parameters of the target nodes, where the processing parameters include hardware parameters of the target nodes and trust relationships between the target nodes and other target nodes;
the planning node 51 is further configured to generate an ordered query plan according to the processing parameters of all the target nodes, and send the query plan to the target nodes;
the target node is configured to perform data processing according to the query plan, and send a data processing result to other target nodes until the last target node sends a last data processing result to the planning node 51;
the planning node 51 generates a query result corresponding to the query request according to the final data processing result.
On the basis of the above embodiment, the generating, by the planning node 51, an ordered query plan according to the processing parameters of all the target nodes includes:
the planning node 51 allocates a corresponding one or more atomic operations to each of the target nodes, determines the dependency relationships between all the atomic operations, and generates a query plan of a directed acyclic structure according to the dependency relationships between all the atomic operations.
On the basis of the above embodiment, the target node performing data processing according to the query plan includes:
the current atomic operation of the target node acquires a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing;
performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking corresponding data processing results as the current data set of the current atomic operation, and sending the current data set to subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with a dependency relation pointed by the current atomic operation;
Repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
On the basis of the above embodiment, referring to fig. 6, the distributed system further comprises a query node 52 for initiating the query request;
after the planning node 51 generates an ordered query plan according to the processing parameters of all the target nodes, the planning node 51 determines a corresponding query cost according to the query plan and sends the query cost to the query node 52;
the query node 52 feeds back query resources matching the query cost to the planning node 51; the planning node 51, after receiving the query resource, allocates the query resource to the corresponding target node.
The distributed system provided by the embodiment of the invention divides the nodes into three parts, namely the query node 52, the planning node 51 and the data all nodes 52, and all or part of the data all nodes 52 in fig. 5 can be used as target nodes. In this embodiment, in different query tasks, different nodes may be other types of nodes, for example, all the nodes 52 of the data may also be used as the query nodes 52 for query. Furthermore, fig. 5 and 6 are merely schematic representations of the structure of the distributed system, and are not intended to limit the architecture on which the distributed system must be based; for example, a query node may be indirectly connected to the planning node through all other data nodes and initiate a query. Meanwhile, what is shown in fig. 5 and 6 is a communication connection relation allowable between the respective nodes, and is not used to indicate that the planning node must generate the query plan in accordance with all the connection relations between all the nodes of the data shown in the figures. The detailed description of the distributed system can be found in the embodiments corresponding to fig. 1 to 3, and the detailed description is omitted here.
The distributed system provided by the embodiment can generate a global ordered query plan based on the trusted planning node, allocate a calculation task for a corresponding target node, and obtain a corresponding query result by executing the calculation task through the target node. The target node executes the calculation task in the security mode, so that data leakage can be avoided even if the data can be kept locally, the security of the target node is prevented from being threatened from the data source, and safe and effective sharing of private data can be realized.
The foregoing is merely specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think about the modified or alternative embodiments within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A method for querying distributed data, comprising:
after a query request is acquired, determining a plurality of target nodes corresponding to the query request, and determining processing parameters of each target node;
generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially perform data processing according to the query plan;
generating a query result corresponding to the query request according to the last data processing result fed back by the last target node;
wherein the generating an ordered query plan according to the processing parameters of all the target nodes comprises:
distributing corresponding one or more atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations;
the instructing the target node to sequentially perform data processing according to the query plan includes:
indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing;
Performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking corresponding data processing results as the current data set of the current atomic operation, and sending the current data set to subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with a dependency relation pointed by the current atomic operation;
repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the instructing the target node to sequentially perform data processing according to the query plan includes: the target node is instructed to sequentially perform data processing according to the query plan in a preset safety mode, and a data processing result conforming to the safety mode is generated;
the generating the query result corresponding to the query request according to the last data processing result fed back by the last target node comprises the following steps: and carrying out security mode removal processing on the last data processing result which is fed back by the last target node and accords with the security mode, and taking the processing result after the security mode removal processing as a query result corresponding to the query request.
3. The method of claim 1, further comprising, after said generating an ordered query plan based on processing parameters of all of said target nodes:
determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost;
and distributing the query resources to the corresponding target nodes.
4. A distributed data querying device, comprising:
the preprocessing module is used for determining a plurality of target nodes corresponding to the query request after the query request is acquired, and determining processing parameters of each target node;
the query plan module is used for generating an ordered query plan according to the processing parameters of all the target nodes and indicating the target nodes to sequentially process data according to the query plan;
the result generation module is used for generating a query result corresponding to the query request according to the last data processing result fed back by the last target node;
wherein the query plan module generating an ordered query plan according to the processing parameters of all the target nodes comprises:
distributing corresponding one or more atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations;
The query plan module instructs the target node to sequentially perform data processing according to the query plan, including:
indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing;
performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking corresponding data processing results as the current data set of the current atomic operation, and sending the current data set to subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with a dependency relation pointed by the current atomic operation;
repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
5. A distributed system, comprising: the system comprises a planning node and a plurality of all data nodes, wherein the planning node is a trusted node;
The planning node is used for determining a plurality of target nodes corresponding to the query request from all the data nodes after the query request is acquired, and determining processing parameters of the target nodes, wherein the processing parameters comprise hardware parameters of the target nodes and trust relations between the target nodes and other target nodes;
the planning node is also used for generating an ordered query plan according to the processing parameters of all the target nodes and sending the query plan to the target nodes;
the target node is used for carrying out data processing according to the query plan and sending the data processing result to other target nodes until the last target node sends the last data processing result to the planning node;
the planning node generates a query result corresponding to the query request according to the final data processing result;
wherein the generating, by the planning node, an ordered query plan according to the processing parameters of all the target nodes includes:
the planning node distributes corresponding one or more atomic operations for each target node, determines the dependency relationship among all the atomic operations, and generates a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations;
The target node performing data processing according to the query plan comprises the following steps:
the current atomic operation of the target node acquires a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operation with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation performs data processing;
performing data processing corresponding to the current atomic operation on all the previous data sets of the current atomic operation, taking corresponding data processing results as the current data set of the current atomic operation, and sending the current data set to subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operation with a dependency relation pointed by the current atomic operation;
repeating the above process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
6. The distributed system of claim 5 further comprising a query node for initiating the query request;
after the planning node generates an ordered query plan according to the processing parameters of all the target nodes, the planning node determines corresponding query cost according to the query plan and sends the query cost to the query node;
The query node feeds back query resources matched with the query cost to the planning node; the planning node distributes the query resources to the corresponding target nodes after receiving the query resources.
CN201911176025.6A 2019-11-26 2019-11-26 Distributed data query method, device and distributed system Active CN110955701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911176025.6A CN110955701B (en) 2019-11-26 2019-11-26 Distributed data query method, device and distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911176025.6A CN110955701B (en) 2019-11-26 2019-11-26 Distributed data query method, device and distributed system

Publications (2)

Publication Number Publication Date
CN110955701A CN110955701A (en) 2020-04-03
CN110955701B true CN110955701B (en) 2023-04-25

Family

ID=69977076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911176025.6A Active CN110955701B (en) 2019-11-26 2019-11-26 Distributed data query method, device and distributed system

Country Status (1)

Country Link
CN (1) CN110955701B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021254288A1 (en) * 2020-06-14 2021-12-23 Wenfei Fan Querying shared data with security heterogeneity
CN114518850B (en) * 2022-02-23 2024-03-12 云链网科技(广东)有限公司 Safe re-deleting storage system based on trusted execution protection and comprising re-deleting and encryption

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7984043B1 (en) * 2007-07-24 2011-07-19 Amazon Technologies, Inc. System and method for distributed query processing using configuration-independent query plans
CN104063486A (en) * 2014-07-03 2014-09-24 四川中亚联邦科技有限公司 Big data distributed storage method and system
CN105404690A (en) * 2015-12-16 2016-03-16 华为技术服务有限公司 Database querying method and apparatus
CN105608077A (en) * 2014-10-27 2016-05-25 青岛金讯网络工程有限公司 Big data distributed storage method and system
CN107301205A (en) * 2017-06-01 2017-10-27 华南理工大学 A kind of distributed Query method in real time of big data and system
CN110263105A (en) * 2019-05-21 2019-09-20 北京百度网讯科技有限公司 Inquiry processing method, query processing system, server and computer-readable medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7984043B1 (en) * 2007-07-24 2011-07-19 Amazon Technologies, Inc. System and method for distributed query processing using configuration-independent query plans
CN104063486A (en) * 2014-07-03 2014-09-24 四川中亚联邦科技有限公司 Big data distributed storage method and system
CN105608077A (en) * 2014-10-27 2016-05-25 青岛金讯网络工程有限公司 Big data distributed storage method and system
CN105404690A (en) * 2015-12-16 2016-03-16 华为技术服务有限公司 Database querying method and apparatus
CN107301205A (en) * 2017-06-01 2017-10-27 华南理工大学 A kind of distributed Query method in real time of big data and system
CN110263105A (en) * 2019-05-21 2019-09-20 北京百度网讯科技有限公司 Inquiry processing method, query processing system, server and computer-readable medium

Also Published As

Publication number Publication date
CN110955701A (en) 2020-04-03

Similar Documents

Publication Publication Date Title
CN110955701B (en) Distributed data query method, device and distributed system
US8635607B2 (en) Cloud-based build service
US10621002B2 (en) Iterative task centric resource scheduling for a user program between different computing frameworks
Pérez et al. A Newton-based heuristic algorithm for multi-objective flexible job-shop scheduling problem
US10579435B2 (en) Executing a foreign program on a parallel computing system
Zheng et al. K-swaps: Cooperative negotiation for solving task-allocation problems
Konur et al. Military system of systems architecting with individual system contracts
Cao et al. Querying shared data with security heterogeneity
CN104008200B (en) Lock the treating method and apparatus of application
CN117591299A (en) Peer-to-peer distributed computing system for heterogeneous device types
Nagarajan et al. An algorithm for cooperative task allocation in scalable, constrained multiple robot systems
US20230003753A1 (en) Systems and methods for managing experimental requests at remote laboratories
Zheng et al. Generalized reaction functions for solving complex-task allocation problems
CN114356511A (en) Task allocation method and system
Chronopoulos et al. A distributed discrete-time neural network architecture for pattern allocation and control
Frasheri et al. Analysis of perceived helpfulness in adaptive autonomous agent populations
Fazal et al. Task allocation in multi-robot system using resource sharing with dynamic threshold approach
KR101577265B1 (en) Apparatus for resource management and operation method of the same
Sergienko et al. Kernel technology to solve discrete optimization problems
CN110955726B (en) Method and device for determining distributed cost, storage medium and electronic equipment
Semenov Merging variables: one technique of search in pseudo-Boolean optimization
WO2021254288A1 (en) Querying shared data with security heterogeneity
US20210166229A1 (en) Method for carrying out transactions
Khanna et al. Scheduling file transfers for data-intensive jobs on heterogeneous clusters
Georgiou et al. Algorithmics of Wireless Networks: 19th International Symposium, ALGOWIN 2023, Amsterdam, The Netherlands, September 7–8, 2023, Revised Selected Papers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant