CN110955701A - Distributed data query method and device and distributed system - Google Patents

Distributed data query method and device and distributed system Download PDF

Info

Publication number
CN110955701A
CN110955701A CN201911176025.6A CN201911176025A CN110955701A CN 110955701 A CN110955701 A CN 110955701A CN 201911176025 A CN201911176025 A CN 201911176025A CN 110955701 A CN110955701 A CN 110955701A
Authority
CN
China
Prior art keywords
query
node
atomic operation
data
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911176025.6A
Other languages
Chinese (zh)
Other versions
CN110955701B (en
Inventor
杨华卫
毕伟
贾晓芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongsi Boan Technology Beijing Co ltd
Original Assignee
Zhongsi Boan Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongsi Boan Technology Beijing Co ltd filed Critical Zhongsi Boan Technology Beijing Co ltd
Priority to CN201911176025.6A priority Critical patent/CN110955701B/en
Publication of CN110955701A publication Critical patent/CN110955701A/en
Application granted granted Critical
Publication of CN110955701B publication Critical patent/CN110955701B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a query method, a query device and a distributed system of distributed data, wherein the method comprises the following steps: after the query request is obtained, determining a plurality of target nodes corresponding to the query request, and determining a processing parameter of each target node; generating an ordered query plan according to the processing parameters of all target nodes, and indicating the target nodes to sequentially process data according to the query plan; and generating a query result corresponding to the query request according to the final data processing result fed back by the final target node. By the distributed data query method, the distributed data query device and the distributed system, the planning node generates the ordered query plan based on the processing parameters of the target nodes, the heterogeneous attributes such as trust relationship, threat level and hardware support can be fused in the heterogeneous environment, and each target node can process data according to the security requirement or hardware support of the target node, so that the data sharing is transparent to the heterogeneous environment.

Description

Distributed data query method and device and distributed system
Technical Field
The invention relates to the technical field of distributed data, in particular to a distributed data query method, a distributed data query device and a distributed system.
Background
Currently, there is an increasing demand for data sharing in the industries such as e-government, healthcare, finance and artificial intelligence, such as precision medicine, where clinical, genetic, environmental and lifestyle data need to be shared for better treatment and prevention of diseases. Data owners who own data typically act as data sources to compose a distributed data system in a distributed manner.
Query processing on distributed data sources has been extensively studied; and, various protocols have been designed based on different security settings and threat assumptions. There currently exist query processing algorithms and systems in a homogeneous environment, i.e. the same protocol is assumed to be used between the parties. In many practical scenarios, distributed data sharing is often implemented in heterogeneous environments, and different protocols are used among the parties. The real reasons for security heterogeneity are various trust relationships between data owners, different threat levels along different communication channels and different computing nodes, the degree of special hardware support available, etc.
If the secure query processing techniques designed for homogeneous environments are used in heterogeneous environments, the most stringent security requirements in the data owner and/or the lowest available hardware support will have to be met, which will result in unnecessarily high computational costs.
Disclosure of Invention
In order to solve the foregoing problems, embodiments of the present invention provide a method, an apparatus, and a distributed system for querying distributed data.
In a first aspect, an embodiment of the present invention provides a method for querying distributed data, including:
after an inquiry request is obtained, determining a plurality of target nodes corresponding to the inquiry request, and determining a processing parameter of each target node;
generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially perform data processing according to the query plan;
and generating a query result corresponding to the query request according to the final data processing result fed back by the final target node.
In one possible implementation, the generating an ordered query plan according to the processing parameters of all the target nodes includes:
and distributing one or more corresponding atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations.
In a possible implementation manner, the instructing the target node to sequentially perform data processing according to the query plan includes:
indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
In a possible implementation manner, the instructing the target node to sequentially perform data processing according to the query plan includes: indicating the target node to sequentially perform data processing according to the query plan in a preset safety mode to generate a data processing result conforming to the safety mode;
the generating a query result corresponding to the query request according to the last data processing result fed back by the target node includes: and performing security mode removal processing on the last data processing result which is fed back by the target node and accords with the security mode, and taking the processing result after security mode removal processing as the query result corresponding to the query request.
In one possible implementation, after the generating an ordered query plan according to the processing parameters of all the target nodes, the method further includes:
determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost;
and allocating the query resources to the corresponding target nodes.
In a second aspect, an embodiment of the present invention further provides a device for querying distributed data, including:
the system comprises a preprocessing module, a query module and a processing module, wherein the preprocessing module is used for determining a plurality of target nodes corresponding to a query request after the query request is obtained and determining processing parameters of each target node;
the query plan module is used for generating an ordered query plan according to the processing parameters of all the target nodes and indicating the target nodes to sequentially perform data processing according to the query plan;
and the result generation module is used for generating a query result corresponding to the query request according to the last data processing result fed back by the target node.
In a third aspect, an embodiment of the present invention further provides a distributed system, including: the data transmission method comprises a planning node and a plurality of data all nodes, wherein the planning node is a credible node;
the planning node is used for determining a plurality of target nodes corresponding to the query request from all the nodes of the data after the query request is obtained, and determining processing parameters of the target nodes, wherein the processing parameters comprise hardware parameters of the target nodes and trust relations between the target nodes and other target nodes;
the planning node is also used for generating an ordered query plan according to the processing parameters of all the target nodes and sending the query plan to the target nodes;
the target node is used for processing data according to the query plan and sending data processing results to other target nodes until the last target node sends the last data processing result to the planning node;
and the planning node generates a query result corresponding to the query request according to the final data processing result.
In a possible implementation manner, the generating, by the planning node, an ordered query plan according to the processing parameters of all the target nodes includes:
and the planning node allocates one or more corresponding atomic operations to each target node, determines the dependency relationship among all the atomic operations, and generates a query plan with a directed acyclic structure according to the dependency relationship among all the atomic operations.
In one possible implementation manner, the data processing performed by the target node according to the query plan includes:
the method comprises the steps that a current atomic operation of a target node obtains a preamble data set of a preamble atomic operation, the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
In a possible implementation, the system further includes a query node for initiating the query request;
after the planning node generates an ordered query plan according to the processing parameters of all the target nodes, the planning node determines corresponding query cost according to the query plan and sends the query cost to the query node;
the query node feeds back query resources matched with the query cost to the planning node; and after receiving the query resources, the planning node allocates the query resources to the corresponding target nodes.
In the solution provided by the first aspect of the embodiments of the present invention, when a query is required, a planning node is used as a supervisor, the planning node selects all nodes of corresponding data as target nodes of the query process, constructs a calculation task of the target nodes into an ordered query plan and coordinates execution of the query plan, so that all target nodes can feed back a final data processing result after the query plan is executed, and the planning node returns the final data processing result to the query node, thereby completing the query process of distributed data. In this embodiment, a global ordered query plan is generated based on a planning node, a computation task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computation task by the target node, where the query plan is generated based on processing parameters of the target node, heterogeneous attributes such as trust relationship, threat level, hardware support and the like can be fused in a heterogeneous environment, and each target node can perform data processing according to its own security requirement or hardware support and the like, so that data sharing is transparent to the heterogeneous environment. The target node executes the calculation task in the security mode, data leakage can be avoided even if the data can be kept locally, the security of the data source is prevented from being threatened, and the safe and effective sharing of the private data can be realized.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart illustrating a distributed data query method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a query plan with a directed acyclic structure in the distributed data query method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating an overall query method for distributed data according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram illustrating a distributed data querying apparatus according to an embodiment of the present invention;
FIG. 5 is a first structural schematic diagram of a distributed system provided by an embodiment of the invention;
fig. 6 shows a second structural diagram of the distributed system provided by the embodiment of the present invention.
Detailed Description
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The distributed data query method provided by the embodiment of the invention is executed by the trusted planning node, and the planning node can generate an ordered query plan so as to realize data query. Referring to fig. 1, the method includes:
step 101: after the query request is obtained, a plurality of target nodes corresponding to the query request are determined, and processing parameters of each target node are determined.
In the embodiment of the invention, data is stored in some nodes of a distributed system in a distributed mode, and the nodes are all the nodes of the data; when other nodes need to inquire some data, corresponding data needs to be acquired from all nodes of one or more data. In the embodiment, a trusted planning node is selected as an intermediate role to supervise the whole data query process; the planning node is trusted or auditable, and may be implemented by a block chain technique. After the planning node obtains the query request, it is determined which data all nodes store data matched with the query request, and all nodes storing the data of the corresponding data are used as target nodes.
In this embodiment, after determining the target nodes, the planning node needs to determine processing parameters of each target node, where the processing parameters may specifically include hardware parameters of the target nodes and trust relationships between the target nodes and other target nodes. The trust relationship between the target node and other nodes is used for indicating whether data transmission is allowed between the two target nodes, the cost when the data is transmitted and the like; the hardware parameter is a parameter related to device hardware at the target node, and may be used to indicate the computing capability of the target node, or indicate the computing priority, etc.
In this embodiment, the query request may be generated autonomously by the planning node, or may be generated by another query node, and the query request is sent to the planning node. In addition, those skilled in the art can understand that the "query node", "planning node", and "target node" in this embodiment are all nodes that execute different functions in the query process, and are not limited to a certain node that can only execute one of the functions. For example, all nodes of the data in the above process may be used as target nodes, and the nodes of the data may also be used as query nodes to perform query operations in other query processes, or may also be used as planning nodes in other query processes if the nodes of the data are trusted.
Step 102: and generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially process data according to the query plan.
In the embodiment of the invention, after the planning node determines the processing parameters of all the target nodes, the corresponding query plan can be generated. In this embodiment, the query plan is an execution plan corresponding to the query request and constructed for the planning node, and based on the query plan, each target node can know which data processing operations need to be performed by itself, that is, which calculation tasks need to be executed; meanwhile, the query plan has a sequential characteristic, namely the query plan is ordered, the target nodes can sequentially execute the calculation tasks based on the sequential characteristic of the query plan, and finally obtain corresponding results, namely final data processing results. In this embodiment, the planning node is a planner of the query plan, and after the planning node issues the query plan to the target node, the target node is used as an executor to execute a corresponding calculation task to obtain a corresponding data processing result.
The query plan in this embodiment may specify an execution sequence of the target nodes, and after the data processing of the current target node is completed, the data processing result may be sent to the next-ranked target node, so that the next-ranked target node may continue to execute the calculation task based on the data processing result of the current target node, and this is repeated until the last target node generates the last data processing result. The "order" in the present embodiment is not limited to the order in which they are arranged in order from the beginning to the end, and may be the order expressed by a directed acyclic structure, or the like.
Step 103: and generating a query result corresponding to the query request according to the final data processing result fed back by the final target node.
In the embodiment of the invention, all the target nodes sequentially execute the ordered query plan, and after the last target node performs data processing, a corresponding processing result, namely a final data processing result, can be generated, and the final target node can feed back the final data processing result to the planning node. In this embodiment, the final data processing result is a final result obtained after all the target nodes execute the query plan, and the final data processing result may represent a corresponding query result. At this time, the planning node can directly send the final data processing result as a corresponding query result to the query node; or, the planning node may process the final data processing result to generate a query result, and send the generated query result to the query node. And when the final data processing result generated by the target node cannot be directly provided for the query node, the trusted planning node performs processing to generate data which can be provided for the query node.
Optionally, in the distributed network, when all the nodes of the data share the data, there may be a risk that the data is obtained by an illegal node to cause data leakage, and even economic loss and legal responsibility of all the nodes of the data are caused, so that the distributed data sharing is initially hindered by security and privacy problems. In order to avoid the problem of data leakage in the data query process, the method provided by the embodiment performs data processing in a mode of loading a secure mode, so that private data is safely and effectively shared. Specifically, the step 102 "instructing the target node to sequentially perform data processing according to the query plan" includes:
step A1: and indicating the target node to sequentially perform data processing in a preset safety mode according to the query plan to generate a data processing result conforming to the safety mode.
In the embodiment of the invention, the target nodes are all data nodes, and when the target nodes execute the calculation tasks based on the query plan, the data processing needs to be carried out in the safety mode, so that the generated data processing result is a result conforming to the safety mode. The Security mode may specifically be a Security Enclave (Security Enclave), a Docker container, an encryption form, multi-party Security computation, and the like, that is, the target node may allocate independent computing power to the computation task using a sandbox mechanism and the like, or encrypt the data processing result, so that it is difficult for other target nodes to obtain sensitive information in the data processing result after obtaining the data processing result of the local target node; similarly, it is difficult for the local target node to obtain sensitive information in the data processing results sent by other target nodes. In this embodiment, the target node needs to execute the query plan honestly, but it may not be limited whether the target node steals data of other nodes, that is, the target node is allowed to attempt to export data acquired from other target nodes. In the security mode, the security and privacy of data transmission between the target nodes can be ensured.
Meanwhile, since the final data processing result generated by the final target node is a processing result that conforms to the security mode, in order to ensure that the query node can read the final data processing result normally, the final data processing result is processed by using a trusted planning node in this embodiment. Specifically, the step 103 of "generating the query result corresponding to the query request according to the last data processing result fed back by the last target node" includes:
step A2: and performing security mode removal processing on the final data processing result which is fed back by the final target node and accords with the security mode, and taking the processing result after security mode removal processing as the query result corresponding to the query request.
In the embodiment of the invention, the planning node can perform security mode removing processing on the final data processing result to generate data irrelevant to the security mode, and the data can be used as the query result provided for the query node, so that the query node can normally read the data in the final data processing result. The 'safe mode removing processing' is a processing mode opposite to that of the target node when data processing is carried out based on the safe mode; for example, if the target node performs data processing in an encryption manner, the security mode is removed and the corresponding decryption processing is performed.
When a query is required, a planning node serves as a supervisor, the planning node selects all nodes of corresponding data as target nodes of the query process, a calculation task of the target nodes is constructed into an ordered query plan, execution of the query plan is coordinated, all the target nodes can feed back a final data processing result after the query plan is executed, and the planning node returns the final data processing result to the query node, so that the query process of the distributed data is completed. In this embodiment, a global ordered query plan is generated based on a planning node, a computation task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computation task by the target node, where the query plan is generated based on processing parameters of the target node, heterogeneous attributes such as trust relationship, threat level, hardware support and the like can be fused in a heterogeneous environment, and each target node can perform data processing according to its own security requirement or hardware support and the like, so that data sharing is transparent to the heterogeneous environment. The target node executes the calculation task in the security mode, data leakage can be avoided even if the data can be kept locally, the security of the data source is prevented from being threatened, and the safe and effective sharing of the private data can be realized. In the embodiment, the data is kept in a local safety calculation mode, which is beneficial to ensuring the safety of the data, so that the safety threats of the data in the transmission process and the transmission destination are reduced; meanwhile, the algorithm dispatch is generally much faster than data transmission, so that the data processing speed is higher.
On the basis of the above embodiment, the planning node specifically generates a query plan of a directed acyclic structure. Specifically, the step 102 of "generating an ordered query plan according to the processing parameters of all the target nodes" includes:
step B1: and distributing one or more corresponding atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations.
In the embodiment of the invention, the query plan of the directed acyclic structure is generated by taking the atomic operation as a basic unit. The atomic operation is a basic operation in a query process, and may specifically be projection (projection), selection (selection), natural join (natural join), deduplication (set difference), union (set intersection), renaming (renaming), and the like. The dependency relationship between two atomic operations means that one of the atomic operations needs to depend on data in the other atomic operation when performing data processing. In this embodiment, the dependency relationship is directional, that is, in two atomic operations, if the atomic operation a depends on the atomic operation B, the atomic operation B does not depend on the atomic operation B. After determining the dependency relationships among all the atomic operations, a query plan with a Directed Acyclic structure may be generated, and a structural diagram of the query plan provided in this embodiment is shown in fig. 2, where fig. 2 represents the query plan with a Directed Acyclic Graph (DAG). In fig. 2, each circle represents an atomic operation, the dependency between two atomic operations is represented by directed edges, and each dashed box represents a target node. That is, fig. 2 contains five target nodes A, B, C, D, E, and the five target nodes are sequentially allocated with 1, 3, 5, 4, and 3 atomic operations, for example, the target node B contains three atomic operations B1, B2, and B3; meanwhile, atomic operation a1 has a directed edge pointing to atomic operation b3, then atomic operation b3 depends on the atomic operation a 1.
Optionally, after determining the query plan of the directed acyclic structure, the planning node may instruct the target node to perform a computation task. Specifically, the step 102 of "instructing the target node to sequentially perform data processing according to the query plan" includes:
step B2: and indicating the current atomic operation of the target node to acquire a preamble data set of the preamble atomic operation, wherein the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing.
Step B3: and performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking the corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with the dependency relationship pointed by the current atomic operation.
Step B4: and repeating the process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
In the embodiment of the invention, the planning node can send the query plan to the target node, so that the target node can know which data processing needs to be carried out, and each atomic operation carries out data processing in sequence according to the query plan of the directed acyclic structure. In this embodiment, a target node that needs to perform data processing is used as a current target node, a current atomic operation of the current target node is determined, and if a dependency relationship of a certain atomic operation points to the current atomic operation, the atomic operation is a previous atomic operation of the current atomic operation. As in FIG. 2, atomic operation a1 points to atomic operation b3, i.e., atomic operation a1 has dependencies that point to atomic operation b3, so atomic operation a1 is a predecessor atomic operation to atomic operation b 3; similarly, atomic operations b1 and b2 are also the predecessor atomic operations to atomic operation b 3.
If there is no preceding atomic operation in the current atomic operation, the current atomic operation may be an initial atomic operation, and the atomic operations a1, b1, etc. in FIG. 2 are all initial atomic operations. In this embodiment, the query plan may be executed from the initial atomic operation, that is, the current atomic operation is the initial atomic operation, at this time, since there is no preamble atomic operation, the preamble data set is empty, the initial atomic operation may only perform data processing on locally stored data, and the data processing at this time is a processing procedure consistent with the initial atomic operation; for example, if the initial atomic operation a1 is a deduplication process, the initial atomic operation a1 may perform deduplication processing on locally stored data, and then obtain a corresponding processing result after deduplication processing, and use the processing result as the data set of the initial atomic operation a1, that is, the current data set.
If the current atomic operation is not the initial atomic operation, that is, there is a preamble atomic operation, the current atomic operation obtains data sets of all preamble atomic operations, that is, preamble data sets, and then performs data processing corresponding to the current atomic operation on all preamble data sets to generate corresponding data processing results, that is, the data sets of the current atomic operation, that is, the current data sets. Optionally, the current atomic operation may only perform data processing on the preamble data set, or may also perform comprehensive processing on the preamble data set and data stored in the local node.
Additionally, if the dependency of the current atomic operation is to point to another atomic operation, then the other atomic operation is a subsequent atomic operation to the current atomic operation; as in fig. 2, atomic operation b3 is a subsequent atomic operation to atomic operation a1, and atomic operation c5 is a subsequent atomic operation to atomic operation b 3. If the current atomic operation has a subsequent atomic operation, the current atomic operation sends the generated current data set to the subsequent atomic operation, so that the subsequent atomic operation can continue to perform the above steps B2 and B3 as the current atomic operation, and the data set generated by the current atomic operation at this time can be used as the final data processing result. In FIG. 2, the atomic operation e3 is the last atomic operation of the directed acyclic structure, and the data set of the atomic operation e3 can be the final data processing result. It will be understood by those skilled in the art that fig. 2 illustrates an example of an atomic operation e3 including a last atomic operation (i.e., an atomic operation without a subsequent atomic operation), and in practical applications, there may be a plurality of last atomic operations, i.e., there may be a plurality of atomic operations without a subsequent atomic operation, and at this time, a data set of all the last atomic operations may be used as a final data processing result. In this embodiment, traversing all the atomic operations is taken as an end condition, that is, when all the atomic operations perform corresponding data processing, it indicates that all the atomic operations perform the calculation tasks assigned to them, and at this time, it indicates that the query plan generated by the planning node is completely executed, and at this time, a final result can be obtained.
In this embodiment, the atomic operation delta may be in the form of a multi-element
Figure BDA0002289970670000131
Represents; wherein opδOperation processing such as projection, deduplication, and the like corresponding to the atomic operation δ is represented; m isδThe number of preceding atomic operations, X, representing the atomic operation deltaδiA preamble data set representing the ith preamble atomic operation of the atomic operation δ; j indicates that the atomic operation delta belongs to the jth target node, and j is E [1, n ]]N is the number of target nodes;
Figure BDA0002289970670000132
represents the data stored locally in the jth target node, and
Figure BDA0002289970670000133
and need not be.
Specifically, referring to FIG. 2, the atomic operation a1 first performs a calculation task, and since the atomic operation a1 is an initial atomic operation and the target node A is the first target node, the tuple of the atomic operation a1 is
Figure BDA0002289970670000134
Namely the atomic operation a1 to the target nodeData stored in A
Figure BDA0002289970670000135
Performing corresponding treatment in the following steps
Figure BDA0002289970670000136
The processing result is the data set of the atomic operation a1
Figure BDA0002289970670000137
Namely, it is
Figure BDA0002289970670000138
Wherein, the function opδ(X) denotes the corresponding processing of the data X in accordance with the atomic operation delta, IδA data set representing an atomic operation δ. The atomic operation a1 is determining its data set
Figure BDA0002289970670000139
Thereafter, the data set can be assembled
Figure BDA00022899706700001310
To the subsequent atomic operation, i.e., to atomic operation b 3. In addition, the processing procedures of the initial atomic operations such as the atomic operations b1 and b2 are similar to those described above, and are not described herein again. Wherein, if the data processing can be performed in the secure mode, the function op is executedδ(X) may then indicate that data X is processed accordingly in the atomic operation δ in the secure mode.
For atomic operation B3, target node B is taken as the second target node, i.e., j ═ 2; and it has three preceding atomic operations, the tuple can be
Figure BDA0002289970670000141
Namely, it is
Figure BDA0002289970670000142
The preamble data sets that may represent three preamble atomic operations, namely the data sets of atomic operations a1, b1, b2,
Figure BDA0002289970670000143
the atomic operation b3 can generate the data set of the atomic operation b3 by performing data processing on the corresponding preamble data set
Figure BDA0002289970670000144
And is
Figure BDA0002289970670000145
Wherein the atomic operation B3 requires data stored for the target node B
Figure BDA0002289970670000146
When the treatment is carried out, the multicomponent group can be in the form described above; if the atomic erase operation b3 does not require processing
Figure BDA0002289970670000147
Then its multi-component group may be
Figure BDA0002289970670000148
Or will be
Figure BDA0002289970670000149
In (1)
Figure BDA00022899706700001410
The assignment is null.
Repeating the above process, the last atomic operation e3 may determine the corresponding data set
Figure BDA00022899706700001411
The data set
Figure BDA00022899706700001412
I.e. the final data processing result.
In the embodiment of the invention, the planning node constructs a distributed query plan by designing the directed acyclic graph, expresses each target node task by the atomic operation and the dependency relationship of the relational algebra level, and sends the atomic operation to the target node for data calculation, so that the overall situation can be optimized. The method can uniformly express the data sharing mode among the nodes in the environment with inconsistent security characteristics, and realizes data sharing in heterogeneous environment; by generating the user tasks into a global query plan, the method is beneficial to optimizing the global query plan, such as reducing data transmission among nodes as much as possible, dispatching the algorithm to safer node calculation and the like.
In the above embodiment, the query node needs to provide certain resources to perform the query operation. Specifically, after the step 102 "generating an ordered query plan according to the processing parameters of all the target nodes", the method further includes:
step C1: and determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost.
Step C2: and allocating the query resources to the corresponding target nodes.
In the embodiment of the invention, after the planning node determines the query plan, the cost of executing the query plan by the target node, namely the query cost, can be calculated, and corresponding query resources are taken from the query node; if the query node provides the query resource for the planning node, the planning node continues to issue the query plan to the target node to execute the corresponding calculation task; if the query node does not provide the query resource, the query process ends. In addition, after the query result is fed back to the query node, the planning node allocates a corresponding query resource to each target node. The query resource may specifically be a resource in the form of a fee, a resource in the form of a credit, or other resources capable of rewarding the target node.
Specifically, in this embodiment, the nodes in the distributed network are divided into three types, that is, the query node, the planning node, and all the data nodes, and the overall process of the query process can be shown in fig. 3, where the target node is all the data nodes. In fig. 3, the target node is responsible for sharing data and executing a computation task, the query node initiates a data query or retrieval task, and the planning node constructs the task of the target node as a query plan and coordinates the execution of the plan. And a query initiated from the query node, wherein after the query node coordinates the execution of the target node, the query result is returned to the planning node and then returned to the query node.
The above describes in detail the flow of the query method for distributed data, which may also be implemented by a corresponding apparatus, and the following describes in detail the structure and function of the apparatus.
The query device for distributed data provided by the embodiment of the invention can be specifically arranged in a query node. Referring to fig. 4, the query apparatus includes:
the preprocessing module 41 is configured to, after an inquiry request is obtained, determine a plurality of target nodes corresponding to the inquiry request, and determine a processing parameter of each target node;
the query plan module 42 is configured to generate an ordered query plan according to the processing parameters of all the target nodes, and instruct the target nodes to sequentially perform data processing according to the query plan;
and a result generating module 43, configured to generate a query result corresponding to the query request according to the last data processing result fed back by the last target node.
On the basis of the above embodiment, the generating an ordered query plan by the query plan module 42 according to the processing parameters of all the target nodes includes:
and distributing one or more corresponding atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations.
On the basis of the above embodiment, the query plan module 42 instructs the target node to sequentially perform data processing according to the query plan, including:
indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
On the basis of the above-described embodiments,
the query plan module 42 instructs the target node to sequentially perform data processing according to the query plan, including: indicating the target node to sequentially perform data processing according to the query plan in a preset safety mode to generate a data processing result conforming to the safety mode;
the generating a query result corresponding to the query request by the result generating module 43 according to the last data processing result fed back by the last target node includes: and performing security mode removal processing on the last data processing result which is fed back by the target node and accords with the security mode, and taking the processing result after security mode removal processing as the query result corresponding to the query request.
On the basis of the above embodiment, the apparatus further includes a resource allocation module;
after the query plan module 42 generates the ordered query plan according to the processing parameters of all the target nodes, the resource allocation module is configured to:
determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost; and allocating the query resources to the corresponding target nodes.
When the query needs to be performed, the planning node serves as a supervisor, the planning node selects all nodes of corresponding data as target nodes of the query process, a calculation task of the target nodes is constructed into an ordered query plan, execution of the query plan is coordinated, all the target nodes can feed back a final data processing result after the query plan is executed, and the planning node returns the final data processing result to the query node, so that the query process of the distributed data is completed. In this embodiment, a global ordered query plan is generated based on a planning node, a computation task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computation task by the target node, where the query plan is generated based on processing parameters of the target node, heterogeneous attributes such as trust relationship, threat level, hardware support and the like can be fused in a heterogeneous environment, and each target node can perform data processing according to its own security requirement or hardware support and the like, so that data sharing is transparent to the heterogeneous environment. The target node executes the calculation task in the security mode, data leakage can be avoided even if the data can be kept locally, the security of the data source is prevented from being threatened, and the safe and effective sharing of the private data can be realized.
Based on the same inventive concept, an embodiment of the present invention further provides a distributed system, as shown in fig. 5, where the distributed system includes: a planning node 51 and a plurality of data all nodes 52, wherein the planning node 51 is a trusted node; the data owner node 52 may be directly or indirectly connected to the planning node 51.
The planning node 51 is configured to, after acquiring the query request, determine a plurality of target nodes corresponding to the query request from the data nodes 52, and determine processing parameters of the target nodes, where the processing parameters include hardware parameters of the target nodes and trust relationships between the target nodes and other target nodes;
the planning node 51 is further configured to generate an ordered query plan according to the processing parameters of all the target nodes, and send the query plan to the target nodes;
the target node is configured to perform data processing according to the query plan, and send a data processing result to other target nodes until the last target node sends a last data processing result to the planning node 51;
the planning node 51 generates a query result corresponding to the query request according to the final data processing result.
On the basis of the above embodiment, the generating, by the planning node 51, an ordered query plan according to the processing parameters of all the target nodes includes:
the planning node 51 allocates one or more corresponding atomic operations to each target node, determines the dependency relationship among all the atomic operations, and generates a query plan with a directed acyclic structure according to the dependency relationship among all the atomic operations.
On the basis of the above embodiment, the data processing by the target node according to the query plan includes:
the method comprises the steps that a current atomic operation of a target node obtains a preamble data set of a preamble atomic operation, the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
On the basis of the above embodiment, referring to fig. 6, the distributed system further includes a query node 52 for initiating the query request;
after the planning node 51 generates an ordered query plan according to the processing parameters of all the target nodes, the planning node 51 determines corresponding query cost according to the query plan, and sends the query cost to the query node 52;
the query node 52 feeds back the query resource matched with the query cost to the planning node 51; after receiving the query resource, the planning node 51 allocates the query resource to the corresponding target node.
In the distributed system provided by the embodiment of the present invention, the nodes are divided into three parts, that is, the query node 52, the planning node 51, and all data nodes 52, and all or part of the all data nodes 52 in fig. 5 may be used as target nodes. In this embodiment, different nodes may be other types of nodes in different query tasks, for example, all the nodes 52 of the data may also be queried as the query nodes 52. In addition, fig. 5 and 6 only schematically show the structure of the distributed system, and are not used to limit that the distributed system must be based on this architecture; for example, the query node may indirectly connect to the planning node through all nodes of other data and initiate a query. Meanwhile, what is shown in fig. 5 and fig. 6 is an allowable communication connection relationship between the respective nodes, and is not used to indicate that the planning node must generate the query plan according to the entire connection relationship between all the nodes of the data shown in the figure. For a detailed description of the distributed system, reference may be made to the embodiments corresponding to fig. 1 to fig. 3, which are not described herein again.
The distributed system provided by this embodiment may generate a global ordered query plan based on the trusted plan node, allocate a computation task to the corresponding target node, and execute the computation task by the target node to obtain a corresponding query result, where the query plan is generated based on the processing parameters of the target node, and may merge heterogeneous attributes such as trust relationship, threat level, hardware support, and the like in a heterogeneous environment, and each target node may perform data processing according to its own security requirements or hardware support, so as to make data sharing transparent to the heterogeneous environment. The target node executes the calculation task in the security mode, data leakage can be avoided even if the data can be kept locally, the security of the data source is prevented from being threatened, and the safe and effective sharing of the private data can be realized.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the modifications or alternative embodiments within the technical scope of the present invention, and shall be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A query method of distributed data is characterized by comprising the following steps:
after an inquiry request is obtained, determining a plurality of target nodes corresponding to the inquiry request, and determining a processing parameter of each target node;
generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially perform data processing according to the query plan;
and generating a query result corresponding to the query request according to the final data processing result fed back by the final target node.
2. The method of claim 1, wherein generating an ordered query plan based on the processing parameters of all of the target nodes comprises:
and distributing one or more corresponding atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations.
3. The method of claim 2, wherein the instructing the target node to perform data processing in sequence according to the query plan comprises:
indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
4. The method of claim 1,
the instructing the target node to sequentially perform data processing according to the query plan includes: indicating the target node to sequentially perform data processing according to the query plan in a preset safety mode to generate a data processing result conforming to the safety mode;
the generating a query result corresponding to the query request according to the last data processing result fed back by the target node includes: and performing security mode removal processing on the last data processing result which is fed back by the target node and accords with the security mode, and taking the processing result after security mode removal processing as the query result corresponding to the query request.
5. The method of claim 1, further comprising, after said generating an ordered query plan according to processing parameters of all of said target nodes:
determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost;
and allocating the query resources to the corresponding target nodes.
6. An apparatus for querying distributed data, comprising:
the system comprises a preprocessing module, a query module and a processing module, wherein the preprocessing module is used for determining a plurality of target nodes corresponding to a query request after the query request is obtained and determining processing parameters of each target node;
the query plan module is used for generating an ordered query plan according to the processing parameters of all the target nodes and indicating the target nodes to sequentially perform data processing according to the query plan;
and the result generation module is used for generating a query result corresponding to the query request according to the last data processing result fed back by the target node.
7. A distributed system, comprising: the data transmission method comprises a planning node and a plurality of data all nodes, wherein the planning node is a credible node;
the planning node is used for determining a plurality of target nodes corresponding to the query request from all the nodes of the data after the query request is obtained, and determining processing parameters of the target nodes, wherein the processing parameters comprise hardware parameters of the target nodes and trust relations between the target nodes and other target nodes;
the planning node is also used for generating an ordered query plan according to the processing parameters of all the target nodes and sending the query plan to the target nodes;
the target node is used for processing data according to the query plan and sending data processing results to other target nodes until the last target node sends the last data processing result to the planning node;
and the planning node generates a query result corresponding to the query request according to the final data processing result.
8. The distributed system of claim 7 wherein the planning node generating an ordered query plan based on the processing parameters of all of the target nodes comprises:
and the planning node allocates one or more corresponding atomic operations to each target node, determines the dependency relationship among all the atomic operations, and generates a query plan with a directed acyclic structure according to the dependency relationship among all the atomic operations.
9. The distributed system of claim 8, wherein the target node performing data processing according to the query plan comprises:
the method comprises the steps that a current atomic operation of a target node obtains a preamble data set of a preamble atomic operation, the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
10. The distributed system of any one of claims 7-9, further comprising a query node for initiating the query request;
after the planning node generates an ordered query plan according to the processing parameters of all the target nodes, the planning node determines corresponding query cost according to the query plan and sends the query cost to the query node;
the query node feeds back query resources matched with the query cost to the planning node; and after receiving the query resources, the planning node allocates the query resources to the corresponding target nodes.
CN201911176025.6A 2019-11-26 2019-11-26 Distributed data query method, device and distributed system Active CN110955701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911176025.6A CN110955701B (en) 2019-11-26 2019-11-26 Distributed data query method, device and distributed system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911176025.6A CN110955701B (en) 2019-11-26 2019-11-26 Distributed data query method, device and distributed system

Publications (2)

Publication Number Publication Date
CN110955701A true CN110955701A (en) 2020-04-03
CN110955701B CN110955701B (en) 2023-04-25

Family

ID=69977076

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911176025.6A Active CN110955701B (en) 2019-11-26 2019-11-26 Distributed data query method, device and distributed system

Country Status (1)

Country Link
CN (1) CN110955701B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021254288A1 (en) * 2020-06-14 2021-12-23 Wenfei Fan Querying shared data with security heterogeneity
CN114518850A (en) * 2022-02-23 2022-05-20 云链网科技(广东)有限公司 Safe re-deletion storage system with re-deletion before encryption based on trusted execution protection

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7984043B1 (en) * 2007-07-24 2011-07-19 Amazon Technologies, Inc. System and method for distributed query processing using configuration-independent query plans
CN104063486A (en) * 2014-07-03 2014-09-24 四川中亚联邦科技有限公司 Big data distributed storage method and system
CN105404690A (en) * 2015-12-16 2016-03-16 华为技术服务有限公司 Database querying method and apparatus
CN105608077A (en) * 2014-10-27 2016-05-25 青岛金讯网络工程有限公司 Big data distributed storage method and system
CN107301205A (en) * 2017-06-01 2017-10-27 华南理工大学 A kind of distributed Query method in real time of big data and system
CN110263105A (en) * 2019-05-21 2019-09-20 北京百度网讯科技有限公司 Inquiry processing method, query processing system, server and computer-readable medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7984043B1 (en) * 2007-07-24 2011-07-19 Amazon Technologies, Inc. System and method for distributed query processing using configuration-independent query plans
CN104063486A (en) * 2014-07-03 2014-09-24 四川中亚联邦科技有限公司 Big data distributed storage method and system
CN105608077A (en) * 2014-10-27 2016-05-25 青岛金讯网络工程有限公司 Big data distributed storage method and system
CN105404690A (en) * 2015-12-16 2016-03-16 华为技术服务有限公司 Database querying method and apparatus
CN107301205A (en) * 2017-06-01 2017-10-27 华南理工大学 A kind of distributed Query method in real time of big data and system
CN110263105A (en) * 2019-05-21 2019-09-20 北京百度网讯科技有限公司 Inquiry processing method, query processing system, server and computer-readable medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021254288A1 (en) * 2020-06-14 2021-12-23 Wenfei Fan Querying shared data with security heterogeneity
CN114518850A (en) * 2022-02-23 2022-05-20 云链网科技(广东)有限公司 Safe re-deletion storage system with re-deletion before encryption based on trusted execution protection
CN114518850B (en) * 2022-02-23 2024-03-12 云链网科技(广东)有限公司 Safe re-deleting storage system based on trusted execution protection and comprising re-deleting and encryption

Also Published As

Publication number Publication date
CN110955701B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
US9846589B2 (en) Virtual machine placement optimization with generalized organizational scenarios
KR101906912B1 (en) Cloud-based build service
Aguilar et al. Task assignment and transaction clustering heuristics for distributed systems
Hutter et al. Parallel algorithm configuration
CN110874484A (en) Data processing method and system based on neural network and federal learning
Pérez et al. A Newton-based heuristic algorithm for multi-objective flexible job-shop scheduling problem
US10621002B2 (en) Iterative task centric resource scheduling for a user program between different computing frameworks
CN110955701B (en) Distributed data query method, device and distributed system
US8046759B2 (en) Resource allocation method and system
WO2015127667A1 (en) Executing foreign program on parallel computing system
CN107430557B (en) Multi-party encryption cube processing device, method and system
Konur et al. Military system of systems architecting with individual system contracts
CN108134848B (en) SOA system resource optimization method based on graph theory K-segmentation
Tompkins Optimization techniques for task allocation and scheduling in distributed multi-agent operations
US7454749B2 (en) Scalable parallel processing on shared memory computers
Furugyan Computation planning in multiprocessor real time automated control systems with an additional resource
CN109791502B (en) Peer-to-peer distributed computing system for heterogeneous device types
Nagarajan et al. An algorithm for cooperative task allocation in scalable, constrained multiple robot systems
US20230003753A1 (en) Systems and methods for managing experimental requests at remote laboratories
Kononov et al. Control of a Complex of Works in Multiprocessor Real-time ACS
CN114356511A (en) Task allocation method and system
Pop et al. The Art of Scheduling for Big Data Science.
CN110955726B (en) Method and device for determining distributed cost, storage medium and electronic equipment
WO2016110461A1 (en) Parallel data streaming between cloud-based applications and massively parallel systems
Amaris et al. Generic algorithms for scheduling applications on heterogeneous multi-core platforms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant