CN110955701A - Distributed data query method and device and distributed system - Google Patents
Distributed data query method and device and distributed system Download PDFInfo
- Publication number
- CN110955701A CN110955701A CN201911176025.6A CN201911176025A CN110955701A CN 110955701 A CN110955701 A CN 110955701A CN 201911176025 A CN201911176025 A CN 201911176025A CN 110955701 A CN110955701 A CN 110955701A
- Authority
- CN
- China
- Prior art keywords
- query
- node
- atomic operation
- data
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention provides a query method, a query device and a distributed system of distributed data, wherein the method comprises the following steps: after the query request is obtained, determining a plurality of target nodes corresponding to the query request, and determining a processing parameter of each target node; generating an ordered query plan according to the processing parameters of all target nodes, and indicating the target nodes to sequentially process data according to the query plan; and generating a query result corresponding to the query request according to the final data processing result fed back by the final target node. By the distributed data query method, the distributed data query device and the distributed system, the planning node generates the ordered query plan based on the processing parameters of the target nodes, the heterogeneous attributes such as trust relationship, threat level and hardware support can be fused in the heterogeneous environment, and each target node can process data according to the security requirement or hardware support of the target node, so that the data sharing is transparent to the heterogeneous environment.
Description
Technical Field
The invention relates to the technical field of distributed data, in particular to a distributed data query method, a distributed data query device and a distributed system.
Background
Currently, there is an increasing demand for data sharing in the industries such as e-government, healthcare, finance and artificial intelligence, such as precision medicine, where clinical, genetic, environmental and lifestyle data need to be shared for better treatment and prevention of diseases. Data owners who own data typically act as data sources to compose a distributed data system in a distributed manner.
Query processing on distributed data sources has been extensively studied; and, various protocols have been designed based on different security settings and threat assumptions. There currently exist query processing algorithms and systems in a homogeneous environment, i.e. the same protocol is assumed to be used between the parties. In many practical scenarios, distributed data sharing is often implemented in heterogeneous environments, and different protocols are used among the parties. The real reasons for security heterogeneity are various trust relationships between data owners, different threat levels along different communication channels and different computing nodes, the degree of special hardware support available, etc.
If the secure query processing techniques designed for homogeneous environments are used in heterogeneous environments, the most stringent security requirements in the data owner and/or the lowest available hardware support will have to be met, which will result in unnecessarily high computational costs.
Disclosure of Invention
In order to solve the foregoing problems, embodiments of the present invention provide a method, an apparatus, and a distributed system for querying distributed data.
In a first aspect, an embodiment of the present invention provides a method for querying distributed data, including:
after an inquiry request is obtained, determining a plurality of target nodes corresponding to the inquiry request, and determining a processing parameter of each target node;
generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially perform data processing according to the query plan;
and generating a query result corresponding to the query request according to the final data processing result fed back by the final target node.
In one possible implementation, the generating an ordered query plan according to the processing parameters of all the target nodes includes:
and distributing one or more corresponding atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations.
In a possible implementation manner, the instructing the target node to sequentially perform data processing according to the query plan includes:
indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
In a possible implementation manner, the instructing the target node to sequentially perform data processing according to the query plan includes: indicating the target node to sequentially perform data processing according to the query plan in a preset safety mode to generate a data processing result conforming to the safety mode;
the generating a query result corresponding to the query request according to the last data processing result fed back by the target node includes: and performing security mode removal processing on the last data processing result which is fed back by the target node and accords with the security mode, and taking the processing result after security mode removal processing as the query result corresponding to the query request.
In one possible implementation, after the generating an ordered query plan according to the processing parameters of all the target nodes, the method further includes:
determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost;
and allocating the query resources to the corresponding target nodes.
In a second aspect, an embodiment of the present invention further provides a device for querying distributed data, including:
the system comprises a preprocessing module, a query module and a processing module, wherein the preprocessing module is used for determining a plurality of target nodes corresponding to a query request after the query request is obtained and determining processing parameters of each target node;
the query plan module is used for generating an ordered query plan according to the processing parameters of all the target nodes and indicating the target nodes to sequentially perform data processing according to the query plan;
and the result generation module is used for generating a query result corresponding to the query request according to the last data processing result fed back by the target node.
In a third aspect, an embodiment of the present invention further provides a distributed system, including: the data transmission method comprises a planning node and a plurality of data all nodes, wherein the planning node is a credible node;
the planning node is used for determining a plurality of target nodes corresponding to the query request from all the nodes of the data after the query request is obtained, and determining processing parameters of the target nodes, wherein the processing parameters comprise hardware parameters of the target nodes and trust relations between the target nodes and other target nodes;
the planning node is also used for generating an ordered query plan according to the processing parameters of all the target nodes and sending the query plan to the target nodes;
the target node is used for processing data according to the query plan and sending data processing results to other target nodes until the last target node sends the last data processing result to the planning node;
and the planning node generates a query result corresponding to the query request according to the final data processing result.
In a possible implementation manner, the generating, by the planning node, an ordered query plan according to the processing parameters of all the target nodes includes:
and the planning node allocates one or more corresponding atomic operations to each target node, determines the dependency relationship among all the atomic operations, and generates a query plan with a directed acyclic structure according to the dependency relationship among all the atomic operations.
In one possible implementation manner, the data processing performed by the target node according to the query plan includes:
the method comprises the steps that a current atomic operation of a target node obtains a preamble data set of a preamble atomic operation, the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
In a possible implementation, the system further includes a query node for initiating the query request;
after the planning node generates an ordered query plan according to the processing parameters of all the target nodes, the planning node determines corresponding query cost according to the query plan and sends the query cost to the query node;
the query node feeds back query resources matched with the query cost to the planning node; and after receiving the query resources, the planning node allocates the query resources to the corresponding target nodes.
In the solution provided by the first aspect of the embodiments of the present invention, when a query is required, a planning node is used as a supervisor, the planning node selects all nodes of corresponding data as target nodes of the query process, constructs a calculation task of the target nodes into an ordered query plan and coordinates execution of the query plan, so that all target nodes can feed back a final data processing result after the query plan is executed, and the planning node returns the final data processing result to the query node, thereby completing the query process of distributed data. In this embodiment, a global ordered query plan is generated based on a planning node, a computation task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computation task by the target node, where the query plan is generated based on processing parameters of the target node, heterogeneous attributes such as trust relationship, threat level, hardware support and the like can be fused in a heterogeneous environment, and each target node can perform data processing according to its own security requirement or hardware support and the like, so that data sharing is transparent to the heterogeneous environment. The target node executes the calculation task in the security mode, data leakage can be avoided even if the data can be kept locally, the security of the data source is prevented from being threatened, and the safe and effective sharing of the private data can be realized.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart illustrating a distributed data query method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a query plan with a directed acyclic structure in the distributed data query method according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating an overall query method for distributed data according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram illustrating a distributed data querying apparatus according to an embodiment of the present invention;
FIG. 5 is a first structural schematic diagram of a distributed system provided by an embodiment of the invention;
fig. 6 shows a second structural diagram of the distributed system provided by the embodiment of the present invention.
Detailed Description
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
The distributed data query method provided by the embodiment of the invention is executed by the trusted planning node, and the planning node can generate an ordered query plan so as to realize data query. Referring to fig. 1, the method includes:
step 101: after the query request is obtained, a plurality of target nodes corresponding to the query request are determined, and processing parameters of each target node are determined.
In the embodiment of the invention, data is stored in some nodes of a distributed system in a distributed mode, and the nodes are all the nodes of the data; when other nodes need to inquire some data, corresponding data needs to be acquired from all nodes of one or more data. In the embodiment, a trusted planning node is selected as an intermediate role to supervise the whole data query process; the planning node is trusted or auditable, and may be implemented by a block chain technique. After the planning node obtains the query request, it is determined which data all nodes store data matched with the query request, and all nodes storing the data of the corresponding data are used as target nodes.
In this embodiment, after determining the target nodes, the planning node needs to determine processing parameters of each target node, where the processing parameters may specifically include hardware parameters of the target nodes and trust relationships between the target nodes and other target nodes. The trust relationship between the target node and other nodes is used for indicating whether data transmission is allowed between the two target nodes, the cost when the data is transmitted and the like; the hardware parameter is a parameter related to device hardware at the target node, and may be used to indicate the computing capability of the target node, or indicate the computing priority, etc.
In this embodiment, the query request may be generated autonomously by the planning node, or may be generated by another query node, and the query request is sent to the planning node. In addition, those skilled in the art can understand that the "query node", "planning node", and "target node" in this embodiment are all nodes that execute different functions in the query process, and are not limited to a certain node that can only execute one of the functions. For example, all nodes of the data in the above process may be used as target nodes, and the nodes of the data may also be used as query nodes to perform query operations in other query processes, or may also be used as planning nodes in other query processes if the nodes of the data are trusted.
Step 102: and generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially process data according to the query plan.
In the embodiment of the invention, after the planning node determines the processing parameters of all the target nodes, the corresponding query plan can be generated. In this embodiment, the query plan is an execution plan corresponding to the query request and constructed for the planning node, and based on the query plan, each target node can know which data processing operations need to be performed by itself, that is, which calculation tasks need to be executed; meanwhile, the query plan has a sequential characteristic, namely the query plan is ordered, the target nodes can sequentially execute the calculation tasks based on the sequential characteristic of the query plan, and finally obtain corresponding results, namely final data processing results. In this embodiment, the planning node is a planner of the query plan, and after the planning node issues the query plan to the target node, the target node is used as an executor to execute a corresponding calculation task to obtain a corresponding data processing result.
The query plan in this embodiment may specify an execution sequence of the target nodes, and after the data processing of the current target node is completed, the data processing result may be sent to the next-ranked target node, so that the next-ranked target node may continue to execute the calculation task based on the data processing result of the current target node, and this is repeated until the last target node generates the last data processing result. The "order" in the present embodiment is not limited to the order in which they are arranged in order from the beginning to the end, and may be the order expressed by a directed acyclic structure, or the like.
Step 103: and generating a query result corresponding to the query request according to the final data processing result fed back by the final target node.
In the embodiment of the invention, all the target nodes sequentially execute the ordered query plan, and after the last target node performs data processing, a corresponding processing result, namely a final data processing result, can be generated, and the final target node can feed back the final data processing result to the planning node. In this embodiment, the final data processing result is a final result obtained after all the target nodes execute the query plan, and the final data processing result may represent a corresponding query result. At this time, the planning node can directly send the final data processing result as a corresponding query result to the query node; or, the planning node may process the final data processing result to generate a query result, and send the generated query result to the query node. And when the final data processing result generated by the target node cannot be directly provided for the query node, the trusted planning node performs processing to generate data which can be provided for the query node.
Optionally, in the distributed network, when all the nodes of the data share the data, there may be a risk that the data is obtained by an illegal node to cause data leakage, and even economic loss and legal responsibility of all the nodes of the data are caused, so that the distributed data sharing is initially hindered by security and privacy problems. In order to avoid the problem of data leakage in the data query process, the method provided by the embodiment performs data processing in a mode of loading a secure mode, so that private data is safely and effectively shared. Specifically, the step 102 "instructing the target node to sequentially perform data processing according to the query plan" includes:
step A1: and indicating the target node to sequentially perform data processing in a preset safety mode according to the query plan to generate a data processing result conforming to the safety mode.
In the embodiment of the invention, the target nodes are all data nodes, and when the target nodes execute the calculation tasks based on the query plan, the data processing needs to be carried out in the safety mode, so that the generated data processing result is a result conforming to the safety mode. The Security mode may specifically be a Security Enclave (Security Enclave), a Docker container, an encryption form, multi-party Security computation, and the like, that is, the target node may allocate independent computing power to the computation task using a sandbox mechanism and the like, or encrypt the data processing result, so that it is difficult for other target nodes to obtain sensitive information in the data processing result after obtaining the data processing result of the local target node; similarly, it is difficult for the local target node to obtain sensitive information in the data processing results sent by other target nodes. In this embodiment, the target node needs to execute the query plan honestly, but it may not be limited whether the target node steals data of other nodes, that is, the target node is allowed to attempt to export data acquired from other target nodes. In the security mode, the security and privacy of data transmission between the target nodes can be ensured.
Meanwhile, since the final data processing result generated by the final target node is a processing result that conforms to the security mode, in order to ensure that the query node can read the final data processing result normally, the final data processing result is processed by using a trusted planning node in this embodiment. Specifically, the step 103 of "generating the query result corresponding to the query request according to the last data processing result fed back by the last target node" includes:
step A2: and performing security mode removal processing on the final data processing result which is fed back by the final target node and accords with the security mode, and taking the processing result after security mode removal processing as the query result corresponding to the query request.
In the embodiment of the invention, the planning node can perform security mode removing processing on the final data processing result to generate data irrelevant to the security mode, and the data can be used as the query result provided for the query node, so that the query node can normally read the data in the final data processing result. The 'safe mode removing processing' is a processing mode opposite to that of the target node when data processing is carried out based on the safe mode; for example, if the target node performs data processing in an encryption manner, the security mode is removed and the corresponding decryption processing is performed.
When a query is required, a planning node serves as a supervisor, the planning node selects all nodes of corresponding data as target nodes of the query process, a calculation task of the target nodes is constructed into an ordered query plan, execution of the query plan is coordinated, all the target nodes can feed back a final data processing result after the query plan is executed, and the planning node returns the final data processing result to the query node, so that the query process of the distributed data is completed. In this embodiment, a global ordered query plan is generated based on a planning node, a computation task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computation task by the target node, where the query plan is generated based on processing parameters of the target node, heterogeneous attributes such as trust relationship, threat level, hardware support and the like can be fused in a heterogeneous environment, and each target node can perform data processing according to its own security requirement or hardware support and the like, so that data sharing is transparent to the heterogeneous environment. The target node executes the calculation task in the security mode, data leakage can be avoided even if the data can be kept locally, the security of the data source is prevented from being threatened, and the safe and effective sharing of the private data can be realized. In the embodiment, the data is kept in a local safety calculation mode, which is beneficial to ensuring the safety of the data, so that the safety threats of the data in the transmission process and the transmission destination are reduced; meanwhile, the algorithm dispatch is generally much faster than data transmission, so that the data processing speed is higher.
On the basis of the above embodiment, the planning node specifically generates a query plan of a directed acyclic structure. Specifically, the step 102 of "generating an ordered query plan according to the processing parameters of all the target nodes" includes:
step B1: and distributing one or more corresponding atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations.
In the embodiment of the invention, the query plan of the directed acyclic structure is generated by taking the atomic operation as a basic unit. The atomic operation is a basic operation in a query process, and may specifically be projection (projection), selection (selection), natural join (natural join), deduplication (set difference), union (set intersection), renaming (renaming), and the like. The dependency relationship between two atomic operations means that one of the atomic operations needs to depend on data in the other atomic operation when performing data processing. In this embodiment, the dependency relationship is directional, that is, in two atomic operations, if the atomic operation a depends on the atomic operation B, the atomic operation B does not depend on the atomic operation B. After determining the dependency relationships among all the atomic operations, a query plan with a Directed Acyclic structure may be generated, and a structural diagram of the query plan provided in this embodiment is shown in fig. 2, where fig. 2 represents the query plan with a Directed Acyclic Graph (DAG). In fig. 2, each circle represents an atomic operation, the dependency between two atomic operations is represented by directed edges, and each dashed box represents a target node. That is, fig. 2 contains five target nodes A, B, C, D, E, and the five target nodes are sequentially allocated with 1, 3, 5, 4, and 3 atomic operations, for example, the target node B contains three atomic operations B1, B2, and B3; meanwhile, atomic operation a1 has a directed edge pointing to atomic operation b3, then atomic operation b3 depends on the atomic operation a 1.
Optionally, after determining the query plan of the directed acyclic structure, the planning node may instruct the target node to perform a computation task. Specifically, the step 102 of "instructing the target node to sequentially perform data processing according to the query plan" includes:
step B2: and indicating the current atomic operation of the target node to acquire a preamble data set of the preamble atomic operation, wherein the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing.
Step B3: and performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking the corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with the dependency relationship pointed by the current atomic operation.
Step B4: and repeating the process until all the atomic operations are traversed, and taking the data set of the last atomic operation as the final data processing result.
In the embodiment of the invention, the planning node can send the query plan to the target node, so that the target node can know which data processing needs to be carried out, and each atomic operation carries out data processing in sequence according to the query plan of the directed acyclic structure. In this embodiment, a target node that needs to perform data processing is used as a current target node, a current atomic operation of the current target node is determined, and if a dependency relationship of a certain atomic operation points to the current atomic operation, the atomic operation is a previous atomic operation of the current atomic operation. As in FIG. 2, atomic operation a1 points to atomic operation b3, i.e., atomic operation a1 has dependencies that point to atomic operation b3, so atomic operation a1 is a predecessor atomic operation to atomic operation b 3; similarly, atomic operations b1 and b2 are also the predecessor atomic operations to atomic operation b 3.
If there is no preceding atomic operation in the current atomic operation, the current atomic operation may be an initial atomic operation, and the atomic operations a1, b1, etc. in FIG. 2 are all initial atomic operations. In this embodiment, the query plan may be executed from the initial atomic operation, that is, the current atomic operation is the initial atomic operation, at this time, since there is no preamble atomic operation, the preamble data set is empty, the initial atomic operation may only perform data processing on locally stored data, and the data processing at this time is a processing procedure consistent with the initial atomic operation; for example, if the initial atomic operation a1 is a deduplication process, the initial atomic operation a1 may perform deduplication processing on locally stored data, and then obtain a corresponding processing result after deduplication processing, and use the processing result as the data set of the initial atomic operation a1, that is, the current data set.
If the current atomic operation is not the initial atomic operation, that is, there is a preamble atomic operation, the current atomic operation obtains data sets of all preamble atomic operations, that is, preamble data sets, and then performs data processing corresponding to the current atomic operation on all preamble data sets to generate corresponding data processing results, that is, the data sets of the current atomic operation, that is, the current data sets. Optionally, the current atomic operation may only perform data processing on the preamble data set, or may also perform comprehensive processing on the preamble data set and data stored in the local node.
Additionally, if the dependency of the current atomic operation is to point to another atomic operation, then the other atomic operation is a subsequent atomic operation to the current atomic operation; as in fig. 2, atomic operation b3 is a subsequent atomic operation to atomic operation a1, and atomic operation c5 is a subsequent atomic operation to atomic operation b 3. If the current atomic operation has a subsequent atomic operation, the current atomic operation sends the generated current data set to the subsequent atomic operation, so that the subsequent atomic operation can continue to perform the above steps B2 and B3 as the current atomic operation, and the data set generated by the current atomic operation at this time can be used as the final data processing result. In FIG. 2, the atomic operation e3 is the last atomic operation of the directed acyclic structure, and the data set of the atomic operation e3 can be the final data processing result. It will be understood by those skilled in the art that fig. 2 illustrates an example of an atomic operation e3 including a last atomic operation (i.e., an atomic operation without a subsequent atomic operation), and in practical applications, there may be a plurality of last atomic operations, i.e., there may be a plurality of atomic operations without a subsequent atomic operation, and at this time, a data set of all the last atomic operations may be used as a final data processing result. In this embodiment, traversing all the atomic operations is taken as an end condition, that is, when all the atomic operations perform corresponding data processing, it indicates that all the atomic operations perform the calculation tasks assigned to them, and at this time, it indicates that the query plan generated by the planning node is completely executed, and at this time, a final result can be obtained.
In this embodiment, the atomic operation delta may be in the form of a multi-elementRepresents; wherein opδOperation processing such as projection, deduplication, and the like corresponding to the atomic operation δ is represented; m isδThe number of preceding atomic operations, X, representing the atomic operation deltaδiA preamble data set representing the ith preamble atomic operation of the atomic operation δ; j indicates that the atomic operation delta belongs to the jth target node, and j is E [1, n ]]N is the number of target nodes;represents the data stored locally in the jth target node, andand need not be.
Specifically, referring to FIG. 2, the atomic operation a1 first performs a calculation task, and since the atomic operation a1 is an initial atomic operation and the target node A is the first target node, the tuple of the atomic operation a1 isNamely the atomic operation a1 to the target nodeData stored in APerforming corresponding treatment in the following stepsThe processing result is the data set of the atomic operation a1Namely, it isWherein, the function opδ(X) denotes the corresponding processing of the data X in accordance with the atomic operation delta, IδA data set representing an atomic operation δ. The atomic operation a1 is determining its data setThereafter, the data set can be assembledTo the subsequent atomic operation, i.e., to atomic operation b 3. In addition, the processing procedures of the initial atomic operations such as the atomic operations b1 and b2 are similar to those described above, and are not described herein again. Wherein, if the data processing can be performed in the secure mode, the function op is executedδ(X) may then indicate that data X is processed accordingly in the atomic operation δ in the secure mode.
For atomic operation B3, target node B is taken as the second target node, i.e., j ═ 2; and it has three preceding atomic operations, the tuple can beNamely, it isThe preamble data sets that may represent three preamble atomic operations, namely the data sets of atomic operations a1, b1, b2,the atomic operation b3 can generate the data set of the atomic operation b3 by performing data processing on the corresponding preamble data setAnd is
Wherein the atomic operation B3 requires data stored for the target node BWhen the treatment is carried out, the multicomponent group can be in the form described above; if the atomic erase operation b3 does not require processingThen its multi-component group may beOr will beIn (1)The assignment is null.
Repeating the above process, the last atomic operation e3 may determine the corresponding data setThe data setI.e. the final data processing result.
In the embodiment of the invention, the planning node constructs a distributed query plan by designing the directed acyclic graph, expresses each target node task by the atomic operation and the dependency relationship of the relational algebra level, and sends the atomic operation to the target node for data calculation, so that the overall situation can be optimized. The method can uniformly express the data sharing mode among the nodes in the environment with inconsistent security characteristics, and realizes data sharing in heterogeneous environment; by generating the user tasks into a global query plan, the method is beneficial to optimizing the global query plan, such as reducing data transmission among nodes as much as possible, dispatching the algorithm to safer node calculation and the like.
In the above embodiment, the query node needs to provide certain resources to perform the query operation. Specifically, after the step 102 "generating an ordered query plan according to the processing parameters of all the target nodes", the method further includes:
step C1: and determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost.
Step C2: and allocating the query resources to the corresponding target nodes.
In the embodiment of the invention, after the planning node determines the query plan, the cost of executing the query plan by the target node, namely the query cost, can be calculated, and corresponding query resources are taken from the query node; if the query node provides the query resource for the planning node, the planning node continues to issue the query plan to the target node to execute the corresponding calculation task; if the query node does not provide the query resource, the query process ends. In addition, after the query result is fed back to the query node, the planning node allocates a corresponding query resource to each target node. The query resource may specifically be a resource in the form of a fee, a resource in the form of a credit, or other resources capable of rewarding the target node.
Specifically, in this embodiment, the nodes in the distributed network are divided into three types, that is, the query node, the planning node, and all the data nodes, and the overall process of the query process can be shown in fig. 3, where the target node is all the data nodes. In fig. 3, the target node is responsible for sharing data and executing a computation task, the query node initiates a data query or retrieval task, and the planning node constructs the task of the target node as a query plan and coordinates the execution of the plan. And a query initiated from the query node, wherein after the query node coordinates the execution of the target node, the query result is returned to the planning node and then returned to the query node.
The above describes in detail the flow of the query method for distributed data, which may also be implemented by a corresponding apparatus, and the following describes in detail the structure and function of the apparatus.
The query device for distributed data provided by the embodiment of the invention can be specifically arranged in a query node. Referring to fig. 4, the query apparatus includes:
the preprocessing module 41 is configured to, after an inquiry request is obtained, determine a plurality of target nodes corresponding to the inquiry request, and determine a processing parameter of each target node;
the query plan module 42 is configured to generate an ordered query plan according to the processing parameters of all the target nodes, and instruct the target nodes to sequentially perform data processing according to the query plan;
and a result generating module 43, configured to generate a query result corresponding to the query request according to the last data processing result fed back by the last target node.
On the basis of the above embodiment, the generating an ordered query plan by the query plan module 42 according to the processing parameters of all the target nodes includes:
and distributing one or more corresponding atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations.
On the basis of the above embodiment, the query plan module 42 instructs the target node to sequentially perform data processing according to the query plan, including:
indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
On the basis of the above-described embodiments,
the query plan module 42 instructs the target node to sequentially perform data processing according to the query plan, including: indicating the target node to sequentially perform data processing according to the query plan in a preset safety mode to generate a data processing result conforming to the safety mode;
the generating a query result corresponding to the query request by the result generating module 43 according to the last data processing result fed back by the last target node includes: and performing security mode removal processing on the last data processing result which is fed back by the target node and accords with the security mode, and taking the processing result after security mode removal processing as the query result corresponding to the query request.
On the basis of the above embodiment, the apparatus further includes a resource allocation module;
after the query plan module 42 generates the ordered query plan according to the processing parameters of all the target nodes, the resource allocation module is configured to:
determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost; and allocating the query resources to the corresponding target nodes.
When the query needs to be performed, the planning node serves as a supervisor, the planning node selects all nodes of corresponding data as target nodes of the query process, a calculation task of the target nodes is constructed into an ordered query plan, execution of the query plan is coordinated, all the target nodes can feed back a final data processing result after the query plan is executed, and the planning node returns the final data processing result to the query node, so that the query process of the distributed data is completed. In this embodiment, a global ordered query plan is generated based on a planning node, a computation task is allocated to a corresponding target node, and a corresponding query result can be obtained by executing the computation task by the target node, where the query plan is generated based on processing parameters of the target node, heterogeneous attributes such as trust relationship, threat level, hardware support and the like can be fused in a heterogeneous environment, and each target node can perform data processing according to its own security requirement or hardware support and the like, so that data sharing is transparent to the heterogeneous environment. The target node executes the calculation task in the security mode, data leakage can be avoided even if the data can be kept locally, the security of the data source is prevented from being threatened, and the safe and effective sharing of the private data can be realized.
Based on the same inventive concept, an embodiment of the present invention further provides a distributed system, as shown in fig. 5, where the distributed system includes: a planning node 51 and a plurality of data all nodes 52, wherein the planning node 51 is a trusted node; the data owner node 52 may be directly or indirectly connected to the planning node 51.
The planning node 51 is configured to, after acquiring the query request, determine a plurality of target nodes corresponding to the query request from the data nodes 52, and determine processing parameters of the target nodes, where the processing parameters include hardware parameters of the target nodes and trust relationships between the target nodes and other target nodes;
the planning node 51 is further configured to generate an ordered query plan according to the processing parameters of all the target nodes, and send the query plan to the target nodes;
the target node is configured to perform data processing according to the query plan, and send a data processing result to other target nodes until the last target node sends a last data processing result to the planning node 51;
the planning node 51 generates a query result corresponding to the query request according to the final data processing result.
On the basis of the above embodiment, the generating, by the planning node 51, an ordered query plan according to the processing parameters of all the target nodes includes:
the planning node 51 allocates one or more corresponding atomic operations to each target node, determines the dependency relationship among all the atomic operations, and generates a query plan with a directed acyclic structure according to the dependency relationship among all the atomic operations.
On the basis of the above embodiment, the data processing by the target node according to the query plan includes:
the method comprises the steps that a current atomic operation of a target node obtains a preamble data set of a preamble atomic operation, the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
On the basis of the above embodiment, referring to fig. 6, the distributed system further includes a query node 52 for initiating the query request;
after the planning node 51 generates an ordered query plan according to the processing parameters of all the target nodes, the planning node 51 determines corresponding query cost according to the query plan, and sends the query cost to the query node 52;
the query node 52 feeds back the query resource matched with the query cost to the planning node 51; after receiving the query resource, the planning node 51 allocates the query resource to the corresponding target node.
In the distributed system provided by the embodiment of the present invention, the nodes are divided into three parts, that is, the query node 52, the planning node 51, and all data nodes 52, and all or part of the all data nodes 52 in fig. 5 may be used as target nodes. In this embodiment, different nodes may be other types of nodes in different query tasks, for example, all the nodes 52 of the data may also be queried as the query nodes 52. In addition, fig. 5 and 6 only schematically show the structure of the distributed system, and are not used to limit that the distributed system must be based on this architecture; for example, the query node may indirectly connect to the planning node through all nodes of other data and initiate a query. Meanwhile, what is shown in fig. 5 and fig. 6 is an allowable communication connection relationship between the respective nodes, and is not used to indicate that the planning node must generate the query plan according to the entire connection relationship between all the nodes of the data shown in the figure. For a detailed description of the distributed system, reference may be made to the embodiments corresponding to fig. 1 to fig. 3, which are not described herein again.
The distributed system provided by this embodiment may generate a global ordered query plan based on the trusted plan node, allocate a computation task to the corresponding target node, and execute the computation task by the target node to obtain a corresponding query result, where the query plan is generated based on the processing parameters of the target node, and may merge heterogeneous attributes such as trust relationship, threat level, hardware support, and the like in a heterogeneous environment, and each target node may perform data processing according to its own security requirements or hardware support, so as to make data sharing transparent to the heterogeneous environment. The target node executes the calculation task in the security mode, data leakage can be avoided even if the data can be kept locally, the security of the data source is prevented from being threatened, and the safe and effective sharing of the private data can be realized.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the modifications or alternative embodiments within the technical scope of the present invention, and shall be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A query method of distributed data is characterized by comprising the following steps:
after an inquiry request is obtained, determining a plurality of target nodes corresponding to the inquiry request, and determining a processing parameter of each target node;
generating an ordered query plan according to the processing parameters of all the target nodes, and indicating the target nodes to sequentially perform data processing according to the query plan;
and generating a query result corresponding to the query request according to the final data processing result fed back by the final target node.
2. The method of claim 1, wherein generating an ordered query plan based on the processing parameters of all of the target nodes comprises:
and distributing one or more corresponding atomic operations for each target node, determining the dependency relationship among all the atomic operations, and generating a query plan of a directed acyclic structure according to the dependency relationship among all the atomic operations.
3. The method of claim 2, wherein the instructing the target node to perform data processing in sequence according to the query plan comprises:
indicating the current atomic operation of the target node to acquire a preamble data set of a preamble atomic operation, wherein the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
4. The method of claim 1,
the instructing the target node to sequentially perform data processing according to the query plan includes: indicating the target node to sequentially perform data processing according to the query plan in a preset safety mode to generate a data processing result conforming to the safety mode;
the generating a query result corresponding to the query request according to the last data processing result fed back by the target node includes: and performing security mode removal processing on the last data processing result which is fed back by the target node and accords with the security mode, and taking the processing result after security mode removal processing as the query result corresponding to the query request.
5. The method of claim 1, further comprising, after said generating an ordered query plan according to processing parameters of all of said target nodes:
determining corresponding query cost according to the query plan, and acquiring query resources matched with the query cost;
and allocating the query resources to the corresponding target nodes.
6. An apparatus for querying distributed data, comprising:
the system comprises a preprocessing module, a query module and a processing module, wherein the preprocessing module is used for determining a plurality of target nodes corresponding to a query request after the query request is obtained and determining processing parameters of each target node;
the query plan module is used for generating an ordered query plan according to the processing parameters of all the target nodes and indicating the target nodes to sequentially perform data processing according to the query plan;
and the result generation module is used for generating a query result corresponding to the query request according to the last data processing result fed back by the target node.
7. A distributed system, comprising: the data transmission method comprises a planning node and a plurality of data all nodes, wherein the planning node is a credible node;
the planning node is used for determining a plurality of target nodes corresponding to the query request from all the nodes of the data after the query request is obtained, and determining processing parameters of the target nodes, wherein the processing parameters comprise hardware parameters of the target nodes and trust relations between the target nodes and other target nodes;
the planning node is also used for generating an ordered query plan according to the processing parameters of all the target nodes and sending the query plan to the target nodes;
the target node is used for processing data according to the query plan and sending data processing results to other target nodes until the last target node sends the last data processing result to the planning node;
and the planning node generates a query result corresponding to the query request according to the final data processing result.
8. The distributed system of claim 7 wherein the planning node generating an ordered query plan based on the processing parameters of all of the target nodes comprises:
and the planning node allocates one or more corresponding atomic operations to each target node, determines the dependency relationship among all the atomic operations, and generates a query plan with a directed acyclic structure according to the dependency relationship among all the atomic operations.
9. The distributed system of claim 8, wherein the target node performing data processing according to the query plan comprises:
the method comprises the steps that a current atomic operation of a target node obtains a preamble data set of a preamble atomic operation, the preamble atomic operation is other atomic operations with a dependency relationship pointing to the current atomic operation, and the preamble data set is a data processing result determined after the preamble atomic operation is subjected to data processing;
performing data processing corresponding to the current atomic operation on all the preorder data sets of the current atomic operation, taking a corresponding data processing result as the current data set of the current atomic operation, and sending the current data set to a subsequent atomic operation when the subsequent atomic operation exists, wherein the subsequent atomic operation is other atomic operations with a dependency relationship pointed by the current atomic operation;
and repeating the above processes until all the atomic operations are traversed, and taking the last data set of the atomic operations as the final data processing result.
10. The distributed system of any one of claims 7-9, further comprising a query node for initiating the query request;
after the planning node generates an ordered query plan according to the processing parameters of all the target nodes, the planning node determines corresponding query cost according to the query plan and sends the query cost to the query node;
the query node feeds back query resources matched with the query cost to the planning node; and after receiving the query resources, the planning node allocates the query resources to the corresponding target nodes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911176025.6A CN110955701B (en) | 2019-11-26 | 2019-11-26 | Distributed data query method, device and distributed system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911176025.6A CN110955701B (en) | 2019-11-26 | 2019-11-26 | Distributed data query method, device and distributed system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110955701A true CN110955701A (en) | 2020-04-03 |
CN110955701B CN110955701B (en) | 2023-04-25 |
Family
ID=69977076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911176025.6A Active CN110955701B (en) | 2019-11-26 | 2019-11-26 | Distributed data query method, device and distributed system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110955701B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021254288A1 (en) * | 2020-06-14 | 2021-12-23 | Wenfei Fan | Querying shared data with security heterogeneity |
CN114518850A (en) * | 2022-02-23 | 2022-05-20 | 云链网科技(广东)有限公司 | Safe re-deletion storage system with re-deletion before encryption based on trusted execution protection |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7984043B1 (en) * | 2007-07-24 | 2011-07-19 | Amazon Technologies, Inc. | System and method for distributed query processing using configuration-independent query plans |
CN104063486A (en) * | 2014-07-03 | 2014-09-24 | 四川中亚联邦科技有限公司 | Big data distributed storage method and system |
CN105404690A (en) * | 2015-12-16 | 2016-03-16 | 华为技术服务有限公司 | Database querying method and apparatus |
CN105608077A (en) * | 2014-10-27 | 2016-05-25 | 青岛金讯网络工程有限公司 | Big data distributed storage method and system |
CN107301205A (en) * | 2017-06-01 | 2017-10-27 | 华南理工大学 | A kind of distributed Query method in real time of big data and system |
CN110263105A (en) * | 2019-05-21 | 2019-09-20 | 北京百度网讯科技有限公司 | Inquiry processing method, query processing system, server and computer-readable medium |
-
2019
- 2019-11-26 CN CN201911176025.6A patent/CN110955701B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7984043B1 (en) * | 2007-07-24 | 2011-07-19 | Amazon Technologies, Inc. | System and method for distributed query processing using configuration-independent query plans |
CN104063486A (en) * | 2014-07-03 | 2014-09-24 | 四川中亚联邦科技有限公司 | Big data distributed storage method and system |
CN105608077A (en) * | 2014-10-27 | 2016-05-25 | 青岛金讯网络工程有限公司 | Big data distributed storage method and system |
CN105404690A (en) * | 2015-12-16 | 2016-03-16 | 华为技术服务有限公司 | Database querying method and apparatus |
CN107301205A (en) * | 2017-06-01 | 2017-10-27 | 华南理工大学 | A kind of distributed Query method in real time of big data and system |
CN110263105A (en) * | 2019-05-21 | 2019-09-20 | 北京百度网讯科技有限公司 | Inquiry processing method, query processing system, server and computer-readable medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021254288A1 (en) * | 2020-06-14 | 2021-12-23 | Wenfei Fan | Querying shared data with security heterogeneity |
CN114518850A (en) * | 2022-02-23 | 2022-05-20 | 云链网科技(广东)有限公司 | Safe re-deletion storage system with re-deletion before encryption based on trusted execution protection |
CN114518850B (en) * | 2022-02-23 | 2024-03-12 | 云链网科技(广东)有限公司 | Safe re-deleting storage system based on trusted execution protection and comprising re-deleting and encryption |
Also Published As
Publication number | Publication date |
---|---|
CN110955701B (en) | 2023-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9846589B2 (en) | Virtual machine placement optimization with generalized organizational scenarios | |
KR101906912B1 (en) | Cloud-based build service | |
Aguilar et al. | Task assignment and transaction clustering heuristics for distributed systems | |
Hutter et al. | Parallel algorithm configuration | |
CN110874484A (en) | Data processing method and system based on neural network and federal learning | |
Pérez et al. | A Newton-based heuristic algorithm for multi-objective flexible job-shop scheduling problem | |
US10621002B2 (en) | Iterative task centric resource scheduling for a user program between different computing frameworks | |
CN110955701B (en) | Distributed data query method, device and distributed system | |
US8046759B2 (en) | Resource allocation method and system | |
WO2015127667A1 (en) | Executing foreign program on parallel computing system | |
CN107430557B (en) | Multi-party encryption cube processing device, method and system | |
Konur et al. | Military system of systems architecting with individual system contracts | |
CN108134848B (en) | SOA system resource optimization method based on graph theory K-segmentation | |
Tompkins | Optimization techniques for task allocation and scheduling in distributed multi-agent operations | |
US7454749B2 (en) | Scalable parallel processing on shared memory computers | |
Furugyan | Computation planning in multiprocessor real time automated control systems with an additional resource | |
CN109791502B (en) | Peer-to-peer distributed computing system for heterogeneous device types | |
Nagarajan et al. | An algorithm for cooperative task allocation in scalable, constrained multiple robot systems | |
US20230003753A1 (en) | Systems and methods for managing experimental requests at remote laboratories | |
Kononov et al. | Control of a Complex of Works in Multiprocessor Real-time ACS | |
CN114356511A (en) | Task allocation method and system | |
Pop et al. | The Art of Scheduling for Big Data Science. | |
CN110955726B (en) | Method and device for determining distributed cost, storage medium and electronic equipment | |
WO2016110461A1 (en) | Parallel data streaming between cloud-based applications and massively parallel systems | |
Amaris et al. | Generic algorithms for scheduling applications on heterogeneous multi-core platforms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |