WO2021254288A1 - Querying shared data with security heterogeneity - Google Patents

Querying shared data with security heterogeneity Download PDF

Info

Publication number
WO2021254288A1
WO2021254288A1 PCT/CN2021/099861 CN2021099861W WO2021254288A1 WO 2021254288 A1 WO2021254288 A1 WO 2021254288A1 CN 2021099861 W CN2021099861 W CN 2021099861W WO 2021254288 A1 WO2021254288 A1 WO 2021254288A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sites
toll
security
plan
Prior art date
Application number
PCT/CN2021/099861
Other languages
French (fr)
Inventor
Wenfei Fan
Yang Cao
Original Assignee
Wenfei Fan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenfei Fan filed Critical Wenfei Fan
Publication of WO2021254288A1 publication Critical patent/WO2021254288A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • the embodiments relate to database storage and access, and more particularly, relate to a method and system for querying shared data with security heterogeneity.
  • An aspect disclosed herein relates to a computer-implemented method, comprising:
  • the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
  • the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
  • generating a distributed query plan further comprising:
  • executing the distributed query plan at the sites further comprises:
  • the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
  • the result of the operation can be stored in a protected mode.
  • Another aspect relates to a computing system, comprising:
  • a memory comprising instructions
  • processors in communication with the memory, wherein the one or more processors execute the instructions to:
  • the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
  • the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
  • the one or more processors execute the instructions to execute the distributed query plan at the sites further comprises:
  • the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
  • the result of the operation can be stored in a protected mode.
  • a further aspect relates to a computer-readable storage medium comprising computer instructions that when executed by one or more processors, cause the one or more processors to:
  • the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
  • the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
  • the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
  • the result of the operation can be stored in a protected mode.
  • FIG. 1 illustrates a flow diagram of an exemplary method for querying shared data with security heterogeneity, according to an aspect
  • FIG. 2 illustrates a flow diagram of another exemplary method for generating a distributed query plan, according to an aspect
  • FIG. 3 illustrates a flow diagram of an exemplary method for executing a distributed query plan, according to an aspect
  • FIG. 4 illustrates a block diagram of an exemplary system for querying shared data with security heterogeneity, according to an aspect
  • FIG. 5 illustrates a block diagram of an exemplary implementation for planning component, according to an aspect
  • FIG. 6 illustrates a block diagram of an exemplary implementation for executing component, according to an aspect
  • FIG. 7 illustrates an exemplary algorithm MTJ for generating toll-minimized plans
  • FIG. 8 illustrates a simplified scenario of A3 for Example 1
  • FIG. 9 illustrates a distributed query plan ⁇ Q in Example 1.
  • FIG. 10A-10D illustrates curve comparison diagrams of experimental results.
  • Example 1 A data-sharing company (name withheld) has built a blockchain-based platform (similar to, e.g., MHMD [24] ) , to enable secure data analytics over distributed datasets owned by users such as government agencies, hospitals, clinics, researchers, drug stores, insurance firms and pharmaceutical companies, etc. Users of the same type are connected via an internal network, and these internal networks are connected to an external network via gateways.
  • MHMD blockchain-based platform
  • Computations within an internal network can be carried out in Docker containers hosted by nodes of the network. Depending on the trust levels among the data owners, the container may use encryption schemes (e.g., OPE or HOM) to secure data uploaded to it, or use plaintext when the users trust each other. Computations across multiple internal networks may be conducted in hardware-assisted enclaves. Such an enclave incurs much higher upfront costs than Docker since, among other things, it requires a consensus among all gateways in the blockchain and will be audited. Datasets uploaded to the enclaves can be either encrypted or not, depending on the contract used for the consensus.
  • encryption schemes e.g., OPE or HOM
  • These security facilities reflect different security protocols and requirements between data owners and users. They cope with different threat levels , and incur security costs by their usage and upfront costs. Below are some example applications on the platform.
  • a government agent wants to find addresses of all households with at least one member who has contracted disease Z but has no health insurance. This involves government with household registration data, hospitals that have electronic medical records (EMR) of patients, and insurance firms. Government data can be uploaded to enclaves in the hospital network without encryption; hospital data can be loaded to enclaves in the insurance network with HOM encryption; and insurance data can be loaded to hospital enclaves with OPE encryption (more efficient but less secure than HOM) as hospitals have higher trust levels than insurance firms.
  • EMR electronic medical records
  • heterogeneity refers to various data and programming models used by different data owners, not security heterogeneity.
  • Hardware-assisted solutions With hardware support (e.g., security enclaves) , query processing over distributed sources can be carried out with moderate overhead, e.g., TrustedDB [8] , Cipherbase [6] and EnclaveDB [36] .
  • An enclave protects the data and code running inside of it from being spied upon, even if the entire software stack of the host is compromised.
  • a remote party can verify the code running inside the enclave through a process known as attestation.
  • two data owners without trust can still jointly compute a function, by sending their data to an enclave, which runs some code (attested by both owners) to compute the function.
  • the enclave can be hosted by one of the data owners, or even a third (untrusted) party. Docker containers are used in lieu of an enclave if the system administrator can be trusted.
  • LDP Locally differential privacy
  • the scheme referred to as DShare , is to abstract computations over shared datasets with heterogeneous security protocols. It is characterized by the notions of data owners, a query planner, data sharing pacts and distributed query plans.
  • a group of data owners often agree upon data sharing protocols in practice. Each owner contributes its data for sharing, and is also a client of the shared data under the protocols.
  • Data owners A collection of data owners, or simply sites, agree to support query services collectively over their private data. Each owner manages its data by its own DBMS and has its own local database schema.
  • Query planner Each data owner may issue a query Q, to compute the answer to Q over The query will be handled by a trusted third party called query planner.
  • the planner Upon receiving a query Q from a data owner, the planner will come up with a distributed query plan that picks security facilities (Docker or enclave) , to comply with the security protocols agreed among the sites, and encryption schemes required for the operations. It also estimates a security charge, referred to as a toll, for the query plan. If it is agreeable by the query issuer, the query planner will instruct the data owners to carry out the query plan, charges the toll to the query issuer, and allocates the revenue to the data owners accordingly.
  • security facilities Docker or enclave
  • the query planner oversees the execution of the query plans and acts as an intermediary between the data owners and the query issuer. However, it does not access any of the local databases or carry out operations in the query plans itself.
  • Each capsule C is associated with a site S j , referred to as a capsule hosted by S j .
  • Computation in C has direct access to the data on S j but cannot access data at other sites except the part that is uploaded to C.
  • the intermediate results are stored at S j in a protected mode via, e.g., access right controls, OPE encryption [12] , or symmetric encryption with keys held only by the query issuer as commonly used in cloud cryptograph such as Azure [1] ; then the capsule C will be terminated.
  • a security protocol for a pair of sites (S i , S j ) specifies: [ (a) ] the lowest security guarantees that a capsule hosted by S j must attain in order for S j to share the data of S i ; and [ (b) ] encryption of data at S i , i.e., which part of the data can be shared, what data needs to be encrypted before sending it to a capsule of S j , and what encryption scheme to use.
  • a data sharing pact ⁇ for a set of sites consists of (1) a security protocol for each pair of sites in and (2) a toll function Toll () that measures the costs of all types of available capsules that can be employed to evaluate queries. In the real world, such costs are incurred by, e.g., renting trusted third party facilities as the capsules and encryption overhead.
  • ⁇ on at is a DAG (directed acyclic graph) , where the nodes of ⁇ denote atomic operations and edges represent their dependencies.
  • Each atomic operation ⁇ is an (n+3) -tuple (op, t c , X 1 , ..., X n , j) , where (i) op is an operator in the relational algebra ( RA ; projection ⁇ , selection ⁇ , natural join set difference -, set union ⁇ and renaming ⁇ ) ; (ii) t c is a capsule of a certain type, e.g., Docker, enclave, SMC system, or trusted third party; (iii) j ⁇ [1, n] denotes the site that hosts a capsule to carry op out ; and (iv) X i is a relation, which is either part of data in that can be shared with S j by the security protocol between S i and S j
  • Edges of the DAG plan ⁇ specifies the dependencies of the atomic operations in ⁇ .
  • the execution of the query plan is mediated and monitored by the query planer, and it takes place at the participating sites only. After executing the query plan, the sites will send the results to the query planner, who then decrypts and returns the results to the client who issued the query.
  • Toll Toll () .
  • Toll ( ⁇ ) estimates the charge for executing ⁇ . It consists of the following:
  • Toll 0 depends on the type of the capsule;
  • Toll c is the charge for sustaining the capsule for executing op.
  • Toll d and Toll c communication and computation cost, respectively. These costs stem from the security protocol between S i and S j , and are reflected as costs incurred by the security facility employed, including e.g., encryption cost, beyond their normal scope.
  • Example 2 [Case study] Continuing with Example 1, we show that applications A1 and A2 can be abstracted by DShare. Denote by (a) Household (address, pid , name ) the registration relation maintained by government, (b) Reg (pid, clinic ) and EMR (pid, disease ) the clinic registration relation and medical record relation, respectively, owned by the clinics and hospitals, and (c) Insurance (pid, company , policy ) the customer records maintained by insurance firms. We present their data sharing pacts ⁇ with associated toll functions.
  • the pact ⁇ specifies the lowest security level for a capsule C. Based on this, the query planner may pick Docker container as C, and remote data will be directly uploaded to C. The shared data will not be stored after the computation since a capsule is ephemeral. These meet the security requirements of A1 since clinics are trusted and do not risk side-channel leakage.
  • the toll function estimates the costs of available capsules C, and is determined by the type, configuration and duration of C to use (which in turn depends on the operations to be carried out in C) .
  • Toll 0 0 since Docker incurs negligible upfront cost
  • Toll (i, j) (X i ) c ij
  • is the communication cost for transferring X i from S i to the Docker at S j (c ij is a coefficient denoting the unit network price)
  • 2 c is a coefficient similar to c ij ) , the cost of sustaining the Docker container for the join.
  • pact ⁇ requires a higher security level for data sharing as described in Example 1.
  • the query planner may take enclaves as capsules.
  • data is required to be encrypted using HOM [19] before uploading it to insurance enclaves, to prevent leakage of user identifications and their EMR records.
  • toll functions can be readily deduced from the types of capsules and the complexity of operations. Alternatively, as a common practice, they can also be empirically estimated by testing or learning over small dataset samples.
  • tolls often denote the economic incentives defined by smart contracts for consensus.
  • toll functions There has also been recent work from the security community on toll (cost) models of hybrid protocols, e.g., [14, 25]. Note that toll functions only specify toll charges of all possible capsules usage in an application; they do not determine which, where and how capsules to be used.
  • DShare allows arbitrary positive polynomial functions as Toll ( ⁇ ) that are composed of submodular set functions [21] for Toll (i, j) (X) (of Toll d ) and Toll c (bi-modular [18] if op of ⁇ is a binary operator, e.g., join) . All toll functions in Example 2.1 are of this type. Our industry partners find that these suffice to express common security charges in practice.
  • the data owners are semi-honest (a.k.a. honest but curious) , i.e., each owner will faithfully execute the query plan, but may try to derive information about other parties’ data. This is why all operations are executed in capsules and the intermediate relations are stored in protected mode.
  • the query planner runs at a trusted third party and is assumed trustworthy, similar to the honest broker in secure database systems e.g., [35, 9] , but without direct access to data.
  • DShare enforces the security protocols of data sharing pact ⁇ by (a) the trusted query planner, and (b) proper capsules for carrying out query plans.
  • the pact specifies the minimum security requirement between each pair S i and S j of sites.
  • the query planner enforces the security guarantee by picking right capsules.
  • Each operation (op, t c , X 1 , ..., X n , j) is performed by a capsule that meets the maximum of all lowest security levels for (S i , S j ) (i ⁇ [1, n] ) .
  • Followup operations retain no less security levels .
  • the planner picks capsules that meet the security requirements and minimize the cost, e.g., it picks docker containers for A1 of Example 1, which satisfy the security requirements specified by the protocol and are cheaper than enclaves and SMC systems.
  • Output A distributed plan ⁇ for Q over such that the toll Toll ( ⁇ ) of ⁇ over under ⁇ is no larger than B.
  • a data sharing pact ⁇ specifies only the minimum security requirements for sharing data between sites and their associated toll charges.
  • is a single atomic operation ⁇
  • consists of sub-plans ⁇ 1 , ..., ⁇ l and an atomic operation ⁇ , where ⁇ 1 , ..., ⁇ l are predecessors of ⁇ in ⁇ , then
  • TBQA d the decision version of TBQA. That is to decide, given the same input of TBQA and an additional number L, whether there exists a distributed query plan ⁇ for Q over such that its toll is at most B and its parallel execution cost is at most L.
  • TBA toll-bounded answerability problem
  • TBA is to check whether TBQA even has a feasible solution or not.
  • TBQA is at least as hard as TBA.
  • Theorem 1 Both TBQA d and TBA are
  • Step (1) Finding toll-minimized canonical plans (Section 4) .
  • ⁇ Q extends the algebra tree T Q (cf. [3] ) of Q into a DAG by replacing each algebra operation op of Q with a distributed query plan ⁇ op that has minimized the toll in Note that here an edge from op 1 to op 2 of Q in T Q may be extended to multiple edges, which connect atomic operations in to those in based on their dependencies.
  • Step (2) Reducing parallel execution cost (Section 5) .
  • Toll ( ⁇ Q ) of ⁇ Q exceeds the toll budget B.
  • ⁇ Q the total number of bits
  • Step (3) Reducing parallel execution cost (Section 5) .
  • Toll ( ⁇ Q ) of ⁇ Q exceeds the toll budget B.
  • Theorem 2 There exists a PTIME O (logn) -approximation algorithm for computing plans with minimum toll for joins.
  • Algorithm One can readily verify that the reduction is approximation-preserving [7] .
  • n 2 ) .
  • set covering specifies the assignment of atomic operations for the work units in to their host sites (line 4) .
  • the capsule types of the atomic operations are such picked that they minimize the cost while satisfying all the relevant security levels specified in the protocols for (line 5) .
  • Example 3 Recall the query for A2 given in Example 2.1, denoted by Q. Assume a simplified data sharing scenarios shown in Fig. 8. We show how algorithm MTJ generates the distributed query plan ⁇ Q depicted in Fig. 9 for Q.
  • op 1 has a set of 4 work units for all i ⁇ ⁇ 1, 2 ⁇ , j ⁇ ⁇ 3, 4 ⁇ (see Fig. 9) .
  • the set consists of (i, W) for all MTJ picks (3, ⁇ u 13 , u 23 ⁇ ) and (4, ⁇ u 14 , u 24 ⁇ ) as that covers with total toll 0. It then interprets as consisting of the atomic joins for I 32 and I 42 of Fig. 9. Note that this is actually the optimal plan for op 1 since incurs no toll at all.
  • Algorithm 1 is an O (logn) -approximation, it has an exponential time complexity since of line 3 is of size exponential in
  • the search terminates when the range [a, b] for ⁇ has a gap (
  • plan rebalancing is motivated by the following.
  • ⁇ Q for an operation op i of Q.
  • This is generated to minimize toll (Section 4) and hence could be imbalanced.
  • Observe that is dominated by the maximum cost of individual sites (Section 2.2) ; hence imbalanced workloads increase
  • rebalancing works by iteratively applying an atomic balancing operator ⁇ b to optimize under its allocated toll budget B i (Section 4) for each operation op i , such that (a) the optimized sub-plan is guaranteed to have a lower cost than and (b) incurs at most B i of toll. That is, ⁇ b makes use of toll allowance B i on op i , and re-distributes the work units handled by in a more balanced and optimized way, to reduce the cost of
  • phase (2) Given a sub-plan operator ⁇ b works in two phases: (1) it first re-distributes the work units of across n sites subject to a toll budget B i allocated to op i ; this yields plan that guarantees to reduce the cost; and (2) it then prepares the answers of for that is subsequent to Here phase (2) is carried out by simply recovering the input distribution for the subsequent sub-plan of It is to ensure that is compatible with its subsequent sub-plan since works with a certain input distribution (i.e., the distribution of the answer of ) due to the heterogeneous security protocols (see Section 2.1) .
  • phase (1) of ⁇ b We parameterize ⁇ b with an integer k that controls the degree of changes to the larger k is, the larger cost is reduced but more toll is consumed.
  • ⁇ b [k] the operator ⁇ b instantiated with k.
  • ⁇ b [k] we apply to by selecting k work units of for re-distribution, to reduce its parallel execution cost.
  • ReBal Algorithm ReBal.
  • the algorithm denoted by ReBal works as follows. Given a sub-plan of ⁇ Q computed in Section 4, a database over n sites a data sharing pact ⁇ and parameter k for ⁇ b , ReBal returns an optimized sub-plan by re-distributing k work units for
  • ReBal first (a) identifies a set of k bottleneck work units for op i , and then (b) re-distributes them to improve It does (a) by iteratively identifying bottleneck sites, picking and adding its bottleneck work units to where bottleneck sites have work units of maximum costs among all. It carries out (b) by assigning work units in one by one to sites with least workload w.r.t. the cost of executing all work units of op i .
  • Algorithm ReBal is near-optimal of all algorithms of its kind. More specifically, denote by the class of algorithms that optimize by selecting and re-distributing k work units of Then we have the following.
  • ReBal is a 2-approximation of the optimal in and is in O (n 2 logn) -time.
  • the following provides the various embodiments for querying shared data with security heterogeneity.
  • FIG. 1 illustrates a flow diagram of an exemplary method 100 for querying shared data with security heterogeneity, according to an aspect.
  • a SQL query for shared data over a plurality of sites can be received.
  • the SQL query can be issued from each of the sites.
  • the sites support query services collectively over their private data.
  • Each site manages its data by its own DBMS and has its own local database schema.
  • the SQL query can be also obtained from a client, i.e., a user input or a computer program or a client device.
  • the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
  • a security protocol for a pair of sites specifies the lowest security guarantees and encryption scheme of data so that a site can share the data of the other site.
  • the data sharing pact further comprises a toll function that measures the parallel execution cost of the distributed query plan.
  • executing the distributed query plan at the sites and returning the results for the SQL query can be mediated and monitored by a trusted third party, and it takes place at the sites only. After executing the distributed query plan, the sites will send the results to the trusted third party, who then decrypts and returns the results to the client who issued the SQL query.
  • the disclosed embodiment proposes an approach to query answering under heterogeneous security models, which defines query plans by incorporating data sharing agreements and the use of various security facilities.
  • the embodiment aims to demonstrate the need, challenges and feasibility of querying shared data with security heterogeneity.
  • FIG. 2 illustrates a flow diagram of an exemplary method 200 for generating a distributed query plan, according to an aspect.
  • generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query.
  • the method can first generate a distributed query plan for the SQL query in a canonical form. That is, extend the algebra tree of the SQL query into a DAG by replacing each algebra operation of the SQL query with a distributed query sub-plan of the operation. The distributed query sub-plan of the operation has minimized toll.
  • optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans can distribute a total toll budget over all sub-plans of the query plan so that the total cost reduction of the query plan can be maximized.
  • the embodiment generates a toll-minimized query plan for the SQL query in a canonical form and further optimizes the query plan to reduce data sharing toll and parallel execution cost.
  • FIG. 3 illustrates a flow diagram of an exemplary method 300 for executing a distributed query plan, according to an aspect.
  • the logic unit can be selected from Docker container, enclave, SMC system, or trusted third party. Data is shared using the logic units to which the sites can transfer and upload datasets. Each logic unit is hosted by a site and used to perform all computations. The computation in each logic unit has direct access to the data at the site associated with it, but cannot access data at other sites except the part that is uploaded to the site associated with it.
  • the result of the operation can be stored in a protected mode, e.g., encrypted with OPE, access right controls, or symmetric encryption with keys.
  • FIG. 4 illustrates a block diagram of an exemplary system 400 for querying shared data with security heterogeneity, according to an aspect.
  • the system 400 comprises an interface component 410, a planning component 420 and an executing component 430.
  • the interface component 410 is configured to receive a SQL query for shared data over a plurality of sites.
  • the planning component 420 is configured to generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites.
  • the executing component 430 is configured to execute the distributed query plan at the sites and returning the results for the SQL query.
  • the planning component 420 further comprises a generating component 510 and an optimizing component 520.
  • the generating component 510 is configured to generate a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query.
  • the optimizing component 520 is configured to optimize the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
  • the executing component 430 further comprises a picking component 610, a transferring component 620 and a performing component 630.
  • the picking component 610 is configured to pick and set up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan.
  • the transferring component 620 is configured to transfer shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site.
  • the performing component 630 is configured to perform the operation and storing the result of the operation at the designated site by the logic unit.
  • TFACC a real-life dataset that integrates the MOT Test Data [32] of Ministry of Transport test for vehicles in the UK from 2005 to 2016, and National Public Transport Access Nodes (NaPTAN) [31] . It has 19 tables with 113 attributes, about 46.7GB of data in total.
  • TPCH benchmark We also used standard benchmark TPCH [2] with its built-in queries. TPCH generates data using TPC-H dbgen [2] , with 8 relations. It has 22 built-in SQL queries, which were rewritten into RA queries in our tests. Along the same lines as for TFACC , we also additionally generated 30 random queries with #join varying from 1 to 5.
  • Each relation of the datasets was randomly partitioned and distributed over a random subset of the machines (sites) .
  • ONE selects the best site S * to evaluate a query Q centrally at S * , i.e., transferring all queried relations to S * and executing Q at S * ; it ensures that at site S * , the evaluation incurs the minimum toll among all sites.
  • DASH 0 follows DASH to process operations op of Q one by one, but centrally at the best site for each op.
  • DASH - follows the framework of DASHto process operations of Q one by one, but randomly assigns work units to sites with data involved, e.g., assigning to S i .
  • Toll i, j (X) c ij
  • 1 if X has 1GB of size.
  • Varying toll budget Varying the total budget B from 20%B m to B m , we tested the query evaluation time of all methods. The result for TPCH is reported in Fig. 10B and shows the following.
  • DASH is on average 2.76 and 2.55 times faster than DASH no on the two datasets, respectively.
  • SMCQL can be naturally integrated into DASH as capsules and becomes more practical in the heterogeneous setting.
  • DASH smc is on average more than 18.89 times faster than SMCQL (SMCQL cannot finish within 48 hours for all cases) .
  • the embodiments described above can be implemented by hardware elements, software elements, or some combination of software and hardware.
  • the hardware elements can include circuitry.
  • the software elements can include computer code stored as machine-readable instructions on a tangible, non-transitory, machine-readable storage medium. Some embodiments can be implemented in one or a combination of hardware, firmware, and software.
  • Some embodiments can be implemented in a computing system or computing device including a memory comprising instructions, and one or more processors in communication with the memory, the one or more processors execute the instructions to perform the functions or operations described in this disclosure.
  • Some embodiments can also be implemented as instructions stored on a machine-readable medium, which can be read and executed by a computing platform to perform the operations described in this disclosure.
  • a machine-readable medium can include any mechanism for storing or transmitting data in a form readable by a machine, e.g., a computer.
  • a machine-readable storage medium can include read only memory (ROM) ; random access memory (RAM) ; magnetic disk storage media; optical storage media; flash memory devices; or any other machine-readable storage medium.
  • Some embodiments can also be software product including a machine-readable medium, which stores instructions that, when executed, cause one or more processors to perform the functions or operations described in this disclosure.
  • TrustedDB A Trusted Hardware based Database with Privacy and Data Confidentiality. In SIGMOD.
  • EnclaveDB a secure database using SGX. In IEEE Security &Privacy.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Operations Research (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and system for querying shared data with security heterogeneity. The method includes receiving a SQL query for shared data over a plurality of sites(110); generating a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites(120); and executing the distributed query plan at the sites and returning the results for the SQL query(130). The solution above can answer distributed SQL queries in the heterogeneous security setting and reduce data sharing toll and query evaluation cost.

Description

QUERYING SHARED DATA WITH SECURITY HETEROGENEITY TECHNICAL FIELD
The embodiments relate to database storage and access, and more particularly, relate to a method and system for querying shared data with security heterogeneity.
BACKGROUND
There has been increasing need for secure data sharing. Security and privacy issues hamper data sharing, since organizations are becoming increasingly aware of the economical loss and legal liabilities due to data breaches. To tackle it, a number of techniques have been devised to enable data sharing while providing certain security guarantees, e.g., encryption schemes such as order-preserving symmetric encryption (OPE) and homomorphic encryption (HOM) , Docker containers, or hardware-assisted enclaves such as Intel SGX and ARM TrustZone. In practice, a group of data owners often adopt a heterogeneous security scheme under which each pair of parties decide their own protocol to share data with diverse levels of trust. The scheme also keeps track of how the data is used. Distributed secure SQL query processing has been well studied in homogeneous environments. However, SQL query answering in a heterogeneous setting is much more challenging than in the homogeneous settings.
SUMMARY
A simplified summary is provided herein to help enable a basic or general understanding of various aspects of exemplary, non-limiting embodiments that follow in the more detailed description and the accompanying drawings.
An aspect disclosed herein relates to a computer-implemented method, comprising:
receiving a SQL query for shared data over a plurality of sites;
generating a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;
executing the distributed query plan at the sites and returning the results for the SQL query.
In some embodiments, the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
In some embodiments, the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
In some embodiments, generating a distributed query plan further comprising:
generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;
optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
In some embodiments, executing the distributed query plan at the sites further comprises:
picking and setting up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;
transferring shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;
performing the operation and storing the result of the operation at the designated site by the logic unit.
In some embodiments, the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
In some embodiments, the result of the operation can be stored in a protected mode.
Another aspect relates to a computing system, comprising:
a memory comprising instructions, and
one or more processors in communication with the memory, wherein the one or more processors execute the instructions to:
receive a SQL query for shared data over a plurality of sites;
generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;
execute the distributed query plan at the sites and returning the results for the SQL query.
In some embodiments, the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
In some embodiments, the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
In some embodiments, the one or more processors execute the instructions to generate a distributed query plan comprises:
generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;
optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
In some embodiments, the one or more processors execute the instructions to execute the distributed query plan at the sites further comprises:
picking and setting up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;
transferring shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;
performing the operation and storing the result of the operation at the designated site by the logic unit.
In some embodiments, the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
In some embodiments, the result of the operation can be stored in a protected mode.
A further aspect relates to a computer-readable storage medium comprising computer instructions that when executed by one or more processors, cause the one or more processors to:
receive a SQL query for shared data over a plurality of sites;
generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;
execute the distributed query plan at the sites and returning the results for the SQL query.
In some embodiments, the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
In some embodiments, the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
In some embodiments, further comprising instructions that cause the one or more processors to:
generate a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;
optimize the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
In some embodiments, further comprising instructions that cause the one or more processors to:
pick and set up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each  operation in the distributed query plan;
transfer shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;
perform the operation and storing the result of the operation at the designated site by the logic unit.
In some embodiments, the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
In some embodiments, the result of the operation can be stored in a protected mode.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing summary and the following detailed description of illustrative implementations are better understood when read in conjunction with the appended drawings. For the purpose of illustrating the implementations, there is shown in the drawings example constructions of the implementations; however, the implementations are not limited to the specific methods and instrumentalities disclosed. In the drawings:
FIG. 1 illustrates a flow diagram of an exemplary method for querying shared data with security heterogeneity, according to an aspect;
FIG. 2 illustrates a flow diagram of another exemplary method for generating a distributed query plan, according to an aspect;
FIG. 3 illustrates a flow diagram of an exemplary method for executing a distributed query plan, according to an aspect;
FIG. 4 illustrates a block diagram of an exemplary system for querying shared data with security heterogeneity, according to an aspect;
FIG. 5 illustrates a block diagram of an exemplary implementation for planning component, according to an aspect;
FIG. 6 illustrates a block diagram of an exemplary implementation for executing component, according to an aspect;
FIG. 7 illustrates an exemplary algorithm MTJ for generating toll-minimized plans;
FIG. 8 illustrates a simplified scenario of A3 for Example 1;
FIG. 9 illustrates a distributed query plan ξ Q in Example 1;
FIG. 10A-10D illustrates curve comparison diagrams of experimental results.
Detailed Description
1. Introduction
There have been increasing demands for sharing data from e-government [20] , healthcare [37] , finance [23] and the AI industry [15] , among other things. For example,  precision medicine requires sharing of clinical, genetic, environmental and lifestyle data for better disease treatment and prevention. However, security and privacy issues hamper data sharing, since organizations are becoming increasingly aware of the economical loss and legal liabilities due to data breaches. To tackle it, a number of techniques have been devised to enable data sharing while providing certain security guarantees, e.g., encryption schemes such as order-preserving symmetric encryption (OPE) [12] and homomorphic encryption (HOM) [19] , Docker containers, or hardware-assisted enclaves [36] such as Intel SGX and ARM TrustZone. The existing work assumes homogeneous settings, i.e., the same security protocol is assumed between all pairs of peers. In many real-world scenarios, however, there are often various trust relationships and hence different security requirements between data owners, as illustrated by the following case study taken from our industry partner.
Example 1: A data-sharing company (name withheld) has built a blockchain-based platform (similar to, e.g., MHMD [24] ) , to enable secure data analytics over distributed datasets owned by users such as government agencies, hospitals, clinics, researchers, drug stores, insurance firms and pharmaceutical companies, etc. Users of the same type are connected via an internal network, and these internal networks are connected to an external network via gateways.
Computations within an internal network can be carried out in Docker containers hosted by nodes of the network. Depending on the trust levels among the data owners, the container may use encryption schemes (e.g., OPE or HOM) to secure data uploaded to it, or use plaintext when the users trust each other. Computations across multiple internal networks may be conducted in hardware-assisted enclaves. Such an enclave incurs much higher upfront costs than Docker since, among other things, it requires a consensus among all gateways in the blockchain and will be audited. Datasets uploaded to the enclaves can be either encrypted or not, depending on the contract used for the consensus.
These security facilities (e.g., Docker and enclaves) reflect different security protocols and requirements between data owners and users. They cope with different threat levels , and incur security costs by their usage and upfront costs. Below are some example applications on the platform.
(A1) One is to find people who register in multiple clinics. Since clinics maintain high trustworthiness, the computation can be done in the clinic internal network by using a Docker container: all clinics upload their registration datasets to the container, where the answers are computed.
(A2) A government agent wants to find addresses of all households with at least one member who has contracted disease Z but has no health insurance. This involves government with household registration data, hospitals that have electronic medical records (EMR) of patients, and insurance firms. Government data can be uploaded to enclaves in the hospital network without encryption; hospital data can be loaded to enclaves in the insurance network with HOM encryption; and insurance data can be loaded to hospital enclaves with OPE encryption (more efficient but less secure than HOM) as hospitals have  higher trust levels than insurance firms.
Such emerging applications introduce new challenges.
(a) The need for a heterogeneous security scheme. An application may involve multiple types of peers (data owners and users) , and various security facilities and guarantees are needed due to the trust levels between these peers. For example, plaintext can be shared between hospitals (via Docker) for A1 and hospitals only share encrypted data with insurance firms via more costly enclaves (A2) . To support these applications, a heterogeneous scheme is needed, to support varying trust levels and security means between different peers.
(b) Query processing with security heterogeneity. Query answering in a heterogeneous setting is much more challenging than in the homogeneous settings. We need to take into account various security charges when deciding where and how the computations should be carried out (i.e., query planning) , e.g., how many containers/enclaves are needed and how to distribute them among sites at multiple networks. This is nontrivial. We want to reduce the security costs. At the same time, we want to minimize the parallel execution time of the query plan. For example, for A2 of Example 1, it would be better to send insurance data to enclaves in the hospital network instead of the other way around.
Contributions &organization. This disclosure studies query evaluation in a heterogeneous security setting.
(1) Abstraction of data sharing scheme. We make a first attempt to study query processing under a data sharing scheme with heterogeneous security protocols. We demonstrate that the scheme can support emerging applications that bear various levels of trust between different peers, as commonly found in the real world. We define distributed query plans and incorporate data sharing cost (security charge) in terms of toll functions that abstract the usage of various security facilities (i.e., Docker containers/enclaves with plain/encrypted data) based on the access rights, trust levels and security demands among the peers.
(2) Querying shared data. We formalize the problem of querying shared data under heterogeneous security as a bi-criteria optimization problem, to minimize both parallel query evaluation cost and data sharing toll. We show that the problem is highly nontrivial: while it is decidable (NEXPTIME ) , it is already PSPACE -hard for SQL and
Figure PCTCN2021099861-appb-000001
-hard for SPC to find optimal plans in special cases. Despite these, we introduce a framework for querying shared data.
(3) Distributed plan generation. Underlying the framework, we develop a polynomial-time (PTIME ) algorithm to generate distributed query plans while minimizing data sharing toll. We show that the algorithm is an O (logn) -approximation for joins , i.e., its join plans are within O (logn) of optimal ones, where n is the number of data owners.
(4) Distributed plan optimization. We further minimize the parallel evaluation cost of query plans generated in (3) while retaining a toll budget via workload rebalance. We show that the problem is also intractable. This said, we give a PTIME 2-approximate algorithm  for rebalancing workload.
(5) Experimental study. Using real-life and synthetic data, we empirically evaluate the effectiveness and efficiency of our plan generation algorithms. We find the following. (a) Security heterogeneity does have an evident impact on query evaluation performance. (b) Our proposed method is effective in reducing both data sharing toll and parallel execution cost, outperforming its competitors by 25.57 and 10.27 times on average, respectively. (c) Existing security systems can be readily incorporated into the data sharing scheme and serve as security facilities.
Position of the work. This work is not to introduce another security protocol. It is not to investigate security guarantees of heterogeneous security models; nor is it to improve existing secure database systems (e.g., [9, 35] ) . Heterogeneous security protocols are already being used in real life, and the security community is studying their properties (e.g., [25, 14] ) . Instead, we study query evaluation over shared data when we are given heterogeneous security protocols. We aim to study the impact of such heterogeneous security protocols on query processing, in terms of costs incurred by data sharing agreements and reflected by the use of different security facilities. The costs stem from existing security facilities, protocols and systems. We make a first attempt to evaluate queries in their presence for emerging applications.
Related work. Distributed secure query processing has been well studied in homogeneous environments.  Complete trust. With complete trust among the data owners, the problem becomes the standard parallel/distributed query processing problem [33] ; its goal is to minimize the communication costs of answering queries [4, 11, 27] .
Related is the study of federated databases [40, 28, 30] , which aim to provide an interface for querying a collection of distributed and autonomous relational databases. Heterogeneity has also been studied in this context, known as multistores or polystores e.g., [17, 26, 22] . However, the heterogeneity here refers to various data and programming models used by different data owners, not security heterogeneity.
Hardware-assisted solutions. With hardware support (e.g., security enclaves) , query processing over distributed sources can be carried out with moderate overhead, e.g., TrustedDB [8] , Cipherbase [6] and EnclaveDB [36] . An enclave protects the data and code running inside of it from being spied upon, even if the entire software stack of the host is compromised. In addition, a remote party can verify the code running inside the enclave through a process known as attestation. Thus, two data owners without trust can still jointly compute a function, by sending their data to an enclave, which runs some code (attested by both owners) to compute the function. The enclave can be hosted by one of the data owners, or even a third (untrusted) party. Docker containers are used in lieu of an enclave if the system administrator can be trusted.
Software-only solutions. In the absence of special hardware support, one can still build a distributed query processing system over distrusted peers using homomorphic encryption [19] or secure multi-party computation (SMC) [44, 39, 13] , such as CryptDB [35] , Monomi [41]  and SMCQL [10, 9] . Both homomorphic encryption and SMC are capable of computing arbitrary functions over the data, but those general-purpose solutions are not very practical. In practice, one has to limit the types of queries supported by designing special-purpose protocols. Even so, these software-only solutions tend to be much more expensive than those with hardware support.
Locally differential privacy (LDP) . LDP [16] has recently emerged as another approach to querying distributed data. LDP algorithms for answering range queries on a single table have been developed [29, 43] . Different from security-based solutions, query results in the LDP model have to be probabilistic with certain random noise, and information leakage will accumulate in the LDP model when more queries are run on the same data.
This work differs from the prior work in the following. (1) We study a heterogeneous security environment, a setting being increasingly used in practice. It supports data sharing among a group of data owners and allows each pair of hosts to adopt a security protocol of their choice, as opposed to the homogeneous security assumption in the previous work. (2) We provide the first abstraction of heterogeneous security protocols. (3) We formalize the problem of querying shared data as a bi-criteria optimization problem, and study its complexity. (4) We propose an approach to answering generic SQL queries in the heterogeneous security setting.
2. Data Sharing with Security Heterogeneity
We start with a scheme to abstract data sharing with security heterogeneity (Section 2.1) , and then formalize the problem of querying shared data under the scheme (Section 2.2) .
2.1 An Abstraction of Data Sharing
The scheme, referred to as DShare , is to abstract computations over shared datasets with heterogeneous security protocols. It is characterized by the notions of data owners, a query planner, data sharing pacts and distributed query plans.
A group of data owners often agree upon data sharing protocols in practice. Each owner contributes its data for sharing, and is also a client of the shared data under the protocols.
Data owners. A collection of data owners, or simply sites, 
Figure PCTCN2021099861-appb-000002
agree to support query services collectively over their private data. Each owner manages its data by its own DBMS and has its own local database schema.
We assume a global schema
Figure PCTCN2021099861-appb-000003
deduced by, e.g., schema mapping, to provide a uniform interface to write queries against the data, where R i is a relation schema. Owner S i has an instance
Figure PCTCN2021099861-appb-000004
of
Figure PCTCN2021099861-appb-000005
for i∈ [1, n] . Here some relations in
Figure PCTCN2021099861-appb-000006
are possibly empty, i.e., 
Figure PCTCN2021099861-appb-000007
does not necessarily have every relation of
Figure PCTCN2021099861-appb-000008
We denote
Figure PCTCN2021099861-appb-000009
by 
Figure PCTCN2021099861-appb-000010
and refer to it as a distributed instance of
Figure PCTCN2021099861-appb-000011
at
Figure PCTCN2021099861-appb-000012
The answer to a query Q over 
Figure PCTCN2021099861-appb-000013
is defined as
Figure PCTCN2021099861-appb-000014
under the normal interpretation. Note that via renaming, this also allows us to compute local answers at any single site or any subset of
Figure PCTCN2021099861-appb-000015
Query planner. Each data owner may issue a query Q, to compute the answer to Q  over
Figure PCTCN2021099861-appb-000016
The query will be handled by a trusted third party called query planner. Upon receiving a query Q from a data owner, the planner will come up with a distributed query plan that picks security facilities (Docker or enclave) , to comply with the security protocols agreed among the sites, and encryption schemes required for the operations. It also estimates a security charge, referred to as a toll, for the query plan. If it is agreeable by the query issuer, the query planner will instruct the data owners to carry out the query plan, charges the toll to the query issuer, and allocates the revenue to the data owners accordingly.
The query planner oversees the execution of the query plans and acts as an intermediary between the data owners and the query issuer. However, it does not access any of the local databases or carry out operations in the query plans itself.
Data sharing pact. We next abstract the varying security protocols between pairs of data owners based on their trust levels. We use the following computation model.
(1) Data is shared using capsules, logic units to which data owners can transfer and upload datasets; physically, a capsule can be instantiated with a Docker container, an enclave, an SMC system such as SMCQL [9] , or even a trusted third party. All computations are carried out in capsules only.
(2) Each capsule C is associated with a site S j, referred to as a capsule hosted by S j. Computation in C has direct access to the data on S j but cannot access data at other sites except the part that is uploaded to C. When the computation in C completes, the intermediate results are stored at S j in a protected mode via, e.g., access right controls, OPE encryption [12] , or symmetric encryption with keys held only by the query issuer as commonly used in cloud cryptograph such as Azure [1] ; then the capsule C will be terminated.
A security protocol for a pair of sites (S i, S j) specifies: [ (a) ] the lowest security guarantees that a capsule hosted by S j must attain in order for S j to share the data of S i; and [ (b) ] encryption of data at S i, i.e., which part of the data can be shared, what data needs to be encrypted before sending it to a capsule of S j , and what encryption scheme to use.
A data sharing pact ρ for a set
Figure PCTCN2021099861-appb-000017
of sites consists of (1) a security protocol for each pair of sites in
Figure PCTCN2021099861-appb-000018
and (2) a toll function Toll () that measures the costs of all types of available capsules that can be employed to evaluate queries. In the real world, such costs are incurred by, e.g., renting trusted third party facilities as the capsules and encryption overhead.
Distributed query plan. We next define query plans over a set
Figure PCTCN2021099861-appb-000019
of sites w.r.t. a data sharing pact ρ.
Consider a distributed instance
Figure PCTCN2021099861-appb-000020
at
Figure PCTCN2021099861-appb-000021
A distributed query plan ξ on
Figure PCTCN2021099861-appb-000022
at
Figure PCTCN2021099861-appb-000023
is a DAG (directed acyclic graph) , where the nodes of ξ denote atomic operations and edges represent their dependencies. Each atomic operation δ is an (n+3) -tuple (op, t c, X 1, ..., X n, j) , where (i) op is an operator in the relational algebra ( RA ; projection π, selection σ, natural join
Figure PCTCN2021099861-appb-000024
set difference -, set union ∪ and renaming λ) ; (ii) t c is a capsule of a certain type, e.g., Docker, enclave, SMC system, or trusted  third party; (iii) j∈ [1, n] denotes the site that hosts a capsule to carry op out ; and (iv) X i is a relation, which is either part of data in
Figure PCTCN2021099861-appb-000025
that can be shared with S j by the security protocol between S i and S j, or the intermediate result
Figure PCTCN2021099861-appb-000026
computed at site S i by operations prior to δ in ξ.
Executing δ= (op, t c, X 1, ..., X n, j) involves steps below:
(1) Set up a capsule C of type t c for op, hosted by site S j, such that for each i∈ [1, n] , C meets the security requirement of ρ for (S i, S j) , where (a) 
Figure PCTCN2021099861-appb-000027
or (b) there exists
Figure PCTCN2021099861-appb-000028
that contains intermediate answers computed over data from S i.
(2) For i∈ [1, n] , transfer X i from S i to the capsule C hosted by S j, based on the security protocol between S i and S j.
(3) Perform the computation
Figure PCTCN2021099861-appb-000029
and store
Figure PCTCN2021099861-appb-000030
in a protected mode (e.g., encrypted with OPE) at S j.
(4) Add the relation
Figure PCTCN2021099861-appb-000031
to
Figure PCTCN2021099861-appb-000032
at S j.
That is, intermediate results of op are computed and stored as such to comply with the security guarantees of pact ρ. In particular, when
Figure PCTCN2021099861-appb-000033
for all i∈ [1, n] , i≠j, (op, t c, X 1, ..., X n, j) simply executes op on local data X j at site S j.
Condition (1b) ensures that each followup operation δ′= (op', X 1', ..., X n', p) of δ also complies with the security requirement of (S i, S p) if
Figure PCTCN2021099861-appb-000034
and X j'contains intermediate results
Figure PCTCN2021099861-appb-000035
computed by δ from the data of S i, even if
Figure PCTCN2021099861-appb-000036
Edges of the DAG plan ξ specifies the dependencies of the atomic operations in ξ. In particular, if there exists X i (in atomic operation δ= (op, t c, X 1, ..., X n, j) ) that comes from relations at site S j computed by atomic operation δ′, then there exists a directed edge from δ′ to δ in ξ.
The execution of the query plan is mediated and monitored by the query planer, and it takes place at the participating sites only. After executing the query plan, the sites will send the results to the query planner, who then decrypts and returns the results to the client who issued the query.
Toll. We next presents the toll function Toll () . For a query plan ξ over
Figure PCTCN2021099861-appb-000037
Toll (ξ) is the sum of Toll (δ) for all operations δ= (op, t c, X 1, ..., X n, j) in ξ , where Toll (δ) estimates the charge for executing δ. It consists of the following:
(a) an upfront cost Toll 0 for setting up the capsule for op;
(b) cost Toll d for transferring remote data via secure channel to the capsule for op hosted by S j, determined by the amount of data and encryption overhead; and
(c) cost Toll c for executing op in the capsule, determined by the duration that the  computation op takes.
More specifically, (a) Toll 0 depends on the type of the capsule; (b) Toll d is measured as
Figure PCTCN2021099861-appb-000038
for all sites S i that have data X i required by δ, where Toll  (i, j) (X i) is the cost of encrypting and transferring X i of S i to the capsule hosted by S j; it is determined by the size of X i and the encryption scheme required by the protocol between S i and S j; in particular, Toll  (j, j) (X j) =0 since C is hosted by S j and can access its local data X j without extra cost. Finally, (c) Toll c is the charge for sustaining the capsule for executing op.
Using our familiar terms, we refer to Toll d and Toll c as communication and computation cost, respectively. These costs stem from the security protocol between S i and S j, and are reflected as costs incurred by the security facility employed, including e.g., encryption cost, beyond their normal scope.
Example 2: [Case study] Continuing with Example 1, we show that applications A1 and A2 can be abstracted by DShare. Denote by (a) Household (address, pid , name ) the registration relation maintained by government, (b) Reg (pid, clinic ) and EMR (pid, disease ) the clinic registration relation and medical record relation, respectively, owned by the clinics and hospitals, and (c) Insurance (pid, company , policy ) the customer records maintained by insurance firms. We present their data sharing pacts ρ with associated toll functions.
Application A1.
It is expressed as a join query
Figure PCTCN2021099861-appb-000039
The pact ρ specifies the lowest security level for a capsule C. Based on this, the query planner may pick Docker container as C, and remote data will be directly uploaded to C. The shared data will not be stored after the computation since a capsule is ephemeral. These meet the security requirements of A1 since clinics are trusted and do not risk side-channel leakage.
The toll function estimates the costs of available capsules C, and is determined by the type, configuration and duration of C to use (which in turn depends on the operations to be carried out in C) . For Docker, Toll 0=0 since Docker incurs negligible upfront cost; Toll  (i, j) (X i) =c ij|X i| is the communication cost for transferring X i from S i to the Docker at S j (c ij is a coefficient denoting the unit network price) ; and
Figure PCTCN2021099861-appb-000040
is c|Reg| 2 (c is a coefficient similar to c ij) , the cost of sustaining the Docker container for the join.
Application A2.
It is an RA query
Figure PCTCN2021099861-appb-000041
As the query involves data from three internal networks, pact ρ requires a higher security level for data sharing as described in Example 1. Based on this, the query planner may take enclaves as capsules. In particular, since insurance firms have lower level of trust than hospitals, data is required to be encrypted using HOM [19] before uploading it to insurance enclaves, to prevent leakage of user identifications and their EMR records.
Here the toll function is determined by the capsule types, operations, encryption schemes and communication. First consider, e.g., the join δ of Household and EMR at a hospital site S j. For enclave, Toll (δ) is estimated as follows : Toll 0=L, where L is the upfront cost of setting up the enclave at S j, and is estimated as the average time for reaching the consensus among the gateways; Toll  (i, j) (X i) is c ij|X i|, where c ij is a coefficient reflecting the cost of shipping Household to the enclave at S j; and 
Figure PCTCN2021099861-appb-000042
where D Household and D EMR are the datasets for the join at S j.
Now consider operation I-Insurance, where I is the join result above. Assume that the set difference takes place in a capsule C hosted by the insurance network. Then the type of C is determined by both the protocol between government and insurance firm and the one between hospital and insurance, to warrant security for each data owner involved.
As shown above, in practice toll functions can be readily deduced from the types of capsules and the complexity of operations. Alternatively, as a common practice, they can also be empirically estimated by testing or learning over small dataset samples. In addition, for blockchain-based data sharing systems similar to Example 1 or MHMD [24] , tolls often denote the economic incentives defined by smart contracts for consensus. There has also been recent work from the security community on toll (cost) models of hybrid protocols, e.g., [14, 25]. Note that toll functions only specify toll charges of all possible capsules usage in an application; they do not determine which, where and how capsules to be used.
DShare allows arbitrary positive polynomial functions as Toll (δ) that are composed of submodular set functions [21] for Toll  (i, j) (X) (of Toll d) and Toll c (bi-modular [18] if op of δ is a binary operator, e.g., join) . All toll functions in Example 2.1 are of this type. Our industry partners find that these suffice to express common security charges in practice.
Guarantees and properties. We adopt the following threat model. The data owners are semi-honest (a.k.a. honest but curious) , i.e., each owner will faithfully execute the query plan, but may try to derive information about other parties’ data. This is why all operations are executed in capsules and the intermediate relations are stored in protected mode. The query planner runs at a trusted third party and is assumed trustworthy, similar to the honest broker in secure database systems e.g., [35, 9] , but without direct access to data.
Under this threat model, DShare enforces the security protocols of data sharing pact ρ by (a) the trusted query planner, and (b) proper capsules for carrying out query plans.
Figure PCTCN2021099861-appb-000043
(1) The pact specifies the minimum security requirement between each pair S i and S j of sites. The query planner enforces the security guarantee by picking right capsules. Each operation (op, t c, X 1, ..., X n, j) is performed by a capsule that meets the maximum of all lowest security levels for (S i, S j) (i∈ [1, n] ) . Followup operations retain no less security levels .
(2) Depending on the security protocols, different pairs of sites may have distinct security requirements. The planner picks capsules that meet the security requirements and minimize the cost, e.g., it picks docker containers for A1 of Example 1, which satisfy the security requirements specified by the protocol and are cheaper than enclaves and SMC systems.
Remark. Composing security protocols is a challenging and active topic of the security community (e.g., [25, 14] ) . It has emerged in semi-trusted data federations, e.g., MHMD [24] and Example 1. This paper takes a heterogeneous setting used in practice and focuses on query planner that selects capsules and distributes computations across the federation, to improve query performance while retaining the required security levels. This said, the query planner can be adapted to other security composition and propagation protocols.
2.2 The Problem of Querying Shared Data
Critical to DShare is its query planner. While there has been a host of research on security protocols and facilities, no prior work has studied how to generate query plans that comply with a data sharing pact with security heterogeneity.
This motivates us to study the toll-bounded query answering problem, denoted by TBQA. Informally, it is to find the best distributed query plan for a given query subject to a toll budget imposed by a data sharing pact. It is stated as follows.
Input: A global schema
Figure PCTCN2021099861-appb-000044
n sites
Figure PCTCN2021099861-appb-000045
adistributed instance
Figure PCTCN2021099861-appb-000046
of
Figure PCTCN2021099861-appb-000047
over
Figure PCTCN2021099861-appb-000048
a data sharing pact ρ , a natural number B, and an RA query Q over
Figure PCTCN2021099861-appb-000049
Output: A distributed plan ξ for Q over
Figure PCTCN2021099861-appb-000050
such that the toll Toll (ξ) of ξ over
Figure PCTCN2021099861-appb-000051
under ρ is no larger than B.
Objective: Minimize the parallel execution cost of ξ over
Figure PCTCN2021099861-appb-000052
denoted by
Figure PCTCN2021099861-appb-000053
As remarked earlier, a data sharing pact ρ specifies only the minimum security requirements for sharing data between sites and their associated toll charges. We have to find query plan ξ that determines, in addition to conventional planning, how to select and distribute capsules for computations that can meet heterogeneous security requirements of ρ, while taking advantages of the heterogeneity and minimizing its parallel execution cost
Figure PCTCN2021099861-appb-000054
To complete the statement of TBQA , we define
Figure PCTCN2021099861-appb-000055
below. Let
Figure PCTCN2021099861-appb-000056
be the execution cost of atomic operation δ over
Figure PCTCN2021099861-appb-000057
Then
Figure PCTCN2021099861-appb-000058
is inductively defined as follows:
If ξ is a single atomic operation δ, 
Figure PCTCN2021099861-appb-000059
If ξ consists of sub-plans ξ 1, ..., ξ l and an atomic operation δ, where ξ 1, ..., ξ l are predecessors of δ in ξ,  then
Figure PCTCN2021099861-appb-000060
Intuitively, 
Figure PCTCN2021099861-appb-000061
characterizes the total parallel execution costs of atomic operations in ξ when parallel execution of independent atomic operations is fully exploited.
We assume that the query planner can efficiently estimate the cost incurred by an atomic operation δ= (op, t c, X 1, ..., X n, j) . For example, when op is
Figure PCTCN2021099861-appb-000062
is |R|×|S|.
3. Querying Shared Data
In this section, we first study the complexity of querying shared data and then outline our approach to solving TBQA.
Complexity of TBQA. Denote by TBQA d the decision version of TBQA. That is to decide, given the same input of TBQA and an additional number L, whether there exists a distributed query plan ξ for Q over
Figure PCTCN2021099861-appb-000063
such that its toll is at most B and its parallel execution cost
Figure PCTCN2021099861-appb-000064
is at most L.
We also study a related problem, referred to as toll-bounded answerability problem and denoted by TBA. Given
Figure PCTCN2021099861-appb-000065
ρ and B as in TBQA, it is to decide whether there exists a distributed query plan ξ for Q over
Figure PCTCN2021099861-appb-000066
with toll at most B.
Intuitively, TBA is to check whether TBQA even has a feasible solution or not. TBQA is at least as hard as TBA.
We say that a data sharing pact ρ is simple if Toll 0=Toll c=0 and Toll  (i, j) (X) =c ij|X|. Both problems are intractable even under such simple pacts that involve only two sites.
Theorem 1 : Both TBQA d and TBA are
(1) decidable in NEXPTIME ;
(2) PSPACE -hard even when ρ is simple; and
(3) 
Figure PCTCN2021099861-appb-000067
-hard even when Q is in SPC and ρ is simple.
Moreover, (2) and (3) hold even when
Figure PCTCN2021099861-appb-000068
has two sites only.
Our approach. In light of Theorem 1, practical solutions to TBQA have to be approximate. We next propose such an approach, which consists of two steps outlined below.
Step (1) : Finding toll-minimized canonical plans (Section 4) . We first generate a distributed query plan ξ Q for Q in a canonical form : ξ Q extends the algebra tree T Q (cf. [3] ) of Q into a DAG by replacing each algebra operation op of Q with a distributed query plan ξ op that has minimized the toll in
Figure PCTCN2021099861-appb-000069
Note that here an edge from op 1 to op 2 of Q in T Q may be extended to multiple edges, which connect atomic operations in
Figure PCTCN2021099861-appb-000070
to those in
Figure PCTCN2021099861-appb-000071
based on their dependencies.
Step (2) : Reducing parallel execution cost (Section 5) . Given ξ Q of step (1) , we  check whether Toll (ξ Q) of ξ Q exceeds the toll budget B. We return “No” if so, i.e., budget B is too small to answer Q in
Figure PCTCN2021099861-appb-000072
under ρ. Otherwise, we further improve ξ Q by making use of the remaining toll allowance, to reduce its parallel execution cost
Figure PCTCN2021099861-appb-000073
without exceeding B.
4. Generating Toll-Minimized Plans
In this section, we show how to carry out step (1) of our approach (Section 3) . Given an RA query Q, a distributed instance
Figure PCTCN2021099861-appb-000074
of schema
Figure PCTCN2021099861-appb-000075
at n sites
Figure PCTCN2021099861-appb-000076
and a data sharing pact ρ, we generate a distributed plan ξ Q for Q, which consists of toll-minimized plan ξ δ for each operation δ in Q. Below we focus on joins; the other RA operations are similar and simpler.
Approximability. One can verify that even for joins, TBQA remains intractable, by reduction from the vertex cover problem , which is NP -complete [34] . Nonetheless, there exists a PTIME approximation. Assume that
Figure PCTCN2021099861-appb-000077
is reasonably large, and constant Toll 0 is negligible w.r.t. Toll d or Toll c.
Theorem 2: There exists a PTIME O (logn) -approximation algorithm for computing plans with minimum toll for joins.
As a proof, we give such an algorithm, denoted by MTJ, as shown in Fig. 7.
Consider query
Figure PCTCN2021099861-appb-000078
Algorithm MTJ generates a plan ξ for Q over
Figure PCTCN2021099861-appb-000079
by reduction to the minimum set cover (MSC ) problem [34] . Below we first present algorithm MTJ by reduction to MSC, which gives us an O (logn) -approximation. However, a direct use of the reduction yields a naive version of MTJ with an exponential time (EXPTIME) complexity. Nonetheless, we develop a technique that is able to reduce its complexity from EXPTIME to PTIME.
Approximation by reduction. We start with a naive version of MTJ by approximation-preserving reduction [7] to MSC, so that MTJ computes toll minimized join plans by making use of available approximate algorithms for MSC.
Reduction. The idea of the reduction is to (a) represent each query plan ξ for join query Q as a “workload” distribution plan that assigns necessary data movement for answering Q in
Figure PCTCN2021099861-appb-000080
among the sites; and (b) reformulate the assignment problem as a variant of MSC that admits a PTIME logarithmic-factor approximation algorithm [42] .
Consider a join query
Figure PCTCN2021099861-appb-000081
distributed database
Figure PCTCN2021099861-appb-000082
over sites S 1, ..., S n and data sharing pact ρ. We construct an instance of MSC , i.e., a universe U of elements and a set
Figure PCTCN2021099861-appb-000083
of weighted subsets of U, such that each c-approximation answer to MSC encodes a distributed join plan for Q with toll at most c-times of the minimum toll among all plans for Q.
Denote by
Figure PCTCN2021099861-appb-000084
the instance of relation R at site S i (i∈ [1, n] ) ; similarly for
Figure PCTCN2021099861-appb-000085
For convenience, we assume w. l. o. g. that neither
Figure PCTCN2021099861-appb-000086
nor
Figure PCTCN2021099861-appb-000087
is empty for all i∈ [1, n] . For any i, j∈ [1, n] , 
Figure PCTCN2021099861-appb-000088
is called a work unit of Q in
Figure PCTCN2021099861-appb-000089
Then:
(1) U consists of all work units of Q in
Figure PCTCN2021099861-appb-000090
and
(2) 
Figure PCTCN2021099861-appb-000091
consists of pairs (i, W) for all i∈ [1, n] and
Figure PCTCN2021099861-appb-000092
We say that (i, W) covers element
Figure PCTCN2021099861-appb-000093
in U if u jk∈W. The weight of (i, W) , denoted by t (i, W) , is defined as the sum of the total Toll d of fetching
Figure PCTCN2021099861-appb-000094
and
Figure PCTCN2021099861-appb-000095
from sites S j and S k to site S i, and total Toll c of computing
Figure PCTCN2021099861-appb-000096
for all units
Figure PCTCN2021099861-appb-000097
in W. Note that this has to take into account toll sharing for relations appearing in multiple work units in W.
Algorithm. One can readily verify that the reduction is approximation-preserving [7] . This gives us an O (logn) -approximation algorithm, as a naive implementation of MTJ , for computing minimum toll join plans (Algorithm 1) , by using the O (log|U|) -approximation of MSC [42] (|U|=n 2) . Here set covering
Figure PCTCN2021099861-appb-000098
specifies the assignment of atomic operations for the work units in
Figure PCTCN2021099861-appb-000099
to their host sites (line 4) . The capsule types of the atomic operations are such picked that they minimize the cost while satisfying all the relevant security levels specified in the protocols for
Figure PCTCN2021099861-appb-000100
 (line 5) .
Example 3 : Recall the query for A2 given in Example 2.1, denoted by Q. Assume a simplified data sharing scenarios shown in Fig. 8. We show how algorithm MTJ generates the distributed query plan ξ Q depicted in Fig. 9 for Q.
Take the join
Figure PCTCN2021099861-appb-000101
of Q for example. Then op 1 has a set
Figure PCTCN2021099861-appb-000102
of 4 work units
Figure PCTCN2021099861-appb-000103
for all i∈ {1, 2} , j∈ {3, 4} (see Fig. 9) . After the reduction, the set
Figure PCTCN2021099861-appb-000104
consists of (i, W) for all
Figure PCTCN2021099861-appb-000105
MTJ picks (3, {u 13, u 23} ) and (4, {u 14, u 24} ) as
Figure PCTCN2021099861-appb-000106
that covers
Figure PCTCN2021099861-appb-000107
with total toll 0. It then interprets
Figure PCTCN2021099861-appb-000108
as
Figure PCTCN2021099861-appb-000109
consisting of the atomic joins for I 32 and I 42 of Fig. 9. Note that this is actually the optimal plan for op 1 since
Figure PCTCN2021099861-appb-000110
incurs no toll at all.
Assume that sizes |I 32|, |I 42| and
Figure PCTCN2021099861-appb-000111
for all i∈ [5, 7] are N. Then along the same lines, MTJ generates a distributed plan
Figure PCTCN2021099861-appb-000112
with the atomic operations for I 33 and I 43 of Fig. 9, which is also optimal for op 2 with
Figure PCTCN2021099861-appb-000113
From exponential to polynomial . While Algorithm 1 is an O (logn) -approximation, it has an exponential time complexity since
Figure PCTCN2021099861-appb-000114
of line 3 is of size exponential in |U| (i.e., exponential in n 2) . Nonetheless, below we show that this can be reduced to PTIME , based on the following:
(a) (i *, W *) identified in line 3 of Algorithm 3 equals
Figure PCTCN2021099861-appb-000115
where U is the set of all work units of Q in
Figure PCTCN2021099861-appb-000116
(b) For each i, computing
Figure PCTCN2021099861-appb-000117
is equivalent to finding the minimum α such that there exists a subset W of U with f α (W) ≤0, where
Figure PCTCN2021099861-appb-000118
(c) For any fixed α, checking whether this holds can be done efficiently in PTIME since one can prove that f α (W) is a submodular function, and submodular minimization can be done in PTIME via, e.g., [21] .
From these, the while loop (line 3) of Algorithm 3 can actually be implemented in PTIME by computing
Figure PCTCN2021099861-appb-000119
as W i for each i∈ [1, n] , which can be implemented by a binary search for the minimum α in [0, α max] such that min Wf α (W) ≤0, where
Figure PCTCN2021099861-appb-000120
in which c  (i, k) is the constant in the toll function Toll  (i, k) (X) =c  (i, k) |X| specified by ρ. The search terminates when the range [a, b] for α has a gap (|b-a|) less than
Figure PCTCN2021099861-appb-000121
Hence, there are at most logn 2max rounds of search, where each round is an invocation of submodular minimization, which is in PTIME (e.g., [21] ) .
That is, the while loop of Algorithm 3 is reduced to PTIME in n and
Figure PCTCN2021099861-appb-000122
Therefore, MTJ can be implemented in PTIME in n and log
Figure PCTCN2021099861-appb-000123
and is an O (logn) -approximation for computing minimum-toll distributed plans for joins.
5. Plan Rebalancing
After generating a toll-minimized canonical plan ξ Q for Q in Section 4, we next study how to further optimize ξ Q by reducing its parallel execution cost
Figure PCTCN2021099861-appb-000124
This is to carry out step (2) of our approach outlined in Section 3. While the adjustments may increase the toll of the revised plan, we make sure that the toll is below the budget B, i.e., we make use of the remaining toll allowance B-Toll (ξ Q) to reduce
Figure PCTCN2021099861-appb-000125
Our technique, referred to as plan rebalancing, is motivated by the following. Consider the sub-plan
Figure PCTCN2021099861-appb-000126
of ξ Q for an operation op i of Q. Here
Figure PCTCN2021099861-appb-000127
is generated to minimize toll (Section 4) and hence could be imbalanced. Observe that
Figure PCTCN2021099861-appb-000128
is dominated by the maximum cost of individual sites (Section 2.2) ; hence imbalanced workloads increase
Figure PCTCN2021099861-appb-000129
In light of this, rebalancing works by iteratively applying an atomic balancing operator κ b to optimize
Figure PCTCN2021099861-appb-000130
under its allocated toll budget B i (Section 4) for each operation op i, such that (a) the optimized sub-plan
Figure PCTCN2021099861-appb-000131
is guaranteed to have a lower cost than
Figure PCTCN2021099861-appb-000132
and (b) 
Figure PCTCN2021099861-appb-000133
incurs at most B i of toll. That is, κ b makes use of toll allowance B i on  op i, and re-distributes the work units handled by
Figure PCTCN2021099861-appb-000134
in a more balanced and optimized way, to reduce the cost of
Figure PCTCN2021099861-appb-000135
However, there are two key challenges to rebalancing.
(C1) How to design κ b such that it can optimize
Figure PCTCN2021099861-appb-000136
in an optimal way under a given toll budget B i on op i?
(C2) How to distribute the total toll budget B over all sub-plans of ξ Q (i.e., operations of Q) so that the total cost reduction of ξ Q is maximized?
We tackle (C2) by iteratively allocating toll budget to individual sub-plans of ξ Q in the same spirit of the gradient descent algorithm [38] for optimization problems Below we focus on (C1) : we propose operator κ b and show that it is a near-optimal design of its kind.
Given a sub-plan
Figure PCTCN2021099861-appb-000137
operator κ b works in two phases: (1) it first re-distributes the work units of
Figure PCTCN2021099861-appb-000138
across n sites subject to a toll budget B i allocated to op i; this yields plan
Figure PCTCN2021099861-appb-000139
that guarantees to reduce the cost; and (2) it then prepares the answers of
Figure PCTCN2021099861-appb-000140
for 
Figure PCTCN2021099861-appb-000141
that is subsequent to
Figure PCTCN2021099861-appb-000142
Here phase (2) is carried out by simply recovering the input distribution for the subsequent sub-plan
Figure PCTCN2021099861-appb-000143
of
Figure PCTCN2021099861-appb-000144
It is to ensure that
Figure PCTCN2021099861-appb-000145
is compatible with its subsequent sub-plan
Figure PCTCN2021099861-appb-000146
since
Figure PCTCN2021099861-appb-000147
works with a certain input distribution (i.e., the distribution of the answer of
Figure PCTCN2021099861-appb-000148
) due to the heterogeneous security protocols (see Section 2.1) .
Below we focus on phase (1) of κ b. We parameterize κ b with an integer k that controls the degree of changes to
Figure PCTCN2021099861-appb-000149
the larger k is, the larger cost is reduced but more toll is consumed. We denote by κ b [k] the operator κ b instantiated with k. We apply κ b [k] to
Figure PCTCN2021099861-appb-000150
by selecting k work units of
Figure PCTCN2021099861-appb-000151
for re-distribution, to reduce its parallel execution cost.
It is nontrivial to pick k units that inflict the lowest cost. Below we first provide an algorithm for unit selection, and then prove that it is near-optimal among all such algorithms.
Algorithm ReBal. The algorithm, denoted by ReBal works as follows. Given a sub-plan
Figure PCTCN2021099861-appb-000152
of ξ Q computed in Section 4, a database
Figure PCTCN2021099861-appb-000153
over n sites
Figure PCTCN2021099861-appb-000154
a data sharing pact ρ and parameter k for κ b, ReBal returns an optimized sub-plan
Figure PCTCN2021099861-appb-000155
by re-distributing k work units for
Figure PCTCN2021099861-appb-000156
More specifically, ReBal first (a) identifies a set
Figure PCTCN2021099861-appb-000157
of k bottleneck work units for op i, and then (b) re-distributes them to improve
Figure PCTCN2021099861-appb-000158
It does (a) by iteratively identifying bottleneck sites, picking and adding its bottleneck work units to
Figure PCTCN2021099861-appb-000159
where bottleneck sites have work units of maximum costs among all. It carries out (b) by assigning work units in 
Figure PCTCN2021099861-appb-000160
one by one to sites with least workload w.r.t. the cost of executing all work units of op i.
Analysis. Algorithm ReBal is near-optimal of all algorithms of its kind. More specifically, denote by
Figure PCTCN2021099861-appb-000161
the class of algorithms that optimize
Figure PCTCN2021099861-appb-000162
by selecting and re-distributing k work units of
Figure PCTCN2021099861-appb-000163
Then we have the following.
Proposition 3:
(1) It is NP -complete to find the optimal optimization of
Figure PCTCN2021099861-appb-000164
by re-distributing k work units.
(2) ReBal is a 2-approximation of the optimal in
Figure PCTCN2021099861-appb-000165
and is in O (n 2 logn) -time.
6. Embodiments
The following provides the various embodiments for querying shared data with security heterogeneity.
FIG. 1 illustrates a flow diagram of an exemplary method 100 for querying shared data with security heterogeneity, according to an aspect. At block 110, a SQL query for shared data over a plurality of sites can be received. In some embodiments, the SQL query can be issued from each of the sites. The sites support query services collectively over their private data. Each site manages its data by its own DBMS and has its own local database schema. In some embodiments, the SQL query can be also obtained from a client, i.e., a user input or a computer program or a client device.
At block 120, generating a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites. In some embodiments, the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites. A security protocol for a pair of sites specifies the lowest security guarantees and encryption scheme of data so that a site can share the data of the other site. In some embodiments, the data sharing pact further comprises a toll function that measures the parallel execution cost of the distributed query plan.
At block 130, executing the distributed query plan at the sites and returning the results for the SQL query. The execution of the distributed query plan can be mediated and monitored by a trusted third party, and it takes place at the sites only. After executing the distributed query plan, the sites will send the results to the trusted third party, who then decrypts and returns the results to the client who issued the SQL query.
As discussed herein, the disclosed embodiment proposes an approach to query answering under heterogeneous security models, which defines query plans by incorporating data sharing agreements and the use of various security facilities. The embodiment aims to demonstrate the need, challenges and feasibility of querying shared data with security heterogeneity.
FIG. 2 illustrates a flow diagram of an exemplary method 200 for generating a distributed  query plan, according to an aspect. At block 210, generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query. In some embodiments, the method can first generate a distributed query plan for the SQL query in a canonical form. That is, extend the algebra tree of the SQL query into a DAG by replacing each algebra operation of the SQL query with a distributed query sub-plan of the operation. The distributed query sub-plan of the operation has minimized toll.
At block 220, optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans. The method can distribute a total toll budget over all sub-plans of the query plan so that the total cost reduction of the query plan can be maximized.
The embodiment generates a toll-minimized query plan for the SQL query in a canonical form and further optimizes the query plan to reduce data sharing toll and parallel execution cost.
FIG. 3 illustrates a flow diagram of an exemplary method 300 for executing a distributed query plan, according to an aspect. At block 310, picking and setting up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan.
In some embodiments, the logic unit can be selected from Docker container, enclave, SMC system, or trusted third party. Data is shared using the logic units to which the sites can transfer and upload datasets. Each logic unit is hosted by a site and used to perform all computations. The computation in each logic unit has direct access to the data at the site associated with it, but cannot access data at other sites except the part that is uploaded to the site associated with it.
At block 320, transferring shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site.
At block 330, performing the operation and storing the result of the operation at the designated site by the logic unit. In some embodiments, the result of the operation can be stored in a protected mode, e.g., encrypted with OPE, access right controls, or symmetric encryption with keys.
FIG. 4 illustrates a block diagram of an exemplary system 400 for querying shared data with security heterogeneity, according to an aspect. The system 400 comprises an interface component 410, a planning component 420 and an executing component 430. The interface component 410 is configured to receive a SQL query for shared data over a plurality of sites. The planning component 420 is configured to generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites. The executing component 430 is configured to execute the distributed query plan at the sites and returning the results for the SQL query.
In some embodiments, as illustrated in Fig. 5, the planning component 420 further comprises a generating component 510 and an optimizing component 520. The generating component 510 is configured to generate a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query. The optimizing component 520 is configured  to optimize the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
In some embodiments, as illustrated in Fig. 6, the executing component 430 further comprises a picking component 610, a transferring component 620 and a performing component 630. The picking component 610 is configured to pick and set up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan. The transferring component 620 is configured to transfer shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site. The performing component 630 is configured to perform the operation and storing the result of the operation at the designated site by the logic unit.
7. Experimental Study
Using benchmarks and real-life datasets, we conducted experiments to evaluate (1) the impact of heterogeneous security protocols on querying shared data; (2) the effectiveness of our toll-minimized planning technique; (3) the effectiveness of our toll-bounded plan optimization; and (4) integration of SMC-based system (SMCQL [9] ) and related comparison.
Experimental setting. We start with the setting.
Real-life dataset. We used TFACC , a real-life dataset that integrates the MOT Test Data [32] of Ministry of Transport test for vehicles in the UK from 2005 to 2016, and National Public Transport Access Nodes (NaPTAN) [31] . It has 19 tables with 113 attributes, about 46.7GB of data in total.
We generated 30 RA queries over TFACC. We used 5 query templates with the number #join of joins varying from 1 to 5. We generated the queries by instantiating the templates with values randomly selected from the dataset.
TPCH benchmark. We also used standard benchmark TPCH [2] with its built-in queries. TPCH generates data using TPC-H dbgen [2] , with 8 relations. It has 22 built-in SQL queries, which were rewritten into RA queries in our tests. Along the same lines as for TFACC , we also additionally generated 30 random queries with #join varying from 1 to 5.
Each relation of the datasets was randomly partitioned and distributed over a random subset of the machines (sites) .
Data sharing pacts. We used three simple data sharing pacts.
(1) Uniform pac ρ U. Under ρ U, the toll functions are of the form Toll i, j (X) =c i, j (|X|) for all pairs S i and S j of sites, and the coefficients c i, j are from a uniform distribution of [0, 100] . Here |X| is the size of the transferred data X.
(2) Power law pact ρ P. Under ρ P, the toll functions are similar to those under ρ U except that the toll coefficients are from a power law distribution of [0, 100] .
(3) Constant pact ρ c/∞ [p] . Under ρ c/∞ [p] , for any pair of sites S i and S j, Toll i, j (X) is either a random constant C ij=C or +∞, where the probability of Toll i, j (X) =+∞ is  p%.
Implementation. We developed a prototype system, referred toDASH (DatA SHare) , for querying shared data under a heterogeneous data sharing pact such as ρ U, ρ P and ρ c/∞ [p] .
Prototyping. DASHemploys PostgreSQL as the DBMS at each site. It implements the framework of Section 3 as the query planner to generate distributed plans. By default, DASH uses PostgreSQL to execute plans at each site. Given a toll budget B, DASH employs the techniques of  Sections  4 and 5 to generate plans subject to B while minimizing parallel cost. If B is not specified, DASHgenerates plans with minimized toll.
Baselines. We are not aware of any existing systems that query shared data and support heterogeneous security protocols. Nonetheless, we designed and implemented three variants of DASH as baselines for comparison:
ONE: selects the best site S * to evaluate a query Q centrally at S *, i.e., transferring all queried relations to S * and executing Q at S *; it ensures that at site S *, the evaluation incurs the minimum toll among all sites.
DASH 0: follows DASH to process operations op of Q one by one, but centrally at the best site for each op.
DASH -: follows the framework of DASHto process operations of Q one by one, but randomly assigns work units to sites with data involved, e.g., assigning
Figure PCTCN2021099861-appb-000166
to S i.
Configuration. The experiments were conducted on 20 Linux servers, each with 6-core Intel i5-8400 2.8GHz CPU, 32 GB of memory and 1TB of HDD. The instances are fully connected with high speed intra-network channels. By default, we used model ρ P, entire TFACC , 32 GB of TPCH , and all queries.
Experimental Results. We next report our findings.
Exp-1: Impact of heterogeneous security protocols. We first evaluated the impact of security protocols on toll consumed by query evaluation over the distributed datasets. We evaluated all queries over both datasets under all three data sharing pacts ρ U, ρ P and ρ c/∞ [p] , when p%ranges from 5%to 55%. Table 1 reports the average toll usage per query by all four methods. The results tell us the following.
Figure PCTCN2021099861-appb-000167
*In all toll functions Toll i, j (X) =c ij|X| , |X|=1 if X has 1GB of size.
Table 1: Average toll usage per query (Exp-1)
(1) Different security pacts charge toll differently. Under ρ c/∞ [p] , some TFACC or TPCH queries cannot be answered with a finite toll by DASH -, DASH 0 or ONE when p≥15%, while DASH can answer all the queries even when p=50%.
(2) On both TPCH and TFACC, DASHconsistently generates plans that incur the minimum toll under all the security pacts. For example, on TFACC under ρ P, the average toll consumption per query of DASH is 45.2, 1.8 and 1.7 times less than that of DASH -, DASH 0 and ONE, respectively.
Exp-2: Effectiveness of toll-minimized planning. We next evaluated the effectiveness of toll-minimized planning of DASH. We tested the average toll usage per query for query evaluation when varying the sizes |D| of datasets from 2 -4×|D max| to |D max|, where |D max|=46.7 GB for TFACC and 32 GB for TPCH. As reported in Fig. 10A for TPCH , we can see the following. (a) Over larger datasets all methods consume larger toll, as expected. (b) However, DASH consistently charges much smaller toll than the other methods, e.g., 3.48, 7.83 and 91.51 times less than ONE, DASH 0 and DASH - on average over TPCH , respectively; moreover , the gap increases with larger D. The results for TFACC are similar (see [5] ) .
Exp-3: Effectiveness of optimization. We next evaluated the effectiveness of toll-bounded query optimization of DASH. We compared with a variant of DASH, denoted by DASH no, which turned off the optimization of Section 5. We evaluated the average query evaluation time of DASH and DASH no with all queries, full datasets, and a total toll budget B m=10|D|, where |D| is the total size of the dataset of all sites. To favor ONE , DASH 0 and DASH -, we set B m large so that these baselines can answer all the queries within the toll budget.
(1) Varying toll budget. Varying the total budget B from 20%B m to B m, we tested the query evaluation time of all methods. The result for TPCH is reported in Fig. 10B and shows the following. (a) DASHis the fastest among all. e.g., DASH is 1.86, 14.54 and 14.02 times faster than DASH -, DASH 0 and ONE, respectively , when B=B m on TPCH. (b) The optimization of DASHis effective: DASH is on average 2.76 and 2.55 times faster than DASH no on the two datasets, respectively.
(3) Varying datasets. Varying the size of datasets in the same way as Exp-2 with full toll budget B m, we tested the average evaluation time per query. The results on TPCH is given in Fig. 10C (the results on TFACC are similar and omitted) . Similar to Exp-3 (1) , DASH consistently performs the best among all the methods, and does better when the datasets get larger.
Exp-4: Integration with SMCQL [9] . We evaluated the feasibility and performance of integrating DASH with SMC systems such as SMCQL [9] . We took SMCQL as the capsules for DASH and denote the integrated system by DASH smc. We evaluated the performance of DASH smc and SMCQL over 1 GB of TPCH (SMCQL does not scale to  larger datasets) . In particular, to simulate the case study of Example 1, we used 20 machines and partitioned them into three groups, with 2, 10 and 8 machines representing governments, hospitals and insurance firms, respectively. To favor SMCQL and prevent DASH smc from bypassing SMCQL capsules, we set the protocols the same as Fig. 0 except that (a) insurance machines do not send data to hospitals, and (b) all computations over insurance machines must use SMCQL capsules. We randomly distributed TPCH relations over the machines. Using three TPCH queries Q4, Q12 and Q19 (simplified due to the restriction of query support on SMCQL) , we evaluated the performance of DASH smc and SMCQL.
(1) SMCQL can be naturally integrated into DASH as capsules and becomes more practical in the heterogeneous setting. DASH smc is on average more than 18.89 times faster than SMCQL (SMCQL cannot finish within 48 hours for all cases) .
(2) DASH smc improves by 1.83 times when B increases from 20%B m to B m while SMCQL is insensitive to B (Fig. 10D) .
Summary. We find the following. (1) Security heterogeneity has a big impact on querying shared data. (2) Our proposed method effectively reduces both toll consumption and parallel execution cost. On average DASH consumes 2.59, 4.64, 69.47 times less toll than ONE , DASH 0 and DASH -, respectively, and is 14.16, 14.44 and 2.2 times faster. (3) Existing systems can be integrated with our method as capsules and alleviate efficiency bottleneck in the heterogeneous setting; it speeds up SMCQL by 18.89 times over 1GB of TPCH.
It should be noted that the embodiments described above can be implemented by hardware elements, software elements, or some combination of software and hardware. The hardware elements can include circuitry. The software elements can include computer code stored as machine-readable instructions on a tangible, non-transitory, machine-readable storage medium. Some embodiments can be implemented in one or a combination of hardware, firmware, and software.
Some embodiments can be implemented in a computing system or computing device including a memory comprising instructions, and one or more processors in communication with the memory, the one or more processors execute the instructions to perform the functions or operations described in this disclosure.
Some embodiments can also be implemented as instructions stored on a machine-readable medium, which can be read and executed by a computing platform to perform the operations described in this disclosure. A machine-readable medium can include any mechanism for storing or transmitting data in a form readable by a machine, e.g., a computer. For example, a machine-readable storage medium can include read only memory (ROM) ; random access memory (RAM) ; magnetic disk storage media; optical storage media; flash memory devices; or any other machine-readable storage medium. Some embodiments can also be software product including a machine-readable medium, which stores instructions that, when executed, cause one or more processors to perform the functions or operations described in this disclosure.
The descriptions of the various embodiments of the disclosure have been presented for  purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure.
References
[1] 2018. Azure encryption. https: //docs. microsoft. com/en-us/azure/security/fundamentals/encryption-overview.
[2] 2019. TPC-H. http: //www. tpc. org/tpch/.
[3] Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.
[4] Foto N. Afrati and Jeffrey D. Ullman. 2011. Optimizing Multiway Joins in a Map-Reduce Environment. TKDE 23, 9 (2011 ) , 1282–1298.
[5] Anonymous. 2020. Full version. https: //bit. ly/it2lx.
[6] Arvind Arasu, Spyros Blanas, Ken Eguro, Manas Joglekar, Raghav Kaushik, Donald Kossmann, Ravi Ramamurthy, Prasang Upadhyaya, and Ramarathnam Venkatesan. 2013. Secure database-as-a-service with Cipherbase. In SIGMOD.
[7] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, and M. Protasi. 1999. Complexity and Approximability Properties: Combinatorial Optimization Problems and Their Approximability Properties.
[8] Sumeet Bajaj and Radu Sion. 2011. TrustedDB: A Trusted Hardware based Database with Privacy and Data Confidentiality. In SIGMOD.
[9] Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel N. Kho, and Jennie Rogers. 2017. SMCQL: Secure Query Processing for Private Data Networks. PVLDB 10 , 6 (2017 ) , 673–684.
[10] Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. ShrinkWrap: Efficient SQL Query Processing in Differentially Private Data Federations. PVLDB 12, 3 (2018 ) , 307–320.
[11] Paul Beame, Paraschos Koutris, and Dan Suciu. 2013. Communication steps for parallel query processing. In PODS. 273–284.
[12] Alexandra Boldyreva, Nathan Chenette, Younho Lee, and Adam O’ Neill. 2009. Order-Preserving Symmetric Encryption. In EUROCRYPT.
[13] Elette Boyle, Kai-Min Chung, and Rafael Pass. 2015. Large-Scale Secure Computation: Multi-party Computation for (Parallel) RAM Programs. In CRYPTO. 742–762.
[14] Niklas Büscher, Daniel Demmler, Stefan Katzenbeisser, David Kretzmer, and Thomas Schneider. 2018. HyCC: Compilation of Hybrid Protocols for Practical Secure Computation. In CCS.
[15] Department for Digital, Culture, Media &Sport and Department for Business, Energy &Industrial Strategy. 2017. Growing the artificial intelligence industry in the UK.
[16] John C Duchi, Michael I Jordan, and Martin J Wainwright. 2013. Local privacy and statistical minimax rates. In FOCS. 429–438.
[17] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska , Bill Howe , Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stanley B. Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Record 44 , 2 (2015 ) , 11–16.
[18] Satoru Fujishige and Satoru Iwata. 2005. Bisubmodular Function Minimization. SIAM J. Discrete Math. 19, 4 (2005 ). 
[19] Craig Gentry. 2009. Fully Homomorphic Encryption Using Ideal Lattices. In STOC.
[20] Government Digital Service, HM Passport Office and UK Statistics Authority. 2018. Information sharing code of practice.
[21] Martin
Figure PCTCN2021099861-appb-000168
László Lovász, and Alexander Schrijver. 1981. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica 1, 2 (1981 ) , 169–197.
[22] Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Howe, and Dan Suciu. 2014. Demonstration of the Myria big data management service. In SIGMOD.
[23] HM Treasury. 2015. Data sharing and open data in banking: Response to the call for evidence.
[24] Horizon 2020 Research and Innovation Action. 2016. My Health, My Data. http: //www. myhealthmydata. eu/.
[25] Muhammad Ishaq, Ana L. Milanova , and Vassilis Zikas. 2019. Efficient MPC via Program Analysis: A Framework for Efficient Optimal Mixing. In CCS.
[26] Boyan Kolev, Carlyna Bondiombouy, Patrick Valduriez, Ricardo Jiménez-Peris, Raquel Pau, and José Pereira. 2016. The CloudMdsQL Multistore System. In SIGMOD.
[27] Paraschos Koutris and Dan Suciu. 2011. Parallel evaluation of conjunctive queries. In PODS. ACM, 223–234.
[28] Ralf Kramer. 1997. Databases on the Web: Technologies for Federation Architectures and Case Studies (Tutorial) . In SIGMOD. 503–506.
[29] Tejas Kulkarni. 2019. Answering Range Queries Under Local Differential Privacy. In SIGMOD.
[30] Ee-Peng Lim, San-Yih Hwang, Jaideep Srivastava, Dave Clements, and M. Ganesh. 1995. Myriad: Design and Implementation of a Federated Database Prototype. Softw., Pract. Exper. 25, 5 (1995 ) , 533–562.
[31] Find open data. 2014. http: //data. gov. uk/dataset/naptan.
[32] Find open data. 2019. https: //data. gov. uk/dataset/e3939ef8-30c7-4ca8-9c7c-ad9475cc9b2f/anonymised-mot-tests-and-results.
[33] M. Tamer
Figure PCTCN2021099861-appb-000169
and Patrick Valduriez. 2011. Principles of Distributed Database  Systems, Third Edition. Springer.
[34] Christos H Papadimitriou. 1994. Computational Complexity. Addison-Wesley.
[35] Raluca Ada Popa, Catherine M. Redfield, Nickolai Zeldovich, and Hari Balakrishnan. 2011. CryptDB: Protecting Confidentiality with Encrypted Query Processing. In SOSP.
[36] Christian Priebe, Kapil Vaswani, and Manuel Costa. 2018. EnclaveDB: a secure database using SGX. In IEEE Security &Privacy.
[37] The Register. 2016. https: //www. theregister. co. uk/2016/07/06/caredata_binned/.
[38] Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016 ) .
[39] Adi Shamir. 1979. How to Share a Secret. Commun. ACM 22, 11 (1979 ) , 612–613.
[40] Amit P. Sheth and James A. Larson. 1990. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Database. ACM Comput. Surv. 22, 3 (1990 ) , 183–236.
[41] Stephen Tu, M. Frans Kaashoek, Samuel Madden, and Nickolai Zeldovich. 2013. Processing analytical queries over encrypted data. In PVLDB.
[42] Vijay V. Vazirani. 2003. Approximation Algorithms. Springer.
[43] Tianhao Wang, Bolin Ding, Jingren Zhou, Cheng Hong, Zhicong Huang, Ninghui Li, and Somesh Jha. 2019. Answering Multi-Dimensional Analytical Queries under Local Differential Privacy. In SIGMOD.
[44] Andrew Chi-Chih Yao. 1982. Protocols for Secure Computations (Extended Abstract) . In FOCS. 160–164.

Claims (21)

  1. A computer-implemented method, comprising:
    receiving a SQL query for shared data over a plurality of sites;
    generating a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;
    executing the distributed query plan at the sites and returning the results for the SQL query.
  2. The method of claim 1, wherein the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
  3. The method of claim 2, wherein the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
  4. The method of claim 3, wherein generating a distributed query plan further comprising:
    generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;
    optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
  5. The method of claim 4, wherein executing the distributed query plan at the sites further comprises:
    picking and setting up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;
    transferring shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;
    performing the operation and storing the result of the operation at the designated site by the logic unit.
  6. The method of claim 5, wherein the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
  7. The method of claim 5, wherein the result of the operation can be stored in a protected mode.
  8. A computing system, comprising:
    a memory comprising instructions, and
    one or more processors in communication with the memory, wherein the one or more processors execute the instructions to:
    receive a SQL query for shared data over a plurality of sites;
    generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;
    execute the distributed query plan at the sites and returning the results for the SQL query.
  9. The computing system of claim 8, wherein the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
  10. The computing system of claim 9, wherein the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
  11. The computing system of claims 10, wherein, the one or more processors execute the instructions to generate a distributed query plan comprises:
    generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;
    optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
  12. The computing system of claims 11, wherein the one or more processors execute the instructions to execute the distributed query plan at the sites further comprises:
    picking and setting up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;
    transferring shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;
    performing the operation and storing the result of the operation at the designated site by the logic unit.
  13. The computing system of claim 12, wherein the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
  14. The computing system of claim 12, wherein the result of the operation can be stored in a protected mode.
  15. A computer-readable storage medium comprising computer instructions that when executed by one or more processors, cause the one or more processors to:
    receive a SQL query for shared data over a plurality of sites;
    generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;
    execute the distributed query plan at the sites and returning the results for the SQL query.
  16. The computer-readable storage medium of claim 15, wherein the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
  17. The computer-readable storage medium of claim 16, wherein the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
  18. The computer-readable storage medium of claim 17, further comprising instructions that cause the one or more processors to:
    generate a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;
    optimize the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
  19. The computer-readable storage medium of claim 18, further comprising instructions that cause the one or more processors to:
    pick and set up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;
    transfer shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;
    perform the operation and storing the result of the operation at the designated site by the logic unit.
  20. The computer-readable storage medium of claim 19, wherein the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
  21. The computer-readable storage medium of claim 19, wherein the result of the operation can be stored in a protected mode.
PCT/CN2021/099861 2020-06-14 2021-06-11 Querying shared data with security heterogeneity WO2021254288A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CNPCT/CN2020/096011 2020-06-14
CN2020096011 2020-06-14

Publications (1)

Publication Number Publication Date
WO2021254288A1 true WO2021254288A1 (en) 2021-12-23

Family

ID=79268486

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/099861 WO2021254288A1 (en) 2020-06-14 2021-06-11 Querying shared data with security heterogeneity

Country Status (1)

Country Link
WO (1) WO2021254288A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080256253A1 (en) * 2007-04-10 2008-10-16 International Business Machines Corporation Method and Apparatus for Cooperative Data Stream Processing
CN101933018A (en) * 2008-01-29 2010-12-29 惠普开发有限公司 Query deployment plan for a distributed shared stream processing system
US20140188845A1 (en) * 2013-01-03 2014-07-03 Sap Ag Interoperable shared query based on heterogeneous data sources
CN108182192A (en) * 2016-12-08 2018-06-19 南京航空航天大学 A kind of half-connection inquiry plan selection algorithm based on distributed data base
CN110659327A (en) * 2019-08-16 2020-01-07 平安科技(深圳)有限公司 Method and related device for realizing interactive query of data between heterogeneous databases
CN110955701A (en) * 2019-11-26 2020-04-03 中思博安科技(北京)有限公司 Distributed data query method and device and distributed system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080256253A1 (en) * 2007-04-10 2008-10-16 International Business Machines Corporation Method and Apparatus for Cooperative Data Stream Processing
CN101933018A (en) * 2008-01-29 2010-12-29 惠普开发有限公司 Query deployment plan for a distributed shared stream processing system
US20140188845A1 (en) * 2013-01-03 2014-07-03 Sap Ag Interoperable shared query based on heterogeneous data sources
CN108182192A (en) * 2016-12-08 2018-06-19 南京航空航天大学 A kind of half-connection inquiry plan selection algorithm based on distributed data base
CN110659327A (en) * 2019-08-16 2020-01-07 平安科技(深圳)有限公司 Method and related device for realizing interactive query of data between heterogeneous databases
CN110955701A (en) * 2019-11-26 2020-04-03 中思博安科技(北京)有限公司 Distributed data query method and device and distributed system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CAO YANG, FAN WENFEI , WANG YANGHAO YANGHAO , YI KE: "Querying Shared Data with Security Heterogeneity", PROCEEDINGS OF THE GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, 11 June 2020 (2020-06-11), pages 575 - 585, XP058551534, DOI: 10.1145/3318464.3389784 *

Similar Documents

Publication Publication Date Title
Ghorbel et al. Privacy in cloud computing environments: a survey and research challenges
US11726968B2 (en) Methods, apparatuses, and devices for transferring data assets based on blockchain
Jain et al. Differential privacy: its technological prescriptive using big data
KR20210133289A (en) Data extraction from blockchain networks
JP2021520539A (en) Computing systems, methods, and computer programs for managing the blockchain
Wu et al. Information flow control in cloud computing
Shetty et al. Data provenance assurance in the cloud using blockchain
US11750652B2 (en) Generating false data for suspicious users
Cao et al. Querying shared data with security heterogeneity
US20190311138A1 (en) Multi-Party Encryption Cube Processing Apparatuses, Methods and Systems
Praitheeshan et al. Private and trustworthy distributed lending model using hyperledger Besu
Brito et al. Secure end-to-end processing of smart metering data
Ruan et al. LedgerView: access-control views on hyperledger fabric
Pal et al. Application multi-tenancy for software as a service
WO2021254288A1 (en) Querying shared data with security heterogeneity
US20200183586A1 (en) Apparatus and method for maintaining data on block-based distributed data storage system
White et al. Transitioning to quantum-safe cryptography on IBM Z
Nguyen et al. Bdsp: A fair blockchain-enabled framework for privacy-enhanced enterprise data sharing
Shtern et al. Toward an ecosystem for precision sharing of segmented Big Data
Zhao et al. Libertas: Privacy-preserving computation for decentralised personal data stores
Balachandar et al. Intelligent Broker Design for IoT Using a Multi-Cloud Environment
Singh et al. Navigating the Landscape of Security Threat Analysis in Cloud Computing environments
Thanasegaran et al. Comparative Study on Cloud Computing Implementation and Security Challenges
Suresh et al. A Blockchain-Based Cloud File Storage System Using Fuzzy-Based Hybrid-Flash Butterfly Optimization Approach for Storage Weight Reduction
Hacıgümüş et al. Secure computation on outsourced data: A 10-year retrospective

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21826875

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21826875

Country of ref document: EP

Kind code of ref document: A1