WO2021254288A1

WO2021254288A1 - Querying shared data with security heterogeneity

Info

Publication number: WO2021254288A1
Application number: PCT/CN2021/099861
Authority: WO
Inventors: Wenfei Fan; Yang Cao
Original assignee: Wenfei Fan
Priority date: 2020-06-14
Filing date: 2021-06-11
Publication date: 2021-12-23

Abstract

A method and system for querying shared data with security heterogeneity. The method includes receiving a SQL query for shared data over a plurality of sites(110); generating a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites(120); and executing the distributed query plan at the sites and returning the results for the SQL query(130). The solution above can answer distributed SQL queries in the heterogeneous security setting and reduce data sharing toll and query evaluation cost.

Description

QUERYING SHARED DATA WITH SECURITY HETEROGENEITY

TECHNICAL FIELD

The embodiments relate to database storage and access, and more particularly, relate to a method and system for querying shared data with security heterogeneity.

BACKGROUND

There has been increasing need for secure data sharing. Security and privacy issues hamper data sharing, since organizations are becoming increasingly aware of the economical loss and legal liabilities due to data breaches. To tackle it, a number of techniques have been devised to enable data sharing while providing certain security guarantees, e.g., encryption schemes such as order-preserving symmetric encryption (OPE) and homomorphic encryption (HOM) , Docker containers, or hardware-assisted enclaves such as Intel SGX and ARM TrustZone. In practice, a group of data owners often adopt a heterogeneous security scheme under which each pair of parties decide their own protocol to share data with diverse levels of trust. The scheme also keeps track of how the data is used. Distributed secure SQL query processing has been well studied in homogeneous environments. However, SQL query answering in a heterogeneous setting is much more challenging than in the homogeneous settings.

SUMMARY

A simplified summary is provided herein to help enable a basic or general understanding of various aspects of exemplary, non-limiting embodiments that follow in the more detailed description and the accompanying drawings.

An aspect disclosed herein relates to a computer-implemented method, comprising:

receiving a SQL query for shared data over a plurality of sites;

generating a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;

executing the distributed query plan at the sites and returning the results for the SQL query.

In some embodiments, the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.

In some embodiments, the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.

In some embodiments, generating a distributed query plan further comprising:

generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;

optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.

In some embodiments, executing the distributed query plan at the sites further comprises:

picking and setting up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;

transferring shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;

performing the operation and storing the result of the operation at the designated site by the logic unit.

In some embodiments, the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.

In some embodiments, the result of the operation can be stored in a protected mode.

Another aspect relates to a computing system, comprising:

a memory comprising instructions, and

one or more processors in communication with the memory, wherein the one or more processors execute the instructions to:

receive a SQL query for shared data over a plurality of sites;

generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;

execute the distributed query plan at the sites and returning the results for the SQL query.

In some embodiments, the one or more processors execute the instructions to generate a distributed query plan comprises:

In some embodiments, the one or more processors execute the instructions to execute the distributed query plan at the sites further comprises:

A further aspect relates to a computer-readable storage medium comprising computer instructions that when executed by one or more processors, cause the one or more processors to:

receive a SQL query for shared data over a plurality of sites;

In some embodiments, further comprising instructions that cause the one or more processors to:

generate a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;

optimize the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.

pick and set up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;

transfer shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;

perform the operation and storing the result of the operation at the designated site by the logic unit.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing summary and the following detailed description of illustrative implementations are better understood when read in conjunction with the appended drawings. For the purpose of illustrating the implementations, there is shown in the drawings example constructions of the implementations; however, the implementations are not limited to the specific methods and instrumentalities disclosed. In the drawings:

FIG. 1 illustrates a flow diagram of an exemplary method for querying shared data with security heterogeneity, according to an aspect;

FIG. 2 illustrates a flow diagram of another exemplary method for generating a distributed query plan, according to an aspect;

FIG. 3 illustrates a flow diagram of an exemplary method for executing a distributed query plan, according to an aspect;

FIG. 4 illustrates a block diagram of an exemplary system for querying shared data with security heterogeneity, according to an aspect;

FIG. 5 illustrates a block diagram of an exemplary implementation for planning component, according to an aspect;

FIG. 6 illustrates a block diagram of an exemplary implementation for executing component, according to an aspect;

FIG. 7 illustrates an exemplary algorithm MTJ for generating toll-minimized plans;

FIG. 8 illustrates a simplified scenario of A3 for Example 1;

FIG. 9 illustrates a distributed query plan ξ _Q in Example 1;

FIG. 10A-10D illustrates curve comparison diagrams of experimental results.

Detailed Description

1. Introduction

There have been increasing demands for sharing data from e-government [20] , healthcare [37] , finance [23] and the AI industry [15] , among other things. For example, precision medicine requires sharing of clinical, genetic, environmental and lifestyle data for better disease treatment and prevention. However, security and privacy issues hamper data sharing, since organizations are becoming increasingly aware of the economical loss and legal liabilities due to data breaches. To tackle it, a number of techniques have been devised to enable data sharing while providing certain security guarantees, e.g., encryption schemes such as order-preserving symmetric encryption (OPE) [12] and homomorphic encryption (HOM) [19] , Docker containers, or hardware-assisted enclaves [36] such as Intel SGX and ARM TrustZone. The existing work assumes homogeneous settings, i.e., the same security protocol is assumed between all pairs of peers. In many real-world scenarios, however, there are often various trust relationships and hence different security requirements between data owners, as illustrated by the following case study taken from our industry partner.

Example 1: A data-sharing company (name withheld) has built a blockchain-based platform (similar to, e.g., MHMD [24] ) , to enable secure data analytics over distributed datasets owned by users such as government agencies, hospitals, clinics, researchers, drug stores, insurance firms and pharmaceutical companies, etc. Users of the same type are connected via an internal network, and these internal networks are connected to an external network via gateways.

Computations within an internal network can be carried out in Docker containers hosted by nodes of the network. Depending on the trust levels among the data owners, the container may use encryption schemes (e.g., OPE or HOM) to secure data uploaded to it, or use plaintext when the users trust each other. Computations across multiple internal networks may be conducted in hardware-assisted enclaves. Such an enclave incurs much higher upfront costs than Docker since, among other things, it requires a consensus among all gateways in the blockchain and will be audited. Datasets uploaded to the enclaves can be either encrypted or not, depending on the contract used for the consensus.

These security facilities (e.g., Docker and enclaves) reflect different security protocols and requirements between data owners and users. They cope with different threat levels , and incur security costs by their usage and upfront costs. Below are some example applications on the platform.

(A1) One is to find people who register in multiple clinics. Since clinics maintain high trustworthiness, the computation can be done in the clinic internal network by using a Docker container: all clinics upload their registration datasets to the container, where the answers are computed.

(A2) A government agent wants to find addresses of all households with at least one member who has contracted disease Z but has no health insurance. This involves government with household registration data, hospitals that have electronic medical records (EMR) of patients, and insurance firms. Government data can be uploaded to enclaves in the hospital network without encryption; hospital data can be loaded to enclaves in the insurance network with HOM encryption; and insurance data can be loaded to hospital enclaves with OPE encryption (more efficient but less secure than HOM) as hospitals have higher trust levels than insurance firms.

Such emerging applications introduce new challenges.

(a) The need for a heterogeneous security scheme. An application may involve multiple types of peers (data owners and users) , and various security facilities and guarantees are needed due to the trust levels between these peers. For example, plaintext can be shared between hospitals (via Docker) for A1 and hospitals only share encrypted data with insurance firms via more costly enclaves (A2) . To support these applications, a heterogeneous scheme is needed, to support varying trust levels and security means between different peers.

(b) Query processing with security heterogeneity. Query answering in a heterogeneous setting is much more challenging than in the homogeneous settings. We need to take into account various security charges when deciding where and how the computations should be carried out (i.e., query planning) , e.g., how many containers/enclaves are needed and how to distribute them among sites at multiple networks. This is nontrivial. We want to reduce the security costs. At the same time, we want to minimize the parallel execution time of the query plan. For example, for A2 of Example 1, it would be better to send insurance data to enclaves in the hospital network instead of the other way around.

Contributions &organization. This disclosure studies query evaluation in a heterogeneous security setting.

(1) Abstraction of data sharing scheme. We make a first attempt to study query processing under a data sharing scheme with heterogeneous security protocols. We demonstrate that the scheme can support emerging applications that bear various levels of trust between different peers, as commonly found in the real world. We define distributed query plans and incorporate data sharing cost (security charge) in terms of toll functions that abstract the usage of various security facilities (i.e., Docker containers/enclaves with plain/encrypted data) based on the access rights, trust levels and security demands among the peers.

(2) Querying shared data. We formalize the problem of querying shared data under heterogeneous security as a bi-criteria optimization problem, to minimize both parallel query evaluation cost and data sharing toll. We show that the problem is highly nontrivial: while it is decidable (NEXPTIME ) , it is already PSPACE -hard for SQL and

-hard for SPC to find optimal plans in special cases. Despite these, we introduce a framework for querying shared data.

(3) Distributed plan generation. Underlying the framework, we develop a polynomial-time (PTIME ) algorithm to generate distributed query plans while minimizing data sharing toll. We show that the algorithm is an O (logn) -approximation for joins , i.e., its join plans are within O (logn) of optimal ones, where n is the number of data owners.

(4) Distributed plan optimization. We further minimize the parallel evaluation cost of query plans generated in (3) while retaining a toll budget via workload rebalance. We show that the problem is also intractable. This said, we give a PTIME 2-approximate algorithm for rebalancing workload.

(5) Experimental study. Using real-life and synthetic data, we empirically evaluate the effectiveness and efficiency of our plan generation algorithms. We find the following. (a) Security heterogeneity does have an evident impact on query evaluation performance. (b) Our proposed method is effective in reducing both data sharing toll and parallel execution cost, outperforming its competitors by 25.57 and 10.27 times on average, respectively. (c) Existing security systems can be readily incorporated into the data sharing scheme and serve as security facilities.

Position of the work. This work is not to introduce another security protocol. It is not to investigate security guarantees of heterogeneous security models; nor is it to improve existing secure database systems (e.g., [9, 35] ) . Heterogeneous security protocols are already being used in real life, and the security community is studying their properties (e.g., [25, 14] ) . Instead, we study query evaluation over shared data when we are given heterogeneous security protocols. We aim to study the impact of such heterogeneous security protocols on query processing, in terms of costs incurred by data sharing agreements and reflected by the use of different security facilities. The costs stem from existing security facilities, protocols and systems. We make a first attempt to evaluate queries in their presence for emerging applications.

Related work. Distributed secure query processing has been well studied in homogeneous environments. Complete trust. With complete trust among the data owners, the problem becomes the standard parallel/distributed query processing problem [33] ; its goal is to minimize the communication costs of answering queries [4, 11, 27] .

Related is the study of federated databases [40, 28, 30] , which aim to provide an interface for querying a collection of distributed and autonomous relational databases. Heterogeneity has also been studied in this context, known as multistores or polystores e.g., [17, 26, 22] . However, the heterogeneity here refers to various data and programming models used by different data owners, not security heterogeneity.

Hardware-assisted solutions. With hardware support (e.g., security enclaves) , query processing over distributed sources can be carried out with moderate overhead, e.g., TrustedDB [8] , Cipherbase [6] and EnclaveDB [36] . An enclave protects the data and code running inside of it from being spied upon, even if the entire software stack of the host is compromised. In addition, a remote party can verify the code running inside the enclave through a process known as attestation. Thus, two data owners without trust can still jointly compute a function, by sending their data to an enclave, which runs some code (attested by both owners) to compute the function. The enclave can be hosted by one of the data owners, or even a third (untrusted) party. Docker containers are used in lieu of an enclave if the system administrator can be trusted.

Software-only solutions. In the absence of special hardware support, one can still build a distributed query processing system over distrusted peers using homomorphic encryption [19] or secure multi-party computation (SMC) [44, 39, 13] , such as CryptDB [35] , Monomi [41] and SMCQL [10, 9] . Both homomorphic encryption and SMC are capable of computing arbitrary functions over the data, but those general-purpose solutions are not very practical. In practice, one has to limit the types of queries supported by designing special-purpose protocols. Even so, these software-only solutions tend to be much more expensive than those with hardware support.

Locally differential privacy (LDP) . LDP [16] has recently emerged as another approach to querying distributed data. LDP algorithms for answering range queries on a single table have been developed [29, 43] . Different from security-based solutions, query results in the LDP model have to be probabilistic with certain random noise, and information leakage will accumulate in the LDP model when more queries are run on the same data.

This work differs from the prior work in the following. (1) We study a heterogeneous security environment, a setting being increasingly used in practice. It supports data sharing among a group of data owners and allows each pair of hosts to adopt a security protocol of their choice, as opposed to the homogeneous security assumption in the previous work. (2) We provide the first abstraction of heterogeneous security protocols. (3) We formalize the problem of querying shared data as a bi-criteria optimization problem, and study its complexity. (4) We propose an approach to answering generic SQL queries in the heterogeneous security setting.

2. Data Sharing with Security Heterogeneity

We start with a scheme to abstract data sharing with security heterogeneity (Section 2.1) , and then formalize the problem of querying shared data under the scheme (Section 2.2) .

2.1 An Abstraction of Data Sharing

The scheme, referred to as DShare , is to abstract computations over shared datasets with heterogeneous security protocols. It is characterized by the notions of data owners, a query planner, data sharing pacts and distributed query plans.

A group of data owners often agree upon data sharing protocols in practice. Each owner contributes its data for sharing, and is also a client of the shared data under the protocols.

Data owners. A collection of data owners, or simply sites,

agree to support query services collectively over their private data. Each owner manages its data by its own DBMS and has its own local database schema.

We assume a global schema

deduced by, e.g., schema mapping, to provide a uniform interface to write queries against the data, where R _i is a relation schema. Owner S _i has an instance

of

for i∈ [1, n] . Here some relations in

are possibly empty, i.e.,

does not necessarily have every relation of

We denote

by

and refer to it as a distributed instance of

at

The answer to a query Q over

is defined as

under the normal interpretation. Note that via renaming, this also allows us to compute local answers at any single site or any subset of

Query planner. Each data owner may issue a query Q, to compute the answer to Q over

The query will be handled by a trusted third party called query planner. Upon receiving a query Q from a data owner, the planner will come up with a distributed query plan that picks security facilities (Docker or enclave) , to comply with the security protocols agreed among the sites, and encryption schemes required for the operations. It also estimates a security charge, referred to as a toll, for the query plan. If it is agreeable by the query issuer, the query planner will instruct the data owners to carry out the query plan, charges the toll to the query issuer, and allocates the revenue to the data owners accordingly.

The query planner oversees the execution of the query plans and acts as an intermediary between the data owners and the query issuer. However, it does not access any of the local databases or carry out operations in the query plans itself.

Data sharing pact. We next abstract the varying security protocols between pairs of data owners based on their trust levels. We use the following computation model.

(1) Data is shared using capsules, logic units to which data owners can transfer and upload datasets; physically, a capsule can be instantiated with a Docker container, an enclave, an SMC system such as SMCQL [9] , or even a trusted third party. All computations are carried out in capsules only.

(2) Each capsule C is associated with a site S _j, referred to as a capsule hosted by S _j. Computation in C has direct access to the data on S _j but cannot access data at other sites except the part that is uploaded to C. When the computation in C completes, the intermediate results are stored at S _j in a protected mode via, e.g., access right controls, OPE encryption [12] , or symmetric encryption with keys held only by the query issuer as commonly used in cloud cryptograph such as Azure [1] ; then the capsule C will be terminated.

A security protocol for a pair of sites (S _i, S _j) specifies: [ (a) ] the lowest security guarantees that a capsule hosted by S _j must attain in order for S _j to share the data of S _i; and [ (b) ] encryption of data at S _i, i.e., which part of the data can be shared, what data needs to be encrypted before sending it to a capsule of S _j , and what encryption scheme to use.

A data sharing pact ρ for a set

of sites consists of (1) a security protocol for each pair of sites in

and (2) a toll function Toll () that measures the costs of all types of available capsules that can be employed to evaluate queries. In the real world, such costs are incurred by, e.g., renting trusted third party facilities as the capsules and encryption overhead.

Distributed query plan. We next define query plans over a set

of sites w.r.t. a data sharing pact ρ.

Consider a distributed instance

at

A distributed query plan ξ on

at

is a DAG (directed acyclic graph) , where the nodes of ξ denote atomic operations and edges represent their dependencies. Each atomic operation δ is an (n+3) -tuple (op, t _c, X ₁, ..., X _n, j) , where (i) op is an operator in the relational algebra ( RA ; projection π, selection σ, natural join

set difference -, set union ∪ and renaming λ) ; (ii) t _c is a capsule of a certain type, e.g., Docker, enclave, SMC system, or trusted third party; (iii) j∈ [1, n] denotes the site that hosts a capsule to carry op out ; and (iv) X _i is a relation, which is either part of data in

that can be shared with S _j by the security protocol between S _i and S _j, or the intermediate result

computed at site S _i by operations prior to δ in ξ.

Executing δ= (op, t _c, X ₁, ..., X _n, j) involves steps below:

(1) Set up a capsule C of type t _c for op, hosted by site S _j, such that for each i∈ [1, n] , C meets the security requirement of ρ for (S _i, S _j) , where (a)

or (b) there exists

that contains intermediate answers computed over data from S _i.

(2) For i∈ [1, n] , transfer X _i from S _i to the capsule C hosted by S _j, based on the security protocol between S _i and S _j.

(3) Perform the computation

and store

in a protected mode (e.g., encrypted with OPE) at S _j.

(4) Add the relation

to

at S _j.

That is, intermediate results of op are computed and stored as such to comply with the security guarantees of pact ρ. In particular, when

for all i∈ [1, n] , i≠j, (op, t _c, X ₁, ..., X _n, j) simply executes op on local data X _j at site S _j.

Condition (1b) ensures that each followup operation δ′= (op', X ₁', ..., X _n', p) of δ also complies with the security requirement of (S _i, S _p) if

and X _j'contains intermediate results

computed by δ from the data of S _i, even if

Edges of the DAG plan ξ specifies the dependencies of the atomic operations in ξ. In particular, if there exists X _i (in atomic operation δ= (op, t _c, X ₁, ..., X _n, j) ) that comes from relations at site S _j computed by atomic operation δ′, then there exists a directed edge from δ′ to δ in ξ.

The execution of the query plan is mediated and monitored by the query planer, and it takes place at the participating sites only. After executing the query plan, the sites will send the results to the query planner, who then decrypts and returns the results to the client who issued the query.

Toll. We next presents the toll function Toll () . For a query plan ξ over

Toll (ξ) is the sum of Toll (δ) for all operations δ= (op, t _c, X ₁, ..., X _n, j) in ξ , where Toll (δ) estimates the charge for executing δ. It consists of the following:

(a) an upfront cost Toll ₀ for setting up the capsule for op;

(b) cost Toll _d for transferring remote data via secure channel to the capsule for op hosted by S _j, determined by the amount of data and encryption overhead; and

(c) cost Toll _c for executing op in the capsule, determined by the duration that the computation op takes.

More specifically, (a) Toll ₀ depends on the type of the capsule; (b) Toll _d is measured as

for all sites S _i that have data X _i required by δ, where Toll _(i, j) (X _i) is the cost of encrypting and transferring X _i of S _i to the capsule hosted by S _j; it is determined by the size of X _i and the encryption scheme required by the protocol between S _i and S _j; in particular, Toll _(j, j) (X _j) =0 since C is hosted by S _j and can access its local data X _j without extra cost. Finally, (c) Toll _c is the charge for sustaining the capsule for executing op.

Using our familiar terms, we refer to Toll _d and Toll _c as communication and computation cost, respectively. These costs stem from the security protocol between S _i and S _j, and are reflected as costs incurred by the security facility employed, including e.g., encryption cost, beyond their normal scope.

Example 2: [Case study] Continuing with Example 1, we show that applications A1 and A2 can be abstracted by DShare. Denote by (a) Household (address, pid , name ) the registration relation maintained by government, (b) Reg (pid, clinic ) and EMR (pid, disease ) the clinic registration relation and medical record relation, respectively, owned by the clinics and hospitals, and (c) Insurance (pid, company , policy ) the customer records maintained by insurance firms. We present their data sharing pacts ρ with associated toll functions.

Application A1.

It is expressed as a join query

The pact ρ specifies the lowest security level for a capsule C. Based on this, the query planner may pick Docker container as C, and remote data will be directly uploaded to C. The shared data will not be stored after the computation since a capsule is ephemeral. These meet the security requirements of A1 since clinics are trusted and do not risk side-channel leakage.

The toll function estimates the costs of available capsules C, and is determined by the type, configuration and duration of C to use (which in turn depends on the operations to be carried out in C) . For Docker, Toll ₀=0 since Docker incurs negligible upfront cost; Toll _(i, j) (X _i) =c _ij|X _i| is the communication cost for transferring X _i from S _i to the Docker at S _j (c _ij is a coefficient denoting the unit network price) ; and

is c|Reg| ² (c is a coefficient similar to c _ij) , the cost of sustaining the Docker container for the join.

Application A2.

It is an RA query

As the query involves data from three internal networks, pact ρ requires a higher security level for data sharing as described in Example 1. Based on this, the query planner may take enclaves as capsules. In particular, since insurance firms have lower level of trust than hospitals, data is required to be encrypted using HOM [19] before uploading it to insurance enclaves, to prevent leakage of user identifications and their EMR records.

Here the toll function is determined by the capsule types, operations, encryption schemes and communication. First consider, e.g., the join δ of Household and EMR at a hospital site S _j. For enclave, Toll (δ) is estimated as follows : Toll ₀=L, where L is the upfront cost of setting up the enclave at S _j, and is estimated as the average time for reaching the consensus among the gateways; Toll _(i, j) (X _i) is c _ij|X _i|, where c _ij is a coefficient reflecting the cost of shipping Household to the enclave at S _j; and

where D _Household and D _EMR are the datasets for the join at S _j.

Now consider operation I-Insurance, where I is the join result above. Assume that the set difference takes place in a capsule C hosted by the insurance network. Then the type of C is determined by both the protocol between government and insurance firm and the one between hospital and insurance, to warrant security for each data owner involved.

As shown above, in practice toll functions can be readily deduced from the types of capsules and the complexity of operations. Alternatively, as a common practice, they can also be empirically estimated by testing or learning over small dataset samples. In addition, for blockchain-based data sharing systems similar to Example 1 or MHMD [24] , tolls often denote the economic incentives defined by smart contracts for consensus. There has also been recent work from the security community on toll (cost) models of hybrid protocols, e.g., [14, 25]. Note that toll functions only specify toll charges of all possible capsules usage in an application; they do not determine which, where and how capsules to be used.

DShare allows arbitrary positive polynomial functions as Toll (δ) that are composed of submodular set functions [21] for Toll _(i, j) (X) (of Toll _d) and Toll _c (bi-modular [18] if op of δ is a binary operator, e.g., join) . All toll functions in Example 2.1 are of this type. Our industry partners find that these suffice to express common security charges in practice.

Guarantees and properties. We adopt the following threat model. The data owners are semi-honest (a.k.a. honest but curious) , i.e., each owner will faithfully execute the query plan, but may try to derive information about other parties’ data. This is why all operations are executed in capsules and the intermediate relations are stored in protected mode. The query planner runs at a trusted third party and is assumed trustworthy, similar to the honest broker in secure database systems e.g., [35, 9] , but without direct access to data.

Under this threat model, DShare enforces the security protocols of data sharing pact ρ by (a) the trusted query planner, and (b) proper capsules for carrying out query plans.

(1) The pact specifies the minimum security requirement between each pair S _i and S _j of sites. The query planner enforces the security guarantee by picking right capsules. Each operation (op, t _c, X ₁, ..., X _n, j) is performed by a capsule that meets the maximum of all lowest security levels for (S _i, S _j) (i∈ [1, n] ) . Followup operations retain no less security levels .

(2) Depending on the security protocols, different pairs of sites may have distinct security requirements. The planner picks capsules that meet the security requirements and minimize the cost, e.g., it picks docker containers for A1 of Example 1, which satisfy the security requirements specified by the protocol and are cheaper than enclaves and SMC systems.

Remark. Composing security protocols is a challenging and active topic of the security community (e.g., [25, 14] ) . It has emerged in semi-trusted data federations, e.g., MHMD [24] and Example 1. This paper takes a heterogeneous setting used in practice and focuses on query planner that selects capsules and distributes computations across the federation, to improve query performance while retaining the required security levels. This said, the query planner can be adapted to other security composition and propagation protocols.

2.2 The Problem of Querying Shared Data

Critical to DShare is its query planner. While there has been a host of research on security protocols and facilities, no prior work has studied how to generate query plans that comply with a data sharing pact with security heterogeneity.

This motivates us to study the toll-bounded query answering problem, denoted by TBQA. Informally, it is to find the best distributed query plan for a given query subject to a toll budget imposed by a data sharing pact. It is stated as follows.

Input: A global schema

n sites

adistributed instance

of

over

a data sharing pact ρ , a natural number B, and an RA query Q over

Output: A distributed plan ξ for Q over

such that the toll Toll (ξ) of ξ over

under ρ is no larger than B.

Objective: Minimize the parallel execution cost of ξ over

denoted by

As remarked earlier, a data sharing pact ρ specifies only the minimum security requirements for sharing data between sites and their associated toll charges. We have to find query plan ξ that determines, in addition to conventional planning, how to select and distribute capsules for computations that can meet heterogeneous security requirements of ρ, while taking advantages of the heterogeneity and minimizing its parallel execution cost

To complete the statement of TBQA , we define

below. Let

be the execution cost of atomic operation δ over

Then

is inductively defined as follows:

If ξ is a single atomic operation δ,

If ξ consists of sub-plans ξ ₁, ..., ξ _l and an atomic operation δ, where ξ ₁, ..., ξ _l are predecessors of δ in ξ, then

Intuitively,

characterizes the total parallel execution costs of atomic operations in ξ when parallel execution of independent atomic operations is fully exploited.

We assume that the query planner can efficiently estimate the cost incurred by an atomic operation δ= (op, t _c, X ₁, ..., X _n, j) . For example, when op is

is |R|×|S|.

3. Querying Shared Data

In this section, we first study the complexity of querying shared data and then outline our approach to solving TBQA.

Complexity of TBQA. Denote by TBQA ^d the decision version of TBQA. That is to decide, given the same input of TBQA and an additional number L, whether there exists a distributed query plan ξ for Q over

such that its toll is at most B and its parallel execution cost

is at most L.

We also study a related problem, referred to as toll-bounded answerability problem and denoted by TBA. Given

ρ and B as in TBQA, it is to decide whether there exists a distributed query plan ξ for Q over

with toll at most B.

Intuitively, TBA is to check whether TBQA even has a feasible solution or not. TBQA is at least as hard as TBA.

We say that a data sharing pact ρ is simple if Toll ₀=Toll _c=0 and Toll _(i, j) (X) =c _ij|X|. Both problems are intractable even under such simple pacts that involve only two sites.

Theorem 1 : Both TBQA ^d and TBA are

(1) decidable in NEXPTIME ;

(2) PSPACE -hard even when ρ is simple; and

(3)

-hard even when Q is in SPC and ρ is simple.

Moreover, (2) and (3) hold even when

has two sites only.

Our approach. In light of Theorem 1, practical solutions to TBQA have to be approximate. We next propose such an approach, which consists of two steps outlined below.

Step (1) : Finding toll-minimized canonical plans (Section 4) . We first generate a distributed query plan ξ _Q for Q in a canonical form : ξ _Q extends the algebra tree T _Q (cf. [3] ) of Q into a DAG by replacing each algebra operation op of Q with a distributed query plan ξ _op that has minimized the toll in

Note that here an edge from op ₁ to op ₂ of Q in T _Q may be extended to multiple edges, which connect atomic operations in

to those in

based on their dependencies.

Step (2) : Reducing parallel execution cost (Section 5) . Given ξ _Q of step (1) , we check whether Toll (ξ _Q) of ξ _Q exceeds the toll budget B. We return “No” if so, i.e., budget B is too small to answer Q in

under ρ. Otherwise, we further improve ξ _Q by making use of the remaining toll allowance, to reduce its parallel execution cost

without exceeding B.

4. Generating Toll-Minimized Plans

In this section, we show how to carry out step (1) of our approach (Section 3) . Given an RA query Q, a distributed instance

of schema

at n sites

and a data sharing pact ρ, we generate a distributed plan ξ _Q for Q, which consists of toll-minimized plan ξ _δ for each operation δ in Q. Below we focus on joins; the other RA operations are similar and simpler.

Approximability. One can verify that even for joins, TBQA remains intractable, by reduction from the vertex cover problem , which is NP -complete [34] . Nonetheless, there exists a PTIME approximation. Assume that

is reasonably large, and constant Toll ₀ is negligible w.r.t. Toll _d or Toll _c.

Theorem 2: There exists a PTIME O (logn) -approximation algorithm for computing plans with minimum toll for joins.

As a proof, we give such an algorithm, denoted by MTJ, as shown in Fig. 7.

Consider query

Algorithm MTJ generates a plan ξ for Q over

by reduction to the minimum set cover (MSC ) problem [34] . Below we first present algorithm MTJ by reduction to MSC, which gives us an O (logn) -approximation. However, a direct use of the reduction yields a naive version of MTJ with an exponential time (EXPTIME) complexity. Nonetheless, we develop a technique that is able to reduce its complexity from EXPTIME to PTIME.

Approximation by reduction. We start with a naive version of MTJ by approximation-preserving reduction [7] to MSC, so that MTJ computes toll minimized join plans by making use of available approximate algorithms for MSC.

Reduction. The idea of the reduction is to (a) represent each query plan ξ for join query Q as a “workload” distribution plan that assigns necessary data movement for answering Q in

among the sites; and (b) reformulate the assignment problem as a variant of MSC that admits a PTIME logarithmic-factor approximation algorithm [42] .

Consider a join query

distributed database

over sites S ₁, ..., S _n and data sharing pact ρ. We construct an instance of MSC , i.e., a universe U of elements and a set

of weighted subsets of U, such that each c-approximation answer to MSC encodes a distributed join plan for Q with toll at most c-times of the minimum toll among all plans for Q.

Denote by

the instance of relation R at site S _i (i∈ [1, n] ) ; similarly for

For convenience, we assume w. l. o. g. that neither

nor

is empty for all i∈ [1, n] . For any i, j∈ [1, n] ,

is called a work unit of Q in

Then:

(1) U consists of all work units of Q in

and

(2)

consists of pairs (i, W) for all i∈ [1, n] and

We say that (i, W) covers element

in U if u _jk∈W. The weight of (i, W) , denoted by t (i, W) , is defined as the sum of the total Toll _d of fetching

and

from sites S _j and S _k to site S _i, and total Toll _c of computing

for all units

in W. Note that this has to take into account toll sharing for relations appearing in multiple work units in W.

Algorithm. One can readily verify that the reduction is approximation-preserving [7] . This gives us an O (logn) -approximation algorithm, as a naive implementation of MTJ , for computing minimum toll join plans (Algorithm 1) , by using the O (log|U|) -approximation of MSC [42] (|U|=n ²) . Here set covering

specifies the assignment of atomic operations for the work units in

to their host sites (line 4) . The capsule types of the atomic operations are such picked that they minimize the cost while satisfying all the relevant security levels specified in the protocols for

(line 5) .

Example 3 : Recall the query for A2 given in Example 2.1, denoted by Q. Assume a simplified data sharing scenarios shown in Fig. 8. We show how algorithm MTJ generates the distributed query plan ξ _Q depicted in Fig. 9 for Q.

Take the join

of Q for example. Then op ₁ has a set

of 4 work units

for all i∈ {1, 2} , j∈ {3, 4} (see Fig. 9) . After the reduction, the set

consists of (i, W) for all

MTJ picks (3, {u ₁₃, u ₂₃} ) and (4, {u ₁₄, u ₂₄} ) as

that covers

with total toll 0. It then interprets

as

consisting of the atomic joins for I ₃₂ and I ₄₂ of Fig. 9. Note that this is actually the optimal plan for op ₁ since

incurs no toll at all.

Assume that sizes |I ₃₂|, |I ₄₂| and

for all i∈ [5, 7] are N. Then along the same lines, MTJ generates a distributed plan

with the atomic operations for I ₃₃ and I ₄₃ of Fig. 9, which is also optimal for op ₂ with

From exponential to polynomial . While Algorithm 1 is an O (logn) -approximation, it has an exponential time complexity since

of line 3 is of size exponential in |U| (i.e., exponential in n ²) . Nonetheless, below we show that this can be reduced to PTIME , based on the following:

(a) (i _*, W _*) identified in line 3 of Algorithm 3 equals

where U is the set of all work units of Q in

(b) For each i, computing

is equivalent to finding the minimum α such that there exists a subset W of U with f _α (W) ≤0, where

(c) For any fixed α, checking whether this holds can be done efficiently in PTIME since one can prove that f _α (W) is a submodular function, and submodular minimization can be done in PTIME via, e.g., [21] .

From these, the while loop (line 3) of Algorithm 3 can actually be implemented in PTIME by computing

as W _i for each i∈ [1, n] , which can be implemented by a binary search for the minimum α in [0, α _max] such that min _Wf _α (W) ≤0, where

in which c _(i, k) is the constant in the toll function Toll _(i, k) (X) =c _(i, k) |X| specified by ρ. The search terminates when the range [a, b] for α has a gap (|b-a|) less than

Hence, there are at most logn ²*α _max rounds of search, where each round is an invocation of submodular minimization, which is in PTIME (e.g., [21] ) .

That is, the while loop of Algorithm 3 is reduced to PTIME in n and

Therefore, MTJ can be implemented in PTIME in n and log

and is an O (logn) -approximation for computing minimum-toll distributed plans for joins.

5. Plan Rebalancing

After generating a toll-minimized canonical plan ξ _Q for Q in Section 4, we next study how to further optimize ξ _Q by reducing its parallel execution cost

This is to carry out step (2) of our approach outlined in Section 3. While the adjustments may increase the toll of the revised plan, we make sure that the toll is below the budget B, i.e., we make use of the remaining toll allowance B-Toll (ξ _Q) to reduce

Our technique, referred to as plan rebalancing, is motivated by the following. Consider the sub-plan

of ξ _Q for an operation op _i of Q. Here

is generated to minimize toll (Section 4) and hence could be imbalanced. Observe that

is dominated by the maximum cost of individual sites (Section 2.2) ; hence imbalanced workloads increase

In light of this, rebalancing works by iteratively applying an atomic balancing operator κ _b to optimize

under its allocated toll budget B _i (Section 4) for each operation op _i, such that (a) the optimized sub-plan

is guaranteed to have a lower cost than

and (b)

incurs at most B _i of toll. That is, κ _b makes use of toll allowance B _i on op _i, and re-distributes the work units handled by

in a more balanced and optimized way, to reduce the cost of

However, there are two key challenges to rebalancing.

(C1) How to design κ _b such that it can optimize

in an optimal way under a given toll budget B _i on op _i?

(C2) How to distribute the total toll budget B over all sub-plans of ξ _Q (i.e., operations of Q) so that the total cost reduction of ξ _Q is maximized?

We tackle (C2) by iteratively allocating toll budget to individual sub-plans of ξ _Q in the same spirit of the gradient descent algorithm [38] for optimization problems Below we focus on (C1) : we propose operator κ _b and show that it is a near-optimal design of its kind.

Given a sub-plan

operator κ _b works in two phases: (1) it first re-distributes the work units of

across n sites subject to a toll budget B _i allocated to op _i; this yields plan

that guarantees to reduce the cost; and (2) it then prepares the answers of

for

that is subsequent to

Here phase (2) is carried out by simply recovering the input distribution for the subsequent sub-plan

of

It is to ensure that

is compatible with its subsequent sub-plan

since

works with a certain input distribution (i.e., the distribution of the answer of

) due to the heterogeneous security protocols (see Section 2.1) .

Below we focus on phase (1) of κ _b. We parameterize κ _b with an integer k that controls the degree of changes to

the larger k is, the larger cost is reduced but more toll is consumed. We denote by κ _b [k] the operator κ _b instantiated with k. We apply κ _b [k] to

by selecting k work units of

for re-distribution, to reduce its parallel execution cost.

It is nontrivial to pick k units that inflict the lowest cost. Below we first provide an algorithm for unit selection, and then prove that it is near-optimal among all such algorithms.

Algorithm ReBal. The algorithm, denoted by ReBal works as follows. Given a sub-plan

of ξ _Q computed in Section 4, a database

over n sites

a data sharing pact ρ and parameter k for κ _b, ReBal returns an optimized sub-plan

by re-distributing k work units for

More specifically, ReBal first (a) identifies a set

of k bottleneck work units for op _i, and then (b) re-distributes them to improve

It does (a) by iteratively identifying bottleneck sites, picking and adding its bottleneck work units to

where bottleneck sites have work units of maximum costs among all. It carries out (b) by assigning work units in

one by one to sites with least workload w.r.t. the cost of executing all work units of op _i.

Analysis. Algorithm ReBal is near-optimal of all algorithms of its kind. More specifically, denote by

the class of algorithms that optimize

by selecting and re-distributing k work units of

Then we have the following.

Proposition 3:

(1) It is NP -complete to find the optimal optimization of

by re-distributing k work units.

(2) ReBal is a 2-approximation of the optimal in

and is in O (n ² logn) -time.

6. Embodiments

The following provides the various embodiments for querying shared data with security heterogeneity.

FIG. 1 illustrates a flow diagram of an exemplary method 100 for querying shared data with security heterogeneity, according to an aspect. At block 110, a SQL query for shared data over a plurality of sites can be received. In some embodiments, the SQL query can be issued from each of the sites. The sites support query services collectively over their private data. Each site manages its data by its own DBMS and has its own local database schema. In some embodiments, the SQL query can be also obtained from a client, i.e., a user input or a computer program or a client device.

At block 120, generating a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites. In some embodiments, the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites. A security protocol for a pair of sites specifies the lowest security guarantees and encryption scheme of data so that a site can share the data of the other site. In some embodiments, the data sharing pact further comprises a toll function that measures the parallel execution cost of the distributed query plan.

At block 130, executing the distributed query plan at the sites and returning the results for the SQL query. The execution of the distributed query plan can be mediated and monitored by a trusted third party, and it takes place at the sites only. After executing the distributed query plan, the sites will send the results to the trusted third party, who then decrypts and returns the results to the client who issued the SQL query.

As discussed herein, the disclosed embodiment proposes an approach to query answering under heterogeneous security models, which defines query plans by incorporating data sharing agreements and the use of various security facilities. The embodiment aims to demonstrate the need, challenges and feasibility of querying shared data with security heterogeneity.

FIG. 2 illustrates a flow diagram of an exemplary method 200 for generating a distributed query plan, according to an aspect. At block 210, generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query. In some embodiments, the method can first generate a distributed query plan for the SQL query in a canonical form. That is, extend the algebra tree of the SQL query into a DAG by replacing each algebra operation of the SQL query with a distributed query sub-plan of the operation. The distributed query sub-plan of the operation has minimized toll.

At block 220, optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans. The method can distribute a total toll budget over all sub-plans of the query plan so that the total cost reduction of the query plan can be maximized.

The embodiment generates a toll-minimized query plan for the SQL query in a canonical form and further optimizes the query plan to reduce data sharing toll and parallel execution cost.

FIG. 3 illustrates a flow diagram of an exemplary method 300 for executing a distributed query plan, according to an aspect. At block 310, picking and setting up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan.

In some embodiments, the logic unit can be selected from Docker container, enclave, SMC system, or trusted third party. Data is shared using the logic units to which the sites can transfer and upload datasets. Each logic unit is hosted by a site and used to perform all computations. The computation in each logic unit has direct access to the data at the site associated with it, but cannot access data at other sites except the part that is uploaded to the site associated with it.

At block 320, transferring shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site.

At block 330, performing the operation and storing the result of the operation at the designated site by the logic unit. In some embodiments, the result of the operation can be stored in a protected mode, e.g., encrypted with OPE, access right controls, or symmetric encryption with keys.

FIG. 4 illustrates a block diagram of an exemplary system 400 for querying shared data with security heterogeneity, according to an aspect. The system 400 comprises an interface component 410, a planning component 420 and an executing component 430. The interface component 410 is configured to receive a SQL query for shared data over a plurality of sites. The planning component 420 is configured to generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites. The executing component 430 is configured to execute the distributed query plan at the sites and returning the results for the SQL query.

In some embodiments, as illustrated in Fig. 5, the planning component 420 further comprises a generating component 510 and an optimizing component 520. The generating component 510 is configured to generate a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query. The optimizing component 520 is configured to optimize the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.

In some embodiments, as illustrated in Fig. 6, the executing component 430 further comprises a picking component 610, a transferring component 620 and a performing component 630. The picking component 610 is configured to pick and set up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan. The transferring component 620 is configured to transfer shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site. The performing component 630 is configured to perform the operation and storing the result of the operation at the designated site by the logic unit.

7. Experimental Study

Using benchmarks and real-life datasets, we conducted experiments to evaluate (1) the impact of heterogeneous security protocols on querying shared data; (2) the effectiveness of our toll-minimized planning technique; (3) the effectiveness of our toll-bounded plan optimization; and (4) integration of SMC-based system (SMCQL [9] ) and related comparison.

Experimental setting. We start with the setting.

Real-life dataset. We used TFACC , a real-life dataset that integrates the MOT Test Data [32] of Ministry of Transport test for vehicles in the UK from 2005 to 2016, and National Public Transport Access Nodes (NaPTAN) [31] . It has 19 tables with 113 attributes, about 46.7GB of data in total.

We generated 30 RA queries over TFACC. We used 5 query templates with the number #join of joins varying from 1 to 5. We generated the queries by instantiating the templates with values randomly selected from the dataset.

TPCH benchmark. We also used standard benchmark TPCH [2] with its built-in queries. TPCH generates data using TPC-H dbgen [2] , with 8 relations. It has 22 built-in SQL queries, which were rewritten into RA queries in our tests. Along the same lines as for TFACC , we also additionally generated 30 random queries with #join varying from 1 to 5.

Each relation of the datasets was randomly partitioned and distributed over a random subset of the machines (sites) .

Data sharing pacts. We used three simple data sharing pacts.

(1) Uniform pac ρ _U. Under ρ _U, the toll functions are of the form Toll _i, j (X) =c _i, j (|X|) for all pairs S _i and S _j of sites, and the coefficients c _i, j are from a uniform distribution of [0, 100] . Here |X| is the size of the transferred data X.

(2) Power law pact ρ _P. Under ρ _P, the toll functions are similar to those under ρ _U except that the toll coefficients are from a power law distribution of [0, 100] .

(3) Constant pact ρ _c/∞ [p] . Under ρ _c/∞ [p] , for any pair of sites S _i and S _j, Toll _i, j (X) is either a random constant C _ij=C or +∞, where the probability of Toll _i, j (X) =+∞ is p%.

Implementation. We developed a prototype system, referred toDASH (DatA SHare) , for querying shared data under a heterogeneous data sharing pact such as ρ _U, ρ _P and ρ _c/∞ [p] .

Prototyping. DASHemploys PostgreSQL as the DBMS at each site. It implements the framework of Section 3 as the query planner to generate distributed plans. By default, DASH uses PostgreSQL to execute plans at each site. Given a toll budget B, DASH employs the techniques of

Sections

4 and 5 to generate plans subject to B while minimizing parallel cost. If B is not specified, DASHgenerates plans with minimized toll.

Baselines. We are not aware of any existing systems that query shared data and support heterogeneous security protocols. Nonetheless, we designed and implemented three variants of DASH as baselines for comparison:

ONE: selects the best site S _* to evaluate a query Q centrally at S _*, i.e., transferring all queried relations to S _* and executing Q at S _*; it ensures that at site S _*, the evaluation incurs the minimum toll among all sites.

DASH ⁰: follows DASH to process operations op of Q one by one, but centrally at the best site for each op.

DASH ^-: follows the framework of DASHto process operations of Q one by one, but randomly assigns work units to sites with data involved, e.g., assigning

to S _i.

Configuration. The experiments were conducted on 20 Linux servers, each with 6-core Intel i5-8400 2.8GHz CPU, 32 GB of memory and 1TB of HDD. The instances are fully connected with high speed intra-network channels. By default, we used model ρ _P, entire TFACC , 32 GB of TPCH , and all queries.

Experimental Results. We next report our findings.

Exp-1: Impact of heterogeneous security protocols. We first evaluated the impact of security protocols on toll consumed by query evaluation over the distributed datasets. We evaluated all queries over both datasets under all three data sharing pacts ρ _U, ρ _P and ρ _c/∞ [p] , when p%ranges from 5%to 55%. Table 1 reports the average toll usage per query by all four methods. The results tell us the following.

^*In all toll functions Toll _i, j (X) =c _ij|X| _, |X|=1 if X has 1GB of size.

Table 1: Average toll usage per query (Exp-1)

(1) Different security pacts charge toll differently. Under ρ _c/∞ [p] , some TFACC or TPCH queries cannot be answered with a finite toll by DASH ^-, DASH ⁰ or ONE when p≥15%, while DASH can answer all the queries even when p=50%.

(2) On both TPCH and TFACC, DASHconsistently generates plans that incur the minimum toll under all the security pacts. For example, on TFACC under ρ _P, the average toll consumption per query of DASH is 45.2, 1.8 and 1.7 times less than that of DASH ^-, DASH ⁰ and ONE, respectively.

Exp-2: Effectiveness of toll-minimized planning. We next evaluated the effectiveness of toll-minimized planning of DASH. We tested the average toll usage per query for query evaluation when varying the sizes |D| of datasets from 2 ^-4×|D _max| to |D _max|, where |D _max|=46.7 GB for TFACC and 32 GB for TPCH. As reported in Fig. 10A for TPCH , we can see the following. (a) Over larger datasets all methods consume larger toll, as expected. (b) However, DASH consistently charges much smaller toll than the other methods, e.g., 3.48, 7.83 and 91.51 times less than ONE, DASH ⁰ and DASH ^- on average over TPCH , respectively; moreover , the gap increases with larger D. The results for TFACC are similar (see [5] ) .

Exp-3: Effectiveness of optimization. We next evaluated the effectiveness of toll-bounded query optimization of DASH. We compared with a variant of DASH, denoted by DASH _no, which turned off the optimization of Section 5. We evaluated the average query evaluation time of DASH and DASH _no with all queries, full datasets, and a total toll budget B _m=10|D|, where |D| is the total size of the dataset of all sites. To favor ONE , DASH ⁰ and DASH ^-, we set B _m large so that these baselines can answer all the queries within the toll budget.

(1) Varying toll budget. Varying the total budget B from 20%B _m to B _m, we tested the query evaluation time of all methods. The result for TPCH is reported in Fig. 10B and shows the following. (a) DASHis the fastest among all. e.g., DASH is 1.86, 14.54 and 14.02 times faster than DASH ^-, DASH ⁰ and ONE, respectively , when B=B _m on TPCH. (b) The optimization of DASHis effective: DASH is on average 2.76 and 2.55 times faster than DASH _no on the two datasets, respectively.

(3) Varying datasets. Varying the size of datasets in the same way as Exp-2 with full toll budget B _m, we tested the average evaluation time per query. The results on TPCH is given in Fig. 10C (the results on TFACC are similar and omitted) . Similar to Exp-3 (1) , DASH consistently performs the best among all the methods, and does better when the datasets get larger.

Exp-4: Integration with SMCQL [9] . We evaluated the feasibility and performance of integrating DASH with SMC systems such as SMCQL [9] . We took SMCQL as the capsules for DASH and denote the integrated system by DASH _smc. We evaluated the performance of DASH _smc and SMCQL over 1 GB of TPCH (SMCQL does not scale to larger datasets) . In particular, to simulate the case study of Example 1, we used 20 machines and partitioned them into three groups, with 2, 10 and 8 machines representing governments, hospitals and insurance firms, respectively. To favor SMCQL and prevent DASH _smc from bypassing SMCQL capsules, we set the protocols the same as Fig. 0 except that (a) insurance machines do not send data to hospitals, and (b) all computations over insurance machines must use SMCQL capsules. We randomly distributed TPCH relations over the machines. Using three TPCH queries Q4, Q12 and Q19 (simplified due to the restriction of query support on SMCQL) , we evaluated the performance of DASH _smc and SMCQL.

(1) SMCQL can be naturally integrated into DASH as capsules and becomes more practical in the heterogeneous setting. DASH _smc is on average more than 18.89 times faster than SMCQL (SMCQL cannot finish within 48 hours for all cases) .

(2) DASH _smc improves by 1.83 times when B increases from 20%B _m to B _m while SMCQL is insensitive to B (Fig. 10D) .

Summary. We find the following. (1) Security heterogeneity has a big impact on querying shared data. (2) Our proposed method effectively reduces both toll consumption and parallel execution cost. On average DASH consumes 2.59, 4.64, 69.47 times less toll than ONE , DASH ⁰ and DASH ^-, respectively, and is 14.16, 14.44 and 2.2 times faster. (3) Existing systems can be integrated with our method as capsules and alleviate efficiency bottleneck in the heterogeneous setting; it speeds up SMCQL by 18.89 times over 1GB of TPCH.

It should be noted that the embodiments described above can be implemented by hardware elements, software elements, or some combination of software and hardware. The hardware elements can include circuitry. The software elements can include computer code stored as machine-readable instructions on a tangible, non-transitory, machine-readable storage medium. Some embodiments can be implemented in one or a combination of hardware, firmware, and software.

Some embodiments can be implemented in a computing system or computing device including a memory comprising instructions, and one or more processors in communication with the memory, the one or more processors execute the instructions to perform the functions or operations described in this disclosure.

Some embodiments can also be implemented as instructions stored on a machine-readable medium, which can be read and executed by a computing platform to perform the operations described in this disclosure. A machine-readable medium can include any mechanism for storing or transmitting data in a form readable by a machine, e.g., a computer. For example, a machine-readable storage medium can include read only memory (ROM) ; random access memory (RAM) ; magnetic disk storage media; optical storage media; flash memory devices; or any other machine-readable storage medium. Some embodiments can also be software product including a machine-readable medium, which stores instructions that, when executed, cause one or more processors to perform the functions or operations described in this disclosure.

The descriptions of the various embodiments of the disclosure have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure.

References

[1] 2018. Azure encryption. https: //docs. microsoft. com/en-us/azure/security/fundamentals/encryption-overview.

[2] 2019. TPC-H. http: //www. tpc. org/tpch/.

[3] Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.

[4] Foto N. Afrati and Jeffrey D. Ullman. 2011. Optimizing Multiway Joins in a Map-Reduce Environment. TKDE 23, 9 (2011 ) , 1282–1298.

[5] Anonymous. 2020. Full version. https: //bit. ly/it2lx.

[6] Arvind Arasu, Spyros Blanas, Ken Eguro, Manas Joglekar, Raghav Kaushik, Donald Kossmann, Ravi Ramamurthy, Prasang Upadhyaya, and Ramarathnam Venkatesan. 2013. Secure database-as-a-service with Cipherbase. In SIGMOD.

[7] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, A. Marchetti-Spaccamela, and M. Protasi. 1999. Complexity and Approximability Properties: Combinatorial Optimization Problems and Their Approximability Properties.

[8] Sumeet Bajaj and Radu Sion. 2011. TrustedDB: A Trusted Hardware based Database with Privacy and Data Confidentiality. In SIGMOD.

[9] Johes Bater, Gregory Elliott, Craig Eggen, Satyender Goel, Abel N. Kho, and Jennie Rogers. 2017. SMCQL: Secure Query Processing for Private Data Networks. PVLDB 10 , 6 (2017 ) , 673–684.

[10] Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. 2018. ShrinkWrap: Efficient SQL Query Processing in Differentially Private Data Federations. PVLDB 12, 3 (2018 ) , 307–320.

[11] Paul Beame, Paraschos Koutris, and Dan Suciu. 2013. Communication steps for parallel query processing. In PODS. 273–284.

[12] Alexandra Boldyreva, Nathan Chenette, Younho Lee, and Adam O’ Neill. 2009. Order-Preserving Symmetric Encryption. In EUROCRYPT.

[13] Elette Boyle, Kai-Min Chung, and Rafael Pass. 2015. Large-Scale Secure Computation: Multi-party Computation for (Parallel) RAM Programs. In CRYPTO. 742–762.

[14] Niklas Büscher, Daniel Demmler, Stefan Katzenbeisser, David Kretzmer, and Thomas Schneider. 2018. HyCC: Compilation of Hybrid Protocols for Practical Secure Computation. In CCS.

[15] Department for Digital, Culture, Media &Sport and Department for Business, Energy &Industrial Strategy. 2017. Growing the artificial intelligence industry in the UK.

[16] John C Duchi, Michael I Jordan, and Martin J Wainwright. 2013. Local privacy and statistical minimax rates. In FOCS. 429–438.

[17] Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magdalena Balazinska , Bill Howe , Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stanley B. Zdonik. 2015. The BigDAWG Polystore System. SIGMOD Record 44 , 2 (2015 ) , 11–16.

[18] Satoru Fujishige and Satoru Iwata. 2005. Bisubmodular Function Minimization. SIAM J. Discrete Math. 19, 4 (2005 ).

[19] Craig Gentry. 2009. Fully Homomorphic Encryption Using Ideal Lattices. In STOC.

[20] Government Digital Service, HM Passport Office and UK Statistics Authority. 2018. Information sharing code of practice.

[21] Martin

László Lovász, and Alexander Schrijver. 1981. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica 1, 2 (1981 ) , 169–197.

[22] Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Howe, and Dan Suciu. 2014. Demonstration of the Myria big data management service. In SIGMOD.

[23] HM Treasury. 2015. Data sharing and open data in banking: Response to the call for evidence.

[24] Horizon 2020 Research and Innovation Action. 2016. My Health, My Data. http: //www. myhealthmydata. eu/.

[25] Muhammad Ishaq, Ana L. Milanova , and Vassilis Zikas. 2019. Efficient MPC via Program Analysis: A Framework for Efficient Optimal Mixing. In CCS.

[26] Boyan Kolev, Carlyna Bondiombouy, Patrick Valduriez, Ricardo Jiménez-Peris, Raquel Pau, and José Pereira. 2016. The CloudMdsQL Multistore System. In SIGMOD.

[27] Paraschos Koutris and Dan Suciu. 2011. Parallel evaluation of conjunctive queries. In PODS. ACM, 223–234.

[28] Ralf Kramer. 1997. Databases on the Web: Technologies for Federation Architectures and Case Studies (Tutorial) . In SIGMOD. 503–506.

[29] Tejas Kulkarni. 2019. Answering Range Queries Under Local Differential Privacy. In SIGMOD.

[30] Ee-Peng Lim, San-Yih Hwang, Jaideep Srivastava, Dave Clements, and M. Ganesh. 1995. Myriad: Design and Implementation of a Federated Database Prototype. Softw., Pract. Exper. 25, 5 (1995 ) , 533–562.

[31] Find open data. 2014. http: //data. gov. uk/dataset/naptan.

[32] Find open data. 2019. https: //data. gov. uk/dataset/e3939ef8-30c7-4ca8-9c7c-ad9475cc9b2f/anonymised-mot-tests-and-results.

[33] M. Tamer

and Patrick Valduriez. 2011. Principles of Distributed Database Systems, Third Edition. Springer.

[34] Christos H Papadimitriou. 1994. Computational Complexity. Addison-Wesley.

[35] Raluca Ada Popa, Catherine M. Redfield, Nickolai Zeldovich, and Hari Balakrishnan. 2011. CryptDB: Protecting Confidentiality with Encrypted Query Processing. In SOSP.

[36] Christian Priebe, Kapil Vaswani, and Manuel Costa. 2018. EnclaveDB: a secure database using SGX. In IEEE Security &Privacy.

[37] The Register. 2016. https: //www. theregister. co. uk/2016/07/06/caredata_binned/.

[38] Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016 ) .

[39] Adi Shamir. 1979. How to Share a Secret. Commun. ACM 22, 11 (1979 ) , 612–613.

[40] Amit P. Sheth and James A. Larson. 1990. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Database. ACM Comput. Surv. 22, 3 (1990 ) , 183–236.

[41] Stephen Tu, M. Frans Kaashoek, Samuel Madden, and Nickolai Zeldovich. 2013. Processing analytical queries over encrypted data. In PVLDB.

[42] Vijay V. Vazirani. 2003. Approximation Algorithms. Springer.

[43] Tianhao Wang, Bolin Ding, Jingren Zhou, Cheng Hong, Zhicong Huang, Ninghui Li, and Somesh Jha. 2019. Answering Multi-Dimensional Analytical Queries under Local Differential Privacy. In SIGMOD.

[44] Andrew Chi-Chih Yao. 1982. Protocols for Secure Computations (Extended Abstract) . In FOCS. 160–164.

Claims

A computer-implemented method, comprising:

receiving a SQL query for shared data over a plurality of sites;

generating a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;

executing the distributed query plan at the sites and returning the results for the SQL query.
The method of claim 1, wherein the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
The method of claim 2, wherein the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
The method of claim 3, wherein generating a distributed query plan further comprising:

generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;

optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
The method of claim 4, wherein executing the distributed query plan at the sites further comprises:

picking and setting up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;

transferring shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;

performing the operation and storing the result of the operation at the designated site by the logic unit.
The method of claim 5, wherein the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
The method of claim 5, wherein the result of the operation can be stored in a protected mode.
A computing system, comprising:

a memory comprising instructions, and

one or more processors in communication with the memory, wherein the one or more processors execute the instructions to:

receive a SQL query for shared data over a plurality of sites;

generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;

execute the distributed query plan at the sites and returning the results for the SQL query.
The computing system of claim 8, wherein the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
The computing system of claim 9, wherein the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
The computing system of claims 10, wherein, the one or more processors execute the instructions to generate a distributed query plan comprises:

generating a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;

optimizing the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
The computing system of claims 11, wherein the one or more processors execute the instructions to execute the distributed query plan at the sites further comprises:

picking and setting up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;

transferring shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;

performing the operation and storing the result of the operation at the designated site by the logic unit.
The computing system of claim 12, wherein the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
The computing system of claim 12, wherein the result of the operation can be stored in a protected mode.
A computer-readable storage medium comprising computer instructions that when executed by one or more processors, cause the one or more processors to:

receive a SQL query for shared data over a plurality of sites;

generate a distributed query plan that complies with a data sharing pact with security heterogeneity between pairs of the sites;

execute the distributed query plan at the sites and returning the results for the SQL query.
The computer-readable storage medium of claim 15, wherein the data sharing pact comprises a security protocol for each pair of sites that specifies a minimum security requirement for sharing data between each pair of the sites.
The computer-readable storage medium of claim 16, wherein the data sharing pact comprises a toll function that measures the parallel execution cost of the distributed query plan.
The computer-readable storage medium of claim 17, further comprising instructions that cause the one or more processors to:

generate a canonical plan which consists of toll-minimized sub-plans for each operation of the SQL query;

optimize the canonical plan to reduce its parallel execution cost by rebalancing toll budget of the sub-plans.
The computer-readable storage medium of claim 18, further comprising instructions that cause the one or more processors to:

pick and set up a logic unit hosted by a designated site that meets the minimum security requirement for sharing data between any other site and the designated site, for each operation in the distributed query plan;

transfer shared data from the other site to the logic unit hosted by the designated site based on the security protocol between the other site and the designated site;

perform the operation and storing the result of the operation at the designated site by the logic unit.
The computer-readable storage medium of claim 19, wherein the logic unit can be selected from the group consisting of Docker container, enclave, SMC system and trusted third party.
The computer-readable storage medium of claim 19, wherein the result of the operation can be stored in a protected mode.