WO2019121898A1

WO2019121898A1 - A computer-implemented method of applying a first function to each data element in a data set, and a worker node and system for implementing the same

Info

Publication number: WO2019121898A1
Application number: PCT/EP2018/085819
Authority: WO
Inventors: Meilof Geert VEENINGEN
Original assignee: Koninklijke Philips N.V.
Priority date: 2017-12-22
Filing date: 2018-12-19
Publication date: 2019-06-27

Abstract

According to an aspect, there is provided a computer-implemented method of applying a first function to each data element in a first data set, the method comprising (i) selecting each data element in the first data set that does not satisfy a criterion, wherein the criterion is not satisfied if the result of applying the first function to the data element is different to the result of applying a second function to the data element; (ii) if the number of selected data elements is less than a threshold value, selecting one or more of the data elements in the first data set that does satisfy the criterion such that the total number of selected data elements is equal to the threshold value; (iii) performing a shuffle operation to shuffle the order of the data elements in the first data set to form a shuffled data set; (iv) revealing the locations of the selected data elements in the shuffled data set; (v) applying the first function to the data elements at each of the revealed locations in the shuffled data set to form a shuffled processed data set comprising processed data elements; and (vi) forming an output based on the shuffled processed data set; wherein steps (i)-(vi) are performed using multi-party computation, MPC, techniques.

Description

A COMPUTER-IMPLEMENTED METHOD OF APPLYING A FIRST FUNCTION TO EACH DATA ELEMENT IN A DATA SET, AND A WORKER NODE AND

SYSTEM FOR IMPLEMENTING THE SAME

FIELD OF THE INVENTION

The disclosure relates to the application of a first function to each data element in a data set, and in particular to a computer-implemented method, a worker node and a system for applying a first function to each data element in a data set.

BACKGROUND OF THE INVENTION

In settings where sensitive information from multiple mutually distrusting parties needs to be processed, cryptography-based privacy-preserving techniques such as multiparty computation (MPC) can be used. In particular, when using MPC, sensitive data is “secret shared” between multiple parties so that no individual party can leam the data without the help of other parties. Using cryptographic protocols between these parties, it is possible to perform computations on such“secret shared” data. Although a wide range of primitive operations on secret shared data are available, not all traditional programming language constructs are available. For instance, it is not possible to have an“if’ statement with a condition involving a sensitive variable, simply because no party in the system should know whether the condition holds. Hence, efficient methods to perform higher-level operations (e.g., sorting a list or finding its maximum) are needed that make use only of operations available on secret-shared data.

One common operation occurring in information processing is the“map” operation, where the same function/is applied to all elements in a data set.

SUMMARY OF THE INVENTION

One way to perform the“map” operation on secret-shared data, is to apply a function/ under MPC to the secret shares of each data element in the data set. However, suppose a function /is to be mapped to a data set for which:

it is computationally expensive to compute function / on input x using MPC; there is criterion f that is straightforward to check on input x such that, if it is truc, f(x) = g(x) where function g is straightforward to compute (e.g., it is a constant); and it is known that f holds for a large part of the data set. If privacy of the data is not an issue, then the time taken for the“map” operation could be reduced by applying g instead of / on data elements for which f holds. Translated to the MPC setting, this would mean that, for each data element x of the data set, it is checked if f holds using MPC; and if f holds then g is executed on x using MPC; and otherwise /is executed on x using MPC. However, this would leak information about x since, to be able to branch on f(c), it would be necessary to reveal whether or not f(c) is true.

A related problem to applying the map in the setting with a“non-triviality” criterion f is“oblivious database filtering” where a data set is converted into a (randomly ordered) sub data set containing only items satisfying the criterion f. Indeed, in some settings where mapping with a criterion f could be used, oblivious database filtering would be an alternative.

There is therefore a need for an improved technique for applying a first function to each data element in a data set that addresses one or more of the above issues.

Briefly, the techniques described herein speed up the“map” operation in privacy-preserving computation, particularly in settings where applying a function on the data set is expensive, there is an easily computed criterion under which there exists a simplification, and the criterion is expected to hold for a large part of the data set. In particular, the techniques provide that rather than filter out“trivial” elements of the data set, the data set is randomly shuffled so that locations of the“trivial” elements can be revealed without loss of privacy.

According to a first specific aspect, there is provided a computer-implemented method of applying a first function to each data element in a first data set, the method comprising (i) selecting each data element in the first data set that does not satisfy a criterion, wherein the criterion is not satisfied if the result of applying the first function to the data element is different to the result of applying a second function to the data element; (ii) if the number of selected data elements is less than a threshold value, selecting one or more of the data elements in the first data set that does satisfy the criterion such that the total number of selected data elements is equal to the threshold value; (iii) performing a shuffle operation to shuffle the order of the data elements in the first data set to form a shuffled data set; (iv) revealing the locations of the selected data elements in the shuffled data set; (v) applying the first function to the data elements at each of the revealed locations in the shuffled data set to form a shuffled processed data set comprising processed data elements; and (vi) forming an output based on the shuffled processed data set; wherein steps (i)-(vi) are performed using multi-party computation, MPC, techniques. According to a second aspect, there is provided a worker node for use in the method according to the first aspect.

According to a third aspect, there is provided a system for applying a first function to each data element in a first data set, the system comprising a plurality of worker nodes, wherein the plurality of worker nodes are configured to use multiparty computation, MPC, techniques to: select each data element in the first data set that does not satisfy a criterion, wherein the criterion is not satisfied if the result of applying the first function to the data element is different to the result of applying a second function to the data element; select one or more of the data elements in the first data set that does satisfy the criterion such that the total number of selected data elements is equal to the threshold value if the number of selected data elements is less than a threshold value; perform a shuffle operation to shuffle the order of the data elements in the first data set to form a shuffled data set; reveal the locations of the selected data elements in the shuffled data set; apply the first function to the data elements at each of the revealed locations in the shuffled data set to form a shuffled processed data set comprising processed data elements; and form an output based on the shuffled processed data set.

According to a fourth aspect, there is provided a worker node configured for use in the system according to the third aspect.

According to a fifth aspect, there is provided a worker node for use in applying a first function to each data element in a first data set, wherein the worker node is configured to use one or more multiparty computation, MPC, techniques with at least one other worker node to select each data element in the first data set that does not satisfy a criterion, wherein the criterion is not satisfied if the result of applying the first function to the data element is different to the result of applying a second function to the data element; select one or more of the data elements in the first data set that does satisfy the criterion such that the total number of selected data elements is equal to the threshold value if the number of selected data elements is less than a threshold value; perform a shuffle operation to shuffle the order of the data elements in the first data set to form a shuffled data set; reveal the locations of the selected data elements in the shuffled data set; apply the first function to the data elements at each of the revealed locations in the shuffled data set in the shuffled data set to form a shuffled processed data set comprising processed data elements; and form an output based on the shuffled processed data set.

According to a sixth aspect, there is provided a computer-implemented method of operating a worker node to apply a first function to each data element in a first data set, the method comprising: (i) selecting each data element in the first data set that does not satisfy a criterion, wherein the criterion is not satisfied only if the result of applying the first function to the data element is different to the result of applying a second function to the data element; (ii) if the number of selected data elements is less than a threshold value, selecting one or more of the data elements in the first data set that does satisfy the criterion such that the total number of selected data elements is equal to the threshold value; (iii) performing a shuffle operation to shuffle the order of the data elements in the first data set to form a shuffled data set; (iv) revealing the locations of the selected data elements in the shuffled data set; (v) applying the first function to each of the selected data elements in the shuffled data set and applying the second function to each of the remaining data elements in the shuffled data set to form a shuffled processed data set comprising processed data elements; and (vi) forming an output based on the shuffled processed data set; wherein steps (i)-(vi) are performed using multiparty computation, MPC, techniques with one or more other worker nodes.

According to a seventh aspect, there is provided a computer program product comprising a computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform the method according to the first aspect or the sixth aspect.

These and other aspects will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

Exemplary embodiments will now be described, by way of example only, with reference to the following drawings, in which:

Fig. 1 is a block diagram of a system comprising a plurality of worker nodes according to an embodiment of the techniques described herein.

Fig. 2 is a block diagram of a worker node that can be used in embodiments of the techniques described herein;

Fig. 3 is a diagram illustrating a procedure according to an embodiment of the techniques described herein; and

Fig. 4 is a flow chart illustrating a method of applying a first function to each data element in a data set. DETAILED DESCRIPTION OF EMBODIMENTS

Fig. 1 is a block diagram of a system 1 in which the techniques and principles described herein may be implemented. The system 1 comprises a plurality of worker nodes 2, with three worker nodes 2 being shown in Fig. 1. Each worker node 2 is able to participate in multiparty computations (MPCs), with one or more of the other worker nodes 2.

Multiparty computation techniques allows the computation of a joint function on sensitive (private) inputs from mutually distrusting parties without requiring those parties to disclose these inputs to a trusted third party or to each other (thus preserving the privacy of these inputs). Cryptographic protocols ensure that no participating party (or coalition of parties) learns anything from this computation except its intended part of the computation outcome.

In the system shown in Fig. 1, an input for the computation can be provided by one or more worker nodes 2 and/or by one or more input nodes (not shown in Fig. 1). The output of the computation may be returned to the node that provided the input(s), e.g. one or more worker nodes 2 and/or one or more input nodes, and/or the output can be provided to one or more nodes that did not provide an input, e.g. one or more of the other worker nodes 2 and/or one or more output nodes (not shown in Fig. 1). Often, a recipient of the output of the MPC is a node that requested the computation.

The plurality of worker nodes 2 in Fig. 1 can be considered as a“committee” of worker nodes 2 that can perform an MPC. A single committee may perform the whole MPC, but in some cases multiple committees (comprising a respective plurality of worker nodes 2) can perform respective parts of the MPC.

The worker nodes 2 are interconnected and thus can exchange signalling therebetween (shown as signals 3). The worker nodes 2 may be local to each other, or one or more of the worker nodes 2 may be remote from the other worker nodes 2. In that case, the worker nodes 2 may be interconnected via one or more wireless or wired networks, including the Internet and a local area network.

Each worker node 2 can be any type of electronic device or computing device. For example a worker node 2 can be, or be part of any suitable type of electronic device or computing device, such as a server, computer, laptop, smart phone, etc. It will be appreciated that the worker nodes 2 shown in Fig. 1 do not need to be the same type of device, and for example, one or more worker nodes 2 can be servers, one or more worker nodes 2 can be a desktop computer, etc.

Fig. 2 is a block diagram of an exemplary worker node 2. The worker node 4 includes interface circuitry 4 for enabling a data connection to other devices or nodes, such as other worker nodes 2. In particular the interface circuitry 4 can enable a connection between the worker node 2 and a network, such as the Internet or a local area network, via any desirable wired or wireless communication protocol. The worker node 2 further includes a processing unit 6 for performing operations on data and for generally controlling the operation of the worker node 2. The worker node 2 further includes a memory unit 8 for storing any data required for the execution of the techniques described herein and for storing computer program code for causing the processing unit 6 to perform method steps as described in more detail below.

The processing unit 6 can be implemented in numerous ways, with software and/or hardware, to perform the various functions described herein. The processing unit 6 may comprise one or more microprocessors or digital signal processor (DSPs) that may be programmed using software or computer program code to perform the required functions and/or to control components of the processing unit 10 to effect the required functions. The processing unit 6 may be implemented as a combination of dedicated hardware to perform some functions (e.g. amplifiers, pre-amplifiers, analog-to-digital convertors (ADCs) and/or digital-to-analog convertors (DACs)) and a processor (e.g., one or more programmed microprocessors, controllers, DSPs and associated circuitry) to perform other functions. Examples of components that may be employed in various embodiments of the present disclosure include, but are not limited to, conventional microprocessors, DSPs, application specific integrated circuits (ASICs), and field-programmable gate arrays (FPGAs).

The memory unit 8 can comprise any type of non-transitory machine -readable medium, such as cache or system memory including volatile and non-volatile computer memory such as random access memory (RAM) static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), programmable ROM (PROM), erasable PROM

(EPROM), and electrically erasable PROM (EEPROM).

If a worker node 2 stores or holds one or more data sets that can be processed in a multiparty computation, the data set(s) can be stored in the memory unit 8.

As noted above, one common operation occurring in information processing is the map operation, where the same function/is applied to all data elements in a data set. However applying function/ can be computationally expensive, particularly where the data set is secret/private and the function/has to be applied under MPC to each individual data element.

For some functions/ there can be a criterion f that is straightforward to check on input (data element) x such that, if it is true,//) = g(x) where function g is straightforward to compute (e.g., it is a constant), which means that the time taken for the map operation could be reduced by applying g instead of / on data elements for which f holds. This can mean that, for each data element x of the data set, it is checked if f holds using MPC; and if f holds then g is executed on x using MPC; and otherwise /is executed on x using MPC.

However, this would leak information about x since, to be able to branch on f(c), it would be necessary to reveal whether or not f(c) is true.

Thus, techniques are required whose program flow does not depend on sensitive data to respect the sensitivity of data elements in the data set. The techniques described herein provide improvements to the application of a function/ to a data set that is secret or private to one or more parties, where there is criterion f for function/ as described above, that means that function/ does not need to be applied to all data elements in the data set.

A first embodiment of the techniques presented herein is described with reference to Fig. 3 that illustrates a map operation on a data set 20 that comprises a plurality of data elements 22. It will be appreciated that although Fig. 3 shows the data set 20 as having five data elements 22, the data set 20 may comprise less data elements 22, or typically many more than five data elements 22. The data elements 22 are numbered consecutively in Fig. 3 from #1 to #5 for ease of identification.

Firstly, the data elements 22 in the data set 20 that do not satisfy f are selected using MPC techniques. That is, the criterion f is checked for all data elements 22 in the data set 20 using MPC techniques, and those not satisfying criterion f are selected. Thus, the check is performed by two or more worker nodes 2 using MPC techniques so that no individual worker node 2 leams the content of a data element 22 or leams whether a particular data element 22 satisfies f. As noted above, f is satisfied only if f(x) = g(x), i.e. f is satisfied only if (or if and only if) the result of applying function/ to data element x is the same as the result of applying function g to data element x. Selecting a data element 22 that does not satisfy f means computing a selection bit that is‘ G if the data element is‘selected’ and‘O’ if not.

In Fig. 3, the data elements 22 that are found to satisfy f are shown in light grey boxes and the data elements 22 that are found not to satisfy f are shown in dark grey boxes. It will be appreciated that Fig. 3 shows this distinction between the data elements 22 for ease of understanding only, and no individual worker node 2 knows which data elements 22 satisfy/do not satisfy f. In the example of Fig. 3, data elements #2 and #5 are found not to satisfy the criterion f. Data elements #1, #3 and #4 are found to satisfy the criterion f. If the number of selected data elements 22 (i.e. the number of data elements 22 that do not satisfy <f>), which is denoted M, is less than a threshold number (or upper bound)

N, then N - M additional data elements 22 are selected. That is, some of the data elements 22 that do satisfy the criterion f are selected so that the total number of selected data elements 22 is equal to N. The additional data elements 22 can be selected randomly. The remaining data elements 22 that do satisfy the criterion f and that are not selected are referred to as non- selected data elements 22. The selection of the additional N- M data elements 22 is performed using MPC techniques by two or more worker nodes 2 so that no individual worker node 2 leams the values of the data elements 22 or which data elements 22 or additional data elements 22 are selected. The worker nodes 2 that perform the selection of the additional data elements 22 may be the same or different to the worker nodes 2 that perform the selection of the data elements 22 that did not satisfy the criterion f.

In the example of Fig. 3, N is 3 and only data elements #2 and #5 are found to not satisfy f, so therefore one additional data element 22 is selected to bring the total number of selected data elements 22 to 3. In the example of Fig. 3, data element #1 is selected.

Next, a random shuffle operation 24 is performed on the data set 20 to randomly change the order of the data elements 22 in the data set 20. This results in a shuffled data set 26. The random shuffling operation 24 applies equally to the selected data elements 22 and the non-selected data elements 22. It will be appreciated from the next step that the knowledge (or identity) of which data elements 22 were selected in the data set 20 is not lost by the shuffling operation 24, and so the selected data elements 22 can still be identified after the shuffling operation 24. In particular the selection bit for each data element 22 as described above that indicates whether or not a data element 22 is selected is shuffled along with the data elements 22.

The random shuffling operation 24 is performed using MPC techniques by two or more worker nodes 2 so that no individual worker node 2 leams the values of the data elements 22, or where each data elements 22 in the data set 20 ends up in the shuffled data set 26. The worker nodes 2 that perform the random shuffling operation 24 may be the same or different to the worker nodes 2 that perform the selection of the data elements 22.

In the example of Fig. 3, the data elements #1 to #5 are randomly shuffled so that they are now in the order #2, #4, #1, #3 and #5 (with the selected data elements still being #1, #2 and #5).

Next, the locations of the N selected data elements 22 in the shuffled data set 26 are revealed. Thus, the locations in the shuffled data set 26 at which there is a selected data elements 22 is revealed. In the example of Fig. 3, the I^st, 3^rd and 5^th locations will be revealed in this step (as these are the locations at which data elements #1, #2 and #5 were shuffled to). Revealing the locations in the shuffled data set 26 means opening the (secret shared or encrypted) selection bit. Revealing the locations can be performed as an MPC operation by a (or the) plurality of worker nodes 2.

Then, function/is applied to the data elements 22 at each of the revealed locations in the shuffled data set 26. Thus, this means that function/is applied to all of the selected data elements 22 in the shuffled data set 26 (i.e. the data elements 22 that did not satisfy the criterion f and the N - M additional selected data elements 22). The function g is applied to the rest of the data elements 22 in the shuffled data set 26 (i.e. function g is applied to the other locations in the shuffled data set 26). This means that function g is applied to the non-selected data elements 22 (i.e. some of the elements that do satisfy criterion f).

The application of the function/ or g as appropriate is indicated by apply operation 28 and results in a shuffled processed data set 30. The application of the function / or g (as appropriate) to the data elements 22 in the shuffled data set 26 is performed using MPC techniques by two or more worker nodes 2 so that no individual worker node 2 learns the values of the data elements 22 in the shuffled data set 26 or the result of applying the function/ or g to any data element 22. The worker nodes 2 that perform the apply operation 28 may be the same or different to the worker nodes 2 that perform the criterion check, selection and/or shuffling operations.

In the example of Fig. 3, function/is applied to the data elements 22 at the I^st, 3^rd and 5^th locations in the shuffled data set 26 (so /is applied to data elements #1, #2 and #5), and function g is applied to the data elements 22 at the remaining locations in the shuffled data set 26 (i.e. the 2^nd and 4^th locations, and thus data elements #3 and #4).

After the apply operation 28, the data elements in the shuffled processed data set 30 are unshuffled by reversing the random shuffle operation 24. The unshuffle operation 32 returns each processed data element in the shuffled processed data set 30 back to the position/location of its corresponding data element 22 in the data set 20 (i.e. the data element 22 that was processed using/ or g to form the processed data element). The unshuffling operation 32 results in an unshuffled processed data set 34. It will be noted that in the unshuffled processed data set 34, each processed data element was obtained either by directly computing/ of that data element 22 in data set 20, or by computing g of that data element 22 if f was satisfied (and that data element 22 was not selected). Thus, based on the definition of criterion f, the end result of the technique shown in Fig. 3 is the application of function/ to all data elements 22 in the original data set 20.

The unshuffling operation 42 is performed using MPC techniques by two or more worker nodes 2 so that no individual worker node 2 leams the values of the processed data elements, which processed data elements move to which locations in the unshuffled processed data set 34, or the content of the unshuffled data set 34. The worker nodes 2 that perform the unshuffling operation 32 may be the same or different to the worker nodes 2 that perform any one or more of the earlier operations.

In the example of Fig. 3, the processed data elements in the processed shuffled data set 30 ordered #2, #4, #1, #3 and #5 are unshuffled to #1, #2, #3, #4 and #5 in the unshuffled processed data set 34.

The shuffling and unshuffling operations described above can be performed using known approaches, as described in more detail below. The operation to select N- M additional data elements based on the number of data elements selected so far is also described in more detail below.

A second embodiment of the techniques presented herein relates to a so-called map-reduce operation on a data set. In a map-reduce operation/computation, the task is to compute

is an associative and commutative operator (e.g. addition or multiplication) and f(Xi) is equal to a neutral element of the associative operator (e.g. zero in the case of addition) whenever the criterion f is satisfied. In this second embodiment, the order of the data elements does not matter, so reversing the shuffle

(unshuffling) prior to applying

is not necessary. In addition, it is not necessary in this embodiment to apply the function g to the non-selected data elements. Moreover, if/is known not to act on elements for which f is satisfied, then those elements can be skipped when taking the

sum.

More detailed implementations of the above embodiments are described below with reference to a particular MPC framework. Thus, the techniques described herein provide for carrying out a“map” operation on a secret-shared data set. The data elements in the data set are assumed to be vectors so that the full data set is a matrix with the elements as rows, secret-shared between a number of worker nodes 2 (so either an input node has secret- shared the data set with the worker nodes 2 beforehand, or the data set is the result of a previous multiparty computation). In the first (map) embodiment, the result of the map operation is another secret-shared data set, given as a matrix that contains the result of applying the“map” operation on the data set; and in the second (map-reduce) embodiment, the result is a secret-shared vector that contains the result of applying a“map-reduce” operation on the data set.

The techniques described herein can be based on any standard technique for performing multiparty computations between multiple worker nodes 2. To implement the techniques, it is necessary to be able to compute on numbers in a given ring with the primitive operations of addition and multiplication. In the following description, as is standard in the art, multiparty computation algorithms are described as normal algorithms, except that secret-shared values are between brackets, e.g., [x], and operations like [x] · [y] induce a cryptographic protocol between the worker nodes 2 implementing the given operation. The Open protocol x <— Open([x]) converts secret-shared values to plaintext. Examples of such frameworks are passively secure MPC based on Shamir secret sharing as described in“Design of large scale applications of secure multiparty computation: secure linear programming” by S. de Hoogh, PhD thesis, Eindhoven University of Technology, 2012, or the SPDZ family of protocols, which are known to those skilled in the art (and for example are described in“Practical covertly secure MPC for dishonest majority - or:

Breaking the SPDZ limits” (by I. Damgard, M. Keller, E. Larraia, V. Pastro, P. Scholl, and N. P. Smart, in Computer Security - ESORICS 2013 - 18th European Symposium on

Research in Computer Security, Egham, UK, September 9-13, 2013, Proceedings, pages 1- 18, 2013) and“MASCOT: faster malicious arithmetic secure computation with oblivious transfer” (by M. Keller, E. Orsini, and P. Scholl, in IACR Cryptology ePrint Archive, 2016:505, 2016).

Other functions and operations are also required for implementing the techniques described herein. A function is required where [fl] <— eqz([x]) that sets // = 0 if x = 0, and fl = \ otherwise. A protocol to implement this function is described in“Design of large scale applications of secure multiparty computation: secure linear programming”.

An operation [M'], [π] <— Shuffle([M]) is required that shuffles a data set [M] according to a randomly generated permutation [p] and returns the shuffled data set and permutation. It should be noted that [p] is not necessarily secret shared in the same way as [M] it is just required that [π] is available for unshuffling. An operation [M'] <—

UnShuffle([M], [π]) is additionally required that shuffles a (modified) data set [M] according to the inverse permutation 7G^-1, i.e. that reverses the shuffle. This can be as described in “Efficient, Oblivious Data Structures for MPC” by Marcel Keller and Peter Scholl, https://eprint.iacr.org/20l4/l37.pdf. In the paper“Round-efficient oblivious database manipulation” by S. Laur, J. Willemson and B. Zhang in Cryptology ePrint Archive, Report 2011/429, several possible protocols are presented that perform this random shuffle. In particular, in the regime of passively secure MPC based on Shamir secret sharing, the“resharing based oblivious shuffle for semihonest setting” method can be used. In the 3-party case, first parties 1 and 2 shuffle and re-share their secret shares according to a shared permutation then parties 1 and 3 do the same for permutation π₂; and parties 2 and 3 do the same for permutation π₃. In this way, the overall permutation is not known to any of the parties. Although it is

not mentioned in the“Round-efficient oblivious database manipulation” paper, it is clearly easy to perform the inverse permutation: first parties 2 and 3 apply and re-share, then

parties 1 and 3 perform

and finally parties 1 and 2 apply

Algorithm 1 below provides a detailed implementation of the mapping procedure described herein.

The algorithm takes as arguments the function/ of the mapping, the simplified function g and predicate/criterion∅ specifying when simplified function g can be used, an upper bound N on the number of data elements for which f does not hold, and the data set [M] to which the operation is applied. First, a vector [v] is computed that contains a one for each row of [M] where f is not satisfied, and a zero where f is satisfied (line 3 of Algorithm 1).

It will be noted that, by assumption,

so‘ones’ are added to [v] to ensure that

in order not to leak for how many elements f was satisfied (lines 5- 9). A variable [C] is used to keep track of how many‘ones’ still need to be added. For each entry [vi] of [v], the variable [c] is computed that denotes whether it should be flipped from ‘zero’ to‘one’. This is the case if [C] is not zero and [vi] is not one, i.e. if [C] · (1— [i?_£]) is not zero, i.e. if [c] = eqz([C] (1— [v_i])) is one. The update is performed by adding [c] to [vi] and subtracting [c] from [C]

The other steps are now straightforward: the Shuffle procedure is applied to the data set concatenated with the [v] vector. Since [v’] has been randomly shuffled and it is known that it has Hamming weight N, it is a completely random vector with Hamming weight N so it can be opened without revealing information. Then, g is applied to [M] whenever Vi is‘one’ and / whenever it is not, and un-shuffle the result to obtain [N] Since the rows of [N] are in the same order as the rows of [M] and by construction, g has only been applied to those rows satisfying f, the overall result is a map on the original dataset, as requested.

Some extensions to the above embodiments and algorithms are set out below:

Obtaining upper bounds (N) - The algorithms above assume that an upper bound N is available on the number of data elements in the data set that should be selected.

In some situations, such an upper bound may already be available and predefined. For example, the apply operation may be combined with the disclosure of an aggregated version of the data set from which an upper bound can be determined. In other situations, an upper bound may not be available but revealing it may not be considered a privacy problem. In this case, after determining the selection, the total number of selected elements can be opened up by the worker nodes 2 and used as a value for N. As an alternative, the total number of selected elements can be rounded or perturbed so as not reveal its exact value. In yet other situations, a likely upper bound may be available but it may be violated. In such a case, the total number of selected elements can be computed and compared to the supposed upper bound, only leaking the result of that comparison.

Flexible application - While the techniques according to the first embodiment described above avoid unnecessary executions of f they do so at the expense of checking f, selecting additional data elements and performing the shuffling and unshuffling operations. Hence, if the upper bound N is not small enough, then the techniques described above do not save time. For instance, the algorithm may only save time if at most five out of ten data elements do not satisfy f. If the execution times of the various computations are known, then based on the upper bound N a flexible decision can be made as to whether to perform a traditional applying/mapping operation (i.e. applying/ to each data element) or a shuffled mapping operation. If these execution times are not known beforehand, they can be measured as the computation progresses. In addition, if the upper bound N is zero, then the shuffling/unshuffling procedures can be skipped.

The flow chart in Fig. 4 shows a method of applying a first function to each data element in a first data set according to the techniques described herein. The method steps in Fig. 4 are described below in terms of the operations performed in a system 1 by a plurality of worker nodes 2 to apply the first function to data elements in the data set, with each step being performed by two or more worker nodes 2 as a multiparty computation. However, it will be appreciated that each step as illustrated and described below can also be understood as referring to the operations of an individual worker node 2 in the multiparty computation.

In addition, it will be appreciated that any particular worker node 2 in the system 1 may participate in or perform any one or more of the steps shown in Fig. 4. Thus, a particular worker node 2 may only participate in or perform one of the steps in Fig. 4, or a particular worker node 2 may participate in or perform any two or more (consecutive or non- consecutive) steps in Fig. 4, or a particular worker node 2 may participate in or perform all of the steps shown in Fig. 4.

At the start of the method, there is a data set, referred to as a first data set, that comprises a plurality of data elements. The data set can be provided to the system 1 by an input node as a private/secret input, or the data set can belong to one of the worker nodes 2 that is to participate in the method and the worker node 2 can provide the data set as an input to the method and the other worker nodes 2 as a private/secret input. In the method, a function/ referred to as a first function, is to be applied (mapped) to each of the data elements in the data set. For the method to be effective in improving the performance of the mapping of the first function on to the first data set, the first function should be relatively computationally expensive to compute as part of a multiparty computation, there should be a criterion that is easy to check for any particular data element such that, if true, the result of applying the first function to the data element is equal to the result of applying a second function to the data element (where the second function is relatively computationally easy to compute as part of a MPC), and the criterion should hold for a large part of the data set. In a first step, step 101, each data element 22 in the first data set 20 that does not satisfy a criterion is selected. As noted above, the criterion is satisfied for a particular data element 22 only if (or if and only if) the result of applying the first function to the data element 22 is equal to the result of applying the second function to the data element 22. This selection is performed as a MPC by a plurality of worker nodes 2. Also as noted above, selecting data elements 22 can mean determining a value for a selection bit.

Step 101 can be performed by determining whether each data element 22 in the first data set 20 satisfies the criterion, and selecting those data elements 22 that do not satisfy the criterion. In this case, the check of the criterion is performed as a MPC by a plurality of worker nodes 2.

In step 103, one or more of the data elements 22 in the first data set 20 that do satisfy the criterion are selected so that the total number of selected data elements 22 is equal to a threshold value N. The selection in step 103 can be performed randomly. The selection of these additional data elements 22 can be performed as a MPC by a plurality of worker nodes 2. If the number of selected data elements 22 Mis equal to or less than N (i.e. the number of data elements 22 M that do not satisfy the criterion is less than /V), then /V - M data elements 22 that do satisfy the criterion are selected. It will be appreciated that if the number of selected data elements 22 is equal to N (i.e. the number of data elements 22 that do not satisfy the criterion is equal to A), then A - M is zero and no additional data elements 22 are selected in step 103. If the number of selected data elements 22 is greater than N, then the method can either be stopped, or a conventional /-mapping can be performed on all of the data elements 22.

In some embodiments, the threshold value N may be determined as described above, and can be determined prior to steps 101 and/or 103 being performed, but in other embodiments the threshold value N can be determined based on the total number of data elements 22 in the first data set 20 that do not satisfy the criterion. In this case, to avoid revealing the exact number of data elements 22 in the first data set that do not satisfy the criterion to the worker nodes 2, the total number can be rounded or perturbed in order to generate the threshold value N.

Then, in step 105, shuffle operation 24 is performed to shuffle the order of the data elements 22 in the first data set to form a shuffled data set 26. This shuffle operation 24 is a random shuffle. The shuffle operation 24 can be performed as described above. The shuffle operation 24 can be performed as a MPC by a plurality of worker nodes 2. It will be appreciated that the knowledge of which data elements 22 were selected in steps 101 and 103 (if appropriate), or alternatively the knowledge of the locations at which the data elements 22 were selected in step 101 and 103 (if appropriate) is retained during the shuffle operation 24 so that the information is preserved which of the data elements 22 (or the locations) in the shuffled data set 26 are the selected data elements 22.

Next, in step 107, the locations of the selected data elements 22 in the shuffled data set 26 are revealed. This enables the worker nodes 2 to know which data elements 22/locations the first function / should be applied to in the next step. As noted above, revealing the locations in the shuffled data set 26 can mean opening the (secret shared or encrypted) selection bit for each data element 22. Revealing the locations can be performed as an MPC operation by a (or the) plurality of worker nodes 2.

Next, in step 109, the first function (j) is applied (operation 28) to the selected data elements 22 at the revealed locations in the shuffled data set 26. The result of this processing is a shuffled processed data set 30 comprising processed data elements. This applying/mapping step is performed as a MPC by a plurality of worker nodes 2. In the embodiments below where the output is to be a processed data set, step 109 can additionally comprise applying the second function g (operation 28) to each of the remaining (i.e.

unselected) data elements 22 in the shuffled data set 26. In the embodiments below where the output is to be a combination of the processed data elements (e.g. a combination using an associative and commutative operator), applying the second function g to each the non- selected data elements 22 is unnecessary.

Finally, in step 111, an output of the method is formed based on the results of step 109. Again, forming the output in step 111 can be performed as a MPC by a plurality of worker nodes 2.

In some embodiments (corresponding to the first embodiment described above), the output of the method is to be a second data set 36 where each data element of the second data set 36 corresponds to the result of applying the first function (f) to the respective data element 22 in the first data set 20. Therefore, in some embodiments, the method further comprises the step of unshuffling the processed data elements in the shuffled processed data set 30. This unshuffling operation 32 reverses the shuffling performed in step 107, and thus returns each processed data element in the shuffled processed data set 30 back to the position/location of its corresponding data element 22 in the first data set 20 (i.e. the data element 22 that was processed using/ or g to form the processed data element). The unshuffling operation 32 results in an unshuffled processed data set 34. Thus, the output in step 111 can correspond to (or be) the unshuffled processed data set 34. As noted above, step 111 of forming the output can be performed as an MPC, which means that the unshuffling operation can be performed as an MPC.

In some embodiments, corresponding to the map-reduce embodiment (the second embodiment) above, the output of the method in step 111 is a combination of the results of applying the first function or the second function to the data elements 22 in the shuffled data set 30. In particular, the combination of the results can be formed using an associative and commutative operator (e.g. addition), where the criterion being satisfied by a data element 22 in the first data set 20 means that the result of applying the first function (f) or the second function (g) to the data element 22 is a neutral element for the associative operator (e.g. zero). In this case, the unshuffling operation described above for the first embodiment is unnecessary. Moreover, in some embodiments, since the result of applying the second function to a data element 22 is a neutral element for the operator, step 111 may comprise forming the output by combining only the data elements in the processed shuffled data set 30 that were determined using the first function/ (i.e. the data elements determined using the second function g can be skipped). Alternatively, as noted above, the second function g may not be applied to the non-selected data elements 22 in applying step 109.

As noted above, any worker node 2 in the system 1 may perform any one or more of the steps shown in Fig. 4 or as described above as part of a MPC with one or more other worker nodes 2. As such, a particular worker node 2 may perform, or be configured or adapted to perform, any one or more of steps 101, 103, 105, 107, 109, 111 and the steps described above.

There is therefore provided improved techniques for applying a first function to each data element in a data set that addresses one or more of the issues with conventional techniques. Generally, the need for multiparty computation arises in many circumstances, for example where multiple mutually distrusting parties want to enable joint analysis on their data sets. Applying a map operation on a list of data elements is a general concept that occurs in many analytics algorithms. The techniques described herein are to be used with data sets for which there is a large number of“trivial” data elements for which the map operation is easy (i.e. where f is satisfied). The X²-based logrank test in survival analysis (if no events take place at most time points) is one such example, but those skilled in the art will be aware of other data sets/tests that the techniques described herein can be applied to.

Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the principles and techniques described herein, from a study of the drawings, the disclosure and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single processor or other unit may fulfil the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. A computer program may be stored or distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems. Any reference signs in the claims should not be construed as limiting the scope.

Claims

1. A computer-implemented method of applying a first function to each data element in a first data set, the method comprising:

(i) selecting each data element in the first data set that does not satisfy a criterion, wherein the criterion is not satisfied if the result of applying the first function to the data element is different to the result of applying a second function to the data element;

(ii) if the number of selected data elements is less than a threshold value, selecting one or more of the data elements in the first data set that does satisfy the criterion such that the total number of selected data elements is equal to the threshold value;

(iii) performing a shuffle operation to shuffle the order of the data elements in the first data set to form a shuffled data set;

(iv) revealing the locations of the selected data elements in the shuffled data set;

(v) applying the first function to the data elements at each of the revealed locations in the shuffled data set to form a shuffled processed data set comprising processed data elements; and

(vi) forming an output based on the shuffled processed data set; wherein steps (i)-(vi) are performed using multi-party computation, MPC, techniques.

2. A computer-implemented method as claimed in claim 1, wherein the criterion is satisfied if and only if the result of applying the first function to the data element is equal to the result of applying the second function to the data element.

3. A computer-implemented method as claimed in claim 1 or 2, wherein the method further comprises the step of:

determining the threshold value based on the number of data elements in the first data set that do not satisfy the criterion.

4. A computer-implemented method as claimed in any of claims 1-3, wherein step (v) further comprises applying the second function to each of the remaining data elements in the shuffled data set to form the shuffled processed data set.

5. A computer-implemented method as claimed in claim 4, wherein the step of forming an output comprises:

performing a reverse shuffle operation on the shuffled processed data set to reverse the shuffling of the order of the data elements in step (iii) and provide a processed data set; and

providing the processed data set as the output.

6. A computer-implemented method as claimed in any of claims 1-5, wherein the step of forming an output comprises:

forming the output by using an associative and commutative operator to combine the processed data elements in the shuffled processed data set.

7. A computer-implemented method as claimed in claim 6, wherein the step of forming an output comprises:

forming the output by using an associative and commutative operator to combine the processed data elements in the shuffled processed data set that were formed from the selected data elements.

8. A computer-implemented method as claimed in claim 6 or 7, wherein the criterion is such that the criterion is satisfied if the result of applying the first function to the data element and the result of applying the second function to the data element is a neutral element for the associative and commutative operator.

9. A computer-implemented method as claimed in any of claims 1-8, wherein each of the steps of the method are performed by a plurality of worker nodes.

10. A worker node for use in the method according to any of claims 1-9.

11. A system for applying a first function to each data element in a first data set, the system comprising:

a plurality of worker nodes, wherein the plurality of worker nodes are configured to use multiparty computation, MPC, techniques to: select each data element in the first data set that does not satisfy a criterion, wherein the criterion is not satisfied if the result of applying the first function to the data element is different to the result of applying a second function to the data element;

select one or more of the data elements in the first data set that does satisfy the criterion such that the total number of selected data elements is equal to the threshold value if the number of selected data elements is less than a threshold value;

perform a shuffle operation to shuffle the order of the data elements in the first data set to form a shuffled data set;

reveal the locations of the selected data elements in the shuffled data set; apply the first function to the data elements at each of the revealed locations in the shuffled data set to form a shuffled processed data set comprising processed data elements; and

form an output based on the shuffled processed data set.

12. A system as claimed in claim 11, wherein the criterion is satisfied if and only if the result of applying the first function to the data element is equal to the result of applying the second function to the data element.

13. A system as claimed in claim 11 or 12, wherein the plurality of worker nodes are further configured to:

determine the threshold value based on the number of data elements in the first data set that do not satisfy the criterion.

14. A system as claimed in any of claims 11-13, wherein the plurality of worker nodes are further configured to apply the second function to each of the remaining data elements in the shuffled data set to form the shuffled processed data set.

15. A system as claimed in claim 14, wherein the plurality of worker nodes are configured to form an output by:

performing a reverse shuffle operation on the shuffled processed data set to reverse the shuffling of the order of the data elements in the shuffle operation and provide a processed data set; and

providing the processed data set as the output.

16. A system as claimed in any of claims 11-15, wherein the plurality of worker nodes are configured to form an output by using an associative and commutative operator to combine the processed data elements in the shuffled processed data set.

17. A system as claimed in claim 16, wherein the plurality of worker nodes are configured to form the output by using an associative and commutative operator to combine the processed data elements in the shuffled processed data set that were formed from the selected data elements.

18. A system as claimed in claim 16 or 17, wherein the criterion is such that the criterion is satisfied if the result of applying the first function to the data element and the result of applying the second function to the data element is a neutral element for the associative and commutative operator.

19. A worker node configured for use in the system according to any of claims 11- 18.

20. A worker node for use in applying a first function to each data element in a first data set, wherein the worker node is configured to use one or more multiparty computation, MPC, techniques with at least one other worker node to:

select each data element in the first data set that does not satisfy a criterion, wherein the criterion is not satisfied if the result of applying the first function to the data element is different to the result of applying a second function to the data element;

reveal the locations of the selected data elements in the shuffled data set; apply the first function to the data elements at each of the revealed locations in the shuffled data set in the shuffled data set to form a shuffled processed data set comprising processed data elements; and

form an output based on the shuffled processed data set.

21. A system as claimed in claim 20, wherein the criterion is satisfied if and only if the result of applying the first function to the data element is equal to the result of applying the second function to the data element.

22. A system as claimed in claim 20 or 21 , wherein the plurality of worker nodes are further configured to:

23. A system as claimed in any of claims 20-22, wherein the plurality of worker nodes are further configured to apply the second function to each of the remaining data elements in the shuffled data set to form the shuffled processed data set.

24. A system as claimed in claim 23, wherein the plurality of worker nodes are configured to form an output by:

providing the processed data set as the output.

25. A system as claimed in any of claims 20-24, wherein the plurality of worker nodes are configured to form an output by using an associative and commutative operator to combine the processed data elements in the shuffled processed data set.

26. A system as claimed in claim 25, wherein the plurality of worker nodes are configured to form the output by using an associative and commutative operator to combine the processed data elements in the shuffled processed data set that were formed from the selected data elements.

27. A system as claimed in claim 25 or 26, wherein the criterion is such that the criterion is satisfied if the result of applying the first function to the data element and the result of applying the second function to the data element is a neutral element for the associative and commutative operator.

28. A computer-implemented method of operating a worker node to apply a first function to each data element in a first data set, the method comprising:

(i) selecting each data element in the first data set that does not satisfy a criterion, wherein the criterion is not satisfied only if the result of applying the first function to the data element is different to the result of applying a second function to the data element;

(v) applying the first function to each of the selected data elements in the shuffled data set and applying the second function to each of the remaining data elements in the shuffled data set to form a shuffled processed data set comprising processed data elements; and

(vi) forming an output based on the shuffled processed data set; wherein steps (i)-(vi) are performed using multiparty computation, MPC, techniques with one or more other worker nodes.

29. A computer-implemented method as claimed in claim 28, wherein the criterion is satisfied if and only if the result of applying the first function to the data element is equal to the result of applying the second function to the data element.

30. A computer-implemented method as claimed in claim 28 or 29, wherein the method further comprises the step of:

31. A computer-implemented method as claimed in any of claims 28-30, wherein step (v) further comprises applying the second function to each of the remaining data elements in the shuffled data set to form the shuffled processed data set.

32. A computer-implemented method as claimed in claim 31 , wherein the step of forming an output comprises:

providing the processed data set as the output.

33. A computer-implemented method as claimed in any of claims 28-32, wherein the step of forming an output comprises:

34. A computer-implemented method as claimed in claim 33, wherein the step of forming an output comprises:

35. A computer-implemented method as claimed in claim 33 or 34, wherein the criterion is such that the criterion is satisfied if the result of applying the first function to the data element and the result of applying the second function to the data element is a neutral element for the associative and commutative operator.