WO2020209793A1

WO2020209793A1 - Privacy preserving system for mapping common identities

Info

Publication number: WO2020209793A1
Application number: PCT/SG2020/050210
Authority: WO
Inventors: Shuwei CAO; Geong Sen POH; Hoon Wei Lim; Peck Yoke LEONG; Jia Xu; Varsha CHITTAWAR
Original assignee: Singapore Telecommunications Limited
Priority date: 2019-04-11
Filing date: 2020-04-06
Publication date: 2020-10-15

Abstract

According to an aspect of the present invention, there is provided a system for mapping common identities stored in separate databases, the system comprising: a mapper module configured to: receive multiply encrypted identifiers, where each is derived from an identifier having undergone multiple encryption; receive a storage key used to perform storage encryption at each database storing one or more of the identifiers; remove the storage encryption from each of the multiply encrypted identifiers using its respective storage key; consolidate the multiply encrypted identifiers that match after the removal of the storage encryption; and transmit results of the match.

Description

Privacy Preserving System For Mapping Common Identities

FIELD

The present invention relates to a system for mapping common identities stored in separate databases.

BACKGROUND

The proliferation of interconnected communication devices enables individuals and organisations to easily communicate and share information. This results in organisations possessing large amounts of sensitive information compared to the past. Commensurately, management of such information is also becoming more regulated. Privacy protection laws have been introduced, such as the General Data Protection Regulation (GDPR) by the EU.

The approach of ensuring data never leaves an organisation premise is preferred for preserving data privacy and providing better control over datasets. There are known techniques that provide privacy-preserving data intersection between two or more participants. Some of these techniques only provide intersection in specific settings and extensive engineering effort is required for them to be practically deployable. While comprehensive, private set operation (PSO) protocols that provide privacy-preserving intersection incur substantial computational and communication overhead.

There is thus a need for a system that can share datasets in a privacy-preserving manner without data leaving an organisation.

SUMMARY OF THE INVENTION

In one implementation, the results of the match are returned to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived. In another implementation, the results of the match are sent to a server storing datasets attributable to the matching multiply encrypted identifiers. In this other implementation, the datasets are contributed by the one or more databases storing the identifiers from which the multiply encrypted identifiers are derived. The sending of the results of the match to the server is to link the stored datasets that are attributable to the matching multiply encrypted identifier.

In a first scenario, at least one of the identifiers may comprise an initial segment of identification data. The consolidation of the multiply encrypted identifiers by the mapper module then includes those derived from the initial segments of identification data that match. The results of the match are returned to each of the databases storing the identification data from which each of the initial segments of identification data that match is obtained. An extension of this first scenario has at least one of the identifiers comprise a further segment of the identification data, the mapper module is configured to locate the multiply encrypted identifiers derived from the further segment of the identification data that match from the consolidation performed on the initial segments of identification. The mapper module then returns the results of the match to each of the databases storing the identification data from which each of the further segments of identification data that match is obtained.

In a second scenario, at least one of the identifiers comprises different identification data with at least one in common with another of the identifiers. The consolidation of the multiply encrypted identifiers by the mapper module then includes the multiply encrypted identifiers derived from the identifiers comprising the common identification data. The results of the match are returned to each of the databases storing the common identification data.

In an implementation where the multiple encryption comprises the storage encryption and a central encryption performed using a central key at a server from which the mapper module receives the multiply encrypted identifiers, the server is configured to establish knowledge of the central key, through the use of a zero knowledge protocol, to each of the databases storing one or more of the identifiers from which the multiply encrypted identifiers are derived.

Each of the multiply encrypted identifiers may be accompanied with schema of fields under which data is organised in the database storing the identifier from which the multiply encrypted identifier is derived. The system has a query analyser configured to interrogate the schema of fields, in response to receipt of a dataset query, to determine which of them have a format corresponding to search parameters of the dataset query. The query analyser divides the dataset query into one or more sub-queries with each containing a subset of the corresponding search parameters and distributes the sub-queries amongst the databases, so that each of the databases receives the sub-query with parameters of format that corresponds to the schema of fields under which data is organised in the database.

The system has a machine learning module configured to, when being trained: receive vector values of data belonging to the identifiers from which the matching multiply encrypted identifiers are derived. The machine learning module then updates parameters used by the machine learning module for outcome prediction based on an aggregation of the received vector values; and reiterates the updating of the parameters until convergence criteria is met. After having been trained, the machine learning module is configured to reference data from one or more of the databases when responding to received queries.

BRIEF DESCRIPTION OF THE DRAWINGS

Representative embodiments of the present invention are herein described, by way of example only, with reference to the accompanying drawings, wherein:

Figure 1 shows a schematic of the system architecture in which data integration in accordance with various embodiments of the present invention is deployed.

Figure 2 shows operation flow taken when uploading data, the identities to which the data belongs and the randomisation of these identities, into a database of the system of Figure 1.

Figure 3 shows operation flow for the data integration protocol that is implemented by the mapper module of Figure 1.

Figure 4 shows operation flow for the query analyser of Figure 1 when executing a distributed query protocol.

Figure 5 shows operation flow for a distributed machine learning protocol used by the machine learning module of Figure 1.

DETAILED DESCRIPTION

In the following description, various embodiments are described with reference to the drawings, where like reference characters generally refer to the same parts throughout the different views.

The present application finds relevance for organisations seeking to share, combine and jointly process their datasets. Having access to data from other organisations provides insight to make better decision or create a greater social impact, since there is limitation on insights that can be extracted from dataset(s) belonging to a single organisation.

However, sharing data between different organisations raises privacy concerns since such datasets could include customer data and medical data which are sensitive in nature and privy to the hosting organisation. The objective of the present application is to enable organisations to share and generate insights of joint datasets in a privacy -preserving manner.

The present application adopts an approach where joint datasets for analysis are achieved while data remains on the organisations premises. This ensures that data does not leave an organisation within which it is stored, thus preserving data privacy. To locate joint datasets, a mapper module is adopted which identifies the databases that contain the data upon which joint analysis can be based. The mapper module uses identification data commonality as one criterion to determine whether data in separate databases can be used for joint analysis, i.e. if an organisation stores data attributable to a common entity (e.g. the same party may have patient records stored with a hospital database, policy records stored with an insurance company database and past expenditure records stored with a credit card company), such data stored in different organisations, once recognised, becomes a joint dataset usable for joint analysis. The mapper module co-ordinates which of its registered databases stores data attributable to a common entity, whereby an identifier used by the common entity forms the shared attribute to locate the data across the registered databases. As such, the data can be traced through such an identifier. Each identifier comprises one or more of identification data (e.g. any type of PII (personally identifiable information), such as national registration identity number, social security number, foreign identification number, telephone contact numbers, email address, residence address, name, bank account number, credit card primary account number) provided by the common entity when registering with each of these databases to establish data ownership. In addition, the identifier may also comprise identification data unique to or generated by each of the databases (e.g. IP address, serial number, IMEI (international mobile equipment identity) number, virtual identification number). Therefore, such an identifier provides means to locate joint datasets. When the mapper module discovers that an identifier matches, i.e. the same identifier is found in several databases, the mapper records the matching identifier in a database used to store matching identifiers.

Since the present application adopts the approach of privacy preservation, the mapper module performs identity matching on encrypted identifiers, rather than in their plain-text form. In more detail, when each of the databases transmits their stored identifiers for the mapper module to determine whether each has a common identity stored in another database, they are already in randomised format from having been encrypted at their respective database. Such encryption is through the use of an encryption key generated at each of the respective databases. This key is called a storage key, a dataset key or a client key. The encrypted identifiers undergo one or more layers of encryption before they are received by the mapper module, so that the mapper module receives multiply encrypted identifiers.

One reason for using multiple encryption is because the storage key used by each database to encrypt its transmitted identifiers is different from the storage key used by another database (e.g. a first database uses storage key x₁, while a second database uses storage key x₂), so that each database is able to independently encrypt its transmitted identifiers without the need to a priori communicate and share a key as in many existing techniques. The mapper module is unable to perform matching if in receipt of identifiers encrypted with just their respective storage keys alone. Receiving the storage key from each of the databases to decrypt the singly encrypted identifiers would then reveal the identifiers, running contrary to the purpose of privacy preservation. Multiple encryption achieves an outer layer of encryption to the inner encryption provided by the storage key. When the mapper module removes the inner encryption from receiving the respective storage key, the identifiers are still encrypted by the outer encryption, so that the identifiers are still masked to the mapper module, thereby achieving privacy preservation.

The multiple encryption used is homomorphic or commutative in nature, i.e. encryption that allows computation on ciphertexts, generating an encrypted result which, when decrypted, matches the result of the operations as if they had been performed on the plaintext. Each of these additional layers of encryption may be applied by an intermediary that routes the encrypted identifiers to the mapper module. For instance, a server uses its own key (called a central key or a server key) to apply an additional layer of encryption to produce the multiply encrypted identifiers. As such, the records kept in the mapper module are encrypted forms of the identifiers. The mapper module receives the storage key used to encrypt each identifier stored at each database. The storage encryption is removed from each of the multiply encrypted identifiers, whereby matching is then performed on the multiple encrypted identifies following the storage encryption removal (called "matching multiply encrypted identifiers"). The results of the match are then transmitted so as to facilitate the location of joint datasets.

In the case where the results of the match are returned to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived, such results indicate which of the one or more identities stored in one of the databases is also stored in one or more of the other databases. Data attributable to each of the matching identities is then available for integration, i.e. usable to become a dataset for joint analysis. It will thus be appreciated that data integration effected by the mapper module does not result in data being combined in a central location. Rather, the data integration effected by the mapper module points to the databases containing the data that can be used for joint dataset analysis. In addition, in the context of the present application, identifiers and dataset are two distinctively different types of data. Identifiers are used to establish ownership of content found in a dataset.

The mapper module is not restricted to perform matching on multiply encrypted identifiers that are derived from singular identification data contained within a column entry (e.g. any one of national registration identity number, telephone contact number, email address, credit card primary account number or IP address in complete form). In one implementation, the mapper module is able to perform matching based on multiply encrypted identifiers that are derived from a segment of identification data (for e.g. segment "3456" from a national registration identity number "S1234567G"). This is advantageous in situations where only a selected segment of identification data is required to locate a common identity from other identities stored across all databases. In situations where a further segment of the identification data is required (e.g. further segment "7G" to eliminate national registration identity number "S1834568H" and locate national registration identity number "S1234567G"), the location of multiply encrypted identifiers can be based on results of the matching performed on the initial segment of identification data. The mapper module would then have to analyse fewer records when using the results from the earlier search for common identities having the initial segment of identification data.

In another implementation, the mapper module is able to perform matching based on multiply encrypted identifiers that are derived from different identification data (e.g. two or more of national registration identity number, telephone contact number, email address, credit card primary account number and IP address), where at least one of the identification data is present in another multiply encrypted identifier. For example: A first multiply encrypted identifier is derived from national registration identity number, telephone contact number and email address. A second multiply encrypted identifier is derived from telephone contact number, credit card primary account number and IP address. The first multiply encrypted identifier is considered to match the second multiply encrypted identifier if the telephone contact number is the same. A third multiply encrypted identifier may also be present, derived from credit card primary account number and virtual identification number. The third multiply encrypted identifier is considered to match the first and second multiply encrypted identifiers if the credit card primary account number, found in the second multiply encrypted identifier and the third multiply encrypted identifier, matches. While the first multiply encrypted identifier does not have the credit card primary account number, the first multiply encrypted identifier is considered to match the third multiply encrypted identifier because of the linkage brought about by the second multiply encrypted identifier. The first multiply encrypted identifier matches the second multiply encrypted identifier (by virtue of their common telephone contact number); while the third multiply encrypted identifier matches the second multiply encrypted identifier (by virtue of their common credit card primary account number). This other implementation thus allows for matching of identifiers comprising multiple different identification data, i.e. identification data contained over several column entries in a database. The protocol used by the mapper module finds a match in a privacy -preserving manner between from the encrypted identifiers contributed by each of the databases. The multiply encrypted identifiers having a common identification are considered to match.

The mapper module is further configured with a zero-knowledge protocol that allows verification of the correctness of the attributes (i.e. the multiply encrypted identifiers) processed during data integration. The zero-knowledge protocol allows the database to verily the correctness of the blinded PII (i.e. the encrypted identifiers) stored in the mapper module and verily that the data received by the mapper module has not been tampered. Any modification by the mapper module will be detected through the zero-knowledge protocol. The proof of the zero-knowledge proof protocol is computed from a random element generated by the server, a random element generated by the database seeking to authenticate the server and the central key. The proof is validated if the comparison of processing the stored identifiers with the proof against their encrypted versions after being processed with the random element generated by the server is a match.

After matching results are received by each of the databases, of the matched PIIs are received, a query analyser operates in tandem with the mapper module to allow a data consumer to formulate a dataset query directed at data stored in each of the databases. The query analyser queries the datasets through a distributed query mechanism, whereby the dataset query is divided into sub-queries, each categorised for its intended destination. Each sub-query is then sent to the database storing the data fitting the parameters of the sub-query.

If instead of query, the data consumer wishes to perform analysis on the data stored in each of the databases, a machine learning module that has a distributed machine learning function is deployed. The machine learning module allows the data consumer to ran analysis on the databases that reside on premise, whereby the machine learning module is trained with vector values of data belonging to the matching encrypted identifiers stored in the mapper module.

The system in which the mapper module, the query analyser and the machine learning module operate is described below.

Figure 1 shows a schematic of the system 100 architecture in which data integration in accordance with various embodiments of the present invention is deployed. Users of the system 100 can be a data contributor and/or a data consumer. A data contributor wishes to share its data either to sell to a data consumer, or to integrate its data with other data contributors so that better insights can be obtained from joint datasets. To share data in a privacy -preserving manner, the system 100 uses a protocol that enables privacy-preserving integration of datasets from multiple databases contributed by participating contributors.

The system 100 comprises a mapper module 102; a central platform 104; databases 106a, 106n, 108; a query analyser 110 and a machine learning module 112. In the implementation shown in Figure 1, databases 106a and 106n are client side components, while the central platform 104, the database 108, the query analyser 110 and the machine learning module 112 are server side components.

The central platform 104 has a frontend 114 for website access and a backend on which an application programming interface (API) 116 resides. The central platform 104 acts as a server to client terminals to which the databases 106a and 106n are associated. The system 100 may have more than one central platform and two databases to which the central platform serves as a server. For the sake of simplicity, Figure 1 shows a cluster to which the central platform 104 and the two databases 106a and 106n belong. The system 100 seeks to map common identities stored in the separate databases 106a and 106n. Initialisation of the mapper module 102 to map common identities is described below with reference to Figures 2 and 3.

Figure 2 shows operation flow taken when uploading data, the identities to which the data belongs and the randomisation of these identities. The operation flow includes uploading a dataset to database 250 (analogous to databases 106a and 106n shown in Figure 1); key creation to randomise the identification data; and uploading database schema to the server (refer the central platform 104 of Figure 1) through the server API 116. Database schema refers to the schema of fields under which data is organised in the database. The following describes the steps involved in this operation flow of Figure 2. Here we denote F(.) as a key homomorphic encryption or a commutative encryption scheme, where the following property holds: Given keys x, y, and identity attribute ID_i, F_y,(F_x(ID_i)) = F_x(F_y(ID_i)).

In step 201, a user 250 selects and uploads a file containing a dataset. The content of the dataset depends on transactions which the identifier owner performs and that the database 106a and 106n is responsible for recording (e.g. sum insured and insurance policies if the database 106a belongs to an insurance company; illness and duration of ward stay if the database 106n belongs to a hospital). In step 202, the user 250 defines the dataset schema and PII column(s) containing identifiers ID;. As mentioned above, each identifier comprises one or more of identification data of the party that owns data stored in the database 106a and 106n.

In step 203, a user application 252 generates a dataset key x which is used to perform storage encryption at the database 106a and 106n. In step 204, the user application 252 imports the uploaded file into the database 106a and 106n. In step 205, the user application 250 randomises the identifiers ID; in the PII columns with the storage key x, to produce encrypted identifiers F_x ( ID_i ) .

In step 206, the user application 252 sends the dataset schema and the encrypted identifiers F,(ID,) to the central platform 104 through its server API 116. In step 207, the user application 252 sends the storage key x to the mapper module 102, for example through REST API over HTTPS. A new dataset key x_n may be generated for each dataset upload operation.

The system 100 adopts a data integration protocol without data leaving the premise. Only randomised identity attributes are submitted to the central server for blinding. As long as each dataset shares a PII column that allows for uniquely identifying a record, i.e. the mapper module 102 is able to identify common identities from its received randomised identifiers ID_i, the data integration protocol is able to match and then inform each participant the matched PIIs without revealing the underlying identity information of the non-matched PIIs. The steps are performed while the datasets remain on premise. It is computationally infeasible for the server/central platform 104 to re-identify the underlying identity of the randomised identifiers ID_i, as long as a secure key homomorphic encryption (or a commutative encryption) scheme is used such as one that is based on a DDH (Decisional Diffie-Hellman) assumption. The data integration protocol is described in greater detail with reference to Figure 3.

Figure 3 shows operation flow for the data integration protocol that is implemented by the mapper module 102.

In step 301, a user 350 selects two or more datasets for integration/merging. In step 302, the user 350 defines the merged dataset schema.

In step 303, the server frontend (FE) 114 posts the merged schema to the server API 116. In step 304, the server API 116 randomises PII columns of the selected datasets with its own server key, y (also referred to us as the central key) resulting in multiply encrypted identifiers F_y,(F_x(ID_i)).

In step 305, the server API 116 sends the multiply encrypted identifiers F_y,(F_x(ID_i)) to the mapper module 102. This results in the mapper module 102 receiving multiply encrypted identifiers F_y,(F_x(ID_i)) where each is derived from an identifier ID; having undergone multiple encryption. The multiple encryption refers to the encryption performed on the identifier at the database using the storage key x_n (see the step 205 of Figure 2) and the encryption performed by the server API 116 using the central key y on the identifier that is already encrypted using the storage key x_n (as per the step 304). As mentioned earlier, the storage key x_n used to encrypt each of the identifiers stored by the databases 106a and 106n may change. The central key y can also change for each batch of multiple encryption that is performed.

In step 306, the mapper module 102 decrypts each of the multiply encrypted identifiers F_y,(F_x(ID_i)) with their respective storage keys x. The receipt of these storage keys x, used to perform storage encryption at each of the database (confer the database 106a and 106n shown in Figure 2) storing one or more of the identifiers ID;, by the mapper module 102 was described with reference to Figure 2 (see the step 207). The decryption using the respective storage key x removes the storage encryption from each of the multiply encrypted identifiers F_y,(F_x(ID_i)). resulting in identifiers encrypted by the central encryption, F,(ID_i). While only dual encryption has been discussed thus far (i.e. firstly by using the storage key x, followed by using the central key y), another implementation may use several layers of encryption (i.e. more than two). In this other implementation, the identifiers will still be multiply encrypted after removal of its storage encryption.

In step 307, the server and the mapper module 102 build and store the data mapping results based on the mapping of the PII columns, i.e. inner join. The mapper module 102 consolidates the multiply encrypted identifiers that match after the removal of the storage encryption, hereafter referred to as "matching multiply encrypted identifiers" (rather than the verbose phrase "multiply encrypted identifiers that match after the removal of the storage encryption"). Each of such multiply encrypted identifiers is derived from common identification data. In step 308, the mapper module 102 discards the encrypted key information, i.e. all the received storage keys x. In step 309, the mapper module 102 returns results of the match to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived. The results of the match are returned so as to provide an indication which of the one or more identities stored in one of the databases (e.g. database 106a of Figure 1) is also stored in one or more of the other databases (e.g. database 106n of Figure 1). In one implementation, the results of the match are returned via the mapper module 102 sending mapped record labels to the respective user applications.

The label is a stateful parameter that indexes the randomised PIIs. The label is used to locate an identifier within the database storing the identifier from which the multiply encrypted identifier F_y,(F_x(ID_i)) is derived. For example, each multiply encrypted identifier F_y,(F_x(ID_i)) received by the mapper module 102 is accompanied with such a label. The label can also be used to combine datasets which each is contributed from a different database.

Returning to the step 307, the mapper module 102 performs intersection of the blinded PIIs provided by all the participants. The mapper module 102 compiles a table of matching blinded PIIs and their accompanying labels, thereby consolidating the labels for the multiply encrypted identifiers that match after the removal of the storage encryption for demarcation as common labels. The objective is so that only randomised PII is required by the central platform 104. Once matched, the central platform 104 has common labels that can be passed to the participants to ascertain which are the PIIs that match with the PIIs contributed by other participants. This is achieved in the step 309, where the mapper module 102 sends the table consisting of the labels to every participant, thereby returning each of the common labels to each of the databases storing one or more of the identifiers located by the common label. The records that belong to the matched PIIs can then be retrieved or queried accordingly.

Table 1 below shows a possible layout for labelled identification data

Table 1: Labelled identification data

In a first implementation, the label is generated when datasets are imported into a client module. For example, with reference to step 204 of Figure 2, a client module x₁ for the database 106a tabulates a generated label for each dataset attributable to an identification data as follows: Table 2: Data organisation in client module X₁

while a client module x_n for the database 106n tabulates a generated label for each dataset attributable to an identification data as follows.

Table 3: Data organisation in client module x_n

The database schema for the database 106a comprises the data fields "age group", "gender" and "cost", while the database schema for the database 106n comprises the data fields "length of stay", "diagnosis code" and "hospitalisation cost".

The following operation flow may then be executed to achieve labelled data integration:

Labelled Data Integration Protocol

a) Client module x₁ sends (100, F_x1(S1187561A)), (200, F_x1(S7765432B)) to the central platform 104. That is the client module x₁ transmits each of its stored identifiers, encrypted with the storage key x₁. Each of the encrypted identifiers F_x1(ID_x1) is accompanied with its corresponding label.

b) Client module x_n sends (455, F_xn(S1187561A)), (610, F_xn(S7658920C)) to the central platform 104. That is the client module x_n transmits each of its stored identifiers, encrypted with the storage key x₁. Each of the encrypted identifiers F_xn(ID_xn) is accompanied with its corresponding label.

c) The central platform 104 further encrypts with key s, (100, F_s(F_x1(S1187561A))), (200, F_s (F_x1(S7765432B))), (455, F_s(F_xn(S1187561A))), (610, F_s (F_xn(S7658920C))). As such, the central platform 104 performs central encryption with the central key s to produce multiply encrypted identifiers F_s(F_xi(ID_i))

d) The mapper module 102 receives the randomised strings from the central platform 104 and the storage keys x₁ and x_n. The mapper module 102 removes the inner encryptions: (100, F_s(S 1187561 A)), (200, F_s(S7765432B)), (455, F_s(S1187561A)), (610, F_s(S7658920C)). e) The mapper module 102 performs mapping by locating the matching multiply encrypted identifiers: F_s(S1187561A) = F_s(S1187561A)

f) Common labels are identified: (100, 455)

g) The mapper module 102 sends the common label 100 to client module x₁ and the common label 455 to client module x_n

In a second implementation, only matching labels are sent to a server. The protocol used in this second implementation is outlined below.

Labelled Data Integration Protocol

Key Setup

(1). A user C generates a random key x and the server S in the central platform generates a random key y.

Randomization

(2). C performs randomization on each ID; of a labelled set

(a) for each ID;, compute a_i = F_x(ID_i);

(3) C submits to S the randomized sequence of IDs

Blinding & Matching

(4) S runs Blinding (see steps 304 to 307 of Figure 3) & Proving (described in greater detail under Zero Knowledge Proof of Correctness Protocol) on ( and sends proof of zero knowledge

protocol p to C.

(5) C runs Verification (described in greater detail under Zero Knowledge Proof of Correctness Protocol) on p.

(6) Given two sets of tuples

perform intersection such that:

•

• Discard all non-matching tuples.

C1 sends the set of matching labels

C2 similarly performs the matching and sends the set of matching labels to S,

for

In Step (6), C₂ performs the same protocol as C₁, as well as in scenarios where there is more than two users. The server S stores all the list of matching labels so that when a user wishes to query for the records of the matched PIIs from other users, it is possible to cross verify that the records returned is indeed those of the matched PIIs. This is performed by the server S and the users sending the set of matched labels to the querying user.

The second application may be used where client terminals store their entire databases containing all attributes, less the identifiers, at the server S. This is a central storage setting where all data providers (e.g. client terminals C₁ and C₂) share their datasets (e.g. age, salary, zip code, insured amount) but not the identifiers (e.g. national registration identity number) to which the datasets belong. Once the datasets containing the attributes but not the identifiers are centrally stored, client terminals C₁ and C₂ can perform matching based on the identifiers in their datasets using the above outlined protocol of the second implementation. Once the matched labels are obtained, client terminal C₁ sends the matched labels to the server S for the server S to correctly combine the stored datasets obtained from client terminals C₁ and C₂.

To illustrate, client terminals C₁ and C₂ may have datasets as tabulated below.

Table 4: Client terminal C₁ dataset

Table 5: Client terminal C₂ dataset

Client terminal C₁ sends the illness data to the server S while client terminal C₂ sends the insurance data to the server S. Since each of the client terminals C₁ and C₂ incorporates the mapper module 102 (see Figure 3), each will possess multiply encrypted identifiers derived from identifiers stored in both the client terminals C₁ and C₂. Client terminal C₁ identifies the common identities (8229012 and 8056798) from their multiply encrypted format. Client terminal C₁ demarcates a label for each of the matching multiply encrypted identifiers as a common label. Client terminal C₁ then transmits the common labels to the server S to link datasets stored therein that belong to the identifier from which the matching multiply encrypted identifier is derived, as illustrated below. Table 6: Server S dataset

From the above, it will be appreciated that the datasets stored in the server S are contributed by one or more databases storing identifiers from which the matching multiply encrypted identifier are derived.

Client terminal C₂ can also similarly perform the matching and send in labels for the server S to validate that the matched labels submitted by C₁ and C₂ are indeed correct if the labels by C₁ = labels by C₂.

The server S or the central platform 104 establishes knowledge of the central key, through the use of a zero knowledge protocol (ZKP), to each of the databases 106a and 106n storing one or more of the identifiers from which the multiply encrypted identifiers are derived. The ZKP protocol that is used by the server S or the central platform 104 is outlined below:

Zero Knowledge Proof of Correctness Protocol

Blinding & Proving

(5). S blinds each received a_i by computing b_i = F_y ( a_i).

(6). S returns

(7). C and S jointly compute a zero-knowledge proof p of correctness:

(a) C generates and sends r_i, ί Î [1, n] to S, where r_i Î_R [1, 2^l], l the security parameter;

(b) S computes

This serves to determine a subset of the multiply encrypted identifiers for use to authenticate S;

(c) S picks a random element s from {1,2 — 1} and compute T₁ = U^s, T₂ =

That is the server S processes the multiply encrypted identifiers with the random element s generated by the server S;

(d) S sends (U, V, T₁, T₂) to C:

(e) C randomly selects and sends c to S, i.e. C sends a random element c from the database 106a and 106n seeking to authenticate S;

(f) S computes t = s— c . y and outputs the proof as p = t

Verification

(8). C verifies p: (a) For all a_i and b_ί, computes

(b) Compute Evaluation parameters T'₁ and T'₂

are thus obtained from processing the stored identifiers in the database 106a and 106n, from which the subset of item 7(b) is obtained, with the proof p. These evaluation parameters T'₁ and T'₂ are used to validate the proof p of the zero knowledge protocol;

(c) Output hue if T'₁ = T ₁ and T'₂ = T₂:

From step (7), the server S or the central platform 104 is configured to generate a random element s and receive a random element c associated with the database 106a and 106n seeking to authenticate the server S or the central platform 104. The server S or the central platform 104 then computes a proof p of the zero knowledge protocol based on the random element s generated by the server, the received random element c associated with the database and the central key y.

From step (8), the database 106a and 106n seeking to authenticate the server S or the central platform 104 is configured to receive from the server: a subset of the multiply encrypted identifiers derived from identifiers stored at the database 106a and 106n, the subset having been processed by the random element s generated by the server S or the central platform 104; and the proof p of the zero knowledge protocol. The database 106a and 106n process the stored identifiers from which the subset is obtained with the proof p to obtain evaluation parameters. The database 106a and 106n validate the proof p of the zero knowledge protocol in response to the evaluation parameters matching the received subset.

Further detail is provided below on the server S construction of the ZKP p of correctness (in terms of computation performed on the randomized dataset submitted by the client) as stated in steps (7)-(8) above. These steps are advantageous from avoiding the high communication/computation overhead of computing ZKP for every pair ( a_i, b_i) separately.

By step (5) of the protocol, the server S knows a_i = F_x(ID_i) and b_i = F_y (a_i) = F_xy(ID_i) for all i in a submitted dataset; on the other hand, at step (8), the client has knowledge of all elements a_i and b_i as well. The server and the client can thus perform a ZKP protocol. The sketched proof is as below.

The server S can prove to the client its knowledge of the key y (that was used for blinding) such that V = U'^y without revealing the secret exponent y to the client, where

are computed by client. The values a_i are provided by client, and b_i are provided by the

server. Similarly, the server S can prove to the client its knowledge of the key y’ such that

without revealing the secret exponent y’ to the client, where a_i is provided by client, and b₁ is provided by server. y =y’ can be proven using the proof technique where

Exponents r_i are uniformly randomly chosen from an exponentially (in security parameter) large space by the client. after the values of a_i, _i, and y = log _a1 (b₁) are fixed Let Let g be a

generator in the group. For each w_;, there exists a unique exponent e_;, such that w_i = g^ei. This results in

That is, for a fixed (unknown) vector (e₁, ... , e_n), the client chooses a vector (r₁, ... , r_n) with each r_i randomly chosen from [1, 2^l], the inner product between these two vectors is zero (i.e. two vectors are perpendicular) with non-negligible probability. It has to be the case that vector (e₁, ... , e_n) is a zero vector, thus for each i.

The zero-knowledge nature of the ZKP protocol for n elements can be proved in the same way as in "Ivan Damgard: On å-Protocols. http://www.cs.au.dk/--ivan/Sigma.pdf· Section 5" and "https://courses.cs.ut.ee/MTAT.07.003/2016_fall/uploads/Main/0902-proof-of-knowledge-for-double- exponent.pdf".

A PII to which a dataset in a database belonging to the system 100 of Figure 1 may not always be atomic. The PII may instead contain a few segments. An example is IPv4 addresses, where an address (e.g. 192.168.1.1) can be divided into four segments, each matching an octet of its dotted format representation (This can be applied to IPv6 address in a similar manner). The segments of an IP address are crucial for network analysis, for instance, to ascertain whether a group of IP addresses originated from an identical subnet. Losing this structure means the dataset loses utility and is of limited value. Any method that randomises (or masks) such PII must be able to preserve its structure. An approach to directly randomise each segment, treating every segment as a single identity attribute, is inefficient. This means if a PII has W segments, randomisation and blinding would have to be performed on all W segments in order to preserve its structure.

A more efficient, simpler and effective protocol is described in the Structure Preserving Data Integration protocol below. The client chooses the segments to be randomised and creates a list L containing identifiers of these selected segments, which serve as initial segments of identification data for the Structure Preserving Data Integration protocol. Only the segments in the list are randomised and blinded, in stages. The intuition is that in majority scenarios the resulting dataset of the first match (or intersection) would contain less records compared to the original datasets.

With reference to Figures 2 and 3, at least one of the identifiers ID; stored in the database 106a and 106n comprises such an initial segment of identification data. The consolidation of step 307 (where the mapper module 102 consolidates the multiply encrypted identifiers that match after the removal of storage encryption) includes the multiply encrypted identifiers derived from the initial segments of identification data that match. The results of the match are returned to each of the databases 106a and 106b storing the identification data from which each of the initial segments of identification data that match is obtained.

A second chosen or further segment of the identification data may be processed next, but on the smaller dataset originated from the first intersection. That is, the mapper module 102 locates the multiply encrypted identifiers derived from the further segment of the identification data that match from the consolidation performed on the initial segments of identification. The results of the match are returned to each of the databases 106a and 106n storing the identification data from which each of the further segments of identification data that match is obtained. Subsequent mapping can be carried out for the remaining segments, depending on how fine-grained the matching is desired. Any key homomorphic encryption (or commutative encryption) scheme can be deployed, such as the one discussed in international application no. PCT/SG2017/050575. In order for the server in the central platform 104 to perform the intersection in stages, the technique discussed above for the Labelled Data Integration Protocol is adopted, as shown in Step (6) of the Structure Preserving Data Integration protocol discussed below. This is so that the server is able to learn the set of matched PIIs for every stage.

The Structure Preserving Data Integration protocol is outlined below:

Key Setup

(1). A user C generates a random key x and the server S in the central platform 104 generates a random key y.

Generalization & Randomization

(2). C divides the identity attribute into W segments; ID; = (lD_i,1, ... , ID _w).

(3). C also prepares an ordered list L = (l_h, ... , l_m), h, m Î [1, W], h £ m.

(4). C performs randomisation on each ID; of the dataset:

(a) for each ID _w, compute a _w = F_x(lD _w), w Î [h, m];

(5). C submits to S the list L and the randomised dataset L is also given to

other participating clients.

Blinding & Matching (6). S creates a unique tag k_i for each record I in the dataset, such that

(7). For each l_v in L v Î [h, m], S and C jointly perform matching:

(a) S runs Blinding & Proving (see Zero Knowledge Proof of Correctness Protocol discussed above) on inputs ( k_i , a_{i, lv}) of

for all i Î [h, n]

(b) S returns

(c) C verifies p. If p is valid, C performs the following (otherwise C aborts):

(i) for each b_{i, lv} in the tuple (i Î [1, n]), extract:

(d) C runs the Intersection (see next heading below) step to find matched blinded segments.

(e) S retrieves records for tags in ( k_i ).

(f) S sets for all i e [1, r], and assigns n := r.

(8). S returns to C for verification and integration.

Intersection

(7a). Given two sets of tuples from C₁ and

from C₂, perform intersection such that:

• for some i Î [1, n] and jÎ [1, n'], store k_i .

• Discard all non-matching tuples.

Cl sends the set of matching tags ( k_i ) to S, for i Î [1, r], r [n].

The Structure Preserving Data Integration protocol works for any PII that can be divided into segments in order to preserve its structure. Two use cases of the structure-preserving data integration protocol are discussed below.

IP Addresses

Using letters, an IPv4 address may be represented in dotted format as (A.B.C.D). Every letter represents an octet of the address (e.g. A := 192).

In order to preserve structure of the address, the IP address is divided into four segments as H(A), H(A.B), H(A.B.C) and H(A.B.C.D). H(.) is a cryptographic hash function. This representation preserves prefixes of an IP address and is the only way that preserves the structure of an IP address without leaking information. Based on the above representation, the client, for example, might wish to find a mapping on segment H(A.B), and at a later stage, H(A.B.C). The client C first randomises all four segments and generalises the non-identity attributes, in order to protect against the server learning the segments of the IP addresses. The client C then creates a list L that contains two identifiers, one for H(A.B) and another for H(A.B.C). Next, the client C and the server S choose the identifier of H(A.B), i.e. an initial segment, from the list, retrieve H(A.B), jointly randomise and blind to find the matching records in the datasets contributed by other participating clients C. Then in order to match and integrate records on H(A.B), the client C and the server S need only randomise and blind this initial segment. At a later stage, in order to match based on H(A.B.C), the client C and the server S would only need to map from the results from matching H(A.B), which may contain less records compared to the full dataset.

Solutions exist that use a lookup table of randomised IP addresses or a key that produces pseudonymized IP addresses. The lookup table or key must be shared between the parties that perform pseudonymization. Such mechanisms are not suitable since each participant will be able to learn the plain datasets given the lookup table or key.

Multi-column IDs

In addition to IP addresses, it is possible that multiple columns in a database are used to represent an identity attribute, e.g. name, email and phone number. Instead of randomising all columns as one attribute, coarse-grain or fine-grain randomisation of multiple columns is used. As in the previous IP addresses use-case, a client may define each column as a segment and create the list L containing the segments that are to be processed. In more detail and with reference to Figures 2 and 3, when at least one of the identifiers ID; comprises different identification data with at least one in common with another of the identifiers, the consolidation of step 307 (where the mapper module 102 consolidates the multiply encrypted identifiers that match after the removal of storage encryption) includes the multiply encrypted identifiers derived from the identifiers comprising the common identification data. The results of the match are returned to each of the databases 106 and 106n storing the common identification data.

Leakage Profile

The leakage of information due to in-stage interactions between the client C and the server S was modelled. Specifically, the server S and the client C learn the sets of matched randomised PIIs and their sizes for every stage of matching. This is because the clients C submit k_i to the server at the end of each iteration. It was found that given the extra leakage of information, the server S and the client C are not able to learn any information of the underlying identity attribute. An interactive stateful leakage function

was defined, given input the dataset, outputs n the size of the dataset, the number of segments W and the list of segments L to be processed.

also outputs an empty list I_match· For every stage l_v of privacy-preserving matching of two or more datasets,

registers the size of the subset r_lv of the matched attributes, and the set of matched randomized IDs in I_match = for i Î [1, r] This means a leakage profile

(n, L, I_match)- where the server

learns the size and the matched randomised IDs of each intersection.

Leakage from intersection of segments

The segments (e.g. in the case of repetitive 192 or 192.168) are not unique given that homomorphic key encryption is deterministic. The server S is able to learn how many IDs have the same segments. This information may also enable the server S to learn that, for example, two datasets maybe of the same domain by observing the received k_i if the list L submitted by the two participants are the same. A potential mitigation is to adopt a prefix-preserving pseudonymization mechanism that partitions the set of IP addresses, performs a migration to hide repetitive segments and replicate the IP addresses into multi-views setting so that the recipient has no idea which view is the real set of IP addresses.

Figure 4 shows operation flow for the query analyser 110 of Figure 1 when executing a distributed query protocol.

The distributed query protocol divides a query into sub-queries that are directed to the datasets of the respective organisations. Only record labels and aggregated results will be sent out from the datasets. The results will be merged based on mapped record labels. The distributed query protocol ensures that raw data never leaves the respective organisations premises.

The distributed query protocol, executed by the query analyser 110, interrogates fields under which data is organised in each of the databases 106a and 106n. In one implementation, the mapper module 102 receives the schema of fields together with the multiply encrypted identifiers, such as in step 305 of Figure 3. When the query analyser 110 receives a dataset query, the query analyser 110 determines which of the schema of fields in the mapper module 102 has a format corresponding to search parameters of the received dataset query. The query analyser 110 is also able to divide the dataset query into one or more sub-queries with each containing a subset of the corresponding search parameters. The sub-queries are then distributed amongst the databases 106a and 106n, so that each of the databases 106a and 106n receives the sub-query with parameters of format that corresponds to the schema of fields under which data is organised in the database 106a and 106n. The query analyser 110 is also configured to: merge responses from each of the databases 106a and 106n to their received sub-query and return the merged response to the dataset query received by the query analyser 110.

The steps for performing a distributed query are described below with reference to Figure 4.

In step 401, customer sends a query on dataset ("dataset query").

In step 402, the Server Frontend (FE) 114 posts the dataset query to the Server API 116. The server API 116 relays the dataset query to the query analyser (QA) 110. In step 403, the QA 110 breaks down the dataset query into sub-queries. The sub-queries are sent to user apps for relay to the database 106a and 106n.

In step 404, each of the user apps queries their on-premise database 106a and 106n. In step 405, the user apps send results back to the QA 110.

In step 406, the QA 110 merges results from the user apps. In step 407, the QA 110 sends the merged results back to the Server API 116 for relay back to the Server FE 114. In step 408, the Server FE 114 displays the merged results.

Tables 7 and 8 below each provide an example of data organised under a schema of fields found in a database.

Table 7: Sample schema of fields

Table 8: Sample schema of fields

Table 9 shows an example of a dataset query and its division into sub-queries based on the schema of fields shown in Table 7 and Table 8.

Table 9: Sample dataset query and sub-queries

Figure 5 shows operation flow for a distributed machine learning protocol used by the machine learning module 112. The machine learning module allows a data consumer to ran analysis on the databases 106a and 106n that reside on the premise. The distributed machine learning (DML) protocol allows for the data consumer to request for a machine learning model to be computed on datasets residing in different organisations.

The machine learning module 112 is trained using vector values of data belonging to identifiers stored in the databases 106a and 106n. Recalling from Figures 2 and 3 that common identification data serves to integrate datasets, the identifiers used to train the machine learning module 112 are those derived from matching multiply encrypted identifiers consolidated in the mapper module 102. Parameters used by the machine learning module 112 for outcome prediction are updated based on an aggregation of the received vector values. The updating of the parameters is reiterated until convergence criteria is met. The machine learning module 112 is then configured to, after having been trained, reference data from one or more of the databases 106a and 106n when responding to received queries.

The steps for performing distributed machine learning are described below.

In step 501, a Customer selects machine learning model type to be trained on datasets.

In step 502, the Server Frontend (FE) 114 posts the selected machine learning model to the Server API 116 which relays the request to the Distributed ML module (DML) 112.

In step 503, the DML module 112 requests dataset dimensions and response variable from user apps. Each of the user apps queries its on-premise database 106a and 106n;

In step 504, the user apps relay back dataset dimensions and response variable to the DML module 112;

In step 505, the DML module 112 initialises parameter values used by the machine learning module 112 for outcome prediction and send the relevant parameter values to each user app, along with machine learning model type information;

In step 506, each user app computes local aggregated results based on model type and uploads aggregates results to the DML module 112;

In step 507, the DML module 112 uses local aggregated results from donor apps to update the parameter values. If convergence criteria is satisfied or maximum number of iterations have been reached, it signals protocol finish to user apps and performs step 509 below. Otherwise, the DML module 112 relays updated parameter values to user apps;

In step 508, the user apps receive updated parameters. If protocol finish signal has not been received, step 506 is repeated;

In step 509, the DML module 112 relays back model parameters to the Server API 116;

In step 510, the Server Frontend 114 displays results.

Returning to Figure 1, the system 100 provides a data contributor an access control mechanism and a configurable risk assessment module, so that the data contributor may decide whether to reveal certain attributes of its database for query or learning by the data consumer. It can also be based on these mechanisms and the configurable risk assessment module, to set the price plan for the data consumer to perform query or analysis on more sensitive information. The distributed query and distributed machine learning steps can be performed independently from the intersection mechanism.

Upon determining a common identity from its multiply encrypted form, the mapper module 102 is configured to query a registration directory storing an address of each of the databases 106a and 106n. With reference to step 309 of Figure 3, this allows the mapper module 102 to obtain the address of each of the databases 106a and 106n to which matching results should be sent.

The system 100 of Figure 1 uses multiple encryption that comprises storage encryption and a central encryption. The removal of the storage encryption results in matching to be performed on identifiers encrypted by the central encryption. This central encryption is performed using a central key y at a server from which the mapper module 102 receives the multiply encrypted identifiers.

The mapper module 102 is configured to return the results of the match through a server from which the mapper module 102 receives the multiply encrypted identifiers. In one implementation, the mapper module 102 is integrated with the server. In another implementation, the mapper module 102 is integrated into a client terminal that integrates one or more of the databases 106a and 106n. It will be appreciated that integration refers to these components being part of a designated network and does not necessarily require for them to be internally situated.

In summary, the present invention provides a system for privacy -preserving data sharing and integration, which enables organisations to match, query and analyse their datasets without the dataset leaving the premises of the organisations. The system is initiated by a user registering to the system and an administrator managing the list of users and the applications. A data integration protocol enables matching of personal identifiable information (PII) in a privacy -preserving manner. Users (who can be data donors and/or data consumers) upload and integrate their datasets by performing the data integration protocol together with a central platform to randomize and map/intersect their PIIs information using the map module of the user and the server application. Once mapped, the information (or labels) on the sets of matched PIIs are returned to each respective user. Coupled with the data integration protocol is a zero-knowledge proof protocol that allows verification of the correctness of the attributes processed by the server during data integration. Any modification by the server on the attributes will be detected through the zero-knowledge proof protocol.

With the information of the matched PIIs, distributed query and distributed machine learning protocols enable computations on the on-premise datasets. The data integration protocol is flexible in that it caters for a PII compromising more than one segment (e.g. IP addresses) or different identification data (e.g. multi-column IDs), using structure preserving techniques. The data integration protocol is also labelled in a way that even when the datasets are on premise, organisations are able to retrieve or refer to records of matched PIIs.

While this invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes can be made and equivalents may be substituted for elements thereof, without departing from the spirit and scope of the invention. In addition, modification may be made to adapt the teachings of the invention to particular situations and materials, without departing from the essential scope of the invention. Thus, the invention is not limited to the particular examples that are disclosed in this specification, but encompasses all embodiments falling within the scope of the appended claims.

Claims

1. A system for mapping common identities stored in separate databases, the system comprising: a mapper module configured to:

receive multiply encrypted identifiers, where each is derived from an identifier having undergone multiple encryption;

receive a storage key used to perform storage encryption at each database storing one or more of the identifiers;

remove the storage encryption from each of the multiply encrypted identifiers using its respective storage key;

consolidate the multiply encrypted identifiers that match after the removal of the storage encryption; and

transmit results of the match.

2. The system of claim 1, wherein the results of the match are returned to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived.

3. The system of claim 1 or 2, wherein when at least one of the identifiers comprises an initial segment of identification data, the consolidation includes the multiply encrypted identifiers derived from the initial segments of identification data that match; and the results of the match are returned to each of the databases storing the identification data from which each of the initial segments of identification data that match is obtained.

4. The system of claim 3, wherein when at least one of the identifiers comprises a further segment of the identification data, the mapper module is further configured to:

locate the multiply encrypted identifiers derived from the further segment of the identification data that match from the consolidation performed on the initial segments of identification; and

return the results of the match to each of the databases storing the identification data from which each of the further segments of identification data that match is obtained.

5. The system of any one of the preceding claims, wherein when at least one of the identifiers comprises different identification data with at least one in common with another of the identifiers, the consolidation includes the multiply encrypted identifiers derived from the identifiers comprising the common identification data; and the results of the match are returned to each of the databases storing the common identification data.

6. The system of any one of the preceding claims, wherein the results of the match indicate which of the one or more identities stored in one of the databases is also stored in one or more of the other databases.

7. The system of any one of the preceding claims, wherein the mapper module is further configured to:

query a registration directory storing an address of each of the databases; and

obtain the address of each of the databases to which the matching results should be sent.

8. The system of any one of the preceding claims, wherein the multiple encryption comprises the storage encryption and a central encryption, whereby the removal of the storage encryption results in the matching to be performed on identifiers encrypted by the central encryption and wherein the central encryption is performed using a central key at a server from which the mapper module receives the multiply encrypted identifiers.

9. The system of claim 8, further comprising the server, wherein the server is configured to: establish knowledge of the central key, through the use of a zero knowledge protocol, to each of the databases storing one or more of the identifiers from which the multiply encrypted identifiers are derived.

10. The system of claim 9, wherein the server is further configured to:

generate a random element (s);

receive a random element (c) associated with the database seeking to authenticate the server; and

compute a proof of the zero knowledge protocol based on the random element (s) generated by the server, the received random (c) element associated with the database and the central key.

11. The system of claim 10, wherein validation of the proof of the zero knowledge protocol comprises having the database seeking to authenticate the server being configured to:

receive from the server:

a subset of the multiply encrypted identifiers derived from identifiers stored at the database, the subset having been processed by the random element generated by the server; and

the proof of the zero knowledge protocol; process the stored identifiers from which the subset is obtained with the proof to obtain evaluation parameters; and

validate the proof of the zero knowledge protocol in response to the evaluation parameters matching the received subset.

12. The system of any one of the preceding claims, wherein each of the multiply encrypted identifiers is accompanied with schema of fields under which data is organised in the database storing the identifier from which the multiply encrypted identifier is derived, wherein the system further comprises a query analyser configured to:

interrogate the schema of fields, in response to receipt of a dataset query, to determine which of them have a format corresponding to search parameters of the dataset query.

13. The system of claim 12, wherein the query analyser is further configured to:

divide the dataset query into one or more sub-queries with each containing a subset of the corresponding search parameters; and

distribute the sub-queries amongst the databases, so that each of the databases receives the sub-query with parameters of format that corresponds to the schema of fields under which data is organised in the database.

14. The system of claim 13, wherein the query analyser module is further configured to:

merge responses from each of the databases to their received sub-query; and return the merged response to the dataset query.

15. The system of any one of the preceding claims, wherein the system further comprises a machine learning module configured to, when being trained:

receive vector values of data belonging to the identifiers from which the matching multiply encrypted identifiers are derived;

update parameters used by the machine learning module for outcome prediction based on an aggregation of the received vector values; and

reiterate the updating of the parameters until convergence criteria is met.

16. The system of claim 15, wherein the machine learning module is configured to, after having been trained, reference data from one or more of the databases when responding to received queries.

17. The system of any one of the preceding claims, wherein the mapper module is further configured to return the results of the match through a server from which the mapper module receives the multiply encrypted identifiers.

18. The system of claim 17, wherein the mapper module is integrated with the server.

19. The system of any one of claims 1 to 17, wherein the mapper module is integrated into a client terminal that integrates one or more of the databases.

20. The system of any one of the preceding claims, wherein each of the multiply encrypted identifiers is accompanied with a label for locating the identifier within the database storing the identifier from which the multiply encrypted identifier is derived and wherein the mapper module is further configured to:

consolidate the labels for the multiply encrypted identifiers that match after the removal of the storage encryption for demarcation as common labels; and

return each of the common labels to each of the databases storing one or more of the identifiers located by the common label.

21. The system of any one of claims 1 to 17, further comprising client terminals with each incorporating the mapper module, wherein the mapper module of the client terminal is further configured to:

demarcate a label for each of the matching multiply encrypted identifiers as a common label; and

transmit the common labels to a server to link datasets stored therein that belong to the identifier from which the matching multiply encrypted identifier is derived, wherein the datasets are contributed by one or more databases storing identifiers from which the matching multiply encrypted identifier are derived.