WO2020209793A1 - Privacy preserving system for mapping common identities - Google Patents

Privacy preserving system for mapping common identities Download PDF

Info

Publication number
WO2020209793A1
WO2020209793A1 PCT/SG2020/050210 SG2020050210W WO2020209793A1 WO 2020209793 A1 WO2020209793 A1 WO 2020209793A1 SG 2020050210 W SG2020050210 W SG 2020050210W WO 2020209793 A1 WO2020209793 A1 WO 2020209793A1
Authority
WO
WIPO (PCT)
Prior art keywords
identifiers
databases
server
match
multiply encrypted
Prior art date
Application number
PCT/SG2020/050210
Other languages
French (fr)
Inventor
Shuwei CAO
Geong Sen POH
Hoon Wei Lim
Peck Yoke LEONG
Jia Xu
Varsha CHITTAWAR
Original Assignee
Singapore Telecommunications Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Singapore Telecommunications Limited filed Critical Singapore Telecommunications Limited
Publication of WO2020209793A1 publication Critical patent/WO2020209793A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/008Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/14Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols using a plurality of keys or algorithms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3218Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using proof of knowledge, e.g. Fiat-Shamir, GQ, Schnorr, ornon-interactive zero-knowledge proofs

Definitions

  • the present invention relates to a system for mapping common identities stored in separate databases.
  • a system for mapping common identities stored in separate databases comprising: a mapper module configured to: receive multiply encrypted identifiers, where each is derived from an identifier having undergone multiple encryption; receive a storage key used to perform storage encryption at each database storing one or more of the identifiers; remove the storage encryption from each of the multiply encrypted identifiers using its respective storage key; consolidate the multiply encrypted identifiers that match after the removal of the storage encryption; and transmit results of the match.
  • the results of the match are returned to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived.
  • the results of the match are sent to a server storing datasets attributable to the matching multiply encrypted identifiers.
  • the datasets are contributed by the one or more databases storing the identifiers from which the multiply encrypted identifiers are derived. The sending of the results of the match to the server is to link the stored datasets that are attributable to the matching multiply encrypted identifier.
  • At least one of the identifiers may comprise an initial segment of identification data.
  • the consolidation of the multiply encrypted identifiers by the mapper module then includes those derived from the initial segments of identification data that match.
  • the results of the match are returned to each of the databases storing the identification data from which each of the initial segments of identification data that match is obtained.
  • An extension of this first scenario has at least one of the identifiers comprise a further segment of the identification data, the mapper module is configured to locate the multiply encrypted identifiers derived from the further segment of the identification data that match from the consolidation performed on the initial segments of identification.
  • the mapper module then returns the results of the match to each of the databases storing the identification data from which each of the further segments of identification data that match is obtained.
  • At least one of the identifiers comprises different identification data with at least one in common with another of the identifiers.
  • the consolidation of the multiply encrypted identifiers by the mapper module then includes the multiply encrypted identifiers derived from the identifiers comprising the common identification data.
  • the results of the match are returned to each of the databases storing the common identification data.
  • the server is configured to establish knowledge of the central key, through the use of a zero knowledge protocol, to each of the databases storing one or more of the identifiers from which the multiply encrypted identifiers are derived.
  • Each of the multiply encrypted identifiers may be accompanied with schema of fields under which data is organised in the database storing the identifier from which the multiply encrypted identifier is derived.
  • the system has a query analyser configured to interrogate the schema of fields, in response to receipt of a dataset query, to determine which of them have a format corresponding to search parameters of the dataset query.
  • the query analyser divides the dataset query into one or more sub-queries with each containing a subset of the corresponding search parameters and distributes the sub-queries amongst the databases, so that each of the databases receives the sub-query with parameters of format that corresponds to the schema of fields under which data is organised in the database.
  • the system has a machine learning module configured to, when being trained: receive vector values of data belonging to the identifiers from which the matching multiply encrypted identifiers are derived. The machine learning module then updates parameters used by the machine learning module for outcome prediction based on an aggregation of the received vector values; and reiterates the updating of the parameters until convergence criteria is met. After having been trained, the machine learning module is configured to reference data from one or more of the databases when responding to received queries.
  • Figure 1 shows a schematic of the system architecture in which data integration in accordance with various embodiments of the present invention is deployed.
  • Figure 2 shows operation flow taken when uploading data, the identities to which the data belongs and the randomisation of these identities, into a database of the system of Figure 1.
  • FIG. 3 shows operation flow for the data integration protocol that is implemented by the mapper module of Figure 1.
  • Figure 4 shows operation flow for the query analyser of Figure 1 when executing a distributed query protocol.
  • Figure 5 shows operation flow for a distributed machine learning protocol used by the machine learning module of Figure 1.
  • the present application finds relevance for organisations seeking to share, combine and jointly process their datasets. Having access to data from other organisations provides insight to make better decision or create a greater social impact, since there is limitation on insights that can be extracted from dataset(s) belonging to a single organisation.
  • the objective of the present application is to enable organisations to share and generate insights of joint datasets in a privacy -preserving manner.
  • the present application adopts an approach where joint datasets for analysis are achieved while data remains on the organisations premises. This ensures that data does not leave an organisation within which it is stored, thus preserving data privacy.
  • a mapper module is adopted which identifies the databases that contain the data upon which joint analysis can be based.
  • the mapper module uses identification data commonality as one criterion to determine whether data in separate databases can be used for joint analysis, i.e. if an organisation stores data attributable to a common entity (e.g. the same party may have patient records stored with a hospital database, policy records stored with an insurance company database and past expenditure records stored with a credit card company), such data stored in different organisations, once recognised, becomes a joint dataset usable for joint analysis.
  • the mapper module co-ordinates which of its registered databases stores data attributable to a common entity, whereby an identifier used by the common entity forms the shared attribute to locate the data across the registered databases. As such, the data can be traced through such an identifier.
  • Each identifier comprises one or more of identification data (e.g. any type of PII (personally identifiable information), such as national registration identity number, social security number, foreign identification number, telephone contact numbers, email address, residence address, name, bank account number, credit card primary account number) provided by the common entity when registering with each of these databases to establish data ownership.
  • the identifier may also comprise identification data unique to or generated by each of the databases (e.g.
  • the mapper module discovers that an identifier matches, i.e. the same identifier is found in several databases, the mapper records the matching identifier in a database used to store matching identifiers.
  • the mapper module performs identity matching on encrypted identifiers, rather than in their plain-text form.
  • each of the databases transmits their stored identifiers for the mapper module to determine whether each has a common identity stored in another database, they are already in randomised format from having been encrypted at their respective database.
  • Such encryption is through the use of an encryption key generated at each of the respective databases. This key is called a storage key, a dataset key or a client key.
  • the encrypted identifiers undergo one or more layers of encryption before they are received by the mapper module, so that the mapper module receives multiply encrypted identifiers.
  • One reason for using multiple encryption is because the storage key used by each database to encrypt its transmitted identifiers is different from the storage key used by another database (e.g. a first database uses storage key x 1 , while a second database uses storage key x 2 ), so that each database is able to independently encrypt its transmitted identifiers without the need to a priori communicate and share a key as in many existing techniques.
  • the mapper module is unable to perform matching if in receipt of identifiers encrypted with just their respective storage keys alone. Receiving the storage key from each of the databases to decrypt the singly encrypted identifiers would then reveal the identifiers, running contrary to the purpose of privacy preservation.
  • Multiple encryption achieves an outer layer of encryption to the inner encryption provided by the storage key. When the mapper module removes the inner encryption from receiving the respective storage key, the identifiers are still encrypted by the outer encryption, so that the identifiers are still masked to the mapper module, thereby achieving privacy preservation.
  • the multiple encryption used is homomorphic or commutative in nature, i.e. encryption that allows computation on ciphertexts, generating an encrypted result which, when decrypted, matches the result of the operations as if they had been performed on the plaintext.
  • Each of these additional layers of encryption may be applied by an intermediary that routes the encrypted identifiers to the mapper module. For instance, a server uses its own key (called a central key or a server key) to apply an additional layer of encryption to produce the multiply encrypted identifiers. As such, the records kept in the mapper module are encrypted forms of the identifiers.
  • the mapper module receives the storage key used to encrypt each identifier stored at each database.
  • the storage encryption is removed from each of the multiply encrypted identifiers, whereby matching is then performed on the multiple encrypted identifies following the storage encryption removal (called “matching multiply encrypted identifiers").
  • Matching multiply encrypted identifiers The results of the match are then transmitted so as to facilitate the location of joint datasets.
  • results of the match are returned to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived, such results indicate which of the one or more identities stored in one of the databases is also stored in one or more of the other databases.
  • Data attributable to each of the matching identities is then available for integration, i.e. usable to become a dataset for joint analysis. It will thus be appreciated that data integration effected by the mapper module does not result in data being combined in a central location. Rather, the data integration effected by the mapper module points to the databases containing the data that can be used for joint dataset analysis.
  • identifiers and dataset are two distinctively different types of data. Identifiers are used to establish ownership of content found in a dataset.
  • the mapper module is not restricted to perform matching on multiply encrypted identifiers that are derived from singular identification data contained within a column entry (e.g. any one of national registration identity number, telephone contact number, email address, credit card primary account number or IP address in complete form).
  • the mapper module is able to perform matching based on multiply encrypted identifiers that are derived from a segment of identification data (for e.g. segment "3456" from a national registration identity number "S1234567G"). This is advantageous in situations where only a selected segment of identification data is required to locate a common identity from other identities stored across all databases. In situations where a further segment of the identification data is required (e.g.
  • the mapper module is able to perform matching based on multiply encrypted identifiers that are derived from different identification data (e.g. two or more of national registration identity number, telephone contact number, email address, credit card primary account number and IP address), where at least one of the identification data is present in another multiply encrypted identifier.
  • a first multiply encrypted identifier is derived from national registration identity number, telephone contact number and email address.
  • a second multiply encrypted identifier is derived from telephone contact number, credit card primary account number and IP address. The first multiply encrypted identifier is considered to match the second multiply encrypted identifier if the telephone contact number is the same.
  • a third multiply encrypted identifier may also be present, derived from credit card primary account number and virtual identification number.
  • the third multiply encrypted identifier is considered to match the first and second multiply encrypted identifiers if the credit card primary account number, found in the second multiply encrypted identifier and the third multiply encrypted identifier, matches. While the first multiply encrypted identifier does not have the credit card primary account number, the first multiply encrypted identifier is considered to match the third multiply encrypted identifier because of the linkage brought about by the second multiply encrypted identifier.
  • the first multiply encrypted identifier matches the second multiply encrypted identifier (by virtue of their common telephone contact number); while the third multiply encrypted identifier matches the second multiply encrypted identifier (by virtue of their common credit card primary account number).
  • This other implementation thus allows for matching of identifiers comprising multiple different identification data, i.e. identification data contained over several column entries in a database.
  • the protocol used by the mapper module finds a match in a privacy -preserving manner between from the encrypted identifiers contributed by each of the databases.
  • the multiply encrypted identifiers having a common identification are considered to match.
  • the mapper module is further configured with a zero-knowledge protocol that allows verification of the correctness of the attributes (i.e. the multiply encrypted identifiers) processed during data integration.
  • the zero-knowledge protocol allows the database to verily the correctness of the blinded PII (i.e. the encrypted identifiers) stored in the mapper module and verily that the data received by the mapper module has not been tampered. Any modification by the mapper module will be detected through the zero-knowledge protocol.
  • the proof of the zero-knowledge proof protocol is computed from a random element generated by the server, a random element generated by the database seeking to authenticate the server and the central key. The proof is validated if the comparison of processing the stored identifiers with the proof against their encrypted versions after being processed with the random element generated by the server is a match.
  • a query analyser operates in tandem with the mapper module to allow a data consumer to formulate a dataset query directed at data stored in each of the databases.
  • the query analyser queries the datasets through a distributed query mechanism, whereby the dataset query is divided into sub-queries, each categorised for its intended destination. Each sub-query is then sent to the database storing the data fitting the parameters of the sub-query.
  • a machine learning module that has a distributed machine learning function is deployed.
  • the machine learning module allows the data consumer to ran analysis on the databases that reside on premise, whereby the machine learning module is trained with vector values of data belonging to the matching encrypted identifiers stored in the mapper module.
  • mapper module the query analyser and the machine learning module operate is described below.
  • Figure 1 shows a schematic of the system 100 architecture in which data integration in accordance with various embodiments of the present invention is deployed.
  • Users of the system 100 can be a data contributor and/or a data consumer.
  • a data contributor wishes to share its data either to sell to a data consumer, or to integrate its data with other data contributors so that better insights can be obtained from joint datasets.
  • the system 100 uses a protocol that enables privacy-preserving integration of datasets from multiple databases contributed by participating contributors.
  • the system 100 comprises a mapper module 102; a central platform 104; databases 106a, 106n, 108; a query analyser 110 and a machine learning module 112.
  • databases 106a and 106n are client side components
  • the central platform 104, the database 108, the query analyser 110 and the machine learning module 112 are server side components.
  • the central platform 104 has a frontend 114 for website access and a backend on which an application programming interface (API) 116 resides.
  • the central platform 104 acts as a server to client terminals to which the databases 106a and 106n are associated.
  • the system 100 may have more than one central platform and two databases to which the central platform serves as a server.
  • Figure 1 shows a cluster to which the central platform 104 and the two databases 106a and 106n belong.
  • the system 100 seeks to map common identities stored in the separate databases 106a and 106n. Initialisation of the mapper module 102 to map common identities is described below with reference to Figures 2 and 3.
  • Figure 2 shows operation flow taken when uploading data, the identities to which the data belongs and the randomisation of these identities.
  • the operation flow includes uploading a dataset to database 250 (analogous to databases 106a and 106n shown in Figure 1); key creation to randomise the identification data; and uploading database schema to the server (refer the central platform 104 of Figure 1) through the server API 116.
  • Database schema refers to the schema of fields under which data is organised in the database. The following describes the steps involved in this operation flow of Figure 2.
  • a user 250 selects and uploads a file containing a dataset.
  • the content of the dataset depends on transactions which the identifier owner performs and that the database 106a and 106n is responsible for recording (e.g. sum insured and insurance policies if the database 106a belongs to an insurance company; illness and duration of ward stay if the database 106n belongs to a hospital).
  • the user 250 defines the dataset schema and PII column(s) containing identifiers ID;.
  • each identifier comprises one or more of identification data of the party that owns data stored in the database 106a and 106n.
  • a user application 252 In step 203, a user application 252 generates a dataset key x which is used to perform storage encryption at the database 106a and 106n. In step 204, the user application 252 imports the uploaded file into the database 106a and 106n. In step 205, the user application 250 randomises the identifiers ID; in the PII columns with the storage key x, to produce encrypted identifiers F x ( ID i ) .
  • step 206 the user application 252 sends the dataset schema and the encrypted identifiers F,(ID,) to the central platform 104 through its server API 116.
  • step 207 the user application 252 sends the storage key x to the mapper module 102, for example through REST API over HTTPS.
  • a new dataset key x n may be generated for each dataset upload operation.
  • the system 100 adopts a data integration protocol without data leaving the premise. Only randomised identity attributes are submitted to the central server for blinding. As long as each dataset shares a PII column that allows for uniquely identifying a record, i.e. the mapper module 102 is able to identify common identities from its received randomised identifiers ID i , the data integration protocol is able to match and then inform each participant the matched PIIs without revealing the underlying identity information of the non-matched PIIs. The steps are performed while the datasets remain on premise.
  • Figure 3 shows operation flow for the data integration protocol that is implemented by the mapper module 102.
  • a user 350 selects two or more datasets for integration/merging.
  • the user 350 defines the merged dataset schema.
  • step 303 the server frontend (FE) 114 posts the merged schema to the server API 116.
  • step 304 the server API 116 randomises PII columns of the selected datasets with its own server key, y (also referred to us as the central key) resulting in multiply encrypted identifiers F y ,(F x (ID i )).
  • step 305 the server API 116 sends the multiply encrypted identifiers F y ,(F x (ID i )) to the mapper module 102.
  • the multiple encryption refers to the encryption performed on the identifier at the database using the storage key x n (see the step 205 of Figure 2) and the encryption performed by the server API 116 using the central key y on the identifier that is already encrypted using the storage key x n (as per the step 304).
  • the storage key x n used to encrypt each of the identifiers stored by the databases 106a and 106n may change.
  • the central key y can also change for each batch of multiple encryption that is performed.
  • step 306 the mapper module 102 decrypts each of the multiply encrypted identifiers F y ,(F x (ID i )) with their respective storage keys x.
  • the receipt of these storage keys x, used to perform storage encryption at each of the database (confer the database 106a and 106n shown in Figure 2) storing one or more of the identifiers ID;, by the mapper module 102 was described with reference to Figure 2 (see the step 207).
  • the decryption using the respective storage key x removes the storage encryption from each of the multiply encrypted identifiers F y ,(F x (ID i )). resulting in identifiers encrypted by the central encryption, F,(ID i ).
  • step 307 the server and the mapper module 102 build and store the data mapping results based on the mapping of the PII columns, i.e. inner join.
  • the mapper module 102 consolidates the multiply encrypted identifiers that match after the removal of the storage encryption, hereafter referred to as "matching multiply encrypted identifiers" (rather than the verbose phrase "multiply encrypted identifiers that match after the removal of the storage encryption"). Each of such multiply encrypted identifiers is derived from common identification data.
  • step 308 the mapper module 102 discards the encrypted key information, i.e. all the received storage keys x.
  • the mapper module 102 returns results of the match to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived.
  • the results of the match are returned so as to provide an indication which of the one or more identities stored in one of the databases (e.g. database 106a of Figure 1) is also stored in one or more of the other databases (e.g. database 106n of Figure 1).
  • the results of the match are returned via the mapper module 102 sending mapped record labels to the respective user applications.
  • the label is a stateful parameter that indexes the randomised PIIs.
  • the label is used to locate an identifier within the database storing the identifier from which the multiply encrypted identifier F y ,(F x (ID i )) is derived. For example, each multiply encrypted identifier F y ,(F x (ID i )) received by the mapper module 102 is accompanied with such a label.
  • the label can also be used to combine datasets which each is contributed from a different database.
  • the mapper module 102 performs intersection of the blinded PIIs provided by all the participants.
  • the mapper module 102 compiles a table of matching blinded PIIs and their accompanying labels, thereby consolidating the labels for the multiply encrypted identifiers that match after the removal of the storage encryption for demarcation as common labels.
  • the objective is so that only randomised PII is required by the central platform 104. Once matched, the central platform 104 has common labels that can be passed to the participants to ascertain which are the PIIs that match with the PIIs contributed by other participants.
  • step 309 the mapper module 102 sends the table consisting of the labels to every participant, thereby returning each of the common labels to each of the databases storing one or more of the identifiers located by the common label.
  • the records that belong to the matched PIIs can then be retrieved or queried accordingly.
  • Table 1 below shows a possible layout for labelled identification data
  • the label is generated when datasets are imported into a client module.
  • a client module x 1 for the database 106a tabulates a generated label for each dataset attributable to an identification data as follows: Table 2: Data organisation in client module X 1
  • a client module x n for the database 106n tabulates a generated label for each dataset attributable to an identification data as follows.
  • the database schema for the database 106a comprises the data fields “age group”, “gender” and “cost”, while the database schema for the database 106n comprises the data fields “length of stay”, “diagnosis code” and "hospitalisation cost”.
  • Client module x 1 sends (100, F x1 (S1187561A)), (200, F x1 (S7765432B)) to the central platform 104. That is the client module x 1 transmits each of its stored identifiers, encrypted with the storage key x 1 . Each of the encrypted identifiers F x1 (ID x1 ) is accompanied with its corresponding label.
  • Client module x n sends (455, F xn (S1187561A)), (610, F xn (S7658920C)) to the central platform 104. That is the client module x n transmits each of its stored identifiers, encrypted with the storage key x 1 . Each of the encrypted identifiers F xn (ID xn ) is accompanied with its corresponding label.
  • the central platform 104 further encrypts with key s, (100, F s (F x1 (S1187561A))), (200, F s (F x1 (S7765432B))), (455, F s (F xn (S1187561A))), (610, F s (F xn (S7658920C))).
  • the central platform 104 performs central encryption with the central key s to produce multiply encrypted identifiers F s (F xi (ID i ))
  • the mapper module 102 receives the randomised strings from the central platform 104 and the storage keys x 1 and x n .
  • the mapper module 102 removes the inner encryptions: (100, F s (S 1187561 A)), (200, F s (S7765432B)), (455, F s (S1187561A)), (610, F s (S7658920C)).
  • the mapper module 102 sends the common label 100 to client module x 1 and the common label 455 to client module x n
  • a user C generates a random key x and the server S in the central platform generates a random key y.
  • C2 similarly performs the matching and sends the set of matching labels to S, for
  • Step (6) C 2 performs the same protocol as C 1 , as well as in scenarios where there is more than two users.
  • the server S stores all the list of matching labels so that when a user wishes to query for the records of the matched PIIs from other users, it is possible to cross verify that the records returned is indeed those of the matched PIIs. This is performed by the server S and the users sending the set of matched labels to the querying user.
  • the second application may be used where client terminals store their entire databases containing all attributes, less the identifiers, at the server S.
  • This is a central storage setting where all data providers (e.g. client terminals C 1 and C 2 ) share their datasets (e.g. age, salary, zip code, insured amount) but not the identifiers (e.g. national registration identity number) to which the datasets belong.
  • client terminals C 1 and C 2 can perform matching based on the identifiers in their datasets using the above outlined protocol of the second implementation.
  • client terminal C 1 sends the matched labels to the server S for the server S to correctly combine the stored datasets obtained from client terminals C 1 and C 2 .
  • client terminals C 1 and C 2 may have datasets as tabulated below.
  • Client terminal C 1 sends the illness data to the server S while client terminal C 2 sends the insurance data to the server S. Since each of the client terminals C 1 and C 2 incorporates the mapper module 102 (see Figure 3), each will possess multiply encrypted identifiers derived from identifiers stored in both the client terminals C 1 and C 2 . Client terminal C 1 identifies the common identities (8229012 and 8056798) from their multiply encrypted format. Client terminal C 1 demarcates a label for each of the matching multiply encrypted identifiers as a common label. Client terminal C 1 then transmits the common labels to the server S to link datasets stored therein that belong to the identifier from which the matching multiply encrypted identifier is derived, as illustrated below. Table 6: Server S dataset
  • the datasets stored in the server S are contributed by one or more databases storing identifiers from which the matching multiply encrypted identifier are derived.
  • the server S or the central platform 104 establishes knowledge of the central key, through the use of a zero knowledge protocol (ZKP), to each of the databases 106a and 106n storing one or more of the identifiers from which the multiply encrypted identifiers are derived.
  • ZKP protocol that is used by the server S or the central platform 104 is outlined below:
  • server S processes the multiply encrypted identifiers with the random element s generated by the server S;
  • the server S or the central platform 104 is configured to generate a random element s and receive a random element c associated with the database 106a and 106n seeking to authenticate the server S or the central platform 104.
  • the server S or the central platform 104 then computes a proof p of the zero knowledge protocol based on the random element s generated by the server, the received random element c associated with the database and the central key y.
  • the database 106a and 106n seeking to authenticate the server S or the central platform 104 is configured to receive from the server: a subset of the multiply encrypted identifiers derived from identifiers stored at the database 106a and 106n, the subset having been processed by the random element s generated by the server S or the central platform 104; and the proof p of the zero knowledge protocol.
  • the database 106a and 106n process the stored identifiers from which the subset is obtained with the proof p to obtain evaluation parameters.
  • the database 106a and 106n validate the proof p of the zero knowledge protocol in response to the evaluation parameters matching the received subset.
  • the sketched proof is as below.
  • a i are provided by client, and b i are provided by the
  • server S can prove to the client its knowledge of the key y’ such that
  • the client chooses a vector (r 1 , ... , r n ) with each r i randomly chosen from [1, 2 l ], the inner product between these two vectors is zero (i.e. two vectors are perpendicular) with non-negligible probability. It has to be the case that vector (e 1 , ... , e n ) is a zero vector, thus for each i.
  • a PII to which a dataset in a database belonging to the system 100 of Figure 1 may not always be atomic.
  • the PII may instead contain a few segments.
  • An example is IPv4 addresses, where an address (e.g. 192.168.1.1) can be divided into four segments, each matching an octet of its dotted format representation (This can be applied to IPv6 address in a similar manner).
  • the segments of an IP address are crucial for network analysis, for instance, to ascertain whether a group of IP addresses originated from an identical subnet. Losing this structure means the dataset loses utility and is of limited value. Any method that randomises (or masks) such PII must be able to preserve its structure. An approach to directly randomise each segment, treating every segment as a single identity attribute, is inefficient. This means if a PII has W segments, randomisation and blinding would have to be performed on all W segments in order to preserve its structure.
  • the client chooses the segments to be randomised and creates a list L containing identifiers of these selected segments, which serve as initial segments of identification data for the Structure Preserving Data Integration protocol. Only the segments in the list are randomised and blinded, in stages. The intuition is that in majority scenarios the resulting dataset of the first match (or intersection) would contain less records compared to the original datasets.
  • step 307 (where the mapper module 102 consolidates the multiply encrypted identifiers that match after the removal of storage encryption) includes the multiply encrypted identifiers derived from the initial segments of identification data that match.
  • the results of the match are returned to each of the databases 106a and 106b storing the identification data from which each of the initial segments of identification data that match is obtained.
  • a second chosen or further segment of the identification data may be processed next, but on the smaller dataset originated from the first intersection. That is, the mapper module 102 locates the multiply encrypted identifiers derived from the further segment of the identification data that match from the consolidation performed on the initial segments of identification. The results of the match are returned to each of the databases 106a and 106n storing the identification data from which each of the further segments of identification data that match is obtained. Subsequent mapping can be carried out for the remaining segments, depending on how fine-grained the matching is desired. Any key homomorphic encryption (or commutative encryption) scheme can be deployed, such as the one discussed in international application no. PCT/SG2017/050575.
  • a user C generates a random key x and the server S in the central platform 104 generates a random key y.
  • Cl sends the set of matching tags ( k i ) to S, for i Î [1, r], r [n].
  • the Structure Preserving Data Integration protocol works for any PII that can be divided into segments in order to preserve its structure. Two use cases of the structure-preserving data integration protocol are discussed below.
  • H(.) is a cryptographic hash function. This representation preserves prefixes of an IP address and is the only way that preserves the structure of an IP address without leaking information.
  • the client for example, might wish to find a mapping on segment H(A.B), and at a later stage, H(A.B.C).
  • the client C first randomises all four segments and generalises the non-identity attributes, in order to protect against the server learning the segments of the IP addresses.
  • the client C then creates a list L that contains two identifiers, one for H(A.B) and another for H(A.B.C).
  • the client C and the server S choose the identifier of H(A.B), i.e. an initial segment, from the list, retrieve H(A.B), jointly randomise and blind to find the matching records in the datasets contributed by other participating clients C.
  • the client C and the server S need only randomise and blind this initial segment.
  • the client C and the server S would only need to map from the results from matching H(A.B), which may contain less records compared to the full dataset.
  • IP addresses it is possible that multiple columns in a database are used to represent an identity attribute, e.g. name, email and phone number. Instead of randomising all columns as one attribute, coarse-grain or fine-grain randomisation of multiple columns is used.
  • a client may define each column as a segment and create the list L containing the segments that are to be processed.
  • the consolidation of step 307 includes the multiply encrypted identifiers derived from the identifiers comprising the common identification data. The results of the match are returned to each of the databases 106 and 106n storing the common identification data.
  • the leakage of information due to in-stage interactions between the client C and the server S was modelled. Specifically, the server S and the client C learn the sets of matched randomised PIIs and their sizes for every stage of matching. This is because the clients C submit k i to the server at the end of each iteration. It was found that given the extra leakage of information, the server S and the client C are not able to learn any information of the underlying identity attribute.
  • An interactive stateful leakage function was defined, given input the dataset, outputs n the size of the dataset, the number of segments W and the list of segments L to be processed.
  • the segments are not unique given that homomorphic key encryption is deterministic.
  • the server S is able to learn how many IDs have the same segments. This information may also enable the server S to learn that, for example, two datasets maybe of the same domain by observing the received k i if the list L submitted by the two participants are the same.
  • a potential mitigation is to adopt a prefix-preserving pseudonymization mechanism that partitions the set of IP addresses, performs a migration to hide repetitive segments and replicate the IP addresses into multi-views setting so that the recipient has no idea which view is the real set of IP addresses.
  • Figure 4 shows operation flow for the query analyser 110 of Figure 1 when executing a distributed query protocol.
  • the distributed query protocol divides a query into sub-queries that are directed to the datasets of the respective organisations. Only record labels and aggregated results will be sent out from the datasets. The results will be merged based on mapped record labels.
  • the distributed query protocol ensures that raw data never leaves the respective organisations premises.
  • the distributed query protocol executed by the query analyser 110, interrogates fields under which data is organised in each of the databases 106a and 106n.
  • the mapper module 102 receives the schema of fields together with the multiply encrypted identifiers, such as in step 305 of Figure 3.
  • the query analyser 110 determines which of the schema of fields in the mapper module 102 has a format corresponding to search parameters of the received dataset query.
  • the query analyser 110 is also able to divide the dataset query into one or more sub-queries with each containing a subset of the corresponding search parameters.
  • the sub-queries are then distributed amongst the databases 106a and 106n, so that each of the databases 106a and 106n receives the sub-query with parameters of format that corresponds to the schema of fields under which data is organised in the database 106a and 106n.
  • the query analyser 110 is also configured to: merge responses from each of the databases 106a and 106n to their received sub-query and return the merged response to the dataset query received by the query analyser 110.
  • step 401 customer sends a query on dataset ("dataset query").
  • step 402 the Server Frontend (FE) 114 posts the dataset query to the Server API 116.
  • the server API 116 relays the dataset query to the query analyser (QA) 110.
  • step 403 the QA 110 breaks down the dataset query into sub-queries. The sub-queries are sent to user apps for relay to the database 106a and 106n.
  • step 404 each of the user apps queries their on-premise database 106a and 106n.
  • step 405 the user apps send results back to the QA 110.
  • step 406 the QA 110 merges results from the user apps.
  • step 407 the QA 110 sends the merged results back to the Server API 116 for relay back to the Server FE 114.
  • step 408 the Server FE 114 displays the merged results.
  • Tables 7 and 8 below each provide an example of data organised under a schema of fields found in a database.
  • Table 9 shows an example of a dataset query and its division into sub-queries based on the schema of fields shown in Table 7 and Table 8.
  • Table 9 Sample dataset query and sub-queries Figure 5 shows operation flow for a distributed machine learning protocol used by the machine learning module 112.
  • the machine learning module allows a data consumer to ran analysis on the databases 106a and 106n that reside on the premise.
  • the distributed machine learning (DML) protocol allows for the data consumer to request for a machine learning model to be computed on datasets residing in different organisations.
  • the machine learning module 112 is trained using vector values of data belonging to identifiers stored in the databases 106a and 106n. Recalling from Figures 2 and 3 that common identification data serves to integrate datasets, the identifiers used to train the machine learning module 112 are those derived from matching multiply encrypted identifiers consolidated in the mapper module 102. Parameters used by the machine learning module 112 for outcome prediction are updated based on an aggregation of the received vector values. The updating of the parameters is reiterated until convergence criteria is met. The machine learning module 112 is then configured to, after having been trained, reference data from one or more of the databases 106a and 106n when responding to received queries.
  • step 501 a Customer selects machine learning model type to be trained on datasets.
  • step 502 the Server Frontend (FE) 114 posts the selected machine learning model to the Server API 116 which relays the request to the Distributed ML module (DML) 112.
  • DML Distributed ML module
  • step 503 the DML module 112 requests dataset dimensions and response variable from user apps.
  • Each of the user apps queries its on-premise database 106a and 106n;
  • step 504 the user apps relay back dataset dimensions and response variable to the DML module 112;
  • step 505 the DML module 112 initialises parameter values used by the machine learning module 112 for outcome prediction and send the relevant parameter values to each user app, along with machine learning model type information;
  • each user app computes local aggregated results based on model type and uploads aggregates results to the DML module 112;
  • step 507 the DML module 112 uses local aggregated results from donor apps to update the parameter values. If convergence criteria is satisfied or maximum number of iterations have been reached, it signals protocol finish to user apps and performs step 509 below. Otherwise, the DML module 112 relays updated parameter values to user apps;
  • step 508 the user apps receive updated parameters. If protocol finish signal has not been received, step 506 is repeated;
  • step 509 the DML module 112 relays back model parameters to the Server API 116;
  • step 510 the Server Frontend 114 displays results.
  • the system 100 provides a data contributor an access control mechanism and a configurable risk assessment module, so that the data contributor may decide whether to reveal certain attributes of its database for query or learning by the data consumer. It can also be based on these mechanisms and the configurable risk assessment module, to set the price plan for the data consumer to perform query or analysis on more sensitive information.
  • the distributed query and distributed machine learning steps can be performed independently from the intersection mechanism.
  • the mapper module 102 Upon determining a common identity from its multiply encrypted form, the mapper module 102 is configured to query a registration directory storing an address of each of the databases 106a and 106n. With reference to step 309 of Figure 3, this allows the mapper module 102 to obtain the address of each of the databases 106a and 106n to which matching results should be sent.
  • the system 100 of Figure 1 uses multiple encryption that comprises storage encryption and a central encryption.
  • the removal of the storage encryption results in matching to be performed on identifiers encrypted by the central encryption.
  • This central encryption is performed using a central key y at a server from which the mapper module 102 receives the multiply encrypted identifiers.
  • the mapper module 102 is configured to return the results of the match through a server from which the mapper module 102 receives the multiply encrypted identifiers.
  • the mapper module 102 is integrated with the server.
  • the mapper module 102 is integrated into a client terminal that integrates one or more of the databases 106a and 106n. It will be appreciated that integration refers to these components being part of a designated network and does not necessarily require for them to be internally situated.
  • the present invention provides a system for privacy -preserving data sharing and integration, which enables organisations to match, query and analyse their datasets without the dataset leaving the premises of the organisations.
  • the system is initiated by a user registering to the system and an administrator managing the list of users and the applications.
  • a data integration protocol enables matching of personal identifiable information (PII) in a privacy -preserving manner. Users (who can be data donors and/or data consumers) upload and integrate their datasets by performing the data integration protocol together with a central platform to randomize and map/intersect their PIIs information using the map module of the user and the server application. Once mapped, the information (or labels) on the sets of matched PIIs are returned to each respective user.
  • Coupled with the data integration protocol is a zero-knowledge proof protocol that allows verification of the correctness of the attributes processed by the server during data integration. Any modification by the server on the attributes will be detected through the zero-knowledge proof protocol.
  • the data integration protocol is flexible in that it caters for a PII compromising more than one segment (e.g. IP addresses) or different identification data (e.g. multi-column IDs), using structure preserving techniques.
  • the data integration protocol is also labelled in a way that even when the datasets are on premise, organisations are able to retrieve or refer to records of matched PIIs.

Abstract

According to an aspect of the present invention, there is provided a system for mapping common identities stored in separate databases, the system comprising: a mapper module configured to: receive multiply encrypted identifiers, where each is derived from an identifier having undergone multiple encryption; receive a storage key used to perform storage encryption at each database storing one or more of the identifiers; remove the storage encryption from each of the multiply encrypted identifiers using its respective storage key; consolidate the multiply encrypted identifiers that match after the removal of the storage encryption; and transmit results of the match.

Description

Privacy Preserving System For Mapping Common Identities
FIELD
The present invention relates to a system for mapping common identities stored in separate databases.
BACKGROUND
The proliferation of interconnected communication devices enables individuals and organisations to easily communicate and share information. This results in organisations possessing large amounts of sensitive information compared to the past. Commensurately, management of such information is also becoming more regulated. Privacy protection laws have been introduced, such as the General Data Protection Regulation (GDPR) by the EU.
The approach of ensuring data never leaves an organisation premise is preferred for preserving data privacy and providing better control over datasets. There are known techniques that provide privacy-preserving data intersection between two or more participants. Some of these techniques only provide intersection in specific settings and extensive engineering effort is required for them to be practically deployable. While comprehensive, private set operation (PSO) protocols that provide privacy-preserving intersection incur substantial computational and communication overhead.
There is thus a need for a system that can share datasets in a privacy-preserving manner without data leaving an organisation.
SUMMARY OF THE INVENTION
According to an aspect of the present invention, there is provided a system for mapping common identities stored in separate databases, the system comprising: a mapper module configured to: receive multiply encrypted identifiers, where each is derived from an identifier having undergone multiple encryption; receive a storage key used to perform storage encryption at each database storing one or more of the identifiers; remove the storage encryption from each of the multiply encrypted identifiers using its respective storage key; consolidate the multiply encrypted identifiers that match after the removal of the storage encryption; and transmit results of the match.
In one implementation, the results of the match are returned to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived. In another implementation, the results of the match are sent to a server storing datasets attributable to the matching multiply encrypted identifiers. In this other implementation, the datasets are contributed by the one or more databases storing the identifiers from which the multiply encrypted identifiers are derived. The sending of the results of the match to the server is to link the stored datasets that are attributable to the matching multiply encrypted identifier.
In a first scenario, at least one of the identifiers may comprise an initial segment of identification data. The consolidation of the multiply encrypted identifiers by the mapper module then includes those derived from the initial segments of identification data that match. The results of the match are returned to each of the databases storing the identification data from which each of the initial segments of identification data that match is obtained. An extension of this first scenario has at least one of the identifiers comprise a further segment of the identification data, the mapper module is configured to locate the multiply encrypted identifiers derived from the further segment of the identification data that match from the consolidation performed on the initial segments of identification. The mapper module then returns the results of the match to each of the databases storing the identification data from which each of the further segments of identification data that match is obtained.
In a second scenario, at least one of the identifiers comprises different identification data with at least one in common with another of the identifiers. The consolidation of the multiply encrypted identifiers by the mapper module then includes the multiply encrypted identifiers derived from the identifiers comprising the common identification data. The results of the match are returned to each of the databases storing the common identification data.
In an implementation where the multiple encryption comprises the storage encryption and a central encryption performed using a central key at a server from which the mapper module receives the multiply encrypted identifiers, the server is configured to establish knowledge of the central key, through the use of a zero knowledge protocol, to each of the databases storing one or more of the identifiers from which the multiply encrypted identifiers are derived.
Each of the multiply encrypted identifiers may be accompanied with schema of fields under which data is organised in the database storing the identifier from which the multiply encrypted identifier is derived. The system has a query analyser configured to interrogate the schema of fields, in response to receipt of a dataset query, to determine which of them have a format corresponding to search parameters of the dataset query. The query analyser divides the dataset query into one or more sub-queries with each containing a subset of the corresponding search parameters and distributes the sub-queries amongst the databases, so that each of the databases receives the sub-query with parameters of format that corresponds to the schema of fields under which data is organised in the database.
The system has a machine learning module configured to, when being trained: receive vector values of data belonging to the identifiers from which the matching multiply encrypted identifiers are derived. The machine learning module then updates parameters used by the machine learning module for outcome prediction based on an aggregation of the received vector values; and reiterates the updating of the parameters until convergence criteria is met. After having been trained, the machine learning module is configured to reference data from one or more of the databases when responding to received queries.
BRIEF DESCRIPTION OF THE DRAWINGS
Representative embodiments of the present invention are herein described, by way of example only, with reference to the accompanying drawings, wherein:
Figure 1 shows a schematic of the system architecture in which data integration in accordance with various embodiments of the present invention is deployed.
Figure 2 shows operation flow taken when uploading data, the identities to which the data belongs and the randomisation of these identities, into a database of the system of Figure 1.
Figure 3 shows operation flow for the data integration protocol that is implemented by the mapper module of Figure 1.
Figure 4 shows operation flow for the query analyser of Figure 1 when executing a distributed query protocol.
Figure 5 shows operation flow for a distributed machine learning protocol used by the machine learning module of Figure 1.
DETAILED DESCRIPTION
In the following description, various embodiments are described with reference to the drawings, where like reference characters generally refer to the same parts throughout the different views.
The present application finds relevance for organisations seeking to share, combine and jointly process their datasets. Having access to data from other organisations provides insight to make better decision or create a greater social impact, since there is limitation on insights that can be extracted from dataset(s) belonging to a single organisation.
However, sharing data between different organisations raises privacy concerns since such datasets could include customer data and medical data which are sensitive in nature and privy to the hosting organisation. The objective of the present application is to enable organisations to share and generate insights of joint datasets in a privacy -preserving manner.
The present application adopts an approach where joint datasets for analysis are achieved while data remains on the organisations premises. This ensures that data does not leave an organisation within which it is stored, thus preserving data privacy. To locate joint datasets, a mapper module is adopted which identifies the databases that contain the data upon which joint analysis can be based. The mapper module uses identification data commonality as one criterion to determine whether data in separate databases can be used for joint analysis, i.e. if an organisation stores data attributable to a common entity (e.g. the same party may have patient records stored with a hospital database, policy records stored with an insurance company database and past expenditure records stored with a credit card company), such data stored in different organisations, once recognised, becomes a joint dataset usable for joint analysis. The mapper module co-ordinates which of its registered databases stores data attributable to a common entity, whereby an identifier used by the common entity forms the shared attribute to locate the data across the registered databases. As such, the data can be traced through such an identifier. Each identifier comprises one or more of identification data (e.g. any type of PII (personally identifiable information), such as national registration identity number, social security number, foreign identification number, telephone contact numbers, email address, residence address, name, bank account number, credit card primary account number) provided by the common entity when registering with each of these databases to establish data ownership. In addition, the identifier may also comprise identification data unique to or generated by each of the databases (e.g. IP address, serial number, IMEI (international mobile equipment identity) number, virtual identification number). Therefore, such an identifier provides means to locate joint datasets. When the mapper module discovers that an identifier matches, i.e. the same identifier is found in several databases, the mapper records the matching identifier in a database used to store matching identifiers.
Since the present application adopts the approach of privacy preservation, the mapper module performs identity matching on encrypted identifiers, rather than in their plain-text form. In more detail, when each of the databases transmits their stored identifiers for the mapper module to determine whether each has a common identity stored in another database, they are already in randomised format from having been encrypted at their respective database. Such encryption is through the use of an encryption key generated at each of the respective databases. This key is called a storage key, a dataset key or a client key. The encrypted identifiers undergo one or more layers of encryption before they are received by the mapper module, so that the mapper module receives multiply encrypted identifiers.
One reason for using multiple encryption is because the storage key used by each database to encrypt its transmitted identifiers is different from the storage key used by another database (e.g. a first database uses storage key x1, while a second database uses storage key x2), so that each database is able to independently encrypt its transmitted identifiers without the need to a priori communicate and share a key as in many existing techniques. The mapper module is unable to perform matching if in receipt of identifiers encrypted with just their respective storage keys alone. Receiving the storage key from each of the databases to decrypt the singly encrypted identifiers would then reveal the identifiers, running contrary to the purpose of privacy preservation. Multiple encryption achieves an outer layer of encryption to the inner encryption provided by the storage key. When the mapper module removes the inner encryption from receiving the respective storage key, the identifiers are still encrypted by the outer encryption, so that the identifiers are still masked to the mapper module, thereby achieving privacy preservation.
The multiple encryption used is homomorphic or commutative in nature, i.e. encryption that allows computation on ciphertexts, generating an encrypted result which, when decrypted, matches the result of the operations as if they had been performed on the plaintext. Each of these additional layers of encryption may be applied by an intermediary that routes the encrypted identifiers to the mapper module. For instance, a server uses its own key (called a central key or a server key) to apply an additional layer of encryption to produce the multiply encrypted identifiers. As such, the records kept in the mapper module are encrypted forms of the identifiers. The mapper module receives the storage key used to encrypt each identifier stored at each database. The storage encryption is removed from each of the multiply encrypted identifiers, whereby matching is then performed on the multiple encrypted identifies following the storage encryption removal (called "matching multiply encrypted identifiers"). The results of the match are then transmitted so as to facilitate the location of joint datasets.
In the case where the results of the match are returned to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived, such results indicate which of the one or more identities stored in one of the databases is also stored in one or more of the other databases. Data attributable to each of the matching identities is then available for integration, i.e. usable to become a dataset for joint analysis. It will thus be appreciated that data integration effected by the mapper module does not result in data being combined in a central location. Rather, the data integration effected by the mapper module points to the databases containing the data that can be used for joint dataset analysis. In addition, in the context of the present application, identifiers and dataset are two distinctively different types of data. Identifiers are used to establish ownership of content found in a dataset.
The mapper module is not restricted to perform matching on multiply encrypted identifiers that are derived from singular identification data contained within a column entry (e.g. any one of national registration identity number, telephone contact number, email address, credit card primary account number or IP address in complete form). In one implementation, the mapper module is able to perform matching based on multiply encrypted identifiers that are derived from a segment of identification data (for e.g. segment "3456" from a national registration identity number "S1234567G"). This is advantageous in situations where only a selected segment of identification data is required to locate a common identity from other identities stored across all databases. In situations where a further segment of the identification data is required (e.g. further segment "7G" to eliminate national registration identity number "S1834568H" and locate national registration identity number "S1234567G"), the location of multiply encrypted identifiers can be based on results of the matching performed on the initial segment of identification data. The mapper module would then have to analyse fewer records when using the results from the earlier search for common identities having the initial segment of identification data.
In another implementation, the mapper module is able to perform matching based on multiply encrypted identifiers that are derived from different identification data (e.g. two or more of national registration identity number, telephone contact number, email address, credit card primary account number and IP address), where at least one of the identification data is present in another multiply encrypted identifier. For example: A first multiply encrypted identifier is derived from national registration identity number, telephone contact number and email address. A second multiply encrypted identifier is derived from telephone contact number, credit card primary account number and IP address. The first multiply encrypted identifier is considered to match the second multiply encrypted identifier if the telephone contact number is the same. A third multiply encrypted identifier may also be present, derived from credit card primary account number and virtual identification number. The third multiply encrypted identifier is considered to match the first and second multiply encrypted identifiers if the credit card primary account number, found in the second multiply encrypted identifier and the third multiply encrypted identifier, matches. While the first multiply encrypted identifier does not have the credit card primary account number, the first multiply encrypted identifier is considered to match the third multiply encrypted identifier because of the linkage brought about by the second multiply encrypted identifier. The first multiply encrypted identifier matches the second multiply encrypted identifier (by virtue of their common telephone contact number); while the third multiply encrypted identifier matches the second multiply encrypted identifier (by virtue of their common credit card primary account number). This other implementation thus allows for matching of identifiers comprising multiple different identification data, i.e. identification data contained over several column entries in a database. The protocol used by the mapper module finds a match in a privacy -preserving manner between from the encrypted identifiers contributed by each of the databases. The multiply encrypted identifiers having a common identification are considered to match.
The mapper module is further configured with a zero-knowledge protocol that allows verification of the correctness of the attributes (i.e. the multiply encrypted identifiers) processed during data integration. The zero-knowledge protocol allows the database to verily the correctness of the blinded PII (i.e. the encrypted identifiers) stored in the mapper module and verily that the data received by the mapper module has not been tampered. Any modification by the mapper module will be detected through the zero-knowledge protocol. The proof of the zero-knowledge proof protocol is computed from a random element generated by the server, a random element generated by the database seeking to authenticate the server and the central key. The proof is validated if the comparison of processing the stored identifiers with the proof against their encrypted versions after being processed with the random element generated by the server is a match.
After matching results are received by each of the databases, of the matched PIIs are received, a query analyser operates in tandem with the mapper module to allow a data consumer to formulate a dataset query directed at data stored in each of the databases. The query analyser queries the datasets through a distributed query mechanism, whereby the dataset query is divided into sub-queries, each categorised for its intended destination. Each sub-query is then sent to the database storing the data fitting the parameters of the sub-query.
If instead of query, the data consumer wishes to perform analysis on the data stored in each of the databases, a machine learning module that has a distributed machine learning function is deployed. The machine learning module allows the data consumer to ran analysis on the databases that reside on premise, whereby the machine learning module is trained with vector values of data belonging to the matching encrypted identifiers stored in the mapper module.
The system in which the mapper module, the query analyser and the machine learning module operate is described below.
Figure 1 shows a schematic of the system 100 architecture in which data integration in accordance with various embodiments of the present invention is deployed. Users of the system 100 can be a data contributor and/or a data consumer. A data contributor wishes to share its data either to sell to a data consumer, or to integrate its data with other data contributors so that better insights can be obtained from joint datasets. To share data in a privacy -preserving manner, the system 100 uses a protocol that enables privacy-preserving integration of datasets from multiple databases contributed by participating contributors.
The system 100 comprises a mapper module 102; a central platform 104; databases 106a, 106n, 108; a query analyser 110 and a machine learning module 112. In the implementation shown in Figure 1, databases 106a and 106n are client side components, while the central platform 104, the database 108, the query analyser 110 and the machine learning module 112 are server side components.
The central platform 104 has a frontend 114 for website access and a backend on which an application programming interface (API) 116 resides. The central platform 104 acts as a server to client terminals to which the databases 106a and 106n are associated. The system 100 may have more than one central platform and two databases to which the central platform serves as a server. For the sake of simplicity, Figure 1 shows a cluster to which the central platform 104 and the two databases 106a and 106n belong. The system 100 seeks to map common identities stored in the separate databases 106a and 106n. Initialisation of the mapper module 102 to map common identities is described below with reference to Figures 2 and 3.
Figure 2 shows operation flow taken when uploading data, the identities to which the data belongs and the randomisation of these identities. The operation flow includes uploading a dataset to database 250 (analogous to databases 106a and 106n shown in Figure 1); key creation to randomise the identification data; and uploading database schema to the server (refer the central platform 104 of Figure 1) through the server API 116. Database schema refers to the schema of fields under which data is organised in the database. The following describes the steps involved in this operation flow of Figure 2. Here we denote F(.) as a key homomorphic encryption or a commutative encryption scheme, where the following property holds: Given keys x, y, and identity attribute IDi, Fy,(Fx(IDi)) = Fx(Fy(IDi)).
In step 201, a user 250 selects and uploads a file containing a dataset. The content of the dataset depends on transactions which the identifier owner performs and that the database 106a and 106n is responsible for recording (e.g. sum insured and insurance policies if the database 106a belongs to an insurance company; illness and duration of ward stay if the database 106n belongs to a hospital). In step 202, the user 250 defines the dataset schema and PII column(s) containing identifiers ID;. As mentioned above, each identifier comprises one or more of identification data of the party that owns data stored in the database 106a and 106n.
In step 203, a user application 252 generates a dataset key x which is used to perform storage encryption at the database 106a and 106n. In step 204, the user application 252 imports the uploaded file into the database 106a and 106n. In step 205, the user application 250 randomises the identifiers ID; in the PII columns with the storage key x, to produce encrypted identifiers Fx ( IDi ) .
In step 206, the user application 252 sends the dataset schema and the encrypted identifiers F,(ID,) to the central platform 104 through its server API 116. In step 207, the user application 252 sends the storage key x to the mapper module 102, for example through REST API over HTTPS. A new dataset key xn may be generated for each dataset upload operation.
The system 100 adopts a data integration protocol without data leaving the premise. Only randomised identity attributes are submitted to the central server for blinding. As long as each dataset shares a PII column that allows for uniquely identifying a record, i.e. the mapper module 102 is able to identify common identities from its received randomised identifiers IDi, the data integration protocol is able to match and then inform each participant the matched PIIs without revealing the underlying identity information of the non-matched PIIs. The steps are performed while the datasets remain on premise. It is computationally infeasible for the server/central platform 104 to re-identify the underlying identity of the randomised identifiers IDi, as long as a secure key homomorphic encryption (or a commutative encryption) scheme is used such as one that is based on a DDH (Decisional Diffie-Hellman) assumption. The data integration protocol is described in greater detail with reference to Figure 3.
Figure 3 shows operation flow for the data integration protocol that is implemented by the mapper module 102.
In step 301, a user 350 selects two or more datasets for integration/merging. In step 302, the user 350 defines the merged dataset schema.
In step 303, the server frontend (FE) 114 posts the merged schema to the server API 116. In step 304, the server API 116 randomises PII columns of the selected datasets with its own server key, y (also referred to us as the central key) resulting in multiply encrypted identifiers Fy,(Fx(IDi)).
In step 305, the server API 116 sends the multiply encrypted identifiers Fy,(Fx(IDi)) to the mapper module 102. This results in the mapper module 102 receiving multiply encrypted identifiers Fy,(Fx(IDi)) where each is derived from an identifier ID; having undergone multiple encryption. The multiple encryption refers to the encryption performed on the identifier at the database using the storage key xn (see the step 205 of Figure 2) and the encryption performed by the server API 116 using the central key y on the identifier that is already encrypted using the storage key xn (as per the step 304). As mentioned earlier, the storage key xn used to encrypt each of the identifiers stored by the databases 106a and 106n may change. The central key y can also change for each batch of multiple encryption that is performed.
In step 306, the mapper module 102 decrypts each of the multiply encrypted identifiers Fy,(Fx(IDi)) with their respective storage keys x. The receipt of these storage keys x, used to perform storage encryption at each of the database (confer the database 106a and 106n shown in Figure 2) storing one or more of the identifiers ID;, by the mapper module 102 was described with reference to Figure 2 (see the step 207). The decryption using the respective storage key x removes the storage encryption from each of the multiply encrypted identifiers Fy,(Fx(IDi)). resulting in identifiers encrypted by the central encryption, F,(IDi). While only dual encryption has been discussed thus far (i.e. firstly by using the storage key x, followed by using the central key y), another implementation may use several layers of encryption (i.e. more than two). In this other implementation, the identifiers will still be multiply encrypted after removal of its storage encryption.
In step 307, the server and the mapper module 102 build and store the data mapping results based on the mapping of the PII columns, i.e. inner join. The mapper module 102 consolidates the multiply encrypted identifiers that match after the removal of the storage encryption, hereafter referred to as "matching multiply encrypted identifiers" (rather than the verbose phrase "multiply encrypted identifiers that match after the removal of the storage encryption"). Each of such multiply encrypted identifiers is derived from common identification data. In step 308, the mapper module 102 discards the encrypted key information, i.e. all the received storage keys x. In step 309, the mapper module 102 returns results of the match to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived. The results of the match are returned so as to provide an indication which of the one or more identities stored in one of the databases (e.g. database 106a of Figure 1) is also stored in one or more of the other databases (e.g. database 106n of Figure 1). In one implementation, the results of the match are returned via the mapper module 102 sending mapped record labels to the respective user applications.
The label is a stateful parameter that indexes the randomised PIIs. The label is used to locate an identifier within the database storing the identifier from which the multiply encrypted identifier Fy,(Fx(IDi)) is derived. For example, each multiply encrypted identifier Fy,(Fx(IDi)) received by the mapper module 102 is accompanied with such a label. The label can also be used to combine datasets which each is contributed from a different database.
Returning to the step 307, the mapper module 102 performs intersection of the blinded PIIs provided by all the participants. The mapper module 102 compiles a table of matching blinded PIIs and their accompanying labels, thereby consolidating the labels for the multiply encrypted identifiers that match after the removal of the storage encryption for demarcation as common labels. The objective is so that only randomised PII is required by the central platform 104. Once matched, the central platform 104 has common labels that can be passed to the participants to ascertain which are the PIIs that match with the PIIs contributed by other participants. This is achieved in the step 309, where the mapper module 102 sends the table consisting of the labels to every participant, thereby returning each of the common labels to each of the databases storing one or more of the identifiers located by the common label. The records that belong to the matched PIIs can then be retrieved or queried accordingly.
Table 1 below shows a possible layout for labelled identification data
Table 1: Labelled identification data
Figure imgf000012_0001
In a first implementation, the label is generated when datasets are imported into a client module. For example, with reference to step 204 of Figure 2, a client module x1 for the database 106a tabulates a generated label for each dataset attributable to an identification data as follows: Table 2: Data organisation in client module X1
Figure imgf000013_0001
while a client module xn for the database 106n tabulates a generated label for each dataset attributable to an identification data as follows.
Table 3: Data organisation in client module xn
Figure imgf000013_0002
The database schema for the database 106a comprises the data fields "age group", "gender" and "cost", while the database schema for the database 106n comprises the data fields "length of stay", "diagnosis code" and "hospitalisation cost".
The following operation flow may then be executed to achieve labelled data integration:
Labelled Data Integration Protocol
a) Client module x1 sends (100, Fx1(S1187561A)), (200, Fx1(S7765432B)) to the central platform 104. That is the client module x1 transmits each of its stored identifiers, encrypted with the storage key x1. Each of the encrypted identifiers Fx1(IDx1) is accompanied with its corresponding label.
b) Client module xn sends (455, Fxn(S1187561A)), (610, Fxn(S7658920C)) to the central platform 104. That is the client module xn transmits each of its stored identifiers, encrypted with the storage key x1. Each of the encrypted identifiers Fxn(IDxn) is accompanied with its corresponding label.
c) The central platform 104 further encrypts with key s, (100, Fs(Fx1(S1187561A))), (200, Fs (Fx1(S7765432B))), (455, Fs(Fxn(S1187561A))), (610, Fs (Fxn(S7658920C))). As such, the central platform 104 performs central encryption with the central key s to produce multiply encrypted identifiers Fs(Fxi(IDi))
d) The mapper module 102 receives the randomised strings from the central platform 104 and the storage keys x1 and xn. The mapper module 102 removes the inner encryptions: (100, Fs(S 1187561 A)), (200, Fs(S7765432B)), (455, Fs(S1187561A)), (610, Fs(S7658920C)). e) The mapper module 102 performs mapping by locating the matching multiply encrypted identifiers: Fs(S1187561A) = Fs(S1187561A)
f) Common labels are identified: (100, 455)
g) The mapper module 102 sends the common label 100 to client module x1 and the common label 455 to client module xn
In a second implementation, only matching labels are sent to a server. The protocol used in this second implementation is outlined below.
Labelled Data Integration Protocol
Key Setup
(1). A user C generates a random key x and the server S in the central platform generates a random key y.
Randomization
(2). C performs randomization on each ID; of a labelled set
Figure imgf000014_0001
(a) for each ID;, compute ai = Fx(IDi);
(3) C submits to S the randomized sequence of IDs
Figure imgf000014_0002
Blinding & Matching
(4) S runs Blinding (see steps 304 to 307 of Figure 3) & Proving (described in greater detail under Zero Knowledge Proof of Correctness Protocol) on ( and sends proof of zero knowledge
Figure imgf000014_0008
protocol p to C.
(5) C runs Verification (described in greater detail under Zero Knowledge Proof of Correctness Protocol) on p.
(6) Given two sets of tuples
Figure imgf000014_0003
perform intersection such that:
Figure imgf000014_0004
• Discard all non-matching tuples.
C1 sends the set of matching labels
Figure imgf000014_0006
C2 similarly performs the matching and sends the set of matching labels to S,
Figure imgf000014_0007
for
Figure imgf000014_0005
In Step (6), C2 performs the same protocol as C1, as well as in scenarios where there is more than two users. The server S stores all the list of matching labels so that when a user wishes to query for the records of the matched PIIs from other users, it is possible to cross verify that the records returned is indeed those of the matched PIIs. This is performed by the server S and the users sending the set of matched labels to the querying user.
The second application may be used where client terminals store their entire databases containing all attributes, less the identifiers, at the server S. This is a central storage setting where all data providers (e.g. client terminals C1 and C2) share their datasets (e.g. age, salary, zip code, insured amount) but not the identifiers (e.g. national registration identity number) to which the datasets belong. Once the datasets containing the attributes but not the identifiers are centrally stored, client terminals C1 and C2 can perform matching based on the identifiers in their datasets using the above outlined protocol of the second implementation. Once the matched labels are obtained, client terminal C1 sends the matched labels to the server S for the server S to correctly combine the stored datasets obtained from client terminals C1 and C2.
To illustrate, client terminals C1 and C2 may have datasets as tabulated below.
Table 4: Client terminal C1 dataset
Figure imgf000015_0001
Table 5: Client terminal C2 dataset
Figure imgf000015_0002
Client terminal C1 sends the illness data to the server S while client terminal C2 sends the insurance data to the server S. Since each of the client terminals C1 and C2 incorporates the mapper module 102 (see Figure 3), each will possess multiply encrypted identifiers derived from identifiers stored in both the client terminals C1 and C2. Client terminal C1 identifies the common identities (8229012 and 8056798) from their multiply encrypted format. Client terminal C1 demarcates a label for each of the matching multiply encrypted identifiers as a common label. Client terminal C1 then transmits the common labels to the server S to link datasets stored therein that belong to the identifier from which the matching multiply encrypted identifier is derived, as illustrated below. Table 6: Server S dataset
Figure imgf000016_0004
From the above, it will be appreciated that the datasets stored in the server S are contributed by one or more databases storing identifiers from which the matching multiply encrypted identifier are derived.
Client terminal C2 can also similarly perform the matching and send in labels for the server S to validate that the matched labels submitted by C1 and C2 are indeed correct if the labels by C1 = labels by C2.
The server S or the central platform 104 establishes knowledge of the central key, through the use of a zero knowledge protocol (ZKP), to each of the databases 106a and 106n storing one or more of the identifiers from which the multiply encrypted identifiers are derived. The ZKP protocol that is used by the server S or the central platform 104 is outlined below:
Zero Knowledge Proof of Correctness Protocol
Blinding & Proving
(5). S blinds each received ai by computing bi = Fy ( ai).
(6). S returns
Figure imgf000016_0001
(7). C and S jointly compute a zero-knowledge proof p of correctness:
(a) C generates and sends ri, ί Î [1, n] to S, where ri ÎR [1, 2l], l the security parameter;
(b) S computes
Figure imgf000016_0002
This serves to determine a subset of the multiply encrypted identifiers for use to authenticate S;
(c) S picks a random element s from {1,2 — 1} and compute T1 = Us, T2 =
Figure imgf000016_0003
That is the server S processes the multiply encrypted identifiers with the random element s generated by the server S;
(d) S sends (U, V, T1, T2) to C:
(e) C randomly selects and sends c to S, i.e. C sends a random element c from the database 106a and 106n seeking to authenticate S;
(f) S computes t = s— c . y and outputs the proof as p = t
Verification
(8). C verifies p: (a) For all ai and bί, computes
Figure imgf000017_0002
(b) Compute Evaluation parameters T'1 and T'2
Figure imgf000017_0001
are thus obtained from processing the stored identifiers in the database 106a and 106n, from which the subset of item 7(b) is obtained, with the proof p. These evaluation parameters T'1 and T'2 are used to validate the proof p of the zero knowledge protocol;
(c) Output hue if T'1 = T 1 and T'2 = T2:
From step (7), the server S or the central platform 104 is configured to generate a random element s and receive a random element c associated with the database 106a and 106n seeking to authenticate the server S or the central platform 104. The server S or the central platform 104 then computes a proof p of the zero knowledge protocol based on the random element s generated by the server, the received random element c associated with the database and the central key y.
From step (8), the database 106a and 106n seeking to authenticate the server S or the central platform 104 is configured to receive from the server: a subset of the multiply encrypted identifiers derived from identifiers stored at the database 106a and 106n, the subset having been processed by the random element s generated by the server S or the central platform 104; and the proof p of the zero knowledge protocol. The database 106a and 106n process the stored identifiers from which the subset is obtained with the proof p to obtain evaluation parameters. The database 106a and 106n validate the proof p of the zero knowledge protocol in response to the evaluation parameters matching the received subset.
Further detail is provided below on the server S construction of the ZKP p of correctness (in terms of computation performed on the randomized dataset submitted by the client) as stated in steps (7)-(8) above. These steps are advantageous from avoiding the high communication/computation overhead of computing ZKP for every pair ( ai, bi) separately.
By step (5) of the protocol, the server S knows ai = Fx(IDi) and bi = Fy (ai) = Fxy(IDi) for all i in a submitted dataset; on the other hand, at step (8), the client has knowledge of all elements ai and bi as well. The server and the client can thus perform a ZKP protocol. The sketched proof is as below.
The server S can prove to the client its knowledge of the key y (that was used for blinding) such that V = U'y without revealing the secret exponent y to the client, where
Figure imgf000017_0003
are computed by client. The values ai are provided by client, and bi are provided by the
Figure imgf000017_0004
server. Similarly, the server S can prove to the client its knowledge of the key y’ such that
Figure imgf000017_0005
without revealing the secret exponent y’ to the client, where ai is provided by client, and b1 is provided by server. y =y’ can be proven using the proof technique where
Figure imgf000018_0001
Exponents ri are uniformly randomly chosen from an exponentially (in security parameter) large space by the client. after the values of ai, i, and y = log a1 (b1) are fixed Let Let g be a
Figure imgf000018_0003
generator in the group. For each w;, there exists a unique exponent e;, such that wi = gei. This results in
Figure imgf000018_0002
That is, for a fixed (unknown) vector (e1, ... , en), the client chooses a vector (r1, ... , rn) with each ri randomly chosen from [1, 2l], the inner product between these two vectors is zero (i.e. two vectors are perpendicular) with non-negligible probability. It has to be the case that vector (e1, ... , en) is a zero vector, thus for each i.
Figure imgf000018_0004
The zero-knowledge nature of the ZKP protocol for n elements can be proved in the same way as in "Ivan Damgard: On å-Protocols. http://www.cs.au.dk/--ivan/Sigma.pdf· Section 5" and "https://courses.cs.ut.ee/MTAT.07.003/2016_fall/uploads/Main/0902-proof-of-knowledge-for-double- exponent.pdf".
A PII to which a dataset in a database belonging to the system 100 of Figure 1 may not always be atomic. The PII may instead contain a few segments. An example is IPv4 addresses, where an address (e.g. 192.168.1.1) can be divided into four segments, each matching an octet of its dotted format representation (This can be applied to IPv6 address in a similar manner). The segments of an IP address are crucial for network analysis, for instance, to ascertain whether a group of IP addresses originated from an identical subnet. Losing this structure means the dataset loses utility and is of limited value. Any method that randomises (or masks) such PII must be able to preserve its structure. An approach to directly randomise each segment, treating every segment as a single identity attribute, is inefficient. This means if a PII has W segments, randomisation and blinding would have to be performed on all W segments in order to preserve its structure.
A more efficient, simpler and effective protocol is described in the Structure Preserving Data Integration protocol below. The client chooses the segments to be randomised and creates a list L containing identifiers of these selected segments, which serve as initial segments of identification data for the Structure Preserving Data Integration protocol. Only the segments in the list are randomised and blinded, in stages. The intuition is that in majority scenarios the resulting dataset of the first match (or intersection) would contain less records compared to the original datasets.
With reference to Figures 2 and 3, at least one of the identifiers ID; stored in the database 106a and 106n comprises such an initial segment of identification data. The consolidation of step 307 (where the mapper module 102 consolidates the multiply encrypted identifiers that match after the removal of storage encryption) includes the multiply encrypted identifiers derived from the initial segments of identification data that match. The results of the match are returned to each of the databases 106a and 106b storing the identification data from which each of the initial segments of identification data that match is obtained.
A second chosen or further segment of the identification data may be processed next, but on the smaller dataset originated from the first intersection. That is, the mapper module 102 locates the multiply encrypted identifiers derived from the further segment of the identification data that match from the consolidation performed on the initial segments of identification. The results of the match are returned to each of the databases 106a and 106n storing the identification data from which each of the further segments of identification data that match is obtained. Subsequent mapping can be carried out for the remaining segments, depending on how fine-grained the matching is desired. Any key homomorphic encryption (or commutative encryption) scheme can be deployed, such as the one discussed in international application no. PCT/SG2017/050575. In order for the server in the central platform 104 to perform the intersection in stages, the technique discussed above for the Labelled Data Integration Protocol is adopted, as shown in Step (6) of the Structure Preserving Data Integration protocol discussed below. This is so that the server is able to learn the set of matched PIIs for every stage.
The Structure Preserving Data Integration protocol is outlined below:
Key Setup
(1). A user C generates a random key x and the server S in the central platform 104 generates a random key y.
Generalization & Randomization
(2). C divides the identity attribute into W segments; ID; = (lDi,1, ... , ID w).
(3). C also prepares an ordered list L = (lh, ... , lm), h, m Î [1, W], h £ m.
(4). C performs randomisation on each ID; of the dataset:
(a) for each ID w, compute a w = Fx(lD w), w Î [h, m];
(5). C submits to S the list L and the randomised dataset L is also given to
Figure imgf000019_0001
other participating clients.
Blinding & Matching (6). S creates a unique tag ki for each record I in the dataset, such that
Figure imgf000020_0007
(7). For each lv in L v Î [h, m], S and C jointly perform matching:
(a) S runs Blinding & Proving (see Zero Knowledge Proof of Correctness Protocol discussed above) on inputs ( ki , ai, lv) of
Figure imgf000020_0009
for all i Î [h, n]
(b) S returns
Figure imgf000020_0008
(c) C verifies p. If p is valid, C performs the following (otherwise C aborts):
(i) for each bi, lv in the tuple (i Î [1, n]), extract:
Figure imgf000020_0003
(d) C runs the Intersection (see next heading below) step to find matched blinded segments.
(e) S retrieves records for tags in ( ki ).
Figure imgf000020_0004
(f) S sets for all i e [1, r], and assigns n := r.
Figure imgf000020_0005
(8). S returns to C for verification and integration.
Figure imgf000020_0006
Intersection
(7a). Given two sets of tuples from C1 and
Figure imgf000020_0002
from C2, perform intersection such that:
Figure imgf000020_0001
• for some i Î [1, n] and jÎ [1, n'], store ki .
Figure imgf000020_0010
• Discard all non-matching tuples.
Cl sends the set of matching tags ( ki ) to S, for i Î [1, r], r [n].
Figure imgf000020_0011
The Structure Preserving Data Integration protocol works for any PII that can be divided into segments in order to preserve its structure. Two use cases of the structure-preserving data integration protocol are discussed below.
IP Addresses
Using letters, an IPv4 address may be represented in dotted format as (A.B.C.D). Every letter represents an octet of the address (e.g. A := 192).
In order to preserve structure of the address, the IP address is divided into four segments as H(A), H(A.B), H(A.B.C) and H(A.B.C.D). H(.) is a cryptographic hash function. This representation preserves prefixes of an IP address and is the only way that preserves the structure of an IP address without leaking information. Based on the above representation, the client, for example, might wish to find a mapping on segment H(A.B), and at a later stage, H(A.B.C). The client C first randomises all four segments and generalises the non-identity attributes, in order to protect against the server learning the segments of the IP addresses. The client C then creates a list L that contains two identifiers, one for H(A.B) and another for H(A.B.C). Next, the client C and the server S choose the identifier of H(A.B), i.e. an initial segment, from the list, retrieve H(A.B), jointly randomise and blind to find the matching records in the datasets contributed by other participating clients C. Then in order to match and integrate records on H(A.B), the client C and the server S need only randomise and blind this initial segment. At a later stage, in order to match based on H(A.B.C), the client C and the server S would only need to map from the results from matching H(A.B), which may contain less records compared to the full dataset.
Solutions exist that use a lookup table of randomised IP addresses or a key that produces pseudonymized IP addresses. The lookup table or key must be shared between the parties that perform pseudonymization. Such mechanisms are not suitable since each participant will be able to learn the plain datasets given the lookup table or key.
Multi-column IDs
In addition to IP addresses, it is possible that multiple columns in a database are used to represent an identity attribute, e.g. name, email and phone number. Instead of randomising all columns as one attribute, coarse-grain or fine-grain randomisation of multiple columns is used. As in the previous IP addresses use-case, a client may define each column as a segment and create the list L containing the segments that are to be processed. In more detail and with reference to Figures 2 and 3, when at least one of the identifiers ID; comprises different identification data with at least one in common with another of the identifiers, the consolidation of step 307 (where the mapper module 102 consolidates the multiply encrypted identifiers that match after the removal of storage encryption) includes the multiply encrypted identifiers derived from the identifiers comprising the common identification data. The results of the match are returned to each of the databases 106 and 106n storing the common identification data.
Leakage Profile
The leakage of information due to in-stage interactions between the client C and the server S was modelled. Specifically, the server S and the client C learn the sets of matched randomised PIIs and their sizes for every stage of matching. This is because the clients C submit ki to the server at the end of each iteration. It was found that given the extra leakage of information, the server S and the client C are not able to learn any information of the underlying identity attribute. An interactive stateful leakage function
Figure imgf000021_0002
was defined, given input the dataset, outputs n the size of the dataset, the number of segments W and the list of segments L to be processed.
Figure imgf000021_0001
also outputs an empty list Imatch· For every stage lv of privacy-preserving matching of two or more datasets,
Figure imgf000021_0003
registers the size of the subset rlv of the matched attributes, and the set of matched randomized IDs in Imatch = for i Î [1, r] This means a leakage profile
Figure imgf000022_0002
(n, L, Imatch)- where the server
Figure imgf000022_0001
learns the size and the matched randomised IDs of each intersection.
Leakage from intersection of segments
The segments (e.g. in the case of repetitive 192 or 192.168) are not unique given that homomorphic key encryption is deterministic. The server S is able to learn how many IDs have the same segments. This information may also enable the server S to learn that, for example, two datasets maybe of the same domain by observing the received ki if the list L submitted by the two participants are the same. A potential mitigation is to adopt a prefix-preserving pseudonymization mechanism that partitions the set of IP addresses, performs a migration to hide repetitive segments and replicate the IP addresses into multi-views setting so that the recipient has no idea which view is the real set of IP addresses.
Figure 4 shows operation flow for the query analyser 110 of Figure 1 when executing a distributed query protocol.
The distributed query protocol divides a query into sub-queries that are directed to the datasets of the respective organisations. Only record labels and aggregated results will be sent out from the datasets. The results will be merged based on mapped record labels. The distributed query protocol ensures that raw data never leaves the respective organisations premises.
The distributed query protocol, executed by the query analyser 110, interrogates fields under which data is organised in each of the databases 106a and 106n. In one implementation, the mapper module 102 receives the schema of fields together with the multiply encrypted identifiers, such as in step 305 of Figure 3. When the query analyser 110 receives a dataset query, the query analyser 110 determines which of the schema of fields in the mapper module 102 has a format corresponding to search parameters of the received dataset query. The query analyser 110 is also able to divide the dataset query into one or more sub-queries with each containing a subset of the corresponding search parameters. The sub-queries are then distributed amongst the databases 106a and 106n, so that each of the databases 106a and 106n receives the sub-query with parameters of format that corresponds to the schema of fields under which data is organised in the database 106a and 106n. The query analyser 110 is also configured to: merge responses from each of the databases 106a and 106n to their received sub-query and return the merged response to the dataset query received by the query analyser 110.
The steps for performing a distributed query are described below with reference to Figure 4.
In step 401, customer sends a query on dataset ("dataset query").
In step 402, the Server Frontend (FE) 114 posts the dataset query to the Server API 116. The server API 116 relays the dataset query to the query analyser (QA) 110. In step 403, the QA 110 breaks down the dataset query into sub-queries. The sub-queries are sent to user apps for relay to the database 106a and 106n.
In step 404, each of the user apps queries their on-premise database 106a and 106n. In step 405, the user apps send results back to the QA 110.
In step 406, the QA 110 merges results from the user apps. In step 407, the QA 110 sends the merged results back to the Server API 116 for relay back to the Server FE 114. In step 408, the Server FE 114 displays the merged results.
Tables 7 and 8 below each provide an example of data organised under a schema of fields found in a database.
Table 7: Sample schema of fields
Figure imgf000023_0003
Table 8: Sample schema of fields
Figure imgf000023_0001
Table 9 shows an example of a dataset query and its division into sub-queries based on the schema of fields shown in Table 7 and Table 8.
Table 9: Sample dataset query and sub-queries
Figure imgf000023_0002
Figure 5 shows operation flow for a distributed machine learning protocol used by the machine learning module 112. The machine learning module allows a data consumer to ran analysis on the databases 106a and 106n that reside on the premise. The distributed machine learning (DML) protocol allows for the data consumer to request for a machine learning model to be computed on datasets residing in different organisations.
The machine learning module 112 is trained using vector values of data belonging to identifiers stored in the databases 106a and 106n. Recalling from Figures 2 and 3 that common identification data serves to integrate datasets, the identifiers used to train the machine learning module 112 are those derived from matching multiply encrypted identifiers consolidated in the mapper module 102. Parameters used by the machine learning module 112 for outcome prediction are updated based on an aggregation of the received vector values. The updating of the parameters is reiterated until convergence criteria is met. The machine learning module 112 is then configured to, after having been trained, reference data from one or more of the databases 106a and 106n when responding to received queries.
The steps for performing distributed machine learning are described below.
In step 501, a Customer selects machine learning model type to be trained on datasets.
In step 502, the Server Frontend (FE) 114 posts the selected machine learning model to the Server API 116 which relays the request to the Distributed ML module (DML) 112.
In step 503, the DML module 112 requests dataset dimensions and response variable from user apps. Each of the user apps queries its on-premise database 106a and 106n;
In step 504, the user apps relay back dataset dimensions and response variable to the DML module 112;
In step 505, the DML module 112 initialises parameter values used by the machine learning module 112 for outcome prediction and send the relevant parameter values to each user app, along with machine learning model type information;
In step 506, each user app computes local aggregated results based on model type and uploads aggregates results to the DML module 112;
In step 507, the DML module 112 uses local aggregated results from donor apps to update the parameter values. If convergence criteria is satisfied or maximum number of iterations have been reached, it signals protocol finish to user apps and performs step 509 below. Otherwise, the DML module 112 relays updated parameter values to user apps;
In step 508, the user apps receive updated parameters. If protocol finish signal has not been received, step 506 is repeated;
In step 509, the DML module 112 relays back model parameters to the Server API 116;
In step 510, the Server Frontend 114 displays results.
Returning to Figure 1, the system 100 provides a data contributor an access control mechanism and a configurable risk assessment module, so that the data contributor may decide whether to reveal certain attributes of its database for query or learning by the data consumer. It can also be based on these mechanisms and the configurable risk assessment module, to set the price plan for the data consumer to perform query or analysis on more sensitive information. The distributed query and distributed machine learning steps can be performed independently from the intersection mechanism.
Upon determining a common identity from its multiply encrypted form, the mapper module 102 is configured to query a registration directory storing an address of each of the databases 106a and 106n. With reference to step 309 of Figure 3, this allows the mapper module 102 to obtain the address of each of the databases 106a and 106n to which matching results should be sent.
The system 100 of Figure 1 uses multiple encryption that comprises storage encryption and a central encryption. The removal of the storage encryption results in matching to be performed on identifiers encrypted by the central encryption. This central encryption is performed using a central key y at a server from which the mapper module 102 receives the multiply encrypted identifiers.
The mapper module 102 is configured to return the results of the match through a server from which the mapper module 102 receives the multiply encrypted identifiers. In one implementation, the mapper module 102 is integrated with the server. In another implementation, the mapper module 102 is integrated into a client terminal that integrates one or more of the databases 106a and 106n. It will be appreciated that integration refers to these components being part of a designated network and does not necessarily require for them to be internally situated.
In summary, the present invention provides a system for privacy -preserving data sharing and integration, which enables organisations to match, query and analyse their datasets without the dataset leaving the premises of the organisations. The system is initiated by a user registering to the system and an administrator managing the list of users and the applications. A data integration protocol enables matching of personal identifiable information (PII) in a privacy -preserving manner. Users (who can be data donors and/or data consumers) upload and integrate their datasets by performing the data integration protocol together with a central platform to randomize and map/intersect their PIIs information using the map module of the user and the server application. Once mapped, the information (or labels) on the sets of matched PIIs are returned to each respective user. Coupled with the data integration protocol is a zero-knowledge proof protocol that allows verification of the correctness of the attributes processed by the server during data integration. Any modification by the server on the attributes will be detected through the zero-knowledge proof protocol.
With the information of the matched PIIs, distributed query and distributed machine learning protocols enable computations on the on-premise datasets. The data integration protocol is flexible in that it caters for a PII compromising more than one segment (e.g. IP addresses) or different identification data (e.g. multi-column IDs), using structure preserving techniques. The data integration protocol is also labelled in a way that even when the datasets are on premise, organisations are able to retrieve or refer to records of matched PIIs.
While this invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes can be made and equivalents may be substituted for elements thereof, without departing from the spirit and scope of the invention. In addition, modification may be made to adapt the teachings of the invention to particular situations and materials, without departing from the essential scope of the invention. Thus, the invention is not limited to the particular examples that are disclosed in this specification, but encompasses all embodiments falling within the scope of the appended claims.

Claims

1. A system for mapping common identities stored in separate databases, the system comprising: a mapper module configured to:
receive multiply encrypted identifiers, where each is derived from an identifier having undergone multiple encryption;
receive a storage key used to perform storage encryption at each database storing one or more of the identifiers;
remove the storage encryption from each of the multiply encrypted identifiers using its respective storage key;
consolidate the multiply encrypted identifiers that match after the removal of the storage encryption; and
transmit results of the match.
2. The system of claim 1, wherein the results of the match are returned to each of the databases storing one or more of the identifiers from which the matching multiply encrypted identifiers are derived.
3. The system of claim 1 or 2, wherein when at least one of the identifiers comprises an initial segment of identification data, the consolidation includes the multiply encrypted identifiers derived from the initial segments of identification data that match; and the results of the match are returned to each of the databases storing the identification data from which each of the initial segments of identification data that match is obtained.
4. The system of claim 3, wherein when at least one of the identifiers comprises a further segment of the identification data, the mapper module is further configured to:
locate the multiply encrypted identifiers derived from the further segment of the identification data that match from the consolidation performed on the initial segments of identification; and
return the results of the match to each of the databases storing the identification data from which each of the further segments of identification data that match is obtained.
5. The system of any one of the preceding claims, wherein when at least one of the identifiers comprises different identification data with at least one in common with another of the identifiers, the consolidation includes the multiply encrypted identifiers derived from the identifiers comprising the common identification data; and the results of the match are returned to each of the databases storing the common identification data.
6. The system of any one of the preceding claims, wherein the results of the match indicate which of the one or more identities stored in one of the databases is also stored in one or more of the other databases.
7. The system of any one of the preceding claims, wherein the mapper module is further configured to:
query a registration directory storing an address of each of the databases; and
obtain the address of each of the databases to which the matching results should be sent.
8. The system of any one of the preceding claims, wherein the multiple encryption comprises the storage encryption and a central encryption, whereby the removal of the storage encryption results in the matching to be performed on identifiers encrypted by the central encryption and wherein the central encryption is performed using a central key at a server from which the mapper module receives the multiply encrypted identifiers.
9. The system of claim 8, further comprising the server, wherein the server is configured to: establish knowledge of the central key, through the use of a zero knowledge protocol, to each of the databases storing one or more of the identifiers from which the multiply encrypted identifiers are derived.
10. The system of claim 9, wherein the server is further configured to:
generate a random element (s);
receive a random element (c) associated with the database seeking to authenticate the server; and
compute a proof of the zero knowledge protocol based on the random element (s) generated by the server, the received random (c) element associated with the database and the central key.
11. The system of claim 10, wherein validation of the proof of the zero knowledge protocol comprises having the database seeking to authenticate the server being configured to:
receive from the server:
a subset of the multiply encrypted identifiers derived from identifiers stored at the database, the subset having been processed by the random element generated by the server; and
the proof of the zero knowledge protocol; process the stored identifiers from which the subset is obtained with the proof to obtain evaluation parameters; and
validate the proof of the zero knowledge protocol in response to the evaluation parameters matching the received subset.
12. The system of any one of the preceding claims, wherein each of the multiply encrypted identifiers is accompanied with schema of fields under which data is organised in the database storing the identifier from which the multiply encrypted identifier is derived, wherein the system further comprises a query analyser configured to:
interrogate the schema of fields, in response to receipt of a dataset query, to determine which of them have a format corresponding to search parameters of the dataset query.
13. The system of claim 12, wherein the query analyser is further configured to:
divide the dataset query into one or more sub-queries with each containing a subset of the corresponding search parameters; and
distribute the sub-queries amongst the databases, so that each of the databases receives the sub-query with parameters of format that corresponds to the schema of fields under which data is organised in the database.
14. The system of claim 13, wherein the query analyser module is further configured to:
merge responses from each of the databases to their received sub-query; and return the merged response to the dataset query.
15. The system of any one of the preceding claims, wherein the system further comprises a machine learning module configured to, when being trained:
receive vector values of data belonging to the identifiers from which the matching multiply encrypted identifiers are derived;
update parameters used by the machine learning module for outcome prediction based on an aggregation of the received vector values; and
reiterate the updating of the parameters until convergence criteria is met.
16. The system of claim 15, wherein the machine learning module is configured to, after having been trained, reference data from one or more of the databases when responding to received queries.
17. The system of any one of the preceding claims, wherein the mapper module is further configured to return the results of the match through a server from which the mapper module receives the multiply encrypted identifiers.
18. The system of claim 17, wherein the mapper module is integrated with the server.
19. The system of any one of claims 1 to 17, wherein the mapper module is integrated into a client terminal that integrates one or more of the databases.
20. The system of any one of the preceding claims, wherein each of the multiply encrypted identifiers is accompanied with a label for locating the identifier within the database storing the identifier from which the multiply encrypted identifier is derived and wherein the mapper module is further configured to:
consolidate the labels for the multiply encrypted identifiers that match after the removal of the storage encryption for demarcation as common labels; and
return each of the common labels to each of the databases storing one or more of the identifiers located by the common label.
21. The system of any one of claims 1 to 17, further comprising client terminals with each incorporating the mapper module, wherein the mapper module of the client terminal is further configured to:
demarcate a label for each of the matching multiply encrypted identifiers as a common label; and
transmit the common labels to a server to link datasets stored therein that belong to the identifier from which the matching multiply encrypted identifier is derived, wherein the datasets are contributed by one or more databases storing identifiers from which the matching multiply encrypted identifier are derived.
PCT/SG2020/050210 2019-04-11 2020-04-06 Privacy preserving system for mapping common identities WO2020209793A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
SG10201903227S 2019-04-11
SG10201903227S 2019-04-11

Publications (1)

Publication Number Publication Date
WO2020209793A1 true WO2020209793A1 (en) 2020-10-15

Family

ID=72752224

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2020/050210 WO2020209793A1 (en) 2019-04-11 2020-04-06 Privacy preserving system for mapping common identities

Country Status (1)

Country Link
WO (1) WO2020209793A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183765A (en) * 2020-10-30 2021-01-05 浙江大学 Multi-source multi-modal data preprocessing method and system for shared learning
CN112492586A (en) * 2020-11-23 2021-03-12 中国联合网络通信集团有限公司 Encryption transmission scheme optimization method and device
WO2023134055A1 (en) * 2022-01-13 2023-07-20 平安科技(深圳)有限公司 Privacy-based federated inference method and apparatus, device, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005109291A2 (en) * 2004-05-05 2005-11-17 Ims Health Incorporated Data record matching algorithms for longitudinal patient level databases
US20090150362A1 (en) * 2006-08-02 2009-06-11 Epas Double Blinded Privacy-Safe Distributed Data Mining Protocol
US20110060903A1 (en) * 2008-03-19 2011-03-10 Takuya Yoshida Group signature system, apparatus and storage medium
US20160147945A1 (en) * 2014-11-26 2016-05-26 Ims Health Incorporated System and Method for Providing Secure Check of Patient Records
WO2019098941A1 (en) * 2017-11-20 2019-05-23 Singapore Telecommunications Limited System and method for private integration of datasets

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005109291A2 (en) * 2004-05-05 2005-11-17 Ims Health Incorporated Data record matching algorithms for longitudinal patient level databases
US20090150362A1 (en) * 2006-08-02 2009-06-11 Epas Double Blinded Privacy-Safe Distributed Data Mining Protocol
US20110060903A1 (en) * 2008-03-19 2011-03-10 Takuya Yoshida Group signature system, apparatus and storage medium
US20160147945A1 (en) * 2014-11-26 2016-05-26 Ims Health Incorporated System and Method for Providing Secure Check of Patient Records
WO2019098941A1 (en) * 2017-11-20 2019-05-23 Singapore Telecommunications Limited System and method for private integration of datasets

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183765A (en) * 2020-10-30 2021-01-05 浙江大学 Multi-source multi-modal data preprocessing method and system for shared learning
CN112492586A (en) * 2020-11-23 2021-03-12 中国联合网络通信集团有限公司 Encryption transmission scheme optimization method and device
CN112492586B (en) * 2020-11-23 2023-05-23 中国联合网络通信集团有限公司 Encryption transmission scheme optimization method and device
WO2023134055A1 (en) * 2022-01-13 2023-07-20 平安科技(深圳)有限公司 Privacy-based federated inference method and apparatus, device, and storage medium

Similar Documents

Publication Publication Date Title
US11263344B2 (en) Data management method and registration method for an anonymous data sharing system, as well as data manager and anonymous data sharing system
US9535658B2 (en) Secure private database querying system with content hiding bloom filters
CA2276036C (en) Method and apparatus for secure storage of data
WO2020209793A1 (en) Privacy preserving system for mapping common identities
Kelarev et al. A multistage protocol for aggregated queries in distributed cloud databases with privacy protection
US10929402B1 (en) Secure join protocol in encrypted databases
Tang et al. Differentially private publication of vertically partitioned data
CN106980796A (en) MDB is based under cloud environment+The multiple domain of tree connects the searching method of keyword
Lim et al. $\mathsf {PrivateLink} $: Privacy-Preserving Integration and Sharing of Datasets
Barazzutti et al. Efficient and confidentiality-preserving content-based publish/subscribe with prefiltering
Xiong et al. Private collaborative filtering under untrusted recommender server
CN108170753A (en) A kind of method of Key-Value data base encryptions and Safety query in shared cloud
Wang et al. An efficient and privacy-preserving range query over encrypted cloud data
Sharma Searchable encryption: A survey
Park et al. PKIS: practical keyword index search on cloud datacenter
US11886414B2 (en) One-way hashing methodology for database records
Ranbaduge et al. A scalable privacy-preserving framework for temporal record linkage
Balasubramaniam et al. A survey on data retrieval techniques in cloud computing
Kreuter et al. Private identity agreement for private set functionalities
EP4020291B1 (en) Providing anonymized data
Yamaguchi et al. Privacy preserving data processing
Lazrig et al. Privacy preserving probabilistic record linkage using locality sensitive hashes
Vasgi et al. A Secure and Effective Retrieval Using Hash Based Mapping Structure over Encrypted Cloud Data
Dagher Secure Protocols for Privacy-preserving Data Outsourcing, Integration, and Auditing
Almakdi et al. Designing a Bit-Based Model to Accelerate Query Processing Over Encrypted Databases in Cloud

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20787417

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20787417

Country of ref document: EP

Kind code of ref document: A1