EP4226267A1

EP4226267A1 - Method for evaluating the risk of re-identification of anonymised data

Info

Publication number: EP4226267A1
Application number: EP21810059.2A
Authority: EP
Inventors: Morgan GUILLAUDEUX; Olivier BREILLACQ
Original assignee: Big Data Sante
Current assignee: Big Data Sante
Priority date: 2020-10-07
Filing date: 2021-10-07
Publication date: 2023-08-16
Also published as: WO2022074301A1; US20230367901A1; CA3194820A1; FR3114892A1; WO2022074302A1; CA3194570A1; US20240005035A1; EP4226268A1

Abstract

The method of the invention provides a protection rate (txP2) representative of the risk of re-identification of data. In the case of a distance-based correspondence-seeking attack, the method comprises the steps of: a) linking an original dataset (EDO) comprising a plurality of original individuals (IO) with an anonymised dataset (EDA) comprising a plurality of anonymised individuals (IA); b) transforming (PCA, MCA, FAMD) the original individuals and the anonymous individuals in a Euclidean space; c) identifying, for each original individual, one or more nearest anonymous individuals based on a distance, by a method referred to as the "k-NN" method; and d) calculating the protection rate, being a mean number (Nm) of anonymous individuals, nearest to a considered original individual (IO_i), who are not a valid anonymous individual corresponding to the original individual considered, the nearest anonymous individuals being those identified in step c) and having a distance (dy) relative to the considered original individual less than the distance between the considered original individual and the valid anonymous individual.

Description

Title of the invention: PROCEDURE FOR ASSESSING THE RISK OF RE-IDENTIFICATION OF ANONYMIZED DATA

The invention generally relates to the anonymization of sensitive data intended to be shared with third parties, for example, for research, analysis or exploitation purposes. More particularly, the invention relates to a method for evaluating the risk of re-identification of anonymized data.

In general, data is a source of performance for organizations and constitutes an important asset for them. Data provides crucial and valuable information for the production of quality goods and services, as well as for decision-making. They provide a competitive advantage that allows organizations to survive and stand out from the competition. The sharing of data, for example in the form of open data known as "open data" in English, is today perceived as offering many opportunities, in particular for the extension of knowledge and human knowledge, innovation and creation of new products and services.

Data has become easily shareable with digital technologies and technological innovations, beyond the organizations that collect and store it for use. The digital transformation of society, with the rise of social networks, the generalization of online consumption, the dematerialization of services, etc., generates a phenomenon of massification of data called "big data" in English. This phenomenon of massification of data has increased with the adoption by a large number of countries of public policies known as "open data" which promote the opening and sharing of data. The technologies that are currently available allow the storage, processing and analysis of this ever-growing mass of data and make it possible to extract knowledge and actionable information from it.

The data may contain personal data, known as "personal data", which is subject to regulations relating to the protection of privacy. Thus, in general, the use, storage and sharing of personal data are subject in France to the European GDPR regulation, for "General Data Protection Regulation", and to the French law known as the "IT law". and freedoms >>. Certain data, such as those relating to the state of health, private and family life, assets and others, are particularly sensitive and must be subject to special precautions.

Several anonymization methods are known and used to process original data in such a way as to protect the privacy of individuals. Data anonymization can be defined as a process that removes the association between the identifying dataset and the subject of data. The process of anonymization aims to prevent the singling out of an individual within a dataset, the link between two records within the same dataset, or between two distinct datasets, when one of the records matches to individual-specific data, and inferring information from the data set. Thus, following an anonymization process, the data is presented in a form that should not allow individuals to be identified, even by combination with other data.

The anonymization method called "k-anonymization" is one of the most widely used methods. This method seeks to make each record of a data set indistinguishable from at least k-1 other records of this data set. The so-called "L-diversity" anonymization method is an extension of the "k-anonymization" method which allows better data protection by involving in each group of k records, called "k-group", the presence of at least L sensitive attribute values.

In general, the main known anonymization algorithms modify data by deleting, generalizing or replacing personal information in individual records. An alteration of the informative content of the data may be the consequence of excessive anonymization. However, it is important that anonymized data remains quality data that retains a maximum of informative content. It is on this condition that anonymized data remain useful for the extraction of knowledge through analysis and reconciliation with other data.

The choice of the anonymization algorithm and the adjustment of its operating parameters are important to reconcile both the obligation to respect privacy and the need to preserve the usefulness of the data. In the state of the art, there is no known single anonymization algorithm that adapts to all contexts and that gives the best result every time. Several anonymization algorithms exist with varying degrees of reliability and contexts of applicability. The context of applicability of anonymization algorithms is characterized, among other things, by the type of data to be anonymized and by the desired use of the anonymized data.

The degree of reliability of the anonymization algorithm is directly related to the risk of re-identification of anonymized data. This risk includes the risk of individualization, i.e. the possibility of isolating an individual, the risk of correlation, i.e. the possibility of linking distinct sets of data concerning the same individual, and the risk of inference, that is, the possibility of inferring information about an individual. However, faced with the development of information technologies which make it possible to link data from different sources, it is almost impossible to guarantee anonymization which would offer a zero risk of re-identification. Different methods for evaluating the risk of re-identification of a set of data having undergone anonymization processing, also referred to as “metrics” below, have been proposed and provide quantitative evaluations of this risk.

Some of these metrics use a method called record-linkage, which is described by Robinson-Cox J. F. in the article “A record-linkage approach to imputation of missing data: Analyzing tag retention in a tag-recapture experiment >>, Journal of Agricultural, Biological, and Environmental Statistics 3(1), 1998, pp. 48-61. This method, which consists of comparing individuals from a data set that has undergone anonymization treatment with an original data set, was initially developed to improve data quality by linking in files distinct from records relating to the same person. It also makes it possible to assess the robustness of anonymization processing in the face of a re-identification attempt in which the attacker would be in possession of the set of anonymized data and original data of one or more individuals of whom he seeks to prove membership in the anonymized cohort.

Deterministic linking methods, discussed by Gill L. in the article "Methods for Automatic Record Matching and Linking and Their Use in National Statistics", National Statistics Methodology Series no. 25, 2001, London: Office for National Statistics, assumes the existence of a set of common variables in the files to be linked. The major problem with such an assumption is that an exact matching procedure for the values taken by the variables common to the individuals is not always possible, or sufficient, to establish a link between the records. This issue is addressed by Winkler W.E. in the article “Matching and record linkage”, Cox B. G. (Ed.), Business Survey Methods, Wiley, New York, 1995, pp. 355-384. In reality, there are between the variables common to two matched records a multitude of small or large differences resulting from several factors which prevent a perfect correspondence of the values of these variables.

To overcome the aforementioned problem, non-deterministic methods have been developed and make it possible to establish a link between two records, with a matching that can be probabilistic or based on a distance between individuals.

Probabilistic matching makes it possible to establish probabilities of links between records. Two records are considered linked when the probability of a link between them exceeds a certain threshold. Probabilistic matching is described by Fellegi LP. et al., Jaro MA, and Winkler WE in their respective articles "A theory of record linkage", Journal of the American Statistical Association 64, 1969, pp. 1 183-1210, "Advances in record-linkage methodology as applied to matching the 1985 Census of Tampa, Florida", Journal of the American Statistical Association 84, 1989, pp. 414-420, and “Advanced methods for record linkage”, Proceedings of the American Statistical Association Section on Survey Research Methods, 1995, pp. 467-472. Distance-based matching is described by Pagliuca D. et al. in the publication "Some Results of Individual Ranking Method on the System of Enterprise Accounts Annual Survey, Esprit SDC Project", Deliverable MI-3/D2, 1999. In this approach, distances are established between individuals and each individual is associated the closest record or the second closest record, and is said respectively “linked to nearest” or “linked to 2nd nearest”, in English.

The aim of the present invention is to provide a new method for evaluating the risk of re-identification of anonymized data during a distance-based matching search attack.

According to a first aspect, the invention relates to a data processing method implemented by computer for the evaluation of a risk of re-identification of anonymized data, the method providing a protection rate representative of the risk of re-identification in the case of a distance-based match-seeking attack, the method comprising the steps of a) linking an original data set comprising a plurality of original individuals to an anonymized data set comprising a plurality of anonymous individuals, the anonymous individuals being produced by a process of anonymization of the original individuals; b) transforming the original individuals and the anonymous individuals in Euclidean space, the original individuals and anonymous individuals being represented by coordinates in Euclidean space; c) identify for each said original individual one or more nearest anonymous individuals on the basis of a distance, by a so-called "k-NN" method; and d) calculating the protection rate as an average number of anonymous individuals closest to the original individual under consideration who are not a valid anonymous individual corresponding to the original individual under consideration, the anonymous individuals closer being those identified in step c) and having a distance with the original individual considered less than the distance between the original individual considered and the valid anonymous individual.

According to a particular characteristic of the method, the aforementioned distance is a Euclidean distance.

According to another particular characteristic of the method, the transformation of step b) is carried out by a factorial method and/or using an artificial neural network called an “auto-encoder”.

According to yet another particular characteristic of the method, the factorial method used in step b) is a so-called “Principal Component Analysis” method when the individuals include variables of the continuous type, a so-called “Multiple Correspondence Analysis” method when the individuals include qualitative type variables or a so-called “Mixed Data Factor Analysis” method when individuals include mixed “continuous/qualitative” type variables. The invention also relates to a data anonymization computer system comprising a data storage device storing program instructions for implementing the method as described briefly above.

The invention also relates to a computer program product comprising a medium in which are recorded program instructions readable by a processor for implementing the method as described briefly above.

Other advantages and characteristics of the present invention will appear more clearly on reading the description below of several particular embodiments with reference to the appended drawings, in which:

[Fig.1] Fig.1 is a flowchart showing a particular embodiment of the method according to the invention.

[Fig.2] Fig.2 represents an illustrative diagram relating to the embodiment of the method according to the invention of Fig.1.

[Fig.3] Fig.3 shows an example of a general architecture of a data anonymization computer system in which the method according to the invention is implemented.

In the description that follows, for purposes of explanation and not limitation, specific details are provided in order to facilitate an understanding of the technology described. It will be apparent to those skilled in the art that other modes or embodiments may be practiced outside of the specific details described below. In other cases, detailed descriptions of well-known methods, techniques, etc. are omitted so as not to complicate the description with unnecessary detail.

Assessing the risk of re-identification requires comparing a set of original data made up of so-called original individuals with a set of anonymized data made up of so-called anonymous individuals. Individuals are typically data records. Each anonymized individual in the anonymized dataset represents an anonymized version of a corresponding original individual. A pair formed by an original individual and a corresponding anonymous individual is referred to as an “original/anonymous pair”. Re-identification risk is the risk that an attacker will successfully link an original individual to their anonymized record, i.e. the corresponding anonymous individual, thus forming a valid original/anonymous pair.

The method according to the invention for the evaluation of the risk of re-identification of data provides a metric, based on an individual-centric approach, which makes it possible to quantify the risk of re-identification of personal data during a distance-based match search. With reference to Figs. 1 and 2, a particular embodiment, designated MR2, of the method of the invention is now described, having an interesting applicability in the context of a distance-based match-seeking attack. This particular embodiment MR2 is built with a decidedly different approach compared to the known methods of the state of the art, by establishing a protection rate which is based on the evaluation of a density of presence of anonymous individuals in the immediate environment of the original individuals.

As visible in Fig.1, this embodiment MR2 comprises five steps S2-1 to S2-5.

The first step S2-1 performs data joining processing. The first step S2-1 is a data joining step. In step S2-1, a set of original data EDO comprising a plurality of original individuals IO is linked to a set of anonymized data EDA comprising a plurality of anonymous individuals IA. EDA anonymized data is that provided by an anonymization process that has processed the original EDO data and corresponds to it.

The second step S2-2 carries out a processing of transformation of the individuals IO and IA in a Euclidean space. In accordance with the invention, various transformation methods may be used. Typically, but not exclusively, a factorial method or an artificial neural network called "auto-encoder", or "autoencoder" in English, can be used to convert the individuals IO and IA in the form of coordinates in a Euclidean space.

Different factorial methods can be used depending on the type of data. Thus, Principal Component Analysis, known as “PCA”, or “PCA” in English for “Principal Component Analysis”, will typically be used when the variables are continuous. Multiple Correspondence Analysis, known as “ACM”, or “MCA”, in English for “Multiple Correspondence Analysis”, will typically be used if the variables are qualitative. The "Factor Analysis of Mixed Data" called "AFDM", or "FAMD" in English for "Factor Analysis of Mixed Data", will typically be used if the variables are mixed, that is to say, continuous type and qualitative type.

In the example embodiment discussed here, a factorial method is used in step S2-2. In this step S2-2, significant axes of variance are identified in the data sets by multivariate data analysis. These significant axes of variance determine the axes of Euclidean space onto which individuals IO and IA are projected.

The transformation of the individuals IO and IA in Euclidean space makes it possible to calculate the mathematical distance between the individuals, from their coordinates. The method of the invention provides for a privileged use of a Euclidean distance as a mathematical distance. However, it will be noted that the use of various other mathematical distances, such as a Manhattan distance, a Mahalanobis distance and the like, is included within the scope of the present invention.

The third step S2-3 is a mathematical distance calculation step, such as a Euclidean distance. In this step S2-3, as illustrated in Fig.2 in which the original individuals 10 and the anonymous individuals IA are represented respectively by black circles and white circles, in a Euclidean space having axes A1 and A2, for each original individual IOi, the mathematical distance due which separates it from the anonymous individual IAi with which it forms a valid origin/anonymous pair (IOi, IAi) is calculated.

The fourth step S2-4 is a step of counting, for each original individual IOi, the number Nj of invalid anonymous individuals IAj separated from the original individual IOi by a mathematical distance dij which is less than the distance due calculated in step S2-3. The "k-nearest neighbors" method, known as "k-NN" (from "k-Nearest Neighbors" in English) is used here to identify, for each individual of origin, one or more anonymous individuals closest on the based on a mathematical distance, such as a Euclidean distance.

In this step S2-4, as illustrated in Fig.2, the number Nj of invalid anonymous individuals IAj present in the area contained in the circle of radius due centered on the original individual IOi is therefore counted.

The original individual IOi is all the better protected against re-identification as the number Nj is high. Indeed, the Nj invalid anonymous individuals IAj being closer, in terms of mathematical distance, to the original individual IOi than the valid anonymous individual IAi, an attack based on the distance will be based on selecting as a priority one Nj invalid anonymous individuals IAj as being the corresponding anonymous individual. The number Nj represents the number of possible matches for the attacker before selecting the valid anonymous individual IAi.

The fifth step S2-5 is a step for calculating the data protection rate against re-identification, designated here txP2, for the data set considered. The protection rate txP2 is calculated here as being a median number Nm of invalid anonymous individuals IAj present around an original individual in the considered data set.

By way of example, we consider here the case of an attacker who is in possession of a data set containing anonymous data (IA individuals) of 100 people of which a considered person i belongs. The attacker is also in possession of the original datum (individual IOi) of the considered person i. The attacker attempts to prove that the original datum (individual IOi) of the considered person i is part of the anonymized cohort. In order to re-identify the valid origin/anonymous pair (IOi, IAi), the attacker must carry out a matching of the individuals and uses for this a mathematical distance between them, such as a Euclidean distance. If, for example, the data protection rate is txP2=7 for this data set, this means that the attacker will then find himself in a situation, as represented in Fig.2, in which he will have on average Nj=7 anonymous individuals not able-bodied IAj closer than the valid anonymous individual IAi and potentially selected. Thus, the denser the environment of the original individual IOi, with numerous invalid anonymous individuals IAj, the more difficult this individual IOi will be to re-identify.

A general architecture of a data anonymization computer system SAD in which the method according to the invention for evaluating the risk of re-identification is implemented is shown by way of example in FIG.

The SAD system is implemented here in a local computer system DSL and comprises two software modules MAD and MET. The MAD and MET software modules are hosted in data storage devices SD, such as memory and/or hard disk, of the local computer system DSL. The local computer system DSL also hosts an original database BDO in which original data DO is stored and an anonymized database BDA in which anonymized data DA is stored.

The MAD software module implements a data anonymization process which processes the original data DO and outputs the anonymized data DA.

The software module MET implements the method according to the invention for the evaluation of the risk of re-identification of the data. The software module MET receives as input original data DO and anonymized data DA and provides as output a protection rate TP against the risk of re-identification. The implementation of the method according to the invention is ensured by the execution of code instructions of the software module MET by a processor (not shown) of the local computer system DSL. The protection rate TP provided by the software module MET provides a measure of the performance of the data anonymization process implemented by the software module MAD.

Of course, the invention is not limited to the embodiments which have been described here by way of illustration. The person skilled in the art, depending on the applications of the invention, may make various modifications and variants falling within the scope of protection of the invention.

Claims

9

Claims Data processing method implemented by computer for the evaluation of a risk of re-identification of anonymized data, said method providing a protection rate (txP2) representative of said risk of re-identification in the case of a distance-based match-seeking attack, said method comprising the steps of a) linking an original dataset (EDO) comprising a plurality of original individuals (IO) to an anonymized dataset (EDA) comprising a plurality of anonymous individuals (IA), said anonymous individuals (IA) being produced by a process of anonymizing said original individuals (IO); b) transforming (PCA, MCA, FAMD) said original individuals (IO) and said anonymous individuals (IA) into a Euclidean space (A1, A2), said original individuals (IO) and anonymous individuals (IA) being represented by coordinates in said Euclidean space (A1, A2); c) identify for each said original individual (IO) one or more said closest anonymous individuals (IA) on the basis of a distance, by a method called "k-NN"; and d) calculating said protection rate (txP2) as being an average number (Nm) of said anonymous individuals (IAj) closest to a said original individual (IOi) who are not a valid anonymous individual (IAi ) corresponding to said original individual (IOi), said closest anonymous individuals (IAj) being those identified in step c) and having a distance (dij) with said original individual (IOi) less than the distance ( due) between said original individual (IOi) and said valid anonymous individual (IAi). Method according to Claim 1, characterized in that the said distance is a Euclidean distance. Method according to Claim 1 or 2, characterized in that the transformation of step b) is carried out by a factorial method (PCA, MCA, FAMD) and/or using an artificial neural network called an "autoencoder ". Method according to Claim 3, characterized in that the said factorial method is a method called "Principal Component Analysis" (PCA) when the said individuals (IO, IA) comprise variables of continuous type, a method called "Multiple Correspondence Analysis" >> (MCA) when said individuals (IO, IA) include qualitative type variables, or a method called “Factorial Analysis of Mixed Data” (FAMD) when said individuals (IO, IA) include mixed type variables “ continuous/qualitative >>. Data anonymization computer system (SAD) comprising a data storage device (SD) storing program instructions (MET) for implementing the method according to any one of Claims 1 to 4. computer comprising a medium in which are recorded program instructions (MET) readable by a processor for implementing the method according to any one of Claims 1 to 4.