FR3048101A1

FR3048101A1 - METHOD AND DEVICE FOR EVALUATING THE ROBUSTNESS OF AN ANONYMOUSING OF A SET OF DATA

Info

Publication number: FR3048101A1
Application number: FR1651422A
Authority: FR
Inventors: Paul Gibert; Marwen Hasnaoui
Original assignee: Digital & Ethics
Current assignee: Digital & Ethics
Priority date: 2016-02-22
Filing date: 2016-02-22
Publication date: 2017-08-25

Abstract

La présente invention se rapporte essentiellement à un procédé d'évaluation de la robustesse d'une anonymisation d'un jeu de données caractérisé en ce qu'il comporte les étapes suivantes : - Acquisition (1000) d'un jeu de données anonymisées - Structuration (1010) du jeu de données anonymisées en au moins une table, chaque colonne de la table correspondant à une propriété, chaque ligne correspondant à un enregistrement, un enregistrement étant un ensemble de valeurs chaque valeur correspondant à une propriété ; - Structuration (1020) d'au moins une catégorie de propriétés, une telle catégorie étant un ensemble de propriétés ; - Production (1030) d'au moins un indicateur selon des distributions de valeurs au sein d'une propriété correspondant à la au moins une catégorie structurée ; Comparaison (1040) du au moins un indicateur à au moins un seuil prédéterminé et, en cas d'échec de la comparaison, production d'un message d'alerte.The present invention essentially relates to a method of evaluating the robustness of an anonymization of a data set characterized in that it comprises the following steps: - Acquisition (1000) of an anonymized dataset - Structuring (1010) of the anonymized data set into at least one table, each column of the table corresponding to a property, each row corresponding to a record, a record being a set of values each value corresponding to a property; Structuring (1020) at least one category of properties, such category being a set of properties; - Producing (1030) at least one indicator according to value distributions within a property corresponding to the at least one structured category; Comparing (1040) the at least one indicator with at least one predetermined threshold and, if the comparison fails, producing an alert message.

Description

Procédé et dispositif d’évaluation de la robustesse d’une anonymisation d’un jeu de donnéesMethod and device for evaluating the robustness of an anonymization of a data set

Domaine technique de l’inventionTechnical field of the invention

Le domaine de l’invention est celui de la manipulation de données sensibles : données susceptibles d’identifier un individu de révéler un secret ou une information préjudiciable le concernant. Elles sont notamment liées à la vie privée des personnes ou aux secrets industriels.The field of the invention is that of the manipulation of sensitive data: data capable of identifying an individual to reveal a secret or detrimental information concerning him. They are particularly related to the privacy of individuals or industrial secrets.

Plus précisément le domaine de l’invention est celui de l’échange de données portant sur des entités légales en particulier des personnes qu’elles soient physiques ou morales et la mise à disposition des données présentant des informations à caractère personnel ou des secrets de fabrique.More precisely, the field of the invention is that of the exchange of data relating to legal entities, in particular persons, whether physical or moral, and the provision of data presenting personal information or trade secrets. .

De telles données sont, par exemple des données médicales, des données financières, des données commerciales, des données de journalisation d’activité sur un réseau, la liste n’est pas exhaustive.Such data is, for example medical data, financial data, commercial data, activity logging data on a network, the list is not exhaustive.

Encore plus précisément le domaine de l’invention est celui d’évaluation de la robustesse d’une anonymisation d’un jeu de données.Even more specifically, the field of the invention is that of evaluating the robustness of an anonymization of a data set.

Etat de la technique antérieur L’anonymisation de données, a fortiori de données personnelles, consiste à modifier le contenu ou la structure de ces données afin de rendre très difficile ou impossible la « ré-identification >> des personnes, physiques ou morales, ou plus généralement des entités concernées par les données ou l’acquisition d’une nouvelle information sur un inidividu Les anglophones parlent aussi parfois de De-Identification (DE-ID).STATE OF THE PRIOR ART The anonymisation of data, a fortiori personal data, consists in modifying the content or the structure of these data in order to make it very difficult or impossible for the "re-identification" of the persons, physical or moral, or more generally entities concerned by the data or the acquisition of new information about an individual Anglophones sometimes also speak of De-Identification (DE-ID).

Le choix d'anonymiser des données résulte souvent d'un compromis déontologique, juridique et éthique, entre une volonté ou une obligation de protéger les individus et leurs données personnelles. L'anonymisation est en particulier employée pour la diffusion et le partage de données jugées d'intérêt public, comme les données ouvertes (Open data).The choice of anonymizing data is often the result of an ethical, legal and ethical compromise between a desire or an obligation to protect individuals and their personal data. In particular, anonymization is used for the dissemination and sharing of data deemed to be of public interest, such as open data.

Toutefois, les techniques d’anonymisation connaissent des limites et les jeux de données peuvent subir des traitements qui permettent de ré-identifier un individu : - faire la liaison entre un individu et un enregistrement ; - retrouver un individu à partir d’enregistrements homogènes ; - révélation d’informations sur un individu ; - obtenir une information (éventuellement probabiliste) sur un individu.However, there are limits to anonymization techniques and data sets can be processed to re-identify an individual: - to link an individual to a record; - find an individual from homogeneous records; - revealing information about an individual; - obtain information (possibly probabilistic) on an individual.

Ces différents risques d’attaque sont connus dans la littérature mais celle-ci les présente sous la forme d’exemple de ré-identification et ne comporte pas de méthode permettant de quantifier de façon globale la sensibilité d’un jeu de donnée à ces différentes attaques. Ces méthodes sont : - la recherche d’enregistrement unique, donc fortement distinctif qui permet de retrouver facilement l’entité correspondante. Par exemple, si on considère la table anonymisée suivante issue d’un centre de soin :These various risks of attack are known in the literature but this one presents them in the form of example of re-identification and does not include a method making it possible to quantify in a global way the sensitivity of a set of data to these different attacks. These methods are: - the unique, therefore highly distinctive, search for registration, which makes it easy to find the corresponding entity. For example, if we consider the following anonymized table from a health center:

Et que l’on sait par ailleurs que ce centre de soin accueil Bob du 93 âgé de 17 ans, alors il sera simple de trouver que Bob a un cancer. - Une attaque d’homogénéité, par exemple en considérant le tableau anonymisé suivant :And we also know that this 93-year-old home care center, Bob, is 17 years old, so it will be easy to find that Bob has cancer. - An attack of homogeneity, for example by considering the following anonymized table:

Et que l’on sait par ailleurs que ce centre de soin accueil Bob du 75 âgé de 23 ans, alors il sera simple de trouver que Bob a le VIH car tous les patients du centre de soin venant du 75 et ayant entre 20 et 29 ans ont le VIH. - Une attaque d’inférences probabilistes dans laquelle on exploite de l’information créée par l’anonymisation ou tout autre traitement des données. Par exemple, dans les données originales on a 50% d’individus souffrant d’un cancer. Après traitement on a créé une répartition des individus dans des catégories anonymes. Si dans chacune de ces catégories on trouve à nouveau 50% d’individus souffrant du cancer alors l’anonymisation n’a pas créé d’information supplémentaire. Par contre il est possible que ce taux varie pour certaines catégories anonymes. Donc le traitement et/ou la transformation des données, dans ce cas l’anonymisation, a créé de l’information exploitable pour une ré identification.And we also know that this 75-year-old home care center Bob 75, then it will be simple to find that Bob has HIV because all the patients of the care center coming from the 75 and having between 20 and 29 years have HIV. - An attack of probabilistic inferences in which one exploits the information created by the anonymization or any other treatment of the data. For example, in the original data we have 50% of individuals suffering from cancer. After treatment we have created a distribution of individuals in anonymous categories. If in each of these categories one finds again 50% of individuals suffering from the cancer then the anonymisation did not create additional information. On the other hand it is possible that this rate varies for some anonymous categories. So the processing and / or transformation of the data, in this case the anonymization, has created usable information for re-identification.

On connaît donc des technique d’anonymisation mais on ne sait pas industrialiser une évaluation de l’efficacité de l’anonymisation. Cette efficacité est donc, au mieux, évaluée de façon empirique lors de la production du premier de jeu de donnée, puis elle est utilisée pour tous les jeux de données suivants.So we know anonymization technique but we do not know how to industrialize an evaluation of the effectiveness of anonymization. This efficiency is, therefore, at best empirically evaluated during the production of the first set of data, and then it is used for all subsequent data sets.

On parle d’un premier jeu de données car dans la pratique on utilise des données vivantes ou actualisées. On définit donc des critères d’extraction de données depuis une base de données. On effectue une première extraction à laquelle on applique une anonymisation. On évalue à la main cette anonymisation et, si elle est satisfaisante, on planifie les critères d’extractions pour produire un jeu de données à intervalle régulier. Ce jeu de données est alors anonymisé et envoyé à son destinataire. Dans la mesure où le contenu des données original change, la fiabilité de l’anonymisation change elle aussi. Dans l’état de la technique ce problème est ignoré.We are talking about a first dataset because in practice we use live or updated data. Data extraction criteria are therefore defined from a database. A first extraction is carried out to which an anonymization is applied. This anonymization is done by hand and, if satisfactory, the extraction criteria are planned to produce a dataset at regular intervals. This dataset is then anonymized and sent to its recipient. As the original data content changes, the reliability of the anonymization changes as well. In the state of the art this problem is ignored.

Exposé de l’invention L’invention permet de résoudre ce problème en proposant un procédé automatisé d’évaluation de la robustesse de l’anonymisation d’un jeu de données (quantifier le niveau du risque de ré-identification d’un individu et de révélation d’information). Le procédé selon l’invention calcul, pour des catégories prédéterminées de données, un ou plusieurs indicateurs. La valeur de ces indicateurs permet d’automatiser une décision de livraison du jeu de données anonymisées. L’invention a donc essentiellement pour objet un procédé d’évaluation de la robustesse d’une anonymisation d’un jeu de données caractérisé en ce qu’il comporte les étapes suivantes : - Acquisition d’un jeu de données anonymisées - Structuration du jeu de données anonymisées en au moins une table, chaque colonne de la table correspondant à une propriété, chaque ligne correspondant à un enregistrement, un enregistrement étant un ensemble de valeurs chaque valeur correspondant à une propriété ; - Structuration d’au moins une catégorie de propriétés, une telle catégorie étant un ensemble de propriétés ; - Production d’au moins un indicateur selon des distributions de valeurs au sein d’une propriété correspondant à la au moins une catégorie structurée ; - Comparaison du au moins un indicateur à au moins un seuil prédéterminé et, en cas d’échec de la comparaison, production d’un message d’alerte. L’invention présente également les caractéristiques suivantes à considérer selon les combinaisons techniquement possibles: - l’évaluation est réalisée à chaque anonymisation, son résultat permettant de piloter une émission du jeu de donnée anonymisé. - une catégorie fait partie du groupe formé d’au moins les éléments suivants identificateur explicite : une donnée dont la seule connaissance permet d’associer l’enregistrement à une entité ;quasi identificateur : une donnée qui considéré avec au moins une autre donnée de la même catégorie permet d’associer l’enregistrement à une entité ; attribut sensible : une donnée dont l’association à une entité pourrait lui porter préjudice ; attribut non sensible : une donnée dont l’association à une entité ne lui porterait pas directement préjudice.un indicateur est un indicateur d’unicité. - l’indicateur d’unicité est calculé à partir de données appartenant à la catégorie des quasi-identificateurs. - un indicateur est un indicateur de diversité. - l’indicateur de diversité est calculé à partir de données appartenant à la catégorie des données sensibles et à la catégorie des quasi-identificateur. - un indicateur est un indicateur de proximité. - l’indicateur de proximité est calculé à partir de données appartenant à la catégorie des données sensibles et à la catégorie des quasi-identificateur. - plusieurs indicateurs sont combinés pour piloter la production du message d’alerte.DISCLOSURE OF THE INVENTION The invention makes it possible to solve this problem by proposing an automated method of evaluating the robustness of the anonymisation of a data set (quantifying the level of risk of re-identification of an individual and of disclosure of information). The method according to the invention calculates, for predetermined categories of data, one or more indicators. The value of these indicators automates a delivery decision of the anonymized data set. The invention therefore essentially relates to a method of evaluating the robustness of an anonymization of a data set characterized in that it comprises the following steps: - Acquisition of an anonymized dataset - Structuring the game anonymized data in at least one table, each column of the table corresponding to a property, each row corresponding to a record, a record being a set of values each value corresponding to a property; Structuring at least one category of properties, such category being a set of properties; - Producing at least one indicator according to value distributions within a property corresponding to the at least one structured category; - Comparing the at least one indicator to at least one predetermined threshold and, if the comparison fails, producing an alert message. The invention also has the following characteristics to consider according to the technically possible combinations: the evaluation is carried out at each anonymization, its result making it possible to control a transmission of the anonymized data set. a category is part of the group consisting of at least the following elements: explicit identifier: a piece of data whose only knowledge makes it possible to associate the recording with an entity, quasi-identifier: a piece of data that is considered with at least one other item of the item; same category allows registration to be associated with an entity; sensitive attribute: data whose association with an entity could be detrimental to it; non-sensitive attribute: a datum whose association with an entity would not directly harm it. An indicator is an indicator of uniqueness. the uniqueness indicator is calculated from data belonging to the quasi- identifiers category. - an indicator is an indicator of diversity. the diversity indicator is computed from data belonging to the sensitive data category and the quasi-identifier category. - an indicator is a proximity indicator. the proximity indicator is computed from data belonging to the sensitive data category and the quasi-identifier category. - Several indicators are combined to control the production of the alert message.

La présente invention se rapporte également à un dispositif de mise en oeuvre d’étapes du procédé selon l’invention.The present invention also relates to a device for implementing steps of the method according to the invention.

La présente invention se rapporte également à un dispositif de stockage non transitoire sur lequel sont enregistrés des codes instructions de mise en œuvre du procédé selon l’invention.The present invention also relates to a non-transitory storage device on which are recorded codes instructions for implementing the method according to the invention.

Brève description des figures D’autres caractéristiques et avantages de l’invention ressortiront à la lecture de la description qui suit, en référence aux figures annexées, qui illustrent : la figure 1 : une illustration d’une infrastructure permettant la mise en œuvre de l’invention ; la figure 2 : une illustration d’étapes du procédé selon l’invention.BRIEF DESCRIPTION OF THE FIGURES Other features and advantages of the invention will emerge on reading the description which follows, with reference to the appended figures, which illustrate: FIG. 1: an illustration of an infrastructure allowing the implementation of the invention; Figure 2: an illustration of steps of the method according to the invention.

Pour plus de clarté, les éléments identiques ou similaires sont repérés par des signes de référence identiques sur l’ensemble des figures sauf précision contraire. L’invention sera mieux comprise à la lecture de la description qui suit et à l’examen des figures qui l’accompagnent. Celles-ci sont présentées à titre indicatif et nullement limitatif de l’invention.For the sake of clarity, identical or similar elements are marked with identical reference signs throughout the figures unless otherwise specified. The invention will be better understood on reading the description which follows and on examining the figures which accompany it. These are presented as an indication and in no way limitative of the invention.

Description détaillée d’un mode de réalisationDetailed description of an embodiment

La figure 1 montre un serveur 100 d’évaluation de robustesse selon l’invention. Le serveur d’évaluation de robustesse selon l’invention comporte : - un microprocesseur 110 ; - des moyens de stockage 120, par exemple un disque dur, une carte mémoire, ou un composant intégré, ou une partie d’un composant intégré, dédié au stockage de donnée, une grappe RAID. Les moyens de stockage sont locaux ou distants ; - une interface 130 de communication, par exemple une carte de communication selon le protocole Ethernet. D’autres protocoles sont envisageables comme IP. L’interface de communication peut être câblée ou non câblée ;FIG. 1 shows a robustness evaluation server 100 according to the invention. The robustness evaluation server according to the invention comprises: a microprocessor 110; storage means 120, for example a hard disk, a memory card, or an integrated component, or part of an integrated component, dedicated to data storage, a RAID array. The storage means are local or remote; a communication interface 130, for example a communication card according to the Ethernet protocol. Other protocols are possible as IP. The communication interface may be wired or uncabled;

Le microprocesseur 110 du serveur d’évaluation de robustesse, les moyens 120 de stockage du serveur d’évaluation de robustesse et l’interface 130 de communication du serveur d’évaluation de robustesse sont interconnectés par un bus 150.The microprocessor 110 of the robustness evaluation server, the storage means 120 of the robustness evaluation server and the communication interface 130 of the robustness evaluation server are interconnected by a bus 150.

Lorsque l’on prête une action à un dispositif celle-ci est en fait effectuée par un microprocesseur du dispositif commandé par des codes instructions enregistrés dans une mémoire du dispositif. Si l’on prête une action à une application, celle-ci est en fait effectuée par un microprocesseur du dispositif dans une mémoire duquel des codes instructions correspondant à l’application sont enregistrés. Lorsqu’un dispositif, ou une application, émet un message, ce message est émis via une interface de communication dudit dispositif ou de la dite application. Dans ces cas, un dispositif est réel ou virtuel.When an action is taken to a device it is in fact carried out by a microprocessor of the device controlled by instruction codes stored in a memory of the device. If an action is taken to an application, it is actually performed by a microprocessor of the device in a memory of which instruction codes corresponding to the application are recorded. When a device or an application transmits a message, this message is sent via a communication interface of said device or of said application. In these cases, a device is real or virtual.

La figure 1 montre que les moyens 120 de stockage comportent une pluralité de zones. Chaque zone est structurée pour remplir une fonction. Ainsi Les moyens de stockages du terminal selon l’invention comporte : - Une zone 120.1 de traitement comportant des codes instructions correspondant à la mise en oeuvre du procédé selon l’invention. C’est-à-dire des codes instructions issus de la mise en œuvre d’un environnement de développement pour l’implémentation du procédé selon l’invention. - Une zone 120.2 de catégories, c’est-à-dire une zone structurée pour enregistrer la description d’une catégorie de données. Une telle structure permet d’associer un identifiant de catégorie, par exemple un nom, à des informations de description de la catégorie. Il peut s’agir d’une ligne dans une table ou d’un structure hiérarchique de type XML. Les catégories sont : o identificateur explicite : une donnée dont la seul connaissance permet d’associer l’enregistrement à une entité ; o quasi identificateur : une donnée qui, considérée avec au moins une autre donnée de la même catégorie, permet d’associer l’enregistrement à une entité ; un quasi identificateur est simple ou composite. Il est simple lorsqu’il ne comporte qu’une clé, il est composite lorsqu’il comporte au moins deux clés ; o attribut sensible : une donnée dont l’association à une entité pourrait lui porter préjudice o attribut non sensible : une donnée dont l’association à une entité ne lui porterait pas directement préjudice - Une zone 120.3 de données anonymisées. C’est une zone qui permet d’enregistrer les données supposées anonymisées. D’un point de vue conceptuel une telle zone peut être vue comme une table dans laquelle chaque ligne correspond à un enregistrement, et chaque colonne à une propriété d’un enregistrement. - Une zone 120.4 de contrat. C’est une zone qui permet au moins d’associer un identifiant d’un indicateur avec une valeur de seuil. Le contenu de la zone de contrat est donc utilisé pour la détermination de la robustesse selon que la position relative du résultat de l’évaluation de la robustesse par rapport à la valeur de seuil.FIG. 1 shows that the storage means 120 comprise a plurality of zones. Each zone is structured to fulfill a function. Thus the storage means of the terminal according to the invention comprises: a processing zone 120.1 comprising instruction codes corresponding to the implementation of the method according to the invention. That is to say instruction codes resulting from the implementation of a development environment for the implementation of the method according to the invention. A zone 120.2 of categories, that is to say a structured zone for recording the description of a category of data. Such a structure makes it possible to associate a category identifier, for example a name, with description information of the category. It can be a row in a table or a hierarchical structure of type XML. The categories are: o explicit identifier: a piece of data whose only knowledge makes it possible to associate the recording with an entity; o quasi identifier: a datum which, when considered with at least one other datum of the same category, makes it possible to associate the recording with an entity; a quasi identifier is simple or composite. It is simple when it has only one key, it is composite when it comprises at least two keys; o sensitive attribute: a data whose association with an entity could harm it o non-sensitive attribute: a data whose association with an entity would not directly harm it - A 120.3 zone of anonymized data. This is an area that allows to record the data supposedly anonymized. From a conceptual point of view such a zone can be seen as a table in which each line corresponds to a record, and each column to a property of a record. - A 120.4 contract area. This is an area that at least allows to associate an identifier of an indicator with a threshold value. The content of the contract area is therefore used for the determination of the robustness as the relative position of the result of the evaluation of the robustness with respect to the threshold value.

La structure de la zone 120.2 permet aussi d’associer une catégorie à une liste de propriétés. Une propriété, et en particulier sa valeur, caractérise un enregistrement.The structure of 120.2 also allows you to associate a category with a list of properties. A property, and especially its value, characterizes a record.

Pour la zone de données, si on considère que les données sont des données relatives aux patients d’un hôpital, alors un enregistrement correspond à un patient. Les propriétés sont, par exemple : - Nom, - Code postal - Age, - Pathologie, - Sexe, - Nombre d’enfants,For the data area, if the data is considered to be patient data for a hospital, then a record is a patient. The properties are, for example: - Name, - Postal Code - Age, - Pathology, - Sex, - Number of children,

La figure 2 montre une étape 1000 d’acquisition d’un jeu de données anonymisées. L’étape 1000 peut être mise en œuvre par le serveur pour tout ou partie. Le serveur 100 réalise au moins la mise à jour de la zone 120.3 de données anonymisées, c’est à dire de l’enregistrement des données en vue de permettre leur relecture. Une telle étape est une étape de transformation d’un jeu de données originales par la mise en œuvre des étapes suivantes - Sélection ou extraction en fonction de critère prédéterminé ; - Anonymisation, par exemple, par transformation et/ou généralisation des valeurs de propriétés. D’une manière générale pour un dispositif, le fait d’acquérir une donnée désigne le fait que cette donnée est lue dans une mémoire locale ou distante.FIG. 2 shows a step 1000 of acquisition of an anonymized data set. Step 1000 can be implemented by the server for all or part. The server 100 performs at least the updating of the area 120.3 of anonymized data, ie the recording of data to allow their replay. Such a step is a step of transforming a set of original data by implementing the following steps: - Selection or extraction according to predetermined criteria; - Anonymization, for example, by transformation and / or generalization of property values. In general, for a device, the fact of acquiring data indicates the fact that this data is read in a local or remote memory.

Dans une variante de l’invention l’anonymisation se réduite à l’acquisition d’un jeu de données anonymisées sans effectuer les transformations proprement dites. A la fin de l’étape d’anonymisation on a donc, dans la zone 120.3 de données anonymisées un ensemble d’enregistrements dont une partie des propriétés ont été anonymisées.In a variant of the invention the anonymization is reduced to the acquisition of an anonymized dataset without performing the actual transformations. At the end of the anonymization step, therefore, in the area 120.3 of anonymized data, there is a set of records, some of whose properties have been anonymised.

Dans une étape 1010 le server 100 d’évaluation de robustesse structure ces données anonymisées en associant à chaque propriété un nom qui permet de désigner la propriété. Les étapes d’anonymisation et de structuration peuvent être confondues, la structure des données anonymisées devant être connue pour la suite du procédé. Les données anonymisées sont assimilables à une table.In a step 1010, the robustness evaluation server 100 structures these anonymized data by associating with each property a name that makes it possible to designate the property. The steps of anonymization and structuring can be confused, the structure of the anonymized data to be known for the rest of the process. Anonymized data is like a table.

Dans une étape 1020, le serveur structure au moins une catégorie de propriétés. Une définition d’une telle catégorie est lue dans la zone 120.2 de catégories. La zone 120.2 de catégorie peut être mise à jour à partir du moment où la structuration des données anonymisées est connue. Au moment où la structuration est connue on peut en effet associer à chaque propriété une catégorie parmi celles précédemment citées et donc produire une structure de catégories. Il est également possible qu’une propriété n’appartienne à aucune catégorie, elle sera alors ignorée par le procédé selon l’invention.In a step 1020, the server structures at least one category of properties. A definition of such a category is read in zone 120.2 of categories. The category area 120.2 can be updated from the moment the structuring of anonymized data is known. At the moment when the structuring is known one can indeed associate with each property a category among those mentioned above and thus produce a structure of categories. It is also possible that a property does not belong to any category, it will then be ignored by the method according to the invention.

On note ici que la composition des catégories est un paramètre du procédé. La composition des catégories peut donc varier d’une mise en œuvre à l’autre.It is noted here that the composition of the categories is a parameter of the process. The composition of categories may therefore vary from one implementation to another.

Le serveur passe alors à une étape 1030 de calcul d’au moins un indicateur. L’indicateur à calculer est connu par l’acquisition du contenu de la zone 120.4 qui définit des seuils par indicateur. Si un indicateur est associé à un seuil alors cet indicateur doit être calculé.The server then proceeds to a step 1030 of calculating at least one indicator. The indicator to be calculated is known by acquiring the content of the zone 120.4 which defines thresholds by indicator. If an indicator is associated with a threshold then this indicator must be calculated.

Le mode de calcul d’un indicateur est, selon les variantes : - Prédéterminé : un identifiant d’un indicateur est associé à une catégorie de données ou plus particulièrement à une propriété ; - Paramétrable : la zone de contrat permet d’associer un indicateur à une catégorie de données ou plus particulièrement à une propriété.The mode of calculating an indicator is, according to the variants: - Predetermined: an identifier of an indicator is associated with a category of data or more particularly with a property; - Parameterizable: the contract area is used to associate an indicator with a category of data or more particularly with a property.

Les calculs d’indicateurs sont faits relativement à des classes d’équivalence. Une classe d’équivalence est constituée par un ensemble d’enregistrements ayant une valeur identique de quasi-identificateur.Indicator calculations are done for equivalence classes. An equivalence class is constituted by a set of records having an identical value of quasi-identifier.

Les indicateurs qu’il est possible de calculer sont : - Un indicateur d’unicité : il évalue le niveau de réduction de détails d’un quasi-identificateur. Il s’agit, dans un mode de réalisation, de la cardinalité minimale parmi les cardinalité des classes d’équivalence constituant le jeu de donnée, et plus particulièrement pour une propriété dans la catégorie des quasi-identificateurs. - Un indicateur de diversité : il évalue le niveau de la diversité des valeurs d’un attribut sensible. Par exemple : o on calcule le nombre minimal des valeurs distinctes prises par un attribut sensible au sein des différentes classes d’équivalence constituant le jeu de données, et/ou o on calcule l’entropie minimale de la distribution des valeurs prises par un attribut sensible au sein des différentes classes d’équivalences constituant le jeu de données. On obtient alors un indicateur de diversité entropique. - Un indicateur de proximité : il évalue le niveau de fidélité des valeurs d’un attribut sensible au sein d’une classe d’équivalence aux valeurs de cet attribut dans le jeu de données complet. L’évaluation quantitative de la proximité peut être obtenue par différentes métriques. A titre d’exemple, on peut utiliser au moins deux modes de calcul: o calcul d’une divergence, par exemple la divergence Kullback Leiber maximale entre la distribution des valeurs d’un attribut sensible au sein des classes d’équivalence et la distribution des valeurs de cet attribut dans le jeu des données, et/ou o calcul de la différence maximale entre la distribution des valeurs d’un attribut sensible au sein des classes d’équivalence et la distribution des valeurs de cet attribut dans la base des données. On appel cet indicateur l’inducteur tD.The indicators that can be calculated are: - A unique indicator: it evaluates the level of reduction of details of a quasi-identifier. In one embodiment, this is the minimum cardinality among the cardinality of the equivalence classes constituting the data set, and more particularly for a property in the category of quasi-identifiers. - A diversity indicator: it evaluates the level of the diversity of values of a sensitive attribute. For example: o calculate the minimum number of distinct values taken by a sensitive attribute within the different equivalence classes constituting the dataset, and / or o calculate the minimum entropy of the distribution of the values taken by an attribute sensitive within the different equivalence classes making up the dataset. An indicator of entropy diversity is then obtained. - A proximity indicator: it evaluates the level of fidelity of the values of a sensitive attribute within a class of equivalence to the values of this attribute in the complete dataset. The quantitative evaluation of the proximity can be obtained by different metrics. For example, one can use at least two modes of computation: o calculation of a divergence, for example the maximum Kullback Leiber divergence between the distribution of the values of a sensitive attribute within the equivalence classes and the distribution values of this attribute in the data set, and / or o calculating the maximum difference between the distribution of the values of a sensitive attribute within the equivalence classes and the distribution of the values of this attribute in the database . This indicator is called the inductor tD.

On rappelle que l’entropie IE au sein d’une classe d’équivalence Ci est définie par les équations suivantes :Let us recall that the entropy IE within a class of equivalence Ci is defined by the following equations:

Avec : - p la probabilité de distribution des valeurs d’un attribut sensible au sein d’une classe d’équivalence Ci ; - C est l’ensemble des classes d’équivalence du jeu de données anonymisées. On rappelle également le mode de calcul de la divergence Kullback Leiber :With: - p the probability of distribution of the values of a sensitive attribute within an equivalence class Ci; - This is the set of equivalence classes of the anonymized data set. We also recall the method of calculating the divergence Kullback Leiber:

Avec : - p la probabilité de distribution des valeurs d’un attribut sensible au sein d’une classe d’équivalence Ci ; - f la probabilité de distribution des valeurs d’un attribut sensible au sein du jeu de données anonymisées. L’inducteur tD est obtenu par l’application des formules suivantes :With: - p the probability of distribution of the values of a sensitive attribute within an equivalence class Ci; - f the probability of distribution of the values of a sensitive attribute within the anonymized data set. The inductor tD is obtained by applying the following formulas:

Avec les définitions de p et de f précédentes.With the definitions of p and f above.

Pour un jeu de données on obtient autant de valeur d’indicateurs qu’il y a de classe d’équivalence dans le jeu de données. La valeur retenue, pour caractériser le jeu de données, est la valeur la plus sensible, c’est-à-dire celle révélant la plus faible robustesse.For a dataset, we obtain as much indicator value as there is an equivalence class in the dataset. The value used to characterize the dataset is the most sensitive value, that is, the one with the lowest robustness.

Le calcul d’un indicateur se fait donc en deux étapes : - calcul d”une valeur d’indicateur pour chaque classe d’équivalence ; - sélection d’une valeur parmi les valeurs calculées.The calculation of an indicator is therefore done in two steps: - calculation of an indicator value for each equivalence class; - selection of a value from the calculated values.

On vient ici de décrire 5 indicateurs et leur mode de calcul. Pour un contrat on doit avoir au moins un indicateur associé à un seuil. Une fois le ou les indicateurs calculés on compare la ou les valeurs obtenues, dans une étape 1040 de comparaison, à leur seuil respectif. Si la comparaison échoue, alors le serveur produit un message d’alerte. Un tel message est, par exemple, un mail qui est émis vers un destinataire prédéterminé. Un tel message peut aussi être la mise à jour d’une valeur dans une base de données locale ou distante.We have just described 5 indicators and their method of calculation. For a contract one must have at least one indicator associated with a threshold. Once the indicator or indicators have been calculated, the value or values obtained are compared, in a comparison step 1040, with their respective thresholds. If the comparison fails, then the server produces an alert message. Such a message is, for example, an email that is sent to a predetermined recipient. Such a message may also be the update of a value in a local or remote database.

Si on conditionne l’émission du fichier anonymisé à la lecture de la valeur mise à jour par le procédé selon l’invention, alors il est possible de bloquer automatiquement rémission d’un fichier dont l’anonymisation ne serait pas assez robuste. On pilote donc ainsi l’émission des fichiers anonymisés.If the transmission of the anonymized file is conditioned on reading the value updated by the method according to the invention, then it is possible to automatically block the delivery of a file whose anonymization would not be sufficiently robust. We thus pilot the emission of anonymized files.

Dans une variante de l’invention un message d’alerte est en fait un rapport qui est émis quel que soit le résultat de l’évaluation de la robustesse.In a variant of the invention, an alert message is in fact a report that is sent regardless of the result of the evaluation of the robustness.

Avec un tel procédé on comprend qu’il est aisé de l’intégrer dans une démarche d’automatisation contrôlée de la diffusion de données. On sait déjà automatiser l’extraction et l’anonymisation. Il suffit donc d’ajouter une étape selon l’invention en lui donnant le contrôle sur une étape d’émission. Ainsi on s’assure qu’aucun jeu de données dont l’anonymisation ne serait pas assez robuste ne puisse être émis.With such a method it is understood that it is easy to integrate it into a controlled automation process of data dissemination. We already know how to automate extraction and anonymization. It is therefore sufficient to add a step according to the invention by giving it control over a transmission step. This ensures that no data set whose anonymization is not robust enough can be issued.

Claims

1. A method for evaluating the robustness of an anonymization of a data set characterized in that it comprises the following steps: - Acquisition (1000) of an anonymized data set - Structuring (1010) of the set of anonymized data in at least one table, each column of the table corresponding to a property, each row corresponding to a record, a record being a set of values each value corresponding to a property; Structuring (1020) at least one category of properties, such category being a set of properties; - Producing (1030) at least one indicator according to value distributions within a property corresponding to the at least one structured category; - Comparing (1040) the at least one indicator to at least one predetermined threshold and, if the comparison fails, producing an alert message.

2. A method for evaluating the robustness of an anonymization according to claim 1, characterized in that the evaluation is performed at each anonymization, its result for controlling a transmission of the anonymized data set.

3. Method for evaluating the robustness of an anonymization according to one of the preceding claims, characterized in that a category is part of the group consisting of at least the following elements: - explicit identifier: a data whose only knowledge is used to associate the record with an entity; - quasi identifier: a data item that considers with at least one other data item of the same category makes it possible to associate the recording with an entity; - sensitive attribute: data whose association with an entity could be detrimental to it; - non-sensitive attribute: a datum whose association with an entity would not cause it direct harm.

4. A method of evaluating the robustness of an anonymization according to one of the preceding claims, characterized in that an indicator is an indicator of uniqueness.

5. A method of evaluating the robustness of an anonymization according to claim 4 characterized in that the uniqueness indicator is calculated from data belonging to the category of quasi-identifiers.

6. A method for evaluating the robustness of an anonymization according to one of the preceding claims, characterized in that an indicator is an indicator of diversity.

7. A method of evaluating the robustness of an anonymization according to claim 6 characterized in that the diversity indicator is calculated from data belonging to the category of sensitive data and the category of quasi-identifier.

8. A method of evaluating the robustness of an anonymization according to one of the preceding claims, characterized in that an indicator is a proximity indicator.

9. A method of evaluating the robustness of an anonymization according to claim 6 characterized in that the proximity indicator is calculated from data belonging to the category of sensitive data and the category of quasi-identifier.

10. A method for evaluating the robustness of an anonymization according to one of the preceding claims characterized in that several indicators are combined to control the production of the alert message.

11. Device 100 for performing a step of a method according to one of the preceding claims.

12. Device (120) for non-transient storage on which are recorded codes instructions for implementing a method according to one of claims 1 to 10.