WO2019158840A1

WO2019158840A1 - Automatic processing method for anonymizing a digital data set

Info

Publication number: WO2019158840A1
Application number: PCT/FR2019/050280
Authority: WO
Inventors: Fatma BOUATTOUR; Mohamed KASRAOUI; Paul-Olivier GIBERT
Original assignee: Digital & Ethics
Priority date: 2018-02-13
Filing date: 2019-02-08
Publication date: 2019-08-22
Also published as: FR3077894A1; EP3752948A1; FR3077894B1

Abstract

The invention relates to the field of digital data processing, more particularly to the automatic processing of large volumes of digital data, consisting in modifying the content and/or the structure of these data in order to make it very difficult or impossible to identify the (natural or artificial) person or entity in question, in particular by anonymizing the data.

Description

AUTOMATIC PROCESSING METHOD FOR ANONYMIZATION

A DIGITAL DATA GAME

Field of the invention

The present invention relates to the field of digital data processing and more particularly automatic processing of large volumes of digital data by modifying the content and / or structure of these data in order to make it very difficult or impossible to "re-identify" the data. persons (natural or legal) or entities concerned, including anonymisation.

Businesses are now accumulating large volumes of data that can be value-added through processing and monetization. These data cover personal data, which makes them subject to regulatory and ethical requirements before they are disseminated. The anonymisation of data is therefore a crucial step to prevent access to personal data. Anonymisation usually results in a loss of information that must be controlled in order to keep the usefulness of the data for users. In order to target anonymisation, it is therefore necessary to decide which variables qualify as identifiers or as sensitive to disclosure. A rationalized analysis of the attributes of a dataset, their characteristics and their modalities is therefore indispensable for the classification of the attributes, prior to the anonymisation exercises and / or assessment of the risk of disclosure of personal data. . The rationalization of the identification of the attributes for anonymization makes it possible to solve the potential problems of subjectivity and / or non-precision of the analyzes, which can emerge when the classification of the attributes is left to the choice of the user / anonymizer and is not based on the opinion of an expert.

The choice of anonymizing data is often the result of an ethical, legal and ethical compromise between a desire or an obligation to protect individuals and their personal data. In particular, anonymization is used for the dissemination and sharing of data deemed to be of public interest, such as open data.

A first step usually consists of removing the identifiers from the cards or databases concerned, such as surnames, first names, tax identifiers, social security numbers, etc.

The next step will be to apply to the files or databases "filters" and "cryptographic transformations" (eg encryption and / or hashing of data by a dedicated algorithm, for example SHA for Secure Hash Algorithm), but before this work , the data manager carries out or commission a study clarifying its need for anonymisation, its objectives and its requirements (eg must there be a possible reversibility of the anonymisation), prioritizing where necessary the data to be protected, according to their degree of "sensitivity" and according to the purpose of the treatment that must then undergo the information. It can thus produce and compare several anonymisation scenarios in order to better choose the solution that seems most relevant to it (according to its requirements, and the requirements of the Law). In all cases the anonymization must resist dictionary attacks.

Several phases and levels of anonymisation sometimes succeed each other: for example the hospital proceeds to a first anonymisation, the data processing center can then complete this work, and the secondary users (researchers in general) can still over-anonymize the data. reworked (before its publication in a review or distribution to other users). Many methods exist (deletion of some data (deletion) and / or manual transcoding, generalization, addition of noise, use of pseudonyms for example for the doctor / patient pair, encryption (usually with a public key - possibly fragmented - possessed by The competent authority).

In the medical field, the notion of anonymized identity and re-identification of the patient concerns the direct and indirect means of re-identification (eg name, address ...) but also the encrypted data if the decryption means is available .

To limit the risk of information leakage, a person (ex: a patient) is included in an anonymous database only if it is obligatory or really useful, and to a project can be associated only one anonymized database . Increased legal certainty is obtained if all the persons listed in it have given their consent (in writing or via the provision of their identifier, for a medico-commercial study, for example), but this type of basis induces interpretation bias. .

Of course, at each level of production or data storage:

- Internal staff must be subject to access control mechanisms to prevent unauthorized access;

- Mechanisms should be provided to detect and block attempts to intrude (through the Internet or other means) and in particular malicious attempts at data inference, abuse of power, etc.

State of the art Patent application WO 2015066523 describes an example of a computer-implemented method, to provide better levels of data privacy, anonymity and security by allowing subjects to whom data belong, to remain "anonymous dynamically," otherwise Anonymous says as long as they wish and to the extent desired.

Embodiments include systems that create, access, use, store, and / or erase data with increased levels of privacy, anonymity, and security, thereby obtaining better qualified and more accurate information. For data to be shared with third parties, embodiments may make possible controlled information sharing that can deliver temporally, geographically and / or usage limited information to the receiving party. In one example, anonymity score scores can be calculated for the shared data items, so that a level of consent / commitment required by the data object before the sharing of the relevant data items to third parties can be done. to be specified.

The patent application WO2012080081 relates to a computer-implemented method of anonymizing data from a data source for a target application, the method comprising: identifying sensitive data elements in data from the source of data; data through a discovery tool and generating data definitions for data items indicating the sensitive data items, the data definitions including at least one property for the data items; specify a set of runtime rules including at least one runtime rule, the runtime rule including an runtime anonymizer protocol, the runtime engine rule set being specified by via an interface; map the runtime ruleset to the data definitions generated by the discovery tool for each of the sensitive data items; and consuming the generated data definitions and applying the mapped runtime anonymization protocol to the sensitive data item data definition, to anonymize the sensitive data item for the target application.

Patent Application EP2752786 is also known which describes an anonymization device and an anonymization method characterized in that all the data satisfy the requested anonymity levels for each, and in that they prevent the loss of value of the information that results from the abstraction of the entire data collection. The present anonymization device comprises: an anonymization means for performing an anonymization processing in which a group of data is treated as a processing unit for a data collection comprising at least two data; an anonymity level specifying means for specifying an adaptive anonymity level for each group; and an anonymity rating means for judging whether a group meets the specified adaptive anonymity level. The anonymization means, on the basis of the evaluation result of the anonymity evaluation means, further performs an anonymization processing of the data collection for which the anonymization processing has been carried out.

European Patent Application EP2573699 discloses another example of an anonymization device for automatically configuring a general hierarchical tree of attribute values in identity information protection technology. In addition, the anonymization device describes, quantitatively evaluates the amount of information that is lost during the generalization of an attribute value, and can thus automatically evaluate priorities between anonymized data and between data that are being anonymized. Information of each person includes attribute values of the person for a plurality of attributes. An anonymization is performed by obscuring the attribute values, and a structure in which attribute values to be obscured, are expressed in a tree structure according to the obscuration level is called a general hierarchical tree. The described identity information anonymization device performs automatic configuration by configuring a tree using frequency information of attribute values. Moreover, by definition of a means for measuring the amount of information lost, using the general hierarchical tree, a quantity of information lost between two anonymized data or between data being anonymized is quantitatively evaluated.

US patent application 2107/0124336 describes an automated method of identifying the attributes for the anonymisation exercise. This method is based on data encryption, a step prior to studying the level of sensitivity of the data and therefore their degree of requirement in terms of anonymization. This patent proposes three methods for choosing values / attributes for anonymization. A first method consists in comparing the different values with values present in a dictionary, with which different levels of sensitivity are associated. Attributes for which the presence of sensitive values in the dataset exceeds a certain predetermined threshold will be selected for anonymization. A second classification method is based on a comparison of the distributions of the values of an attribute in the dataset and in a known distribution. This method can confirm the results of the first method of identifying the attributes to be anonymized. A final method is to provide the anonymizer with a portion of the dataset in its version. original (before encryption) and generate from this sample a number of expressions for one or more attribute (s). The rest of the dataset will be encrypted and compared to these generated expressions to identify certain attributes and their sensitivity.

Disadvantages of prior art

The solutions of the prior art are adapted to prepare anonymous databases when they are created. On the other hand, these solutions do not make it possible to easily change the anonymization, for example when the addition of new entries modifies the context of anonymisation. The solutions of the prior art require in this case the reprocessing of the entire database, which may require considerable computation time, for databases that may represent several terabytes.

Furthermore, the solutions of the prior art do not allow to adjust in a flexible manner, and dynamically scalable, the level of anonymization requirement according to the possibilities of re-identification by elaborate processing of the data.

An overly demanding anonymisation leads to the loss of any usefulness / value of the data.

On the other hand, if one privileges the richness of the information accessible by the treatment of the data, the anonymization risks being insufficient with respect to the regulatory standards.

This arbitrage between these two constraints evolves according to the number and the nature of the entries recorded in the database.

For example, sex information combined with age information can be identifying, which requires a transformation / anonymization action, especially when data contains in addition information relating to a given pathology. However, if all the entries / registrations correspond to the same sex, or to the same age group, the information is in fact not identifiable. But if new entries change this situation, the information "sex" or "age" may require different treatment.

In addition, the anonymization requires a preliminary step of identifying the attributes / values to be anonymized. This step is left to the choice of the anonymizer / user and is therefore subject to a problem of subjectivity and non-precision of the classification. Moreover, even work that focuses on the classification of attributes does not provide a clear and documented methodology for qualifying attributes.

Solution provided by the invention

The present invention aims to overcome these disadvantages by proposing a method for having different levels of anonymization through a classification of the variables of a database.

The invention relates in its most general sense to a method of automatically processing a digital data set consisting of:

save in a non-permanent memory a set of original data,

to record in a permanent memory

a digital file constituted by a table determining at least identifiers / denominations of the variables, and for each of said variables

■ an "Identifier Status" parameter

[identifier "I", quasi-identifier "IQ", non-identifier "NP"] ^■ a parameter "Status Sensitivity" [: yes

"S", or not "NS"]. This parameter depends on the selected sensitivity definition:

"Regulatory" sensitivity, limited to the legal requirements in terms of protection of privacy.

- "General" sensitivity, encompassing other aspects such as psychological, cultural, ... o A digital file constituted by a table of census variables of the reference population with for each

^■ The different modalities / values taken by each variable according to the census

^■ The frequency of appearance of each category in the reference population (France, United States, ..)

^■ An order of the power of identification of the different census variables o A numerical file constituted by a table of variables with an established order of the degree of facility (208) by which an attacking potential can access the information on the different variables. This order can be deduced from some databases tracing the history of attacks. o A digital file consisting of a table of "sensitive" attributes, for which the values / modalities are classified in order of sensitivity.

The method of applying:

a first treatment based on the referential of the attributes, noted "Initial Classification" consisting of o associating with each of the variables of said original data set a "status" parameter and processing the variables associated with a "hidden" status ("I", "Qi" or "S"), that is to say requiring an action before sharing the data, to prevent their normal use (without anonymisation for example) in said data set

o to assign to each of the variables associated with a status "NP" / "NS" a flag of non-processing and final conservation in the final data set a second treatment concerning the residual variables associated with a status "quasi-identifier" consisting at :

o Prohibit their exploitation by assigning a "hidden" status to prevent their normal use in the final version of that dataset OR

assign to each of said residual variables:

^■ a first indicator for the availability of the associated value from external data sources, such as from a web crawler or a repository or historical attacks

and or

^■ a second indicator corresponding to the frequency values of said variables associated in the general population (also called reference population), the data set is a subset

ordering each of said residual variables according to said associated indicators, which will result, for example, in different processing / anonymization levels during the process anonymisation. This order translates the final classification of the attributes assigned to a numerical sequence "IQ". a third treatment concerning the residual variables associated with a "regulatory" sensitivity parameter consisting of:

o Prohibit their exploitation of assigning a "hidden" status to prevent their normal use in the final version (215) of that dataset

OR

o Assign to each of these residual variables a sensitivity indicator by referring to a list of sensitive variables with their different modalities / values ranging from the most sensitive to the least sensitive. These indicators are calculated based on the occurrence frequency of the most sensitive values of the sensitive attribute. They will then be compared to a frequency threshold

"Acceptable" previously defined,

o Keeping for each of the residual variables characterized by a frequency of occurrence of the sensitive values greater than a threshold value, their "hidden" status to prevent their normal use in said data set,

o Assign the remaining variables a "hidden" status but more "flexible" in terms of processing requirements during the anonymization process.

According to a particular embodiment, a fourth processing concerning the residual variables associated with a "general" sensitivity parameter of assigning some of said variables a "hidden" status to prevent their normal use in said set of data. According to one variant, the method comprises, prior to the first classification step, a processing for assigning to each of the variables for which no correspondence with the attribute repository (201) is established, a provisional status in the attribute repository (201) , which can be changed to definitive status or rejected according to the opinion of an operator.

Advantageously, the method further comprises a step consisting in dynamically applying to the variables that can not be associated with the referential of the attributes, a specific processing consisting in registering in said repository the pair "variable, status" awaiting validation / rejection according to the opinion of an operator. This would also imply potential enrichments of the "Power of identification" (207) and / or "sensitivity" repositories.

According to a variant, said processes are applied periodically [for example during each evolution of the data set (210) or at each evolution of the regulatory framework].

Advantageously, said treatments applied to the "hidden" variables / values consist of:

delete said variables / values (especially for variables assigned to status "I")

save the said variables in a DMZ

the anonymization of at least a part of the values corresponding to said variables.

Detailed description of a non-limiting example of

The invention The present invention will be better understood on reading the detailed description of a nonlimiting example of the invention which follows, with reference to the appended drawings, in which:

Figure 1 shows the flow diagram of the set of treatments.

FIG. 2 represents the set of processing modules for implementing the invention.

Figure 3 shows a detailed view of the logic diagram of the first classification step.

Figure 4 presents a detailed view of the logic diagram of the attribute identification power analysis.

Figure 5 provides a detailed view of the logic diagram of attribute sensitivity analysis.

Context of the invention

The present invention relates to the automatic classification of the attributes of a digital data set to better target the anonymisation and / or risk assessment of re-identification (RI) exercises. The aim is to automate the technical processes to ensure compliance with the regulatory framework on the protection of personal data.

The proliferation of personal data and legal and legal developments in this area make the exercise of database anonymisation an issue for the owners / users of digital databases.

Some national and European bodies such as the CNIL and the G29 insist on the importance of the protection of personal data, by proposing anonymisation methodologies allowing a compromise between the protection of privacy and exploitation. of the data. The regulatory framework is further strengthened by the European Data Protection Regulation (GDPR), which aims to harmonize European legislation on the issue of personal data protection. In order to guarantee the protection of data, the anonymisation work must be verified by assessing the risk of re-identification of personal data.

The anonymisation and assessment of the risk of disclosure of personal data generally concern certain variables in a dataset, particularly those with an identifying nature or those with a sensitive character. At the same time, anonymization involves loss of information about the dataset, which can affect the usefulness of the data for users such as researchers. For that, it is relevant for a user or owner of the data to target the variables on which the anonymization or the re-identification risk measurement will be carried out. For example, the classification of the attributes of a dataset would be an asset in striking a balance between the obligation to respect one's private life and the guarantee of the usefulness of the data.

The classification of the attributes is carried out by a "manual" treatment by the owner of the data and remains linked to its appreciation. This leaves the question of the classification of variables subject to subjectivity and thus may result in decisions of anonymisation or assessment of the risk of re-identification that are not in conformity with the requirements of the manipulation of personal data. In addition, the context of dissemination of datasets, the evolution of laws and customs as well as the characteristics of certain data sets mean that the classification of variables is not final and that an expert assessment is always desirable to ensure the ethical use of personal data. Given these elements, there is therefore a technical problem related to the preliminary analysis (manual or automatic) of the attributes of a dataset in order to target the anonymisation exercises and / or assessment of the risk of re-identification. data by a potential attacker of the dataset.

In this case, there is a need to rationalize the classification of attributes in order to introduce objectivity first to this task, which is often subject to subjectivity, automatically in view of the considerable number of data requiring treatment, in some cases. applications. This will make it possible to compare data sets of the same kind, in terms of the risk of disclosure. A good classification of the attributes will facilitate the decisions concerning the methods of anonymisation and / or measurement of the risk of disclosure of the personal data. In addition, having a classification of attributes that depends on the dataset and which is not necessarily definitive offers more flexibility for the owners of the data to be able to satisfy different couples contexts of use / nature of the customers.

The present invention provides an attribute classification methodology to help data owners share their data while respecting the requirements of personal data automatically and dynamically, allowing the parameters to be automatically scaled according to the introduction of new data into the database.

The data owner accesses a dataset with attributes. Each attribute has a name to classify it. Each attribute can take different modalities / values and so can also be classify according to the composition of these values (distribution, frequency or other).

The innovation of this classification methodology therefore lies particularly in the intervention of the modalities of the different attributes of a dataset in the classification process of the attributes.

Description of the invention

This invention has two stages of classification of the data. The classification begins with a first step, where the attributes of the dataset to be processed are subject to a first classification, using a created database called "Attributes Repository". This invention will be described according to a detailed example with reference to Figures 1 to 5 annexed showing the functional architecture and the logic of the main functional modules.

1- Repository of attributes (201)

The "Attributes framework" (201) consists of applying a classification of the attributes according to two main criteria of anonymization of the personal data, namely:

- their identifying character (202) and

- their sensitive nature (203)

The identifier character (202) results in the recording of a three-state numerical sequence: "I" when the variable is directly identifying as the social security number, "IQ" when the variable can become an identifier, combined with other variables associated with the same state as the postal code, or "NP". The variables associated with the numerical sequence "NP" are not treated in the the scope of this invention, which can reduce computational time in the anonymization process / process (204).

The sensitive character (203) results in the recording of a digital sequence that can take two states: "S" when the variable is sensitive in the sense that its disclosure should be avoided and "NS" in the other cases.

The repository (201) is translated into a file containing variables, listed from the state of the art, the recommendations of the institutes for the protection of privacy and the use cases encountered. These variables are categorized to facilitate the use of the repository when classifying the attributes of a given dataset. The categories listed are: health, education and work, addresses, numbers and dates ...

Attribute classification is then based on two elements:

the identifying character (202) of the attributes and precisely their attribute identification power, and

- the sensitive nature (203) of the data in the sense of the law and also more generally in the sense of the customs, society ... For the sensitive character, one considers:

- belonging to a particular category in the legal sense and

- a more general sensitivity not limited to the level of legal requirement.

Attributes belonging, according to the law, to a "particular category" are classified as sensitive variables assigned to the numerical sequence "S", for example health data, criminal record, etc.

"General" sensitivity is not, however, reduced to legal sensitivity; it takes into account ethical and social aspects. The number of repetitions by example can be considered as a sensitive variable, and thus this variable can be associated with a sequence "S" or "NS" depending on the user's choice.

These criteria come from the literature on anonymization and their inclusion in categorizing variables helps to reduce the subjectivity of qualification and analysis. Indeed, most of the anonymization software / tools do not provide support to their users in the step of classifying the attributes of the datasets.

This repository (201) can be continuously enriched and is supposed to bring together a large set of variables related to many sectors of activity, in order to increase its usefulness.

The processing results in the enrichment of the data table constituting the repository (201) by numerical parameters defined as follows:

^■ Category: This is the theme to which the attribute refers.

^■ Attribute: The name of the attribute.

^■ Identifier status: This is to classify the variable as identifier "I, to be eliminated from the anonymized version", quasi-identifier "IQ" or not.

"NP".

^■ Special category in the legal sense: It

these are the attributes that must be considered sensitive and thus to be protected within the meaning of the law.

^■ General Sensitivity: Sensitivity includes sensitivity in the legal sense but also in the sense of ethics, custom, society, ...

^■ Additional remarks: Precisions to be taken into account when classifying. Two other standards are added to refine the classification of attributes ^(2nd classification stage):

2- Repository of sensitivity of the attributes (205)

In order to provide flexibility to users at the time of classification of the attributes, the "sensitivity of attributes" repository (205) proposes to reference, according to the degree of sensitivity, the different modalities / values of an attribute classified as sensitive and therefore assigned the numerical sequence "S".

Certain attributes classified as "sensitive" and assigned to the numerical sequence "S" take values that do not necessarily have the same degree of sensitivity and / or protection requirement, hence the interest of proposing a more refined analysis of sensitivity and sensitivity order for the different modalities of the sensitive attributes (206).

For example, to establish the order of sensitivity of the "Disease" attribute, it is relevant to take into account that certain diseases are more sensitive to disclosure than others, that is to say that their disclosure could cause more harm to the person (s) concerned.

Based on the international classifications of diseases published by the World Health Organization (WHO), we can propose an order of sensitivity of different diseases (depending on the degree of dangerousness and / or social judgments) which will take the form of next :

High Sensitivity Diseases: Sexually Transmitted Diseases, ...

Diseases with Moderate Sensitivity: Chronic Diseases, ...

Low Sensitivity Diseases: Other Validation of this categorization would require the advice of an expert.

Finally, the "Attributes Sensitivity Repository" (205) is constituted by the list of sensitive attributes identified by the "Attributes Reference" (201) and for each attribute, the various possible modalities (that can evolve) are classified by order sensitivity and / or requirement in terms of protection of privacy and from a socio-cultural point of view.

3- Repository of the Power of identification of the data 207)

The qualification of the quasi-identifier attributes assigned to a numerical sequence "IQ" can be improved by passing to a finer degree of analysis (212). Indeed, the power of identification can vary from one quasi-identifying attribute to another. Thus, the level of requirement in terms of anonymization and / or anonymization evaluation could be different depending on the level of power of a virtual identifier in the re-identification of an individual.

Two decision rules facilitating the classification of quasi-identifiers according to their power of identification are determined, the aim being to create an "Identification Power Referential" (207). Specifically, we propose two criteria on which the order of identifying power is based: the "ease of access of the attributes" and the "frequency of appearance in the reference population".

3.1- * Facilitated Accessibility Repository (208)

The basic principle of this "Facilitated Accessibility Repository" (208) is that a potential attacker would not be able to access all the attributes identifiers, assigned to the numerical sequence "IQ", with the same degree of ease. Indeed, all other things being equal, some quasi-ID "QI" attributes are easier to access than others because of their public availability (on the Internet, on official sites, competition results, etc.). ).

We therefore propose an order by category of attributes. If we consider, for example, the category of "dates", the different dates that can be found in datasets do not necessarily have the same degree of accessibility. We consider for example the following classification:

Dates easy to access: dates of birth, ...

Dates less accessible: dates of hospitalization, ...

Dates difficult to access: medical check dates, ...

The goal is to have a repository of quasi-identifying attributes, affected by the numerical sequence "IQ", classified according to their ease of access by an attacking potential.

3.2- * Referential of Reference Population (209)

We also consider that the power of identification of a quasi-identifier attribute, affected by a numerical sequence "IQ", could depend on the frequency of the appearance of its different modalities in the reference population, like the French population. For example, all things being equal, we can consider that the variable "date of birth" has a higher identification power than age. The "date of birth" actually gives more information than age gives and is more identifying individuals. In the same spirit, " being a woman / man "is less identifying than" being a teacher in philosophy ".

The "Reference Population Reference" (209) is therefore based on the distribution of the different attributes in the reference population, for example a country. For France, we refer for example to the data of the last census of the French population of 2013 to deduce the distribution of a set of attributes.

The data recorded concern the following variables at this level: age, socio-professional category, department of birth, department of previous residence, department of current residence, department of work, degree obtained, nationality, sector of activity, region of birth, region of previous residence, region of work, sex, marital status and type of activity. This list can be enriched by other data on the French population which will expand the list of attributes.

From this census, the attributes are classified according to the frequency of appearance of their different modalities / proposed values. The decision rule is:

The occurrence frequencies of the less frequent values / modalities, of two quasi-identifying attributes A and B, are compared. The attribute for which the least frequent category has a lower percentage of appearance will be considered as an attribute with a higher identification power, which will subsequently result in a level of anonymisation and / or risk of re-identification. -identification more important. This processing makes it possible to give an order of power of identification of the attributes. This reference population reference system (209) can be extended by taking into account the characteristics of other reference populations, such as the United States or Canada. We will have, in fine, a database giving the main characteristics of the reference populations (populations to which the data sets are attached).

These two criteria of "ease of access" and "reference population" will make it possible to have an identification power reference system (207).

The two criteria may be complementary to cover the most quasi-identifying attributes, assigned the numerical sequence "IQ", of a dataset.

Description of an Example of Implementation of the Invention

Classification of attributes can follow the following methodology:

Step (1) the data owner / user accesses a dataset (210) that contains attributes with different denominations. The data owner examines the attribute dictionary (if it exists) or attributes directly to classify them.

Step (2): During this step, the user accesses the "attribute repository" (201).

Step (3): In this step, the calculator processes the data set (210) to match each of the attributes with the attribute repository (201). For attributes of the dataset (210), for which matching is performed, the processing consists of assign them a marker. This correspondence can be done manually by the user by comparing the list of attributes of his dataset to the attribute repository or automatically by creating search automation algorithms such as the Rabin-Karp algorithm, String searching, approximate string searching, or else semantic search algorithms such as the Lesk algorithm.

Step (4): This step distinguishes the attributes of the dataset (201) for which a matching has been performed on the one hand, and the attributes for which no matching has been determined, on the other hand .

Step (5): This step consists in registering in the attribute repository (201) the attributes of the dataset (210) for which no match has been found. These variables are registered with a temporary status, which can be changed to final status or rejected according to the opinion of an operator.

Step (6): This step to perform a first classification of the attributes, denoted "Initial Classification" (211), based on "the referential of the attributes" (201). This step only affects those attributes for which a match with the "attribute repository" (201) has been established. At the end of this step, each of the marked attributes will have a status based on the attribute repository (201) translated by a numerical sequence that can take different states: "I", "IQ",

"NP", "S" or "NS".

This step is described in more detail with reference to the logic diagram object of FIG.

Using this repository, a user / owner of the data can make a first classification, denoted "Initial Classification" (211) of the attributes of its data set in order to target the anonymisation / disclosure risk measurement exercises.

A user accesses (301) the attribute dictionary of the dataset to be studied and the "attribute repository" (201). For attributes whose matching in the attribute repository has been found (303), a determination of their identifier (304) / sensitive (305) status will allow for an initial classification of the attributes (306). The determination of this first classification is done by referring to the different columns of the file of the "referential of the attributes" (201). Again, the correspondence between the attributes of the dataset (201) and their status in the "attribute repository" (201) can be done manually or automatically by search automation algorithms.

For the attributes of the dataset (210) assigned a numerical sequence "I", "NP" or "NS", the initial classification of the attributes (306) corresponds to their definitive classification. These attributes will therefore be permanently stored in the classification module (213), on which the anonymization process is based:

- Attributes assigned to a numerical "I" sequence will undergo special processing and will not appear in the final dataset (215) to ensure privacy.

- Attributes assigned a numerical sequence "NP" or "NS" will not undergo (214) particular processing (204) and will be kept directly in the final dataset (215).

Step (7): The user then determines an option to grant the attributes assigned to a digital sequence "IQ" or "S" a hidden status preventing their normal use in the final data set (215) and go directly to the anonymisation process (204) or to further processing of the data set (210), described below.

Step (8): This step only applies to attributes, assigned to a numerical sequence "S", determined by a filtering module (501). This step, called "sensitivity analysis" (206), is presented in more detail by the logic diagram, object of FIG.

The processing will be based on the result of the initial classification of the attributes (306) and the "sensitivity reference" (205).

By accessing (502) the "attribute sensitivity repository" (205), the calculator will examine the distribution of the modalities of the sensitive attribute in the data set (503). The occurrence frequencies of the most sensitive categories of the attribute are then calculated for the data set to be studied (504).

The frequency of appearance of the sensitive modalities (of this sensitive attribute) is then compared to a frequency threshold defined previously (505). The attribute in question will retain its "sensitive" character if the frequency of appearance of the "sensitive" modalities in the data set exceeds the threshold previously chosen (506). Otherwise, the attribute will be assigned to a "less sensitive" class (507). We thus obtain a final classification of the sensitive attributes (508). This rule will provide flexibility to the users during the anonymization process (204) in order to obtain the final version of the dataset (215). Step (9): This step only applies to the attributes assigned to a numerical sequence "QI" determined by a filtering module (401). This step, named "Analysis of the power of identification" (212), is presented in more detail by the logic diagram, object of Figure 4.

The processing will be based on the result of the initial classification of the attributes (306) and on the "identification power referential" (207).

The computer accesses (402) the "attribute access facility repository" (208) and compares (403) thereafter the degrees of ease of access of the various attributes of the dataset (210) assigned to a digital sequence "IQ" ", Based on the same repository (208). This comparison results in an order of "ease of access" of the different attributes.

The calculator then accesses (404) the "reference population reference" (209) and will sort (405) attributes assigned a numerical sequence "IQ" according to the order established in the "reference population reference". (209). This order can be done manually or automatically by sorting algorithms, namely "selection sorting", "tree sorting" ...

The comparison of the quasi-identifying attributes from an easy access point of view (403) and the order of the different attributes in terms of the characteristics in the reference population (405) make it possible to have a final order ( 406) attributes assigned to a numerical sequence "IQ" according to their power of re-identification. This order will provide flexibility to the users during the anonymization process (204), in terms of the need for anonymization for the different tagged attributes of the dataset (210). Step (10): This step presents the end of the classification process of the attributes of the dataset (210). The results of the sensitivity analyzes (206) and the identification power (212) are grouped in a classification module (213), on which the computer for the data processing (204) of the data set (210) will be based. . This processing may result in an anonymization of certain attributes, with different degrees of requirement in order to arrive at a final version of the dataset (215). In all cases, data processing must meet privacy needs while maintaining the usefulness of the dataset (210).

Claims

claims

1 - Method of automatically processing a digital data set consisting of:

save in a non-permanent memory a set of original data,

to record in a permanent memory

a digital file (201) constituted by a table determining at least identifiers / denominations of the variables, and for each of said variables

^■ an "Identifier Status" parameter

[identifier "I", quasi-identifier "IQ", non-identifier "NP"]

^■ a parameter "Status Sensitivity" [: yes

"S", or not "NS"]

o A digital file consisting of a table of census variables of the reference population (209) with for each

^■ An order of the power of identification of the different census variables

o A digital file constituted by a table of variables with an established order of the degree of facility (208) by which an attacking potential can access the information on the various variables o A digital file consisting of a "sensitive" attribute table (205), for which the values / terms are ranked in order of sensitivity. the method of applying: a first processing based on the attribute reference system (201), denoted "Initial Classification" (211) consisting of

o associating with each of the variables of said original data set a "status" parameter and processing the variables associated with a "hidden" status ("I", "IQ" or "S"),

o assigning to each of the variables associated with an "NP" / "NS" status a flag of non-processing and final preservation (214) in the final data set (215) a second processing relating to the residual variables associated with a status "Quasi-identifier" consisting of:

o Prohibit their exploitation by assigning a "hidden" status to prevent their normal use in the final version (215) of said dataset (210)

OR

o assign to each of said residual variables:

^■ a first indicator corresponding to the availability of the associated value from external data sources

and or

^■ a second indicator corresponding to the frequency values of said variables associated in the general population whose data set is a subset o to direct each said residual variables in accordance with said associated indicators, which will result, for example by levels different processing / anonymisation during the anonymization process (204) determining the classification final assignment of attributes assigned to a digital sequence "QI" (406) a third processing concerning the residual variables associated with a "regulatory" sensitivity parameter consisting of:

OR

o Assign to each of these residual variables a sensitivity indicator by referring to a list of sensitive variables with their different modalities / values (205) ranging from the most sensitive to the least sensitive, calculated on the basis of the frequency of appearance of the values. the most sensitive of the sensitive attribute

o Assign the remaining variables a "hidden" status but more "flexible" in terms of processing requirements during the anonymization process (204).

2 - Process according to claim 1 characterized in that it further comprises a fourth treatment concerning the residual variables associated with a "general" sensitivity parameter of assigning some of said variables a "hidden" status to prevent their normal use in said data set. 3 - Process according to claim 2 characterized in that it comprises before the first classification step a processing for assigning to each of the variables for which no correspondence with the referential attributes (201) is established, a temporary status in the attribute repository (201), which can be changed to definitive status or rejected according to the opinion of an operator.

4 - Process according to claim 1 characterized in that it further comprises a step of dynamically applying to non-associatable variables to the repository attributes (201), a specific processing of recording in said repository the couple "variable, status" waiting for validation / rejection according to the opinion of an operator. This would also imply potential enrichments of the "Power of identification" (207) and / or "sensitivity" frameworks (205).

5 - Process according to claim 1, characterized in that said treatments are applied periodically [for example during each evolution of the data set (210) or at each evolution of the regulatory framework].

6 - Process according to claim 1 characterized in that said treatments applied to the variables / "hidden" values consist of:

save the said variables in a DMZ