FR3117230A1

FR3117230A1 - Methods for determining an anonymous data structure, data counting methods, device and system for implementing such methods

Info

Publication number: FR3117230A1
Application number: FR2012827A
Authority: FR
Inventors: Baptiste Olivier
Original assignee: Orange SA
Current assignee: Orange SA
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2022-06-10
Also published as: WO2022123172A1

Abstract

Procédés de détermination d’une structure anonyme de données, procédés de comptage de données, dispositif et système pour la mise en œuvre de tels procédés L’invention concerne un procédé de détermination d’une structure de données à partir d’au moins une donnée personnelle d relative à un individu. Ledit procédé comporte une étape d’initialisation d’une structure de données L ainsi qu’un ensemble d’étapes de :- détermination d’une valeur W_{d} égale à b x h(d) + (1-b) x h(V), où h est une fonction de hachage, b est une variable de Bernoulli, V est une variable aléatoire uniforme indépendante de b,et, si la valeur W_{d} n’appartient pas à la structure L,- insertion de la valeur W_{d} dans la structure L si le cardinal de ladite structure L est inférieur à un nombre k donné,- sinon, si le cardinal de la structure L est supérieur à k et si la valeur W_{d} est inférieure à la plus grande valeur de la structure L, insertion de la valeur W_{d} dans la structure L par remplacement de ladite plus grande valeur. Figure pour l’abrégé : Fig. 3Methods for determining an anonymous data structure, data counting methods, device and system for implementing such methods The invention relates to a method for determining a data structure from at least one datum personal d relating to an individual. Said method comprises a step of initializing a data structure L as well as a set of steps of:- determining a value W_{d} equal to b x h(d) + (1-b) x h(V ), where h is a hash function, b is a Bernoulli variable, V is a uniform random variable independent of b, and, if the value W_{d} does not belong to the structure L,- inserting the value W_{d} in structure L if the cardinality of said structure L is less than a given number k, - otherwise, if the cardinality of structure L is greater than k and if the value W_{d} is less than the most large value of the structure L, inserting the value W_{d} into the structure L by replacing said largest value. Figure for abstract: Fig. 3

Description

Methods for determining an anonymous data structure, data counting methods, device and system for implementing such methods

La présente invention appartient au domaine général du traitement de l’information. Elle concerne plus particulièrement des procédés permettant de déterminer des structures anonymes de données à partir de données « personnelles », ainsi que des procédés permettant de compter des éléments distincts dans des ensembles de données personnelles. Elle concerne également un dispositif de traitement et un système informatique configurés pour mettre en œuvre lesdits procédés.The present invention belongs to the general field of information processing. More particularly, it relates to methods for determining anonymous data structures from "personal" data, as well as methods for counting distinct elements in sets of personal data. It also relates to a processing device and a computer system configured to implement said methods.

On entend généralement par « données personnelles » des données qui concernent des personnes identifiées directement ou indirectement.“Personal data” generally means data relating to persons identified directly or indirectly.

Les données personnelles peuvent être de différentes natures, et concerner indifféremment des personnes physiques ou morales. Il s'agit par exemple de données médicales, de données universitaires, de données qui reflètent certaines caractéristiques d'individus acquises sur ou via un ou plusieurs réseaux de communication, telles que des données d'un graphe social d'individus représentant un réseau de connexions et de relations de ces individus (couramment désigné par « réseau social »), des données extraites de comptes rendus d'appels réalisés dans un réseau de télécommunications (représentatives de la mobilité des individus entre les différentes antennes relais du réseau), des données de navigation des individus sur le réseau public Internet (et notamment les sites visités et les transitions d'un site à l'autre), des données relatives à l'utilisation par des individus de divers objets connectés, etc.Personal data can be of different natures, and concern either natural or legal persons. These include, for example, medical data, university data, data which reflect certain characteristics of individuals acquired on or via one or more communication networks, such as data from a social graph of individuals representing a network of connections and relations of these individuals (commonly referred to as "social network"), data extracted from reports of calls made in a telecommunications network (representative of the mobility of individuals between the various relay antennas of the network), data browsing of individuals on the public Internet network (and in particular the sites visited and transitions from one site to another), data relating to the use by individuals of various connected objects, etc.

On comprend bien dès lors que rendre publiques ce type de données peut porter atteinte à la vie privée des individus concernés. Or, avec le développement aujourd'hui des réseaux de télécommunications et des services de plus en plus nombreux qui s'appuient sur ces réseaux (réseaux sociaux, objets connectés, etc.), on assiste à une augmentation spectaculaire des données personnelles qui s'échangent via ou sur ces réseaux.It is therefore understandable that making this type of data public can infringe the privacy of the individuals concerned. However, with the current development of telecommunications networks and the increasing number of services that rely on these networks (social networks, connected objects, etc.), we are witnessing a spectacular increase in personal data that is exchange via or on these networks.

Il existe aujourd'hui, dans l'état de la technique, différentes méthodes permettant d'anonymiser (i.e. de rendre anonymes) des données personnelles mémorisées dans une base de données. Par opposition aux données personnelles, des données anonymes telles que celles qui peuvent être obtenues via ces méthodes désignent des données à partir desquelles il est impossible de : (i) cibler un individu, (ii) savoir si des données sont liées à un unique individu, et (iii) inférer des informations sur un individu. Ainsi, l'anonymisation de données consiste à modifier le contenu ou la structure des données personnelles afin de rendre difficile, au moins théoriquement, l'identification des individus concernés à partir des données anonymisées.There are currently, in the state of the art, various methods for anonymizing (i.e. making anonymous) personal data stored in a database. As opposed to personal data, anonymous data such as that which can be obtained via these methods refers to data from which it is impossible to: (i) target an individual, (ii) know whether data is linked to a single individual , and (iii) infer information about an individual. Thus, data anonymization consists of modifying the content or structure of personal data in order to make it difficult, at least theoretically, to identify the individuals concerned from the anonymized data.

En ayant accès à de telles données anonymisées, par exemple si une entité qui en possède décide de les rendre publiques, il devient alors possible de réaliser des opérations de comptage avec un risque restreint de dévoiler des informations sensibles sur lesdits individus. En particulier, un intérêt certain est porté (en raison de grandes masses de données qu’il devient possible de collecter) à des opérations visant à compter le nombre de données distinctes dans un ensemble de données anonymisées, afin de réaliser des études statistiques sur des évènements réalisés par les individus concernés durant une durée déterminée.By having access to such anonymized data, for example if an entity that owns it decides to make it public, it then becomes possible to carry out counting operations with a limited risk of revealing sensitive information about said individuals. In particular, there is a definite interest (because of the large masses of data that it is becoming possible to collect) in operations aimed at counting the number of distinct data in a set of anonymized data, in order to carry out statistical studies on events carried out by the individuals concerned during a given period.

A titre d’exemple nullement limitatif, de telles études statistiques peuvent concerner la caractérisation de flux de populations dans un espace géographique donné, par exemple pour déterminer quelles actions peuvent être mises en œuvre pour réguler de tels flux (modification d’une ou plusieurs signalisations, construction d’une ou plusieurs nouvelles infrastructures visant à faciliter les déplacements, etc.) ou bien encore par exemple pour évaluer l’attractivité dudit espace géographique. Ainsi, un exemple d’application peut consister à analyser des structures de données obtenues après anonymisation de numéro de téléphone d’individus enregistrés pendant une durée déterminée dans une gare de trains (autrement dit, dans cet exemple d’application, un évènement réalisé par un individu consiste en un appel téléphonique passé ou reçu depuis ladite gare).By way of non-limiting example, such statistical studies may concern the characterization of population flows in a given geographical area, for example to determine what actions can be implemented to regulate such flows (modification of one or more signals , construction of one or more new infrastructures aimed at facilitating travel, etc.) or even for example to assess the attractiveness of said geographical space. Thus, an example of application can consist in analyzing data structures obtained after anonymization of telephone numbers of individuals registered for a determined period in a train station (in other words, in this example of application, an event carried out by an individual consists of a telephone call made or received from said station).

Il convient de noter que la problématique du comptage de données anonymisées a déjà été étudiée abondamment. En particulier, il a déjà été proposé différentes méthodes pour construire des structures de données à partir desquelles il est possible de réaliser des opérations de comptage de manière relativement efficace. Ainsi, on connait par exemple les filtres de Bloom, les estimateurs de Flagolet-Martin ainsi que les structures dites de k-valeurs minimum (« k-minimum values » dans la littérature anglo-saxonne, encore abrégé « KMV »). De telles structures sont par exemples décrites dans le document : « Cardinality estimation : An experimental survey », Proceedings of the VLDB Endowment, 11(4) : 499-512, 2017.It should be noted that the problem of counting anonymized data has already been studied extensively. In particular, various methods have already been proposed for constructing data structures from which it is possible to carry out counting operations in a relatively efficient manner. Thus, we know, for example, Bloom filters, Flagolet-Martin estimators as well as so-called minimum k-value structures (“k-minimum values” in the Anglo-Saxon literature, still abbreviated “KMV”). Such structures are for example described in the document: “Cardinality estimation: An experimental survey”, Proceedings of the VLDB Endowment, 11(4): 499-512, 2017.

Il n’en reste pas moins que chacune de ces structures présentent des désavantages. Ainsi, pour un ensemble de données personnelles donné, un filtre de Bloom contient l’ensemble des valeurs de hachage de ces données personnelles, ce qui ne permet pas de garantir une confidentialité très efficace face à une attaque par exemple de type force brute.However, each of these structures has its drawbacks. Thus, for a given set of personal data, a Bloom filter contains all the hash values of this personal data, which does not make it possible to guarantee very effective confidentiality in the face of an attack, for example of the brute force type.

Une structure KMV, bien que plus compacte (seules les k plus petites valeurs de hachage sont contenues dans une structure KMV), hérite du même désavantage d’anonymisation que celui mentionné ci-avant pour un filtre de Bloom.A KMV structure, although more compact (only the smallest k hash values are contained in a KMV structure), inherits the same anonymization disadvantage as mentioned above for a Bloom filter.

Enfin, en ce qui concerne un estimateur de Flagolet-Martin, celui-ci ne permet pas de réaliser des opérations de comptage de données distinctes dans des unions ou des intersections d’ensembles, et ce quel que soit le niveau d’anonymisation qu’un tel estimateur peut présenter. Or cela est particulièrement pénalisant lorsqu’il s’agit de réaliser des études statistiques faisant intervenir différents groupes d’individus (exemple : déterminer combien de personnes ont voyagé en train entre deux villes pendant une période donnée de la journée, à partir de deux ensembles de données, chaque ensemble contenant des données personnelles de personnes présentes dans une gare desdites deux villes).Finally, with regard to a Flagolet-Martin estimator, this does not make it possible to carry out operations for counting distinct data in unions or intersections of sets, and this regardless of the level of anonymization that such an estimator can present. However, this is particularly penalizing when it comes to carrying out statistical studies involving different groups of individuals (example: determining how many people have traveled by train between two cities during a given period of the day, from two sets of data, each set containing personal data of persons present in a station of said two cities).

En définitive, il ressort de ce qui précède que les structures de données proposées jusqu’alors sont déficientes en ce qu’elles ne permettent pas de faire, de manière simultanée, des opérations de comptage de données distinctes dans des ensembles, et en particulier dans des intersections et des unions d’ensembles, et à la fois de contrôler le niveau d’anonymisation sur ces opérations.
Ultimately, it emerges from the foregoing that the data structures proposed hitherto are deficient in that they do not make it possible to carry out, simultaneously, operations of counting distinct data in sets, and in particular in intersections and unions of sets, and at the same time to control the level of anonymization on these operations.

La présente invention a pour objectif de remédier à tout ou partie des inconvénients de l’art antérieur, notamment ceux exposés ci-avant, en proposant une solution qui permette, en comparaison avec les solutions de l’état de la technique, de réaliser des opérations de comptage visant à estimer le nombre d’éléments distincts dans des ensembles de données, en particulier dans des intersections et des unions d’ensembles de données, avec un meilleur compromis en termes de compacité (nombre de données traitées pour fournir une excellente estimation de comptage) et de garantie forte de confidentialité.The objective of the present invention is to remedy all or part of the drawbacks of the prior art, in particular those set out above, by proposing a solution which makes it possible, in comparison with the solutions of the state of the art, to achieve counting operations aimed at estimating the number of distinct elements in datasets, especially in intersections and unions of datasets, with a better compromise in terms of compactness (number of data processed to provide an excellent estimate counting) and a strong guarantee of confidentiality.

A cet effet, et selon un premier aspect, l’invention concerne un procédé de détermination, dit « premier procédé », d’une structure de données à partir d’au moins une donnée d, ladite au moins une donnée étant une donnée personnelle relative à un individu et mémorisée au cours d’un processus de mémorisation de durée déterminée suite à la réalisation d’un évènement par ledit individu. En outre, ledit procédé comporte une étape d’initialisation d’une structure de données L à un ensemble vide ainsi qu’un ensemble E_DET d’étapes de :
- détermination d’une valeur W_{d} égale à b x h(d) + (1-b) x h(V), où :
● h est une fonction de hachage à valeurs discrètes comprises entre 0 et 1,
● b est une variable de Bernoulli de paramètre p, avec M étant le cardinal de l’image de la fonction de hachage h, et r étant un majorant du nombre de fois où la donnée d peut être mémorisée au cours de ladite durée, et ɛ étant un nombre strictement positif,
● V est une variable aléatoire uniforme indépendante de b,
et, si la valeur W_{d} n’appartient pas à la structure L,
- insertion de la valeur W_{d} dans la structure L si le cardinal de ladite structure L est inférieur à un nombre k donné,
- sinon, si le cardinal de la structure L est supérieur à k et si la valeur W_{d} est inférieure à la plus grande valeur de la structure L, insertion de la valeur W_{d} dans la structure L par remplacement de ladite plus grande valeurTo this end, and according to a first aspect, the invention relates to a method for determining, called "first method", a data structure from at least one datum d, said at least one datum being personal datum relating to an individual and stored during a storage process of determined duration following the occurrence of an event by said individual. Furthermore, said method comprises a step of initializing a data structure L to an empty set as well as a set E_DET of steps of:
- determination of a value W_{d} equal to bxh(d) + (1-b) xh(V), where:
● h is a hash function with discrete values between 0 and 1,
● b is a Bernoulli variable with parameter p, with M being the cardinal of the image of the hash function h, and r being an upper bound of the number of times the datum d can be stored during said duration, and ɛ being a strictly positive number,
● V is a uniform random variable independent of b,
and, if the value W_{d} does not belong to the structure L,
- insertion of the value W_{d} in the structure L if the cardinality of said structure L is less than a given number k,
- otherwise, if the cardinal of the structure L is greater than k and if the value W_{d} is less than the largest value of the structure L, inserting the value W_{d} into the structure L by replacing said greater value

Ainsi, ledit premier procédé permet d’obtenir une structure de données L plus compacte que l’ensemble de données dont elle est issue, et formée uniquement de valeurs distinctes entre elles, tout en assurant une confidentialité différentielle d’ordre ɛ à ces valeurs (en raison de l’intervalle dans lequel est choisi le paramètre p).Thus, said first method makes it possible to obtain a data structure L that is more compact than the set of data from which it is derived, and formed solely of values distinct from each other, while ensuring differential confidentiality of order ɛ to these values ( because of the interval in which the parameter p is chosen).

On note qu’une telle structure de données L diffère d’une structure conventionnelle de type KMV dans la mesure où la présence d’une donnée dans ladite structure L est garantie avec une probabilité p (ce qui conduit dès lors à assurer une confidentialité différentielle d’ordre ɛ), là où la structure KMV comporte, par construction, toutes les valeurs de hachage des valeurs distinctes de l’ensemble de données considéré au départ. Ainsi, la structure de données L obtenue au moyen du premier procédé possède un niveau de confidentialité bien meilleure que celui proposé par une structure KMV conventionnelle.It should be noted that such a data structure L differs from a conventional structure of the KMV type insofar as the presence of a datum in said structure L is guaranteed with a probability p (which consequently leads to ensuring differential confidentiality of order ɛ), where the KMV structure comprises, by construction, all the hash values of the distinct values of the set of data considered at the start. Thus, the data structure L obtained by means of the first method has a much better level of confidentiality than that offered by a conventional KMV structure.

On comprend par ailleurs que le gain en compacité procuré par l’obtention de ladite structure de données L offre la possibilité de réaliser des opérations de comptage (estimation du nombre d’éléments distincts dans un ensemble de données, dans une union d’ensembles de données ou bien encore dans une intersection d’ensembles de données) de manière très efficace (rapidité de calculs, gain de stockage en mémoire, etc.).It is also understood that the gain in compactness procured by obtaining said data structure L offers the possibility of carrying out counting operations (estimating the number of distinct elements in a set of data, in a union of sets of data or even in an intersection of data sets) in a very efficient way (speed of calculations, gain in memory storage, etc.).

Dans des modes particuliers de mise en œuvre, le procédé de détermination peut comporter en outre l’une ou plusieurs des caractéristiques suivantes, prises isolément ou selon toutes les combinaisons techniquement possibles.In particular modes of implementation, the determination method may also include one or more of the following characteristics, taken in isolation or in any technically possible combination.

Dans des modes particuliers de mise en œuvre, la structure de données L est déterminée à partir d’une pluralité de données, les étapes dudit ensemble E_DET d’étapes étant itérées pour chaque donnée de ladite pluralité de données, la structure de données considérée pour l’insertion d’une donnée lors d’une itération courante de l’ensemble d’étapes E_DET correspondant à la structure de données dans laquelle a été insérée une donnée lors d’une itération de l’ensemble d’ étapes E_DET précédant ladite itération courante.In particular modes of implementation, the data structure L is determined from a plurality of data, the steps of said set E_DET of steps being iterated for each data of said plurality of data, the data structure considered for the insertion of a datum during a current iteration of the set of steps E_DET corresponding to the data structure into which a datum was inserted during an iteration of the set of steps E_DET preceding said iteration current.

Dans des modes particuliers de mise en œuvre, ledit ensemble E_DET d’étapes est mis en œuvre :
- après que toutes les données de ladite pluralité de données sont mémorisées, ou
- de manière dynamique, après chaque mémorisation d’une donnée de ladite pluralité de données.In particular modes of implementation, said E_DET set of steps is implemented:
- after all of said plurality of data is stored, or
- Dynamically, after each storage of a datum of said plurality of data.

Dans des modes particuliers de mise en œuvre, ledit procédé comporte, pour chaque exécution de l’étape d’insertion par remplacement, une étape d’incrémentation d’une valeur dite « valeur de dénombrement » N, ladite valeur de dénombrement N étant représentative, à l’issue du procédé, du nombre total de fois où ladite étape d’insertion par remplacement a été exécutée lors de la mise en œuvre du procédéIn particular embodiments, said method comprises, for each execution of the step of insertion by replacement, a step of incrementing a value called "counting value" N, said counting value N being representative , at the end of the method, of the total number of times said step of inserting by replacement has been executed during the implementation of the method

Dans des modes particuliers de mise en œuvre, ladite valeur de dénombrement N est initialisée à zéro ou bien initialisée à une réalisation d’une variable aléatoire de Laplace centrée.In particular embodiments, said counting value N is initialized to zero or else initialized to a realization of a centered Laplace random variable.

Selon un deuxième aspect, l’invention concerne un procédé de détermination, dit « deuxième procédé », d’une structure de données, dite « deuxième structure de données » L2, à partir d’une première structure de données L1 obtenue par application d’un algorithme de k’-valeurs minimales à des données, chaque donnée étant une donnée personnelle relative à un individu et mémorisée au cours d’un processus de mémorisation de durée déterminée suite à la réalisation d’un évènement par ledit individu, la mise en œuvre dudit algorithme de k’-valeurs minimales utilisant une fonction de hachage h à valeurs discrètes comprises entre 0 et 1. En outre, ledit procédé comporte des étapes de :
- détermination d’une valeur D1 égale au quotient de k’-1 par la plus grande valeur de la première structure L1,
- échantillonnage uniforme d’un nombre D1 de valeurs dans l’image de la fonction de hachage h, de sorte à obtenir un ensemble L_D1 comprenant lesdits D1 valeurs échantillonnées,
- échantillonnage uniforme d’un nombre D2 de valeurs dans l’ensemble L_D1, D2 étant égal à la partie entière du produit [1-p] x D1, avec M étant le cardinal de l’image de la fonction de hachage h, et ɛ étant un nombre strictement positif,
- sélection d’un nombre D3 de plus petites valeurs parmi lesdites D2 valeurs échantillonnées, D3 étant égal à la partie entière du produit [1-p] x k, où k est un nombre donné inférieur à k’,
- échantillonnage uniforme d’un nombre D4 de valeurs entre la plus grande valeur de la première structure L1 et 1, de sorte à obtenir un ensemble L_D4 comprenant lesdits D4 valeurs échantillonnées, D4 étant égale à D1-k,
- échantillonnage uniforme d’un nombre D5 de valeurs dans l’union des ensembles L_D1 et L_D4, D5 étant égal à la partie entière du produit p x D1,
- sélection d’un nombre D6 de plus petites valeurs parmi lesdites D5 valeurs échantillonnées, D6 étant égal à k-D3,
- regroupement desdites plus petites valeurs sélectionnées lors desdites sélections, de sorte à former ladite deuxième structure L2According to a second aspect, the invention relates to a method for determining, called "second method", a data structure, called "second data structure" L2, from a first data structure L1 obtained by application of 'an algorithm of minimum k'-values to data, each datum being personal datum relating to an individual and stored during a memorization process of determined duration following the occurrence of an event by said individual, the setting implementation of said minimal k'-value algorithm using a hash function h with discrete values between 0 and 1. In addition, said method comprises steps of:
- determination of a value D1 equal to the quotient of k'-1 by the greatest value of the first structure L1,
- uniform sampling of a number D1 of values in the image of the hash function h, so as to obtain a set L_D1 comprising said D1 sampled values,
- uniform sampling of a number D2 of values in the set L_D1, D2 being equal to the integer part of the product [1-p] x D1, with M being the cardinal of the image of the hash function h, and ɛ being a strictly positive number,
- selection of a number D3 of smallest values among said D2 sampled values, D3 being equal to the integer part of the product [1-p] xk, where k is a given number less than k',
- uniform sampling of a number D4 of values between the largest value of the first structure L1 and 1, so as to obtain a set L_D4 comprising said D4 sampled values, D4 being equal to D1-k,
- uniform sampling of a number D5 of values in the union of the sets L_D1 and L_D4, D5 being equal to the integer part of the product px D1,
- selection of a number D6 of smallest values among said D5 sampled values, D6 being equal to k-D3,
- grouping of said smallest values selected during said selections, so as to form said second structure L2

Ledit deuxième procédé hérite des mêmes avantages que ceux mentionnés ci-avant en référence audit premier procédé.Said second method inherits the same advantages as those mentioned above with reference to said first method.

Selon un troisième aspect, l’invention concerne un procédé d’insertion d’au moins une donnée dans une structure de données L, ladite au moins une donnée étant une donnée personnelle relative à un individu et mémorisée au cours d’un processus de mémorisation de durée déterminée suite à la réalisation d’un évènement par ledit individu. En outre, ledit procédé comporte des étapes de :
- obtention d’une structure de données L déterminée selon un premier ou un deuxième procédé de détermination conforme à l’invention,
- mise en œuvre, pour ladite au moins une donnée à insérer, d’un ensemble d’étapes identique à l’ensemble E_DET d’étapes d’un premier procédé de détermination selon l’invention.According to a third aspect, the invention relates to a method for inserting at least one datum into a data structure L, said at least one datum being personal datum relating to an individual and stored during a memorization process of fixed duration following the occurrence of an event by said individual. Furthermore, said method comprises steps of:
- obtaining a data structure L determined according to a first or a second determination method in accordance with the invention,
- implementation, for said at least one datum to be inserted, of a set of steps identical to the set E_DET of steps of a first determination method according to the invention.

Ledit troisième procédé hérite des mêmes avantages que ceux mentionnés ci-avant en référence audit premier procédé.Said third method inherits the same advantages as those mentioned above with reference to said first method.

Dans des modes particuliers de mise en œuvre, le procédé d’insertion peut comporter en outre l’une ou plusieurs des caractéristiques suivantes, prises isolément ou selon toutes les combinaisons techniquement possibles.In particular embodiments, the insertion method may also include one or more of the following characteristics, taken in isolation or in all technically possible combinations.

Dans des modes particuliers de mise en œuvre, une pluralité de données est considérée pour l’insertion dans la structure L, ladite étape de mise en œuvre étant itérée pour chacune des données à insérer, la structure de données considérée pour l’insertion d’une donnée lors d’une itération courante de l’étape de mise en œuvre correspondant à la structure de données dans laquelle a été insérée une donnée lors d’une itération de l’étape de mise en œuvre précédant ladite itération courante.In particular embodiments, a plurality of data is considered for insertion into the structure L, said implementation step being iterated for each of the data to be inserted, the data structure considered for the insertion of a datum during a current iteration of the implementation step corresponding to the data structure into which a datum has been inserted during an iteration of the implementation step preceding said current iteration.

Dans des modes particuliers de mise en œuvre, ledit ensemble identique audit ensemble E_DET d’étapes du premier procédé de détermination est mis en œuvre :
- après que toutes les données de ladite pluralité de données sont mémorisées, ou
- de manière dynamique, après chaque mémorisation d’une donnée de ladite pluralité de données.In particular modes of implementation, said set identical to said set E_DET of steps of the first determination method is implemented:
- after all of said plurality of data is stored, or
- Dynamically, after each storage of a datum of said plurality of data.

Dans des modes particuliers de mise en œuvre, lorsque la structure de données L obtenue a été déterminée selon un premier procédé de détermination conforme à l’invention, de sorte à obtenir une valeur de dénombrement N correspondant au nombre total de fois où l’étape d’insertion par remplacement a été exécutée, le procédé d’insertion comporte également, pour chaque exécution de l’étape identique à l’étape d’insertion par remplacement dudit premier procédé de détermination, une étape d’incrémentation de ladite valeur de dénombrement N.In particular embodiments, when the data structure L obtained has been determined according to a first determination method in accordance with the invention, so as to obtain a counting value N corresponding to the total number of times the step insertion by replacement has been executed, the insertion method also comprises, for each execution of the step identical to the step of insertion by replacement of said first determination method, a step of incrementing said counting value NOT.

Selon un quatrième aspect, l’invention concerne un procédé d’estimation du nombre de données distinctes dans un ensemble de données, chaque donnée étant une donnée personnelle relative à un individu et mémorisée au cours d’un processus de mémorisation de durée déterminée suite à la réalisation d’un évènement par ledit individu. Ledit procédé comporte des étapes de :
- obtention d’une structure de données déterminée selon un procédé de détermination ou d’insertion conforme à l’invention, le cardinal de ladite structure obtenue étant égal à k,
- détermination d’une valeur Q égale au quotient de k-1 par la plus grande valeur de la structure de données obtenue,
- détermination d’un estimateur G du nombre de données distinctes de l’ensemble de données en fonction de Q.According to a fourth aspect, the invention relates to a method for estimating the number of distinct data in a set of data, each data being personal data relating to an individual and stored during a storage process of determined duration following the realization of an event by said individual. Said method comprises steps of:
- obtaining a data structure determined according to a determination or insertion method in accordance with the invention, the cardinality of said structure obtained being equal to k,
- determination of a value Q equal to the quotient of k-1 by the largest value of the data structure obtained,
- determination of an estimator G of the number of distinct data of the set of data according to Q.

Dans des modes particuliers de mise en œuvre, l’estimateur G est égal à g^-1(Q), où g^-1est une fonction inverse d’une fonction g d’inconnue β et ayant pour expression :

expression dans laquelle :
● ,
● ,
● si les nombres d’occurrences respectifs des données dudit ensemble de données sont tous égaux à une même valeur R donnée, r_moy est égal à R,
● sinon, et si la structure de données est déterminée selon un premier procédé ou un procédé d’insertion conforme à l’invention, de sorte à obtenir une valeur de dénombre N, r_moy est égal à la partie entière du quotient de N par Q.In particular modes of implementation, the estimator G is equal to g ^-1 (Q), where g ^-1 is an inverse function of a function g of unknown β and having the expression:

expression in which:
● ,
● ,
● if the respective numbers of occurrences of the data of said data set are all equal to the same given value R, r_avg is equal to R,
● otherwise, and if the data structure is determined according to a first method or an insertion method in accordance with the invention, so as to obtain a count value N, r_moy is equal to the integer part of the quotient of N by Q .

Selon un cinquième aspect, l’invention concerne un procédé d’estimation du nombre de données distinctes dans une union d’une pluralité d’ensembles de données, chaque donnée étant une donnée personnelle relative à un individu et mémorisée au cours d’un processus de mémorisation de durée déterminée suite à la réalisation d’un évènement par ledit individu. Ledit procédé comporte des étapes de :
- obtention, pour chaque ensemble de données, d’une structure de données déterminée selon un procédé de détermination ou d’insertion conforme à l’invention,
- sélection d’un nombre D7 de plus petites valeurs d’une structure correspondant à l’union des structures de données obtenues, D7 étant égal à la plus petite valeur parmi les cardinaux respectifs desdites structures de données obtenues,
- détermination d’une valeur Q_UNION égale au quotient de D7-1 par la plus grande valeur parmi lesdites plus petites valeurs sélectionnées,
- détermination d’un estimateur G_UNION du nombre de données distinctes de ladite union d’ensembles de données en fonction de Q_UNION.According to a fifth aspect, the invention relates to a method for estimating the number of distinct data in a union of a plurality of sets of data, each data being personal data relating to an individual and stored during a process of memorization of determined duration following the realization of an event by said individual. Said method comprises steps of:
- obtaining, for each set of data, a data structure determined according to a determination or insertion method in accordance with the invention,
- selection of a number D7 of smallest values of a structure corresponding to the union of the data structures obtained, D7 being equal to the smallest value among the respective cardinals of said data structures obtained,
- determination of a Q_UNION value equal to the quotient of D7-1 by the largest value among said smallest selected values,
- determination of an estimator G_UNION of the number of distinct data of said union of data sets as a function of Q_UNION.

Dans des modes particuliers de mise en œuvre, l’estimateur G_UNION est égal à
g^-1(Q_UNION), où g^-1est une fonction inverse d’une fonction g d’inconnue β et ayant pour expression :

expression dans laquelle :
● ,
● ,
● si les nombres d’occurrences respectifs des données desdits ensembles de données sont tous égaux à une même valeur R donnée, r_moy est égal à R,
● sinon, et si la structure de données est déterminée selon un premier procédé ou un procédé d’insertion conforme à l’invention, de sorte à obtenir une valeur N_SUM égale à la somme des valeurs de dénombrement respectives desdits structures de données, r_moy est égal à la partie entière du quotient de N_SUM par Q_UNION.In particular modes of implementation, the estimator G_UNION is equal to
g ^-1 (Q_UNION), where g ^-1 is an inverse function of a function g with unknown β and having the expression:

expression in which:
● ,
● ,
● if the respective numbers of occurrences of the data of said data sets are all equal to the same given value R, r_avg is equal to R,
● otherwise, and if the data structure is determined according to a first method or an insertion method in accordance with the invention, so as to obtain a value N_SUM equal to the sum of the respective counting values of said data structures, r_moy is equal to the integer part of the quotient of N_SUM by Q_UNION.

Selon un sixième aspect, l’invention concerne un procédé d’estimation du nombre de données distinctes dans une intersection d’une pluralité d’ensembles de données, chaque donnée étant une donnée personnelle relative à un individu et mémorisée au cours d’un processus de mémorisation de durée déterminée suite à la réalisation d’un évènement par ledit individu. Ledit procédé comporte des étapes de :
- obtention, pour chaque ensemble de données, d’une structure de données déterminée selon un procédé de détermination ou d’insertion conforme à l’invention,
- sélection d’un nombre D7 de plus petites valeurs d’une structure correspondant à l’union desdites structures de données obtenues, D7 étant égal à la plus petite valeur parmi les cardinaux respectifs desdites structures de données obtenues,
- détermination d’une valeur Q_UNION égale au quotient de D7-1 par la plus grande valeur parmi lesdites plus petites valeurs sélectionnées,
- détermination d’une valeur Q_INTER égale au quotient du cardinal d’une structure correspondant à l’intersection desdites structures de données obtenues par D7,
- détermination d’un estimateur G_INTER du nombre de données distinctes de ladite intersection d’ensembles de données en fonction de Q_INTER et Q_UNION.According to a sixth aspect, the invention relates to a method for estimating the number of distinct data in an intersection of a plurality of sets of data, each data being personal data relating to an individual and stored during a process of memorization of determined duration following the realization of an event by said individual. Said method comprises steps of:
- obtaining, for each set of data, a data structure determined according to a determination or insertion method in accordance with the invention,
- selection of a number D7 of smallest values of a structure corresponding to the union of said obtained data structures, D7 being equal to the smallest value among the respective cardinals of said obtained data structures,
- determination of a Q_UNION value equal to the quotient of D7-1 by the largest value among said smallest selected values,
- determination of a value Q_INTER equal to the quotient of the cardinality of a structure corresponding to the intersection of said data structures obtained by D7,
- determination of an estimator G_INTER of the number of distinct data of said intersection of data sets as a function of Q_INTER and Q_UNION.

Dans des modes particuliers de mise en œuvre, les ensembles de données, dits ensembles « A_1,…, A_T », sont au nombre de T et respectivement associés à des paramètres r_1,…, r_T, le cardinal d’une structure de données associée à un ensemble A_i étant égal à k_i, et l’estimateur G_INTER ayant pour expression :

expression dans laquelle :
● si, pour chaque ensemble A_i, les nombres d’occurrences respectifs des données dudit ensemble A_i sont tous égaux à une même valeur R_i donnée, r_i est égal à R_i,
● sinon, et si la structure de données est déterminée selon un premier procédé ou un procédé d’insertion conforme à l’invention, de sorte à obtenir pour chaque structure de donnée associée à un ensemble A_i une valeur de dénombrement N_i, le paramètre r_i de chaque ensemble A_i est égal à la partie entière du quotient de N_i par une valeur Q_i, ladite valeur Q_i étant égale au quotient de k_i-1 par la plus grande valeur de la structure de données obtenue pour ledit ensemble A_i.In particular modes of implementation, the sets of data, called sets “A_1,…, A_T”, are T in number and respectively associated with parameters r_1,…, r_T, the cardinality of an associated data structure to a set A_i being equal to k_i, and the estimator G_INTER having the expression:

expression in which:
● if, for each set A_i, the respective numbers of occurrences of the data of said set A_i are all equal to the same given value R_i, r_i is equal to R_i,
● otherwise, and if the data structure is determined according to a first method or an insertion method in accordance with the invention, so as to obtain for each data structure associated with a set A_i a counting value N_i, the parameter r_i of each set A_i is equal to the integer part of the quotient of N_i by a value Q_i, said value Q_i being equal to the quotient of k_i-1 by the greatest value of the data structure obtained for said set A_i.

Ainsi, les procédés d’estimation selon l’invention offrent des fonctionnalités supplémentaires de calculs de cardinaux d’intersections et d’union sur des ensembles de données d’utilisateurs tout en garantissant une protection de la vie privée ces utilisateurs. Un tel avantage n’est pas offert par des solutions connues de l’art antérieur.Thus, the estimation methods according to the invention offer additional functionalities for calculating intersection and union cardinals on sets of user data while guaranteeing protection of the privacy of these users. Such an advantage is not offered by known solutions of the prior art.

Il importe de noter que les premier, deuxième et troisième procédés forment des solutions alternatives à la détermination d’une même structure de données. Une telle structure de données constitue un élément technique particulier sur lequel s’appuient les procédés d’estimation selon l’invention pour estimer le nombre d’éléments distincts dans des ensembles de données, en particulier dans des intersections et des unions d’ensembles de données, avec un meilleur compromis en termes de compacité et de garantie forte de confidentialité. Dit encore autrement, les différents procédés de l’invention sont liés entre eux par un même concept inventif général consistant en la détermination et l’utilisation d’une structure de données selon l’invention pour réaliser de manière précise et confidentielle les opérations de comptage en question.It is important to note that the first, second and third methods form alternative solutions to the determination of the same data structure. Such a data structure constitutes a particular technical element on which the estimation methods according to the invention are based for estimating the number of distinct elements in data sets, in particular in intersections and unions of sets of data, with a better compromise in terms of compactness and a strong guarantee of confidentiality. In other words, the different methods of the invention are linked together by the same general inventive concept consisting of the determination and use of a data structure according to the invention to carry out the counting operations in a precise and confidential manner. in question.

Selon un septième aspect, l’invention concerne un programme d’ordinateur comportant des instructions pour la mise en œuvre de l’un quelconque des procédés selon l’invention lorsque ledit programme d’ordinateur est exécuté par un ordinateur.According to a seventh aspect, the invention relates to a computer program comprising instructions for the implementation of any of the methods according to the invention when said computer program is executed by a computer.

Ce programme peut utiliser n’importe quel langage de programmation, et être sous la forme de code source, code objet, ou de code intermédiaire entre code source et code objet, tel que dans une forme partiellement compilée, ou dans n’importe quelle autre forme souhaitable.This program may use any programming language, and be in the form of source code, object code, or intermediate code between source code and object code, such as in partially compiled form, or in any other desirable form.

Selon un huitième aspect, l’invention concerne un support d’informations ou d’enregistrement lisible par un ordinateur sur lequel est enregistré un programme d’ordinateur selon l’invention.According to an eighth aspect, the invention relates to a computer-readable information or recording medium on which a computer program according to the invention is recorded.

Le support d'informations ou d’enregistrement peut être n'importe quelle entité ou dispositif capable de stocker le programme. Par exemple, le support peut comporter un moyen de stockage, tel qu'une ROM, par exemple un CD ROM ou une ROM de circuit microélectronique, ou encore un moyen d'enregistrement magnétique, par exemple une disquette (floppy disc) ou un disque dur.The information or recording medium can be any entity or device capable of storing the program. For example, the medium may include a storage medium, such as a ROM, for example a CD ROM or a microelectronic circuit ROM, or even a magnetic recording medium, for example a floppy disk or a disk. hard.

D'autre part, le support d'informations ou d’enregistrement peut être un support transmissible tel qu'un signal électrique ou optique, qui peut être acheminé via un câble électrique ou optique, par radio ou par d'autres moyens. Le programme selon l'invention peut être en particulier téléchargé sur un réseau de type Internet.On the other hand, the information or recording medium can be a transmissible medium such as an electrical or optical signal, which can be conveyed via an electrical or optical cable, by radio or by other means. The program according to the invention can in particular be downloaded from an Internet-type network.

Alternativement, le support d'informations ou d’enregistrement peut être un circuit intégré dans lequel le programme est incorporé, le circuit étant adapté pour exécuter ou pour être utilisé dans l'exécution du procédé en question.Alternatively, the information or recording medium may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.

Selon un neuvième aspect, l’invention concerne un dispositif de traitement comportant des moyens configurés pour mettre en œuvre un procédé selon l’invention.According to a ninth aspect, the invention relates to a processing device comprising means configured to implement a method according to the invention.

Selon un dixième aspect, l’invention concerne un système informatique comportant :
- une base de données dans laquelle est mémorisée au moins une donnée personnelle relative à un individu, ladite au moins une donnée ayant été mémorisée au cours d’un processus de mémorisation de durée déterminée suite à la réalisation d’un évènement par ledit individu,
- un dispositif de traitement selon l’invention.
According to a tenth aspect, the invention relates to a computer system comprising:
- a database in which is stored at least one personal data relating to an individual, said at least one data having been stored during a storage process of determined duration following the occurrence of an event by said individual,
- a processing device according to the invention.

D’autres caractéristiques et avantages de la présente invention ressortiront de la description faite ci-dessous, en référence aux dessins annexés qui en illustrent un exemple de réalisation dépourvu de tout caractère limitatif. Sur les figures :
la représente schématiquement, dans son environnement, un mode particulier de réalisation d’un système informatique selon l’invention ;
la représente schématiquement un exemple d’architecture matérielle d’un dispositif de traitement appartenant au système informatique de la ;
la représente sous forme d’ordinogramme, les principales étapes d’un procédé, dit « premier procédé », de détermination d’une structure de données selon l’invention ;
la représente, sous forme d’ordinogramme, les principales étapes d’un procédé, dit « deuxième procédé », de détermination d’une structure de données selon l’invention ;
la représente, sous forme d’ordinogramme, les principales étapes d’un procédé, dit « troisième procédé », d’insertion d’au moins une données dans une structure de données selon l’invention ;
la représente, sous forme d’ordinogramme, les principales étapes d’un procédé, dit « quatrième procédé », d’estimation du nombre de données distinctes dans un ensemble de données selon l’invention ;
la représente, sous forme d’ordinogramme, les principales étapes d’un procédé, dit « cinquième procédé », d’estimation du nombre de données distinctes dans une union d’ensembles de données selon l’invention ;
la représente, sous forme d’ordinogramme, les principales étapes d’un procédé, dit « quatrième procédé », d’estimation du nombre de données distinctes dans une intersection d’un ensemble de données selon l’invention.
Other characteristics and advantages of the present invention will emerge from the description given below, with reference to the appended drawings which illustrate an exemplary embodiment thereof which is devoid of any limiting character. In the figures:
there schematically represents, in its environment, a particular embodiment of a computer system according to the invention;
there schematically represents an example of hardware architecture of a processing device belonging to the computer system of the ;
there represents, in the form of a flowchart, the main steps of a process, referred to as the “first process”, for determining a data structure according to the invention;
there represents, in the form of a flowchart, the main steps of a process, called “second process”, for determining a data structure according to the invention;
there represents, in the form of a flowchart, the main steps of a method, called a “third method”, of inserting at least one data item into a data structure according to the invention;
there represents, in the form of a flowchart, the main steps of a method, called “fourth method”, of estimating the number of distinct data in a set of data according to the invention;
there represents, in the form of a flowchart, the main steps of a method, called “fifth method”, for estimating the number of distinct data in a union of data sets according to the invention;
there represents, in the form of a flowchart, the main steps of a method, called “fourth method”, for estimating the number of distinct data in an intersection of a set of data according to the invention.

La représente schématiquement, dans son environnement, un mode particulier de réalisation d’un système informatique 10 selon l’invention.There schematically represents, in its environment, a particular embodiment of a computer system 10 according to the invention.

Dans le mode de réalisation de la , le système informatique 10 comporte un dispositif de traitement 11 configuré pour réaliser des traitements permettant d’anonymiser un ou plusieurs ensembles de données (un ensemble pouvant comporter une ou plusieurs données), ainsi que de réaliser des opérations de comptage d’éléments distincts dans de tels ensembles de données, en mettant en œuvre divers procédés selon l’invention et décrits plus en détails ultérieurement.In the embodiment of the , the computer system 10 comprises a processing device 11 configured to carry out processing operations making it possible to anonymize one or more sets of data (a set may comprise one or more data), as well as to carry out counting operations of distinct elements in such data sets, by implementing various methods according to the invention and described in more detail later.

Les données considérées dans le cadre de la présente invention correspondent à des données relatives à un ou plusieurs individus, et mémorisées au cours d’un processus de mémorisation de durée déterminée suite à la réalisation d’un évènement par ledit ou lesdits individus. Par ailleurs, le ou les ensembles de données à partir desquels sont mis en œuvre les procédés selon l’invention sont eux-mêmes compris dans un ensemble de données plus général et noté « X » (X est donc lui-même un ensemble de données) qui regroupe les données personnelles de tous les individus susceptibles de réaliser ledit évènement pendant ladite durée.The data considered in the context of the present invention correspond to data relating to one or more individuals, and stored during a storage process of fixed duration following the occurrence of an event by said individual(s). Furthermore, the data set(s) from which the methods according to the invention are implemented are themselves included in a more general data set and denoted “X” (X is therefore itself a data set ) which includes the personal data of all individuals likely to carry out the said event during the said period.

Bien qu'aucune limitation ne soit attachée à la nature des données personnelles considérées et au contexte dans lequel elles ont été acquises, on considère dans le présent mode de réalisation que les données personnelles traitées par le système informatique 10 sont extraites de traces de mobilité identifiées dans un réseau de télécommunications mobiles pour une pluralité d'individus. De telles traces de mobilité sont classiquement remontées dans des comptes rendus d'appels établis par le réseau, et traduisent la mobilité des individus lors de communications établies sur le réseau de télécommunications mobile entre différentes antennes relais du réseau, et ce sur une période de temps donnée.Although no limitation is attached to the nature of the personal data considered and to the context in which they were acquired, it is considered in the present embodiment that the personal data processed by the computer system 10 are extracted from traces of mobility identified in a mobile telecommunications network for a plurality of individuals. Such traces of mobility are conventionally reported in call reports established by the network, and translate the mobility of individuals during communications established on the mobile telecommunications network between different relay antennas of the network, and this over a period of time. given.

De manière plus spécifique, on considère ici que les données personnelles traitées correspondent à des numéros de téléphone mobiles appartenant à un ou plusieurs individus, ces numéros de téléphone ayant été mémorisés dans une base de données 12 illustrée sur la , après que lesdits individus aient réalisé une ou plusieurs communications téléphoniques alors même qu’ils se trouvaient dans une gare de trains pendant une période de temps déterminée, par exemple entre 16h et 20h.More specifically, it is considered here that the personal data processed correspond to mobile telephone numbers belonging to one or more individuals, these telephone numbers having been stored in a database 12 illustrated on the , after said individuals have made one or more telephone calls even though they were in a train station for a determined period of time, for example between 4 p.m. and 8 p.m.

Autrement dit, pour cet exemple spécifique de mise en œuvre, on comprend qu’un évènement réalisé par un individu consiste en un appel téléphonique passé ou reçu dans une gare de trains, et que le processus de mémorisation en question consiste en le stockage du numéro de téléphone de cet individu dans la base de données 12 au cours de la période de temps concernée.In other words, for this specific example of implementation, it is understood that an event carried out by an individual consists of a telephone call made or received in a train station, and that the memorization process in question consists of storing the number telephone number of this individual in the database 12 during the period of time concerned.

Dès lors, un ensemble de données destiné à être utilisé dans un procédé selon l’invention, pour cet exemple spécifique de mise en œuvre, correspond à un ensemble de numéros de téléphone. On note également qu’un numéro de téléphone peut apparaitre plusieurs fois dans un tel ensemble de données si l’individu associé à ce numéro de téléphone a passé et/ou reçu plusieurs appels dans la gare pendant ladite période de temps. Enfin, l’ensemble X correspond, dans cet exemple de mise en œuvre, à tous les numéros de téléphone des individus susceptibles de passer et/ou recevoir un appel téléphonique dans ladite gare. En pratique, l’ensemble X est formé des numéros de tous les habitants de la ville dans laquelle se trouve la gare en question. Mais rien n’exclut non plus de considérer un ensemble X plus grand, comme par exemple formé des numéros de téléphone des individus habitant dans le pays dans lequel se situe ladite gare.Therefore, a set of data intended to be used in a method according to the invention, for this specific example of implementation, corresponds to a set of telephone numbers. It is also noted that a telephone number may appear several times in such a data set if the individual associated with this telephone number has made and/or received several calls in the station during said period of time. Finally, the set X corresponds, in this example of implementation, to all the telephone numbers of the individuals likely to make and/or receive a telephone call in said station. In practice, the set X is made up of the numbers of all the inhabitants of the city in which the station in question is located. But there is also nothing to exclude considering a larger set X, such as, for example, the telephone numbers of individuals living in the country in which the said station is located.

Aucune limitation n’est attachée au nombre de gares pouvant être envisagé, comme cela est détaillé ci-après lors de la description de procédés permettant de fournir des estimations de comptage dans des unions et/ou intersections d’ensemble de données. Bien entendu, on comprend également que la période de temps allant de 16h à 20h (et donc de durée déterminée égale à 4h) n’est donnée ici qu’à titre d’exemple, et que tout autre valeur peut être envisagée.No limitation is attached to the number of stations that can be considered, as detailed below when describing methods for providing count estimates in unions and/or intersections of data sets. Of course, it is also understood that the time period from 4 p.m. to 8 p.m. (and therefore of a fixed duration equal to 4 hours) is given here only by way of example, and that any other value can be considered.

De manière plus générale, le fait de considérer des données personnelles sous la forme de numéro de téléphone ne constitue qu’un exemple particulier d’application de l’invention, et d’autres exemples peuvent bien entendu être envisagés, comme cela est également décrit ultérieurement au travers de différentes applications techniques de l’invention.More generally, the fact of considering personal data in the form of a telephone number only constitutes one particular example of application of the invention, and other examples can of course be considered, as is also described subsequently through various technical applications of the invention.

La représente schématiquement un exemple d’architecture matérielle du dispositif de traitement 11 appartenant au système informatique 10 de la , pour la mise en œuvre de différents procédés selon l’invention.There schematically represents an example of hardware architecture of the processing device 11 belonging to the computer system 10 of the , for the implementation of various methods according to the invention.

Tel qu’illustré par la , le dispositif de traitement 11 dispose de l’architecture matérielle d’un ordinateur. Ainsi, le dispositif de traitement 11 comporte, notamment, un processeur 1, une mémoire vive 2, une mémoire morte 3 et une mémoire non volatile 4. Il comporte en outre un module de communication 5.As illustrated by the , the processing device 11 has the hardware architecture of a computer. Thus, the processing device 11 comprises, in particular, a processor 1, a random access memory 2, a read only memory 3 and a non-volatile memory 4. It also comprises a communication module 5.

La mémoire morte 3 du dispositif de traitement 11 constitue un support d’enregistrement conforme à l’invention, lisible par le processeur 1 et sur lequel est enregistré une pluralité de programmes d’ordinateur PROG_1, PROG_2, PROG_3, PROG_4, PROG_5, PROG_6 conformes à l’invention, comportant des instructions pour l’exécution d’étapes de procédé selon l’invention. Chacun des programmes PROG_i, i étant un indice entier allant de 1 à 6, définit au moins un module fonctionnel du dispositif de traitement 11, qui s’appuie ou commande les éléments matériels 1 à 5 du dispositif de traitement 11 cités précédemment. Les fonctions exécutées par de tels modules sont précisées ci-après en lien avec la description des procédés pouvant être mis en œuvre lorsque les instructions (sous forme de code) desdits programmes PROG_i sont exécutées par le processeur 1.The read only memory 3 of the processing device 11 constitutes a recording medium in accordance with the invention, readable by the processor 1 and on which is recorded a plurality of computer programs PROG_1, PROG_2, PROG_3, PROG_4, PROG_5, PROG_6 conforming to the invention, comprising instructions for the execution of process steps according to the invention. Each of the programs PROG_i, i being an integer index ranging from 1 to 6, defines at least one functional module of the processing device 11, which relies on or controls the hardware elements 1 to 5 of the processing device 11 mentioned above. The functions executed by such modules are specified below in connection with the description of the processes that can be implemented when the instructions (in the form of code) of said programs PROG_i are executed by processor 1.

Le module de communication 5 permet au dispositif de traitement 11 de communiquer avec la base de données 12, et notamment d'accéder aux données personnelles qui y sont mémorisées. Il peut comprendre par exemple une carte réseau ou tout autre moyen permettant de se connecter à un réseau de communication auxquels appartiennent le dispositif de traitement 11 et la base de données 12, ou bien encore de communiquer sur un bus de données numériques reliant le dispositif de traitement 11 à la base de données 12.The communication module 5 enables the processing device 11 to communicate with the database 12, and in particular to access the personal data stored therein. It may include, for example, a network card or any other means making it possible to connect to a communication network to which the processing device 11 and the database 12 belong, or even to communicate on a digital data bus connecting the processing device processing 11 to database 12.

On note que, dans le mode de réalisation de la , la base de données 12 correspond à un élément distinct du dispositif de traitement 11. Il convient toutefois de noter qu’il ne s’agit là que d’une variante d’implémentation de l’invention, et que rien n’exclut d’envisager que la base de données 12 soit par exemple stockée directement dans une mémoire du dispositif de traitement 11, par exemple dans sa mémoire non volatile 4.It is noted that, in the embodiment of the , the database 12 corresponds to a distinct element of the processing device 11. It should however be noted that this is only a variant implementation of the invention, and that nothing excludes consider that the database 12 is for example stored directly in a memory of the processing device 11, for example in its non-volatile memory 4.

La suite de la description vise à décrire les différents procédés pouvant être mis en œuvre par le dispositif de traitement 11 au moyen respectivement des programmes PROG_i. A cet effet, on considère désormais de manière nullement limitative que le ou les ensembles de données sur lesquels s’appuient ces différents procédés sont issus de la base de données 12 après la fin du processus de mémorisation envisagé au sens de l’invention.The rest of the description aims to describe the various methods that can be implemented by the processing device 11 by means of the programs PROG_i respectively. To this end, it is now considered in no way limiting that the set or sets of data on which these different methods are based come from the database 12 after the end of the storage process envisaged within the meaning of the invention.

Cela étant, il convient de noter que de telles dispositions ne sont en aucun cas limitatives de l’invention. En particulier, rien n’exclut d’envisager que lesdits procédés soient mis en œuvre de manière dynamique, en parallèle dudit processus de mémorisation et après chaque mémorisation d’une donnée dans la base de données 12.That said, it should be noted that such provisions are in no way limiting of the invention. In particular, nothing excludes considering that said methods are implemented dynamically, in parallel with said memorization process and after each memorization of a piece of data in the database 12.

La représente, sous forme d’ordinogramme, les principales étapes d’un procédé de détermination selon l’invention, dit « premier procédé ». Les instructions dédiées à l’exécution des étapes du premier procédé sont contenues dans le programme PROG_1.There represents, in the form of a flowchart, the main steps of a determination method according to the invention, called “first method”. The instructions dedicated to the execution of the steps of the first method are contained in the program PROG_1.

Dans son principe général, ledit premier procédé de détermination consiste à générer, à partir d’un ensemble A de données personnelles, une structure de données garantissant une confidentialité différentielle d’ordre ɛ dudit ensemble A, et à partir de laquelle il est possible de réaliser des opérations de comptage d’éléments distincts de manière particulièrement efficace.In its general principle, said first determination method consists in generating, from a set A of personal data, a data structure guaranteeing differential confidentiality of order ɛ of said set A, and from which it is possible to carry out counting operations of distinct elements in a particularly efficient manner.

La méthode d'anonymisation dite de « confidentialité différentielle », plus communément désignée par « privacy différentielle » ou encore par « differential privacy » en anglais, est bien connue de l’homme du métier. Elle permet d'anonymiser des données et est particulièrement appréciée car elle permet de quantifier formellement et rigoureusement le niveau d'anonymat obtenu, autrement dit, le risque de ré-identifier à partir des données anonymes obtenues (i.e. les données contenues dans la structure de données générée au moyen du premier procédé de détermination) les données personnelles relatives aux individus en jeu. Ceci offre avantageusement la possibilité de contrôler le compromis entre utilité des données anonymes obtenues et niveau d'anonymat garanti.The so-called “differential privacy” anonymization method, more commonly referred to as “differential privacy” or “differential privacy” in English, is well known to those skilled in the art. It makes it possible to anonymize data and is particularly appreciated because it makes it possible to formally and rigorously quantify the level of anonymity obtained, in other words, the risk of re-identifying from the anonymous data obtained (i.e. the data contained in the structure of data generated by means of the first determination method) the personal data relating to the individuals involved. This advantageously offers the possibility of controlling the compromise between the usefulness of the anonymous data obtained and the level of guaranteed anonymity.

En effet, un niveau d'anonymat trop élevé peut se traduire par une perte d'information utile concernant les données d'origine. Inversement, un jeu de données anonymes trop proche du jeu de données initial dévoile trop d'informations sur les individus concernés. Un tel contrôle est donc important car il permet de savoir si le niveau d'anonymat considéré est raisonnable ou non.Indeed, a level of anonymity that is too high can result in a loss of useful information concerning the original data. Conversely, an anonymous dataset that is too close to the initial dataset reveals too much information about the individuals concerned. Such a control is therefore important because it makes it possible to know whether the level of anonymity considered is reasonable or not.

En pratique, la mesure du niveau d’anonymat s’effectue au moyen d’un paramètre, noté ɛ dans le cadre de la présente invention, et traduisant le fait que deux structures de données générées grâce audit premier procédé ont quasiment la même loi de probabilité si les ensembles de données fournis en entrée du premier procédé sont voisins, c'est-à-dire diffèrent par la contribution d’une unique donnée. Le « quasiment » est mesuré par ledit paramètre ɛ : plus ɛ est petit, plus les lois de probabilités sont proches et plus il est difficile de détecter la participation d'un individu particulier dans les structures de données (meilleur est donc l'anonymat atteint), ce qui correspond au but recherché par la confidentialité différentielle. On note que le paramètre ɛ et l'anonymat atteint par l'algorithme varient en sens inverse, i.e. plus ɛ est petit et meilleur est l'anonymat garanti par l'algorithme.In practice, the measurement of the level of anonymity is carried out by means of a parameter, denoted ɛ in the context of the present invention, and reflecting the fact that two data structures generated by means of said first method have almost the same law of probability if the sets of data supplied as input to the first method are close, that is to say differ by the contribution of a single datum. The "almost" is measured by said parameter ɛ: the smaller ɛ, the closer the probability laws and the more difficult it is to detect the participation of a particular individual in the data structures (the better the anonymity achieved ), which is the intended purpose of differential privacy. We note that the parameter ɛ and the anonymity achieved by the algorithm vary inversely, i.e. the smaller ɛ, the better the anonymity guaranteed by the algorithm.

Dans la mesure où la confidentialité différentielle a été étudiée abondamment, elle n’est pas décrite plus avant ici. Pour plus de détails, il est par exemple possible de se référer au document de C. Dwork, F. McSherry, K. Nissim et A. Smith 15 intitulé « Calibrating Noise to Sensitivity in Private Data Analysis », Theory of Cryptography, pages 265-284, 2006.Because differential privacy has been studied extensively, it is not described further here. For more details, it is for example possible to refer to the document by C. Dwork, F. McSherry, K. Nissim and A. Smith 15 entitled “Calibrating Noise to Sensitivity in Private Data Analysis”, Theory of Cryptography, pages 265 -284, 2006.

Ledit premier procédé comporte tout d’abord une étape E10 d’initialisation d’une structure de données L à un ensemble vide. ladite étape E10 d’initialisation est mise en œuvre par un module d’initialisation (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Said first method firstly comprises a step E10 of initializing a data structure L to an empty set. said initialization step E10 is implemented by an initialization module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

L’objectif de l’étape E10 est donc de créer une structure apte à recevoir des données. En pratique, d’un point de vue informatique, la mise en œuvre de ladite étape E10 consiste à instancier une liste vide.The objective of step E10 is therefore to create a structure capable of receiving data. In practice, from a computer point of view, the implementation of said step E10 consists in instantiating an empty list.

Ledit premier procédé comporte également un ensemble E_DET d’étapes mises en œuvre pour chaque donnée de l’ensemble A. Afin de décrire la mise en œuvre dudit ensemble d’étapes E_DET, on considère une donnée, notée « d », de l’ensemble A.Said first method also comprises a set E_DET of steps implemented for each datum of set A. In order to describe the implementation of said set of steps E_DET, a datum, denoted "d", of the set a.

Dès lors, pour ladite donnée d, ledit ensemble d’étapes E_DET comporte une étape E_DET_10 de détermination d’une valeur W_{d} égale à b x h(d) + (1-b) x h(V), où :
● h est une fonction de hachage définie sur l’ensemble X et à valeurs discrètes dans l’intervalle [0,1], par exemple à valeurs uniformément réparties entre 0 et 1,
● b est une variable de Bernoulli de paramètre p, avec

M étant le cardinal de l’image de la fonction de hachage h, et r étant un majorant du nombre de fois où la donnée d peut être mémorisée au cours de ladite durée,
● V est une variable aléatoire uniforme sur l’ensemble X et indépendante de b.
Ladite étape E_DET_10 est mise en œuvre par un module de détermination (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Therefore, for said datum d, said set of steps E_DET includes a step E_DET_10 for determining a value W_{d} equal to bxh(d) + (1-b) xh(V), where:
● h is a hash function defined on the set X and with discrete values in the interval [0,1], for example with values uniformly distributed between 0 and 1,
● b is a Bernoulli variable with parameter p, with

M being the cardinal of the image of the hash function h, and r being an upper bound of the number of times the datum d can be stored during said duration,
● V is a uniform random variable on the set X and independent of b.
Said step E_DET_10 is implemented by a determination module (not represented in the figures) equipping the determination device 11 and configured for this purpose.

La valeur du paramètre r intervenant dans l’encadrement du paramètre p est déterminée par l’application envisagée pour ledit premier procédé.The value of the parameter r intervening in the framing of the parameter p is determined by the application envisaged for said first method.

Ainsi, dans le présent mode de mise en œuvre, en considérant, comme mentionné auparavant, qu’une donnée personnelle correspond au numéro de téléphone mobile appartenant à un individu, ce numéro ayant été mémorisé dans la base de données 12 durant une période de 4h (période de temps allant de 16h à 20h) dans une gare de trains déterminée, le paramètre r peut par exemple être fixé à 5. Il est en effet raisonnable de penser qu’un individu ne passe pas et/ou ne reçoit pas plus de cinq appels téléphoniques en l’espace de quatre heures dans ladite gare.Thus, in the present mode of implementation, considering, as mentioned previously, that personal data corresponds to the mobile telephone number belonging to an individual, this number having been stored in the database 12 for a period of 4 hours (period of time ranging from 4 p.m. to 8 p.m.) in a specific train station, the parameter r can for example be set at 5. It is indeed reasonable to think that an individual does not pass and/or does not receive more than five telephone calls in the space of four hours in said station.

Bien entendu, le choix consistant à considérer r égal à 5 ne constitue qu’une variante d’implémentation de l’invention, et d’autres valeurs sont envisageables dès lors qu’elles reflètent la réalité de l’application envisagée pour la mise en œuvre dudit premier procédé. En particulier, rien n’exclut, pour ce qui concerne le présent mode de mise en œuvre, de considérer une valeur inférieure ou supérieure à 5.Of course, the choice consisting in considering r equal to 5 only constitutes an implementation variant of the invention, and other values can be envisaged provided that they reflect the reality of the application envisaged for the implementation. implementation of said first method. In particular, nothing excludes, as far as this mode of implementation is concerned, considering a value lower or higher than 5.

D’une manière générale, l’homme du métier est en mesure de fixer une valeur pour le paramètre r eu égard à l’application envisagée.In general, the person skilled in the art is able to set a value for the parameter r with regard to the application envisaged.

La valeur du paramètre M intervenant dans l’encadrement du paramètre p est elle aussi déterminée par l’application envisagée pour ledit premier procédé. Ainsi, en reprenant les considérations évoquées ci-dessus pour le paramètre r, le paramètre M peut par exemple être fixé à un million. Rien n’exclut cependant de considérer une autre valeur pour le paramètre M, cette valeur étant préférentiellement au moins de l’ordre du carré du cardinal de l’ensemble X, de sorte à éviter tout risque de collisions dans l’image de la fonction de hachage h (de telles collisions pouvant en effet être la source d’erreurs d’estimation de cardinaux pour des procédés d’estimation selon l’invention qui sont décrits plus en détails ultérieurement).The value of the parameter M intervening in the framing of the parameter p is also determined by the application envisaged for said first method. Thus, by taking up the considerations mentioned above for the parameter r, the parameter M can for example be fixed at one million. However, nothing excludes considering another value for the parameter M, this value preferably being at least of the order of the square of the cardinality of the set X, so as to avoid any risk of collisions in the image of the function of hashing h (such collisions can in fact be the source of cardinal estimation errors for estimation methods according to the invention which are described in more detail later).

Il importe de noter que le fait de choisir p en respectant l’encadrement proposé ci-avant permet avantageusement de garantir une confidentialité différentielle d’ordre ɛ de la structure de données obtenue à l’issue dudit premier procédé, comme l’inventeur a été en mesure de le démontrer.It is important to note that the fact of choosing p while respecting the framework proposed above advantageously makes it possible to guarantee differential confidentiality of order ɛ of the data structure obtained at the end of said first process, as the inventor was able to demonstrate it.

On note par ailleurs que l’encadrement donné ci-avant pour le paramètre p traduit le niveau de confidentialité que l’on souhaite accorder aux données formant la structure de données obtenue à l’issue dudit premier procédé. Plus précisément, plus p est proche de la valeur 1/2, plus la structure de données obtenue à l’issue du premier procédé protège les données qu’elle contient contre un risque de ré-identification.It is also noted that the framework given above for the parameter p translates the level of confidentiality that one wishes to grant to the data forming the data structure obtained at the end of said first process. More precisely, the closer p is to the value 1/2, the more the data structure obtained at the end of the first process protects the data it contains against a risk of re-identification.

Ledit ensemble E_DET d’étapes comporte également, pour ladite donnée d, une première étape E_DET_20 de test consistant à vérifier si la valeur W_{d} appartient ou non à la structure L jusqu’alors déterminée. Ladite première étape E_DET_20 est mise en œuvre par un premier module de test (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Said set E_DET of steps also comprises, for said datum d, a first test step E_DET_20 consisting in verifying whether or not the value W_{d} belongs to the structure L determined hitherto. Said first step E_DET_20 is implemented by a first test module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

Dès lors, si la valeur W_{d} n’appartient pas à la structure L (réponse négative à la première étape E_DET_20 de test), ledit ensemble E_DET d’étapes comporte une deuxième étape E_DET_30 de test consistant à vérifier si le cardinal de ladite structure L est inférieur à un nombre k donné. Ladite deuxième étape E_DET_30 est mise en œuvre par un deuxième module de test (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Consequently, if the value W_{d} does not belong to the structure L (negative response to the first test step E_DET_20), said set E_DET of steps comprises a second test step E_DET_30 consisting in checking whether the cardinality of said structure L is less than a given number k. Said second step E_DET_30 is implemented by a second test module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

Dès lors, si le cardinal de ladite structure L est inférieur à k (réponse positive à la deuxième étape E_DET_30 de test), l’ensemble E_DET d’étapes comporte une étape E_DET_40 d’insertion de la valeur W_{d} dans la structure L.Therefore, if the cardinality of said structure L is less than k (positive response to the second test step E_DET_30), the set E_DET of steps includes a step E_DET_40 for inserting the value W_{d} into the structure I.

A contrario, si le cardinal de la structure L est supérieur à k, ledit ensemble E_DET d’étapes comporte une troisième étape E_DET_50 de test consistant à vérifier si la valeur W_{d} est inférieure à la plus grande valeur de la structure L jusqu’alors déterminée. Ladite troisième étape E_DET_50 est mise en œuvre par un troisième module de test (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Conversely, if the cardinality of the structure L is greater than k, said set E_DET of steps comprises a third test step E_DET_50 consisting in checking whether the value W_{d} is less than the largest value of the structure L up to then determined. Said third step E_DET_50 is implemented by a third test module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

Dès lors, si la valeur W_{d} est inférieure à la plus grande valeur de la structure L (réponse positive à la troisième étape E_DET_50 de test), l’ensemble d’étapes E_DET comporte une étape E_DET_60 d’insertion de la valeur W_{d} dans la structure L par remplacement de ladite plus grande valeur.Therefore, if the value W_{d} is less than the largest value of the structure L (positive response to the third test step E_DET_50), the set of steps E_DET includes a step E_DET_60 for inserting the value W_{d} in structure L by replacing said largest value.

Ledit ensemble E_DET d’étapes a été décrit jusqu’à présent pour une seule donnée d appartenant à l’ensemble A. Les autres données appartenant à l’ensemble A sont traitées de manière similaire, à savoir que les étapes dudit ensemble E_DET d’étapes sont itérées pour chacune desdites autres données de l’ensemble A. Les exécutions de ces itérations s’effectuent de sorte que la structure de données considérée pour l’insertion d’une donnée lors d’une itération courante de l’ensemble d’étapes E_DET correspond à la structure de données dans laquelle a été insérée une donnée lors d’une itération de l’ensemble d’étapes E_DET précédant ladite itération courante.Said set E_DET of steps has been described so far for a single datum d belonging to set A. The other data belonging to set A are treated in a similar way, namely that the steps of said set E_DET d' steps are iterated for each of said other data of the set A. The executions of these iterations are carried out so that the data structure considered for the insertion of a data during a current iteration of the set of steps E_DET corresponds to the data structure into which data was inserted during an iteration of the set of steps E_DET preceding said current iteration.

Etant donné la nature desdites deuxième et troisième étapes de test E_DET_30, E_DET_50, on comprend que le nombre k correspond à une taille qu’il convient de ne pas dépasser pour ladite structure L. Il s’agit donc d’un nombre fixé, et avantageusement inférieur au cardinal de l’ensemble de données A de sorte que la structure de données L obtenue à l’issu du premier procédé est plus compacte que ledit ensemble de données A.Given the nature of said second and third test steps E_DET_30, E_DET_50, it is understood that the number k corresponds to a size that should not be exceeded for said structure L. It is therefore a fixed number, and advantageously less than the cardinal of the data set A so that the data structure L obtained at the end of the first method is more compact than said data set A.

On note également, étant donné la nature de la première étape E_DET_20 de test que la structure de données L obtenue à l’issue du premier procédé ne comporte que des valeurs distinctes entre elles.It is also noted, given the nature of the first test step E_DET_20, that the data structure L obtained at the end of the first process only comprises distinct values between them.

En définitive, à l’issue dudit premier procédé, on obtient une structure de données L plus compacte que l’ensemble A dont elle est issue, formée uniquement de valeurs distinctes entre elles, et qui assure une confidentialité différentielle d’ordre ɛ à ces valeurs.Ultimately, at the end of said first process, a data structure L is obtained that is more compact than the set A from which it is derived, formed solely of values distinct from each other, and which ensures differential confidentiality of order ɛ to these values.

On note qu’une telle structure de données L diffère d’une structure conventionnelle de type KMV dans la mesure où la présence d’une donnée dans ladite structure L est garantie avec une probabilité p (ce qui conduit dès lors à assurer une confidentialité différentielle d’ordre ɛ), là où la structure KMV comporte, par construction, toutes les valeurs de hachage des valeurs distinctes de l’ensemble A. Ainsi, la structure de données L obtenue au moyen du premier procédé possède un niveau de confidentialité bien meilleure que celui proposé par une structure KMV conventionnelle.It should be noted that such a data structure L differs from a conventional structure of the KMV type insofar as the presence of a datum in said structure L is guaranteed with a probability p (which consequently leads to ensuring differential confidentiality of order ɛ), where the KMV structure comprises, by construction, all the hash values of the distinct values of the set A. Thus, the data structure L obtained by means of the first method has a much better level of confidentiality than that offered by a conventional KMV structure.

On note que le premier procédé de détermination a été décrit jusqu’à présent en considérant qu’un ensemble de données A comporte une pluralité de données. On comprend toutefois qu’il est possible de le mettre en œuvre en ne considérant qu’une seule donnée personnelle (auquel cas, bien entendu, les étapes dudit ensemble d’étapes E_DET ne sont exécutées qu’une seule fois).It should be noted that the first determination method has been described so far by considering that a set of data A comprises a plurality of data. However, it is understood that it is possible to implement it by considering only one piece of personal data (in which case, of course, the steps of said set of steps E_DET are executed only once).

On note également que le premier procédé a été décrit ci-avant en référence aux principales étapes exécutées lors de la mise en œuvre dudit premier procédé. Il n’en reste pas moins qu’il reste possible d’envisager des modes plus particuliers de mise en œuvre dudit premier procédé dans lesquels encore d’autres étapes peuvent être exécutées. De tels autres modes de mise en œuvre sont notamment utiles pour la mise en œuvre de procédé de comptage dans des ensembles, comme cela est décrit plus en détails ultérieurement.It is also noted that the first method has been described above with reference to the main steps executed during the implementation of said first method. The fact remains that it remains possible to envisage more particular modes of implementation of said first method in which still other steps can be executed. Such other modes of implementation are in particular useful for the implementation of counting methods in sets, as described in more detail later.

Ainsi, dans un mode particulier de mise en œuvre, ledit premier procédé comporte également, pour chaque exécution de l’étape E_DET_60 d’insertion par remplacement, une étape d’incrémentation d’une valeur dite « valeur de dénombrement » N. De cette manière, ladite valeur de dénombrement N est représentative, à l’issue du procédé, du nombre total de fois où ladite étape d’insertion par remplacement a été exécutée lors de la mise en œuvre du procédé.Thus, in a particular mode of implementation, said first method also comprises, for each execution of step E_DET_60 of insertion by replacement, a step of incrementing a value called "counting value" N. From this In this way, said counting value N is representative, at the end of the method, of the total number of times said step of inserting by replacement has been executed during the implementation of the method.

Ladite valeur de dénombrement N est par exemple initialisée à zéro.Said counting value N is for example initialized to zero.

Selon un autre exemple, ladite valeur de dénombrement N est initialisée à une réalisation d’une variable aléatoire de Laplace centrée. Procéder de cette manière se révèle avantageux dans la mesure où l’information obtenue à l’issue du premier procédé, à savoir donc un couple formé de la structure de données L et de ladite valeur de dénombrement N, respecte également le principe de la confidentialité différentielle, comme cela est par exemple expliqué dans le document de Dwork et al. déjà mentionné auparavant.According to another example, said count value N is initialized to a realization of a centered Laplace random variable. Proceeding in this way proves to be advantageous insofar as the information obtained at the end of the first process, namely therefore a pair formed of the data structure L and of said counting value N, also respects the principle of confidentiality differential, as explained for example in the document by Dwork et al. already mentioned before.

L’invention ne se limite pas à obtenir une structure de données L telle que celle décrite ci-avant à partir du seul premier procédé. En effet, l’invention propose également, selon un autre aspect, de générer, à partir d’une structure KMV préalablement déterminée pour ledit ensemble A, une structure de données plus compacte que l’ensemble A dont elle est issue, formée uniquement de valeurs distinctes entre elles, et qui assure une confidentialité d’ordre ɛ à ces valeurs. Dit encore autrement, l’invention propose également de transformer une première structure L1 de type KMV en une deuxième structure L2 héritant des mêmes avantages que ceux décrits précédemment en référence à la structure L obtenue à l’issue du premier procédé.The invention is not limited to obtaining a data structure L such as that described above from the first method alone. Indeed, the invention also proposes, according to another aspect, to generate, from a KMV structure previously determined for said set A, a more compact data structure than the set A from which it is derived, formed solely of distinct values between them, and which ensures confidentiality of order ɛ to these values. In other words, the invention also proposes to transform a first structure L1 of the KMV type into a second structure L2 inheriting the same advantages as those described above with reference to the structure L obtained at the end of the first process.

La représente, sous forme d’ordinogramme, les principales étapes d’un autre procédé de détermination selon l’invention, dit « deuxième procédé ». Les instructions dédiées à l’exécution des étapes du deuxième procédé sont contenues dans le programme PROG_2.There represents, in the form of a flowchart, the main steps of another method of determination according to the invention, called “second method”. The instructions dedicated to the execution of the steps of the second method are contained in the program PROG_2.

Pour la description du deuxième procédé de la , on considère à nouveau un ensemble A comportant une pluralité de données personnelles. On considère également qu’une première structure de données L1 a été obtenue par application d’un algorithme de k’-valeurs minimales aux données de l’ensemble A. On note que la mise en œuvre dudit algorithme de k’-valeurs minimales utilise de manière conventionnelle une fonction de hachage h, cette fonction de hachage étant définie dans le cadre de la présente invention sur l’ensemble X (c’est une fonction de hachage à valeurs discrètes dans l’intervalle [0,1], par exemple à valeurs uniformément réparties entre 0 et 1). On note par ailleurs que le nombre k’ correspond, de manière connue en soi, au cardinal de la première structure L1, et est fixé en fonction de l’application envisagée.For the description of the second process of the , we again consider a set A comprising a plurality of personal data. We also consider that a first data structure L1 has been obtained by applying a minimal k'-values algorithm to the data of the set A. We note that the implementation of said minimal k'-values algorithm uses conventionally a hash function h, this hash function being defined within the framework of the present invention on the set X (it is a hash function with discrete values in the interval [0,1], for example with values uniformly distributed between 0 and 1). It is further noted that the number k′ corresponds, in a manner known per se, to the cardinality of the first structure L1, and is fixed according to the application envisaged.

Par exemple, dans le contexte applicatif envisagé ici (appels passés et/ou reçus dans une gare par des individus et pendant une période de temps de 4h), ledit nombre k’ peut être pris égal à 100.For example, in the application context considered here (calls made and/or received in a station by individuals and during a time period of 4 hours), said number k' can be taken as equal to 100.

De manière générale, on note que le nombre k’ est représentatif d’un compromis entre capacité à stocker suffisamment de données pour pouvoir réaliser des estimations de comptage fiables (meilleur quand k’ est grand) et la compacité de la structure L1 (meilleure compacité quand k’ est petit).In general, we note that the number k' is representative of a compromise between the ability to store enough data to be able to make reliable count estimates (better when k' is large) and the compactness of the L1 structure (better compactness when k' is small).

Tel qu’illustré par la , ledit deuxième procédé comporte une étape F10 de détermination d’une valeur D1 égale au quotient de k’-1 par la plus grande valeur de la première structure L1. Ladite étape F10 de détermination est mise en œuvre par un module de détermination (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.As illustrated by the , said second method comprises a step F10 of determining a value D1 equal to the quotient of k'-1 by the greatest value of the first structure L1. Said determination step F10 is implemented by a determination module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

Le deuxième procédé comporte également une étape F20 d’échantillonnage uniforme d’un nombre D1 de valeurs dans l’image de la fonction de hachage h, i.e. dans h(X), de sorte à obtenir un ensemble L_D1 comprenant lesdites D1 valeurs échantillonnées. Ladite étape F20 est mise en œuvre par un premier module d’échantillonnage (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.The second method also includes a step F20 of uniform sampling of a number D1 of values in the image of the hash function h, i.e. in h(X), so as to obtain a set L_D1 comprising said D1 sampled values. Said step F20 is implemented by a first sampling module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

Par « échantillonnage uniforme d’une valeur dans un ensemble », on fait classiquement référence au fait de sélectionner aléatoirement une valeur dudit ensemble avec une probabilité uniforme égale à l’inverse du cardinal dudit ensemble. En conséquence, l’échantillonnage uniforme d’une pluralité de valeurs dans ledit ensemble consiste à répéter l’opération précédente (avec remise) autant de fois que le nombre de valeurs de ladite pluralité de valeurs.By "uniform sampling of a value in a set", we classically refer to the fact of randomly selecting a value from said set with a uniform probability equal to the inverse of the cardinality of said set. Consequently, the uniform sampling of a plurality of values in said set consists in repeating the previous operation (with replacement) as many times as the number of values of said plurality of values.

Le deuxième procédé comporte également une étape F30 échantillonnage uniforme d’un nombre D2 de valeurs dans l’ensemble L_D1, D2 étant égal à la partie entière du produit de [1-p] par D1 (i.e. [1-p] x D1), avec, de manière similaire à ce qui a été décrit auparavant dans le cadre du premier procédé :

Ladite étape F30 est mise en œuvre par un deuxième module d’échantillonnage (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.The second method also comprises a step F30 uniform sampling of a number D2 of values in the set L_D1, D2 being equal to the integer part of the product of [1-p] by D1 (ie [1-p] x D1) , with, in a manner similar to what was previously described in the context of the first method:

Said step F30 is implemented by a second sampling module (not represented in the figures) equipping the determination device 11 and configured for this purpose.

On note que l’encadrement proposé pour le paramètre p dans le cadre dudit deuxième procédé diffère sensiblement de celui proposé dans le cadre du premier procédé en ce que le paramètre r est ici considéré égal à 1. Une telle variation découle du fait que le deuxième procédé utilise comme point de départ la structure L1 qui est une structure KMV conventionnelle, l’objectif étant, notamment, de transformer cette structure L1 en une autre structure fournissant une garantie de confidentialité différentielle. Or, de manière connue en soi, les données contenues dans une structure KMV conventionnelle, et donc a fortiori dans la structure L1, sont toutes distinctes les unes des autres, de sorte que le nombre maximal de fois où une donnée peut apparaitre dans la structure L1 est égal à 1.It is noted that the framework proposed for the parameter p in the context of said second method differs significantly from that proposed in the context of the first method in that the parameter r is here considered equal to 1. Such a variation results from the fact that the second method uses as a starting point the structure L1 which is a conventional KMV structure, the objective being, in particular, to transform this structure L1 into another structure providing a guarantee of differential confidentiality. However, in a manner known per se, the data contained in a conventional KMV structure, and therefore a fortiori in the L1 structure, are all distinct from each other, so that the maximum number of times a data item can appear in the structure L1 is equal to 1.

Ledit deuxième procédé comporte également une étape F40 de sélection d’un nombre D3 de plus petites valeurs parmi lesdites D2 valeurs échantillonnées, D3 étant égal à la partie entière du produit [1-p] x k, où k est un nombre donné inférieur à k’. Ladite étape F40 est mise en œuvre par un premier module de sélection (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Said second method also comprises a step F40 of selecting a number D3 of smallest values from among said D2 sampled values, D3 being equal to the integer part of the product [1-p] x k, where k is a given number less than k '. Said step F40 is implemented by a first selection module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

Ledit deuxième procédé comporte également une étape F50 d’échantillonnage uniforme d’un nombre D4 de valeurs entre la plus grande valeur de la première structure L1 et 1, de sorte à obtenir un ensemble L_D4 comprenant lesdites D4 valeurs échantillonnées, D4 étant égale à D1-k. Ladite étape F50 est mise en œuvre par un troisième module d’échantillonnage (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Said second method also comprises a step F50 of uniform sampling of a number D4 of values between the largest value of the first structure L1 and 1, so as to obtain a set L_D4 comprising said D4 sampled values, D4 being equal to D1 -k. Said step F50 is implemented by a third sampling module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

Ledit deuxième procédé comporte également une étape F60 d’échantillonnage uniforme d’un nombre D5 de valeurs dans l’union des ensembles L_D1 et L_D4, D5 étant égal à la partie entière du produit p x D1. Ladite étape F60 est mise en œuvre par un quatrième module d’échantillonnage (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Said second method also comprises a step F60 of uniform sampling of a number D5 of values in the union of the sets L_D1 and L_D4, D5 being equal to the integer part of the product p x D1. Said step F60 is implemented by a fourth sampling module (not shown in the figures) fitted to the determination device 11 and configured for this purpose.

Ledit deuxième procédé comporte également une étape F70 de sélection d’un nombre D6 de plus petites valeurs parmi lesdites D5 valeurs échantillonnées, D6 étant égal à k-D3. Ladite étape F70 est mise en œuvre par un deuxième module de sélection (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Said second method also includes a step F70 of selecting a number D6 of smallest values from among said D5 sampled values, D6 being equal to k-D3. Said step F70 is implemented by a second selection module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

Ledit deuxième procédé comporte également une étape F80 de regroupement desdites plus petites valeurs sélectionnées lors desdites sélections, de sorte à former une deuxième structure L2.Said second method also comprises a step F80 of grouping together said smallest values selected during said selections, so as to form a second structure L2.

En définitive, à l’issue dudit deuxième procédé, et de manière similaire à ce qui a été décrit ci-avant pour le premier procédé, on obtient une structure de données L2 plus compacte que l’ensemble A dont elle est issue, formée uniquement de valeurs distinctes entre elles, et qui assure une confidentialité d’ordre ɛ à ces valeurs.Ultimately, at the end of said second method, and in a manner similar to what was described above for the first method, a more compact data structure L2 is obtained than the set A from which it is derived, formed only of distinct values between them, and which ensures confidentiality of order ɛ to these values.

En outre, cette structure de données L2 présente également l’avantage d’être plus compacte que la structure de données L1 étant donné que k est inférieur à k’, tout en offrant une garantie de confidentialité beaucoup plus forte.In addition, this L2 data structure also has the advantage of being more compact than the L1 data structure since k is less than k', while offering a much stronger guarantee of confidentiality.

D’un point de vue théorique, il convient d’observer que le résultat du mécanisme consistant à générer la structure de données L2 à partir de la structure de données L1 satisfait la garantie de confidentialité d’ordre ɛ différentielle dès lors que k’ est supérieur à [1-p] x D_A + p x k, où D_A correspond au nombre d’éléments distincts de l’ensemble A. Or, le nombre k’ étant donné par le cas d’application du deuxième procédé, on comprend qu’il faut choisir k de sorte que la condition k’ est supérieur [1-p] x D_A + p x k soit satisfaite. Les structures de type KMV étant utilisées dans les situations où k’ est inférieur à D_A (sinon il suffirait de stocker toutes les données de l’ensemble A pour en compter les éléments distincts), le choix consistant à avoir k inférieur à k’ permet donc de garantir la confidentialité différentielle d’ordre ɛ de la structure L2.From a theoretical point of view, it should be observed that the result of the mechanism consisting in generating the L2 data structure from the L1 data structure satisfies the guarantee of confidentiality of order ɛ differential since k' is greater than [1-p] x D_A + p x k, where D_A corresponds to the number of distinct elements of the set A. However, the number k' being given by the case of application of the second method, it is understood that it must choose k such that the condition k' is greater than [1-p] x D_A + p x k is satisfied. Since KMV type structures are used in situations where k' is less than D_A (otherwise it would suffice to store all the data of set A to count its distinct elements), the choice consisting in having k less than k' allows therefore to guarantee the differential confidentiality of order ɛ of the L2 structure.

On comprend donc que l’invention permet d’obtenir une structure de données compacte et garantissant une confidentialité différentielle d’ordre ɛ de deux manières différentes : soit en créant ladite structure de données directement à partir d’un ensemble brut de données A, soit en transformant une structure de données de type KMV elle-même obtenue à partir dudit ensemble brut de données A.It is therefore understood that the invention makes it possible to obtain a compact data structure and guaranteeing differential confidentiality of order ɛ in two different ways: either by creating said data structure directly from a raw set of data A, or by transforming a KMV type data structure itself obtained from said raw set of data A.

Selon un autre aspect, l’invention permet également d’enrichir une structure de données indifféremment obtenue selon le premier procédé ou le deuxième procédé, en y insérant une donnée nouvellement mémorisée dans la base de données 12.According to another aspect, the invention also makes it possible to enrich a data structure either obtained according to the first method or the second method, by inserting therein a data item newly stored in the database 12.

La représente, sous forme d’ordinogramme, les principales étapes d’un procédé d’insertion de données selon l’invention, dit « troisième procédé ». Les instructions dédiées à l’exécution des étapes du troisième procédé sont contenues dans le programme PROG_3.There represents, in the form of a flowchart, the main steps of a data insertion method according to the invention, called “third method”. The instructions dedicated to the execution of the steps of the third method are contained in the program PROG_3.

Pour la description du troisième procédé de la , on considère à nouveau un ensemble A comportant une pluralité de données personnelles. On considère également une donnée d_new nouvellement mémorisée dans la base de données, par exemple suite à une nouvelle exécution du processus de mémorisation, ladite donnée d_new correspondant là encore à une donnée personnelle relative à un individu et mémorisée suite à la réalisation d’un évènement (appel téléphonique passé ou reçu dans la gare de trains dans le présent mode de mise en œuvre) par ledit individu.For the description of the third process of the , we again consider a set A comprising a plurality of personal data. Consideration is also given to a data item d_new newly stored in the database, for example following a new execution of the storage process, said data item d_new again corresponding to personal data relating to an individual and stored following the occurrence of an event (telephone call made or received in the train station in this embodiment) by said individual.

Ledit troisième procédé comporte dès lors une étape G10 d’obtention d’une structure L déterminée selon ledit premier procédé ou bien selon ledit deuxième procédé.Said third method therefore comprises a step G10 of obtaining a structure L determined according to said first method or else according to said second method.

Dans un mode particulier de mise en œuvre, ladite étape G10 d’obtention consiste en la mise en œuvre du premier procédé ou bien du deuxième procédé en considérant les données de l’ensemble A. Bien entendu, dans le cas où c’est le deuxième procédé qui est mis en œuvre, on comprend que ledit troisième procédé s’appuie sur une première structure obtenue par application d’un algorithme de k’-valeurs minimales à l’ensemble A.In a particular mode of implementation, said step G10 of obtaining consists of the implementation of the first method or else of the second method by considering the data of the set A. Of course, in the case where it is the second method which is implemented, it is understood that said third method is based on a first structure obtained by applying an algorithm of minimum k'-values to the set A.

Dans un autre mode particulier de mise en œuvre, ladite structure de données L a été déterminée par le dispositif 11 de détermination au moyen du premier procédé ou du deuxième procédé, et préalablement à ladite étape G10 d’obtention. Dans ce cas, le terme « obtention » fait référence à un accès à une mémoire du dispositif de détermination 11 dans laquelle a été stockée la structure L à l’issue du premier procédé ou du deuxième procédé.In another particular mode of implementation, said data structure L has been determined by the determination device 11 by means of the first method or the second method, and prior to said obtaining step G10. In this case, the term “obtaining” refers to an access to a memory of the determination device 11 in which the structure L was stored at the end of the first process or the second process.

Ledit troisième procédé comporte également une étape G20 de mise en œuvre, pour ladite donnée d_new, d’un ensemble d’étapes identique à l’ensemble E_DET d’étapes dudit premier procédé. Autrement dit, les étapes E_DET_10 à E_DET_60 (le cas échéant) sont exécutées pour ladite donnée d_new.Said third method also includes a step G20 for implementing, for said data item d_new, a set of steps identical to the set E_DET of steps of said first method. In other words, steps E_DET_10 to E_DET_60 (if applicable) are executed for said d_new datum.

Bien entendu, ledit troisième procédé ne se limite pas à l’insertion d’une seule donnée dans la structure L. Il est en effet possible d’insérer une pluralité de données dans la structure L en réitérant, de manière similaire à ce qui a été décrit pour le premier procédé, les étapes dudit ensemble E_DET d’étapes pour chacun desdites données à insérer.Of course, said third method is not limited to the insertion of a single datum into the structure L. It is in fact possible to insert a plurality of data into the structure L by reiterating, in a manner similar to what was has been described for the first method, the steps of said set E_DET of steps for each of said data to be inserted.

En outre, lorsque la structure de données L obtenue a été déterminée selon le premier procédé, de sorte à obtenir une valeur de dénombrement N correspondant au nombre total de fois où l’étape E_DET_60 d’insertion par remplacement a été exécutée lors dudit premier procédé, le troisième procédé peut également comporter, suivant un mode particulier de mise en œuvre, une mise à jour de ladite valeur de dénombrement N. Plus précisément, dans ce mode particulier de mise en œuvre, ledit troisième procédé comporte, lors de la mise en œuvre de l’étape G20, et pour chaque exécution de l’étape identique à l’étape E_DET_60 d’insertion par remplacement dudit premier procédé, une étape d’incrémentation de ladite valeur de dénombrement N. Autrement dit, dans ce mode particulier de mise en œuvre, le décompte du nombre de fois où une insertion par remplacement est exécutée est poursuivi.Furthermore, when the data structure L obtained has been determined according to the first method, so as to obtain a counting value N corresponding to the total number of times the step E_DET_60 of insertion by replacement has been executed during said first method , the third method may also comprise, according to a particular mode of implementation, an update of said counting value N. More precisely, in this particular mode of implementation, said third method comprises, when implementing implementation of step G20, and for each execution of the step identical to step E_DET_60 of insertion by replacement of said first method, a step of incrementing said counting value N. In other words, in this particular mode of implementation, the count of the number of times an insertion by replacement is executed is continued.

Il est à noter qu’il a été considéré jusqu’à présent que le dispositif en charge de la mise en œuvre du troisième procédé est le dispositif de détermination 11 de la . Cela étant, rien n’exclut d’envisager que ledit troisième procédé soit mis en œuvre par un autre dispositif configuré de manière similaire audit dispositif de détermination 11, après que ce dernier ait déterminé la structure L. Dans ce cas, le terme « obtention » pour ladite étape G10 d’obtention fait référence à un transfert de données (émission/réception) entre le dispositif de détermination 11 et ledit autre dispositif. Ce transfert de données est mis en œuvre par le module de communication 5 équipant le dispositif de détermination 11 ainsi que par un module de communication similaire équipant ledit autre dispositif.It should be noted that it has been considered so far that the device in charge of implementing the third method is the device 11 for determining the . This being so, nothing excludes considering that said third method is implemented by another device configured in a manner similar to said determination device 11, after the latter has determined the structure L. In this case, the term “obtaining for said obtaining step G10 refers to a data transfer (transmission/reception) between the determining device 11 and said other device. This data transfer is implemented by the communication module 5 equipping the determination device 11 as well as by a similar communication module equipping said other device.

L’invention ne se limite pas à l’obtention de structures de données au moyen desdits premier, deuxième et troisième procédés. En effet, l’invention permet également, à partir d’une structure ainsi obtenue, de réaliser des opérations de comptage visant à estimer le nombre d’éléments distincts dans un ensemble de données, ledit ensemble de données pouvant notamment lui-même résulter d’union(s) ou d’intersection(s) d’ensembles de données.The invention is not limited to obtaining data structures by means of said first, second and third methods. Indeed, the invention also makes it possible, from a structure thus obtained, to carry out counting operations aimed at estimating the number of distinct elements in a set of data, said set of data being able in particular itself to result from union(s) or intersection(s) of data sets.

La représente, sous forme d’ordinogramme, les principales étapes d’un procédé de comptage selon l’invention, dit « quatrième procédé ». Les instructions dédiées à l’exécution des étapes du quatrième procédé sont contenues dans le programme PROG_4.There represents, in the form of a flowchart, the main steps of a counting method according to the invention, called “fourth method”. The instructions dedicated to the execution of the steps of the fourth method are contained in the program PROG_4.

Ledit quatrième procédé consiste à estimer le nombre de données distinctes présentes dans l’ensemble A. Pour ce faire, ledit quatrième procédé prend appui sur une structure de données déterminée pour ledit ensemble A au moyen de l’un quelconque desdits premier, deuxième et troisième procédés. Par la suite, le nombre de données distinctes présentes dans l’ensemble A est estimé, notamment, à partir d’une quantité extraite de caractéristiques de la structure de données déterminée.Said fourth method consists in estimating the number of distinct data present in set A. To do this, said fourth method relies on a data structure determined for said set A by means of any one of said first, second and third processes. Thereafter, the number of distinct data present in the set A is estimated, in particular, from a quantity extracted from characteristics of the determined data structure.

Pour la description du quatrième procédé de la , on considère à nouveau un ensemble A comportant une pluralité de données personnelles, comme cela était le cas pour les figures 3, 4 et 5. On note alors qu’étant donné le contexte applicatif considéré ici, estimer le nombre de données distinctes contenues dans l’ensemble A revient à estimer le nombre d’individus ayant passé et/ou reçu des appels téléphoniques dans la gare pendant la période de temps concernée.For the description of the fourth process of the , we again consider a set A comprising a plurality of personal data, as was the case for Figures 3, 4 and 5. We then note that given the application context considered here, estimating the number of distinct data contained in set A amounts to estimating the number of individuals having made and/or received telephone calls in the station during the period of time concerned.

Ledit quatrième procédé comporte dans un premier temps une étape H10 d’obtention d’une structure de données L pour ledit ensemble A, ladite structure de donnée L étant déterminée selon l’un quelconque desdits premier, deuxième et troisième procédé, et le cardinal de ladite structure obtenue étant égal à un nombre k.Said fourth method initially comprises a step H10 for obtaining a data structure L for said set A, said data structure L being determined according to any one of said first, second and third methods, and the cardinality of said structure obtained being equal to a number k.

Les considérations techniques décrites ci-avant pour l’étape G10 du troisième procédé en ce qui concerne la signification du terme « obtention » s’appliquent ici de la même manière pour l’étape H10.The technical considerations described above for step G10 of the third method with regard to the meaning of the term “obtaining” apply here in the same way for step H10.

Pour rappel, la détermination de ladite structure de données L s’effectue au moyen du paramètre p, dont la valeur est choisie de sorte à être majorée par une quantité qui dépend du nombre M, et ce de sorte à garantir la confidentialité différentielle d’ordre ɛ de ladite structure de données L.As a reminder, the determination of said data structure L is carried out by means of the parameter p, the value of which is chosen so as to be increased by a quantity which depends on the number M, and this so as to guarantee the differential confidentiality of order ɛ of said data structure L.

Ledit quatrième procédé comporte également une étape H20 de détermination d’une valeur Q égale au quotient de k-1 par la plus grande valeur de la structure de données L (i.e. Q = (k-1)/max(L)). Ladite étape H20 est mise en œuvre par un module de détermination (non représenté sur les figures) équipant le dispositif de détermination 11 et configuré à cet effet.Said fourth method also comprises a step H20 of determining a value Q equal to the quotient of k-1 by the largest value of the data structure L (i.e. Q = (k-1)/max(L)). Said step H20 is implemented by a determination module (not shown in the figures) equipping the determination device 11 and configured for this purpose.

Ledit quatrième procédé comporté également une étape H30 de détermination d’un estimateur G du nombre de données distinctes de l’ensemble A en fonction de Q.Said fourth method also comprises a step H30 of determining an estimator G of the number of distinct data of the set A as a function of Q.

En définitive, on obtient donc un estimateur G à partir d’une structure de données L qui est plus compacte que l’ensemble A et qui garantit en outre une confidentialité différentielle d’ordre ɛ.Ultimately, we therefore obtain an estimator G from a data structure L which is more compact than the set A and which also guarantees differential confidentiality of order ɛ.

Dans un mode préféré de mise en œuvre, l’estimateur G est égal à g^-1(Q), où g^-1est une fonction inverse d’une fonction g d’inconnue β et ayant pour expression :

expression dans laquelle :
● ,
● .In a preferred mode of implementation, the estimator G is equal to g ^-1 (Q), where g ^-1 is an inverse function of a function g of unknown β and having the expression:

expression in which:
● ,
● .

Il est à noter que l’expression du coefficient a fait intervenir le coefficient binomial
qui peut encore s’écrire, suivant une autre convention :
It should be noted that the expression of the coefficient involved the binomial coefficient
which can also be written, following another convention:

Le paramètre r_moy considéré dans les expressions des coefficients a et b peut prendre différentes valeurs en fonction d’hypothèses faites sur le contexte applicatif envisagé pour ledit quatrième procédé.The parameter r_moy considered in the expressions of the coefficients a and b can take different values depending on assumptions made on the application context envisaged for said fourth method.

Ainsi, on peut par exemple considérer une hypothèse selon laquelle les nombres d’occurrences respectifs des données dudit ensemble A sont tous égaux à une même valeur R donnée. Dans ce cas, le paramètre r_moy est égal à R. Donc en particulier, s’il est raisonnable de supposer que chaque donnée de l’ensemble A n’apparait qu’une seule fois dans ledit ensemble A, on obtient les expressions simplifiées suivantes pour les coefficients a et b :
●
● Thus, it is for example possible to consider a hypothesis according to which the respective numbers of occurrences of the data of said set A are all equal to the same given value R. In this case, the r_avg parameter is equal to R. So in particular, if it is reasonable to assume that each datum of set A appears only once in said set A, we obtain the following simplified expressions for the coefficients a and b:
●
●

Selon un autre exemple, on considère une hypothèse selon laquelle l’ensemble A comporte plusieurs données dont les nombres d’occurrences respectifs dans ledit ensemble A diffèrent entre eux. Dans ce cas, le paramètre r_moy est égal à la partie entière du quotient de N par Q. On rappelle ici que N correspond à la valeur de dénombrement mentionnée auparavant, ce qui implique donc que pour mettre en œuvre le quatrième procédé dans le présent exemple, il importe que la structure de données L soit déterminée au moyen d’un mode de mise en œuvre du premier ou troisième procédé permettant d’obtenir ladite valeur N de dénombrement.According to another example, a hypothesis is considered according to which the set A comprises several data whose respective numbers of occurrences in said set A differ from each other. In this case, the parameter r_moy is equal to the integer part of the quotient of N by Q. It is recalled here that N corresponds to the counting value mentioned previously, which therefore implies that to implement the fourth method in the present example , it is important that the data structure L be determined by means of a mode of implementation of the first or third method making it possible to obtain said counting value N.

Il importe de noter que le fait de considérer une valeur de r_moy égale à la partie entière du quotient de N par Q n’est pas limité à la seule hypothèse considérée dans l’exemple précédent. Ainsi, une telle valeur de r_moy peut par exemple également être utilisée lorsque les nombres d’occurrences respectifs des données dudit ensemble A sont tous égaux à une même valeur, sans pour autant que la valeur commune soit connue.It is important to note that the fact of considering a value of r_avg equal to the integer part of the quotient of N by Q is not limited to the single hypothesis considered in the previous example. Thus, such a value of r_moy can for example also be used when the respective numbers of occurrences of the data of said set A are all equal to the same value, without the common value being known.

Par ailleurs, le fait de considérer une détermination de l’estimateur G au moyen de ladite fonction g^-1ne constitue qu’une variante préférée de mise en œuvre du quatrième procédé. Cette variante se révèle particulièrement avantageuse pour fournir une estimation très précise du nombre de données distinctes présentes dans l’ensemble A. L’invention couvre néanmoins d’autres modes de mise en œuvre du quatrième procédé, comme par exemple un mode dans lequel l’estimateur G est déterminé égal à Q.Moreover, the fact of considering a determination of the estimator G by means of said function g ^-1 only constitutes a preferred variant of implementation of the fourth method. This variant proves to be particularly advantageous for providing a very precise estimate of the number of distinct data present in the set A. The invention nevertheless covers other modes of implementation of the fourth method, such as for example a mode in which the estimator G is determined equal to Q.

La représente, sous forme d’ordinogramme, les principales étapes d’un autre procédé de comptage selon l’invention, dit « cinquième procédé ». Les instructions dédiées à l’exécution des étapes du cinquième procédé sont contenues dans le programme PROG_5.There represents, in the form of a flowchart, the main steps of another counting method according to the invention, called “fifth method”. The instructions dedicated to the execution of the steps of the fifth method are contained in the program PROG_5.

Ledit cinquième procédé consiste à estimer le nombre de données distinctes présentes dans l’union d’une pluralité d’ensembles de données A_1, A_2,…, A_T, T étant un nombre entier strictement supérieur à 1 (un ensemble de donnée A_i, i étant un indice entier compris entre 1 et T, correspondant par exemple à l’ensemble A considéré pour le quatrième procédé).Said fifth method consists in estimating the number of distinct data present in the union of a plurality of data sets A_1, A_2,…, A_T, T being an integer strictly greater than 1 (a data set A_i, i being an integer index between 1 and T, corresponding for example to the set A considered for the fourth method).

Ledit cinquième procédé comporte dans un premier temps une étape J10 d’obtention, pour chaque ensemble de données A_i, d’une structure de données L_i déterminée selon l’un quelconque desdits premier, deuxième et troisième procédé, le cardinal de ladite structure L_i étant égal à un nombre k_i (on note que les valeurs k_1, …, k_T peuvent être, en tout ou partie, distinctes entre elles).Said fifth method initially comprises a step J10 of obtaining, for each set of data A_i, a data structure L_i determined according to any one of said first, second and third method, the cardinality of said structure L_i being equal to a number k_i (note that the values k_1, …, k_T can be, in whole or in part, distinct from each other).

La mise en œuvre de ladite étape J10 est similaire à celle de l’étape H10.The implementation of said step J10 is similar to that of step H10.

Ledit cinquième procédé comporte également une étape J20 de sélection d’un nombre D7 de plus petites valeurs d’une structure dite « structure d’union » L_UNION correspondant à l’union des structures de données L_1, L_2,…, L_T. Ledit nombre D7 est égal à la plus petite valeur parmi les cardinaux respectifs desdites structures de données L_1, L_2,…, L_T (i.e. D7 = min(k_1,…, k_T)).Said fifth method also comprises a step J20 of selecting a number D7 of smallest values of a so-called “union structure” structure L_UNION corresponding to the union of the data structures L_1, L_2,…, L_T. Said number D7 is equal to the smallest value among the respective cardinals of said data structures L_1, L_2,…, L_T (i.e. D7 = min(k_1,…, k_T)).

Le cinquième procédé comporte également une étape J30 de détermination d’une valeur Q_UNION égale au quotient de D7-1 par la plus grande valeur parmi lesdites D7 plus petites valeurs sélectionnées.The fifth method also includes a step J30 of determining a Q_UNION value equal to the quotient of D7-1 by the largest value among said D7 smallest selected values.

Le cinquième procédé comporte également une étape J40 de détermination d’un estimateur G_UNION du nombre de données distinctes de ladite union d’ensembles de données A_1, A_2,…, A_T en fonction de Q_UNION.The fifth method also includes a step J40 of determining an estimator G_UNION of the number of distinct data of said union of data sets A_1, A_2,…, A_T as a function of Q_UNION.

En définitive, on obtient donc un estimateur G_UNION à partir d’une structure de données L_UNION qui est plus compacte que l’union des ensembles de données A_1, A_2,…, A_T, et qui garantit en outre une confidentialité différentielle d’ordre ɛ.Ultimately, we therefore obtain an estimator G_UNION from a data structure L_UNION which is more compact than the union of the data sets A_1, A_2,…, A_T, and which also guarantees differential confidentiality of order ɛ .

Dans un mode préféré de mise en œuvre, l’estimateur G_UNION est égal à g^-1(Q_UNION), où g^-1est une fonction inverse d’une fonction g d’inconnue β et ayant pour expression :

expression dans laquelle :
● ,
● .In a preferred mode of implementation, the estimator G_UNION is equal to g ^-1 (Q_UNION), where g ^-1 is an inverse function of a function g of unknown β and having the expression:

expression in which:
● ,
● .

Le paramètre r_moy considéré dans les expressions des coefficients a et b peut prendre différentes valeurs en fonction d’hypothèses faites sur le contexte applicatif envisagé pour ledit cinquième procédé.The parameter r_moy considered in the expressions of the coefficients a and b can take different values depending on assumptions made on the application context envisaged for said fifth method.

Ainsi, on peut par exemple considérer une hypothèse selon laquelle les nombres d’occurrences respectifs des données desdits ensembles A_1, A_2,…, A_T sont tous égaux à une même valeur R donnée. Dans ce cas, le paramètre r_moy est égal à R.Thus, one can for example consider a hypothesis according to which the respective numbers of occurrences of the data of said sets A_1, A_2,…, A_T are all equal to the same given value R. In this case, the r_avg parameter is equal to R.

Selon un autre exemple, on considère une hypothèse selon laquelle ladite union d’ensembles de données A_1, A_2,…, A_T comporte plusieurs données dont les nombres d’occurrences respectifs dans ladite union diffèrent entre eux. Dans ce cas, le paramètre r_moy est égal à la partie entière du quotient d’une valeur N_SUM par Q_UNION, ladite valeur N_SUM étant égale à la somme des valeurs de dénombrement respectives desdites structures de données L_1, L_2,…, L_T. On note donc que pour mettre en œuvre le cinquième procédé dans le présent exemple, il importe que chaque structure de données L_i soit déterminée au moyen d’un mode de mise en œuvre du premier ou troisième procédé permettant d’obtenir une valeur de dénombrement N_i associée à ladite structure L_i. On a également que N_SUM est égale à N_1+N_2+…N_T.According to another example, a hypothesis is considered according to which said union of data sets A_1, A_2,…, A_T comprises several data whose respective numbers of occurrences in said union differ from each other. In this case, the r_avg parameter is equal to the integer part of the quotient of an N_SUM value by Q_UNION, said N_SUM value being equal to the sum of the respective counting values of said data structures L_1, L_2,…, L_T. It is therefore noted that to implement the fifth method in the present example, it is important that each data structure L_i be determined by means of an implementation mode of the first or third method making it possible to obtain a counting value N_i associated with said structure L_i. We also have that N_SUM is equal to N_1+N_2+…N_T.

Il importe de noter que le fait de considérer une valeur de r_moy égale à la partie entière du quotient de N_SUM par Q_UNION n’est pas limité à la seule hypothèse considérée dans l’exemple précédent. Ainsi, une telle valeur de r_moy peut par exemple également être utilisée lorsque les nombres d’occurrences respectifs des données de ladite union d’ensembles A_1, A_2,…, A_T sont tous égaux à une même valeur, sans pour autant que la valeur commune soit connue. Une telle valeur de r_moy peut également être envisagée dans le cas où les nombres d’occurrences respectifs des données d’un ensemble A_i son tous égaux à une même valeur r_i, mais qu’il existe au moins deux indices i et j (j étant donc également compris entre 1 et T) tels que r_i diffère de r_j.It is important to note that the fact of considering a value of r_moy equal to the integer part of the quotient of N_SUM by Q_UNION is not limited to the single hypothesis considered in the previous example. Thus, such a value of r_moy can for example also be used when the respective numbers of occurrences of the data of said union of sets A_1, A_2,…, A_T are all equal to the same value, without however the common value be known. Such a value of r_moy can also be envisaged in the case where the respective numbers of occurrences of the data of a set A_i are all equal to the same value r_i, but there are at least two indices i and j (j being therefore also between 1 and T) such that r_i differs from r_j.

Par ailleurs, le fait de considérer une détermination de l’estimateur G_UNION au moyen de ladite fonction g^-1ne constitue qu’une variante préférée de mise en œuvre du cinquième procédé. Cette variante se révèle particulièrement avantageuse pour fournir une estimation très précise du nombre de données distinctes présentes dans l’union des d’ensembles de données A_1, A_2,…, A_T. L’invention couvre néanmoins d’autres modes de mise en œuvre du cinquième procédé, comme par exemple un mode dans lequel l’estimateur G_UNION est déterminé égal à Q_UNION.Moreover, the fact of considering a determination of the estimator G_UNION by means of said function g ^-1 only constitutes a preferred variant of implementation of the fifth method. This variant proves to be particularly advantageous for providing a very precise estimate of the number of distinct data present in the union of the data sets A_1, A_2,…, A_T. The invention nevertheless covers other modes of implementation of the fifth method, such as for example a mode in which the estimator G_UNION is determined equal to Q_UNION.

La représente, sous forme d’ordinogramme, les principales étapes d’un autre procédé de comptage selon l’invention, dit « sixième procédé ». Les instructions dédiées à l’exécution des étapes du sixième procédé sont contenues dans le programme PROG_6.There represents, in the form of a flowchart, the main steps of another counting method according to the invention, called the “sixth method”. The instructions dedicated to the execution of the steps of the sixth method are contained in the program PROG_6.

Ledit sixième procédé consiste à estimer le nombre de données distinctes présentes dans l’intersection d’une pluralité d’ensembles de données A_1, A_2,…, A_T, T étant un nombre entier strictement supérieur à 1 (un ensemble de donnée A_i, i étant un indice entier compris entre 1 et T, correspondant par exemple à l’ensemble A considéré pour le quatrième procédé).Said sixth method consists in estimating the number of distinct data present in the intersection of a plurality of data sets A_1, A_2,…, A_T, T being an integer strictly greater than 1 (a data set A_i, i being an integer index between 1 and T, corresponding for example to the set A considered for the fourth method).

A titre purement illustratif, en reprenant le contexte applicatif selon lequel une donnée personnelle relative à un individu correspond à un numéro de téléphone dudit individu mémorisé après que celui-ci ait passé ou reçu un appel dans une gare donnée et au cours d’une période de temps donnée, ledit sixième procédé permet avantageusement d’estimer le nombre de personnes ayant voyagé entre deux gares au cours d’une période de temps donnée. Plus particulièrement, si on considère un ensemble A_1 (respectivement A_2) comportant tous les numéros (éventuellement avec répétitions) des individus ayant passé et/ou reçu des appels téléphoniques dans une première gare (respectivement dans une deuxième gare), ledit sixième procédé permet d’estimer le nombre d’éléments distincts compris dans l’intersection de A_1 et A_2.For purely illustrative purposes, taking up the application context according to which personal data relating to an individual corresponds to a telephone number of said individual stored after the latter has made or received a call in a given station and during a period of given time, said sixth method advantageously makes it possible to estimate the number of people who have traveled between two stations during a given period of time. More particularly, if we consider a set A_1 (respectively A_2) comprising all the numbers (possibly with repetitions) of the individuals having made and/or received telephone calls in a first station (respectively in a second station), said sixth method makes it possible to estimate the number of distinct elements included in the intersection of A_1 and A_2.

Ledit sixième procédé comporte dans un premier temps une étape M10 d’obtention, pour chaque ensemble de données A_i, d’une structure de données L_i déterminée selon l’un quelconque desdits premier, deuxième et troisième procédé, le cardinal de ladite structure L_i étant égal à un nombre k_i (on note que les valeurs k_1, …, k_T peuvent être, en tout ou partie, distinctes entre elles).Said sixth method initially comprises a step M10 of obtaining, for each set of data A_i, a data structure L_i determined according to any one of said first, second and third method, the cardinality of said structure L_i being equal to a number k_i (note that the values k_1, …, k_T can be, in whole or in part, distinct from each other).

La mise en œuvre de ladite étape M10 est similaire à celle de l’étape H10 ou bien encore J10.The implementation of said step M10 is similar to that of step H10 or even J10.

Le sixième procédé comporte également une étape M20 de sélection d’un nombre D7 de plus petites valeurs d’une structure dite « structure d’union » L_UNION correspondant à l’union des structures de données L_1, L_2,…, L_T. ledit nombre D7 est égal à la plus petite valeur parmi les cardinaux respectifs desdites structures de données L_1, L_2,…, L_T (i.e. D7 = min(k_1,…, k_T)).The sixth method also includes a step M20 of selecting a number D7 of smaller values of a so-called “union structure” L_UNION corresponding to the union of the data structures L_1, L_2,…, L_T. said number D7 is equal to the smallest value among the respective cardinals of said data structures L_1, L_2,…, L_T (i.e. D7 = min(k_1,…, k_T)).

Le sixième procédé comporte également une étape M30 de détermination d’une valeur Q_UNION égale au quotient de D7-1 par la plus grande valeur parmi lesdites D7 plus petites valeurs sélectionnées.The sixth method also includes a step M30 of determining a value Q_UNION equal to the quotient of D7-1 by the largest value among said D7 smaller values selected.

Le sixième procédé comporte également une étape M40 de détermination d’une valeur Q_INTER égale au quotient du cardinal d’une structure dite « structure d’intersection » L_INTER correspondant à l’intersection desdites structures de données L_1, L_2,…, L_T par D7.The sixth method also comprises a step M40 of determining a value Q_INTER equal to the quotient of the cardinality of a structure called "intersection structure" L_INTER corresponding to the intersection of said data structures L_1, L_2, ..., L_T by D7 .

Le sixième procédé comporte également une étape M50 de détermination d’un estimateur G_INTER du nombre de données distinctes de ladite intersection d’ensembles de données en fonction de Q_INTER et de Q_UNION.The sixth method also includes a step M50 of determining an estimator G_INTER of the number of distinct data of said intersection of sets of data as a function of Q_INTER and Q_UNION.

En définitive, on obtient donc un estimateur G_INTER à partir d’une structure de données L_INTER qui est plus compacte que l’intersection des ensembles de données A_1, A_2,…, A_T, et qui garantit en outre une confidentialité différentielle d’ordre ɛ.Ultimately, we therefore obtain an estimator G_INTER from a data structure L_INTER which is more compact than the intersection of the data sets A_1, A_2,…, A_T, and which also guarantees differential confidentiality of order ɛ .

Dans un mode préféré de mise en œuvre, l’estimateur G_INTER a pour expression :

In a preferred mode of implementation, the estimator G_INTER has the expression:

Les paramètres r_1, r_2, …, r_T correspondent à des nombres respectivement associés aux ensembles de données A_1, A_2, …, A_T. Chaque paramètre r_i considéré dans l’expression de G_INTER peut prendre différentes valeurs en fonction d’hypothèses faites sur le contexte applicatif envisagé pour ledit sixième procédé.The parameters r_1, r_2, …, r_T correspond to numbers respectively associated with the data sets A_1, A_2, …, A_T. Each parameter r_i considered in the expression of G_INTER can take different values depending on assumptions made on the application context considered for said sixth process.

Ainsi, on peut par exemple considérer une hypothèse selon laquelle, pour chaque ensemble A_i, les nombres d’occurrences respectifs des données dudit ensemble A_i sont tous égaux à une même valeur R_i donnée. Dans ce cas, le paramètre r_i associé à l’ensemble A_i est égal à R_i.Thus, one can for example consider a hypothesis according to which, for each set A_i, the respective numbers of occurrences of the data of said set A_i are all equal to the same given value R_i. In this case, the parameter r_i associated with the set A_i is equal to R_i.

Selon un autre exemple, on considère une hypothèse selon laquelle au moins un ensemble de données (parmi les ensembles de données A_1,…, A_T) comporte plusieurs données dont les nombres d’occurrences respectifs dans ledit au moins un ensemble diffèrent entre eux. Dans ce cas, le paramètre r_i de chaque ensemble A_i est égal à la partie entière du quotient de N_i par une valeur Q_i, ladite valeur Q_i étant égale au quotient de k_i-1 par la plus grande valeur de la structure de données L_i obtenue pour ledit ensemble A_i. On note donc que pour mettre en œuvre le sixième procédé dans le présent exemple, il importe que chaque structure de données L_i soit déterminée au moyen d’un mode de mise en œuvre du premier ou troisième procédé permettant d’obtenir une valeur de dénombrement N_i associée à ladite structure L_i.According to another example, a hypothesis is considered according to which at least one set of data (among the sets of data A_1,…, A_T) comprises several data whose respective numbers of occurrences in said at least one set differ from one another. In this case, the parameter r_i of each set A_i is equal to the integer part of the quotient of N_i by a value Q_i, said value Q_i being equal to the quotient of k_i-1 by the largest value of the data structure L_i obtained for said set A_i. It is therefore noted that to implement the sixth method in the present example, it is important that each data structure L_i be determined by means of an implementation mode of the first or third method making it possible to obtain a counting value N_i associated with said structure L_i.

Il importe de noter que le fait de considérer une valeur de r_i égale à la partie entière du quotient de N_i par Q_i n’est pas limité à la seule hypothèse considérée dans l’exemple précédent. Ainsi, des dispositions identiques peuvent également s’appliquer lorsque les nombres d’occurrences respectifs des données d’un ensemble A_i sont tous égaux à une même valeur, sans pour autant que la valeur commune soit connue.It is important to note that considering a value of r_i equal to the integer part of the quotient of N_i by Q_i is not limited to the single hypothesis considered in the previous example. Thus, identical provisions can also apply when the respective numbers of occurrences of the data of a set A_i are all equal to the same value, without the common value being known.

Par ailleurs, le fait de considérer une détermination de l’estimateur G_INTER selon ledit mode préféré ne constitue qu’une variante d’implémentation du sixième procédé. Cette variante se révèle particulièrement avantageuse pour fournir une estimation très précise du nombre de données distinctes présentes dans l’intersection des d’ensembles de données A_1, A_2,…, A_T. L’invention couvre néanmoins d’autres modes de mise en œuvre du sixième procédé, comme par exemple un mode dans lequel l’estimateur G_INTER est déterminé égal à au produit entre Q_INTER et Q_UNION.Moreover, the fact of considering a determination of the estimator G_INTER according to said preferred mode only constitutes an implementation variant of the sixth method. This variant proves to be particularly advantageous for providing a very precise estimate of the number of distinct data present in the intersection of the data sets A_1, A_2,…, A_T. The invention nevertheless covers other modes of implementation of the sixth method, such as for example a mode in which the estimator G_INTER is determined equal to the product between Q_INTER and Q_UNION.

Les différents procédés selon l’invention ont été décrits jusqu’à présent en relation avec un contexte applicatif dans lequel une donnée personnelle relative à un individu correspond à un numéro de téléphone dudit individu mémorisé après que celui-ci ait passé ou reçu un appel dans une gare donnée et au cours d’une période de temps donnée.The various methods according to the invention have been described so far in relation to an application context in which personal data relating to an individual corresponds to a telephone number of said individual stored after the latter has made or received a call in a given station and over a given period of time.

Néanmoins, l’invention ne se limite à un tel contexte applicatif, et peut être mise en œuvre, en définitive, dans tout contexte dans lequel il est utile de pouvoir estimer précisément, rapidement et de manière anonymisée, le nombre de données distinctes dans un ensemble de données, dans une union d’ensembles de données ou bien encore dans une intersection d’ensembles de données.Nevertheless, the invention is not limited to such an application context, and can be implemented, ultimately, in any context in which it is useful to be able to estimate precisely, quickly and anonymously, the number of distinct data in a dataset, in a union of datasets or even in an intersection of datasets.

Ainsi, et de manière plus générale, l’invention trouve une application particulièrement avantageuse dans l’analyse statistique de la fréquentation d’une ou plusieurs zones géographiques, de sorte à permettre, par exemple, de dimensionner des infrastructures en fonction de flux d’individus, d’identifier des emplacements d’intérêt en fonction de profils d’individus, etc.Thus, and more generally, the invention finds a particularly advantageous application in the statistical analysis of the frequentation of one or more geographical areas, so as to allow, for example, to dimension infrastructures according to flow of individuals, identify locations of interest based on individual profiles, etc.

En outre, l’invention ne se limite pas non plus à la mémorisation de données correspondant à des numéros de téléphones mobiles. L’invention peut par exemple s’appliquer à la mémorisation de données correspondant à des informations de connexion auprès d’un serveur informatique, comme par exemple un serveur Internet de sorte à pouvoir réaliser des analyses statistiques d’un trafic réseau.Furthermore, the invention is not limited either to the storage of data corresponding to mobile telephone numbers. The invention can for example be applied to the storage of data corresponding to connection information with a computer server, such as for example an Internet server so as to be able to carry out statistical analyzes of network traffic.

Claims

Method for determining a data structure from at least one datum d, said at least one datum being personal datum relating to an individual and stored during a memorization process of determined duration following the completion of d an event by said individual, said method comprising a step (E10) of initializing a data structure L to an empty set as well as a set E_DET of steps of:
- determination (E_DET_10) of a value W_{d} equal to bxh(d) + (1-b) xh(V), where:
● h is a hash function with discrete values between 0 and 1,
● b is a Bernoulli variable with parameter p, with
M being the cardinal of the image of the hash function h, and r being an upper bound of the number of times the datum d can be stored during said duration, and ɛ being a strictly positive number,
● V is a uniform random variable independent of b,
and, if the value W_{d} does not belong to structure L (E_DET_20),
- insertion (E_DET_40) of the value W_{d} in the structure L if the cardinality of said structure L is less than a given number k (E_DET_30),
- otherwise, if the cardinality of the structure L is greater than k and if the value W_{d} is less than the largest value of the structure L (E_DET_50), insertion (E_DET_60) of the value W_{d} in the structure L by replacing said largest value.

Method according to claim 1, said method comprising, for each execution of the step of insertion by replacement, a step of incrementing a value called "counting value" N, said counting value N being representative, at at the end of the method, the total number of times said step of inserting by replacement has been executed during the implementation of the method.

Method for determining a data structure, called "second data structure" L2, from a first data structure L1 obtained by applying an algorithm of minimum k'-values to data, each datum being a personal datum relating to an individual and stored during a storage process of determined duration following the occurrence of an event by said individual, the implementation of said algorithm of minimum k'-values using a hash function h to discrete values between 0 and 1, said method comprising steps of:
- determination (F10) of a value D1 equal to the quotient of k'-1 by the greatest value of the first structure L1,
- uniform sampling (F20) of a number D1 of values in the image of the hash function h, so as to obtain a set L_D1 comprising said D1 sampled values,
- uniform sampling (F30) of a number D2 of values in the set L_D1, D2 being equal to the integer part of the product [1-p] x D1, with
M being the cardinal of the image of the hash function h, and ɛ being a strictly positive number,
- selection (F40) of a number D3 of smallest values among said D2 sampled values, D3 being equal to the integer part of the product [1-p] xk, where k is a given number less than k',
- uniform sampling (F50) of a number D4 of values between the largest value of the first structure L1 and 1, so as to obtain a set L_D4 comprising said D4 sampled values, D4 being equal to D1-k,
- uniform sampling (F60) of a number D5 of values in the union of the sets L_D1 and L_D4, D5 being equal to the integer part of the product px D1,
- selection (F70) of a number D6 of smallest values among said D5 sampled values, D6 being equal to k-D3,
- grouping (F80) of said smallest values selected during said selections, so as to form said second structure L2.

Method for inserting at least one datum into a data structure L, said at least one datum being personal datum relating to an individual and stored during a memorization process of determined duration following the performance of a event by said individual, said method comprising steps of:
- obtaining (G10) a data structure L determined according to a method in accordance with any one of claims 1 to 3,
- implementation (G20), for said at least one datum to be inserted, of a set of steps identical to the set E_DET of steps of a method in accordance with any one of claims 1 to 3.

Method according to claim 4, in which, when the data structure L obtained has been determined according to a determination method according to claim 2, so as to obtain a count value N corresponding to the total number of times the step d insertion by replacement has been executed, the insertion method also comprises, for each execution of the step identical to the step of insertion by replacement of the said determination method according to claim 2, a step of incrementing the said count value N.

Method for estimating the number of distinct data in a set of data, each data being personal data relating to an individual and stored during a storage process of determined duration following the occurrence of an event by said individual, said method comprising steps of:
- obtaining (H10) a data structure determined according to a method in accordance with any one of claims 1 to 5, the cardinality of said structure obtained being equal to k,
- determination (H20) of a value Q equal to the quotient of k-1 by the largest value of the data structure obtained,
- determination (H30) of an estimator G of the number of distinct data of the data set as a function of Q.

Method according to claim 6, in which the estimator G is equal to g ^-1 (Q), where g ^-1 is an inverse function of a function g of unknown β and having the expression:

expression in which:
● ,
● ,
● if the respective numbers of occurrences of the data of said data set are all equal to the same given value R, r_avg is equal to R,
● otherwise, and if the data structure is determined according to a method in accordance with any one of claims 2 and 5, r_moy is equal to the integer part of the quotient of N by Q.

Method for estimating the number of distinct pieces of data in a union of a plurality of sets of data, each piece of data being personal data relating to an individual and stored during a storage process of determined duration following the completion of an event by said individual, said method comprising steps of:
- obtaining (J10), for each set of data, a data structure determined according to a method in accordance with any one of claims 1 to 5,
- selection (J20) of a number D7 of smallest values of a structure corresponding to the union of the data structures obtained, D7 being equal to the smallest value among the respective cardinals of said data structures obtained,
- determination (J30) of a Q_UNION value equal to the quotient of D7-1 by the largest value among said smallest selected values,
- determination (J40) of an estimator G_UNION of the number of distinct data of said union of data sets as a function of Q_UNION.

Method according to claim 8, in which the estimator G_UNION is equal to g ^-1 (Q_UNION), where g ^-1 is an inverse function of a function g of unknown β and having the expression:

expression in which:
● ,
● ,
● if the respective numbers of occurrences of the data of said data sets are all equal to the same given value R, r_avg is equal to R,
● otherwise, and if each data structure is determined according to a method in accordance with any one of claims 2 and 5, so as to obtain a value N_SUM equal to the sum of the respective counting values of said data structures, r_moy is equal to the integer part of the quotient of N_SUM by Q_UNION.

Method for estimating the number of distinct pieces of data in an intersection of a plurality of sets of data, each piece of data being personal data relating to an individual and stored during a storage process of determined duration following the completion of an event by said individual, said method comprising steps of:
- obtaining (M10), for each set of data, a data structure determined according to a method in accordance with any one of claims 1 to 5,
- selection (M20) of a number D7 of smallest values of a structure corresponding to the union of said obtained data structures, D7 being equal to the smallest value among the respective cardinals of said obtained data structures,
- determination (M30) of a Q_UNION value equal to the quotient of D7-1 by the largest value among said smallest selected values,
- determination (M40) of a Q_INTER value equal to the quotient of the cardinality of a structure corresponding to the intersection of said data structures obtained by D7,
- determination (M50) of an estimator G_INTER of the number of distinct data of said intersection of sets of data as a function of Q_INTER and Q_UNION.

Method according to claim 10, in which the sets of data, called sets "A_1,..., A_T", are T in number and respectively associated with parameters r_1,..., r_T, the cardinality of a data structure associated with a set A_i being equal to k_i, and the estimator G_INTER having the expression:

expression in which:
● if, for each set A_i, the respective numbers of occurrences of the data of said set A_i are all equal to the same given value R_i, r_i is equal to R_i,
● otherwise, and if each data structure is determined according to a method in accordance with any one of claims 2 and 5, so as to obtain for each data structure associated with a set A_i a counting value N_i, the parameter r_i of each set A_i is equal to the integer part of the quotient of N_i by a value Q_i, said value Q_i being equal to the quotient of k_i-1 by the largest value of the data structure obtained for said set A_i.

Computer program comprising instructions for implementing a method according to any one of claims 1 to 11 when said program is executed by a computer.

A computer-readable recording medium on which a computer program according to claim 12 is recorded.

Processing device (11) comprising means configured to implement a method according to any one of claims 1 to 11.

Computer system (10) comprising:
- a database (12) in which is stored at least one personal datum relating to an individual, said at least one datum having been stored during a memorization process of determined duration following the occurrence of an event by said individual,
- a processing device (11) according to claim 14.