EP3036678A1 - Method and apparatus for utility-aware privacy preserving mapping in view of collusion and composition - Google Patents

Method and apparatus for utility-aware privacy preserving mapping in view of collusion and composition

Info

Publication number
EP3036678A1
EP3036678A1 EP13812233.8A EP13812233A EP3036678A1 EP 3036678 A1 EP3036678 A1 EP 3036678A1 EP 13812233 A EP13812233 A EP 13812233A EP 3036678 A1 EP3036678 A1 EP 3036678A1
Authority
EP
European Patent Office
Prior art keywords
data
bound
public
private
privacy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP13812233.8A
Other languages
German (de)
French (fr)
Inventor
Nadia FAWAZ
Abbasali Makhdoumi KAKHAKI
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thomson Licensing SAS
Original Assignee
Thomson Licensing SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Licensing SAS filed Critical Thomson Licensing SAS
Publication of EP3036678A1 publication Critical patent/EP3036678A1/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Definitions

  • This invention relates to a method and an apparatus for preserving privacy, and more particularly, to a method and an apparatus for preserving privacy of user data in view of collusion or composition.
  • This service, or other benefit that the user derives from allowing access to the user's data may be referred to as utility.
  • privacy risks arise as some of the collected data may be deemed sensitive by the user, e.g., political opinion, health status, income level, or may seem harmless at first sight, e.g., product ratings, yet lead to the inference of more sensitive data with which it is correlated.
  • the latter threat refers to an inference attack, a technique of inferring private data by exploiting its correlation with publicly released data.
  • FIG. 1 is a pictorial example illustrating collusion and composition.
  • FIG. 2 is a flow diagram depicting an exemplary method for preserving privacy, in accordance with an embodiment of the present principles.
  • FIG. 3 is a flow diagram depicting another exemplary method for preserving privacy, in accordance with an embodiment of the present principles.
  • FIG. 4 is a block diagram depicting an exemplary privacy agent, in
  • FIG. 5 is a block diagram depicting an exemplary system that has multiple privacy agents, in accordance with an embodiment of the present principles.
  • the present principles provide a method for processing user data for a user, comprising the steps of: accessing the user data, which includes private data, a first public data and a second public data, the first public data corresponding to a first category of data, and the second public data corresponding to a second category of data; determining a first information leakage bound between the private data and a first and second released data; determining a second information leakage bound between the private data and the first released data, and a third information leakage bound between the private data and the second released data, responsive to the first information leakage bound; determining a first privacy preserving mapping that maps the first category of data to the first released data responsive the second bound and a second privacy preserving mapping that maps the second category of data to the second released data responsive the third bound; modifying the first and second public data for the user, based on the first and second privacy preserving mappings respectively, to form the first and second released data; and releasing the modified first and second public data to at least one of a service provider and a data collecting agency as described
  • the present principles also provide a method for processing user data for a user, comprising the steps of: accessing the user data, which includes private data, a first public data and a second public data, the first public data corresponding to a first category of data, and the second public data corresponding to a second category of data; determining a first information leakage bound between the private data and a first and second released data; determining a second information leakage bound between the private data and the first released data, and a third information leakage bound between the private data and the second released data, responsive to the first information leakage bound, wherein each of the second bound and the third bound substantially equals the first bound; determining a first privacy preserving mapping that maps the first category of data to the first released data responsive the second bound and a second privacy preserving mapping that maps the second category of data to the second released data responsive the third bound; modifying the first and second public data for the user, based on the first and second privacy preserving mappings respectively, to form the first and second released data; and releasing the modified first and second
  • the term analyst which for example may be a part of a service provider's system, as used in the present application, refers to a receiver of the released data, who ostensibly uses the data in order to provide utility to the user. Often the analyst is a legitimate receiver of the released data. However, an analyst could also illegitimately exploit the released data and infer some information about private data of the user. This creates a tension between privacy and utility requirements. To reduce the inference threat while maintaining utility the user may release a "distorted version" of data, generated according to a conditional probabilistic mapping, called “privacy preserving mapping," designed under a utility constraint.
  • a user would like to remain private as “private data,” the data the user is willing to release as “public data,” and the data the user actually releases as “released data.”
  • a user may want to keep his political opinion private, and is willing to release his TV ratings with modification (for example, the user's actual rating of a program is 4, but he releases the rating as 3).
  • the user's political opinion is considered to be private data for this user
  • the TV ratings are considered to be public data
  • the released modified TV ratings are considered to be the released data.
  • another user may be willing to release both political opinion and TV ratings without modifications, and thus, for this other user, there is no distinction between private data, public data and released data when only political opinion and TV ratings are considered. If many people release political opinions and TV ratings, an analyst may be able to derive the correlation between political opinions and TV ratings, and thus, may be able to infer the political opinion of the user who wants to keep it private.
  • private data this refers to data that the user not only indicates that it should not be publicly released, but also that he does not want it to be inferred from other data that he would release.
  • Public data is data that the user would allow the privacy agent to release, possibly in a distorted way to prevent the inference of the private data.
  • public data is the data that the service provider requests from the user in order to provide him with the service. The user however will distort (i.e., modify) it before releasing it to the service provider.
  • public data is the data that the user indicates as being "public” in the sense that he would not mind releasing it as long as the release takes a form that protects against inference of the private data.
  • a specific category of data is considered as private data or public data is based on the point of view of a specific user. For ease of notation, we call a specific category of data as private data or public data from the perspective of the current user. For example, when trying to design privacy preserving mapping for a current user who wants to keep his political opinion private, we call the political opinion as private data for both the current user and for another user who is willing to release his political opinion.
  • the distortion between the released data and public data as a measure of utility.
  • the distortion is larger, the released data is more different from the public data, and more privacy is preserved, but the utility derived from the distorted data may be lower for the user.
  • the distortion is smaller, the released data is a more accurate representation of the public data and the user may receive more utility, for example, receive more accurate content recommendations.
  • we model the privacy-utility tradeoff and design the privacy preserving mapping by solving an optimization problem minimizing the information leakage, which is defined as mutual information between private data and released data, subject to a distortion constraint.
  • finding the privacy preserving mapping relies on the fundamental assumption that the prior joint distribution that links private data and released data is known and can be provided as an input to the optimization problem.
  • the true prior distribution may not be known, but rather some prior statistics may be estimated from a set of sample data that can be observed.
  • the prior joint distribution could be estimated from a set of users who do not have privacy concerns and publicly release different categories of data, which may be considered to be private or public data by the users who are concerned about their privacy.
  • the marginal distribution of the public data to be released, or simply its second order statistics may be estimated from a set of users who only release their public data.
  • the statistics estimated based on this set of samples are then used to design the privacy preserving mapping mechanism that will be applied to new users, who are concerned about their privacy.
  • the public data is denoted by a random variable X ⁇ X with the probability distribution P x .
  • X is correlated with the private data, denoted by random variable S e S.
  • the correlation of S and X is defined by the joint distribution P s x .
  • the released data, denoted by random variable Y G y is a distorted version of X.
  • Y is achieved via passing X through a kernel, P Y ⁇ X .
  • the term "kernel” refers to a conditional probability that maps data X to data Y probabilistically. That is, the kernel P Y ⁇ X is the privacy preserving mapping that we wish to design.
  • D (. ) is the K-L divergence
  • E(. ) is the expectation of a random variable
  • H(. ) is the entropy
  • e e [0,1] is called the leakage factor
  • I(S; Y) represents the information leakage.
  • any distortion metric can be used, such as the
  • leakage factor, e, and distortion level, D of a privacy preserving mapping.
  • our objective is to limit the amount of private information that can be inferred, given a utility constraint.
  • the objective can be mathematically formulated as to find the probability mapping P Y ⁇ X that minimizes the maximum information leakage 7(5; Y) given a distortion constraint, where the maximum is taken over the uncertainty in the statistical knowledge on the distribution P s x available at the privacy agent:
  • the probability distribution P S Y can be obtained from the joint distribution
  • Theorem 1 decouples the dependency of Y and S into two terms, one relating S and X, and one relating X and Y. Thus, one can upper bound the information leakage even without knowing P s x , by minimizing the term relating X and Y.
  • the application of this result in our problem is the following:
  • I(S; X) is the intrinsic information embedded in X about S, which we do not have control on.
  • the value of ⁇ does not affect the mapping we will find, but the value of ⁇ affects what we think is the privacy guarantee (in term the leakage factor) resulting from this mapping. If the ⁇ bound is tight, then the privacy guarantee will be tight. If the ⁇ bound is not tight, we may then be paying more distortion than is actually necessary for a target leakage factor, but this does not affect the privacy guarantee.
  • Maximal correlation is a measure of correlation between two random variables with applications both in information theory and computer science.
  • maximal correlation provides its relation with S * (X Y).
  • Ahlswede and P. Gacs, "Spreading of sets in product spaces and hypercontraction of the markov operator," The Annals of Probability (hereinafter “Ahlswede”):
  • Collusion a private data, S, is correlated with two public data, X t and X 2 .
  • Each privacy preserving mapping is designed to protect against the inference of S from each of the released data separately.
  • Decentralization simplifies the design, by breaking one large optimization with many variables (joint design) into several smaller optimizations with fewer variables.
  • Composition a private data S is correlated with the public data, X x and X 2 through the joint probability distribution P(S; X x ; X 2 ).
  • P(S; X x ; X 2 ) the probability distribution of the public data
  • X x the public data
  • X 2 the joint probability distribution of the public data
  • P(S; X x ; X 2 ) the probability distribution of the public data
  • P(S; X x ; X 2 ) the joint probability distribution
  • FIG. 1 provides examples on collusion and composition:
  • Example 1 collusion when a single private data and multiple public data are considered
  • Example 2 collusion when multiple private data and multiple public data are considered
  • Example 3 composition when a single private data and multiple public data are considered.
  • Example 4 composition when multiple private data and multiple public data are considered.
  • a private data, S is correlated with two public data, X x and X 2 .
  • Netflix is a legitimate receiver of information about TV rating, but not snack rating
  • Kraft Foods is a legitimate receiver of information about snack rating, but not TV rating. However, they may share information in order to infer more about the user's private data.
  • Example 2 private data S 1 is correlated with public data X Xl and private data S 2 is correlated with public data X 2 .
  • income as private data S t
  • gender as private data S 2
  • TV rating as public data X x
  • snack rating as public data X 2 .
  • Two privacy preserving mappings are applied on these public data to obtain two released data, Y 1 and Y 2 provided to two analysts, respectively.
  • Example 3 a private data, S is correlated with public data X x and X 2 through joint probability distribution Ps ⁇ x ⁇
  • Ps ⁇ x ⁇ we consider political opinion as private data S, TV rating for Fox news as public data X x and TV rating for ABC news as public data X 2 .
  • An analyst for example, Comcast asks for both X 1 and X 2 .
  • the privacy preserving mappings are designed separately and we want to analyze the privacy guarantees when the privacy agent combines her information Y 1 and Y 2 about both S t and S 2 .
  • Comcast is an legitimate receiver of both TV ratings for Fox news and ABC news.
  • Example 4 two private data, S t and S 2 are correlated with public data, X x and X 2 through joint probability distribution Ps ⁇ x ⁇
  • income as private data S t
  • gender as private data S 2
  • TV rating as public data X x
  • snack rating as public data X 2 .
  • mappings for large size X are more difficult to design than mappings for small size X (possibly one variable, or a small vector), as the complexity of the optimization problem which provides a solution to the privacy mapping scales with the size of vector X.
  • a private random variable S is correlated with 3 ⁇ 4and X 2 .
  • Distorted versions of 3 ⁇ 4and X 2 are denoted by Y x and Y 2 , respectively.
  • Y x and Y 2 Distorted versions of 3 ⁇ 4and X 2 are denoted by Y x and Y 2 , respectively.
  • PiX ⁇ X- and P(Y 2 ⁇ X 2 ) on ⁇ and X 2 to obtain Y 1 and Y 2 , respectively given distortion constraints.
  • the individual information leakages are I(S Y t ) and I(S Y 2 ).
  • Y x and Y 2 are combined together into a pair (Y l t Y 2 ), either by colluding entities, or by a privacy agent through composition.
  • a private random variable 5 is correlated with X x and X 2 .
  • Distorted versions of X x and X 2 are denoted by Y 1 and Y 2 , respectively.
  • Y 1 and Y 2 Distorted versions of X x and X 2 are denoted by Y 1 and Y 2 , respectively.
  • Py 2 ⁇ x 2 are designed with given distortion constraints, and the individual information leakages are 7(5; and 7(5; Y 2 ), respectively.
  • the two released data Y 1 and Y 2 are combined together into a pair (3 ⁇ 4, Y 2 ), either by colluding entities, or by a privacy agent through composition.
  • Lemma 1 Assume Y t , Y 2 , and S form a Markov chain in any order. If the privacy preserving mappings leak I(Y ; S) and I(Y 2 ; S bits by Y x and Y 2 , respectively, then at most I(Yi, S) + l(Y 2 ; S) bits of information are leaked by the pair Y 1 and Y 2 . In other words, 5) ⁇ I(Y 1 ; S) + I(Y 2 ; S). Moreover, if S ⁇ Y 1 ⁇ Y 2 , then /(S; ⁇ , ⁇ ) ⁇
  • Lemma 1 applies regardless of how much knowledge on P s x is available when the mapping is designed.
  • the bounds in Lemma 1 holds when P sx is known. It also holds if the privacy preserving mappings are designed using the method based on the separability result in Theorem 1 .
  • FIG. 2 illustrates an exemplary method 200 for preserving privacy in view of collusion or composition, in accordance with an embodiment of the present principles.
  • Method 200 starts at step 205.
  • it collects statistical information based on the single private data S and public data X x and X 2 .
  • it decides the cumulative privacy guarantee for the private data S in view of collusion or composition of released data Y x and Y 2 . That is, it decides a leakage factor e for / (S Y 1 , Y 2 ) .
  • the privacy preserving mappings are designed in a decentralized fashion for public data X x and X 2 .
  • it determines a privacy preserving mapping P Yl ⁇ Xl for public data X t , given leakage factor e t for I(S; Y t ).
  • it determines a privacy preserving mapping ⁇ 2 ⁇ 2 for public data X 2 , given leakage factor e 2 for I(S; Y 2 ).
  • collusion may occur when a legitimate receiver of released data Y 1 (but not Y 2 ) exchanges information about Y 2 with a legitimate receiver of released data Y 2 (but not Y t ).
  • both released data are legitimately received by the same receiver, and composition occurs when the receiver combines information from both released data to infer more information about the user.
  • S*(3 ⁇ 4,3 ⁇ 4 3 ⁇ 4) max ⁇ S*(3 ⁇ 4; 3 ⁇ 4S*3 ⁇ 4; 3 ⁇ 4) ⁇ .
  • I(Yi>' 3 ⁇ 4 ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ) is the only required inequality as mentioned in Anantharam to obtain the inequality (20) (see Anantharam, page 10, part C).
  • FIG.3 illustrates an exemplary method 300 for preserving privacy in view of collusion or composition, in accordance with an embodiment of the present principles.
  • Method 300 is similar to method 200,except that S*(3 ⁇ 4; ⁇ e (330) and S*(X 2 ; Y 2 ) ⁇ e (335). Note that method 200 works under some Markov chain assumptions stated in Lemma 1 , while method 300 works more generally. Multiple private data, multiple public data
  • the cumulative information leakage of the pair Y 1 and y 2 is bounded by (21 ). In particular, if X x and X 2 are independent, then this bound holds.
  • method 200 determines privacy preserving mappings considering a single private data and two public data in view of collusion or composition.
  • method 200 can be applied with some modifications.
  • step 210 we collect statistical information based on S t , S 2 , X 1 and X 2 .
  • step 230 we design a privacy preserving mapping P Yl ⁇ Xl for public data X t , given leakage factor ⁇ for /(5 1( - Y t ).
  • step 235 we design a privacy preserving mapping ⁇ 2 ⁇ 2 for public data X 2 , given leakage factor ⁇ 2 for l(S 2 ; Y 2 ).
  • step 310 we collect statistical information based on S t , S 2 , X x and X 2 .
  • step 335 we design a privacy preserving mapping ⁇ 2 ⁇ 2 for public data X 2 , given leakage factor ⁇ for /(5 2 ; y 2 ) .
  • a privacy agent is an entity that provides privacy service to a user.
  • a privacy agent may perform any of the following:
  • FIG. 4 depicts a block diagram of an exemplary system 400 where a privacy agent can be used.
  • Public users 410 release their private data (S) and/or public data (X).
  • S private data
  • X public data
  • the information released by the public users becomes statistical information useful for a privacy agent.
  • a privacy agent 480 includes statistics collecting module 420, privacy preserving mapping decision module 430, and privacy preserving module 440.
  • Statistics collecting module 420 may be used to collect joint distribution P s x , marginal probability measure P x , and/or mean and covariance of public data.
  • Statistics collecting module 420 may also receive statistics from data aggregators, such as bluekai.com.
  • privacy preserving mapping decision module 430 designs several privacy preserving mapping mechanisms.
  • Privacy preserving module 440 distorts public data of private user 460 before it is released, according to the conditional probability.
  • the privacy preserving module may design separate privacy preserving mappings for X x and X 2 , respectively, in view of composition.
  • each colluding entity may use system 400 to design a separate privacy preserving mapping.
  • the privacy agent needs only the statistics to work without the knowledge of the entire data that was collected in the data collection module and that allowed to compute the statistics.
  • the data collection module could be a standalone module that collects data and then computes statistics, and needs not be part of the privacy agent. The data collection module shares the statistics with the privacy agent.
  • a privacy agent sits between a user and a receiver of the user data (for example, a service provider).
  • a privacy agent may be located at a user device, for example, a computer, or a set-top box (STB).
  • STB set-top box
  • a privacy agent may be a separate entity.
  • All the modules of a privacy agent may be located at one device, or may be distributed over different devices, for example, statistics collecting module 420 may be located at a data aggregator who only releases statistics to the module 430, the privacy preserving mapping decision module 430, may be located at a "privacy service provider" or at the user end on the user device connected to a module 420, and the privacy preserving module 440 may be located at a privacy service provider, who then acts as an intermediary between the user, and the service provider to who the user would like to release data, or at the user end on the user device.
  • the privacy agent may provide released data to a service provider, for example, Comcast or Netflix, in order for private user 460 to improve received service based on the released data, for example, a recommendation system provides movie recommendations to a user based on its released movies rankings.
  • a service provider for example, Comcast or Netflix
  • FIG. 5 we show that there are multiple privacy agents in the system. In different variations, there need not be privacy agents everywhere as it is not a requirement for the privacy system to work. For example, there could be only a privacy agent at the user device, or at the service provider, or at both. In FIG. 5, we show that the same privacy agent "C" for both Netflix and Facebook. In another embodiment, the privacy agents at Facebook and Netflix, can, but need not, be the same.
  • the implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal.
  • An apparatus may be implemented in, for example, appropriate hardware, software, and firmware.
  • the methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device.
  • processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.
  • PDAs portable/personal digital assistants
  • the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
  • Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
  • Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information. Additionally, this application or its claims may refer to "receiving" various pieces of information. Receiving is, as with “accessing", intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory).
  • receiving is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
  • implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted.
  • the information may include, for example, instructions for performing a method, or data produced by one of the described implementations.
  • a signal may be formatted to carry the bitstream of a described embodiment.
  • Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal.
  • the formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream.
  • the information that the signal carries may be, for example, analog or digital information.
  • the signal may be transmitted over a variety of different wired or wireless links, as is known.
  • the signal may be stored on a processor-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Medical Informatics (AREA)
  • Storage Device Security (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Operations Research (AREA)
  • Algebra (AREA)
  • Evolutionary Biology (AREA)
  • Automation & Control Theory (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)

Abstract

The present embodiments focus on the privacy-utility tradeoff encountered by a user who wishes to release some public data to an analyst, which is correlated with his private data, in the hope of getting some utility. When multiple data are released to one or more analyst, we design privacy preserving mappings in a decentralized fashion. In particular, each privacy preserving mapping is designed to protect against the inference of private data from each of the released data separately. Decentralization simplifies the design, by breaking one large joint optimization problem with many variables into several smaller optimizations with fewer variables.

Description

Method and Apparatus for Utility-Aware Privacy Preserving Mapping in View of Collusion and Composition
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of the filing date of the following U.S.
Provisional Application, which is hereby incorporated by reference in its entirety for all purposes: Serial No. 61/ 867,544, filed on August 19, 2013, and titled "Method and Apparatus for Utility-Aware Privacy Preserving Mapping in View of Collusion and Composition."
This application is related to U.S. Provisional Patent Application Serial No. 61/691 ,090 filed on August 20, 2012, and titled "A Framework for Privacy against Statistical Inference" (hereinafter "Fawaz"). The provisional application is expressly incorporated by reference herein in its entirety.
In addition, this application is related to the following applications: (1 ) Attorney Docket No. PU130120, entitled "Method and Apparatus for Utility-Aware Privacy Preserving Mapping against Inference Attacks," and (2) Attorney Docket No.
PU 130122, entitled "Method and Apparatus for Utility-Aware Privacy Preserving Mapping through Additive Noise," which are commonly assigned, incorporated by reference in their entireties, and concurrently filed herewith. TECHNICAL FIELD
This invention relates to a method and an apparatus for preserving privacy, and more particularly, to a method and an apparatus for preserving privacy of user data in view of collusion or composition. BACKGROUND
In the era of Big Data, the collection and mining of user data has become a fast growing and common practice by a large number of private and public institutions. For example, technology companies exploit user data to offer personalized services to their customers, government agencies rely on data to address a variety of challenges, e.g., national security, national health, budget and fund allocation, or medical institutions analyze data to discover the origins and potential cures to diseases. In some cases, the collection, the analysis, or the sharing of a user's data with third parties is performed without the user's consent or awareness. In other cases, data is released voluntarily by a user to a specific analyst, in order to get a service in return, e.g., product ratings released to get recommendations. This service, or other benefit that the user derives from allowing access to the user's data may be referred to as utility. In either case, privacy risks arise as some of the collected data may be deemed sensitive by the user, e.g., political opinion, health status, income level, or may seem harmless at first sight, e.g., product ratings, yet lead to the inference of more sensitive data with which it is correlated. The latter threat refers to an inference attack, a technique of inferring private data by exploiting its correlation with publicly released data. BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a pictorial example illustrating collusion and composition.
FIG. 2 is a flow diagram depicting an exemplary method for preserving privacy, in accordance with an embodiment of the present principles. FIG. 3 is a flow diagram depicting another exemplary method for preserving privacy, in accordance with an embodiment of the present principles.
FIG. 4 is a block diagram depicting an exemplary privacy agent, in
accordance with an embodiment of the present principles.
FIG. 5 is a block diagram depicting an exemplary system that has multiple privacy agents, in accordance with an embodiment of the present principles.
SUMMARY
The present principles provide a method for processing user data for a user, comprising the steps of: accessing the user data, which includes private data, a first public data and a second public data, the first public data corresponding to a first category of data, and the second public data corresponding to a second category of data; determining a first information leakage bound between the private data and a first and second released data; determining a second information leakage bound between the private data and the first released data, and a third information leakage bound between the private data and the second released data, responsive to the first information leakage bound; determining a first privacy preserving mapping that maps the first category of data to the first released data responsive the second bound and a second privacy preserving mapping that maps the second category of data to the second released data responsive the third bound; modifying the first and second public data for the user, based on the first and second privacy preserving mappings respectively, to form the first and second released data; and releasing the modified first and second public data to at least one of a service provider and a data collecting agency as described below. The present principles also provide an apparatus for performing these steps.
The present principles also provide a method for processing user data for a user, comprising the steps of: accessing the user data, which includes private data, a first public data and a second public data, the first public data corresponding to a first category of data, and the second public data corresponding to a second category of data; determining a first information leakage bound between the private data and a first and second released data; determining a second information leakage bound between the private data and the first released data, and a third information leakage bound between the private data and the second released data, responsive to the first information leakage bound, wherein each of the second bound and the third bound substantially equals the first bound; determining a first privacy preserving mapping that maps the first category of data to the first released data responsive the second bound and a second privacy preserving mapping that maps the second category of data to the second released data responsive the third bound; modifying the first and second public data for the user, based on the first and second privacy preserving mappings respectively, to form the first and second released data; and releasing the modified first and second public data to at least one of a service provider and a data collecting agency as described below. The present principles also provide an apparatus for performing these steps. The present principles also provide a computer readable storage medium having stored thereon instructions for processing user data for a user according to the methods described above.
DETAILED DESCRIPTION In the database and cryptography literatures from which differential privacy arose, the focus has been algorithmic. In particular, researchers have used differential privacy to design privacy preserving mechanisms for inference algorithms, transporting, and querying data. More recent works focused on the relation of differential privacy with statistical inference. It is shown that differential privacy does not guarantee a limited information leakage. Other frameworks similar to differential privacy exist such as the Pufferfish framework, which can be found in an article by D. Kifer and A. Machanavajjhala, "A rigorous and customizable framework for privacy," in ACM PODS, 2012, which however does not focus on utility preservation. Many approaches rely on information-theoretic techniques to model and analyze privacy-accuracy tradeoff. Most of these information-theoretic models focus mainly on collective privacy for all or subsets of the entries of a database, and provide asymptotic guarantees on the average remaining uncertainty per database entry- or equivocation per input variable after the output release. In contrast, the framework studied in the present application provides privacy in terms of bounds on the information leakage that an analyst achieves by observing the released output.
We consider the setting described in Fawaz, where a user has two kinds of data that are correlated: some data that he would like to remain private, and some non-private data that he is willing to release to an analyst and from which he may derive some utility, for example, the release of media preferences to a service provider to receive more accurate content recommendations.
The term analyst, which for example may be a part of a service provider's system, as used in the present application, refers to a receiver of the released data, who ostensibly uses the data in order to provide utility to the user. Often the analyst is a legitimate receiver of the released data. However, an analyst could also illegitimately exploit the released data and infer some information about private data of the user. This creates a tension between privacy and utility requirements. To reduce the inference threat while maintaining utility the user may release a "distorted version" of data, generated according to a conditional probabilistic mapping, called "privacy preserving mapping," designed under a utility constraint.
In the present application, we refer to the data a user would like to remain private as "private data," the data the user is willing to release as "public data," and the data the user actually releases as "released data." For example, a user may want to keep his political opinion private, and is willing to release his TV ratings with modification (for example, the user's actual rating of a program is 4, but he releases the rating as 3). In this case, the user's political opinion is considered to be private data for this user, the TV ratings are considered to be public data, and the released modified TV ratings are considered to be the released data. Note that another user may be willing to release both political opinion and TV ratings without modifications, and thus, for this other user, there is no distinction between private data, public data and released data when only political opinion and TV ratings are considered. If many people release political opinions and TV ratings, an analyst may be able to derive the correlation between political opinions and TV ratings, and thus, may be able to infer the political opinion of the user who wants to keep it private.
Regarding private data, this refers to data that the user not only indicates that it should not be publicly released, but also that he does not want it to be inferred from other data that he would release. Public data is data that the user would allow the privacy agent to release, possibly in a distorted way to prevent the inference of the private data.
In one embodiment, public data is the data that the service provider requests from the user in order to provide him with the service. The user however will distort (i.e., modify) it before releasing it to the service provider. In another embodiment, public data is the data that the user indicates as being "public" in the sense that he would not mind releasing it as long as the release takes a form that protects against inference of the private data.
As discussed above, whether a specific category of data is considered as private data or public data is based on the point of view of a specific user. For ease of notation, we call a specific category of data as private data or public data from the perspective of the current user. For example, when trying to design privacy preserving mapping for a current user who wants to keep his political opinion private, we call the political opinion as private data for both the current user and for another user who is willing to release his political opinion.
In the present principles, we use the distortion between the released data and public data as a measure of utility. When the distortion is larger, the released data is more different from the public data, and more privacy is preserved, but the utility derived from the distorted data may be lower for the user. On the other hand, when the distortion is smaller, the released data is a more accurate representation of the public data and the user may receive more utility, for example, receive more accurate content recommendations. In one embodiment, to preserve privacy against statistical inference, we model the privacy-utility tradeoff and design the privacy preserving mapping by solving an optimization problem minimizing the information leakage, which is defined as mutual information between private data and released data, subject to a distortion constraint. In Fawaz, finding the privacy preserving mapping relies on the fundamental assumption that the prior joint distribution that links private data and released data is known and can be provided as an input to the optimization problem. In practice, the true prior distribution may not be known, but rather some prior statistics may be estimated from a set of sample data that can be observed. For example, the prior joint distribution could be estimated from a set of users who do not have privacy concerns and publicly release different categories of data, which may be considered to be private or public data by the users who are concerned about their privacy. Alternatively when the private data cannot be observed, the marginal distribution of the public data to be released, or simply its second order statistics, may be estimated from a set of users who only release their public data. The statistics estimated based on this set of samples are then used to design the privacy preserving mapping mechanism that will be applied to new users, who are concerned about their privacy. In practice, there may also exist a mismatch between the estimated prior statistics and the true prior statistics, due for example to a small number of observable samples, or to the incompleteness of the observable data.
To formulate the problem, the public data is denoted by a random variable X ε X with the probability distribution Px. X is correlated with the private data, denoted by random variable S e S. The correlation of S and X is defined by the joint distribution Ps x. The released data, denoted by random variable Y G y is a distorted version of X. Y is achieved via passing X through a kernel, PY\X. In the present application, the term "kernel" refers to a conditional probability that maps data X to data Y probabilistically. That is, the kernel PY\X is the privacy preserving mapping that we wish to design. Since Y is a probabilistic function of only X, in the present application, we assume S→ X→ Y form a Markov chain. Therefore, once we define PY\x, we have the joint distribution PS X Y = PY\XPs,x and in particular the joint distribution PS Y.
In the following, we first define the privacy notion, and then the accuracy notion.
Definition 1. Assume S→ X→ Y. A kernel PY\X is called e-divergence private if the distribution PS Y resulting from the joint distribution PS X Y = PY\XPs,x satisfies
D (Ps,y \ (1 )
where D (. ) is the K-L divergence, E(. ) is the expectation of a random variable, H(. ) is the entropy, e e [0,1] is called the leakage factor, and the mutual information I(S; Y) represents the information leakage.
We say a mechanism has full privacy if e = 0. In extreme cases, e = 0 implies that, the released random variable, Y, is independent from the private random variable, S, and e = 1 implies that S is fully recoverable from Y (S is a deterministic function of Y). Note that one can assume Y is completely independent from S to have full privacy (e = 0), but, this may lead to a poor accuracy level. We define accuracy as the following. Definition 2. Let d-. X x y→ R+ be a distortion measure. A kernel PY\X is called inaccurate if E[d(X, Y)]≤ D .
It should be noted that any distortion metric can be used, such as the
Hamming distance if X and Y are binary vectors, or the Euclidian norm if X and Y are real vectors, or even more complex metrics modeling the variation in utility that a user would derive from the release of Y instead of X. The latter could, for example, represent the difference in the quality of content recommended to the user based on the release of his distorted media preferences Y instead of his true preferences X.
There is a tradeoff between leakage factor, e, and distortion level, D , of a privacy preserving mapping. In one embodiment, our objective is to limit the amount of private information that can be inferred, given a utility constraint. When inference is measured by information leakage between private data and released data and utility is indicated by distortion between public data and released data, the objective can be mathematically formulated as to find the probability mapping PY\X that minimizes the maximum information leakage 7(5; Y) given a distortion constraint, where the maximum is taken over the uncertainty in the statistical knowledge on the distribution Ps x available at the privacy agent:
min max 7(S; Y), s. t. E[d(X, Y)]≤ D.
The probability distribution PS Y can be obtained from the joint distribution
In the following, we propose a scheme to achieve privacy (i.e., to minimize information leakage) subject to the distortion constraint, based on some techniques in statistical inference, called maximal correlation. We show how we can use this theory to design privacy preserving mappings without the full knowledge of the joint probability measure PS X . In particular, we prove a separability result on the information leakage: more precisely, we provide an upper bound on the information leakage in terms of I(S X) times a maximal correlation factor, which is determined by the kernel, PY\X . This permits formulating the optimum mapping without the full knowledge of the joint probability measure PS X .
Next, we provide a definition that is used in stating a decoupling result.
Definition 3. For a given joint distribution PX Y , let S* (X; Y) = supr(x)≠p(x) ^ ^||^^, where r(y) is the marginal measure of p(y\x)r(x) on Y.
Note that S* (X Y) ≤ 1 because of data processing inequality for divergence. The following is a result of an article by V. Anantharam, A. Gohari, S. Kamath, and C. Nair, "On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover," arXiv preprint arXiv: 1304.6133, 2013
(hereinafter "Anantharam").
Theorem 1. If S→ X→ Y form a Markov chain, the following bound holds:
I(S; Y)≤ S* (X; Y)I (S; X), (6) and the bound is tight as we vary S. In other words, we have
S:S→X→Y assuming I(S; X ≠ 0.
Theorem 1 decouples the dependency of Y and S into two terms, one relating S and X, and one relating X and Y. Thus, one can upper bound the information leakage even without knowing Ps x, by minimizing the term relating X and Y. The application of this result in our problem is the following:
Assume we are in a regime that Ps x is not known and l(S; X)≤ Δ for some Δ G [0, H(S)]. I(S; X) is the intrinsic information embedded in X about S, which we do not have control on. The value of Δ does not affect the mapping we will find, but the value of Δ affects what we think is the privacy guarantee (in term the leakage factor) resulting from this mapping. If the Δ bound is tight, then the privacy guarantee will be tight. If the Δ bound is not tight, we may then be paying more distortion than is actually necessary for a target leakage factor, but this does not affect the privacy guarantee.
Usin Theorem 1 , we have
Therefore, the optimization problem becomes to find PY\X, minimizing following objective function:
min maxS* (X; Y
Ργ\χ Ρχ
s. t. E[d(X, Y ]≤ D. (8)
In order to study this optimization problem in more detail, we review some results in maximal correlation literature. Maximal correlation (or Renyi correlation) is a measure of correlation between two random variables with applications both in information theory and computer science. In the following, we define maximal correlation and provide its relation with S*(X Y).
Definition 4. Given two random variables X and Y, the maximal correlation of (X, Y) is
P^x,- Y) = mx ^) ^ U{X)9{Y) (9) where T is the collection of pairs of real-valued random variables f(X) and g(Y) such that E[f(X ] = E[g(Y ] = 0 and E[f(X 2] = E[g(Y 2] = 1.
This measure was first introduced by Hirschfeld (H. O. Hirschfeld, "A connection between correlation and contingency," in Proceedings of the Cambridge Philosophical Society, vol. 31 ) and Gebelein (H. Gebelein, "Das statistische Problem der Korrelation als Variations- und Eigenwert-problem und sein Zusammenhang mit der Ausgleichungsrechnung," Zeitschrift fur angew. Math, und Mech. 21 , pp. 364-379 (1941 )), and then studied by Renyi (A. Renyi, "On measures of dependence," Acta Mathematica Hungarica, vol. 10, no. 3). Recently, Anantharam et al. and Kamath et al. (S. Kamath and V. Anantharam, "Non-interactive simulation of joint distributions: The hirschfeld-gebelein-renyi maximal correlation and the hypercontractivity ribbon," in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton
Conference on, hereinafter "Kamath") studied the maximal correlation and provided a geometric interpretation of this quantity. The following is a result of an article by R.
Ahlswede and P. Gacs, "Spreading of sets in product spaces and hypercontraction of the markov operator," The Annals of Probability (hereinafter "Ahlswede"):
max p^j (X; Y) = max S* (X; Y). (1 Q) Substituting (10) in (8), the privacy preserving mapping is the solution of min max p2^ (X; Y)
ργ\χ Ρχ s. t. E[d X; Y)] < D. (1 1 )
It is shown in an article by H. S. Witsenhausen, "On sequences of pairs of dependent random variables," SIAM Journal on Applied Mathematics, vol. 28, no. 1 that, maximal correlation, pm(X; Y) is characterized by the second largest singular value of the matrix Q with entries Qx v = , P(x,y) . The optimization problem can be solved by power iteration algorithm or Lanczos algorithm for finding singular values of a matrix.
In the above, we discuss how privacy preserving mappings can be designed using the separability result in Theorem 1 . The methods discussed above are among the techniques which can be used to address new challenges in the design of privacy preserving mapping mechanisms, that arise when multiple data releases to one or several analyst occur. In the present application, we provide privacy mapping mechanisms in view of collusion or composition.
In the following, we define the challenges under collusion and composition. Collusion: a private data, S, is correlated with two public data, Xt and X2.
Two privacy preserving mappings are applied on these public data to obtain two released data, Yx and Y2 , respectively, which are then released to two analysts. We wish to analyze the cumulative privacy guarantees on S when the analysts share Y1 and Y2. In the present application, we also refer to the analysts that share Y1 and Y2 as colluding entities.
We focus on the case where the two privacy-preserving mappings are designed in a decentralized fashion: Each privacy preserving mapping is designed to protect against the inference of S from each of the released data separately.
Decentralization simplifies the design, by breaking one large optimization with many variables (joint design) into several smaller optimizations with fewer variables.
Composition: a private data S is correlated with the public data, Xx and X2 through the joint probability distribution P(S; Xx; X2). Assume that we are able to design separately two privacy preserving mappings, where one mapping transforms Xx into Yt, and the other mapping transforms X2 into Y2. An analyst requests the pair (Xt, X2). We wish to re-use these two separate privacy mappings to generate a privacy preserving mapping for the pair (Xt, X2 ), which still guarantees a certain level of privacy.
FIG. 1 provides examples on collusion and composition:
Example 1 : collusion when a single private data and multiple public data are considered;
Example 2: collusion when multiple private data and multiple public data are considered;
Example 3: composition when a single private data and multiple public data are considered; and
Example 4: composition when multiple private data and multiple public data are considered. In Example 1 , a private data, S, is correlated with two public data, Xx and X2.
In this example, we consider political opinion as private data S, TV rating as public data X1 and snack rating as public data X2. Two privacy preserving mappings are applied on these public data to obtain two released data, Yx and Y2 provided to two entities, respectively. For example, the distorted TV rating (¾) is provided to Netflix, and the distorted snack rating (Y2 ) is provided to Kraft Foods. The privacy preserving mappings are designed in a decentralized fashion. Each of the privacy preserving mapping schemes is designed to protect S from the corresponding analyst. If Netflix exchanges information (Yt) with Kraft (Y2 ), the user's private data (S) may be recovered more accurately than if they only depend on Yx or Y2 alone. We wish to analyze the privacy guarantees when the analysts share Yx and Y2. In this example, Netflix is a legitimate receiver of information about TV rating, but not snack rating, and Kraft Foods is a legitimate receiver of information about snack rating, but not TV rating. However, they may share information in order to infer more about the user's private data.
In Example 2, private data S1 is correlated with public data XXl and private data S2 is correlated with public data X2. In this example, we consider income as private data St , gender as private data S2 , TV rating as public data Xx and snack rating as public data X2. Two privacy preserving mappings are applied on these public data to obtain two released data, Y1 and Y2 provided to two analysts, respectively.
In Example 3, a private data, S is correlated with public data Xx and X2 through joint probability distribution Ps^x^ In this example, we consider political opinion as private data S, TV rating for Fox news as public data Xx and TV rating for ABC news as public data X2. An analyst, for example, Comcast asks for both X1 and X2. Again, the privacy preserving mappings are designed separately and we want to analyze the privacy guarantees when the privacy agent combines her information Y1 and Y2 about both St and S2. In this example, Comcast is an legitimate receiver of both TV ratings for Fox news and ABC news.
In Example 4, two private data, St and S2 are correlated with public data, Xx and X2 through joint probability distribution Ps^x^■ In this example, we consider income as private data St , gender as private data S2 , TV rating as public data Xx and snack rating as public data X2.
As discussed above, multiple random variables (for example, Xx and X2) are involved when there is collusion or composition. However, mappings for large size X (large vector with multiple variables) are more difficult to design than mappings for small size X (possibly one variable, or a small vector), as the complexity of the optimization problem which provides a solution to the privacy mapping scales with the size of vector X.
In one embodiment, we simplify the design of the optimization problem by breaking one large optimization with many variables into several smaller optimization with less variables. Both collusion and composition problems can be captured in the following setting.
Assume a private random variable S is correlated with ¾and X2. Distorted versions of ¾and X2 are denoted by Yx and Y2 , respectively. We perform two separate privacy preserving mappings, PiX^X- and P(Y2 \X2) , on ^and X2 to obtain Y1 and Y2 , respectively given distortion constraints. The individual information leakages are I(S Yt) and I(S Y2). Assume that Yx and Y2 are combined together into a pair (Yl t Y2), either by colluding entities, or by a privacy agent through composition. In the present principles, we address the question of how privacy guarantees combine under multiple releases, i.e., the question of obtaining the resulting cumulative information leakage when multiple released data are combined, either through composition or collusion. The rules of combination of privacy guarantees help in addressing the issue of colluding entities, who share data that is released to them individually in order to improve their inference of private data. Combination rules also help in the design of privacy preserving mapping mechanisms by allowing to break the joint design for multiple pieces of data into several simpler design problems for individual pieces of data. The combination of privacy preserving schemes is studied in several existing works. The focus of these works is on differential privacy in the presence of collusion or composition. However, the present principles consider privacy in the presence of collusion or composition under an information-theoretic privacy metric.
In the following, we first discuss the case where the releases are related to the same private data (e.g., Example 1 and Example 3), and then extend the analysis to the case where the releases are related to different but correlated pieces of private data (e.g., Example 2 and Example 4).
Single private data, multiple public data
Assume a private random variable 5 is correlated with Xx and X2. Distorted versions of Xx and X2 are denoted by Y1 and Y2 , respectively. We perform two separate privacy preserving mappings on Xx and X2 to obtain Y1 and Y2 , respectively. and Py2\x2 are designed with given distortion constraints, and the individual information leakages are 7(5; and 7(5; Y2), respectively. Assume the two released data Y1 and Y2 are combined together into a pair (¾, Y2), either by colluding entities, or by a privacy agent through composition. We want to analyze the resulting cumulative privacy leakage I(S YXl Y2) under this combination of information.
Lemma 1. Assume Yt, Y2, and S form a Markov chain in any order. If the privacy preserving mappings leak I(Y ; S) and I(Y2; S bits by Yx and Y2, respectively, then at most I(Yi, S) + l(Y2; S) bits of information are leaked by the pair Y1 and Y2. In other words, 5)≤ I(Y1; S) + I(Y2; S). Moreover, if S→ Y1→ Y2 , then /(S; ^, ^)≤
/(¾; S). If S→ Y2→ Yt, then /(S; Yit Y2 ≤ l(Y2; S).
Proof: Note that if three random variables form a Markov chain, A→ B→ C, then we have 1 A B)≥ 1(A B \C), 1 B C)≥ 1(B C\A), and 1 A C\B) = 0. The proof follows from this fact. □
Lemma 1 applies regardless of how much knowledge on Ps x is available when the mapping is designed. The bounds in Lemma 1 holds when Psx is known. It also holds if the privacy preserving mappings are designed using the method based on the separability result in Theorem 1 .
Note that using Y1 and Y2 together might lead to full recovery of S. For instance, let S, Yt, and Y2 be three Bern( ) random variables such that S = Y1 @ Y2 and Yt Y2- Then, we have = l Y2 S) = 0, whereas 5) = 1 bit and S is fully recoverable from (Yt, Y2). Another example is when Y1 = S + N where N is some noise and Y2 = S - N. We can fully recover s by adding Y1 and Y2.
FIG. 2 illustrates an exemplary method 200 for preserving privacy in view of collusion or composition, in accordance with an embodiment of the present principles. Method 200 starts at step 205. At step 210, it collects statistical information based on the single private data S and public data Xx and X2. At step 220, it decides the cumulative privacy guarantee for the private data S in view of collusion or composition of released data Yx and Y2. That is, it decides a leakage factor e for / (S Y1, Y2) .
Following Lemma 1 , the privacy preserving mappings are designed in a decentralized fashion for public data Xx and X2. At step 230, it determines a privacy preserving mapping PYl\Xl for public data Xt , given leakage factor et for I(S; Yt). Similarly, at step 235, it determines a privacy preserving mapping Ργ22 for public data X2 , given leakage factor e2 for I(S; Y2).
In one embodiment, we may set e = et + e2 , for example, et = e2 = e/2. According to the privacy preserving mappings designed at steps 230 and 235,
/(S; ¾ < eitf(S), /(S; Y2 ≤ e2H(S ,
Using Lemma 1 , we have
/(¾, Y2; S)≤ I(Yi ; S) + I(Y2; S)≤ e^iS + e2H(S)≤ eH(S) At steps 240 and 245, we distort data Xx and X2 according to privacy preserving mappings PYl\Xl and ¾¾ , respectively. At steps 250 and 255, the distorted data are released as Yx and Y2 , respectively.
As discussed before, collusion may occur when a legitimate receiver of released data Y1 (but not Y2) exchanges information about Y2 with a legitimate receiver of released data Y2 (but not Yt). On the other hand, for composition, both released data are legitimately received by the same receiver, and composition occurs when the receiver combines information from both released data to infer more information about the user.
Next, we use the results on maximal correlation to upper bound the
cumulative amount of information leaked by the pair Y1 and Y2.
Theorem 4. Let PYl\Xl and Ργ22 be designed separately, i.e., Ργ^χ^ =
and λ = max{S*(X1;Y1),S*(X2;Y2)}. If I(Y1;Y2)≥ !/(¾;¾), then we have
;Yi,Y2) ≤ S;X1,X2)max{S*(X1;Y1),S*(X2;Y2)}. (19)
Proof: To prove the theorem we give the following.
Proposition 4. Let PYl,Y2lXl,X2 = ¾¾¾¾ and λ = max {S*(¾ Y1),S*(X2; Y2)}. If /(¾; Y2)≥ X2), then we have
S*(¾, X2; Y, Y2)≤ max {S*(¾; ¾ S*(¾; ¾)}. (20)
Moreover, if ¾ and ¾ are independent (or equivalently, (¾, and (X2,Y2) are independent), then we have
S*(¾,¾ ¾) = max {S*(¾; ¾S*¾; ¾)}.
First, we prove this proposition. The particular case where independence holds has been previously proved in Anantharam, and the proof for the general case follows the same lines of the proof of tensorization of S* X; Y) by noting that,
I(Yi>' ¾≥ ^Κχ ι·χ ι) is the only required inequality as mentioned in Anantharam to obtain the inequality (20) (see Anantharam, page 10, part C).
Back to the proof of Theorem 4: Since we have the Markov chain, S→
→ 0i.*2). using Theorem 1, we obtain
S; Yi, Y2) < ; xlt x2)s*(x1, x2, Y, Y2). Now, using Proposition 4, concludes the proof. □
Therefore, if both mappings are designed separately with small maximal correlation, then we can still bound the cumulative amount of information leaked by the pair Y1 and Y2. Corollary 1. The first term in the upper bound (19), i.e., I(X1,X2; S) can be bounded as the following:
If !, X2, and S form a Markov chain in any order, then I(X,X2;S)≤ I(X;S) + I(X;S). Moreover, if S→ xi→ X2, then /(S;*i,¾) < 7(¾;S). If S→ X2→ Xt, then l(S;XllX2)≤l(X2;S). Proof: the proof is similar to that of Lemma 1.
Note that /(S; ¾), /(S; Y2) and /(S; Y, Y2) are less or equal to H(S). If we choose
S^X^Y < e,S*(X2;Y2 < e,
we get
; Yi, Y2) < /(5;*1,*2)max{S*(*1; Y1),S*(X2; Y2)}
≤ tf(S)max{S*(¾; Y1),S*(X2; Y2 } < eH(S).
FIG.3 illustrates an exemplary method 300 for preserving privacy in view of collusion or composition, in accordance with an embodiment of the present principles. Method 300 is similar to method 200,except that S*(¾; < e (330) and S*(X2; Y2) < e (335). Note that method 200 works under some Markov chain assumptions stated in Lemma 1 , while method 300 works more generally. Multiple private data, multiple public data
Assume we have two private random variables St and S2 , which correlate with Xx and X2 , respectively. We distort Xx and X2 to obtain Y1 and Y2 , respectively. An analyst has access to Yx and Y2 and wishes to discover (S1, S2). Theorem 5. Let PYi lXi and Ρ¾ μ¾ be designed separately, i.e., ΡΥι2\Χι2 = ¾¾¾¾ and λ = max { S*(¾; Y^. S*^; Y2)}. If Y2)≥ then we obtain
/(51< 52; 1< r2) < /(51< 52; -y1< -y2) max { 5* (-y1; 1)< 5*(-y2; r2)}. (21 )
Proof: Similar to the proof of Theorem 4. □
Therefore, the cumulative information leakage of the pair Y1 and y2 is bounded by (21 ). In particular, if Xx and X2 are independent, then this bound holds.
In FIG. 2, we discuss method 200 that determines privacy preserving mappings considering a single private data and two public data in view of collusion or composition. When there are two private data, method 200 can be applied with some modifications. Specifically, at step 210, we collect statistical information based on St , S2 , X1 and X2. At step 230, we design a privacy preserving mapping PYl\Xl for public data Xt, given leakage factor ει for /(51(- Yt). At step 235, we design a privacy preserving mapping Ργ22 for public data X2 , given leakage factor ε2 for l(S2; Y2).
Similarly, in FIG. 3, we discuss method 300 that determines privacy
preserving mappings considering a single private data and two public data in view of collusion or composition. When there are two private data, method 300 can be applied with some modifications. Specifically, at step 310, we collect statistical information based on St, S2 , Xx and X2. At step 330, we design a privacy preserving mapping PYl\Xl for public data X , given leakage factor ε for /(51 (- ¾). At step 335, we design a privacy preserving mapping Ργ22 for public data X2 , given leakage factor ε for /(52; y2) .
In the above, we discuss about two private data or two public data. The present principles can also be applied when there are more than two private or public data.
A privacy agent is an entity that provides privacy service to a user. A privacy agent may perform any of the following:
- receive from the user what data he deems private, what data he deems public, and what level of privacy he wants;
- compute the privacy preserving mapping;
- implement the privacy preserving mapping for the user (i.e., distort his data according to the mapping); and
- release the distorted data, for example, to a service provider or a data collecting agency.
The present principles can be used in a privacy agent that protects the privacy of user data. FIG. 4 depicts a block diagram of an exemplary system 400 where a privacy agent can be used. Public users 410 release their private data (S) and/or public data (X). As discussed before, public users may release public data as is, that is, Y = X. The information released by the public users becomes statistical information useful for a privacy agent.
A privacy agent 480 includes statistics collecting module 420, privacy preserving mapping decision module 430, and privacy preserving module 440.
Statistics collecting module 420 may be used to collect joint distribution Ps x, marginal probability measure Px, and/or mean and covariance of public data.
Statistics collecting module 420 may also receive statistics from data aggregators, such as bluekai.com. Depending on the available statistical information, privacy preserving mapping decision module 430 designs several privacy preserving mapping mechanisms. Privacy preserving module 440 distorts public data of private user 460 before it is released, according to the conditional probability. When the public data is multi-dimensional, for example, when X include both X1 and X2 , the privacy preserving module may design separate privacy preserving mappings for Xx and X2 , respectively, in view of composition. When there is collusion, each colluding entity may use system 400 to design a separate privacy preserving mapping.
Note that the privacy agent needs only the statistics to work without the knowledge of the entire data that was collected in the data collection module and that allowed to compute the statistics. Thus, in another embodiment, the data collection module could be a standalone module that collects data and then computes statistics, and needs not be part of the privacy agent. The data collection module shares the statistics with the privacy agent.
A privacy agent sits between a user and a receiver of the user data (for example, a service provider). For example, a privacy agent may be located at a user device, for example, a computer, or a set-top box (STB). In another example, a privacy agent may be a separate entity. All the modules of a privacy agent may be located at one device, or may be distributed over different devices, for example, statistics collecting module 420 may be located at a data aggregator who only releases statistics to the module 430, the privacy preserving mapping decision module 430, may be located at a "privacy service provider" or at the user end on the user device connected to a module 420, and the privacy preserving module 440 may be located at a privacy service provider, who then acts as an intermediary between the user, and the service provider to who the user would like to release data, or at the user end on the user device.
The privacy agent may provide released data to a service provider, for example, Comcast or Netflix, in order for private user 460 to improve received service based on the released data, for example, a recommendation system provides movie recommendations to a user based on its released movies rankings.
In FIG. 5, we show that there are multiple privacy agents in the system. In different variations, there need not be privacy agents everywhere as it is not a requirement for the privacy system to work. For example, there could be only a privacy agent at the user device, or at the service provider, or at both. In FIG. 5, we show that the same privacy agent "C" for both Netflix and Facebook. In another embodiment, the privacy agents at Facebook and Netflix, can, but need not, be the same. The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users.
Reference to "one embodiment" or "an embodiment" or "one implementation" or "an implementation" of the present principles, as well as other variations thereof, mean that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase "in one embodiment" or "in an embodiment" or "in one implementation" or "in an implementation", as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
Additionally, this application or its claims may refer to "determining" various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
Further, this application or its claims may refer to "accessing" various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information. Additionally, this application or its claims may refer to "receiving" various pieces of information. Receiving is, as with "accessing", intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, "receiving" is typically involved, in one way or another, during operations such as, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bitstream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.

Claims

CLAIMS:
1 . A method for processing user data for a user, comprising the steps of: accessing the user data, which includes private data, a first public data and a second public data, the first public data corresponding to a first category of data, and the second public data corresponding to a second category of data;
determining (220, 320) a first information leakage bound between the private data and a first and second released data;
determining a second information leakage bound between the private data and the first released data, and a third information leakage bound between the private data and the second released data, responsive to the first bound;
determining (230, 235, 330, 335) a first privacy preserving mapping that maps the first category of data to the first released data responsive the second bound and a second privacy preserving mapping that maps the second category of data to the second released data responsive the third bound;
modifying (240, 245, 340, 345) the first and second public data for the user, based on the first and second privacy preserving mappings respectively, to form the first and second released data; and
releasing (250, 255, 350, 355) the modified first and second public data to at least one of a service provider and a data collecting agency.
2. The method of claim 1 , wherein a combination of the second bound and the third bound substantially corresponds to the first bound.
3. The method of claim 1 , wherein each of the second bound and the third bound substantially equals the first bound.
4. The method of claim 1 , wherein the releasing step releases the modified first public data to a first receiver and releases the modified second public data to a second receiver, wherein the first and second receivers are configured to exchange information about the modified first and second public data.
5. The method of claim 1 , wherein the releasing step releases the modified first and second public data to a same receiver.
6. The method of claim 1 , further comprising the step of:
determining whether collusion or composition occurs at the at least one of a service provider and a data collecting agency.
7. The method of claim 1 , wherein the steps of determining the first and second privacy preserving mappings are based on maximal correlation techniques.
8. The method of claim 1 , wherein the private data includes a first private data and a second private data, wherein the step of determining a second information leakage bound step determines the second bound between the first private data and the first public data and the third bound between the second private data and the second public data.
9. An apparatus for processing user data for a user, comprising:
a processor configured to access the user data, which includes private data, a first public data and a second public data, the first public data corresponding to a first category of data, and the second public data corresponding to a second category of data; a privacy preserving mapping decision module (430) configured to
determine a first information leakage bound between the private data and a first and second released data,
determine a second information leakage bound between the private data and the first released data, and a third information leakage bound between the private data and the second released data, responsive to the first bound, and
determine a first privacy preserving mapping that maps the first category of data to the first released data responsive the second bound and a second privacy preserving mapping that maps the second category of data to the second released data responsive the third bound; and
a privacy preserving module (440) configured to
modify the first and second public data for the user, based on the first and second privacy preserving mappings respectively, to form the first and second released data, and
release the modified first and second public data to at least one of a service provider and a data collecting agency.
10. The apparatus of claim 9, wherein a combination of the second bound and the third bound substantially corresponds to the first bound.
1 1 . The apparatus of claim 9, wherein each of the second bound and the third bound substantially equals the first bound.
12. The apparatus of claim 9, wherein the privacy preserving module (440) releases the modified first public data to a first receiver and releases the modified second public data to a second receiver, wherein the first and second receivers are configured to exchange information about the modified first and second public data.
13. The apparatus of claim 9, wherein the privacy preserving module (440) releases the modified first and second public data to a same receiver.
14. The apparatus of claim 9, wherein the privacy preserving mapping decision module (430) is further configured to determine whether collusion or composition occurs at the at least one of a service provider and a data collecting agency.
15. The apparatus of claim 9, wherein privacy preserving mapping decision module (430) determines the first and second privacy preserving mappings based on maximal correlation techniques.
16. The apparatus of claim 9, wherein the private data includes a first private data and a second private data, and wherein the privacy preserving mapping decision module (430) determines the second information leakage bound between the first private data and the first public data and the third information leakage bound between the second private data and the second public data.
17. A computer readable storage medium having stored thereon instructions for processing user data for a user, according to claims 1 -8.
EP13812233.8A 2013-08-19 2013-11-21 Method and apparatus for utility-aware privacy preserving mapping in view of collusion and composition Withdrawn EP3036678A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201361867544P 2013-08-19 2013-08-19
PCT/US2013/071287 WO2015026385A1 (en) 2013-08-19 2013-11-21 Method and apparatus for utility-aware privacy preserving mapping in view of collusion and composition

Publications (1)

Publication Number Publication Date
EP3036678A1 true EP3036678A1 (en) 2016-06-29

Family

ID=49880941

Family Applications (1)

Application Number Title Priority Date Filing Date
EP13812233.8A Withdrawn EP3036678A1 (en) 2013-08-19 2013-11-21 Method and apparatus for utility-aware privacy preserving mapping in view of collusion and composition

Country Status (5)

Country Link
EP (1) EP3036678A1 (en)
JP (1) JP2016535898A (en)
KR (1) KR20160044485A (en)
CN (1) CN105612529A (en)
WO (1) WO2015026385A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014031551A1 (en) * 2012-08-20 2014-02-27 Thomson Licensing A method and apparatus for privacy-preserving data mapping under a privacy-accuracy trade-off
CN108073821B (en) * 2016-11-09 2021-08-06 中国移动通信有限公司研究院 Data security processing method and device
CN108763947B (en) * 2018-01-19 2020-07-07 北京交通大学 Time-space type track big data differential privacy protection method
CN108763954B (en) * 2018-05-17 2022-03-01 西安电子科技大学 Linear regression model multidimensional Gaussian difference privacy protection method and information security system
CN109766710B (en) * 2018-12-06 2022-04-08 广西师范大学 Differential privacy protection method of associated social network data
JP2021056435A (en) * 2019-10-01 2021-04-08 株式会社東芝 Information processor, information processing method, and program
CN110968893A (en) * 2019-11-21 2020-04-07 中山大学 Privacy protection method for associated classified data sequence based on Pufferfish framework
CN111461858B (en) * 2020-03-10 2023-02-17 支付宝(杭州)信息技术有限公司 Continuous multiplication calculation method, device and system based on privacy protection and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7533808B2 (en) * 2005-02-09 2009-05-19 Yuh-Shen Song Privacy protected cooperation network
US20100036884A1 (en) * 2008-08-08 2010-02-11 Brown Robert G Correlation engine for generating anonymous correlations between publication-restricted data and personal attribute data
US8312273B2 (en) * 2009-10-07 2012-11-13 Microsoft Corporation Privacy vault for maintaining the privacy of user profiles
CN102624708A (en) * 2012-02-23 2012-08-01 浙江工商大学 Efficient data encryption, updating and access control method for cloud storage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO2015026385A1 *

Also Published As

Publication number Publication date
WO2015026385A1 (en) 2015-02-26
KR20160044485A (en) 2016-04-25
CN105612529A (en) 2016-05-25
JP2016535898A (en) 2016-11-17

Similar Documents

Publication Publication Date Title
WO2015026385A1 (en) Method and apparatus for utility-aware privacy preserving mapping in view of collusion and composition
US20160203333A1 (en) Method and apparatus for utility-aware privacy preserving mapping against inference attacks
KR20160044553A (en) Method and apparatus for utility-aware privacy preserving mapping through additive noise
Zhou et al. Kernelized probabilistic matrix factorization: Exploiting graphs and side information
Shen et al. Epicrec: Towards practical differentially private framework for personalized recommendation
Ye et al. Heterogeneous federated learning: State-of-the-art and research challenges
Shen et al. Privacy-preserving personalized recommendation: An instance-based approach via differential privacy
US20160210463A1 (en) Method and apparatus for utility-aware privacy preserving mapping through additive noise
US11106809B2 (en) Privacy-preserving transformation of continuous data
US20150235051A1 (en) Method And Apparatus For Privacy-Preserving Data Mapping Under A Privacy-Accuracy Trade-Off
US20160006700A1 (en) Privacy against inference attacks under mismatched prior
WO2015026384A1 (en) Method and apparatus for utility-aware privacy preserving mapping against inference attacks
WO2022160623A1 (en) Teacher consensus aggregation learning method based on randomized response differential privacy technology
WO2015157020A1 (en) Method and apparatus for sparse privacy preserving mapping
CN107609421A (en) Secret protection cooperates with the collaborative filtering method based on neighborhood of Web service prediction of quality
Asad et al. CEEP-FL: A comprehensive approach for communication efficiency and enhanced privacy in federated learning
WO2022237175A1 (en) Graph data processing method and apparatus, device, storage medium, and program product
Chen et al. Privacy and fairness in Federated learning: on the perspective of Tradeoff
Zhou et al. Differentially private distributed learning
Zheng et al. A Matrix Factorization Recommendation System-Based Local Differential Privacy for Protecting Users' Sensitive Data
US20150371241A1 (en) User identification through subspace clustering
US20160203334A1 (en) Method and apparatus for utility-aware privacy preserving mapping in view of collusion and composition
Yang et al. Achieving privacy-preserving cross-silo anomaly detection using federated XGBoost
Amorino et al. Minimax rate for multivariate data under componentwise local differential privacy constraints
Weng et al. Practical privacy attacks on vertical federated learning

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20160315

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Effective date: 20190426