WO2015175141A1

WO2015175141A1 - Method, apparatus and system for preserving privacy during media consumption and recommendation

Info

Publication number: WO2015175141A1
Application number: PCT/US2015/026070
Authority: WO
Inventors: Subrahmanya Sandilya BHAMIDIPATI; Nadia FAWAZ; Branislav Kveton; Amy Zhang
Original assignee: Thomson Licensing
Priority date: 2014-05-16
Filing date: 2015-04-16
Publication date: 2015-11-19
Also published as: WO2015175141A9

Abstract

The present principles provide an interactive privacy-preserving method, apparatus and system for, for example, content consumption and content recommendation processes and systems, that provides a user with privacy, transparency and control of personal attributes, while maintaining the quality of personalized recommendations the user receives. A user is informed about the risk of releasing data related to, for example, media preferences with respect to attributes the user considers private (e.g., political views, age, gender).

Description

METHOD, APPARATUS AND SYSTEM FOR PRESERVING PRIVACY DURING MEDIA CONSUMPTION AND RECOMMENDATION

TECHNICAL FIELD

[0001] The present principles relate to statistical inference and privacy-preserving techniques. More particularly, the present principles relate to preserving privacy of user information against inference while retaining the effectiveness of content consumption and personalized recommendations.

BACKGROUND

[0002] With the advent of targeted advertising and the popularity of mining user data, users of content consumption systems are finding their privacy threatened. To address this rising concern, some privacy-preserving mechanisms have been proposed, such as the mechanism described in B. Fung and K. Wang and R. Chen and P. Yu, "Privacy Preserving Data Publishing: A Survey of Recent Developments," ACM Computing Surveys, 2010.

[0003] Such mechanisms provide strong theoretical guarantees, but often lack practicality. For instance, reaching a sufficiently high level of privacy often requires that the user data be distorted to the point where it is not usable. What is needed is an interactive privacy system, which can implement information-theoretic privacy to provide practical policies for protecting user profiles, while maintaining the utility of sanitized user data.

SUMMARY

[0004] The present principles propose a method, apparatus and system for preserving user privacy while enabling content consumption and preserving the quality of personalized recommendations.

[0005] In one embodiment of the present principles a method for preserving user privacy includes applying a quantization to data at least one user is willing to make public to control a number of optimization variables, estimating a distribution that links the quantized data to data at least one user considers private and applying convex optimization to the distribution to determine a mapping from data a user is willing to make public to a distorted version of the data the user is willing to make public.

[0006] In an alternate embodiment of the present principles, an apparatus for preserving user privacy includes a memory for storing at least one of program routines, content and data and a processor for executing the program routines. In one embodiment, the apparatus is configured to apply a quantization to data at least one user is willing to make public to control a number of optimization variables, estimate a distribution that links the quantized data to data at least one user considers private and apply convex optimization to the distribution to determine a mapping from data a user is willing to make public to a distorted version of the data the user is willing to make public.

[0007] In an alternate embodiment of the present principles, a system for preserving user privacy includes a content source, a user interface and an apparatus in communication with the content source and the user interface. The apparatus of the system includes a memory for storing at least one of program routines, content and data and a processor for executing the program routines, wherein the apparatus configured to apply a quantization to data at least one user is willing to make public to control a number of optimization variables, estimate a distribution that links the quantized data to data at least one user considers private and apply convex optimization to the distribution to determine a mapping from data a user is willing to make public to a distorted version of the data the user is willing to make public.

[0008] These and other aspects, features and advantages of the present principles will become apparent from the following detailed description of exemplary embodiments, which is to be read in connection with the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS

[0009] The teachings of the present principles can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:

FIG. la depicts a high level block diagram of a system for preserving user privacy while enabling content consumption and recommendation in accordance with an embodiment of the present principles;

FIG. lb depicts a functional diagram of the system of FIG. la in accordance with an embodiment of the present principles;

FIG. 2 depicts a high level block diagram of a server suitable for implementation in the system of FIG. 1 in accordance with an embodiment of the present principles;

FIG. 3 depicts a high level illustrative depiction of a privacy dashboard usable in the user system of FIG. 1 in accordance with an embodiment of the present principles;

FIG. 4 depicts a high level illustrative depiction of a program display page usable in the user system of FIG. 1 in accordance with an embodiment of the present principles; FIG. 5 depicts a high level illustrative depiction of a program history display page including ratings usable in the user system of FIG. 1 in accordance with an embodiment of the present principles;

Fig. 6 depicts an illustrative graph of the privacy-utility tradeoff in accordance with an embodiment of the present principles;

FIG. 7 depicts an illustrative a receiver operating characteristic (ROC) curve showing the performance of, for example, a logistic regression classifier attempting to infer a user's political views in accordance with an embodiment of the present principles;

FIG. 8a depicts the six (6) top TV show recommendations based on actual ratings given to programming by a user in accordance with an embodiment of the present principles;

FIG. 8b depicts the six (6) top TV show recommendations based on ratings distorted for privacy in accordance with an embodiment of the present principles; and

FIG. 9 depicts a flow diagram of a method for preserving user privacy in accordance with an embodiment of the present principles.

It should be understood that the drawing(s) are for purposes of illustrating the concepts of the various described principles and are not necessarily the only possible configuration for illustrating the principles.

To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.

DETAILED DESCRIPTION

[0010] Embodiments of the present principles advantageously provide a method, apparatus and system for preserving user privacy while enabling content consumption and content recommendation. Although the present principles will be described primarily within the context of video content consumption, such as TV programming, and program recommendations and program ratings, the specific embodiments of the present principles should not be treated as limiting the scope of the invention. It will be appreciated by those skilled in the art and informed by the teachings of the present principles that the concepts of the present principles can be advantageously applied to any content including audio, video and any combination thereof and can be interfaced with online video services, as well as TV and VoD services and can be implemented with publicly released data other than program ratings. The present principles can also be extended to other media content, such as music, books, news, and to other products, services, or locations rated online by users. The concepts of the present principles can also be adapted to protect privacy in the context of social networks. For example, users can be informed of the privacy risks of actions such as likes, connecting to friends, etc., in embodiments, prior to taking those actions, and users can be provided means to control these risks. In such a context, data distortion could for example amount to simply avoiding taking some actions, or avoiding the release of some data.

[0011] In the claims hereof, any element expressed as a means for performing a specified function is intended to encompass any way of performing that function including, for example, a) a combination of circuit elements that performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function. The present principles as defined by such claims reside in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for. It is thus regarded that any means that can provide those functionalities are equivalent to those shown herein.

[0012] Moreover, all statements herein reciting principles, aspects, and embodiments of the present principles, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.

[0013] The functions of the various elements shown in the figures can be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions can be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which can be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and can implicitly include, without limitation, digital signal processor ("DSP") hardware, read-only memory ("ROM") for storing software, random access memory ("RAM"), and non-volatile storage. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future (i.e., any elements developed that perform the same function, regardless of structure). [0014] Thus, for example, it will be appreciated by those skilled in the art that the block diagrams presented herein represent conceptual views of illustrative system components and/or circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudocode, and the like represent various processes which may be substantially represented in computer readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

[0015] Furthermore, because some of the constituent system components and methods depicted in the accompanying drawings can be implemented in software, the actual connections between the system components or the process function blocks may differ depending upon the manner in which the present principles are programmed. Given the teachings herein, one of ordinary skill in the pertinent art will be able to contemplate these and similar implementations or configurations of the present principles.

[0016] Embodiments of the present principles provide an interactive privacy-preserving personalized video consumption system, which protects a user's privacy while enabling the delivery of relevant content recommendations to the user. Embodiments of the present principles implement an information theoretic framework to design a utility-aware privacy- preserving mapping that perturbs a user' s video ratings to prevent inference of user attributes that can be considered by the user as private, such as political affiliation, age, gender, while maintaining the utility of the released perturbed ratings for recommendation.

[0017] In one embodiment, a method, apparatus and system of the present principles implement convex optimization to learn a probability mapping from actual ratings to perturbed ratings that minimizes distortion subject to a privacy constraint. In various embodiments, to reduce an optimization size, a quantization step is implemented to enable the control of the number of optimization variables.

[0018] FIG. la depicts a high level block diagram of a system 100 for preserving user privacy while enabling content consumption and recommendation in accordance with an embodiment of the present principles. The system 100 of FIG. la illustratively comprises a user client 105, a privacy server 110, and a recommendation server 115. In one embodiment of the system 100 of FIG. la, the user client 105 can comprise a web interface. In various embodiments of the present principles such web interface can be written in HTML5 and javascript. Similarly, in an embodiment of the system 100 of FIG. la, the privacy server 110 and the recommendation server 115 can comprise flask interfaces, which is a python based micro web framework. In the embodiment of FIG. la, the privacy server 110, and the recommendation server 115 comprise a privacy agent 125. Although in the illustrative embodiment of FIG. la, the user client 105, the privacy server 110, and the recommendation server 115 are depicted as comprising separate components, in alternate embodiment of the present principles, the functionality of each of the components can be integrated into a single component or any other combination of components. The components of FIG. la can, in one embodiment, comprise at least a portion of a content recommendation system.

[0019] FIG. lb depicts a functional diagram of the system of FIG. la in accordance with an embodiment of the present principles. More specifically, in FIG. lb, data from at least one user which includes personal user attributes and data a user might consider public, such as program ratings, is communicated to the privacy agent. The privacy agent performs quantization of the data to quantize B into C, which will be described in greater detail below. In the embodiment of FIG. lb, an estimation of a distribution that links the quantized data to data at least one user considers private, such as user attributes including age, gender and political preference, is performed. Again, such process is described in greater detail below. The estimated distribution is then implemented to determine and design a privacy mapping of data a user is willing to make public to a distorted version of such data, such that inference of user attributes that might be considered private from data that a user is willing to make public becomes much more difficult or specifically cannot outperform an uninformed random guess. In the embodiment of FIG. lb, such privacy mapping is determined as a result of an Algorithm described in greater detail below. As such and as depicted in FIG. la, user data a user is willing to make public, such as program recommendations, can be manipulated by the mapping of the present principles to create a distorted version of such data to be communicated to, for example a service provider/content source, to be used by the service provider/content source for providing, for example, personalized content recommendations for the user, without being able to determine from the distorted version of the data, personal attributes of the user, which the user has indicated the user wants to remain private. In the embodiment of FIG. lb, such information is also communicated back to the user to be presented to the user via, for example, a user interface to inform the user of privacy risks associated with the data released by the user (described in greater detail below). The user can further use the user interface to communicate data to a system of the present principles; the data including at least which attributes a user wants to keep private and data a user is willing to publically release, for example, program ratings. [0020] In the system 100 of FIG. 1 (FIG. la and FIG. lb), the user client 105 enables a user to interact with available privacy settings, provide ratings for consumed content, such as TV shows, and displays recommendations based on the user's privacy settings and privatized ratings. In the system 100 of FIG. 1, the privacy server 110 and the recommendation server 115 serve client requests (web pages) and store and retrieve data from databases, such as user and privacy mapping data for the privacy server and content and data for the recommendation server 115. The privacy server 110 further performs rating privatization based on the user's privacy settings and sends privatized ratings to the recommendation server 115 and to the user client 105, which will be described in further detail below. The recommendation server 115 generates recommendations based on the user's privatized ratings, and communicates such recommendations to the user client 105.

[0021] The system 100 of FIG. 1 can further comprise at least one database 120. In one embodiment of the present principles the database can comprise one or more MongoDB databases. The database(s) 120 can store user privacy settings and user interactions with the content (e.g. ratings) and data related to privacy mapping. Such database(s) 120 are accessed by the privacy server 110 to gain access to such data. The database(s) 120 can further store content metadata used to display on the web interface at the client side and as such can be accessed by the client server 105. The database(s) 120 can further store content profiles for recommendation purposes and can be accessed by the recommendation server 115. Although in FIG. la, the database is illustratively a separate component, in alternate embodiments of the present principles, a database of the present principles can be an integrated component at least one of the privacy server 110, the recommendation server 115 and the user client.

[0022] FIG. 2 depicts a high level block diagram of a server 200 such as the privacy server 110 and/or the recommendation server 115 suitable for implementation in the system of FIG. 1 in accordance with an embodiment of the present principles. The server 200 of FIG. 2 comprises a processor 210 as well as a memory 220 for storing control programs, instructions, software, video content, data, user ratings and the like. The processor 210 cooperates with conventional support circuitry 230 such as power supplies, clock circuits, cache memory and the like as well as circuits that assist in executing the software routines stored in the memory 220. As such, it is contemplated that some of the process steps discussed herein as software processes can be implemented within hardware, for example, as circuitry that cooperates with the processor 210 to perform various steps. The recommendation server 115 also contains input-output circuitry 240 that forms an interface between the various respective functional elements communicating with the server 200. As noted throughout this disclosure, the memory 220 can be a hard disk storage device, a static RAM, a DRAM, ROM, etc., or combinations of the same.

[0023] Although the server 200 of FIG. 2 is depicted as a general purpose computer that is programmed to perform various control functions in accordance with the present invention, the invention can be implemented in hardware, for example, as an application specified integrated circuit (ASIC). As such, the process steps described herein are intended to be broadly interpreted as being equivalently performed by software, hardware, or a combination thereof.

[0024] In the system 100 of FIG. 1, the user client 105 can include a privacy dashboard which displays the privacy settings of the user and a privacy monitor. FIG. 3 depicts a high level illustrative depiction of a privacy dashboard 300 usable in the user client 105 of the system 100 of FIG. 1 in accordance with an embodiment of the present principles. The privacy settings of the privacy dashboard 300 in accordance with the illustrative embodiment of FIG. 3 enable a user to select any combination of attributes that a user deems private and wishes to protect. For example, in the embodiment of FIG. 3, the privacy dashboard 300 illustratively includes three attributes — age, gender, and political views— from which a user can select and indicate that the user wishes to keep private and protect in accordance with the present principles. It should be noted that in accordance with the present principles, it is not required for the user to reveal what his political view, age, or gender are, but only whether the user considers any of these features as sensitive information that the wants to remain private.

[0025] In the embodiment of the privacy dashboard 300 of FIG. 3, the privacy monitor 305 displays the inference threat for each private attribute from the actual TV show ratings entered by the user, and from the distorted privacy-preserving ratings. Thus, the privacy monitor 305 enables the user to compare what his privacy risk would have been if he did not activate the privacy protection of the present principles with the risk to his privacy after the inference that could be made from his ratings after they were sanitized using the privacy- preserving mechanism of the present principles.

[0026] In considering privacy, it is considered that a user has two types of data: some data that the user would like to remain private such as political views, age, gender, and some data that the user is willing to release publicly and from which the user will derive some utility, such as the release of media preferences (TV show ratings) to a service provider. The release of such data, for example, enables the user to receive content recommendations. In one embodiment, A denotes the vector of personal attributes that the user wants to keep private, and B denotes the vector of data the user is willing to make public. The user private attributes, A, are linked to B by the joint probability distribution p_{A B} . Thus, a party interested in determining personal information about the user, such as a service provider or a third party with whom the user may exchange data, can observe the B data and infer some information about the personal attributes, A, that a user wishes to keep private.

[0027] In accordance with embodiments of the present principles, to reduce this inference threat, instead of releasing B, the user will release a distorted version of B, denoted B. In one embodiment of the present principles, B is generated according to a conditional probabilistic mapping pg_|B considered herein as the privacy-preserving mapping. The privacy-preserving mapping, pg\ , is designed in such a way that it renders any statistical inference of the A data based on the observation of B harder, yet, at the same time, preserves some utility to the released data B, by limiting the distortion generated by the mapping. The privacy-preserving mapping is designed to control the privacy leakage, modeled as the mutual information I(A; B) between the private attributes, A, and the publicly released data B, subject to a utility requirement, modeled by a constraint on the average distortion E_Bg

renders the released data B statistically independent from the private data A, and any inference algorithm that tries to infer the private data, A, from the released data, B, cannot outperform an uninformed random guess. The privacy and utility metrics and the design of the privacy-preserving mapping, are discussed in greater details below.

[0028] It should be noted that in the local setting, perfect privacy I(A; B) = 0 is equivalent to statistical independence between A and B. That is,

= Ρ_Β Φ), for all a, a' and b, which in turn is equivalent to B being locally 0-differential private with respect to A. Indeed, in the local setting, on one hand the local database A is of size 1 as it contains only the data of a single individual user, thus all databases a, a' are neighboring databases. On the other hand, the service provider asks for the query B, which due to its correlation with A can be considered as a randomized function of A, and receives the sanitized version, B. Thus, in the local privacy setting at perfect privacy, the information theoretic privacy metric and the differential privacy metric are equivalent with respect to the private data, A. [0029] In one embodiment of the present principles, to model the inference threat for each private attribute from a particular rating vector representing the user history of ratings, a privacy risk metric on a scale [0,100] is implemented. For a private attribute, A, and a specific vector of ratings B = b, the privacy risk can be defined according to equation one (1), which follows:

H A\B = b)

Risk A, b) = (1 - * loo

{A) (i) where H(A) = —∑_ap_A (a \og p_A (a denotes the entropy of the variable A distributed according to p_A a), and represents the inherent uncertainty on A. Similarly, it should be noted that

H(AIB = b) =—∑_ap_A\_b (a\b) \og p_A\_b (a\b) denotes the remaining entropy of A given the observation B = b, and represents the remaining uncertainty on A after observing B = b. Intuitively, the privacy risk Risk(A, b) measures the percentage by which the uncertainty on A decreases due to the observation of B = b, relative to the original uncertainty prior to observing B. A privacy Risk(A, b) = 0 means that the rating vector B = b does not provide any information about the private attribute A, while a Risk(A, b) = 100 implies that no uncertainty is left about the attribute A from observing the rating vector B = b. The privacy risk based on the user's actual rating vector B = b is Risk(A, b), while the privacy risk based on the distorted ratings B = b is Risk(A, b), and is obtained by replacing B = b in (1) by B = b. The mutual information between the private data A and the distorted data B can be defined llows:

(2)

which is related to the average of the privacy risks over all possible distorted rating vectors B. Achieving perfect privacy (I(A; B) = 0) ensures a 0-privacy risk, meaning that any inference algorithm that would try to infer A from B would not outperform an uninformed random guess.

[0030] Referring back to the privacy dashboard 300 of FIG. 3, after completing the entering of the user's privacy settings, the user can then access a program guide, such as a TV guide (not shown), and pick a program that the user would like to consume/watch. In various embodiments of the present principles, on a program display page, the user can give a star rating to the program. For example, FIG. 4 depicts a high level illustrative depiction of a program display page usable in the user system 100 of FIG. 1 in accordance with an embodiment of the present principles. On the program display page 400 of FIG. 4, a privacy risk monitor 405 reminds the user of the user' s privacy risk based on available history of the user's previous program ratings. When the user hovers a pointing/selection device above the star ratings for the displayed program, for each possible rating, for example { 1,..,5 }, the privacy risk monitor 405 dynamically updates the risk numbers to inform the user of how the privacy risk would evolve if the user selected a particular rating.

[0031] In one embodiment of the present principles, the privacy risk monitor 405 displays the updated risk based on actual ratings, before sanitization in accordance with the present principles. In such an embodiment, once the user selects a rating, the rating value is sanitized in accordance with the present principles. A user is then able to use the privacy dashboard 300 to check that the privacy risk after the sanitization of the rating value has decreased and, in the best case scenario, is equal to zero (0) for the attributes the user selected as private.

[0032] In one embodiment of the present principles, the system 100 of FIG. 1 implements a privacy-preserving process for the release of the user program ratings to, for example, a service provider, that ensures perfect privacy (I(A; B ) = 0) against statistical inference of the user's private features, while at the same time minimizing the distortion to the released data/ratings. FIG. 5 depicts a high level illustrative depiction of a program history display page 500 including ratings usable in the user system 100 of FIG. 1 in accordance with an embodiment of the present principles. The program history display page 500 of FIG. 5 displays the user's actual ratings and the distorted ratings on a display of a respective program.

[0033] The privacy-preserving mapping,

of embodiments of the present principles requires characterizing the value of Ρ_β\_ΰ (β\^ for all possible pairs (b, b £ 2 x 2, i.e. solving the convex optimization problem over |2| |2 | variables. When 2 = 2, and the size of the alphabet 121 = 6⁵⁰ is large, solving the convex optimization over |2| ² variables may become intractable. Quantization can be used to reduce the number of optimization variables, from |2| ² to K² , where K denotes the number of quantization levels. It should be noted that the choice of K is a tradeoff between the size of the optimization and the additional distortion introduced by quantization. Quantization assumes that vectors B lie in a metric space. Directly applying quantization on the original rating vector, B, where unrated shows are assigned a 0 rating, would cause the perception of unrated programs as strongly disliked by the user, when such programs

actually may not be disliked, but simply unknown to the user for example. To circumvent this issue, the rating vector B is completed into B_c using low rank matrix factorization, a standard collaborative filtering technique. The completed rating vector B_c is then input into a quantization process that maps B_c to a cluster center, C. In one embodiment of the present principles, for quantization, K-means clustering is used, with K = 75 cluster centers, where a choice of K was guided empirically. The cluster center, C, is then fed to a privacy optimization algorithm, that finally outputs a distorted rating vector B .

[0034] For example, in one embodiment of the present principles, the following algorithm can be used to describe the quantized privacy-preserving mapping of the present principles:

Algorithm 1 Quantized privacy-preserving mapping

Input: prior p_{A C}

Solve: convex optimization

minimize E_Pc _ [d ( C, B ) ]

PB \C

Subject to l(A; B^{^}) < e, and pg\_c £ Simplex

Remap : p_{S B} = p_{S lc(B)}

Output: mapping

In summary, the design of the privacy-preserving mapping described in the embodiment of Algorithm 1, follows the Markov chain A → B → B_c → C → B.

[0035] It should be noted that computing the privacy Risk(A, b), as well as finding the privacy-preserving mapping as a solution to privacy convex optimization rely on the fundamental assumption that the prior distribution p_{A B} that links private attributes A and data B is known and can be used as an input to the algorithm. In practice, the true distribution may not be known, but can rather be estimated from a set of sample data that can be observed, for example from a set of users who do not have privacy concerns and publicly release both their attributes A and their original data B. However, such a dataset may contain a small number of samples or be incomplete, which makes the estimation of the prior distribution challenging.

[0036] In the Algorithm 1 illustrated above, the prior distribution between private data and quantized completed data is used. The distribution is estimated using Kernel Density Estimation, with a Gaussian kernel with width σ = 9.5. That is, in Algorithm 1 , e bounds the amount of information about the private data A that is leaked by the distorted data B , and thus represents the level of privacy requirement on the user side. Varying e enables the study of the tradeoff between privacy requirement and distortion. FIG. 6 depicts an illustrative graph of the privacy-utility tradeoff in accordance with an embodiment of the present principles. The graph of FIG. 6 depicts the graphing of mutual information I (A; B) against end-to-end distortion (quantization + privacy mapping) per rating. K-means quantization introduces a distortion 1.08 per rating and yields a mutual information I (A; C) = 0.2. With 0.14 additional distortion, the privacy-preserving mapping achieves perfect privacy I (A; B) =

0 for an end-to-end distortion of 1.22.

[0037] In one embodiment of the present principles, the focus is on perfect privacy

1 (A; i?) = 0, thus on e being close to 0. At perfect privacy, any inference algorithm that tries to infer A from B can only perform as well as an uninformed random guess. In such an embodiment, intuitively, B is statistically independent from A and, thus the privacy mapping of the present principles has statistically 'erased' any information about the private data A that can be inferred from B .

[0038] FIG. 7 depicts an illustrative receiver operating characteristic (ROC) curve showing the performance of, for example, a logistic regression classifier attempting to infer a user's political views from the original rating vector, from a binarized version of the rating vector where ratings >= 4 are mapped to 1 (like), and ratings <= 3 are mapped to zero (dislike), or from rating vectors distorted according to a privacy-preserving mapping with average distortion <= 1 (second curve from the bottom), or distortion <= 2 (bottom curve). In the embodiment of FIG. 7, 10-fold cross validation was used and the false positive rate (e.g., Democrats falsely classified as Republicans) were plotted against the true positive rate (e.g., Republicans correctly classified). The top curve in FIG. 7 depicts the privacy risk on inferring the political views from the original rating vectors and the second curve from the top depicts that merely binarizing the ratings is not enough to ensure privacy. The bottom straight diagonal line of FIG. 7 depicts the results of an un-informed random guess. As seen in FIG. 7, the bottom straight diagonal line is very similar to the bottom curve, which demonstrates that with distortion <= 2, the privacy-preserving mechanism of the present principles successfully ensures privacy against logistic regression of political views from distorted ratings. Further inference attacks were performed with other classifiers, including Naive Bayes, and SVM, and similar results were observed.

[0039] The privacy-preserving process of the present principles preserves the relevance of recommendations even when ratings distorted for privacy are used. FIG. 8 depicts a high level illustrative depiction of a recommendations page in accordance with an embodiment of the present principles. In FIG. 8, FIG. 8a depicts the six (6) top TV show recommendations based on actual ratings given to programming by a user and FIG. 8b depicts the six (6) top TV show recommendations based on ratings distorted for privacy in accordance with the present principles. In one embodiment of the present principles, low rank matrix factorization (MF) is used to predict missing show ratings from ratings provided by the user for shows not rated. The MF recommender engine (not shown) was trained by alternating regularized least square. In one embodiment of the present principles, the recommender engine can be a component/function of the recommendation server 115 of FIG. 1. As depicted in FIG. 8, there is an overlap of 4 out of 6 recommendations without and with privacy, which illustrates that privacy-preserving process of the present principles preserves maintains utility while protecting user privacy.

[0040] Further testing was performed to illustrate that privacy-preserving process of the present principles is able to eliminate the privacy threat from B for chosen attributes A with little effect on the quality of recommendations. For example, in one experiment, 5-fold cross validation was used to split a dataset into a training set containing 80% of the data, and a test set containing the remaining 20% of the data on which the MF recommender engine was tested both with and without privacy activated to compare the relevance of recommendations in these two cases. The random splitting into training and test sets was performed 5 times, as shown in the first row of Table I, below.

Set 1 2 3 4 5

RMSE1 (f) 1.2434 1.3208 1.2657 1.3359 1.2928

RMSE2 (r) 1.3469 1.3522 1.4182 1.3969 1.3708

Table I: Rating prediction RMSE [0041] More precisely, in each test set, 10% of the ratings were removed and attempted to be predicted. Table I above depicts the RMSE in rating prediction based on actual ratings and on distorted ratings. In Table I above, r denotes predicted ratings based on the actual ratings provided by users for other shows, while r denotes predicted ratings based on the ratings distorted for privacy. The prediction RMSE for r (RMSE1, privacy not activated) and for r (RMSE2, privacy activated) are calculated on the 10% of ratings that were removed. Table I shows that the RMSE for rating prediction does not degrade much when privacy protection is activated with respect to rating prediction without privacy. It should be noted that the results presented above are for the case of perfect privacy (I(A; B) = 0), meaning that any inference algorithm that would try to infer A, e.g. political views, from ratings B would not outperform an uninformed random guess. If the privacy requirements were less stringent, for example (I(A; B ) < e), for some e > 0, then the RMSE for rating prediction with privacy protection would be even closer to the RMSE without privacy. Finally, it should be noted that using a more advanced and optimized recommendation engine, instead of the aforementioned standard MF recommendation engine, would result in better rating prediction quality both without and with privacy protection.

[0042] FIG. 9 depicts a flow diagram of a method for preserving user privacy in accordance with an embodiment of the present principles. The method 900 begins at step 902 in which quantization is used on data at least one user may consider as private and data a user is willing to make public previously provided by at least one user to reduce the number of optimization variables. As described above, in one embodiment of the present principles quantization can be used to reduce the number of optimization variables, from |S| ² to K², where K denotes the number of quantization levels. In various embodiments of the present principles, the quantization step can further include completing the rating vector, B, into B_c using low rank matrix factorization, a standard collaborative filtering technique. The completed rating vector B_c is then input into a quantization process that maps B_c to a cluster center, C. The method 900 can then proceed to step 904.

[0043] At step 904, a distribution that links the data the at least one user may consider as private and the quantized value of the data the user is willing to make public is estimated. As described above, in one embodiment of the present principles, the distribution is estimated using Kernel Density Estimation, with a Gaussian kernel with width σ = 9.5. The method 900 can then proceed to step 908. [0044] At step 906, convex optimization is applied to the distribution determined in step 904 to determine a mapping from actual data a user is willing to make public to a distorted version of that data. For example, as described above, in one embodiment of the present principles, convex optimization is applied to the estimated distribution to determine a mapping from actual user program ratings to perturbed program ratings that minimizes distortion subject to a privacy constraint. Such mapping can then be applied to user data that a user is willing to make public to distort that data such that attributes that a user wishes to remain private cannot be determined from the distorted version of the public data distorted in accordance with embodiments of the present principles. As described above, in one embodiment of the present invention, the design of the privacy mapping can be accomplished using the Algorithm 1 , presented above. The method 900 can then be exited.

The present description illustrates embodiments of the present principles. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the present principles and are included within its scope. That is, having described various embodiments of a method, apparatus and system for preserving user privacy while enabling content consumption and recommendation (which are intended to be illustrative and not limiting), it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes can be made in the particular embodiments of the present principles disclosed which are within the scope and spirit of the invention. While the forgoing is directed to various embodiments of the present principles, other and further embodiments of the invention may be devised without departing from the basic scope thereof.

Claims

1. A method, comprising the steps of:

applying a quantization to data at least one user is willing to make public to control a number of optimization variables;

estimating a distribution that links the quantized data to data at least one user considers private; and

applying convex optimization to the distribution to determine a mapping from data a user is willing to make public to a distorted version of the data the user is willing to make public.

2. The method of claim 1, wherein the quantization further comprises using a collaborative filtering technique.

3. The method of claim 2, wherein the filtering technique comprises low rank matrix factorization.

4. The method of claim 1, wherein the distribution is estimated using Kernel Density Estimation.

5. The method of claim 4, wherein the Kernel Density Estimation comprises a Gaussian kernel with a width of 9.5

6. The method of claim 1, comprising applying the determined mapping to data a user is willing to make public to distort the data the user is willing to make public such that an inference of data the user considers private cannot be made from the distorted data.

7. The method of claim 1, wherein the data a user considers private comprises at least one of the group consisting of gender, age and political affiliation.

8. The method of claims 1 or 6, wherein the data at least one user is willing to make public comprises program ratings.

9. The method of claim 1, wherein said mapping is determined using the following Algorithm:

Input: prior p_{A C}

Solve: convex optimization

minimize E_Pc _ [d ( C, B ) ]

PB \C

Subject to ≤ £, and pg\_c £ Simplex

Remap : p_{S B} = p_{S lc(B)}

Output: mapping p_B\_B.

10. The method of claim 9, wherein the mapping determined using the Algorithm follows the Markov chain s → B → B_c → C → B.

11. The method of claim 9, wherein e bounds an amount of information about data a user considers private, A, that is leaked by the distorted data, B, and varying e enables a study of a tradeoff between a privacy requirement and distortion.

12. The method of claim 1, comprising communicating distorted data to a source of content such that the user can, in return, receive content recommendations from the source of content while preventing the source of content to make any inferences of private attributes of the user from the distorted data.

13. The method of claim 1, wherein said data at least one user is willing to make public comprises at least one of program rankings and user attributes of a plurality of previous users of a content recommendation system.

14. An apparatus for preserving privacy, comprising:

a memory for storing at least one of program routines, content and data; and a processor for executing said program routines;

said apparatus configured to:

apply a quantization to data at least one user is willing to make public to control a number of optimization variables; estimate a distribution that links the quantized data to data at least one user considers private; and

apply convex optimization to the distribution to determine a mapping from data a user is willing to make public to a distorted version of the data the user is willing to make public.

15. The apparatus of claim 14, wherein said apparatus comprises at least one of a privacy server and a recommendation server.

16. The apparatus of claim 14, wherein said distorted data increases a difficulty of inferring private attributes of a user from the data the user is willing to make public.

17. A system for preserving privacy, comprising:

a content source;

a user interface; and

an apparatus in communication with the content source and the user interface, the apparatus comprising a memory for storing at least one of program routines, content and data and a processor for executing said program routines, wherein the apparatus configured to:

apply a quantization to data at least one user is willing to make public to control a number of optimization variables;

estimate a distribution that links the quantized data to data at least one user considers private; and

18. The system of claim 17, wherein said apparatus receives a user input identifying user attributes a user wishes to keep private and data a user is willing to make public and the apparatus determines a distorted version of the data using said mapping.

19. The system of claim 17, where the distorted version of the data is communicated to said content source and in return receives recommendations for the user.

20. The system of claim 19, wherein said data a user is willing to make public comprises program ratings and said recommendations comprise content recommendations.

21. The system of claim 17, wherein said apparatus communicates privacy risk information to said user interface.