EP2864940A2 - Verfahren und vorrichtung zur verschleierung von benutzerdemografie - Google Patents

Verfahren und vorrichtung zur verschleierung von benutzerdemografie

Info

Publication number
EP2864940A2
EP2864940A2 EP13784040.1A EP13784040A EP2864940A2 EP 2864940 A2 EP2864940 A2 EP 2864940A2 EP 13784040 A EP13784040 A EP 13784040A EP 2864940 A2 EP2864940 A2 EP 2864940A2
Authority
EP
European Patent Office
Prior art keywords
ratings
demographic information
user
movie
particular user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP13784040.1A
Other languages
English (en)
French (fr)
Inventor
Smriti Bhagat
Udi WEINSBERG
Stratis Ioannidis
Nina Taft
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Thomson Licensing SAS
Original Assignee
Thomson Licensing SAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Licensing SAS filed Critical Thomson Licensing SAS
Publication of EP2864940A2 publication Critical patent/EP2864940A2/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/045Explanation of inference; Explainable artificial intelligence [XAI]; Interpretable artificial intelligence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the present invention relates generally to user profiling and user privacy in recommender systems. More specifically, the invention relates to demographic information inference.
  • the present invention includes a method and apparatus to obfuscate demographic information that can be determined from a user's ratings of digital content.
  • gender information may be determined from a user's movie ratings.
  • an obfuscation method and apparatus are presented.
  • the obfuscation method includes training an inference engine that is in communication with an obfuscation engine.
  • the inference engine determines demographic information using a training data set which includes movie ratings and demographic information from a plurality of other users.
  • movie ratings from the new user are received where the movie ratings from the particular user are received are without demographic information.
  • the demographic information of the new user is determined using the trained inference engine.
  • Extra movie ratings are then added to the user-generated ratings. The extra ratings are generated to be adverse to a finding of the user's demographic information if performed by an external inference engine.
  • the external inference engine may be part of a recommender system which recommends movies for user viewing.
  • Figure 1 illustrates an exemplary environment embodiment for an inference engine according to aspects of the invention
  • FIG. 2a depicts a Receiver Operating Characteristic (ROC) plot of different classifiers for a Flixster training data set
  • FIG. 2b depicts a Receiver Operating Characteristic (ROC) plot of different classifiers for a Moisks training data set
  • Figure 2c depicts the increase of precision according to size for a Flixster training data set
  • Figure 2d depicts the cumulative distribution function (CDF) for a Flixster confidence
  • Figure 3 illustrates an example flow diagram of a use of the inference engine according to aspects of the invention
  • Figure 4 illustrates an example inference engine according to aspects of the invention
  • Figure 5a depicts a first embodiment example of an obfuscation engine environment
  • Figure 5b depicts a second embodiment example of an obfuscation engine environment
  • Figure 5c depicts an example obfuscation engine according to aspects of the invention.
  • Figure 6 illustrates an example flow diagram of a use of the obfuscation engine according to aspects of the invention .
  • Figure 1 depicts an exemplary system 100 or environment for an inference engine as discussed in herein. Other environments are possible.
  • the system 100 of Figure 1 depicts a recommender system 130 which provides content recommendations to users on a network 120.
  • Typical examples of the recommender system include content recommender systems which are operated by content providers such as Netflix®, Hulu®, Amazon®, and the like.
  • a recommender system 100 provides candidate digital content for subscribing users.
  • Such content can include streaming video, DVD mailings, books, articles, and merchandise.
  • candidate movies can be recommended to a user based on her past movie selection or select user profile characteristics. As one example embodiment, the instance of streaming video is considered.
  • the inference engine 135 can be a data processing device that can infer demographic information from non-demographic information provided by a user 125 who sends movie ratings to the recommender system 130.
  • the inference engine 135 functions to process the movie ratings provided by user 125 and infer demographic information.
  • the demographic information discussed is gender. But one of skill in the art will recognize that other demographic information may also be inferred according to aspects of the invention. Such demographic information may include, but is not limited to, age, ethnicity, political orientation, and the like.
  • the inference engine 135 operates using training data acquired via users 1, 2 to n (105, 110 to 1 15 respectively). These users provide movie rating data as well as demographic information to the inference engine 135 via the recommender system 130.
  • the training data set may be acquired over time as users 105 through 1 15 use the recommender system.
  • the inference engine can input a training data set in one or more data loads directly imported via an input port 136.
  • Port 136 may be used to input a training data set from a network, a disk drive, or other data source containing the training data.
  • Inference engine 135 utilizes algorithms to process the training data set.
  • the inference engine 135 subsequently utilizes user 125 (user X) inputs containing movie ratings.
  • Movie ratings contain one or more of a movie identification, such as a movie title or a movie index or reference number and a rating value to infer demographic information concerning user 125.
  • a "movie title”, or more generically “movie identifier” as used in this discussion, is an identifier, such as a name or title or a database index of the movie, show, documentary, series episode, digital game, or other digital content viewed by user 125.
  • a rating value is a subjective measure of the viewed digital content as judged by user 125.
  • rating values are quality assessments made by the user 125 and are graded on a scale from 1 to 5; 1 being a low subjective score and 5 being a high subjective score.
  • Those of skill in the art will recognize that other may equivalently be used such as a 1 to 10 numeric scale, an alphabetical scale, a five star scale, a ten half star scale, or a word scales ranging from "bad" to "excellent”.
  • the information provided by user 125 does not contain demographic information and the inference engine 135 determines the user 125 's demographic information from only her movie ratings.
  • a training data set is used to teach the inference engine 135.
  • the training data set may be available to both the recommended system 130 as well as the inference engine 135.
  • a characterization of the training data set is now provided.
  • S t _ ⁇ M is the set of movies for which the rating of a user i G 3T is in the dataset, and by r ⁇ , j E Si, the rating given by user i G 3T to movie j G M.
  • the training set also contains a binary variable y t G ⁇ 0,1 ⁇ indicating the gender of the user (bit 0 is mapped to male users).
  • the training data set is assumed unadulterated: neither ratings nor gender labels have been tampered with or obfuscated.
  • the recommender mechanism throughout the paper is assumed to be matrix factorization since this is commonly used in commercial systems. Although matrix factorization is utilized as an example, any recommender mechanism may be used. Alternate recommender mechanisms include the neighborhood method (clustering of users), contextual similarity of items, or other mechanism known to those of skill in the art. Ratings for the set S 0 are generated by appending the provided ratings to the rating matrix of the training set and factorizing it. More specifically, we associate with each user i G 3T U ⁇ 0 ⁇ a latent feature vector u G M. d . Associated with each movie j G JVC is a latent feature vector Vj G M. d . The regularized mean square error is defined to be where ⁇ is the average rating of the entire dataset.
  • Flixster is a publicly available online social network for rating and reviewing movies. Flixster allows users to enter demographic information into their profiles and share their movie ratings and reviews with their friends and the public. The dataset has 1M users, of which only 34.2K users share their age and gender. This subset of 34.2K users is considered, who have rated 17K movies and provided 5.8M ratings. The 12.8K males and 21.4K females have provided 2.4M and 3.4M ratings, respectively. Flixster allows users to provide half star ratings, however, to be consistent across the evaluation datasets, the ratings are rounded up to be integers from 1 to 5. Another data set is MoEnts. This second dataset is publicly available from the GrouplensTM research team. The dataset consists of 3.7K movies and 1M ratings by 6K users. The 4331 males and 1709 females provided 750K and 250K ratings, respectively.
  • demographic information can include many characteristics.
  • the determination of gender as an example demographic is expressed as one embodiment in the current invention. However, the determination of different or multiple demographic characteristics of a user is within the scope of the present invention.
  • Bayesian classifiers Three different types are examined: Bayesian classifiers, support vector machines (SVM), and logistic regression.
  • SVM support vector machines
  • logistic regression In the Bayesian setting, several different generative models are examined; for all models, assume that points (x;, ;) are sampled independently from the same joint distribution P(x, y). Given P, the predicted label y G ⁇ 0,1 ⁇ attributed to characteristic vector x is the one with maximum likelihood, i.e.,
  • a mixed Naive Bayes is now described according to an aspect of the invention.
  • This model is based on the assumption that, users give normally distributed ratings. More specifically,
  • the value p t also serves a confidence value for the classification of user i.
  • One of great benefits of using logistic regression is that the coefficients ⁇ capture the extent of the correlation between each movie and the class. In the current instance, the large positive ⁇ indicates that movie j is correlated with class male, whereas small negative ⁇ indicates that movie j is correlated with class female.
  • We select the regularization parameter so that we have at least 1000 movies correlated with each gender that have a non-zero coefficient.
  • support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns, and are used for classification and regression analysis.
  • an SVM finds a hyperplane that separates users belonging to different genders in a way that minimizes the distance of incorrectly classified users from the hyperplane as is well known in the art.
  • precision in a classification task is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class).
  • true positives i.e. the number of items correctly labeled as belonging to the positive class
  • false positives which are items incorrectly labeled as belonging to the class
  • Table 2 shows that logistic regression outperforms all other models for Flixster users and both genders.
  • SVM performs better than all other algorithms, while logistic regression is second best.
  • the inference performs better for the gender that is dominant in each dataset (female in Flixster and male in Mounds). This is especially evident for SVM, which exhibits very high recall for the dominate class and low recall for the dominated class.
  • the mixed model improves significantly on the Bernoulli model and results similarly to the multinomial. This indicates that the usage of a Gaussian distribution might not be a sufficiently accurate estimation for the distribution of the ratings.
  • the movie-gender correlation was considered.
  • the coefficients computed by logistic regression expose movies that are most correlated with males and females.
  • Table 3 lists the top 10 movies correlated with each gender for Flixster; similar observations as the ones below hold for Mounds.
  • the movies are ordered based on their average rank across the 10-folds. Average rank was used since the coefficients can vary significantly between folds, but the order of movies does not.
  • the top gender correlated movies are quite different depending on whether X or correlated movies, only 35 are the same for males across the two inputs, and 27 are the same for females; the comparison yielded a Jaccard distance of 0.19 and 0.16,
  • Table 3 shows that in both datasets some of the top male correlated movies have plots that involve gay males, (such as Latter Days, Beautiful Thing, and Eating Out); we observed the same results when using X.
  • the main reason for this is that all of these movies have a relatively small number of ratings, ranging from a few tens to a few hundreds. In this case it is sufficient for a small variance in the rating distributions between genders with respect to the class priors, to make the movie highly correlated with the class.
  • Figure 3 represents a method according to aspects of the invention to generate demographic information from user ratings which do not have demographic information and to utilize those results for useful purposes.
  • the end purposes of using such generated demographic information include the targeting of advertisements to the user 125, and/or to provide enhanced recommendations via a recommender system 130.
  • the method 300 of Figure 3 begins with an input of a training data set having rating and demographic information representing a plurality of users into an inference engine at step 305.
  • Figure 1 illustrated the inference engine 135 to be part of a recommended system 130. This step may be accomplished using the recommended system connection 137 to the network 120 or may be accomplished via direct input to the inference engine 135 via port 136. If the input is via the recommended system network connection 137, then the training data set may be a one-by-one accumulation of demographic and rating information (movie ratings or any other digital content ratings), or one or more loads of at least one user training data sets having demographic and rating information.
  • the data is one or more downloads of at least one user training data set.
  • the recommender system 135 trains the inference engine using the information from the training data set. Step 210 can be skipped if the inference engine 135 has a direct download via port 136. In either event, steps 205 and 210 represent a training of the inference engine 135 with a training data set having both user demographic information as well as user rating information.
  • a new user that is not in the training data set such as user 125, interacts with the recommender system 130 and provides only ratings.
  • these ratings can be, for example, movie ratings having movie identifier information and subjective rating value information.
  • the ratings provided by user 125 are without
  • the inference engine 135 uses a classification algorithm to determine the new user's demographic information based on the new user's ratings.
  • the classification algorithm is preferably one of support vector machines (SVM), or logistic regression as discussed earlier.
  • SVM support vector machines
  • the determined demographic information such as gender, may be used for many useful purposes. Two examples are provided in Figure 3.
  • the demographic information determined at step 320 is used at step 325 by the recommender system 130 to provide enhanced recommendations to the new user.
  • the recommender system 130 is a movie recommender system, such as operated by NetflixTM or HuluTM
  • the demographic information such as gender
  • the recommender system 130 can use the determined demographic information from step 320 to target specific advertisements to the new user at step 330.
  • the gender-specific advertisements may be targeted to the new user.
  • Such advertisements may include perfume purchase discount suggestions for females or beard shaving equipment purchase discounts for males.
  • the recommender system may have access to potential advertisements from an internal or external data base or network server, not shown.
  • step 325 or 330 may be taken as useful actions taken to exploit the demographic information extracted from the ratings provided by the new user 125.
  • Steps 315 through 330 may be repeated for each new user that utilizes the services of the recommender system 130.
  • a user that receives an enhanced recommendation or an advertisement from the recommender system would receive the enhanced recommendation or advertisement on a display device associated with the user, such as user 125.
  • Such user display devices are well known and include display devices associated with home television systems, stand alone televisions, personal computers, and handheld devices, such as personal digital assistants, laptops, tablets, cell phones, and web notebooks.
  • Figure 4 is an example block diagram of an inference engine 135.
  • the inference engine 135 interfaces with the recommender system 130 as depicted in Figure 1.
  • Inference engine interface 410 functions to connect the communication components of the inference engine 135 to those of the recommender system 130.
  • the inference engine interface 410 to the recommender system at 405 may be a serial or parallel link, or an embedded or external function, as is known to those of skill in the art.
  • the inference engine may be combined with the recommender system or may be separate from the recommender system.
  • Interface port 405 allows the recommender system 130 to provide training data to the inference engine
  • Processor 420 provides computation functions for the inference engine 135.
  • the processor can any form of CPU or controller that utilizes communications between elements of the inference engine to control communication and computation processes for the inference engine.
  • bus 415 provides a communication path between the various elements of inference engine 135 and that other point to point interconnections are also feasible.
  • Program memory 430 can provide a repository for memory related to the method 300 of Figure 3.
  • Data memory 440 can provide the repository for storage of information such as trains data sets, downloads, uploads, or scratchpad calculations. Those of skill in the art will recognize that memory 430 and 440 may be combined or separate and may be incorporated all or in part of processor 420.
  • Processor 420 utilizes the storage and retrieval properties of program memory to execute instructions, such as computer instructions, to perform the steps of method 300, in order to produce demographic information for use by the recommender system 130.
  • Estimator 450 may be separate or part of processor 420 and functions to provide calculation resources for determination of the demographic information from a new user's ratings. As such, estimator 450 can provide computation resources for the classifier, preferably either SVM or logistic regression. The estimator can provide interim calculations to data memory 440 or processor 420 in the determination of a new user's demographic information. Such interim calculations include the probability of the demographic information related to the new user given only her rating information.
  • the estimator 450 may be hardware, but is preferably a combination of hardware and firmware or software.
  • the inference algorithms correctly predict the gender of users with a precision of 70%-80%.
  • the above discussed technique for determining demographic information from a user's ratings may invoke privacy concerns for the user.
  • Some users may find it desirable to obfuscate their demographic information from reliable determination.
  • An obfuscation mechanism to protect detectable demographic information from reliable detection is addressed below.
  • Figure 5a depicts an example environment 500 in which an obfuscation mechanism can reside with respect to the inference engine 135 of a recommender system.
  • the obfuscation mechanism can reside in multiple places.
  • the obfuscation mechanism may reside in the cloud connected to network 120 or in user 125 equipment. If located in the cloud (not shown), the obfuscation mechanism could be a network service offered to many users. If located in the user equipment, then the obfuscation mechanism essentially contains an inference engine with additional computational elements.
  • an obfuscation engine 126 is able to monitor recommendations coming from the user 125 and add additional ratings to the user's ratings in order to reduce the accuracy of any inference engine located in the recommender system 130.
  • a content aggregator that serves to distribute content to a user could also act to preserve a user's demographic information by providing an obfuscation engine along with the content aggregation service.
  • Figure 5b depicts such a content aggregator service.
  • a content aggregator 560 connects to the network 120 via link 555 and can obtain access to digital content that may be of interest to user 125. The user 125 may gain access to the content aggregator directly via link 582 or via the network
  • the content aggregator acts as a provider of digital content for user 125 and offers that content to the user for a fee.
  • One content provider may be a recommender system 130.
  • the content aggregator 560 acts as a conduit for digital content that may be rated by user 125.
  • the content aggregator can offer obfuscation services to the user via an obfuscation engine 570 that operates with an inference engine 575.
  • the obfuscation engine 570 acts to obfuscate user 125's demographic information so that when user 125 rates digital content acquired from the recommender system 130 content provider, additional and obfuscating ratings are added to the ratings passed to the recommender system 130. The added ratings are adverse to a correct determination of the demographic information.
  • the inference engine 135 associated with the recommender system cannot accurately determine demographic information from user 125 via his ratings.
  • Figure 5c depicts an example block diagram 590 of an obfuscation engine 599.
  • the obfuscation engine 599 interfaces with a network, such as 120 on Figure 5b, via network interface 591.
  • the network interface 591 allows user data, such as user ratings, and a training data set to be accessed via the network, such as the internet.
  • a receiver in the network interface enables receipt of training data and user-provided ratings, such as movie ratings.
  • a transmitter within the network interface 591 enables extra ratings generated by the rating generator 595 to be sent to a network.
  • the extra ratings as well as the user-provided ratings are sent to a recommender system 130 where an inference engine, such as 135 of Figure 5b, is prevented from accurately determining the demographic information of the user.
  • Processor 592 provides computation functions for the obfuscation engine 599.
  • the processor can be any form of CPU or controller that utilizes communications between elements of the obfuscation engine to control communication and computation processes for the obfuscation engine.
  • bus 597 provides a
  • Program memory 593 can provide a repository for memory related to the method 600 of Figure 6.
  • Data memory 594 can provide the repository for storage of information such as training data sets, downloads, uploads, or scratchpad calculations. Those of skill in the art will recognize that memory 593 and 594 may be combined or separate and may be incorporated all or in part of processor 591.
  • Processor 591 utilizes program memory instructions to execute a method, such as method 600, to produce obfuscation data that is adverse to an accurate determination of a user's demographic information. The obfuscation data to be transmitted to a network based recommender system via network interface 591.
  • the inference engine 596 may be separate or part of processor 592 and functions to provide calculation resources for determination of the demographic information from a new user's ratings. As such, the inference engine may be similar to that of Figure 4 or may utilize the computation resources shown in Figure 5c.
  • the rating generator 595 operates to generate ratings for use by the obfuscation technique described below. Specifically, the rating generator generates extra ratings that mimic user ratings but that are adverse to the accurate
  • the rating generator creates ratings that are sent to an external inference engine, such as the inference engine in a recommender system (See Figure 1).
  • the extra ratings being sent to the external inference system act to muddle the ratings from the new user such that an accurate determination of user demographic information is not likely.
  • the obfuscation engine 599 may be hardware based, but is preferably a combination of hardware and firmware or software.
  • a user such as user 125, indexed by 0, views and rates digital content items such as movies.
  • r 0j G M. is the rating of movie j G S 0 and the user's rating profile is defined as the set of (movie, ranking) pairs ⁇ 0 ⁇ ⁇ (J, r 0j ): j G S 0 ⁇ .
  • the user submits ⁇ 0 (i.e.
  • this obfuscation aims at striking a good balance between the following two conflicting goals : (a) ⁇ ' 0 can be used to provide relevant recommendations to the user, and (b) it is difficult to infer the user's demographic information, such as gender, from ⁇ ' 0 .
  • the obfuscated rating profile ⁇ ' 0 is assumed to be submitted to a recommender system 130 that has a module that implements a gender inference engine 135.
  • the recommender system 135 uses ⁇ to predict the user's ratings on M ⁇ S' 0 , and potentially, recommend movies that might be of interest to the user.
  • the gender inference engine 135 is a classification mechanism, that uses the same ⁇ ' 0 to profile and label the user as either male of female.
  • the obfuscation engine 126 and gender inference engine 135 are not.
  • the simple approach is taken that both recommender system 130 and inference engine 135 are oblivious to the fact that any kind of obfuscation is taking place. Both mechanisms take the profile ⁇ ' at "face value” and do not reverse-engineer the "true" profileK ⁇ .
  • the recommender system 130 and inference engine 135 have access to the training dataset. It is assumed that the training data set is unadulterated; neither ratings nor demographic information, such as gender labels, have been tampered with or obfuscated.
  • the obfuscation engine 126 may also have a partial view of the training set. In one embodiment, the training dataset is public, and the obfuscation engine 126 has full access to it.
  • a confidence value of the classifier used in the inference engine 135 is the obstacle that an obfuscation engine needs to overcome when trying to hide demographic information, such as gender, from the classifier.
  • the obfuscation engine attempts to lower this confidence value of the classifier in the inference engine 135. Therefore, an evaluation of whether the classifier has different confidence values when it outputs a correct or incorrect classification is undertaken.
  • Figure 2d plots the cumulative distribution function (CDF) of the confidence value for correct and incorrect classifications. Figure 2d shows that the confidence is higher when the classification is correct, with a median confidence for incorrect classifications of 0.65, while for correct classification it is 0.85. Moreover, nearly 20% of correct classifications have a confidence of 1.0, which holds for less than 1% of incorrect classifications.
  • CDF cumulative distribution function
  • the obfuscation engine has a mechanism that takes as input a user i's rating profile ⁇ , a parameter k that represents the number of permitted alterations, and information from the training set to output an altered rating profile J-C' i such that it is hard to infer the gender of the user while minimally impacting the quality of recommendations received.
  • a mechanism can alter J t by adding, deleting or changing movie ratings. Focus is placed on the setting in which the obfuscation engine is only allowed to add k movie ratings, since deleting movies is impractical in most services and changing ratings is less effective than adding ratings when the viewing event is a strong predictor of the user's demographic attributes.
  • a fixed number k is not used but rather an added number that corresponds to a given percentage of movies in a user's rating profile.
  • the obfuscation engine needs to make two non-trivial decisions; which movies should be added and what should be the rating assigned to each movie.
  • These added movie ratings as termed extra ratings.
  • the extra ratings are adverse to a correct determination of the demographic information of the user.
  • the rating values in the rating pairs (title, rating value) of the extra ratings are not assigned as "noise" but have some useful value. For example, if this rating corresponds to the average rating over all users, or the predicted rating (using matrix factorization) for a specific user, then the rating value is a reasonable predictor of how the user may have rated if she had watched the movie.
  • Each strategy takes as input a set of movies S t rated by the user i, the number of movies k to be added, and ordered lists L M and L F of male and female correlated movies, respectively, and outputs an altered set of movies S' u where S t _ ⁇
  • the lists L M and L F are stored in decreasing order of the value of a scoring function w: L M U L F ⁇ R where w(j) indicates how strongly correlated a movie j £ L M U L F is with the associated gender.
  • ) - ⁇ S t ⁇ and L M ⁇ L F 0.
  • the obfuscation mechanism uses the available training data to compute the average rating for all movies j G S' t — S t and add them to user i's altered rating profile ⁇ ' t .
  • the obfuscation mechanism computes the latent factors of movies by performing matrix factorization on the training dataset, and uses those to predict a user's ratings. The predicted ratings for all movies j G S' t — S t are added to ⁇ ' t .
  • Table 4 Accuracy of gender inference for different strategies and noise levels, on assigning average movie ratings
  • Table 4 shows the accuracy of inference for all three movie selection strategies (i.e., random, sampled and greedy) when the rating assigned is the average movie rating.
  • the accuracy is computed using 10 fold cross validation, where the model is trained on unadulterated data, and tested on obfuscated data. Since the accuracy of inference is the highest for the logistic regression classifier, it would be the natural choice as the inference mechanism for a recommender system.
  • the obfuscation mechanism above uses ordered lists that correspond well to the inference mechanism's notion of male or female correlated movies. However, in general, the obfuscation mechanism does not know which inference algorithm is used and thus lists such as L M and L F may have a weaker match to such a notion interior to the inference algorithm.
  • the obfuscation mechanism is evaluated under such a scenario; with Multinomial Naive Bayes and SVM classifiers. The obfuscation still performs well as we see in Table 4, the inference accuracy of the Multinomial classifier drops from 71% to 42.1% for Flixster, and from 76% to 60% for the Moußs dataset (with 10% extra ratings and the greedy strategy).
  • the RMSE increases with additional ratings, although negligibly.
  • a slight decrease in RMSE with extra ratings occurs. This may occur because by adding extra ratings we increase the density of the original rating matrix which may turn improve the performance of matrix factorization solutions.
  • Another explanation could be that the extra ratings are not arbitrary, but somewhat meaningful (i.e., the average across all users). The key observation is that for both datasets, the change in RMSE is not significant, a maximum of 0.015 for Flixster (with random strategy and 10% extra ratings), and 0.058 for Mocludes (with sampled strategy and 10% extra ratings).
  • the obfuscation engine preserves recommender system quality of recommendations to the user.
  • Preserving recommendation quality is an appealing feature for the obfuscation engine.
  • the tradeoff when the rating assignment corresponds to the "predicted ratings" approach is considered.
  • the motivation behind this rating assignment is that, in principle, this obfuscation results in no change in RMSE as compared with the RMSE on unaltered data. In other words, there is no tradeoff to be made on the utility front with this choice of rating assignment.
  • Table 5 shows the accuracy of gender inference when this rating assignment is used. The results are similar those in Table 4 where the rating assignment is the average movie rating.
  • the accuracy of gender inference is slightly lower with predicted ratings; for example, for the greedy strategy with 1% extra ratings, the accuracy of the logistic regression classifier reduces from 57.7% to 48.4% - and this benefit comes without sacrificing the quality of recommendations.
  • the experimental evaluation shows that with small amount of additional ratings, it is possible to protect a user's gender by obfuscation, with an insignificant change to the quality of recommendations received by the user.
  • Figure 6 depicts an example method 600 for the production of a set of ratings (title, rating values) from a user which can hide the user's demographic information from accurate detection. The method also advantageously does not adversely affect
  • the method begins at step 605 with the introduction of a training set of ratings from other users.
  • the training data set having both ratings (title, rating value) and demographic information of the other users.
  • the training data set is used to train an inference engine, such as 575 or 596 in Figures 5b and 5c respectively.
  • the trained inference engine can determine demographic information of the user 125. As such, it somewhat emulates the function of an inference engine that is in a recommender system, such as 130 in Figure 5b, that is accessed by the user 125.
  • the obfuscation engine After training the inference engine, the obfuscation engine is ready for use by a new user.
  • a new user who is not one of the users in the training data set, provides ratings to the obfuscation engine.
  • the obfuscation engine receives ratings, such as movie ratings.
  • the received movie ratings are only rating pairs of (title, rating value) and are without demographic information of the new user.
  • the inference engine uses a classification algorithm to determine the new users' demographic information based on the user's ratings.
  • the obfuscation engine generates ratings that are adverse to an accurate determination of demographic information by another inference engine. That is, the generated ratings are extra ratings that can be added to the ratings of the user that help obfuscate detectable demographic information of the user.
  • the inference engine infers the gender of user 125 as female, then the extra ratings generated by the obfuscation engine will provide data that incorrectly infers the user's gender. Accordingly, an external inference engine, such as one in a recommender system, would be unable to accurately determine the gender demographic information of new user 125. Thus, the extra ratings are adverse to an accurate detection of the demographic information of the new user.
  • the extra ratings are transmitted to a recommender system (RS) by the obfuscation engine at step 630.
  • This has the effect of obscuring the demographic information of the user 125 as detected by an inference engine in the recommender system 130.
  • This obfuscation occurs because the external inference engine, such as 135 of Figure 5b, receives not only the user's normally generated ratings, but also the extra ratings having rating pairs (title, rating value) that act adverse to an accurate demographic information determination. That is, the extra ratings act to prevent an accurate determination of the user's demographic information by an inference engine.
  • a recommender system 130 having an inference engine 135, is prevented in performing an accurate determination of a user's demographic information with the extra ratings.
  • Steps 615 through 630 may be repeated for a new user. Accordingly, multiple new users can have their demographic information obfuscated by the method 600.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Computer Security & Cryptography (AREA)
  • Medical Informatics (AREA)
  • Computational Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Bioethics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
EP13784040.1A 2012-06-21 2013-06-10 Verfahren und vorrichtung zur verschleierung von benutzerdemografie Withdrawn EP2864940A2 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261662618P 2012-06-21 2012-06-21
PCT/US2013/044890 WO2014007943A2 (en) 2012-06-21 2013-06-10 Method and apparatus for obfuscating user demographics

Publications (1)

Publication Number Publication Date
EP2864940A2 true EP2864940A2 (de) 2015-04-29

Family

ID=49514015

Family Applications (1)

Application Number Title Priority Date Filing Date
EP13784040.1A Withdrawn EP2864940A2 (de) 2012-06-21 2013-06-10 Verfahren und vorrichtung zur verschleierung von benutzerdemografie

Country Status (5)

Country Link
EP (1) EP2864940A2 (de)
JP (1) JP2015521769A (de)
KR (1) KR20150023433A (de)
CN (1) CN104641386A (de)
WO (1) WO2014007943A2 (de)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160125439A1 (en) * 2014-10-31 2016-05-05 The Nielsen Company (Us), Llc Methods and apparatus to correct segmentation errors
CN109189979B (zh) * 2018-08-13 2020-11-24 腾讯科技(深圳)有限公司 音乐推荐方法、装置、计算设备和存储介质
CN112185583B (zh) * 2020-10-14 2022-05-31 天津之以科技有限公司 一种基于贝叶斯网络的数据挖掘检疫方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6970904B1 (en) * 1999-12-29 2005-11-29 Rode Consulting, Inc. Methods and apparatus for sharing computational resources
AU2002259247A1 (en) * 2002-02-25 2003-09-09 Predictive Media Corporation Compact implementations for limited-resource platforms
US20110153391A1 (en) * 2009-12-21 2011-06-23 Michael Tenbrock Peer-to-peer privacy panel for audience measurement
CN102387207A (zh) * 2011-10-21 2012-03-21 华为技术有限公司 基于用户反馈信息的推送方法和推送系统

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None *
See also references of WO2014007943A2 *

Also Published As

Publication number Publication date
WO2014007943A2 (en) 2014-01-09
JP2015521769A (ja) 2015-07-30
KR20150023433A (ko) 2015-03-05
WO2014007943A3 (en) 2014-04-10
CN104641386A (zh) 2015-05-20

Similar Documents

Publication Publication Date Title
TWI636416B (zh) 內容個人化之多相排序方法和系統
US20150112812A1 (en) Method and apparatus for inferring user demographics
US10671679B2 (en) Method and system for enhanced content recommendation
Weinsberg et al. BlurMe: Inferring and obfuscating user gender based on ratings
Yu et al. Attributes coupling based matrix factorization for item recommendation
US20200081896A1 (en) Computerized system and method for high-quality and high-ranking digital content discovery
US20120323725A1 (en) Systems and methods for supplementing content-based attributes with collaborative rating attributes for recommending or filtering items
US11157836B2 (en) Changing machine learning classification of digital content
US10740415B2 (en) Content recommendation
Zhao et al. Service quality evaluation by exploring social users’ contextual information
Niu et al. FUIR: Fusing user and item information to deal with data sparsity by using side information in recommendation systems
Wu et al. Optimization matrix factorization recommendation algorithm based on rating centrality
Lee et al. Trustor clustering with an improved recommender system based on social relationships
JP2017059057A (ja) 推定装置、推定方法及び推定プログラム
WO2022247666A1 (zh) 一种内容的处理方法、装置、计算机设备和存储介质
US20160171228A1 (en) Method and apparatus for obfuscating user demographics
US20240187388A1 (en) Automatic privacy-aware machine learning method and apparatus
EP2864940A2 (de) Verfahren und vorrichtung zur verschleierung von benutzerdemografie
Chen et al. Trust-based collaborative filtering algorithm in social network
Yu et al. Attributes coupling based item enhanced matrix factorization technique for recommender systems
Yang et al. Social-group-based ranking algorithms for cold-start video recommendation
Felıcio et al. Social prefrec framework: leveraging recommender systems based on social information
Liu et al. Personalized resource recommendation based on regular tag and user operation
Ma Modeling users for online advertising
Wang et al. Click-through prediction for sponsored search advertising with hybrid models

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20150116

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

RIN1 Information on inventor provided before grant (corrected)

Inventor name: WEINSBERG, UDI

Inventor name: IOANNIDIS, STRATIS

Inventor name: TAFT, NINA

Inventor name: BHAGAT, SMRITI

DAX Request for extension of the european patent (deleted)
17Q First examination report despatched

Effective date: 20180530

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20181211