JP2015521769A

JP2015521769A - Method and apparatus for obfuscating user demographics

Info

Publication number: JP2015521769A
Application number: JP2015518432A
Authority: JP
Inventors: バガットスムリティ; ウェインズバーグウディ; イオアニーディスストラティス; タフトニナ
Original assignee: Thomson Licensing SAS
Current assignee: Thomson Licensing SAS
Priority date: 2012-06-21
Filing date: 2013-06-10
Publication date: 2015-07-30
Also published as: WO2014007943A2; WO2014007943A3; CN104641386A; KR20150023433A; EP2864940A2

Abstract

推薦システムを有するデジタルコンテンツサービスにレイティングを提供する新規ユーザのデモグラフィック情報の正確な検出を難読化する方法は、５つのデモグラフィック情報を検出する推論エンジンを訓練するステップを含む。訓練集合は、複数の他のユーザからの映画レイティングとデモグラフィック情報を含む。新規ユーザは、映画レイティング等のレイティングを入力し、推論エンジンは、新規ユーザのデモグラフィック情報を決定する。次に、難読化エンジンは、推薦システムの推論エンジンが新規ユーザのデモグラフィック情報を正確に検出するのに失敗するような映画レイティングを推薦システムに追加する。A method of obfuscating accurate detection of new user demographic information that provides ratings to a digital content service having a recommendation system includes training an inference engine that detects five demographic information. The training set includes movie ratings and demographic information from multiple other users. The new user inputs a rating, such as a movie rating, and the inference engine determines demographic information for the new user. The obfuscation engine then adds movie ratings to the recommendation system such that the recommendation system's inference engine fails to accurately detect the new user's demographic information.

Description

（関連出願の相互参照）
本出願は、２０１２年６月２１日に出願された米国仮特許出願第６１／６６２６１８号「ＭｅｔｈｏｄａｎｄＡｐｐａｒａｔｕｓＦｏｒＯｂｆｕｓｃａｔｉｎｇＵｓｅｒＤｅｍｏｇｒａｐｈｉｃｓＢａｓｅｄｏｎＲａｔｉｎｇｓ（レイティングに基づいてユーザのデモグラフィックスを難読化する方法および装置」の優先権を主張し、その全体を援用により本明細書に組み込むものとする。 (Cross-reference of related applications)
This application is based on US Provisional Patent Application No. 61 / 664,618, “Method and Apparatus for Obfuscating User Demographics Based on Ratings, filed June 21, 2012, and a method of obfuscating user demographics based on ratings and "Device" priority is claimed and is incorporated herein by reference in its entirety.

本発明は、一般的に、推薦システムにおけるユーザプロファイリングおよびユーザのプライバシーに関する。本発明は、より詳細には、デモグラフィック情報の推論に関する。 The present invention generally relates to user profiling and user privacy in a recommendation system. The present invention relates more particularly to the inference of demographic information.

ユーザのデモグラフィックスの推論は、異なる文脈で、様々な種類のユーザが生成したデータに関して研究されてきた。相互作用ネットワークという文脈では、グラフ構造が、ブログやフェイスブック由来のソーシャルネットワークデータのリンクベース情報を用いたデモグラフィックスの推論に有用であることが分かった。他の作業は、デモグラフィックスを推論するために、ユーザの書き込みから得たテキスト特性に依存する。 User demographic reasoning has been studied on various types of user-generated data in different contexts. In the context of interactive networks, graph structures have proved useful for inferring demographics using linkbase information from social network data from blogs and Facebook. Other tasks rely on text characteristics obtained from user writing to infer demographics.

テキストベースの推論の主な短所は、ほとんどのユーザはレビューを書かないので、これらの方法は、適用できないということである。同様に、推薦システムは、詳細を推論したいユーザのソーシャルネットワークを取得できない場合がある。 The main disadvantage of text-based reasoning is that these methods are not applicable because most users do not write reviews. Similarly, the recommendation system may not be able to obtain the social network of the user who wants to infer details.

できるだけ少ない情報に基づいてユーザのデモグラフィックスを推論する方法が望まれていることが分かる。本発明は、このような推論方法に関する。 It can be seen that a method for inferring user demographics based on as little information as possible is desired. The present invention relates to such an inference method.

この概要は、発明の詳細な説明においてさらに後述する概念の一部を簡単に紹介するものである。この概要は、請求項に記載された事項の重要な特徴や不可欠な特徴を特定するためではなく、請求項に記載された事項の範囲を制限するものでもない。 This summary is a brief introduction to some of the concepts described further below in the detailed description of the invention. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter.

本発明は、デジタルコンテンツに対するユーザレイティングから決定できるデモグラフィック情報を難読化する方法と装置を含む。一実施形態においては、性別情報は、ユーザの映画レイティングから決定してよい。プライバシーに関する懸念に対処するために、難読化方法と難読化装置を提示する。難読化方法は、難読化エンジンと通信する推論エンジンを訓練することを含む。推論エンジンは、複数の他のユーザからの映画レイティングとデモグラフィック情報を含む訓練データ集合を用いて、デモグラフィック情報を決定する。その後、新規ユーザからの映画レイティングを受信する。特定のユーザからの映画レイティングは、デモグラフィック情報なしで受信する。新規ユーザのデモグラフィック情報は、訓練された推論エンジンを用いて判断される。次に、付加的映画レイティングが、ユーザが生成したレイティングに追加される。付加的レイティングは、外部の推論エンジンによって行われる場合、ユーザのデモグラフィック情報の結果に反するように生成される。外部の推論エンジンは、ユーザに視聴を勧める映画を推薦する推薦システムの一部であってよい。 The present invention includes a method and apparatus for obfuscating demographic information that can be determined from user ratings for digital content. In one embodiment, gender information may be determined from the user's movie rating. In order to address privacy concerns, an obfuscation method and an obfuscation device are presented. The obfuscation method includes training an inference engine that communicates with the obfuscation engine. The inference engine determines demographic information using a training data set that includes movie ratings and demographic information from multiple other users. Thereafter, movie ratings from new users are received. Movie ratings from specific users are received without demographic information. New user demographic information is determined using a trained inference engine. The additional movie rating is then added to the user generated rating. Additional ratings are generated against the results of the user's demographic information when performed by an external inference engine. The external reasoning engine may be part of a recommendation system that recommends movies for users to watch.

本発明の追加の特徴および長所は、添付の図面を参照した、以下の例示の実施形態の詳細な説明から明らかになろう。 Additional features and advantages of the present invention will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

上記発明の概要、および、以下の例示的実施形態の詳細な説明は、添付図面を参照しながら読むと、よりよく理解できる、それらは、例として挙げるものであって、請求項の発明を制限するものではない。 The foregoing summary, as well as the following detailed description of exemplary embodiments, can be better understood when read with reference to the appended drawings, which are given by way of illustration only and limit the claimed invention. Not what you want.

本発明の態様による、推論エンジンのための例示の環境の実施形態を示す。2 illustrates an exemplary environment embodiment for an inference engine according to aspects of the present invention. Ｆｌｉｘｓｔｅｒ訓練データ集合に関する異なる分類器の受信者動作特性（ＲＯＣ）プロットを示す。Fig. 4 shows receiver operating characteristic (ROC) plots of different classifiers for the Flixster training data set. Ｍｏｖｉｅｌｅｎｓ訓練データ集合に関する異なる分類器の受信者動作特性（ＲＯＣ）プロットを示す。Figure 7 shows receiver operating characteristic (ROC) plots of different classifiers for the Movielens training data set. Ｆｌｉｘｓｔｅｒ訓練データ集合のサイズによる精度の向上を示す。The accuracy improvement with the size of the Flixster training data set is shown. Ｆｌｉｘｓｔｅｒの信頼値の累積分布関数（ＣＤＦ）を示す。The cumulative distribution function (CDF) of the confidence value of the Flexster is shown. 本発明の態様による推論エンジンの使用のフロー図の例を示す。FIG. 6 illustrates an example flow diagram for use of an inference engine according to an aspect of the present invention. 本発明の態様による推論エンジンの例を示す。2 illustrates an example inference engine according to an aspect of the present invention. 難読化エンジン環境の第１の実施形態の例を示す。2 illustrates an example of a first embodiment of an obfuscation engine environment. 難読化エンジン環境の第２の実施形態の例を示す。2 illustrates an example of a second embodiment of an obfuscation engine environment. 本発明の態様による難読化エンジンの例を示す。2 illustrates an example of an obfuscation engine according to an aspect of the present invention. 本発明の態様による難読化エンジンのフロー図の例を示す。FIG. 4 illustrates an example flow diagram of an obfuscation engine according to an aspect of the present invention.

以下の様々な例示的実施形態の詳細な説明において、当該説明の一部を構成する添付図面が参照される。当該添付図面には、例として本発明の実施可能な様々な実施形態が示されている。本発明の範囲を逸脱することなく、他の実施形態を利用してもよく、構造および機能の変更を行ってもよいことは理解されたい。 In the following detailed description of various exemplary embodiments, reference is made to the accompanying drawings that form a part hereof. The accompanying drawings show various possible embodiments of the present invention by way of example. It should be understood that other embodiments may be utilized and structural and functional changes may be made without departing from the scope of the invention.

性別、年齢、収入、人種などのデモグラフィック情報を介してユーザをプロファイリングすることは、ターゲット広告や個人に合わせたコンテンツ配信において、とても重要である。推薦システムも、個人に合わせた推薦を行うためにこのような情報の恩恵を受けることができる。しかしながら、推薦システムのユーザは、デモグラフィック情報を自発的に提供しないことが多い。これは、プライバシーを守るため意図的な場合もあり、面倒や無関心など非意図的な場合もある。そのため、多数のユーザからユーザレイティングを集めて、現れるパターンから意味のある情報を抽出するという従来の協調フィルタリング法は、ユーザが提供するレイティングにのみ依存して、このような情報を用いない。 Profiling users via demographic information such as gender, age, income, race, etc. is very important for targeted advertising and content distribution tailored to individuals. The recommendation system can also benefit from such information to make personalized recommendations. However, users of recommendation systems often do not provide demographic information voluntarily. This may be intentional in order to protect privacy, and may be unintentional such as bothersome and indifferent. Therefore, the conventional collaborative filtering method of collecting user ratings from a large number of users and extracting meaningful information from the appearing patterns does not use such information depending only on the ratings provided by the users.

一見すると、レイティングを推薦システムに開示することは、あまり害がないようにみえるかもしれない。ユーザはこの開示によって確かに効用を得る。すなわち、適切なコンテンツ／アイテムを探し出す能力である。にもかかわらず、ユーザのデモグラフィックスを、ソーシャルネットワーク、ブログ、マイクロブログ等でのユーザアクティビティに相関させることによって、ユーザアクティビティからユーザのデモグラフィックスを推論するには、かなりの作業量が必要であった。従って、年齢、性別、人種だけでなく、政治的方向性をも含む、デモグラフィック情報も協調フィルタリングシステムに開示された情報から推論できるか否かを問うことは自然なことである。実際には、レイティング値にかかわりなく、ユーザがアイテムと相互作用した（例えば、特定の映画を見た、特定の歌を聞いた、または、製品を買った）という単なる事実をデモグラフィック情報と相関させてよい。 At first glance, disclosing ratings to the recommendation system may seem less harmful. Users will certainly benefit from this disclosure. That is, the ability to find appropriate content / items. Nevertheless, inferring user demographics from user activity by correlating user demographics with user activity on social networks, blogs, microblogs, etc. requires a significant amount of work Met. Therefore, it is natural to ask whether demographic information, including not only age, gender, and race but also political direction, can be inferred from information disclosed in the collaborative filtering system. In fact, the mere fact that the user interacted with the item (eg, watched a particular movie, heard a particular song, or bought a product), regardless of the rating value, correlated with demographic information You may let me.

このような推論が成功する可能性があるということは、幾つかの重要な意味合いを持つ。一方では、推薦者から見ると、デモグラフィック情報に関してユーザをプロファイリングすることは、幾つかの応用につながる。すなわち、このようなプロファイリングは、推薦だけでなく、広告からの追加の収入を生み出すことができる。広告主は特定のデモグラフィックグループをターゲットにすることに本来、関心を持っているからである。本発明は、このような推論技術に関する。ユーザが推論したい情報は、ユーザの性別であると仮定するが、本発明の方法は、別のデモグラフィック特性（年齢、人種、政治的方向性など）を推論するときにも適用される。また、具体的な実施形態は映画レイティングを対象とするが、これは一例に過ぎない。歌、デジタルゲーム、製品、レストラン等のレイティングを含むが、これらに限られない、任意の種類のレイティングを用いてよい。簡単、明快に理解できるように、映画レイティングを用いてデモグラフィック情報を判断する例を主に用いるが、他の種類のレイティングにも適用可能である。 The possibility of such inferences being successful has several important implications. On the one hand, from the point of view of the recommender, profiling a user with demographic information leads to several applications. That is, such profiling can generate additional revenue from advertisements as well as recommendations. Advertisers are inherently interested in targeting specific demographic groups. The present invention relates to such an inference technique. The information that the user wants to infer is assumed to be the user's gender, but the method of the present invention is also applied when inferring other demographic characteristics (age, race, political orientation, etc.). Also, although specific embodiments are directed to movie ratings, this is only an example. Any type of rating may be used, including but not limited to ratings for songs, digital games, products, restaurants, and the like. For the sake of easy and clear understanding, an example in which demographic information is determined using movie ratings is mainly used, but the present invention can also be applied to other types of ratings.

図１は、例示のシステム１００、すなわち、本明細書に記載の推論エンジンのための環境を示す。他の環境でも可能である。図１のシステム１００は、ネットワーク１２０上のユーザにコンテンツを推薦する推薦システム１３０を示す。推薦システムの一般的な例には、Ｎｅｔｆｌｉｘ「登録商標」、Ｈｕｌｕ「登録商標」、Ａｍａｚｏｎ「登録商標」等のコンテンツプロバイダによって運営されるコンテンツ推薦システムが含まれる。通常、推薦システム１００は、加入ユーザに候補デジタルコンテンツを提供する。このようなコンテンツは、ストリーミングビデオ、ＤＶＤメーリング、本、記事、および、商品を含むことができる。一例として、ストリーミングビデオの事例においては、候補映画を、ユーザの過去の映画選択、または、ユーザプロファイル特性の選択に基づいてユーザに推薦することができる。一実施形態例として、ストリーミングビデオの事例を考える。 FIG. 1 illustrates an environment for an exemplary system 100, ie, an inference engine described herein. Other environments are possible. The system 100 of FIG. 1 shows a recommendation system 130 that recommends content to users on the network 120. Common examples of recommendation systems include content recommendation systems operated by content providers such as Netflix “registered trademark”, Hulu “registered trademark”, and Amazon “registered trademark”. Typically, the recommendation system 100 provides candidate digital content to subscriber users. Such content can include streaming video, DVD mailing, books, articles, and merchandise. As an example, in the case of streaming video, a candidate movie can be recommended to the user based on the user's past movie selection or selection of user profile characteristics. As an example embodiment, consider the case of streaming video.

本発明の文脈においては、推論エンジン１３５は、推薦システム１３０に映画レイティングを送信するユーザ１２５が提供する非デモグラフィック情報からデモグラフィック情報を推論できるデータ処理装置であってよい。推論エンジン１３５は、ユーザ１２５が提供した映画レイティングを処理して、デモグラフィック情報を推論する働きをする。一例としての事例では、検討するデモグラフィック情報は性別である。しかし、本発明の態様に従って他のデモグラフィック情報を推論してよいことを当業者は認識されよう。このようなデモグラフィック情報には、年齢、人種、政治的方向性等が含まれるが、それらに限られない。 In the context of the present invention, inference engine 135 may be a data processing device that can infer demographic information from non-demographic information provided by user 125 sending movie ratings to recommendation system 130. The inference engine 135 serves to infer demographic information by processing movie ratings provided by the user 125. In the example case, the demographic information considered is gender. However, those skilled in the art will recognize that other demographic information may be inferred in accordance with aspects of the present invention. Such demographic information includes, but is not limited to, age, race, political direction, and the like.

本発明の態様によると、下記のように、推論エンジン１３５は、ユーザ１、２〜ｎを介して獲得した訓練データ（それぞれ、１０５、１１０〜１１５）を用いて動作する。このユーザ達は、推薦システム１３０を介して映画レイティングデータとデモグラフィック情報を推論エンジン１３５に提供する。訓練データ集合は、ユーザ１０５〜１１５が推薦システムを使用することによって経時的に獲得されてよい。あるいは、推論エンジンは、入力ポート１３６を介して１つまたは複数のデータロードを直接インポートして訓練データ集合を入力することができる。ポート１３６は、訓練データを含むネットワーク、ディスクドライブ、または、他のデータソースから、訓練データ集合を入力するために用いてよい。 In accordance with aspects of the present invention, inference engine 135 operates using training data (105, 110-115, respectively) acquired via users 1, 2-n, as described below. These users provide movie rating data and demographic information to the inference engine 135 via the recommendation system 130. The training data set may be acquired over time by the users 105-115 using the recommendation system. Alternatively, the inference engine can import the training data set by directly importing one or more data loads via the input port 136. Port 136 may be used to input a training data set from a network, training disk, or other data source that contains training data.

推論エンジン１３５は、アルゴリズムを利用して訓練データ集合を処理する。推論エンジン１３５は、次に、映画レイティングを含むユーザ１２５（ユーザＸ）の入力を利用する。映画レイティングは、映画のタイトル、映画の索引や参照番号等の映画を識別できるものを１つまたは複数と、ユーザ１２５に関するデモグラフィック情報を推論するためのレイティング値とを含む。「映画のタイトル」、または、より総称的に、本説明で用いる「映画識別子」は、ユーザ１２５が視聴する映画、ショー、ドキュメンタリー、シリーズ番組、デジタルゲーム、または、他のデジタルコンテンツの名前すなわちタイトル、またはデータベース索引などの識別子である。レイティング値は、ユーザ１２５が判定した視聴済デジタルコンテンツの主観的測度である。通常、レイティング値は、ユーザ１２５が行った品質アセスメントで、１〜５の基準で格付けされる。１は低い主観的スコアで、５は高い主観的スコアである。１〜１０の数字による評価、アルファベットによる評価、５つ星による評価、星半分の１０段階評価、または、「よくない」から「すばらしい」までの言葉による評価など、他の評価法を同様に用いてよいことを、当業者は認識されよう。本発明の態様によると、ユーザ１２５が提供した情報はデモグラフィック情報を含まず、推論エンジン１３５がユーザ１２５の映画レイティングのみからユーザ１２５のデモグラフィック情報を決定することに注意されたい。 The inference engine 135 processes the training data set using an algorithm. The inference engine 135 then uses the input of the user 125 (user X) including movie ratings. Movie ratings include one or more that can identify a movie, such as a movie title, movie index or reference number, and a rating value for inferring demographic information about the user 125. A “movie title” or, more generally, a “movie identifier” as used in this description is the name or title of a movie, show, documentary, series program, digital game, or other digital content viewed by the user 125. Or an identifier such as a database index. The rating value is a subjective measure of the viewed digital content determined by the user 125. Usually, the rating value is a quality assessment performed by the user 125, and is rated according to the criteria of 1 to 5. 1 is a low subjective score and 5 is a high subjective score. Other evaluation methods are used in the same way, such as 1-10 numerical evaluation, alphabetic evaluation, 5-star evaluation, 10-level evaluation of half stars, or evaluation with words from “not good” to “great” Those skilled in the art will recognize that this may be the case. Note that in accordance with aspects of the present invention, the information provided by user 125 does not include demographic information, and inference engine 135 determines user 125 demographic information from user 125 movie ratings only.

本発明の態様によると、訓練データ集合を用いて、推論エンジン１３５を教育する。訓練データ集合は、推薦システム１３０と推論エンジン１３５の両方で入手可能であってよい。ここで、訓練データ集合の特徴を述べる。訓練データ集合は、Ｎ＝｛１，．．．，Ｎ｝のユーザの集合を含み、各ユーザは、カタログにある映画Ｍの部分集合にレイティングを与えている。Ｓ_i⊆Ｍは、ユーザｉ∈Ｎのレイティングがデータ集合内にある映画の集合を表し、ｒ_ij，ｊ∈Ｓ_iは、ユーザｉ∈Ｎが映画ｊ∈Ｍに与えたレイティングを表す。さらに、各ｉ∈Ｎに関して、訓練データ集合は、ユーザの性別を示す２値変数ｙ_i∈｛０，１｝（ビット０は、男性ユーザにマッピングされる）も含む。訓練データ集合は、純粋であると仮定する、すなわち、レイティングラベルも性別ラベルも手を加えられず、難読化もされていない。 In accordance with aspects of the present invention, the training data set is used to educate the inference engine 135. The training data set may be available in both the recommendation system 130 and the inference engine 135. Here, the characteristics of the training data set are described. The training data set is N = {1,. . . , N}, each user giving a rating to a subset of movies M in the catalog. S _i ⊆M represents a set of movies in which the rating of user iεN is in the data set, and r _ij , jεS _i represents a rating given to movie jεM by user iεN. In addition, for each iεN, the training data set also includes a binary variable y _i ε {0,1} that indicates the gender of the user (bit 0 is mapped to a male user). The training data set is assumed to be pure, i.e. neither the rating label nor the gender label is touched and obfuscated.

紙面を介した推薦機構は、商業システムで一般的に用いられ、行列の因数分解と仮定される。行列の因数分解を例として利用するが、任意の推薦機構を用いてよい。代替の推薦機構には、近傍法（ユーザのクラスタリング）、アイテムの文脈の類似性、または、当業者に既知の他の機構が含まれる。集合Ｍ＼Ｓ₀のレイティングは、提供されたレイティングを訓練集合のレイティング行列に加えて、因数分解することによって生成する。より詳細には、各ユーザｉ∈Ｎ∪｛０｝を潜在特徴ベクトルｕ_i∈Ｒ^dに関連付ける。各映画ｊ∈Ｍと関連付けるのは、潜在特徴ベクトルｖ_j∈Ｒ^dである。正規化平均二乗誤差は、次式で定義される。 The recommendation mechanism via space is commonly used in commercial systems and is assumed to be matrix factorization. Although matrix factorization is used as an example, any recommendation mechanism may be used. Alternative recommendation mechanisms include neighborhood methods (user clustering), item context similarity, or other mechanisms known to those skilled in the art. The rating of the set M \ S ₀ is generated by factoring the provided rating with the training set's rating matrix. More specifically, each user iεN∪ {0} is associated with a latent feature vector u _i εR ^d . Associated with each movie jεM is a latent feature vector v _j εR ^d . The normalized mean square error is defined by the following equation.

ここで、μは、データ集合全体の平均レイティングである。ベクトルｕ_i，ｖ_jは、傾斜降下によってＭＳＥを最小化することによって構築される。ｄ＝２０とλ＝０．３の値を用いる。このようにユーザと映画の両方をプロファイリングして、映画ｊ∈Ｍ＼Ｓ₀´に対するユーザ０のレイティングは、＜ｕ₀，ｖ_j＞＋μを通して予測される。 Here, μ is an average rating of the entire data set. The vectors u _i and v _j are constructed by minimizing the MSE by slope descent. The values d = 20 and λ = 0.3 are used. Profiling both the user and the movie in this way, the rating of user 0 for movie jεM \ S ₀ ′ is predicted through <u ₀ , v _j > + μ.

ＦｌｉｘｓｔｅｒとＭｏｖｉｅｌｅｎｓという２つの訓練データ集合の例を考える。Ｆｌｉｘｓｔｅｒは、映画レイティングおよびレビューのための公的に入手可能なオンラインソーシャルネットワークである。Ｆｌｉｘｓｔｅｒを用いて、ユーザはデモグラフィック情報を自分のプロファイルに入力したり、自分の映画レイティングやレビューを友達や公衆と共有することができる。そのデータ集合は、１００万人のユーザを有し、そのうち、３万４２００人のユーザが、自分の年齢と性別を共有している。この３万４２００人のユーザからなる部分集合は、１万７０００の映画をレイティングし、５８０万のレイティングを提供してきた。１万２８００人の男性が２４０万のレイティングを提供し、２万１４００人の女性が、３４０万のレイティングを提供してきた。Ｆｌｉｘｓｔｅｒを用いて、ユーザは星半分のレイティングを提供することができるが、評価データ集合同士を整合させるために、そのレイティングは、１〜５の整数に切り上げる。別のデータ集合は、Ｍｏｖｉｅｌｅｎｓである。この第２のデータ集合は、Ｇｒｏｕｐｌｅｎｓ（商標）調査チームから公的に入手可能である。このデータ集合は、３７００の映画と、６０００人のユーザによる１００万のレイティングからなる。４３３１人の男性が７５万のレイティングを提供し、１７０９人の女性が２５万のレイティングを提供してきた。 Consider an example of two training data sets, Flixster and Movielens. Flixster is a publicly available online social network for movie ratings and reviews. Using Flixster, users can enter demographic information into their profiles and share their movie ratings and reviews with friends and the public. The data set has 1 million users, of which 34,200 users share their age and gender. This subset of 34,200 users has rated 17,000 movies and provided 5.8 million ratings. 12,800 men have provided 2.4 million ratings and 21,400 women have provided 3.4 million ratings. Using Flixster, the user can provide half-star ratings, but the ratings are rounded up to an integer of 1-5 to match the evaluation data sets. Another data set is Movielens. This second data set is publicly available from the Grouplens ™ research team. This data set consists of 3700 movies and 1 million ratings by 6000 users. 4331 men have provided 750,000 ratings and 1709 women have provided 250,000 ratings.

デモグラフィック情報を決定するために、推論エンジンでは分類器を用いる。上記のように、デモグラフィック情報は、多くの特性を含むことができる。デモグラフィックの例として、性別の決定を本発明の一実施形態として記載する。しかし、ユーザの別または複数のデモグラフィック特性の判断も本発明の範囲に含まれる。 In order to determine the demographic information, the inference engine uses a classifier. As noted above, demographic information can include many characteristics. As an example of a demographic, gender determination is described as an embodiment of the present invention. However, determination of another or more demographic characteristics of the user is also within the scope of the present invention.

分類器を訓練するために、ｊ∈Ｓ_iの場合、ｘ_ij＝ｒ_ij、そうでない場合、ｘ_ij＝０というように、訓練データ集合内の各ユーザｉ∈Ｎを固有ベクトルｘ_i∈Ｒ^Mに関連付ける。２値変数ｙ_iがユーザｉの性別を示すことを思い起こすと、性別は分類において従属変数の働きをする。Ｘ∈Ｒ^NXMは、固有ベクトルの行列を表し、Ｙ∈｛０，１｝^Nは性別ベクトルを表す。 To train the classifier, for _{_{_{j∈S i, x ij = r ij}}} , otherwise, so that x _ij = 0, the eigenvector each user i∈N the training data set x _i ∈R ^M Associate with. Recalling that the binary variable y _i indicates the gender of user i, gender acts as a dependent variable in the classification. XεR ^NXM represents a matrix of eigenvectors, and Yε {0,1} ^N represents a gender vector.

である。 It is.

次に、クラス事前確率分類について記載する。クラス事前確率分類は、他の分類器の性能を評価する基準方法の役割をする。母集団の性別クラスが不均一に分布したデータ集合を考えると、この基本的な分類戦略は、多数派の性別を有するとして、全てのユーザを分類する。これは、 Next, class prior probability classification will be described. Class prior probability classification serves as a reference method for evaluating the performance of other classifiers. Given a data set in which the gender classes of the population are unevenly distributed, this basic classification strategy classifies all users as having a majority gender. this is,

として設定された訓練集合から推定された、生成モデルＰ（ｙ｜ｘ）＝Ｐ（ｙ）に基づいて、等式（１）を用いることに等しい。 Is equivalent to using equation (1) based on the generated model P (y | x) = P (y) estimated from the training set set as

次に、本発明の態様による混合ナイーブベイズについて記載する。上記多項ナイーブベイズの代替となるもので、発明者は混合ナイーブベイズと称している。このモデルは、ユーザは、通常、正規分布のレイティングを行うという仮定に基づいている。より詳細には、 Next, the mixed naive Bayes according to an embodiment of the present invention will be described. This is an alternative to the above-mentioned multiple naive Bayes, and the inventor has called mixed naive Bayes. This model is based on the assumption that the user usually has a normal distribution rating. More specifically,

次に、本発明におけるロジスティック回帰の使用について記載する。上記全てのベイズ法の重大な欠点は、映画レイティングは独立していると仮定していることである。この欠点に対処するために、発明者は、ロジスティック回帰を使用した。線形回帰は、係数の集合β=｛β₀，β_1… ，β_M｝を生成することを思い起こすと、固有ベクトルｘ_iを有するユーザｉ∈Ｎの分類は、最初に、確率 Next, the use of logistic regression in the present invention will be described. A significant drawback of all the above Bayesian methods is that it assumes that movie ratings are independent. To address this shortcoming, the inventor used logistic regression. Recalling that linear regression produces a set of coefficients β = {β ₀ , β _1... , Β _M }, the classification of user iεN with eigenvector x _i is

を計算して行う。ユーザは、Ｐ_i＜０．５の場合、女性に分類され、そうでない場合、男性に分類される。値Ｐ_iは、また、ユーザｉの分類の信頼値の役割を果たす。ロジスティック回帰を用いる大きな利点の一つは、係数βが各映画とクラスの間の相関の程度を捕捉することである。本事例においては、大きな正のβ_jは、映画ｊがクラス男性と相関しており、小さい負のβ_jは、映画ｊがクラス女性と相関していることを示す。係数がゼロでない、各性別と相関した少なくとも１０００の映画を有するように、正規化パラメータを選択する。 Calculate and do. The user is classified as female if P _i <0.5, and otherwise classified as male. The value P _i also serves as a confidence value for the classification of user i. One major advantage of using logistic regression is that the coefficient β captures the degree of correlation between each movie and class. In this case, a large positive β _j indicates that movie j is correlated with class men, and a small negative β _j indicates that movie j is correlated with class women. The normalization parameters are selected to have at least 1000 movies correlated with each gender with non-zero coefficients.

機械学習において、サポートベクターマシーン（ＳＶＭ）は、データを分析してパターンを認識する関連する学習アルゴリズムを有する監視学習モデルで、分類および回帰分析に用いられる。当技術分野で周知のように、ＳＶＭは、直観的に、間違えて分類されたユーザの超平面からの距離を最小にする、異なる性別に属するユーザを分ける超平面を見つける。ＳＶＭは、ロジスティック回帰の長所の多くを保持する。すなわち、特徴空間での独立性を前提とせず、係数を生成する。特徴空間（映画の数）は既にかなり多いので、分類器の評価では線形ＳＶＭを用いる。パラメータ空間（Ｃ）にわたって対数探索を行って、発明者はＣ＝１で最高の結果となることを発見した。 In machine learning, a support vector machine (SVM) is a supervised learning model that has an associated learning algorithm that analyzes data to recognize patterns and is used for classification and regression analysis. As is well known in the art, SVM intuitively finds hyperplanes that separate users belonging to different genders that minimize the distance from the hyperplane of misclassified users. SVM retains many of the advantages of logistic regression. That is, the coefficient is generated without assuming independence in the feature space. Since the feature space (number of movies) is already quite large, a linear SVM is used for classifier evaluation. Performing a logarithmic search over the parameter space (C), the inventor found that C = 1 gave the best results.

ＦｌｉｘｓｔｅｒとＭｏｖｉｅｌｅｎｓデータ集合の両方に関して、全てのアルゴリズムを評価した。１０分割交差検証を行い、平均精度と再現率を両データ集合に関して計算し、平均受信者動作特性（ＲＯＣ）曲線を１０分割データを通して計算した。ＲＯＣに関して、データ集合内の男性から正確に分類された男性の比率として真陽性率を計算し、データ集合内の女性から不正確に分類された男性として偽陽性率を計算する。表１は、３つの測定基準であるＡＵＣ、精度、および、再現率に関する分類結果の概要を示す。表２は、同じ結果を性別ごとに分けて示す。ＲＯＣ曲線を図２ａ、図２ｂに示す。表１は、３つの測定基準であるＡＵＣ、適合率、および、再現率に関する分類結果の概要を示す。表２は、同じ結果を性別ごとに分けて示す。 All algorithms were evaluated for both the Flixster and the Movielens data sets. Ten-fold cross validation was performed, average accuracy and recall were calculated for both data sets, and average receiver operating characteristic (ROC) curves were calculated through the ten-segment data. For ROC, the true positive rate is calculated as the proportion of men correctly classified from men in the data set, and the false positive rate is calculated as incorrectly classified men from women in the data set. Table 1 gives an overview of the classification results for three metrics, AUC, accuracy, and recall. Table 2 shows the same results divided by gender. ROC curves are shown in FIGS. 2a and 2b. Table 1 summarizes the classification results for the three metrics, AUC, precision, and recall. Table 2 shows the same results divided by gender.

ＲＯＣ曲線から分かるように、ＳＶＭとロジスティック回帰曲線は、両方のデータ集合に関して、他の曲線より優位にあるので、ＳＶＭおよびロジスティック回帰は、どのベイズモデルよりも優れている。詳細には、Ｆｌｉｘｓｔｅｒに関してはロジスティック回帰が最も優れており、Ｍｏｖｉｅｌｅｎｓに関してはＳＶＭが最も優れていた。ベルヌーイモデル、混合モデル、および、多項モデルの性能は、互いに大きくは異ならない。この結果は、表１のＡＵＣ値によってさらに確認される。この表はまた、他の全ての方法により性能の劣る単純なクラス事前確率モデルの弱点を示している。 As can be seen from the ROC curves, SVM and logistic regression are superior to any Bayesian model because SVM and logistic regression curves are superior to other curves for both data sets. Specifically, logistic regression was the best for Flixster, and SVM was the best for Movielens. The performances of the Bernoulli model, the mixed model, and the multinomial model are not significantly different from each other. This result is further confirmed by the AUC values in Table 1. This table also shows the weaknesses of simple class prior probabilistic models that are inferior in performance by all other methods.

一般的に、分類タスクにおける精度は、真陽性の数（すなわち、陽性クラスに属するとして正確にラベル付けされたアイテムの数）を、陽性クラスに属するとしてラベル付けされた要素の総数（すなわち、真陽性と、陽性クラスに属するとして間違ってラベル付けされたアイテムである偽陽性との和）で割ったものである。この文脈における再現率とは、真陽性を、実際に陽性クラスに属する要素の総数（すなわち、真陽性と、陽性クラスに属するとしてラベル付けされるべきであったのにされなかったアイテムである偽陰性との和）で割った数として定義される。 In general, the accuracy in a classification task depends on the number of true positives (ie, the number of items correctly labeled as belonging to the positive class) to the total number of elements labeled as belonging to the positive class (ie, true Divided by the sum of the positive and the false positive, which is an item incorrectly labeled as belonging to the positive class. Recall in this context is true positives, the total number of elements that actually belong to the positive class (i.e., false positives, items that should have been labeled as true positives and belong to positive classes). Defined as the number divided by the sum of negative).

精度および再現率という点で、ロジスティック回帰が、Ｆｌｉｘｓｔｅｒのユーザでは、両方の性別に関して他の全てのモデルに勝っていることを表２は示している。Ｍｏｖｉｅｌｅｎｓのユーザに関しては、ＳＶＭが全ての他のアルゴリズムに勝っており、ロジスティック回帰が二番目によい。一般的に、推論は、各データ集合において多数派の性別（Ｆｌｉｘｓｔｅｒでは女性、Ｍｏｖｉｅｌｅｎｓでは男性）に関してよい成績を収める。これは、ＳＶＭに関して特に明らかである。ＳＶＭは、多数派のクラスに関して非常に高い再現率を示し、少数派のクラスに関して低い再現率を示す。混合モデルは、ベルヌーイモデルで有意に向上し、結果は多項モデルに類似する。これは、ガウス分布を使用することが、レイティング分布に関して十分に正確な推定とはいえない可能性があることを示している。 Table 2 shows that logistic regression outperforms all other models for both genders in terms of accuracy and recall. For the users of Movielens, SVM outperforms all other algorithms, and logistic regression is second best. In general, inferences perform well with respect to the majority gender (Fixster is female and Movielens is male) in each data set. This is particularly evident with SVM. SVM shows very high recall for the majority class and low recall for the minority class. The mixed model is significantly improved with the Bernoulli model and the results are similar to the multinomial model. This indicates that using a Gaussian distribution may not be a sufficiently accurate estimate for the rating distribution.

訓練集合のサイズの影響を評価した。１０分割交差検証を用いたので、訓練集合は、評価集合に対して大きい。Ｆｌｉｘｓｔｅｒデータを用いて、訓練集合サイズ内のユーザの数が推論の正確さに与える影響を評価する。評価集合が３０００人のユーザを有する１０分割交差検証に加えて、１００分割交差検証を３００人のユーザの評価集合を用いて行った。さらに、１００人のユーザから初めて、繰り返す毎に、ユーザを１００人ずつ追加するという、訓練集合を徐々に増やすことを行った。 The effect of training set size was evaluated. Since 10-fold cross validation is used, the training set is large relative to the evaluation set. Using Flixster data, we evaluate the impact of the number of users in the training set size on the inference accuracy. In addition to 10-fold cross validation with an evaluation set having 3000 users, 100-fold cross validation was performed using an evaluation set of 300 users. Furthermore, for the first time from 100 users, the training set was gradually increased by adding 100 users each time it was repeated.

図２ｃは、２つの評価集合のサイズに対して、Ｆｌｉｘｓｔｅｒに関するロジスティック回帰推論の精度を表している。同図は、両方のサイズに関して、精度が７０％を超えるアルゴリズムでは、訓練集合内のユーザ数は約３００人で十分であり、訓練集合内のユーザが５０００人だと、７４％を超える精度に達することを示している。これは、訓練には、比較的少ない数のユーザで十分であることを示している。 FIG. 2c represents the accuracy of the logistic regression inference for Flixster for the size of the two evaluation sets. The figure shows that for both sizes, an algorithm with an accuracy of over 70% requires about 300 users in the training set, and an accuracy of over 74% for 5000 users in the training set. Show that you reach. This indicates that a relatively small number of users is sufficient for training.

２つの取得可能なデータ集合に関してＳＶＭ分類器と線形回帰分類器の特徴を詳しく記載し、望ましい結果を得たので、推論エンジンを実現するための新規な方法および装置を発明した。図３は、デモグラフィック情報を有さないユーザのレイティングからデモグラフィック情報を生成し、その結果を有用な目的のために利用する、本発明の態様による方法を示す。生成されたこのようなデモグラフィック情報を用いる最終目的には、ユーザ１２５に対するターゲット広告、および／または、推薦システム１３０を介した推薦の強化が含まれる。 Having detailed the characteristics of the SVM classifier and the linear regression classifier with respect to two obtainable data sets and obtained desirable results, a novel method and apparatus for implementing an inference engine has been invented. FIG. 3 illustrates a method according to an aspect of the present invention that generates demographic information from a user's rating without demographic information and uses the results for useful purposes. The ultimate goal of using such generated demographic information includes targeted advertisements for the user 125 and / or enhanced recommendations via the recommendation system 130.

図３の方法３００は、ステップ３０５において、複数のユーザを表すレイティングとデモグラフィック情報とを有する訓練データ集合を、推論エンジンに入力することで開始される。図１に、推薦システム１３０の一部である推論エンジン１３５を示した。このステップは、ネットワーク１２０への推薦システムの接続１３７を用いて達成されてもよく、ポート１３６を介して推論エンジン１３５に直接入力することによって達成されてもよい。推薦システムネットワーク接続１３７を介して入力が行われる場合、訓練データ集合は、デモグラフィック情報およびレイティング情報（映画レイティング、または、任意の他のデジタルコンテンツレイティング）を１つずつ蓄積したものであってもよく、少なくとも一人のユーザのデモグラフィック情報およびレイティング情報を有する訓練データ集合を１つまたは複数ロードしたものでもよい。入力ポート１３６を介して推論エンジン１３５に直接、入力を行う場合、そのデータは、少なくとも一人のユーザの訓練データ集合を１つまたは複数ダウンロードしたものであってよい。ステップ２１０において、推薦システム１３５は、訓練データ集合からの情報を用いて推論エンジンを訓練する。推論エンジン１３５が、ポート１３６を介して直接ダウンロードを受信する場合は、ステップ２１０は省くことができる。どちらの場合でも、ステップ２０５および２１０は、ユーザのデモグラフィック情報とユーザのレイティング情報との両方を有する訓練データ集合を用いて、推論エンジン１３５を訓練することを表す。 The method 300 of FIG. 3 begins at step 305 by inputting a training data set having ratings representing demo users and demographic information into an inference engine. FIG. 1 shows an inference engine 135 that is part of the recommendation system 130. This step may be accomplished using a recommendation system connection 137 to the network 120 or may be accomplished by direct input to the inference engine 135 via port 136. When input is made via the recommendation system network connection 137, the training data set may be one that stores demographic information and rating information (movie ratings, or any other digital content rating) one by one. It is also possible to load one or more training data sets having demographic information and rating information of at least one user. When inputting directly into the inference engine 135 via the input port 136, the data may be one or more downloaded training data sets of at least one user. In step 210, the recommendation system 135 trains the inference engine using information from the training data set. If the inference engine 135 receives a download directly via port 136, step 210 can be omitted. In either case, steps 205 and 210 represent training the inference engine 135 using a training data set having both user demographic information and user rating information.

ステップ３１５において、訓練データ集合に含まれないユーザ１２５等の新規ユーザが、推薦システム１３０と相互作用して、レイティングのみを提供する。上記のように、これらのレイティングは、例えば、映画識別子情報と客観的なレイティング値情報とを有する映画レイティングであってよい。ユーザ１２５が提供したレイティングは、推論エンジンが検索するデモグラフィック情報を持たない。新規ユーザ１２５がレイティングを推薦システムに入力した後、ステップ３２０において、推論エンジン１３５は、分類アルゴリズムを用いて、新規ユーザのレイティングに基づいて新規ユーザのデモグラフィック情報を決定する。分類アルゴリズムは、前述のサポートベクターマシーン（ＳＶＭ）またはロジスティック回帰のうちの１つであることが好ましい。 In step 315, a new user, such as user 125, not included in the training data set interacts with the recommendation system 130 to provide only ratings. As described above, these ratings may be, for example, movie ratings having movie identifier information and objective rating value information. The rating provided by the user 125 does not have demographic information that the inference engine retrieves. After the new user 125 inputs the rating into the recommendation system, in step 320, the inference engine 135 uses the classification algorithm to determine demographic information for the new user based on the new user's rating. The classification algorithm is preferably one of the aforementioned support vector machines (SVM) or logistic regression.

新規ユーザのデモグラフィック情報を決定すると、性別等の決定したデモグラフィック情報は、多くの有用な目的に用いてよい。２つの例を図３に示す。１つの例においては、ステップ３２０で決定されたデモグラフィック情報は、ステップ３２５で推薦システム１３０によって使用されて、新規ユーザに強化された推薦を行う。例えば、推薦システム１３０がＮｅｔｆｌｉｘ（商標）またはＨｕｌｕ（商標）によって運営されている映画推薦システムである場合、性別等のデモグラフィック情報は、その新規ユーザが視聴するための、そのユーザの性別に特化した映画をより厳密に選択するのに用いてよい。あるいは、推薦システム１３０は、ステップ３２０で決定したデモグラフィック情報を、ステップ３３０で、その新規ユーザへのターゲット広告に用いることができる。例えば、新規ユーザの性別が決定されると、その新規ユーザをターゲットにして、その性別に特化した広告をしてよい。このような広告は、女性に対する香水の購入割引の提案や、男性に対する髭剃り器の購入割引を含んでよい。推薦システムは、内部のデータベース、外部のデータベース、または、ネットワークサーバ（図示せず）からの潜在的な広告へのアクセスを有してよい。 Once demographic information for a new user is determined, the demographic information determined, such as gender, may be used for many useful purposes. Two examples are shown in FIG. In one example, the demographic information determined in step 320 is used by the recommendation system 130 in step 325 to make enhanced recommendations to new users. For example, if the recommendation system 130 is a movie recommendation system operated by Netflix (trademark) or Hulu (trademark), demographic information such as gender is specific to the user's gender for viewing by the new user. It can be used to select a more exacted movie. Alternatively, the recommendation system 130 can use the demographic information determined in step 320 for targeted advertising to the new user in step 330. For example, when the gender of a new user is determined, an advertisement specialized for that gender may be made targeting the new user. Such advertisements may include perfume purchase discount suggestions for women and shaving device purchase discounts for men. The recommendation system may have access to potential advertisements from an internal database, an external database, or a network server (not shown).

新規ユーザ１２５が提供したレイティングから抽出したデモグラフィック情報を活用するための有用なアクションとして、ステップ３２５および３３０のいずれか、または両方を行ってよい。ステップ３１５〜３３０は、推薦システム１３０のサービスを利用する新規ユーザ毎に繰り返してよい。強化された推薦または広告を推薦システムから受信するユーザは、ユーザ１２５等のユーザに関連付けられた表示装置上で、当該強化された推薦または広告を受信する。このようなユーザ表示装置は、周知であり、家庭用テレビ機器、スタンドアロンテレビ、パーソナルコンピュータや、パーソナルデジタルアシスタント、ラップトップ、タブレット、携帯電話、ウェブノートブック等のハンドヘルド装置、に関連付けられた表示装置を含む。 As a useful action to take advantage of the demographic information extracted from the ratings provided by the new user 125, either or both of steps 325 and 330 may be performed. Steps 315 to 330 may be repeated for each new user who uses the service of the recommendation system 130. A user receiving an enhanced recommendation or advertisement from a recommendation system receives the enhanced recommendation or advertisement on a display device associated with the user, such as user 125. Such user display devices are well known and are associated with home television equipment, stand-alone televisions, personal computers and handheld devices such as personal digital assistants, laptops, tablets, mobile phones, web notebooks, etc. including.

図４は、推論エンジン１３５のブロック図である。推論エンジン１３５は、図１に示すように推薦システム１３０とインタフェースをとる。推論エンジンインタフェース４１０は、推論エンジン１３５の通信コンポーネントを、推薦システム１３０の通信コンポーネントに接続する働きをする。４０５における推論エンジンインタフェース４１０から推薦システムへのリンクは、当業者には既知のように、シリアルリンクであっても並列リンクであってもよく、組み込まれていても外部にあってもよい。このように、推論エンジンは、推薦システムと結合していてもよく、推薦システムから分かれていてもよい。インタフェースポート４０５によって、推薦システム１３０は、訓練データを推論エンジン１３５に提供することができ、推論結果を推薦システムに提供することができる。代替の訓練データ集合インタフェースは入力ポート１３６で、ネットワーク、または、記憶媒体ソース等の他のデジタルデータソースから訓練データを使いやすい形で入力可能である。 FIG. 4 is a block diagram of the inference engine 135. The inference engine 135 interfaces with the recommendation system 130 as shown in FIG. The inference engine interface 410 serves to connect the communication component of the inference engine 135 to the communication component of the recommendation system 130. The link from the inference engine interface 410 at 405 to the recommendation system may be a serial link or a parallel link, as known to those skilled in the art, and may be embedded or external. Thus, the inference engine may be coupled with the recommendation system or may be separated from the recommendation system. The interface port 405 allows the recommendation system 130 to provide training data to the inference engine 135 and to provide inference results to the recommendation system. An alternative training data set interface is an input port 136 through which training data can be input in an easy-to-use manner from a network or other digital data source such as a storage media source.

プロセッサ４２０は、推論エンジン１３５に計算機能を提供する。プロセッサは、推論エンジンの要素間の通信を利用して、推論エンジンの通信プロセスおよび計算プロセスを制御する任意の形式のＣＰＵまたはコントローラであってよい。バス４１５は推論エンジン１３５の様々な要素間の通信経路を提供すること、かつ、他のポイントツーポイントの相互接続も実行可能なことを、当業者は認識している。 The processor 420 provides a calculation function to the inference engine 135. The processor may be any form of CPU or controller that utilizes the communication between the elements of the inference engine to control the inference engine communication and computation processes. Those skilled in the art recognize that bus 415 provides a communication path between the various elements of inference engine 135 and that other point-to-point interconnections can also be implemented.

プログラムメモリ４３０は、図３の方法３００に関連するメモリのリポジトリを提供することができる。データメモリ４４０は、訓練データ集合、ダウンロードしたもの、アップロードしたもの、スクラッチパッド計算等の情報を記憶するためのリポジトリを提供することができる。メモリ４３０および４４０は、結合されていてもよく、別個であってもよく、全てまたは一部をプロセッサ４２０に組み込んでよいことを当業者は認識されよう。プロセッサ４２０は、推薦システム１３０が使用するデモグラフィック情報を生成するために、プログラムメモリの記憶プロパティおよび検索プロパティを利用してコンピュータ命令等の命令を実行し、方法３００のステップを行う。 Program memory 430 may provide a repository of memory associated with method 300 of FIG. The data memory 440 can provide a repository for storing information such as training data sets, downloaded ones, uploaded ones, scratchpad calculations, and the like. Those skilled in the art will recognize that the memories 430 and 440 may be coupled, may be separate, and may be incorporated in whole or in part into the processor 420. The processor 420 performs the steps of the method 300 by executing instructions, such as computer instructions, using program memory storage and search properties to generate demographic information for use by the recommendation system 130.

推定器４５０は、別個であっても、プロセッサ４２０の一部であってもよく、新規ユーザのレイティングからデモグラフィック情報を決定するための計算資源を提供する働きをする。そのため、推定器４５０は、分類器、好ましくは、ＳＶＭまたはロジスティック回帰に計算資源を提供することができる。推定器は、データメモリ４４０またはプロセッサ４２０に、新規ユーザのデモグラフィック情報の決定における暫定的計算を提供することができる。この暫定的計算は、自分のレイティング情報のみを与えた新規ユーザに関連するデモグラフィック情報の確率を含む。推定器４５０は、ハードウェアであってよいが、ハードウェアと、ファームウエアまたはソフトウェアとの組合せであることが好ましい。 The estimator 450, which may be separate or part of the processor 420, serves to provide computational resources for determining demographic information from new user ratings. As such, the estimator 450 can provide computational resources for a classifier, preferably SVM or logistic regression. The estimator may provide the data memory 440 or the processor 420 with provisional calculations in determining new user demographic information. This provisional calculation includes the probability of demographic information related to the new user who gave only his rating information. The estimator 450 may be hardware, but is preferably a combination of hardware and firmware or software.

比較的小さい訓練集合を所与とすると、推論アルゴリズムは、７０％〜８０％の精度でユーザの性別を正確に予測する。しかし、ユーザのレイティングからデモグラフィック情報を決定するための上記技術は、ユーザのプライバシーに関する懸念を引き起こす場合がある。ユーザによっては、自分のデモグラフィック情報を確実に決定されないように難読化することを望む場合がある。検出可能なデモグラフィック情報を確実な検出から保護する難読化機構については下記に記載する。 Given a relatively small training set, the inference algorithm accurately predicts the user's gender with 70% to 80% accuracy. However, the above techniques for determining demographic information from user ratings may raise concerns about user privacy. Some users may want to obfuscate their demographic information so that it is not reliably determined. An obfuscation mechanism that protects detectable demographic information from reliable detection is described below.

図５ａは、推薦システムの推論エンジン１３５に関する難読化機構が存在し得る例示の環境５００を示す。難読化機構は、複数の場所に存在することができる。難読化機構は、ネットワーク１２０に接続されたクラウド、または、ユーザ１２５の装置に存在してよい。クラウド（図示せず）に存在する場合、難読化機構は、多くのユーザに提供されるネットワークサービスである。ユーザ装置に存在する場合、難読化機構は、基本的に、追加の計算要素を有する推論エンジンを含む。例えば、図５に示すように、難読化エンジン１２６は、推薦システム１３０に存在する推論エンジンの精度を減らすために、ユーザ１２５からの推薦を監視し、追加のレイティングをそのユーザのレイティングに追加することができる。 FIG. 5a shows an exemplary environment 500 where an obfuscation mechanism for the inference engine 135 of the recommendation system may exist. An obfuscation mechanism can exist in multiple locations. The obfuscation mechanism may be in a cloud connected to the network 120 or in the user 125 device. When present in the cloud (not shown), the obfuscation mechanism is a network service provided to many users. When present in the user device, the obfuscation mechanism basically includes an inference engine with additional computational elements. For example, as shown in FIG. 5, the obfuscation engine 126 monitors recommendations from the user 125 and adds additional ratings to the user's ratings to reduce the accuracy of the inference engine present in the recommendation system 130. be able to.

別の実施形態においては、ユーザにコンテンツを配信するコンテンツアグリゲータが、コンテンツアグリゲーションサービスと共に難読化エンジンを提供することによってユーザのデモグラフィック情報を保護するように働くこともできる。図５ｂは、このようなコンテンツアグリゲータのサービスを示す。図５ｂの構成においては、コンテンツアグリゲータ５６０は、リンク５５５を介してネットワーク１２０に接続し、ユーザ１２５が関心を持ち得るデジタルコンテンツへのアクセスを得ることができる。ユーザ１２５は、リンク５８２を介してコンテンツアグリゲータに直接アクセスしてもよく、リンク５８１を介してネットワーク１２０を通してアクセスしてもよい。どちらの場合でも、コンテンツアグリゲータは、ユーザ１２５に対するデジタルコンテンツの提供者として働き、料金を徴収してコンテンツをユーザに提供する。コンテンツプロバイダは、推薦システム１３０であってよい。このように、コンテンツアグリゲータ５６０は、ユーザ１２５がレイティング可能なデジタルコンテンツのための導管の機能を果たす。プライバシーサービスとして、コンテンツアグリゲータは、推論エンジン５７５と共に動作する難読化エンジン５７０を介して、ユーザに難読化サービスを提供することができる。難読化エンジン５７０は、ユーザ１２５が、コンテンツプロバイダである推薦システム１３０から取得したデジタルコンテンツをレイティングすると、推薦システム１３０に送られるレイティングに追加の難読化のためのレイティングが追加されるように、ユーザ１２５のデモグラフィック情報を難読化するように働く。追加されたレイティングは、デモグラフィック情報の正確な決定に反する。従って、推薦システムに関連付けられた推論エンジン１３５は、ユーザ１２５のレイティングを介してユーザ１２５のデモグラフィック情報を正確に決定することはできない。 In another embodiment, a content aggregator that delivers content to the user may also serve to protect the user's demographic information by providing an obfuscation engine along with a content aggregation service. FIG. 5b shows such a content aggregator service. In the configuration of FIG. 5b, content aggregator 560 can connect to network 120 via link 555 to gain access to digital content that user 125 may be interested in. User 125 may access the content aggregator directly via link 582 or may access it via network 120 via link 581. In either case, the content aggregator acts as a provider of digital content for the user 125 and collects a fee to provide the content to the user. The content provider may be a recommendation system 130. Thus, the content aggregator 560 acts as a conduit for digital content that can be rated by the user 125. As a privacy service, a content aggregator can provide an obfuscation service to a user via an obfuscation engine 570 that operates with an inference engine 575. The obfuscation engine 570 allows the user 125 to add an additional obfuscation rating to the rating sent to the recommendation system 130 when the user 125 rates the digital content acquired from the recommendation system 130 that is a content provider. It works to obfuscate 125 demographic information. The added rating goes against the exact determination of demographic information. Accordingly, the inference engine 135 associated with the recommendation system cannot accurately determine the demographic information of the user 125 via the user 125 rating.

図５ｃは、難読化エンジン５９９の例示のブロック図５９０を示す。難読化エンジン５９９は、ネットワークインタフェース５９１を介して、図５ｂの１２０等のネットワークとインタフェースをとる。ネットワークインタフェース５９１によって、ユーザレイティング等のユーザデータと、訓練データ集合に、インターネット等のネットワークを介してアクセスすることができる。そのため、ネットワークインタフェースの受信部は、訓練データと、映画レイティング等のユーザが提供したレイティングとを受信することができる。さらに、ネットワークインタフェース５９１内の送信部によって、レイティング生成部５９５が生成した追加のレイティングをネットワークに送信することができる。一実施形態においては、付加的レイティングと、ユーザが提供したレイティングとを、推薦システム１３０に送り、推薦システム１３０で、図５ｂの１３５等の推論エンジンがユーザのデモグラフィック情報を正確に決定するのを妨げる。 FIG. 5 c shows an exemplary block diagram 590 of the obfuscation engine 599. The obfuscation engine 599 interfaces with a network such as 120 in FIG. 5b via a network interface 591. The network interface 591 can access user data such as user ratings and a training data set via a network such as the Internet. Therefore, the receiving unit of the network interface can receive the training data and the rating provided by the user such as movie rating. Furthermore, the additional rating generated by the rating generation unit 595 can be transmitted to the network by the transmission unit in the network interface 591. In one embodiment, the additional ratings and user-provided ratings are sent to the recommendation system 130 where an inference engine such as 135 in FIG. 5b accurately determines the user's demographic information. Disturb.

プロセッサ５９２は、難読化エンジン５９９に計算機能を提供する。プロセッサは、難読化エンジンの要素間の通信を利用して難読化エンジンの通信プロセスおよび計算プロセスを制御する任意の形態のＣＰＵまたはコントローラであってよい。バス５９７は難読化エンジン５９９の様々な要素間の通信経路を提供すること、かつ、バスアーキテクチャの代わりにポイントツーポイント接続も実行可能であることを、当業者は認識している。 The processor 592 provides a calculation function to the obfuscation engine 599. The processor may be any form of CPU or controller that utilizes communication between elements of the obfuscation engine to control the obfuscation engine communication and computation processes. Those skilled in the art recognize that bus 597 provides a communication path between the various elements of obfuscation engine 599 and that point-to-point connections can also be implemented instead of a bus architecture.

プログラムメモリ５９３は、図６の方法６００に関連するメモリにリポジトリを提供することができる。データメモリ５９４は、訓練データ集合、ダウンロードしたもの、アップロードしたもの、または、スクラッチパッド計算等の情報を記憶するためのリポジトリを提供することができる。メモリ５９３および５９４は、結合されていてもよく、別個であってもよく、全てまたは一部をプロセッサ５９１に組み込まれていてもよいことを、当業者は認識されよう。プロセッサ５９１は、プログラムメモリの命令を用いて方法６００等の方法を実行することによって、ユーザのデモグラフィック情報の正確な決定に反する難読化データを生成する。難読化データは、ネットワークインタフェース５９１を介して、ネットワークベースの推薦システムに送信される。 Program memory 593 may provide a repository to memory associated with method 600 of FIG. Data memory 594 may provide a repository for storing information such as training data sets, downloaded ones, uploaded ones, or scratch pad calculations. One of ordinary skill in the art will recognize that the memories 593 and 594 may be coupled, separate, or incorporated in whole or in part into the processor 591. The processor 591 generates obfuscated data that violates the accurate determination of the user's demographic information by executing a method such as method 600 using instructions in the program memory. The obfuscated data is transmitted to the network-based recommendation system via the network interface 591.

推論エンジン５９６は、プロセッサ５９２とは別個であっても、その一部であってもよく、新規ユーザのレイティングからデモグラフィック情報を決定するための計算資源を提供する働きをする。そのため、推論エンジンは、図４の推論エンジンに類似していてもよく、図５ｃに示す計算資源を利用してもよい。レイティング生成部５９５は、以下に記載の難読化技術によって使用されるレイティングを生成するように動作する。詳細には、レイティング生成部は、ユーザレイティングを模倣しているが、推薦システムにある推論エンジンによるデモグラフィック情報の正確な決定に反する付加的レイティングを生成する。このように、レイティング生成部は、推薦システム（図１参照）の推論エンジン等の外部の推論エンジンに送信するレイティングを作成する。外部の推論システムに送信される付加的レイティングは、新規ユーザのデモグラフィック情報を正確に決定するのが容易でないように、そのユーザからのレイティングに干渉する働きをする。難読化エンジン５９９は、ハードウェアベースであってよいが、ハードウェアと、ファームウェアまたはソフトウェアとの組合せであることが好ましい。 Inference engine 596 may be separate from or part of processor 592 and serves to provide computational resources for determining demographic information from new user ratings. As such, the inference engine may be similar to the inference engine of FIG. 4 and may utilize the computational resources shown in FIG. 5c. The rating generator 595 operates to generate ratings used by the obfuscation techniques described below. Specifically, the rating generator imitates user ratings, but generates additional ratings that are contrary to the accurate determination of demographic information by the inference engine in the recommendation system. As described above, the rating generation unit creates a rating to be transmitted to an external inference engine such as an inference engine of the recommendation system (see FIG. 1). The additional rating sent to the external reasoning system acts to interfere with the rating from the user so that it is not easy to accurately determine the demographic information of the new user. The obfuscation engine 599 may be hardware based, but is preferably a combination of hardware and firmware or software.

難読化エンジンの特徴を次に記載する。０のインデックスが付いたユーザ１２５等のユーザが、映画等のデジタルコンテンツアイテムを視聴してレイティングする。ユーザがレイティングすることができる映画のユニバースがＭ個の映画のカタログを含むものとすると、ユーザは、カタログＭ=｛１，２，．．．，Ｍ｝の部分集合Ｓ₀のレイティングをする。ｒ_0j∈Ｒは、映画ｊ∈Ｓ₀のレイティングを表し、ユーザのレイティングプロファイルは、（映画、ランキング）対の集合Η₀≡｛（ｊ，ｒ_0j）：ｊ∈Ｓ₀｝として定義される。図５を参照すると、ユーザは、Η₀（すなわち、１３９）を難読化機構に提出し、難読化機構は、Ｓ´₀≠Ｓ₀の、変更後のレイティングプロファイルΗ´₀＝｛（ｊ，ｒ_0j´）：ｊ∈Ｓ₀´｝（すなわち、１３８）を出力する。簡単に言うと、この難読化は、以下の（ａ）Η´₀は、ユーザに適切な推薦を行うために使用できる（ｂ）Η´₀から性別等のユーザのデモグラフィック情報を推論するのは難しい、という２つの相反する目的をうまく両立させることを目指している。 The characteristics of the obfuscation engine are described below. A user such as user 125 with an index of 0 views and rates a digital content item such as a movie. Assuming that the universe of movies that the user can rate includes a catalog of M movies, the user will have a catalog M = {1, 2,. . . , M}, a subset S ₀ is rated. r _{0j ∈R} represents the rating of movie j∈S ₀ , and the user's rating profile is defined as a set of (movie, ranking) pairs Η ₀ ≡ {(j, r _0j ): j∈S ₀ }. . Referring to FIG. 5, the user, Eta ₀ (i.e., 139) to submit to the obfuscation mechanism, obfuscation mechanisms of _S'0 ≠ S _0, the rating profile after change _Η'0 = {(j, r _0j ′): jεS ₀ ′} (ie, 138) is output. Briefly, the obfuscation, to the following (a) _Η'0 is deduced user demographic information gender, etc. from which it (b) _Η'0 used for proper recommended to the user The goal is to successfully balance the two conflicting objectives.

より詳細には、難読化されたレイティングプロファイルΗ´₀は、性別推論エンジン１３５を実施するモジュールを有する推薦システム１３０に提示されるとみなされる。推薦システム１３５は、Η´₀を用いて、Ｍ＼Ｓ´₀に関するユーザレイティングを予測し、場合によっては、ユーザが興味を持ちそうな映画を推薦する。性別推論エンジン１３５は、同じΗ´₀を用いてユーザをプロファイリングし、男性か女性にラベル付けする分類機構である。 More specifically, the obfuscated rating profile “ _0” is considered to be presented to the recommendation system 130 having a module that implements the gender inference engine 135. Recommendation system 135, using a _Η'0, to predict the user rating on M\S' _0, in some cases, a user would recommend a likely movie interested. Gender inference engine 135, and profiling the user by using the same _Η'0, is a classification mechanism to label male or female.

推薦システム１３０の実施は、公知であるかもしれないが、難読化エンジン１２６および性別推論エンジン１３５は公知ではない。この問題の第１の段階として、推薦システム１３０および推論エンジン１３５の両方とも、任意の種類の難読化が行われていることに気付かないという簡単なアプローチをとっている。当該両方の機構は、「表面上の値」でプロファイルΗ´をとり、「真の」プロファイルΗを逆行分析しない。 Although the implementation of the recommendation system 130 may be known, the obfuscation engine 126 and the gender reasoning engine 135 are not known. As a first step in this problem, both the recommendation system 130 and the inference engine 135 take a simple approach that they do not realize that any kind of obfuscation is taking place. Both of these mechanisms take a profile Η ′ by “value on the surface” and do not retrograde the “true” profile Η.

上記のように、推薦システム１３０および推論エンジン１３５は、訓練データ集合にアクセスを有する。訓練データ集合は純粋である、すなわち、レイティングも、性別ラベル等のデモグラフィック情報も、手を加えられず、難読化もされていない、ということが前提となっている。難読化エンジン１２６は、訓練集合の一部を見てもよい。一実施形態においては、訓練データ集合は公開されており、難読化エンジン１２６は、完全に訓練データ集合にアクセスできる。 As described above, the recommendation system 130 and the inference engine 135 have access to the training data set. It is assumed that the training data set is pure, that is, the rating and demographic information such as gender labels are not touched or obfuscated. The obfuscation engine 126 may see part of the training set. In one embodiment, the training data set is public and the obfuscation engine 126 has full access to the training data set.

一般的に、推論エンジン１３５で用いられる分類器の信頼値は、性別等のデモグラフィック情報を分類器から隠そうとするとき、難読化エンジンが克服する必要がある障害である。難読化エンジンは、推論エンジン１３５の分類器の、この信頼値を低下させようとする。従って、分類器が正確または不正確な分類を出力する時の分類器が異なる信頼値を有するか否かの評価を行う。推論エンジンに用いられる分類器の評価に関して、図２ｄは、正確な分類および不正確な分類に対する信頼値の累積分布関数（ＣＤＦ）を示している。図２ｄから、分類が正確な時の方が信頼値は高く、信頼値の中央値は不正確な分類では０．６５であり、正確な分類では０．８５である。さらに、正確な分類の２０％近くが、信頼値１．０で、信頼値１．０は、不正確な分類では１％未満である。 In general, the confidence value of the classifier used in the inference engine 135 is an obstacle that the obfuscation engine needs to overcome when trying to hide demographic information such as gender from the classifier. The obfuscation engine attempts to reduce this confidence value of the inference engine 135 classifier. Therefore, it is evaluated whether the classifier has different confidence values when the classifier outputs an accurate or inaccurate classification. Regarding the evaluation of the classifier used in the inference engine, FIG. 2d shows the cumulative distribution function (CDF) of confidence values for correct and incorrect classification. From FIG. 2d, the confidence value is higher when the classification is accurate, and the median of the confidence values is 0.65 for the incorrect classification and 0.85 for the accurate classification. Furthermore, close to 20% of the correct classification has a confidence value of 1.0, which is less than 1% for an incorrect classification.

難読化エンジンは、ユーザｉのレイティングプロファイルΗ_iと、許可された変更の回数を表すパラメータｋと、訓練集合からの情報とを入力として受け取って、受信する推薦の品質への影響を最小限にしつつ、ユーザの性別の推論が難しいように変更されたレイティングプロファイルΗ´_iを出力する機構を有する。一般には、このような機構は、映画レイティングを追加、削除、または、変更することによってΗ_iを変更することができる。ここでは、難読化エンジンがｋ個の映画レイティングを追加することだけを許可される設定に焦点を置く。なぜなら、映画の削除は大抵のサービスでは非現実的であり、視聴イベントがユーザのデモグラフィック属性の強力な予測材料である時、レイティングの変更はレイティングの追加ほど有効ではないからである。ユーザが、自身のプロファイルでレイティングした映画の数はユーザによって異なる（数が少ないユーザもいる）ので、固定数ｋを使用せずに、ユーザのレイティングプロファイルにある映画の所与の割合に対応する追加の数を用いる。映画をユーザのプロファイルに追加するために、難読化エンジンは、追加すべき映画と、その各映画に割り当てるレイティングという、２つの重要な決定をする必要がある。 The obfuscation engine receives as input the rating profile ユーザ_{i of} user i, the parameter k representing the number of allowed changes, and the information from the training set, and minimizes the impact on the quality of received recommendations. On the other hand, it has a mechanism for outputting a rating profile Η _i that has been changed so that it is difficult to infer gender of the user. In general, such a mechanism can change Η _i by adding, deleting, or changing movie ratings. Here we focus on a setting where the obfuscation engine is only allowed to add k movie ratings. This is because movie deletion is impractical for most services, and rating changes are not as effective as adding ratings when viewing events are a powerful predictor of user demographic attributes. Since the number of movies that a user has rated with their profile varies from user to user (some users are small), do not use a fixed number k, but correspond to a given percentage of movies in the user's rating profile Use additional numbers. In order to add a movie to the user's profile, the obfuscation engine needs to make two important decisions: the movie to add and the rating assigned to each movie.

これらの追加された映画レイティングは、付加的レイティングと呼ぶ。付加的レイティングは、ユーザのデモグラフィック情報の正確な決定に反する。付加的レイティングのレイティング対（タイトル、レイティング値）のレイティング値は、「ノイズ」として割り当てられるのではなく、何らかの有用な価値を有する。例えば、このレイティングが全てのユーザの平均レイティング、または、特定のユーザの（行列の因数分解を用いた）予測レイティングに相当する場合、レイティング値は、仮にそのユーザが映画を見た場合、どのようにレイティングしたかを合理的に予測するものとなる。 These added movie ratings are referred to as additive ratings. Additional ratings are contrary to the accurate determination of the user's demographic information. The rating value of an additional rating rating pair (title, rating value) is not assigned as “noise” but has some useful value. For example, if this rating corresponds to an average rating for all users or a predictive rating for a particular user (using matrix factorization), the rating value will be what if that user watches the movie. It is a reasonable estimate of whether the rating was made.

難読化スキームを構築するために、難読化機構は、訓練データ集合に完全なアクセスを有し、かつ、追加すべき映画およびレイティングを選択するために、訓練データ集合を使用して、情報を引き出せることが、まず、前提となっている。難読化エンジンのための映画の選択を考えて、発明者は、映画を選択するための３つの戦略を選択した。各戦略は、ユーザｉがレイティングした映画の集合Ｓ_iと、追加すべき映画の数ｋと、男性に相関した映画の順序付けリストＬ_Mおよび女性に相関した映画の順序付けリストＬ_Fとを入力として受け取り、変更後の映画の集合Ｓ´_iを出力する。ここでＳ_i⊆Ｓ´_iである。リストＬ_MおよびＬ_Fは、スコアリング関数ｗの値の降順Ｌ_M∪Ｌ_F→Ｒで記憶される。ここで、ｗ（ｊ）は、映画ｊ∈Ｌ_M∪Ｌ_Fが関連する性別とどれほど強く相関しているかを示す。スコアリング関数の具体的な例は、ｗ（ｊ）=β_jを設定することであり、ここで、β_jは、訓練データ集合からロジスティック回帰モデルを学習することによって得られた映画ｊの係数である。スコアリング関数のこのインスタンス化は、評価のために用いられる。さらに、ｋ＜ｍｉｎ（｜Ｌ_M｜,｜Ｌ_F｜）−｜Ｓ_i｜、かつ、Ｌ_M∩Ｌ_F＝と仮定する。 To build an obfuscation scheme, the obfuscation mechanism has full access to the training data set and can use the training data set to extract information to select movies and ratings to add This is the premise. Given the choice of movies for the obfuscation engine, the inventor has chosen three strategies for choosing movies. Each strategy is a set S _i of the movie that the user i was rating, and the number k of the movie to be added, and a list L _F ordering of movie correlated with the correlated list L _M and the woman ordered the movie to men as input The set S ′ _i after receiving and changing the movie is output. Here, S _i ⊆S ′ _i . The lists L _M and L _F are stored in descending order L _M ∪L _F → R in the value of the scoring function w. Here, w (j) indicates how strongly the movie jεL _M ∪L _F correlates with the related gender. A specific example of a scoring function is to set w (j) = β _j , where β _j is a coefficient of movie j obtained by learning a logistic regression model from a training data set It is. This instantiation of the scoring function is used for evaluation. Further, it is assumed that k <min (| L _M |, | L _F |) − | S _i | and L _M ∩L _F =.

映画選択プロセスは、以下のようになる。所与の女性（または男性）ユーザｉに関して、Ｓ´_i＝Ｓ_iを初期化する。各戦略は、Ｌ_M（またはＬ_F）から繰り返し映画ｊを取出し、ｊ∈Ｓ´_iの場合、ｋ個の映画が追加されるまで、ｊをＳ´_iに追加する。集合Ｓ´_iは望ましい出力である。３つの戦略は、映画の順序付けリストからの映画の取りだし方が異なる。 The movie selection process is as follows. Initialize S ′ _i = S _i for a given female (or male) user i. Each strategy is repeated retrieves the movie j from L _M (or L _F), if the J∈S' _i, until the k movie is added, to add j to _S'i. The set S ′ _i is a desirable output. The three strategies differ in how movies are taken from the ordered list of movies.

所与の女性（男性）ユーザｉに関してランダム戦略を考えると、映画のスコアに関係なく、反対の性別Ｌ_M（Ｌ_F）に対応するリストから均一に、ランダムに映画ｊを取出す。サンプリング戦略を考えると、反対の性別に対応するリストにある映画に関連付けられたスコアの分布に基づいて、映画をサンプリングすることができる。例えば、それぞれ、スコア０．５、０．３、０．２の３つの映画ｊ₁、ｊ₂、ｊ₃がある場合、ｊ₁を確率０．５で取り出すなどである。欲張り戦略を考えると、反対の性別に対応するリストのうち最高スコアの映画を取り出すことができる。 Given a random strategy for a given female (male) user i, the movie j is randomly and randomly extracted from the list corresponding to the opposite gender L _M (L _F ), regardless of the score of the movie. Given a sampling strategy, movies can be sampled based on the distribution of scores associated with movies in the list corresponding to the opposite gender. For example, if there are _three movies j ₁ , j ₂ , and j ₃ with scores 0.5, 0.3, and 0.2, respectively, j ₁ is extracted with a probability of 0.5. Given the greedy strategy, you can pick the highest scoring movie from the list corresponding to the opposite gender.

レイティング値の対（タイトル、レイティング値）のレイティング割り当てを考えると、（視聴されたか否かを示す）プロファイルに映画が含まれるか含まれないかのバイナリイベントは、レイティングとほとんど同じくらい強い性別推論のためのシグナルであることに先ほど注目した。これは、ユーザプロファイルに追加する映画の決定と、その映画に与えるレイティング値の決定という２つを行う必要がある難読化機構にとって、重要な影響がある。この発見は、追加すべき映画の選択が性別推論を妨げることに大きな影響を持ち得るということを示唆している。しかし、実際のレイティングが性別推論にあまり影響を与えない場合、推薦の品質を維持する助けになるレイティング値を選択することができる。プロファイルに映画が含まれるか含まれないかのバイナリイベント自体が性別推論のためのシグナルであるとすると、推薦システム１３０を介してユーザに提供される推薦に与える影響が少ない付加的な映画にレイティングを割り当てることができる。平均映画レイティングと予測レイティングの２つのレイティング割り当てが提案される。 Given the rating assignment of rating value pairs (title, rating value), binary events on whether a profile is included or not in a profile (indicating whether or not it was viewed) are almost as strong as gender inference I noticed that it was a signal for This has important implications for obfuscation mechanisms that need to do two things: determine the movie to add to the user profile and determine the rating value to give the movie. This finding suggests that the choice of movies to add can have a significant impact on preventing gender inference. However, if the actual rating does not significantly affect gender inference, a rating value can be selected that helps maintain the quality of the recommendation. Assuming that the binary event itself, whether the profile includes or does not include a movie, is a signal for gender inference, then rating the additional movie that has less impact on recommendations provided to the user via the recommendation system 130 Can be assigned. Two rating assignments are proposed: average movie rating and predictive rating.

平均映画レイティングにおいては、難読化機構は、利用可能な訓練データを用いて全ての映画ｊ∈Ｓ´_i−Ｓ_iの平均レイティングを計算し、計算した平均レイティングをユーザｉの変更後のレイティングプロファイルΗ´_iに追加する。予測レイティングにおいては、難読化機構は、訓練データ集合に対して行列の因数分解を行うことによって映画の潜在因子を計算し、その潜在因子を用いてユーザのレイティングを予測する。全ての映画ｊ∈Ｓ´_i−Ｓ_iの予測レイティングが、Η´_iに追加される。 In the average movie rating, the obfuscation mechanism calculates the average rating of all movies jεS ′ _i −S _i using the available training data, and the calculated average rating is the rating profile after the change of user i. Add to Η´ _i . In predictive rating, the obfuscation mechanism calculates the latent factor of the movie by performing matrix factorization on the training data set, and predicts the user's rating using the latent factor. The predicted rating of all of the movie j∈S' _{_i} -S _i, is added to the _Η'i.

上記で、難読化エンジン１２６は訓練集合へのアクセスを制限されていないとした。しかし、上記機構は、以下の量、すなわち、（ａ）映画選択のために、男性に相関した映画および女性に相関した映画の順序付けされたリスト（ｂ）レイティング割当のために、平均映画レイティングと、ユーザの映画レイティングを予測するための映画の潜在因子、にのみアクセスが必要である。この情報は、ＮｅｔｆｌｉｘＰｒｉｚｅ（商標）データ集合等の公的に入手可能なデータ集合から見つけることができることに注意されたい。このような公的なデータ集合のユーザが全体として特定の推薦システムのユーザと、統計的に類似しているとすると、推論エンジン１３５で使用されて具体的に設定された訓練データ集合への完全なアクセスを前提とする必要はない。 In the above, it is assumed that the obfuscation engine 126 is not restricted from accessing the training set. However, the above mechanism does the following quantities: (a) an ordered list of movies correlated with men and movies correlated with women for movie selection; and (b) average movie ratings for rating assignments. Only the movie's latent factors for predicting the user's movie rating are needed. Note that this information can be found from publicly available data sets, such as the Netflix Prize ™ data set. Assuming that users of such public data sets are statistically similar to users of a particular recommendation system as a whole, the complete training data set used by the inference engine 135 is fully There is no need to assume secure access.

上記で提案した映画選択およびレイティング割当戦略の全ての順列の評価を行った。各ユーザｉに関する１％、５％および１０％｜Ｓｉ｜に対応するｋの値を評価する。リストＬ_MおよびＬ_Fの映画のスコアは、対応するロジスティック回帰係数に設定される。 All permutations of the proposed movie selection and rating assignment strategies were evaluated. Evaluate the value of k corresponding to 1%, 5% and 10% | Si | for each user i. Score movie list L _M and L _F is set to the corresponding logistic regression coefficients.

難読化は、性別推論の性能を減じることで、プライバシーを増大する。表４は、割り当てられたレイティングが平均映画レイティングである時の、３つの映画選択戦略（すなわち、ランダム、サンプリング、欲張り）の全てに関しての推論の正確さを示している。正確さは、１０分割交差検証を用いて計算される。つまり、モデルを純粋なデータに関して訓練し、難読化データに対して検定する。推論の正確さは、ロジスティック回帰分類器に関して最も高いので、ロジスティック回帰分類器は、推薦システムの推論機構として自然な選択であろう。欲張り戦略を用いてほんの１％の付加的レイティングを追加すると、Ｆｌｉｘｓｔｅｒデータ集合に関しては、純粋なデータに関する正確さ７６．５％に比べて、正確さは１５％まで落ち（すなわち、８０％低下し）、１０％の追加レイティングを追加すると、正確さは、ゼロに近くなる。異なる難読化機構に関するプライバシーと効用のトレードオフは、ユーザのプロファイルに対するほんの１％のレイティングの追加によって推論の正確さが８０％低下することを示している。 Obfuscation increases privacy by reducing the performance of gender inference. Table 4 shows the inference accuracy for all three movie selection strategies (ie, random, sampling, greedy) when the assigned rating is the average movie rating. The accuracy is calculated using 10-fold cross validation. That is, the model is trained on pure data and tested against obfuscated data. Since the reasoning accuracy is highest for a logistic regression classifier, the logistic regression classifier would be a natural choice as the inference mechanism of the recommendation system. Adding only 1% additional rating using a greedy strategy, the accuracy drops to 15% (ie, decreases by 80%) for the Fixster data set compared to 76.5% for pure data. ) Adding 10% additional rating brings the accuracy closer to zero. The trade-off between privacy and utility for different obfuscation mechanisms shows that adding only 1% rating to the user's profile reduces inference accuracy by 80%.

従って、難読化機構が欲張り戦略に従って映画を選択する場合、少数の映画を追加することで、性別の難読化には十分である。（映画のスコア、ひいてはロジスティック回帰係数を無視する）ランダム戦略を用いて映画を選択する場合でさえ、反対の性別に相関する映画をほんの１０％追加するだけで、性別推論の正確さを（７６．５％から２８．５％の正確さに）６３％低下させるのに十分である。Ｍｏｖｉｅｌｅｎｓデータ集合に関しても同様の傾向が見られる。 Thus, if the obfuscation mechanism selects a movie according to a greedy strategy, adding a small number of movies is sufficient for gender obfuscation. Even when selecting a movie using a random strategy (ignoring the movie's score, and thus the logistic regression coefficient), adding only 10% of the opposite gender-correlated movies can improve the accuracy of gender inference (76 Sufficient to reduce by 63% (to an accuracy of .5% to 28.5%). A similar trend is seen for the Movielens data set.

上記難読化機構では、男性または女性に相関した映画であるという推論機構の考えに良く対応する順序付けされたリストを用いる。しかし、一般に、難読化機構は、どの推論アルゴリズムが用いられているか知らないので、Ｌ_MおよびＬ_F等のリストは、推論アルゴリズムの内にあるそのような考えにあまり一致しない場合がある。難読化機構は、多項ナイーブベイズ分類器およびＳＶＭ分類器と共に、このようシナリオで評価される。表４で分かるように、難読化は、それでもなお良い成績を挙げており、多項分類器の推論の正確さは、（１０％の付加的レイティングと欲張り戦略を用いると）Ｆｌｉｘｓｔｅｒデータ集合に関して７１％から４２．１％に低下し、Ｍｏｖｉｅｌｅｎｓデータ集合に関して７６％から６０％に低下する。 The obfuscation mechanism uses an ordered list that better corresponds to the inference mechanism's idea that the movie is correlated with men or women. However, in general, obfuscation mechanism does not know what inference algorithm is used, the list of such L _M and L _F may seldom match such idea within the inference algorithm. An obfuscation mechanism is evaluated in this scenario, along with a multinomial naive Bayes classifier and an SVM classifier. As can be seen in Table 4, the obfuscation is still performing well, and the inference accuracy of the multinomial classifier is 71% for the Flixster data set (using 10% additional rating and greedy strategy). From 42.1% to 762.1% for the Movielens data set.

ユーザが自分の性別を難読化する場合にユーザが見る推薦の品質への影響を考えた。この影響は、各ユーザに関して１０のレイティングの提出された検定集合の行列の因数分解の二乗平均平方根誤差（ＲＭＳＥ）を計算することによって測定する。再び、１０分割交差検証を行った。ここで、１０分の９はユーザのデータは純粋で、１０分の１は、追加のノイズを有するレイティングである。すなわち、Η´は、ユーザの１０分の１に用いて、Ηを残りに用いる。これは、自分の性別を難読化したシステムのユーザの１０％に関してＲＭＳＥの変化を評価することに等しい。全体的に見て、難読化はＲＭＳＥに無視し得る影響しか与えないことを発明者は発見した。Ｆｌｉｘｓｔｅｒに関しては、付加的レイティングのない場合と比較して、ＲＭＳＥはレイティングの追加に伴って増加したが、無視できる程度であった。Ｍｏｖｉｅｌｅｎｓ訓練データ集合に関しては、付加的レイティングによって、ＲＭＳＥはわずかに減少する。これは、付加的レイティングを追加することによって、元のレイティング行列の密度が増加し、それによって、行列の因数分解の解の成績が改善される場合があるために生じ得る。付加的レイティングは任意ではないが、ある程度、意味を持つ（すなわち、全てのユーザの平均）だからというのが別の説明である。両方のデータ集合に関して、ＲＭＳＥの変化は有意ではなく、Ｆｌｉｘｓｔｅｒでは（ランダム戦略で、１０％の付加的レイティング）で、最大０．０１５であり、Ｍｏｖｉｅｌｅｎｓに関しては（サンプル戦略で、１０％の付加的レイティング）で最大０．０５８であるというのが、主要な結果である。従って、難読化エンジンは、推薦システムのユーザに対する推薦の品質を維持している。 We considered the effect on the quality of recommendations seen by users when they obfuscate their gender. This effect is measured by calculating the root mean square error (RMSE) of the factorization of a matrix of submitted test sets of 10 ratings for each user. Again, 10-fold cross validation was performed. Here, 9/10 is the user's data is pure and 1/10 is the rating with additional noise. That is, Η ′ is used for 1/10 of the user and Η is used for the rest. This is equivalent to assessing the change in RMSE for 10% of users of the system that obfuscated their gender. Overall, the inventor has discovered that obfuscation has only a negligible impact on RMSE. With respect to Flixster, RMSE increased with the addition of ratings compared to the case without additional ratings, but was negligible. For the Movielens training data set, RMSE is slightly reduced due to the additional rating. This can occur because adding additional ratings increases the density of the original rating matrix, which may improve the performance of the matrix factorization solution. The additional rating is not arbitrary, but is another explanation because it is meaningful to some extent (ie, the average of all users). For both datasets, the change in RMSE is not significant, with a maximum of 0.015 for Flixster (10% additional rating for random strategy) and 10% additional for Movielens (for sample strategy). The main result is that the maximum rating is 0.058. Therefore, the obfuscation engine maintains the quality of recommendation for users of the recommendation system.

提案した難読化のプライバシーと効用のトレードオフを検討する。ここで、所望の高度なプライバシーは、性別推論の正確さを低下させ、効用を高くすると高品質な推薦のためのプロキシとして用いられることが多いＲＭＳＥが低くなる。評価すると、発明者は、Ｆｌｉｘｓｔｅｒ訓練データ集合に関しては、プライバシーが高くなると効用が低下することを発見した。上記のように、Ｍｏｖｉｅｌｅｎｓ訓練データ集合を用いると、プライバシーが高くなると、効用も増加するが、ほんのわずかである。難読化機構は、性別推論の正確さを優位に低下させることができ、かつ、推薦の品質に引き起こす変化もほんのわずかである。 Consider the proposed trade-off between obfuscation privacy and utility. Here, the desired high level of privacy reduces the accuracy of gender inference, and the higher the utility, the lower the RMSE that is often used as a proxy for high quality recommendations. Upon evaluation, the inventor has found that the utility decreases with increasing privacy with respect to the Flixster training data set. As noted above, the use of the Movielens training data set increases utility, but only slightly, with increased privacy. An obfuscation mechanism can significantly reduce the accuracy of gender inference, and it causes only a small change in the quality of the recommendation.

推薦の品質を維持することは、難読化エンジンにとって魅力のある特徴である。一評価において、レイティング割当が「予測レイティング」アプローチに対応するときのトレードオフを考える。このレイティング割当の背後にある動機は、原則として、この難読化は、変更されていないデータに関するＲＭＳＥと比較して、ＲＭＳＥを変化させないということである。言い換えれば、レイティング割当のこの選択を用いると、効用フロント（ｕｔｉｌｉｔｙｆｒｏｎｔ）に関してトレードオフは行われない。表５は、このレイティング割当を用いた時の性別推論の正確さを示す。この結果は、レイティング割当が平均映画レイティングである表４の結果と類似している。Ｍｏｖｉｅｌｅｎｓ訓練データ集合に関しては、性別推論の正確さは、予測レイティングよりわずかに低い。例えば、１％の付加的レイティングの欲張り戦略に関しては、ロジスティック回帰分類器の正確さは、５７．７％から４８．４％に低下する。この利点は、推薦の品質を犠牲にすることなく、もたらされる。結論として、少量の追加のレイティングを用いると、ユーザが受信する推薦の品質に有意な変化を与えずに、難読化によってユーザの性別を保護することが可能であることが、実験評価より分かる。 Maintaining the quality of the recommendation is an attractive feature for the obfuscation engine. In one evaluation, consider the trade-off when rating assignment corresponds to a “predictive rating” approach. The motivation behind this rating assignment is that, in principle, this obfuscation does not change the RMSE compared to the RMSE for unmodified data. In other words, with this selection of rating assignments, there is no trade-off with respect to utility front. Table 5 shows the accuracy of gender inference when using this rating assignment. This result is similar to the result in Table 4 where the rating assignment is average movie rating. For the Movielens training data set, the accuracy of gender reasoning is slightly lower than the predictive rating. For example, for a 1% additive rating greedy strategy, the accuracy of the logistic regression classifier drops from 57.7% to 48.4%. This advantage comes without sacrificing the quality of the recommendation. In conclusion, experimental evaluation shows that with a small amount of additional ratings, it is possible to protect the user's gender by obfuscation without significantly changing the quality of recommendations received by the user.

図６は、ユーザのデモグラフィック情報を正確な検出から隠すことができる、そのユーザのレイティングの集合（タイトル、レイティング値）を作成するための例示の方法６００を示す。また、この方法の長所は、推薦システム１３０の推論エンジン１３５を使用する結果、受信するであろう推薦に悪影響を与えないことである。この方法は、ステップ６０５において、他のユーザからのレイティングの訓練集合を導入することで始まる。訓練データ集合は、他のユーザの、レイティング（タイトル、レイティング値）と、デモグラフィック情報との両方を有する。ステップ６１０において、訓練データ集合を用いて、図５ｂの５７５や図５ｃの５９６等の推論エンジンを訓練する。訓練された推論エンジンは、ユーザ１２５のデモグラフィック情報を決定することができる。そのため、訓練された推論エンジンは、ユーザ１２５がアクセスする、図５ｂの１３０等の推薦システム内の推論エンジンの機能を幾分エミュレート（模倣）する。 FIG. 6 illustrates an exemplary method 600 for creating a set of user ratings (title, rating value) that can hide the demographic information of the user from accurate detection. Also, the advantage of this method is that using the inference engine 135 of the recommendation system 130 does not adversely affect the recommendations that will be received. The method begins at step 605 by introducing a rating training set from another user. The training data set includes both other users' ratings (title, rating value) and demographic information. In step 610, the training data set is used to train an inference engine such as 575 in FIG. 5b or 596 in FIG. 5c. The trained inference engine can determine demographic information for the user 125. As such, the trained inference engine somewhat emulates the functionality of an inference engine in a recommendation system, such as 130 in FIG.

推論エンジンの訓練後、難読化エンジンは、新規ユーザが使用できる状態になる。ステップ６１５において、訓練データ集合内のユーザではない新規ユーザが、難読化エンジンにレイティングを提供する。結果として、難読化エンジンは、映画レイティング等のレイティングを受信する。受信した映画レイティングは、（タイトル、レイティング値）のレイティング対のみであり、新規ユーザのデモグラフィック情報は含まれない。 After training the inference engine, the obfuscation engine is ready for use by new users. In step 615, a new user who is not a user in the training data set provides a rating to the obfuscation engine. As a result, the obfuscation engine receives ratings such as movie ratings. The received movie ratings are only (title, rating value) rating pairs, and do not include demographic information of new users.

ステップ６２０において、５７５または５９６等の推論エンジンは、分類アルゴリズムを用いて、新規ユーザのデモグラフィック情報をそのユーザのレイティングに基づいて決定する。ステップ６２５において、難読化エンジンは、別の推論エンジンによってデモグラフィック情報の正確な決定に反するレイティングを生成する。すなわち、生成されたレイティングは、ユーザのレイティングに追加することができ、かつ、ユーザの検出可能なデモグラフィック情報の難読化を助ける付加的レイティングである。簡単な例を挙げると、推論エンジンがユーザ１２５の性別を女性と推論する場合、難読化エンジンが生成する付加的レイティングは、ユーザの性別を不正確に推論するデータを提供することになる。従って、推薦システムの推論エンジン等の外部の推論エンジンは、新規ユーザ１２５の性別デモグラフィック情報を正確に決定することができなくなる。このように、付加的レイティングは、新規ユーザのデモグラフィック情報の正確な検出に反するものである。 In step 620, an inference engine, such as 575 or 596, uses a classification algorithm to determine demographic information for the new user based on the user's rating. In step 625, the obfuscation engine generates a rating that violates the correct determination of the demographic information by another inference engine. That is, the generated rating can be added to the user's rating and is an additional rating that helps obfuscate the user's detectable demographic information. To give a simple example, if the inference engine infers the gender of the user 125 as female, the additional rating generated by the obfuscation engine will provide data that infers the user's gender incorrectly. Accordingly, an external inference engine such as the recommendation system inference engine cannot accurately determine the gender demographic information of the new user 125. Thus, additional ratings are contrary to accurate detection of demographic information for new users.

付加的レイティングは、ステップ６３０で難読化エンジンによって推薦システム（ＲＳ）に送信される。これは、推薦システム１３０の推論エンジンが検出するユーザ１２５のデモグラフィック情報を難読化する効果を有する。この難読化は、図５ｂの１３５等の外部の推論エンジンがユーザの通常に生成したレイティングだけでなく、正確なデモグラフィック情報の決定に反するレイティング対（タイトル、レイティング値)を有する付加的レイティングも受信するので、発生する。すなわち、付加的レイティングは、推論エンジンがユーザのデモグラフィック情報を正確に決定するのを妨げる働きをする。本発明の態様によると、推論エンジン１３５を有する推薦システム１３０は、ユーザのデモグラフィック情報の正確な決定を行うことを付加的レイティングを用いて妨げられている。しかし、推薦システム１３０からユーザ１２５が受信する推薦の品質は、付加的レイティングの追加によって大きく低下しない。基本的に、推薦システム１３０からユーザ１２５が受信する推薦の品質は、付加的レイティングを含まない時と比較して付加的レイティングを追加した時、同じように維持される。ステップ６１５〜ステップ６３０は、新規ユーザに対して繰り返してよい。従って、多数の新規ユーザが、方法６００によってデモグラフィック情報を難読化することができる。 The additional rating is sent to the recommendation system (RS) by the obfuscation engine at step 630. This has the effect of obfuscating the demographic information of the user 125 detected by the inference engine of the recommendation system 130. This obfuscation includes not only the user's normal rating generated by an external inference engine such as 135 in FIG. 5b, but also an additional rating with a rating pair (title, rating value) that violates the determination of accurate demographic information. Occurs because it receives. That is, the additional rating serves to prevent the inference engine from accurately determining the user's demographic information. In accordance with aspects of the present invention, the recommendation system 130 with the inference engine 135 is prevented from making an accurate determination of the user's demographic information using additional ratings. However, the quality of recommendation received by the user 125 from the recommendation system 130 is not significantly degraded by the addition of additional ratings. Basically, the quality of recommendations received by the user 125 from the recommendation system 130 is maintained in the same way when additional ratings are added compared to when no additional ratings are included. Steps 615 to 630 may be repeated for new users. Thus, a large number of new users can obfuscate demographic information by the method 600.

図５ａ、５ｂ、５ｃの難読化エンジンの実施に関して特定のアーキテクチャを示したが、構成要素の機能の分散、構成要素の統合、ユーザのプライバシーに関するユーザへのサービスとしてのサーバ内の位置など、実施の選択肢があることを当業者は認識されよう。このような選択肢は、図示および記載した構成の機能および構造と等価である。 Although a specific architecture has been shown for the obfuscation engine implementation of FIGS. 5a, 5b, 5c, implementations such as component function distribution, component integration, location in the server as a service to the user regarding user privacy, etc. Those skilled in the art will recognize that there are alternatives. Such an option is equivalent to the function and structure of the configuration shown and described.

Claims

In a method of obfuscating demographic information of a specific user, performed by an obfuscation engine,
Training an inference engine communicatively coupled to the obfuscation engine to determine demographic information using a training data set including ratings from other users and demographic information;
Receiving a rating from the specific user that includes only rating information and sent to a recommendation system for evaluating the rating from the specific user;
Determining the demographic information of the particular user from the rating provided by the particular user;
Generating a rating against the determined demographic information of the particular user by the obfuscation engine;
Transmitting the generated rating to the recommendation system;
The generated rating of the specific user obfuscates the determination of the demographic information of the specific user by the recommendation system;
Method.

The method of claim 1, wherein the rating from the particular user is a movie rating that includes a movie title and movie rating value information and no demographic information.

The method of claim 1, wherein the demographic information of the particular user is gender information.

The method of claim 1, wherein transmitting the generated rating to the recommendation system includes transmitting an additional rating that is against the determined demographic information of the particular user.

The method of claim 4, wherein the additional rating is about 10% of a movie rating provided by the particular user.

The method of claim 1, wherein the determining step includes determining the demographic information of the particular user using a classifier.

7. The method of claim 6, wherein the classifier is one of a support vector machine, a logistic regression algorithm, and a Bayesian approach such as a naive Bayes model, a polynomial model, a mixed model.

The method of claim 1, wherein the generated rating maintains a recommendation system quality.

An obfuscation device that obfuscates an accurate determination of demographic information of a particular user providing movie ratings to a recommendation system via the device,
A receiver in the network interface for inputting a training data set including movie ratings and demographic information from a plurality of other users;
A processor having access to memory for determining demographic information based on the movie rating by executing a program using an inference engine;
A rating generation unit for generating an additional rating against the determined demographic information;
A transmitter in the network interface for transmitting both the movie rating provided by the user and the additional rating to the recommendation system;
A combination of the user-provided rating and the additional rating prevents the recommendation system from determining the demographic information;
Obfuscation device.

The apparatus of claim 9, wherein the apparatus is part of a user device.

The apparatus of claim 9, wherein the movie rating includes movie title information and a movie rating value.

The apparatus of claim 1, wherein the determined demographic information of the particular user is gender information.

The apparatus of claim 1, further comprising a classifier that assists the processor in determining the demographic information for the particular user.

The apparatus of claim 1, wherein the classifier is one of a support vector machine and a logistic regression algorithm.