CN109299436B - Preference sorting data collection method meeting local differential privacy - Google Patents
Preference sorting data collection method meeting local differential privacy Download PDFInfo
- Publication number
- CN109299436B CN109299436B CN201811079995.XA CN201811079995A CN109299436B CN 109299436 B CN109299436 B CN 109299436B CN 201811079995 A CN201811079995 A CN 201811079995A CN 109299436 B CN109299436 B CN 109299436B
- Authority
- CN
- China
- Prior art keywords
- preference
- data
- data collection
- distribution
- item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Abstract
The application discloses a preference sorting data collection method meeting local differential privacy, wherein a user terminal converts preference sorting data by using Rule I and Rule II, adds noise into the converted data and sends the converted data to a data collection platform, the data collection platform and the user terminal are matched with each other to realize an algorithm meeting the local differential privacy, the whole RI model construction is completed, and preference sorting data are generated based on the constructed model. By the method, the collected preference sorting data can be guaranteed to have higher data utility while privacy disclosure is avoided.
Description
Technical Field
The application relates to a data collection technology, in particular to a preference sorting data collection method meeting local differential privacy.
Background
The preference ranking data is a typical personal data. For a user, his preference ranking data refers to the ranking of items given by the user to a given Set of items (ItemSet) according to their own preference for the items in the Set. For example, the item set is { cola, white spirit, sprite, plain boiled water, beer }, and the preference ranks of a user for these five items are < white spirit, cola, beer, sprite, plain boiled water >, which indicates that the user likes white spirit most and the plain boiled water most least. With the rapid development of information technologies such as mobile internet and cloud computing and the increasing popularization of mobile terminals such as smart phones, it is common for users to share their preference ranking data to various data collectors (e.g., service providers) through mobile device applications so as to enjoy personalized services. On the other hand, it is also essential for service providers to collect and analyze user preference ranking data in order to provide a better user experience and create new revenue opportunities. However, the user's preference ranking data often contains extremely sensitive personal information, and the direct collection of such data by the data collector may cause serious privacy disclosure problems for individuals.
FIG. 1 is a diagram illustrating a current user preference data collection scenario. The scenario mainly involves two roles, namely a user (namely a data contributor) and a data collector, and a term set x formed by d terms is given1,x2,...,xdU, each user ui(1 ≦ i ≦ n) each having a preference ranking data σi=<σi(1),σi(2),…,σi(d)>And users are independent of each other. Wherein σi(j)=xkRepresents xkAt σiRank of (1) is j. The data collector collects the preference ranking data of each user by using the data collection platform and through the network, so as to obtain a preference ranking data set, namely, a model of the preference ranking data is constructed. The model can generate new preference sorting data, the new preference sorting data generated by the model and the original preference sorting data of the user have the same statistical characteristics, and meanwhile, the original preference sorting data are not directly given, so that the privacy of the user is protected to a certain extent. The data collector may directly utilize the collected preference ranking data model for analysis, or may open the model or new preference ranking data generated to a third party (e.g., a research institution).
It can be seen from the above processing that, during the process of collecting the user preference ranking data, it is possible to prevent the user of the model and the new preference ranking data from acquiring the user privacy, but before the model of the preference ranking data is formed, the user preference privacy data may still be revealed. Specifically, for each user, there are three roles that may pose a threat to their privacy: 1) a data collector; 2) other users; 3) any potential attacker in addition to the data collector and other users.
The preference sorting data collection technology of privacy protection provides a feasible scheme for solving the problem of personal privacy disclosure brought by preference sorting data collection. A Local Differential Privacy technology (Local Differential Privacy) proposed in recent years is a Differential Privacy technology specifically proposed to solve the problem of personal Privacy disclosure caused by data collection. In particular, the technique requires that the data contributor first add a suitable amount of noise to the data he owns, and then send the data containing the noise to the data collector to achieve privacy protection for the data contributor.
Currently, there are some work on data collection issues that satisfy local differential privacy. Based on the information theory, Duchi et al propose a high-dimensional data collection method facing the task of mean value calculation and risk minimization statistics. By extending the method, the sampling rate can be increased, based on sampling techniques,et al propose a method of data collection known as Harmony. Specifically, for each piece of high-dimensional data, the method randomly selects a certain dimension of the piece of data, and if the dimension corresponds to continuous data, the collection is performed based on the method proposed by Duchi et al; if the dimension corresponds to discrete data, collection is performed by using an SH mechanism. In order to obtain frequent items of multidimensional data, Qin et al propose a two-phase data collection method called LDPMiner. In the first stage, the method is based on an SH mechanism, and a candidate space of frequent items is preliminarily determined from noise data; in the second phase, the method derives precise frequent terms based on the rapporr mechanism. Based on the EM (expection knowledge) algorithm, Fanti et al propose an extended RAPPOR mechanism. The mechanism assumes that all dimensions of high-dimensional data are independent from each other, collects all-dimensional data by utilizing a RAPPOR mechanism, and uses the data as input of an EM algorithm to deduce joint distribution of the whole data, so that the mechanism can be used for generating original data. However, when the data dimension is high, the mechanism is not only time-complex but also slow in convergence speed. Aiming at the problem, Ren et al provides a new method by combining the EM algorithm with the Lasso regression, and the method can greatly improve the efficiency of the proposed method in the RAPPOR mechanism.
Direct connectionApplying the local differential privacy algorithm to the preference ranking data, wherein the specific calculation mode can be as follows: assuming that a value space formed by all possible preference sorting data exists, then regarding the preference sorting data of each user as a discrete value in the value space, and finally directly applying a single-dimensional multi-valued data collection method meeting the local differential privacy, wherein the single-dimensional multi-valued data collection method comprises RAPPOR, SH and OLH algorithms to collect the data. However, the transformed data has a huge value space, and x ═ x for a given set of terms1,x2,...,xdThe value space size of the converted data is d! . Therefore, these algorithms can cause a lot of noise in the collected data, resulting in the unavailability of the finally obtained preference ranking data.
Disclosure of Invention
The application provides a preference sorting data collection method meeting local differential privacy, and preference sorting data collection realized by the method can ensure that the collected preference sorting data has higher data utility while avoiding privacy disclosure.
In order to achieve the purpose, the following technical scheme is adopted in the application:
a method of preferred ranking data collection satisfying local differential privacy, comprising:
the data collection platform aggregates the first vectorsAll vectors z injInitialized to 0 vector and aiming at each user terminal u in preset preference item setiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, and j is a preference item index in the preference item set;
for each user terminal, the data is sorted by the user's own preference, including the attribute setTuple t of all attributes ini[Aj]Assigning and collecting the platform according to the dataPreference index j sent to the user terminal, using tuple ti[Aj]Generating value subscriptsSending to the data collection platform; wherein A isjIs a collection of attributesThe jth attribute of (1), ti[Aj]Represents tiIn AjThe number of attributes in the tuple is equal to the number of preference items in the preference sorting data, the attributes correspond to the preference items one to one, and the value of each attribute is equal to the rank of the corresponding preference item; the value subscript satisfies the conditionk∈{1,2,...,|dom(Aj)|},I(ti[Aj]Represents ti[Aj]In dom (A)j) Index of (1), dom (A)j) Represents attribute AjThe value space of (a);
the data collection platform being transmitted by each user terminalSet the first vectorThe value is increased by 1;
the data collection platform combines each value z of all vectors in the first set of vectorsj[k]Is updated toWherein the content of the first and second substances,the epsilon' is a preset first privacy budget;
the data collection platform determines from the first set of vectorsAndand using the first set of vectors,Andcomputing all triplets in the set of preferencesOf mutual informationAnd constructing K-thin chainSending the data to each user terminal;
the data collection platform collects the second vectorAll vectors z inj' initialization to a 0 vector and for each user terminal u in a preset set of preferencesiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, j is a preference item index in the preference item set, and preference item indexes selected for different user terminals are the same or different;
for each user terminal, the data is sorted by the user's own preference, including the attribute setTuple t of all attributes ini′[Aj′]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi′[Aj′]Generating a valueSubscriptSending to the data collection platform; wherein the attribute setComprising two subsetsAnd correspond toCollection of leaf item sets of Correspond toThe value subscript satisfies the condition
The data collection platform being transmitted by each user terminalSet the second vectorThe value is increased by 1;
the data collection platform combines each value z of all vectors in the second set of vectorsj′[k]Is updated toWherein the content of the first and second substances,the epsilon ' is a preset second privacy budget, epsilon ' + epsilon ' isepsilon, and epsilon is a total privacy budget for establishing an RI model;
obtaining the second vector set according to the first vector setThe distribution information of the leaf nodes and the distribution information of the internal nodes;
generating preference ranking data using an RI model that includes distribution information of the leaf nodes and distribution information of the interior nodes.
Preferably, the method further comprises: according to the mutual information of the triples, theThe distribution information of the leaf nodes and the distribution information of the internal nodes of the network system generate a specified number of preference ranking data.
Preferably, the data collection platform determines from the first set of vectorsThe method comprises the following steps:
for each oneDistribution inConstructing a Lasso regression modelWherein the content of the first and second substances,is one longDegree 2d column vector, which stores the distributionAndis determined by the information of (a) a,is a binary matrix of size 2d x d (d-1),is a column vector of length d (d-1) for storing the joint distributionThe information of (a);
solving the Lasso regression model by a minimum angle regression method, and estimating to obtainAnd determining a joint distributionAccording to the joint distributionComputing
Preferably, the data collection platform determines from the first set of vectorsThe method comprises the following steps:
for each oneDistribution inConstructing a Lasso regression model Wherein the content of the first and second substances,is a column vector of length (d +2) that stores the distributionAndis determined by the information of (a) a,is a binary matrix of size (d +2) x 2d,is a column vector of length 2d for storing the joint distributionThe information of (a);
solving the Lasso regression model by a minimum angle regression method, and estimating to obtainAnd determining a joint distribution
according to the technical scheme, the user terminal converts preference sorting data by using Rule I and Rule II, adds noise into the converted data and sends the converted data to the data collection platform, the data collection platform and the user terminal are matched with each other to realize an algorithm meeting local differential privacy, the whole RI model is built, and the established RI model is used for generating the preference sorting data meeting the local differential privacy. By the method, the collected preference sorting data can be guaranteed to have higher data utility while privacy disclosure is avoided.
Drawings
FIG. 1 is a diagram illustrating a current user preference data collection scenario;
FIG. 2 is a schematic view of an example of a 2-thin chain;
FIG. 3 is a first graph illustrating performance comparison in the present application;
FIG. 4 is a graph illustrating a performance comparison of the present application;
fig. 5 is a third performance comparison diagram in the present application.
Detailed Description
For the purpose of making the objects, technical means and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings.
In order to solve the problem mentioned in the background art that the local differential privacy method is not available after being applied to preference ranking data, the applicant proposes a preference ranking data algorithm (SAFARI algorithm) satisfying local differential privacy. The method has the main ideas that a data collector collects distribution information on a series of small-valued spaces selected according to a riffled independent model (RI model), the collected distribution information on the small-valued spaces is used for approximating the whole distribution of preference ranking data, a model is built, and the built model is used for generating the preference ranking data. Since the SAFARI algorithm processes a plurality of small value spaces instead of a large value space, the SAFARI algorithm can greatly reduce the scale of noise.
The process of the present application is described in detail below.
At present, the modeling of the preference Ranking data may be performed by using an RI model, and the RI model can approximate the overall distribution of the preference Ranking data by using a product of two low-dimensional Distributions, namely, Relative Ranking Distributions (Relative Ranking Distributions) and cross Distributions (intersecting Distributions), according to mutual exclusivity between dimensions of the preference Ranking data, so as to effectively model the preference Ranking data. And generating new preference sorting data by using the established model, thereby realizing the privacy protection of the user.
The structure of the RI model is a binary tree called K-thin chain. Wherein the root node represents the original item set, the other nodes represent sub item sets of the original item set, and the size of the item set of the leaf node does not exceed a constant K. FIG. 2 is an example of a 2-thin chain.
In this example, the original set of terms { cola, white spirit, sprite, plain boiled water, beer } is first divided into two subsets of terms that are mutually exclusive and have a rifle Independent relationship, namely { plain boiled water } and { cola, white spirit, sprite, beer }. Since the size of the sub-set { cola, white spirit, sprite, beer } exceeds 2 (i.e., the value of K), the sub-set is further divided into sub-sets { cola, sprite } and { white spirit, beer } that are mutually exclusive and have a rifle Independent relationship.
The learning process of the RI model comprises two stages of structure learning and parameter learning:
1) and (5) structure learning. First, the mutual information of all triplets in the item set is calculated, which is defined as follows:
given a set of items (x)1,x2,...,xd) For any triplet in the setWherein the content of the first and second substances,andis a set of three different items, the mutual information of the triples being
Wherein the content of the first and second substances,representation itemThe rank of (a) is determined,is a binary variable. In particular, it is possible to use, for example,representsThat is, the itemIs ranked in the itemBefore;representsThat is, the itemIs ranked in the itemAnd then.
And then constructing the K-thin chain on the original item set by using an anchor point algorithm according to the mutual information of the triples.
2) And (5) parameter learning. From the constructed K-thin chain, the distribution of each node is learned to approximate the overall distribution of the original preference ranking data set. Among them, the distribution of leaf nodes is called Relative Ranking Distributions (Relative Ranking Distributions), and the distribution of internal nodes (including root nodes) is called cross Distributions (Interleaving Distributions).
The relative sequencing distribution and the cross distribution are determined through the process, and the modeling of the RI model is completed.
The preference ranking data collection method is based on the RI model, and the preference ranking data is generated according to the established RI model. Only in the modeling process, the acquisition of the mutual triple information and the acquisition of the relative sequencing distribution and the cross distribution meet the local differential privacy.
The preference ranking data algorithm (SAFARI algorithm) meeting the local differential privacy in the present application relates to two rules Rule I, Rule II and one SAFA algorithm, and specifically may include 5 stages:
1. Each user converts own preference sorting data according to a conversion Rule (marked as Rule I), so that the data collection platform can obtain distribution information required by calculating the mutual triple information. The content of the distribution information will be described in detail later.
1. And the data collection platform calls an SAFA algorithm by using the privacy budget of epsilon' and through the cooperation of the user, and collects the distribution information required by calculating the triple mutual information from the data converted in the stage 1 by the user. In the SAFA algorithm, a user adds noise to converted data and then sends the data to a data collection platform, so that privacy disclosure is avoided.
2. And the data collection platform calculates all the triple mutual information in the RI model by using the collected distribution information.
1. Each user transforms his own preference ranking data according to another transformation Rule (denoted Rule II) so that the data collection platform can determine the information aboutRelative Ranking distribution (Relative Ranking Distributions) information and cross distribution (Interleaving Distributions) information.
1. The data collector uses the privacy budget of epsilon' to call the SAFA algorithm to collect the data about the user converted in the phase 2 through the cooperation of the userThe ordering distribution information and the cross distribution information. In the SAFA algorithm, a user adds noise to converted data and then sends the data to a data collection platform, so that privacy disclosure is avoided.
Thus, the construction of the RI model is completed after the sequencing distribution information and the cross distribution information are obtained.
According to the constructed rifflee index model, the data collector generates n pieces of new preference ranking data.
Through the processing of the stage 1 to the stage 4, the modeling of the preference ranking data RI meeting the local differential privacy can be realized. The data collection platform may issue the completed RI model to a third party, or, in order to better provide the preference ranking data to the third party, preferably, may further generate new preference ranking data to be issued to the third party through the stage 5 process.
In addition, in the above data collection method, the phase 2 and the phase 4 complete the local differential privacy processing in two parts, so on the premise that the overall privacy budget is epsilon, epsilon' + epsilon ″, preferably, in practical application, it is usually taken that
We will next introduce Rule I, Rule II and SAFA methods used in the above data collection method satisfying the local differential privacy, respectively.
Design Rule I
As described above, Rule I is used to convert preference ranking data of a user, and the converted data is used to calculate mutual information of triples, so that Rule I needs to be designed according to a calculation manner of the mutual information of the triples. In particular, to compute any one possible triplet (x) according to its mutual information definitioni,xj,xk) The data collection platform needs to collect three types of distribution information:
to accomplish this task, an intuitive translation method is for each user to translate his preference ranking data to provide information about these three types of distributions. In particular, each user converts his preference ranking data into a tuple comprising a plurality of attributes, wherein each attribute corresponds toIs distributed.
However, this translation method can result in redundant information including content in the user's translated data and increase translation complexity because
In fact, the data collector only needs to collectAnd then derived therefromAndof the distribution. Thus, each user need only translate his preference ranking data to provideOf the distribution. Due to the fact thatIn which O (d) is contained3) A different distribution, the number of attributes in each user transformed tuple is O (d)3)。
Unfortunately, when d is relatively large, this transformation can cause the data collector to collect a large amount of noise in the data when LDP is satisfied due to the dimensional disaster. To solve this problem, we have designed Rule I. According to the conversion rule, each user only needs to convert his preference ranking data to provideOf the distribution. The data collection platform only needs to collectAnd then estimating therefrom using a regression modelAndof the distribution. In particular, the applicant has found that the estimation problem is a sparse linear regression (sparse linear regression) problem. Therefore, a Lasso regression model is selected that effectively solves this problem. Details of Rule I and how to estimate using the Lasso regression model are described belowAndof the distribution.
Rule I: each user terminal uiFirstly sorting preference data sigma of corresponding usersiConversion to a collection of containment attributesTuple t of all attributes ini. Wherein the content of the first and second substances,each attribute A in (1)jCorresponding to an itemAjHas a value space of dom (A)j)={1,2,...,d}。dom(Aj) Consisting of d possible values, which representWith possible absolute ranking. Then, forEach attribute A in (1)jEach user uiAccording to σiFor ti[Aj]And carrying out assignment.
Distribution estimation using Lasso, determinationAndbased on what has been collectedThe data collection platform may estimate the distribution in the following mannerAndof the distribution.
First, the data collection platform is accessed fromIs estimated in the distributionThe distribution information in (1). In particular, for eachDistribution inData collector constructs a Lasso regression modelWherein the content of the first and second substances,
3)is a column vector of length d (d-1) that is used to store the joint distributionThe information of (1).
By solving the Lasso regression model using the minimum angle regression method, the data collection platform can estimateThereby obtaining a joint distributionThe information of (1). According to a joint distributionThe data collection platform can calculate the distributionThe information of (1).
Then, the data collector gets the data fromAndis estimated in the distributionThe distribution information in (1). In particular, for eachDistribution inData collector constructs a Lasso regression model Wherein the content of the first and second substances,
Similarly, by solving the Lasso regression model using the least-angle regression method, the data collector can estimateThereby obtaining a joint distributionThe information of (1).
To explain this point, in the Rule I described above,each attribute A in (1)jAnd the corresponding relation between the preference items and the data collection platform needs to be consistent on the data collection platform side and the user terminal side. That is, in SAFARI, the data collection platform and each user terminal need to guarantee to run the same Rule I. In addition, as can be seen from the above processing, in the conversion rule, the number of attributes included in the tuple converted by each user is o (d). For attribute collectionEach attribute A in (1)jIts value space size | dom (A)j) L is only d, and it is clear that this value is much smaller than d! . By estimating the set.The frequency of any possible value of each attribute in the data collection system can be obtainedAnd then estimating the distribution information based on the distribution informationAndof the distribution. Such processing does not generate a large amount of redundant data, and thus the utility of the user preference data can be effectively improved.
Design Rule II
In the construction of K-thin chainThereafter, the data collection platform needs to collect information aboutRelative Ranking Distributions (Relative Ranking Distributions) and Interleaving Distributions (Interleaving Distributions). To this end, designRule II. According to the conversion rule, each user terminal only needs to convert the preference ranking data of the corresponding user to provide information about the two types of distributions.
Rule II: each user terminal uiFirstly, the preference sorting data sigma of the corresponding user is sortediConversion to a collection of containment attributesTuple t of all attributes ini. In particular, attribute setsFrom two subsetsAndform aWherein the content of the first and second substances,
correspond toLeaf item set collecting platform Is thatBy a set of leaf sub-items containing only one itemAnd (4) forming. Because of the relation toThe relative ranking distribution of each set of leaf items in the list can be easily inferred, so users do not need to provide information aboutThe information of (1).
Each attribute A in (1)jCorresponding to the collectionA set of leaf items l ink。AjIs defined by all the values ofkThe relative ordering of (a). In particular, when K is 1,all of the sets of leaf entries in (a) contain only one entry, and, at this time,wherein K representsThe most contained items of the middle leaf item set.
Each attribute A in (1)jCorresponding to the collection of internal itemsAn inner item set g ofk。AjIs defined by all the values related to gkIs formed by cross-sorting. Then, forEach attribute A in (1)jEach user uiAccording to σiFor ti[Aj]And carrying out assignment.
To explain this point, in the above Rule II,each attribute A in (1)jThe leaf items or the internal items in the data collection platform are in one-to-one correspondence, and the correspondence relation between the leaf items and the internal items needs to be kept consistent on the data collection platform and the user terminal side. That is, in SAFARI, the data collection platform and each user terminal need to guarantee the same Rule II running. In addition, as can be seen from the above processing, in the conversion rule, the number of attributes included in the tuple converted by each user is o (d). For theEach attribute A in (1)jIts maximum value space size is K! (ii) a For theEach attribute A in (1)jThe maximum value space size of the system isThus property collectionsThe maximum value space of any attribute in (a) isIt is clear that this value is much smaller than d! . By estimating the constellationThe frequency of any possible value of each attribute in the data collection system can be obtainedAboutDistribution information of (2). Such processing does not generate a large amount of redundant data, and thus the utility of the user preference data can be effectively improved.
SAFA process
In the data collection process of the present application, both stage 2 and stage 4 require SAFA processing on the data transformed by the user terminal, and the SAFA processing is described in detail here.
In order to collect distribution information required for constructing the RI model, the data collection platform needs to estimate the frequency of any one possible value of each attribute in the tuple after the transformation by the user under the condition that the LDP is satisfied.
The data collection platform can directly call the Harmony method which is the most advanced method for analyzing multi-attribute data under LDP to complete the task. In particular, forEach attribute A in (1)jThe data collector will AjIs mapped into a space with a size of | dom (A)i)|×|dom(Aj) Binary matrix phi of |j. Then, for each user uiFrom the set, the data collectorRandomly selecting an attribute (assumed to be A)r) And call the SH algorithm [11 ]]Collecting uiTuple t ofiIn ArThe value of (a).
We observe the set of transformed attributes whether Rule I or Rule II is appliedThe value space size of all the attributes in (1) is far smaller than d! . However, for the attribute with small value space, the Harmony method still maps the value space of each dimension into a matrix, resulting in unnecessary noise contained in the collected data, especiallyIs when handling binary attributes. It is stated in the literature that the generalized random response algorithm works best when estimating a small number of discrete value frequencies. Therefore, we propose a new LDP algorithm, named Sampling Randomizer for Multiple Attributes (SAFA), to more accurately perform frequency estimation on multi-attribute data with small value space under the condition of satisfying LDP. The method has the main idea that each user terminal randomly selects an attribute, then perturbs the value of the attribute by using a generalized random response algorithm, and sends the perturbed result to a data collection platform.
As mentioned above, it is necessary to apply SAFA algorithm for processing in stage 2 and stage 4, which is a process that needs to be performed by the user terminal and the data collection platform in cooperation. The SAFA algorithms applied in stage 2 and stage 4 are the same, but the distribution information to be obtained by the SAFA algorithms is different, and the processing of the SAFA algorithms is described in the following.
The specific process of SAFA is as follows:
1. data collection platform initialization vector setAll vectors in the vector, namely all values in each vector are assigned to be 0; here, for stage 2, the set of vectorsIs thatFor stage 4, vector setIs the collection of the relative ordering distribution set and the cross distribution set;
2. for each user terminal uiThe following operations are performed:
2.1. when the data collection platform is converted from Rule I or Rule IIRandomly selecting an index j;
2.2. the data collection platform sends j to ui;
Where k ∈ {1, 2., | dom (A)j)|};
after the above operations are executed on all the user terminals, the following operations are continuously executed:
This is a specific treatment of the SAFA process. To illustrate that the SAFA method in the present application can satisfy local differential privacy, theoretical proof is given below.
Theorem: for any user uiThe privacy budget ε ', SAFA satisfies ε' -LDP.
And (3) proving that:
defined by LDP, for any two different tuples ti,t′iIs arbitraryWhereinIs an index of attributes selected by the data collector, which we need to demonstrate
Since j is randomly selected, so
We discuss (1) in all possible 4 cases.
In view of the above, it is desirable to provide,this is true. Therefore, the conclusion is confirmed.
Through the form analysis, the preference ranking data collection algorithm (SAFARI) meeting the local differential privacy in the application can ensure that the algorithm meets the local differential privacy for each user, and simultaneously ensures that the data collected by a data collector has higher data utility.
Here, the above preference ranking data collection method satisfying the local differential privacy in the present application is summarized as follows:
1. the data collection platform aggregates the first vectorsAll vectors z injInitialized to 0 vector and aiming at each user terminal u in preset preference item setiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, and j is a preference item index in the preference item set;
2. for each user terminal, the data is sorted by the user's own preference, including the attribute setTuple t of all attributes ini[Aj]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi[Aj]Generating value subscriptsSending to the data collection platform; wherein A isjIs an attribute collection platformThe number of attributes in the tuple is equal to the number of preference items in the preference sorting data, the attributes correspond to the preference items one by one, and the value of each attribute is equal to the ranking of the corresponding preference item; the value subscript satisfies the condition
3. Transmitted by the data collection platform with each user terminalSet the first vectorThe value is increased by 1;
4. the data collection platform combines each value z of all vectors in the first set of vectorsj[k]Is updated to Wherein the content of the first and second substances,the epsilon' is a preset first privacy budget;
5. the data collection platform collects the data according to a first vector (i.e., a first vector set)) DeterminingAndand using the first vector set,Andcomputing all triplets in the set of preferencesOf mutual informationAnd constructing K-thin chainSending the data to each user terminal;
6. the data collection platform assembles the second vectorAll vectors z inj' initialization to a 0 vector and for each user terminal u in a preset set of preferencesiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, j is a preference item index in the preference item set, and preference item indexes selected for different user terminals are the same or different;
7. for each user terminal, the data is sorted by the user's own preference, including the attribute setTuple t of all attributes ini′[Aj′]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi′[Aj′]Generating value subscriptsSending to the data collection platform; wherein the attribute setComprising two subsetsAnd correspond toCollection of leaf item sets of Correspond toThe value subscript satisfies the condition
8. Transmitted by the data collection platform with each user terminalSet the second vectorThe value is increased by 1;
9. the data collection platform combines each value z of all vectors in the second set of vectorsj′[k]Is updated toWherein the content of the first and second substances,the epsilon ' is a preset second privacy budget, epsilon ' + epsilon ' isepsilon, and epsilon is a total privacy budget for establishing an RI model;
10. from the second vector set (i.e., the collection of the relatively ordered distribution set and the cross-ordered distribution set)Distribution information of leaf nodes and distribution information of inner nodes.
11. And generating new preference ranking data by utilizing the established RI model. Wherein the RI model isDistribution information of leaf nodes and distribution information of inner nodes.
In the data collecting method, the stepsProcessing of step 1 and user terminal pair t in step 2i[Aj]The assignment process can be performed in any order, the process of step 6 and the user terminal in step 2 can be performed on ti′[Aj′]The process of assigning values may be performed in any order.
Next, it can be determined by comparison with rapor, SH and OLH that the SAFARI method proposed by the present application has a significant advantage in the utility of the data collected by the data collection platform. To better illustrate the advantages of the method of the present application, a first-order marginal distribution (Q) is employed1) And second-order marginal distribution (Q)2) To measure the utility of the preference ranking data collected by the four algorithms RAPPOR, SH, OLH and SAFARI. Wherein for first-order marginal distribution and second-order marginal distribution, different algorithms are used for generating L between marginal distribution of data and distribution of original data1Distance measures the utility of the collected data. The specific experimental setup was as follows: we tested the performance of each method using two sets of real data sets Sushi and Jester. The specific characteristics of the data in these two sets of data sets are shown in table 2.
TABLE 2 data set characteristics
Data set | Number of users | Number of items |
Sushi | 5,000 | 3~10 |
Jester | 20,000 | 3~10 |
The performance of the SAFARI process is illustrated by analyzing the experimental data below.
First, the first-order marginal distribution and second-order marginal distribution are used to measure the performance of the four methods of RAPPOR, SH, OLH and SAFARI. The results of the experiment are shown in FIG. 3.
As can be seen from FIG. 3, in different datasets, as the privacy budget becomes larger, the first-order marginal distribution and the second-order marginal distribution of data generated by the RAPPOR, SH, OLH, and SAFARI algorithms and the marginal distribution of the original dataset1The distance decreases but the test results of the SAFARI algorithm are consistently less than RAPPOR, SH and OLH. This is because: on one hand, for the SAFARI algorithm, the K-thin chain enables a data collector to have very good robustness on the accuracy of the related distribution information collected by using the SAFARI, and the influence of added noise is small; on the other hand, for rapor, SH and OLH algorithms, they introduce a lot of noise when the privacy parameters are reduced.
Next, we tested the effectiveness of Rule I using the data sets Sushi and Jester. For this reason, we compare it with another version of Rule I (denoted Rule I). In Rule I, each user transforms his preference ranking data to provide it directlyOf the distribution. Let the data collector use SAFA method to collect distribution information from data converted by users according to Rule I and Rule I respectively, and present the obtained S3Average L of medium distribution1Distance. The results of the experiment are shown in FIG. 4.
As can be seen from fig. 4, Rule I results in better utility when d does not exceed 4 for different data sets. This is because when d is relatively small, the advantage of Lasso regression does not outweigh the impact of information loss. However, Rule I results in quite good results when d is relatively large, thus proving the superiority of Rule I.
Finally, we tested the effectiveness of the SAFA algorithm using the data sets Sushi and Jester. For this reason, we compare it with the Harmony method. We let the data collector collect S from the data transformed by users according to Rule I by using SAFA and Harmony methods respectively1And presenting the average L of the distribution obtained1Distance. The results of the experiment are shown in FIG. 5.
As can be seen from fig. 5, the distribution information collected using the SAFA method contains a smaller amount of noise for different data sets. This is because, when the value space of the attribute is small, unnecessary noise is introduced by the process of mapping the value space of each attribute to a matrix in the Harmony algorithm.
According to the data collection method, the preference sorting data collection is realized, privacy disclosure is avoided, and meanwhile the collected preference sorting data is guaranteed to have high data utility.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (5)
1. A method for collecting preference ranking data satisfying local differential privacy, comprising:
the data collection platform aggregates the first vectorsAll vectors z injInitialized to 0 vector and aiming at each user terminal u in preset preference item setiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, and j is a preference item index in the preference item set;
for each user terminal, the data is sorted by the user's own preference, including the attribute setTuple t of all attributes ini[Aj]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi[Aj]Generating value subscriptsSending to the data collection platform; wherein A isjIs a collection of attributesThe jth attribute of (1), ti[Aj]Represents tiIn AjThe number of attributes in the tuple is equal to the number of preference items in the preference sorting data, the attributes correspond to the preference items one to one, and the value of each attribute is equal to the rank of the corresponding preference item; the value subscript satisfies the condition I(ti[Aj]) Represents ti[Aj]In dom (A)j) Index of (1), dom (A)j) Represents attribute AjValue space of (2), Pr 2]Representing a probability distribution value;
the data collection platform being transmitted by each user terminalSet the first vectorThe value is increased by 1;
the data collection platform combines each value z of all vectors in the first set of vectorsj[k]Is updated to Wherein the content of the first and second substances,the epsilon' is a preset first privacy budget,n represents the total number of the user terminals;
the data collection platform determines from the first set of vectorsAndand using the first set of vectors,Andcomputing all triplets in the set of preferencesOf mutual informationAnd constructing K-thin chainTo be sent to the respective user terminal,as binary variablesFor marking the sameIs ranked inBefore or after, k1,k2,k3Representing the index of each item in the preference set within any triple in the preference set, d being the total number of items contained in the preference set,representation itemRank of (2);
the data collection platform assembles the second vectorAll vectors z injInitialized to a 0 vector and for each user terminal u in a preset set of preferencesiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, j is a preference item index in the preference item set, and preference item indexes selected for different user terminals are the same or different;
for each user terminal, the data is sorted by the user's own preference, including the attribute setTuple t of all attributes ini′[Aj′]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi′[Aj′]Generating value subscriptsSending to the data collection platform; wherein, it belongs toSexual setComprising two subsets Correspond toCollection of leaf item sets ofCorrespond toCollection of inner item sets ofIs thatIs composed of a leaf item set containing only one item, |, represents the total number of items contained in the set, and the value subscript satisfies the condition
The data collection platform being transmitted by each user terminalSet the second vectorThe value is increased by 1;
the data collection platform combines each value z of all vectors in the second set of vectorsj′[k]Is updated to Wherein the content of the first and second substances,the epsilon ' is a preset second privacy budget, epsilon ' + epsilon ' isepsilon, and epsilon is a total privacy budget for establishing an RI model;
obtaining the second vector set according to the first vector setThe distribution information of the leaf nodes and the distribution information of the internal nodes;
generating preference ranking data using an RI model that includes distribution information of the leaf nodes and distribution information of the interior nodes.
3. The method of claim 1 or 2, wherein the data collection platform determines from the first set of vectorsThe method comprises the following steps:
for each oneDistribution inConstructing a Lasso regression modelWherein the content of the first and second substances,is a 2d length column vector that stores the distributionAndis determined by the information of (a) a,is a binary matrix of size 2d x d (d-1),is a column vector of length d (d-1) for storing the joint distributionThe information of (a);
4. The method of claim 1 or 2, wherein the data collection platform determines from the first set of vectorsThe method comprises the following steps:
for each oneDistribution inConstructing a Lasso regression model Wherein the content of the first and second substances,is a column vector of length (d +2) that stores the distributionAndis determined by the information of (a) a,is a binary matrix of size (d +2) x 2d,is a column vector of length 2d for storing the joint distributionThe information of (a);
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811079995.XA CN109299436B (en) | 2018-09-17 | 2018-09-17 | Preference sorting data collection method meeting local differential privacy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811079995.XA CN109299436B (en) | 2018-09-17 | 2018-09-17 | Preference sorting data collection method meeting local differential privacy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109299436A CN109299436A (en) | 2019-02-01 |
CN109299436B true CN109299436B (en) | 2021-10-15 |
Family
ID=65163261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811079995.XA Active CN109299436B (en) | 2018-09-17 | 2018-09-17 | Preference sorting data collection method meeting local differential privacy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109299436B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110022531B (en) * | 2019-03-01 | 2021-01-19 | 华南理工大学 | Localized differential privacy urban garbage data report and privacy calculation method |
CN113811868A (en) * | 2019-06-12 | 2021-12-17 | 阿里巴巴集团控股有限公司 | Method and system for responding multidimensional analysis query under local differential privacy |
CN110443063B (en) * | 2019-06-26 | 2023-03-28 | 电子科技大学 | Adaptive privacy-protecting federal deep learning method |
CN112995076B (en) * | 2019-12-17 | 2022-09-27 | 国家电网有限公司大数据中心 | Discrete data frequency estimation method, user side, data center and system |
CN111669366B (en) * | 2020-04-30 | 2021-04-27 | 南京大学 | Localized differential private data exchange method and storage medium |
CN112329056B (en) * | 2020-11-03 | 2021-11-02 | 石家庄铁道大学 | Government affair data sharing-oriented localized differential privacy method |
JPWO2022107284A1 (en) * | 2020-11-19 | 2022-05-27 | ||
CN113111383B (en) * | 2021-04-21 | 2022-05-20 | 山东大学 | Personalized differential privacy protection method and system for vertically-divided data |
CN114091100B (en) * | 2021-11-23 | 2024-05-03 | 北京邮电大学 | Track data collection method and system meeting local differential privacy |
CN115098931B (en) * | 2022-07-20 | 2022-12-16 | 江苏艾佳家居用品有限公司 | Small sample analysis method for mining personalized requirements of indoor design of user |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740245A (en) * | 2014-12-08 | 2016-07-06 | 北京邮电大学 | Frequent item set mining method |
CN106991335A (en) * | 2017-02-20 | 2017-07-28 | 南京邮电大学 | A kind of data publication method based on difference secret protection |
CN107862219A (en) * | 2017-11-14 | 2018-03-30 | 哈尔滨工业大学深圳研究生院 | The guard method of demand privacy in a kind of social networks |
CN107871087A (en) * | 2017-11-08 | 2018-04-03 | 广西师范大学 | The personalized difference method for secret protection that high dimensional data is issued under distributed environment |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9672364B2 (en) * | 2013-03-15 | 2017-06-06 | Microsoft Technology Licensing, Llc | Differentially private linear queries on histograms |
US10885467B2 (en) * | 2016-04-28 | 2021-01-05 | Qualcomm Incorporated | Differentially private iteratively reweighted least squares |
-
2018
- 2018-09-17 CN CN201811079995.XA patent/CN109299436B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105740245A (en) * | 2014-12-08 | 2016-07-06 | 北京邮电大学 | Frequent item set mining method |
CN106991335A (en) * | 2017-02-20 | 2017-07-28 | 南京邮电大学 | A kind of data publication method based on difference secret protection |
CN107871087A (en) * | 2017-11-08 | 2018-04-03 | 广西师范大学 | The personalized difference method for secret protection that high dimensional data is issued under distributed environment |
CN107862219A (en) * | 2017-11-14 | 2018-03-30 | 哈尔滨工业大学深圳研究生院 | The guard method of demand privacy in a kind of social networks |
Non-Patent Citations (1)
Title |
---|
一种基于隐树模型的满足差分隐私的高维数据发布算法;苏炜航等;《小型微型计算机系统》;20180430;第39卷(第4期);第681-685页 * |
Also Published As
Publication number | Publication date |
---|---|
CN109299436A (en) | 2019-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299436B (en) | Preference sorting data collection method meeting local differential privacy | |
Ma et al. | A highly accurate prediction algorithm for unknown web service QoS values | |
CN111159483B (en) | Tensor calculation-based social network diagram abstract generation method | |
CN113206831B (en) | Data acquisition privacy protection method facing edge calculation | |
Wang et al. | Discover community leader in social network with PageRank | |
CN107046557A (en) | The intelligent medical calling inquiry system that dynamic Skyline is inquired about under mobile cloud computing environment | |
CN108024307A (en) | A kind of heterogeneous network access selection method and system based on Internet of Things | |
CN114385376A (en) | Client selection method for federated learning of lower edge side of heterogeneous data | |
Cui et al. | Communication-efficient federated recommendation model based on many-objective evolutionary algorithm | |
CN108173958A (en) | Data-optimized storage method based on ant group algorithm under a kind of cloudy environment | |
Huo et al. | Aggregated inference | |
CN112612948B (en) | Deep reinforcement learning-based recommendation system construction method | |
Chang et al. | Personalized multimedia recommendation systems using higher-order tensor singular-value-decomposition | |
CN115618127A (en) | Collaborative filtering algorithm of neural network recommendation system | |
CN107679709A (en) | A kind of supplier selection method and device based on Intuitionistic Fuzzy Numbers and prestige transmission | |
CN113902113A (en) | Convolutional neural network channel pruning method | |
CN113850317A (en) | Multi-type neighbor aggregation graph convolution recommendation method and system | |
CN112990672A (en) | Introduced technology evaluation selection method | |
CN112765413A (en) | Graph data query recommendation method based on user characteristics | |
Zhou et al. | Hgena: A hyperbolic graph embedding approach for network alignment | |
CN111523054A (en) | Project recommendation method and system based on active account and similar accounts | |
Chen et al. | Irlm: inductive representation learning model for personalized poi recommendation | |
CN109919790A (en) | Group type recognition methods, device, electronic equipment and storage medium | |
Liu et al. | Attentive-feature transfer based on mapping for cross-domain recommendation | |
CN114826967B (en) | Information sharing capability evaluation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |