CN109299436B - Preference sorting data collection method meeting local differential privacy - Google Patents

Preference sorting data collection method meeting local differential privacy Download PDF

Info

Publication number
CN109299436B
CN109299436B CN201811079995.XA CN201811079995A CN109299436B CN 109299436 B CN109299436 B CN 109299436B CN 201811079995 A CN201811079995 A CN 201811079995A CN 109299436 B CN109299436 B CN 109299436B
Authority
CN
China
Prior art keywords
preference
data
data collection
distribution
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811079995.XA
Other languages
Chinese (zh)
Other versions
CN109299436A (en
Inventor
程祥
苏森
杨健宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201811079995.XA priority Critical patent/CN109299436B/en
Publication of CN109299436A publication Critical patent/CN109299436A/en
Application granted granted Critical
Publication of CN109299436B publication Critical patent/CN109299436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

The application discloses a preference sorting data collection method meeting local differential privacy, wherein a user terminal converts preference sorting data by using Rule I and Rule II, adds noise into the converted data and sends the converted data to a data collection platform, the data collection platform and the user terminal are matched with each other to realize an algorithm meeting the local differential privacy, the whole RI model construction is completed, and preference sorting data are generated based on the constructed model. By the method, the collected preference sorting data can be guaranteed to have higher data utility while privacy disclosure is avoided.

Description

Preference sorting data collection method meeting local differential privacy
Technical Field
The application relates to a data collection technology, in particular to a preference sorting data collection method meeting local differential privacy.
Background
The preference ranking data is a typical personal data. For a user, his preference ranking data refers to the ranking of items given by the user to a given Set of items (ItemSet) according to their own preference for the items in the Set. For example, the item set is { cola, white spirit, sprite, plain boiled water, beer }, and the preference ranks of a user for these five items are < white spirit, cola, beer, sprite, plain boiled water >, which indicates that the user likes white spirit most and the plain boiled water most least. With the rapid development of information technologies such as mobile internet and cloud computing and the increasing popularization of mobile terminals such as smart phones, it is common for users to share their preference ranking data to various data collectors (e.g., service providers) through mobile device applications so as to enjoy personalized services. On the other hand, it is also essential for service providers to collect and analyze user preference ranking data in order to provide a better user experience and create new revenue opportunities. However, the user's preference ranking data often contains extremely sensitive personal information, and the direct collection of such data by the data collector may cause serious privacy disclosure problems for individuals.
FIG. 1 is a diagram illustrating a current user preference data collection scenario. The scenario mainly involves two roles, namely a user (namely a data contributor) and a data collector, and a term set x formed by d terms is given1,x2,...,xdU, each user ui(1 ≦ i ≦ n) each having a preference ranking data σi=<σi(1),σi(2),…,σi(d)>And users are independent of each other. Wherein σi(j)=xkRepresents xkAt σiRank of (1) is j. The data collector collects the preference ranking data of each user by using the data collection platform and through the network, so as to obtain a preference ranking data set, namely, a model of the preference ranking data is constructed. The model can generate new preference sorting data, the new preference sorting data generated by the model and the original preference sorting data of the user have the same statistical characteristics, and meanwhile, the original preference sorting data are not directly given, so that the privacy of the user is protected to a certain extent. The data collector may directly utilize the collected preference ranking data model for analysis, or may open the model or new preference ranking data generated to a third party (e.g., a research institution).
It can be seen from the above processing that, during the process of collecting the user preference ranking data, it is possible to prevent the user of the model and the new preference ranking data from acquiring the user privacy, but before the model of the preference ranking data is formed, the user preference privacy data may still be revealed. Specifically, for each user, there are three roles that may pose a threat to their privacy: 1) a data collector; 2) other users; 3) any potential attacker in addition to the data collector and other users.
The preference sorting data collection technology of privacy protection provides a feasible scheme for solving the problem of personal privacy disclosure brought by preference sorting data collection. A Local Differential Privacy technology (Local Differential Privacy) proposed in recent years is a Differential Privacy technology specifically proposed to solve the problem of personal Privacy disclosure caused by data collection. In particular, the technique requires that the data contributor first add a suitable amount of noise to the data he owns, and then send the data containing the noise to the data collector to achieve privacy protection for the data contributor.
Currently, there are some work on data collection issues that satisfy local differential privacy. Based on the information theory, Duchi et al propose a high-dimensional data collection method facing the task of mean value calculation and risk minimization statistics. By extending the method, the sampling rate can be increased, based on sampling techniques,
Figure BDA0001801664120000021
et al propose a method of data collection known as Harmony. Specifically, for each piece of high-dimensional data, the method randomly selects a certain dimension of the piece of data, and if the dimension corresponds to continuous data, the collection is performed based on the method proposed by Duchi et al; if the dimension corresponds to discrete data, collection is performed by using an SH mechanism. In order to obtain frequent items of multidimensional data, Qin et al propose a two-phase data collection method called LDPMiner. In the first stage, the method is based on an SH mechanism, and a candidate space of frequent items is preliminarily determined from noise data; in the second phase, the method derives precise frequent terms based on the rapporr mechanism. Based on the EM (expection knowledge) algorithm, Fanti et al propose an extended RAPPOR mechanism. The mechanism assumes that all dimensions of high-dimensional data are independent from each other, collects all-dimensional data by utilizing a RAPPOR mechanism, and uses the data as input of an EM algorithm to deduce joint distribution of the whole data, so that the mechanism can be used for generating original data. However, when the data dimension is high, the mechanism is not only time-complex but also slow in convergence speed. Aiming at the problem, Ren et al provides a new method by combining the EM algorithm with the Lasso regression, and the method can greatly improve the efficiency of the proposed method in the RAPPOR mechanism.
Direct connectionApplying the local differential privacy algorithm to the preference ranking data, wherein the specific calculation mode can be as follows: assuming that a value space formed by all possible preference sorting data exists, then regarding the preference sorting data of each user as a discrete value in the value space, and finally directly applying a single-dimensional multi-valued data collection method meeting the local differential privacy, wherein the single-dimensional multi-valued data collection method comprises RAPPOR, SH and OLH algorithms to collect the data. However, the transformed data has a huge value space, and x ═ x for a given set of terms1,x2,...,xdThe value space size of the converted data is d! . Therefore, these algorithms can cause a lot of noise in the collected data, resulting in the unavailability of the finally obtained preference ranking data.
Disclosure of Invention
The application provides a preference sorting data collection method meeting local differential privacy, and preference sorting data collection realized by the method can ensure that the collected preference sorting data has higher data utility while avoiding privacy disclosure.
In order to achieve the purpose, the following technical scheme is adopted in the application:
a method of preferred ranking data collection satisfying local differential privacy, comprising:
the data collection platform aggregates the first vectors
Figure BDA00018016641200000313
All vectors z injInitialized to 0 vector and aiming at each user terminal u in preset preference item setiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, and j is a preference item index in the preference item set;
for each user terminal, the data is sorted by the user's own preference, including the attribute set
Figure BDA00018016641200000314
Tuple t of all attributes ini[Aj]Assigning and collecting the platform according to the dataPreference index j sent to the user terminal, using tuple ti[Aj]Generating value subscripts
Figure BDA00018016641200000315
Sending to the data collection platform; wherein A isjIs a collection of attributes
Figure BDA00018016641200000317
The jth attribute of (1), ti[Aj]Represents tiIn AjThe number of attributes in the tuple is equal to the number of preference items in the preference sorting data, the attributes correspond to the preference items one to one, and the value of each attribute is equal to the rank of the corresponding preference item; the value subscript satisfies the condition
Figure BDA0001801664120000031
k∈{1,2,...,|dom(Aj)|},I(ti[Aj]Represents ti[Aj]In dom (A)j) Index of (1), dom (A)j) Represents attribute AjThe value space of (a);
the data collection platform being transmitted by each user terminal
Figure BDA00018016641200000316
Set the first vector
Figure BDA0001801664120000032
The value is increased by 1;
the data collection platform combines each value z of all vectors in the first set of vectorsj[k]Is updated to
Figure BDA0001801664120000033
Wherein the content of the first and second substances,
Figure BDA0001801664120000034
the epsilon' is a preset first privacy budget;
the data collection platform determines from the first set of vectors
Figure BDA0001801664120000035
And
Figure BDA0001801664120000036
and using the first set of vectors,
Figure BDA0001801664120000037
And
Figure BDA0001801664120000038
computing all triplets in the set of preferences
Figure BDA0001801664120000039
Of mutual information
Figure BDA00018016641200000310
And constructing K-thin chain
Figure BDA00018016641200000311
Sending the data to each user terminal;
the data collection platform collects the second vector
Figure BDA00018016641200000312
All vectors z inj' initialization to a 0 vector and for each user terminal u in a preset set of preferencesiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, j is a preference item index in the preference item set, and preference item indexes selected for different user terminals are the same or different;
for each user terminal, the data is sorted by the user's own preference, including the attribute set
Figure BDA0001801664120000041
Tuple t of all attributes ini′[Aj′]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi′[Aj′]Generating a valueSubscript
Figure BDA0001801664120000042
Sending to the data collection platform; wherein the attribute set
Figure BDA0001801664120000043
Comprising two subsets
Figure BDA0001801664120000044
And
Figure BDA0001801664120000045
Figure BDA0001801664120000046
correspond to
Figure BDA0001801664120000047
Collection of leaf item sets of
Figure BDA0001801664120000048
Figure BDA0001801664120000049
Correspond to
Figure BDA00018016641200000410
The value subscript satisfies the condition
Figure BDA00018016641200000411
The data collection platform being transmitted by each user terminal
Figure BDA00018016641200000412
Set the second vector
Figure BDA00018016641200000413
The value is increased by 1;
the data collection platform combines each value z of all vectors in the second set of vectorsj′[k]Is updated to
Figure BDA00018016641200000414
Wherein the content of the first and second substances,
Figure BDA00018016641200000415
the epsilon ' is a preset second privacy budget, epsilon ' + epsilon ' isepsilon, and epsilon is a total privacy budget for establishing an RI model;
obtaining the second vector set according to the first vector set
Figure BDA00018016641200000416
The distribution information of the leaf nodes and the distribution information of the internal nodes;
generating preference ranking data using an RI model that includes distribution information of the leaf nodes and distribution information of the interior nodes.
Preferably, the method further comprises: according to the mutual information of the triples, the
Figure BDA00018016641200000417
The distribution information of the leaf nodes and the distribution information of the internal nodes of the network system generate a specified number of preference ranking data.
Preferably, the data collection platform determines from the first set of vectors
Figure BDA00018016641200000418
The method comprises the following steps:
for each one
Figure BDA00018016641200000419
Distribution in
Figure BDA00018016641200000420
Constructing a Lasso regression model
Figure BDA00018016641200000421
Wherein the content of the first and second substances,
Figure BDA00018016641200000422
is one longDegree 2d column vector, which stores the distribution
Figure BDA00018016641200000423
And
Figure BDA00018016641200000424
is determined by the information of (a) a,
Figure BDA00018016641200000425
is a binary matrix of size 2d x d (d-1),
Figure BDA00018016641200000426
is a column vector of length d (d-1) for storing the joint distribution
Figure BDA00018016641200000427
The information of (a);
solving the Lasso regression model by a minimum angle regression method, and estimating to obtain
Figure BDA00018016641200000428
And determining a joint distribution
Figure BDA00018016641200000429
According to the joint distribution
Figure BDA00018016641200000430
Computing
Figure BDA00018016641200000431
Preferably, the data collection platform determines from the first set of vectors
Figure BDA00018016641200000432
The method comprises the following steps:
for each one
Figure BDA00018016641200000433
Distribution in
Figure BDA00018016641200000434
Constructing a Lasso regression model
Figure BDA00018016641200000435
Figure BDA00018016641200000436
Wherein the content of the first and second substances,
Figure BDA00018016641200000437
is a column vector of length (d +2) that stores the distribution
Figure BDA00018016641200000438
And
Figure BDA00018016641200000439
is determined by the information of (a) a,
Figure BDA00018016641200000440
is a binary matrix of size (d +2) x 2d,
Figure BDA00018016641200000441
is a column vector of length 2d for storing the joint distribution
Figure BDA00018016641200000442
The information of (a);
solving the Lasso regression model by a minimum angle regression method, and estimating to obtain
Figure BDA0001801664120000051
And determining a joint distribution
Figure BDA0001801664120000052
Preferably, the first and second liquid crystal films are made of a polymer,
Figure BDA0001801664120000053
according to the technical scheme, the user terminal converts preference sorting data by using Rule I and Rule II, adds noise into the converted data and sends the converted data to the data collection platform, the data collection platform and the user terminal are matched with each other to realize an algorithm meeting local differential privacy, the whole RI model is built, and the established RI model is used for generating the preference sorting data meeting the local differential privacy. By the method, the collected preference sorting data can be guaranteed to have higher data utility while privacy disclosure is avoided.
Drawings
FIG. 1 is a diagram illustrating a current user preference data collection scenario;
FIG. 2 is a schematic view of an example of a 2-thin chain;
FIG. 3 is a first graph illustrating performance comparison in the present application;
FIG. 4 is a graph illustrating a performance comparison of the present application;
fig. 5 is a third performance comparison diagram in the present application.
Detailed Description
For the purpose of making the objects, technical means and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings.
In order to solve the problem mentioned in the background art that the local differential privacy method is not available after being applied to preference ranking data, the applicant proposes a preference ranking data algorithm (SAFARI algorithm) satisfying local differential privacy. The method has the main ideas that a data collector collects distribution information on a series of small-valued spaces selected according to a riffled independent model (RI model), the collected distribution information on the small-valued spaces is used for approximating the whole distribution of preference ranking data, a model is built, and the built model is used for generating the preference ranking data. Since the SAFARI algorithm processes a plurality of small value spaces instead of a large value space, the SAFARI algorithm can greatly reduce the scale of noise.
The process of the present application is described in detail below.
At present, the modeling of the preference Ranking data may be performed by using an RI model, and the RI model can approximate the overall distribution of the preference Ranking data by using a product of two low-dimensional Distributions, namely, Relative Ranking Distributions (Relative Ranking Distributions) and cross Distributions (intersecting Distributions), according to mutual exclusivity between dimensions of the preference Ranking data, so as to effectively model the preference Ranking data. And generating new preference sorting data by using the established model, thereby realizing the privacy protection of the user.
The structure of the RI model is a binary tree called K-thin chain. Wherein the root node represents the original item set, the other nodes represent sub item sets of the original item set, and the size of the item set of the leaf node does not exceed a constant K. FIG. 2 is an example of a 2-thin chain.
In this example, the original set of terms { cola, white spirit, sprite, plain boiled water, beer } is first divided into two subsets of terms that are mutually exclusive and have a rifle Independent relationship, namely { plain boiled water } and { cola, white spirit, sprite, beer }. Since the size of the sub-set { cola, white spirit, sprite, beer } exceeds 2 (i.e., the value of K), the sub-set is further divided into sub-sets { cola, sprite } and { white spirit, beer } that are mutually exclusive and have a rifle Independent relationship.
The learning process of the RI model comprises two stages of structure learning and parameter learning:
1) and (5) structure learning. First, the mutual information of all triplets in the item set is calculated, which is defined as follows:
given a set of items (x)1,x2,...,xd) For any triplet in the set
Figure BDA0001801664120000061
Wherein the content of the first and second substances,
Figure BDA0001801664120000062
and
Figure BDA0001801664120000063
is a set of three different items, the mutual information of the triples being
Figure BDA0001801664120000064
Wherein the content of the first and second substances,
Figure BDA0001801664120000065
representation item
Figure BDA0001801664120000066
The rank of (a) is determined,
Figure BDA0001801664120000067
is a binary variable. In particular, it is possible to use, for example,
Figure BDA0001801664120000068
represents
Figure BDA0001801664120000069
That is, the item
Figure BDA00018016641200000610
Is ranked in the item
Figure BDA00018016641200000611
Before;
Figure BDA00018016641200000612
represents
Figure BDA00018016641200000613
That is, the item
Figure BDA00018016641200000614
Is ranked in the item
Figure BDA00018016641200000615
And then.
And then constructing the K-thin chain on the original item set by using an anchor point algorithm according to the mutual information of the triples.
2) And (5) parameter learning. From the constructed K-thin chain, the distribution of each node is learned to approximate the overall distribution of the original preference ranking data set. Among them, the distribution of leaf nodes is called Relative Ranking Distributions (Relative Ranking Distributions), and the distribution of internal nodes (including root nodes) is called cross Distributions (Interleaving Distributions).
The relative sequencing distribution and the cross distribution are determined through the process, and the modeling of the RI model is completed.
The preference ranking data collection method is based on the RI model, and the preference ranking data is generated according to the established RI model. Only in the modeling process, the acquisition of the mutual triple information and the acquisition of the relative sequencing distribution and the cross distribution meet the local differential privacy.
The preference ranking data algorithm (SAFARI algorithm) meeting the local differential privacy in the present application relates to two rules Rule I, Rule II and one SAFA algorithm, and specifically may include 5 stages:
stage 1
1. Each user converts own preference sorting data according to a conversion Rule (marked as Rule I), so that the data collection platform can obtain distribution information required by calculating the mutual triple information. The content of the distribution information will be described in detail later.
Stage 2
1. And the data collection platform calls an SAFA algorithm by using the privacy budget of epsilon' and through the cooperation of the user, and collects the distribution information required by calculating the triple mutual information from the data converted in the stage 1 by the user. In the SAFA algorithm, a user adds noise to converted data and then sends the data to a data collection platform, so that privacy disclosure is avoided.
2. And the data collection platform calculates all the triple mutual information in the RI model by using the collected distribution information.
3. Data collection platform for constructing K-thin chain
Figure BDA0001801664120000071
4. The data collection platform is to
Figure BDA0001801664120000072
And is sent to each user.
Stage 3
1. Each user transforms his own preference ranking data according to another transformation Rule (denoted Rule II) so that the data collection platform can determine the information about
Figure BDA0001801664120000073
Relative Ranking distribution (Relative Ranking Distributions) information and cross distribution (Interleaving Distributions) information.
Stage 4
1. The data collector uses the privacy budget of epsilon' to call the SAFA algorithm to collect the data about the user converted in the phase 2 through the cooperation of the user
Figure BDA0001801664120000074
The ordering distribution information and the cross distribution information. In the SAFA algorithm, a user adds noise to converted data and then sends the data to a data collection platform, so that privacy disclosure is avoided.
Thus, the construction of the RI model is completed after the sequencing distribution information and the cross distribution information are obtained.
Stage 5
According to the constructed rifflee index model, the data collector generates n pieces of new preference ranking data.
Through the processing of the stage 1 to the stage 4, the modeling of the preference ranking data RI meeting the local differential privacy can be realized. The data collection platform may issue the completed RI model to a third party, or, in order to better provide the preference ranking data to the third party, preferably, may further generate new preference ranking data to be issued to the third party through the stage 5 process.
In addition, in the above data collection method, the phase 2 and the phase 4 complete the local differential privacy processing in two parts, so on the premise that the overall privacy budget is epsilon, epsilon' + epsilon ″, preferably, in practical application, it is usually taken that
Figure BDA0001801664120000075
We will next introduce Rule I, Rule II and SAFA methods used in the above data collection method satisfying the local differential privacy, respectively.
Design Rule I
As described above, Rule I is used to convert preference ranking data of a user, and the converted data is used to calculate mutual information of triples, so that Rule I needs to be designed according to a calculation manner of the mutual information of the triples. In particular, to compute any one possible triplet (x) according to its mutual information definitioni,xj,xk) The data collection platform needs to collect three types of distribution information:
Figure BDA0001801664120000081
Figure BDA0001801664120000082
Figure BDA0001801664120000083
to accomplish this task, an intuitive translation method is for each user to translate his preference ranking data to provide information about these three types of distributions. In particular, each user converts his preference ranking data into a tuple comprising a plurality of attributes, wherein each attribute corresponds to
Figure BDA0001801664120000084
Is distributed.
However, this translation method can result in redundant information including content in the user's translated data and increase translation complexity because
Figure BDA0001801664120000085
In fact, the data collector only needs to collect
Figure BDA0001801664120000086
And then derived therefrom
Figure BDA0001801664120000087
And
Figure BDA00018016641200000825
of the distribution. Thus, each user need only translate his preference ranking data to provide
Figure BDA0001801664120000089
Of the distribution. Due to the fact that
Figure BDA00018016641200000810
In which O (d) is contained3) A different distribution, the number of attributes in each user transformed tuple is O (d)3)。
Unfortunately, when d is relatively large, this transformation can cause the data collector to collect a large amount of noise in the data when LDP is satisfied due to the dimensional disaster. To solve this problem, we have designed Rule I. According to the conversion rule, each user only needs to convert his preference ranking data to provide
Figure BDA00018016641200000811
Of the distribution. The data collection platform only needs to collect
Figure BDA00018016641200000812
And then estimating therefrom using a regression model
Figure BDA00018016641200000813
And
Figure BDA00018016641200000814
of the distribution. In particular, the applicant has found that the estimation problem is a sparse linear regression (sparse linear regression) problem. Therefore, a Lasso regression model is selected that effectively solves this problem. Details of Rule I and how to estimate using the Lasso regression model are described below
Figure BDA00018016641200000815
And
Figure BDA00018016641200000816
of the distribution.
Rule I: each user terminal uiFirstly sorting preference data sigma of corresponding usersiConversion to a collection of containment attributes
Figure BDA00018016641200000817
Tuple t of all attributes ini. Wherein the content of the first and second substances,
Figure BDA00018016641200000818
each attribute A in (1)jCorresponding to an item
Figure BDA00018016641200000819
AjHas a value space of dom (A)j)={1,2,...,d}。dom(Aj) Consisting of d possible values, which represent
Figure BDA00018016641200000820
With possible absolute ranking. Then, for
Figure BDA00018016641200000821
Each attribute A in (1)jEach user uiAccording to σiFor ti[Aj]And carrying out assignment.
Distribution estimation using Lasso, determination
Figure BDA00018016641200000822
And
Figure BDA00018016641200000823
based on what has been collected
Figure BDA00018016641200000824
The data collection platform may estimate the distribution in the following manner
Figure BDA0001801664120000091
And
Figure BDA0001801664120000092
of the distribution.
First, the data collection platform is accessed from
Figure BDA0001801664120000093
Is estimated in the distribution
Figure BDA0001801664120000094
The distribution information in (1). In particular, for each
Figure BDA0001801664120000095
Distribution in
Figure BDA0001801664120000096
Data collector constructs a Lasso regression model
Figure BDA0001801664120000097
Wherein the content of the first and second substances,
1)
Figure BDA0001801664120000098
is a 2d length column vector that stores the distribution
Figure BDA0001801664120000099
And
Figure BDA00018016641200000910
the information of (a);
2)
Figure BDA00018016641200000911
is a binary matrix of size 2d x d (d-1);
3)
Figure BDA00018016641200000912
is a column vector of length d (d-1) that is used to store the joint distribution
Figure BDA00018016641200000913
The information of (1).
By solving the Lasso regression model using the minimum angle regression method, the data collection platform can estimate
Figure BDA00018016641200000914
Thereby obtaining a joint distribution
Figure BDA00018016641200000915
The information of (1). According to a joint distribution
Figure BDA00018016641200000916
The data collection platform can calculate the distribution
Figure BDA00018016641200000917
The information of (1).
Then, the data collector gets the data from
Figure BDA00018016641200000918
And
Figure BDA00018016641200000919
is estimated in the distribution
Figure BDA00018016641200000920
The distribution information in (1). In particular, for each
Figure BDA00018016641200000921
Distribution in
Figure BDA00018016641200000922
Data collector constructs a Lasso regression model
Figure BDA00018016641200000923
Figure BDA00018016641200000924
Wherein the content of the first and second substances,
1)
Figure BDA00018016641200000925
is a column vector of length (d +2) that stores the distribution
Figure BDA00018016641200000926
And
Figure BDA00018016641200000927
the information of (a);
2)
Figure BDA00018016641200000928
is a binary matrix of size (d +2) × 2 d;
3)
Figure BDA00018016641200000929
is a 2d length column vector that is used to store the joint distribution
Figure BDA00018016641200000930
The information of (1).
Similarly, by solving the Lasso regression model using the least-angle regression method, the data collector can estimate
Figure BDA00018016641200000931
Thereby obtaining a joint distribution
Figure BDA00018016641200000932
The information of (1).
To explain this point, in the Rule I described above,
Figure BDA00018016641200000933
each attribute A in (1)jAnd the corresponding relation between the preference items and the data collection platform needs to be consistent on the data collection platform side and the user terminal side. That is, in SAFARI, the data collection platform and each user terminal need to guarantee to run the same Rule I. In addition, as can be seen from the above processing, in the conversion rule, the number of attributes included in the tuple converted by each user is o (d). For attribute collection
Figure BDA00018016641200000934
Each attribute A in (1)jIts value space size | dom (A)j) L is only d, and it is clear that this value is much smaller than d! . By estimating the set.
Figure BDA00018016641200000935
The frequency of any possible value of each attribute in the data collection system can be obtained
Figure BDA00018016641200000936
And then estimating the distribution information based on the distribution information
Figure BDA00018016641200000937
And
Figure BDA00018016641200000938
of the distribution. Such processing does not generate a large amount of redundant data, and thus the utility of the user preference data can be effectively improved.
Design Rule II
In the construction of K-thin chain
Figure BDA0001801664120000101
Thereafter, the data collection platform needs to collect information about
Figure BDA0001801664120000102
Relative Ranking Distributions (Relative Ranking Distributions) and Interleaving Distributions (Interleaving Distributions). To this end, designRule II. According to the conversion rule, each user terminal only needs to convert the preference ranking data of the corresponding user to provide information about the two types of distributions.
Rule II: each user terminal uiFirstly, the preference sorting data sigma of the corresponding user is sortediConversion to a collection of containment attributes
Figure BDA0001801664120000103
Tuple t of all attributes ini. In particular, attribute sets
Figure BDA0001801664120000104
From two subsets
Figure BDA0001801664120000105
And
Figure BDA0001801664120000106
form a
Figure BDA0001801664120000107
Wherein the content of the first and second substances,
1)
Figure BDA0001801664120000108
Figure BDA0001801664120000109
correspond to
Figure BDA00018016641200001010
Leaf item set collecting platform
Figure BDA00018016641200001011
Figure BDA00018016641200001012
Is that
Figure BDA00018016641200001033
By a set of leaf sub-items containing only one itemAnd (4) forming. Because of the relation to
Figure BDA00018016641200001014
The relative ranking distribution of each set of leaf items in the list can be easily inferred, so users do not need to provide information about
Figure BDA00018016641200001015
The information of (1).
Figure BDA00018016641200001016
Each attribute A in (1)jCorresponding to the collection
Figure BDA00018016641200001017
A set of leaf items l ink。AjIs defined by all the values ofkThe relative ordering of (a). In particular, when K is 1,
Figure BDA00018016641200001018
all of the sets of leaf entries in (a) contain only one entry, and, at this time,
Figure BDA00018016641200001019
wherein K represents
Figure BDA00018016641200001020
The most contained items of the middle leaf item set.
2)
Figure BDA00018016641200001021
Figure BDA00018016641200001022
Each attribute A in (1)jCorresponding to the collection of internal items
Figure BDA00018016641200001023
An inner item set g ofk。AjIs defined by all the values related to gkIs formed by cross-sorting. Then, for
Figure BDA00018016641200001024
Each attribute A in (1)jEach user uiAccording to σiFor ti[Aj]And carrying out assignment.
To explain this point, in the above Rule II,
Figure BDA00018016641200001025
each attribute A in (1)jThe leaf items or the internal items in the data collection platform are in one-to-one correspondence, and the correspondence relation between the leaf items and the internal items needs to be kept consistent on the data collection platform and the user terminal side. That is, in SAFARI, the data collection platform and each user terminal need to guarantee the same Rule II running. In addition, as can be seen from the above processing, in the conversion rule, the number of attributes included in the tuple converted by each user is o (d). For the
Figure BDA00018016641200001026
Each attribute A in (1)jIts maximum value space size is K! (ii) a For the
Figure BDA00018016641200001027
Each attribute A in (1)jThe maximum value space size of the system is
Figure BDA00018016641200001028
Thus property collections
Figure BDA00018016641200001029
The maximum value space of any attribute in (a) is
Figure BDA00018016641200001030
It is clear that this value is much smaller than d! . By estimating the constellation
Figure BDA00018016641200001031
The frequency of any possible value of each attribute in the data collection system can be obtainedAbout
Figure BDA00018016641200001032
Distribution information of (2). Such processing does not generate a large amount of redundant data, and thus the utility of the user preference data can be effectively improved.
SAFA process
In the data collection process of the present application, both stage 2 and stage 4 require SAFA processing on the data transformed by the user terminal, and the SAFA processing is described in detail here.
In order to collect distribution information required for constructing the RI model, the data collection platform needs to estimate the frequency of any one possible value of each attribute in the tuple after the transformation by the user under the condition that the LDP is satisfied.
The data collection platform can directly call the Harmony method which is the most advanced method for analyzing multi-attribute data under LDP to complete the task. In particular, for
Figure BDA0001801664120000111
Each attribute A in (1)jThe data collector will AjIs mapped into a space with a size of | dom (A)i)|×|dom(Aj) Binary matrix phi of |j. Then, for each user uiFrom the set, the data collector
Figure BDA0001801664120000119
Randomly selecting an attribute (assumed to be A)r) And call the SH algorithm [11 ]]Collecting uiTuple t ofiIn ArThe value of (a).
We observe the set of transformed attributes whether Rule I or Rule II is applied
Figure BDA00018016641200001110
The value space size of all the attributes in (1) is far smaller than d! . However, for the attribute with small value space, the Harmony method still maps the value space of each dimension into a matrix, resulting in unnecessary noise contained in the collected data, especiallyIs when handling binary attributes. It is stated in the literature that the generalized random response algorithm works best when estimating a small number of discrete value frequencies. Therefore, we propose a new LDP algorithm, named Sampling Randomizer for Multiple Attributes (SAFA), to more accurately perform frequency estimation on multi-attribute data with small value space under the condition of satisfying LDP. The method has the main idea that each user terminal randomly selects an attribute, then perturbs the value of the attribute by using a generalized random response algorithm, and sends the perturbed result to a data collection platform.
As mentioned above, it is necessary to apply SAFA algorithm for processing in stage 2 and stage 4, which is a process that needs to be performed by the user terminal and the data collection platform in cooperation. The SAFA algorithms applied in stage 2 and stage 4 are the same, but the distribution information to be obtained by the SAFA algorithms is different, and the processing of the SAFA algorithms is described in the following.
The specific process of SAFA is as follows:
1. data collection platform initialization vector set
Figure BDA0001801664120000112
All vectors in the vector, namely all values in each vector are assigned to be 0; here, for stage 2, the set of vectors
Figure BDA0001801664120000113
Is that
Figure BDA0001801664120000114
For stage 4, vector set
Figure BDA0001801664120000115
Is the collection of the relative ordering distribution set and the cross distribution set;
2. for each user terminal uiThe following operations are performed:
2.1. when the data collection platform is converted from Rule I or Rule II
Figure BDA0001801664120000116
Randomly selecting an index j;
2.2. the data collection platform sends j to ui
2.3.uiGenerating a value index with noise, and marking as
Figure BDA0001801664120000117
So that
Figure BDA0001801664120000118
Where k ∈ {1, 2., | dom (A)j)|};
2.4.uiWill be provided with
Figure BDA0001801664120000121
Sending to a data collector;
2.5. the data collector will
Figure BDA0001801664120000122
The value of (a) is increased by 1;
after the above operations are executed on all the user terminals, the following operations are continuously executed:
3. for collections
Figure BDA00018016641200001213
Each vector z ofjThe following processing is performed:
3.1. data collector set probability
Figure BDA0001801664120000123
3.2. Data collector set probability
Figure BDA0001801664120000124
3.3. Will vector zjEach value z inj[k]Is updated to
Figure BDA0001801664120000125
This is a specific treatment of the SAFA process. To illustrate that the SAFA method in the present application can satisfy local differential privacy, theoretical proof is given below.
Theorem: for any user uiThe privacy budget ε ', SAFA satisfies ε' -LDP.
And (3) proving that:
defined by LDP, for any two different tuples ti,t′iIs arbitrary
Figure BDA0001801664120000126
Wherein
Figure BDA0001801664120000127
Is an index of attributes selected by the data collector, which we need to demonstrate
Figure BDA0001801664120000128
Since j is randomly selected, so
Figure BDA0001801664120000129
We discuss (1) in all possible 4 cases.
Case 1: if it is not
Figure BDA00018016641200001210
And is
Figure BDA00018016641200001211
Figure BDA00018016641200001212
Case 2: if it is not
Figure BDA0001801664120000131
And is
Figure BDA0001801664120000132
Figure BDA0001801664120000133
Case 3: if it is not
Figure BDA0001801664120000134
And is
Figure BDA0001801664120000135
Figure BDA0001801664120000136
Case 4: if it is not
Figure BDA0001801664120000137
And is
Figure BDA0001801664120000138
Figure BDA0001801664120000139
In view of the above, it is desirable to provide,
Figure BDA00018016641200001310
this is true. Therefore, the conclusion is confirmed.
Through the form analysis, the preference ranking data collection algorithm (SAFARI) meeting the local differential privacy in the application can ensure that the algorithm meets the local differential privacy for each user, and simultaneously ensures that the data collected by a data collector has higher data utility.
Here, the above preference ranking data collection method satisfying the local differential privacy in the present application is summarized as follows:
1. the data collection platform aggregates the first vectors
Figure BDA00018016641200001319
All vectors z injInitialized to 0 vector and aiming at each user terminal u in preset preference item setiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, and j is a preference item index in the preference item set;
2. for each user terminal, the data is sorted by the user's own preference, including the attribute set
Figure BDA00018016641200001320
Tuple t of all attributes ini[Aj]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi[Aj]Generating value subscripts
Figure BDA00018016641200001311
Sending to the data collection platform; wherein A isjIs an attribute collection platform
Figure BDA00018016641200001312
The number of attributes in the tuple is equal to the number of preference items in the preference sorting data, the attributes correspond to the preference items one by one, and the value of each attribute is equal to the ranking of the corresponding preference item; the value subscript satisfies the condition
Figure BDA00018016641200001313
3. Transmitted by the data collection platform with each user terminal
Figure BDA00018016641200001314
Set the first vector
Figure BDA00018016641200001315
The value is increased by 1;
4. the data collection platform combines each value z of all vectors in the first set of vectorsj[k]Is updated to
Figure BDA00018016641200001316
Figure BDA00018016641200001317
Wherein the content of the first and second substances,
Figure BDA00018016641200001318
the epsilon' is a preset first privacy budget;
5. the data collection platform collects the data according to a first vector (i.e., a first vector set)
Figure BDA0001801664120000141
) Determining
Figure BDA0001801664120000142
And
Figure BDA0001801664120000143
and using the first vector set,
Figure BDA0001801664120000144
And
Figure BDA0001801664120000145
computing all triplets in the set of preferences
Figure BDA0001801664120000146
Of mutual information
Figure BDA0001801664120000147
And constructing K-thin chain
Figure BDA0001801664120000148
Sending the data to each user terminal;
6. the data collection platform assembles the second vector
Figure BDA0001801664120000149
All vectors z inj' initialization to a 0 vector and for each user terminal u in a preset set of preferencesiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, j is a preference item index in the preference item set, and preference item indexes selected for different user terminals are the same or different;
7. for each user terminal, the data is sorted by the user's own preference, including the attribute set
Figure BDA00018016641200001410
Tuple t of all attributes ini′[Aj′]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi′[Aj′]Generating value subscripts
Figure BDA00018016641200001411
Sending to the data collection platform; wherein the attribute set
Figure BDA00018016641200001412
Comprising two subsets
Figure BDA00018016641200001413
And
Figure BDA00018016641200001427
Figure BDA00018016641200001416
Figure BDA00018016641200001417
correspond to
Figure BDA00018016641200001418
Collection of leaf item sets of
Figure BDA00018016641200001419
Figure BDA00018016641200001420
Correspond to
Figure BDA00018016641200001428
The value subscript satisfies the condition
Figure BDA00018016641200001421
8. Transmitted by the data collection platform with each user terminal
Figure BDA00018016641200001422
Set the second vector
Figure BDA00018016641200001423
The value is increased by 1;
9. the data collection platform combines each value z of all vectors in the second set of vectorsj′[k]Is updated to
Figure BDA00018016641200001424
Wherein the content of the first and second substances,
Figure BDA00018016641200001425
the epsilon ' is a preset second privacy budget, epsilon ' + epsilon ' isepsilon, and epsilon is a total privacy budget for establishing an RI model;
10. from the second vector set (i.e., the collection of the relatively ordered distribution set and the cross-ordered distribution set)
Figure BDA00018016641200001429
Distribution information of leaf nodes and distribution information of inner nodes.
11. And generating new preference ranking data by utilizing the established RI model. Wherein the RI model is
Figure BDA00018016641200001426
Distribution information of leaf nodes and distribution information of inner nodes.
In the data collecting method, the stepsProcessing of step 1 and user terminal pair t in step 2i[Aj]The assignment process can be performed in any order, the process of step 6 and the user terminal in step 2 can be performed on ti′[Aj′]The process of assigning values may be performed in any order.
Next, it can be determined by comparison with rapor, SH and OLH that the SAFARI method proposed by the present application has a significant advantage in the utility of the data collected by the data collection platform. To better illustrate the advantages of the method of the present application, a first-order marginal distribution (Q) is employed1) And second-order marginal distribution (Q)2) To measure the utility of the preference ranking data collected by the four algorithms RAPPOR, SH, OLH and SAFARI. Wherein for first-order marginal distribution and second-order marginal distribution, different algorithms are used for generating L between marginal distribution of data and distribution of original data1Distance measures the utility of the collected data. The specific experimental setup was as follows: we tested the performance of each method using two sets of real data sets Sushi and Jester. The specific characteristics of the data in these two sets of data sets are shown in table 2.
TABLE 2 data set characteristics
Data set Number of users Number of items
Sushi 5,000 3~10
Jester 20,000 3~10
The performance of the SAFARI process is illustrated by analyzing the experimental data below.
First, the first-order marginal distribution and second-order marginal distribution are used to measure the performance of the four methods of RAPPOR, SH, OLH and SAFARI. The results of the experiment are shown in FIG. 3.
As can be seen from FIG. 3, in different datasets, as the privacy budget becomes larger, the first-order marginal distribution and the second-order marginal distribution of data generated by the RAPPOR, SH, OLH, and SAFARI algorithms and the marginal distribution of the original dataset1The distance decreases but the test results of the SAFARI algorithm are consistently less than RAPPOR, SH and OLH. This is because: on one hand, for the SAFARI algorithm, the K-thin chain enables a data collector to have very good robustness on the accuracy of the related distribution information collected by using the SAFARI, and the influence of added noise is small; on the other hand, for rapor, SH and OLH algorithms, they introduce a lot of noise when the privacy parameters are reduced.
Next, we tested the effectiveness of Rule I using the data sets Sushi and Jester. For this reason, we compare it with another version of Rule I (denoted Rule I). In Rule I, each user transforms his preference ranking data to provide it directly
Figure BDA0001801664120000151
Of the distribution. Let the data collector use SAFA method to collect distribution information from data converted by users according to Rule I and Rule I respectively, and present the obtained S3Average L of medium distribution1Distance. The results of the experiment are shown in FIG. 4.
As can be seen from fig. 4, Rule I results in better utility when d does not exceed 4 for different data sets. This is because when d is relatively small, the advantage of Lasso regression does not outweigh the impact of information loss. However, Rule I results in quite good results when d is relatively large, thus proving the superiority of Rule I.
Finally, we tested the effectiveness of the SAFA algorithm using the data sets Sushi and Jester. For this reason, we compare it with the Harmony method. We let the data collector collect S from the data transformed by users according to Rule I by using SAFA and Harmony methods respectively1And presenting the average L of the distribution obtained1Distance. The results of the experiment are shown in FIG. 5.
As can be seen from fig. 5, the distribution information collected using the SAFA method contains a smaller amount of noise for different data sets. This is because, when the value space of the attribute is small, unnecessary noise is introduced by the process of mapping the value space of each attribute to a matrix in the Harmony algorithm.
According to the data collection method, the preference sorting data collection is realized, privacy disclosure is avoided, and meanwhile the collected preference sorting data is guaranteed to have high data utility.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (5)

1. A method for collecting preference ranking data satisfying local differential privacy, comprising:
the data collection platform aggregates the first vectors
Figure FDA0003133224450000011
All vectors z injInitialized to 0 vector and aiming at each user terminal u in preset preference item setiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, and j is a preference item index in the preference item set;
for each user terminal, the data is sorted by the user's own preference, including the attribute set
Figure FDA00031332244500000124
Tuple t of all attributes ini[Aj]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi[Aj]Generating value subscripts
Figure FDA0003133224450000012
Sending to the data collection platform; wherein A isjIs a collection of attributes
Figure FDA0003133224450000013
The jth attribute of (1), ti[Aj]Represents tiIn AjThe number of attributes in the tuple is equal to the number of preference items in the preference sorting data, the attributes correspond to the preference items one to one, and the value of each attribute is equal to the rank of the corresponding preference item; the value subscript satisfies the condition
Figure FDA0003133224450000014
Figure FDA00031332244500000125
I(ti[Aj]) Represents ti[Aj]In dom (A)j) Index of (1), dom (A)j) Represents attribute AjValue space of (2), Pr 2]Representing a probability distribution value;
the data collection platform being transmitted by each user terminal
Figure FDA0003133224450000015
Set the first vector
Figure FDA0003133224450000016
The value is increased by 1;
the data collection platform combines each value z of all vectors in the first set of vectorsj[k]Is updated to
Figure FDA0003133224450000017
Figure FDA0003133224450000018
Wherein the content of the first and second substances,
Figure FDA0003133224450000019
the epsilon' is a preset first privacy budget,
Figure FDA00031332244500000110
n represents the total number of the user terminals;
the data collection platform determines from the first set of vectors
Figure FDA00031332244500000111
And
Figure FDA00031332244500000112
and using the first set of vectors,
Figure FDA00031332244500000113
And
Figure FDA00031332244500000114
computing all triplets in the set of preferences
Figure FDA00031332244500000115
Of mutual information
Figure FDA00031332244500000116
And constructing K-thin chain
Figure FDA00031332244500000117
To be sent to the respective user terminal,
Figure FDA00031332244500000118
as binary variablesFor marking the same
Figure FDA00031332244500000119
Is ranked in
Figure FDA00031332244500000120
Before or after, k1,k2,k3Representing the index of each item in the preference set within any triple in the preference set, d being the total number of items contained in the preference set,
Figure FDA00031332244500000121
representation item
Figure FDA00031332244500000122
Rank of (2);
the data collection platform assembles the second vector
Figure FDA00031332244500000123
All vectors z injInitialized to a 0 vector and for each user terminal u in a preset set of preferencesiRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, j is a preference item index in the preference item set, and preference item indexes selected for different user terminals are the same or different;
for each user terminal, the data is sorted by the user's own preference, including the attribute set
Figure FDA0003133224450000021
Tuple t of all attributes ini′[Aj′]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platformi′[Aj′]Generating value subscripts
Figure FDA0003133224450000022
Sending to the data collection platform; wherein, it belongs toSexual set
Figure FDA0003133224450000023
Comprising two subsets
Figure FDA0003133224450000024
Figure FDA0003133224450000025
Correspond to
Figure FDA0003133224450000026
Collection of leaf item sets of
Figure FDA0003133224450000027
Correspond to
Figure FDA0003133224450000028
Collection of inner item sets of
Figure FDA0003133224450000029
Is that
Figure FDA00031332244500000210
Is composed of a leaf item set containing only one item, |, represents the total number of items contained in the set, and the value subscript satisfies the condition
Figure FDA00031332244500000211
Figure FDA00031332244500000212
The data collection platform being transmitted by each user terminal
Figure FDA00031332244500000213
Set the second vector
Figure FDA00031332244500000214
The value is increased by 1;
the data collection platform combines each value z of all vectors in the second set of vectorsj′[k]Is updated to
Figure FDA00031332244500000215
Figure FDA00031332244500000216
Wherein the content of the first and second substances,
Figure FDA00031332244500000217
the epsilon ' is a preset second privacy budget, epsilon ' + epsilon ' isepsilon, and epsilon is a total privacy budget for establishing an RI model;
obtaining the second vector set according to the first vector set
Figure FDA00031332244500000218
The distribution information of the leaf nodes and the distribution information of the internal nodes;
generating preference ranking data using an RI model that includes distribution information of the leaf nodes and distribution information of the interior nodes.
2. The method of claim 1, further comprising: according to the mutual information of the triples, the
Figure FDA00031332244500000228
The distribution information of the leaf nodes and the distribution information of the internal nodes of the network system generate a specified number of preference ranking data.
3. The method of claim 1 or 2, wherein the data collection platform determines from the first set of vectors
Figure FDA00031332244500000219
The method comprises the following steps:
for each one
Figure FDA00031332244500000220
Distribution in
Figure FDA00031332244500000221
Constructing a Lasso regression model
Figure FDA00031332244500000222
Wherein the content of the first and second substances,
Figure FDA00031332244500000223
is a 2d length column vector that stores the distribution
Figure FDA00031332244500000224
And
Figure FDA00031332244500000225
is determined by the information of (a) a,
Figure FDA00031332244500000226
is a binary matrix of size 2d x d (d-1),
Figure FDA00031332244500000229
is a column vector of length d (d-1) for storing the joint distribution
Figure FDA00031332244500000227
The information of (a);
solving the Lasso regression model by a minimum angle regression method, and estimating to obtain
Figure FDA0003133224450000031
And determining a joint distribution
Figure FDA0003133224450000032
According to the joint distribution
Figure FDA0003133224450000033
Computing
Figure FDA0003133224450000034
4. The method of claim 1 or 2, wherein the data collection platform determines from the first set of vectors
Figure FDA0003133224450000035
The method comprises the following steps:
for each one
Figure FDA0003133224450000036
Distribution in
Figure FDA0003133224450000037
Constructing a Lasso regression model
Figure FDA0003133224450000038
Figure FDA0003133224450000039
Wherein the content of the first and second substances,
Figure FDA00031332244500000310
is a column vector of length (d +2) that stores the distribution
Figure FDA00031332244500000311
And
Figure FDA00031332244500000312
is determined by the information of (a) a,
Figure FDA00031332244500000313
is a binary matrix of size (d +2) x 2d,
Figure FDA00031332244500000314
is a column vector of length 2d for storing the joint distribution
Figure FDA00031332244500000315
The information of (a);
solving the Lasso regression model by a minimum angle regression method, and estimating to obtain
Figure FDA00031332244500000316
And determining a joint distribution
Figure FDA00031332244500000317
5. The method according to claim 1 or 2,
Figure FDA00031332244500000318
CN201811079995.XA 2018-09-17 2018-09-17 Preference sorting data collection method meeting local differential privacy Active CN109299436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811079995.XA CN109299436B (en) 2018-09-17 2018-09-17 Preference sorting data collection method meeting local differential privacy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811079995.XA CN109299436B (en) 2018-09-17 2018-09-17 Preference sorting data collection method meeting local differential privacy

Publications (2)

Publication Number Publication Date
CN109299436A CN109299436A (en) 2019-02-01
CN109299436B true CN109299436B (en) 2021-10-15

Family

ID=65163261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811079995.XA Active CN109299436B (en) 2018-09-17 2018-09-17 Preference sorting data collection method meeting local differential privacy

Country Status (1)

Country Link
CN (1) CN109299436B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110022531B (en) * 2019-03-01 2021-01-19 华南理工大学 Localized differential privacy urban garbage data report and privacy calculation method
CN113811868A (en) * 2019-06-12 2021-12-17 阿里巴巴集团控股有限公司 Method and system for responding multidimensional analysis query under local differential privacy
CN110443063B (en) * 2019-06-26 2023-03-28 电子科技大学 Adaptive privacy-protecting federal deep learning method
CN112995076B (en) * 2019-12-17 2022-09-27 国家电网有限公司大数据中心 Discrete data frequency estimation method, user side, data center and system
CN111669366B (en) * 2020-04-30 2021-04-27 南京大学 Localized differential private data exchange method and storage medium
CN112329056B (en) * 2020-11-03 2021-11-02 石家庄铁道大学 Government affair data sharing-oriented localized differential privacy method
JPWO2022107284A1 (en) * 2020-11-19 2022-05-27
CN113111383B (en) * 2021-04-21 2022-05-20 山东大学 Personalized differential privacy protection method and system for vertically-divided data
CN114091100B (en) * 2021-11-23 2024-05-03 北京邮电大学 Track data collection method and system meeting local differential privacy
CN115098931B (en) * 2022-07-20 2022-12-16 江苏艾佳家居用品有限公司 Small sample analysis method for mining personalized requirements of indoor design of user

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740245A (en) * 2014-12-08 2016-07-06 北京邮电大学 Frequent item set mining method
CN106991335A (en) * 2017-02-20 2017-07-28 南京邮电大学 A kind of data publication method based on difference secret protection
CN107862219A (en) * 2017-11-14 2018-03-30 哈尔滨工业大学深圳研究生院 The guard method of demand privacy in a kind of social networks
CN107871087A (en) * 2017-11-08 2018-04-03 广西师范大学 The personalized difference method for secret protection that high dimensional data is issued under distributed environment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9672364B2 (en) * 2013-03-15 2017-06-06 Microsoft Technology Licensing, Llc Differentially private linear queries on histograms
US10885467B2 (en) * 2016-04-28 2021-01-05 Qualcomm Incorporated Differentially private iteratively reweighted least squares

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740245A (en) * 2014-12-08 2016-07-06 北京邮电大学 Frequent item set mining method
CN106991335A (en) * 2017-02-20 2017-07-28 南京邮电大学 A kind of data publication method based on difference secret protection
CN107871087A (en) * 2017-11-08 2018-04-03 广西师范大学 The personalized difference method for secret protection that high dimensional data is issued under distributed environment
CN107862219A (en) * 2017-11-14 2018-03-30 哈尔滨工业大学深圳研究生院 The guard method of demand privacy in a kind of social networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于隐树模型的满足差分隐私的高维数据发布算法;苏炜航等;《小型微型计算机系统》;20180430;第39卷(第4期);第681-685页 *

Also Published As

Publication number Publication date
CN109299436A (en) 2019-02-01

Similar Documents

Publication Publication Date Title
CN109299436B (en) Preference sorting data collection method meeting local differential privacy
Ma et al. A highly accurate prediction algorithm for unknown web service QoS values
CN111159483B (en) Tensor calculation-based social network diagram abstract generation method
CN113206831B (en) Data acquisition privacy protection method facing edge calculation
Wang et al. Discover community leader in social network with PageRank
CN107046557A (en) The intelligent medical calling inquiry system that dynamic Skyline is inquired about under mobile cloud computing environment
CN108024307A (en) A kind of heterogeneous network access selection method and system based on Internet of Things
CN114385376A (en) Client selection method for federated learning of lower edge side of heterogeneous data
Cui et al. Communication-efficient federated recommendation model based on many-objective evolutionary algorithm
CN108173958A (en) Data-optimized storage method based on ant group algorithm under a kind of cloudy environment
Huo et al. Aggregated inference
CN112612948B (en) Deep reinforcement learning-based recommendation system construction method
Chang et al. Personalized multimedia recommendation systems using higher-order tensor singular-value-decomposition
CN115618127A (en) Collaborative filtering algorithm of neural network recommendation system
CN107679709A (en) A kind of supplier selection method and device based on Intuitionistic Fuzzy Numbers and prestige transmission
CN113902113A (en) Convolutional neural network channel pruning method
CN113850317A (en) Multi-type neighbor aggregation graph convolution recommendation method and system
CN112990672A (en) Introduced technology evaluation selection method
CN112765413A (en) Graph data query recommendation method based on user characteristics
Zhou et al. Hgena: A hyperbolic graph embedding approach for network alignment
CN111523054A (en) Project recommendation method and system based on active account and similar accounts
Chen et al. Irlm: inductive representation learning model for personalized poi recommendation
CN109919790A (en) Group type recognition methods, device, electronic equipment and storage medium
Liu et al. Attentive-feature transfer based on mapping for cross-domain recommendation
CN114826967B (en) Information sharing capability evaluation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant