CN109299436B

CN109299436B - Preference sorting data collection method meeting local differential privacy

Info

Publication number: CN109299436B
Application number: CN201811079995.XA
Authority: CN
Inventors: 程祥; 苏森; 杨健宇
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-09-17
Filing date: 2018-09-17
Publication date: 2021-10-15
Anticipated expiration: 2038-09-17
Also published as: CN109299436A

Abstract

The application discloses a preference sorting data collection method meeting local differential privacy, wherein a user terminal converts preference sorting data by using Rule I and Rule II, adds noise into the converted data and sends the converted data to a data collection platform, the data collection platform and the user terminal are matched with each other to realize an algorithm meeting the local differential privacy, the whole RI model construction is completed, and preference sorting data are generated based on the constructed model. By the method, the collected preference sorting data can be guaranteed to have higher data utility while privacy disclosure is avoided.

Description

Preference sorting data collection method meeting local differential privacy

Technical Field

The application relates to a data collection technology, in particular to a preference sorting data collection method meeting local differential privacy.

Background

The preference ranking data is a typical personal data. For a user, his preference ranking data refers to the ranking of items given by the user to a given Set of items (ItemSet) according to their own preference for the items in the Set. For example, the item set is { cola, white spirit, sprite, plain boiled water, beer }, and the preference ranks of a user for these five items are < white spirit, cola, beer, sprite, plain boiled water >, which indicates that the user likes white spirit most and the plain boiled water most least. With the rapid development of information technologies such as mobile internet and cloud computing and the increasing popularization of mobile terminals such as smart phones, it is common for users to share their preference ranking data to various data collectors (e.g., service providers) through mobile device applications so as to enjoy personalized services. On the other hand, it is also essential for service providers to collect and analyze user preference ranking data in order to provide a better user experience and create new revenue opportunities. However, the user's preference ranking data often contains extremely sensitive personal information, and the direct collection of such data by the data collector may cause serious privacy disclosure problems for individuals.

FIG. 1 is a diagram illustrating a current user preference data collection scenario. The scenario mainly involves two roles, namely a user (namely a data contributor) and a data collector, and a term set x formed by d terms is given₁，x₂，...，x_dU, each user u_i(1 ≦ i ≦ n) each having a preference ranking data σ_i＝<σ_i(1)，σ_i(2)，…，σ_i(d)>And users are independent of each other. Wherein σ_i(j)＝x_kRepresents x_kAt σ_iRank of (1) is j. The data collector collects the preference ranking data of each user by using the data collection platform and through the network, so as to obtain a preference ranking data set, namely, a model of the preference ranking data is constructed. The model can generate new preference sorting data, the new preference sorting data generated by the model and the original preference sorting data of the user have the same statistical characteristics, and meanwhile, the original preference sorting data are not directly given, so that the privacy of the user is protected to a certain extent. The data collector may directly utilize the collected preference ranking data model for analysis, or may open the model or new preference ranking data generated to a third party (e.g., a research institution).

It can be seen from the above processing that, during the process of collecting the user preference ranking data, it is possible to prevent the user of the model and the new preference ranking data from acquiring the user privacy, but before the model of the preference ranking data is formed, the user preference privacy data may still be revealed. Specifically, for each user, there are three roles that may pose a threat to their privacy: 1) a data collector; 2) other users; 3) any potential attacker in addition to the data collector and other users.

The preference sorting data collection technology of privacy protection provides a feasible scheme for solving the problem of personal privacy disclosure brought by preference sorting data collection. A Local Differential Privacy technology (Local Differential Privacy) proposed in recent years is a Differential Privacy technology specifically proposed to solve the problem of personal Privacy disclosure caused by data collection. In particular, the technique requires that the data contributor first add a suitable amount of noise to the data he owns, and then send the data containing the noise to the data collector to achieve privacy protection for the data contributor.

Currently, there are some work on data collection issues that satisfy local differential privacy. Based on the information theory, Duchi et al propose a high-dimensional data collection method facing the task of mean value calculation and risk minimization statistics. By extending the method, the sampling rate can be increased, based on sampling techniques,

et al propose a method of data collection known as Harmony. Specifically, for each piece of high-dimensional data, the method randomly selects a certain dimension of the piece of data, and if the dimension corresponds to continuous data, the collection is performed based on the method proposed by Duchi et al; if the dimension corresponds to discrete data, collection is performed by using an SH mechanism. In order to obtain frequent items of multidimensional data, Qin et al propose a two-phase data collection method called LDPMiner. In the first stage, the method is based on an SH mechanism, and a candidate space of frequent items is preliminarily determined from noise data; in the second phase, the method derives precise frequent terms based on the rapporr mechanism. Based on the EM (expection knowledge) algorithm, Fanti et al propose an extended RAPPOR mechanism. The mechanism assumes that all dimensions of high-dimensional data are independent from each other, collects all-dimensional data by utilizing a RAPPOR mechanism, and uses the data as input of an EM algorithm to deduce joint distribution of the whole data, so that the mechanism can be used for generating original data. However, when the data dimension is high, the mechanism is not only time-complex but also slow in convergence speed. Aiming at the problem, Ren et al provides a new method by combining the EM algorithm with the Lasso regression, and the method can greatly improve the efficiency of the proposed method in the RAPPOR mechanism.

Direct connectionApplying the local differential privacy algorithm to the preference ranking data, wherein the specific calculation mode can be as follows: assuming that a value space formed by all possible preference sorting data exists, then regarding the preference sorting data of each user as a discrete value in the value space, and finally directly applying a single-dimensional multi-valued data collection method meeting the local differential privacy, wherein the single-dimensional multi-valued data collection method comprises RAPPOR, SH and OLH algorithms to collect the data. However, the transformed data has a huge value space, and x ═ x for a given set of terms₁，x₂，...，x_dThe value space size of the converted data is d! . Therefore, these algorithms can cause a lot of noise in the collected data, resulting in the unavailability of the finally obtained preference ranking data.

Disclosure of Invention

The application provides a preference sorting data collection method meeting local differential privacy, and preference sorting data collection realized by the method can ensure that the collected preference sorting data has higher data utility while avoiding privacy disclosure.

In order to achieve the purpose, the following technical scheme is adopted in the application:

a method of preferred ranking data collection satisfying local differential privacy, comprising:

the data collection platform aggregates the first vectors

All vectors z in_jInitialized to 0 vector and aiming at each user terminal u in preset preference item set_iRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, and j is a preference item index in the preference item set;

for each user terminal, the data is sorted by the user's own preference, including the attribute set

Tuple t of all attributes in_i[A_j]Assigning and collecting the platform according to the dataPreference index j sent to the user terminal, using tuple t_i[A_j]Generating value subscripts

Sending to the data collection platform; wherein A is_jIs a collection of attributes

The jth attribute of (1), t_i[A_j]Represents t_iIn A_jThe number of attributes in the tuple is equal to the number of preference items in the preference sorting data, the attributes correspond to the preference items one to one, and the value of each attribute is equal to the rank of the corresponding preference item; the value subscript satisfies the condition

k∈{1，2，...，|dom(A_j)|}，I(t_i[A_j]Represents t_i[A_j]In dom (A)_j) Index of (1), dom (A)_j) Represents attribute A_jThe value space of (a);

the data collection platform being transmitted by each user terminal

Set the first vector

The value is increased by 1;

the data collection platform combines each value z of all vectors in the first set of vectors_j[k]Is updated to

Wherein the content of the first and second substances,

the epsilon' is a preset first privacy budget;

the data collection platform determines from the first set of vectors

And

and using the first set of vectors,

And

computing all triplets in the set of preferences

Of mutual information

And constructing K-thin chain

Sending the data to each user terminal;

the data collection platform collects the second vector

All vectors z in_j' initialization to a 0 vector and for each user terminal u in a preset set of preferences_iRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, j is a preference item index in the preference item set, and preference item indexes selected for different user terminals are the same or different;

Tuple t of all attributes in_i′[A_j′]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platform_i′[A_j′]Generating a valueSubscript

Sending to the data collection platform; wherein the attribute set

Comprising two subsets

And

correspond to

Collection of leaf item sets of

Correspond to

The value subscript satisfies the condition

The data collection platform being transmitted by each user terminal

Set the second vector

The value is increased by 1;

the data collection platform combines each value z of all vectors in the second set of vectors_j′[k]Is updated to

Wherein the content of the first and second substances,

the epsilon ' is a preset second privacy budget, epsilon ' + epsilon ' isepsilon, and epsilon is a total privacy budget for establishing an RI model;

obtaining the second vector set according to the first vector set

The distribution information of the leaf nodes and the distribution information of the internal nodes;

generating preference ranking data using an RI model that includes distribution information of the leaf nodes and distribution information of the interior nodes.

Preferably, the method further comprises: according to the mutual information of the triples, the

The distribution information of the leaf nodes and the distribution information of the internal nodes of the network system generate a specified number of preference ranking data.

Preferably, the data collection platform determines from the first set of vectors

The method comprises the following steps:

for each one

Distribution in

Constructing a Lasso regression model

Wherein the content of the first and second substances,

is one longDegree 2d column vector, which stores the distribution

And

is determined by the information of (a) a,

is a binary matrix of size 2d x d (d-1),

is a column vector of length d (d-1) for storing the joint distribution

The information of (a);

solving the Lasso regression model by a minimum angle regression method, and estimating to obtain

And determining a joint distribution

According to the joint distribution

Computing

The method comprises the following steps:

for each one

Distribution in

Constructing a Lasso regression model

Wherein the content of the first and second substances,

is a column vector of length (d +2) that stores the distribution

And

is determined by the information of (a) a,

is a binary matrix of size (d +2) x 2d,

is a column vector of length 2d for storing the joint distribution

The information of (a);

And determining a joint distribution

Preferably, the first and second liquid crystal films are made of a polymer,

according to the technical scheme, the user terminal converts preference sorting data by using Rule I and Rule II, adds noise into the converted data and sends the converted data to the data collection platform, the data collection platform and the user terminal are matched with each other to realize an algorithm meeting local differential privacy, the whole RI model is built, and the established RI model is used for generating the preference sorting data meeting the local differential privacy. By the method, the collected preference sorting data can be guaranteed to have higher data utility while privacy disclosure is avoided.

Drawings

FIG. 1 is a diagram illustrating a current user preference data collection scenario;

FIG. 2 is a schematic view of an example of a 2-thin chain;

FIG. 3 is a first graph illustrating performance comparison in the present application;

FIG. 4 is a graph illustrating a performance comparison of the present application;

fig. 5 is a third performance comparison diagram in the present application.

Detailed Description

For the purpose of making the objects, technical means and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings.

In order to solve the problem mentioned in the background art that the local differential privacy method is not available after being applied to preference ranking data, the applicant proposes a preference ranking data algorithm (SAFARI algorithm) satisfying local differential privacy. The method has the main ideas that a data collector collects distribution information on a series of small-valued spaces selected according to a riffled independent model (RI model), the collected distribution information on the small-valued spaces is used for approximating the whole distribution of preference ranking data, a model is built, and the built model is used for generating the preference ranking data. Since the SAFARI algorithm processes a plurality of small value spaces instead of a large value space, the SAFARI algorithm can greatly reduce the scale of noise.

The process of the present application is described in detail below.

At present, the modeling of the preference Ranking data may be performed by using an RI model, and the RI model can approximate the overall distribution of the preference Ranking data by using a product of two low-dimensional Distributions, namely, Relative Ranking Distributions (Relative Ranking Distributions) and cross Distributions (intersecting Distributions), according to mutual exclusivity between dimensions of the preference Ranking data, so as to effectively model the preference Ranking data. And generating new preference sorting data by using the established model, thereby realizing the privacy protection of the user.

The structure of the RI model is a binary tree called K-thin chain. Wherein the root node represents the original item set, the other nodes represent sub item sets of the original item set, and the size of the item set of the leaf node does not exceed a constant K. FIG. 2 is an example of a 2-thin chain.

In this example, the original set of terms { cola, white spirit, sprite, plain boiled water, beer } is first divided into two subsets of terms that are mutually exclusive and have a rifle Independent relationship, namely { plain boiled water } and { cola, white spirit, sprite, beer }. Since the size of the sub-set { cola, white spirit, sprite, beer } exceeds 2 (i.e., the value of K), the sub-set is further divided into sub-sets { cola, sprite } and { white spirit, beer } that are mutually exclusive and have a rifle Independent relationship.

The learning process of the RI model comprises two stages of structure learning and parameter learning:

1) and (5) structure learning. First, the mutual information of all triplets in the item set is calculated, which is defined as follows:

given a set of items (x)₁，x₂，...，x_d) For any triplet in the set

Wherein the content of the first and second substances,

and

is a set of three different items, the mutual information of the triples being

Wherein the content of the first and second substances,

representation item

The rank of (a) is determined,

is a binary variable. In particular, it is possible to use, for example,

represents

That is, the item

Is ranked in the item

Before;

represents

That is, the item

Is ranked in the item

And then.

And then constructing the K-thin chain on the original item set by using an anchor point algorithm according to the mutual information of the triples.

2) And (5) parameter learning. From the constructed K-thin chain, the distribution of each node is learned to approximate the overall distribution of the original preference ranking data set. Among them, the distribution of leaf nodes is called Relative Ranking Distributions (Relative Ranking Distributions), and the distribution of internal nodes (including root nodes) is called cross Distributions (Interleaving Distributions).

The relative sequencing distribution and the cross distribution are determined through the process, and the modeling of the RI model is completed.

The preference ranking data collection method is based on the RI model, and the preference ranking data is generated according to the established RI model. Only in the modeling process, the acquisition of the mutual triple information and the acquisition of the relative sequencing distribution and the cross distribution meet the local differential privacy.

The preference ranking data algorithm (SAFARI algorithm) meeting the local differential privacy in the present application relates to two rules Rule I, Rule II and one SAFA algorithm, and specifically may include 5 stages:

stage 1

1. Each user converts own preference sorting data according to a conversion Rule (marked as Rule I), so that the data collection platform can obtain distribution information required by calculating the mutual triple information. The content of the distribution information will be described in detail later.

Stage 2

1. And the data collection platform calls an SAFA algorithm by using the privacy budget of epsilon' and through the cooperation of the user, and collects the distribution information required by calculating the triple mutual information from the data converted in the stage 1 by the user. In the SAFA algorithm, a user adds noise to converted data and then sends the data to a data collection platform, so that privacy disclosure is avoided.

2. And the data collection platform calculates all the triple mutual information in the RI model by using the collected distribution information.

3. Data collection platform for constructing K-thin chain

4. The data collection platform is to

And is sent to each user.

Stage 3

1. Each user transforms his own preference ranking data according to another transformation Rule (denoted Rule II) so that the data collection platform can determine the information about

Relative Ranking distribution (Relative Ranking Distributions) information and cross distribution (Interleaving Distributions) information.

Stage 4

1. The data collector uses the privacy budget of epsilon' to call the SAFA algorithm to collect the data about the user converted in the phase 2 through the cooperation of the user

The ordering distribution information and the cross distribution information. In the SAFA algorithm, a user adds noise to converted data and then sends the data to a data collection platform, so that privacy disclosure is avoided.

Thus, the construction of the RI model is completed after the sequencing distribution information and the cross distribution information are obtained.

Stage 5

According to the constructed rifflee index model, the data collector generates n pieces of new preference ranking data.

Through the processing of the stage 1 to the stage 4, the modeling of the preference ranking data RI meeting the local differential privacy can be realized. The data collection platform may issue the completed RI model to a third party, or, in order to better provide the preference ranking data to the third party, preferably, may further generate new preference ranking data to be issued to the third party through the stage 5 process.

In addition, in the above data collection method, the phase 2 and the phase 4 complete the local differential privacy processing in two parts, so on the premise that the overall privacy budget is epsilon, epsilon' + epsilon ″, preferably, in practical application, it is usually taken that

We will next introduce Rule I, Rule II and SAFA methods used in the above data collection method satisfying the local differential privacy, respectively.

Design Rule I

As described above, Rule I is used to convert preference ranking data of a user, and the converted data is used to calculate mutual information of triples, so that Rule I needs to be designed according to a calculation manner of the mutual information of the triples. In particular, to compute any one possible triplet (x) according to its mutual information definition_i，x_j，x_k) The data collection platform needs to collect three types of distribution information:

to accomplish this task, an intuitive translation method is for each user to translate his preference ranking data to provide information about these three types of distributions. In particular, each user converts his preference ranking data into a tuple comprising a plurality of attributes, wherein each attribute corresponds to

Is distributed.

However, this translation method can result in redundant information including content in the user's translated data and increase translation complexity because

In fact, the data collector only needs to collect

And then derived therefrom

And

of the distribution. Thus, each user need only translate his preference ranking data to provide

Of the distribution. Due to the fact that

In which O (d) is contained³) A different distribution, the number of attributes in each user transformed tuple is O (d)³)。

Unfortunately, when d is relatively large, this transformation can cause the data collector to collect a large amount of noise in the data when LDP is satisfied due to the dimensional disaster. To solve this problem, we have designed Rule I. According to the conversion rule, each user only needs to convert his preference ranking data to provide

Of the distribution. The data collection platform only needs to collect

And then estimating therefrom using a regression model

And

of the distribution. In particular, the applicant has found that the estimation problem is a sparse linear regression (sparse linear regression) problem. Therefore, a Lasso regression model is selected that effectively solves this problem. Details of Rule I and how to estimate using the Lasso regression model are described below

And

of the distribution.

Rule I: each user terminal u_iFirstly sorting preference data sigma of corresponding users_iConversion to a collection of containment attributes

Tuple t of all attributes in_i. Wherein the content of the first and second substances,

each attribute A in (1)_jCorresponding to an item

A_jHas a value space of dom (A)_j)＝{1，2，...，d}。dom(A_j) Consisting of d possible values, which represent

With possible absolute ranking. Then, for

Each attribute A in (1)_jEach user u_iAccording to σ_iFor t_i[A_j]And carrying out assignment.

Distribution estimation using Lasso, determination

And

based on what has been collected

The data collection platform may estimate the distribution in the following manner

And

of the distribution.

First, the data collection platform is accessed from

Is estimated in the distribution

The distribution information in (1). In particular, for each

Distribution in

Data collector constructs a Lasso regression model

Wherein the content of the first and second substances,

1)

is a 2d length column vector that stores the distribution

And

the information of (a);

2)

is a binary matrix of size 2d x d (d-1);

3)

is a column vector of length d (d-1) that is used to store the joint distribution

The information of (1).

By solving the Lasso regression model using the minimum angle regression method, the data collection platform can estimate

Thereby obtaining a joint distribution

The information of (1). According to a joint distribution

The data collection platform can calculate the distribution

The information of (1).

Then, the data collector gets the data from

And

is estimated in the distribution

The distribution information in (1). In particular, for each

Distribution in

Data collector constructs a Lasso regression model

Wherein the content of the first and second substances,

1)

is a column vector of length (d +2) that stores the distribution

And

the information of (a);

2)

is a binary matrix of size (d +2) × 2 d;

3)

is a 2d length column vector that is used to store the joint distribution

The information of (1).

Similarly, by solving the Lasso regression model using the least-angle regression method, the data collector can estimate

Thereby obtaining a joint distribution

The information of (1).

To explain this point, in the Rule I described above,

each attribute A in (1)_jAnd the corresponding relation between the preference items and the data collection platform needs to be consistent on the data collection platform side and the user terminal side. That is, in SAFARI, the data collection platform and each user terminal need to guarantee to run the same Rule I. In addition, as can be seen from the above processing, in the conversion rule, the number of attributes included in the tuple converted by each user is o (d). For attribute collection

Each attribute A in (1)_jIts value space size | dom (A)_j) L is only d, and it is clear that this value is much smaller than d! . By estimating the set.

The frequency of any possible value of each attribute in the data collection system can be obtained

And then estimating the distribution information based on the distribution information

And

of the distribution. Such processing does not generate a large amount of redundant data, and thus the utility of the user preference data can be effectively improved.

Design Rule II

In the construction of K-thin chain

Thereafter, the data collection platform needs to collect information about

Relative Ranking Distributions (Relative Ranking Distributions) and Interleaving Distributions (Interleaving Distributions). To this end, designRule II. According to the conversion rule, each user terminal only needs to convert the preference ranking data of the corresponding user to provide information about the two types of distributions.

Rule II: each user terminal u_iFirstly, the preference sorting data sigma of the corresponding user is sorted_iConversion to a collection of containment attributes

Tuple t of all attributes in_i. In particular, attribute sets

From two subsets

And

form a

Wherein the content of the first and second substances,

1)

correspond to

Leaf item set collecting platform

Is that

By a set of leaf sub-items containing only one itemAnd (4) forming. Because of the relation to

The relative ranking distribution of each set of leaf items in the list can be easily inferred, so users do not need to provide information about

The information of (1).

Each attribute A in (1)_jCorresponding to the collection

A set of leaf items l in_k。A_jIs defined by all the values of_kThe relative ordering of (a). In particular, when K is 1,

all of the sets of leaf entries in (a) contain only one entry, and, at this time,

wherein K represents

The most contained items of the middle leaf item set.

2)

Each attribute A in (1)_jCorresponding to the collection of internal items

An inner item set g of_k。A_jIs defined by all the values related to g_kIs formed by cross-sorting. Then, for

To explain this point, in the above Rule II,

each attribute A in (1)_jThe leaf items or the internal items in the data collection platform are in one-to-one correspondence, and the correspondence relation between the leaf items and the internal items needs to be kept consistent on the data collection platform and the user terminal side. That is, in SAFARI, the data collection platform and each user terminal need to guarantee the same Rule II running. In addition, as can be seen from the above processing, in the conversion rule, the number of attributes included in the tuple converted by each user is o (d). For the

Each attribute A in (1)_jIts maximum value space size is K! (ii) a For the

Each attribute A in (1)_jThe maximum value space size of the system is

Thus property collections

The maximum value space of any attribute in (a) is

It is clear that this value is much smaller than d! . By estimating the constellation

The frequency of any possible value of each attribute in the data collection system can be obtainedAbout

Distribution information of (2). Such processing does not generate a large amount of redundant data, and thus the utility of the user preference data can be effectively improved.

SAFA process

In the data collection process of the present application, both stage 2 and stage 4 require SAFA processing on the data transformed by the user terminal, and the SAFA processing is described in detail here.

In order to collect distribution information required for constructing the RI model, the data collection platform needs to estimate the frequency of any one possible value of each attribute in the tuple after the transformation by the user under the condition that the LDP is satisfied.

The data collection platform can directly call the Harmony method which is the most advanced method for analyzing multi-attribute data under LDP to complete the task. In particular, for

Each attribute A in (1)_jThe data collector will A_jIs mapped into a space with a size of | dom (A)_i)|×|dom(A_j) Binary matrix phi of |_j. Then, for each user u_iFrom the set, the data collector

Randomly selecting an attribute (assumed to be A)_r) And call the SH algorithm [11 ]]Collecting u_iTuple t of_iIn A_rThe value of (a).

We observe the set of transformed attributes whether Rule I or Rule II is applied

The value space size of all the attributes in (1) is far smaller than d! . However, for the attribute with small value space, the Harmony method still maps the value space of each dimension into a matrix, resulting in unnecessary noise contained in the collected data, especiallyIs when handling binary attributes. It is stated in the literature that the generalized random response algorithm works best when estimating a small number of discrete value frequencies. Therefore, we propose a new LDP algorithm, named Sampling Randomizer for Multiple Attributes (SAFA), to more accurately perform frequency estimation on multi-attribute data with small value space under the condition of satisfying LDP. The method has the main idea that each user terminal randomly selects an attribute, then perturbs the value of the attribute by using a generalized random response algorithm, and sends the perturbed result to a data collection platform.

As mentioned above, it is necessary to apply SAFA algorithm for processing in stage 2 and stage 4, which is a process that needs to be performed by the user terminal and the data collection platform in cooperation. The SAFA algorithms applied in stage 2 and stage 4 are the same, but the distribution information to be obtained by the SAFA algorithms is different, and the processing of the SAFA algorithms is described in the following.

The specific process of SAFA is as follows:

1. data collection platform initialization vector set

All vectors in the vector, namely all values in each vector are assigned to be 0; here, for stage 2, the set of vectors

Is that

For stage 4, vector set

Is the collection of the relative ordering distribution set and the cross distribution set;

2. for each user terminal u_iThe following operations are performed:

2.1. when the data collection platform is converted from Rule I or Rule II

Randomly selecting an index j;

2.2. the data collection platform sends j to u_i；

2.3.u_iGenerating a value index with noise, and marking as

So that

Where k ∈ {1, 2., | dom (A)_j)|}；

2.4.u_iWill be provided with

Sending to a data collector;

2.5. the data collector will

The value of (a) is increased by 1;

after the above operations are executed on all the user terminals, the following operations are continuously executed:

3. for collections

Each vector z of_jThe following processing is performed:

3.1. data collector set probability

3.2. Data collector set probability

3.3. Will vector z_jEach value z in_j[k]Is updated to

This is a specific treatment of the SAFA process. To illustrate that the SAFA method in the present application can satisfy local differential privacy, theoretical proof is given below.

Theorem: for any user u_iThe privacy budget ε ', SAFA satisfies ε' -LDP.

And (3) proving that:

defined by LDP, for any two different tuples t_i，t′_iIs arbitrary

Wherein

Is an index of attributes selected by the data collector, which we need to demonstrate

Since j is randomly selected, so

We discuss (1) in all possible 4 cases.

Case 1: if it is not

And is

Case 2: if it is not

And is

Case 3: if it is not

And is

Case 4: if it is not

And is

In view of the above, it is desirable to provide,

this is true. Therefore, the conclusion is confirmed.

Through the form analysis, the preference ranking data collection algorithm (SAFARI) meeting the local differential privacy in the application can ensure that the algorithm meets the local differential privacy for each user, and simultaneously ensures that the data collected by a data collector has higher data utility.

Here, the above preference ranking data collection method satisfying the local differential privacy in the present application is summarized as follows:

1. the data collection platform aggregates the first vectors

2. for each user terminal, the data is sorted by the user's own preference, including the attribute set

Tuple t of all attributes in_i[A_j]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platform_i[A_j]Generating value subscripts

Sending to the data collection platform; wherein A is_jIs an attribute collection platform

The number of attributes in the tuple is equal to the number of preference items in the preference sorting data, the attributes correspond to the preference items one by one, and the value of each attribute is equal to the ranking of the corresponding preference item; the value subscript satisfies the condition

3. Transmitted by the data collection platform with each user terminal

Set the first vector

The value is increased by 1;

4. the data collection platform combines each value z of all vectors in the first set of vectors_j[k]Is updated to

Wherein the content of the first and second substances,

the epsilon' is a preset first privacy budget;

5. the data collection platform collects the data according to a first vector (i.e., a first vector set)

) Determining

And

and using the first vector set,

And

computing all triplets in the set of preferences

Of mutual information

And constructing K-thin chain

Sending the data to each user terminal;

6. the data collection platform assembles the second vector

7. for each user terminal, the data is sorted by the user's own preference, including the attribute set

Tuple t of all attributes in_i′[A_j′]Assigning value, and utilizing tuple t according to preference item index j sent to the user terminal by the data collection platform_i′[A_j′]Generating value subscripts

Sending to the data collection platform; wherein the attribute set

Comprising two subsets

And

correspond to

Collection of leaf item sets of

Correspond to

The value subscript satisfies the condition

8. Transmitted by the data collection platform with each user terminal

Set the second vector

The value is increased by 1;

9. the data collection platform combines each value z of all vectors in the second set of vectors_j′[k]Is updated to

Wherein the content of the first and second substances,

10. from the second vector set (i.e., the collection of the relatively ordered distribution set and the cross-ordered distribution set)

Distribution information of leaf nodes and distribution information of inner nodes.

11. And generating new preference ranking data by utilizing the established RI model. Wherein the RI model is

In the data collecting method, the stepsProcessing of step 1 and user terminal pair t in step 2_i[A_j]The assignment process can be performed in any order, the process of step 6 and the user terminal in step 2 can be performed on t_i′[A_j′]The process of assigning values may be performed in any order.

Next, it can be determined by comparison with rapor, SH and OLH that the SAFARI method proposed by the present application has a significant advantage in the utility of the data collected by the data collection platform. To better illustrate the advantages of the method of the present application, a first-order marginal distribution (Q) is employed₁) And second-order marginal distribution (Q)₂) To measure the utility of the preference ranking data collected by the four algorithms RAPPOR, SH, OLH and SAFARI. Wherein for first-order marginal distribution and second-order marginal distribution, different algorithms are used for generating L between marginal distribution of data and distribution of original data₁Distance measures the utility of the collected data. The specific experimental setup was as follows: we tested the performance of each method using two sets of real data sets Sushi and Jester. The specific characteristics of the data in these two sets of data sets are shown in table 2.

TABLE 2 data set characteristics

Data set	Number of users	Number of items
			Sushi	5,000	3～10
Jester	20,000	3～10

The performance of the SAFARI process is illustrated by analyzing the experimental data below.

First, the first-order marginal distribution and second-order marginal distribution are used to measure the performance of the four methods of RAPPOR, SH, OLH and SAFARI. The results of the experiment are shown in FIG. 3.

As can be seen from FIG. 3, in different datasets, as the privacy budget becomes larger, the first-order marginal distribution and the second-order marginal distribution of data generated by the RAPPOR, SH, OLH, and SAFARI algorithms and the marginal distribution of the original dataset₁The distance decreases but the test results of the SAFARI algorithm are consistently less than RAPPOR, SH and OLH. This is because: on one hand, for the SAFARI algorithm, the K-thin chain enables a data collector to have very good robustness on the accuracy of the related distribution information collected by using the SAFARI, and the influence of added noise is small; on the other hand, for rapor, SH and OLH algorithms, they introduce a lot of noise when the privacy parameters are reduced.

Next, we tested the effectiveness of Rule I using the data sets Sushi and Jester. For this reason, we compare it with another version of Rule I (denoted Rule I). In Rule I, each user transforms his preference ranking data to provide it directly

Of the distribution. Let the data collector use SAFA method to collect distribution information from data converted by users according to Rule I and Rule I respectively, and present the obtained S₃Average L of medium distribution₁Distance. The results of the experiment are shown in FIG. 4.

As can be seen from fig. 4, Rule I results in better utility when d does not exceed 4 for different data sets. This is because when d is relatively small, the advantage of Lasso regression does not outweigh the impact of information loss. However, Rule I results in quite good results when d is relatively large, thus proving the superiority of Rule I.

Finally, we tested the effectiveness of the SAFA algorithm using the data sets Sushi and Jester. For this reason, we compare it with the Harmony method. We let the data collector collect S from the data transformed by users according to Rule I by using SAFA and Harmony methods respectively₁And presenting the average L of the distribution obtained₁Distance. The results of the experiment are shown in FIG. 5.

As can be seen from fig. 5, the distribution information collected using the SAFA method contains a smaller amount of noise for different data sets. This is because, when the value space of the attribute is small, unnecessary noise is introduced by the process of mapping the value space of each attribute to a matrix in the Harmony algorithm.

According to the data collection method, the preference sorting data collection is realized, privacy disclosure is avoided, and meanwhile the collected preference sorting data is guaranteed to have high data utility.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for collecting preference ranking data satisfying local differential privacy, comprising:

the data collection platform aggregates the first vectors

I(t_i[A_j]) Represents t_i[A_j]In dom (A)_j) Index of (1), dom (A)_j) Represents attribute A_jValue space of (2), Pr 2]Representing a probability distribution value;

the data collection platform being transmitted by each user terminal

Set the first vector

The value is increased by 1;

Wherein the content of the first and second substances,

the epsilon' is a preset first privacy budget,

n represents the total number of the user terminals;

the data collection platform determines from the first set of vectors

And

and using the first set of vectors,

And

computing all triplets in the set of preferences

Of mutual information

And constructing K-thin chain

To be sent to the respective user terminal,

as binary variablesFor marking the same

Is ranked in

Before or after, k₁，k₂，k₃Representing the index of each item in the preference set within any triple in the preference set, d being the total number of items contained in the preference set,

representation item

Rank of (2);

the data collection platform assembles the second vector

All vectors z in_jInitialized to a 0 vector and for each user terminal u in a preset set of preferences_iRespectively selecting preference item indexes j and sending the preference item indexes j to corresponding user terminals; wherein i is a user terminal index, j is a preference item index in the preference item set, and preference item indexes selected for different user terminals are the same or different;

Sending to the data collection platform; wherein, it belongs toSexual set

Comprising two subsets

Correspond to

Collection of leaf item sets of

Correspond to

Collection of inner item sets of

Is that

Is composed of a leaf item set containing only one item, |, represents the total number of items contained in the set, and the value subscript satisfies the condition