CN109299436A

CN109299436A - A kind of ordering of optimization preference method of data capture meeting local difference privacy

Info

Publication number: CN109299436A
Application number: CN201811079995.XA
Authority: CN
Inventors: 程祥; 苏森; 杨健宇
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-09-17
Filing date: 2018-09-17
Publication date: 2019-02-01
Anticipated expiration: 2038-09-17
Also published as: CN109299436B

Abstract

This application discloses a kind of ordering of optimization preference methods of data capture for meeting local difference privacy, user terminal converts ordering of optimization preference data using Rule I and Rule II, and data collection platform will be sent to after the data addition noise after conversion, data collection platform and user terminal, which cooperate, realizes the algorithm of the local difference privacy of satisfaction, and entire RI model construction is completed, the model based on building generates ordering of optimization preference data.By the above method, it can guarantee the ordering of optimization preference data data effectiveness with higher collected while guaranteeing to avoid privacy leakage.

Description

A kind of ordering of optimization preference method of data capture meeting local difference privacy

Technical field

This application involves data collection techniques, in particular to a kind of ordering of optimization preference data collection for meeting local difference privacy Method.

Background technique

Ordering of optimization preference data are a kind of typical personal data.For a user, his ordering of optimization preference data are Refer to user to given item collection (Item Set) according to the sequence of its own item provided to fancy grade every in item collection.Example Such as, item collection is { laughable, white wine, Sprite, plain boiled water, beer }, certain user is < white wine to the ordering of optimization preference of this five items, laughable, Beer, Sprite, plain boiled water >, then show that the user most likes white wine, what is least liked is plain boiled water.With mobile Internet and The mobile terminals such as the fast development of the information technologies such as cloud computing and smart phone become increasingly popular, and users pass through mobile device Application program enjoys their ordering of optimization preference data sharing to various data collectors (for example, service provider) Way by personalized service is commonplace.On the other hand, for service provider, in order to provide better user New revenue opportunity is experienced and creates, it is also essential for collecting and analyzing the ordering of optimization preference data of user.However, user The personal information of extreme sensitivity is usually contained in ordering of optimization preference data, data collector directly collects these data and may result in Serious individual privacy leakage problem.

Fig. 1 is the schematic diagram of a scenario that current user preference data is collected.The scene relates generally to user's (i.e. contribution data Person) and two kinds of roles of data collector, give an item collection χ={ x being made of d item₁, x₂..., x_d, each user u_i (1≤i≤n) respectively possesses an ordering of optimization preference data σ_i=< σ_i(1), σ_i(2) ..., σ_i(d) >, and between user mutually solely It is vertical.Wherein, σ_i(j)=x_kRepresent x_kIn σ_iIn ranking be j.Data collector is received using data collection platform and by network Collect the ordering of optimization preference data of each user, to obtain ordering of optimization preference data set, that is, constructs the mould of ordering of optimization preference data Type.New ordering of optimization preference data can be generated by the model, the new ordering of optimization preference data and user generated by model are original partially Good sorting data has identical statistical property, meanwhile, and original ordering of optimization preference data are not directly given, it protects to a certain extent Privacy of user is protected.Data collector can directly be analyzed using collected obtained ordering of optimization preference data model, can also To give the new ordering of optimization preference data opening of the model or generation to third party (for example, research institution).

By above-mentioned processing as it can be seen that being can be to avoid model and new preference during user preference sorting data is collected The user of sorting data obtains privacy of user, but before forming ordering of optimization preference data model, the preference privacy number of user According to what still may be leaked.Specifically, for each user, there is also following Three roles to cause prestige to its privacy The side of body: 1) data collector；2) other users；3) any potential attacker in addition to data collector and other users.

The ordering of optimization preference data collection techniques of secret protection are to solve ordering of optimization preference data collection bring individual privacy to let out Dew problem provides a kind of feasible scheme.Local difference privacy technology (Local Differential proposed in recent years It Privacy is) a kind of exclusively for the difference privacy technology for solving the proposition of individual privacy leakage problem caused by data collection.Especially Ground, technical requirements contribution data person add suitable noise into the data that it possesses first, then will contain noise again Data are sent to data collector, to realize the secret protection to data contributor.

Currently, the data gathering problem for meeting local difference privacy is studied there are a few thing.Wherein, it is based on information theory, Duchi et al. proposes a kind of high dimensional data collection method that task is minimized towards mean value computation and statistical risk.By right This method is extended, and is based on sampling technique,Et al. propose a kind of data collection side referred to as Harmony Method.Particularly, to each high dimensional data, this method is randomly chosen certain dimension of the data, if the dimension is corresponding Be continuous data, then be collected based on Duchi et al. proposed method；If the corresponding dimension is discrete type number According to being then collected using SH mechanism.In order to obtain the frequent episode of multidimensional data, Qin etc. proposes one kind and is referred to as The two stages method of data capture of LDPMiner.In the first stage, this method is based on SH mechanism, primarily determines from noise data The candidate spatial of frequent episode；In second stage, this method is based on RAPPOR mechanism and obtains accurate frequent episode.Based on EM (Expectation Maximization) algorithm, Fanti et al. propose a kind of RAPPOR mechanism of extension.The mechanism is assumed Each dimension of high dimensional data is mutually indepedent, and collects each dimension data using RAPPOR mechanism, using these data as EM algorithm The Joint Distribution to infer overall data is inputted, so as to the generation for initial data.However, when data dimension is higher When, not only time complexity is high but also convergence rate is slow for the mechanism.For this problem, by the way that EM algorithm and Lasso are returned phase In conjunction with Ren et al. proposes a kind of new method, and this method can increase substantially the effect of method proposed in RAPPOR mechanism Rate.

Directly local difference privacy algorithm is applied in ordering of optimization preference data, specific calculation may is that hypothesis is deposited In the valued space that one is made of all possible ordering of optimization preference data, then the ordering of optimization preference data of each user are regarded as A discrete value in the valued space, finally directly with the one-dimensional multivalue type data collection side for meeting local difference privacy Method, including RAPPOR, SH and OLH algorithm are collected the data.However, the data after conversion have huge value empty Between, for given item collection χ={ x₁, x₂..., x_d, then the valued space size of the data after converting is d！.Therefore, these Algorithm can make to cause the ordering of optimization preference data finally obtained unavailable containing a large amount of noises in collected data.

Summary of the invention

The application provides a kind of ordering of optimization preference method of data capture for meeting local difference privacy, realizes using this method Ordering of optimization preference data collection can be with higher in the ordering of optimization preference data for guaranteeing to guarantee to collect while avoiding privacy leakage Data effectiveness.

To achieve the above object, the application adopts the following technical scheme that

A kind of ordering of optimization preference method of data capture meeting local difference privacy, comprising:

Data collection platform is by primary vector setIn institute directed quantity z_jIt is initialized as 0 vector, and in preset preference Each user terminal u is directed in item collection_iSelection preference entry index j is sent to corresponding user terminal respectively；Wherein, i is user Terminal index, j are the preference entry indexes that preferences are concentrated；

For each user terminal, using the ordering of optimization preference data of user itself, to including attribute setIn all categories The tuple t of property_i[A_j] assignment is carried out, and the preference entry index j of the user terminal is sent to according to the data collection platform, benefit With tuple t_i[A_j] generate value subscriptIt is sent to the data collection platform；Wherein, A_jIt is attribute setIn j-th Attribute, t_i[A_j] indicate t_iMiddle A_jValue, the attribute number in tuple is equal to of the preferences in the ordering of optimization preference data Number, attribute and preferences correspond, and the value of each attribute is equal to the ranking of corresponding preferences；The value subscript meets item PartK ∈ 1,2 ..., | dom (A_j) |, I (t_i[A_j] Indicate t_i[A_j] in dom (A_j) in index, dom (A_j) indicate attribute A_jValued space；

The data collection platform is sent using each user terminalIt will be in the primary vector setValue Add 1；

The data collection platform is by each value z of institute's directed quantity in the primary vector set_j[k] is updated toWherein,The ε ' is preset the One privacy budget；

The data collection platform is determined according to the primary vector setWithAnd the utilization primary vector set,WithCalculate the preference All triples in item collectionMutual informationAnd construct K-thin chainIt is sent to each use Family terminal；

Data collection platform is by secondary vector Ji TaiIn institute directed quantity z_j' it is initialized as 0 vector, and preset inclined Each user terminal u is directed in good item collection_iSelection preference entry index j is sent to corresponding user terminal respectively；Wherein, i is to use Family terminal index, j are the preference entry indexes that preferences are concentrated, and be the preference entry index that selects of different user terminals are identical or not Together；

For each user terminal, using the ordering of optimization preference data of user itself, to including attribute setIn all categories The tuple t of property_i′[A_j'] assignment is carried out, and the preference entry index j of the user terminal is sent to according to the data collection platform, Utilize tuple t_i′[A_j'] generate value subscriptIt is sent to the data collection platform；Wherein, attribute setIncluding two SubsetWith Correspond toLeaf item collection set Correspond toInside item collection set, the value subscript meets condition

The data collection platform is sent using each user terminalIt will be in the secondary vector set Value plus 1；

The data collection platform is by each value z of institute's directed quantity in the secondary vector set_j' [k] is updated toWherein,The ε " is default The second privacy budget, ε '+ε "=ε, ε are the overall privacy budget for establishing RI model；

It is obtained according to the secondary vector set describedLeaf node distributed intelligence and internal node distribution letter Breath；

Using the RI model of the distributed intelligence of the distributed intelligence and internal node that include the leaf node, preference row is generated Ordinal number evidence.

Preferably, this method further comprises: according to the mutual information of the triple, describedLeaf node distribution The distributed intelligence of information and internal node generates the ordering of optimization preference data of specified quantity.

Preferably, the data collection platform is determined according to the primary vector setInclude:

For eachIn distributionConstruct a Lasso regression model Wherein,It is the column vector that a length is 2d, its storage and distributionWithInformation,It is one A size is the binary matrix of 2d × d (d-1),It is the column vector that a length is d (d-1), for storing joint point ClothInformation；

The Lasso regression model is solved by minimum angle homing method, estimation obtainsAnd determine Joint DistributionFurther according to Joint DistributionIt calculates

For eachIn distributionConstruct a Lasso regression model Wherein,It is the column vector that a length is (d+2), its storage and distributionWithInformation,It is the binary matrix that a size is (d+2) × 2d,It is that a length is The column vector of 2d, for storing Joint DistributionInformation；

The Lasso regression model is solved by minimum angle homing method, estimation obtainsAnd determine joint point Cloth

Preferably,

As seen from the above technical solution, in the application, user terminal is using Rule I and Rule II to ordering of optimization preference data It is converted, and is sent to data collection platform, data collection platform and user terminal after noise is added in the data after conversion Cooperate the algorithm realized and meet local difference privacy, and completes entire RI model construction, recycles the RI model of foundation raw At the ordering of optimization preference data for meeting local difference privacy.By the above method, can while guaranteeing to avoid privacy leakage guarantor Demonstrate,prove the ordering of optimization preference data data effectiveness with higher collected.

Detailed description of the invention

Fig. 1 is the schematic diagram of a scenario that current user preference data is collected；

Fig. 2 is 2-thin chain example schematic；

Fig. 3 is the performance comparison schematic diagram one in the application；

Fig. 4 is the performance comparison schematic diagram two in the application；

Fig. 5 is the performance comparison schematic diagram three in the application.

Specific embodiment

In order to which the purpose, technological means and advantage of the application is more clearly understood, the application is done below in conjunction with attached drawing It is further described.

It can not applied to data after ordering of optimization preference data in order to solve the local difference privacy methods mentioned in background technique With the problem of, applicant proposed the ordering of optimization preference data algorithms (SAFARI algorithm) for meeting local difference privacy.This method Main thought is that data collector collects and a series of small taken according to what riffle independent model (RI model) selected The distributed intelligence of value spatially, using the distributed intelligence on the small valued space of collection come the entirety of approximate ordering of optimization preference data point Cloth establishes model, and generates ordering of optimization preference data using the model established.What it is due to SAFARI algorithm process is multiple small take It is worth space rather than a big valued space, so, the scale of noise can be greatly reduced in it.

The processing of the application is described in detail below.

Currently, carrying out modeling to ordering of optimization preference data can be using RI model, RI model can be according to ordering of optimization preference number According to the mutually exclusive property between each dimension, using relative order distribution (Relative Ranking Distributions) and hand over The product of fork distribution (Interleaving Distributions) two kinds of low-dimensionals distribution carrys out the entirety of approximate ordering of optimization preference data Distribution, to effectively be modeled to ordering of optimization preference data.The model established is recycled to generate new ordering of optimization preference data, from And it realizes privacy of user and protects.

The structure of RI model is the binary tree for being referred to as K-thin chain.Wherein, the original item collection of root nodes stand, The Son item set of the original item collection of other node on behalf, and the item collection size of leaf node is no more than constant K.Fig. 2 is a 2- The example of thin chain.

In this example, original item collection { laughable, white wine, Sprite, plain boiled water, beer } is first subdivided into two mutual not phases Hand over and have the Son item set of Riffle Independent relationship, i.e. { plain boiled water } and { laughable, white wine, Sprite, beer }.Due to The size of Son item set { laughable, white wine, Sprite, beer } has been more than 2 (i.e. the values of K), which is further divided into mutually not Intersect and have the Son item set { laughable, Sprite } and { white wine, beer } of Riffle Independent relationship.

The learning process of RI model includes two stages of Structure learning and parameter learning:

1) Structure learning.The computational item mutual information of concentrating all triples first, is defined as follows:

Give an item collection (x₁, x₂..., x_d), for any one triple in the item collectionIts In,WithIt is three items different in item collection, the mutual information of the triple is

Wherein,Indicate itemRanking,It is a binary variable.Particularly,It representsThat is, itemRanking in itemBefore；Generation TableThat is, itemRanking in itemLater.

Then according to the mutual information of triple, K-thin chain is constructed in original item collection with anchor point algorithm.

2) parameter learning.According to the K-thin chain constructed, learn the distribution of each node, carrys out approximate original preference The overall distribution of sorting data collection.Wherein, the distribution of leaf node is referred to as relative order distribution (Relative Ranking Distributions), the distribution of internal node (including root node) is referred to as cross-distribution (Interleaving Distributions)。

Relative order distribution and cross-distribution are determined by the above process, also just complete the modeling of RI model.

The ordering of optimization preference method of data capture of the application is namely based on RI model, and is generated partially according to the RI model of foundation Good sorting data.The only acquisition of the acquisition of triple mutual information and relative order distribution and cross-distribution in modeling process It is all satisfied local difference privacy.

The ordering of optimization preference data algorithm (SAFARI algorithm) for meeting local difference privacy in the application is related to two rules Rule I, Rule II and SAFA algorithm, can specifically include 5 stages:

Stage 1

1. each user converts the ordering of optimization preference data of oneself according to a transformation rule (being denoted as Rule I), from And data collection platform is enable to obtain distributed intelligence required for calculating triple mutual information.Content about distributed intelligence exists Extended meeting is discussed in detail afterwards.

Stage 2

1. data collection platform uses the privacy budget of ε ', cooperation of the SAFA algorithm by user is called, from user in rank Distributed intelligence required for calculating triple mutual information is collected in data after 1 transfer of section.Wherein, ε ' is for characterizing secret protection Intensity, in SAFA algorithm, noise is added in the data after conversion by user, data collection platform is then then forwarded to, to keep away Exempt to reveal privacy.

2. data collection platform utilizes collected distributed intelligence, triple mutual information all in RI model is calculated.

3. data collection platform constructs K-thin chain

4. data collection platform willIssue each user.

Stage 3

1. each user turns the ordering of optimization preference data of oneself according to another transformation rule (being denoted as Rule II) Change, thus make data collection platform can determine aboutRelative order be distributed (Relative Ranking Distributions) information and cross-distribution (Interleaving Distributions) information.

Stage 4

1. data collector uses the privacy budget of ε ", cooperation of the SAFA algorithm by user is called, from user in the stage 2 In data after middle conversion collect aboutSequence distributed intelligence and cross-distribution information.Wherein, ε " is for characterizing privacy guarantor Intensity is protected, in SAFA algorithm, noise is added in the data after conversion by user, it is then then forwarded to data collection platform, thus Avoid leakage privacy.

So far, after obtaining sequence distributed intelligence and cross-distribution information, the building of RI model is just completed.

Stage 5

According to the riffle independent model of building, data collector generates the new ordering of optimization preference data of n item.

The ordering of optimization preference data RI for meeting local difference privacy can be realized by the processing in above-mentioned 1~stage of stage 4 Modeling.Data collection platform can be by the RI model development of completion to third party, alternatively, preferably to provide preference to third party Sorting data, it is preferable that new ordering of optimization preference data mining further can also be generated to third party by the processing in stage 5.

In addition, stage 2 and stage 4 are divided into two parts the place for completing local difference privacy in above-mentioned method of data capture Reason, therefore, under the premise of privacy budget is ε on the whole, ε '+ε "=ε, it is preferable that in practical applications, usually take

Below we by Rule I used in the method for data capture for introducing above-mentioned satisfaction local difference privacy respectively, Rule II and SAFA method.

Design Rule I

As previously mentioned, Rule I is for converting the ordering of optimization preference data of user, the data after the conversion are based on The mutual information of triple is calculated, therefore, the design of Rule I needs to carry out according to the calculation of triple mutual information.Specifically, It is defined according to the mutual information of triple, in order to calculate any one possible triple (x_i, x_j, x_k) mutual information, data collection Platform needs to collect the distributed intelligence of three types:

In order to complete this task, a kind of intuitive method for transformation is that each user is allowed to carry out his ordering of optimization preference data Conversion, to provide the information being distributed about these three types.In particular, each user converts his ordering of optimization preference data to One tuple comprising multiple attributes, wherein each attribute corresponds toIn one distribution.

However, this method for transformation can make the redundancy comprising amount in the data after user's conversion, and increases and turn Change complexity, because

In fact, data collector only needs to collectIn distribution, then therefrom deriveWithThe information of middle distribution. Therefore, each user only needs to convert his ordering of optimization preference data, to provideThe information of middle distribution.Due toIn include O(d³) a different distribution, the number of attributes in tuple after each user's conversion is O (d³)。

Unfortunately, when d is relatively large, due to dimension disaster, such transform mode, which will lead to, is meeting LDP's Under the conditions of, it include a large amount of noise in data collected by data collector.In order to solve this problem, we devise Rule I.According to the transformation rule, each user only needs to convert his ordering of optimization preference data, to provideMiddle distribution Information.Data collection platform only needs to collectIn distribution, then therefrom estimated using regression modelWithThe letter of middle distribution Breath.Particularly, it is found by the applicant that the estimation problem is that sparse linear returns (sparse linear regression) problem.Cause This, selection can effectively solve the Lasso regression model of the problem.The details of Rule I is introduced separately below and how to be utilized The estimation of Lasso regression modelWithThe information of middle distribution.

Rule I: each user terminal u_iFirst by the ordering of optimization preference data σ of corresponding user_iIt is converted into one and includes attribute SetThe tuple t of middle all properties_i.Wherein,In each attribute A_jCorresponding to an itemA_jValued space be dom (A_j)={ 1,2 ..., d }.dom(A_j) be made of d possible values, these values represent The possible absolute ranking having.Then, forIn each attribute A_j, each user u_iAccording to σ_iTo t_i[A_j] assigned Value.

Distribution estimation is carried out using Lasso, is determinedWithBased on what is be collected intoIn distribution, data collection Platform can be estimated as followsWithThe information of middle distribution.

Firstly, data collection platform fromDistribution in estimateIn distributed intelligence.In particular, for eachIn distributionData collector constructs a Lasso regression modelWherein,

1)It is the column vector that a length is 2d, its storage and distributionWithInformation；

2)It is the binary matrix that a size is 2d × d (d-1)；

3)It is the column vector that a length is d (d-1), it is used to store Joint DistributionLetter Breath.

By solving the Lasso regression model with minimum angle homing method, data collection platform can be estimated To obtain Joint DistributionInformation.According to Joint DistributionData collection platform can be counted Calculate distributionInformation.

Then, data collector fromWithDistribution in estimateIn distributed intelligence.In particular, for each It is aIn distributionData collector constructs a Lasso regression model Wherein,

1)It is the column vector that a length is (d+2), its storage and distributionWithLetter Breath；

2)It is the binary matrix that a size is (d+2) × 2d；

3)It is the column vector that a length is 2d, it is used to store Joint DistributionLetter Breath.

Similarly, by solving the Lasso regression model with minimum angle homing method, data collector can estimateTo obtain Joint DistributionInformation.

Need exist for explanation a bit, in above-mentioned Rule I,In each attribute A_jIt is corresponded with preferences, the two Corresponding relationship is consistent in data collection platform and subscriber terminal side needs.That is, data collection is flat in SAFARI Platform and each user terminal need to guarantee to run identical Rule I.In addition, by above-mentioned processing as it can be seen that in the transformation rule, The number of attributes for including in tuple after each user's conversion is O (d).For attribute setIn each attribute A_j, it Valued space size | dom (A_j) | it is only d, it is evident that this value is much smaller than d！.Gathered by estimation.In each attribute The frequency of any one possible value, data collector can obtainIn distributed intelligence, then estimate accordinglyWith The information of middle distribution.Such processing will not generate mass of redundancy data, so as to effectively improve the effect of user preference data With.

Design Rule II

In building K-thin chainLater, data collection platform need collect aboutRelative order distribution (Relative Ranking Distributions) and cross-distribution (Interleaving Distributions).For this purpose, Devise Rule II.According to the transformation rule, each user terminal only needs the ordering of optimization preference data for corresponding to user to it to turn Change, to provide the information being distributed about both types.

Rule II: each user terminal u_iThe ordering of optimization preference data σ of user is corresponded to first_iIt is converted into one Include attribute setThe tuple t of middle all properties_i.In particular, attribute setBy two subsetsWithIt constitutesWherein,

1)

Correspond toLeaf item collection Ji Tai It isA subset, by only include an item leaf item Collection is constituted.Because aboutIn each leaf item collection relative order distribution be easy to be pushed off out, so users are not required to It provides relatedInformation.

In each attribute A_jCorresponding to setIn a leaf item collection l_k。A_jValued space by owning About l_kRelative order constitute.Particularly, when K is 1,In all leaf item collection only include an item, at this point,Wherein, K is indicatedThe most item numbers for including of middle period Son item set.

2)

In each attribute A_jCorresponding in internal item collection setAn internal item collection g_k。A_jValued space by All about g_kTranslocation sorting constitute.Then, forIn each attribute A_j, each user u_iAccording to σ_iTo t_i[A_j] carry out Assignment.

Need exist for explanation a bit, in above-mentioned Rule II,In each attribute A_jLeaf item or internal item one in One is corresponding, and the corresponding relationship of the two is consistent in data collection platform and subscriber terminal side needs.That is, in SAFARI In, data collection platform and each user terminal need to guarantee to run identical Rule II.In addition, by above-mentioned processing as it can be seen that In the transformation rule, the number of attributes for including in the tuple after each user's conversion is O (d).ForIn each attribute A_j, its maximum valued space size is K！；ForIn each attribute A_j, its maximum valued space size isTherefore attribute setIn the maximum value space of any attribute beIt is obvious that this value Much smaller than d！.By estimating Ji TaiIn each attribute any one possible value frequency, data collector can obtain AboutDistributed intelligence.Such processing will not generate mass of redundancy data, so as to effectively improve user preference data Effectiveness.

SAFA method

In the data collection process of aforementioned the application, stage 2 and stage 4 require the data after converting to user terminal SAFA processing is carried out, SAFA processing is just discussed in detail here.

In order to collect building RI model needed for distributed intelligence, data collection platform needs estimate under conditions of meeting LDP The frequency of any one possible value of each attribute in tuple after counting user's conversion.

Data collection platform can call directly the current state-of-the-art method that multiattribute data is analyzed at LDP --- Harmony method, to complete this task.Particularly, forIn each attribute A_j, data collector is by A_jValued space Being mapped as a size is | dom (A_i)|×|dom(A_j) | binary matrix Φ_j.Then, for each user u_i, data receipts Collection person is from setIt is randomly chosen an attribute and (is assumed to be A_r), and SH algorithm [11] is called to collect u_iTuple t_iMiddle A_r Value.

It is observed that either using Rule I or Rule II, the attribute set after conversionIn all properties Valued space size be much smaller than d！.However, the attribute small for valued space, Harmony method is still by each dimension Valued space be mapped in a matrix, result in collected data and contain unnecessary noise, especially handle When binary attribute.Have in document and points out, when estimating the discrete value frequency of smallest number, generalized The effect of randomized response algorithm is best.Therefore, it is proposed that a kind of new LDP algorithm, entitled Sampling Randomizer for Multiple Attributes (SAFA), under conditions of meeting LDP, more accurately to small The multiattribute data of valued space carries out Frequency Estimation.The main thought of this method is that each user terminal randomly chooses one Then attribute disturbs the value of the attribute with generalized randomized response algorithm, and will disturb Dynamic result is sent to data collection platform.

As previously mentioned, need to be handled using SAFA algorithm in stage 2 and stage 4, the SAFA algorithm be need by with Family terminal and data collection platform carry out the process of cooperation completion.Stage 2 is identical with the SAFA algorithm that the stage 4 is applied, and only leads to It is different to cross the distributed intelligence that SAFA algorithm to be obtained, uniformly introduces the processing of SAFA algorithm below.

Detailed process is as follows by SAFA:

1. data collection platform initialization vector setIn all vector, i.e., all values in each vector are assigned to 0；Here, for stage 2, vector setIt is exactlyFor stage 4, vector setIt is exactly relative order distributed collection and friendship Pitch the intersection of distributed collection；

2. being directed to each user terminal u_iIt performs the following operations:

When 2.1. data collection platform is converted from Rule I or Rule IIOne rope of middle random selection Draw j；

2.2. j is sent to u by data collection platform_i；

2.3.u_iThe value subscript for having noise is generated, is denoted asSo that

Wherein k ∈ 1,2 ..., | dom (A_j)|}；

2.4.u_iIt willIt is sent to data collector；

2.5. data collector willValue increase by 1；

After having executed aforesaid operations to all user terminals, following operation is continued to execute:

3. for setEach of vector z_jExecute following processing:

3.1. probability is arranged in data collector

3.2. probability is arranged in data collector

3.3. by vector z_jEach of value z_j[k] is updated to

Above-mentioned is the specific processing of SAFA method.It is hidden that SAFA method to illustrate in the application can satisfy local difference Theoretic proof is given below in private.

Theorem: for any user u_i, privacy budget ε ', SAFA meet ε '-LDP.

It proves:

It is defined by LDP, the tuple t different for any two_i, t '_i, arbitrarilyWherein It is the property index selected by data collector, it would be desirable to prove

Because j be it is randomly selected,

We discuss (1) in all possible 4 kinds of situations.

Situation 1: ifAnd

Situation 2: ifAnd

Situation 3: ifAnd

Situation 4: ifAnd

In conclusionIt sets up.Therefore, conclusion must be demonstrate,proved.

Above by form analysis, the ordering of optimization preference data collection algorithm of the local difference privacy of satisfaction in the application (SAFARI) local difference privacy can be met to each user guaranteeing algorithm, while guarantees number collected by data collector According to data effectiveness with higher.

Here, the ordering of optimization preference method of data capture that local difference privacy is met in the application above is summarized as follows:

1, data collection platform is by primary vector setIn institute directed quantity z_jIt is initialized as 0 vector, and preset inclined Each user terminal u is directed in good item collection_iSelection preference entry index j is sent to corresponding user terminal respectively；Wherein, i is to use Family terminal index, j are the preference entry indexes that preferences are concentrated；

2, for each user terminal, using the ordering of optimization preference data of user itself, to including attribute setIn own The tuple t of attribute_i[A_j] assignment is carried out, and the preference entry index j of the user terminal is sent to according to the data collection platform, Utilize tuple t_i[A_j] generate value subscriptIt is sent to the data collection platform；Wherein, A_jIt is property set platformIn jth A attribute, the attribute number in tuple are equal to the number of the preferences in the ordering of optimization preference data, and attribute and preferences are one by one Corresponding, the value of each attribute is equal to the ranking of corresponding preferences；The value subscript meets condition

3, data collection platform is sent using each user terminalIt will be in the primary vector setValue adds 1；

4, data collection platform is by each value z of institute's directed quantity in the primary vector set_j[k] is updated to Wherein,The ε ' is that preset first privacy is pre- It calculates；

5, data collection platform according to primary vector set (namely) determine WithAnd utilization primary vector set,WithCalculate the preferences Concentrate all triplesMutual informationAnd construct K-thin chainIt is sent to each user Terminal；

6, data collection platform is by secondary vector setIn institute directed quantity z_j' it is initialized as 0 vector, and preset Preferences, which are concentrated, is directed to each user terminal u_iSelection preference entry index j is sent to corresponding user terminal respectively；Wherein, i is User terminal index, j are the preference entry indexes that preferences are concentrated, be the preference entry index that selects of different user terminals to be identical or It is different；

7, for each user terminal, using the ordering of optimization preference data of user itself, to including attribute setIn own The tuple t of attribute_i′[A_j'] assignment is carried out, and the preference entry index of the user terminal is sent to according to the data collection platform J utilizes tuple t_i′[A_j'] generate value subscriptIt is sent to the data collection platform；Wherein, attribute setIncluding two A subsetWith Correspond toLeaf item collection set Correspond toInside item collection set, the value subscript meets Condition

8, data collection platform is sent using each user terminalIt will be in the secondary vector setValue Add 1；

9, data collection platform is by each value z of institute's directed quantity in the secondary vector set_j' [k] is updated toWherein,The ε " is default The second privacy budget, ε '+ε "=ε, ε are the overall privacy budget for establishing RI model；

10, it is obtained according to secondary vector set (the namely intersection of relative order distributed collection and translocation sorting distributed collection) It arrivesLeaf node distributed intelligence and internal node distributed intelligence.

11, using the RI model of foundation, new ordering of optimization preference data are generated.Wherein, RI model isLeaf node point The distributed intelligence of cloth information and internal node.

In above-mentioned method of data capture, user terminal is to t in the processing of step 1 and step 2_i[A_j] carry out assignment processing It can be and executed with random order, user terminal is to t in the processing of step 6 and step 2_i′[A_j'] processing that carries out assignment can be with It is to be executed with random order.

It compares followed by with RAPPOR, SH with OLH, the SAFARI method of the application proposition can be determined in data There is apparent advantage in the effectiveness of data collected by collecting platform.The advantages of in order to which the application method is better described, uses The limit first-order is distributed (Q₁) and the limit second-order distribution (Q₂) measure RAPPOR, SH, OLH and SAFARI tetra- The effectiveness of ordering of optimization preference data collected by a algorithm.Wherein, for the distribution of the limit first-order and second-order Limit distribution, we generate the L between the limit distribution of data and the distribution of initial data with algorithms of different₁Distance is to measure The effectiveness for the data being collected into.That specifically tests is provided that we use two groups of true data set Sushi and Jester Test the performance of each method.The specific features of data are as shown in table 2 in this two group data set.

2 data set features of table

Data set	Number of users	The quantity of item
			Sushi	5,000	3~10
Jester	20,000	3~10

Illustrate the performance of SAFARI method below by analysis experimental data.

Firstly, measuring RAPPOR, SH, OLH using the distribution of the limit first-order and the distribution of the limit second-order With the performance of tetra- methods of SAFARI.Experimental result is as shown in Figure 3.

From figure 3, it can be seen that in different data sets, as privacy budget becomes larger, RAPPOR, SH, OLH and SAFARI The limit point of the distribution of the limit first-order and the distribution of the limit second-order and raw data set of the data that algorithm generates L between cloth₁Distance reduces, but the test result of SAFARI algorithm is consistently less than RAPPOR, SH and OLH.This is because: a side Face, for SAFARI algorithm, K-thin chain makes data collector with correlation distribution information collected by SAFARI Accuracy have extraordinary robustness, influenced by noise be added smaller；On the other hand, for RAPPOR, SH and OLH Algorithm, when privacy parameters reduce, they can introduce a large amount of noise.

Then, we test the validity of Rule I using data set Sushi and Jester.For this purpose, we are by it and separately The Rule I (being denoted as Rule I*) of one version is compared.In Rule I*, each user by his ordering of optimization preference data into Row conversion, directly to provideThe information of middle distribution.We allow data collector with SAFA method respectively from user according to Distributed intelligence is collected in the data of Rule I and Rule I* conversion, and the S of its acquisition is presented₃The average L of middle distribution₁Distance.It is real It is as shown in Figure 4 to test result.

From fig. 4, it can be seen that when d is no more than 4, Rule I* will lead to better effect for different data sets With.This is because Lasso returns the bring advantage enemy only brought influence of information loss when d is smaller.So And when d is relatively large, Rule I will lead to fairly good as a result, such demonstrate the superiority of Rule I.

Finally, we utilize the validity of data set Sushi and Jester testing SA FA algorithm.For this purpose, we by it with Harmony method compares.We allow data collector that SAFA and Harmony method is used to be turned from user according to Rule I respectively S is collected in the data of change₁In distributed intelligence, and present its obtain distribution average L₁Distance.Experimental result is as shown in Figure 5.

From fig. 5, it can be seen that being made an uproar for different data sets using what distributed intelligence collected by SAFA method contained Volume is smaller.This is because when the valued space of attribute is smaller, by the valued space of each attribute in Harmony algorithm The process for being mapped to a matrix can introduce unnecessary noise.

By above-mentioned every test as it can be seen that the ordering of optimization preference data collection that the method for data capture of the application is realized, Neng Gou Guarantee to avoid to guarantee the ordering of optimization preference data data effectiveness with higher collected while privacy leakage.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of ordering of optimization preference method of data capture for meeting local difference privacy characterized by comprising

Data collection platform is by primary vector setIn institute directed quantity z_jIt is initialized as 0 vector, and in preset preference item collection In be directed to each user terminal u_iSelection preference entry index j is sent to corresponding user terminal respectively；Wherein, i is user terminal Index, j are the preference entry indexes that preferences are concentrated；

For each user terminal, using the ordering of optimization preference data of user itself, to including attribute setMiddle all properties Tuple t_i[A_j] assignment is carried out, and the preference entry index j of the user terminal is sent to according to the data collection platform, utilize member Group t_i[A_j] generate value subscriptIt is sent to the data collection platform；Wherein, A_jIt is attribute setIn j-th of attribute, t_i[A_j] indicate t_iMiddle A_jValue, the attribute number in tuple is equal to the number of the preferences in the ordering of optimization preference data, belongs to Property corresponded with preferences, the value of each attribute is equal to the ranking of corresponding preferences；The value subscript meets condition I(t_i[A_j] table Show t_i[A_j] in dom (A_j) in index, dom (A_j) indicate attribute A_jValued space；

The data collection platform is sent using each user terminalIt will be in the primary vector setValue plus 1；

The data collection platform is determined according to the primary vector setWithAnd the utilization primary vector set,WithCalculate the preference All triples in item collectionMutual informationAnd construct K-thin chainIt is sent to each User terminal；

Data collection platform is by secondary vector setIn institute directed quantity z_j' is initialized as 0 vector, and in preset preferences It concentrates and is directed to each user terminal u_iSelection preference entry index j is sent to corresponding user terminal respectively；Wherein, i is user's end End index, j are the preference entry indexes that preferences are concentrated, and be the preference entry index that different user terminals select are identical or different；

For each user terminal, using the ordering of optimization preference data of user itself, to including attribute setMiddle all properties Tuple t_i′[A_j'] assignment is carried out, and the preference entry index j of the user terminal is sent to according to the data collection platform, it utilizes Tuple t_i′[A_j'] generate value subscriptIt is sent to the data collection platform；Wherein, attribute setIncluding two subsetsWith Correspond toLeaf item collection set Correspond toInside item collection set, the value subscript meets condition

The data collection platform is sent using each user terminalIt will be in the secondary vector setValue adds 1；

It is obtained according to the secondary vector set describedLeaf node distributed intelligence and internal node distributed intelligence；

Using the RI model of the distributed intelligence of the distributed intelligence and internal node that include the leaf node, ordering of optimization preference number is generated According to.

2. the method according to claim 1, wherein this method further comprises: according to the mutual of the triple It is information, describedLeaf node distributed intelligence and internal node distributed intelligence, generate the ordering of optimization preference number of specified quantity According to.

3. method according to claim 1 or 2, which is characterized in that the data collection platform is according to the primary vector Set determinesInclude:

4. method according to claim 1 or 2, which is characterized in that the data collection platform is according to the primary vector Set determinesInclude:

The Lasso regression model is solved by minimum angle homing method, estimation obtainsAnd determine Joint Distribution

5. method according to claim 1 or 2, which is characterized in that