CN105989161A - Big data processing method and apparatus - Google Patents
Big data processing method and apparatus Download PDFInfo
- Publication number
- CN105989161A CN105989161A CN201510095692.7A CN201510095692A CN105989161A CN 105989161 A CN105989161 A CN 105989161A CN 201510095692 A CN201510095692 A CN 201510095692A CN 105989161 A CN105989161 A CN 105989161A
- Authority
- CN
- China
- Prior art keywords
- noise
- query result
- query
- susceptibility
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Embodiments of the invention provide a big data processing method and apparatus. The method comprises the steps of receiving a query instruction sent by a client and determining a query function K according to the query instruction; performing a query on a big data set D according to the query function to obtain a query result R, wherein the query result R equals {Rj}, j is greater than or equal to 1 and less than or equal to m, and m is a positive integer greater than or equal to 1; obtaining a sensitive degree S(K) of the query function K, wherein the sensitive degree S(K) represents the sensitiveness of the query function K; determining noise N required to be added in the query result R according to the query result R and the sensitive degree S(K), wherein the noise N equals {Nj} and noise components Nj of the noise N are in one-to-one correspondence with query result components Rj; and performing noise-adding processing on the query result components Rj according to the noise components Nj to obtain a noise-added query result R', wherein R' equals {R'j}. According to the method capable of performing noise-adding processing on big data, proposed by the embodiment of the invention, the data query result with differential privacy can be output.
Description
Technical field
The present invention relates to data processing field, and more particularly, to the method processing big data and
Device.
Background technology
The purpose of secret protection data mining is to protect individual privacy data, can promote user simultaneously
Between data sharing.Differential privacy is a strict theory for describing and analyzing data publication method
Model, its objective is to provide effective method to maximize the accurate of statistical query information from staqtistical data base
Property, minimize the chance identifying individual record simultaneously.
The data handling procedure with differential privacy feasible at present can be only applied to data on a small scale, but
For big data, each component of its Query Result vector has independent coordinate, and this is every
Individual independent coordinate is the stochastic variable of an exponentially scale distribution, and therefore there is no can be on a large scale
The effective way of differential privacy can be implemented in data.
Content of the invention
The embodiment of the present invention provides a kind of method and apparatus processing big data, can be on large-scale data
Realize the purpose with the inquiry of differential privacy.
First aspect, embodiments provides a kind of method processing big data, comprising: receive visitor
The query statement that family end sends, and determine query function K according to described query statement;According to described inquiry
Function K carries out inquiry and obtains Query Result R, described Query Result R={R to large data sets Dj, its
In 1≤j≤m, m is greater than or equal to the positive integer of 1;Obtain the susceptibility of described query function K
S (K), described susceptibility S (K) characterizes the sensitiveness of described query function K;According to described Query Result
R determines, with described susceptibility S (K), the noise N, described noise N={N needing to add to Query Result Rj,
The noise component(s) N of described noise NjWith Query Result components RjOne_to_one corresponding;According to described noise component(s)
NjTo described Query Result components RjCarry out adding process of making an uproar, obtain adding the Query Result R '={ R ' making an uproarj}。
In conjunction with first aspect, in the first possible implementation of first aspect, described acquisition is described
The susceptibility S (K) of query function K (x) includes: obtain Query Result K (D1) and the data of data set D1
The Query Result K (D2) of collection D2;By described Query Result K (D1) with described Query Result K (D2) one
In individual metric space, the minimum of a value of difference is as the value of described susceptibility S (K), wherein, and described data set
D1 and two different subsets that described data set D2 is described large data sets D, described data set D1 and
Record data are at most differed between described data set D2.
In conjunction with the first possible implementation of first aspect or first aspect, in the second of first aspect
In kind possible implementation, described make an uproar with the determination of described susceptibility S (K) according to described Query Result R
Sound N includes: generate the noise N ' meeting Laplacian noise distribution according to described Query Result R, its
Described in noise N ' each noise component(s) separate;Make an uproar according to described susceptibility S (K) corrects
Obtain described noise N, wherein said noise component(s) N after sound N 'jMeet the La Pu of described susceptibility S (K)
Lars noise profile.
In conjunction with the possible implementation method of the first to the second of first aspect or first aspect, in first aspect
The third possible implementation in, described query function K is hash function F, and described method includes:
According to the training set of described large data sets D, training obtains described hash function F;Wherein, described training
The subset that collection is described large data sets D, described training set also includes property set X and tag along sort Y,
Described property set X is the set of the data characterizing element property in described training set, described tag along sort Y
It is the set of the data characterizing element classification result in described training set.
Second aspect, embodiments provides a kind of device for processing big data, comprising: connect
Receiving module, described receiver module is used for receiving the query statement that client sends, and refers to according to described inquiry
Order determines query function K;First determining module, described first determining module is for according to described inquiry letter
Number K carries out inquiry and obtains Query Result R, described Query Result R={R to large data sets Dj, wherein
1≤j≤m, m are greater than or equal to the positive integer of 1;Acquisition module, described acquisition module is used for obtaining
The susceptibility S (K) of described query function K that described first determining module determines, described susceptibility S (K)
Characterize the sensitiveness of described query function K;Second determining module, described second determining module is used for basis
Described Query Result R and the described susceptibility S (K) obtaining according to described acquisition module determine to be needed to looking into
Ask the noise N, described noise N={N that result R addsj, the noise component(s) N of described noise NjWith look into
Ask result components RjOne_to_one corresponding;Adding module of making an uproar, the described module of making an uproar that adds is for according to described second determination mould
The noise N that block determinesjTo described Query Result components RjCarry out adding making an uproar, obtain adding the Query Result made an uproar
R '={ R 'j}。
In conjunction with second aspect, in the first possible implementation of second aspect, described acquisition module
Specifically for: obtain the Query Result K (D2) of the Query Result K (D1) and data set D2 of data set D1;
The Norm minimum value of described Query Result K (D1) and described Query Result K (D2) difference is set to described
The value of susceptibility S (K), wherein said data set D1 and described data set D2 are described large data sets D
Two different subsets, at most differ a record between described data set D1 and described data set D2
Data.
, wherein, in conjunction with second aspect or second aspect first to possible implementation, in second party
In the possible implementation of the second in face, described second determining module specifically for: according to described inquiry
Result R generates the noise N ' meeting Laplacian noise distribution, each noise in wherein said noise N '
Component is separate;Obtain described noise N after correcting described noise N ' according to described susceptibility S (K),
The noise component(s) N of wherein said noise NjMeet the Laplacian noise distribution of described susceptibility S (K).
In conjunction with the possible implementation of the first to the second of second aspect or second aspect, in second aspect
The third possible implementation in, described query function K is hash function F, described first determine
Module is additionally operable to: according to the training set of described large data sets D, training obtains described hash function F;Its
In, described training set is a subset of described large data sets D, and described training set includes property set X
With tag along sort Y, described property set X is the set of the data characterizing element property in described training set,
Described tag along sort Y is the set of the data characterizing element classification result in described training set.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Surely need to Query Result add noise and by this noise add Query Result such that it is able to original greatly
What the Query Result of data set carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy
Query Result.Therefore, the present invention implements to carry out adding process of making an uproar to the big data of scale, and maximum can
The leakage avoiding sensitive data of energy, it is achieved the purpose of differential privacy inquiry.
Brief description
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, below will be in the embodiment of the present invention
The required accompanying drawing using is briefly described, it should be apparent that, drawings described below is only this
Some embodiments of invention, for those of ordinary skill in the art, are not paying creative work
Under the premise of, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the schematic diagram of the system scenarios example that can apply the embodiment of the present invention.
Fig. 2 is the flow chart of a kind of method processing big data of the embodiment of the present invention.
Fig. 3 is the flow chart of a kind of method processing big data of another embodiment of the present invention.
Fig. 4 is the flow chart of a kind of method processing big data of another embodiment of the present invention.
Fig. 5 is the schematic block diagram of a kind of device processing big data of the embodiment of the present invention.
Fig. 6 is the schematic block diagram of a kind of device processing big data of another embodiment of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out
Clearly and completely describe, it is clear that described embodiment is a part of embodiment of the present invention, and not
It is whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making wound
The every other embodiment being obtained on the premise of the property made work, all should belong to the scope of protection of the invention.
Fig. 1 is the schematic diagram of the system scenarios example that can apply the embodiment of the present invention.
When not carrying out the noise processed with differential privacy, client user is directly original quick to having
The database of sense data proposes accurate inquiry request 1, and dashed lines shown in arrow 1, database is by essence
True Query Result 2 returns client, dashed lines shown in arrow 2, is thus easy to original quick
The individual private data of sense data is revealed in client.
By adding inquiry mechanism, client user can be by inquiry mechanism to having original sensitive data
Database propose statistical query request 3, as shown in solid arrow 3 in figure, database can will be added up
Query Result 4 returns client through inquiry mechanism, before returning client, and this statistical query result
Through susceptibility noise processed, obtain adding the statistical query result 5 after making an uproar and return client, such as reality in figure
Shown in line arrow 5, so that Query Result has secret protection and the avoiding number of individuals of maximum possible
According to privacy leakage.
Fig. 2 is the flow chart of a kind of method processing big data of the embodiment of the present invention.As in figure 2 it is shown,
The method includes:
Step 210, receives the query statement that client sends, and determines inquiry according to described query statement
Function K.
Step 220, carries out inquiry according to query function K to large data sets D and obtains Query Result R, institute
State Query Result R={Rj, wherein 1≤j≤m, m are greater than or equal to the positive integer of 1.
Step 230, obtains the susceptibility S (K) of query function K, looks into described in this susceptibility S (K) sign
Ask the sensitiveness of function K.
Step 240, determining according to Query Result R and susceptibility S (K) needs to Query Result R addition
Noise N, noise N={Nj, the noise component(s) N of noise NjWith Query Result components RjOne_to_one corresponding.
Step 250, according to noise component(s) NjTo Query Result components RjCarry out adding process of making an uproar, added
Query Result R '={ the R ' making an uproarj}。
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Determine noise, the process of making an uproar that adds with differential privacy can be carried out to the Query Result of original large data sets,
Obtain the Query Result with differential privacy eventually.Therefore, present invention enforcement can be to the big data of scale
Carry out adding process of making an uproar, the leakage avoiding sensitive data of maximum possible, it is achieved the purpose of differential privacy inquiry.
Specifically, in step 210, after receiving the query statement that client sends, by obtaining client
Evidence and information inquiry to relevant user data collection for the end requires, chooses concrete statistical query function K, example
If this statistical query function can be summation (sum) or the function of (average) etc. of averaging, also
Can be to train, based on classified inquiry result, the hash function obtaining, wherein, Query Result R be that basis is looked into
Ask what function K was obtained by statistical query at large data sets D.
Specifically, in step 230, the susceptibility S (K) of query function K, this susceptibility S (K) are obtained
Characterize the sensitiveness (sensitive) of described query function K, for the susceptibility of any one function F
The definition of S (F) is: meet condition | | F (D1)-F(D2)||MThe minimum of a value of≤S (F), wherein data set D1
At most differ with D2 one record data, M represents a metric space, it should be understood that data set D1 and
D2 differs record data and is meant that in the case that D1 with D2 data element number is identical, certain
The numerical value of one element or value type are different.Alternatively, as one embodiment of the invention, in step
In 220, the susceptibility S (K) obtaining query function K includes: calculate the Query Result of data set D1
The Query Result K (D2) of K (D1) and data set D2;By Query Result K (D1) and Query Result K (D2)
In a metric space, the minimum of a value of difference is as the value of described susceptibility S (K), wherein data set D1
And at most differ between data set D2 one record data, data set D1 and data set D2 be described greatly
The different subset of two of data set D, it should be noted that Query Result K (D1) and Query Result K (D2) here
In a metric space, difference refers to the absolute of Query Result K (D1) and Query Result K (D2) difference
Value.
Specifically, the susceptibility S (K) obtaining query function K includes calculating described susceptibility according to following formula
S (K): S (K)=min | | K (D1)‐K(D2)||M, wherein, data set D1 and data set D2 at most differs
One record data, M represents a metric space.
Alternatively, as one embodiment of the invention, described according to described Query Result R and described sensitivity
Degree S (K) determines that noise N includes: generate the noise N ' meeting Laplacian noise distribution according to Query Result,
Wherein in noise N ', each noise component(s) is separate;Obtain after correcting noise N ' according to susceptibility S (K)
Noise N, wherein noise component(s) NjMeet the Laplacian noise distribution of susceptibility S (K).
Specifically, in step 230, according to Laplacian Differential Approach privacy theorem, select to add the mechanism of making an uproar,
Generate noise N '=[N '1..., N 'j..., N 'm], wherein each noise component(s) of noise N ' is mutual
Independent, this noise is calibrated by the susceptibility S (K) according to K, noise N=[N after being calibrated1...,
Nj..., Nm], each noise component(s) in noise N after its alignment is separate.
Specifically, according to noise NjTo Query Result RjCarry out adding process of making an uproar, obtain that there is differential privacy
Query Result R 'jReferring to join noise after calibration statistical query result, output has secret protection
Query Result R '=[R '1..., R 'j..., R 'm]=R=[R1..., Rj..., Rm]+[N1...,
Nj..., Nm]。
Alternatively, as one embodiment of the invention, according to the query demand to large data sets D for the client,
Query function K determining Query Result R is hash function F, the Query Result component of Query Result R
For Rj, wherein 1≤j≤m, m are greater than the positive integer equal to 1.
Alternatively, as one embodiment of the invention, according to the query demand to large data sets D for the client,
Query function K determining Query Result R is that hash function F includes: according to the training of large data sets D
Collection, training obtains hash function F, and generates the first Hash classification chart T of hash function F;Wherein,
Training set includes that property set X and tag along sort Y, described property set X are to characterize unit in described training set
The set of the data of element attribute, described tag along sort Y is sign element classification result in described training set
The set of data.
Alternatively, as one embodiment of the invention, according to noise NjTo Query Result components RjCarry out
Adding process of making an uproar, the Query Result R ' obtaining having differential privacy includes: according to noise NjTo hash function
The first Hash classification chart T of F (x) carries out adding process of making an uproar, and obtains Query Result R ' and has the of differential privacy
Two Hash classification chart T '.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Surely need to Query Result add noise and by this noise add Query Result such that it is able to original greatly
What the Query Result of data set carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy
Query Result.Therefore, the present invention implements to carry out adding process of making an uproar to the big data of scale, and maximum can
The leakage avoiding sensitive data of energy, it is achieved the purpose of differential privacy inquiry.
Fig. 3 is the flow chart of a kind of method processing big data of the embodiment of the present invention.As it is shown on figure 3,
The method includes:
Step 310, receives the query statement that client sends, and determines inquiry according to described query statement
Function F, this query function F is hash function.
Step 320, carries out inquiry according to query function F to large data sets D and obtains Query Result R, institute
State Query Result R={Rj, wherein 1≤j≤m, m are greater than or equal to the positive integer of 1.
Step 330, obtains the susceptibility S (K) of query function F, and this susceptibility S (K) characterizes described inquiry
The sensitiveness of function F.
Step 340, determining according to Query Result R and susceptibility S (K) needs to Query Result R addition
Noise N, noise N={Nj, the noise component(s) N of noise NjWith Query Result components RjOne_to_one corresponding.
Step 350, according to noise component(s) NjTo Query Result components RjCarry out adding process of making an uproar, added
Query Result R '={ the R ' making an uproarj}。
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original
What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy
Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum
The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Should be understood that in step 110, the query demand of above-mentioned large data sets R refers to, for example, can be right
Certain attribute in a certain subset of R is sued for peace, it is thus achieved that statistical query function, wherein 1≤j≤m,
M is greater than the positive integer equal to 1, can be micro-by construction according to the demand inquiring about large data sets R
Point privacy Stochastic Decision-making Hash (it is English: Differentially Private Random Decision Hashing,
Write a Chinese character in simplified form: DPRDH), training construction hash function F.
Alternatively, as one embodiment of the invention, obtain large data sets R's according to large data sets R
Hash function F includes: according to the training set of large data sets D, construct the Stochastic Decision-making with differential privacy
Hash, obtains hash function F with training, and generates the first Hash classification chart T of hash function F;Its
In, training set includes property set X and tag along sort Y, described property set X is table in described training set
Levying the set of the data of element property, described tag along sort Y is to characterize element classification knot in described training set
The set of the data of fruit.Should be understood that according to the demand to large data sets D inquiry, can train and be obtained
Obtaining m Hash classification chart, this m Hash classification chart is all the first Hash classification chart T, it should be appreciated that
First Hash classification chart T is to be trained obtain to comprise at least one Kazakhstan by the training set of large data sets D
One class Hash classification chart of uncommon classification chart.
Specifically, construct that to have the Stochastic Decision-making Hash procedure of differential privacy as follows: input training set { belongs to
Property collection X, tag along sort Y}, the first Hash classification chart in comprise m initial Hash classification chart and L
Class, wherein, property set X can be numeric type (numerical), classification type (categorical) and two
System type (binary), and tag along sort Y is according to the label symbol obtaining after property set X classification,
Corresponding L class under tag along sort Y, L is greater than the positive integer equal to 1;Export m Hash to divide
Class table, this m Hash classification chart collection is combined into T=[T1..., Tj..., Tm], i.e. obtain the first Hash
Classification chart T, for wherein any one sublist Tj=[bkkey1,bkkey2..., bkkeyL]。
Specifically, the parameter according to above-mentioned input and output, constructs the Stochastic Decision-making Kazakhstan with differential privacy
Uncommon following flow process:
1. m masking-out vector (maskvector) of stochastic generation, wherein any one masking-out vector is
maskvectorj;
2. in the training set of pair input, property set X is unified carries out being encoded to binary type, obtains m
Binary type Xbinary, wherein any one binary type is encoded to Xbinaryj;
3. the training process constructing the random Harsh with differential privacy is as follows:
For i=1;i≤|X|;++i do
For j=1, j≤m;++j do
Calculation key, key=maskvectorjAndXbinaryj;
Distribution key assignments is to Hash classification chart, bkkey=Tj[key];
Adjust the key assignments bk in Hash classification chartkey(Y)+=1;
End
End
For should be understood that corresponding to the outer loop in step 3, refer to each unit in property set X
Element all carries out the circulation of an internal layer, to be assigned to their corresponding key assignments in m Hash classification chart;
And correspond to interior loop, then be according to masking-out vector logical AND binary type coding result as key assignments,
Each key assignments is assigned to any one Hash classification chart Tj, circulate m time to obtain m Hash classification
Table T=[T1..., Tj..., Tm]。
4., through step 3, m Hash classification chart, T=[T can be obtained1..., Tj..., Tm],
Wherein any one sublist Tj=[bkkey1, bkkey2..., bkkeyL] correspond to table 1,
Each column vector Y=[Y of this table1bi,…,Yibi,…,YLbn] it is referred to as a masking-out arrow
Amount.
Table 1
Tag along sort (L class) | Bucket 1 | Bucket 2 | …… | Bucket i | …… | Bucket n |
Y1 | Y1b1 | Y1b2 | Y1bi | Y1bn | ||
Y2 | Y2b1 | Y2b2 | Y2bi | Y2bn | ||
… | ||||||
Yi | Yib1 | Yib2 | Yibi | Yibn | ||
… | ||||||
YL | YLb1 | YLb2 | YLbi | YLbn |
Alternatively, as one embodiment of the invention, the susceptibility S (F) calculating query function F (x) includes:
Calculate the Query Result K (D2) of the Query Result K (D1) and data set D2 of data set D1;Inquiry is tied
Really K (D1) and the minimum of a value of Query Result K (D2) difference in a metric space are as described sensitivity
The value of degree S (K), wherein at most differs record data, number between data set D1 and data set D2
It is two different subsets of described large data sets D according to collection D1 and data set D2.
Alternatively, as one embodiment of the invention, the susceptibility S (F) of above-mentioned hash function F (x) by under
Formula is calculated: S (F)=min | | F (D1)‐F(D2)||M, wherein, data set D1 and D2 at most differs
One record data, M represents a metric space.
Alternatively, described determine that noise N includes according to Query Result R and susceptibility S (K): according to looking into
Ask result and generate the noise N ' meeting Laplacian noise distribution, wherein each noise component(s) in noise N '
Separate;Obtain noise N, wherein noise component(s) N after correcting noise N ' according to susceptibility S (K)jFull
The Laplacian noise distribution of foot susceptibility S (K), i.e. noise component(s) NjMeet Lap (S (F)/ε), so that
The Query Result R ' after noise must be added to have ε-differential privacy.
Alternatively, as one embodiment of the invention, according to the query demand to large data sets D for the client,
Query function K (x) determining Query Result R is that hash function F (x) includes: according to large data sets D
Training set, training obtains described hash function F (x), and generates the first Hash classification of hash function F (x)
Table T;Wherein, training set includes the property set X and tag along sort Y of large data sets D.
Alternatively, as one embodiment of the invention, according to noise NjTo described Query Result components Rj
Carrying out adding process of making an uproar, the Query Result R ' obtaining having differential privacy includes:
According to noise NjCarry out adding process of making an uproar to the first Hash classification chart T of hash function F (x), obtain
Second Hash classification chart T ' corresponding with the Query Result R ' with differential privacy.
Alternatively, as one embodiment of the invention, by construction differential privacy Stochastic Decision-making Hash classification
Device (English: Differentially Private Random Decision Hashing Classifier, write a Chinese character in simplified form:
DPRDHC) can predict that output has the Query Result R ' of differential privacy.
Specifically, differential privacy random Harsh grader is constructed, to predict that output Query Result R's ' is as follows
Flow process:
1. input m the second Hash classification chart set, T '=[T '1... T 'j... T 'm] and divided
The identity column X ' of class;
2. initialize tag along sort vector (label vectors), statistic of classification (label count and label
average);
3. encode the row X ' being classified;
4. the prediction process constructing the random Harsh grader with differential privacy is as follows:
For j=1;j≤m;++j do
Calculation key, key=maskvectorjAndXbinaryj;
Distribution key assignments is to Hash classification chart, bkkey=Tj[key];
Adjust the key assignments label count+=bk in Hash classification chartkey;
End
5. the arithmetic average of the tag along sort in m Equations of The Second Kind Hash classification chart of calculating, label
Avg=label count/m;
6. in m label, take maximum as tag along sort value, Y '=argmax (label avg);
7. output category label value Y '.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original
What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy
Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum
The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Below in conjunction with concrete steps, describe in more detail the embodiment of the present invention.
Fig. 4 is the flow chart of a kind of method processing big data of another embodiment of the present invention.Such as Fig. 4 institute
Show, the method 400 following steps:
Step 401, it is thus achieved that statistical query function F.
Step 402, generates separate noise N '=[N '1..., N 'j..., N 'm]。
Step 403, calculates the standard deviation D=[D of noise N '1..., Dj..., Dm]。
Step 404, the susceptibility S (F) of counting statistics query function.
Step 405, by calibrating the standard deviation D, the noise N=[N after being calibrated of noise N '1...,
Nj..., Nm]。
Step 406, it is thus achieved that statistical query result R=[R1..., Rj..., Rm]。
Step 407, the noise N after calibration joins statistical query result, the inquiry knot of output secret protection
Really R '=[R '1..., R 'j..., R 'm]=R=[R1..., Rj..., Rm]+[N1..., Nj...,
Nm]。
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original
What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy
Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum
The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Alternatively, in step 401, the aggregation information inquiry according to client relevant user data collection is wanted
Ask, choose concrete statistical query function F, for example, sue for peace or function etc. of averaging, it is also possible to be base
Train the hash function obtaining in classified inquiry result.
Alternatively, in step 402, the Query Result according to statistical query function F, eligible conjunction
Suitable noise mechanism is to generate separate noise N '=[N '1..., N 'j..., N 'm], this noise
Each component in N ' is separate, and such as N ' can be distributed for meeting Laplacian noise,
So wherein each component of N ' is separate and meets Laplacian noise distribution.Ying Li
Solve, choose suitable noise mechanism and refer to according to Laplacian Differential Approach privacy theorem, select to add the mechanism of making an uproar.
Alternatively, in step 403, the standard deviation of each isolated component in noise N ' is calculated respectively
Obtain standard deviation D=[D1..., Dj..., Dm]。
Alternatively, in step 404, the Query Result F (D1) and data set D2 of data set D1 are calculated
Query Result F (D2);Take Query Result F (D1) and Query Result F (D2) in a metric space
The minimum of a value of difference is as the value of described susceptibility S (K), wherein said data set D1 and described data set
Record data are at most differed between D2.Specifically, the susceptibility S (F) of counting statistics query function F
Including according to the described susceptibility S (F) of following formula calculating: S (F)=min | | F (D1)‐F(D2)||M, wherein, data
Collection D1 and data set D2 at most differs record data, and M represents a metric space, data set
D1 and two different subsets that data set D2 is large data sets D.
Alternatively, in step 405, by calibrating the standard deviation D of noise N ', after being calibrated
Noise N=[N1..., Nj..., Nm] so that each component N in the noise N after calibrationjFull
Foot Lap (S (F)/ε), in order to output has the Query Result of ε-differential privacy, and wherein ε codomain is in [0,1]
Between, this ε can be specified by user.
Alternatively, in a step 406, statistical query result R=[R is obtained according to statistical query function F1...,
Rj..., Rm], it should be appreciated that this step also can obtain before generating noise N ', and the present invention does not limits
In this.
Alternatively, in step 407, the noise N after calibration is joined statistical query result, export hidden
Query Result R '=[the R ' of private protection1..., R 'j..., R 'm]=R=[R1..., Rj..., Rm]+[N1...,
Nj..., Nm], owing to each component in noise N is to calibrate according to statistical function susceptibility S (F)
After obtain and meet Lap (S (F)/ε) distribution, therefore, the Query Result R ' of output has ε-differential privacy.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original
What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy
Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum
The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Fig. 1 to Fig. 4 is the detailed process described in detail from method angle and process big data, below in conjunction with
Fig. 5 to Fig. 6 is from the device describing in detail for processing big data.
Fig. 5 is the schematic block diagram of a kind of device processing big data of the embodiment of the present invention.Such as Fig. 5 institute
Showing, device 500 includes: receiver module the 510th, the first determining module the 520th, computing module the 530th, second
Determining module 540 and add module 550 of making an uproar.
Receiver module 510, is used for receiving the query statement that client sends, and according to described query statement
Determine query function K.
First determining module 520, the first determining module is for entering to large data sets D according to query function K
Row inquiry obtains Query Result R, Query Result R={Rj, wherein 1≤j≤m, m is greater than or equal to
The positive integer of 1.
Acquisition module 530, acquisition module is quick for query function K of acquisition the first determining module determination
Sensitivity S (K), this susceptibility S (K) characterizes the sensitiveness of described query function K.
Second determining module 540, the second determining module is for according to Query Result R with according to acquisition module
The susceptibility S (K) obtaining determines the noise N, noise N={N needing to add to Query Result Rj, make an uproar
The noise component(s) N of sound NjWith Query Result components RjOne_to_one corresponding.
Add module 550 of making an uproar, add module of making an uproar for the noise N determining according to the second determining modulejTo inquiry knot
Really components RjCarry out adding making an uproar, obtain adding the Query Result R '={ R ' making an uproarj}。
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original
What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy
Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum
The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Specifically, receiver module 510 is looked into by obtaining evidence and information to relevant user data collection for the client
Asking and requiring, choose concrete statistical query function K, such as this statistical query function can be summation (sum)
Or the function of (average) etc. of averaging, it is also possible to it is to train, based on classified inquiry result, the Kazakhstan obtaining
Uncommon function, wherein, Query Result R is to pass through statistical query according to query function K at large data sets D
Obtain.
Alternatively, as one embodiment of the invention, acquisition module 530 specifically for: calculate data set
The Query Result K (D2) of the Query Result K (D1) and data set D2 of D1;By Query Result K (D1) with
The minimum of a value of the difference of Query Result K (D2) is set to the value of susceptibility S (K), wherein data set D1 and
At most differing record data between described data set D2, data set D1 and data set D2 is big number
Two different subsets according to collection D, it should be appreciated that data set D1 and D2 differs containing of record data
Justice is in the case that D1 with D2 data element number is identical, the numerical value of some element or numerical value class
Type is different.Also, it is noted that the difference of Query Result K (D1) and Query Result K (D2) refers to here
The absolute value of difference between the two.
Specifically, acquisition module 520 is additionally operable to according to following formula calculating susceptibility S (K):
S (K)=min | | K (D1)‐K(D2)||M, wherein, data set D1 and D2 at most differs record data,
Data set D1 represents a degree with two different subsets that data set D2 is described large data sets D, M
Quantity space.
Alternatively, as one embodiment of the invention, the second determining module 530 specifically for: according to looking into
Ask result and generate the noise N ' meeting Laplacian noise distribution, each noise in wherein said noise N '
Component is separate;Obtain noise N, wherein noise component(s) after correcting noise N ' according to susceptibility S (K)
NjMeet the Laplacian noise distribution of susceptibility S (K).
Alternatively, as one embodiment of the invention, the first determining module 510 is additionally operable to: according to client
The query demand to large data sets D for the end, determines that query function K (x) of Query Result R is hash function
The Query Result component of F (x), Query Result R is Rj, wherein 1≤j≤m, m are greater than equal to 1
Positive integer.
Alternatively, as one embodiment of the invention, the first determining module 510 is additionally operable to: according to big number
According to the training set of collection D, training obtains hash function F (x), and generates the first Hash of hash function F (x)
Classification chart T;Wherein, training set is a subset of large data sets D, and this training set includes property set
X and tag along sort Y, described property set X are the set of the data characterizing element property in described training set,
Described tag along sort Y is the set of the data characterizing element classification result in described training set.
Alternatively, as one embodiment of the invention, the first determining module 510 is additionally operable to make an uproar according to described
Sound NjCarry out adding process of making an uproar to the first Hash classification chart T of hash function F (x), obtain Query Result R 'j
There is the second Hash classification chart T ' of differential privacy.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original
What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy
Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum
The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Fig. 6 is the schematic block diagram of a kind of device processing big data of another embodiment of the present invention.Should note
Meaning, the equipment shown in Fig. 6 is corresponding with Fig. 2 to Fig. 4 embodiment, is capable of Fig. 1 to Fig. 4 embodiment
Each process of the method processing big data, for avoiding repeating suitably to omit detailed description.Such as Fig. 6 institute
A kind of device processing big data showing includes: processor the 610th, memory 620 and bus 630.Its
In, processor 610 is connected by bus 630 with memory 620, and this memory 620 refers to for storage
Order, this processor 610 is for performing the instruction of this memory 620 storage.Specifically, processor 610
For: receive the query statement that client sends, and determine query function K according to this query statement;Root
According to query function K, inquiry is carried out to large data sets D and obtain Query Result R, Query Result R={Rj,
Wherein 1≤j≤m, m are greater than or equal to the positive integer of 1;Obtain the susceptibility S (K) of query function K,
This susceptibility S (K) characterizes the sensitiveness of query function K;True according to Query Result R and susceptibility S (K)
Surely the noise N, noise N={N adding to Query Result R is neededj, the noise component(s) N of noise Nj
With Query Result components RjOne_to_one corresponding;According to noise component(s) NjTo Query Result components RjCarry out adding making an uproar
Process, obtain adding the Query Result R '={ R ' making an uproarj}。
Alternatively, as one embodiment of the invention, processor 610 is for obtaining looking into of data set D1
Ask the Query Result K (D2) of result K (D1) and data set D2;By Query Result K (D1) and Query Result
The minimum of a value of K (D2) difference in a metric space is as the value of susceptibility S (K), wherein data set D1
And between data set D2, at most differing record data, data set D1 and data set D2 is big data
Two different subsets of collection D.
Specifically, processor 610 is for according to following formula acquisition susceptibility S (K):
S (K)=min | | K (D1)‐K(D2)||M, wherein, data set D1 and data set D2 at most differs a note
Record data, data set D1 and two different subsets that data set D2 is large data sets D, M represents one
Individual metric space.
Alternatively, as one embodiment of the invention, processor 610 is full for generating according to Query Result
The noise N ' of foot Laplacian noise distribution, wherein in noise N ', each noise component(s) is separate;Root
Obtain noise N, wherein noise component(s) N after susceptibility S (K) correction noise N 'jMeet susceptibility S (K)
Laplacian noise distribution.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility
Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original
What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy
Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum
The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Those of ordinary skill in the art are it is to be appreciated that combine described in the embodiments described herein
Various method steps and unit, can with electronic hardware, computer software or the two be implemented in combination in,
In order to clearly demonstrate the interchangeability of hardware and software, general according to function in the above description
Describe step and the composition of each embodiment.These functions perform with hardware or software mode actually,
Depend on application-specific and the design constraint of technical scheme.Those of ordinary skill in the art can be to often
Individual specifically should being used for uses different methods to realize described function, but this realize it is not considered that
Beyond the scope of this invention.
The method describing in conjunction with the embodiments described herein or step can be performed by hardware, processor
Software program, or the combination of the two implements.Software program can be placed in random access memory (RAM),
Internal memory, read-only storage (ROM), electrically programmable ROM, electrically erasable ROM, deposit
In device, hard disk, moveable magnetic disc, CD-ROM or technical field known any other form of
In storage medium.
Although by with reference to accompanying drawing and by way of combining preferred embodiment to the present invention have been described in detail,
But the present invention is not limited to this.Without departing from the spirit and substance of the premise in the present invention, this area is common
Technical staff can carry out modification or the replacement of various equivalence to embodiments of the invention, and these modifications or
Replacing all should be in the covering scope of the present invention.
Claims (8)
1. the method processing big data, it is characterised in that include:
Receive the query statement that client sends, and determine query function K according to described query statement;
Carry out inquiry to large data sets D according to described query function K and obtain Query Result R, described look into
Ask result R={Rj, wherein 1≤j≤m, m are greater than or equal to the positive integer of 1;
Obtaining the susceptibility S (K) of described query function K, described susceptibility S (K) characterizes described inquiry letter
The sensitiveness of number K;
Need to Query Result R addition with the determination of described susceptibility S (K) according to described Query Result R
Noise N, described noise N={Nj, the noise component(s) N of described noise NjWith Query Result components Rj
One_to_one corresponding;
According to described noise component(s) NjTo described Query Result components RjCarry out adding process of making an uproar, obtain adding and make an uproar
Query Result R '={ R 'j}。
2. method according to claim 1, it is characterised in that the described query function of described acquisition
The susceptibility S (K) of K includes:
Obtain the Query Result K (D2) of the Query Result K (D1) and data set D2 of data set D1;
By described Query Result K (D1) and described Query Result K (D2) difference in a metric space
Minimum of a value as the value of described susceptibility S (K), wherein said data set D1 and described data set D2 is
The different subset of two of described large data sets D, between described data set D1 and described data set D2 extremely
One record data of many differences.
3. the method according to according to any one of claim 1 or 2, it is characterised in that described basis
With described susceptibility S (K), described Query Result R determines that noise N includes:
Generate the noise N ' meeting Laplacian noise distribution according to described Query Result R, wherein said
In noise N ', each noise component(s) is separate;
Obtain described noise N after correcting described noise N ' according to described susceptibility S (K), wherein said make an uproar
The noise component(s) N of sound NjMeet the Laplacian noise distribution of described susceptibility S (K).
4. according to the method in any one of claims 1 to 3, it is characterised in that described inquiry
Function K is hash function F, and described method includes:
According to the training set of described large data sets D, training obtains described hash function F;
Wherein, described training set is a subset of described large data sets D, and described training set also includes belonging to
Property collection X and tag along sort Y, described property set X be in described training set characterize element property data
Set, described tag along sort Y is the set of data characterizing element classification result in described training set.
5. the device being used for processing big data, it is characterised in that include:
Receiver module, described receiver module is for receiving the query statement that client sends, and according to described
Query statement determines query function K;
First determining module, described first determining module is used for according to described query function K to large data sets
D carries out inquiry and obtains Query Result R, described Query Result R={Rj, wherein 1≤j≤m, m are big
In or be equal to 1 positive integer;
Acquisition module, described acquisition module is for obtaining the described inquiry letter that described first determining module determines
The susceptibility S (K) of number K, described susceptibility S (K) characterizes the sensitiveness of described query function K;
Second determining module, described second determining module is for according to described Query Result R with according to described
The described susceptibility S (K) that acquisition module obtains determines the noise N needing to add to Query Result R, institute
State noise N={Nj, the noise component(s) N of described noise NjWith Query Result components RjOne_to_one corresponding;
Adding module of making an uproar, the described module of making an uproar that adds is for the noise N determining according to described second determining modulejRight
Described Query Result components RjCarry out adding making an uproar, obtain adding the Query Result R '={ R ' making an uproarj}。
6. device according to claim 5, it is characterised in that described acquisition module specifically for:
Obtain the Query Result K (D2) of the Query Result K (D1) and data set D2 of data set D1;
The Norm minimum value of described Query Result K (D1) and described Query Result K (D2) difference is set to
The value of described susceptibility S (K), wherein said data set D1 and described data set D2 are described big data
Two different subsets of collection D, at most differ one between described data set D1 and described data set D2
Record data.
7. the device according to according to any one of claim 5 or 6, it is characterised in that described second
Determining module specifically for:
Generate the noise N ' meeting Laplacian noise distribution according to described Query Result R, wherein said
In noise N ', each noise component(s) is separate;
Obtain described noise N after correcting described noise N ' according to described susceptibility S (K), wherein said make an uproar
The noise component(s) N of sound NjMeet the Laplacian noise distribution of described susceptibility S (K).
8. the device according to according to any one of claim 7 to 7, it is characterised in that described inquiry
Function K is hash function F, and described first determining module is additionally operable to:
According to the training set of described large data sets D, training obtains described hash function F;
Wherein, described training set is a subset of described large data sets D, and described training set includes attribute
Collection X and tag along sort Y, described property set X are the data characterizing element property in described training set
Set, described tag along sort Y is the set of the data characterizing element classification result in described training set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510095692.7A CN105989161A (en) | 2015-03-04 | 2015-03-04 | Big data processing method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510095692.7A CN105989161A (en) | 2015-03-04 | 2015-03-04 | Big data processing method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105989161A true CN105989161A (en) | 2016-10-05 |
Family
ID=57038338
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510095692.7A Pending CN105989161A (en) | 2015-03-04 | 2015-03-04 | Big data processing method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105989161A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664488A (en) * | 2017-03-28 | 2018-10-16 | 华为技术有限公司 | A kind of processing method and processing device of traffic statistics achievement data |
CN109492429A (en) * | 2018-10-30 | 2019-03-19 | 华南师范大学 | A kind of method for secret protection of data publication |
CN113157541A (en) * | 2021-04-20 | 2021-07-23 | 贵州优联博睿科技有限公司 | Distributed database-oriented multi-concurrent OLAP (on-line analytical processing) type query performance prediction method and system |
CN113553363A (en) * | 2021-09-23 | 2021-10-26 | 支付宝(杭州)信息技术有限公司 | Query processing method and device |
-
2015
- 2015-03-04 CN CN201510095692.7A patent/CN105989161A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664488A (en) * | 2017-03-28 | 2018-10-16 | 华为技术有限公司 | A kind of processing method and processing device of traffic statistics achievement data |
CN108664488B (en) * | 2017-03-28 | 2020-11-10 | 华为技术有限公司 | Method and device for processing voice system index data |
CN109492429A (en) * | 2018-10-30 | 2019-03-19 | 华南师范大学 | A kind of method for secret protection of data publication |
CN109492429B (en) * | 2018-10-30 | 2020-10-16 | 华南师范大学 | Privacy protection method for data release |
CN113157541A (en) * | 2021-04-20 | 2021-07-23 | 贵州优联博睿科技有限公司 | Distributed database-oriented multi-concurrent OLAP (on-line analytical processing) type query performance prediction method and system |
CN113157541B (en) * | 2021-04-20 | 2024-04-05 | 贵州优联博睿科技有限公司 | Multi-concurrency OLAP type query performance prediction method and system for distributed database |
CN113553363A (en) * | 2021-09-23 | 2021-10-26 | 支付宝(杭州)信息技术有限公司 | Query processing method and device |
WO2023045504A1 (en) * | 2021-09-23 | 2023-03-30 | 支付宝(杭州)信息技术有限公司 | Query processing method and apparatus |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Community detection in networks with node features | |
Lai et al. | Fast global k-means clustering using cluster membership and inequality | |
CN110022531B (en) | Localized differential privacy urban garbage data report and privacy calculation method | |
CN105989161A (en) | Big data processing method and apparatus | |
Ma et al. | Fast-convergent federated learning with class-weighted aggregation | |
CN114580651A (en) | Federal learning method, device, equipment, system and computer readable storage medium | |
Erpolat Taşabat | A Novel Multicriteria Decision‐Making Method Based on Distance, Similarity, and Correlation: DSC TOPSIS | |
Nosovskiy et al. | Automatic clustering and boundary detection algorithm based on adaptive influence function | |
CN113807415B (en) | Federal feature selection method, federal feature selection device, federal feature selection computer device, and federal feature selection storage medium | |
Nepomuceno et al. | On the use of interval extensions to estimate the largest Lyapunov exponent from chaotic data | |
CN113468382A (en) | Multi-party loop detection method, device and related equipment based on knowledge federation | |
US20110138264A1 (en) | Verification Of Data Stream Computations Using Third-Party-Supplied Annotations | |
CN114036581A (en) | Privacy calculation method based on neural network model | |
Sheela et al. | Partition based perturbation for privacy preserving distributed data mining | |
US20210056586A1 (en) | Optimizing large scale data analysis | |
Ma et al. | Fuzzy nodes recognition based on spectral clustering in complex networks | |
CN115601283A (en) | Image enhancement method and device, computer equipment and computer readable storage medium | |
Wang et al. | Variable selection in semiparametric quantile modeling for longitudinal data | |
Moreno-García et al. | An edit distance between graph correspondences | |
Qin et al. | New uncertainty measure of rough fuzzy sets and entropy weight method for fuzzy‐target decision‐making tables | |
Kukush et al. | Simultaneous estimation of baseline hazard rate and regression parameters in Cox proportional hazards model with measurement error | |
Fu et al. | PPA-DBSCAN: Privacy-preserving ρ-Approximate Density-based Clustering | |
KR102377535B1 (en) | Anonymization of big data personal information and method of combining anonymized data | |
CN113537308A (en) | Two-stage k-means clustering processing system and method based on localized differential privacy | |
CN113158088A (en) | Position recommendation method based on graph neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20161005 |